Freitag, 27. Juli 2018

missing oraagent.bin process

It all started with a small error when I tried to stop/start a service:

srvctl start service -d cmpp01_site1 -s cmpp01
PRCD-1084 : Failed to start service CMPP01 
PRCR-1079 : Failed to start resource ora.cmpp01_site1.cmpp01.svc 
CRS-2680: Clean of 'ora.cmpp01_site1.cmpp01.svc' on 'av2l954p' failed 
CRS-5802: Unable to start the agent process 

It's not fine the agent can not be started. so let's check the logfile crsd_oraagent_oracle.trc:
2018-07-23 21:29:14.339 : CRSCOMM:2939654272:  Ipc: Starting send thread
2018-07-23 21:29:14.340 :GIPCXCPT:2939654272:  gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)
2018-07-23 21:29:14.340 :GIPCXCPT:2939654272:  gipcConnectSyncF [connectToServer : clsIpcClient.cpp : 380]: EXCEPTION[ ret gipcretConnectionRefused (29) ]  failed sync connect endp 0x1cbb5b0 [0000000000000090] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x1cbe170, sendp 0x1cbdf20 status 13flags 0xa108871a, flags-2 0x0, usrFlags 0x30000 }, addr 0x1cbc990 [0000000000000097] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, flags 0x0
2018-07-23 21:29:14.340 : CRSCOMM:2939654272:  IpcC: gipcConnect() failed, rc= 29
2018-07-23 21:29:14.340 : CRSCOMM:2939654272: [FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_IPC_SOCKET_11)) ret = 29

It clearly says there is something going wrong, unfortunately it doesn't tell in detail what's going wrong or what's required.

Maybe the most beautiful but useless analysis here was provided by tfactl summary -crs:
DETAILS                            STATUS    COMPONENT  
  .------------------------------.   PROBLEM   CRS        
  | CRS_SERVER_STATUS   : ONLINE |                        
  | CRS_STATE           : ONLINE |                        
  | CRS_INTEGRITY_CHECK : FAIL   |                        
  | CRS_RESOURCE_STATUS : ONLINE |                        

As neither My Oracle Support nor google helped a lot here, I started a random cry on twitter:

And Anil had mercy on me :
This was not the answer to my problem, but it started a very valuable conversation in which I learned enough to identify the issue.

It's important to know that (this) oraagent.bin process is started by crsd.bin, and it is restarted if it disappears for any reason.
So to see what's really going and and leads to the error above, running strace on crsd.bin and it's forks/clones:
strace -f -p <pid_of_crsd.bin> -o <outfile>

as expected crsd.bin creates a new process for oraagent.bin:
18631 clone( <unfinished ...>
18631 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe273dfe9d0) = 18633
18633 execve("/appl/grid/grid_12201/bin/oraagent.bin", ["/appl/grid/grid_12201/bin/oraage"...], [/* 52 vars */]) = 0

and if everything runs as expected the socket handling looks like
18633 socket(AF_LOCAL, SOCK_STREAM, 0 
18633 <... socket resumed> )            = 69  <== the creation of the socked
18633 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110 <unfinished ...>
18633 <... connect resumed> )           = 0   <== the successful connection of the sockets file descriptor 69 to  /var/tmp/.oracle/sCRSD_IPC_SOCKET_11

But in my problematic case it's this:
22612 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110  <unfinished ...>
22612  <... connect resumed> )           = -1 ENOENT (No such file or directory)

The first finding here:
The error
gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)
in the logfile correlates with
<... connect resumed> )           = -1 ENOENT (No such file or directory)

With this specific information a short check in /var/tmp/.oracle shows the missing file is not there.
A crosscheck in /proc/<pid_of_crsd.bin>/fd/ shows
lrwx------. 1 root root 64 Jul 26 22:00 /proc/2144/fd/214 -> /var/tmp/.oracle/sCRSD_IPC_SOCKET_11_lock (deleted)

this all happens on grid binaries 12.2 with latest patch set.

I showed now:
1) if oraagent.bin disappears, crsd.bin creates a new oraagent.bin process
2) the oraagent.bin process tries to connect a socket on /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 to communicate with crsd.bin
3) if 2) fails, gipcretConnectionRefused (29) is shown in the logfile.

Right now I know:
If /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 is missing but oraagent.bin does run, I can operate the cluster resources, but whenever oraagent.bin disappears, cluster resources can not be managed anymore.

If this is the only problem (and no other sockets in /var/tmp/.oracle are missing) I can restart crsd which will also restart oraagent.bin and everything works like expected.
crsctl stop res ora.crsd -init
CRS-2673: Attempting to stop 'ora.crsd' on 'node1'
CRS-2677: Stop of 'ora.crsd' on 'node1' succeeded

crsctl start res ora.crsd -init  
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded

My next questions to answer is
1) why does oraagent.bin disappear at all?
2) why does files in /var/tmp/.oracle disappear?

This is still not answered and under investigation.
Kommentar veröffentlichen