srvctl start service -d cmpp01_site1 -s cmpp01
PRCD-1084 : Failed to start service CMPP01 PRCR-1079 : Failed to start resource ora.cmpp01_site1.cmpp01.svc CRS-2680: Clean of 'ora.cmpp01_site1.cmpp01.svc' on 'av2l954p' failed CRS-5802: Unable to start the agent process
It's not fine the agent can not be started. so let's check the logfile crsd_oraagent_oracle.trc:
2018-07-23 21:29:14.339 : CRSCOMM:2939654272: Ipc: Starting send thread 2018-07-23 21:29:14.340 :GIPCXCPT:2939654272: gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29) 2018-07-23 21:29:14.340 :GIPCXCPT:2939654272: gipcConnectSyncF [connectToServer : clsIpcClient.cpp : 380]: EXCEPTION[ ret gipcretConnectionRefused (29) ] failed sync connect endp 0x1cbb5b0 [0000000000000090] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x1cbe170, sendp 0x1cbdf20 status 13flags 0xa108871a, flags-2 0x0, usrFlags 0x30000 }, addr 0x1cbc990 [0000000000000097] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, flags 0x0 2018-07-23 21:29:14.340 : CRSCOMM:2939654272: IpcC: gipcConnect() failed, rc= 29 2018-07-23 21:29:14.340 : CRSCOMM:2939654272: [FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_IPC_SOCKET_11)) ret = 29
It clearly says there is something going wrong, unfortunately it doesn't tell in detail what's going wrong or what's required.
Maybe the most beautiful but useless analysis here was provided by tfactl summary -crs:
DETAILS STATUS COMPONENT +----------------------------------+---------+-----------+ .------------------------------. PROBLEM CRS | CRS_SERVER_STATUS : ONLINE | | CRS_STATE : ONLINE | | CRS_INTEGRITY_CHECK : FAIL | | CRS_RESOURCE_STATUS : ONLINE | '------------------------------' +----------------------------------+---------+-----------+
As neither My Oracle Support nor google helped a lot here, I started a random cry on twitter:
I can trace— Martin Berger (@martinberx) July 24, 2018
RCD-1316 : failed to stop services xxx
CRS-5802: Unable to start the agent process
down to
crsd_oraagent_oracle.trc
CRSCOMM:12345: [FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_IPC_SOCKET_11)) ret = 29
and now?
any hints about this socket?
And Anil had mercy on me :
This was not the answer to my problem, but it started a very valuable conversation in which I learned enough to identify the issue.This is s local IPC request so check if crsd is running? Ps ef |grep crsd.bin.. Also check the crsd logs. It should have the gory details. Btw, Which version is this ?— Anil Nair (@RACMasterPM) July 24, 2018
It's important to know that (this) oraagent.bin process is started by crsd.bin, and it is restarted if it disappears for any reason.
So to see what's really going and and leads to the error above, running strace on crsd.bin and it's forks/clones:
strace -f -p <pid_of_crsd.bin> -o <outfile>
as expected crsd.bin creates a new process for oraagent.bin:
18631 clone( <unfinished ...> ... 18631 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe273dfe9d0) = 18633 ... 18633 execve("/appl/grid/grid_12201/bin/oraagent.bin", ["/appl/grid/grid_12201/bin/oraage"...], [/* 52 vars */]) = 0
and if everything runs as expected the socket handling looks like
18633 socket(AF_LOCAL, SOCK_STREAM, 0... 18633 <... socket resumed> ) = 69 <== the creation of the socked ... 18633 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110 <unfinished ...> ... 18633 <... connect resumed> ) = 0 <== the successful connection of the sockets file descriptor 69 to /var/tmp/.oracle/sCRSD_IPC_SOCKET_11
But in my problematic case it's this:
22612 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110 <unfinished ...> ... 22612 <... connect resumed> ) = -1 ENOENT (No such file or directory)
The first finding here:
The error
gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)in the logfile correlates with
<... connect resumed> ) = -1 ENOENT (No such file or directory)
With this specific information a short check in /var/tmp/.oracle shows the missing file is not there.
A crosscheck in /proc/<pid_of_crsd.bin>/fd/ shows
lrwx------. 1 root root 64 Jul 26 22:00 /proc/2144/fd/214 -> /var/tmp/.oracle/sCRSD_IPC_SOCKET_11_lock (deleted)
this all happens on grid binaries 12.2 with latest patch set.
I showed now:
1) if oraagent.bin disappears, crsd.bin creates a new oraagent.bin process
2) the oraagent.bin process tries to connect a socket on /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 to communicate with crsd.bin
3) if 2) fails, gipcretConnectionRefused (29) is shown in the logfile.
Right now I know:
If /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 is missing but oraagent.bin does run, I can operate the cluster resources, but whenever oraagent.bin disappears, cluster resources can not be managed anymore.
If this is the only problem (and no other sockets in /var/tmp/.oracle are missing) I can restart crsd which will also restart oraagent.bin and everything works like expected.
crsctl stop res ora.crsd -init CRS-2673: Attempting to stop 'ora.crsd' on 'node1' CRS-2677: Stop of 'ora.crsd' on 'node1' succeeded crsctl start res ora.crsd -init CRS-2672: Attempting to start 'ora.crsd' on 'node1' CRS-2676: Start of 'ora.crsd' on 'node1' succeeded
My next questions to answer is
1) why does oraagent.bin disappear at all?
2) why does files in /var/tmp/.oracle disappear?
This is still not answered and under investigation.