berxblog: Juli 2018

2018-07-27

missing oraagent.bin process

It all started with a small error when I tried to stop/start a service:

srvctl start service -d cmpp01_site1 -s cmpp01

PRCD-1084 : Failed to start service CMPP01 
PRCR-1079 : Failed to start resource ora.cmpp01_site1.cmpp01.svc 
CRS-2680: Clean of 'ora.cmpp01_site1.cmpp01.svc' on 'av2l954p' failed 
CRS-5802: Unable to start the agent process

It's not fine the agent can not be started. so let's check the logfile crsd_oraagent_oracle.trc:

2018-07-23 21:29:14.339 : CRSCOMM:2939654272:  Ipc: Starting send thread
2018-07-23 21:29:14.340 :GIPCXCPT:2939654272:  gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)
2018-07-23 21:29:14.340 :GIPCXCPT:2939654272:  gipcConnectSyncF [connectToServer : clsIpcClient.cpp : 380]: EXCEPTION[ ret gipcretConnectionRefused (29) ]  failed sync connect endp 0x1cbb5b0 [0000000000000090] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x1cbe170, sendp 0x1cbdf20 status 13flags 0xa108871a, flags-2 0x0, usrFlags 0x30000 }, addr 0x1cbc990 [0000000000000097] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, flags 0x0
2018-07-23 21:29:14.340 : CRSCOMM:2939654272:  IpcC: gipcConnect() failed, rc= 29
2018-07-23 21:29:14.340 : CRSCOMM:2939654272: [FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_IPC_SOCKET_11)) ret = 29

It clearly says there is something going wrong, unfortunately it doesn't tell in detail what's going wrong or what's required.

Maybe the most beautiful but useless analysis here was provided by tfactl summary -crs:

DETAILS                            STATUS    COMPONENT  
+----------------------------------+---------+-----------+
  .------------------------------.   PROBLEM   CRS        
  | CRS_SERVER_STATUS   : ONLINE |                        
  | CRS_STATE           : ONLINE |                        
  | CRS_INTEGRITY_CHECK : FAIL   |                        
  | CRS_RESOURCE_STATUS : ONLINE |                        
  '------------------------------'                        
+----------------------------------+---------+-----------+

As neither My Oracle Support nor google helped a lot here, I started a random cry on twitter:

I can trace
RCD-1316 : failed to stop services xxx
CRS-5802: Unable to start the agent process
down to
crsd_oraagent_oracle.trc
CRSCOMM:12345: [FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_IPC_SOCKET_11)) ret = 29
and now?
any hints about this socket?
— Martin Berger (@martinberx) July 24, 2018

And Anil had mercy on me :

This is s local IPC request so check if crsd is running? Ps ef |grep crsd.bin.. Also check the crsd logs. It should have the gory details. Btw, Which version is this ?
— Anil Nair (@RACMasterPM) July 24, 2018

This was not the answer to my problem, but it started a very valuable conversation in which I learned enough to identify the issue.

It's important to know that (this) oraagent.bin process is started by crsd.bin, and it is restarted if it disappears for any reason.
So to see what's really going and and leads to the error above, running strace on crsd.bin and it's forks/clones:
strace -f -p <pid_of_crsd.bin> -o <outfile>

as expected crsd.bin creates a new process for oraagent.bin:

18631 clone( <unfinished ...>
...
18631 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe273dfe9d0) = 18633
...
18633 execve("/appl/grid/grid_12201/bin/oraagent.bin", ["/appl/grid/grid_12201/bin/oraage"...], [/* 52 vars */]) = 0

and if everything runs as expected the socket handling looks like

18633 socket(AF_LOCAL, SOCK_STREAM, 0 
...
18633 <... socket resumed> )            = 69  <== the creation of the socked
...
18633 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110 <unfinished ...>
...
18633 <... connect resumed> )           = 0   <== the successful connection of the sockets file descriptor 69 to  /var/tmp/.oracle/sCRSD_IPC_SOCKET_11

But in my problematic case it's this:

22612 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110  <unfinished ...>
...
22612  <... connect resumed> )           = -1 ENOENT (No such file or directory)

The first finding here:
The error

gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)

in the logfile correlates with

<... connect resumed> )           = -1 ENOENT (No such file or directory)

With this specific information a short check in /var/tmp/.oracle shows the missing file is not there.
A crosscheck in /proc/<pid_of_crsd.bin>/fd/ shows

lrwx------. 1 root root 64 Jul 26 22:00 /proc/2144/fd/214 -> /var/tmp/.oracle/sCRSD_IPC_SOCKET_11_lock (deleted)

this all happens on grid binaries 12.2 with latest patch set.

I showed now:
1) if oraagent.bin disappears, crsd.bin creates a new oraagent.bin process
2) the oraagent.bin process tries to connect a socket on /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 to communicate with crsd.bin
3) if 2) fails, gipcretConnectionRefused (29) is shown in the logfile.

Right now I know:
If /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 is missing but oraagent.bin does run, I can operate the cluster resources, but whenever oraagent.bin disappears, cluster resources can not be managed anymore.

If this is the only problem (and no other sockets in /var/tmp/.oracle are missing) I can restart crsd which will also restart oraagent.bin and everything works like expected.

crsctl stop res ora.crsd -init
CRS-2673: Attempting to stop 'ora.crsd' on 'node1'
CRS-2677: Stop of 'ora.crsd' on 'node1' succeeded

crsctl start res ora.crsd -init  
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded

My next questions to answer is
1) why does oraagent.bin disappear at all?
2) why does files in /var/tmp/.oracle disappear?

This is still not answered and under investigation.

2018-07-23

enabling Database Vault on e-business RAC database

Right now I'm in a process to setup Database Vault for an E-Business suite database. This is a 2 node RAC cluster.
The DB is 12.1.0.2 with April 17 BP.

As the DB exists already, I followed How To Enable Database Vault in a 12c database ? (Doc ID 2112167.1).
Everything looks smooth, but unfortunately, at the Configuration of DV

exec dvsys.configure_dv('DVOWNER','DVMANAGER');

failes with

ERROR at line 1: 
ORA-47500: Database Vault cannot be configured. 
ORA-06512: at "DVSYS.CONFIGURE_DV", line 23 
ORA-06512: at "DVSYS.CONFIGURE_DV", line 72 
ORA-06512: at line 2

ORA-47500 tells

47500, 00000, "Database Vault cannot be configured."
// *Cause: The Database Vault configuration failed because some Database Vault  
//         objects were missing.
// *Action: Run the dvremov.sql script to remove Database Vault and then
//          reinstall Database Vault.

This was the first try: remove DV and start again.
Nothing changed.

Some research on MOS led to Known Issues In A Database Vault Environment (Doc ID 2330775.1)
there the explanation for ORA-47500 is:

Cause: There are many invalid objects in the database.

Solution: Run $ORACLE_HOME/rdbms/admin/utlrp.sql to validate the invalid objects

So all invalid objects were removed/fixed and - DV enabled again.
With same error.

A lot more investigation and a SR was required here.

When catmac.sql was run it completed and didn't show any hints it failed at the end.
Especially register DB and it's validation were shown successful:

SQL> DECLARE
  2      num number;
  3  BEGIN
  4      dbms_registry.loaded('DV');
  5      SYS.validate_dv;
  6  END;
  7  /

PL/SQL procedure successfully completed.

After a lot more investigation, an issue in the (manually activated) spool of catmac.sql was found:

ERROR at line 1:
ORA-31011: XML parsing failed
ORA-19213: error occurred in XML processing at lines 49
LPX-00217: invalid character 21 (U+0015)
ORA-06512: at "DVSYS.DBMS_MACADM", line 160
ORA-06512: at line 2

and this error stack leads to Installation Of Database Vault Fails With ORA-19213, LPX-00217: Invalid Character 21 (U+0015) (Doc ID 2324871.1)
As so often: the problem looks obvious, but only after it was found.

The problem is the environment variable
ORA_NLS10=<ORACLE_HOME>/nls/data/9idata
Which is a requirement for Oracle E-business Suite.

Here RAC has it fair share of complexity as this environment is not (only) defined in a .bash_profile or similar, but also in cluster registry:

srvctl getenv database -d <DB_UNIQUE_NAME>
EBSSIB01_SITE1:
ORA_NLS10=<ORACLE_HOME>/nls/data/9idata

The fix was to disable this setting:

srvctl unsetenv database -d <DB_UNIQUE_NAME> -t ORA_NLS10

restart all instances
and now successfully enable DV at last!

Of course, afterwards ORA_NLS10 must be set again

srvctl setenv  database -d <DB_UNIQUE_NAME> -t "ORA_NLS10=<ORACLE_HOME>/nls/data/9idata"

and instances restarted once more.

2018-07-19

OUD - no operational route

Oracle Universal Directory is by far not my preferred tool.
Probably it's me, I'm just not used to it.

The latest joy was an error when I tried to run

eusm addDatabase

and I got

javax.naming.AuthenticationException: [LDAP: error code 49 - Invalid Credentials]

Very unspecific, so I logged into Oracle Unified Directory Services Manager (OUDSM)

There a click on Data Browser showed another error message:

LDAP: error code 52 - The load balancing workflow element 
has no operational route. 
This may happen when no route has been configured 
or all the configured routes are down.

Google was of no help here, but it was obvious something is "down":

As there was no change on OUD, I asked admins of those proxy data sources - my friends the AD admins.
They confessed they changed something: they changed authentication to "ldap_bind: Strong authentication required"

This leads to a quite straight forward solution:
Enable LDAPS and import "their" keys into OUDs Trust Manager:

First import the keys:
I decided to go for the preferred keystore: JKS
First the password for the keystore: it's located in <OUD>/config/keystore.pin
and the keys installed with

<java_home>/bin/keytool \
-importcert -alias priv-root -file /tmp/priv-root.cer \
-keystore config/truststore -storetype JKS

And of course the config must be adapted also:

With these settings, the Data Sources are reachable again - everything fine.
(I had to restart OUD that all Data Sources were available - no clue why)

If you ever hit such an issue, don't follow my approach blindly.
Might it help at least a little big.

berxblog