Dienstag, 4. September 2018

TFA reports full dba_feature_usage_statistics to MyOracleSupport

Those who deal with MyOracleSupport (MOS) regularly will have noticed their demand for Trace File Analyzer (TFA) diagnostic collections.
Few will know they are also providing a full content of their DBs DBA_FEATURE_USAGE_STATISTICS view; at least when running the parameters
tfactl diagcollect -database <DB_UNIQUE_NAME>

The report calls tfa_home/resources/sql/db_feature_usage.sql and it's result is stored in the <hostname>.tfa_<timestamp>.zip file.

I did not find any mention about this behavior in the documentation, so it might be worth to post about it.

This is at least true for version 18.3.0.0.0 (latest version available at 2018-09-04)

RBAL process needs to release FORMER disk

Oracles Automatic Storage Management (ASM) has many nice advantages against other volume managers. One of my favorite is to add and remove or rebalance disks without any interruption for the database. Unfortunately the remove of devices after the rebalance is completed is not as straight forward as expected: 
In a current environment (Grid Infrastructure: 12.2.0.1.180417, rdbms: 12.1.0.2.180116 and others) after the ALTER DISKGROUP xxx DROP DISK yyy specific discs are in status FORMER
Due to  a specific schema of directories & symlinks it takes some effort to come to the real /dev/dm-xxx device. 
Still several RBAL processes of different DBs (I did not find a derterministic pattern which processes or DBs) hold a file handle on this specific device. To make it more interesting is the fact that the diskgroup (DG) or disk was never used by any of those DBs. We have a strict mapping of DBs to DGs, so it's easy for me to claim that. 

To tell RBAL processes to release this "stale" filehandle, I have to run a
  ALTER DISKGROUP zzz REBALANCE
Any DG does the trick, it is not required to be related to the DB of RBAL process at all. 
As diskgroups are normally balanced, this is a kind of NOP. Still it's required to wake up RBAL and makes it release unused filehandles. 

I still have no idea why these RBAL processes open a device it's DB never needs at all; 
but as I have my proper workaround, I don't investigate any further. 

Might this help someone out there, or at least me when I hit this odd behavior again ;-) 

Mittwoch, 8. August 2018

unwanted cleanup in /var/tmp/.oracle

In my previous blog missing oraagent.bin process I raised a question:
why does files in /var/tmp/.oracle disappear? 
At least this question I am now able to answer.

By the help of Linux kernel's audit system and a proper rule:
-a always,exit -F arch=b64 -S rename,rmdir,unlink,unlinkat,renameat -F path=/var/tmp/.oracle -F key=ora.delete

I was able to see the culprit process:
----
type=PROCTITLE msg=audit(08/06/2018 17:13:48.382:58122) : proctitle=/usr/bin/systemd-tmpfiles --clean 
type=PATH msg=audit(08/06/2018 17:13:48.382:58122) : item=1 name=sCRSD_IPC_SOCKET_11 inode=50924979 dev=00:25 mode=socket,777 ouid=root ogid=root rdev=00:00 nametype=DELETE 
type=PATH msg=audit(08/06/2018 17:13:48.382:58122) : item=0 name=/ inode=45518 dev=00:25 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 nametype=PARENT 
type=CWD msg=audit(08/06/2018 17:13:48.382:58122) :  cwd=/ 
type=SYSCALL msg=audit(08/06/2018 17:13:48.382:58122) : arch=x86_64 syscall=unlinkat success=yes exit=0 a0=0x5 a1=0x556fb11fb17b a2=0x0 a3=0x3 items=2 ppid=1 pid=25307 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=systemd-tmpfile exe=/usr/bin/systemd-tmpfiles key=ora.delete 
----

So it was a feature of systemd: systemd-tmpfiles.

In detail it comes from
/usr/lib/tmpfiles.d/tmp.conf 
this particular line:
v /var/tmp 1777 root root 30dy

The fix for this is easy: create a proper file (copy) in
/etc/tmpfiles.d
and add these lines at the end:
# Exclude /var/tmp/.oracle
x /var/tmp/.oracle
But as a research at MOS and google did not come up with any issue with systemd-tmpfiles and crsd.bin, oraagent.bin or any other keyword, something else must be a problem here as well.
This (and the other question from my previous post) are questions for further investigations.

So much for now.

Freitag, 27. Juli 2018

missing oraagent.bin process

It all started with a small error when I tried to stop/start a service:

srvctl start service -d cmpp01_site1 -s cmpp01
PRCD-1084 : Failed to start service CMPP01 
PRCR-1079 : Failed to start resource ora.cmpp01_site1.cmpp01.svc 
CRS-2680: Clean of 'ora.cmpp01_site1.cmpp01.svc' on 'av2l954p' failed 
CRS-5802: Unable to start the agent process 

It's not fine the agent can not be started. so let's check the logfile crsd_oraagent_oracle.trc:
2018-07-23 21:29:14.339 : CRSCOMM:2939654272:  Ipc: Starting send thread
2018-07-23 21:29:14.340 :GIPCXCPT:2939654272:  gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)
2018-07-23 21:29:14.340 :GIPCXCPT:2939654272:  gipcConnectSyncF [connectToServer : clsIpcClient.cpp : 380]: EXCEPTION[ ret gipcretConnectionRefused (29) ]  failed sync connect endp 0x1cbb5b0 [0000000000000090] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x1cbe170, sendp 0x1cbdf20 status 13flags 0xa108871a, flags-2 0x0, usrFlags 0x30000 }, addr 0x1cbc990 [0000000000000097] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, flags 0x0
2018-07-23 21:29:14.340 : CRSCOMM:2939654272:  IpcC: gipcConnect() failed, rc= 29
2018-07-23 21:29:14.340 : CRSCOMM:2939654272: [FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_IPC_SOCKET_11)) ret = 29

It clearly says there is something going wrong, unfortunately it doesn't tell in detail what's going wrong or what's required.

Maybe the most beautiful but useless analysis here was provided by tfactl summary -crs:
DETAILS                            STATUS    COMPONENT  
+----------------------------------+---------+-----------+
  .------------------------------.   PROBLEM   CRS        
  | CRS_SERVER_STATUS   : ONLINE |                        
  | CRS_STATE           : ONLINE |                        
  | CRS_INTEGRITY_CHECK : FAIL   |                        
  | CRS_RESOURCE_STATUS : ONLINE |                        
  '------------------------------'                        
+----------------------------------+---------+-----------+

As neither My Oracle Support nor google helped a lot here, I started a random cry on twitter:


And Anil had mercy on me :
This was not the answer to my problem, but it started a very valuable conversation in which I learned enough to identify the issue.

It's important to know that (this) oraagent.bin process is started by crsd.bin, and it is restarted if it disappears for any reason.
So to see what's really going and and leads to the error above, running strace on crsd.bin and it's forks/clones:
strace -f -p <pid_of_crsd.bin> -o <outfile>

as expected crsd.bin creates a new process for oraagent.bin:
18631 clone( <unfinished ...>
...
18631 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe273dfe9d0) = 18633
...
18633 execve("/appl/grid/grid_12201/bin/oraagent.bin", ["/appl/grid/grid_12201/bin/oraage"...], [/* 52 vars */]) = 0

and if everything runs as expected the socket handling looks like
18633 socket(AF_LOCAL, SOCK_STREAM, 0 
...
18633 <... socket resumed> )            = 69  <== the creation of the socked
...
18633 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110 <unfinished ...>
...
18633 <... connect resumed> )           = 0   <== the successful connection of the sockets file descriptor 69 to  /var/tmp/.oracle/sCRSD_IPC_SOCKET_11

But in my problematic case it's this:
22612 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110  <unfinished ...>
...
22612  <... connect resumed> )           = -1 ENOENT (No such file or directory)

The first finding here:
The error
gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)
in the logfile correlates with
<... connect resumed> )           = -1 ENOENT (No such file or directory)

With this specific information a short check in /var/tmp/.oracle shows the missing file is not there.
A crosscheck in /proc/<pid_of_crsd.bin>/fd/ shows
lrwx------. 1 root root 64 Jul 26 22:00 /proc/2144/fd/214 -> /var/tmp/.oracle/sCRSD_IPC_SOCKET_11_lock (deleted)

this all happens on grid binaries 12.2 with latest patch set.


I showed now:
1) if oraagent.bin disappears, crsd.bin creates a new oraagent.bin process
2) the oraagent.bin process tries to connect a socket on /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 to communicate with crsd.bin
3) if 2) fails, gipcretConnectionRefused (29) is shown in the logfile.



Right now I know:
If /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 is missing but oraagent.bin does run, I can operate the cluster resources, but whenever oraagent.bin disappears, cluster resources can not be managed anymore.

If this is the only problem (and no other sockets in /var/tmp/.oracle are missing) I can restart crsd which will also restart oraagent.bin and everything works like expected.
crsctl stop res ora.crsd -init
CRS-2673: Attempting to stop 'ora.crsd' on 'node1'
CRS-2677: Stop of 'ora.crsd' on 'node1' succeeded

crsctl start res ora.crsd -init  
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded


My next questions to answer is
1) why does oraagent.bin disappear at all?
2) why does files in /var/tmp/.oracle disappear?

This is still not answered and under investigation.

Montag, 23. Juli 2018

enabling Database Vault on e-business RAC database

Right now I'm in a process to setup Database Vault for an E-Business suite database. This is a 2 node RAC cluster.
The DB is 12.1.0.2 with April 17 BP.

As the DB exists already, I followed How To Enable Database Vault in a 12c database ? (Doc ID 2112167.1).
Everything looks smooth, but unfortunately, at the Configuration of DV
exec dvsys.configure_dv('DVOWNER','DVMANAGER');
failes with
ERROR at line 1: 
ORA-47500: Database Vault cannot be configured. 
ORA-06512: at "DVSYS.CONFIGURE_DV", line 23 
ORA-06512: at "DVSYS.CONFIGURE_DV", line 72 
ORA-06512: at line 2 

ORA-47500 tells
47500, 00000, "Database Vault cannot be configured."
// *Cause: The Database Vault configuration failed because some Database Vault  
//         objects were missing.
// *Action: Run the dvremov.sql script to remove Database Vault and then
//          reinstall Database Vault.

This was the first try: remove DV and start again.
Nothing changed.


Some research on MOS led to Known Issues In A Database Vault Environment (Doc ID 2330775.1)
there the explanation for ORA-47500 is:

Cause: There are many invalid objects in the database.

Solution: Run $ORACLE_HOME/rdbms/admin/utlrp.sql to validate the invalid objects

So all invalid objects were removed/fixed and - DV enabled again.
With same error.


A lot more investigation and a SR was required here.

When catmac.sql was run it completed and didn't show any hints it failed at the end.
Especially register DB and it's validation were shown successful:
SQL> DECLARE
  2      num number;
  3  BEGIN
  4      dbms_registry.loaded('DV');
  5      SYS.validate_dv;
  6  END;
  7  /

PL/SQL procedure successfully completed.


After a lot more investigation, an issue in the (manually activated) spool of catmac.sql was found:
ERROR at line 1:
ORA-31011: XML parsing failed
ORA-19213: error occurred in XML processing at lines 49
LPX-00217: invalid character 21 (U+0015)
ORA-06512: at "DVSYS.DBMS_MACADM", line 160
ORA-06512: at line 2
and this error stack leads to Installation Of Database Vault Fails With ORA-19213, LPX-00217: Invalid Character 21 (U+0015) (Doc ID 2324871.1)
As so often: the problem looks obvious, but only after it was found.


The problem is the environment variable
ORA_NLS10=<ORACLE_HOME>/nls/data/9idata
Which is a requirement for Oracle E-business Suite.

Here RAC has it fair share of complexity as this environment is not (only) defined in a .bash_profile or similar, but also in cluster registry:
srvctl getenv database -d <DB_UNIQUE_NAME>
EBSSIB01_SITE1:
ORA_NLS10=<ORACLE_HOME>/nls/data/9idata

The fix was to disable this setting:
srvctl unsetenv database -d <DB_UNIQUE_NAME> -t ORA_NLS10
restart all instances
and now successfully enable DV at last!

Of course, afterwards ORA_NLS10 must be set again
srvctl setenv  database -d <DB_UNIQUE_NAME> -t "ORA_NLS10=<ORACLE_HOME>/nls/data/9idata"
and instances restarted once more.

Donnerstag, 19. Juli 2018

OUD - no operational route

Oracle Universal Directory is by far not my preferred tool.
Probably it's me, I'm just not used to it.

The latest joy was an error when I tried to run
eusm addDatabase
and I got
javax.naming.AuthenticationException: [LDAP: error code 49 - Invalid Credentials]

Very unspecific, so I logged into Oracle Unified Directory Services Manager (OUDSM)


There a click on Data Browser showed another error message:
LDAP: error code 52 - The load balancing workflow element 
has no operational route. 
This may happen when no route has been configured 
or all the configured routes are down.






Google was of no help here, but it was obvious something is "down":

As there was no change on OUD, I asked admins of those proxy data sources - my friends the AD admins.
They confessed they changed something: they changed authentication to "ldap_bind: Strong authentication required"

This leads to a quite straight forward solution:
Enable LDAPS and import "their" keys into OUDs Trust Manager:

First import the keys:
I decided to go for the preferred keystore: JKS
First the password for the keystore: it's located in <OUD>/config/keystore.pin
and the keys installed with
<java_home>/bin/keytool \
-importcert -alias priv-root -file /tmp/priv-root.cer \
-keystore config/truststore -storetype JKS






And of course the config must be adapted also:

With these settings, the Data Sources are reachable again - everything fine.
(I had to restart OUD that all Data Sources were available - no clue why)

If you ever hit such an issue, don't follow my approach blindly.
Might it help at least a little big.

Dienstag, 26. Juni 2018

when it runs on your computer ...

... please don't see it ready for customers.

I try to enable Database Vault on an existing 12.2 database.
A proper document to follow is How To Enable Database Vault in a 12c database ? (Doc ID 2112167.1)

At some point it tells to run
@$ORACLE_HOME/rdbms/admin/catmac.sql system temp <system_password>

unfortunately this fails with
CREATE USER dvsys IDENTIFIED BY "D_DVSYSPW"
*
ERROR at line 1:
ORA-28003: password verification for the specified password failed
ORA-20025: Password must contain at least 1 digit(s)
(you need to spool the results into a file to find it in the output stream)

The reason is obvious:
SQL> select * from dba_profiles where resource_name='PASSWORD_VERIFY_FUNCTION' and Profile='DEFAULT';

PROFILE                        RESOURCE_NAME                    RESOURCE LIMIT                                    COM
------------------------------ -------------------------------- -------- ---------------------------------------- ---
DEFAULT                        PASSWORD_VERIFY_FUNCTION         PASSWORD ORA12C_VERIFY_FUNCTION                   NO

and so is the solution:
alter profile default limit PASSWORD_VERIFY_FUNCTION null;
@$ORACLE_HOME/rdbms/admin/catmac.sql system temp <system_password>
alter profile default limit PASSWORD_VERIFY_FUNCTION ORA12C_VERIFY_FUNCTION;


Oracle requires to disable a security feature to enable another one.
In this sandbox environment it's not a specific, hand-made password verify function. It's the default one from oracle.
It would be very nice if oracle would test it's security features to run with each other.

There is no problem for me, but I'm wondering how serious (holistic) Oracle takes "security"?