On a 4 node cluster 2 nodes got Linux patches and were rebooted. This was done after 10pm for minimal business impact. (The application is written by Oracle - so it's not fully TAF aware).
The DBs started fine, all services were available.
At 3:50 in the morning I got a call some processes show strange errors.
As I connected to the Server, the Instance shows running proceses. But I could not connect to the instance:
With proper environment variables, the errors were:
sqlplus system/xxx SQL*Plus: Release 11.2.0.4.0 Production on Thu Oct 19 03:58:52 2017 Copyright (c) 1982, 2013, Oracle. All rights reserved. Enter password: ERROR: ORA-01034: ORACLE not available ORA-27123: unable to attach to shared memory segment Linux-x86_64 Error: 22: Invalid argument Additional information: 25 Additional information: 323780623 Additional information: -1610612736 Process ID: 0 Session ID: 0 Serial number: 0 Enter user-name: ERROR: ORA-01017: invalid username/password; logon denied
and
sqlplus system/xxx SQL*Plus: Release 11.2.0.4.0 Production on Thu Oct 19 03:59:33 2017 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to an idle instance. SQL>
So it seems it's related to shared memory.
First checks showed there are enough hugepages configured and available, ipcs lists them correct.
in alert.log I found entries like
Thu Oct 19 03:34:41 2017 Process q003 died, see its trace file Thu Oct 19 03:35:40 2017 Process W000 died, see its trace file Process W000 died, see its trace file Process W000 died, see its trace file Thu Oct 19 03:35:52 2017 Process W000 died, see its trace file Process W000 died, see its trace file Thu Oct 19 03:35:57 2017 Process m000 died, see its trace file Process W000 died, see its trace file Thu Oct 19 03:36:04 2017and
Thu Oct 19 04:00:02 2017 Process J000 died, see its trace file kkjcre1p: unable to spawn jobq slave process Errors in file /...diag.../trace/INST_NAME_cjq0_61788.trc: Process W000 died, see its trace file Thu Oct 19 04:00:04 2017
this contains
*** 2017-10-19 04:00:02.536 *** SESSION ID:(631.21) 2017-10-19 04:00:02.536 *** CLIENT ID:() 2017-10-19 04:00:02.536 *** SERVICE NAME:(SYS$BACKGROUND) 2017-10-19 04:00:02.536 *** MODULE NAME:() 2017-10-19 04:00:02.536 *** ACTION NAME:() 2017-10-19 04:00:02.536 Process J000 is dead (pid=119122 req_ver=453 cur_ver=453 state=KSOSP_SPAWNED). *** 2017-10-19 04:00:09.545 Process J000 is dead (pid=119264 req_ver=456 cur_ver=456 state=KSOSP_SPAWNED). *** 2017-10-19 04:00:12.539 Process J000 is dead (pid=119278 req_ver=25 cur_ver=25 state=KSOSP_SPAWNED).
With all this information (and some search effort) the best match I found is Connected to an idle instance, while database instance is running (Process J000 died, see its trace file kkjcre1p: unable to spawn jobq slave process ) (Doc ID 1578491.1).
But why is prelink doing such bad things?
Some information about the memory structure of the running instance and new started process, together with a good old strace will shed some light:
first of all the /proc/<pid>/maps content of the running smon (any process attached to the instance would do):
00400000-0bdbb000 r-xp 00000000 fd:ac 4730278 /O_H/bin/oracle 0bfba000-0c1b4000 rw-p 0b9ba000 fd:ac 4730278 /O_H/bin/oracle 0c1b4000-0c20b000 rw-p 00000000 00:00 0 0c89b000-0c8fe000 rw-p 00000000 00:00 0 [heap] 60000000-a0000000 rw-s 00000000 00:0c 323747854 /SYSV00000000 (deleted) a0000000-3260000000 rw-s 00000000 00:0c 323780623 /SYSV00000000 (deleted) 3260000000-3260200000 rw-s 00000000 00:0c 323813392 /SYSV6a4ae364 (deleted) 3c2ee00000-3c2ee01000 r-xp 00000000 fd:01 1625 /lib64/libaio.so.1.0.1 3c2ee01000-3c2f000000 ---p 00001000 fd:01 1625 /lib64/libaio.so.1.0.1 3c2f000000-3c2f001000 rw-p 00000000 fd:01 1625 /lib64/libaio.so.1.0.1 7feb42051000-7feb42061000 rw-p 00000000 00:05 4603 /dev/zero 7feb42061000-7feb42071000 rw-p 00000000 00:05 4603 /dev/zero 7feb42071000-7feb42081000 rw-p 00000000 00:05 4603 /dev/zero 7feb42081000-7feb42091000 rw-p 00000000 00:05 4603 /dev/zero 7feb42091000-7feb420a1000 rw-p 00000000 00:05 4603 /dev/zero ... 7feb42750000-7feb42751000 rw-p 0007f000 00:05 4603 /dev/zero 7feb42751000-7feb42891000 rw-p 00000000 00:00 0 7feb42891000-7feb4289d000 r-xp 00000000 fd:01 3650 /lib64/libnss_files-2.12.so 7feb4289d000-7feb42a9c000 ---p 0000c000 fd:01 3650 /lib64/libnss_files-2.12.so 7feb42a9c000-7feb42a9d000 r--p 0000b000 fd:01 3650 /lib64/libnss_files-2.12.so 7feb42a9d000-7feb42a9e000 rw-p 0000c000 fd:01 3650 /lib64/libnss_files-2.12.so 7feb42a9e000-7feb42b9e000 rw-p 00000000 00:00 0 7feb42b9e000-7feb42ba8000 r-xp 00000000 fd:a4 2880 /usr/lib64/libnuma.so.1 (deleted) 7feb42ba8000-7feb42da8000 ---p 0000a000 fd:a4 2880 /usr/lib64/libnuma.so.1 (deleted) 7feb42da8000-7feb42da9000 rw-p 0000a000 fd:a4 2880 /usr/lib64/libnuma.so.1 (deleted) 7feb42da9000-7feb42f33000 r-xp 00000000 fd:01 118 /lib64/libc-2.12.so (deleted) 7feb42f33000-7feb43133000 ---p 0018a000 fd:01 118 /lib64/libc-2.12.so (deleted) 7feb43133000-7feb43137000 r--p 0018a000 fd:01 118 /lib64/libc-2.12.so (deleted) 7feb43137000-7feb43139000 rw-p 0018e000 fd:01 118 /lib64/libc-2.12.so (deleted) 7feb43139000-7feb4313d000 rw-p 00000000 00:00 0 7feb4313d000-7feb43153000 r-xp 00000000 fd:01 3646 /lib64/libnsl-2.12.so (deleted) 7feb43153000-7feb43352000 ---p 00016000 fd:01 3646 /lib64/libnsl-2.12.so (deleted) 7feb43352000-7feb43353000 r--p 00015000 fd:01 3646 /lib64/libnsl-2.12.so (deleted) 7feb43353000-7feb43354000 rw-p 00016000 fd:01 3646 /lib64/libnsl-2.12.so (deleted) 7feb43354000-7feb43356000 rw-p 00000000 00:00 0 7feb43356000-7feb4336d000 r-xp 00000000 fd:01 142 /lib64/libpthread-2.12.so.#prelink#.O7kB09 (deleted) 7feb4336d000-7feb4356d000 ---p 00017000 fd:01 142 /lib64/libpthread-2.12.so.#prelink#.O7kB09 (deleted) 7feb4356d000-7feb4356e000 r--p 00017000 fd:01 142 /lib64/libpthread-2.12.so.#prelink#.O7kB09 (deleted) 7feb4356e000-7feb4356f000 rw-p 00018000 fd:01 142 /lib64/libpthread-2.12.so.#prelink#.O7kB09 (deleted) 7feb4356f000-7feb43573000 rw-p 00000000 00:00 0 7feb43573000-7feb435f6000 r-xp 00000000 fd:01 3644 /lib64/libm-2.12.so (deleted) 7feb435f6000-7feb437f5000 ---p 00083000 fd:01 3644 /lib64/libm-2.12.so (deleted) 7feb437f5000-7feb437f6000 r--p 00082000 fd:01 3644 /lib64/libm-2.12.so (deleted) 7feb437f6000-7feb437f7000 rw-p 00083000 fd:01 3644 /lib64/libm-2.12.so (deleted) 7feb437f7000-7feb437f9000 r-xp 00000000 fd:01 3642 /lib64/libdl-2.12.so (deleted) 7feb437f9000-7feb439f9000 ---p 00002000 fd:01 3642 /lib64/libdl-2.12.so (deleted) 7feb439f9000-7feb439fa000 r--p 00002000 fd:01 3642 /lib64/libdl-2.12.so (deleted) 7feb439fa000-7feb439fb000 rw-p 00003000 fd:01 3642 /lib64/libdl-2.12.so (deleted) 7feb439fb000-7feb43a07000 r-xp 00000000 fd:ac 4593158 /O_H/lib/libocrutl11.so 7feb43a07000-7feb43b06000 ---p 0000c000 fd:ac 4593158 /O_H/lib/libocrutl11.so ...Here we see at the top 3 shared memory segments (bold) - the one with id 323780623 will be of special interest.
As I try to attach a new process (sqlplus) to the instance, strace shows me
... 9134 shmat(323780623, 0xa0000000, 0) = -1 EINVAL (Invalid argument) 9134 shmctl(323780623, IPC_STAT, 0x7fffc48703c8) = 0 9134 shmat(323780623, 0, 0) = 0x2aaaaac00000 9134 shmdt(0x2aaaaac00000) = 0 9134 uname({sys="Linux", node="av3l291p", ...}) = 0 9134 times({tms_utime=0, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 432291212 9134 times({tms_utime=0, tms_stime=1, tms_cutime=0, tms_cstime=0}) = 432291212 9134 stat("/O_H/rdbms/log", {st_mode=S_IFDIR|0755, st_size=503808, ...}) = 0 9134 stat("/O_H/rdbms/log/ebssip021_ora_9134.trc", 0x7fffc486c810) = -1 ENOENT (No such file or directory) 9134 open("/O_H/rdbms/log/ebssip021_ora_9134.trc", O_WRONLY|O_CREAT|O_TRUNC, 0660) = 4 9134 lstat("/O_H/rdbms/log/ebssip021_ora_9134.trc", {st_mode=S_IFREG|0640, st_size=0, ...}) = 0 ...Its a nice habit to write a trace file in such a situation. Oracle Trace Files are often great. In this case, it contains (beside many other things) information about the shared memory segments:
*********************** Dumping ipcs output ******************** ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 260440066 oracle 600 4096 0 0x00000000 260472835 oracle 600 4096 0 0x2bac4398 260505604 oracle 600 24576 95 0x00000000 320405509 oracle 640 67108864 85 0x00000000 320438278 oracle 640 10670309376 85 0xa6f47cf8 320471047 oracle 640 2097152 85 0x00000000 323485704 oracle 640 33554432 94 0x00000000 323551241 oracle 640 16777216 146 0x00000000 323584010 oracle 640 1056964608 146 0x8015f33c 323616779 oracle 640 2097152 146 0x00000000 323649548 oracle 640 2113929216 94 0xb047bb08 323682317 oracle 640 2097152 94 0x00000000 323747854 oracle 640 1073741824 76 0x00000000 323780623 oracle 640 213674622976 75 0x6a4ae364 323813392 oracle 640 2097152 76 0x00000000 326828049 oracle 640 50331648 128 0x00000000 326860818 oracle 640 5318377472 128 0x3df7a980 326893587 oracle 640 2097152 128
and the process map:
******************* Dumping process map **************** 00400000-0bdbb000 r-xp 00000000 fd:ac 4730278 /O_H/bin/oracle 0bfba000-0c1b4000 rw-p 0b9ba000 fd:ac 4730278 /O_H/bin/oracle 0c1b4000-0c20b000 rw-p 00000000 00:00 0 0d2c4000-0d309000 rw-p 00000000 00:00 0 [heap] 60000000-a0000000 rw-s 00000000 00:0c 323747854 /SYSV00000000 (deleted) 324e000000-324e020000 r-xp 00000000 fd:01 119 /lib64/ld-2.12.so 324e220000-324e221000 r--p 00020000 fd:01 119 /lib64/ld-2.12.so 324e221000-324e222000 rw-p 00021000 fd:01 119 /lib64/ld-2.12.so 324e222000-324e223000 rw-p 00000000 00:00 0 324e400000-324e58a000 r-xp 00000000 fd:01 120 /lib64/libc-2.12.so 324e58a000-324e78a000 ---p 0018a000 fd:01 120 /lib64/libc-2.12.so 324e78a000-324e78e000 r--p 0018a000 fd:01 120 /lib64/libc-2.12.so 324e78e000-324e790000 rw-p 0018e000 fd:01 120 /lib64/libc-2.12.so
My sqlplus process tried to attach the shared memory segment 323780623 at address 0xa0000000 ( shmat(323780623, 0xa0000000, 0) ) but it failed (-1 EINVAL).
The reason for the failure is the alredy linked library /lib64/ld-2.12.so at address 324e000000 (and some others as well).
But why is it so important to have this shared memory at the specific address? As strace shows, it was able to map it on some other address without any issue ( shmat(323780623, 0, 0 = 0x2aaaaac00000 ).
But Oracle requires the shared memory segments mapped at the same address for all processes of an instance. It uses the direct memory address to access all it's memory structures. E.g. when querying saddr or paddr from v$session, these are the real addresses.
So if you map a shared memory process with your own binary at the same address, you can access the memory structures direct using these addresses.
So what bad did prelink make to the system?
The Linux server was patched at about midnight. A lot of new RPMs were installed and the server rebooted. Clusterware started and fine and int started all instances.
But sometime close to 03:34 prelink was started, it did a lot of work with all the new libraries and created a new /etc/prelink.cache file, so all new started binaries loads their libraries according to this file. In my case the shared memory segment could not be loaded at the specific address anymore - so the Instance was still healthy, but no new process could attach anymore.
My solution was to kill the instance and restart it.
Generally I recommend to disable prelink on Linux hosts running Oracle database.
Oracle also recommends doing so.
Keine Kommentare:
Kommentar veröffentlichen