In this post I try to show what's going on if a local listener dies in a 11gR2 RAC environment. My basic question is: When does (a) SCAN-Listener knows the local Listener disappeared?
My testcase (a sandbox):
- A 2-node RAC - all actions are run on node 1, if not explicit defined.
- My test-DB is called TTT04 (Test, you know?)
- I have 3 SCAN listeners there, but I want to make the test-case easier so I do pin down my connection string to only one SCAN-listener (it's SCAN2 in my case):
TTT04_bx = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 172.24.32.117) (PORT = 1521)) # SCAN2 (CONNECT_DATA = (SERVICE_NAME = TTT04_SITE1) ) ) - start tracing pmon:
ps -ef | grep pmon | grep TTT04
SQL> oradebug setospid <pid_of_pmon>
Oracle pid: 2, Unix process pid: <pid_of_pmon>, image: oracle@<node1> (PMON)
SQL> oradebug Event 10257 trace name context forever, level 16
Statement processed.
- just to make sure server side load balancing will lead my to node1:
on node2: several
bzip2 -z -c /dev/urandom > /dev/null &
/tmp/bx1.sql
connect berx/berx123#@TTT04_bx
spool /tmp/bx1.txt
select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME from v$instance;
exit
/tmp/bx2.sql
connect berx/berx123#@TTT04_bx
spool /tmp/bx2.txt
select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME from v$instance;
exit
My command is
kill -9 `pgrep -f "tnslsnr LISTENER "` ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr1.txt ; sqlplus /nolog @/tmp/bx1.sql & sleep 5 ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr2.txt; sqlplus /nolog @/tmp/bx2.sql
and the result on the Terminal:
SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:50 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
ERROR:
ORA-12541: TNS:no listener
SP2-0640: Not connected
[1]+ Done sqlplus /nolog @/tmp/bx1.sql 2> /tmp/bx1.err
SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:55 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected.
TO_CHAR(SYSDATE,'YY HOST_NAME
------------------- ---------
2012-05-05 23:00:55 <node2>
Now the question is, what happened between these 5 seconds of
sleep 5?- pmons tracefile
TTT041_pmon_4399.trcshows
*** 2012-05-05 23:00:48.391
kmmlrl: status: succ=4, wait=0, fail=0
*** 2012-05-05 23:00:51.398
kmmlrl: status: succ=3, wait=0, fail=1
kmmlrl: update retry
kmmgdnu: TTT04_SITE1
goodness=0, delta=1,
flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
kmmlrl: node load 394
kmmlrl: (ADDRESS=(PROTOCOL=TCP)(HOST=<node1>-vip)(PORT=1521)) block
kmmlrl: nsgr update returned 0
kmmlrl: nsgr register returned 0
*** 2012-05-05 23:00:54.401
kmmlrl: status: succ=3, wait=0, fail=1
*** 2012-05-05 23:00:57.402
kmmlrl: status: succ=3, wait=0, fail=1
Just a short explanation what you can see here: pmon distributes every 3 seconds the status he knows about all listeners he knows to all other listeners. Between 23:00:48 and 23:00:51 pmon found the local_listener disappeared - and informed all the other listeners about this fact.
- what the LISTENER_SCAN2 knows:
lsnr1.txt
Service "TTT04_SITE1" has 2 instance(s).
Instance "TTT041", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))
lsnr2.txt
Service "TTT04_SITE1" has 2 instance(s).
Instance "TTT041", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:blocked
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))
Direct after the kill of the local_listener, LISTENER_SCAN2 still believes both local_listeners are in state:ready - therefore it directs the connection to the listener with lower load (on node1) which I just killed.
But after only 5 seconds, it knows it is in state:blocked and therefore directs my 2nd connection attempt to node2.
What is this all about?
- If a Listener dies (for any reason) there is a periode of about 3 seconds, where other processes might rely on it's existence and service.
- PMON is the process which informas all listeners about the status of the others (one more reason to make sure it never stucks.
- PMONs Listener REGISTRATION is something different.
My todays findings where supported by Understanding and Troubleshooting Instance Load Balancing [Note:263599.1].