In this post I try to show what's going on if a local listener dies in a 11gR2 RAC environment. My basic question is: When does (a) SCAN-Listener knows the local Listener disappeared?
My testcase (a sandbox):
- A 2-node RAC - all actions are run on node 1, if not explicit defined.
- My test-DB is called TTT04 (Test, you know?)
- I have 3 SCAN listeners there, but I want to make the test-case easier so I do pin down my connection string to only one SCAN-listener (it's SCAN2 in my case):
TTT04_bx = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 172.24.32.117) (PORT = 1521)) # SCAN2 (CONNECT_DATA = (SERVICE_NAME = TTT04_SITE1) ) )
- start tracing pmon:
ps -ef | grep pmon | grep TTT04
SQL> oradebug setospid <pid_of_pmon>
Oracle pid: 2, Unix process pid: <pid_of_pmon>, image: oracle@<node1> (PMON)
SQL> oradebug Event 10257 trace name context forever, level 16
Statement processed.
- just to make sure server side load balancing will lead my to node1:
on node2: several
bzip2 -z -c /dev/urandom > /dev/null &
/tmp/bx1.sql
connect berx/berx123#@TTT04_bx
spool /tmp/bx1.txt
select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME from v$instance;
exit
/tmp/bx2.sql
connect berx/berx123#@TTT04_bx
spool /tmp/bx2.txt
select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME from v$instance;
exit
My command is
kill -9 `pgrep -f "tnslsnr LISTENER "` ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr1.txt ; sqlplus /nolog @/tmp/bx1.sql & sleep 5 ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr2.txt; sqlplus /nolog @/tmp/bx2.sql
and the result on the Terminal:
SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:50 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
ERROR:
ORA-12541: TNS:no listener
SP2-0640: Not connected
[1]+ Done sqlplus /nolog @/tmp/bx1.sql 2> /tmp/bx1.err
SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:55 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected.
TO_CHAR(SYSDATE,'YY HOST_NAME
------------------- ---------
2012-05-05 23:00:55 <node2>
Now the question is, what happened between these 5 seconds of
sleep 5
?- pmons tracefile
TTT041_pmon_4399.trc
shows
*** 2012-05-05 23:00:48.391
kmmlrl: status: succ=4, wait=0, fail=0
*** 2012-05-05 23:00:51.398
kmmlrl: status: succ=3, wait=0, fail=1
kmmlrl: update retry
kmmgdnu: TTT04_SITE1
goodness=0, delta=1,
flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
kmmlrl: node load 394
kmmlrl: (ADDRESS=(PROTOCOL=TCP)(HOST=<node1>-vip)(PORT=1521)) block
kmmlrl: nsgr update returned 0
kmmlrl: nsgr register returned 0
*** 2012-05-05 23:00:54.401
kmmlrl: status: succ=3, wait=0, fail=1
*** 2012-05-05 23:00:57.402
kmmlrl: status: succ=3, wait=0, fail=1
Just a short explanation what you can see here: pmon distributes every 3 seconds the status he knows about all listeners he knows to all other listeners. Between 23:00:48 and 23:00:51 pmon found the local_listener disappeared - and informed all the other listeners about this fact.
- what the LISTENER_SCAN2 knows:
lsnr1.txt
Service "TTT04_SITE1" has 2 instance(s).
Instance "TTT041", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))
lsnr2.txt
Service "TTT04_SITE1" has 2 instance(s).
Instance "TTT041", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:blocked
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))
Direct after the kill of the local_listener, LISTENER_SCAN2 still believes both local_listeners are in state:ready - therefore it directs the connection to the listener with lower load (on node1) which I just killed.
But after only 5 seconds, it knows it is in state:blocked and therefore directs my 2nd connection attempt to node2.
What is this all about?
- If a Listener dies (for any reason) there is a periode of about 3 seconds, where other processes might rely on it's existence and service.
- PMON is the process which informas all listeners about the status of the others (one more reason to make sure it never stucks.
- PMONs Listener REGISTRATION is something different.
My todays findings where supported by Understanding and Troubleshooting Instance Load Balancing [Note:263599.1].
5 Kommentare:
What happens is a node goes down? There is no PMON process to inform SCAN that a local listener disappeared?
Yury
Yury,
I tried the small testcase:
lsnrctl service LISTENER_SCAN2 > /tmp/lsnr3.txt; kill -9 `pgrep -f ora_pmon_TTT041` ; lsnrctl service LISTENER_SCAN2 > /tmp/lsnr4.txt
in /tmp/lsnr3.txt I found (as expected)
Service "TTT04_SITE1" has 2 instance(s).
Instance "TTT041", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))
and in /tmp/lsnr4.txt:
Service "TTT04_SITE1" has 1 instance(s).
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))
You can see, directly after I killed PMON, LISTENER_SCAN2 did not mention it's instance TTT041 anymore. This is at least an indication PMON wipes out all the infos from a PMON when it loses it's TCP connection.
So here I did not have any gap at all.
I think we are getting there. We now have a good repeatable case. We see that after some time SCAN got updated up to the right status. Now it is time to confirm that our understanding of events is right:
- If Listener goes down. PMON verifies listeners on 3 secs bases (is it default time frame? is there any way we can change it?). If no listener found it notifies SCAN listeners. SCAN listeners clean up the info
- If local PMON and Listener goes down. "This is at least an indication SCAN LISTENER wipes out all the infos from a PMON when it loses it's TCP connection.". It looks like SCAN Listeners checks if connection to PMON is alive on regular basis (what is the time frame? can we change it?) and if it isn't alive it cleans up all the information provided before by the PMON
Does it sound right?
Yury
Yury,
yes, it sounds right.
I just don't know if PMONs 3 sec. interval can be changed;
I also don't know if LISTENER polls it's TCP connection from PMON (async) or catches it's failure and reacts immediately.
Both is something I would need a debugger and knowledge in these tasks. I'm just not trained in debug into that level.
Hi Martin,
I think sniffing network could be a good start to see if survived host is trying to contact evicted host. tcpdump plus netstat should be good starting point.
regards,
Marcin
Kommentar veröffentlichen