berxblog: who cares if a listener is dying

2012-05-05

who cares if a listener is dying

In this post I try to show what's going on if a local listener dies in a 11gR2 RAC environment. My basic question is: When does (a) SCAN-Listener knows the local Listener disappeared?
My testcase (a sandbox):

A 2-node RAC - all actions are run on node 1, if not explicit defined.
My test-DB is called TTT04 (Test, you know?)

I have 3 SCAN listeners there, but I want to make the test-case easier so I do pin down my connection string to only one SCAN-listener (it's SCAN2 in my case):

TTT04_bx =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = 172.24.32.117)
                               (PORT = 1521)) # SCAN2
    (CONNECT_DATA =
      (SERVICE_NAME = TTT04_SITE1)
    )
  )

start tracing pmon:
ps -ef | grep pmon | grep TTT04
SQL> oradebug setospid <pid_of_pmon> Oracle pid: 2, Unix process pid: <pid_of_pmon>, image: oracle@<node1> (PMON) SQL> oradebug Event 10257 trace name context forever, level 16 Statement processed.
just to make sure server side load balancing will lead my to node1:
on node2: several
bzip2 -z -c /dev/urandom > /dev/null &

An now the real test. My 2 test-scripts:
/tmp/bx1.sql



connect berx/berx123#@TTT04_bx

spool /tmp/bx1.txt

select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME  from v$instance;

exit

/tmp/bx2.sql



connect berx/berx123#@TTT04_bx

spool /tmp/bx2.txt

select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME  from v$instance;

exit

My command is

kill -9 `pgrep  -f "tnslsnr LISTENER "` ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr1.txt ; sqlplus /nolog @/tmp/bx1.sql & sleep 5 ;  lsnrctl services LISTENER_SCAN2 > /tmp/lsnr2.txt; sqlplus /nolog @/tmp/bx2.sql

and the result on the Terminal:



SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:50 2012



Copyright (c) 1982, 2011, Oracle.  All rights reserved.



ERROR:

ORA-12541: TNS:no listener





SP2-0640: Not connected

[1]+  Done                    sqlplus /nolog @/tmp/bx1.sql 2> /tmp/bx1.err



SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:55 2012



Copyright (c) 1982, 2011, Oracle.  All rights reserved.



Connected.



TO_CHAR(SYSDATE,'YY  HOST_NAME

-------------------  ---------

2012-05-05 23:00:55  <node2>

Now the question is, what happened between these 5 seconds of sleep 5?

pmons tracefile TTT041_pmon_4399.trc shows
*** 2012-05-05 23:00:48.391 kmmlrl: status: succ=4, wait=0, fail=0 *** 2012-05-05 23:00:51.398 kmmlrl: status: succ=3, wait=0, fail=1 kmmlrl: update retry kmmgdnu: TTT04_SITE1 goodness=0, delta=1, flags=0x4:unblocked/not overloaded, update=0x6:G/D/- kmmlrl: node load 394 kmmlrl: (ADDRESS=(PROTOCOL=TCP)(HOST=<node1>-vip)(PORT=1521)) block kmmlrl: nsgr update returned 0 kmmlrl: nsgr register returned 0 *** 2012-05-05 23:00:54.401 kmmlrl: status: succ=3, wait=0, fail=1 *** 2012-05-05 23:00:57.402 kmmlrl: status: succ=3, wait=0, fail=1
Just a short explanation what you can see here: pmon distributes every 3 seconds the status he knows about all listeners he knows to all other listeners. Between 23:00:48 and 23:00:51 pmon found the local_listener disappeared - and informed all the other listeners about this fact.
what the LISTENER_SCAN2 knows:
lsnr1.txt
Service "TTT04_SITE1" has 2 instance(s). Instance "TTT041", status READY, has 1 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521)) Instance "TTT042", status READY, has 1 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))

lsnr2.txt
Service "TTT04_SITE1" has 2 instance(s). Instance "TTT041", status READY, has 1 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:blocked REMOTE SERVER (ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521)) Instance "TTT042", status READY, has 1 handler(s) for this service... Handler(s): "DEDICATED" established:0 refused:0 state:ready REMOTE SERVER (ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))
Direct after the kill of the local_listener, LISTENER_SCAN2 still believes both local_listeners are in state:ready - therefore it directs the connection to the listener with lower load (on node1) which I just killed.
But after only 5 seconds, it knows it is in state:blocked and therefore directs my 2nd connection attempt to node2.

What is this all about?

If a Listener dies (for any reason) there is a periode of about 3 seconds, where other processes might rely on it's existence and service.
PMON is the process which informas all listeners about the status of the others (one more reason to make sure it never stucks.
PMONs Listener REGISTRATION is something different.

My todays findings where supported by Understanding and Troubleshooting Instance Load Balancing [Note:263599.1].

5 Kommentare:

Yury hat gesagt…: What happens is a node goes down? There is no PMON process to inform SCAN that a local listener disappeared?

Yury; 6. Mai 2012 um 12:14
Martin Berger hat gesagt…: Yury,
I tried the small testcase:

lsnrctl service LISTENER_SCAN2 > /tmp/lsnr3.txt; kill -9 `pgrep -f ora_pmon_TTT041` ; lsnrctl service LISTENER_SCAN2 > /tmp/lsnr4.txt
in /tmp/lsnr3.txt I found (as expected)
Service "TTT04_SITE1" has 2 instance(s).
Instance "TTT041", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))

and in /tmp/lsnr4.txt:
Service "TTT04_SITE1" has 1 instance(s).
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))

You can see, directly after I killed PMON, LISTENER_SCAN2 did not mention it's instance TTT041 anymore. This is at least an indication PMON wipes out all the infos from a PMON when it loses it's TCP connection.

So here I did not have any gap at all.; 6. Mai 2012 um 20:39
Yury hat gesagt…: I think we are getting there. We now have a good repeatable case. We see that after some time SCAN got updated up to the right status. Now it is time to confirm that our understanding of events is right:
- If Listener goes down. PMON verifies listeners on 3 secs bases (is it default time frame? is there any way we can change it?). If no listener found it notifies SCAN listeners. SCAN listeners clean up the info
- If local PMON and Listener goes down. "This is at least an indication SCAN LISTENER wipes out all the infos from a PMON when it loses it's TCP connection.". It looks like SCAN Listeners checks if connection to PMON is alive on regular basis (what is the time frame? can we change it?) and if it isn't alive it cleans up all the information provided before by the PMON

Does it sound right?

Yury; 7. Mai 2012 um 02:15
Martin Berger hat gesagt…: Yury,
yes, it sounds right.
I just don't know if PMONs 3 sec. interval can be changed;
I also don't know if LISTENER polls it's TCP connection from PMON (async) or catches it's failure and reacts immediately.
Both is something I would need a debugger and knowledge in these tasks. I'm just not trained in debug into that level.; 7. Mai 2012 um 08:45
Marcin Przepiorowski hat gesagt…: Hi Martin,
I think sniffing network could be a good start to see if survived host is trying to contact evicted host. tcpdump plus netstat should be good starting point.

regards,
Marcin; 8. Mai 2012 um 11:01

Kommentar veröffentlichen

berxblog

2012-05-05

who cares if a listener is dying

5 Kommentare:

Follower

Über mich