Samstag, 5. Mai 2012

who cares if a listener is dying


In this post I try to show what's going on if a local listener dies in a 11gR2 RAC environment. My basic question is: When does (a) SCAN-Listener knows the local Listener disappeared?
My testcase (a sandbox):

  • A 2-node RAC - all actions are run on node 1, if not explicit defined.
  • My test-DB is called TTT04 (Test, you know?)
  • I have 3 SCAN listeners there, but I want to make the test-case easier so I do pin down my connection string to only one SCAN-listener (it's SCAN2 in my case):
    TTT04_bx =
      (DESCRIPTION =
        (ADDRESS = (PROTOCOL = TCP)(HOST = 172.24.32.117)
                                   (PORT = 1521)) # SCAN2
        (CONNECT_DATA =
          (SERVICE_NAME = TTT04_SITE1)
        )
      )
  • start tracing pmon:
    ps -ef | grep pmon | grep TTT04
    SQL> oradebug setospid <pid_of_pmon>
    Oracle pid: 2, Unix process pid: <pid_of_pmon>, image: oracle@<node1> (PMON)
    SQL> oradebug Event 10257 trace name context forever, level 16
    Statement processed.

  • just to make sure server side load balancing will lead my to node1:
    on node2: several
    bzip2 -z -c /dev/urandom > /dev/null &
An now the real test. My 2 test-scripts:
/tmp/bx1.sql

connect berx/berx123#@TTT04_bx
spool /tmp/bx1.txt
select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME from v$instance;
exit


/tmp/bx2.sql

connect berx/berx123#@TTT04_bx
spool /tmp/bx2.txt
select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME from v$instance;
exit


My command is
kill -9 `pgrep -f "tnslsnr LISTENER "` ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr1.txt ; sqlplus /nolog @/tmp/bx1.sql & sleep 5 ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr2.txt; sqlplus /nolog @/tmp/bx2.sql


and the result on the Terminal:

SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:50 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

ERROR:
ORA-12541: TNS:no listener


SP2-0640: Not connected
[1]+ Done sqlplus /nolog @/tmp/bx1.sql 2> /tmp/bx1.err

SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:55 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected.

TO_CHAR(SYSDATE,'YY HOST_NAME
------------------- ---------
2012-05-05 23:00:55 <node2>





Now the question is, what happened between these 5 seconds of sleep 5?


  • pmons tracefile TTT041_pmon_4399.trc shows

    *** 2012-05-05 23:00:48.391
    kmmlrl: status: succ=4, wait=0, fail=0

    *** 2012-05-05 23:00:51.398
    kmmlrl: status: succ=3, wait=0, fail=1
    kmmlrl: update retry
    kmmgdnu: TTT04_SITE1
    goodness=0, delta=1,
    flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
    kmmlrl: node load 394
    kmmlrl: (ADDRESS=(PROTOCOL=TCP)(HOST=<node1>-vip)(PORT=1521)) block
    kmmlrl: nsgr update returned 0
    kmmlrl: nsgr register returned 0

    *** 2012-05-05 23:00:54.401
    kmmlrl: status: succ=3, wait=0, fail=1

    *** 2012-05-05 23:00:57.402
    kmmlrl: status: succ=3, wait=0, fail=1

    Just a short explanation what you can see here: pmon distributes every 3 seconds the status he knows about all listeners he knows to all other listeners. Between 23:00:48 and 23:00:51 pmon found the local_listener disappeared - and informed all the other listeners about this fact.
  • what the LISTENER_SCAN2 knows:
    lsnr1.txt

    Service "TTT04_SITE1" has 2 instance(s).
    Instance "TTT041", status READY, has 1 handler(s) for this service...
    Handler(s):
    "DEDICATED" established:0 refused:0 state:ready
    REMOTE SERVER
    (ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
    Instance "TTT042", status READY, has 1 handler(s) for this service...
    Handler(s):
    "DEDICATED" established:0 refused:0 state:ready
    REMOTE SERVER
    (ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))


    lsnr2.txt

    Service "TTT04_SITE1" has 2 instance(s).
    Instance "TTT041", status READY, has 1 handler(s) for this service...
    Handler(s):
    "DEDICATED" established:0 refused:0 state:blocked
    REMOTE SERVER
    (ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
    Instance "TTT042", status READY, has 1 handler(s) for this service...
    Handler(s):
    "DEDICATED" established:0 refused:0 state:ready
    REMOTE SERVER
    (ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))

    Direct after the kill of the local_listener, LISTENER_SCAN2 still believes both local_listeners are in state:ready - therefore it directs the connection to the listener with lower load (on node1) which I just killed.
    But after only 5 seconds, it knows it is in state:blocked and therefore directs my 2nd connection attempt to node2.

What is this all about?

  • If a Listener dies (for any reason) there is a periode of about 3 seconds, where other processes might rely on it's existence and service.
  • PMON is the process which informas all listeners about the status of the others (one more reason to make sure it never stucks.
  • PMONs Listener REGISTRATION is something different. 


My todays findings where supported by Understanding and Troubleshooting Instance Load Balancing [Note:263599.1].

Kommentare:

Yury hat gesagt…

What happens is a node goes down? There is no PMON process to inform SCAN that a local listener disappeared?

Yury

Martin Berger hat gesagt…

Yury,
I tried the small testcase:

lsnrctl service LISTENER_SCAN2 > /tmp/lsnr3.txt; kill -9 `pgrep -f ora_pmon_TTT041` ; lsnrctl service LISTENER_SCAN2 > /tmp/lsnr4.txt

in /tmp/lsnr3.txt I found (as expected)
Service "TTT04_SITE1" has 2 instance(s).
Instance "TTT041", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))

Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))


and in /tmp/lsnr4.txt:
Service "TTT04_SITE1" has 1 instance(s).
Instance "TTT042", status READY, has 1 handler(s) for this service...
Handler(s):
"DEDICATED" established:0 refused:0 state:ready
REMOTE SERVER
(ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))


You can see, directly after I killed PMON, LISTENER_SCAN2 did not mention it's instance TTT041 anymore. This is at least an indication PMON wipes out all the infos from a PMON when it loses it's TCP connection.

So here I did not have any gap at all.

Yury hat gesagt…

I think we are getting there. We now have a good repeatable case. We see that after some time SCAN got updated up to the right status. Now it is time to confirm that our understanding of events is right:
- If Listener goes down. PMON verifies listeners on 3 secs bases (is it default time frame? is there any way we can change it?). If no listener found it notifies SCAN listeners. SCAN listeners clean up the info
- If local PMON and Listener goes down. "This is at least an indication SCAN LISTENER wipes out all the infos from a PMON when it loses it's TCP connection.". It looks like SCAN Listeners checks if connection to PMON is alive on regular basis (what is the time frame? can we change it?) and if it isn't alive it cleans up all the information provided before by the PMON

Does it sound right?

Yury

Martin Berger hat gesagt…

Yury,
yes, it sounds right.
I just don't know if PMONs 3 sec. interval can be changed;
I also don't know if LISTENER polls it's TCP connection from PMON (async) or catches it's failure and reacts immediately.
Both is something I would need a debugger and knowledge in these tasks. I'm just not trained in debug into that level.

Marcin Przepiorowski hat gesagt…

Hi Martin,
I think sniffing network could be a good start to see if survived host is trying to contact evicted host. tcpdump plus netstat should be good starting point.

regards,
Marcin