Samstag, 5. Mai 2012

who cares if a listener is dying


In this post I try to show what's going on if a local listener dies in a 11gR2 RAC environment. My basic question is: When does (a) SCAN-Listener knows the local Listener disappeared?
My testcase (a sandbox):

  • A 2-node RAC - all actions are run on node 1, if not explicit defined.
  • My test-DB is called TTT04 (Test, you know?)
  • I have 3 SCAN listeners there, but I want to make the test-case easier so I do pin down my connection string to only one SCAN-listener (it's SCAN2 in my case):
    TTT04_bx =
      (DESCRIPTION =
        (ADDRESS = (PROTOCOL = TCP)(HOST = 172.24.32.117)
                                   (PORT = 1521)) # SCAN2
        (CONNECT_DATA =
          (SERVICE_NAME = TTT04_SITE1)
        )
      )
  • start tracing pmon:
    ps -ef | grep pmon | grep TTT04
    SQL> oradebug setospid <pid_of_pmon>
    Oracle pid: 2, Unix process pid: <pid_of_pmon>, image: oracle@<node1> (PMON)
    SQL> oradebug Event 10257 trace name context forever, level 16
    Statement processed.

  • just to make sure server side load balancing will lead my to node1:
    on node2: several
    bzip2 -z -c /dev/urandom > /dev/null &
An now the real test. My 2 test-scripts:
/tmp/bx1.sql

connect berx/berx123#@TTT04_bx
spool /tmp/bx1.txt
select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME from v$instance;
exit


/tmp/bx2.sql

connect berx/berx123#@TTT04_bx
spool /tmp/bx2.txt
select to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS'), HOST_NAME from v$instance;
exit


My command is
kill -9 `pgrep -f "tnslsnr LISTENER "` ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr1.txt ; sqlplus /nolog @/tmp/bx1.sql & sleep 5 ; lsnrctl services LISTENER_SCAN2 > /tmp/lsnr2.txt; sqlplus /nolog @/tmp/bx2.sql


and the result on the Terminal:

SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:50 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

ERROR:
ORA-12541: TNS:no listener


SP2-0640: Not connected
[1]+ Done sqlplus /nolog @/tmp/bx1.sql 2> /tmp/bx1.err

SQL*Plus: Release 11.2.0.3.0 Production on Sat May 5 23:00:55 2012

Copyright (c) 1982, 2011, Oracle. All rights reserved.

Connected.

TO_CHAR(SYSDATE,'YY HOST_NAME
------------------- ---------
2012-05-05 23:00:55 <node2>





Now the question is, what happened between these 5 seconds of sleep 5?


  • pmons tracefile TTT041_pmon_4399.trc shows

    *** 2012-05-05 23:00:48.391
    kmmlrl: status: succ=4, wait=0, fail=0

    *** 2012-05-05 23:00:51.398
    kmmlrl: status: succ=3, wait=0, fail=1
    kmmlrl: update retry
    kmmgdnu: TTT04_SITE1
    goodness=0, delta=1,
    flags=0x4:unblocked/not overloaded, update=0x6:G/D/-
    kmmlrl: node load 394
    kmmlrl: (ADDRESS=(PROTOCOL=TCP)(HOST=<node1>-vip)(PORT=1521)) block
    kmmlrl: nsgr update returned 0
    kmmlrl: nsgr register returned 0

    *** 2012-05-05 23:00:54.401
    kmmlrl: status: succ=3, wait=0, fail=1

    *** 2012-05-05 23:00:57.402
    kmmlrl: status: succ=3, wait=0, fail=1

    Just a short explanation what you can see here: pmon distributes every 3 seconds the status he knows about all listeners he knows to all other listeners. Between 23:00:48 and 23:00:51 pmon found the local_listener disappeared - and informed all the other listeners about this fact.
  • what the LISTENER_SCAN2 knows:
    lsnr1.txt

    Service "TTT04_SITE1" has 2 instance(s).
    Instance "TTT041", status READY, has 1 handler(s) for this service...
    Handler(s):
    "DEDICATED" established:0 refused:0 state:ready
    REMOTE SERVER
    (ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
    Instance "TTT042", status READY, has 1 handler(s) for this service...
    Handler(s):
    "DEDICATED" established:0 refused:0 state:ready
    REMOTE SERVER
    (ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))


    lsnr2.txt

    Service "TTT04_SITE1" has 2 instance(s).
    Instance "TTT041", status READY, has 1 handler(s) for this service...
    Handler(s):
    "DEDICATED" established:0 refused:0 state:blocked
    REMOTE SERVER
    (ADDRESS=(PROTOCOL=TCP)(HOST=<host1>-vip)(PORT=1521))
    Instance "TTT042", status READY, has 1 handler(s) for this service...
    Handler(s):
    "DEDICATED" established:0 refused:0 state:ready
    REMOTE SERVER
    (ADDRESS=(PROTOCOL=TCP)(HOST=<host2>-vip)(PORT=1521))

    Direct after the kill of the local_listener, LISTENER_SCAN2 still believes both local_listeners are in state:ready - therefore it directs the connection to the listener with lower load (on node1) which I just killed.
    But after only 5 seconds, it knows it is in state:blocked and therefore directs my 2nd connection attempt to node2.

What is this all about?

  • If a Listener dies (for any reason) there is a periode of about 3 seconds, where other processes might rely on it's existence and service.
  • PMON is the process which informas all listeners about the status of the others (one more reason to make sure it never stucks.
  • PMONs Listener REGISTRATION is something different. 


My todays findings where supported by Understanding and Troubleshooting Instance Load Balancing [Note:263599.1].
Kommentar veröffentlichen