Montag, 19. März 2012

looking close at TAF


At the moment I'm trying to collect and sort some informations about Oracles Transparent Application Failover. There is a lot of general information available in the wild, but no deeper details. Here I try to show my findings.

Testcase

For my test-database with DB_UNIQUE_NAME: TTT06_SITE1 I created the service
srvctl add service -d TTT06_SITE1 -s TTT06_TAF -P BASIC -e SELECT -r TTT061,TTT062 .
The tnsnames.ora entry is
TTT06_TAF =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (LOAD_BALANCE = OFF)
      (ADDRESS = (PROTOCOL = TCP)(HOST = crs908.my.domain)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = TTT06_TAF)(SERVER=DEDICATED)
    )
  )

tracing

Just
strace -f -t -s 128 -o sqlplus_taf.strace sqlplus "berx/berx123#@TTT06_TAF"
I will look closely on the sqlplus_taf.strace soon, just the testcase can be finished easily:

current instance


SELECT (SELECT instance_number
        FROM   v$instance) inst,
       s.sid,
       s.service_name,
       s.failover_type,
       s.failover_method,
       s.failed_over,
       p.spid
FROM   v$process p,
       v$session s
WHERE  s.paddr = p.addr
       AND addr IN (SELECT paddr
                    FROM   v$session
                    WHERE  audsid = Sys_context('USERENV', 'SESSIONID'));


   INST      SID SERVICE_NAME FAILOVER_TYPE FAILOVER_M FAI   SPID
------- -------- ------------ ------------- ---------- --- ------
      1      144    TTT06_TAF        SELECT      BASIC  NO  23440

and after a startup force in a 2nd session in instance 1

new instance


/

   INST      SID SERVICE_NAME FAILOVER_TYPE FAILOVER_M FAI   SPID
------- -------- ------------ ------------- ---------- --- ------
      2      146    TTT06_TAF        SELECT      BASIC YES  14927

what's going on

A short excerpt of the sqlplus_taf.strace
First sqlplus tries to access ~/.tnsnames.ora, fails and then opens $TNS_ADMIN/tnsnames.ora. Of course there it reads the connection string shown above.
Next it tries to resolve the HOST entry:
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("<my_dns>")}, 28) = 0
and gets all IPs for my SCAN-DNS.
sqlplus asks one of the SCAN listeners:
connect(9, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("<any SCAN IP>")}, 16) = -1 EINPROGRESS (Operation now in progress)
for the SERVICE and gets a kind of redirect:
read(9, "\1\10\0\0\6\0\0\0\0@(ADDRESS=(PROTOCOL=TCP)(HOST=<NODE1-vip>)(PORT=1521))\0(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<SCAN IP>)(PORT=1"..., 8208) = 264
The SCAN-Listener is of no good anymore: close(9). sqlplus looks up the name of <NODE2-vip> in /etc/hosts and tries it's next step with the <NODE1-vip> listener:
connect(9, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("<NODE1-vip IP>")}, 16) = -1 EINPROGRESS (Operation now in progress)

The listener creates a server process for sqlplus - and let's them do their private business.


The startup force killed the server process for sqlplus. But it doesn't know anything about it, until it get's the <newline> from the terminal.

Of course filehandle 9 is somewhat dead and gets close(9). Now really the same steps as above (just tnsnames.ora is not re-read!): SCAN IP lookup, redirect to an NODE-vip, etc.

So only tnsnames.ora is cached, all other lookups and connections are re-run again.
Take this into account if you try to change your setup (IPs, lookups, DNS) while connections are active.

Kommentar veröffentlichen