Posts mit dem Label strace werden angezeigt. Alle Posts anzeigen
Posts mit dem Label strace werden angezeigt. Alle Posts anzeigen

2019-01-03

connection manager - what happens at startup & shutdown

Oracle Connection Manager (cman) is a great tool to create a gateway between networks which can not be connected, or filter those who can access to a service.
Most of the time, it's a very robust implementation and doesn't need a lot of attention. But if it's required to dig into it, a basic knowledge about it's components can be useful.


In this post I'll show what happens in a simple startup (& shutdown) of a mostly "default" cman.
The test was done on Oracle Linux 7.5 with cman 12.2.0.1.181016.


tl:dr

In my setup, processes are called this order.
at startup, cmctl spawns several processes. these processes communicate via sockets and has some shared memory segments

The details: 

The results in this post show some lines of interest cenerated by
strace -f -o <some_dir&gt>/cman_$$.strace -y -s 128 cman
and the 3 commands
administer cman_berx
startup
shutdown

Before each quote of the tracefile I write the line number - this is not an exact offset for your tests, just shows the sequence and sometimes big gaps which I do not show. I will name PIDs and executes together for better readability.

1:
10115 execve("<ORACLE_HOME>/bin/cmctl", ["cmctl"], [/* 118 vars */]) = 0
process 10115 is now cmctl.

200:
10115 open("<ORACLE_HOME>/env.ora", O_RDONLY) = 3
... 
10115 read(3, "# Copyright (c) 2015, Oracle and/or its affiliates. All rights reserved.\n# \n# NAME\n#   env.ora\n#\n# FUNCTION\n#   Oracle ENVironme"..., 4096) = 852
The undocumented file <ORACLE_HOME>/env.ora is checked for TNSNAMES_ORA parameter there - my file is default, so nothing is set there.

231:
10115 access("<TNS_ADMIN>/cman.ora", F_OK) = -1 ENOENT (No such file or directory)
10115 access("/etc/cman.ora", F_OK)     = -1 ENOENT (No such file or directory)
10115 access("<ORACLE_HOME>/network/admin/cman.ora", F_OK) = 0
10115 open("<ORACLE_HOME>/network/admin/cman.ora", O_RDONLY) = 3
... 
10115 read(3<<ORACLE_HOME>/network/admin/cman.ora>, "######################################################################\n#\n# Copyright (c) 2001,2002, Oracle Corporation. All righ"..., 4096) = 3996 
... 
10115 close(3<<ORACLE_HOME>/network/admin/cman.ora>) = 0 
10115 (cman) searches for cman.ora in <TNS_ADMIN> (nothing there) /etc (also nothing) and <ORACLE_HOME>/network/admin (there is an entry for cman_berx).

297:
10115 write(1</dev/pts/0>, "CMCTL> ", 7) = 7
10115 read(0</dev/pts/0>, "administer cman_berx\n", 1024) = 21
10115 (cmctl) writes a prompt (CMCTL>) and gets my command line command (administer cman_berx).

364:
10115 connect(4<socket:[1549779484]>, {sa_family=AF_INET, sin_port=htons(1522), sin_addr=inet_addr("x.y.z.78")}, 16) = -1 EINPROGRESS (Operation now in progress)
10115 (cmctl) tries to connect to the address:port provided in cman.ora - and can't connect. I assume that's the moment it realises it must start all related processes.


418:
10115 geteuid()                         = 5001
10115 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b8713915150) = 10286
10115 rt_sigprocmask(SIG_BLOCK, [PIPE], NULL, 8) = 0
10286 set_robust_list(0x2b8713915160, 24 <unfinished ...>
10115 rt_sigaction(SIGPIPE, {0x2b8717669de0, ~[ILL TRAP ABRT BUS FPE SEGV USR2 TERM XCPU XFSZ SYS RTMIN RT_1], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x2b8718e9f680},  <unfinished ...>
10286 <... set_robust_list resumed> )   = 0
10115 <... rt_sigaction resumed> {SIG_DFL, [], 0}, 8) = 0
10115 rt_sigprocmask(SIG_UNBLOCK, [PIPE], NULL, 8) = 0
10286 clone( <unfinished ...>
10115 wait4(10286,  <unfinished ...>
10286 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b8713915150) = 10287
10287 set_robust_list(0x2b8713915160, 24) = 0
10287 close(6<pipe:[1549773687]>)       = 0
10287 close(7<pipe:[1549773688]> <unfinished ...>
10286 exit_group(0)                     = ?
10287 <... close resumed> )             = 0
10287 setsid( <unfinished ...>
10286 +++ exited with 0 +++
10115 (cmctl) creates (clone) a new process (10286) - this process immediately clones another one (10287) and then terminates itself (exit_group). By this there is no parent-child relationship between 10115 (cmctl) and 10287. Still they both use the same binary (cmctl) and filehandles etc. - so 10287 closes some filehandles it doesn't require.

449:
10287 execve("/appl/oracle/product/cman_12201/bin/cmadmin", ["/appl/oracle/product/cman_12201/bin/cmadmin", "cman_berx", "-inherit"], [/* 119 vars */]) = 0
process 10287 is now cmadmin.


2602:
10287 access("/var/tmp/.oracle", F_OK)  = 0
10287 chmod("/var/tmp/.oracle", 01777)  = -1 EPERM (Operation not permitted)
10287 socket(AF_LOCAL, SOCK_STREAM, 0)  = 6<socket:[1549770245]>
...
10287 access("/var/tmp/.oracle_500100", F_OK) = -1 ENOENT (No such file or directory)
10287 mkdir("/var/tmp/.oracle_500100", 0700) = 0
10287 chmod("/var/tmp/.oracle_500100", 0700) = 0
10287 access("/var/tmp/.oracle_500100/s#10287.1", F_OK) = -1 ENOENT (No such file or directory)
10287 bind(6<socket:[1549770245]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, 110) = 0
10287 chmod("/var/tmp/.oracle_500100/s#10287.1", 0777) = 0
10287 listen(6<socket:[1549770245]>, 100) = 0
10287 getsockname(6<socket:[1549770245]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, [36]) = 0
10287 chmod("/var/tmp/.oracle_500100/s#10287.1", 0777) = 0 
10287 (cmadmin) checks if /var/tmp/.oracle exists and if it can change it's permission. In my environment the directory exists (as Grid Infrastructure is installed) but it's permissions can't be changed. It belongs to root.  Then /var/tmp/.oracle_500100 is created and a socket is created there. For easier readability, the PID is part of the socket name.

2631:
10287 clone(child_stack=0x2b6fd20de570, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b6fd20e99d0, tls=0x2b6fd20e9700, child_tidptr=0x2b6fd20e99d0) = 10289
10287 (cmadmin) creates another process (10289).

2701:
10287 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10290
10287 (cmadmin) creates another process (10290).

2719:
10290 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10291
...
10290 exit_group(0)                     = ?
...
10290 +++ exited with 0 +++
10290 creates 10291 and terminates itself.

2764:
10291 execve("/appl/oracle/product/cman_12201/bin/tnslsnr", ["/appl/oracle/product/cman_12201/bin/tnslsnr", "ifile=/appl/oracle/product/cman_12201/network/admin/cman.ora", "cman_berx", "-inherit", "-mode", "proxy"], [/* 119 vars */]) = 0
10291 becomes tnslsnr


3613:
10291 bind(11<socket:[1549780776]>, {sa_family=AF_INET, sin_port=htons(1522), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
10291 listen(11<socket:[1549780776]>, 128) = 0
10291 getsockname(11<socket:[1549780776]>, {sa_family=AF_INET, sin_port=htons(1522), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
10291 (tnslsnr) listens on 0.0.0.0:1522 - But in cman.ora a specific IP address is given.

3669:
10291 write(12<<DIAG_DEST>/netcman/<hostname>/cman_berx/trace/cman_berx.log>, "Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=<server.domain>)(PORT=1522)))\n", 91) = 91
10291 (tnslsnr) writes something different to the logfile (?)

3674:
10291 access("/var/tmp/.oracle", F_OK)  = 0
10291 chmod("/var/tmp/.oracle", 01777)  = -1 EPERM (Operation not permitted)
10291 socket(AF_LOCAL, SOCK_STREAM, 0)  = 14<socket:[1549790968]>
10291 access("/var/tmp/.oracle/s#10291.1", F_OK) = -1 ENOENT (No such file or directory)
10291 bind(14<socket:[1549790968]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/s#10291.1"}, 110) = 0
10291 chmod("/var/tmp/.oracle/s#10291.1", 0777) = 0
10291 listen(14<socket:[1549790968]>, 1) = 0
10291 getsockname(14<socket:[1549790968]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/s#10291.1"}, [29]) = 0
10291 chmod("/var/tmp/.oracle/s#10291.1", 0777) = 0
10291 (tnslsnr) creates its own socket file for communication purpose.

4088:
10291 connect(16<socket:[1549778909]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sprocr_local_conn_0_PROL"}, 110) = 0

4149:
10291 connect(15<socket:[1549778910]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_UI_SOCKET"}, 110) = 0
10291 (tnslsnr) checks some sockets which belong to Grid Infrastructure.

4616:
10291 clone(child_stack=0x2ac6b580b570, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2ac6b58169d0, tls=0x2ac6b5816700, child_tidptr=0x2ac6b58169d0) = 10292
10291 (tnslsnr) creates PID 10292

4754:
10292 clone( <unfinished ...>
10291 <... close resumed> )             = 0
10292 <... clone resumed> child_stack=0x2ac6b584c570, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2ac6b58579d0, tls=0x2ac6b5857700, child_tidptr=0x2ac6b58579d0) = 10293
10293 set_robust_list(0x2ac6b58579e0, 24 <unfinished ...>
10292 creates PID 10293

4790:
10293 exit(0)                           = ?
10292 <... poll resumed> )              = 1 ([{fd=15, revents=POLLIN}])
10293 +++ exited with 0 +++
10293 exits - it did not "clone" or "execve" anything

4841:
10287 shmget(0x59248000, 2462280, IPC_CREAT|IPC_EXCL|0600 <unfinished ...>
10287 (cmadmin) allocates a shared memory segment with key 0x59248000.

4937:
10287 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10294
...
10294 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10295
...
10294 exit_group(0)                     = ?
...
10294 +++ exited with 0 +++
10287 (cmadmin) creates PID 10294, this creates PID 10295 and terminates itself.


4967:
10295 execve("/appl/oracle/product/cman_12201/bin/cmgw", ["/appl/oracle/product/cman_12201/bin/cmgw", "cmgw0", "0", "16", "cman_berx", "SNLSM:59248000"], [/* 119 vars */]) = 0
10295 becomes cmgw0 - you can see the parameter SNLSM - it gives the shared memory key from 10287 (cmadmin).

5158:
10295 shmget(0x59248000, 1, 0)          = 2099576860
10295 (cmgw0) attaches to shared memory 10287 (cmadmin).


5566:
10287 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10296
...
10296 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10297
...
10296 exit_group(0)                     = ?
10287 (cmadmin) creates PID 10296 which creates 10297 and exits. - same as above

5625:
10297 execve("/appl/oracle/product/cman_12201/bin/cmgw", ["/appl/oracle/product/cman_12201/bin/cmgw", "cmgw1", "1", "16", "cman_berx", "SNLSM:59248000"], [/* 119 vars */] <unfinished ...>
10297 becomes cmgw1.

some (later) important sockets:
5680:
10295 connect(5<socket:[1549804554]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, 110 <unfinished ...>

6447:
10297 connect(5<socket:[1549798977]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, 110 <unfinished ...>

7448:
10295 bind(7<socket:[1549804555]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10295.1"}, 110 <unfinished ...>

7458:
10297 bind(7<socket:[1549798978]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10297.1"}, 110 <unfinished ...>

7896:
10115 connect(5<socket:[1549779550]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, 110 <unfinished ...> 

At this stage, cman is started, all processes are in memory and happy to server any requests.



The shutdown- command will be much faster:

8553:
10115 <... read resumed> "shutdown\n", 1024) = 9
10115 (cmctl) recieves "shutdown" from command line

8554
10115 write(5, "\0\0\0S\6$\0\0\0\0E\0\0\200\2\0\0\0\0\4\0\0\220_8\2\0\0\0\0\20\0#\0\0\000901A4C4F9FD8-3939-E034-0800208AB384\1\0v\0\0\0\3\0\0\0\0\0", 83) = 83
101115 (cmctl) informs 10287 (cmadmin)

8568:
10287 write(5&lt;socket:[1549778933]&gt;, "\0\0\0S\6 \0\0\0\0E\0\0\200\2\0\0\0\0\4\0\0p\344t\2\0\0\0\0\20\0#\0\0\00091986B2B10AC-3001-E034-0800208AB384\1\0l\0\0\0\1\0\0\0\0\0", 83) = 83
10287 times( &lt;unfinished ...&gt;
10295 &lt;... epoll_wait resumed&gt; [{EPOLLIN, {u32=13989488, u64=13989488}}], 1024, -1) = 1
10287 &lt;... times resumed&gt; NULL)         = 1269637186
10287 epoll_wait(7&lt;anon_inode:[eventpoll]&gt;,  &lt;unfinished ...&gt;
10295 read(5&lt;socket:[1549804554]&gt;, "\0\0\0S\6 \0\0\0\0E\0\0\200\2\0\0\0\0\4\0\0p\344t\2\0\0\0\0\20\0#\0\0\00091986B2B10AC-3001-E034-0800208AB384\1\0l\0\0\0\1\0\0\0\0\0", 8208) = 83
1
10287 (cmadmin) informs 10295 (cmgw0) (and later 10297 (cmgw1) by the same method) to shutdown

some cleanup takes place:

8585:
10295 unlink("/var/tmp/.oracle_500100/s#10295.1" <unfinished ...>

8590:
10295 rmdir("/var/tmp/.oracle_500100" <unfinished ...>

8621:
10297 unlink("/var/tmp/.oracle_500100/s#10297.1" <unfinished ...>




and all the processes terminate:

8869:
10289 exit(0)                           = ?
...
10289 +++ exited with 0 +++

8927:
10295 exit_group(0)                     = ?

8967:
10297 exit_group(0)                     = ?

9096:
10291 unlink("/var/tmp/.oracle/s#10291.1" <unfinished ...>


9293:
10291 exit_group(0)                     = ?

9305:
10287 exit_group(0)                     = ?

9321:
10115 exit_group(0)                     = ?

That's a simple startup and shutdown of connection manager - nothing fancy here, just a lot of processes and sockets which are created - and then deleted.

I hope you never need to read this article to it's end. But if you need to, it should provide some value for you!

2018-07-27

missing oraagent.bin process

It all started with a small error when I tried to stop/start a service:

srvctl start service -d cmpp01_site1 -s cmpp01
PRCD-1084 : Failed to start service CMPP01 
PRCR-1079 : Failed to start resource ora.cmpp01_site1.cmpp01.svc 
CRS-2680: Clean of 'ora.cmpp01_site1.cmpp01.svc' on 'av2l954p' failed 
CRS-5802: Unable to start the agent process 

It's not fine the agent can not be started. so let's check the logfile crsd_oraagent_oracle.trc:
2018-07-23 21:29:14.339 : CRSCOMM:2939654272:  Ipc: Starting send thread
2018-07-23 21:29:14.340 :GIPCXCPT:2939654272:  gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)
2018-07-23 21:29:14.340 :GIPCXCPT:2939654272:  gipcConnectSyncF [connectToServer : clsIpcClient.cpp : 380]: EXCEPTION[ ret gipcretConnectionRefused (29) ]  failed sync connect endp 0x1cbb5b0 [0000000000000090] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=00000000-00000000-0))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, readyRef (nil), ready 0, wobj 0x1cbe170, sendp 0x1cbdf20 status 13flags 0xa108871a, flags-2 0x0, usrFlags 0x30000 }, addr 0x1cbc990 [0000000000000097] { gipcAddress : name 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_IPC_SOCKET_11)(GIPCID=00000000-00000000-0))', objFlags 0x0, addrFlags 0x4 }, flags 0x0
2018-07-23 21:29:14.340 : CRSCOMM:2939654272:  IpcC: gipcConnect() failed, rc= 29
2018-07-23 21:29:14.340 : CRSCOMM:2939654272: [FFAIL] IpcC: Could not connect to (ADDRESS=(PROTOCOL=IPC)(KEY=CRSD_IPC_SOCKET_11)) ret = 29

It clearly says there is something going wrong, unfortunately it doesn't tell in detail what's going wrong or what's required.

Maybe the most beautiful but useless analysis here was provided by tfactl summary -crs:
DETAILS                            STATUS    COMPONENT  
+----------------------------------+---------+-----------+
  .------------------------------.   PROBLEM   CRS        
  | CRS_SERVER_STATUS   : ONLINE |                        
  | CRS_STATE           : ONLINE |                        
  | CRS_INTEGRITY_CHECK : FAIL   |                        
  | CRS_RESOURCE_STATUS : ONLINE |                        
  '------------------------------'                        
+----------------------------------+---------+-----------+

As neither My Oracle Support nor google helped a lot here, I started a random cry on twitter:


And Anil had mercy on me :
This was not the answer to my problem, but it started a very valuable conversation in which I learned enough to identify the issue.

It's important to know that (this) oraagent.bin process is started by crsd.bin, and it is restarted if it disappears for any reason.
So to see what's really going and and leads to the error above, running strace on crsd.bin and it's forks/clones:
strace -f -p <pid_of_crsd.bin> -o <outfile>

as expected crsd.bin creates a new process for oraagent.bin:
18631 clone( <unfinished ...>
...
18631 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe273dfe9d0) = 18633
...
18633 execve("/appl/grid/grid_12201/bin/oraagent.bin", ["/appl/grid/grid_12201/bin/oraage"...], [/* 52 vars */]) = 0

and if everything runs as expected the socket handling looks like
18633 socket(AF_LOCAL, SOCK_STREAM, 0 
...
18633 <... socket resumed> )            = 69  <== the creation of the socked
...
18633 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110 <unfinished ...>
...
18633 <... connect resumed> )           = 0   <== the successful connection of the sockets file descriptor 69 to  /var/tmp/.oracle/sCRSD_IPC_SOCKET_11

But in my problematic case it's this:
22612 connect(69, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_IPC_SOCKET_11"}, 110  <unfinished ...>
...
22612  <... connect resumed> )           = -1 ENOENT (No such file or directory)

The first finding here:
The error
gipcInternalConnectSync: failed sync request, ret gipcretConnectionRefused (29)
in the logfile correlates with
<... connect resumed> )           = -1 ENOENT (No such file or directory)

With this specific information a short check in /var/tmp/.oracle shows the missing file is not there.
A crosscheck in /proc/<pid_of_crsd.bin>/fd/ shows
lrwx------. 1 root root 64 Jul 26 22:00 /proc/2144/fd/214 -> /var/tmp/.oracle/sCRSD_IPC_SOCKET_11_lock (deleted)

this all happens on grid binaries 12.2 with latest patch set.


I showed now:
1) if oraagent.bin disappears, crsd.bin creates a new oraagent.bin process
2) the oraagent.bin process tries to connect a socket on /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 to communicate with crsd.bin
3) if 2) fails, gipcretConnectionRefused (29) is shown in the logfile.



Right now I know:
If /var/tmp/.oracle/sCRSD_IPC_SOCKET_11 is missing but oraagent.bin does run, I can operate the cluster resources, but whenever oraagent.bin disappears, cluster resources can not be managed anymore.

If this is the only problem (and no other sockets in /var/tmp/.oracle are missing) I can restart crsd which will also restart oraagent.bin and everything works like expected.
crsctl stop res ora.crsd -init
CRS-2673: Attempting to stop 'ora.crsd' on 'node1'
CRS-2677: Stop of 'ora.crsd' on 'node1' succeeded

crsctl start res ora.crsd -init  
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded


My next questions to answer is
1) why does oraagent.bin disappear at all?
2) why does files in /var/tmp/.oracle disappear?

This is still not answered and under investigation.

2012-03-19

looking close at TAF


At the moment I'm trying to collect and sort some informations about Oracles Transparent Application Failover. There is a lot of general information available in the wild, but no deeper details. Here I try to show my findings.

Testcase

For my test-database with DB_UNIQUE_NAME: TTT06_SITE1 I created the service
srvctl add service -d TTT06_SITE1 -s TTT06_TAF -P BASIC -e SELECT -r TTT061,TTT062 .
The tnsnames.ora entry is
TTT06_TAF =
  (DESCRIPTION =
    (ADDRESS_LIST =
      (LOAD_BALANCE = OFF)
      (ADDRESS = (PROTOCOL = TCP)(HOST = crs908.my.domain)(PORT = 1521))
    )
    (CONNECT_DATA =
      (SERVICE_NAME = TTT06_TAF)(SERVER=DEDICATED)
    )
  )

tracing

Just
strace -f -t -s 128 -o sqlplus_taf.strace sqlplus "berx/berx123#@TTT06_TAF"
I will look closely on the sqlplus_taf.strace soon, just the testcase can be finished easily:

current instance


SELECT (SELECT instance_number
        FROM   v$instance) inst,
       s.sid,
       s.service_name,
       s.failover_type,
       s.failover_method,
       s.failed_over,
       p.spid
FROM   v$process p,
       v$session s
WHERE  s.paddr = p.addr
       AND addr IN (SELECT paddr
                    FROM   v$session
                    WHERE  audsid = Sys_context('USERENV', 'SESSIONID'));


   INST      SID SERVICE_NAME FAILOVER_TYPE FAILOVER_M FAI   SPID
------- -------- ------------ ------------- ---------- --- ------
      1      144    TTT06_TAF        SELECT      BASIC  NO  23440

and after a startup force in a 2nd session in instance 1

new instance


/

   INST      SID SERVICE_NAME FAILOVER_TYPE FAILOVER_M FAI   SPID
------- -------- ------------ ------------- ---------- --- ------
      2      146    TTT06_TAF        SELECT      BASIC YES  14927

what's going on

A short excerpt of the sqlplus_taf.strace
First sqlplus tries to access ~/.tnsnames.ora, fails and then opens $TNS_ADMIN/tnsnames.ora. Of course there it reads the connection string shown above.
Next it tries to resolve the HOST entry:
connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("<my_dns>")}, 28) = 0
and gets all IPs for my SCAN-DNS.
sqlplus asks one of the SCAN listeners:
connect(9, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("<any SCAN IP>")}, 16) = -1 EINPROGRESS (Operation now in progress)
for the SERVICE and gets a kind of redirect:
read(9, "\1\10\0\0\6\0\0\0\0@(ADDRESS=(PROTOCOL=TCP)(HOST=<NODE1-vip>)(PORT=1521))\0(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=<SCAN IP>)(PORT=1"..., 8208) = 264
The SCAN-Listener is of no good anymore: close(9). sqlplus looks up the name of <NODE2-vip> in /etc/hosts and tries it's next step with the <NODE1-vip> listener:
connect(9, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("<NODE1-vip IP>")}, 16) = -1 EINPROGRESS (Operation now in progress)

The listener creates a server process for sqlplus - and let's them do their private business.


The startup force killed the server process for sqlplus. But it doesn't know anything about it, until it get's the <newline> from the terminal.

Of course filehandle 9 is somewhat dead and gets close(9). Now really the same steps as above (just tnsnames.ora is not re-read!): SCAN IP lookup, redirect to an NODE-vip, etc.

So only tnsnames.ora is cached, all other lookups and connections are re-run again.
Take this into account if you try to change your setup (IPs, lookups, DNS) while connections are active.