Sonntag, 5. Mai 2019

invisible IOs

My team identified an interesting case of invisible IOs this week. During some checks of system statistics graphs, something unexpected ocured:
One of those "small & unimportant" test DBs had an interesting IO pattern:

This was something interesting, so a colleague decided to have a deeper look at it. The next obvious step is to see if AAS shows anything interesting. And it does:

A lot of blue (which is used for IO in all tools I'm aware of) was expected, but it's obviously dominated by green (CPU). What a surprise!

After some quick crosschecks if any of the systems show fake data, we agreed it's all correct. Still the question exists: why does system statistics show a lot of IO, whereas ASH looks very different?

More data is needed, so one question was: out of all the direct path read IOs we saw in this timespawn, how fast were they? The first source for a check is DBA_HIST_SYSTEM_EVENT:
    MAX(total_waits) - MIN(total_waits) AS sum_total_waits,
    MAX(time_waited_micro) - MIN(time_waited_micro) AS sum_time_waited,
    ( MAX(time_waited_micro) - MIN(time_waited_micro) ) / ( MAX(total_waits) - MIN(total_waits) + 1) time_per_wait
    dba_hist_snapshot       snap,
    dba_hist_system_event   sysev
WHERE snap.snap_id = sysev.snap_id
    AND begin_interval_time BETWEEN to_timestamp('2019-05-03 06:00:00', 'YYYY-MM-DD HH24:MI:SS') 
                                AND to_timestamp('2019-05-03 20:00:00', 'YYYY-MM-DD HH24:MI:SS')
        ( sysev.event_name,          sysev.wait_class ),
        ( sysev.wait_class )
ORDER BY    3 DESC,    2,    1;

------------------------------ --------------- --------------- --------------- --------------
                               User I/O              156659399     87378811924         557.76
                               System I/O            134570949     10821645896          80.42
                               Idle                   66603429 ###############     2432741.08
                               Network                65725910      5330619377          81.10
                               Administrative         20587618    191843235750        9318.38
direct path read               User I/O               10353091      2001455155         193.32
db file sequential read        User I/O                1008098       255164480         253.11
                               Other                    947941     10721785435       11310.59
control file sequential read   System I/O               472653        11995170          25.38
                               Application              460657        73815782         160.24
SQL*Net message from client    Idle                     303956    500628620721     1647037.64

The topmost User I/O event is direct path read. During the 14 hour period, it happened at about 200 times every second. Based on the ASH graph above we can assume it all happened in one session, so in this session every second about 40ms were spent in direct path read.
This now raises the question why these 40% of time are not visible in ASH? The best answer I found so far is described in ASHviz: Densities and dark matter by John Beresniewicz.
Short events are underrepresented in sampled data. And as one single direct path read lasted only 193μs on average, the observation was hit by this bias.
With this very simple test-DB it was possible to investigate the situation to a reasonable degree of understanding, But chasing the real producer of high IOs in a busy production system might be misleading when only ASH is used.

This situation is kind of a downer for me as ASH does not even close show the truth. (this is a good example, where tracing is better than sampling). So a reliable source for additional information is required.
In recent Oracle versions (10.1 or later) there is some information available: Of course v$sesstat shows statistics for every session, but the numbers are cumulative and must be processed somehow.
Oracle provides v$sessmetric for this purpose, but it only shows the last x seconds and only a small subset of statistics. A more generic solution is Tanel Põders snapper. But both approaches can only show what's going on at the time of observation.
For something more DBA_HIST_ish, DBA_HIST_SESSMETRIC_HISTORY looks promising, but it is misleading, as This view is populated only if a session metric exceeds a server metric threshold that was configured using the DBMS_SERVER_ALERT package.
There is much more about these features & their limitations in Kyle Hailys Oracle 10g Performance: chapter 04 new features.

With all that in mind, I hope for additional information about session statistic deltas. (SSD would be a perfect and unambiguous abbreviation 😅)
In short, it should write all changed session stats for all sessions at every snapshot. In addition for every closed session their statistics must be saved, otherwise their data is lost. Tanel implemented something similar already in his Session-level statspack but he decided not to store session statistics by default to save space. Also no logoff-trigger is there (yet).
Let's see if there will ever be a SSD implementation available.

Samstag, 30. März 2019

limit IOs by instances

This is a follow up to my previous post.

I had to answer if I can limit the (IO) resources a DB instance can utilize. Unfortunately, in a simple instance I can not do so. It can be done in PDBs, but right now PDBs are out of scope.
So a simpler approach was developed: limiting IOs by creating cgroups.

There are 2 steps:

  1. disks
    1. create a proper cgroup
    2. get all related devices for a given ASM diskgroup
    3. set proper limits for all the devices
  2. regularly add all matching processes to this cgroup

In the cgroups mountpoint, a new directory must be created which is used as "root" for all cgroups, so it does not collide with other cgroups implementations in the system.

This leads to a structure like
where ORACLE is the "root" and "BX1" is the name of a specific cgroup.
In this group limits can be set, e.g. in blkio.throttle.read_bps_device or blkio.throttle.read_iops_device.
As the limit is per device, the total limit is divided by the number of devices.

To have any effect on processes, all processes PIDs which are under the regime of this cgroup are added to /sys/fs/cgroup/blkio/ORACLE/BX1/tasks.
For more flexibility, a list of patters (to match cmd in ps -eo pid,cmd ) are defined for each cgroup individualy. e.g. all processes, which either match a foreground process which was connected via listener BXTST02.*LOCAL=NO or any parallel process ora_p[[:digit:]].{2}_BXTST02.

In my configuration (crontab), the disks are added to the cgroup once per hour, whereas processes are added every minute.
This can lead to some delays if disks are added, and every single process can live up to one minute without any limits.

But for a longer period, it should be quite stable. (or at least, the best I can do in the sort time given)

The effect is acceptable. In a generic test with SLOB the picture is obvious:
When the processes were added to the cgroup, MB/s dropped down to the 5MB/s configured.

Of course by these limits, the average response time (or wait time) goes up.
In another test where SLOB was in the cgroup all the time, the MEAN responsetime was 0.027 sec.
But a histogram shows more than 50% of the calls finish within 10 ms (a reasonalbe value for the storage system in this test) but there is a peak between 25 and 50 ms which dominates the total response time.
RANGE {min ≤ e < max}    DURATION       %   CALLS      MEAN       MIN       MAX
---------------------  ----------  ------  ------  --------  --------  --------
 1.     0ms      5ms     6.010682    2.0%   2,290  0.002625  0.000252  0.004999
 2.     5ms     10ms    26.712672    9.0%   3,662  0.007295  0.005002  0.009997
 3.    10ms     15ms    12.935713    4.4%   1,090  0.011868  0.010003  0.014998
 4.    15ms     20ms     6.828035    2.3%     398  0.017156  0.015003  0.019980
 5.    20ms     25ms     4.846490    1.6%     218  0.022232  0.020039  0.024902
 6.    25ms     50ms    17.002454    5.7%     471  0.036099  0.025085  0.049976
 7.    50ms    100ms   182.104408   61.6%   2,338  0.077889  0.050053  0.099991
 8.   100ms  1,000ms    39.354627   13.3%     326  0.120720  0.100008  0.410570
 9. 1,000ms       +∞
---------------------  ----------  ------  ------  --------  --------  --------
TOTAL (9)              295.795081  100.0%  10,793  0.027406  0.000252  0.410570

This also can be seen in a graph:

The system is working and stable; probably not perfect but good enough for it's requirements.
There was a discussion if this should be achieved on storage layer. This would limit every process immediately, but would also be a much stricter setting. As an example I can exclude logwriter from any cgroup and let it work as fast as possible, whereas IO limits on storage side would put logwriter and foreground processes in same limits.

The script can be found at github, but it has some prerequisits and might not work on other than my current systems without adaption. Don't hesitate to contact me if you really want to give it a try 😉

Sonntag, 17. Februar 2019

shared resources

Some days ago I had one of these calls from an application colleague:
The database was slow yesterday, can you please check what's the problem?
Of course, I had some short discussion if he really means the DB or should it be called the application is slow. Also some other questions needed to be asked first, e.g. if it's a response time or throughput issue, when it was "good" last time, what "bad" and "good" means in numbers (seconds/ experience, requests / second), if it's affecting all the experiences on this DB or only a subset. Can it be reproduced or at least does it occur on a known pattern. Also the DB name and users affected I had to ask.
Some time I should create a comprehensive checklist for these questions as I'm missing some of them regularly. But that not today's topic.

As the answers are very vague (at least I got a DB & username with some rough timestamp when it might has started) I checked the DB. I'm it would be more precise to focus on user experience. But I know the developers quite well; it would cost a lot of effort & management pressure to convince them to implement proper instrumentation.
So I was going for low hanging fruits first. ASH/AWR showed this picture:
There is time of more activities in the Database, and it's quite visible where it started and ended. With some more analysis it was obvious the number of calls didn't change dramatically. Getting data from the application (yes, there was at least an activity log) showed in fact the number of requests followed a daily pattern, but this didn't look like the picture above. The pattern of demand was similar to other days; but on any other day AWR showed no such picture, only a "breathe" similar to the applications demand.
By this conversation we implicit generated a (still very blurry) "good" pattern to compare the "bad" one.

These bits of information were not perfect, but at least something I could use for further investigations.
AWR shows much higher IO times than normal. I crosschecked the most active SQL_IDs from a "good" pattern against the "bad" one and could confirm they are IO-bound in "good" as well, so they might be affected by this unusual IO.
On the DB-cluster TFA is installed, and so is oswatcher. (Again, aggregates only. But still much better information than nothing). A small awk script analysed the data there (I just don't know why oswatcher doesn't has a SQL interface similar to osquery ) and there were huge response time, but no queueing on the Linux hosts.
This would now require digging deeper through the storage layers (FC SAN and several big central storage boxes). Unfortunately I does not have access to them and their performance data is not externalized right now and the SAN admin was to busy with other things, like compiling reports for senior managers 🤦.
But at least I'm somehow lucky: At this company a nice schema is implemented which gives ASM-disks much more information than only /dev/mapper/anything. As the name of the storage box is part of the ASM-disk path, by some knowledge about the possible servers involved (e.g. only servers in this datacenter, and with some other attributes could be involved at all) and a centralized repository I got a list of all DB-servers and DBs potentially involved on that layer.
With this list and another part of our repository (in that case periodic v$sysstat snapshots [aggregates of aggregates - guess the accuracy there]) a list of other DBs with high IO numbers during that time could be compiled.
Checking these databases there was a top producer of IO:

Some additional analysis showed this load on this particular DB created sufficient IO to saturate one subsystem in the storage box. By doing so all other system which use the same subsystem also suffered. To crosscheck that, the Storage admin (who completed the report in the meantime) provided other DBs utilizing that subsystem - and they showed similar patterns at exactly that time.

Even with all the data collected, this accumulation of coincidents is still no prove what was going on. But it was sufficient likely for the management to think about remediation activities.

Here the question might show up why only the admins of this one application showed up and complained about their system? During the investigation it was visible that other DBs showed similar patterns and so other applications were likely to suffer also.
I invested some time to  get answers to these questions. This is not related to the initial goal of the activities, only to satisfy my curiosity and improve my knowledge about the whole landscape.
It showed these other applications also suffered, but the impact was much smaller for 2 reasons:

  1. The portion of DB IO time to their user experience was relatively small. Even these high response time didn't cause a lot of harm. 
  2. The requirements of the most suffering application were more strict. It can be explained by the time a person accepts to wait after a click: If this click is on a webpage, most readers of this blog might understand what's going on behind the scene and some tenth of a second up to some seconds are acceptable for an answer. But if you flip a switch to turn on the light, they expect the light to shine immediately (aka. within less time than the human senses can grasp)

After all the analysis was done and roughly explained to the manager of the most suffering application, he invited for a meeting.
A manager, some people with different roles in development and ops, a representative for the "culprit application which caused the high load" and I were there. He opened the meeting with a summary (condensed by me to):
We and they share the same storage, they did some heavy query and we suffered. 
Technically this is right, but the way it's provided to those not experienced in IT leads to the managers request:
We must ensure they do not cause such IO load anymore. 

This might seem legit, but it might not lead to the managers real goal: Make his application reasonable fast in as many situations as possible. So I had to kill this branch of the discussion for a moment and explained, why they are not the real problem:
As the whole IT is a shared system on various layers (different hosts share the same storage system, different virtual hosts share the same VM layer, different instances share the same DB-cluster, different sessions share the same instance, ...) it is an unreasonable big effort to define resource limits, quotas and similar for all components involved. Especially in a highly dynamic environment with hundreds of DBs, thousands of servers and zillions of processes.
A majority in this room showed some shock & horror and claimed they were not aware of the high level of complexity in basic infrastructure.

With some effort we can limit the IO consumption of this particular DB which caused the issue this time. But next time (and there will be a next time) we only can be sure this particular DB is not involved.
As that's not sufficient, the next request was:
Let's then monitor the storage box and alarm every time a component is saturated.
A nice idea, isn't it?
But whom to alarm? -
The SAN admin of course. 
What should he do then? He can't reduce the load and can't increase the storage capacity either (at least not short term without budget, vendor involved, ...). 
So let's inform those which are connected to this resource! 
In our example it would have been two dozen applications - most of them didn't suffer or care.
Then only the top 5.
In our case you wouldn't have been alarmed.

This also doesn't lead to anything useful.

About at that time I raised a simple question:
Who in this room is responsible for the (suffering part of the) application?
(the managers face went pale for a second).
After some moments, an external (not employed) developer raised his hand and muttered something like "If no one wants to, I can take that". 
I was slightly bored of this meeting, so to speed up things my next question was:

Is every IO your application doing [during the critical activity] necessary?

The argument in short goes like that: If you are less dependent on IO timing, even bad IOs can't harm that much. I call this

make it robust against impacts out of your control

 Beside the obvious effect to improve the application, a nice side effect is a less busy total environment. This again pays back as better service for the whole company.

The meeting stopped close to that point. The new found architect & developers promised to see if they can make their application more robust, the manager wants to raise awareness in all IT departments what a shared environment is and how it can affect business processes within the company.
I have a mixed feeling: the analysis seems valid, but it was full of educated guess & shortcuts - this doesn't feel right as it contains "experience" and "luck"; there is a method which describes a more structured way, I just felt unable to follow these steps . I hope I could transfer the initial reaction (finger-pointing to another team) to something more valuable for the company: use less resources, make your application robust.
But I doubt my effort was lasting.

Donnerstag, 3. Januar 2019

connection manager - what happens at startup & shutdown

Oracle Connection Manager (cman) is a great tool to create a gateway between networks which can not be connected, or filter those who can access to a service.
Most of the time, it's a very robust implementation and doesn't need a lot of attention. But if it's required to dig into it, a basic knowledge about it's components can be useful.

In this post I'll show what happens in a simple startup (& shutdown) of a mostly "default" cman.
The test was done on Oracle Linux 7.5 with cman


In my setup, processes are called this order.
at startup, cmctl spawns several processes. these processes communicate via sockets and has some shared memory segments

The details: 

The results in this post show some lines of interest cenerated by
strace -f -o <some_dir&gt>/cman_$$.strace -y -s 128 cman
and the 3 commands
administer cman_berx

Before each quote of the tracefile I write the line number - this is not an exact offset for your tests, just shows the sequence and sometimes big gaps which I do not show. I will name PIDs and executes together for better readability.

10115 execve("<ORACLE_HOME>/bin/cmctl", ["cmctl"], [/* 118 vars */]) = 0
process 10115 is now cmctl.

10115 open("<ORACLE_HOME>/env.ora", O_RDONLY) = 3
10115 read(3, "# Copyright (c) 2015, Oracle and/or its affiliates. All rights reserved.\n# \n# NAME\n#   env.ora\n#\n# FUNCTION\n#   Oracle ENVironme"..., 4096) = 852
The undocumented file <ORACLE_HOME>/env.ora is checked for TNSNAMES_ORA parameter there - my file is default, so nothing is set there.

10115 access("<TNS_ADMIN>/cman.ora", F_OK) = -1 ENOENT (No such file or directory)
10115 access("/etc/cman.ora", F_OK)     = -1 ENOENT (No such file or directory)
10115 access("<ORACLE_HOME>/network/admin/cman.ora", F_OK) = 0
10115 open("<ORACLE_HOME>/network/admin/cman.ora", O_RDONLY) = 3
10115 read(3<<ORACLE_HOME>/network/admin/cman.ora>, "######################################################################\n#\n# Copyright (c) 2001,2002, Oracle Corporation. All righ"..., 4096) = 3996 
10115 close(3<<ORACLE_HOME>/network/admin/cman.ora>) = 0 
10115 (cman) searches for cman.ora in <TNS_ADMIN> (nothing there) /etc (also nothing) and <ORACLE_HOME>/network/admin (there is an entry for cman_berx).

10115 write(1</dev/pts/0>, "CMCTL> ", 7) = 7
10115 read(0</dev/pts/0>, "administer cman_berx\n", 1024) = 21
10115 (cmctl) writes a prompt (CMCTL>) and gets my command line command (administer cman_berx).

10115 connect(4<socket:[1549779484]>, {sa_family=AF_INET, sin_port=htons(1522), sin_addr=inet_addr("x.y.z.78")}, 16) = -1 EINPROGRESS (Operation now in progress)
10115 (cmctl) tries to connect to the address:port provided in cman.ora - and can't connect. I assume that's the moment it realises it must start all related processes.

10115 geteuid()                         = 5001
10115 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b8713915150) = 10286
10115 rt_sigprocmask(SIG_BLOCK, [PIPE], NULL, 8) = 0
10286 set_robust_list(0x2b8713915160, 24 <unfinished ...>
10115 rt_sigaction(SIGPIPE, {0x2b8717669de0, ~[ILL TRAP ABRT BUS FPE SEGV USR2 TERM XCPU XFSZ SYS RTMIN RT_1], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x2b8718e9f680},  <unfinished ...>
10286 <... set_robust_list resumed> )   = 0
10115 <... rt_sigaction resumed> {SIG_DFL, [], 0}, 8) = 0
10115 rt_sigprocmask(SIG_UNBLOCK, [PIPE], NULL, 8) = 0
10286 clone( <unfinished ...>
10115 wait4(10286,  <unfinished ...>
10286 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b8713915150) = 10287
10287 set_robust_list(0x2b8713915160, 24) = 0
10287 close(6<pipe:[1549773687]>)       = 0
10287 close(7<pipe:[1549773688]> <unfinished ...>
10286 exit_group(0)                     = ?
10287 <... close resumed> )             = 0
10287 setsid( <unfinished ...>
10286 +++ exited with 0 +++
10115 (cmctl) creates (clone) a new process (10286) - this process immediately clones another one (10287) and then terminates itself (exit_group). By this there is no parent-child relationship between 10115 (cmctl) and 10287. Still they both use the same binary (cmctl) and filehandles etc. - so 10287 closes some filehandles it doesn't require.

10287 execve("/appl/oracle/product/cman_12201/bin/cmadmin", ["/appl/oracle/product/cman_12201/bin/cmadmin", "cman_berx", "-inherit"], [/* 119 vars */]) = 0
process 10287 is now cmadmin.

10287 access("/var/tmp/.oracle", F_OK)  = 0
10287 chmod("/var/tmp/.oracle", 01777)  = -1 EPERM (Operation not permitted)
10287 socket(AF_LOCAL, SOCK_STREAM, 0)  = 6<socket:[1549770245]>
10287 access("/var/tmp/.oracle_500100", F_OK) = -1 ENOENT (No such file or directory)
10287 mkdir("/var/tmp/.oracle_500100", 0700) = 0
10287 chmod("/var/tmp/.oracle_500100", 0700) = 0
10287 access("/var/tmp/.oracle_500100/s#10287.1", F_OK) = -1 ENOENT (No such file or directory)
10287 bind(6<socket:[1549770245]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, 110) = 0
10287 chmod("/var/tmp/.oracle_500100/s#10287.1", 0777) = 0
10287 listen(6<socket:[1549770245]>, 100) = 0
10287 getsockname(6<socket:[1549770245]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, [36]) = 0
10287 chmod("/var/tmp/.oracle_500100/s#10287.1", 0777) = 0 
10287 (cmadmin) checks if /var/tmp/.oracle exists and if it can change it's permission. In my environment the directory exists (as Grid Infrastructure is installed) but it's permissions can't be changed. It belongs to root.  Then /var/tmp/.oracle_500100 is created and a socket is created there. For easier readability, the PID is part of the socket name.

10287 clone(child_stack=0x2b6fd20de570, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2b6fd20e99d0, tls=0x2b6fd20e9700, child_tidptr=0x2b6fd20e99d0) = 10289
10287 (cmadmin) creates another process (10289).

10287 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10290
10287 (cmadmin) creates another process (10290).

10290 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10291
10290 exit_group(0)                     = ?
10290 +++ exited with 0 +++
10290 creates 10291 and terminates itself.

10291 execve("/appl/oracle/product/cman_12201/bin/tnslsnr", ["/appl/oracle/product/cman_12201/bin/tnslsnr", "ifile=/appl/oracle/product/cman_12201/network/admin/cman.ora", "cman_berx", "-inherit", "-mode", "proxy"], [/* 119 vars */]) = 0
10291 becomes tnslsnr

10291 bind(11<socket:[1549780776]>, {sa_family=AF_INET, sin_port=htons(1522), sin_addr=inet_addr("")}, 16) = 0
10291 listen(11<socket:[1549780776]>, 128) = 0
10291 getsockname(11<socket:[1549780776]>, {sa_family=AF_INET, sin_port=htons(1522), sin_addr=inet_addr("")}, [16]) = 0
10291 (tnslsnr) listens on - But in cman.ora a specific IP address is given.

10291 write(12<<DIAG_DEST>/netcman/<hostname>/cman_berx/trace/cman_berx.log>, "Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=<server.domain>)(PORT=1522)))\n", 91) = 91
10291 (tnslsnr) writes something different to the logfile (?)

10291 access("/var/tmp/.oracle", F_OK)  = 0
10291 chmod("/var/tmp/.oracle", 01777)  = -1 EPERM (Operation not permitted)
10291 socket(AF_LOCAL, SOCK_STREAM, 0)  = 14<socket:[1549790968]>
10291 access("/var/tmp/.oracle/s#10291.1", F_OK) = -1 ENOENT (No such file or directory)
10291 bind(14<socket:[1549790968]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/s#10291.1"}, 110) = 0
10291 chmod("/var/tmp/.oracle/s#10291.1", 0777) = 0
10291 listen(14<socket:[1549790968]>, 1) = 0
10291 getsockname(14<socket:[1549790968]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/s#10291.1"}, [29]) = 0
10291 chmod("/var/tmp/.oracle/s#10291.1", 0777) = 0
10291 (tnslsnr) creates its own socket file for communication purpose.

10291 connect(16<socket:[1549778909]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sprocr_local_conn_0_PROL"}, 110) = 0

10291 connect(15<socket:[1549778910]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle/sCRSD_UI_SOCKET"}, 110) = 0
10291 (tnslsnr) checks some sockets which belong to Grid Infrastructure.

10291 clone(child_stack=0x2ac6b580b570, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2ac6b58169d0, tls=0x2ac6b5816700, child_tidptr=0x2ac6b58169d0) = 10292
10291 (tnslsnr) creates PID 10292

10292 clone( <unfinished ...>
10291 <... close resumed> )             = 0
10292 <... clone resumed> child_stack=0x2ac6b584c570, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x2ac6b58579d0, tls=0x2ac6b5857700, child_tidptr=0x2ac6b58579d0) = 10293
10293 set_robust_list(0x2ac6b58579e0, 24 <unfinished ...>
10292 creates PID 10293

10293 exit(0)                           = ?
10292 <... poll resumed> )              = 1 ([{fd=15, revents=POLLIN}])
10293 +++ exited with 0 +++
10293 exits - it did not "clone" or "execve" anything

10287 shmget(0x59248000, 2462280, IPC_CREAT|IPC_EXCL|0600 <unfinished ...>
10287 (cmadmin) allocates a shared memory segment with key 0x59248000.

10287 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10294
10294 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10295
10294 exit_group(0)                     = ?
10294 +++ exited with 0 +++
10287 (cmadmin) creates PID 10294, this creates PID 10295 and terminates itself.

10295 execve("/appl/oracle/product/cman_12201/bin/cmgw", ["/appl/oracle/product/cman_12201/bin/cmgw", "cmgw0", "0", "16", "cman_berx", "SNLSM:59248000"], [/* 119 vars */]) = 0
10295 becomes cmgw0 - you can see the parameter SNLSM - it gives the shared memory key from 10287 (cmadmin).

10295 shmget(0x59248000, 1, 0)          = 2099576860
10295 (cmgw0) attaches to shared memory 10287 (cmadmin).

10287 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10296
10296 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b6fd1f80150) = 10297
10296 exit_group(0)                     = ?
10287 (cmadmin) creates PID 10296 which creates 10297 and exits. - same as above

10297 execve("/appl/oracle/product/cman_12201/bin/cmgw", ["/appl/oracle/product/cman_12201/bin/cmgw", "cmgw1", "1", "16", "cman_berx", "SNLSM:59248000"], [/* 119 vars */] <unfinished ...>
10297 becomes cmgw1.

some (later) important sockets:
10295 connect(5<socket:[1549804554]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, 110 <unfinished ...>

10297 connect(5<socket:[1549798977]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, 110 <unfinished ...>

10295 bind(7<socket:[1549804555]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10295.1"}, 110 <unfinished ...>

10297 bind(7<socket:[1549798978]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10297.1"}, 110 <unfinished ...>

10115 connect(5<socket:[1549779550]>, {sa_family=AF_LOCAL, sun_path="/var/tmp/.oracle_500100/s#10287.1"}, 110 <unfinished ...> 

At this stage, cman is started, all processes are in memory and happy to server any requests.

The shutdown- command will be much faster:

10115 <... read resumed> "shutdown\n", 1024) = 9
10115 (cmctl) recieves "shutdown" from command line

10115 write(5, "\0\0\0S\6$\0\0\0\0E\0\0\200\2\0\0\0\0\4\0\0\220_8\2\0\0\0\0\20\0#\0\0\000901A4C4F9FD8-3939-E034-0800208AB384\1\0v\0\0\0\3\0\0\0\0\0", 83) = 83
101115 (cmctl) informs 10287 (cmadmin)

10287 write(5&lt;socket:[1549778933]&gt;, "\0\0\0S\6 \0\0\0\0E\0\0\200\2\0\0\0\0\4\0\0p\344t\2\0\0\0\0\20\0#\0\0\00091986B2B10AC-3001-E034-0800208AB384\1\0l\0\0\0\1\0\0\0\0\0", 83) = 83
10287 times( &lt;unfinished ...&gt;
10295 &lt;... epoll_wait resumed&gt; [{EPOLLIN, {u32=13989488, u64=13989488}}], 1024, -1) = 1
10287 &lt;... times resumed&gt; NULL)         = 1269637186
10287 epoll_wait(7&lt;anon_inode:[eventpoll]&gt;,  &lt;unfinished ...&gt;
10295 read(5&lt;socket:[1549804554]&gt;, "\0\0\0S\6 \0\0\0\0E\0\0\200\2\0\0\0\0\4\0\0p\344t\2\0\0\0\0\20\0#\0\0\00091986B2B10AC-3001-E034-0800208AB384\1\0l\0\0\0\1\0\0\0\0\0", 8208) = 83
10287 (cmadmin) informs 10295 (cmgw0) (and later 10297 (cmgw1) by the same method) to shutdown

some cleanup takes place:

10295 unlink("/var/tmp/.oracle_500100/s#10295.1" <unfinished ...>

10295 rmdir("/var/tmp/.oracle_500100" <unfinished ...>

10297 unlink("/var/tmp/.oracle_500100/s#10297.1" <unfinished ...>

and all the processes terminate:

10289 exit(0)                           = ?
10289 +++ exited with 0 +++

10295 exit_group(0)                     = ?

10297 exit_group(0)                     = ?

10291 unlink("/var/tmp/.oracle/s#10291.1" <unfinished ...>

10291 exit_group(0)                     = ?

10287 exit_group(0)                     = ?

10115 exit_group(0)                     = ?

That's a simple startup and shutdown of connection manager - nothing fancy here, just a lot of processes and sockets which are created - and then deleted.

I hope you never need to read this article to it's end. But if you need to, it should provide some value for you!