2019-03-30

limit IOs by instances

This is a follow up to my previous post.

I had to answer if I can limit the (IO) resources a DB instance can utilize. Unfortunately, in a simple instance I can not do so. It can be done in PDBs, but right now PDBs are out of scope.
So a simpler approach was developed: limiting IOs by creating cgroups.

There are 2 steps:

  1. disks
    1. create a proper cgroup
    2. get all related devices for a given ASM diskgroup
    3. set proper limits for all the devices
  2. regularly add all matching processes to this cgroup

In the cgroups mountpoint, a new directory must be created which is used as "root" for all cgroups, so it does not collide with other cgroups implementations in the system.

This leads to a structure like
/sys/fs/cgroup/blkio/ORACLE/BX1/
where ORACLE is the "root" and "BX1" is the name of a specific cgroup.
In this group limits can be set, e.g. in blkio.throttle.read_bps_device or blkio.throttle.read_iops_device.
As the limit is per device, the total limit is divided by the number of devices.

To have any effect on processes, all processes PIDs which are under the regime of this cgroup are added to /sys/fs/cgroup/blkio/ORACLE/BX1/tasks.
For more flexibility, a list of patters (to match cmd in ps -eo pid,cmd ) are defined for each cgroup individualy. e.g. all processes, which either match a foreground process which was connected via listener BXTST02.*LOCAL=NO or any parallel process ora_p[[:digit:]].{2}_BXTST02.

In my configuration (crontab), the disks are added to the cgroup once per hour, whereas processes are added every minute.
This can lead to some delays if disks are added, and every single process can live up to one minute without any limits.

But for a longer period, it should be quite stable. (or at least, the best I can do in the sort time given)

The effect is acceptable. In a generic test with SLOB the picture is obvious:
When the processes were added to the cgroup, MB/s dropped down to the 5MB/s configured.

Of course by these limits, the average response time (or wait time) goes up.
In another test where SLOB was in the cgroup all the time, the MEAN responsetime was 0.027 sec.
But a histogram shows more than 50% of the calls finish within 10 ms (a reasonalbe value for the storage system in this test) but there is a peak between 25 and 50 ms which dominates the total response time.
RANGE {min ≤ e < max}    DURATION       %   CALLS      MEAN       MIN       MAX
---------------------  ----------  ------  ------  --------  --------  --------
 1.     0ms      5ms     6.010682    2.0%   2,290  0.002625  0.000252  0.004999
 2.     5ms     10ms    26.712672    9.0%   3,662  0.007295  0.005002  0.009997
 3.    10ms     15ms    12.935713    4.4%   1,090  0.011868  0.010003  0.014998
 4.    15ms     20ms     6.828035    2.3%     398  0.017156  0.015003  0.019980
 5.    20ms     25ms     4.846490    1.6%     218  0.022232  0.020039  0.024902
 6.    25ms     50ms    17.002454    5.7%     471  0.036099  0.025085  0.049976
 7.    50ms    100ms   182.104408   61.6%   2,338  0.077889  0.050053  0.099991
 8.   100ms  1,000ms    39.354627   13.3%     326  0.120720  0.100008  0.410570
 9. 1,000ms       +∞
---------------------  ----------  ------  ------  --------  --------  --------
TOTAL (9)              295.795081  100.0%  10,793  0.027406  0.000252  0.410570

This also can be seen in a graph:

The system is working and stable; probably not perfect but good enough for it's requirements.
There was a discussion if this should be achieved on storage layer. This would limit every process immediately, but would also be a much stricter setting. As an example I can exclude logwriter from any cgroup and let it work as fast as possible, whereas IO limits on storage side would put logwriter and foreground processes in same limits.

The script can be found at github, but it has some prerequisits and might not work on other than my current systems without adaption. Don't hesitate to contact me if you really want to give it a try 😉