2019-03-30

limit IOs by instances

This is a follow up to my previous post.

I had to answer if I can limit the (IO) resources a DB instance can utilize. Unfortunately, in a simple instance I can not do so. It can be done in PDBs, but right now PDBs are out of scope.
So a simpler approach was developed: limiting IOs by creating cgroups.

There are 2 steps:

  1. disks
    1. create a proper cgroup
    2. get all related devices for a given ASM diskgroup
    3. set proper limits for all the devices
  2. regularly add all matching processes to this cgroup

In the cgroups mountpoint, a new directory must be created which is used as "root" for all cgroups, so it does not collide with other cgroups implementations in the system.

This leads to a structure like
/sys/fs/cgroup/blkio/ORACLE/BX1/
where ORACLE is the "root" and "BX1" is the name of a specific cgroup.
In this group limits can be set, e.g. in blkio.throttle.read_bps_device or blkio.throttle.read_iops_device.
As the limit is per device, the total limit is divided by the number of devices.

To have any effect on processes, all processes PIDs which are under the regime of this cgroup are added to /sys/fs/cgroup/blkio/ORACLE/BX1/tasks.
For more flexibility, a list of patters (to match cmd in ps -eo pid,cmd ) are defined for each cgroup individualy. e.g. all processes, which either match a foreground process which was connected via listener BXTST02.*LOCAL=NO or any parallel process ora_p[[:digit:]].{2}_BXTST02.

In my configuration (crontab), the disks are added to the cgroup once per hour, whereas processes are added every minute.
This can lead to some delays if disks are added, and every single process can live up to one minute without any limits.

But for a longer period, it should be quite stable. (or at least, the best I can do in the sort time given)

The effect is acceptable. In a generic test with SLOB the picture is obvious:
When the processes were added to the cgroup, MB/s dropped down to the 5MB/s configured.

Of course by these limits, the average response time (or wait time) goes up.
In another test where SLOB was in the cgroup all the time, the MEAN responsetime was 0.027 sec.
But a histogram shows more than 50% of the calls finish within 10 ms (a reasonalbe value for the storage system in this test) but there is a peak between 25 and 50 ms which dominates the total response time.
RANGE {min ≤ e < max}    DURATION       %   CALLS      MEAN       MIN       MAX
---------------------  ----------  ------  ------  --------  --------  --------
 1.     0ms      5ms     6.010682    2.0%   2,290  0.002625  0.000252  0.004999
 2.     5ms     10ms    26.712672    9.0%   3,662  0.007295  0.005002  0.009997
 3.    10ms     15ms    12.935713    4.4%   1,090  0.011868  0.010003  0.014998
 4.    15ms     20ms     6.828035    2.3%     398  0.017156  0.015003  0.019980
 5.    20ms     25ms     4.846490    1.6%     218  0.022232  0.020039  0.024902
 6.    25ms     50ms    17.002454    5.7%     471  0.036099  0.025085  0.049976
 7.    50ms    100ms   182.104408   61.6%   2,338  0.077889  0.050053  0.099991
 8.   100ms  1,000ms    39.354627   13.3%     326  0.120720  0.100008  0.410570
 9. 1,000ms       +∞
---------------------  ----------  ------  ------  --------  --------  --------
TOTAL (9)              295.795081  100.0%  10,793  0.027406  0.000252  0.410570

This also can be seen in a graph:

The system is working and stable; probably not perfect but good enough for it's requirements.
There was a discussion if this should be achieved on storage layer. This would limit every process immediately, but would also be a much stricter setting. As an example I can exclude logwriter from any cgroup and let it work as fast as possible, whereas IO limits on storage side would put logwriter and foreground processes in same limits.

The script can be found at github, but it has some prerequisits and might not work on other than my current systems without adaption. Don't hesitate to contact me if you really want to give it a try 😉

2 Kommentare:

Mikhail Velikikh hat gesagt…

Hi Martin,

I use something similar on Amazon Cloud to restrict some administrative tasks from consuming all of EC2 instance resources.
When it comes to RMAN backups, I utilize the RMAN Rate Limit feature.
However, if I need to perform some other tasks, such as building new indexes on "big" tables or some bulk data movement, and I want to minimize the load on the database, I add my processes under cgroup manually.
Do you use a listener shared across several instances that should have their own limits?
As I see, once the listener is under cgroup, all spawned processed will be under this cgroup as well:

[root@oel71db2 ~]# mkdir -p /sys/fs/cgroup/blkio/ORACLE/BX1/
[root@oel71db2 ~]# pgrep -af tns
15 netns
2061 /u01/app/oracle/product/12.1.0/dbhome_1/bin/tnslsnr LISTENER -inherit
[root@oel71db2 ~]# echo 2061 >> /sys/fs/cgroup/blkio/ORACLE/BX1/tasks
[root@oel71db2 ~]# ps -o cgroup,cmd -p 2061
CGROUP CMD
2:blkio:/ORACLE/BX1,1:name= /u01/app/oracle/product/12.1.0/dbhome_1/bin/tnslsnr LISTENER -inherit
[root@oel71db2 ~]# ps -o cgroup,cmd -p 19806
CGROUP CMD
2:blkio:/ORACLE/BX1,1:name= oraclecdb12c (LOCAL=NO)

My disk configuration is pretty much stable, so that I do not use cron to recalculate limits when new disks are added or removed from the ASM instance.
Per my knowledge, the cgroup IO limits are per device, so that in my environment I have to recalculate all existing limits once a new disk is added in order to be under the instance limits.
For example, say I have an R4.XLarge instance whose I/O limit is around 100MB/s, and 4 ASM disks.
If I want to restrict my processes to 50MB/s to keep the remaining for the application, then I add 4 lines to my blkio.throttle.read_bps_device with 12.5MB.
However, when I add a new disk, the fifth one, then I have to recalculate all previous limits to make sure that I consume not more that 50MB/s.
Do you do something similar or when you add a new disk, you just set the limit for that added disk?

Martin Berger hat gesagt…

Hi Mikhail,
thank you for your great questions.
You are right, for RMAN I prefer it's rate limit.
In the current setup my scripts were created for, there is a shared listener for many instances. As I want to have separate limits per instance, I can not use the listener to limit it's childs. This would also exclude other processes like parallel processes which normally are not grandchildren of listener.
To overcome this limitation, the script is run every minit in cron, checks for all processes which match some patterns and add them to tasks.
You are also right, IO and mbps limits are per device. In my script it's calculated by reduced_limit=$(($4/$devicecount)).
So the limit is recalculated and set properly.
In my case it's done once per hour (it happens not that often anyhow).