Using Linux Control Groups to Constrain Process Memory

This post originally appeared on the Rittman Mead blog.

Linux Control Groups (cgroups) are a nifty way to limit the amount of resource, such as CPU, memory, or IO throughput, that a process or group of processes may use. Frits Hoogland wrote a great blog demonstrating how to use it to constrain the I/O a particular process could use, and was the inspiration for this one. I have been doing some digging into the performance characteristics of OBIEE in certain conditions, including how it behaves under memory pressure. I’ll write more about that in a future blog, but wanted to write this short blog to demonstrate how cgroups can be used to constrain the memory that a given Linux process can be allocated.

This was done on Amazon EC2 running an image imported originally from Oracle’s OBIEE SampleApp, built on Oracle Linux 6.5.

$ uname -a
Linux demo.us.oracle.com 2.6.32-431.5.1.el6.x86_64 #1 SMP Tue Feb 11 11:09:04 PST 2014 x86_64 x86_64 x86_64 GNU/Linux

First off, install the necessary package in order to use them, and start the service. Throughout this blog where I quote shell commands those prefixed with # are run as root and $ as non-root:

# yum install libcgroup
# service cgconfig start

Create a cgroup (I’m shamelessly ripping off Frits’ code here, hence the same cgroup name ;-) ):

# cgcreate -g memory:/myGroup

You can use cgget to view the current limits, usage, & high watermarks of the cgroup:

# cgget -g memory:/myGroup|grep bytes
memory.memsw.limit_in_bytes: 9223372036854775807
memory.memsw.max_usage_in_bytes: 0
memory.memsw.usage_in_bytes: 0
memory.soft_limit_in_bytes: 9223372036854775807
memory.limit_in_bytes: 9223372036854775807
memory.max_usage_in_bytes: 0
memory.usage_in_bytes: 0

For more information about the field meaning see the doc here.

To test out the cgroup ability to limit memory used by a process we’re going to use the tool stress, which can be used to generate CPU, memory, or IO load on a server. It’s great for testing what happens to a server under resource pressure, and also for testing memory allocation capabilities of a process which is what we’re using it for here.

We’re going to configure cgroups to add stress to the myGroup group whenever it runs

 $ cat /etc/cgrules.conf
*:stress memory myGroup

[Re-]start the cg rules engine service:

# service cgred restart

Now we’ll use the watch command to re-issue the cgget command every second enabling us to watch cgroup’s metrics in realtime:

# watch --interval 1 cgget -g memory:/myGroup
/myGroup:
memory.memsw.failcnt: 0
memory.memsw.limit_in_bytes: 9223372036854775807
memory.memsw.max_usage_in_bytes: 0
memory.memsw.usage_in_bytes: 0
memory.oom_control: oom_kill_disable 0
        under_oom 0
memory.move_charge_at_immigrate: 0
memory.swappiness: 60
memory.use_hierarchy: 0
memory.stat: cache 0
        rss 0
        mapped_file 0
        pgpgin 0
        pgpgout 0
        swap 0
        inactive_anon 0
        active_anon 0
        inactive_file 0
        active_file 0
        unevictable 0
        hierarchical_memory_limit 9223372036854775807
        hierarchical_memsw_limit 9223372036854775807
        total_cache 0
        total_rss 0
        total_mapped_file 0
        total_pgpgin 0
        total_pgpgout 0
        total_swap 0
        total_inactive_anon 0
        total_active_anon 0
        total_inactive_file 0
        total_active_file 0
        total_unevictable 0
memory.failcnt: 0
memory.soft_limit_in_bytes: 9223372036854775807
memory.limit_in_bytes: 9223372036854775807
memory.max_usage_in_bytes: 0
memory.usage_in_bytes: 0

In a separate terminal (or even better, use screen!) run stress, telling it to grab 150MB of memory:

$ stress --vm-bytes 150M --vm-keep -m 1

Review the cgroup, and note that the usage fields have increased:

/myGroup:
memory.memsw.failcnt: 0
memory.memsw.limit_in_bytes: 9223372036854775807
memory.memsw.max_usage_in_bytes: 157548544
memory.memsw.usage_in_bytes: 157548544
memory.oom_control: oom_kill_disable 0
        under_oom 0
memory.move_charge_at_immigrate: 0
memory.swappiness: 60
memory.use_hierarchy: 0
memory.stat: cache 0
        rss 157343744
        mapped_file 0
        pgpgin 38414
        pgpgout 0
        swap 0
        inactive_anon 0
        active_anon 157343744
        inactive_file 0
        active_file 0
        unevictable 0
        hierarchical_memory_limit 9223372036854775807
        hierarchical_memsw_limit 9223372036854775807
        total_cache 0
        total_rss 157343744
        total_mapped_file 0
        total_pgpgin 38414
        total_pgpgout 0
        total_swap 0
        total_inactive_anon 0
        total_active_anon 157343744
        total_inactive_file 0
        total_active_file 0
        total_unevictable 0
memory.failcnt: 0
memory.soft_limit_in_bytes: 9223372036854775807
memory.limit_in_bytes: 9223372036854775807
memory.max_usage_in_bytes: 157548544
memory.usage_in_bytes: 157548544

Both memory.memsw.usage_in_bytes and memory.usage_in_bytes are 157548544 = 150.25MB

Having a look at the process stats for stress shows us:

$ ps -ef|grep stress
oracle   15296  9023  0 11:57 pts/12   00:00:00 stress --vm-bytes 150M --vm-keep -m 1
oracle   15297 15296 96 11:57 pts/12   00:06:23 stress --vm-bytes 150M --vm-keep -m 1
oracle   20365 29403  0 12:04 pts/10   00:00:00 grep stress

$ cat /proc/15297/status

Name:   stress
State:  R (running)
[...]
VmPeak:   160124 kB
VmSize:   160124 kB
VmLck:         0 kB
VmHWM:    153860 kB
VmRSS:    153860 kB
VmData:   153652 kB
VmStk:        92 kB
VmExe:        20 kB
VmLib:      2232 kB
VmPTE:       328 kB
VmSwap:        0 kB
[...]

The man page for proc gives us more information about these fields, but of particular note are:

VmSize: Virtual memory size.
VmRSS: Resident set size.
VmSwap: Swapped-out virtual memory size by anonymous private pages

Our stress process has a VmSize of 156MB, VmRSS of 150MB, and zero swap.

Kill the stress process, and set a memory limit of 100MB for any process in this cgroup:

# cgset -r memory.limit_in_bytes=100m myGroup

Run cgset and you should see the see new limit. Note that at this stage we’re just setting memory.limit_in_bytes and leaving memory.memsw.limit_in_bytes unchanged.

# cgget -g memory:/myGroup|grep limit|grep bytes
memory.memsw.limit_in_bytes: 9223372036854775807
memory.soft_limit_in_bytes: 9223372036854775807
memory.limit_in_bytes: 104857600

Let’s see what happens when we try to allocate the memory, observing the cgroup and process Virtual Memory process information at each point:

15MB:

 $ stress --vm-bytes 15M --vm-keep -m 1
stress: info: [31942] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

# cgget -g memory:/myGroup|grep usage|grep -v max
memory.memsw.usage_in_bytes: 15990784
memory.usage_in_bytes: 15990784

$ cat /proc/$(pgrep stress|tail -n1)/status|grep VmVmPeak:    21884 kB
VmSize:    21884 kB
VmLck:         0 kB
VmHWM:     15616 kB
VmRSS:     15616 kB
VmData:    15412 kB
VmStk:        92 kB
VmExe:        20 kB
VmLib:      2232 kB
VmPTE:        60 kB
VmSwap:        0 kB

50MB:

 $ stress --vm-bytes 50M --vm-keep -m 1
stress: info: [32419] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

# cgget -g memory:/myGroup|grep usage|grep -v max
memory.memsw.usage_in_bytes: 52748288
memory.usage_in_bytes: 52748288

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm
VmPeak:    57724 kB
VmSize:    57724 kB
VmLck:         0 kB
VmHWM:     51456 kB
VmRSS:     51456 kB
VmData:    51252 kB
VmStk:        92 kB
VmExe:        20 kB
VmLib:      2232 kB
VmPTE:       128 kB
VmSwap:        0 kB

100MB:

$ stress --vm-bytes 100M --vm-keep -m 1
stress: info: [20379] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
# cgget -g memory:/myGroup|grep usage|grep -v max
memory.memsw.usage_in_bytes: 105197568
memory.usage_in_bytes: 104738816

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm
VmPeak:   108924 kB
VmSize:   108924 kB
VmLck:         0 kB
VmHWM:    102588 kB
VmRSS:    101448 kB
VmData:   102452 kB
VmStk:        92 kB
VmExe:        20 kB
VmLib:      2232 kB
VmPTE:       232 kB
VmSwap:     1212 kB

Note that VmSwap has now gone above zero, despite the machine having plenty of usable memory:

# vmstat -s
     16330912  total memory
     14849864  used memory
     10583040  active memory
      3410892  inactive memory
      1481048  free memory
       149416  buffer memory
      8204108  swap cache
      6143992  total swap
      1212184  used swap
      4931808  free swap

So it looks like the memory cap has kicked in and the stress process is being forced to get the additional memory that it needs from swap.

Let’s tighten the screw a bit further:

$ stress --vm-bytes 200M --vm-keep -m 1
stress: info: [21945] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The process is now using 100MB of swap (since we’ve asked it to grab 200MB but cgroup is constraining it to 100MB real):

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm
VmPeak:   211324 kB
VmSize:   211324 kB
VmLck:         0 kB
VmHWM:    102616 kB
VmRSS:    102600 kB
VmData:   204852 kB
VmStk:        92 kB
VmExe:        20 kB
VmLib:      2232 kB
VmPTE:       432 kB
VmSwap:   102460 kB

The cgget command confirms that we’re using swap, as the memsw value shows:

# cgget -g memory:/myGroup|grep usage|grep -v max
memory.memsw.usage_in_bytes: 209788928
memory.usage_in_bytes: 104759296

So now what happens if we curtail the use of all memory, including swap? To do this we’ll set the memory.memsw.limit_in_bytes parameter. Note that running cgset whilst a task under the cgroup is executing seems to get ignored if it is below that currently in use (per the usage_in_bytes field). If it is above this then the change is instantaneous:

Current state

# cgget -g memory:/myGroup|grep bytes
memory.memsw.limit_in_bytes: 9223372036854775807
memory.memsw.max_usage_in_bytes: 209915904
memory.memsw.usage_in_bytes: 209784832
memory.soft_limit_in_bytes: 9223372036854775807
memory.limit_in_bytes: 104857600
memory.max_usage_in_bytes: 104857600
memory.usage_in_bytes: 104775680

Set the limit below what is currently in use (150m limit vs 200m in use)
```
# cgset -r memory.memsw.limit_in_bytes=150m myGroup
```

Check the limit – it remains unchanged

# cgget -g memory:/myGroup|grep bytes
memory.memsw.limit_in_bytes: 9223372036854775807
memory.memsw.max_usage_in_bytes: 209993728
memory.memsw.usage_in_bytes: 209784832
memory.soft_limit_in_bytes: 9223372036854775807
memory.limit_in_bytes: 104857600
memory.max_usage_in_bytes: 104857600
memory.usage_in_bytes: 104751104

Set the limit above what is currently in use (250m limit vs 200m in use)
```
# cgset -r memory.memsw.limit_in_bytes=250m myGroup
```

Check the limit - it’s taken effect

# cgget -g memory:/myGroup|grep bytes
memory.memsw.limit_in_bytes: 262144000
memory.memsw.max_usage_in_bytes: 210006016
memory.memsw.usage_in_bytes: 209846272
memory.soft_limit_in_bytes: 9223372036854775807
memory.limit_in_bytes: 104857600
memory.max_usage_in_bytes: 104857600
memory.usage_in_bytes: 104816640

So now we’ve got limits in place of 100MB real memory and 250MB total (real + swap). What happens when we test that out?

$ stress --vm-bytes 245M --vm-keep -m 1
stress: info: [25927] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The process is using 245MB total (VmData), of which 95MB is resident (VmRSS) and 150MB is swapped out (VmSwap)

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm
VmPeak:   257404 kB
VmSize:   257404 kB
VmLck:         0 kB
VmHWM:    102548 kB
VmRSS:     97280 kB
VmData:   250932 kB
VmStk:        92 kB
VmExe:        20 kB
VmLib:      2232 kB
VmPTE:       520 kB
VmSwap:   153860 kB

The cgroup stats reflect this:

# cgget -g memory:/myGroup|grep bytes
memory.memsw.limit_in_bytes: 262144000
memory.memsw.max_usage_in_bytes: 257159168
memory.memsw.usage_in_bytes: 257007616
[...]
memory.limit_in_bytes: 104857600
memory.max_usage_in_bytes: 104857600
memory.usage_in_bytes: 104849408

If we try to go above this absolute limit (memory.memsw.max_usage_in_bytes) then the cgroup kicks in a stops the process getting the memory, which in turn causes stress to fail:

$ stress --vm-bytes 250M --vm-keep -m 1
stress: info: [27356] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [27356] (415) <-- worker 27357 got signal 9
stress: WARN: [27356] (417) now reaping child worker processes
stress: FAIL: [27356] (451) failed run completed in 3s

This gives you an indication of how careful you need to be using this type of low-level process control. Most tools will not be happy if they are starved of resource, including memory, and may well behave in unstable ways.

Thanks to Frits Hoogland for reading a draft of this post and providing valuable feedback.

This post originally appeared on the Rittman Mead blog.