Cgroups — Deep Dive into Resource Management in Kubernetes

This is what happens when you set resource requests and limits on your Kubernetes Pods and Deployments

Martin Heinz
Better Programming
Published in
7 min readFeb 20, 2023

--

Photo by Stephen Leonardi on Unsplash

There’s a lot of “magic” that happens behind the scenes to make the whole Kubernetes work. One of those is resource management and resource allocation done by Linux cgroups.

In this article, we will take a deep dive into what cgroups are, how Kubernetes uses them to manage Node resources, and how we can take advantage of them beyond setting resource requests and limits on Pods.

What are Cgroups?

First things first — What are cgroups anyway?Control Groups, or cgroups for short, are a Linux kernel feature that takes care of resource allocation (CPU time, memory, network bandwidth, I/O), prioritization, and accounting (aka how much is the container using?). Additionally, besides being a Linux primitive, cgroups are also a building block of containers, so without cgroups there would be no containers.

As the name implies — the cgroups are groups, so they group processes in the parent-child hierarchy, which forms a tree. So, if — for example — parent cgroup is assigned 128Mi of RAM, then the sum of RAM usage of all of its children cannot exceed 128Mi.

This hierarchy lives in /sys/fs/cgroup/, which is the cgroup filesystem ( cgroupfs). There you will find sub-trees for all Linux processes. Here we're interested in how cgroups impact scheduling and resources assigned/allocated to our Kubernetes Pods, so the part we care about is kubepods.slice/:

All the Kubernetes cgroups are located in kubepods.slice/ subdirectory, which has further kubepods-besteffort.slice/, kubepods-burstable.slice/ and kubepods-guaranteed.slice/ subdirectories for each QoS (Quality-of-Service) type. Under these directories, you will find directories for each Pod and inside those, further directories for each container.

At each level, there are files such as cpu.weight or cpu.max that specify how much of the particular resource - e.g. CPU - can this group use. For clarity, the above file tree only shows these files at the deepest level.

Finally, here at the leaves of the tree are the files that describe how much memory ( memory.min and memory.max), CPU ( cpu.weight and cpu.max) or other resources each container gets to work with. These files are a direct translation of the resource requests and limits defined in Pod manifests. However, if you were to look at these files, the values you would find there actually don't seem related to the requests and limits, so what do they mean and how did they get there?

How Does It Work?

Let’s now walk through all the steps to better understand how the Pod requests and limits get translated/propagated all the way to files in /sys/fs/....

We begin with a simple Pod definition that includes memory and CPU requests/limits:

When we create/apply this Pod manifest, the Pod gets assigned to a Node and the kubelet on the Node takes this PodSpec and passes it to Container Runtime Interface (CRI), e.g. containerd or CRI-O, which translates it to lower-level OCI JSON spec that describes the container that will be created:

As you can see from the above, this spec includes cgroupsPath which is the directory where the cgroup files will be located. It also includes the already translated requests and limits under the info.runtimeSpec.linux.resources (we will talk about what this means a bit later).

This spec is then passed to lower-level OCI container runtime — most likely runc - which talks to systemd driver which creates systemd scope unit and also sets values in the files in cgroupfs.

To first inspect the systemd scope unit:

We first find the container ID using crictl ps, which is CRI equivalent of docker ps. In the output of this command we see our Pod webserver and the container ID. We then use systemd-cgls which recursively shows control groups content. In its output, we see the group with our container's ID, which is crio-029d006435420.... Finally, we use systemctl show --no-pager crio-029d006435420... which gives us the systemd properties which were used to set the values in cgroup files.

To then inspect the cgroups filesystem itself:

We go to the directory kubepods-burstable-pod6910effd_ea14_4f76_a7de_53c333338acb.slice which was listed in the output of systemd-cgls. This is the cgroup directory for the whole webserver Pod. Here we find the individual cgroup files. The files and the values that we mostly care about are cpu.weight, cpu.max, memory.min and memory.max as these are the ones that describe CPU and memory requests/limits of the Pod, but what do the values mean?

  • cpu.weight - This is the CPU request. It is converted to the so-called weight (also called "shares"). It's in the range of 1 to 10000 and describes how much CPU will the container get in comparison to other containers. If you had only 2 processes on the system and one would have 2000 and the other 8000, then the former would get 20% and the latter 80% of CPU cycles. In this case 250m equals to 10 "shares", so if we were to run Pod with 450m CPU request, then it would get 18 "shares".
  • cpu.max - This is the CPU limit. The values in the file indicate that the group may consume up to $MAX in each $PERIOD duration. max for $MAX indicates no limit. In this case: consume 50000 / 100000, therefore at most 0.5 (500m) CPU.
  • memory.min - This is memory request in bytes. This is only set if memory QoS is enabled in the cluster (explained later).
  • memory.max - A memory usage hard limit in bytes. If a cgroup's memory usage reaches this limit and can't be reduced, the OOM killer is invoked in the cgroup and the container gets killed.

There are a lot of other files besides these 4, however, none of them can be currently set through Pod manifests.

As a side note, if you want to poke around yourself, an alternative/faster way to find these values might be to get the path to the cgroupfs form container runtime spec mentioned earlier:

Monitoring

Apart from enforcing resource allocation, cgroups are also used for monitoring resource consumption. This is done by cAdvisor component included in kubelet. Looking at the cAdvisor metrics also serves as an easier way to view the cgroups files values.

To view cAdvisor metrics you can use:

If you have access to the cluster Node, then you can get the metrics directly from kubelet API using the first curl command above. Alternatively, you can use kubectl proxy to get access to the Kubernetes API Server and run curl from locally specifying one of your nodes in the path.

Regardless of which option you use, you will get a huge list of metrics that will look like this sample.

Some of the more interesting metrics you will find there are:

And as a final summary of the whole propagation and translation from Pod manifest all the way to cgroupfs, here's a little table:

Why Should You Care?

With all the newly acquired knowledge about cgroups in Kubernetes, there’s a question, “Why even bother learning this, when Linux and Kubernetes do all the work for us?”. Well, a deeper understanding is always beneficial in my opinion, and you never know when you will need this knowledge for debugging. More importantly, though, knowing how it works, makes it possible to implement and take advantage of some advanced features:

  • For example Memory QoS, which was briefly mentioned earlier. Most people don’t know this, but currently — as of Kubernetes v1.26 — memory requests in Pod manifest are not taken into consideration by container runtime and are therefore effectively ignored. Additionally, there’s no way to throttle memory usage and when the container reaches the memory limit, it simply gets OOM killed. With the introduction of Memory QoS feature, which is currently in Alpha, Kubernetes can take advantage of additional cgroup files memory.min and memory.high to throttle a container instead of straight-up killing it. (Note: The memory.min value in the earlier examples is populated only because Memory QoS was enabled on the cluster.)
  • Another possible advanced application for cgroups could be a container-aware OOM killer — let’s say you have a Pod with a logging sidecar, if the Pod reaches the memory usage limit, then the main container in the Pod might get killed because of the memory consumption of the sidecar. With container-aware OOM killer, we could in theory configure the Pod so that the sidecar is killed first when the memory limit is reached.
  • Thanks to cgroups, it’s also possible to run Kubernetes components, such as kubelet or CRI in rootless mode ( using this Alpha feature), which is great for security.
  • And finally, It’s also good to have some knowledge of cgroups if you’re a Java developer because JDK looks at the cgroup files to understand how much CPU and memory is available.

And these are just the tip of the iceberg. There are many more things that cgroups can help us with and chances are, that in the future we will see more resource management features added to Kubernetes, for example for managing other types of resources such as disk throttling, network I/O, or resource pressure (PSI).

Closing Thoughts

While a lot of the things in Kubernetes might seem like magic, when you look closely you will find out that it’s really just a clever use of core Linux components and features, and resource management is no exception as we’ve seen in this article.

Additionally, while cgroups might seem like an implementation detail from point of view of a cluster operator and user, having an understanding of how they work can be beneficial when troubleshooting difficult issues or when using advanced features like the ones described in the previous section.

Want to Connect?

This article was originally posted at martinheinz.dev.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Martin Heinz
Martin Heinz

Written by Martin Heinz

CKA | RHCE | DevOps Engineer | Working with Python, Kubernetes, Linux and more | https://martinheinz.dev/ | https://ko-fi.com/martinheinz

No responses yet

Write a response