Cgroups — Deep Dive into Resource Management in Kubernetes
This is what happens when you set resource requests and limits on your Kubernetes Pods and Deployments

There’s a lot of “magic” that happens behind the scenes to make the whole Kubernetes work. One of those is resource management and resource allocation done by Linux cgroups
.
In this article, we will take a deep dive into what cgroups
are, how Kubernetes uses them to manage Node resources, and how we can take advantage of them beyond setting resource requests and limits on Pods.
What are Cgroups?
First things first — What are cgroups
anyway? — Control Groups, or cgroups
for short, are a Linux kernel feature that takes care of resource allocation (CPU time, memory, network bandwidth, I/O), prioritization, and accounting (aka how much is the container using?). Additionally, besides being a Linux primitive, cgroups
are also a building block of containers, so without cgroups
there would be no containers.
As the name implies — the cgroups
are groups, so they group processes in the parent-child hierarchy, which forms a tree. So, if — for example — parent cgroup is assigned 128Mi
of RAM, then the sum of RAM usage of all of its children cannot exceed 128Mi
.
This hierarchy lives in /sys/fs/cgroup/
, which is the cgroup filesystem ( cgroupfs
). There you will find sub-trees for all Linux processes. Here we're interested in how cgroups
impact scheduling and resources assigned/allocated to our Kubernetes Pods, so the part we care about is kubepods.slice/
:
All the Kubernetes cgroups
are located in kubepods.slice/
subdirectory, which has further kubepods-besteffort.slice/
, kubepods-burstable.slice/
and kubepods-guaranteed.slice/
subdirectories for each QoS (Quality-of-Service) type. Under these directories, you will find directories for each Pod and inside those, further directories for each container.
At each level, there are files such as cpu.weight
or cpu.max
that specify how much of the particular resource - e.g. CPU - can this group use. For clarity, the above file tree only shows these files at the deepest level.
Finally, here at the leaves of the tree are the files that describe how much memory ( memory.min
and memory.max
), CPU ( cpu.weight
and cpu.max
) or other resources each container gets to work with. These files are a direct translation of the resource requests and limits defined in Pod manifests. However, if you were to look at these files, the values you would find there actually don't seem related to the requests and limits, so what do they mean and how did they get there?
How Does It Work?
Let’s now walk through all the steps to better understand how the Pod requests
and limits
get translated/propagated all the way to files in /sys/fs/...
.
We begin with a simple Pod definition that includes memory and CPU requests/limits:
When we create/apply this Pod manifest, the Pod gets assigned to a Node and the kubelet
on the Node takes this PodSpec and passes it to Container Runtime Interface (CRI), e.g. containerd
or CRI-O, which translates it to lower-level OCI JSON spec that describes the container that will be created:
As you can see from the above, this spec includes cgroupsPath
which is the directory where the cgroup
files will be located. It also includes the already translated requests and limits under the info.runtimeSpec.linux.resources
(we will talk about what this means a bit later).
This spec is then passed to lower-level OCI container runtime — most likely runc
- which talks to systemd
driver which creates systemd
scope unit and also sets values in the files in cgroupfs
.
To first inspect the systemd
scope unit:
We first find the container ID using crictl ps
, which is CRI equivalent of docker ps
. In the output of this command we see our Pod webserver
and the container ID. We then use systemd-cgls
which recursively shows control groups content. In its output, we see the group with our container's ID, which is crio-029d006435420...
. Finally, we use systemctl show --no-pager crio-029d006435420...
which gives us the systemd
properties which were used to set the values in cgroup
files.
To then inspect the cgroups
filesystem itself:
We go to the directory kubepods-burstable-pod6910effd_ea14_4f76_a7de_53c333338acb.slice
which was listed in the output of systemd-cgls
. This is the cgroup directory for the whole webserver
Pod. Here we find the individual cgroup files. The files and the values that we mostly care about are cpu.weight
, cpu.max
, memory.min
and memory.max
as these are the ones that describe CPU and memory requests/limits of the Pod, but what do the values mean?
cpu.weight
- This is the CPU request. It is converted to the so-called weight (also called "shares"). It's in the range of 1 to 10000 and describes how much CPU will the container get in comparison to other containers. If you had only 2 processes on the system and one would have 2000 and the other 8000, then the former would get 20% and the latter 80% of CPU cycles. In this case250m
equals to10
"shares", so if we were to run Pod with450m
CPU request, then it would get18
"shares".cpu.max
- This is the CPU limit. The values in the file indicate that the group may consume up to$MAX
in each$PERIOD
duration.max
for$MAX
indicates no limit. In this case: consume50000 / 100000
, therefore at most0.5
(500m
) CPU.memory.min
- This is memory request in bytes. This is only set if memory QoS is enabled in the cluster (explained later).memory.max
- A memory usage hard limit in bytes. If a cgroup's memory usage reaches this limit and can't be reduced, the OOM killer is invoked in thecgroup
and the container gets killed.
There are a lot of other files besides these 4, however, none of them can be currently set through Pod manifests.
As a side note, if you want to poke around yourself, an alternative/faster way to find these values might be to get the path to the cgroupfs
form container runtime spec mentioned earlier:
Monitoring
Apart from enforcing resource allocation, cgroups
are also used for monitoring resource consumption. This is done by cAdvisor
component included in kubelet
. Looking at the cAdvisor metrics also serves as an easier way to view the cgroups
files values.
To view cAdvisor metrics you can use:
If you have access to the cluster Node, then you can get the metrics directly from kubelet
API using the first curl
command above. Alternatively, you can use kubectl proxy
to get access to the Kubernetes API Server and run curl
from locally specifying one of your nodes in the path.
Regardless of which option you use, you will get a huge list of metrics that will look like this sample.
Some of the more interesting metrics you will find there are:
And as a final summary of the whole propagation and translation from Pod manifest all the way to cgroupfs
, here's a little table:
Why Should You Care?
With all the newly acquired knowledge about cgroups
in Kubernetes, there’s a question, “Why even bother learning this, when Linux and Kubernetes do all the work for us?”. Well, a deeper understanding is always beneficial in my opinion, and you never know when you will need this knowledge for debugging. More importantly, though, knowing how it works, makes it possible to implement and take advantage of some advanced features:
- For example Memory QoS, which was briefly mentioned earlier. Most people don’t know this, but currently — as of Kubernetes v1.26 — memory requests in Pod manifest are not taken into consideration by container runtime and are therefore effectively ignored. Additionally, there’s no way to throttle memory usage and when the container reaches the memory limit, it simply gets OOM killed. With the introduction of Memory QoS feature, which is currently in Alpha, Kubernetes can take advantage of additional
cgroup
filesmemory.min
andmemory.high
to throttle a container instead of straight-up killing it. (Note: Thememory.min
value in the earlier examples is populated only because Memory QoS was enabled on the cluster.) - Another possible advanced application for
cgroups
could be a container-aware OOM killer — let’s say you have a Pod with a logging sidecar, if the Pod reaches the memory usage limit, then the main container in the Pod might get killed because of the memory consumption of the sidecar. With container-aware OOM killer, we could in theory configure the Pod so that the sidecar is killed first when the memory limit is reached. - Thanks to
cgroups
, it’s also possible to run Kubernetes components, such askubelet
or CRI in rootless mode ( using this Alpha feature), which is great for security. - And finally, It’s also good to have some knowledge of
cgroups
if you’re a Java developer because JDK looks at thecgroup
files to understand how much CPU and memory is available.
And these are just the tip of the iceberg. There are many more things that cgroups
can help us with and chances are, that in the future we will see more resource management features added to Kubernetes, for example for managing other types of resources such as disk throttling, network I/O, or resource pressure (PSI).
Closing Thoughts
While a lot of the things in Kubernetes might seem like magic, when you look closely you will find out that it’s really just a clever use of core Linux components and features, and resource management is no exception as we’ve seen in this article.
Additionally, while cgroups
might seem like an implementation detail from point of view of a cluster operator and user, having an understanding of how they work can be beneficial when troubleshooting difficult issues or when using advanced features like the ones described in the previous section.
Want to Connect?
This article was originally posted at martinheinz.dev.