Hack the Container: Understanding Docker's Inner Workings
Peekaboo Kernel: Understanding Namespaces and Cgroups
Not surprisingly, Docker stands on the shoulders of Linux kernel features to provide containerization. Two of the most important features are namespaces and cgroups.
When Docker launches a container, a process is created in an isolated environment using these two features, as follows:
- It relies on Linux namespaces to isolate processes by giving each container its own view of system resources such as process IDs, networking, filesystems, and users. For example, the PID namespace ensures that processes inside a container cannot see or interact with processes outside the container.
- It uses control groups (cgroups) to limit, account for, and prioritize resource usage like CPU, memory, and I/O, ensuring containers can share a host safely and predictably. For example, when we launch a container, we can specify memory and CPU limits to prevent a container from consuming all of the host's resources. Control groups achieve this by organizing processes into hierarchical groups and applying resource limits to them.
Let's experiment with these two features to understand how they work. To follow along, you need a Linux machine with root access. You can use a virtual machine or a physical one. We are using Ubuntu for this example:
Test if your system is using cgroups v2 by running the following command:
ls /sys/fs/cgroup/cgroup.controllers 2>/dev/null && \
echo "cgroups v2 is enabled" || \
echo "cgroups v2 is not enabled"
Most Linux distributions ship with cgroups v2 enabled by default nowadays. If your system is not using cgroups v2, you can enable it by adding systemd.unified_cgroup_hierarchy=1 to your kernel command line and rebooting your machine:
# NOTE: Only run these commands if cgroups v2 is not enabled
# Add the argument to GRUB configuration
grubby --update-kernel=ALL \
--args="systemd.unified_cgroup_hierarchy=1"
# Regenerate GRUB configuration
grub2-mkconfig -o /boot/grub2/grub.cfg
# Reboot the system
reboot
Are Control Groups enabled? Let's move to the next step. Create a new execution context using the following command:
unshare --fork --pid --mount-proc bash
ℹ️ The execution context is the environment in which a process runs. It includes the process itself and resources such as memory, file handles, and so on.
The unshare command disassociates parts of the process execution context.
ℹ️
unshare()allows a process or a thread to disassociate parts of its execution context that are currently being shared with other processes/threads. Part of the execution context, such as the mount namespace, is shared when a new process/thread is created usingfork()orvfork(), while other parts, such as virtual memory, may be shared by explicit request when creating a process or thread usingclone().
You can now view the list of processes running in the current namespace using the following command:
ps aux
You should see only two processes: the bash shell you just created and the ps command itself. This is because we created a new PID namespace, and in this namespace, there are no other processes running.
Now, let's create a new cgroup inside this namespace using the following command:
# Create a new cgroup directory
mkdir -p "/sys/fs/cgroup/mygroup"
# List the cgroup files
ls "/sys/fs/cgroup/mygroup"
You should see files related to the memory and CPU controllers, among other files. These files are used to configure and monitor the cgroup.
The next step is to define a limit for the memory and CPU controllers. Let's set a very low limit for both controllers to see how they work:
# Set memory limit to 50MB
echo $((50*1024*1024)) > "/sys/fs/cgroup/mygroup/memory.max"
# Disable swap memory for this cgroup
# optional, but makes your test decisivePainless Docker - 2nd Edition
A Comprehensive Guide to Mastering Docker and its EcosystemEnroll now to unlock all content and receive all future updates for free.
Hurry! This limited time offer ends in:
To redeem this offer, copy the coupon code below and apply it at checkout:
