Eric Chiang | Containers from scratch

This is write up for talk I gave at CAT BarCamp, an awesome unconference at Portland State University. The talk started with the self-imposed challenge “give an intro to containers without Docker or rkt.”

Often thought of as cheap VMs, containers are just isolated groups of processes running on a single host. That isolation leverages several underlying technologies built into the Linux kernel: namespaces, cgroups, chroots and lots of terms you’ve probably heard before.

So, let’s have a little fun and use those underlying technologies to build our own containers.

On today’s agenda:

setting up a file system
chroot
unshare
nsenter
bind mounts
cgroups
capabilities

Container file systems

Container images, the thing you download from the internet, are literally just tarballs (or tarballs in tarballs if you’re fancy). The least magic part of a container are the files you interact with.

For this post I’ve build a simple tarball by stripping down a Docker image. The tarball holds something that looks like a Debian file system and will be our playground for isolating processes.

$ wget https://github.com/ericchiang/containers-from-scratch/releases/download/v0.1.0/rootfs.tar.gz
$ sha256sum rootfs.tar.gz 
c79bfb46b9cf842055761a49161831aee8f4e667ad9e84ab57ab324a49bc828c  rootfs.tar.gz

First, explode the tarball and poke around.

$ # tar needs sudo to create /dev files and setup file ownership
$ sudo tar -zxf rootfs.tar.gz
$ ls rootfs
bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr
$ ls -al rootfs/bin/ls
-rwxr-xr-x. 1 root root 118280 Mar 14  2015 rootfs/bin/ls

The resulting directory looks an awful lot like a Linux system. There’s a bin directory with executables, an etc with system configuration, a lib with shared libraries, and so on.

Actually building this tarball is an interesting topic, but one we’ll be glossing over here. For an overview, I’d strongly recommend the excellent talk “Minimal Containers” by my coworker Brian Redbeard.

chroot

The first tool we’ll be working with is chroot. A thin wrapper around the similarly named syscall, it allows us to restrict a process’ view of the file system. In this case, we’ll restrict our process to the “rootfs” directory then exec a shell.

Once we’re in there we can poke around, run commands, and do typical shell things.

$ sudo chroot rootfs /bin/bash
root@localhost:/# ls /
bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr
root@localhost:/# which python
/usr/bin/python
root@localhost:/# /usr/bin/python -c 'print "Hello, container world!"'
Hello, container world!
root@localhost:/#

It’s worth noting that this works because of all the things baked into the tarball. When we execute the Python interpreter, we’re executing rootfs/usr/bin/python, not the host’s Python. That interpreter depends on shared libraries and device files that have been intentionally included in the archive.

Speaking of applications, instead of shell we can run one in our chroot.

$ sudo chroot rootfs python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

If you’re following along at home, you’ll be able to view everything the file server can see at http://localhost:8000/.

Creating namespaces with unshare

How isolated is this chrooted process? Let’s run a command on the host in another terminal.

$ # outside of the chroot
$ top

Sure enough, we can see the top invocation from inside the chroot.

$ sudo chroot rootfs /bin/bash
root@localhost:/# mount -t proc proc /proc
root@localhost:/# ps aux | grep top
1000     24753  0.1  0.0 156636  4404 ?        S+   22:28   0:00 top
root     24764  0.0  0.0  11132   948 ?        S+   22:29   0:00 grep top

Better yet, our chrooted shell is running as root, so it has no problem killing the top process.

root@localhost:/# pkill top

So much for containment.

This is where we get to talk about namespaces. Namespaces allow us to create restricted views of systems like the process tree, network interfaces, and mounts.

Creating namespace is super easy, just a single syscall with one argument, unshare. The unshare command line tool gives us a nice wrapper around this syscall and lets us setup namespaces manually. In this case, we’ll create a PID namespace for the shell, then execute the chroot like the last example.

$ sudo unshare -p -f --mount-proc=$PWD/rootfs/proc \
    chroot rootfs /bin/bash
root@localhost:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  20268  3240 ?        S    22:34   0:00 /bin/bash
root         2  0.0  0.0  17504  2096 ?        R+   22:34   0:00 ps aux
root@localhost:/#

Having created a new process namespace, poking around our chroot we’ll notice something a bit funny. Our shell thinks its PID is 1?! What’s more, we can’t see the host’s process tree anymore.

Entering namespaces with nsenter

A powerful aspect of namespaces is their composability; processes may choose to separate some namespaces but share others. For instance it may be useful for two programs to have isolated PID namespaces, but share a network namespace (e.g. Kubernetes pods). This brings us to the setns syscall and the nsentercommand line tool.

Let’s find the shell running in a chroot from our last example.

$ # From the host, not the chroot.
$ ps aux | grep /bin/bash | grep root
...
root     29840  0.0  0.0  20272  3064 pts/5    S+   17:25   0:00 /bin/bash

The kernel exposes namespaces under /proc/(PID)/ns as files. In this case, /proc/29840/ns/pid is the process namespace we’re hoping to join.

$ sudo ls -l /proc/29840/ns
total 0
lrwxrwxrwx. 1 root root 0 Oct 15 17:31 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx. 1 root root 0 Oct 15 17:31 mnt -> 'mnt:[4026532434]'
lrwxrwxrwx. 1 root root 0 Oct 15 17:31 net -> 'net:[4026531969]'
lrwxrwxrwx. 1 root root 0 Oct 15 17:31 pid -> 'pid:[4026532446]'
lrwxrwxrwx. 1 root root 0 Oct 15 17:31 user -> 'user:[4026531837]'
lrwxrwxrwx. 1 root root 0 Oct 15 17:31 uts -> 'uts:[4026531838]'

The nsenter command provides a wrapper around setns to enter a namespace. We’ll provide the namespace file, then run the unshare to remount /proc and chroot to setup a chroot. This time, instead of creating a new namespace, our shell will join the existing one.

$ sudo nsenter --pid=/proc/29840/ns/pid \
    unshare -f --mount-proc=$PWD/rootfs/proc \
    chroot rootfs /bin/bash
root@localhost:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  20272  3064 ?        S+   00:25   0:00 /bin/bash
root         5  0.0  0.0  20276  3248 ?        S    00:29   0:00 /bin/bash
root         6  0.0  0.0  17504  1984 ?        R+   00:30   0:00 ps aux

Having entered the namespace successfully, when we run ps in the second shell (PID 5) we see the first shell (PID 1).

Getting around chroot with mounts

When deploying an “immutable” container it often becomes important to inject files or directories into the chroot, either for storage or configuration. For this example, we’ll create some files on the host, then expose them read-only to the chrooted shell using mount.

First, let’s make a new directory to mount into the chroot and create a file there.

$ sudo mkdir readonlyfiles
$ echo "hello" > readonlyfiles/hi.txt

Next, we’ll create a target directory in our container and bind mount the directory providing the -o ro argument to make it read-only. If you’ve never seen a bind mount before, think of this like a symlink on steroids.

$ sudo mkdir -p rootfs/var/readonlyfiles
$ sudo mount --bind -o ro $PWD/readonlyfiles $PWD/rootfs/var/readonlyfiles

The chrooted process can now see the mounted files.

$ sudo chroot rootfs /bin/bash
root@localhost:/# cat /var/readonlyfiles/hi.txt
hello

However, it can’t write them.

root@localhost:/# echo "bye" > /var/readonlyfiles/hi.txt
bash: /var/readonlyfiles/hi.txt: Read-only file system

Though a pretty basic example, it can actually be expanded quickly for things like NFS, or in-memory file systems by switching the arguments to mount.

Use umount to remove the bind mount (rm won’t work).

$ sudo umount $PWD/rootfs/var/readonlyfiles

cgroups

cgroups, short for control groups, allow kernel imposed isolation on resources like memory and CPU. After all, what’s the point of isolating processes they can still kill neighbors by hogging RAM?

The kernel exposes cgroups through the /sys/fs/cgroup directory. If your machine doesn’t have one you may have to mount the memory cgroup to follow along.

$ ls /sys/fs/cgroup/
blkio  cpuacct      cpuset   freezer  memory   net_cls,net_prio  perf_event  systemd
cpu    cpu,cpuacct  devices  hugetlb  net_cls  net_prio          pids

For this example we’ll create a cgroup to restrict the memory of a process. Creating a cgroup is easy, just create a directory. In this case we’ll create a memory group called “demo”. Once created, the kernel fills the directory with files that can be used to configure the cgroup.

$ sudo su
# mkdir /sys/fs/cgroup/memory/demo
# ls /sys/fs/cgroup/memory/demo/
cgroup.clone_children               memory.memsw.failcnt
cgroup.event_control                memory.memsw.limit_in_bytes
cgroup.procs                        memory.memsw.max_usage_in_bytes
memory.failcnt                      memory.memsw.usage_in_bytes
memory.force_empty                  memory.move_charge_at_immigrate
memory.kmem.failcnt                 memory.numa_stat
memory.kmem.limit_in_bytes          memory.oom_control
memory.kmem.max_usage_in_bytes      memory.pressure_level
memory.kmem.slabinfo                memory.soft_limit_in_bytes
memory.kmem.tcp.failcnt             memory.stat
memory.kmem.tcp.limit_in_bytes      memory.swappiness
memory.kmem.tcp.max_usage_in_bytes  memory.usage_in_bytes
memory.kmem.tcp.usage_in_bytes      memory.use_hierarchy
memory.kmem.usage_in_bytes          notify_on_release
memory.limit_in_bytes               tasks
memory.max_usage_in_bytes

To adjust a value we just have to write to the corresponding file. Let’s limit the cgroup to 100MB of memory and turn off swap.

# echo "100000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
# echo "0" > /sys/fs/cgroup/memory/demo/memory.swappiness

The tasks file is special, it contains the list of processes which are assigned to the cgroup. To join the cgroup we can write our own PID.

# echo $$ > /sys/fs/cgroup/memory/demo/tasks

Finally we need a memory hungry application.

f = open("/dev/urandom", "r")
data = ""

i=0
while True:
    data += f.read(10000000) # 10mb
    i += 1
    print "%dmb" % (i*10,)

If you’ve setup the cgroup correctly, this program won’t crash your computer.

# python hungry.py
10mb
20mb
30mb
40mb
50mb
60mb
70mb
80mb
Killed

If that didn’t crash your computer, congratulations!

cgroups can’t be removed until every processes in the tasks file has exited or been reassigned to another group. Exit the shell and remove the directory with rmdir (don’t use rm -r).

# exit
exit
$ sudo rmdir /sys/fs/cgroup/memory/demo

Container security and capabilities

Containers are extremely effective ways of running arbitrary code from the internet as root, and this is where the low overhead of containers hurts us. Containers are significantly easier to break out of than a VM. As a result many technologies used to improve the security of containers, such as SELinux, seccomp, and capabilities involve limiting the power of processes already running as root.

In this section we’ll be exploring Linux capabilities.

Consider the following Go program which attempts to listen on port 80.

package main

import (
    "fmt"
    "net"
    "os"
)

func main() {
    if _, err := net.Listen("tcp", ":80"); err != nil {
        fmt.Fprintln(os.Stdout, err)
        os.Exit(2)
    }
    fmt.Println("success")
}

What happens when we compile and run this?

$ go build -o listen listen.go
$ ./listen
listen tcp :80: bind: permission denied

Predictably this program fails; listing on port 80 requires permissions we don’t have. Of course we can just use sudo, but we’d like to give the binary just the one permission to listen on lower ports.

Capabilities are a set of discrete powers that together make up everything root can do. This ranges from things like setting the system clock, to kill arbitrary processes. In this case, CAP_NET_BIND_SERVICE allows executables to listen on lower ports.

We can grant the executable CAP_NET_BIND_SERVICE using the setcap command.

$ sudo setcap cap_net_bind_service=+ep listen
$ getcap listen
listen = cap_net_bind_service+ep
$ ./listen
success

For things already running as root, like most containerized apps, we’re more interested in taking capabilities away than granting them. First let’s see all powers our root shell has:

$ sudo su
# capsh --print
Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,37+ep
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,37
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
uid=0(root)
gid=0(root)
groups=0(root)

Yeah, that’s a lot of capabilities.

As an example, we’ll use capsh to drop a few capabilities including CAP_CHOWN. If things work as expected, our shell shouldn’t be able to modify file ownership despite being root.

$ sudo capsh --drop=cap_chown,cap_setpcap,cap_setfcap,cap_sys_admin --chroot=$PWD/rootfs --
root@localhost:/# whoami
root
root@localhost:/# chown nobody /bin/ls
chown: changing ownership of '/bin/ls': Operation not permitted

Conventional wisdom still states that VMs isolation is mandatory when running untrusted code. But security features like capabilities are important to protect against hacked applications running in containers.

Beyond more elaborate tools like seccomp, SELinux, and capabilities, applications running in containers generally benefit from the same kind of best practices as applications running outside of one. Know what your linking against, don’t run as root in your container, update for known security issues in a timely fashion.

Conclusion

Containers aren’t magic. Anyone with a Linux machine can play around with them and tools like Docker and rkt are just wrappers around things built into every modern kernel. No, you probably shouldn’t go and implement your own container runtime. But having a better understanding of these lower level technologies will help you work with these higher level tools (especially when debugging).

There’s a ton of topics I wasn’t able to cover today, networking and copy-on-write file systems probably being the biggest two. However, I hope this acts as a good starting point for anyone wanting to get their hands dirty. Happy hacking!