Skip to main content
the invisible-layer how abstraction is making software engineers dumber

What Docker Actually Is

9 min read Chapter 18 of 56
Summary

Demystifies Docker by building a container from scratch...

Demystifies Docker by building a container from scratch using Linux namespaces, cgroups, and overlay filesystems, then maps each primitive to what docker run actually does, giving the reader a mental model for debugging container issues at the OS level.

What Docker Actually Is

Docker is not a virtual machine. If you take one thing from this section, take that. A virtual machine runs a complete operating system with its own kernel on emulated hardware. Docker runs your process on the host kernel with some clever isolation tricks. The difference isn’t semantic — it’s architectural, and confusing the two will lead you to wrong conclusions about performance, security, and debugging.

A container is a regular Linux process with three constraints applied to it:

  1. Namespaces — it can’t see certain things
  2. Cgroups — it can’t use more than certain amounts of resources
  3. An overlay filesystem — it sees a custom view of the filesystem

That’s it. No hypervisor. No guest kernel. No hardware emulation. Just a process with blinders on and a budget.

macOS/Windows note: Docker Desktop on macOS and Windows runs a Linux VM behind the scenes (using Apple’s Virtualization Framework or WSL2). Your container runs inside that Linux VM. You’re getting a VM whether you wanted one or not — but the container inside it still uses the primitives described here.

Namespaces: What Your Process Can See

A Linux namespace restricts what a process can see. There are several types, each isolating a different aspect of the system:

NamespaceIsolatesEffect
PIDProcess IDsContainer sees its own PID 1, can’t see host processes
NETNetwork stackContainer gets its own IP, interfaces, routing table
MNTMount pointsContainer sees its own filesystem tree
UTSHostnameContainer can have its own hostname
IPCInter-process communicationSeparate shared memory, semaphores
USERUser/group IDsUID 0 in container can map to unprivileged UID on host

You can create namespaces manually with unshare:

# Create a new PID and mount namespace, run bash inside it
sudo unshare --pid --mount --fork bash

# Inside this new namespace:
echo $$
# 1   <-- this bash IS PID 1 in this namespace

# Mount a fresh /proc so ps works correctly in the new PID namespace
mount -t proc proc /proc

ps aux
# USER  PID %CPU %MEM    VSZ   RSS TTY  STAT START   TIME COMMAND
# root    1  0.0  0.0   7236  4020 pts/0 S  12:00   0:00 bash
# root    2  0.0  0.0  10072  3312 pts/0 R+ 12:00   0:00 ps aux

# Only two processes visible. The host has hundreds.
# Exit and unmount when done

From inside this namespace, the bash process believes it’s PID 1 — the init process. It can’t see any of the host’s processes. From the host’s perspective, this is still just a regular process with a regular PID (say, 45823). The namespace is a lens, not a wall.

You can inspect what namespaces a process belongs to:

# On the host, find the actual PID of our namespaced bash
# Then look at its namespace memberships
ls -la /proc/45823/ns/
# lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532589]'    # different!
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026532590]'    # different!
# lrwxrwxrwx 1 root root 0 ... net -> 'net:[4026531840]'    # same as host

Each namespace is identified by an inode number. Processes in the same namespace share the same inode. This is how you can tell whether two containers share a network namespace — compare their /proc/<pid>/ns/net links.

You can enter an existing namespace with nsenter:

# Enter the namespaces of a running container
# (This is what "docker exec" does under the hood)
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' my_container)
nsenter --target $CONTAINER_PID --mount --pid --net bash

That command joins the mount, PID, and network namespaces of the container’s main process. You’re now seeing what the container sees. This is far more powerful than docker exec because you can choose which namespaces to enter — you could join the network namespace but keep the host’s PID namespace, for instance.

Cgroups: What Your Process Can Use

Namespaces control visibility. Cgroups (control groups) control resource consumption. A cgroup sets hard limits on how much CPU, memory, I/O bandwidth, and other resources a process (or group of processes) can use.

Cgroup configuration lives in a filesystem, typically mounted at /sys/fs/cgroup/. You can create and configure cgroups by creating directories and writing to files:

# Create a cgroup (cgroups v2)
sudo mkdir /sys/fs/cgroup/my_container

# Set a memory limit of 100MB
echo $((100 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/my_container/memory.max

# Set a CPU limit: 50% of one core (50000 out of 100000 microseconds)
echo "50000 100000" | sudo tee /sys/fs/cgroup/my_container/cpu.max

# Add the current shell to this cgroup
echo $$ | sudo tee /sys/fs/cgroup/my_container/cgroup.procs

# Now this shell and all its children are limited to
# 100MB RAM and 50% of one CPU core

If a process in this cgroup tries to allocate more than 100MB of memory, the kernel’s OOM (Out Of Memory) killer will terminate it. This is the mechanism behind the dreaded “OOMKilled” status in Kubernetes — your container exceeded its cgroup memory limit, and the kernel killed it. It wasn’t a crash. It was an execution.

You can see a container’s cgroup limits:

# Find the cgroup of a Docker container
CONTAINER_ID=$(docker inspect --format '{{.Id}}' my_container)
cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.max
cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.current

When someone says “my container is using 250MB” — this is where that number comes from. memory.current in the cgroup filesystem.

The Overlay Filesystem: Layered Reality

A Docker image isn’t a single filesystem snapshot. It’s a stack of read-only layers with a thin writable layer on top.

When you write a Dockerfile like:

FROM ubuntu:22.04       # Layer 1: base Ubuntu filesystem
RUN apt-get install -y python3  # Layer 2: adds python3 binaries
COPY app.py /app/       # Layer 3: adds your code

Each instruction creates a layer — a directory containing only the files that changed. The overlay filesystem (OverlayFS) merges these layers into a single coherent view:

┌─────────────────────────┐
│   Writable Layer        │  ← Container writes go here
├─────────────────────────┤
│   Layer 3: COPY app.py  │  ← Read-only
├─────────────────────────┤
│   Layer 2: RUN apt-get  │  ← Read-only
├─────────────────────────┤
│   Layer 1: ubuntu:22.04 │  ← Read-only
└─────────────────────────┘

When the container reads /app/app.py, OverlayFS looks down through the layers until it finds the file. When the container writes a file, the write goes to the top writable layer. When the container modifies an existing file from a lower layer, OverlayFS copies it to the writable layer first (copy-on-write again — the same principle as fork()).

You can see the actual layer directories on disk:

docker inspect my_container --format '{{.GraphDriver.Data.MergedDir}}'
# /var/lib/docker/overlay2/abc123.../merged

docker inspect my_container --format '{{json .GraphDriver.Data}}' | python3 -m json.tool
# {
#   "LowerDir": "/var/lib/docker/overlay2/layer3/diff:
#                /var/lib/docker/overlay2/layer2/diff:
#                /var/lib/docker/overlay2/layer1/diff",
#   "MergedDir": "/var/lib/docker/overlay2/abc123/merged",
#   "UpperDir":  "/var/lib/docker/overlay2/abc123/diff",
#   "WorkDir":   "/var/lib/docker/overlay2/abc123/work"
# }

LowerDir is the stack of read-only layers. UpperDir is the writable layer. MergedDir is the combined view the container sees.

This is why docker images shows shared layers — if ten images are all FROM ubuntu:22.04, that base layer exists only once on disk. It’s also why writing large files inside a running container is slow: OverlayFS has overhead compared to writing directly to ext4 or xfs. Databases inside containers suffer from this. This is why production database containers use volume mounts that bypass OverlayFS entirely.

Building a Container Without Docker

Let’s put the primitives together. Here’s a minimal “container” using only shell commands:

# 1. Create a root filesystem (just use Alpine's minimal rootfs)
mkdir -p /tmp/mycontainer/rootfs
cd /tmp/mycontainer
curl -o alpine.tar.gz https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.0-x86_64.tar.gz
tar xzf alpine.tar.gz -C rootfs

# 2. Set up a cgroup with resource limits
sudo mkdir /sys/fs/cgroup/mycontainer
echo $((50 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/mycontainer/memory.max

# 3. Launch a process with new namespaces, in the cgroup,
#    with the new root filesystem
echo $$ | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs
sudo unshare --pid --mount --uts --ipc --fork chroot rootfs /bin/sh -c '
    mount -t proc proc /proc
    hostname mycontainer
    echo "Hello from inside a container!"
    echo "PID: $$"
    echo "Hostname: $(hostname)"
    ps aux
    exec /bin/sh
'

This gives you an isolated process with its own PID namespace (it sees itself as PID 1), its own hostname, its own filesystem root, and a 50MB memory limit. No Docker daemon. No image registry. Just three Linux primitives composed together.

What docker run Actually Does

When you run docker run -it --memory=512m --cpus=1.5 ubuntu bash, Docker:

  1. Pulls the image layers (if not cached) and assembles them into an overlay filesystem
  2. Creates a new set of namespaces: PID, NET, MNT, UTS, IPC (optionally USER)
  3. Creates a cgroup and sets memory.max to 512MB, cpu.max to 150000 100000
  4. Sets up a virtual ethernet pair (veth) connecting the container’s network namespace to the docker0 bridge
  5. Sets the overlay MergedDir as the root filesystem via pivot_root (similar to chroot but more secure)
  6. Drops Linux capabilities (the container’s root can’t load kernel modules, for example)
  7. Applies seccomp filters (blocks ~44 dangerous syscalls like reboot and mount)
  8. Executes bash as PID 1 inside the container

Every one of these steps uses the OS primitives discussed above. Docker is an orchestrator of existing kernel features, not a new virtualization technology.

Why This Matters for Debugging

When a container “can’t reach the network,” the problem is in the NET namespace — check docker inspect for network settings, use nsenter --net to enter its network namespace and run ip addr / ip route. When a container gets OOMKilled, the problem is in its cgroup — check memory.max vs memory.current. When a container’s filesystem writes are slow, the problem might be OverlayFS — consider a volume mount.

The abstraction Docker provides is valuable. But debugging requires you to see through the abstraction to the primitives underneath. You can’t fix a namespace issue by restarting the container. You can’t fix a cgroup limit by upgrading your base image. You have to know what layer the problem lives on.

Docker didn’t invent containers. It made them convenient. Convenience is good — until it breaks, and you need to understand what’s actually happening.