Containers

Basics

There is no such thing as a container, but only as much as containing is a thing you do.

Linux containers

Modern containers are a mix of cgroups, namespaces, pivot_root (like chroot but more appropriate for containers), and seccomp. But this is just one possible implementation. There are also, certainly, other ways of implementing containers.

Docker : Docker is a reference implementation for the concept of containers. It provided a simple, practical way to bundle applications and their dependencies in a relatively standardized way. Docker’s main innovation was the layering system
LXC & LXD
Other linux container solutions: Some implementations (e.g. firecracker-containerd) also use SELinux and CPU virtualization support.

Containers without namespaces

One could ship application components as containers, but have them run as VMs, eg. firecracker-containerd
Dockers for Windows and Macs do not even use “cgroups” and “namespaces” because these technologies are not available on these stacks, it resorts to plain old VMs.

Comparison w Solaris Zones and BSD Jails

Ramblings from Jessie
Solaris Zones, BSD Jails are first class concepts.
Containers on the other hand are not real things.
“Containers are less isolated than Jails/Zones but allow for a flexibility and control that is not possible with Jails, Zones, or VMs. And THAT IS A FEATURE.”
Containers are not sandboxing but we can attempt to sandbox it and good attempts have been made. One example is gvisor

Images

What is it?

a .tar
We basically bundle the application together w deps and just enough of a Linux root filesystem to run it.
Usually updated in version compared to updated in place

Implementations

Using Dockerfile
Using nix dockerTools (NixOS)

TODO Concept of `user` during image creation

Use of `USER`

It’s reasonable to build the application as root and switch only to a non-root USER at the very end of the Dockerfile using USER.
- Later at runtime, USER can always be overridden by the --user cli flag.
docker - Install all packages in root and switch to a non-root user or use `gosu` or `sudo` directly in a dockerfile? - Stack Overflow
While you can just make up UIDs when passing --user, usually you’d want that user to be actually present inside the container
- In those case you’d use useradd / groupapp like normal linux installation
- the process is split into two operations, even with the standard Docker tools, i.e.
  - first you create the new user so that you can configure the files to be owned by it (useradd phase)
  - and then you have to configure the containers produced by this image to be run as that particular user by default. (USER / gosu etc.)
  - You can do it in multi-staged builds (even with nix dockerTools but I don’t see it as strictly necessary)

Use of `gosu`

gosu is only for de-elevating from root to a normal user. It is normally used as the last step of an entrypoint script to run the actual service as a non-root user (ie exec gosu nobody:nobody redis-server). This is useful when you need to do a few setup steps that require root (like chown a volume directory) and yet not have the service running as root. If you do not need any root access before the service starts, then USER nobody:nobody in the Dockerfile (or —user nobody:nobody on docker run) will accomplish the same thing (gosu uses the same function from runc that docker uses).

Gosu and user ? · Issue #55 · tianon/gosu · GitHub
dockerfile - Docker using gosu vs USER - Stack Overflow
gosu and rootless: https://github.com/containers/podman/issues/6816
My heuristic: reachout for gosu only if you need to do priviledged operations in the entrypoint file, otherwise just use USER

Filesystem and storage

Overlayfs

Containers usually don’t have a persistent root filesystem
overlayfs is used to create a temporary layer on top of the container image.
This is thrown away when the container is stopped.

Block devices

Docker and containerd use overlay filesystems instead of loopback devices.
CSI might mount it as loopback

Mounts

See Mounting related stuff for OS mounts instead of container mounts

	Host volume/Bind mount	Named volume
Permission issues	We need to manually manage it	Usually no permission issue
Overwrite	Will overwite when mounted	Will merge files when mounted
Management	Managed outside of docker/podman	Managed by docker/podman
Remote storage	Not possible	Possible via drivers(`local` and `image`)
SELinux	-	I am not using it but people do it

Host volumes / `bind mounts`

persistent data outside of the container image is grafted on to the container’s filesystem via a bind mount to another location on the host.
We need to make sure UID/GID inside the container match what’s there in the directory/file we’re bind mounting etc.

Named volumes

When you create a volume, it’s stored within a directory on the Docker host.

When you mount the volume into a container, this directory is what’s mounted into the container.

This is similar to the way that bind mounts work

Except that volumes are managed by Docker and are isolated from the core functionality of the host machine.

Can be assigned a “driver”, either local or image
- local can further be configured in ways to use normal Operating Systems Mounting related stuff! So in a way we can do bind mounts too with name:local volume but idk why you’d want to do that. (using --opt option to pass mount options etc.)

Other volumes

Host volumes(bind mounts) and Names volumes(managed by docker) are ideas around Docker and Podman.
If you’re using some Orchestrator, things might be totally different in terms of how volumes are handled.
- For eg. Nomad automatically bind mounts its “task directories” into the container while you can configure “nomad volumes” to the container. You could also do podman/docker volume etc. These become specific to the orchestrator at that point.

Container Security (containers using linux namespaces)

Privileged and Unprivileged & Rootless

on use of the phrase “privileged containers”

I am not against use of this term but for clarity, I like to think it like this:

There’s no privileged or non/un-privileged containers. But only containers made to run in a privileged/non-privileged manner by combination of various things that can be applied to a running container and it’s a spectrum.
Following are some of those things which determine if a “running” container is actually privileged.
- user namespaces: Whether user namespaces(uid/gid) is even involved. This can be per container or across containers etc (see userns) or none.
- rootless/rootfull: This is just using user namespace in an opinionated way with regards to the root user and few other runtime specific things.
  - Rootless containers using Podman | Enable Sysadmin
- Linux Capabilities: Even if you run the container in rootfull mode(eg. no uid mapping whatsoever), if you get use capabilities root inside the container would not have as much power as it would normally have, however it’d still be able to edit any files etc. which is still concerning.
- --user flag : Is “specific to the main command” being run, defaults to root, can be overridden by using USER in the Containerfile or by passing this option during run. This is useful both in rootfull and rootless modes.
- --priviliged flag: What this flag does depends on the container engine you’re using. But affects the execution of the container as a whole. See this for more info.

At any given setup multiple of these(or more) things will be at play and will actually determine whether if things are actually “privileged”.

The `--user` vs rootless mode

“Rootless mode” is handled differently by Docker and Podman (better to assume that the ideas don’t transfer directly)

The podman docs give a good description: podman-run — Podman documentation

First of all, for you to specify --user the uid to be specified must exist inside the container. You’d using something like useradd for that during image creation.
This sets the UID to be used ONLY FOR THE COMMAND following that execution and overrides USER set during image creation.
When you use rootfull container, eg. default Docker, you can pass in the --user flag, this will run the main command as that user but then you can exec into the container and then run command as root inside the container which is also root outside the container. When running in rootless mode, this is not the case, in the exec case, you’ll be un-previledged user outside the container even if you’re root inside.
So since --user overrides the user, and rootless more re-maps the uid, do we need --user in rootless mode? Yes.
- See Should you use the —user flag in rootless containers? | Enable Sysadmin
- Summary: root inside the container maps to a valid uid in the host, but other uid other than root map to fake uids which have less capabilities. (See linked post). So using --user + rootless is safer.

Linux Capabilities in rootless vs rootful

Capabilities are per process things but they can be under a “user namespace”,
When you’re using rootfull, capabilities go as root (root in container and host are the same)
When using rootless, even you give the capabilities you’re giving that inside the usernamespace

Different meanings of Privileged and Unprivileged & Rootless

Unprivileged containers != Rootless containers

If containers are running as non-root users, when the runtime is still running as root, we don’t call them Rootless Containers.

Context	Idea	Description	Consequences
Linux	`root` user	User with UID of `0`	Tools such as `htop` will automatically label user as `root` if sees the `uid` of 0
	`non-root` user	User with UID other than `0`
Docker daemon	Rootfull	Daemon running as `root`
	Rootless	Daemon running as ~~non-root~~ `non-privileged` user
	Privileged	N/A
	Unprivileged	N/A
Docker container (at runtime)	Rootfull	`root` in container is `root` in host	Get fired as an SRE
	Rootless	`root` in the container is the `non-root user` on behalf of which the docker daemon was run	Mounted files in the container will be owned by `root` which in the host are owned by `non-root` user
			linux user namespaces come into play
			In the mounted path, other files not owned by “the” `non-root` user will show up as `nobody:nobody` in the container
			Other `users` inside the container will have a shifted user id and group id.
	Privileged/Unprivileged	Depends on various things
Podman container (at runtime)	Rootfull	run the initial process as the root of the user namespace they are launched in. (uns is `host`)
	Rootless	run the initial process as the root of the user namespace they are launched in. (uns is mapped)
	Privileged/Unprivileged	Depends on various things

Rootless in Docker and Podman
- User namespace
  - You can run rootfull in podman by using sudo
  - Docker by default does NOT create user namespace (uns, i.e lsns will not list any) it does create other namespaces ofc, but Podman being run as rootless by default will.
- Network namespace
  - See Docker for docker networking, which uses bridge
  - Podman uses slirp4netns to provide ip address
    
    There’s no shared network for rootless containers. Each one is plumbed into a tap interface which is then networked out to the host by slirp4netns. So if you start two containers in rootless mode, by default, they can’t talk directly to each other without exposing ports on the host. All your containers get the same IP address. On the installation I’m using they all get 10.0.2.100
    
    By comparison, if you run containers rootfully, the networking looks much more similar to the default Docker configuration. Containers will get an individual IP address, and will be able to communicate with other containers on the bridge network that they’ve been connected to.
  - MacVlan: Rootfull podman containers to do something similar, but with DHCP support. See More Podman - Rootfull containers, Networking and processes for more info.

Rootless mode in Nomad
- Rootless Nomad · Issue #13669 · hashicorp/nomad · GitHub (Not fully supported yet, can use Podman as the task driver which can help)

UID/GID and SUID/SGID

subuid and subgid subordinate(uids/gids)

Different from setuid and setgid bits

/etc/subuid and /etc/subgid let you assign extra user ids and group ids to a particular user. The subuid file contains a list of users and the user ids that the user is allowed to impersonate.
- Any resource owned by user(inside the container) which is not mapped(outside container) will get id -1 (nobody)
- Range of ids you assign are no longer available to be assigned to other users (both primary and via subuid)
- user to which they are assigned now ‘owns’ these ids.
Configured via the subid field in /etc/nsswitch.conf (nsswitch) file. (Has default set to files)
See shadow-utils
There’s also a limit to how many entries you can make in /etc/subuid files etc

Manual configuration

$ cat /etc/subuid
#<user>:base_id:total_nos_of_ids_allowed
user1:100000:65536
$ cat /etc/subgid
user1:100000:65536

Using usemod

usermod --add-subuids 1000000-1000999999 root
usermod --add-sugids 1000000-1000999999 pappu
usermod --add-subuids 1000000-1000999999 --add-sugids 1000000-1000999999 xyzuser

What about uid_map and gid_map ?

These are specific to user NS. The newuidmap tool can help. newuidmap sets /proc/[pid]/uid_map based on its command line arguments and the uids allowed.

Example usecase

Take an example usecase: “container” needs to run as root!

In this case, we can re-map this user(root inside container) to a less-privileged user on the Docker host.

use usermod to “allocate” a range of suid to a user(on host). These uid(s) in the range are all fake!
Any uid from the range can be used to “impersonate” the user(host) inside another user(inside the container)
- i.e If in any case someone breaks out of the container, the uid they’ll have is the “fake” id, which has no privileges on the host system at all.
After this you somehow map this to the user namespace (see Linux Namespaces)
- In raw linux, you’d use something like newidmap to map uid_map and gid_map
- In Docker, you’d use the userns-remap config option(daemon.json)
  - You’d do something like {"userns-remap": "testuser"} where testuser is the one you ran usermod for. If you keep the value as default, docker would try to create a user dockremap for you. (I dont like that autocreate idea honestly)
  - After that, container runtime should automatically pick the “fake uid”.
  - See Isolate containers with a user namespace | Docker Docs for more info
THIS user-ns IS NOT PER CONTAINER!
- Per container user namespace is not properly supported yet
- Support for user namespaces in Nomad · Issue #23918 · hashicorp/nomad · GitHub
- Docker user namespacing map user in container to specific user in host · Issue #27548 · moby/moby · GitHub
- Podman does support: see https://docs.podman.io/en/v4.4/markdown/options/userns.container.html
  - Understanding rootless Podman’s user namespace modes | Enable Sysadmin
  - There’s also --subuidname

Creating “unprivileged” containers

Unprivileged created by:

Taking a set of normal UIDs and GIDs from the host
Usually at least 65536 of each (to be POSIX compliant)
Then mapping those into the container
Implementations mostly expect /etc/subuid to contain at least 65536 subuids.
This allows LXC & LXD to do the “shift” in containers because it has a reserved pool of UIDs and GIDs.

`bind mounts` and UID

bind mounts (also called host volume in contrast to named volume) don’t play well with user namespace mapping etc.
Mounts are a very separate topic, podman even uses SELinux for filesystem namespacing etc.
But since bind mounts use host filesystem directly, it’s likely that there will be permission issues which will need to be handled individually
I like to avoid bind mounts whenever possible
See https://github.com/paperless-ngx/paperless-ngx/issues/4242

Resources

Container networking

Overview

Also see Docker networking
net ns can be connected using a Linux Virtual Ethernet Device or veth pair
From a network architecture point of view, all containers on a given Docker host are sitting on bridge interfaces.
Different container managers(docker, podman, lxd etc) provide a number of ways networking can be done with containers.
An interesting one is the bridged networking approach, which essentially boils down to 3 things.
1. Creating veth pair from host to net namespace-X. Every new container will add new veth interface and remove it once container is stopped. Eg. lxc info <instance_name> will show the veth created for the instance
2. Adding a bridge for the veth pair to talk through. When you install docker, it automatically creates a docker0 bridge created for containers to communicate. bridge is a L2 device, uses ARP.
3. Adding iptables rules to access outside network
See Introduction to Linux interfaces for virtual networking and Deep Dive into Linux Networking and Docker - Bridge, vETH and IPTables
Also see 8.2.5 About Veth and Macvlan

TODO How do things happen with the network namespace

https://www.reddit.com/r/podman/comments/14uxhrx/which_alternative_for_slirp4netns_in_rootless/
https://github.com/eriksjolund/podman-networking-docs
https://github.com/containers/podman/discussions/21451
unprivileged users cannot create networking interfaces on the host.
- The default networking mode for rootful containers on the other side is netavark, which allows a container to have a routable IP address.
https://github.com/containers/podman/blob/main/docs/tutorials/basic_networking.md

TODO Containers in Practice

TODO Golang Questions

What user to use gid etc.
Which image to use, why not alpine
- I recomend agains alpine images. They use muslc, that can be a real troublemaker at times. If you don’t need any tooling, then distroless image is great. Otherwise debian-slim ticks everything for me.
- You can also use SCRATCH but be aware of things like outgoing HTTPS requests, where you need a local CA certificate list for validating certificates.
  - We can also do COPY —from=build etc/ssl/certs/ca-certificates.crt /etc/ssl/certs
- Distroless
  - a base image with just ca-certificates, a passwd entry and tzdata, which are a few dependencies the Go runtime might look for at runtime.
  - So it’s pretty nice, rather than using SCRATCH
    - Eg. https://github.com/a1comms/gcs-sftp-server/blob/master/Dockerfile
- Q: Why not OS base image in golang applications? what about cgo.
Which image tool to use, docker or something else?
so if you are using containerization in development and not building inside the container, you are missing out on one of the major advantages of the paradigm.
- So we should be building inside the container?
- https://docs.docker.com/build/building/multi-stage/
- Building a docker image for a Go programm : golang
  - Usually people use multi-stage builds. Use the golang image to build the binary, then copy that binary to the more minimal alpine image.

🐏 mogoz

Table of Contents

Containers

Basics

Linux containers

Containers without namespaces

Comparison w Solaris Zones and BSD Jails

Images

What is it?

Implementations

TODO Concept of `user` during image creation

Use of `USER`

Use of `gosu`

Filesystem and storage

Overlayfs

Block devices

Mounts

Host volumes / `bind mounts`

Named volumes

Other volumes

Container Security (containers using linux namespaces)

Privileged and Unprivileged & Rootless

on use of the phrase “privileged containers”

The `--user` vs rootless mode

Linux Capabilities in rootless vs rootful

Different meanings of Privileged and Unprivileged & Rootless

UID/GID and SUID/SGID

subuid and subgid subordinate(uids/gids)

Example usecase

Creating “unprivileged” containers

`bind mounts` and UID

Resources

Container networking

Overview

TODO How do things happen with the network namespace

TODO Containers in Practice

TODO Golang Questions

Graph View

Backlinks

🐏 mogoz

Table of Contents

Containers

Basics §

Linux containers §

Containers without namespaces §

Comparison w Solaris Zones and BSD Jails §

Images §

What is it? §

Implementations §

TODO Concept of user during image creation §

Use of USER §

Use of gosu §

Filesystem and storage §

Overlayfs §

Block devices §

Mounts §

Host volumes / bind mounts §

Named volumes §

Other volumes §

Container Security (containers using linux namespaces) §

Privileged and Unprivileged & Rootless §

on use of the phrase “privileged containers” §

The --user vs rootless mode §

Linux Capabilities in rootless vs rootful §

Different meanings of Privileged and Unprivileged & Rootless §

UID/GID and SUID/SGID §

subuid and subgid subordinate(uids/gids) §

Example usecase §

Creating “unprivileged” containers §

bind mounts and UID §

Resources §

Container networking §

Overview §

TODO How do things happen with the network namespace §

TODO Containers in Practice §

TODO Golang Questions §

Graph View

Backlinks

Basics

Linux containers

Containers without namespaces

Comparison w Solaris Zones and BSD Jails

Images

What is it?

Implementations

TODO Concept of `user` during image creation

Use of `USER`

Use of `gosu`

Filesystem and storage

Overlayfs

Block devices

Mounts

Host volumes / `bind mounts`

Named volumes

Other volumes

Container Security (containers using linux namespaces)

Privileged and Unprivileged & Rootless

on use of the phrase “privileged containers”

The `--user` vs rootless mode

Linux Capabilities in rootless vs rootful

Different meanings of Privileged and Unprivileged & Rootless

UID/GID and SUID/SGID

subuid and subgid subordinate(uids/gids)

Example usecase

Creating “unprivileged” containers

`bind mounts` and UID

Resources

Container networking

Overview

TODO How do things happen with the network namespace

TODO Containers in Practice

TODO Golang Questions