tags : OCI Ecosystem, Linux

Basics

There is no such thing as a container, but only as much as containing is a thing you do.

Linux containers

Modern containers are a mix of cgroups, namespaces, pivot_root (like chroot but more appropriate for containers), and seccomp. But this is just one possible implementation. There are also, certainly, other ways of implementing containers.

  • Docker : Docker is a reference implementation for the concept of containers. It provided a simple, practical way to bundle applications and their dependencies in a relatively standardized way. Docker’s main innovation was the layering system
  • LXC & LXD
  • Other linux container solutions: Some implementations (e.g. firecracker-containerd) also use SELinux and CPU virtualization support.

Containers without namespaces

  • One could ship application components as containers, but have them run as VMs, eg. firecracker-containerd
  • Dockers for Windows and Macs do not even use “cgroups” and “namespaces” because these technologies are not available on these stacks, it resorts to plain old VMs.

Comparison w Solaris Zones and BSD Jails

  • Ramblings from Jessie
  • Solaris Zones, BSD Jails are first class concepts.
  • Containers on the other hand are not real things.
  • “Containers are less isolated than Jails/Zones but allow for a flexibility and control that is not possible with Jails, Zones, or VMs. And THAT IS A FEATURE.”
  • Containers are not sandboxing but we can attempt to sandbox it and good attempts have been made. One example is gvisor

Images

What is it?

  • a .tar
  • We basically bundle the application together w deps and just enough of a Linux root filesystem to run it.
  • Usually updated in version compared to updated in place

Implementations

  • Using Dockerfile
  • Using nix dockerTools (NixOS)

TODO Concept of user during image creation

Use of USER

  • It’s reasonable to build the application as root and switch only to a non-root USER at the very end of the Dockerfile using USER.
    • Later at runtime, USER can always be overridden by the --user cli flag.
  • docker - Install all packages in root and switch to a non-root user or use `gosu` or `sudo` directly in a dockerfile? - Stack Overflow
  • While you can just make up UIDs when passing --user, usually you’d want that user to be actually present inside the container
    • In those case you’d use useradd / groupapp like normal linux installation
    • the process is split into two operations, even with the standard Docker tools, i.e.
      • first you create the new user so that you can configure the files to be owned by it (useradd phase)
      • and then you have to configure the containers produced by this image to be run as that particular user by default. (USER / gosu etc.)
      • You can do it in multi-staged builds (even with nix dockerTools but I don’t see it as strictly necessary)

Use of gosu

gosu is only for de-elevating from root to a normal user. It is normally used as the last step of an entrypoint script to run the actual service as a non-root user (ie exec gosu nobody:nobody redis-server). This is useful when you need to do a few setup steps that require root (like chown a volume directory) and yet not have the service running as root. If you do not need any root access before the service starts, then USER nobody:nobody in the Dockerfile (or —user nobody:nobody on docker run) will accomplish the same thing (gosu uses the same function from runc that docker uses).

Filesystem and storage

Overlayfs

  • Containers usually don’t have a persistent root filesystem
  • overlayfs is used to create a temporary layer on top of the container image.
  • This is thrown away when the container is stopped.

Block devices

  • Docker and containerd use overlay filesystems instead of loopback devices.
  • CSI might mount it as loopback

Mounts

See Mounting related stuff for OS mounts instead of container mounts

Host volume/Bind mountNamed volume
Permission issuesWe need to manually manage itUsually no permission issue
OverwriteWill overwite when mountedWill merge files when mounted
ManagementManaged outside of docker/podmanManaged by docker/podman
Remote storageNot possiblePossible via drivers(local and image)
SELinux-I am not using it but people do it

Host volumes / bind mounts

  • persistent data outside of the container image is grafted on to the container’s filesystem via a bind mount to another location on the host.
  • We need to make sure UID/GID inside the container match what’s there in the directory/file we’re bind mounting etc.

Named volumes

  • When you create a volume, it’s stored within a directory on the Docker host.
  • When you mount the volume into a container, this directory is what’s mounted into the container.
  • This is similar to the way that bind mounts work
    • Except that volumes are managed by Docker and are isolated from the core functionality of the host machine.
  • Can be assigned a “driver”, either local or image
    • local can further be configured in ways to use normal Operating Systems Mounting related stuff! So in a way we can do bind mounts too with name:local volume but idk why you’d want to do that. (using --opt option to pass mount options etc.)

Other volumes

  • Host volumes(bind mounts) and Names volumes(managed by docker) are ideas around Docker and Podman.
  • If you’re using some Orchestrator, things might be totally different in terms of how volumes are handled.
    • For eg. Nomad automatically bind mounts its “task directories” into the container while you can configure “nomad volumes” to the container. You could also do podman/docker volume etc. These become specific to the orchestrator at that point.

Container Security (containers using linux namespaces)

Privileged and Unprivileged & Rootless

on use of the phrase “privileged containers”

I am not against use of this term but for clarity, I like to think it like this:

  • There’s no privileged or non/un-privileged containers. But only containers made to run in a privileged/non-privileged manner by combination of various things that can be applied to a running container and it’s a spectrum.
  • Following are some of those things which determine if a “running” container is actually privileged.
    • user namespaces: Whether user namespaces(uid/gid) is even involved. This can be per container or across containers etc (see userns) or none.
    • rootless/rootfull: This is just using user namespace in an opinionated way with regards to the root user and few other runtime specific things.
    • Linux Capabilities: Even if you run the container in rootfull mode(eg. no uid mapping whatsoever), if you get use capabilities root inside the container would not have as much power as it would normally have, however it’d still be able to edit any files etc. which is still concerning.
    • --user flag : Is “specific to the main command” being run, defaults to root, can be overridden by using USER in the Containerfile or by passing this option during run. This is useful both in rootfull and rootless modes.
    • --priviliged flag: What this flag does depends on the container engine you’re using. But affects the execution of the container as a whole. See this for more info.

At any given setup multiple of these(or more) things will be at play and will actually determine whether if things are actually “privileged”.

The --user vs rootless mode

  • First of all, for you to specify --user the uid to be specified must exist inside the container. You’d using something like useradd for that during image creation.
  • This sets the UID to be used ONLY FOR THE COMMAND following that execution and overrides USER set during image creation.
  • When you use rootfull container, eg. default Docker, you can pass in the --user flag, this will run the main command as that user but then you can exec into the container and then run command as root inside the container which is also root outside the container. When running in rootless mode, this is not the case, in the exec case, you’ll be un-previledged user outside the container even if you’re root inside.
  • So since --user overrides the user, and rootless more re-maps the uid, do we need --user in rootless mode? Yes.

Linux Capabilities in rootless vs rootful

  • Capabilities are per process things but they can be under a “user namespace”,
  • When you’re using rootfull, capabilities go as root (root in container and host are the same)
  • When using rootless, even you give the capabilities you’re giving that inside the usernamespace

Different meanings of Privileged and Unprivileged & Rootless

Unprivileged containers != Rootless containers

If containers are running as non-root users, when the runtime is still running as root, we don’t call them Rootless Containers.

ContextIdeaDescriptionConsequences
Linuxroot userUser with UID of 0Tools such as htop will automatically label user as root if sees the uid of 0
non-root userUser with UID other than 0
Docker daemonRootfullDaemon running as root
RootlessDaemon running as non-root non-privileged user
PrivilegedN/A
UnprivilegedN/A
Docker container (at runtime)Rootfullroot in container is root in hostGet fired as an SRE
Rootlessroot in the container is the non-root user on behalf of which the docker daemon was runMounted files in the container will be owned by root which in the host are owned by non-root user
linux user namespaces come into play
In the mounted path, other files not owned by “the” non-root user will show up as nobody:nobody in the container
Other users inside the container will have a shifted user id and group id.
Privileged/UnprivilegedDepends on various things
Podman container (at runtime)Rootfullrun the initial process as the root of the user namespace they are launched in. (uns is host)
Rootlessrun the initial process as the root of the user namespace they are launched in. (uns is mapped)
Privileged/UnprivilegedDepends on various things
  • Rootless in Docker and Podman

    • User namespace
      • You can run rootfull in podman by using sudo
      • Docker by default does NOT create user namespace (uns, i.e lsns will not list any) it does create other namespaces ofc, but Podman being run as rootless by default will.
    • Network namespace
      • See Docker for docker networking, which uses bridge

      • Podman uses slirp4netns to provide ip address

        There’s no shared network for rootless containers. Each one is plumbed into a tap interface which is then networked out to the host by slirp4netns. So if you start two containers in rootless mode, by default, they can’t talk directly to each other without exposing ports on the host. All your containers get the same IP address. On the installation I’m using they all get 10.0.2.100

        By comparison, if you run containers rootfully, the networking looks much more similar to the default Docker configuration. Containers will get an individual IP address, and will be able to communicate with other containers on the bridge network that they’ve been connected to.

      • MacVlan: Rootfull podman containers to do something similar, but with DHCP support. See More Podman - Rootfull containers, Networking and processes for more info.

UID/GID and SUID/SGID

subuid and subgid subordinate(uids/gids)

Different from setuid and setgid bits

  • /etc/subuid and /etc/subgid let you assign extra user ids and group ids to a particular user. The subuid file contains a list of users and the user ids that the user is allowed to impersonate.
    • Any resource owned by user(inside the container) which is not mapped(outside container) will get id -1 (nobody)
    • Range of ids you assign are no longer available to be assigned to other users (both primary and via subuid)
    • user to which they are assigned now ‘owns’ these ids.
  • Configured via the subid field in /etc/nsswitch.conf (nsswitch) file. (Has default set to files)
  • See shadow-utils
  • There’s also a limit to how many entries you can make in /etc/subuid files etc
  • Manual configuration

    $ cat /etc/subuid
    #<user>:base_id:total_nos_of_ids_allowed
    user1:100000:65536
    $ cat /etc/subgid
    user1:100000:65536
  • Using usemod

    usermod --add-subuids 1000000-1000999999 root
    usermod --add-sugids 1000000-1000999999 pappu
    usermod --add-subuids 1000000-1000999999 --add-sugids 1000000-1000999999 xyzuser
  • What about uid_map and gid_map ?

    These are specific to user NS. The newuidmap tool can help. newuidmap sets /proc/[pid]/uid_map based on its command line arguments and the uids allowed.

Example usecase

Take an example usecase: “container” needs to run as root!

In this case, we can re-map this user(root inside container) to a less-privileged user on the Docker host.

Creating “unprivileged” containers

Unprivileged created by:

  • Taking a set of normal UIDs and GIDs from the host
  • Usually at least 65536 of each (to be POSIX compliant)
  • Then mapping those into the container
  • Implementations mostly expect /etc/subuid to contain at least 65536 subuids.
  • This allows LXC & LXD to do the “shift” in containers because it has a reserved pool of UIDs and GIDs.

bind mounts and UID

  • bind mounts (also called host volume in contrast to named volume) don’t play well with user namespace mapping etc.
  • Mounts are a very separate topic, podman even uses SELinux for filesystem namespacing etc.
  • But since bind mounts use host filesystem directly, it’s likely that there will be permission issues which will need to be handled individually
  • I like to avoid bind mounts whenever possible
  • See https://github.com/paperless-ngx/paperless-ngx/issues/4242

Resources

Container networking

Overview

  • Also see Docker networking
  • net ns can be connected using a Linux Virtual Ethernet Device or veth pair
  • From a network architecture point of view, all containers on a given Docker host are sitting on bridge interfaces.
  • Different container managers(docker, podman, lxd etc) provide a number of ways networking can be done with containers.
  • An interesting one is the bridged networking approach, which essentially boils down to 3 things.
    1. Creating veth pair from host to net namespace-X. Every new container will add new veth interface and remove it once container is stopped. Eg. lxc info <instance_name> will show the veth created for the instance
    2. Adding a bridge for the veth pair to talk through. When you install docker, it automatically creates a docker0 bridge created for containers to communicate. bridge is a L2 device, uses ARP.
    3. Adding iptables rules to access outside network
  • See Introduction to Linux interfaces for virtual networking and Deep Dive into Linux Networking and Docker - Bridge, vETH and IPTables
  • Also see 8.2.5 About Veth and Macvlan

TODO How do things happen with the network namespace

TODO Containers in Practice

TODO Golang Questions

  • What user to use gid etc.
  • Which image to use, why not alpine
    • I recomend agains alpine images. They use muslc, that can be a real troublemaker at times. If you don’t need any tooling, then distroless image is great. Otherwise debian-slim ticks everything for me.
    • You can also use SCRATCH but be aware of things like outgoing HTTPS requests, where you need a local CA certificate list for validating certificates.
      • We can also do COPY —from=build etc/ssl/certs/ca-certificates.crt /etc/ssl/certs
    • Distroless
    • Q: Why not OS base image in golang applications? what about cgo.
  • Which image tool to use, docker or something else?
  • so if you are using containerization in development and not building inside the container, you are missing out on one of the major advantages of the paradigm.