tags : OCI Ecosystem, Linux

Basics

There is no such thing as a container, but only as much as containing is a thing you do.

Linux containers

Modern containers are a mix of cgroups, namespaces, pivot_root (like chroot but more appropriate for containers), and seccomp. But this is just one possible implementation. There are also, certainly, other ways of implementing containers.

  • Docker : Docker is a reference implementation for the concept of containers. It provided a simple, practical way to bundle applications and their dependencies in a relatively standardized way. Docker’s main innovation was the layering system
  • LXC & LXD
  • Other linux container solutions: Some implementations (e.g. firecracker-containerd) also use SELinux and CPU virtualization support.

Containers without namespaces

  • One could ship application components as containers, but have them run as VMs, eg. firecracker-containerd
  • Dockers for Windows and Macs do not even use “cgroups” and “namespaces” because these technologies are not available on these stacks, it resorts to plain old VMs.

Comparison w Solaris Zones and BSD Jails

  • Ramblings from Jessie
  • Solaris Zones, BSD Jails are first class concepts.
  • Containers on the other hand are not real things.
  • “Containers are less isolated than Jails/Zones but allow for a flexibility and control that is not possible with Jails, Zones, or VMs. And THAT IS A FEATURE.”
  • Containers are not sandboxing but we can attempt to sandbox it and good attempts have been made. One example is gvisor

Images

What is it?

  • a .tar
  • We basically bundle the application together w deps and just enough of a Linux root filesystem to run it.
  • Usually updated in version compared to updated in place

Implementations

Dockerfile

Nix/Guix

Filesystem and storage

Overlayfs

  • Containers usually don’t have a persistent root filesystem
  • overlayfs is used to create a temporary layer on top of the container image.
  • This is thrown away when the container is stopped.

Bind mounts

  • persistent data outside of the container image is grafted on to the container’s filesystem via a bind mount to another location on the host.

Privileged and Unprivileged

Whether the root user in the container is the “real” root user (uid 0 at the kernel level).

Unprivileged containers != Rootless containers

If containers are running as non-root users, when the runtime is still running as root, we don’t call them Rootless Containers.

Unprivileged created by:

  • Taking a set of normal UIDs and GIDs from the host
  • Usually at least 65536 of each (to be POSIX compliant)
  • Then mapping those into the container
  • Implementations mostly expect /etc/subuid to contain at least 65536 subuids.
  • This allows LXD to do the “shift” in containers because it has a reserved pool of UIDs and GIDs.

subuid and subgid subordinate(uids/gids)

  • /etc/subuid and /etc/subgid let you assign extra user ids and group ids to a particular user. The subuid file contains a list of users and the user ids that the user is allowed to impersonate.
    • Any resource owned by user(inside the container) which is not mapped(outside container) will get id -1 (nobody)
    • Range of ids you assign are no longer available to be assigned to other users (both primary and via subuid)
    • user to which they are assigned now ‘owns’ these ids.
  • Configured via the subid field in /etc/nsswitch.conf (nsswitch) file. (Has default set to files)
  • Different from setuid and setgid bits
  • See shadow-utils

Manual configuration

$ cat /etc/subuid
#<user>:base_id:total_nos_of_ids_allowed
user1:100000:65536
$ cat /etc/subgid
user1:100000:65536

Using usemod

usermod --add-subuids 1000000-1000999999 root
usermod --add-sugids 1000000-1000999999 pappu
usermod --add-subuids 1000000-1000999999 --add-sugids 1000000-1000999999 xyzuser

What about uid_map and gid_map ?

These are specific to user NS. The newuidmap tool can help. newuidmap sets /proc/[pid]/uid_map based on its command line arguments and the uids allowed.

Container networking

  • Also see Docker networking
  • net ns can be connected using a Linux Virtual Ethernet Device or veth pair
  • From a network architecture point of view, all containers on a given Docker host are sitting on bridge interfaces.
  • Different container managers(docker, podman, lxd etc) provide a number of ways networking can be done with containers.
  • An interesting one is the bridged networking approach, which essentially boils down to 3 things.
    1. Creating veth pair from host to net namespace-X. Every new container will add new veth interface and remove it once container is stopped. Eg. lxc info <instance_name> will show the veth created for the instance
    2. Adding a bridge for the veth pair to talk through. When you install docker, it automatically creates a docker0 bridge created for containers to communicate. bridge is a L2 device, uses ARP.
    3. Adding iptables rules to access outside network
  • See Introduction to Linux interfaces for virtual networking and Deep Dive into Linux Networking and Docker - Bridge, vETH and IPTables
  • Also see 8.2.5 About Veth and Macvlan

TODO Containers in Practice

TODO Golang Questions

  • What user to use gid etc.
  • Which image to use, why not alpine
    • I recomend agains alpine images. They use muslc, that can be a real troublemaker at times. If you don’t need any tooling, then distroless image is great. Otherwise debian-slim ticks everything for me.
    • You can also use SCRATCH but be aware of things like outgoing HTTPS requests, where you need a local CA certificate list for validating certificates.
      • We can also do COPY —from=build etc/ssl/certs/ca-certificates.crt /etc/ssl/certs
    • Distroless
    • Q: Why not OS base image in golang applications? what about cgo.
  • Which image tool to use, docker or something else?
  • so if you are using containerization in development and not building inside the container, you are missing out on one of the major advantages of the paradigm.