This section covers how the containers are isolated from the host as well as each other using the kernel namespaces. This is actually the most significant kernel feature which virtualizes the resources and isolates the processes from each other and using just namespaces creates a containers of sorts, see nsexec.
Pasting here the definition from the manual page namespaces(7) as there probably isn't a better one.
A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes.
There are 7 namespaces at the moment and a process can be in one or more of them. There are always global namespaces for each of the types so that any process is always in some namespace of each type.
Linux has so far following namespaces.
Number in the brackets is the kernel version when the namespace was introduced
Isolates the mount points. A process has it's own view of the mount points and changes are not propagated to other namespaces.
Isolates IPC resources. System V IPC objects and POSIX message queues.
Isolates the process ID number space. Processes in different PID namespaces can have the same PID or can't see PIDs of different namespace.
Isolates the network resources like network devices, IPv4 and IPv6 protocol stacks, IP routing tables, firewalls etc.
Isolates the user and group resources, unprivileged user in the "root" namespace can be a user ID 0 in the new namespace. When new user namespace is created the user gets full capabilities(7) inside the namespace.
Isolates the view of the
Mount namespace isolates the mount points and effectively different namespaces can have different filesystem trees as well as any changes in the mount points may or may not be propagated in the other namespaces depending on the mount types (private, bind, slave etc), see mount(8). In the container context it means that anything happening to mount points inside the container is not propagated elsewhere so they are completely isolated.
Image courtesy of Wonchang Song
PID namespace isolated the PID numbers, they are a hierarchical structure where the parent namespace can view all the PIDs in the child namespaces. When a new namespace is created the first process gets the PID 1 and is a sort of init process of that namespace. It should in the ideal world be able to reap any child processes as otherwise it can actually exhaust the root PID space because of the hierarchical nature.
Network namespace creates a completely new network stack including
routing tables, in a new network namespace you get just the loopback
lo and nothing else so you are actually unable to connect to
the network (see nsexec). Physical network interfaces can
reside in only one namespace at a time so very often to connect the
namespace somewhere the virtual Ethernet device pair
is used with together with
In any case
comes handy for adding a device to the namespace.
Creating new namespaces#
There are two syscalls how to create a new namespace.
is to disassociate from the parent process context and thus create a new one.
There is also setns(2) which allows you to enter an existing namespace.
unshare and nsenter in the shell#
$ unshare --fork --pid --mount-proc
Runs a new shell in own PID namespace, it needs to remount the procfs
as otherwise tools like
ps would still show the parent namespace.
nsexec is a minimal example on how to use namespaces to isolate
processes and one could argue that it creates a container using the
host filesystem and programs.
./nsexec --help Create a child process that executes a shell command in new namespace(s), Usage: ./nsexec [OPTIONS] <CMD> -h, --help print this help -n, --net new network namespace -p, --pid new PID namespace -u, --uts HOSTNAME new UTS namespace -v, --verbose more verbose output <CMD> command to be executed
See the Code
$ sudo ./nsexec -npu myhost bash myhost> ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 10:45 pts/3 00:00:00 bash root 6 1 0 10:45 pts/3 00:00:00 ps -ef myhost> ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 myhost> exit exit