Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use private cgroup namespaces for cgroup v2 #63

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 52 additions & 4 deletions pkg/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import (
"fmt"
"io"
"os"
"path"
"path/filepath"
"regexp"
"strconv"
Expand Down Expand Up @@ -307,10 +308,57 @@ func (c *Cluster) createMachineRunArgs(machine *Machine, name string, i int) []s
"--tmpfs", "/tmp:exec,mode=777",
}
if docker.CgroupVersion() == "2" {
runArgs = append(runArgs, "--cgroupns", "host",
"--cgroup-parent", "bootloose.slice",
"-v", "/sys/fs/cgroup:/sys/fs/cgroup:rw")

runArgs = append(runArgs, "--cgroupns", "private")

if !machine.spec.Privileged {
// Non-privileged containers will have their /sys/fs/cgroup folder
// mounted read-only, even when running in private cgroup
// namespaces. This is a bummer for init systems. Containers could
// probably remount the cgroup fs in read-write mode, but that would
// require CAP_SYS_ADMIN _and_ a custom logic in the container's
// entry point. Podman has `--security-opt unmask=/sys/fs/cgroup`,
// but that's not a thing for Docker. The only other way to get a
// writable cgroup fs inside the container is to explicitly mount
// it. Some references:
// - https://github.com/moby/moby/issues/42275
// - https://serverfault.com/a/1054414

// Docker will use cgroups like
// <cgroup-parent>/docker-{{ContainerID}}.scope.
//
// Ideally, we could mount those to /sys/fs/cgroup inside the
// containers. But there's some chicken-and-egg problem, as we only
// know the container ID _after_ the container creation. As a
// duct-tape solution, we mount our own cgroup as the root, which is
// unrelated to the Docker-managed one:
// <cgroup-parent>/cluster-{{ClusterID}}.scope/machine-{{MachineID}}.scope

// FIXME: How to clean this up? Especially when Docker is being run
// on a different machine?
Comment on lines +336 to +337
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only remaining concern with this approach. Generally, it improves the isolation of bootloose machines a lot, but all the cgroups created inside those machines won't be cleaned up. They would, if we could somehow leverage the Docker-managed cgroups, but, due to the chicken-and-egg problem stated above, we can't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could run bootloose itself in a Docker container to do the cleanup? This would be a pretty heavy cleanup procedure, but it is the only way I can think of to tackle this.


// Just assume that the cgroup fs is mounted at its default
// location. We could try to figure this out via
// /proc/self/mountinfo, but it's really not worth the hassle.
const cgroupMountpoint = "/sys/fs/cgroup"

// Use this as the parent cgroup for everything. Note that if Docker
// uses the systemd cgroup driver, the cgroup name has to end with
// .slice. This is not a requirement for the cgroupfs driver; it
// won't care. Hence, just always use the .slice suffix, no matter
// if it's required or not.
const cgroupParent = "bootloose.slice"

cg := path.Join(
cgroupMountpoint, cgroupParent,
fmt.Sprintf("cluster-%s.scope", c.spec.Cluster.Name),
fmt.Sprintf("machine-%s.scope", name),
)

runArgs = append(runArgs,
"--cgroup-parent", cgroupParent,
"-v", fmt.Sprintf("%s:%s:rw", cg, cgroupMountpoint),
)
}
} else {
runArgs = append(runArgs, "-v", "/sys/fs/cgroup:/sys/fs/cgroup:ro")
}
Expand Down