Skip to content

Commit

Permalink
runtime-config-linux: Separate mknod from cgroups
Browse files Browse the repository at this point in the history
With mknod entries in linux.devices and cgroups entries in
linux.resources.devices.  Background discussion in [1].

For specifying device cgroups independent of device creation.  This
makes it easy to distinguish between configs that call for cgroup
adjustments (which have linux.resources entries) from those that
don't.  Without this split, folks interested in making that
distinction would have to parse the device section to determine if it
included cgroup changes.  This will also make it easy to drop either
portion (mknod [2] or cgroups [3]) independently of the other if the
project decides to do so.

Using seperate sections for mknod and cgroups also allows us to avoid
the complicated validation rules needed for the combined format
mknod/cgroup [4].

Now that there is a section specific to supplying devices, I shifted
the default device listing over from config-linux [5].  The /dev/ptmx
entry is a bit awkward, since it's not a device, but it seemed to fit
better over here.  But I would also be fine leaving it with the other
mounts in config-linux.

fileMode, uid, and gid are optional, because mknod(2) doesn't need
them and specifies the handling when they aren't set [6,7].
Similarly, major/minor numbers are only required for S_IFCHR and
S_IFBLK [6].  I've left off wording about required runtime behavior
for unset values, because I'd rather address that with a blanket rule
[8].

For the cgroup, access is optional because the kernel docs show an
example that doesn't write an access field to the devices.deny file
[9].  The current kernel docs don't go into much detail on this
behavior (I expect unset and 'rwm' are equivalent), but if the kernel
doesn't need a value written, the spec should get out of the way and
allow users to not specify a value.

The reference links are sorted into two blocks, with kernel-doc links
sorted alphabetically followed by man pages sorted alphabetically by
section.  The cgroup link is new since 2016-01-13 [10].

[1]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/y_Fsa2_jJaM
     Subject: Separate config entries for device mknod and cgroups?
     Date: Mon, 5 Oct 2015 12:46:55 -0700
     Message-ID: <20151005194655.GN28418@odin.tremily.us>
[2]: opencontainers#98
[3]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/qWHoKs8Fsrk
     Subject: removal of cgroups from the OCI Linux spec
     Date: Wed, 28 Oct 2015 17:01:59 +0000
     Message-ID: <CAD2oYtO1RMCcUp52w-xXemzDTs+J6t4hS5Mm4mX+uBnVONGDfA@mail.gmail.com>
[4]: opencontainers#101
[5]: opencontainers#171 (comment)
[6]: http://man7.org/linux/man-pages/man2/mknod.2.html#DESCRIPTION
[7]: https://github.com/opencontainers/specs/pull/298/files#r51053835
[8]: opencontainers#285 (comment)
[9]: https://kernel.org/doc/Documentation/cgroup-v1/devices.txt
[10]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34a9304a96d6351c2d35dcdc9293258378fc0bd8

Signed-off-by: W. Trevor King <wking@tremily.us>
  • Loading branch information
wking authored and Ma Shimiao committed Aug 18, 2016
1 parent 4649a63 commit 0b0c35e
Show file tree
Hide file tree
Showing 2 changed files with 115 additions and 89 deletions.
176 changes: 94 additions & 82 deletions config-linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,27 +16,19 @@ Valid values are the strings for capabilities defined in [the man page](http://m
]
```

## Default Devices and File Systems
## Default File Systems

The Linux ABI includes both syscalls and several special file paths.
Applications expecting a Linux environment will very likely expect these files paths to be setup correctly.

The following devices and filesystems MUST be made available in each application's filesystem

| Path | Type | Notes |
| ------------ | ------ | ------- |
| /proc | [procfs](https://www.kernel.org/doc/Documentation/filesystems/proc.txt) | |
| /sys | [sysfs](https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt) | |
| /dev/null | [device](http://man7.org/linux/man-pages/man4/null.4.html) | |
| /dev/zero | [device](http://man7.org/linux/man-pages/man4/zero.4.html) | |
| /dev/full | [device](http://man7.org/linux/man-pages/man4/full.4.html) | |
| /dev/random | [device](http://man7.org/linux/man-pages/man4/random.4.html) | |
| /dev/urandom | [device](http://man7.org/linux/man-pages/man4/random.4.html) | |
| /dev/tty | [device](http://man7.org/linux/man-pages/man4/tty.4.html) | |
| /dev/console | [device](http://man7.org/linux/man-pages/man4/console.4.html) | |
| /dev/pts | [devpts](https://www.kernel.org/doc/Documentation/filesystems/devpts.txt) | |
| /dev/ptmx | [device](https://www.kernel.org/doc/Documentation/filesystems/devpts.txt) | Bind-mount or symlink of /dev/pts/ptmx |
| /dev/shm | [tmpfs](https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt) | |
The following filesystems MUST be made available in each application's filesystem

| Path | Type |
| -------- | ------ |
| /proc | [procfs](https://www.kernel.org/doc/Documentation/filesystems/proc.txt) |
| /sys | [sysfs](https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt) |
| /dev/pts | [devpts](https://www.kernel.org/doc/Documentation/filesystems/devpts.txt) |
| /dev/shm | [tmpfs](https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt) |

## Namespaces

Expand Down Expand Up @@ -115,93 +107,59 @@ There is a limit of 5 mappings which is the Linux kernel hard limit.

## Devices

`devices` is an array specifying the list of devices to be created in the container.
`devices` is an array specifying the list of devices that MUST be available in the container.
The runtime may supply them however it likes (with [mknod][mknod.2], by bind mounting from the runtime mount namespace, etc.).

The following parameters can be specified:

* **`type`** *(char, required)* - type of device: `c`, `b`, `u` or `p`. More info in `man mknod`.

* **`path`** *(string, optional)* - full path to device inside container

* **`major, minor`** *(int64, required)* - major, minor numbers for device. More info in `man mknod`. There is a special value: `-1`, which means `*` for `device` cgroup setup.

* **`permissions`** *(string, optional)* - cgroup permissions for device. A composition of `r` (*read*), `w` (*write*), and `m` (*mknod*).

* **`fileMode`** *(uint32, optional)* - file mode for device file

* **`uid`** *(uint32, optional)* - uid of device owner

* **`gid`** *(uint32, optional)* - gid of device owner

**`fileMode`**, **`uid`** and **`gid`** are required if **`path`** is given and are otherwise not allowed.
* **`type`** *(char, required)* - type of device: `c`, `b`, `u` or `p`.
More info in [mknod(1)][mknod.1].
* **`path`** *(string, required)* - full path to device inside container.
* **`major, minor`** *(int64, required unless **`type`** is `p`)* - [major, minor numbers][devices] for the device.
* **`fileMode`** *(uint32, optional)* - file mode for the device.
You can also control access to devices [with cgroups](#device-whitelist).
* **`uid`** *(uint32, optional)* - id of device owner.
* **`gid`** *(uint32, optional)* - id of device group.

###### Example

```json
"devices": [
{
"path": "/dev/random",
"path": "/dev/fuse",
"type": "c",
"major": 1,
"minor": 8,
"permissions": "rwm",
"major": 10,
"minor": 229,
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/urandom",
"type": "c",
"major": 1,
"minor": 9,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/null",
"type": "c",
"major": 1,
"minor": 3,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/zero",
"type": "c",
"major": 1,
"minor": 5,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/tty",
"type": "c",
"major": 5,
"path": "/dev/sda",
"type": "b",
"major": 8,
"minor": 0,
"permissions": "rwm",
"fileMode": 0666,
"uid": 0,
"gid": 0
},
{
"path": "/dev/full",
"type": "c",
"major": 1,
"minor": 7,
"permissions": "rwm",
"fileMode": 0666,
"fileMode": 0660,
"uid": 0,
"gid": 0
}
]
```

###### Default Devices

In addition to any devices configured with this setting, the runtime MUST also supply:

* [`/dev/null`][null.4]
* [`/dev/zero`][zero.4]
* [`/dev/full`][full.4]
* [`/dev/random`][random.4]
* [`/dev/urandom`][random.4]
* [`/dev/tty`][tty.4]
* [`/dev/console`][console.4]
* [`/dev/ptmx`][pts.4].
A [bind-mount or symlink of the container's `/dev/pts/ptmx`][devpts].

## Control groups

Also known as cgroups, they are used to restrict resource usage for a container and handle device access.
Expand All @@ -228,6 +186,46 @@ You can configure a container's cgroups via the `resources` field of the Linux c
Do not specify `resources` unless limits have to be updated.
For example, to run a new process in an existing container without updating limits, `resources` need not be specified.

#### Device whitelist

`devices` is an array of entries to control the [device whitelist][cgroups-devices].
The runtime MUST apply entries in the listed order.

The following parameters can be specified:

* **`allow`** *(boolean, required)* - whether the entry is allowed or denied.
* **`type`** *(char, optional)* - type of device: `a` (all), `c` (char), or `b` (block).
`null` or unset values mean "all", mapping to `a`.
* **`major, minor`** *(int64, optional)* - [major, minor numbers][devices] for the device.
`null` or unset values mean "all", mapping to [`*` in the filesystem API][cgroups-devices].
* **`access`** *(string, optional)* - cgroup permissions for device.
A composition of `r` (read), `w` (write), and `m` (mknod).

###### Example

```json
"devices": [
{
"allow": false,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 10,
"minor": 229,
"access": "rw"
},
{
"allow": true,
"type": "b",
"major": 8,
"minor": 0,
"access": "r"
}
]
```

#### Disable out-of-memory killer

`disableOOMKiller` contains a boolean (`true` or `false`) that enables or disables the Out of Memory killer for a cgroup.
Expand Down Expand Up @@ -587,3 +585,17 @@ Setting `noNewPrivileges` to true prevents the processes in the container from g
```json
"noNewPrivileges": true,
```

[cgroups-devices]: https://www.kernel.org/doc/Documentation/cgroup-v1/devices.txt
[devices]: https://www.kernel.org/doc/Documentation/devices.txt
[devpts]: https://www.kernel.org/doc/Documentation/filesystems/devpts.txt

[mknod.1]: http://man7.org/linux/man-pages/man1/mknod.1.html
[mknod.2]: http://man7.org/linux/man-pages/man2/mknod.2.html
[console.4]: http://man7.org/linux/man-pages/man4/console.4.html
[full.4]: http://man7.org/linux/man-pages/man4/full.4.html
[null.4]: http://man7.org/linux/man-pages/man4/null.4.html
[pts.4]: http://man7.org/linux/man-pages/man4/pts.4.html
[random.4]: http://man7.org/linux/man-pages/man4/random.4.html
[tty.4]: http://man7.org/linux/man-pages/man4/tty.4.html
[zero.4]: http://man7.org/linux/man-pages/man4/zero.4.html
28 changes: 21 additions & 7 deletions config_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ type Linux struct {
CgroupsPath *string `json:"cgroupsPath,omitempty"`
// Namespaces contains the namespaces that are created and/or joined by the container
Namespaces []Namespace `json:"namespaces"`
// Devices are a list of device nodes that are created and enabled for the container
// Devices are a list of device nodes that are created for the container
Devices []Device `json:"devices"`
// ApparmorProfile specified the apparmor profile for the container.
ApparmorProfile string `json:"apparmorProfile"`
Expand Down Expand Up @@ -213,6 +213,8 @@ type Network struct {

// Resources has container runtime resource constraints
type Resources struct {
// Devices are a list of device rules for the whitelist controller
Devices []DeviceCgroup `json:"devices"`
// DisableOOMKiller disables the OOM killer for out of memory conditions
DisableOOMKiller *bool `json:"disableOOMKiller,omitempty"`
// Specify an oom_score_adj for the container.
Expand All @@ -231,7 +233,7 @@ type Resources struct {
Network *Network `json:"network,omitempty"`
}

// Device represents the information on a Linux special device file
// Device represents the mknod information for a Linux special device file
type Device struct {
// Path to the device.
Path string `json:"path"`
Expand All @@ -241,14 +243,26 @@ type Device struct {
Major int64 `json:"major"`
// Minor is the device's minor number.
Minor int64 `json:"minor"`
// Cgroup permissions format, rwm.
Permissions string `json:"permissions"`
// FileMode permission bits for the device.
FileMode os.FileMode `json:"fileMode"`
FileMode *os.FileMode `json:"fileMode,omitempty"`
// UID of the device.
UID uint32 `json:"uid"`
UID *uint32 `json:"uid,omitempty"`
// Gid of the device.
GID uint32 `json:"gid"`
GID *uint32 `json:"gid,omitempty"`
}

// DeviceCgroup represents a device rule for the whitelist controller
type DeviceCgroup struct {
// Allow or deny
Allow bool `json:"allow"`
// Device type, block, char, etc.
Type *rune `json:"type,omitempty"`
// Major is the device's major number.
Major *int64 `json:"major,omitempty"`
// Minor is the device's minor number.
Minor *int64 `json:"minor,omitempty"`
// Cgroup access permissions format, rwm.
Access *string `json:"access,omitempty"`
}

// Seccomp represents syscall restrictions
Expand Down

0 comments on commit 0b0c35e

Please sign in to comment.