Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run aws-node as unprivileged pod #796

Closed
mogren opened this issue Jan 17, 2020 · 20 comments
Closed

Run aws-node as unprivileged pod #796

mogren opened this issue Jan 17, 2020 · 20 comments
Assignees
Labels
2.x CNI plugin Features and issues to address in 2.x CNI plugin enhancement

Comments

@mogren
Copy link
Contributor

mogren commented Jan 17, 2020

We currently set the aws-node pod to be privileged, but that might not be needed.

Check if CAP_NET_ADMIN and CAP_DAC_OVERRIDE is enough to set the RDF check to loose and copy the binary and config file:

// If node port support is enabled, configure the kernel's reverse path filter check on eth0 for "loose"
// filtering. This is required because
// - NodePorts are exposed on eth0
// - The kernel's RPF check happens after incoming packets to NodePorts are DNATted to the pod IP.
// - For pods assigned to secondary ENIs, the routing table includes source-based routing. When the kernel does
// the RPF check, it looks up the route using the pod IP as the source.
// - Thus, it finds the source-based route that leaves via the secondary ENI.
// - In "strict" mode, the RPF check fails because the return path uses a different interface to the incoming
// packet. In "loose" mode, the check passes because some route was found.
primaryIntfRPFilter := "/proc/sys/net/ipv4/conf/" + primaryIntf + "/rp_filter"
const rpFilterLoose = "2"
log.Debugf("Setting RPF for primary interface: %s", primaryIntfRPFilter)
err = n.setProcSys(primaryIntfRPFilter, rpFilterLoose)

@mogren
Copy link
Contributor Author

mogren commented Jan 20, 2020

Unfortunately, having the capabilities limited to ["NET_ADMIN", "DAC_OVERRIDE"] is not enough to write to /proc/sys/net/*. From the logs:

[DEBUG] Setting RPF for primary interface: /proc/sys/net/ipv4/conf/eth0/rp_filter
[ERROR] Failed to set up host networkfailed to configure eth0 RPF check: open /proc/sys/net/ipv4/conf/eth0/rp_filter: read-only file system
[ERROR] Initialization failure: ipamd init: failed to set up host network: failed to configure eth0 RPF check: open /proc/sys/net/ipv4/conf/eth0/rp_filter: read-only file system

@mogren mogren closed this as completed Jan 20, 2020
@SaranBalaji90 SaranBalaji90 changed the title Try to limit permissions on the ipamd pod Move rp_filter setting outside aws-node and run aws-node as unprivileged pod Mar 23, 2020
@SaranBalaji90 SaranBalaji90 changed the title Move rp_filter setting outside aws-node and run aws-node as unprivileged pod Run aws-node as unprivileged pod Mar 23, 2020
@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Mar 23, 2020

Looks like only for privileged pod, /proc gets mounted with write access.

There are few ways to remove privileged pod permission for aws-node,

  1. Set the rp_filter through init container.
    initContainers:
      - command:
        - sh
        - -c
        - sysctl net.ipv4.conf.eth0.rp_filter=2
        image: golang:1.13-stretch
        name: rp_filter_setting
        securityContext:
          privileged: true
  1. Set the rp_filter when the ec2 instance starts up. But the problem with this approach is, if anyone builds custom AMI then they have to make sure this change is added to their AMI build scripts.
  2. Enabling unsafe sysctl - https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#enabling-unsafe-sysctls but this doesn't help us, because https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/kubelet/sysctl/whitelist.go#L89 kubelet rejects the pod allocation when net.* sysctl is used along with host network.

This leave us with option-1 being more suitable for our usecase.

Proposed solution:

  1. Add a flag that instructs aws-node pod whether to perform or skip setting rp_filter value for eth0.
  2. Add init container as described above and remove "privileged" mode for aws-node.
  3. At later time, deprecate the flag and remove it completely from aws-node.

@SaranBalaji90 SaranBalaji90 reopened this Mar 23, 2020
@M00nF1sh
Copy link
Contributor

M00nF1sh commented Mar 24, 2020

feel like we should use src_valid_mark instead: torvalds/linux@28f6aee

BTW, it's indeed possible to trick without any sysctl setting, like mangle traffic from eth0 with a custom tos value, and have a route policy for that tos value to use main table. but it's a bit tricky.

https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1879
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1660
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/route.c#L1679
https://github.com/torvalds/linux/blob/v4.14/net/ipv4/fib_frontend.c#L315

@mogren
Copy link
Contributor Author

mogren commented Mar 25, 2020

This PR #130 where the code was added has some good comments.

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Mar 25, 2020

Nice dive depp @M00nF1sh. I tested both of your suggestions and both seems to work fine

# with TOS
sudo iptables -A PREROUTING -t mangle -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j TOS —set-tos 0x08
sudo ip rule add pref 1025 tos 0x08 table main

# with MARK
sudo iptables -A PREROUTING -t mangle -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j MARK —set-xmark 0x80/0x80
sudo sysctl net.ipv4.conf.eth0.src_valid_mark=1

Removed 'if' block from

if n.nodePortSupportEnabled {
and was able to get aws-node running with following security context.

securityContext:
          capabilities:
            add:
            - NET_ADMIN

@anguslees
Copy link
Contributor

anguslees commented Mar 25, 2020

We should continue the above investigation, but for context I note we read-write mount the host's /var/log (can rewrite logs) and /var/run/docker{,shim}.sock (host root equiv) into the aws-node container. We will also need to drop/reduce these hostPath volumes somehow before privileged=false has any meaning.

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Apr 1, 2020

For now, I'm thinking of checking if we have write access to net.ipv4.conf.eth0.rp_filter file then update the rp_filter otherwise don't update. With this we don't have to introduce another env variable to have users to decide whether to do this operation or not. This would simplify user experience with respect to updates (would help both variants of updates that users performs - just editing aws-node ds version number as well as applying the manifest completely)

@mogren
Copy link
Contributor Author

mogren commented Jun 10, 2020

Resolved by adding the init container in #955

@mogren mogren closed this as completed Jun 10, 2020
@anguslees
Copy link
Contributor

anguslees commented Aug 18, 2020

This was closed prematurely, so just reopening so we don't lose the remaining action item raised earlier.

#955 moved the literal privileged=true Kubernetes option to an earlier init container, but we still expose the CRI socket to the aws-node persistent container. This allows the aws-node container to trivially just (eg) start a new privileged container and so the aws-node pod remains "privileged" in every practical sense.

Remaining action item:

  • Remove CRI socket from aws-node container (or equivalent docker/containerd socket)

(For tracking: Write access to all of /var/log was removed in #987, and the docker socket was removed in #1075)

@TBBle
Copy link

TBBle commented Dec 5, 2020

Per aws/containers-roadmap#1048 (comment), the aws-node Pod needs NET_RAW as well as NET_ADMIN. It's currently undeclared in the Daemonset, but because it's one of the default capabilities added by the Docker runtime, its absence is not noticed until a PodSecurityPolicy tries to take it away.

The symptom observed is:

daemonset pods get scheduled but fail. On top of that the aws-k8s-agent fails out silently.

@jayanthvn jayanthvn self-assigned this Dec 16, 2020
@jayanthvn
Copy link
Contributor

#1352 will remove CRI socket read from aws-node. This will be merged to v1.8 release.

@jayanthvn jayanthvn added this to the v1.8.0 milestone Jan 27, 2021
@jayanthvn jayanthvn removed this from the v1.8.0 milestone Apr 21, 2021
@jayanthvn jayanthvn assigned haouc and unassigned jayanthvn and couralex6 Jun 10, 2021
@csuzhangxc
Copy link

csuzhangxc commented Nov 9, 2021

#1352 will remove CRI socket read from aws-node. This will be merged to v1.8 release.

this PR is still not merged after v1.9.3 was released.

@jayanthvn jayanthvn added this to the v1.11.0 milestone Feb 25, 2022
@jayanthvn jayanthvn assigned M00nF1sh and unassigned haouc Feb 25, 2022
@jayanthvn jayanthvn removed this from the v1.11.0 milestone Apr 18, 2022
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jun 18, 2022
@TBBle
Copy link

TBBle commented Jun 19, 2022

#1352 was closed unmerged

Closing this PR since this will be handled as part of v1.11.0 release.

but I don't see any change for the CRI socket read being removed (i.e. bumping checkpointMigrationPhase to 2) in that release. I do see #1958 which looks like it was preparing to support not reading from the CRI socket, but did not itself make that change.

Since 1.11.0 added to the data being stored in the datastore file, I guess this can't be advanced until at least 1.12.0 since users will need to upgrade to a 1.11.x version in order to have the right data in the file before upgrading to a phase 2 release in order to deliver the no-reboot upgrade path.

@github-actions github-actions bot removed the stale Issue or PR is stale label Jun 20, 2022
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Sep 21, 2022
@TBBle
Copy link

TBBle commented Sep 22, 2022

#1352 lives again.

@github-actions github-actions bot removed the stale Issue or PR is stale label Sep 23, 2022
@jayanthvn
Copy link
Contributor

Moved to checkpoint migration 2 and switched to use state file instead of CRI socket for IP allocation pool restore - #2110. This is as part of 1.12.0 release - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.12.0.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jan 11, 2023
@M00nF1sh M00nF1sh removed the stale Issue or PR is stale label Jan 12, 2023
@jdn5126
Copy link
Contributor

jdn5126 commented Jan 12, 2023

Closing this as init container must run with privileged access, while main container does not

@jdn5126 jdn5126 closed this as completed Jan 12, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x CNI plugin Features and issues to address in 2.x CNI plugin enhancement
Projects
None yet
Development

No branches or pull requests

10 participants