Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

chore(CIS): add protect-kernel-defaults #999

Merged
merged 7 commits into from
Apr 25, 2019

Conversation

andyzhangx
Copy link
Contributor

@andyzhangx andyzhangx commented Apr 10, 2019

Reason for Change:

fix: add protect-kernel-defaults for CIS conformance

about protect-kernel-defaults: Default kubelet behaviour for kernel tuning. If set, kubelet errors if any of kernel tunables is different than kubelet defaults.

[FAIL] 2.1.6 Ensure that the --protect-kernel-defaults argument is set to true (Scored)
2.1.6 If using a Kubelet config file, edit the file to set protectKernelDefaults: true .
If using command line arguments, edit the kubelet service file
/etc/systemd/system/kubelet.service on each worker node and
set the below parameter in KUBELET_SYSTEM_PODS_ARGS variable.
--protect-kernel-defaults=true
Based on your system, restart the kubelet service. For example:
systemctl daemon-reload
systemctl restart kubelet.service

Issue Fixed:

Fixes #446

Requirements:

Notes:

/assign @jackfrancis @CecileRobertMichon
cc  @feiskyer

@andyzhangx
Copy link
Contributor Author

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Apr 10, 2019

Codecov Report

Merging #999 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #999      +/-   ##
==========================================
+ Coverage   74.43%   74.43%   +<.01%     
==========================================
  Files         131      131              
  Lines       18318    18319       +1     
==========================================
+ Hits        13635    13636       +1     
  Misses       3902     3902              
  Partials      781      781

@andyzhangx
Copy link
Contributor Author

it looks like this flag would make kubelet down:

Apr 10 13:07:45 k8s-master-29058287-0 kubelet[23084]: F0410 13:07:45.841909   23084 kubelet.go:1324] Failed to start ContainerManager [Invalid kernel flag: vm/overcommit_memory, expected value: 1, actual value: 0, Invalid kernel flag: kernel/panic, expected value: 10, actual value: 0, Invalid kernel flag: kernel/panic_on_oops, expected value: 1, actual value: 0]
Apr 10 13:07:45 k8s-master-29058287-0 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Apr 10 13:07:45 k8s-master-29058287-0 systemd[1]: kubelet.service: Unit entered failed state.
Apr 10 13:07:45 k8s-master-29058287-0 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 10 13:07:46 k8s-master-29058287-0 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.

/hold
Let me hold this PR to do further investigation.

@andyzhangx

This comment has been minimized.

@andyzhangx andyzhangx closed this Apr 15, 2019
@andyzhangx andyzhangx reopened this Apr 19, 2019
@andyzhangx andyzhangx changed the title fix: add protect-kernel-defaults for CIS conformance [WIP] fix: add protect-kernel-defaults for CIS conformance Apr 19, 2019
@andyzhangx
Copy link
Contributor Author

/hold
/azp run pr-e2e

@andyzhangx andyzhangx changed the title [WIP] fix: add protect-kernel-defaults for CIS conformance [WIP] chore(CIS): add protect-kernel-defaults Apr 19, 2019
@andyzhangx
Copy link
Contributor Author

andyzhangx commented Apr 19, 2019

Interesting thing:

  • In the beginning on VM set up, /proc/sys/kernel/panic is 0, so kubelet could not start up with --protect-kernel-defaults=true since /proc/sys/kernel/panic check failed, and then when I removed the --protect-kernel-defaults=true flag manually on that machine, and kubelet could start up, and then /proc/sys/kernel/panic changed to 10. This time I could configure kubelet with --protect-kernel-defaults=true flag, all are working well.
$ cat /proc/sys/kernel/panic
0
# remove  "--protect-kernel-defaults=true" flag in "/etc/default/kubelet" file
$ sudo systemctl daemon-reload
$ sudo systemctl restart kubelet
Warning: kubelet.service changed on disk. Run 'systemctl daemon-reload' to reload units.
$ cat /proc/sys/kernel/panic
10

That's the reason why this PR always failed.
cc @jackfrancis @feiskyer
It looks like kubelet start up could affect /proc/sys/kernel/panic value.

@jackfrancis
Copy link
Member

@cchildress @khenidak @cpuguy83 Any thoughts on changing the host OS setting /proc/sys/kernel/panic to 10? (from 0, i.e., disabled)

@andyzhangx andyzhangx force-pushed the cis-protect-kernel-defaults branch from f74dfa3 to 83014d0 Compare April 22, 2019 03:16
@acs-bot acs-bot added size/S and removed size/XS labels Apr 22, 2019
@andyzhangx
Copy link
Contributor Author

andyzhangx commented Apr 22, 2019

Thanks guys, I have figured out that I should set kernel turnable falgs before set up kubelet, it works now.
The code change only happens on Linux, while there are some Windows test failures.
Update:
There are no failures on Windows, it's intermittent failures.

@andyzhangx andyzhangx changed the title [WIP] chore(CIS): add protect-kernel-defaults chore(CIS): add protect-kernel-defaults Apr 22, 2019
@andyzhangx
Copy link
Contributor Author

/hold cancel

@jackfrancis
Copy link
Member

@andyzhangx I've moved bits around here:

  1. now validating config via E2E
  2. made this a default and not a static (i.e., not user-overridable) configuration
  3. added unit tests

I'd still like a review from one of @cchildress @khenidak @cpuguy83 @weinong @yangl900 before merging

@CecileRobertMichon CecileRobertMichon added the needs-rebase Changes in the target branch require a `git rebase` and `git push -f` label Apr 22, 2019
@andyzhangx andyzhangx force-pushed the cis-protect-kernel-defaults branch from 1c88f15 to f5a3008 Compare April 24, 2019 12:33
net.ipv6.conf.default.accept_redirects = 0
# refer to https://github.com/kubernetes/kubernetes/blob/75d45bdfc9eeda15fb550e00da662c12d7d37985/pkg/kubelet/cm/container_manager_linux.go#L359-L397
vm.overcommit_memory = 1
Copy link
Contributor Author

@andyzhangx andyzhangx Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackfrancis I found this code change will fail all e2e test, and actully the kernel flags are not set correctly after this PR:

Failed to start ContainerManager [Invalid kernel flag: kernel/panic, expected value: 10, actual value: 0, Invalid kernel flag: kernel/panic_on_oops, expected value: 1, actual value: 0, Invalid kernel flag: vm/overcommit_memory, expected value: 1, actual value: 0]

There is also no /etc/sysctl.d/60-CIS.conf file under /etc/sysctl.d/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by putting the whole thing behind "if ubuntu distro" only logic. Once the changes in this file are baked into the VHD then we can turn the functionality on, and run tests, by default

@jackfrancis
Copy link
Member

@andyzhangx, we haven't baked this new config file into a VHD yet, so we need to verify this functionality by using the "ubuntu" distro. I've just made a change to do that, and validated that it works:

$ sudo cat /etc/default/kubelet 

KUBELET_OPTS=

KUBELET_CONFIG=--address=0.0.0.0 --allow-privileged=true --anonymous-auth=false --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.0.0.10 --cluster-domain=cluster.local --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-gates=PodPriority=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=30m --keep-terminated-pod-volumes=false --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=30 --network-plugin=cni --node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=k8s.gcr.io/pause-amd64:3.1 --pod-manifest-path=/etc/kubernetes/manifests --pod-max-pids=-1 --protect-kernel-defaults=true --streaming-connection-idle-timeout=5m 
KUBELET_IMAGE=k8s.gcr.io/hyperkube-amd64:v1.14.1
KUBELET_NODE_LABELS=kubernetes.io/role=master,node-role.kubernetes.io/master=,kubernetes.azure.com/cluster=kubernetes-westus2-11363



KUBELET_REGISTER_NODE=--register-node=true
KUBELET_REGISTER_WITH_TAINTS=--register-with-taints=node-role.kubernetes.io/master=true:NoSchedule
$ kubectl get nodes -o wide
NAME                                 STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-agentpool1-14659950-vmss000000   Ready    agent    3m28s   v1.14.1   10.240.0.4     <none>        Ubuntu 16.04.6 LTS   4.15.0-1042-azure   docker://3.0.4
k8s-agentpool1-14659950-vmss000001   Ready    agent    3m28s   v1.14.1   10.240.0.35    <none>        Ubuntu 16.04.6 LTS   4.15.0-1042-azure   docker://3.0.4
k8s-master-14659950-0                Ready    master   3m28s   v1.14.1   10.255.255.5   <none>        Ubuntu 16.04.6 LTS   4.15.0-1042-azure   docker://3.0.4

@jackfrancis
Copy link
Member

Validated new E2E test coverage on "ubuntu" distro:

  should validate Ubuntu host OS network configuration on all nodes
  /Users/jackfrancis/work/src/github.com/Azure/aks-engine/test/e2e/kubernetes/kubernetes_test.go:207

$ k config view -o json

$ k get nodes -o json

$ scp -i /Users/jackfrancis/work/src/github.com/Azure/aks-engine/_output/kubernetes-westus2-11363-ssh -o StrictHostKeyChecking=no scripts/net-config-validate.sh azureuser@kubernetes-westus2-11363.westus2.cloudapp.azure.com:/tmp/net-config-validate.sh
2019/04/25 09:16:45 
Authorized uses only. All activity may be monitored and reported.



$ ssh -A -i /Users/jackfrancis/work/src/github.com/Azure/aks-engine/_output/kubernetes-westus2-11363-ssh -o ConnectTimeout=30 -o StrictHostKeyChecking=no azureuser@kubernetes-westus2-11363.westus2.cloudapp.azure.com -p 22 scp -o StrictHostKeyChecking=no /tmp/net-config-validate.sh k8s-agentpool1-14659950-vmss000000:/tmp/net-config-validate.sh

$ ssh -A -i /Users/jackfrancis/work/src/github.com/Azure/aks-engine/_output/kubernetes-westus2-11363-ssh -p 22 -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR azureuser@kubernetes-westus2-11363.westus2.cloudapp.azure.com ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR k8s-agentpool1-14659950-vmss000000 "/tmp/net-config-validate.sh"

$ ssh -A -i /Users/jackfrancis/work/src/github.com/Azure/aks-engine/_output/kubernetes-westus2-11363-ssh -o ConnectTimeout=30 -o StrictHostKeyChecking=no azureuser@kubernetes-westus2-11363.westus2.cloudapp.azure.com -p 22 scp -o StrictHostKeyChecking=no /tmp/net-config-validate.sh k8s-agentpool1-14659950-vmss000001:/tmp/net-config-validate.sh

$ ssh -A -i /Users/jackfrancis/work/src/github.com/Azure/aks-engine/_output/kubernetes-westus2-11363-ssh -p 22 -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR azureuser@kubernetes-westus2-11363.westus2.cloudapp.azure.com ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR k8s-agentpool1-14659950-vmss000001 "/tmp/net-config-validate.sh"

$ ssh -A -i /Users/jackfrancis/work/src/github.com/Azure/aks-engine/_output/kubernetes-westus2-11363-ssh -o ConnectTimeout=30 -o StrictHostKeyChecking=no azureuser@kubernetes-westus2-11363.westus2.cloudapp.azure.com -p 22 scp -o StrictHostKeyChecking=no /tmp/net-config-validate.sh k8s-master-14659950-0:/tmp/net-config-validate.sh

$ ssh -A -i /Users/jackfrancis/work/src/github.com/Azure/aks-engine/_output/kubernetes-westus2-11363-ssh -p 22 -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR azureuser@kubernetes-westus2-11363.westus2.cloudapp.azure.com ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR k8s-master-14659950-0 "/tmp/net-config-validate.sh"
•
azureuser@k8s-master-14659950-0:~$ grep 'cat /proc/sys/kernel/panic' /tmp/net-config-validate.sh 
cat /proc/sys/kernel/panic | grep $KERNEL_PANIC_VALUE || exit 1
cat /proc/sys/kernel/panic_on_oops | grep $KERNEL_PANIC_ON_OOPS_VALUE || exit 1
azureuser@k8s-master-14659950-0:~$ grep '/proc/sys/vm/overcommit_memory' /tmp/net-config-validate.sh 
cat /proc/sys/vm/overcommit_memory | grep $VM_OVERCOMMIT_MEMORY_VALUE || exit 1

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot acs-bot added the lgtm label Apr 25, 2019
@mboersma mboersma removed the needs-rebase Changes in the target branch require a `git rebase` and `git push -f` label Apr 25, 2019
@acs-bot
Copy link

acs-bot commented Apr 25, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@acs-bot acs-bot merged commit 4ef8ca6 into Azure:master Apr 25, 2019
@@ -81,6 +81,12 @@ func (cs *ContainerService) setKubeletConfig() {
"--streaming-connection-idle-timeout": "5m",
}

// "--protect-kernel-defaults" is true is currently only valid using base Ubuntu OS image
// until the changes are baked into a VHD
if cs.Properties.IsUbuntuDistroForAllNodes() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it will always return false now @jackfrancis

// IsUbuntuDistroForAllNodes returns true if all of the agent pools plus masters are running the base Ubuntu image
func (p *Properties) IsUbuntuDistroForAllNodes() bool {
        if len(p.AgentPoolProfiles) > 0 {
                for _, ap := range p.AgentPoolProfiles {
                        if ap.Distro != Ubuntu && ap.Distro != Ubuntu1804 {
                                return false
                        }
                }
        }
        if p.MasterProfile != nil {
                return p.MasterProfile.Distro == Ubuntu || p.MasterProfile.Distro == Ubuntu1804
        }
        return true
}

I think I have already tried the latest Ubuntu version:

    "agentpoolosImageSKU": {
      "value": "aks-ubuntu-1604-201904"
    },
    "agentpoolosImageVersion": {
      "value": "2019.04.24"
    },

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kubelet config: --protect-kernel-defaults
5 participants