Skip to content

Commit ca89fa7

Browse files
committed
Merge branch 'cgroup-bpf'
Daniel Mack says: ==================== Add eBPF hooks for cgroups This is v9 of the patch set to allow eBPF programs for network filtering and accounting to be attached to cgroups, so that they apply to all sockets of all tasks placed in that cgroup. The logic also allows to be extendeded for other cgroup based eBPF logic. Again, only minor details are updated in this version. Changes from v8: * Move the egress hooks into ip_finish_output() and ip6_finish_output() so they run after the netfilter hooks. For IPv4 multicast, add a new ip_mc_finish_output() callback that is invoked on success by netfilter, and call the eBPF program from there. Changes from v7: * Replace the static inline function cgroup_bpf_run_filter() with two specific macros for ingress and egress. This addresses David Miller's concern regarding skb->sk vs. sk in the egress path. Thanks a lot to Daniel Borkmann and Alexei Starovoitov for the suggestions. Changes from v6: * Rebased to 4.9-rc2 * Add EXPORT_SYMBOL(__cgroup_bpf_run_filter). The kbuild test robot now succeeds in building this version of the patch set. * Switch from bpf_prog_run_save_cb() to bpf_prog_run_clear_cb() to not tamper with the contents of skb->cb[]. Pointed out by Daniel Borkmann. * Use sk_to_full_sk() in the egress path, as suggested by Daniel Borkmann. * Renamed BPF_PROG_TYPE_CGROUP_SOCKET to BPF_PROG_TYPE_CGROUP_SKB, as requested by David Ahern. * Added Alexei's Acked-by tags. Changes from v5: * The eBPF programs now operate on L3 rather than on L2 of the packets, and the egress hooks were moved from __dev_queue_xmit() to ip*_output(). * For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb through BPF_LD_[ABS|IND] instructions, but hook up the bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann for the help. Changes from v4: * Plug an skb leak when dropping packets due to eBPF verdicts in __dev_queue_xmit(). Spotted by Daniel Borkmann. * Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't operate on timewait or request sockets. Suggested by Daniel Borkmann. * Add missing @parent parameter in kerneldoc of __cgroup_bpf_update(). Spotted by Rami Rosen. * Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error. Changes from v3: * Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER, renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann. * Dropped the attach_flags member from the anonymous struct for BPF attach operations in union bpf_attr. They can be added later on via CHECK_ATTR. Requested by Daniel Borkmann and Alexei. * Release old_prog at the end of __cgroup_bpf_update rather that at the beginning to fix a race gap between program updates and their users. Spotted by Daniel Borkmann. * Plugged an skb leak when dropping packets on the egress path. Spotted by Daniel Borkmann. * Add cgroups@vger.kernel.org to the loop, as suggested by Rami Rosen. * Some minor coding style adoptions not worth mentioning in particular. Changes from v2: * Fixed the RCU locking details Tejun pointed out. * Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler. Changes from v1: * Moved all bpf specific cgroup code into its own file, and stub out related functions for !CONFIG_CGROUP_BPF as static inline nops. This way, the call sites are not cluttered with #ifdef guards while the feature remains compile-time configurable. * Implemented the new scheme proposed by Tejun. Per cgroup, store one set of pointers that are pinned to the cgroup, and one for the programs that are effective. When a program is attached or detached, the change is propagated to all the cgroup's descendants. If a subcgroup has its own pinned program, skip the whole subbranch in order to allow delegation models. * The hookup for egress packets is now done from __dev_queue_xmit(). * A static key is now used in both the ingress and egress fast paths to keep performance penalties close to zero if the feature is not in use. * Overall cleanup to make the accessors use the program arrays. This should make it much easier to add new program types, which will then automatically follow the pinned vs. effective logic. * Fixed locking issues, as pointed out by Eric Dumazet and Alexei Starovoitov. Changes to the program array are now done with xchg() and are protected by cgroup_mutex. * eBPF programs are now expected to return 1 to let the packet pass, not >= 0. Pointed out by Alexei. * Operation is now limited to INET sockets, so local AF_UNIX sockets are not affected. The enum members are renamed accordingly. In case other socket families should be supported, this can be extended in the future. * The sample program learned to support both ingress and egress, and can now optionally make the eBPF program drop packets by making it return 0. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2 parents 619228d + d8c5b17 commit ca89fa7

File tree

15 files changed

+612
-2
lines changed

15 files changed

+612
-2
lines changed

include/linux/bpf-cgroup.h

+79
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
#ifndef _BPF_CGROUP_H
2+
#define _BPF_CGROUP_H
3+
4+
#include <linux/bpf.h>
5+
#include <linux/jump_label.h>
6+
#include <uapi/linux/bpf.h>
7+
8+
struct sock;
9+
struct cgroup;
10+
struct sk_buff;
11+
12+
#ifdef CONFIG_CGROUP_BPF
13+
14+
extern struct static_key_false cgroup_bpf_enabled_key;
15+
#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
16+
17+
struct cgroup_bpf {
18+
/*
19+
* Store two sets of bpf_prog pointers, one for programs that are
20+
* pinned directly to this cgroup, and one for those that are effective
21+
* when this cgroup is accessed.
22+
*/
23+
struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
24+
struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
25+
};
26+
27+
void cgroup_bpf_put(struct cgroup *cgrp);
28+
void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
29+
30+
void __cgroup_bpf_update(struct cgroup *cgrp,
31+
struct cgroup *parent,
32+
struct bpf_prog *prog,
33+
enum bpf_attach_type type);
34+
35+
/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
36+
void cgroup_bpf_update(struct cgroup *cgrp,
37+
struct bpf_prog *prog,
38+
enum bpf_attach_type type);
39+
40+
int __cgroup_bpf_run_filter(struct sock *sk,
41+
struct sk_buff *skb,
42+
enum bpf_attach_type type);
43+
44+
/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */
45+
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) \
46+
({ \
47+
int __ret = 0; \
48+
if (cgroup_bpf_enabled) \
49+
__ret = __cgroup_bpf_run_filter(sk, skb, \
50+
BPF_CGROUP_INET_INGRESS); \
51+
\
52+
__ret; \
53+
})
54+
55+
#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) \
56+
({ \
57+
int __ret = 0; \
58+
if (cgroup_bpf_enabled && sk && sk == skb->sk) { \
59+
typeof(sk) __sk = sk_to_full_sk(sk); \
60+
if (sk_fullsock(__sk)) \
61+
__ret = __cgroup_bpf_run_filter(__sk, skb, \
62+
BPF_CGROUP_INET_EGRESS); \
63+
} \
64+
__ret; \
65+
})
66+
67+
#else
68+
69+
struct cgroup_bpf {};
70+
static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
71+
static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
72+
struct cgroup *parent) {}
73+
74+
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
75+
#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
76+
77+
#endif /* CONFIG_CGROUP_BPF */
78+
79+
#endif /* _BPF_CGROUP_H */

include/linux/cgroup-defs.h

+4
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
#include <linux/percpu-refcount.h>
1717
#include <linux/percpu-rwsem.h>
1818
#include <linux/workqueue.h>
19+
#include <linux/bpf-cgroup.h>
1920

2021
#ifdef CONFIG_CGROUPS
2122

@@ -300,6 +301,9 @@ struct cgroup {
300301
/* used to schedule release agent */
301302
struct work_struct release_agent_work;
302303

304+
/* used to store eBPF programs */
305+
struct cgroup_bpf bpf;
306+
303307
/* ids of the ancestors at each level including self */
304308
int ancestor_ids[];
305309
};

include/uapi/linux/bpf.h

+17
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,8 @@ enum bpf_cmd {
7373
BPF_PROG_LOAD,
7474
BPF_OBJ_PIN,
7575
BPF_OBJ_GET,
76+
BPF_PROG_ATTACH,
77+
BPF_PROG_DETACH,
7678
};
7779

7880
enum bpf_map_type {
@@ -98,8 +100,17 @@ enum bpf_prog_type {
98100
BPF_PROG_TYPE_TRACEPOINT,
99101
BPF_PROG_TYPE_XDP,
100102
BPF_PROG_TYPE_PERF_EVENT,
103+
BPF_PROG_TYPE_CGROUP_SKB,
101104
};
102105

106+
enum bpf_attach_type {
107+
BPF_CGROUP_INET_INGRESS,
108+
BPF_CGROUP_INET_EGRESS,
109+
__MAX_BPF_ATTACH_TYPE
110+
};
111+
112+
#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
113+
103114
#define BPF_PSEUDO_MAP_FD 1
104115

105116
/* flags for BPF_MAP_UPDATE_ELEM command */
@@ -150,6 +161,12 @@ union bpf_attr {
150161
__aligned_u64 pathname;
151162
__u32 bpf_fd;
152163
};
164+
165+
struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
166+
__u32 target_fd; /* container object to attach to */
167+
__u32 attach_bpf_fd; /* eBPF program to attach */
168+
__u32 attach_type;
169+
};
153170
} __attribute__((aligned(8)));
154171

155172
/* BPF helper function descriptions:

init/Kconfig

+12
Original file line numberDiff line numberDiff line change
@@ -1154,6 +1154,18 @@ config CGROUP_PERF
11541154

11551155
Say N if unsure.
11561156

1157+
config CGROUP_BPF
1158+
bool "Support for eBPF programs attached to cgroups"
1159+
depends on BPF_SYSCALL && SOCK_CGROUP_DATA
1160+
help
1161+
Allow attaching eBPF programs to a cgroup using the bpf(2)
1162+
syscall command BPF_PROG_ATTACH.
1163+
1164+
In which context these programs are accessed depends on the type
1165+
of attachment. For instance, programs that are attached using
1166+
BPF_CGROUP_INET_INGRESS will be executed on the ingress path of
1167+
inet sockets.
1168+
11571169
config CGROUP_DEBUG
11581170
bool "Example controller"
11591171
default n

kernel/bpf/Makefile

+1
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list
55
ifeq ($(CONFIG_PERF_EVENTS),y)
66
obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
77
endif
8+
obj-$(CONFIG_CGROUP_BPF) += cgroup.o

kernel/bpf/cgroup.c

+167
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
/*
2+
* Functions to manage eBPF programs attached to cgroups
3+
*
4+
* Copyright (c) 2016 Daniel Mack
5+
*
6+
* This file is subject to the terms and conditions of version 2 of the GNU
7+
* General Public License. See the file COPYING in the main directory of the
8+
* Linux distribution for more details.
9+
*/
10+
11+
#include <linux/kernel.h>
12+
#include <linux/atomic.h>
13+
#include <linux/cgroup.h>
14+
#include <linux/slab.h>
15+
#include <linux/bpf.h>
16+
#include <linux/bpf-cgroup.h>
17+
#include <net/sock.h>
18+
19+
DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
20+
EXPORT_SYMBOL(cgroup_bpf_enabled_key);
21+
22+
/**
23+
* cgroup_bpf_put() - put references of all bpf programs
24+
* @cgrp: the cgroup to modify
25+
*/
26+
void cgroup_bpf_put(struct cgroup *cgrp)
27+
{
28+
unsigned int type;
29+
30+
for (type = 0; type < ARRAY_SIZE(cgrp->bpf.prog); type++) {
31+
struct bpf_prog *prog = cgrp->bpf.prog[type];
32+
33+
if (prog) {
34+
bpf_prog_put(prog);
35+
static_branch_dec(&cgroup_bpf_enabled_key);
36+
}
37+
}
38+
}
39+
40+
/**
41+
* cgroup_bpf_inherit() - inherit effective programs from parent
42+
* @cgrp: the cgroup to modify
43+
* @parent: the parent to inherit from
44+
*/
45+
void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
46+
{
47+
unsigned int type;
48+
49+
for (type = 0; type < ARRAY_SIZE(cgrp->bpf.effective); type++) {
50+
struct bpf_prog *e;
51+
52+
e = rcu_dereference_protected(parent->bpf.effective[type],
53+
lockdep_is_held(&cgroup_mutex));
54+
rcu_assign_pointer(cgrp->bpf.effective[type], e);
55+
}
56+
}
57+
58+
/**
59+
* __cgroup_bpf_update() - Update the pinned program of a cgroup, and
60+
* propagate the change to descendants
61+
* @cgrp: The cgroup which descendants to traverse
62+
* @parent: The parent of @cgrp, or %NULL if @cgrp is the root
63+
* @prog: A new program to pin
64+
* @type: Type of pinning operation (ingress/egress)
65+
*
66+
* Each cgroup has a set of two pointers for bpf programs; one for eBPF
67+
* programs it owns, and which is effective for execution.
68+
*
69+
* If @prog is %NULL, this function attaches a new program to the cgroup and
70+
* releases the one that is currently attached, if any. @prog is then made
71+
* the effective program of type @type in that cgroup.
72+
*
73+
* If @prog is %NULL, the currently attached program of type @type is released,
74+
* and the effective program of the parent cgroup (if any) is inherited to
75+
* @cgrp.
76+
*
77+
* Then, the descendants of @cgrp are walked and the effective program for
78+
* each of them is set to the effective program of @cgrp unless the
79+
* descendant has its own program attached, in which case the subbranch is
80+
* skipped. This ensures that delegated subcgroups with own programs are left
81+
* untouched.
82+
*
83+
* Must be called with cgroup_mutex held.
84+
*/
85+
void __cgroup_bpf_update(struct cgroup *cgrp,
86+
struct cgroup *parent,
87+
struct bpf_prog *prog,
88+
enum bpf_attach_type type)
89+
{
90+
struct bpf_prog *old_prog, *effective;
91+
struct cgroup_subsys_state *pos;
92+
93+
old_prog = xchg(cgrp->bpf.prog + type, prog);
94+
95+
effective = (!prog && parent) ?
96+
rcu_dereference_protected(parent->bpf.effective[type],
97+
lockdep_is_held(&cgroup_mutex)) :
98+
prog;
99+
100+
css_for_each_descendant_pre(pos, &cgrp->self) {
101+
struct cgroup *desc = container_of(pos, struct cgroup, self);
102+
103+
/* skip the subtree if the descendant has its own program */
104+
if (desc->bpf.prog[type] && desc != cgrp)
105+
pos = css_rightmost_descendant(pos);
106+
else
107+
rcu_assign_pointer(desc->bpf.effective[type],
108+
effective);
109+
}
110+
111+
if (prog)
112+
static_branch_inc(&cgroup_bpf_enabled_key);
113+
114+
if (old_prog) {
115+
bpf_prog_put(old_prog);
116+
static_branch_dec(&cgroup_bpf_enabled_key);
117+
}
118+
}
119+
120+
/**
121+
* __cgroup_bpf_run_filter() - Run a program for packet filtering
122+
* @sk: The socken sending or receiving traffic
123+
* @skb: The skb that is being sent or received
124+
* @type: The type of program to be exectuted
125+
*
126+
* If no socket is passed, or the socket is not of type INET or INET6,
127+
* this function does nothing and returns 0.
128+
*
129+
* The program type passed in via @type must be suitable for network
130+
* filtering. No further check is performed to assert that.
131+
*
132+
* This function will return %-EPERM if any if an attached program was found
133+
* and if it returned != 1 during execution. In all other cases, 0 is returned.
134+
*/
135+
int __cgroup_bpf_run_filter(struct sock *sk,
136+
struct sk_buff *skb,
137+
enum bpf_attach_type type)
138+
{
139+
struct bpf_prog *prog;
140+
struct cgroup *cgrp;
141+
int ret = 0;
142+
143+
if (!sk || !sk_fullsock(sk))
144+
return 0;
145+
146+
if (sk->sk_family != AF_INET &&
147+
sk->sk_family != AF_INET6)
148+
return 0;
149+
150+
cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
151+
152+
rcu_read_lock();
153+
154+
prog = rcu_dereference(cgrp->bpf.effective[type]);
155+
if (prog) {
156+
unsigned int offset = skb->data - skb_network_header(skb);
157+
158+
__skb_push(skb, offset);
159+
ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
160+
__skb_pull(skb, offset);
161+
}
162+
163+
rcu_read_unlock();
164+
165+
return ret;
166+
}
167+
EXPORT_SYMBOL(__cgroup_bpf_run_filter);

0 commit comments

Comments
 (0)