-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add raw socket (AF_PACKET) backend for virtio-net host #171
Conversation
Nice, thanks Hajime ! LGTM, I'll wait a couple of days for more reviews before merging. |
} lkl_netdev_linux_fdnet_ops; | ||
|
||
struct lkl_netdev_linux_fdnet *lkl_register_netdev_linux_fdnet(int fd); | ||
void lkl_unregister_netdev_linux_fdnet(struct lkl_netdev_linux_fdnet *nd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is lkl_unregister_netdev_linux_fdnet used anywhere externally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no at this moment, but I intended this to be used externally.
(I added comments for those function prototypes)
Do you have performance numbers vs tap? If it bypass kernel efficiently, it should give better numbers. |
35034c4
to
b9c0c66
Compare
this is numbers btw/ raw socket and tap (also put in the commit message). those are based on hijacked netperf (client) to native netserver (server) connected via 10Gbps back-to-back link.
the improvements are not that significant, and I guess without QDISK_BYPASS (i.e., the host kernel is before Linux 3.14), then the differences might be trivial too. |
nd = (struct lkl_netdev_linux_fdnet *) | ||
malloc(sizeof(struct lkl_netdev_linux_fdnet)); | ||
if (!nd) { | ||
fprintf(stderr, "raw: failed to allocate memory\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"raw" in error msg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. thanks.
LGTM |
this commit intends to introduce similar backend using file descriptor based netdev backend such as raw socket-bsaed one. Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
b9c0c66
to
c9b18e9
Compare
updated the patch set with the comment @liuyuan10. |
c9b18e9
to
0ab0505
Compare
This patch introduces new backend for virtio net, which uses AF_PACKET socket (a.k.a. raw socket) to bypass host kernel and uses LKL network stack instead. it is convinient since we don't have to add additional net_device (e.g., tap) for LKL, and possibly faster than tuntap with PACKET_QDISC_BYPASS socket option (available after Linux 3.14). One drawback is it requires root privilege (sudo or suid bit on) to use this. example usage is like this: sudo LKL_HIJACK_NET_IFTYPE=raw LKL_HIJACK_NET_IFPARAMS=docker0 \ LKL_HIJACK_NET_IP=172.17.0.39 LKL_HIJACK_NET_NETMASK_LEN=24 \ ./bin/lkl-hijack.sh ping 172.17.0.2 some benchmarks with netperf: - TCP_RR raw(QDISC_BYPASS): 9519.31 Trans/sec tap: 9486.03 Trans/sec - TCP_STREAM raw(QDISC_BYPASS): 2184.79 Mbps tap: 2130.39 Mbps - UDP_STREAM raw(QDISC_BYPASS): 3654.32 Mbps tap: 3108.10 Mbps Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
0ab0505
to
a28b355
Compare
In the current code, when creating a new fib6 table, tb6_root.leaf gets initialized to net->ipv6.ip6_null_entry. If a default route is being added with rt->rt6i_metric = 0xffffffff, fib6_add() will add this route after net->ipv6.ip6_null_entry. As null_entry is shared, it could cause problem. In order to fix it, set fn->leaf to NULL before calling fib6_add_rt2node() when trying to add the first default route. And reset fn->leaf to null_entry when adding fails or when deleting the last default route. syzkaller reported the following issue which is fixed by this commit: WARNING: suspicious RCU usage 4.15.0-rc5+ lkl#171 Not tainted ----------------------------- net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 4 locks held by swapper/0/0: #0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline] #0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310 #1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline] #1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007 #2: (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560 #3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline] #3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948 stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ lkl#171 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585 fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701 fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949 fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline] fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016 fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033 call_timer_fn+0x228/0x820 kernel/time/timer.c:1320 expire_timers kernel/time/timer.c:1357 [inline] __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285 invoke_softirq kernel/softirq.c:365 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:540 [inline] smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904 </IRQ> Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 66f5d6c ("ipv6: replace rwlock with rcu and spinlock in fib6_table") Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
this includes two commits: 1) refactor the tap code to support generic file descriptor handling, 2) add raw socket backend for virtio-net.
This change is