Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add raw socket (AF_PACKET) backend for virtio-net host #171

Merged
2 commits merged into from
Jul 6, 2016

Conversation

thehajime
Copy link
Member

@thehajime thehajime commented Jul 2, 2016

this includes two commits: 1) refactor the tap code to support generic file descriptor handling, 2) add raw socket backend for virtio-net.


This change is Reviewable

@ghost
Copy link

ghost commented Jul 3, 2016

Nice, thanks Hajime ! LGTM, I'll wait a couple of days for more reviews before merging.

} lkl_netdev_linux_fdnet_ops;

struct lkl_netdev_linux_fdnet *lkl_register_netdev_linux_fdnet(int fd);
void lkl_unregister_netdev_linux_fdnet(struct lkl_netdev_linux_fdnet *nd);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is lkl_unregister_netdev_linux_fdnet used anywhere externally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no at this moment, but I intended this to be used externally.

(I added comments for those function prototypes)

@liuyuan10
Copy link
Member

Do you have performance numbers vs tap? If it bypass kernel efficiently, it should give better numbers.

@thehajime thehajime force-pushed the feature-virtio-rawsock branch 2 times, most recently from 35034c4 to b9c0c66 Compare July 5, 2016 05:05
@thehajime
Copy link
Member Author

Do you have performance numbers vs tap? If it bypass kernel efficiently, it should give better numbers.

this is numbers btw/ raw socket and tap (also put in the commit message).

those are based on hijacked netperf (client) to native netserver (server) connected via 10Gbps back-to-back link.
the raw socket result uses outgoing NIC directly while the tap involves host bridge (br0 consists of tap and the outgoing NIC) so, I guess it's not apple-to-apple comparison.

- TCP_RR
 raw(QDISK_BYPASS): 9519.31 Trans/sec
 tap:               9486.03 Trans/sec
- TCP_STREAM
 raw(QDISK_BYPASS): 2184.79 Mbps
 tap:               2130.39 Mbps
- UDP_STREAM
 raw(QDISK_BYPASS): 3654.32 Mbps
 tap:               3108.10 Mbps

the improvements are not that significant, and I guess without QDISK_BYPASS (i.e., the host kernel is before Linux 3.14), then the differences might be trivial too.

nd = (struct lkl_netdev_linux_fdnet *)
malloc(sizeof(struct lkl_netdev_linux_fdnet));
if (!nd) {
fprintf(stderr, "raw: failed to allocate memory\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"raw" in error msg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. thanks.

@liuyuan10
Copy link
Member

LGTM
Thanks for the detailed numbers

this commit intends to introduce similar backend using file descriptor
based netdev backend such as raw socket-bsaed one.

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
@thehajime thehajime force-pushed the feature-virtio-rawsock branch from b9c0c66 to c9b18e9 Compare July 5, 2016 22:53
@thehajime
Copy link
Member Author

updated the patch set with the comment @liuyuan10.

@thehajime thehajime force-pushed the feature-virtio-rawsock branch from c9b18e9 to 0ab0505 Compare July 5, 2016 22:57
This patch introduces new backend for virtio net, which uses AF_PACKET
socket (a.k.a. raw socket) to bypass host kernel and uses LKL network
stack instead.  it is convinient since we don't have to add additional
net_device (e.g., tap) for LKL, and possibly faster than tuntap with
PACKET_QDISC_BYPASS socket option (available after Linux 3.14).  One
drawback is it requires root privilege (sudo or suid bit on) to use
this.

example usage is like this:

sudo LKL_HIJACK_NET_IFTYPE=raw LKL_HIJACK_NET_IFPARAMS=docker0 \
LKL_HIJACK_NET_IP=172.17.0.39 LKL_HIJACK_NET_NETMASK_LEN=24 \
./bin/lkl-hijack.sh ping  172.17.0.2

some benchmarks with netperf:

- TCP_RR
 raw(QDISC_BYPASS): 9519.31 Trans/sec
 tap:               9486.03 Trans/sec
- TCP_STREAM
 raw(QDISC_BYPASS): 2184.79 Mbps
 tap:               2130.39 Mbps
- UDP_STREAM
 raw(QDISC_BYPASS): 3654.32 Mbps
 tap:               3108.10 Mbps

Signed-off-by: Hajime Tazaki <thehajime@gmail.com>
@thehajime thehajime force-pushed the feature-virtio-rawsock branch from 0ab0505 to a28b355 Compare July 5, 2016 23:04
@ghost ghost merged commit 7881bd3 into lkl:master Jul 6, 2016
@thehajime thehajime deleted the feature-virtio-rawsock branch October 21, 2016 02:39
thehajime pushed a commit to libos-nuse/lkl-linux that referenced this pull request Mar 19, 2018
In the current code, when creating a new fib6 table, tb6_root.leaf gets
initialized to net->ipv6.ip6_null_entry.
If a default route is being added with rt->rt6i_metric = 0xffffffff,
fib6_add() will add this route after net->ipv6.ip6_null_entry. As
null_entry is shared, it could cause problem.

In order to fix it, set fn->leaf to NULL before calling
fib6_add_rt2node() when trying to add the first default route.
And reset fn->leaf to null_entry when adding fails or when deleting the
last default route.

syzkaller reported the following issue which is fixed by this commit:

WARNING: suspicious RCU usage
4.15.0-rc5+ lkl#171 Not tainted
-----------------------------
net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
4 locks held by swapper/0/0:
 #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
 #0:  ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
 #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
 #1:  (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
 #2:  (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
 #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
 #3:  (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ lkl#171
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
 fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
 fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
 fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
 fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
 fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
 call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
 expire_timers kernel/time/timer.c:1357 [inline]
 __run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
 run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
 __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
 invoke_softirq kernel/softirq.c:365 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:540 [inline]
 smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
 apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
 </IRQ>

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 66f5d6c ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants