-
-
Notifications
You must be signed in to change notification settings - Fork 604
Networking Stack
The OSv networking stack originates from FreeBSD as of circa 2013 but has since been heavily modified to implement Van Jacobson's "network channels" design, to reduce the number of locks and lock operations. For more theory and high-level design details please read the "Network Channels" chapter of the OSv paper.
This Wiki instead, focuses on the code and where these design ideas are implemented. It still touches just a tip of "the iceberg" which is the code of the networking stack located mostly under the bsd/
subtree.
One can use trace.py
to effectively study the OSv networking stack. There are numerous trace points that can be enabled when running a given app and then extracted and analyzed using the aforementioned tool as described in this wiki page.
A good testing bed might be a httpserver-monitoring-api
app which can be built and started with the following trace points enabled:
./scripts/build image=httpserver-monitoring-api.fg
./scripts/run.py --trace=net_packet*,tcp*,tso*,inpcb*,in_lltable* --trace-backtrace --api
curl http://localhost:8000/os/threads # Just to trigger some networking activity
./scripts/trace.py extract
./scripts/trace.py list --tcpdump -blLF
0xffff800001981040 /libhttpserver- 0 0.117564227 tcp_state tp=0xffffa000015b9c00, CLOSED -> CLOSED
tcpcb::set_state(int) bsd/sys/netinet/tcp_var.h:233
tcp_attach bsd/sys/netinet/tcp_usrreq.cc:1618
tcp_usr_attach bsd/sys/netinet/tcp_usrreq.cc:130
socreate bsd/sys/kern/uipc_socket.cc:335
socreate(int, int, int) bsd/sys/kern/uipc_syscalls.cc:118
sys_socket bsd/sys/kern/uipc_syscalls.cc:133
linux_socket bsd/sys/compat/linux/linux_socket.cc:616
socket bsd/sys/kern/uipc_syscalls_wrap.cc:333
There is a number of data structures that are key to understanding the networking stack. Most of them originate from FreeBSD and only some have been extended with OSv-specific data, especially for net channel needs.
The struct mbuf
is a key structure used to exchange data between a network driver and the other part of the stack. It is used at the very bottom of the stack. It is made of the header (32 bytes) followed by data. Typically constructed to carry multiple mbufs as a linked list - "mbuf chain". For more details read here.
The struct socket
represents a socket object behind a socket API and typically maps to a single connection.
The net channel is a direct bottom-up traffic line flowing from a network driver acting as a producer to an app thread calling recv()
, poll()
, epoll()
among others acting as a consumer. The net channels are designed to avoid most of the locking involved when typically traversing layer by layer. Relatedly, one can see many references in the code to both "fast path" and "slow path". To understand both and net channels, one can start looking at this code in virtio-net
driver (there is a similar code in the vmxnet3
driver):
void net::receiver()
{
...
bool fast_path = _ifn->if_classifier.post_packet(m_head);
if (!fast_path) {
(*_ifn->if_input)(_ifn, m_head);
}
...
}
In essence, this code is called to process incoming data (RX) from the network card and it tries to "push" the resulting mbuf
via the network channel (fast-path). If that fails it falls back to the if_input
from the FreeBSD way of doing things.
The if_classifier
, a member of the struct ifnet
describing network interface and defined in if_var.h
, is an instance of the class classifier
. The method post_packet()
used in the code above, is part of the "producer" interface and its role is to identify or classify if mbuf
in question has some corresponding net channel and if so push
the mbuf
on that net channel and wake consumers of the net channel. So the network card driver, virtio-net
in this example, is a "producer" in the context of the net channel and threads blocked when calling send
, recv
and poll
are "consumers". Also, an instance of a net channel corresponds to a single TCP connection.
Here is an example of the "successful" fast path traversal:
0xffff8000015ff040 virtio-net-rx 0 21.143180806 net_packet_in b'IP truncated-ip - 14 bytes missing! 192.168.122.1.36394 > 192.1
68.122.15.8000: Flags [P.], seq 2688002:2688090, ack 2893961834, win 65535, length 88'
log_packet_in(mbuf*, int) core/net_trace.cc:143
classifier::post_packet(mbuf*) core/net_channel.cc:133
virtio::net::receiver() drivers/virtio-net.cc:542
std::_Function_handler<void (), virtio::net::net(virtio::virtio_device&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) drivers/virtio-net.cc:243
__invoke_impl<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
__invoke_r<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:154
_M_invoke /usr/include/c++/11/bits/std_function.h:290
sched::thread::main() core/sched.cc:1267
thread_main_c arch/x64/arch-switch.hh:325
thread_main arch/x64/entry.S:116
Now, how does the post_packet()
exactly "classify" the packet? Under the hood, it calls the method classify_ipv4_tcp()
, which in turn first verifies if the packet belongs in the "fast path" category meaning more-less:
- is it an IP packet?
- does it carry a TCP payload?
- is the underlying TCP connection in the right state - not
TH_SYN
norTH_FIN
norTH_RST
.
The last condition effectively means that only sockets in the state - ESTABLISHED, CLOSE_WAIT, FIN_WAIT_2, and TIME_WAIT - would "participate" in the fast path traversal. In other words, the fast path only plays a role when a TCP connection is established and the slow path is what happens during establishing and tear-down of a TCP connection.
The post_packet()
pushes an mbuf
onto the net channel only if one exists. But when does a net channel get created? The net channel gets constructed by tcp_setup_net_channel()
and destroyed by tcp_teardown_net_channel()
or tcp_free_net_channel()
. The former gets called when a TCP connection gets established in tcp_do_segment()
here and there. The tcp_teardown_net_channel()
gets called by tcp_do_segment()
when socket in ESTABLISHED state transitions to CLOSE_WAIT one, and an established socket is closed is in tcp_usr_close()
and tcp_usrclosed()
. The tcp_free_net_channel()
on other hand, gets called by tcp_discardcb()
when the process of TCP connection closing begins in other TCP state machine cases.
The tcp_setup_net_channel()
is key as it binds the "consumers" of a net channel by calling add_poller()
and add_epoll
. It also registers a new net channel in the RCU hashtable kept as part of the classifier
.
Another question remains: how do "consumers" consume data of net channels? The post_packet()
method discussed above wakes all interested consumers after a successful push. The consumers include the _waiting_thread
- a member of the net channel - and pollers and "epollers" woken by wake_pollers(). Now, how does an mbuf
exactly get consumed? The critical places in the bsd/
tree are sprinkled with calls to process_queue()
which pops an mbuf
of a queue and passes it by invoking a callback method _process_packet
which was set when constructing the net channel. This may be illustrated by this stack trace:
0xffff800001de8040 >/tests/misc-tc 3 24.416126421 net_packet_handling b'IP truncated-ip - 1366 bytes missing! 192.168.122.1.9999 > 192.168.122.15.20728: Flags [.], seq 10059266:10060706, ack 2978885776, win 65535, length 1440'
log_packet_handling(mbuf*, int) core/net_trace.cc:153
std::_Function_handler<void (mbuf*), tcp_setup_net_channel::{lambda(mbuf*)#1}>::_M_invoke(std::_Any_data const&, mbuf*&&) bsd/sys/netinet/tcp_input.cc:3193
operator() bsd/sys/netinet/tcp_input.cc:3231
__invoke_impl<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:61
__invoke_r<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:154
_M_invoke /usr/include/c++/11/bits/std_function.h:290
std::function<void (mbuf*)>::operator()(mbuf*) const /usr/include/c++/11/bits/std_function.h:590
net_channel::process_queue() core/net_channel.cc:37
int sbwait_tmo<osv::clock::uptime>(socket*, sockbuf*, boost::optional<std::chrono::time_point<osv::clock::uptime, osv::clock::uptime::duration> >) bsd/sys/kern/uipc_sockbuf.cc:167
sbwait bsd/sys/kern/uipc_sockbuf.cc:190
soreceive_generic bsd/sys/kern/uipc_socket.cc:1464
kern_recvit bsd/sys/kern/uipc_syscalls.cc:607
sys_recvfrom bsd/sys/kern/uipc_syscalls.cc:673
sys_recvfrom bsd/sys/kern/uipc_syscalls.cc:707
linux_recv bsd/sys/compat/linux/linux_socket.cc:866
recv bsd/sys/kern/uipc_syscalls_wrap.cc:183 #libc API call
The sbwait_tmo()
in the stack above waits on the net channel associated with the socket object in question and once awoken proceeds to call process_queue()
as can be seen in the shortened version of the code:
int sbwait_tmo(socket* so, struct sockbuf *sb, boost::optional<std::chrono::time_point<Clock>> timeout)
{
...
if (so->so_nc && !so->so_nc_busy) {
so->so_nc_busy = true;
sched::thread::wait_for(SOCK_MTX_REF(so), *so->so_nc, sb->sb_cc_wq, tmr, sc);
so->so_nc_busy = false;
so->so_nc_wq.wake_all(SOCK_MTX_REF(so));
} else {
sched::thread::wait_for(SOCK_MTX_REF(so), so->so_nc_wq, sb->sb_cc_wq, tmr, sc);
}
...
if (so->so_nc) {
so->so_nc->process_queue();
}
return 0;
}
Lastly, the _process_packet
invoked by process_queue()
is actually the function tcp_net_channel_packet()
that disassembles the mbuf
popped of a net channel queue and pushes it up the stack by calling the tcp_do_segment()
function which by itself is long and pretty complicated.
Consuming the data through net channels in the context of poll
and epoll
is handled differently, and the key element is a socket_file
class and epoll_file
respectively. Here is an example stack trace illustrating some of the details:
0xffff800001981040 /libhttpserver- 0 19.880613508 net_packet_handling b'IP truncated-ip - 14 bytes missing! 192.168.122.1.36398 > 19
2.168.122.15.8000: Flags [P.], seq 2496002:2496090, ack 233273190, win 65535, length 88'
log_packet_handling(mbuf*, int) core/net_trace.cc:153
std::_Function_handler<void (mbuf*), tcp_setup_net_channel::{lambda(mbuf*)#1}>::_M_invoke(std::_Any_data const&, mbuf*&&) bsd/sys/netinet/tcp_input.cc:3193
operator() bsd/sys/netinet/tcp_input.cc:3231
__invoke_impl<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:61
__invoke_r<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:154
_M_invoke /usr/include/c++/11/bits/std_function.h:290
std::function<void (mbuf*)>::operator()(mbuf*) const /usr/include/c++/11/bits/std_function.h:590
net_channel::process_queue() core/net_channel.cc:37
socket_file::poll(int) bsd/sys/kern/sys_socket.cc:260
epoll_file::add(epoll_key, epoll_event*) core/epoll.cc:99
epoll_ctl core/epoll.cc:308
Coming back to the original code, if the "fast path" fails when post_packet()
returns false, the if_input
function - "slow path" is called. Here is an example of a "slow path" execution:
0xffff800001783040 virtio-net-rx 0 19.881495336 net_packet_in b'IP 192.168.122.1.36398 > 192.168.122.15.8000: Flags [F.], seq 2496090, ack 233281200, win 65535, length 0'
log_packet_in(mbuf*, int) core/net_trace.cc:143
netisr_dispatch_src bsd/sys/net/netisr.cc:768
virtio::net::receiver() drivers/virtio-net.cc:544
std::_Function_handler<void (), virtio::net::net(virtio::virtio_device&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) drivers/virtio-net.cc:243
__invoke_impl<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
__invoke_r<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:154
_M_invoke /usr/include/c++/11/bits/std_function.h:290
sched::thread::main() core/sched.cc:1267
thread_main_c arch/x64/arch-switch.hh:325
thread_main arch/x64/entry.S:116
The netisr_dispatch_src
- the FreeBSD stack routine - is what the if_input
member of the struct ifnet
points to.
To conclude, fast path because it directly calls net channel rather than traversing all traditional stack call paths that involve many locks - slow path.
There are many ways the network stack can be dissected and analyzed but one common one is to look at the direction of traffic and how it travels through the layers. One direction is a top-down one starting with the libc functions like send()
, recv()
and others implemented in bsd/sys/kern/uipc_syscalls_wrap.cc
called by an application at the socket layer, that convert user buffers to TCP packets, attach IP headers to those TCP packets, and finally egress via the network card driver. Here is an example of a stacktrace illustrating the send()
function call traversing all the way down the stack to push out an mbuf
onto the network interface (note ether_output_frame()
):
0xffff8000019db040 >/tests/misc-tc 2 121.338417028 virtio_net_tx_packet_size vring 0xffffa000011a9200 vec_sz 3
virtio::net::txq::try_xmit_one_locked(virtio::net::net_req*) drivers/virtio-net.cc:712
virtio::net::txq::try_xmit_one_locked(void*) drivers/virtio-net.cc:655
osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::xmit(mbuf*) include/osv/percpu_xmit.hh:293
ether_output_frame bsd/sys/net/if_ethersubr.cc:398
ether_output bsd/sys/net/if_ethersubr.cc:366
ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*) bsd/sys/netinet/ip_output.cc:621
tcp_output bsd/sys/netinet/tcp_output.cc:1385
tcp_usr_send(socket*, int, mbuf*, bsd_sockaddr*, mbuf*, thread*) bsd/sys/netinet/tcp_usrreq.cc:832
sosend_generic bsd/sys/kern/uipc_socket.cc:1075
kern_sendit bsd/sys/kern/uipc_syscalls.cc:515
sys_sendto bsd/sys/kern/uipc_syscalls.cc:470
sys_sendto bsd/sys/kern/uipc_syscalls.cc:554
linux_send bsd/sys/compat/linux/linux_socket.cc:859
send bsd/sys/kern/uipc_syscalls_wrap.cc:239
In another example below:
0xffff8000019db040 >/tests/misc-tc 2 121.339244385 virtio_net_tx_packet_size vring 0xffffa000011a9200 vec_sz 2
virtio::net::txq::try_xmit_one_locked(virtio::net::net_req*) drivers/virtio-net.cc:712
virtio::net::txq::try_xmit_one_locked(void*) drivers/virtio-net.cc:655
osv::xmitter<virtio::net::txq, 4096u, std::function<bool ()>, boost::iterators::function_output_iterator<osv::xmitter_functor<virtio::net::txq> > >::xmit(mbuf*) include/osv/percpu_xmit.hh:293
ether_output_frame bsd/sys/net/if_ethersubr.cc:398
ether_output bsd/sys/net/if_ethersubr.cc:366
ip_output(mbuf*, mbuf*, route*, int, ip_moptions*, inpcb*) bsd/sys/netinet/ip_output.cc:621
tcp_output bsd/sys/netinet/tcp_output.cc:1385
tcp_do_segment(mbuf*, tcphdr*, socket*, tcpcb*, int, int, unsigned char, int, bool&) bsd/sys/netinet/tcp_input.cc:1421
tcp_net_channel_packet bsd/sys/netinet/tcp_input.cc:3212
operator() bsd/sys/netinet/tcp_input.cc:3231
__invoke_impl<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:61
__invoke_r<void, tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)>&, mbuf*> /usr/include/c++/11/bits/invoke.h:154
_M_invoke /usr/include/c++/11/bits/std_function.h:290
std::function<void (mbuf*)>::operator()(mbuf*) const /usr/include/c++/11/bits/std_function.h:590
net_channel::process_queue() core/net_channel.cc:37
int sbwait_tmo<osv::clock::uptime>(socket*, sockbuf*, boost::optional<std::chrono::time_point<osv::clock::uptime, osv::clock::uptime::duration> >) bsd/sys/kern/uipc_sockbuf.cc:167
sbwait bsd/sys/kern/uipc_sockbuf.cc:190
soreceive_generic bsd/sys/kern/uipc_socket.cc:1464
kern_recvit bsd/sys/kern/uipc_syscalls.cc:607
sys_recvfrom bsd/sys/kern/uipc_syscalls.cc:673
sys_recvfrom bsd/sys/kern/uipc_syscalls.cc:707
linux_recv bsd/sys/compat/linux/linux_socket.cc:866
recv bsd/sys/kern/uipc_syscalls_wrap.cc:183
we have the recv()
function trying to receive data over a socket that traverses over the net channel and eventually hits the tcp_do_segment()
that triggers TCP output to send an ACK. Just like in the former stacktrace, the ether_output_frame()
is the one which calls the if_transmit
- a member of the struct ifnet
- to push out an mbuf
onto the associated network card.
In order to efficiently transmit data in the top-down flow, OSv uses an optimization technique based on the per-cpu TX queues. The main idea is to use the xmitter
class introduced by this commit to try to push an mbuf
onto network card if there is no contention and otherwise put it on a per-cpu queue which is processed later by special worker thread. For more details about how the xmitter
has been integrated into the virtio-net
and vmxnet3
drivers please look at this commit and that one respectively.
A good part of this direction has been extensively discussed in the section about net channels, and the slow and fast paths above. But here you see another "slow path" example illustrating the TCP state transition when data arrives:
0xffff800001783040 virtio-net-rx 1 140.273629729 tcp_state tp=0xffffa00002a8b400, FIN_WAIT_1 -> FIN_WAIT_2
tcpcb::set_state(int) ./bsd/sys/netinet/tcp_var.h:233
tcp_do_segment(mbuf*, tcphdr*, socket*, tcpcb*, int, int, unsigned char, int, bool&) bsd/sys/netinet/tcp_input.cc:2277
tcp_input bsd/sys/netinet/tcp_input.cc:956
ip_input bsd/sys/netinet/ip_input.cc:774
netisr_dispatch_src bsd/sys/net/netisr.cc:769
netisr_dispatch_src bsd/sys/net/netisr.cc:769
virtio::net::receiver() drivers/virtio-net.cc:544
std::_Function_handler<void (), virtio::net::net(virtio::virtio_device&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) drivers/virtio-net.cc:243
__invoke_impl<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:61
__invoke_r<void, virtio::net::net(virtio::virtio_device&)::<lambda()>&> /usr/include/c++/11/bits/invoke.h:154
_M_invoke /usr/include/c++/11/bits/std_function.h:290
sched::thread::main() core/sched.cc:1267
thread_main_c arch/x64/arch-switch.hh:325
thread_main arch/x64/entry.S:116
As mbuf
s travel up and down the stack, relevant functions get called to process them depending on family and protocol. To accommodate it, OSv re-uses the switch tables from FreeBSD.
For example, the netisr_dispatch_src()
is called by a network driver (through ether_input()
which if_input
member of struct ifnet
is set to) to propagate an mbuf
up the stack:
int netisr_dispatch_src(u_int proto, uintptr_t source, struct mbuf *m)
{
...
netisr_proto[proto].np_handler(m);
...
}
In this case, the ether netisr
handler - ip_input()
- gets called for the protocol NETISR_ETHER
(the np_handler
gets set by netisr_register()
routine).
The ip_input()
ends up calling the tcp_input()
function using the switch table ip_protox
like so:
void ip_input(struct mbuf *m)
{
uint8_t protocol;
int hlen;
m = ip_preprocess_packet(m, protocol, hlen);
if (!m) {
return;
}
(*inetsw[ip_protox[protocol]].pr_input)(m, hlen);
}
The switch tables for the inet
domain are setup in bsd/sys/netinet/in_proto.cc
.