Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloning shared skbs #173

Closed
krizhanovsky opened this issue Aug 5, 2015 · 9 comments
Closed

Cloning shared skbs #173

krizhanovsky opened this issue Aug 5, 2015 · 9 comments
Assignees
Milestone

Comments

@krizhanovsky
Copy link
Contributor

The patch 2636242 makes the skb 'shared', i.e. skb_shared() returns true, so tcp_transmit_skb() makes the skb copy or clone and this is surely unwished behavior. Also, for example packet_rcv() from net/packet/af_packet.c does skb clones on following pattern:

    if (skb_shared(skb))
            struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);

So the fix introduces performance problem by unnecessary copying. Also since we do modify skb data we should prohibit skb clones at all in Tempesta path with something like:

    --- a/tempesta_fw/sock.c
    +++ b/tempesta_fw/sock.c
    @@ -445,6 +445,10 @@ ss_tcp_process_data(struct sock *sk)
        struct tcp_sock *tp = tcp_sk(sk);

        skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
    +       if (unlikely(skb_cloned(skb))) {
    +           SS_WARN("Cannot proceess cloned skb\n");
    +           goto out;
    +       }
            if (unlikely(before(tp->copied_seq, TCP_SKB_CB(skb)->seq))) {
                SS_WARN("recvmsg bug: TCP sequence gap at seq %X"
                    " recvnxt %X\n",

And the patch seems leads to real skb dops on ingress packets:

    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a51c0, sk->sk_socket           (null), state (Established)
    [tempesta]   new client socket: sk=ffff88002c4a51c0, state=1
    [tempesta]   new client: cli=ffff88002c123670
    [tempesta]   new client socket is accepted: sk=ffff88002c4a51c0, conn=ffff880078e9d5a0, cli=ffff88002c123670
    [sync_sockets]   ss_drain_accept_queue: sk ffff88002c4a58c0, sk->sk_socket           (null), state (Listen)
    [sync_sockets]   ss_tcp_data_ready: sk ffff88002c4a51c0, sk->sk_socket           (null), state (Established)
    [tempesta]   Link new msg ffff88002c369040 with connection ffff880078e9d5a0
    [tempesta]   Add skb ffff880079344340 to message ffff88002c369040
    [tempesta]   Received 288 client data bytes on conn=ffff880078e9d5a0
    [tempesta]   parse 288 client data bytes (GET / HTTP/1.1
    Host: 172.16.0.5
    User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: en-US,en;q=0.5
    Accept-Encoding: gzip, deflate
    Connection: keep-alive

    ) on req=ffff88002c369040
    [tempesta]   enter FSM at state 0
    [tempesta]   parser: Req_0(0:0): c=0x47(G), r=-2
    [tempesta]   parser: Req_Method(1:0): c=0x47(G), r=-2
    [tempesta]   parser: Req_Uri(11:0): c=0x2f(/), r=-2
    [tempesta]   parser: Req_UriAbsPath(22:0): c=0x20( ), r=-2
    [tempesta]   parser: Req_HttpVer(23:0): c=0x48(H), r=-2
    [tempesta]   parser: Req_EoL(31:0): c=0xd(.), r=-2
    [tempesta]   parser: Req_EoL(31:0): c=0xa(.), r=-2
    [tempesta]   parser: Req_Hdr(32:0): c=0x48(H), r=-2
    [tempesta]   open header at char [H]
    [tempesta]   parser: RGen_LWS(10000:37): c=0x20( ), r=-2
    [tempesta]   parser: RGen_LWS(10000:37): c=0x31(1), r=-2
    [tempesta]   enter FSM at state 37
    [tempesta]   parser: Req_HdrHostV(37:0): c=0x31(1), r=-2
    [tempesta]   open header at char [1]
    [tempesta]   enter FSM at state 1
    [tempesta]   parser: Req_I_H(1:1): c=0x31(1), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x37(7), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x32(2), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x2e(.), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x31(1), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x36(6), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x2e(.), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x30(0), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x2e(.), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0x35(5), r=-2
    [tempesta]   parser: Req_I_H(1:1): c=0xd(.), r=-2
    [tempesta]   parse header __req_parse_host: ret=10 len=266 id=0
    [tempesta]   store header chunk len=32 data=ffff88007ac14402 hdr=<0x4,0,ffff88007ac14418>
    [tempesta]   store header w/ ptr=ffff88007ac14418 len=10 flags=6 id=0
    [tempesta]   parser: RGen_LF(10001:0): c=0xa(.), r=-2
    [tempesta]   parser: Req_Hdr(32:0): c=0x55(U), r=-2
    [tempesta]   open header at char [U]
    [tempesta]   parser: Req_HdrOther(108:0): c=0x55(U), r=-2
    [tempesta]   store header chunk len=122 data=ffff88007ac14402 hdr=<0x0,0,ffff88007ac14424>
    [tempesta]   store header w/ ptr=ffff88007ac14424 len=88 flags=2 id=4
    [tempesta]   parser: RGen_LF(10001:0): c=0xa(.), r=-2
    [tempesta]   parser: Req_Hdr(32:0): c=0x41(A), r=-2
    [tempesta]   open header at char [A]
    [tempesta]   parser: Req_HdrOther(108:0): c=0x41(A), r=-2
    [tempesta]   store header chunk len=195 data=ffff88007ac14402 hdr=<0x0,0,ffff88007ac1447e>
    [tempesta]   store header w/ ptr=ffff88007ac1447e len=71 flags=2 id=5
    [tempesta]   parser: RGen_LF(10001:0): c=0xa(.), r=-2
    [tempesta]   parser: Req_Hdr(32:0): c=0x41(A), r=-2
    [tempesta]   open header at char [A]
    [tempesta]   parser: Req_HdrOther(108:0): c=0x41(A), r=-2
    [tempesta]   store header chunk len=228 data=ffff88007ac14402 hdr=<0x0,0,ffff88007ac144c7>
    [tempesta]   store header w/ ptr=ffff88007ac144c7 len=31 flags=2 id=6
    [tempesta]   parser: RGen_LF(10001:0): c=0xa(.), r=-2
    [tempesta]   parser: Req_Hdr(32:0): c=0x41(A), r=-2
    [tempesta]   open header at char [A]
    [tempesta]   parser: Req_HdrOther(108:0): c=0x41(A), r=-2
    [tempesta]   store header chunk len=260 data=ffff88007ac14402 hdr=<0x0,0,ffff88007ac144e8>
    [tempesta]   store header w/ ptr=ffff88007ac144e8 len=30 flags=2 id=7
    [tempesta]   parser: RGen_LF(10001:0): c=0xa(.), r=-2
    [tempesta]   parser: Req_Hdr(32:0): c=0x43(C), r=-2
    [tempesta]   open header at char [C]
    [tempesta]   parser: Req_HdrC(38:0): c=0x6f(o), r=-2
    [tempesta]   parser: Req_HdrConnection(60:0): c=0x3a(:), r=-2
    [tempesta]   parser: RGen_LWS(10000:61): c=0x20( ), r=-2
    [tempesta]   parser: RGen_LWS(10000:61): c=0x6b(k), r=-2
    [tempesta]   enter FSM at state 61
    [tempesta]   parser: Req_HdrConnectionV(61:0): c=0x6b(k), r=-2
    [tempesta]   open header at char [k]
    [tempesta]   enter FSM at state 1
    [tempesta]   parser: I_Conn(1:1): c=0x6b(k), r=-2
    [tempesta]   parser: I_EoT(6:6): c=0xd(.), r=0
    [tempesta]   parse header __parse_connection: ret=10 len=14 id=2
    [tempesta]   store header chunk len=284 data=ffff88007ac14402 hdr=<0x4,0,ffff88007ac14514>
    [tempesta]   store header w/ ptr=ffff88007ac14514 len=10 flags=6 id=2
    [tempesta]   parser: RGen_LF(10001:0): c=0xa(.), r=-2
    [tempesta]   parser: Req_Hdr(32:0): c=0xd(.), r=-2
    [tempesta]   parser: Req_HdrDone(109:0): c=0xa(.), r=-2
    [tempesta]   parse msg body: flags=0x2 content_length=0
    [tempesta]   Request parsed: len=288 parsed=288 msg_len=288 res=0
    [tempesta]   TFW_HTTP_FSM_REQ_MSG return code 0
    [tempesta]   Matching request: ffff88002c369040, list: ffff88002c4a4040
    [tempesta]   tfw_sched_http: use server group: 'default'
    [tempesta]   added X-Forwarded-For header: X-Forwarded-For: 172.16.0.1

    [sync_sockets]   ss_send: sk ffff88002c4a8140, sk->sk_socket           (null), state (Established)
    [sync_sockets]   ss_send:168 entail skb=ffff880079344340 data_len=0 len=317
    [sync_sockets]   ss_send:199 sk=ffff88002c4a8140 is_queue_empty=0 tcp_send_head(sk)=ffff880079344340 sk->sk_state=1
    [sync_sockets]   ss_tcp_data_ready: sk ffff88002c4a8140, sk->sk_socket           (null), state (Established)
    [sync_sockets] Warning: Cannot proceess cloned skb
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    [tempesta]   Free msg: ffff88002c369040
    [tempesta]   free skb ffff880079344340: truesize=2240 sk=          (null), destructor=          (null) users=1 type=Conn_Clnt
    [sync_sockets]   Close socket ffff88002c4a8140 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c4a8140, sk->sk_socket           (null), state (Established)
    [sync_sockets]   free rcv skb ffff88002c271348
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c369740, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a6180, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   Peer connection closing
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    [sync_sockets]   Close socket ffff88002c4a6180 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c4a6180, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   free rcv skb ffff88002d586e88
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a8140, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a8840, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   Peer connection closing
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    [sync_sockets]   Close socket ffff88002c4a8840 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c4a8840, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   free rcv skb ffff88002c25ee88
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c250780, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a6880, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   Peer connection closing
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    [sync_sockets]   Close socket ffff88002c4a6880 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c4a6880, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   free rcv skb ffff88002c25ea08
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a8840, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c369740, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   Peer connection closing
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    [sync_sockets]   Close socket ffff88002c369740 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c369740, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   free rcv skb ffff88002d586a08
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a6180, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c250780, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   Peer connection closing
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    BUG: scheduling while atomic: httpd/3099/0x10000101
    Modules linked in: tfw_sched_rr(O) tfw_sched_http(O) tfw_sched_hash(O) tfw_sched_dummy(O) tempesta_fw(O) tempesta_db(O) nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables xfs exportfs ppdev microcode serio_raw pcspkr i2c_piix4 i2c_core parport_pc parport dm_mirror dm_region_hash dm_log dm_mod uinput btrfs xor zlib_deflate raid6_pq libcrc32c ata_generic pata_acpi e1000 ata_piix floppy ipv6 autofs4
    CPU: 0 PID: 3099 Comm: httpd Tainted: G           O 3.10.10+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
     ffff88007fc12d00 ffff88007fc039f0 ffffffff814cdd5f ffff88007fc03a00
     ffffffff814ca4d0 ffff88007fc03a68 ffffffff814d1618 ffff88002c337fd8
     0000000000000246 0000000000012d00 ffff88002c337fd8 0000000000012d00
    Call Trace:
     <IRQ>  [<ffffffff814cdd5f>] dump_stack+0x19/0x1b
     [<ffffffff814ca4d0>] __schedule_bug+0x48/0x56
     [<ffffffff814d1618>] __schedule+0x6c8/0x7c0
     [<ffffffff8106f3b6>] __cond_resched+0x26/0x30
     [<ffffffff814d1afa>] _cond_resched+0x3a/0x50
     [<ffffffff813f1a52>] lock_sock_nested+0x12/0x40
     [<ffffffffa0472053>] lock_sock+0x1d/0x1f [tempesta_fw]
     [<ffffffffa04741be>] ss_connect+0x85/0xe8 [tempesta_fw]
     [<ffffffffa0475311>] tfw_sock_srv_connect_try+0xb8/0xfb [tempesta_fw]
     [<ffffffffa0475698>] tfw_sock_srv_connect_failover+0xd3/0x119 [tempesta_fw]
     [<ffffffffa047327b>] ss_droplink+0x80/0xaf [tempesta_fw]
     [<ffffffffa0473b32>] ss_tcp_state_change+0x21a/0x29c [tempesta_fw]
     [<ffffffff814501f9>] tcp_fin+0x169/0x1e0
     [<ffffffff81452af8>] tcp_data_queue+0x6c8/0xc90
     [<ffffffff814559af>] tcp_rcv_established+0x1bf/0x8a0
     [<ffffffff81460704>] tcp_v4_do_rcv+0x1b4/0x470
     [<ffffffff8123a2a6>] ? tempesta_sock_tcp_rcv+0x26/0x30
     [<ffffffff81461d3e>] tcp_v4_rcv+0x68e/0x7e0
     [<ffffffff8143c2c0>] ? ip_rcv_finish+0x320/0x320
     [<ffffffff8143c390>] ip_local_deliver_finish+0xd0/0x210
     [<ffffffff8143c6a8>] ip_local_deliver+0x88/0x90
     [<ffffffff8143c018>] ip_rcv_finish+0x78/0x320
     [<ffffffff8143c91d>] ip_rcv+0x26d/0x390
     [<ffffffff81405776>] __netif_receive_skb_core+0x5e6/0x820
     [<ffffffff814059cd>] __netif_receive_skb+0x1d/0x60
     [<ffffffff814064b0>] process_backlog+0xa0/0x160
     [<ffffffff81405d79>] net_rx_action+0x139/0x220
     [<ffffffff810461f0>] __do_softirq+0x100/0x250
     [<ffffffff814dc47c>] call_softirq+0x1c/0x30
     <EOI>  [<ffffffff810041ed>] do_softirq+0x3d/0x80
     [<ffffffff81045814>] local_bh_enable+0x94/0xa0
     [<ffffffff81440828>] ip_finish_output+0x1b8/0x3b0
     [<ffffffff81441d70>] ip_output+0x90/0xa0
     [<ffffffff81441495>] ip_local_out+0x25/0x30
     [<ffffffff814417eb>] ip_queue_xmit+0x14b/0x3f0
     [<ffffffff81458f2d>] tcp_transmit_skb+0x45d/0x8c0
     [<ffffffff81459524>] tcp_write_xmit+0x194/0xb90
     [<ffffffff8145a13e>] __tcp_push_pending_frames+0x2e/0xc0
     [<ffffffff8145b014>] tcp_send_fin+0x64/0xe0
     [<ffffffff81449e24>] tcp_shutdown+0x54/0x60
     [<ffffffff81472929>] inet_shutdown+0x89/0x120
     [<ffffffff813f019d>] SyS_shutdown+0x7d/0x90
     [<ffffffff814db242>] system_call_fastpath+0x16/0x1b
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a8140, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   Peer connection closing
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    [sync_sockets]   Close socket ffff88002c4a8140 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c4a8140, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   free rcv skb ffff8800795e9a08
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a6880, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a8840, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   Peer connection closing
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    [sync_sockets]   Close socket ffff88002c4a8840 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c4a8840, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   free rcv skb ffff8800795e9588
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a8140, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a6180, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   Peer connection closing
    [tempesta]   connection lost: 127.0.0.1:8080
    [tempesta]   Free msg:           (null)
    [sync_sockets]   Close socket ffff88002c4a6180 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c4a6180, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   free rcv skb ffff8800795e97c8
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c4a8840, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    [sync_sockets]   Close socket ffff88002c250780 (account=1)
    [sync_sockets]   __ss_do_close: sk ffff88002c250780, sk->sk_socket           (null), state (Close Wait)
    [sync_sockets]   free rcv skb ffff8800795e9c48
    [sync_sockets]   ss_tcp_state_change: sk ffff88002c250080, sk->sk_socket           (null), state (Established)
    [tempesta]   connected: 127.0.0.1:8080
    sysctl[3130]: sysctl: cannot stat /proc/sys/fs/nfs/nlm_tcpport: No such file or directory
    sysctl[3130]: sysctl: cannot stat /proc/sys/fs/nfs/nlm_udpport: No such file or directory

The problem also should be debugged and we should prohibit cloned skbs in Tempesta path.

@krizhanovsky krizhanovsky added this to the 0.4.0 Web Server milestone Aug 5, 2015
@krizhanovsky
Copy link
Contributor Author

Related to these fixes: #56 (comment)

@keshonok
Copy link
Contributor

keshonok commented Aug 7, 2015

In the regular course of action tcp_transmit_skb() always makes a clone (with skb_clone()) or a copy (with pskb_copy() if an SKB is a clone already) of an SKB that it is asked to transmit. That is regulated by the clone_it argument to the function, and it's set to zero only for non-data SKBs that are generated by the network stack, such as ACK, etc. The clone or the copy then makes its way down the stack to a network device, while the original stays in the TX queue awaiting an ACK, or for possible retransmits.

When a client of a back end server sit on the same host as Tempesta, SKBs are transmitted and received via the loopback driver. It means that the same cloned SKBs that are sent out, are received and make their way up the stack to Tempesta.

In other words, receiving cloned SKBs is a pretty normal situation in the network stack. The stack is aware of that, and knows how to work with that.

Note that utilities like tcpdump are different, and do not have any effect on what's described above. They are given a shared SKB, and if they want to modify it they need to either clone or copy it depending on how deep the modification runs. These SKBs do not get to Tempesta.

The fix mentioned above makes SKBs shared on their way out of Tempesta to a client or a back end server (skb->users > 1). That is a totally different matter than cloned SKBs. As it is, the fix doesn't change the way network stack works, and it doesn't introduce any extra copying.

What is unusual about the fix is that is makes cloned SKBs shared, which looks more like an oxymoron as "cloning" and "sharing" of SKBs are just two different kinds of sharing of SKBs. So, usually, these two kinds of sharing do not appear together in the same SKB. Having both in the same SKB is totally unconventional.

Tempesta needs to be able to modify data in an SKB, like inserting or removing HTTP headers. That may happen anywhere in the data, not just in the packet header. Neither cloning or sharing of SKBs provides that, and only a full SKB copy can provide full protection. Still, SKBs that get to Tempesta are assumed to be in sole possession of the Linux kernel, so it's assumed that data in those SKBs can be modified. Usually that is the case anyway, but it needs to be stated clearly.

With all of the above, there are several places where Tempesta and the kernel need modifications:

  1. An SKB may be split in the kernel's TX path due to a different MSS. If an SKB is cloned, it's uncloned first (see tcp_fragment()). That, however, causes a BUG() in pskb_expand_head() that doesn't expect a shared SKB (recall that shared SKBs can't be modified at all). However, we do know here that a shared SKB is not harmful as it's from Tempesta. It makes sense to expand the condition in pskb_expand_head() and only do a BUG() if an SKB is not from Tempesta. However, currently we don't know if there are other places like that. An alternative way is to unclone SKBs right before they are shared on their way out of Tempesta in ss_send().
  2. Pipelined requests are split into separate requests, so SKBs are split in two. If an SKB came in cloned, it has to be uncloned first. Need that in ss_skb_split(). The BUG() mentioned above is not triggered because SKBs are made shared only on their way out of Tempesta.
  3. SKBs are also always cloned with skb_clone() if they're original, or copied with pskb_copy() if they're clones when TCP stack sends them down to a network device. Clones/copies are sent down, and the originals stay in TCP write queue awaiting for ACK or a possible retransmit. Note that pskb_copy() makes a copy of sk_buff{} and the header in one pass, whereas skb_clone() makes a copy of sk_buff{}, and pskb_expand_head(), or "unclone", makes a copy of the header. So, while the alternative fix mentioned in (1) above is cleaner and more straightforward, it may be a little bit "cheaper" to do it the other way. I'm not sure if that's worth it though. That needs to be investigated.
  4. As Tempesta processes data in an SKB, it sets various pointers into SKB's data, like pointers to HTTP headers, etc. Certain manipulations on SKBs that happen in the network stack (see (1) and (3) above) move SKB data, and that may make Tempesta pointers invalid. These are the pointers that point into the linear part of an SKB. These pointers may need to be adjusted after SKBs are sent and before the pointers are used. We have this case in Tempesta with request SKBs as they are kept around until the response comes, and data from request SKBs are used then for caching. So this is a pretty serious matter that needs a proper resolution.

@keshonok
Copy link
Contributor

Current solution is to unclone all cloned SKBs that come to Tempesta (see 4971758). That allows Tempesta to not worry about cloned packets altogether. Incoming cloned SKBs are unwelcome in following situations:

  1. Pipelined requests or responses. These are split into separate HTTP messages, and that's done by splitting the SKBs these HTTP messages are in.
  2. Inserting/Deleting/Replacing HTTP headers. That also requires SKBs that are not clones.

The SKBs are almost surely modified for HTTP requests, but at this time there's no modification of HTTP responses. In case performance becomes critical, it may make sense to unclone SKBs only when necessary, i.e. when they are definitely modified in any of the cases above.

Also, in case cloned SKBs are seen on the way out of Tempesta (in case we decide to do it in an "effective way" in future), certain measures described above will be necessary that deal with SKBs that are both cloned and shared, as Tempesta makes SKBs shared on their way out.

The bigger issue remains that an SKB may still be split or merged in the kernel network stack, and that may make pointers into the SKB invalid. These are the pointers that point to every small part of an HTTP message, like each header field, and the body. These pointers are used later for caching of HTTP messages. A solution needs to be found.

@krizhanovsky krizhanovsky modified the milestones: TBD, 0.4.0 Web Server Aug 10, 2015
@krizhanovsky
Copy link
Contributor Author

Actually, there is no problem with skb handled data, only network header pointers are adjusted while data is kept untouched. See #122 (comment). There is example of sendfile(2) which can send the same data many times and surely w/o mangling of original pages. Meantime, skb head data also kept safe.

@keshonok
Copy link
Contributor

I don't think that's completely true. Below are the cases where Tempesta can lose SKB data or end up with invalid pointers to HTTP message data stored in an SKB.

  1. An SKB is split due to a different MSS, and the split occurs in the linear part. The original SKB is trimmed to a new size, and remaining data from the linear part, as well as all paged data is moved to a new SKB. The data is neither lost nor mangled in this case. Even though the original SKB loses most of data and the SKB size now says that it's much smaller, the linear part itself is not moved or mangled. That memory is still there, and Tempesta pointers stay valid. Paged data is moved to a new SKB that Tempesta doesn't know about and has not control of, so Tempesta loses control of paged data. When that data is transmitted, the new SKB is freed and paged data's memory becomes invalid, along with Tempesta pointers into that data.
    This occurs in skb_split_inside_header() called from skb_split() called from tcp_fragment() that can be called from tso_fragment() or directly.
  2. Same as above, but the split occurs in paged data. Similarly, some pages stay in the original SKB that Tempesta has a reference to, and some pages are moved to a new SKB that Tempesta knows nothing of. When that new SKB is transmitted it's freed and data in those pages is lost. Tempesta pointers into that data become invalid.
    This occurs in skb_split_no_header() called from skb_split() called from tcp_fragment() that can be called from tso_fragment() or directly.
  3. Any event that requires extra head space. An outgoing packet in the egress path may need more head space than what was allocated for it in the ingress path. That may cause either relocation of data within the linear part using extra tail space, or relocation of data to a newly allocated linear part. Both actions lead to invalid pointers to sections of an HTTP message that are located in the linear part of an SKB. I believe that's an infrequent case, yet I don't think it can be completely excluded.
    This occurs in pskb_expand_head(). Besides an obvious case, it may also be called due to GSO/TSO processing: skb_cow() or skb_cow_head() called from skb_segment() called from tcp_tso_segment() called from inet_gso_segment().
  4. Most of SKB mangling occurs on TCP ACK processing or retransmit. It may happen that only part of SKB's data is ACKed. In that case an SKB is adjusted to start from a new address in the data. Full pages of data that were received by the other end of connection are unreferenced and disappear from an SKB, and effectively they are lost. That memory becomes invalid, as do Tempesta pointers into that memory.
    This occurs in tcp_fragment() called tcp_match_skb_to_sack() (that is eventually called from tcp_ack()).
    There's also a path in retransmit code through tcp_fragment() called from tcp_mark_head_lost() called from tcp_timeout_skbs() called from tcp_update_scoreboard().
    There're also path in retransmit code through tcp_trim_head() called from __tcp_retransmit_skb(). pskb_trim() there is also destructive.
  5. I didn't research the consequences well enough but it appears that SKBs may also be merged in the retransmit path. That may present similar challenges as well. In one example, though, ttcp_collapse_retrans() leaves all Tempesta pointers perfectly valid.
  6. Also, I am not sure yet how the fragment list of an SKB plays into all of this. It may complicate things further.

keshonok added a commit that referenced this issue Aug 19, 2015
Implement a workaround to issues described in #173 and #122.
@krizhanovsky krizhanovsky modified the milestones: 0.5.0 Web Server, 0.6 TBD Sep 2, 2015
@krizhanovsky
Copy link
Contributor Author

There is also GFSM issue. Lets assume HTTP FSM calls other FSM, for example ICAP, and the called FSM returns POSTPONE code. This means that the FSM uses the skb (probably not only data kept by the skb, but also the skb itself in case of TL program or low-level network classifier) and lower layer, the caller, FSM must not free the skb.

This is not the case for current Frang implementation, but we must keep this in mind during #102 implementing.

Skb can be freed when all GFSM and/or TL programs return PASS or BLOCK on it. This could be tracked by bitmask, i.e. skb can be freed when POSTPONE bit is zero.

Since network layer is stateless, then probably we can make an agreement than low layer network classifiers don't postpone skbs while the higher layer GFSM users use skb data only and we can free skbs and get/put data pages carries by skbs (this is originally @keshonok proposal).

@krizhanovsky
Copy link
Contributor Author

@krizhanovsky
Copy link
Contributor Author

skb_get() commented out by d8e9bd0 is really wrong solution since tcp_push() -> __tcp_push_pending_frames() -> tcp_write_xmit() -> tso_fragment() -> tcp_fragment() changes the skb data pointers (not the data itself, leaving all data pointers in TfwStr correct) by skb_trim() or skb_split(). The new skbs are just pushed to socket send queue w/o adjusting current skb, so all Tempesta's skb lists are unaware about the changes. So we can't use skb after data transmission because it has wrong data pointers and lengths. Rather we must clone it. tcp_fragment() doesn't care about shared skb, but correctly handles cloned skbs.

@krizhanovsky
Copy link
Contributor Author

Actually there is not much we can do: tcp_transmit_skb() changed skb data pointers and lengths, so we must clone skb before transmission. However, it's unclear why during transmission work flow mentioned in #173 (comment) tcp_transmit_skb() can't use the same skb and does pskb_copy() or skb_clone() - it seems the code can be reworked to reduce number of copies.

This is pure linux kernel issue which should be investigated more. For now 4.1.6 does ever more tcp_transmit_skb() calls with armed clone_it. Probably, this will change with future kernel releases. Will back to the issue on next kernel tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants