A new iteration of #498: message streaming mode. NOT done and work in process #1183

vankoven · 2019-02-12T01:03:43Z

The task is not completed yet and the work is still in progress. It's here to make a very high level review to estimate the pros and cons of the used approach.

The code compiles and it even works in some scenarios, but it doesn't look good for now. Last week I've tried to fix issues I had in streaming, especially in scenarios when one (or both) of the streaming peers suddenly disconnects during streaming, but I completely ruined the streaming process and the code became over sophisticated, unreliable and tending to crash. So I tried to revert to very beginning and move forward commit-by-commit trying to simplify the code and make it more straight forward, but I didn't manage to finish it yet and some pieces of code are too messy and unreliable.

Here the issues that appeared during implementation.

TCP window steering

TCP receive window steering. There are only two situations, when the window steering is required. In first, client feeds us tons of pipelined requests. Not very big problem and can be resolved via Frang limits. Frang will block too fast clients, normally clients are slower than the servers. Probably closing client connections with 429 status code is acceptable. The second case is message streaming. We don't want to allow client to eat all the memory by a sing request (e.g. POST request with Blue-Ray image encapsulated).

Note, that the window steering can't slow down clients that doesn't use request pipelining. It sends request - window gets lower, it receives the response - window is maximum agains.

The task is to set N as buffer size for both unparsed skbs in receive queue and parsed requests in TfwCliConn->seq_queue. If requests strored in seq_queue are already filled the buffer, then advertize a zero window to the client. If a response is forwarded to the client, then the corresponding request must be destroyed and the advertised window should grow on.

To control window size, we need to force the kernel to take into account the space occupied by the requests in the seq_queue. If we look at tcp bufers we will see two buffers: "network" buffer advertised as receive window and "application" buffer used to isolate application from network. Currently skbs are orphaned from the tcp socket right after receive, so the TCP stack sees that there is no allocated memory from the socket and advertises the maximum window size. My idea was to register size of parsed HTTP message in the sk->sk_rmem_alloc to make the Kernel aware of the used memory and avoid manual window size calculations. The trick worked. While client was filling the "application" buffer, he was advertised a maximum window size, but while "network" buffer was used the window got smaller, up to zero size. Another great result was possibility to set different buffer limits for client and server connections.

The code worked in all scenarios. Except one. Closing connections in very concurrent environment. Dropped requests might be deregistered twice from the sk->sk_rmem_alloc, so it wasn't empty on socket destruction, so kernel might to generate a pretty long traces on wrk stops. The issue has a sporadic character.

Streaming tricky situations

Streamed messages can be buffered.

This happen, when the receiver is not ready to receive a stream. E.g. client requested two pipelined uris: first is very slow to complete, second is fast, but it can be a full BD image. It's not possible to stream the BD image until the first request is responsed. We can't put server connection on hold. Client may intentionally slow down receiving, so the server connection would get stack. That is why window steering can't be applied to server connections.

Some parts of a streamed messages must be buffered for some time

Some high level http processing can't work with partial data. E.g. trailer headers can't be processed by parts, we have to buffer them fully before process. Although RFC allow to ignore trailer headers in stream mode, we still want to validate the headers and modify them if req_hdr_set directive is set.

Response can appear at any time.

Http is request-response protocol. Response is sent only after request was received. Well, forget about that during the streaming. Imagine, that a part of the request was streamed to backend and a new part is being parsed. There can be already a response received from the backend. Scenario 1: try to stream GET request with a long body to Nginx and make a pause, Nginx will respond immediately without waiting for the message end. Scenario 2: stream a part of request and make a syntax error, backend will respond with 4xx code immediately (Tempesta has the same behaviour).

Target connections to stream a message can dissappear at any time.

Imagine a request is streamed, a part of it was already received by the backend. It already sent a response. Now error happens and backend connection is closed. We still need to receive the rest of the stream, but there is no need to send it anywhere since response is received. Just drop it part by part.

Client disconnected when streamed message wasn't received in full

If the partly received request wasn't sent to backend, simply evict it. It it was, then there is no recovery possibilities, only to close server connection.

(The list is not finished ind to be continued shortly.)

Fix #806: Reduce count of IPI during work_queue processing.

2. Fix handshake checksumming. Now we successfully finish handshakes (yahoo!); 3. Introduce TfwFsmData->trail to not to treat TLS TAG as part of HTTP request; 4. Introduce ttls_bzero_safe() for fast & non-optimizable zeroing in process context; 5. Coding style cleanup for PK and some other minor cleanups.

from processing the tag if an HTTP request spans over TLS records bound.

Sometimes with make -jN, tdbq fails to link with libtdb, since there is no libtdb yet.

tdb: make tdbq depend on libtdb

Reset default vhost during start of Tempesta FW

Drops everything from the TfwAddr union, making it essentially an alias for struct sockaddr_in6. IPv4 addresses are converted to IPv4-mapped addresses at the parsing stage. Macros for log reporting with addresses of both IPv4 and IPv6 are unified. They got a new parameter controlling whenever ports should be reported too.

Use struct sockaddr_in6 for all addresses

the lists and let FSMs to keep their skb lists. skb lists can be sent to GFSM calls only from TLS level which is very special in necessity to collect skb for complete records, so move the lists processing to HTTP layer and prcess each skb in the list separately.

and accouning of parsed bytes in the recent http and ss changes.

Preserve correct fsm state when moving to a next state

Fix unstable JS challange passing

drop unused "unused" parameter of frang_conn_limit()

…LS 0.1: don't close TCP connection on server response with 'Connection: close' - in this case TLS close notification message is sent on closed connection. Add new TfwConnHooks->conn_close hook to send TLS close notification alert just before closing a connection, so there is tfw_connection_close() wrapper instead of direct ss_close() calls.

Also adds a rare case of empty slice handling.

When data are in multiple chunks, it's necessary to initialize 'tr' at the beginning of every new chunk.

…ulation fixups for sticky cookie calculation in case of segmented data

…itting test cookie parser splitting string in the right places

If it was 0 for some reason, we force it to be UINT_MAX instead, exactly to avoid such checks at run time.

Fix TDB build issue

Separate number of chunks from flags

# Conflicts: # etc/tempesta_fw.conf

krizhanovsky · 2019-02-13T11:24:33Z

Previous versions of the pull request with comments - #1012 and #1067

krizhanovsky

I made only high level review, but I really appreciate the way how #498 is solved in this PR. I updated the original issue by additional information discussed in this and previous pull requests.

krizhanovsky · 2019-02-13T18:02:16Z

tempesta_fw/http.c

+ *   I.e. don't stream messages byte by byte, buffer some data before forward
+ *   it; if timeout is exceeded, stream as many data as available, don't wait
+ *   for more. It can be rather long wait in case of long polling. The idea
+ *   behind is same as for Nagle algorithm in TCP.


I think we should stream messages with skb granulrity to avoid costly skb splitting or allocation.

krizhanovsky · 2019-02-13T18:19:40Z

tempesta_fw/http.c

+			    "set client_msg_buffering to %u.\n",
+			    tfw_cfg_cli_rmem, tfw_cfg_cli_msg_buff_sz,
+			    tfw_cfg_cli_rmem);
+		tfw_cfg_cli_msg_buff_sz = tfw_cfg_cli_rmem;


If we have equal buffer and rmem, then it seems we'll frequently send 0 receive window for large messages: we have to announce 0 window to a client until we free all the memory occupied by the forwarded HTTP message.

krizhanovsky · 2019-02-13T18:20:03Z

tempesta_fw/sock_clnt.c

+ * message processing inside the Tempesta due to message modifications.
+ * Two variants are possible: mirror all modifications to sk->sk_rmem_alloc
+ * or operate only by original message size. The second case was chosen
+ * because its simplicity and error-free possibilities.


Yes, agree.

krizhanovsky · 2019-02-13T18:21:47Z

tempesta_fw/sock_clnt.c

+ * Limit total memory used to serve single TCP connection.
+ *
+ * Received skb may be stored in one of two buffers: socket receive buffer
+ * (sk->sk_receive_queue) or Tempesta connection buffer (TfwCliConn->seq_queue).


Just as an skb is placed in sk_receive_queue a socket callback is called and Tempesta removes the skb from the queue, so actually we don't use the receive buffer at all except TCP OOO segment queue. So tcp_rmem is wasted only by OOO segments in our case. Probably this note doesn't affect the PR, but just to keep this in mind.

krizhanovsky · 2019-02-13T21:20:07Z

tempesta_fw/http_msg.c

+TfwHttpMsg *
+tfw_http_msg_alloc_stream_part(TfwHttpMsg *hm_head)
+{
+	TfwHttpStream *stream;


I supposed that if we receive a part of streamed HTTP message, then there is current TfwHttpMsg which we can attach the new data, use it for the parser and so on. Why do we need TfwHttpStream and even allocate the new descriptor?

krizhanovsky · 2019-02-13T21:23:32Z

tempesta_fw/http.h

+ * more streamed and not fully received message in its queue. So the streamed
+ * message is actually get buffered until the receiver is ready. There are also
+ * some cases when a more of the message must be buffered before is can be
+ * processed, e.g. trailer header processing.


If there are 2 streamed messages, then there are should be 2 incomplete TfwHttpMsg descriptors - why not to use them? We even can use parts of TfwMsg->skb_head list, possibly with some skb pointers in the list, to handle a message headers and only current stream buffer...

krizhanovsky · 2019-02-13T21:28:16Z

tempesta_fw/sock.c

+new_measure:
+	tp->rcvq_space.seq = tp->copied_seq;
+	tp->rcvq_space.time = tp->tcp_mstamp;
+}


OK, I love what's going on here - now it seems like the right way to solve the issue!

aleksostapenko and others added 30 commits October 11, 2018 20:50

Add TODO regarding timer callbacks deactivation and memory barriers.

aac45fc

Client may have a lot of cookies, use only ours for JS challenge

0952cbb

Changes according comments (#806).

914bcca

Merge pull request #1072 from tempesta-tech/ao-806

5e13957

Fix #806: Reduce count of IPI during work_queue processing.

Reset default vhost during start of Tempesta FW (#1074).

cb1a6d2

Pass TLS trailed (TAG length) to ss_skb_process() to prevent HTTP parser

5e6bcf2

from processing the tag if an HTTP request spans over TLS records bound.

tdb: make tdbq depend on libtdb

1fc007d

Sometimes with make -jN, tdbq fails to link with libtdb, since there is no libtdb yet.

Merge pull request #1082 from tempesta-tech/ri-tdb-makefile-dependencies

895c23f

tdb: make tdbq depend on libtdb

Merge pull request #1079 from tempesta-tech/ao-fix-vhost-cfg

f177157

Reset default vhost during start of Tempesta FW

#1054: remove session cache

5e829a1

Merge pull request #1081 from tempesta-tech/ri-775-always-use-ipv6

1617736

Use struct sockaddr_in6 for all addresses

Fix several obvious bugs to reach HTTP request parsing

afd5b7c

Fix 'Paired request missing' issue due to improper working with offsets

c0a3f48

and accouning of parsed bytes in the recent http and ss changes.

fix code review comments

f6148e7

Merge pull request #1055 from tempesta-tech/ik-fix-gfsm-move

82c2bc5

Preserve correct fsm state when moving to a next state

fix code review comments

332e5eb

Merge pull request #1077 from tempesta-tech/ik-fix-jsc

074e73d

Fix unstable JS challange passing

drop unused "unused" parameter

dde12db

Merge pull request #1083 from tempesta-tech/ri-remove-unused-parameter

9d1a393

drop unused "unused" parameter of frang_conn_limit()

make tfw_str_collect_cmp() reset its output variable before use

fa50fde

Also adds a rare case of empty slice handling.

handle multiple chunks of sticky cookie string

b1b6f7d

When data are in multiple chunks, it's necessary to initialize 'tr' at the beginning of every new chunk.

generate a new sticky cookie if client sent a truncated cookie

ed82007

Merge pull request #1088 from tempesta-tech/ri-sticky-cookie-len-calc…

7be16ea

…ulation fixups for sticky cookie calculation in case of segmented data

test cookie parser splitting string in the right places

37e8b4d

Merge pull request #1092 from tempesta-tech/ri-parser-test-cookie-spl…

f600847

…itting test cookie parser splitting string in the right places

do not check tfw_cfg_sticky.sess_lifetime, it's always greater than 0

e4bee6d

If it was 0 for some reason, we force it to be UINT_MAX instead, exactly to avoid such checks at run time.

avbelov23 and others added 19 commits February 1, 2019 14:15

move TfwStr_string to str.h and rename TfwStr_string to TFW_STR_STRING

a193c88

rename TFW_STR_FROM to TFW_STR_FROM_CSTR

2733525

Merge pull request #1174 from tempesta-tech/ik-makefile-fix

5b3a12a

Fix TDB build issue

use TFW_STR_STRING instead TFW_STR_FROM_CSTR where possible

5b0dba6

fix macroses

79913d6

Merge pull request #1166 from tempesta-tech/avb-str2

ab9a43a

Separate number of chunks from flags

cfg: add parse long and unify error messages on parse integer errors

416f715

Restrict client rmem size

c5c5cb4

REMOVE ME

e64688f

sock_srv: print port when cant access backend server

64f1c82

Add missing state in TCP debug

1f90639

Introduce stream part structure

b252608

cache: don't cache streamed responses

8eaf4f7

Add functions to create/destroy streamed message part

671a7ab

parser: add parser functions for stream parst of the htpp messages

b60b01a

Evict streamed request after first and only possible forwarding attempt

2ece434

Add client|server_msg_buffering configuration directive

38450f7

# Conflicts: # etc/tempesta_fw.conf

don't try to send error response if client has disconnected

0cfa993

wip

d6e02c8

krizhanovsky self-requested a review February 13, 2019 11:08

krizhanovsky self-assigned this Feb 13, 2019

krizhanovsky mentioned this pull request Feb 13, 2019

export sysctl_tcp_moderate_rcvbuf tempesta-tech/linux-4.14.32-tfw#4

Closed

krizhanovsky mentioned this pull request Feb 13, 2019

HTTP message buffering and streaming #498

Open

6 tasks

krizhanovsky reviewed Feb 13, 2019

View reviewed changes

vankoven changed the title ~~PR of failed #498 task~~ A new iteration of #498: message streaming mode. NOT done and work in process Feb 14, 2019

vankoven marked this pull request as draft September 2, 2020 14:13

krizhanovsky closed this Jan 25, 2023

krizhanovsky force-pushed the master branch from c6d405c to ec339e5 Compare January 25, 2023 19:48

krizhanovsky deleted the ik-streaming-failed branch April 4, 2023 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new iteration of #498: message streaming mode. NOT done and work in process #1183

A new iteration of #498: message streaming mode. NOT done and work in process #1183

vankoven commented Feb 12, 2019 •

edited

Loading

krizhanovsky commented Feb 13, 2019 •

edited

Loading

krizhanovsky left a comment

krizhanovsky Feb 13, 2019

krizhanovsky Feb 13, 2019

krizhanovsky Feb 13, 2019

krizhanovsky Feb 13, 2019

krizhanovsky Feb 13, 2019

krizhanovsky Feb 13, 2019

krizhanovsky Feb 13, 2019

A new iteration of #498: message streaming mode. NOT done and work in process #1183

A new iteration of #498: message streaming mode. NOT done and work in process #1183

Conversation

vankoven commented Feb 12, 2019 • edited Loading

TCP window steering

Streaming tricky situations

krizhanovsky commented Feb 13, 2019 • edited Loading

krizhanovsky left a comment

Choose a reason for hiding this comment

krizhanovsky Feb 13, 2019

Choose a reason for hiding this comment

krizhanovsky Feb 13, 2019

Choose a reason for hiding this comment

krizhanovsky Feb 13, 2019

Choose a reason for hiding this comment

krizhanovsky Feb 13, 2019

Choose a reason for hiding this comment

krizhanovsky Feb 13, 2019

Choose a reason for hiding this comment

krizhanovsky Feb 13, 2019

Choose a reason for hiding this comment

krizhanovsky Feb 13, 2019

Choose a reason for hiding this comment

vankoven commented Feb 12, 2019 •

edited

Loading

krizhanovsky commented Feb 13, 2019 •

edited

Loading