-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path0020-Add-a-sysctl-to-allow-TCP-window-shrinking-in-order-.patch
267 lines (237 loc) · 9.67 KB
/
0020-Add-a-sysctl-to-allow-TCP-window-shrinking-in-order-.patch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>
Date: Wed, 1 Mar 2023 20:06:28 -0600
Subject: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor
memory limits
Under certain circumstances, the tcp receive buffer memory limit
set by autotuning is ignored, and the receive buffer can grow
unrestrained until it reaches tcp_rmem[2].
To reproduce: Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()). This will fill the tcp receive buffer
all the way to tcp_rmem[2], ignoring the autotuning limit
(sk_rcvbuf).
As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.
The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender. This problem has previously been identified in
RFC 7323, appendix F [1].
The Linux kernel currently adheres to never shrinking the window.
In addition to the overallocation of memory mentioned above, this
is also functionally incorrect, because once tcp_rmem[2] is
reached, the receiver will drop in-window packets resulting in
retransmissions and an eventual timeout of the tcp session. A
receive buffer full condition should instead result in a zero
window and an indefinite wait.
In practice, this problem is largely hidden for most flows. It
is not applicable to mice flows. Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.
But this problem does show up for other types of flows. A good
example are websockets and other type of flows that send small
amounts of data spaced apart slightly in time. In these cases,
we directly encounter the problem described in [1].
RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].
This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf). This new functionality is enabled with
the following sysctl:
sysctl: net.ipv4.tcp_shrink_window
This sysctl changes how the TCP window is calculated.
If sysctl tcp_shrink_window is zero (the default value), then the
window is never shrunk.
If sysctl tcp_shrink_window is non-zero, then the memory limit
set by autotuning is honored. This requires that the TCP window
be shrunk ("retracted") as described in RFC 1122.
[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323
---
Documentation/networking/ip-sysctl.rst | 14 ++++++
include/net/netns/ipv4.h | 2 +
net/ipv4/sysctl_net_ipv4.c | 7 +++
net/ipv4/tcp_ipv4.c | 1 +
net/ipv4/tcp_output.c | 59 +++++++++++++++++++-------
5 files changed, 68 insertions(+), 15 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index e7b3fa7bb3f7..114ea77f4786 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -965,6 +965,20 @@ tcp_tw_reuse - INTEGER
tcp_window_scaling - BOOLEAN
Enable window scaling as defined in RFC1323.
+tcp_shrink_window - BOOLEAN
+ This changes how the TCP receive window is calculated when window
+ scaling is in effect.
+
+ RFC 7323, section 2.4, says there are instances when a retracted
+ window can be offered, and that TCP implementations MUST ensure
+ that they handle a shrinking window, as specified in RFC 1122.
+
+ - 0 - Disabled. The window is never shrunk.
+ - 1 - Enabled. The window is shrunk when necessary to remain within
+ the memory limit set by autotuning (sk_rcvbuf).
+
+ Default: 0
+
tcp_wmem - vector of 3 INTEGERs: min, default, max
min: Amount of memory reserved for send buffers for TCP sockets.
Each TCP socket has rights to use it due to fact of its birth.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bea45ca29cd0..476378afdd99 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -231,5 +231,7 @@ struct netns_ipv4 {
atomic_t rt_genid;
siphash_key_t ip_id_key;
+
+ unsigned int sysctl_tcp_shrink_window;
};
#endif
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index fab6da51e4c6..bf5386395ebd 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1398,6 +1398,13 @@ static struct ctl_table ipv4_net_table[] = {
.mode = 0644,
.proc_handler = proc_douintvec_minmax,
},
+ {
+ .procname = "tcp_shrink_window",
+ .data = &init_net.ipv4.sysctl_tcp_shrink_window,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ },
{ }
};
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a0a3880b8cf9..725c2aa3b515 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3217,6 +3217,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.tcp_congestion_control = &tcp_reno;
net->ipv4.sysctl_tcp_collapse_max_bytes = 0;
+ net->ipv4.sysctl_tcp_shrink_window = 0;
return 0;
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 85f9a3a99bd6..c08cb445d5f7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -260,8 +260,8 @@ static u16 tcp_select_window(struct sock *sk)
u32 old_win = tp->rcv_wnd;
u32 cur_win = tcp_receive_window(tp);
u32 new_win = __tcp_select_window(sk);
+ struct net *net = sock_net(sk);
- /* Never shrink the offered window */
if (new_win < cur_win) {
/* Danger Will Robinson!
* Don't update rcv_wup/rcv_wnd here or else
@@ -270,11 +270,15 @@ static u16 tcp_select_window(struct sock *sk)
*
* Relax Will Robinson.
*/
- if (new_win == 0)
- NET_INC_STATS(sock_net(sk),
- LINUX_MIB_TCPWANTZEROWINDOWADV);
- new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+ if (!net->ipv4.sysctl_tcp_shrink_window) {
+ /* Never shrink the offered window */
+ if (new_win == 0)
+ NET_INC_STATS(sock_net(sk),
+ LINUX_MIB_TCPWANTZEROWINDOWADV);
+ new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+ }
}
+
tp->rcv_wnd = new_win;
tp->rcv_wup = tp->rcv_nxt;
@@ -2956,6 +2960,7 @@ u32 __tcp_select_window(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
+ struct net *net = sock_net(sk);
/* MSS for the peer's data. Previous versions used mss_clamp
* here. I don't know if the value based on our guesses
* of peer's MSS is better for the performance. It's more correct
@@ -2977,16 +2982,24 @@ u32 __tcp_select_window(struct sock *sk)
if (mss <= 0)
return 0;
}
+
+ if (net->ipv4.sysctl_tcp_shrink_window) {
+ /* new window should always be an exact multiple of scaling factor */
+ free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+ }
+
if (free_space < (full_space >> 1)) {
icsk->icsk_ack.quick = 0;
if (tcp_under_memory_pressure(sk))
tcp_adjust_rcv_ssthresh(sk);
- /* free_space might become our new window, make sure we don't
- * increase it due to wscale.
- */
- free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+ if (!net->ipv4.sysctl_tcp_shrink_window) {
+ /* free_space might become our new window, make sure we don't
+ * increase it due to wscale.
+ */
+ free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+ }
/* if free space is less than mss estimate, or is below 1/16th
* of the maximum allowed, try to move to zero-window, else
@@ -2997,10 +3010,24 @@ u32 __tcp_select_window(struct sock *sk)
*/
if (free_space < (allowed_space >> 4) || free_space < mss)
return 0;
+
+ if (net->ipv4.sysctl_tcp_shrink_window && free_space < (1 << tp->rx_opt.rcv_wscale))
+ return 0;
}
- if (free_space > tp->rcv_ssthresh)
+ if (free_space > tp->rcv_ssthresh) {
free_space = tp->rcv_ssthresh;
+ if (net->ipv4.sysctl_tcp_shrink_window) {
+ /* new window should always be an exact multiple of scaling factor
+ *
+ * For this case, we ALIGN "up" (increase free_space) because
+ * we know free_space is not zero here, it has been reduced from
+ * the memory-based limit, and rcv_ssthresh is not a hard limit
+ * (unlike sk_rcvbuf).
+ */
+ free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale));
+ }
+ }
/* Don't do rounding if we are using window scaling, since the
* scaled window will not line up with the MSS boundary anyway.
@@ -3008,11 +3035,13 @@ u32 __tcp_select_window(struct sock *sk)
if (tp->rx_opt.rcv_wscale) {
window = free_space;
- /* Advertise enough space so that it won't get scaled away.
- * Import case: prevent zero window announcement if
- * 1<<rcv_wscale > mss.
- */
- window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+ if (!net->ipv4.sysctl_tcp_shrink_window) {
+ /* Advertise enough space so that it won't get scaled away.
+ * Import case: prevent zero window announcement if
+ * 1<<rcv_wscale > mss.
+ */
+ window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+ }
} else {
window = tp->rcv_wnd;
/* Get the largest window that is a nice multiple of mss.
--
2.39.2