android_kernel_samsung_sm8650/net
mfreemon@cloudflare.com 0796c53424 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
[ Upstream commit b650d953cd391595e536153ce30b4aab385643ac ]

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-19 23:08:54 +02:00
..
6lowpan
9p 9p: virtio: make sure 'offs' is initialized in zc_request 2023-09-13 09:42:21 +02:00
802 mrp: introduce active flags to prevent UAF when applicant uninit 2022-12-31 13:33:02 +01:00
8021q vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit() 2023-05-24 17:32:47 +01:00
appletalk
atm atm: hide unused procfs functions 2023-06-09 10:34:16 +02:00
ax25 ax25: move from strlcpy with unused retval to strscpy 2022-08-22 17:55:50 -07:00
batman-adv batman-adv: Hold rtnl lock during MTU update via netlink 2023-08-30 16:11:08 +02:00
bluetooth Bluetooth: ISO: Fix handling of listen for unicast 2023-10-10 22:00:40 +02:00
bpf Revert "bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES" 2023-03-17 08:50:32 +01:00
bpfilter
bridge neighbour: fix data-races around n->output 2023-10-10 22:00:42 +02:00
caif net: caif: Fix use-after-free in cfusbl_device_notify() 2023-03-17 08:50:24 +01:00
can can: isotp: isotp_sendmsg(): fix TX state detection and wait behavior 2023-10-19 23:08:52 +02:00
ceph libceph: fix potential hang in ceph_osdc_notify() 2023-08-11 12:08:19 +02:00
core net: refine debug info in skb_checksum_help() 2023-10-19 23:08:53 +02:00
dcb net: dcb: choose correct policy to parse DCB_ATTR_BCN 2023-08-11 12:08:17 +02:00
dccp dccp: fix dccp_v4_err()/dccp_v6_err() again 2023-10-06 14:56:39 +02:00
devlink devlink: remove reload failed checks in params get/set callbacks 2023-09-23 11:11:01 +02:00
dns_resolver
dsa net: dsa: sja1105: always enable the send_meta options 2023-07-19 16:22:06 +02:00
ethernet net: gro: skb_gro_header helper function 2022-08-25 10:33:21 +02:00
ethtool ipv6: Remove in6addr_any alternatives. 2023-09-19 12:28:10 +02:00
hsr net: hsr: Add __packed to struct hsr_sup_tlv. 2023-10-06 14:56:57 +02:00
ieee802154 net: ieee802154: fix error return code in dgram_bind() 2022-10-07 09:29:17 +02:00
ife
ipv4 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink 2023-10-19 23:08:54 +02:00
ipv6 ipv6: remove one read_lock()/read_unlock() pair in rt6_check_neigh() 2023-10-10 22:00:46 +02:00
iucv net/iucv: Fix size of interrupt data 2023-03-22 13:33:50 +01:00
kcm kcm: Fix error handling for SOCK_DGRAM in kcm_sendmsg(). 2023-09-19 12:28:10 +02:00
key net: af_key: fix sadb_x_filter validation 2023-08-23 17:52:32 +02:00
l2tp ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data() 2023-10-10 22:00:42 +02:00
l3mdev
lapb
llc llc: Don't drop packet from non-root netns. 2023-07-27 08:50:45 +02:00
mac80211 wifi: mac80211: fix potential key use-after-free 2023-10-10 22:00:41 +02:00
mac802154 mac802154: fix missing INIT_LIST_HEAD in ieee802154_if_add() 2022-12-05 09:53:08 +01:00
mctp net: mctp: purge receive queues on sk destruction 2023-02-06 08:06:34 +01:00
mpls net: mpls: fix stale pointer if allocation fails during device rename 2023-02-22 12:59:53 +01:00
mptcp mptcp: fix delegated action races 2023-10-19 23:08:49 +02:00
ncsi ncsi: Propagate carrier gain/loss events to the NCSI controller 2023-10-06 14:56:57 +02:00
netfilter net: prevent address rewrite in kernel_bind() 2023-10-19 23:08:50 +02:00
netlabel netlabel: fix shift wrapping bug in netlbl_catmap_setlong() 2023-09-13 09:42:24 +02:00
netlink netlink: remove the flex array from struct nlmsghdr 2023-10-10 22:00:46 +02:00
netrom netrom: Deny concurrent connect(). 2023-09-13 09:42:35 +02:00
nfc nfc: nci: assert requested protocol is valid 2023-10-19 23:08:54 +02:00
nsh net: nsh: Use correct mac_offset to unwind gso skb in nsh_gso_segment() 2023-05-24 17:32:45 +01:00
openvswitch net: openvswitch: reject negative ifindex 2023-08-23 17:52:35 +02:00
packet net/packet: annotate data-races around tp->status 2023-08-16 18:27:26 +02:00
phonet
psample genetlink: start to validate reserved header bytes 2022-08-29 12:47:15 +01:00
qrtr net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume() 2023-04-20 12:35:09 +02:00
rds net: prevent address rewrite in kernel_bind() 2023-10-19 23:08:50 +02:00
rfkill
rose net/rose: Fix to not accept on connected socket 2023-02-22 12:59:42 +01:00
rxrpc rxrpc: Fix hard call timeout units 2023-05-17 11:53:35 +02:00
sched net/sched: Retire rsvp classifier 2023-09-23 11:11:13 +02:00
sctp sctp: update hb timer immediately after users change hb_interval 2023-10-10 22:00:44 +02:00
smc net/smc: Fix pos miscalculation in statistics 2023-10-19 23:08:54 +02:00
strparser
sunrpc Revert "SUNRPC dont update timeout value on connection reset" 2023-10-06 14:57:03 +02:00
switchdev
tipc tipc: fix a potential deadlock on &tx->lock 2023-10-10 22:00:43 +02:00
tls net/tls: do not free tls_rec on async operation in bpf_exec_tx_verdict() 2023-09-19 12:28:09 +02:00
unix af_unix: Fix data-race around unix_tot_inflight. 2023-09-19 12:28:02 +02:00
vmw_vsock vsock: avoid to close connected socket after the timeout 2023-05-24 17:32:44 +01:00
wireless wifi: cfg80211: fix cqm_config access race 2023-10-10 22:00:40 +02:00
x25 net/x25: Fix to not accept on connected socket 2023-02-09 11:28:13 +01:00
xdp xsk: Fix xsk_diag use-after-free error during socket cleanup 2023-09-19 12:28:01 +02:00
xfrm xfrm: add forgotten nla_policy for XFRMA_MTIMER_THRESH 2023-08-23 17:52:32 +02:00
compat.c use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
devres.c
Kconfig Remove DECnet support from kernel 2022-08-22 14:26:30 +01:00
Kconfig.debug net: make NET_(DEV|NS)_REFCNT_TRACKER depend on NET 2022-09-20 14:23:56 -07:00
Makefile devlink: move code to a dedicated directory 2023-08-30 16:11:00 +02:00
socket.c net: prevent address rewrite in kernel_bind() 2023-10-19 23:08:50 +02:00
sysctl_net.c