lineage-22.1
11880 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
8bb51cdc4f |
xfrm: fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets
[ Upstream commit 1166a530a84758bb9e6b448fc8c195ed413f5ded ]
Before Linux v5.8 an AF_INET6 SOCK_DGRAM (udp/udplite) socket
with SOL_UDP, UDP_ENCAP, UDP_ENCAP_ESPINUDP{,_NON_IKE} enabled
would just unconditionally use xfrm4_udp_encap_rcv(), afterwards
such a socket would use the newly added xfrm6_udp_encap_rcv()
which only handles IPv6 packets.
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Benedict Wong <benedictwong@google.com>
Cc: Yan Yan <evitayan@google.com>
Fixes:
|
||
|
ed6634a559 |
Merge 'android14-6.1' into 'android14-6.1-lts'
This catches the -lts branch up with all of the recent changes that have gone into the non-lts branch, INCLUDING the ABI update which we want here to ensure that we do NOT break any newly added dependent symbols (and to bring back in the reverts that were required before the ABI break). This includes the following commits: |
||
|
1707d64dab |
UPSTREAM: tcp: deny tcp_disconnect() when threads are waiting
[ Upstream commit 4faeee0cf8a5d88d63cdbc3bab124fb0e6aed08c ] Historically connect(AF_UNSPEC) has been abused by syzkaller and other fuzzers to trigger various bugs. A recent one triggers a divide-by-zero [1], and Paolo Abeni was able to diagnose the issue. tcp_recvmsg_locked() has tests about sk_state being not TCP_LISTEN and TCP REPAIR mode being not used. Then later if socket lock is released in sk_wait_data(), another thread can call connect(AF_UNSPEC), then make this socket a TCP listener. When recvmsg() is resumed, it can eventually call tcp_cleanup_rbuf() and attempt a divide by 0 in tcp_rcv_space_adjust() [1] This patch adds a new socket field, counting number of threads blocked in sk_wait_event() and inet_wait_for_connect(). If this counter is not zero, tcp_disconnect() returns an error. This patch adds code in blocking socket system calls, thus should not hurt performance of non blocking ones. Note that we probably could revert commit |
||
|
07873e75c6 |
UPSTREAM: bpf, sockmap: Incorrectly handling copied_seq
[ Upstream commit e5c6de5fa025882babf89cecbed80acf49b987fa ] The read_skb() logic is incrementing the tcp->copied_seq which is used for among other things calculating how many outstanding bytes can be read by the application. This results in application errors, if the application does an ioctl(FIONREAD) we return zero because this is calculated from the copied_seq value. To fix this we move tcp->copied_seq accounting into the recv handler so that we update these when the recvmsg() hook is called and data is in fact copied into user buffers. This gives an accurate FIONREAD value as expected and improves ACK handling. Before we were calling the tcp_rcv_space_adjust() which would update 'number of bytes copied to user in last RTT' which is wrong for programs returning SK_PASS. The bytes are only copied to the user when recvmsg is handled. Doing the fix for recvmsg is straightforward, but fixing redirect and SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then call this from skmsg handlers. This fixes another issue where a broken socket with a BPF program doing a resubmit could hang the receiver. This happened because although read_skb() consumed the skb through sock_drop() it did not update the copied_seq. Now if a single reccv socket is redirecting to many sockets (for example for lb) the receiver sk will be hung even though we might expect it to continue. The hang comes from not updating the copied_seq numbers and memory pressure resulting from that. We have a slight layer problem of calling tcp_eat_skb even if its not a TCP socket. To fix we could refactor and create per type receiver handlers. I decided this is more work than we want in the fix and we already have some small tweaks depending on caller that use the helper skb_bpf_strparser(). So we extend that a bit and always set the strparser bit when it is in use and then we can gate the seq_copied updates on this. Fixes: |
||
|
f9cc0b7f9b |
UPSTREAM: bpf, sockmap: TCP data stall on recv before accept
[ Upstream commit ea444185a6bf7da4dd0df1598ee953e4f7174858 ] A common mechanism to put a TCP socket into the sockmap is to hook the BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program that can map the socket info to the correct BPF verdict parser. When the user adds the socket to the map the psock is created and the new ops are assigned to ensure the verdict program will 'see' the sk_buffs as they arrive. Part of this process hooks the sk_data_ready op with a BPF specific handler to wake up the BPF verdict program when data is ready to read. The logic is simple enough (posted here for easy reading) static void sk_psock_verdict_data_ready(struct sock *sk) { struct socket *sock = sk->sk_socket; if (unlikely(!sock || !sock->ops || !sock->ops->read_skb)) return; sock->ops->read_skb(sk, sk_psock_verdict_recv); } The oversight here is sk->sk_socket is not assigned until the application accepts() the new socket. However, its entirely ok for the peer application to do a connect() followed immediately by sends. The socket on the receiver is sitting on the backlog queue of the listening socket until its accepted and the data is queued up. If the peer never accepts the socket or is slow it will eventually hit data limits and rate limit the session. But, important for BPF sockmap hooks when this data is received TCP stack does the sk_data_ready() call but the read_skb() for this data is never called because sk_socket is missing. The data sits on the sk_receive_queue. Then once the socket is accepted if we never receive more data from the peer there will be no further sk_data_ready calls and all the data is still on the sk_receive_queue(). Then user calls recvmsg after accept() and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler. The handler checks for data in the sk_msg ingress queue expecting that the BPF program has already run from the sk_data_ready hook and enqueued the data as needed. So we are stuck. To fix do an unlikely check in recvmsg handler for data on the sk_receive_queue and if it exists wake up data_ready. We have the sock locked in both read_skb and recvmsg so should avoid having multiple runners. Fixes: |
||
|
028591f2c8 |
UPSTREAM: bpf, sockmap: Handle fin correctly
[ Upstream commit 901546fd8f9ca4b5c481ce00928ab425ce9aacc0 ] The sockmap code is returning EAGAIN after a FIN packet is received and no more data is on the receive queue. Correct behavior is to return 0 to the user and the user can then close the socket. The EAGAIN causes many apps to retry which masks the problem. Eventually the socket is evicted from the sockmap because its released from sockmap sock free handling. The issue creates a delay and can cause some errors on application side. To fix this check on sk_msg_recvmsg side if length is zero and FIN flag is set then set return to zero. A selftest will be added to check this condition. Fixes: |
||
|
a59051006b |
UPSTREAM: bpf, sockmap: Pass skb ownership through read_skb
[ Upstream commit 78fa0d61d97a728d306b0c23d353c0e340756437 ] The read_skb hook calls consume_skb() now, but this means that if the recv_actor program wants to use the skb it needs to inc the ref cnt so that the consume_skb() doesn't kfree the sk_buff. This is problematic because in some error cases under memory pressure we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue(). Then we get this, skb_linearize() __pskb_pull_tail() pskb_expand_head() BUG_ON(skb_shared(skb)) Because we incremented users refcnt from sk_psock_verdict_recv() we hit the bug on with refcnt > 1 and trip it. To fix lets simply pass ownership of the sk_buff through the skb_read call. Then we can drop the consume from read_skb handlers and assume the verdict recv does any required kfree. Bug found while testing in our CI which runs in VMs that hit memory constraints rather regularly. William tested TCP read_skb handlers. [ 106.536188] ------------[ cut here ]------------ [ 106.536197] kernel BUG at net/core/skbuff.c:1693! [ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1 [ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014 [ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330 [ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202 [ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20 [ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8 [ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000 [ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8 [ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8 [ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0 [ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.542255] Call Trace: [ 106.542383] <IRQ> [ 106.542487] __pskb_pull_tail+0x4b/0x3e0 [ 106.542681] skb_ensure_writable+0x85/0xa0 [ 106.542882] sk_skb_pull_data+0x18/0x20 [ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9 [ 106.543536] ? migrate_disable+0x66/0x80 [ 106.543871] sk_psock_verdict_recv+0xe2/0x310 [ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0 [ 106.544561] tcp_read_skb+0x7b/0x120 [ 106.544740] tcp_data_queue+0x904/0xee0 [ 106.544931] tcp_rcv_established+0x212/0x7c0 [ 106.545142] tcp_v4_do_rcv+0x174/0x2a0 [ 106.545326] tcp_v4_rcv+0xe70/0xf60 [ 106.545500] ip_protocol_deliver_rcu+0x48/0x290 [ 106.545744] ip_local_deliver_finish+0xa7/0x150 Fixes: |
||
|
ee4c9c95ff |
Merge 6.1.34 into android14-6.1-lts
Changes in 6.1.34
scsi: megaraid_sas: Add flexible array member for SGLs
net: sfp: fix state loss when updating state_hw_mask
spi: mt65xx: make sure operations completed before unloading
platform/surface: aggregator: Allow completion work-items to be executed in parallel
platform/surface: aggregator_tabletsw: Add support for book mode in KIP subsystem
spi: qup: Request DMA before enabling clocks
afs: Fix setting of mtime when creating a file/dir/symlink
wifi: mt76: mt7615: fix possible race in mt7615_mac_sta_poll
bpf, sockmap: Avoid potential NULL dereference in sk_psock_verdict_data_ready()
neighbour: fix unaligned access to pneigh_entry
net: dsa: lan9303: allow vid != 0 in port_fdb_{add|del} methods
net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294
bpf: Fix UAF in task local storage
bpf: Fix elem_size not being set for inner maps
net/ipv6: fix bool/int mismatch for skip_notify_on_dev_down
net/smc: Avoid to access invalid RMBs' MRs in SMCRv1 ADD LINK CONT
net: enetc: correct the statistics of rx bytes
net: enetc: correct rx_bytes statistics of XDP
net/sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values
drm/i915: Explain the magic numbers for AUX SYNC/precharge length
drm/i915: Use 18 fast wake AUX sync len
Bluetooth: hci_sync: add lock to protect HCI_UNREGISTER
Bluetooth: Fix l2cap_disconnect_req deadlock
Bluetooth: ISO: don't try to remove CIG if there are bound CIS left
Bluetooth: L2CAP: Add missing checks for invalid DCID
wifi: mac80211: use correct iftype HE cap
wifi: cfg80211: reject bad AP MLD address
wifi: mac80211: mlme: fix non-inheritence element
wifi: mac80211: don't translate beacon/presp addrs
qed/qede: Fix scheduling while atomic
wifi: cfg80211: fix locking in sched scan stop work
selftests/bpf: Verify optval=NULL case
selftests/bpf: Fix sockopt_sk selftest
netfilter: nft_bitwise: fix register tracking
netfilter: conntrack: fix NULL pointer dereference in nf_confirm_cthelper
netfilter: ipset: Add schedule point in call_ad().
netfilter: nf_tables: out-of-bound check in chain blob
ipv6: rpl: Fix Route of Death.
tcp: gso: really support BIG TCP
rfs: annotate lockless accesses to sk->sk_rxhash
rfs: annotate lockless accesses to RFS sock flow table
net: sched: add rcu annotations around qdisc->qdisc_sleeping
drm/i915/selftests: Stop using kthread_stop()
drm/i915/selftests: Add some missing error propagation
net: sched: move rtm_tca_policy declaration to include file
net: sched: act_police: fix sparse errors in tcf_police_dump()
net: sched: fix possible refcount leak in tc_chain_tmplt_add()
bpf: Add extra path pointer check to d_path helper
drm/amdgpu: fix Null pointer dereference error in amdgpu_device_recover_vram
lib: cpu_rmap: Fix potential use-after-free in irq_cpu_rmap_release()
net: bcmgenet: Fix EEE implementation
bnxt_en: Don't issue AP reset during ethtool's reset operation
bnxt_en: Query default VLAN before VNIC setup on a VF
bnxt_en: Skip firmware fatal error recovery if chip is not accessible
bnxt_en: Prevent kernel panic when receiving unexpected PHC_UPDATE event
bnxt_en: Implement .set_port / .unset_port UDP tunnel callbacks
batman-adv: Broken sync while rescheduling delayed work
Input: xpad - delete a Razer DeathAdder mouse VID/PID entry
Input: psmouse - fix OOB access in Elantech protocol
Input: fix open count when closing inhibited device
ALSA: hda: Fix kctl->id initialization
ALSA: ymfpci: Fix kctl->id initialization
ALSA: gus: Fix kctl->id initialization
ALSA: cmipci: Fix kctl->id initialization
ALSA: hda/realtek: Add quirk for Clevo NS50AU
ALSA: ice1712,ice1724: fix the kcontrol->id initialization
ALSA: hda/realtek: Add a quirk for HP Slim Desktop S01
ALSA: hda/realtek: Add Lenovo P3 Tower platform
ALSA: hda/realtek: Add quirks for Asus ROG 2024 laptops using CS35L41
drm/i915/gt: Use the correct error value when kernel_context() fails
drm/amd/pm: conditionally disable pcie lane switching for some sienna_cichlid SKUs
drm/amdgpu: fix xclk freq on CHIP_STONEY
drm/amdgpu: change reserved vram info print
drm/amd/pm: Fix power context allocation in SMU13
drm/amd/display: Reduce sdp bw after urgent to 90%
wifi: iwlwifi: mvm: Fix -Warray-bounds bug in iwl_mvm_wait_d3_notif()
can: j1939: j1939_sk_send_loop_abort(): improved error queue handling in J1939 Socket
can: j1939: change j1939_netdev_lock type to mutex
can: j1939: avoid possible use-after-free when j1939_can_rx_register fails
mptcp: only send RM_ADDR in nl_cmd_remove
mptcp: add address into userspace pm list
mptcp: update userspace pm infos
selftests: mptcp: update userspace pm addr tests
selftests: mptcp: update userspace pm subflow tests
ceph: fix use-after-free bug for inodes when flushing capsnaps
s390/dasd: Use correct lock while counting channel queue length
Bluetooth: Fix use-after-free in hci_remove_ltk/hci_remove_irk
Bluetooth: fix debugfs registration
Bluetooth: hci_qca: fix debugfs registration
tee: amdtee: Add return_origin to 'struct tee_cmd_load_ta'
rbd: move RBD_OBJ_FLAG_COPYUP_ENABLED flag setting
rbd: get snapshot context after exclusive lock is ensured to be held
virtio_net: use control_buf for coalesce params
soc: qcom: icc-bwmon: fix incorrect error code passed to dev_err_probe()
pinctrl: meson-axg: add missing GPIOA_18 gpio group
usb: usbfs: Enforce page requirements for mmap
usb: usbfs: Use consistent mmap functions
mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM
mm: page_table_check: Ensure user pages are not slab pages
arm64: dts: qcom: sc8280xp: Flush RSC sleep & wake votes
ARM: at91: pm: fix imbalanced reference counter for ethernet devices
ARM: dts: at91: sama7g5ek: fix debounce delay property for shdwc
ASoC: codecs: wsa883x: do not set can_multi_write flag
ASoC: codecs: wsa881x: do not set can_multi_write flag
arm64: dts: qcom: sc7180-lite: Fix SDRAM freq for misidentified sc7180-lite boards
arm64: dts: imx8qm-mek: correct GPIOs for USDHC2 CD and WP signals
arm64: dts: imx8-ss-dma: assign default clock rate for lpuarts
ASoC: mediatek: mt8195-afe-pcm: Convert to platform remove callback returning void
ASoC: mediatek: mt8195: fix use-after-free in driver remove path
ASoC: simple-card-utils: fix PCM constraint error check
blk-mq: fix blk_mq_hw_ctx active request accounting
arm64: dts: imx8mn-beacon: Fix SPI CS pinmux
i2c: mv64xxx: Fix reading invalid status value in atomic mode
firmware: arm_ffa: Set handle field to zero in memory descriptor
gpio: sim: fix memory corruption when adding named lines and unnamed hogs
i2c: sprd: Delete i2c adapter in .remove's error path
riscv: mm: Ensure prot of VM_WRITE and VM_EXEC must be readable
eeprom: at24: also select REGMAP
soundwire: stream: Add missing clear of alloc_slave_rt
riscv: fix kprobe __user string arg print fault issue
vduse: avoid empty string for dev name
vhost: support PACKED when setting-getting vring_base
vhost_vdpa: support PACKED when setting-getting vring_base
ksmbd: fix out-of-bound read in deassemble_neg_contexts()
ksmbd: fix out-of-bound read in parse_lease_state()
ksmbd: check the validation of pdu_size in ksmbd_conn_handler_loop
Revert "ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled"
ext4: only check dquot_initialize_needed() when debugging
wifi: rtw89: correct PS calculation for SUPPORTS_DYNAMIC_PS
wifi: rtw88: correct PS calculation for SUPPORTS_DYNAMIC_PS
Revert "staging: rtl8192e: Replace macro RTL_PCI_DEVICE with PCI_DEVICE"
Linux 6.1.34
Note, commit
|
||
|
32d0f34bbc |
Revert "tcp: deny tcp_disconnect() when threads are waiting"
This reverts commit
|
||
|
2a77668d45 |
This is the 6.1.33 stable release
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmSC5VIACgkQONu9yGCS aT5RPhAAiVFNzTuQT4DtPzXUzl9hpNtdtZPVa/z28+SbOZyf2YgyDGXLHvnGbJ/2 8DWDV9uSsxdX2InNqzD/IbRSiHjXprpDssthq3Qr5aPH7FO76uICWndrCk0dhZsK kI/+J7BqS1vgtaxsZeo/IHmMQJ5oEzx/JzvcyK5po0rykNDCxWNnh8cK4YtFOVtk eRD8cPWXvJGn88pdPPlQuS75MKBGcAUZLodN//tP+x2bcWzocaTZUCEHL36eLcVc 0CxPykCpFOcLFLIJWQ+pY2/HR2ynTBxYoaXsTpscR+FKbS+Lz9B6PUoXCvqaV2/e lriLjg22lbqxBbBhEk5NLBVozajtU/gNq6pptp/EnZahwjjyavuToZviWf8NWfs0 2u+zQlolinCKnm+8o18dRn24kI7LbUSD2w+V8FydSQNHMikvu/xHgDdLgzmj2XAf ZIAkHdGjRzKL2euDPrp28D5vPfCqDjqT2wUE2vUsc+Ax4k6ewFCPs3cweWD8hoFS fAjTC3Q/oNp6eEbWuWJPxl+DW/tD3ezRGeqrRCXQwubcgwB5iaS5ItdCCfG/lfiJ PNHf4kpg4FlyBf8aPD+R3QA6KOuS1owNNk3cx72zHs8zPusosHWj9hDrXeYVn06G gj1SIoC+jC/L5nbYH9WFLnKm9+EQ28lcp9j7f1PdlDhkcJmzBRY= =Qjnb -----END PGP SIGNATURE----- Merge 6.1.33 into android14-6.1-lts Changes in 6.1.33 RDMA/bnxt_re: Fix the page_size used during the MR creation phy: amlogic: phy-meson-g12a-mipi-dphy-analog: fix CNTL2_DIF_TX_CTL0 value RDMA/efa: Fix unsupported page sizes in device RDMA/hns: Fix timeout attr in query qp for HIP08 RDMA/hns: Fix base address table allocation RDMA/hns: Modify the value of long message loopback slice dmaengine: at_xdmac: fix potential Oops in at_xdmac_prep_interleaved() RDMA/bnxt_re: Fix a possible memory leak RDMA/bnxt_re: Fix return value of bnxt_re_process_raw_qp_pkt_rx iommu/rockchip: Fix unwind goto issue iommu/amd: Don't block updates to GATag if guest mode is on iommu/amd: Handle GALog overflows iommu/amd: Fix up merge conflict resolution nfsd: make a copy of struct iattr before calling notify_change dmaengine: pl330: rename _start to prevent build error riscv: Fix unused variable warning when BUILTIN_DTB is set net/mlx5: Drain health before unregistering devlink net/mlx5: SF, Drain health before removing device net/mlx5: fw_tracer, Fix event handling net/mlx5e: Don't attach netdev profile while handling internal error net: mellanox: mlxbf_gige: Fix skb_panic splat under memory pressure netrom: fix info-leak in nr_write_internal() af_packet: Fix data-races of pkt_sk(sk)->num. tls: improve lockless access safety of tls_err_abort() amd-xgbe: fix the false linkup in xgbe_phy_status perf ftrace latency: Remove unnecessary "--" from --use-nsec option mtd: rawnand: ingenic: fix empty stub helper definitions RDMA/irdma: Prevent QP use after free RDMA/irdma: Fix Local Invalidate fencing af_packet: do not use READ_ONCE() in packet_bind() tcp: deny tcp_disconnect() when threads are waiting tcp: Return user_mss for TCP_MAXSEG in CLOSE/LISTEN state if user_mss set net/smc: Scan from current RMB list when no position specified net/smc: Don't use RMBs not mapped to new link in SMCRv2 ADD LINK net/sched: sch_ingress: Only create under TC_H_INGRESS net/sched: sch_clsact: Only create under TC_H_CLSACT net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs net/sched: Prohibit regrafting ingress or clsact Qdiscs net: sched: fix NULL pointer dereference in mq_attach net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report udp6: Fix race condition in udp6_sendmsg & connect nfsd: fix double fget() bug in __write_ports_addfd() nvme: fix the name of Zone Append for verbose logging net/mlx5e: Fix error handling in mlx5e_refresh_tirs net/mlx5: Read embedded cpu after init bit cleared iommu/mediatek: Flush IOTLB completely only if domain has been attached net/sched: flower: fix possible OOB write in fl_set_geneve_opt() tcp: fix mishandling when the sack compression is deferred. net: dsa: mv88e6xxx: Increase wait after reset deactivation mtd: rawnand: marvell: ensure timing values are written mtd: rawnand: marvell: don't set the NAND frequency select rtnetlink: call validate_linkmsg in rtnl_create_link mptcp: avoid unneeded __mptcp_nmpc_socket() usage mptcp: add annotations around msk->subflow accesses mptcp: avoid unneeded address copy mptcp: simplify subflow_syn_recv_sock() mptcp: consolidate passive msk socket initialization mptcp: fix data race around msk->first access mptcp: add annotations around sk->sk_shutdown accesses drm/amdgpu: release gpu full access after "amdgpu_device_ip_late_init" watchdog: menz069_wdt: fix watchdog initialisation ALSA: hda: Glenfly: add HD Audio PCI IDs and HDMI Codec Vendor IDs. ASoC: Intel: soc-acpi-cht: Add quirk for Nextbook Ares 8A tablet drm/amdgpu: Use the default reset when loading or reloading the driver mailbox: mailbox-test: Fix potential double-free in mbox_test_message_write() drm/ast: Fix ARM compatibility btrfs: abort transaction when sibling keys check fails for leaves ARM: 9295/1: unwind:fix unwind abort for uleb128 case hwmon: (k10temp) Add PCI ID for family 19, model 78h media: rcar-vin: Select correct interrupt mode for V4L2_FIELD_ALTERNATE platform/x86: intel_scu_pcidrv: Add back PCI ID for Medfield platform/mellanox: fix potential race in mlxbf-tmfifo driver gfs2: Don't deref jdesc in evict drm/amdgpu: set gfx9 onwards APU atomics support to be true fbdev: imsttfb: Fix use after free bug in imsttfb_probe fbdev: modedb: Add 1920x1080 at 60 Hz video mode fbdev: stifb: Fix info entry in sti_struct on error path nbd: Fix debugfs_create_dir error checking block/rnbd: replace REQ_OP_FLUSH with REQ_OP_WRITE nvme-pci: add NVME_QUIRK_BOGUS_NID for HS-SSD-FUTURE 2048G nvme-pci: add quirk for missing secondary temperature thresholds ASoC: amd: yc: Add DMI entry to support System76 Pangolin 12 ASoC: dwc: limit the number of overrun messages um: harddog: fix modular build xfrm: Check if_id in inbound policy/secpath match ASoC: dt-bindings: Adjust #sound-dai-cells on TI's single-DAI codecs ALSA: hda/realtek: Add quirks for ASUS GU604V and GU603V ASoC: ssm2602: Add workaround for playback distortions media: dvb_demux: fix a bug for the continuity counter media: dvb-usb: az6027: fix three null-ptr-deref in az6027_i2c_xfer() media: dvb-usb-v2: ec168: fix null-ptr-deref in ec168_i2c_xfer() media: dvb-usb-v2: ce6230: fix null-ptr-deref in ce6230_i2c_master_xfer() media: dvb-usb-v2: rtl28xxu: fix null-ptr-deref in rtl28xxu_i2c_xfer media: dvb-usb: digitv: fix null-ptr-deref in digitv_i2c_xfer() media: dvb-usb: dw2102: fix uninit-value in su3000_read_mac_address media: netup_unidvb: fix irq init by register it at the end of probe media: dvb_ca_en50221: fix a size write bug media: ttusb-dec: fix memory leak in ttusb_dec_exit_dvb() media: mn88443x: fix !CONFIG_OF error by drop of_match_ptr from ID table media: dvb-core: Fix use-after-free due on race condition at dvb_net media: dvb-core: Fix use-after-free due to race at dvb_register_device() media: dvb-core: Fix kernel WARNING for blocking operation in wait_event*() media: dvb-core: Fix use-after-free due to race condition at dvb_ca_en50221 ASoC: SOF: debug: conditionally bump runtime_pm counter on exceptions ASoC: SOF: pcm: fix pm_runtime imbalance in error handling ASoC: SOF: sof-client-probes: fix pm_runtime imbalance in error handling ASoC: SOF: pm: save io region state in case of errors in resume s390/pkey: zeroize key blobs s390/topology: honour nr_cpu_ids when adding CPUs ACPI: resource: Add IRQ override quirk for LG UltraPC 17U70P wifi: rtl8xxxu: fix authentication timeout due to incorrect RCR value ARM: dts: stm32: add pin map for CAN controller on stm32f7 arm64/mm: mark private VM_FAULT_X defines as vm_fault_t arm64: vdso: Pass (void *) to virt_to_page() wifi: mac80211: simplify chanctx allocation wifi: mac80211: consider reserved chanctx for mindef wifi: mac80211: recalc chanctx mindef before assigning wifi: iwlwifi: mvm: Add locking to the rate read flow scsi: core: Decrease scsi_device's iorequest_cnt if dispatch failed wifi: b43: fix incorrect __packed annotation net: wwan: t7xx: Ensure init is completed before system sleep netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with CONFIG_NF_NAT nvme-multipath: don't call blk_mark_disk_dead in nvme_mpath_remove_disk nvme: do not let the user delete a ctrl before a complete initialization ALSA: oss: avoid missing-prototype warnings drm/msm: Be more shouty if per-process pgtables aren't working atm: hide unused procfs functions ceph: silence smatch warning in reconnect_caps_cb() drm/amdgpu: skip disabling fence driver src_irqs when device is unplugged ublk: fix AB-BA lockdep warning nvme-pci: Add quirk for Teamgroup MP33 SSD block: Deny writable memory mapping if block is read-only KVM: arm64: vgic: Fix a circular locking issue KVM: arm64: vgic: Wrap vgic_its_create() with config_lock KVM: arm64: vgic: Fix locking comment media: mediatek: vcodec: Only apply 4K frame sizes on decoder formats mailbox: mailbox-test: fix a locking issue in mbox_test_message_write() drivers: base: cacheinfo: Fix shared_cpu_map changes in event of CPU hotplug media: uvcvideo: Don't expose unsupported formats to userspace iio: accel: st_accel: Fix invalid mount_matrix on devices without ACPI _ONT method iio: adc: mxs-lradc: fix the order of two cleanup operations HID: google: add jewel USB id HID: wacom: avoid integer overflow in wacom_intuos_inout() iio: imu: inv_icm42600: fix timestamp reset dt-bindings: iio: adc: renesas,rcar-gyroadc: Fix adi,ad7476 compatible value iio: light: vcnl4035: fixed chip ID check iio: adc: stm32-adc: skip adc-channels setup if none is present iio: adc: ad_sigma_delta: Fix IRQ issue by setting IRQ_DISABLE_UNLAZY flag iio: dac: mcp4725: Fix i2c_master_send() return value handling iio: addac: ad74413: fix resistance input processing iio: adc: ad7192: Change "shorted" channels to differential iio: adc: stm32-adc: skip adc-diff-channels setup if none is present iio: dac: build ad5758 driver when AD5758 is selected net: usb: qmi_wwan: Set DTR quirk for BroadMobi BM818 dt-bindings: usb: snps,dwc3: Fix "snps,hsphy_interface" type usb: cdns3: fix NCM gadget RX speed 20x slow than expection at iMX8QM usb: gadget: f_fs: Add unbind event before functionfs_unbind md/raid5: fix miscalculation of 'end_sector' in raid5_read_one_chunk() misc: fastrpc: return -EPIPE to invocations on device removal misc: fastrpc: reject new invocations during device removal scsi: stex: Fix gcc 13 warnings ata: libata-scsi: Use correct device no in ata_find_dev() drm/amdgpu: enable tmz by default for GC 11.0.1 drm/amd/pm: reverse mclk and fclk clocks levels for SMU v13.0.4 drm/amd/pm: reverse mclk and fclk clocks levels for vangogh drm/amd/pm: resolve reboot exception for si oland drm/amd/pm: reverse mclk clocks levels for SMU v13.0.5 drm/amd/pm: reverse mclk and fclk clocks levels for yellow carp drm/amd/pm: reverse mclk and fclk clocks levels for renoir x86/mtrr: Revert 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case") mmc: vub300: fix invalid response handling mmc: pwrseq: sd8787: Fix WILC CHIP_EN and RESETN toggling order tty: serial: fsl_lpuart: use UARTCTRL_TXINV to send break instead of UARTCTRL_SBK btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds phy: qcom-qmp-combo: fix init-count imbalance phy: qcom-qmp-pcie-msm8996: fix init-count imbalance block: fix revalidate performance regression powerpc/iommu: Limit number of TCEs to 512 for H_STUFF_TCE hcall iommu/amd: Fix domain flush size when syncing iotlb tpm, tpm_tis: correct tpm_tis_flags enumeration values riscv: perf: Fix callchain parse error with kernel tracepoint events io_uring: undeprecate epoll_ctl support selinux: don't use make's grouped targets feature yet mtdchar: mark bits of ioctl handler noinline tracing/timerlat: Always wakeup the timerlat thread tracing/histograms: Allow variables to have some modifiers tracing/probe: trace_probe_primary_from_call(): checked list_first_entry selftests: mptcp: connect: skip if MPTCP is not supported selftests: mptcp: pm nl: skip if MPTCP is not supported selftests: mptcp: join: skip if MPTCP is not supported selftests: mptcp: sockopt: skip if MPTCP is not supported selftests: mptcp: userspace pm: skip if MPTCP is not supported mptcp: fix connect timeout handling mptcp: fix active subflow finalization ext4: add EA_INODE checking to ext4_iget() ext4: set lockdep subclass for the ea_inode in ext4_xattr_inode_cache_find() ext4: disallow ea_inodes with extended attributes ext4: add lockdep annotations for i_data_sem for ea_inode's fbcon: Fix null-ptr-deref in soft_cursor serial: 8250_tegra: Fix an error handling path in tegra_uart_probe() serial: cpm_uart: Fix a COMPILE_TEST dependency powerpc/xmon: Use KSYM_NAME_LEN in array size test_firmware: fix a memory leak with reqs buffer test_firmware: fix the memory leak of the allocated firmware buffer KVM: arm64: Populate fault info for watchpoint KVM: x86: Account fastpath-only VM-Exits in vCPU stats ksmbd: fix credit count leakage ksmbd: fix UAF issue from opinfo->conn ksmbd: fix incorrect AllocationSize set in smb2_get_info ksmbd: fix slab-out-of-bounds read in smb2_handle_negotiate ksmbd: fix multiple out-of-bounds read during context decoding KEYS: asymmetric: Copy sig and digest in public_key_verify_signature() fs/ntfs3: Validate MFT flags before replaying logs regmap: Account for register length when chunking tpm, tpm_tis: Request threaded interrupt handler iommu/amd/pgtbl_v2: Fix domain max address drm/amd/display: Have Payload Properly Created After Resume xfs: verify buffer contents when we skip log replay tls: rx: strp: don't use GFP_KERNEL in softirq context arm64: efi: Use SMBIOS processor version to key off Ampere quirk selftests: mptcp: diag: skip if MPTCP is not supported selftests: mptcp: simult flows: skip if MPTCP is not supported selftests: mptcp: join: avoid using 'cmp --bytes' ext4: enable the lazy init thread when remounting read/write Linux 6.1.33 Note, the following commits were reverted from this merge, due to conflicts with other KVM patches. If they are needed later, they can be brought back in a way that enables them to actually build properly: |
||
|
3a53767f1f |
Revert "bpf, sockmap: Pass skb ownership through read_skb"
This reverts commit
|
||
|
51ffabff7c |
Revert "bpf, sockmap: Handle fin correctly"
This reverts commit
|
||
|
3ce63059c1 |
Revert "bpf, sockmap: TCP data stall on recv before accept"
This reverts commit
|
||
|
0851b00164 |
Revert "bpf, sockmap: Incorrectly handling copied_seq"
This reverts commit
|
||
|
f8e6aa0e60 |
tcp: gso: really support BIG TCP
[ Upstream commit 82a01ab35bd02ba4b0b4e12bc95c5b69240eb7b0 ] We missed that tcp_gso_segment() was assuming skb->len was smaller than 65535 : oldlen = (u16)~skb->len; This part came with commit |
||
|
9166225c3b |
net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294
[ Upstream commit e209fee4118fe9a449d4d805361eb2de6796be39 ]
With this commit, all the GIDs ("0 4294967294") can be written to the
"net.ipv4.ping_group_range" sysctl.
Note that 4294967295 (0xffffffff) is an invalid GID (see gid_valid() in
include/linux/uidgid.h), and an attempt to register this number will cause
-EINVAL.
Prior to this commit, only up to GID 2147483647 could be covered.
Documentation/networking/ip-sysctl.rst had "0 4294967295" as an example
value, but this example was wrong and causing -EINVAL.
Fixes:
|
||
|
26b6ad0f34 |
Merge 6.1.32 into android14-6.1-lts
Changes in 6.1.32 inet: Add IP_LOCAL_PORT_RANGE socket option ipv{4,6}/raw: fix output xfrm lookup wrt protocol firmware: arm_ffa: Fix usage of partition info get count flag selftests/bpf: Fix pkg-config call building sign-file platform/x86/amd/pmf: Fix CnQF and auto-mode after resume tls: rx: device: fix checking decryption status tls: rx: strp: set the skb->len of detached / CoW'ed skbs tls: rx: strp: fix determining record length in copy mode tls: rx: strp: force mixed decrypted records into copy mode tls: rx: strp: factor out copying skb data tls: rx: strp: preserve decryption status of skbs when needed net/mlx5: E-switch, Devcom, sync devcom events and devcom comp register gpio-f7188x: fix chip name and pin count on Nuvoton chip bpf, sockmap: Pass skb ownership through read_skb bpf, sockmap: Convert schedule_work into delayed_work bpf, sockmap: Reschedule is now done through backlog bpf, sockmap: Improved check for empty queue bpf, sockmap: Handle fin correctly bpf, sockmap: TCP data stall on recv before accept bpf, sockmap: Wake up polling after data copy bpf, sockmap: Incorrectly handling copied_seq blk-mq: fix race condition in active queue accounting vfio/type1: check pfn valid before converting to struct page net: page_pool: use in_softirq() instead page_pool: fix inconsistency for page_pool_ring_[un]lock() net: phy: mscc: enable VSC8501/2 RGMII RX clock wifi: rtw89: correct 5 MHz mask setting wifi: iwlwifi: mvm: support wowlan info notification version 2 wifi: iwlwifi: mvm: fix potential memory leak RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task" octeontx2-af: Add validation for lmac type drm/amd: Don't allow s0ix on APUs older than Raven bluetooth: Add cmd validity checks at the start of hci_sock_ioctl() Revert "thermal/drivers/mellanox: Use generic thermal_zone_get_trip() function" block: fix bio-cache for passthru IO cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf() cpufreq: amd-pstate: Add ->fast_switch() callback netfilter: ctnetlink: Support offloaded conntrack entry deletion tools headers UAPI: Sync the linux/in.h with the kernel sources Linux 6.1.32 Change-Id: I70ca0d07b33b26c2ed7613e6532eb9ae845112ee Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
03c3264a15 |
Merge 6.1.31 into android14-6.1-lts
Changes in 6.1.31 usb: dwc3: fix gadget mode suspend interrupt handler issue tpm, tpm_tis: Avoid cache incoherency in test for interrupts tpm, tpm_tis: Only handle supported interrupts tpm_tis: Use tpm_chip_{start,stop} decoration inside tpm_tis_resume tpm, tpm_tis: startup chip before testing for interrupts tpm: Re-enable TPM chip boostrapping non-tpm_tis TPM drivers tpm: Prevent hwrng from activating during resume watchdog: sp5100_tco: Immediately trigger upon starting. drm/amd/amdgpu: update mes11 api def drm/amdgpu/mes11: enable reg active poll skbuff: Proactively round up to kmalloc bucket size platform/x86: hp-wmi: Fix cast to smaller integer type warning net: dsa: mv88e6xxx: Add RGMII delay to 88E6320 drm/amd/display: hpd rx irq not working with eDP interface ocfs2: Switch to security_inode_init_security() arm64: Also reset KASAN tag if page is not PG_mte_tagged x86/mm: Avoid incomplete Global INVLPG flushes platform/x86/intel/ifs: Annotate work queue on stack so object debug does not complain ALSA: hda/ca0132: add quirk for EVGA X299 DARK ALSA: hda: Fix unhandled register update during auto-suspend period ALSA: hda/realtek: Enable headset onLenovo M70/M90 SUNRPC: Don't change task->tk_status after the call to rpc_exit_task mmc: sdhci-esdhc-imx: make "no-mmc-hs400" works mmc: block: ensure error propagation for non-blk power: supply: axp288_fuel_gauge: Fix external_power_changed race power: supply: bq25890: Fix external_power_changed race ASoC: rt5682: Disable jack detection interrupt during suspend net: cdc_ncm: Deal with too low values of dwNtbOutMaxSize m68k: Move signal frame following exception on 68020/030 xtensa: fix signal delivery to FDPIC process xtensa: add __bswap{si,di}2 helpers parisc: Use num_present_cpus() in alternative patching code parisc: Handle kgdb breakpoints only in kernel context parisc: Fix flush_dcache_page() for usage from irq context parisc: Allow to reboot machine after system halt parisc: Enable LOCKDEP support parisc: Handle kprobes breakpoints only in kernel context gpio: mockup: Fix mode of debugfs files btrfs: use nofs when cleaning up aborted transactions dt-binding: cdns,usb3: Fix cdns,on-chip-buff-size type drm/mgag200: Fix gamma lut not initialized. drm/radeon: reintroduce radeon_dp_work_func content drm/amd/pm: add missing NotifyPowerSource message mapping for SMU13.0.7 drm/amd/pm: Fix output of pp_od_clk_voltage Revert "binder_alloc: add missing mmap_lock calls when using the VMA" Revert "android: binder: stop saving a pointer to the VMA" binder: add lockless binder_alloc_(set|get)_vma() binder: fix UAF caused by faulty buffer cleanup binder: fix UAF of alloc->vma in race with munmap() selftests/memfd: Fix unknown type name build failure drm/amd/amdgpu: limit one queue per gang perf/x86/uncore: Correct the number of CHAs on SPR x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms irqchip/mips-gic: Don't touch vl_map if a local interrupt is not routable irqchip/mips-gic: Use raw spinlock for gic_lock debugobjects: Don't wake up kswapd from fill_pool() fbdev: udlfb: Fix endpoint check net: fix stack overflow when LRO is disabled for virtual interfaces udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated(). USB: core: Add routines for endpoint checks in old drivers USB: sisusbvga: Add endpoint checks media: radio-shark: Add endpoint checks ASoC: lpass: Fix for KASAN use_after_free out of bounds net: fix skb leak in __skb_tstamp_tx() drm: fix drmm_mutex_init() selftests: fib_tests: mute cleanup error message octeontx2-pf: Fix TSOv6 offload bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps lan966x: Fix unloading/loading of the driver ipv6: Fix out-of-bounds access in ipv6_find_tlv() cifs: mapchars mount option ignored power: supply: leds: Fix blink to LED on transition power: supply: mt6360: add a check of devm_work_autocancel in mt6360_charger_probe power: supply: bq27xxx: Fix bq27xxx_battery_update() race condition power: supply: bq27xxx: Fix I2C IRQ race on remove power: supply: bq27xxx: Fix poll_interval handling and races on remove power: supply: bq27xxx: Add cache parameter to bq27xxx_battery_current_and_status() power: supply: bq27xxx: Move bq27xxx_battery_update() down power: supply: bq27xxx: Ensure power_supply_changed() is called on current sign changes power: supply: bq27xxx: After charger plug in/out wait 0.5s for things to stabilize power: supply: bq25890: Call power_supply_changed() after updating input current or voltage power: supply: bq24190: Call power_supply_changed() after updating input current power: supply: sbs-charger: Fix INHIBITED bit for Status reg optee: fix uninited async notif value firmware: arm_ffa: Check if ffa_driver remove is present before executing firmware: arm_ffa: Fix FFA device names for logical partitions fs: fix undefined behavior in bit shift for SB_NOUSER regulator: pca9450: Fix BUCK2 enable_mask platform/x86: ISST: Remove 8 socket limit coresight: Fix signedness bug in tmc_etr_buf_insert_barrier_packet() ARM: dts: imx6qdl-mba6: Add missing pvcie-supply regulator x86/pci/xen: populate MSI sysfs entries xen/pvcalls-back: fix double frees with pvcalls_new_active_socket() x86/show_trace_log_lvl: Ensure stack pointer is aligned, again ASoC: Intel: Skylake: Fix declaration of enum skl_ch_cfg ASoC: Intel: avs: Fix declaration of enum avs_channel_config ASoC: Intel: avs: Access path components under lock cxl: Wait Memory_Info_Valid before access memory related info sctp: fix an issue that plpmtu can never go to complete state forcedeth: Fix an error handling path in nv_probe() platform/mellanox: mlxbf-pmc: fix sscanf() error checking net/mlx5e: Fix SQ wake logic in ptp napi_poll context net/mlx5e: Fix deadlock in tc route query code net/mlx5e: Use correct encap attribute during invalidation net/mlx5e: do as little as possible in napi poll when budget is 0 net/mlx5: DR, Fix crc32 calculation to work on big-endian (BE) CPUs net/mlx5: Handle pairing of E-switch via uplink un/load APIs net/mlx5: DR, Check force-loopback RC QP capability independently from RoCE net/mlx5: Fix error message when failing to allocate device memory net/mlx5: Collect command failures data only for known commands net/mlx5: Devcom, fix error flow in mlx5_devcom_register_device net/mlx5: Devcom, serialize devcom registration arm64: dts: imx8mn-var-som: fix PHY detection bug by adding deassert delay firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors regulator: mt6359: add read check for PMIC MT6359 net/smc: Reset connection when trying to use SMCRv2 fails. 3c589_cs: Fix an error handling path in tc589_probe() net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLE Linux 6.1.31 Change-Id: I1043b7dd190672829baaf093f690e70a07c7a6dd Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
26c1cc6858 |
This is the 6.1.30 stable release
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmRuPHsACgkQONu9yGCS aT6USxAAx2uklTRE3mmIS9qytOjb8Z3gsA8LVaaQ3f25CWNiuverNj0mFyNtI9KX 84ZBS/G8aHA6z0dtdyMupHznHehQp7pVo0LOeVMz2bR+CjkpRQei2NimG8bGRcFK W6c40w99lD9dYpaal3yajs+k+LF3BktmBNc0SynCjjyEy4YA5RbWOhtGX6P4VRqs sPXcmmAHsqDPLfqsgsHiBNsiw+dCP7jY1a17rTxz1g49/4zS6BEGtxxpU4UZNbph rKrX0sgF8UM15IfdFc0CiOXhAcL7QQfUbucJ/94180gclF4j6QqAMueAr6mLWkFd Pj7vLn/KD2wA2dzTBekHZ9SYp31xcXomkzfdLoMMnazfy3RL4sO7WhJks0k0T2En 3LIlsRZx/C2ztf3SLq2z2Bw/ExaefrydLI9cWJBi7CQ5yUVO15edcv40W4pxoMOL xFDZhCksC+JNc74HPYKTmg+SJQsxtYeLrwb6zW43aJByY+rls70crfhdS5fORvmH G8qDS2PCNAqpulxyxQtYxiIcRiM4SqPskves+3nu7gBFGfsv2AJU1gNCorIpZuW8 DS2jrMwPv7gH+eUvqrnrtdA+Vk4TYWslg0mPlVNavX98i9/dC9Vjss3yXCYh7Q6u 0+BpSBLtKM4pahaMgKpYv/V/r+GKvIt7Npki8o/bs1nuykF04aw= =hAQM -----END PGP SIGNATURE----- Merge 6.1.30 into android14-6.1-lts Changes in 6.1.30 drm/fbdev-generic: prohibit potential out-of-bounds access drm/mipi-dsi: Set the fwnode for mipi_dsi_device ARM: 9296/1: HP Jornada 7XX: fix kernel-doc warnings net: skb_partial_csum_set() fix against transport header magic value net: mdio: mvusb: Fix an error handling path in mvusb_mdio_probe() scsi: ufs: core: Fix I/O hang that occurs when BKOPS fails in W-LUN suspend tick/broadcast: Make broadcast device replacement work correctly linux/dim: Do nothing if no time delta between samples net: stmmac: Initialize MAC_ONEUS_TIC_COUNTER register net: Fix load-tearing on sk->sk_stamp in sock_recv_cmsgs(). net: phy: bcm7xx: Correct read from expansion register netfilter: nf_tables: always release netdev hooks from notifier netfilter: conntrack: fix possible bug_on with enable_hooks=1 bonding: fix send_peer_notif overflow netlink: annotate accesses to nlk->cb_running net: annotate sk->sk_err write from do_recvmmsg() net: deal with most data-races in sk_wait_event() net: add vlan_get_protocol_and_depth() helper tcp: add annotations around sk->sk_shutdown accesses gve: Remove the code of clearing PBA bit ipvlan:Fix out-of-bounds caused by unclear skb->cb net: mscc: ocelot: fix stat counter register values net: datagram: fix data-races in datagram_poll() af_unix: Fix a data race of sk->sk_receive_queue->qlen. af_unix: Fix data races around sk->sk_shutdown. drm/i915/guc: Don't capture Gen8 regs on Xe devices drm/i915: Fix NULL ptr deref by checking new_crtc_state drm/i915/dp: prevent potential div-by-zero drm/i915: Expand force_probe to block probe of devices as well. drm/i915: taint kernel when force probing unsupported devices fbdev: arcfb: Fix error handling in arcfb_probe() ext4: reflect error codes from ext4_multi_mount_protect() to its callers ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled ext4: allow to find by goal if EXT4_MB_HINT_GOAL_ONLY is set ext4: allow ext4_get_group_info() to fail refscale: Move shutdown from wait_event() to wait_event_idle() selftests: cgroup: Add 'malloc' failures checks in test_memcontrol rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access open: return EINVAL for O_DIRECTORY | O_CREAT fs: hfsplus: remove WARN_ON() from hfsplus_cat_{read,write}_inode() drm/displayid: add displayid_get_header() and check bounds better drm/amd/display: populate subvp cmd info only for the top pipe drm/amd/display: Correct DML calculation to align HW formula platform/x86: x86-android-tablets: Add Acer Iconia One 7 B1-750 data drm/amd/display: Enable HostVM based on rIOMMU active drm/amd/display: Use DC_LOG_DC in the trasform pixel function regmap: cache: Return error in cache sync operations for REGCACHE_NONE remoteproc: imx_dsp_rproc: Add custom memory copy implementation for i.MX DSP Cores arm64: dts: qcom: msm8996: Add missing DWC3 quirks media: cx23885: Fix a null-ptr-deref bug in buffer_prepare() and buffer_finish() media: pci: tw68: Fix null-ptr-deref bug in buf prepare and finish media: pvrusb2: VIDEO_PVRUSB2 depends on DVB_CORE to use dvb_* symbols ACPI: processor: Check for null return of devm_kzalloc() in fch_misc_setup() drm/rockchip: dw_hdmi: cleanup drm encoder during unbind memstick: r592: Fix UAF bug in r592_remove due to race condition arm64: dts: imx8mq-librem5: Remove dis_u3_susphy_quirk from usb_dwc3_0 firmware: arm_sdei: Fix sleep from invalid context BUG ACPI: EC: Fix oops when removing custom query handlers drm/amd/display: fixed dcn30+ underflow issue remoteproc: stm32_rproc: Add mutex protection for workqueue drm/tegra: Avoid potential 32-bit integer overflow drm/msm/dp: Clean up handling of DP AUX interrupts ACPICA: Avoid undefined behavior: applying zero offset to null pointer ACPICA: ACPICA: check null return of ACPI_ALLOCATE_ZEROED in acpi_db_display_objects arm64: dts: qcom: sdm845-polaris: Drop inexistent properties irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4 ACPI: video: Remove desktops without backlight DMI quirks drm/amd/display: Correct DML calculation to follow HW SPEC drm/amd: Fix an out of bounds error in BIOS parser drm/amdgpu: Fix sdma v4 sw fini error media: Prefer designated initializers over memset for subdev pad ops media: mediatek: vcodec: Fix potential array out-of-bounds in decoder queue_setup wifi: ath: Silence memcpy run-time false positive warning bpf: Annotate data races in bpf_local_storage wifi: brcmfmac: pcie: Provide a buffer of random bytes to the device wifi: brcmfmac: cfg80211: Pass the PMK in binary instead of hex ext2: Check block size validity during mount scsi: lpfc: Prevent lpfc_debugfs_lockstat_write() buffer overflow scsi: lpfc: Correct used_rpi count when devloss tmo fires with no recovery bnxt: avoid overflow in bnxt_get_nvram_directory() net: pasemi: Fix return type of pasemi_mac_start_tx() net: Catch invalid index in XPS mapping netdev: Enforce index cap in netdev_get_tx_queue scsi: target: iscsit: Free cmds before session free lib: cpu_rmap: Avoid use after free on rmap->obj array entries scsi: message: mptlan: Fix use after free bug in mptlan_remove() due to race condition gfs2: Fix inode height consistency check scsi: ufs: ufs-pci: Add support for Intel Lunar Lake ext4: set goal start correctly in ext4_mb_normalize_request ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa() crypto: jitter - permanent and intermittent health errors f2fs: Fix system crash due to lack of free space in LFS f2fs: fix to drop all dirty pages during umount() if cp_error is set f2fs: fix to check readonly condition correctly samples/bpf: Fix fout leak in hbm's run_bpf_prog bpf: Add preempt_count_{sub,add} into btf id deny list md: fix soft lockup in status_resync wifi: iwlwifi: pcie: fix possible NULL pointer dereference wifi: iwlwifi: add a new PCI device ID for BZ device wifi: iwlwifi: pcie: Fix integer overflow in iwl_write_to_user_buf wifi: iwlwifi: mvm: fix ptk_pn memory leak block, bfq: Fix division by zero error on zero wsum wifi: ath11k: Ignore frags from uninitialized peer in dp. wifi: iwlwifi: fix iwl_mvm_max_amsdu_size() for MLO null_blk: Always check queue mode setting from configfs wifi: iwlwifi: dvm: Fix memcpy: detected field-spanning write backtrace wifi: ath11k: Fix SKB corruption in REO destination ring nbd: fix incomplete validation of ioctl arg ipvs: Update width of source for ip_vs_sync_conn_options Bluetooth: btusb: Add new PID/VID 04ca:3801 for MT7663 Bluetooth: Add new quirk for broken local ext features page 2 Bluetooth: btrtl: add support for the RTL8723CS Bluetooth: Improve support for Actions Semi ATS2851 based devices Bluetooth: btrtl: check for NULL in btrtl_set_quirks() Bluetooth: btintel: Add LE States quirk support Bluetooth: hci_bcm: Fall back to getting bdaddr from EFI if not set Bluetooth: Add new quirk for broken set random RPA timeout for ATS2851 Bluetooth: L2CAP: fix "bad unlock balance" in l2cap_disconnect_rsp Bluetooth: btrtl: Add the support for RTL8851B staging: rtl8192e: Replace macro RTL_PCI_DEVICE with PCI_DEVICE HID: apple: Set the tilde quirk flag on the Geyser 4 and later staging: axis-fifo: initialize timeouts in init only ASoC: amd: yc: Add DMI entries to support HP OMEN 16-n0xxx (8A42) HID: logitech-hidpp: Don't use the USB serial for USB devices HID: logitech-hidpp: Reconcile USB and Unifying serials spi: spi-imx: fix MX51_ECSPI_* macros when cs > 3 usb: typec: ucsi: acpi: add quirk for ASUS Zenbook UM325 ALSA: hda: LNL: add HD Audio PCI ID ASoC: amd: Add Dell G15 5525 to quirks list ASoC: amd: yc: Add ThinkBook 14 G5+ ARP to quirks list for acp6x HID: apple: Set the tilde quirk flag on the Geyser 3 HID: Ignore battery for ELAN touchscreen on ROG Flow X13 GV301RA HID: wacom: generic: Set battery quirk only when we see battery data usb: typec: tcpm: fix multiple times discover svids error serial: 8250: Reinit port->pm on port specific driver unbind mcb-pci: Reallocate memory region to avoid memory overlapping sched: Fix KCSAN noinstr violation lkdtm/stackleak: Fix noinstr violation recordmcount: Fix memory leaks in the uwrite function soundwire: dmi-quirks: add remapping for Intel 'Rooks County' NUC M15 phy: st: miphy28lp: use _poll_timeout functions for waits soundwire: qcom: gracefully handle too many ports in DT soundwire: bus: Fix unbalanced pm_runtime_put() causing usage count underflow mfd: intel_soc_pmic_chtwc: Add Lenovo Yoga Book X90F to intel_cht_wc_models mfd: dln2: Fix memory leak in dln2_probe() mfd: intel-lpss: Add Intel Meteor Lake PCH-S LPSS PCI IDs parisc: Replace regular spinlock with spin_trylock on panic path platform/x86: Move existing HP drivers to a new hp subdir platform/x86: hp-wmi: add micmute to hp_wmi_keymap struct drm/amdgpu: drop gfx_v11_0_cp_ecc_error_irq_funcs xfrm: don't check the default policy if the policy allows the packet Revert "Fix XFRM-I support for nested ESP tunnels" drm/msm/dp: unregister audio driver during unbind drm/msm/dpu: Assign missing writeback log_mask drm/msm/dpu: Move non-MDP_TOP INTF_INTR offsets out of hwio header drm/msm/dpu: Remove duplicate register defines from INTF dt-bindings: display/msm: dsi-controller-main: Document qcom, master-dsi and qcom, sync-dual-dsi platform: Provide a remove callback that returns no value ASoC: fsl_micfil: Fix error handler with pm_runtime_enable cpupower: Make TSC read per CPU for Mperf monitor xfrm: Reject optional tunnel/BEET mode templates in outbound policies af_key: Reject optional tunnel/BEET mode templates in outbound policies drm/msm: Fix submit error-path leaks selftests: seg6: disable DAD on IPv6 router cfg for srv6_end_dt4_l3vpn_test selftets: seg6: disable rp_filter by default in srv6_end_dt4_l3vpn_test net: fec: Better handle pm_runtime_get() failing in .remove() net: phy: dp83867: add w/a for packet errors seen with short cables ALSA: firewire-digi00x: prevent potential use after free wifi: mt76: connac: fix stats->tx_bytes calculation ALSA: hda/realtek: Apply HP B&O top speaker profile to Pavilion 15 sfc: disable RXFCS and RXALL features by default vsock: avoid to close connected socket after the timeout tcp: fix possible sk_priority leak in tcp_v4_send_reset() serial: arc_uart: fix of_iomap leak in `arc_serial_probe` serial: 8250_bcm7271: balance clk_enable calls serial: 8250_bcm7271: fix leak in `brcmuart_probe` erspan: get the proto with the md version for collect_md net: dsa: rzn1-a5psw: enable management frames for CPU port net: dsa: rzn1-a5psw: fix STP states handling net: dsa: rzn1-a5psw: disable learning for standalone ports net: hns3: fix output information incomplete for dumping tx queue info with debugfs net: hns3: fix sending pfc frames after reset issue net: hns3: fix reset delay time to avoid configuration timeout net: hns3: fix reset timeout when enable full VF media: netup_unidvb: fix use-after-free at del_timer() SUNRPC: double free xprt_ctxt while still in use SUNRPC: always free ctxt when freeing deferred request SUNRPC: Fix trace_svc_register() call site ASoC: mediatek: mt8186: Fix use-after-free in driver remove path ASoC: SOF: topology: Fix logic for copying tuples drm/exynos: fix g2d_open/close helper function definitions net: nsh: Use correct mac_offset to unwind gso skb in nsh_gso_segment() virtio-net: Maintain reverse cleanup order virtio_net: Fix error unwinding of XDP initialization tipc: add tipc_bearer_min_mtu to calculate min mtu tipc: do not update mtu if msg_max is too small in mtu negotiation tipc: check the bearer min mtu properly when setting it by netlink s390/cio: include subchannels without devices also for evaluation can: dev: fix missing CAN XL support in can_put_echo_skb() net: bcmgenet: Remove phy_stop() from bcmgenet_netif_stop() net: bcmgenet: Restore phy_stop() depending upon suspend/close ice: introduce clear_reset_state operation ice: Fix ice VF reset during iavf initialization wifi: cfg80211: Drop entries with invalid BSSIDs in RNR wifi: mac80211: fortify the spinlock against deadlock by interrupt wifi: mac80211: fix min center freq offset tracing wifi: mac80211: Abort running color change when stopping the AP wifi: iwlwifi: mvm: fix cancel_delayed_work_sync() deadlock wifi: iwlwifi: fw: fix DBGI dump wifi: iwlwifi: fix OEM's name in the ppag approved list wifi: iwlwifi: mvm: fix OEM's name in the tas approved list wifi: iwlwifi: mvm: don't trust firmware n_channels scsi: storvsc: Don't pass unused PFNs to Hyper-V host net: tun: rebuild error handling in tun_get_user tun: Fix memory leak for detached NAPI queue. cassini: Fix a memory leak in the error handling path of cas_init_one() net: dsa: mv88e6xxx: Fix mv88e6393x EPC write command offset igb: fix bit_shift to be in [1..8] range vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit() net: wwan: iosm: fix NULL pointer dereference when removing device net: pcs: xpcs: fix C73 AN not getting enabled net: selftests: Fix optstring netfilter: nf_tables: fix nft_trans type confusion netfilter: nft_set_rbtree: fix null deref on element insertion bridge: always declare tunnel functions ALSA: usb-audio: Add a sample rate workaround for Line6 Pod Go USB: usbtmc: Fix direction for 0-length ioctl control messages usb-storage: fix deadlock when a scsi command timeouts more than once USB: UHCI: adjust zhaoxin UHCI controllers OverCurrent bit value usb: dwc3: gadget: Improve dwc3_gadget_suspend() and dwc3_gadget_resume() usb: dwc3: debugfs: Resume dwc3 before accessing registers usb: gadget: u_ether: Fix host MAC address case usb: typec: altmodes/displayport: fix pin_assignment_show Revert "usb: gadget: udc: core: Prevent redundant calls to pullup" Revert "usb: gadget: udc: core: Invoke usb_gadget_connect only when started" xhci-pci: Only run d3cold avoidance quirk for s2idle xhci: Fix incorrect tracking of free space on transfer rings ALSA: hda: Fix Oops by 9.1 surround channel names ALSA: hda: Add NVIDIA codec IDs a3 through a7 to patch table ALSA: hda/realtek: Add quirk for Clevo L140AU ALSA: hda/realtek: Add a quirk for HP EliteDesk 805 ALSA: hda/realtek: Add quirk for 2nd ASUS GU603 ALSA: hda/realtek: Add quirk for HP EliteBook G10 laptops ALSA: hda/realtek: Fix mute and micmute LEDs for yet another HP laptop can: j1939: recvmsg(): allow MSG_CMSG_COMPAT flag can: isotp: recvmsg(): allow MSG_CMSG_COMPAT flag can: kvaser_pciefd: Set CAN_STATE_STOPPED in kvaser_pciefd_stop() can: kvaser_pciefd: Call request_irq() before enabling interrupts can: kvaser_pciefd: Empty SRB buffer in probe can: kvaser_pciefd: Clear listen-only bit if not explicitly requested can: kvaser_pciefd: Do not send EFLUSH command on TFD interrupt can: kvaser_pciefd: Disable interrupts in probe error path wifi: rtw88: use work to update rate to avoid RCU warning SMB3: Close all deferred handles of inode in case of handle lease break SMB3: drop reference to cfile before sending oplock break ksmbd: smb2: Allow messages padded to 8byte boundary ksmbd: allocate one more byte for implied bcc[0] ksmbd: fix wrong UserName check in session_user ksmbd: fix global-out-of-bounds in smb2_find_context_vals KVM: Fix vcpu_array[0] races statfs: enforce statfs[64] structure initialization maple_tree: make maple state reusable after mas_empty_area() mm: fix zswap writeback race condition serial: Add support for Advantech PCI-1611U card serial: 8250_exar: Add support for USR298x PCI Modems serial: qcom-geni: fix enabling deactivated interrupt thunderbolt: Clear registers properly when auto clear isn't in use vc_screen: reload load of struct vc_data pointer in vcs_write() to avoid UAF ceph: force updating the msg pointer in non-split case drm/amd/pm: fix possible power mode mismatch between driver and PMFW drm/amdgpu/gmc11: implement get_vbios_fb_size() drm/amdgpu/gfx10: Disable gfxoff before disabling powergating. drm/amdgpu/gfx11: Adjust gfxoff before powergating on gfx11 as well drm/amdgpu: refine get gpu clock counter method drm/amdgpu/gfx11: update gpu_clock_counter logic dt-bindings: ata: ahci-ceva: Cover all 4 iommus entries powerpc/iommu: DMA address offset is incorrectly calculated with 2MB TCEs powerpc/iommu: Incorrect DDW Table is referenced for SR-IOV device tpm/tpm_tis: Disable interrupts for more Lenovo devices powerpc/64s/radix: Fix soft dirty tracking nilfs2: fix use-after-free bug of nilfs_root in nilfs_evict_inode() s390/dasd: fix command reject error on ESE devices s390/crypto: use vector instructions only if available for ChaCha20 s390/qdio: fix do_sqbs() inline assembly constraint arm64: mte: Do not set PG_mte_tagged if tags were not initialized rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handler rethook, fprobe: do not trace rethook related functions remoteproc: imx_dsp_rproc: Fix kernel test robot sparse warning crypto: testmgr - fix RNG performance in fuzz tests drm/amdgpu: declare firmware for new MES 11.0.4 drm/amd/amdgpu: introduce gc_*_mes_2.bin v2 drm/amdgpu: reserve the old gc_11_0_*_mes.bin Linux 6.1.30 Change-Id: I411885affcf017410aab34bf3fba2dde96df6593 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
ef75a88787 |
Merge 6.1.28 into android14-6.1-lts
Changes in 6.1.28 ASOC: Intel: sof_sdw: add quirk for Intel 'Rooks County' NUC M15 ASoC: Intel: soc-acpi: add table for Intel 'Rooks County' NUC M15 ASoC: soc-pcm: fix hw->formats cleared by soc_pcm_hw_init() for dpcm x86/hyperv: Block root partition functionality in a Confidential VM ASoC: amd: yc: Add DMI entries to support Victus by HP Laptop 16-e1xxx (8A22) iio: adc: palmas_gpadc: fix NULL dereference on rmmod ASoC: Intel: bytcr_rt5640: Add quirk for the Acer Iconia One 7 B1-750 ASoC: da7213.c: add missing pm_runtime_disable() net: wwan: t7xx: do not compile with -Werror selftests mount: Fix mount_setattr_test builds failed scsi: mpi3mr: Handle soft reset in progress fault code (0xF002) net: sfp: add quirk enabling 2500Base-x for HG MXPD-483II platform/x86: thinkpad_acpi: Add missing T14s Gen1 type to s2idle quirk list wifi: ath11k: reduce the MHI timeout to 20s tracing: Error if a trace event has an array for a __field() asm-generic/io.h: suppress endianness warnings for readq() and writeq() x86/cpu: Add model number for Intel Arrow Lake processor wireguard: timers: cast enum limits members to int in prints wifi: mt76: mt7921e: Set memory space enable in PCI_COMMAND if unset ASoC: amd: fix ACP version typo mistake ASoC: amd: ps: update the acp clock source. arm64: Always load shadow stack pointer directly from the task struct arm64: Stash shadow stack pointer in the task struct on interrupt powerpc/boot: Fix boot wrapper code generation with CONFIG_POWER10_CPU PCI: kirin: Select REGMAP_MMIO PCI: pciehp: Fix AB-BA deadlock between reset_lock and device_lock PCI: qcom: Fix the incorrect register usage in v2.7.0 config phy: qcom-qmp-pcie: sc8180x PCIe PHY has 2 lanes IMA: allow/fix UML builds usb: gadget: udc: core: Invoke usb_gadget_connect only when started usb: gadget: udc: core: Prevent redundant calls to pullup usb: dwc3: gadget: Stall and restart EP0 if host is unresponsive USB: dwc3: fix runtime pm imbalance on probe errors USB: dwc3: fix runtime pm imbalance on unbind hwmon: (k10temp) Check range scale when CUR_TEMP register is read-write hwmon: (adt7475) Use device_property APIs when configuring polarity tpm: Add !tpm_amd_is_rng_defective() to the hwrng_unregister() call site posix-cpu-timers: Implement the missing timer_wait_running callback media: ov8856: Do not check for for module version blk-stat: fix QUEUE_FLAG_STATS clear blk-crypto: don't use struct request_queue for public interfaces blk-crypto: add a blk_crypto_config_supported_natively helper blk-crypto: move internal only declarations to blk-crypto-internal.h blk-crypto: Add a missing include directive blk-mq: release crypto keyslot before reporting I/O complete blk-crypto: make blk_crypto_evict_key() return void blk-crypto: make blk_crypto_evict_key() more robust staging: iio: resolver: ads1210: fix config mode tty: Prevent writing chars during tcsetattr TCSADRAIN/FLUSH xhci: fix debugfs register accesses while suspended serial: fix TIOCSRS485 locking serial: 8250: Fix serial8250_tx_empty() race with DMA Tx serial: max310x: fix IO data corruption in batched operations tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem fs: fix sysctls.c built MIPS: fw: Allow firmware to pass a empty env ipmi:ssif: Add send_retries increment ipmi: fix SSIF not responding under certain cond. iio: addac: stx104: Fix race condition when converting analog-to-digital iio: addac: stx104: Fix race condition for stx104_write_raw() kheaders: Use array declaration instead of char wifi: mt76: add missing locking to protect against concurrent rx/status calls pwm: meson: Fix axg ao mux parents pwm: meson: Fix g12a ao clk81 name soundwire: qcom: correct setting ignore bit on v1.5.1 pinctrl: qcom: lpass-lpi: set output value before enabling output ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus ring-buffer: Sync IRQ works before buffer destruction crypto: api - Demote BUG_ON() in crypto_unregister_alg() to a WARN_ON() crypto: safexcel - Cleanup ring IRQ workqueues on load failure crypto: arm64/aes-neonbs - fix crash with CFI enabled crypto: ccp - Don't initialize CCP for PSP 0x1649 rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed reiserfs: Add security prefix to xattr name in reiserfs_security_write() KVM: nVMX: Emulate NOPs in L2, and PAUSE if it's not intercepted KVM: arm64: Avoid vcpu->mutex v. kvm->lock inversion in CPU_ON KVM: arm64: Avoid lock inversion when setting the VM register width KVM: arm64: Use config_lock to protect data ordered against KVM_RUN KVM: arm64: Use config_lock to protect vgic state KVM: arm64: vgic: Don't acquire its_lock before config_lock relayfs: fix out-of-bounds access in relay_file_read drm/amd/display: Remove stutter only configurations drm/amd/display: limit timing for single dimm memory drm/amd/display: fix PSR-SU/DSC interoperability support drm/amd/display: fix a divided-by-zero error KVM: RISC-V: Retry fault if vma_lookup() results become invalid ksmbd: fix racy issue under cocurrent smb2 tree disconnect ksmbd: call rcu_barrier() in ksmbd_server_exit() ksmbd: fix NULL pointer dereference in smb2_get_info_filesystem() ksmbd: fix memleak in session setup ksmbd: not allow guest user on multichannel ksmbd: fix deadlock in ksmbd_find_crypto_ctx() ACPI: video: Remove acpi_backlight=video quirk for Lenovo ThinkPad W530 i2c: omap: Fix standard mode false ACK readings riscv: mm: remove redundant parameter of create_fdt_early_page_table tracing: Fix permissions for the buffer_percent file swsmu/amdgpu_smu: Fix the wrong if-condition drm/amd/pm: re-enable the gfx imu when smu resume iommu/amd: Fix "Guest Virtual APIC Table Root Pointer" configuration in IRTE RISC-V: Align SBI probe implementation with spec Revert "ubifs: dirty_cow_znode: Fix memleak in error handling path" ubifs: Fix memleak when insert_old_idx() failed ubi: Fix return value overwrite issue in try_write_vid_and_data() ubifs: Free memory for tmpfile name ubifs: Fix memory leak in do_rename ceph: fix potential use-after-free bug when trimming caps xfs: don't consider future format versions valid cxl/hdm: Fail upon detecting 0-sized decoders bus: mhi: host: Remove duplicate ee check for syserr bus: mhi: host: Use mhi_tryset_pm_state() for setting fw error state bus: mhi: host: Range check CHDBOFF and ERDBOFF ASoC: dt-bindings: qcom,lpass-rx-macro: correct minItems for clocks kunit: improve KTAP compliance of KUnit test output kunit: fix bug in the order of lines in debugfs logs rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check selftests/resctrl: Return NULL if malloc_and_init_memory() did not alloc mem selftests/resctrl: Move ->setup() call outside of test specific branches selftests/resctrl: Allow ->setup() to return errors selftests/resctrl: Check for return value after write_schemata() selinux: fix Makefile dependencies of flask.h selinux: ensure av_permissions.h is built when needed tpm, tpm_tis: Do not skip reset of original interrupt vector tpm, tpm_tis: Claim locality before writing TPM_INT_ENABLE register tpm, tpm_tis: Disable interrupts if tpm_tis_probe_irq() failed tpm, tpm_tis: Claim locality before writing interrupt registers tpm, tpm: Implement usage counter for locality tpm, tpm_tis: Claim locality when interrupts are reenabled on resume erofs: stop parsing non-compact HEAD index if clusterofs is invalid erofs: initialize packed inode after root inode is assigned erofs: fix potential overflow calculating xattr_isize drm/rockchip: Drop unbalanced obj unref drm/i915/dg2: Drop one PCI ID drm/vgem: add missing mutex_destroy drm/probe-helper: Cancel previous job before starting new one drm/amdgpu: register a vga_switcheroo client for MacBooks with apple-gmux tools/x86/kcpuid: Fix avx512bw and avx512lvl fields in Fn00000007 soc: ti: pm33xx: Fix refcount leak in am33xx_pm_probe arm64: dts: renesas: r8a77990: Remove bogus voltages from OPP table arm64: dts: renesas: r8a774c0: Remove bogus voltages from OPP table arm64: dts: renesas: r9a07g044: Update IRQ numbers for SSI channels arm64: dts: renesas: r9a07g054: Update IRQ numbers for SSI channels arm64: dts: renesas: r9a07g043: Introduce SOC_PERIPHERAL_IRQ() macro to specify interrupt property arm64: dts: renesas: r9a07g043: Update IRQ numbers for SSI channels drm/mediatek: dp: Only trigger DRM HPD events if bridge is attached drm/msm/disp/dpu: check for crtc enable rather than crtc active to release shared resources EDAC/skx: Fix overflows on the DRAM row address mapping arrays ARM: dts: qcom-apq8064: Fix opp table child name regulator: core: Shorten off-on-delay-us for always-on/boot-on by time since booted arm64: dts: ti: k3-am62-main: Fix GPIO numbers in DT arm64: dts: ti: k3-am62a7-sk: Fix DDR size to full 4GB arm64: dts: ti: k3-j721e-main: Remove ti,strobe-sel property arm64: dts: broadcom: bcmbca: bcm4908: fix NAND interrupt name arm64: dts: broadcom: bcmbca: bcm4908: fix LED nodenames arm64: dts: broadcom: bcmbca: bcm4908: fix procmon nodename arm64: dts: qcom: msm8998: Fix stm-stimulus-base reg name arm64: dts: qcom: sc7280: fix EUD port properties arm64: dts: qcom: sdm845: correct dynamic power coefficients arm64: dts: qcom: sdm845: Fix the PCI I/O port range arm64: dts: qcom: msm8998: Fix the PCI I/O port range arm64: dts: qcom: sc7280: Fix the PCI I/O port range arm64: dts: qcom: ipq8074: Fix the PCI I/O port range arm64: dts: qcom: ipq6018: Fix the PCI I/O port range arm64: dts: qcom: msm8996: Fix the PCI I/O port range arm64: dts: qcom: sm8250: Fix the PCI I/O port range arm64: dts: qcom: sm8150: Fix the PCI I/O port range arm64: dts: qcom: sm8450: Fix the PCI I/O port range ARM: dts: qcom: ipq4019: Fix the PCI I/O port range ARM: dts: qcom: ipq8064: Fix the PCI I/O port range ARM: dts: qcom: sdx55: Fix the unit address of PCIe EP node x86/MCE/AMD: Use an u64 for bank_map media: bdisp: Add missing check for create_workqueue media: platform: mtk-mdp3: Add missing check and free for ida_alloc media: amphion: decoder implement display delay enable media: av7110: prevent underflow in write_ts_to_decoder() firmware: qcom_scm: Clear download bit during reboot drm/bridge: adv7533: Fix adv7533_mode_valid for adv7533 and adv7535 media: max9286: Free control handler arm64: dts: ti: k3-am625: Correct L2 cache size to 512KB arm64: dts: ti: k3-am62a7: Correct L2 cache size to 512KB drm/msm/adreno: drop bogus pm_runtime_set_active() drm: msm: adreno: Disable preemption on Adreno 510 virt/coco/sev-guest: Double-buffer messages arm64: dts: qcom: sm8350-microsoft-surface: fix USB dual-role mode property drm/amd/display/dc/dce60/Makefile: Fix previous attempt to silence known override-init warnings ACPI: processor: Fix evaluating _PDC method when running as Xen dom0 mmc: sdhci-of-esdhc: fix quirk to ignore command inhibit for data arm64: dts: qcom: sm8450: fix pcie1 gpios properties name drm: rcar-du: Fix a NULL vs IS_ERR() bug ARM: dts: gta04: fix excess dma channel usage firmware: arm_scmi: Fix xfers allocation on Rx channel perf/arm-cmn: Move overlapping wp_combine field ARM: dts: stm32: fix spi1 pin assignment on stm32mp15 arm64: dts: apple: t8103: Disable unused PCIe ports cpufreq: mediatek: fix passing zero to 'PTR_ERR' cpufreq: mediatek: fix KP caused by handler usage after regulator_put/clk_put cpufreq: mediatek: raise proc/sram max voltage for MT8516 cpufreq: mediatek: Raise proc and sram max voltage for MT7622/7623 cpufreq: qcom-cpufreq-hw: Revert adding cpufreq qos arm64: dts: mediatek: mt8192-asurada: Fix voltage constraint for Vgpu ACPI: VIOT: Initialize the correct IOMMU fwspec drm/lima/lima_drv: Add missing unwind goto in lima_pdev_probe() drm/mediatek: dp: Change the aux retries times when receiving AUX_DEFER mailbox: mpfs: switch to txdone_poll soc: bcm: brcmstb: biuctrl: fix of_iomap leak soc: renesas: renesas-soc: Release 'chipid' from ioremap() gpu: host1x: Fix potential double free if IOMMU is disabled gpu: host1x: Fix memory leak of device names arm64: dts: qcom: sc7280-herobrine-villager: correct trackpad supply arm64: dts: qcom: sc7180-trogdor-lazor: correct trackpad supply arm64: dts: qcom: sc7180-trogdor-pazquel: correct trackpad supply arm64: dts: qcom: msm8994-kitakami: drop unit address from PMI8994 regulator arm64: dts: qcom: msm8994-msft-lumia-octagon: drop unit address from PMI8994 regulator arm64: dts: qcom: apq8096-db820c: drop unit address from PMI8994 regulator drm/ttm: optimize pool allocations a bit v2 drm/ttm/pool: Fix ttm_pool_alloc error path regulator: core: Consistently set mutex_owner when using ww_mutex_lock_slow() regulator: core: Avoid lockdep reports when resolving supplies x86/apic: Fix atomic update of offset in reserve_eilvt_offset() arm64: dts: qcom: msm8994-angler: Fix cont_splash_mem mapping arm64: dts: qcom: msm8994-angler: removed clash with smem_region arm64: dts: sc7180: Rename qspi data12 as data23 arm64: dts: sc7280: Rename qspi data12 as data23 media: mediatek: vcodec: Use 4K frame size when supported by stateful decoder media: mediatek: vcodec: Make MM21 the default capture format media: mediatek: vcodec: Force capture queue format to MM21 media: mediatek: vcodec: add params to record lat and core lat_buf count media: mediatek: vcodec: using each instance lat_buf count replace core ready list media: mediatek: vcodec: move lat_buf to the top of core list media: mediatek: vcodec: add core decode done event media: mediatek: vcodec: remove unused lat_buf media: mediatek: vcodec: making sure queue_work successfully media: mediatek: vcodec: change lat thread decode error condition media: cedrus: fix use after free bug in cedrus_remove due to race condition media: rkvdec: fix use after free bug in rkvdec_remove platform/x86/amd/pmf: Move out of BIOS SMN pair for driver probe platform/x86/amd: pmc: Don't try to read SMU version on Picasso platform/x86/amd: pmc: Hide SMU version and program attributes for Picasso platform/x86/amd: pmc: Don't dump data after resume from s0i3 on picasso platform/x86/amd: pmc: Move idlemask check into `amd_pmc_idlemask_read` platform/x86/amd: pmc: Utilize SMN index 0 for driver probe platform/x86/amd: pmc: Move out of BIOS SMN pair for STB init media: dm1105: Fix use after free bug in dm1105_remove due to race condition media: saa7134: fix use after free bug in saa7134_finidev due to race condition media: platform: mtk-mdp3: fix potential frame size overflow in mdp_try_fmt_mplane() media: rcar_fdp1: Fix refcount leak in probe and remove function media: v4l: async: Return async sub-devices to subnotifier list media: hi846: Fix memleak in hi846_init_controls() drm/amd/display: Fix potential null dereference media: rc: gpio-ir-recv: Fix support for wake-up media: venus: dec: Fix handling of the start cmd media: venus: dec: Fix capture formats enumeration order regulator: stm32-pwr: fix of_iomap leak x86/ioapic: Don't return 0 from arch_dynirq_lower_bound() arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step perf/arm-cmn: Fix port detection for CMN-700 media: mediatek: vcodec: fix decoder disable pm crash media: mediatek: vcodec: add remove function for decoder platform driver debugobject: Prevent init race with static objects drm/i915: Make intel_get_crtc_new_encoder() less oopsy tick/common: Align tick period with the HZ tick. ACPI: bus: Ensure that notify handlers are not running after removal cpufreq: use correct unit when verify cur freq rpmsg: glink: Propagate TX failures in intentless mode as well hwmon: (pmbus/fsp-3y) Fix functionality bitmask in FSP-3Y YM-2151E platform/chrome: cros_typec_switch: Add missing fwnode_handle_put() wifi: ath6kl: minor fix for allocation size wifi: ath9k: hif_usb: fix memory leak of remain_skbs wifi: ath11k: Use platform_get_irq() to get the interrupt wifi: ath5k: Use platform_get_irq() to get the interrupt wifi: ath5k: fix an off by one check in ath5k_eeprom_read_freq_list() wifi: ath11k: fix SAC bug on peer addition with sta band migration wifi: brcmfmac: support CQM RSSI notification with older firmware wifi: ath6kl: reduce WARN to dev_dbg() in callback tools: bpftool: Remove invalid \' json escape wifi: rtw88: mac: Return the original error from rtw_pwr_seq_parser() wifi: rtw88: mac: Return the original error from rtw_mac_power_switch() bpf: take into account liveness when propagating precision bpf: fix precision propagation verbose logging crypto: qat - fix concurrency issue when device state changes scm: fix MSG_CTRUNC setting condition for SO_PASSSEC wifi: ath11k: fix deinitialization of firmware resources selftests/bpf: Fix a fd leak in an error path in network_helpers.c bpf: Remove misleading spec_v1 check on var-offset stack read net: pcs: xpcs: remove double-read of link state when using AN vlan: partially enable SIOCSHWTSTAMP in container net/packet: annotate accesses to po->xmit net/packet: convert po->origdev to an atomic flag net/packet: convert po->auxdata to an atomic flag libbpf: Fix ld_imm64 copy logic for ksym in light skeleton. net: dsa: qca8k: remove assignment of an_enabled in pcs_get_state() netfilter: keep conntrack reference until IPsecv6 policy checks are done bpf: Fix __reg_bound_offset 64->32 var_off subreg propagation scsi: target: core: Change the way target_xcopy_do_work() sets restiction on max I/O scsi: target: Move sess cmd counter to new struct scsi: target: Move cmd counter allocation scsi: target: Pass in cmd counter to use during cmd setup scsi: target: iscsit: isert: Alloc per conn cmd counter scsi: target: iscsit: Stop/wait on cmds during conn close scsi: target: Fix multiple LUN_RESET handling scsi: target: iscsit: Fix TAS handling during conn cleanup scsi: megaraid: Fix mega_cmd_done() CMDID_INT_CMDS net: sunhme: Fix uninitialized return code f2fs: handle dqget error in f2fs_transfer_project_quota() f2fs: fix uninitialized skipped_gc_rwsem f2fs: apply zone capacity to all zone type f2fs: compress: fix to call f2fs_wait_on_page_writeback() in f2fs_write_raw_pages() f2fs: fix scheduling while atomic in decompression path crypto: caam - Clear some memory in instantiate_rng crypto: sa2ul - Select CRYPTO_DES wifi: rtlwifi: fix incorrect error codes in rtl_debugfs_set_write_rfreg() wifi: rtlwifi: fix incorrect error codes in rtl_debugfs_set_write_reg() scsi: libsas: Add sas_ata_device_link_abort() scsi: hisi_sas: Handle NCQ error when IPTT is valid wifi: rt2x00: Fix memory leak when handling surveys f2fs: fix iostat lock protection net: qrtr: correct types of trace event parameters selftests: xsk: Use correct UMEM size in testapp_invalid_desc selftests: xsk: Disable IPv6 on VETH1 selftests: xsk: Deflakify STATS_RX_DROPPED test selftests/bpf: Wait for receive in cg_storage_multi test bpftool: Fix bug for long instructions in program CFG dumps crypto: drbg - Only fail when jent is unavailable in FIPS mode xsk: Fix unaligned descriptor validation f2fs: fix to avoid use-after-free for cached IPU bio wifi: iwlwifi: fix duplicate entry in iwl_dev_info_table bpf/btf: Fix is_int_ptr() scsi: lpfc: Fix ioremap issues in lpfc_sli4_pci_mem_setup() net: ethernet: stmmac: dwmac-rk: rework optional clock handling net: ethernet: stmmac: dwmac-rk: fix optional phy regulator handling wifi: ath11k: fix writing to unintended memory region bpf, sockmap: fix deadlocks in the sockhash and sockmap nvmet: fix error handling in nvmet_execute_identify_cns_cs_ns() nvmet: fix Identify Namespace handling nvmet: fix Identify Controller handling nvmet: fix Identify Active Namespace ID list handling nvmet: fix I/O Command Set specific Identify Controller nvme: fix async event trace event nvme-fcloop: fix "inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage" selftests/bpf: Use read_perf_max_sample_freq() in perf_event_stackmap selftests/bpf: Fix leaked bpf_link in get_stackid_cannot_attach blk-mq: don't plug for head insertions in blk_execute_rq_nowait wifi: iwlwifi: debug: fix crash in __iwl_err() wifi: iwlwifi: trans: don't trigger d3 interrupt twice wifi: iwlwifi: mvm: don't set CHECKSUM_COMPLETE for unsupported protocols bpf, sockmap: Revert buggy deadlock fix in the sockhash and sockmap f2fs: fix to check return value of f2fs_do_truncate_blocks() f2fs: fix to check return value of inc_valid_block_count() md/raid10: fix task hung in raid10d md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix memleak of md thread md/raid10: don't call bio_start_io_acct twice for bio which experienced read error wifi: iwlwifi: mvm: don't drop unencrypted MCAST frames wifi: iwlwifi: yoyo: skip dump correctly on hw error wifi: iwlwifi: yoyo: Fix possible division by zero wifi: iwlwifi: mvm: initialize seq variable wifi: iwlwifi: fw: move memset before early return jdb2: Don't refuse invalidation of already invalidated buffers io_uring/rsrc: use nospec'ed indexes wifi: iwlwifi: make the loop for card preparation effective wifi: mt76: mt7915: expose device tree match table wifi: mt76: handle failure of vzalloc in mt7615_coredump_work wifi: mt76: add flexible polling wait-interval support wifi: mt76: mt7921e: fix probe timeout after reboot wifi: mt76: fix 6GHz high channel not be scanned mt76: mt7921: fix kernel panic by accessing unallocated eeprom.data wifi: mt76: mt7921: fix missing unwind goto in `mt7921u_probe` wifi: mt76: mt7921e: improve reliability of dma reset wifi: mt76: mt7921e: stop chip reset worker in unregister hook wifi: mt76: connac: fix txd multicast rate setting wifi: iwlwifi: mvm: check firmware response size netfilter: conntrack: restore IPS_CONFIRMED out of nf_conntrack_hash_check_insert() netfilter: conntrack: fix wrong ct->timeout value wifi: iwlwifi: fw: fix memory leak in debugfs ixgbe: Allow flow hash to be set via ethtool ixgbe: Enable setting RSS table to default values net/mlx5e: Don't clone flow post action attributes second time net/mlx5: E-switch, Create per vport table based on devlink encap mode net/mlx5: E-switch, Don't destroy indirect table in split rule net/mlx5e: Fix error flow in representor failing to add vport rx rule net/mlx5: Remove "recovery" arg from mlx5_load_one() function net/mlx5: Suspend auxiliary devices only in case of PCI device suspend Revert "net/mlx5: Remove "recovery" arg from mlx5_load_one() function" net/mlx5: Use recovery timeout on sync reset flow net/mlx5e: Nullify table pointer when failing to create net: stmmac:fix system hang when setting up tag_8021q VLAN for DSA ports bpf: Fix race between btf_put and btf_idr walk. bpf: Don't EFAULT for getsockopt with optval=NULL netfilter: nf_tables: don't write table validation state without mutex net: dpaa: Fix uninitialized variable in dpaa_stop() net/sched: sch_fq: fix integer overflow of "credit" ipv4: Fix potential uninit variable access bug in __ip_make_skb() Revert "Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work" netlink: Use copy_to_user() for optval in netlink_getsockopt(). net: amd: Fix link leak when verifying config failed tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. ipmi: ASPEED_BT_IPMI_BMC: select REGMAP_MMIO instead of depending on it ASoC: cs35l41: Only disable internal boost drivers: staging: rtl8723bs: Fix locking in _rtw_join_timeout_handler() drivers: staging: rtl8723bs: Fix locking in rtw_scan_timeout_handler() pstore: Revert pmsg_lock back to a normal mutex usb: host: xhci-rcar: remove leftover quirk handling usb: dwc3: gadget: Change condition for processing suspend event serial: stm32: Re-assert RTS/DE GPIO in RS485 mode only if more data are transmitted fpga: bridge: fix kernel-doc parameter description iio: light: max44009: add missing OF device matching serial: 8250_bcm7271: Fix arbitration handling spi: atmel-quadspi: Don't leak clk enable count in pm resume spi: atmel-quadspi: Free resources even if runtime resume failed in .remove() spi: imx: Don't skip cleanup in remove's error path usb: gadget: udc: renesas_usb3: Fix use after free bug in renesas_usb3_remove due to race condition ASoC: soc-compress: Inherit atomicity from DAI link for Compress FE PCI: imx6: Install the fault handler only on compatible match ASoC: es8316: Handle optional IRQ assignment linux/vt_buffer.h: allow either builtin or modular for macros spi: qup: Don't skip cleanup in remove's error path interconnect: qcom: rpm: drop bogus pm domain attach spi: fsl-spi: Fix CPM/QE mode Litte Endian vmci_host: fix a race condition in vmci_host_poll() causing GPF of: Fix modalias string generation PCI/EDR: Clear Device Status after EDR error recovery ia64: mm/contig: fix section mismatch warning/error ia64: salinfo: placate defined-but-not-used warning scripts/gdb: bail early if there are no clocks scripts/gdb: bail early if there are no generic PD HID: amd_sfh: Correct the structure fields HID: amd_sfh: Correct the sensor enable and disable command HID: amd_sfh: Fix illuminance value HID: amd_sfh: Add support for shutdown operation HID: amd_sfh: Correct the stop all command HID: amd_sfh: Increase sensor command timeout for SFH1.1 HID: amd_sfh: Handle "no sensors" enabled for SFH1.1 cacheinfo: Check sib_leaf in cache_leaves_are_shared() coresight: etm_pmu: Set the module field drm/panel: novatek-nt35950: Improve error handling ASoC: fsl_mqs: move of_node_put() to the correct location PCI/PM: Extend D3hot delay for NVIDIA HDA controllers drm/panel: novatek-nt35950: Only unregister DSI1 if it exists spi: cadence-quadspi: fix suspend-resume implementations i2c: cadence: cdns_i2c_master_xfer(): Fix runtime PM leak on error path i2c: xiic: xiic_xfer(): Fix runtime PM leak on error path scripts/gdb: raise error with reduced debugging information uapi/linux/const.h: prefer ISO-friendly __typeof__ sh: sq: Fix incorrect element size for allocating bitmap buffer usb: gadget: tegra-xudc: Fix crash in vbus_draw usb: chipidea: fix missing goto in `ci_hdrc_probe` usb: mtu3: fix kernel panic at qmu transfer done irq handler firmware: stratix10-svc: Fix an NULL vs IS_ERR() bug in probe tty: serial: fsl_lpuart: adjust buffer length to the intended size serial: 8250: Add missing wakeup event reporting spi: cadence-quadspi: use macro DEFINE_SIMPLE_DEV_PM_OPS staging: rtl8192e: Fix W_DISABLE# does not work after stop/start spmi: Add a check for remove callback when removing a SPMI driver virtio_ring: don't update event idx on get_buf fbdev: mmp: Fix deferred clk handling in mmphw_probe() selftests/powerpc/pmu: Fix sample field check in the mmcra_thresh_marked_sample_test macintosh/windfarm_smu_sat: Add missing of_node_put() powerpc/perf: Properly detect mpc7450 family powerpc/mpc512x: fix resource printk format warning powerpc/wii: fix resource printk format warnings powerpc/sysdev/tsi108: fix resource printk format warnings macintosh: via-pmu-led: requires ATA to be set powerpc/rtas: use memmove for potentially overlapping buffer copy sched/fair: Fix inaccurate tally of ttwu_move_affine perf/core: Fix hardlockup failure caused by perf throttle Revert "objtool: Support addition to set CFA base" riscv: Fix ptdump when KASAN is enabled sched/rt: Fix bad task migration for rt tasks tracing/user_events: Ensure write index cannot be negative clk: at91: clk-sam9x60-pll: fix return value check IB/hifi1: add a null check of kzalloc_node in hfi1_ipoib_txreq_init RDMA/siw: Fix potential page_array out of range access clk: mediatek: mt2712: Add error handling to clk_mt2712_apmixed_probe() clk: mediatek: Consistently use GATE_MTK() macro clk: mediatek: mt7622: Properly use CLK_IS_CRITICAL flag clk: mediatek: mt8135: Properly use CLK_IS_CRITICAL flag RDMA/rdmavt: Delete unnecessary NULL check clk: qcom: gcc-qcm2290: Fix up gcc_sdcc2_apps_clk_src workqueue: Fix hung time report of worker pools rtc: omap: include header for omap_rtc_power_off_program prototype RDMA/mlx4: Prevent shift wrapping in set_user_sq_size() rtc: meson-vrtc: Use ktime_get_real_ts64() to get the current time rtc: k3: handle errors while enabling wake irq RDMA/erdma: Use fixed hardware page size fs/ntfs3: Fix memory leak if ntfs_read_mft failed fs/ntfs3: Add check for kmemdup fs/ntfs3: Fix OOB read in indx_insert_into_buffer fs/ntfs3: Fix slab-out-of-bounds read in hdr_delete_de() iommu/mediatek: Set dma_mask for PGTABLE_PA_35_EN power: supply: generic-adc-battery: fix unit scaling clk: add missing of_node_put() in "assigned-clocks" property parsing RDMA/siw: Remove namespace check from siw_netdev_event() clk: qcom: gcc-sm6115: Mark RCGs shared where applicable power: supply: rk817: Fix low SOC bugs RDMA/cm: Trace icm_send_rej event before the cm state is reset RDMA/srpt: Add a check for valid 'mad_agent' pointer IB/hfi1: Fix SDMA mmu_rb_node not being evicted in LRU order IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests clk: imx: fracn-gppll: fix the rate table clk: imx: fracn-gppll: disable hardware select control clk: imx: imx8ulp: Fix XBAR_DIVBUS and AD_SLOW clock parents NFSv4.1: Always send a RECLAIM_COMPLETE after establishing lease iommu/amd: Set page size bitmap during V2 domain allocation clk: qcom: lpasscc-sc7280: Skip qdsp6ss clock registration clk: qcom: lpassaudiocc-sc7280: Add required gdsc power domain clks in lpass_cc_sc7280_desc clk: qcom: gcc-sm8350: fix PCIe PIPE clocks handling clk: qcom: dispcc-qcm2290: get rid of test clock clk: qcom: dispcc-qcm2290: Remove inexistent DSI1PHY clk Input: raspberrypi-ts - fix refcount leak in rpi_ts_probe swiotlb: relocate PageHighMem test away from rmem_swiotlb_setup swiotlb: fix debugfs reporting of reserved memory pools RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMR RDMA/mlx5: Fix flow counter query via DEVX SUNRPC: remove the maximum number of retries in call_bind_status RDMA/mlx5: Use correct device num_ports when modify DC clocksource/drivers/davinci: Fix memory leak in davinci_timer_register when init fails openrisc: Properly store r31 to pt_regs on unhandled exceptions timekeeping: Fix references to nonexistent ktime_get_fast_ns() SMB3: Add missing locks to protect deferred close file list SMB3: Close deferred file handles in case of handle lease break ext4: fix i_disksize exceeding i_size problem in paritally written case ext4: fix use-after-free read in ext4_find_extent for bigalloc + inline pinctrl: renesas: r8a779a0: Remove incorrect AVB[01] pinmux configuration pinctrl: renesas: r8a779f0: Fix tsn1_avtp_pps pin group pinctrl: renesas: r8a779g0: Fix Group 4/5 pin functions pinctrl: renesas: r8a779g0: Fix Group 6/7 pin functions pinctrl: renesas: r8a779g0: Fix ERROROUTC function names leds: TI_LMU_COMMON: select REGMAP instead of depending on it pinctrl: ralink: reintroduce ralink,rt2880-pinmux compatible string dmaengine: mv_xor_v2: Fix an error code. leds: tca6507: Fix error handling of using fwnode_property_read_string pwm: mtk-disp: Disable shadow registers before setting backlight values pwm: mtk-disp: Configure double buffering before reading in .get_state() soundwire: cadence: rename sdw_cdns_dai_dma_data as sdw_cdns_dai_runtime soundwire: intel: don't save hw_params for use in prepare phy: tegra: xusb: Add missing tegra_xusb_port_unregister for usb2_port and ulpi_port phy: ti: j721e-wiz: Fix unreachable code in wiz_mode_select() dma: gpi: remove spurious unlock in gpi_ch_init dmaengine: dw-edma: Fix to change for continuous transfer dmaengine: dw-edma: Fix to enable to issue dma request on DMA processing dmaengine: at_xdmac: do not enable all cyclic channels pinctrl-bcm2835.c: fix race condition when setting gpio dir thermal/drivers/mediatek: Use devm_of_iomap to avoid resource leak in mtk_thermal_probe mfd: tqmx86: Do not access I2C_DETECT register through io_base mfd: tqmx86: Specify IO port register range more precisely mfd: tqmx86: Correct board names for TQMxE39x mfd: ocelot-spi: Fix unsupported bulk read mfd: arizona-spi: Add missing MODULE_DEVICE_TABLE hte: tegra: fix 'struct of_device_id' build error hte: tegra-194: Fix off by one in tegra_hte_map_to_line_id() ACPI: PM: Do not turn of unused power resources on the Toshiba Click Mini PM: hibernate: Turn snapshot_test into global variable PM: hibernate: Do not get block device exclusively in test_resume mode afs: Fix updating of i_size with dv jump from server afs: Fix getattr to report server i_size on dirs, not local size afs: Avoid endless loop if file is larger than expected parisc: Fix argument pointer in real64_call_asm() parisc: Ensure page alignment in flush functions ALSA: usb-audio: Add quirk for Pioneer DDJ-800 ALSA: hda/realtek: Add quirk for ThinkPad P1 Gen 6 ALSA: hda/realtek: Add quirk for ASUS UM3402YAR using CS35L41 ALSA: hda/realtek: support HP Pavilion Aero 13-be0xxx Mute LED ALSA: hda/realtek: Fix mute and micmute LEDs for an HP laptop nilfs2: do not write dirty data after degenerating to read-only nilfs2: fix infinite loop in nilfs_mdt_get_block() mm: do not reclaim private data from pinned page drbd: correctly submit flush bio on barrier md/raid10: fix null-ptr-deref in raid10_sync_request md/raid5: Improve performance for sequential IO kasan: hw_tags: avoid invalid virt_to_page() mtd: core: provide unique name for nvmem device, take two mtd: core: fix nvmem error reporting mtd: core: fix error path for nvmem provider mtd: spi-nor: core: Update flash's current address mode when changing address mode mailbox: zynqmp: Fix IPI isr handling kcsan: Avoid READ_ONCE() in read_instrumented_memory() mailbox: zynqmp: Fix typo in IPI documentation wifi: rtl8xxxu: RTL8192EU always needs full init wifi: rtw89: fix potential race condition between napi_init and napi_enable clk: microchip: fix potential UAF in auxdev release callback clk: rockchip: rk3399: allow clk_cifout to force clk_cifout_src to reparent scripts/gdb: fix lx-timerlist for Python3 btrfs: scrub: reject unsupported scrub flags s390/dasd: fix hanging blockdevice after request requeue ia64: fix an addr to taddr in huge_pte_offset() mm/mempolicy: correctly update prev when policy is equal on mbind vhost_vdpa: fix unmap process in no-batch mode dm verity: fix error handling for check_at_most_once on FEC dm clone: call kmem_cache_destroy() in dm_clone_init() error path dm integrity: call kmem_cache_destroy() in dm_integrity_init() error path dm flakey: fix a crash with invalid table line dm ioctl: fix nested locking in table_clear() to remove deadlock concern dm: don't lock fs when the map is NULL in process of resume blk-iocost: avoid 64-bit division in ioc_timer_fn cifs: fix potential use-after-free bugs in TCP_Server_Info::hostname cifs: protect session status check in smb2_reconnect() thunderbolt: Use correct type in tb_port_is_clx_enabled() prototype bonding (gcc13): synchronize bond_{a,t}lb_xmit() types wifi: ath11k: synchronize ath11k_mac_he_gi_to_nl80211_he_gi()'s return type perf auxtrace: Fix address filter entire kernel size perf intel-pt: Fix CYC timestamps after standalone CBR block/blk-iocost (gcc13): keep large values in a new enum sfc (gcc13): synchronize ef100_enqueue_skb()'s return type i40e: Remove unused i40e status codes i40e: Remove string printing for i40e_status i40e: use int for i40e_status drm/amd/display (gcc13): fix enum mismatch debugobject: Ensure pool refill (again) scsi: libsas: Grab the ATA port lock in sas_ata_device_link_abort() netfilter: nf_tables: deactivate anonymous set from preparation phase Linux 6.1.28 Change-Id: I61b5133e2d051cc2aa39b8c7c1be3fc25da40210 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
c3fc733798 |
tcp: fix mishandling when the sack compression is deferred.
[ Upstream commit 30c6f0bf9579debce27e45fac34fdc97e46acacc ]
In this patch, we mainly try to handle sending a compressed ack
correctly if it's deferred.
Here are more details in the old logic:
When sack compression is triggered in the tcp_compressed_ack_kick(),
if the sock is owned by user, it will set TCP_DELACK_TIMER_DEFERRED
and then defer to the release cb phrase. Later once user releases
the sock, tcp_delack_timer_handler() should send a ack as expected,
which, however, cannot happen due to lack of ICSK_ACK_TIMER flag.
Therefore, the receiver would not sent an ack until the sender's
retransmission timeout. It definitely increases unnecessary latency.
Fixes:
|
||
|
752836e1a2 |
tcp: Return user_mss for TCP_MAXSEG in CLOSE/LISTEN state if user_mss set
[ Upstream commit 34dfde4ad87b84d21278a7e19d92b5b2c68e6c4d ]
This patch replaces the tp->mss_cache check in getting TCP_MAXSEG
with tp->rx_opt.user_mss check for CLOSE/LISTEN sock. Since
tp->mss_cache is initialized with TCP_MSS_DEFAULT, checking if
it's zero is probably a bug.
With this change, getting TCP_MAXSEG before connecting will return
default MSS normally, and return user_mss if user_mss is set.
Fixes:
|
||
|
c2251ce048 |
tcp: deny tcp_disconnect() when threads are waiting
[ Upstream commit 4faeee0cf8a5d88d63cdbc3bab124fb0e6aed08c ] Historically connect(AF_UNSPEC) has been abused by syzkaller and other fuzzers to trigger various bugs. A recent one triggers a divide-by-zero [1], and Paolo Abeni was able to diagnose the issue. tcp_recvmsg_locked() has tests about sk_state being not TCP_LISTEN and TCP REPAIR mode being not used. Then later if socket lock is released in sk_wait_data(), another thread can call connect(AF_UNSPEC), then make this socket a TCP listener. When recvmsg() is resumed, it can eventually call tcp_cleanup_rbuf() and attempt a divide by 0 in tcp_rcv_space_adjust() [1] This patch adds a new socket field, counting number of threads blocked in sk_wait_event() and inet_wait_for_connect(). If this counter is not zero, tcp_disconnect() returns an error. This patch adds code in blocking socket system calls, thus should not hurt performance of non blocking ones. Note that we probably could revert commit |
||
|
5dd0547a3e |
UPSTREAM: mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 1c71222e5f2393b5ea1a41795c67589eea7e3490) Bug: 161210518 Change-Id: Ifc352b487db109adab17dd33a83f5c7e68c0bbc6 Signed-off-by: Suren Baghdasaryan <surenb@google.com> |
||
|
1d39b94f8c |
UPSTREAM: ipv{4,6}/raw: fix output xfrm lookup wrt protocol
With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
protocol field of the flow structure, build by raw_sendmsg() /
rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy
lookup when some policies are defined with a protocol in the selector.
For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
specify the protocol. Just accept all values for IPPROTO_RAW socket.
For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
without breaking backward compatibility (the value of this field was never
checked). Let's add a new kind of control message, so that the userland
could specify which protocol is used.
Fixes:
|
||
|
9713594a2b |
UPSTREAM: inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Change-Id: I06e1860472cd2f90bf030076be0c87b9b775a3df Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> (cherry picked from commit 91d0b78c5177f3e42a4d8738af8ac19c3a90d002) Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
fe735073a5 |
bpf, sockmap: Incorrectly handling copied_seq
[ Upstream commit e5c6de5fa025882babf89cecbed80acf49b987fa ]
The read_skb() logic is incrementing the tcp->copied_seq which is used for
among other things calculating how many outstanding bytes can be read by
the application. This results in application errors, if the application
does an ioctl(FIONREAD) we return zero because this is calculated from
the copied_seq value.
To fix this we move tcp->copied_seq accounting into the recv handler so
that we update these when the recvmsg() hook is called and data is in
fact copied into user buffers. This gives an accurate FIONREAD value
as expected and improves ACK handling. Before we were calling the
tcp_rcv_space_adjust() which would update 'number of bytes copied to
user in last RTT' which is wrong for programs returning SK_PASS. The
bytes are only copied to the user when recvmsg is handled.
Doing the fix for recvmsg is straightforward, but fixing redirect and
SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
call this from skmsg handlers. This fixes another issue where a broken
socket with a BPF program doing a resubmit could hang the receiver. This
happened because although read_skb() consumed the skb through sock_drop()
it did not update the copied_seq. Now if a single reccv socket is
redirecting to many sockets (for example for lb) the receiver sk will be
hung even though we might expect it to continue. The hang comes from
not updating the copied_seq numbers and memory pressure resulting from
that.
We have a slight layer problem of calling tcp_eat_skb even if its not
a TCP socket. To fix we could refactor and create per type receiver
handlers. I decided this is more work than we want in the fix and we
already have some small tweaks depending on caller that use the
helper skb_bpf_strparser(). So we extend that a bit and always set
the strparser bit when it is in use and then we can gate the
seq_copied updates on this.
Fixes:
|
||
|
ab90b68f65 |
bpf, sockmap: TCP data stall on recv before accept
[ Upstream commit ea444185a6bf7da4dd0df1598ee953e4f7174858 ]
A common mechanism to put a TCP socket into the sockmap is to hook the
BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
that can map the socket info to the correct BPF verdict parser. When
the user adds the socket to the map the psock is created and the new
ops are assigned to ensure the verdict program will 'see' the sk_buffs
as they arrive.
Part of this process hooks the sk_data_ready op with a BPF specific
handler to wake up the BPF verdict program when data is ready to read.
The logic is simple enough (posted here for easy reading)
static void sk_psock_verdict_data_ready(struct sock *sk)
{
struct socket *sock = sk->sk_socket;
if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
return;
sock->ops->read_skb(sk, sk_psock_verdict_recv);
}
The oversight here is sk->sk_socket is not assigned until the application
accepts() the new socket. However, its entirely ok for the peer application
to do a connect() followed immediately by sends. The socket on the receiver
is sitting on the backlog queue of the listening socket until its accepted
and the data is queued up. If the peer never accepts the socket or is slow
it will eventually hit data limits and rate limit the session. But,
important for BPF sockmap hooks when this data is received TCP stack does
the sk_data_ready() call but the read_skb() for this data is never called
because sk_socket is missing. The data sits on the sk_receive_queue.
Then once the socket is accepted if we never receive more data from the
peer there will be no further sk_data_ready calls and all the data
is still on the sk_receive_queue(). Then user calls recvmsg after accept()
and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
The handler checks for data in the sk_msg ingress queue expecting that
the BPF program has already run from the sk_data_ready hook and enqueued
the data as needed. So we are stuck.
To fix do an unlikely check in recvmsg handler for data on the
sk_receive_queue and if it exists wake up data_ready. We have the sock
locked in both read_skb and recvmsg so should avoid having multiple
runners.
Fixes:
|
||
|
3a2129ebae |
bpf, sockmap: Handle fin correctly
[ Upstream commit 901546fd8f9ca4b5c481ce00928ab425ce9aacc0 ]
The sockmap code is returning EAGAIN after a FIN packet is received and no
more data is on the receive queue. Correct behavior is to return 0 to the
user and the user can then close the socket. The EAGAIN causes many apps
to retry which masks the problem. Eventually the socket is evicted from
the sockmap because its released from sockmap sock free handling. The
issue creates a delay and can cause some errors on application side.
To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
is set then set return to zero. A selftest will be added to check this
condition.
Fixes:
|
||
|
4ae2af3e59 |
bpf, sockmap: Pass skb ownership through read_skb
[ Upstream commit 78fa0d61d97a728d306b0c23d353c0e340756437 ]
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.
This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,
skb_linearize()
__pskb_pull_tail()
pskb_expand_head()
BUG_ON(skb_shared(skb))
Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.
To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.
Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.
[ 106.536188] ------------[ cut here ]------------
[ 106.536197] kernel BUG at net/core/skbuff.c:1693!
[ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
[ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
[ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
[ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
[ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
[ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
[ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
[ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
[ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
[ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
[ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 106.542255] Call Trace:
[ 106.542383] <IRQ>
[ 106.542487] __pskb_pull_tail+0x4b/0x3e0
[ 106.542681] skb_ensure_writable+0x85/0xa0
[ 106.542882] sk_skb_pull_data+0x18/0x20
[ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
[ 106.543536] ? migrate_disable+0x66/0x80
[ 106.543871] sk_psock_verdict_recv+0xe2/0x310
[ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0
[ 106.544561] tcp_read_skb+0x7b/0x120
[ 106.544740] tcp_data_queue+0x904/0xee0
[ 106.544931] tcp_rcv_established+0x212/0x7c0
[ 106.545142] tcp_v4_do_rcv+0x174/0x2a0
[ 106.545326] tcp_v4_rcv+0xe70/0xf60
[ 106.545500] ip_protocol_deliver_rcu+0x48/0x290
[ 106.545744] ip_local_deliver_finish+0xa7/0x150
Fixes:
|
||
|
3f5413c954 |
ipv{4,6}/raw: fix output xfrm lookup wrt protocol
[ Upstream commit 3632679d9e4f879f49949bb5b050e0de553e4739 ]
With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
protocol field of the flow structure, build by raw_sendmsg() /
rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy
lookup when some policies are defined with a protocol in the selector.
For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
specify the protocol. Just accept all values for IPPROTO_RAW socket.
For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
without breaking backward compatibility (the value of this field was never
checked). Let's add a new kind of control message, so that the userland
could specify which protocol is used.
Fixes:
|
||
|
6728486447 |
inet: Add IP_LOCAL_PORT_RANGE socket option
[ Upstream commit 91d0b78c5177f3e42a4d8738af8ac19c3a90d002 ] Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 | PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 3632679d9e4f ("ipv{4,6}/raw: fix output xfrm lookup wrt protocol") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
2a112f0462 |
udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated().
commit ad42a35bdfc6d3c0fc4cb4027d7b2757ce665665 upstream. syzbot reported [0] a null-ptr-deref in sk_get_rmem0() while using IPPROTO_UDPLITE (0x88): 14:25:52 executing program 1: r0 = socket$inet6(0xa, 0x80002, 0x88) We had a similar report [1] for probably sk_memory_allocated_add() in __sk_mem_raise_allocated(), and commit |
||
|
820a60a416 |
tcp: fix possible sk_priority leak in tcp_v4_send_reset()
[ Upstream commit 1e306ec49a1f206fd2cc89a42fac6e6f592a8cc1 ]
When tcp_v4_send_reset() is called with @sk == NULL,
we do not change ctl_sk->sk_priority, which could have been
set from a prior invocation.
Change tcp_v4_send_reset() to set sk_priority and sk_mark
fields before calling ip_send_unicast_reply().
This means tcp_v4_send_reset() and tcp_v4_send_ack()
no longer have to clear ctl_sk->sk_mark after
their call to ip_send_unicast_reply().
Fixes:
|
||
|
b4c0af8974 |
tcp: add annotations around sk->sk_shutdown accesses
[ Upstream commit e14cadfd80d76f01bfaa1a8d745b1db19b57d6be ]
Now sk->sk_shutdown is no longer a bitfield, we can add
standard READ_ONCE()/WRITE_ONCE() annotations to silence
KCSAN reports like the following:
BUG: KCSAN: data-race in tcp_disconnect / tcp_poll
write to 0xffff88814588582c of 1 bytes by task 3404 on cpu 1:
tcp_disconnect+0x4d6/0xdb0 net/ipv4/tcp.c:3121
__inet_stream_connect+0x5dd/0x6e0 net/ipv4/af_inet.c:715
inet_stream_connect+0x48/0x70 net/ipv4/af_inet.c:727
__sys_connect_file net/socket.c:2001 [inline]
__sys_connect+0x19b/0x1b0 net/socket.c:2018
__do_sys_connect net/socket.c:2028 [inline]
__se_sys_connect net/socket.c:2025 [inline]
__x64_sys_connect+0x41/0x50 net/socket.c:2025
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff88814588582c of 1 bytes by task 3374 on cpu 0:
tcp_poll+0x2e6/0x7d0 net/ipv4/tcp.c:562
sock_poll+0x253/0x270 net/socket.c:1383
vfs_poll include/linux/poll.h:88 [inline]
io_poll_check_events io_uring/poll.c:281 [inline]
io_poll_task_func+0x15a/0x820 io_uring/poll.c:333
handle_tw_list io_uring/io_uring.c:1184 [inline]
tctx_task_work+0x1fe/0x4d0 io_uring/io_uring.c:1246
task_work_run+0x123/0x160 kernel/task_work.c:179
get_signal+0xe64/0xff0 kernel/signal.c:2635
arch_do_signal_or_restart+0x89/0x2a0 arch/x86/kernel/signal.c:306
exit_to_user_mode_loop+0x6f/0xe0 kernel/entry/common.c:168
exit_to_user_mode_prepare+0x6c/0xb0 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:297
do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x03 -> 0x00
Fixes:
|
||
|
65531f5675 |
net: deal with most data-races in sk_wait_event()
[ Upstream commit d0ac89f6f9879fae316c155de77b5173b3e2c9c9 ]
__condition is evaluated twice in sk_wait_event() macro.
First invocation is lockless, and reads can race with writes,
as spotted by syzbot.
BUG: KCSAN: data-race in sk_stream_wait_connect / tcp_disconnect
write to 0xffff88812d83d6a0 of 4 bytes by task 9065 on cpu 1:
tcp_disconnect+0x2cd/0xdb0
inet_shutdown+0x19e/0x1f0 net/ipv4/af_inet.c:911
__sys_shutdown_sock net/socket.c:2343 [inline]
__sys_shutdown net/socket.c:2355 [inline]
__do_sys_shutdown net/socket.c:2363 [inline]
__se_sys_shutdown+0xf8/0x140 net/socket.c:2361
__x64_sys_shutdown+0x31/0x40 net/socket.c:2361
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff88812d83d6a0 of 4 bytes by task 9040 on cpu 0:
sk_stream_wait_connect+0x1de/0x3a0 net/core/stream.c:75
tcp_sendmsg_locked+0x2e4/0x2120 net/ipv4/tcp.c:1266
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1484
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:651
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
__sys_sendto+0x246/0x300 net/socket.c:2142
__do_sys_sendto net/socket.c:2154 [inline]
__se_sys_sendto net/socket.c:2150 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2150
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x00000000 -> 0x00000068
Fixes:
|
||
|
fc60067260 |
ipv4: Fix potential uninit variable access bug in __ip_make_skb()
[ Upstream commit 99e5acae193e369b71217efe6f1dad42f3f18815 ]
Like commit ea30388baebc ("ipv6: Fix an uninit variable access bug in
__ip6_make_skb()"). icmphdr does not in skb linear region under the
scenario of SOCK_RAW socket. Access icmp_hdr(skb)->type directly will
trigger the uninit variable access bug.
Use a local variable icmp_type to carry the correct value in different
scenarios.
Fixes:
|
||
|
18c6e1f4af |
Revert "Revert "raw: Fix NULL deref in raw_get_next().""
This reverts commit
|
||
|
01e7770c33 |
Revert "Revert "raw: use net_hash_mix() in hash function""
This reverts commit
|
||
|
55e4f0c551 |
This is the 6.1.25 stable release
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmRBFW0ACgkQONu9yGCS aT7Jew//Ytw9+JQ71LT1TuJnQ1GayJOL1BW5hgxoYgnBFasWDwsGA9rzHs6KHqHb 0Vjk7MX7VZB+6zWakOxY5CFVM33J4fS7wY8WE2bj8X3QQhD/J0HQDMdELvSBi3qF 7xI6sghEQEwOuwAj2+CBqm/q7rA5FTnO1QgJuk/AKJ6PHGRiQeZ7q1zGpFvSaj7S cyKvY99RsjnUN+PYk4LE2+u/6DVCqiWYVDQrdjalb9zsrXg4+nmPH6ZJzZX8+bbM eM0xAR675V8TXqDi+8bj7tWmiS52XyjYF3Q/bu9BmU67DqslH9FFyVQxhgTHUZpN qWXkojEU2djIc3qt7T/bpZS/vD8Kg3Px1CgyIRN8Y5SlZfhZyqVdTZ4AQCtJuLQJ wDIdQCLlGzzDNFvbD+LdfJSjZt7Ig1sM/HwtPZhUA9yF0FN1XV3dcESzCOeI0/S7 ohRh8cs1sidnxrbvVwiVNENSqbJD7G9/9vVjIfyfcnt57q+fs6xCBhpOyNoVOp74 I5i6ALMcVZoAB50vDjnoGZsSRe9W2AmOV6UMIkVCvRCWYFqBpgVftMTAACNyljni UlXmO7aDQj+nbHD/auclFtU02oHQbk62FSrwoWMFS090zWztQqUhgRY7Qnl13yCM poEvrKlskXhvunsNtdVmI5O3N2GANWKgGwkyFIiXvgxKkw1qpUo= =zeN9 -----END PGP SIGNATURE----- Merge 6.1.25 into android14-6.1 Changes in 6.1.25 Revert "pinctrl: amd: Disable and mask interrupts on resume" drm/amd/display: Pass the right info to drm_dp_remove_payload ALSA: emu10k1: fix capture interrupt handler unlinking ALSA: hda/sigmatel: add pin overrides for Intel DP45SG motherboard ALSA: i2c/cs8427: fix iec958 mixer control deactivation ALSA: hda: patch_realtek: add quirk for Asus N7601ZM ALSA: hda/realtek: Add quirks for Lenovo Z13/Z16 Gen2 ALSA: firewire-tascam: add missing unwind goto in snd_tscm_stream_start_duplex() ALSA: emu10k1: don't create old pass-through playback device on Audigy ALSA: hda/sigmatel: fix S/PDIF out on Intel D*45* motherboards ALSA: hda/hdmi: disable KAE for Intel DG2 Bluetooth: L2CAP: Fix use-after-free in l2cap_disconnect_{req,rsp} Bluetooth: Fix race condition in hidp_session_thread bluetooth: btbcm: Fix logic error in forming the board name. Bluetooth: Free potentially unfreed SCO connection Bluetooth: hci_conn: Fix possible UAF btrfs: restore the thread_pool= behavior in remount for the end I/O workqueues btrfs: fix fast csum implementation detection fbmem: Reject FB_ACTIVATE_KD_TEXT from userspace mtdblock: tolerate corrected bit-flips mtd: rawnand: meson: fix bitmask for length in command word mtd: rawnand: stm32_fmc2: remove unsupported EDO mode mtd: rawnand: stm32_fmc2: use timings.mode instead of checking tRC_min KVM: arm64: PMU: Restore the guest's EL0 event counting after migration fbcon: Fix error paths in set_con2fb_map fbcon: set_con2fb_map needs to set con2fb_map! drm/i915/dsi: fix DSS CTL register offsets for TGL+ clk: sprd: set max_register according to mapping range RDMA/irdma: Do not generate SW completions for NOPs RDMA/irdma: Fix memory leak of PBLE objects RDMA/irdma: Increase iWARP CM default rexmit count RDMA/irdma: Add ipv4 check to irdma_find_listener() IB/mlx5: Add support for 400G_8X lane speed RDMA/erdma: Update default EQ depth to 4096 and max_send_wr to 8192 RDMA/erdma: Inline mtt entries into WQE if supported RDMA/erdma: Defer probing if netdevice can not be found clk: rs9: Fix suspend/resume RDMA/cma: Allow UD qp_type to join multicast only bpf: tcp: Use sock_gen_put instead of sock_put in bpf_iter_tcp LoongArch, bpf: Fix jit to skip speculation barrier opcode dmaengine: apple-admac: Handle 'global' interrupt flags dmaengine: apple-admac: Set src_addr_widths capability dmaengine: apple-admac: Fix 'current_tx' not getting freed 9p/xen : Fix use after free bug in xen_9pfs_front_remove due to race condition bpf, arm64: Fixed a BTI error on returning to patched function KVM: arm64: Initialise hypervisor copies of host symbols unconditionally KVM: arm64: Advertise ID_AA64PFR0_EL1.CSV2/3 to protected VMs niu: Fix missing unwind goto in niu_alloc_channels() tcp: restrict net.ipv4.tcp_app_win bonding: fix ns validation on backup slaves iavf: refactor VLAN filter states iavf: remove active_cvlans and active_svlans bitmaps net: openvswitch: fix race on port output Bluetooth: hci_conn: Fix not cleaning up on LE Connection failure Bluetooth: Fix printing errors if LE Connection times out Bluetooth: SCO: Fix possible circular locking dependency sco_sock_getsockopt Bluetooth: Set ISO Data Path on broadcast sink drm/armada: Fix a potential double free in an error handling path qlcnic: check pci_reset_function result net: wwan: iosm: Fix error handling path in ipc_pcie_probe() cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume() sctp: fix a potential overflow in sctp_ifwdtsn_skip RDMA/core: Fix GID entry ref leak when create_ah fails selftests: openvswitch: adjust datapath NL message declaration udp6: fix potential access to stale information net: macb: fix a memory corruption in extended buffer descriptor mode skbuff: Fix a race between coalescing and releasing SKBs libbpf: Fix single-line struct definition output in btf_dump ARM: 9290/1: uaccess: Fix KASAN false-positives ARM: dts: qcom: apq8026-lg-lenok: add missing reserved memory power: supply: rk817: Fix unsigned comparison with less than zero power: supply: cros_usbpd: reclassify "default case!" as debug power: supply: axp288_fuel_gauge: Added check for negative values selftests/bpf: Fix progs/find_vma_fail1.c build error. wifi: mwifiex: mark OF related data as maybe unused i2c: imx-lpi2c: clean rx/tx buffers upon new message i2c: hisi: Avoid redundant interrupts efi: sysfb_efi: Add quirk for Lenovo Yoga Book X91F/L block: ublk_drv: mark device as LIVE before adding disk ACPI: video: Add backlight=native DMI quirk for Acer Aspire 3830TG drm: panel-orientation-quirks: Add quirk for Lenovo Yoga Book X90F hwmon: (peci/cputemp) Fix miscalculated DTS for SKX hwmon: (xgene) Fix ioremap and memremap leak verify_pefile: relax wrapper length check asymmetric_keys: log on fatal failures in PE/pkcs7 nvme: send Identify with CNS 06h only to I/O controllers wifi: iwlwifi: mvm: fix mvmtxq->stopped handling wifi: iwlwifi: mvm: protect TXQ list manipulation drm/amdgpu: add mes resume when do gfx post soft reset drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs drm/amdgpu/gfx: set cg flags to enter/exit safe mode ACPI: resource: Add Medion S17413 to IRQ override quirk x86/hyperv: Move VMCB enlightenment definitions to hyperv-tlfs.h KVM: selftests: Move "struct hv_enlightenments" to x86_64/svm.h KVM: SVM: Add a proper field for Hyper-V VMCB enlightenments x86/hyperv: KVM: Rename "hv_enlightenments" to "hv_vmcb_enlightenments" KVM: SVM: Flush Hyper-V TLB when required tracing: Add trace_array_puts() to write into instance tracing: Have tracing_snapshot_instance_cond() write errors to the appropriate instance maple_tree: fix write memory barrier of nodes once dead for RCU mode ksmbd: avoid out of bounds access in decode_preauth_ctxt() riscv: add icache flush for nommu sigreturn trampoline HID: intel-ish-hid: Fix kernel panic during warm reset net: sfp: initialize sfp->i2c_block_size at sfp allocation net: phy: nxp-c45-tja11xx: add remove callback net: phy: nxp-c45-tja11xx: fix unsigned long multiplication overflow scsi: ses: Handle enclosure with just a primary component gracefully x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot cgroup: fix display of forceidle time at root cgroup/cpuset: Fix partition root's cpuset.cpus update bug cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() drm/amd/pm: correct SMU13.0.7 pstate profiling clock settings drm/amd/pm: correct SMU13.0.7 max shader clock reporting mptcp: use mptcp_schedule_work instead of open-coding it mptcp: stricter state check in mptcp_worker ubi: Fix failure attaching when vid_hdr offset equals to (sub)page size ubi: Fix deadlock caused by recursively holding work_sem i2c: mchp-pci1xxxx: Update Timing registers powerpc/papr_scm: Update the NUMA distance table for the target node sched/fair: Fix imbalance overflow x86/rtc: Remove __init for runtime functions i2c: ocores: generate stop condition after timeout in polling mode cifs: fix negotiate context parsing nvme-pci: mark Lexar NM760 as IGNORE_DEV_SUBNQN nvme-pci: add NVME_QUIRK_BOGUS_NID for T-FORCE Z330 SSD cgroup/cpuset: Skip spread flags update on v2 cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods Linux 6.1.25 Change-Id: Ib4d2c49ea9bacb8d8dbdb7b3a4eecce937016427 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
|
2039635543 |
Revert "raw: use net_hash_mix() in hash function"
This reverts commit
|
||
|
cc7a00d2d6 |
Revert "raw: Fix NULL deref in raw_get_next()."
This reverts commit
|
||
|
9d7765638f |
tcp: restrict net.ipv4.tcp_app_win
[ Upstream commit dc5110c2d959c1707e12df5f792f41d90614adaa ]
UBSAN: shift-out-of-bounds in net/ipv4/tcp_input.c:555:23
shift exponent 255 is too large for 32-bit type 'int'
CPU: 1 PID: 7907 Comm: ssh Not tainted 6.3.0-rc4-00161-g62bad54b26db-dirty #206
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x136/0x150
__ubsan_handle_shift_out_of_bounds+0x21f/0x5a0
tcp_init_transfer.cold+0x3a/0xb9
tcp_finish_connect+0x1d0/0x620
tcp_rcv_state_process+0xd78/0x4d60
tcp_v4_do_rcv+0x33d/0x9d0
__release_sock+0x133/0x3b0
release_sock+0x58/0x1b0
'maxwin' is int, shifting int for 32 or more bits is undefined behaviour.
Fixes:
|
||
|
db9c9086d3 |
bpf: tcp: Use sock_gen_put instead of sock_put in bpf_iter_tcp
[ Upstream commit 580031ff9952b7dbf48dedba6b56a100ae002bef ]
While reviewing the udp-iter batching patches, noticed the bpf_iter_tcp
calling sock_put() is incorrect. It should call sock_gen_put instead
because bpf_iter_tcp is iterating the ehash table which has the req sk
and tw sk. This patch replaces all sock_put with sock_gen_put in the
bpf_iter_tcp codepath.
Fixes:
|
||
|
5a08a32e62 |
ping: Fix potentail NULL deref for /proc/net/icmp.
[ Upstream commit ab5fb73ffa01072b4d8031cc05801fa1cb653bee ] After commit |
||
|
b34056bedf |
raw: Fix NULL deref in raw_get_next().
[ Upstream commit 0a78cf7264d29abeca098eae0b188a10aabc8a32 ] Dae R. Jeong reported a NULL deref in raw_get_next() [0]. It seems that the repro was running these sequences in parallel so that one thread was iterating on a socket that was being freed in another netns. unshare(0x40060200) r0 = syz_open_procfs(0x0, &(0x7f0000002080)='net/raw\x00') socket$inet_icmp_raw(0x2, 0x3, 0x1) pread64(r0, &(0x7f0000000000)=""/10, 0xa, 0x10000000007f) After commit |
||
|
53a0031217 |
raw: use net_hash_mix() in hash function
[ Upstream commit 6579f5bacc2c4cbc5ef6abb45352416939d1f844 ] Some applications seem to rely on RAW sockets. If they use private netns, we can avoid piling all RAW sockets bound to a given protocol into a single bucket. Also place (struct raw_hashinfo).lock into its own cache line to limit false sharing. Alternative would be to have per-netns hashtables, but this seems too expensive for most netns where RAW sockets are not used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Stable-dep-of: 0a78cf7264d2 ("raw: Fix NULL deref in raw_get_next().") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
0185e87c69 |
icmp: guard against too small mtu
[ Upstream commit 7d63b67125382ff0ffdfca434acbc94a38bd092b ]
syzbot was able to trigger a panic [1] in icmp_glue_bits(), or
more exactly in skb_copy_and_csum_bits()
There is no repro yet, but I think the issue is that syzbot
manages to lower device mtu to a small value, fooling __icmp_send()
__icmp_send() must make sure there is enough room for the
packet to include at least the headers.
We might in the future refactor skb_copy_and_csum_bits() and its
callers to no longer crash when something bad happens.
[1]
kernel BUG at net/core/skbuff.c:3343 !
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 15766 Comm: syz-executor.0 Not tainted 6.3.0-rc4-syzkaller-00039-gffe78bbd5121 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
RIP: 0010:skb_copy_and_csum_bits+0x798/0x860 net/core/skbuff.c:3343
Code: f0 c1 c8 08 41 89 c6 e9 73 ff ff ff e8 61 48 d4 f9 e9 41 fd ff ff 48 8b 7c 24 48 e8 52 48 d4 f9 e9 c3 fc ff ff e8 c8 27 84 f9 <0f> 0b 48 89 44 24 28 e8 3c 48 d4 f9 48 8b 44 24 28 e9 9d fb ff ff
RSP: 0018:ffffc90000007620 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000000001e8 RCX: 0000000000000100
RDX: ffff8880276f6280 RSI: ffffffff87fdd138 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
R10: 00000000000001e8 R11: 0000000000000001 R12: 000000000000003c
R13: 0000000000000000 R14: ffff888028244868 R15: 0000000000000b0e
FS: 00007fbc81f1c700(0000) GS:ffff88802ca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2df43000 CR3: 00000000744db000 CR4: 0000000000150ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
icmp_glue_bits+0x7b/0x210 net/ipv4/icmp.c:353
__ip_append_data+0x1d1b/0x39f0 net/ipv4/ip_output.c:1161
ip_append_data net/ipv4/ip_output.c:1343 [inline]
ip_append_data+0x115/0x1a0 net/ipv4/ip_output.c:1322
icmp_push_reply+0xa8/0x440 net/ipv4/icmp.c:370
__icmp_send+0xb80/0x1430 net/ipv4/icmp.c:765
ipv4_send_dest_unreach net/ipv4/route.c:1239 [inline]
ipv4_link_failure+0x5a9/0x9e0 net/ipv4/route.c:1246
dst_link_failure include/net/dst.h:423 [inline]
arp_error_report+0xcb/0x1c0 net/ipv4/arp.c:296
neigh_invalidate+0x20d/0x560 net/core/neighbour.c:1079
neigh_timer_handler+0xc77/0xff0 net/core/neighbour.c:1166
call_timer_fn+0x1a0/0x580 kernel/time/timer.c:1700
expire_timers+0x29b/0x4b0 kernel/time/timer.c:1751
__run_timers kernel/time/timer.c:2022 [inline]
Fixes:
|
||
|
9c7d680368 |
erspan: do not use skb_mac_header() in ndo_start_xmit()
[ Upstream commit 8e50ed774554f93d55426039b27b1e38d7fa64d8 ]
Drivers should not assume skb_mac_header(skb) == skb->data in their
ndo_start_xmit().
Use skb_network_offset() and skb_transport_offset() which
better describe what is needed in erspan_fb_xmit() and
ip6erspan_tunnel_xmit()
syzbot reported:
WARNING: CPU: 0 PID: 5083 at include/linux/skbuff.h:2873 skb_mac_header include/linux/skbuff.h:2873 [inline]
WARNING: CPU: 0 PID: 5083 at include/linux/skbuff.h:2873 ip6erspan_tunnel_xmit+0x1d9c/0x2d90 net/ipv6/ip6_gre.c:962
Modules linked in:
CPU: 0 PID: 5083 Comm: syz-executor406 Not tainted 6.3.0-rc2-syzkaller-00866-gd4671cb96fa3 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
RIP: 0010:skb_mac_header include/linux/skbuff.h:2873 [inline]
RIP: 0010:ip6erspan_tunnel_xmit+0x1d9c/0x2d90 net/ipv6/ip6_gre.c:962
Code: 04 02 41 01 de 84 c0 74 08 3c 03 0f 8e 1c 0a 00 00 45 89 b4 24 c8 00 00 00 c6 85 77 fe ff ff 01 e9 33 e7 ff ff e8 b4 27 a1 f8 <0f> 0b e9 b6 e7 ff ff e8 a8 27 a1 f8 49 8d bf f0 0c 00 00 48 b8 00
RSP: 0018:ffffc90003b2f830 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 000000000000ffff RCX: 0000000000000000
RDX: ffff888021273a80 RSI: ffffffff88e1bd4c RDI: 0000000000000003
RBP: ffffc90003b2f9d8 R08: 0000000000000003 R09: 000000000000ffff
R10: 000000000000ffff R11: 0000000000000000 R12: ffff88802b28da00
R13: 00000000000000d0 R14: ffff88807e25b6d0 R15: ffff888023408000
FS: 0000555556a61300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055e5b11eb6e8 CR3: 0000000027c1b000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__netdev_start_xmit include/linux/netdevice.h:4900 [inline]
netdev_start_xmit include/linux/netdevice.h:4914 [inline]
__dev_direct_xmit+0x504/0x730 net/core/dev.c:4300
dev_direct_xmit include/linux/netdevice.h:3088 [inline]
packet_xmit+0x20a/0x390 net/packet/af_packet.c:285
packet_snd net/packet/af_packet.c:3075 [inline]
packet_sendmsg+0x31a0/0x5150 net/packet/af_packet.c:3107
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg+0xde/0x190 net/socket.c:747
__sys_sendto+0x23a/0x340 net/socket.c:2142
__do_sys_sendto net/socket.c:2154 [inline]
__se_sys_sendto net/socket.c:2150 [inline]
__x64_sys_sendto+0xe1/0x1b0 net/socket.c:2150
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f123aaa1039
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc15d12058 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f123aaa1039
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000020000040 R09: 0000000000000014
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f123aa648c0
R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
Fixes:
|
||
|
1c5642cfa6 |
ipv4: Fix incorrect table ID in IOCTL path
[ Upstream commit 8a2618e14f81604a9b6ad305d57e0c8da939cd65 ] Commit |
||
|
b339c0af83 |
tcp: Fix bind() conflict check for dual-stack wildcard address.
[ Upstream commit d9ba9934285514f1f95d96326a82398a22dc77f2 ] Paul Holzinger reported [0] that commit |
||
|
a69b72b57b |
net: tunnels: annotate lockless accesses to dev->needed_headroom
[ Upstream commit 4b397c06cb987935b1b097336532aa6b4210e091 ]
IP tunnels can apparently update dev->needed_headroom
in their xmit path.
This patch takes care of three tunnels xmit, and also the
core LL_RESERVED_SPACE() and LL_RESERVED_SPACE_EXTRA()
helpers.
More changes might be needed for completeness.
BUG: KCSAN: data-race in ip_tunnel_xmit / ip_tunnel_xmit
read to 0xffff88815b9da0ec of 2 bytes by task 888 on cpu 1:
ip_tunnel_xmit+0x1270/0x1730 net/ipv4/ip_tunnel.c:803
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
write to 0xffff88815b9da0ec of 2 bytes by task 2379 on cpu 0:
ip_tunnel_xmit+0x1294/0x1730 net/ipv4/ip_tunnel.c:804
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip6_finish_output2+0x9bc/0xc50 net/ipv6/ip6_output.c:134
__ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
ip6_finish_output+0x39a/0x4e0 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip6_output+0xeb/0x220 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:444 [inline]
NF_HOOK include/linux/netfilter.h:302 [inline]
mld_sendpack+0x438/0x6a0 net/ipv6/mcast.c:1820
mld_send_cr net/ipv6/mcast.c:2121 [inline]
mld_ifc_work+0x519/0x7b0 net/ipv6/mcast.c:2653
process_one_work+0x3e6/0x750 kernel/workqueue.c:2390
worker_thread+0x5f2/0xa10 kernel/workqueue.c:2537
kthread+0x1ac/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
value changed: 0x0dd4 -> 0x0e14
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 2379 Comm: kworker/0:0 Not tainted 6.3.0-rc1-syzkaller-00002-g8ca09d5fa354-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
Workqueue: mld mld_ifc_work
Fixes:
|
||
|
9180aa4622 |
tcp: tcp_make_synack() can be called from process context
[ Upstream commit bced3f7db95ff2e6ca29dc4d1c9751ab5e736a09 ] tcp_rtx_synack() now could be called in process context as explained in |
||
|
079d37e162 |
netfilter: tproxy: fix deadlock due to missing BH disable
[ Upstream commit 4a02426787bf024dafdb79b362285ee325de3f5e ]
The xtables packet traverser performs an unconditional local_bh_disable(),
but the nf_tables evaluation loop does not.
Functions that are called from either xtables or nftables must assume
that they can be called in process context.
inet_twsk_deschedule_put() assumes that no softirq interrupt can occur.
If tproxy is used from nf_tables its possible that we'll deadlock
trying to aquire a lock already held in process context.
Add a small helper that takes care of this and use it.
Link: https://lore.kernel.org/netfilter-devel/401bd6ed-314a-a196-1cdc-e13c720cc8f2@balasys.hu/
Fixes:
|
||
|
f45cf3ae30 |
bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
[ Upstream commit d900f3d20cc3169ce42ec72acc850e662a4d4db2 ] When the buffer length of the recvmsg system call is 0, we got the flollowing soft lockup problem: watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [a.out:6149] CPU: 3 PID: 6149 Comm: a.out Kdump: loaded Not tainted 6.2.0+ #30 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:remove_wait_queue+0xb/0xc0 Code: 5e 41 5f c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 48 89 fd 53 48 89 f3 4c 8d 6b 18 4c 8d 73 20 RSP: 0018:ffff88811b5978b8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff88811a7d3780 RCX: ffffffffb7a4d768 RDX: dffffc0000000000 RSI: ffff88811b597908 RDI: ffff888115408040 RBP: 1ffff110236b2f1b R08: 0000000000000000 R09: ffff88811a7d37e7 R10: ffffed10234fa6fc R11: 0000000000000001 R12: ffff88811179b800 R13: 0000000000000001 R14: ffff88811a7d38a8 R15: ffff88811a7d37e0 FS: 00007f6fb5398740(0000) GS:ffff888237180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 000000010b6ba002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_msg_wait_data+0x279/0x2f0 tcp_bpf_recvmsg_parser+0x3c6/0x490 inet_recvmsg+0x280/0x290 sock_recvmsg+0xfc/0x120 ____sys_recvmsg+0x160/0x3d0 ___sys_recvmsg+0xf0/0x180 __sys_recvmsg+0xea/0x1a0 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc The logic in tcp_bpf_recvmsg_parser is as follows: msg_bytes_ready: copied = sk_msg_recvmsg(sk, psock, msg, len, flags); if (!copied) { wait data; goto msg_bytes_ready; } In this case, "copied" always is 0, the infinite loop occurs. According to the Linux system call man page, 0 should be returned in this case. Therefore, in tcp_bpf_recvmsg_parser(), if the length is 0, directly return. Also modify several other functions with the same problem. Fixes: |
||
|
95c131b41f |
tcp: tcp_check_req() can be called from process context
[ Upstream commit 580f98cc33a260bb8c6a39ae2921b29586b84fdf ] This is a follow up of commit |
||
|
512b6c4b83 |
netfilter: x_tables: fix percpu counter block leak on error path when creating new netns
[ Upstream commit 0af8c09c896810879387decfba8c942994bb61f5 ]
Here is the stack where we allocate percpu counter block:
+-< __alloc_percpu
+-< xt_percpu_counter_alloc
+-< find_check_entry # {arp,ip,ip6}_tables.c
+-< translate_table
And it can be leaked on this code path:
+-> ip6t_register_table
+-> translate_table # allocates percpu counter block
+-> xt_register_table # fails
there is no freeing of the counter block on xt_register_table fail.
Note: xt_percpu_counter_free should be called to free it like we do in
do_replace through cleanup_entry helper (or in __ip6t_unregister_table).
Probability of hitting this error path is low AFAICS (xt_register_table
can only return ENOMEM here, as it is not replacing anything, as we are
creating new netns, and it is hard to imagine that all previous
allocations succeeded and after that one in xt_register_table failed).
But it's worth fixing even the rare leak.
Fixes:
|
||
|
3dd6ac9733 |
netfilter: ebtables: fix table blob use-after-free
[ Upstream commit e58a171d35e32e6e8c37cfe0e8a94406732a331f ]
We are not allowed to return an error at this point.
Looking at the code it looks like ret is always 0 at this
point, but its not.
t = find_table_lock(net, repl->name, &ret, &ebt_mutex);
... this can return a valid table, with ret != 0.
This bug causes update of table->private with the new
blob, but then frees the blob right away in the caller.
Syzbot report:
BUG: KASAN: vmalloc-out-of-bounds in __ebt_unregister_table+0xc00/0xcd0 net/bridge/netfilter/ebtables.c:1168
Read of size 4 at addr ffffc90005425000 by task kworker/u4:4/74
Workqueue: netns cleanup_net
Call Trace:
kasan_report+0xbf/0x1f0 mm/kasan/report.c:517
__ebt_unregister_table+0xc00/0xcd0 net/bridge/netfilter/ebtables.c:1168
ebt_unregister_table+0x35/0x40 net/bridge/netfilter/ebtables.c:1372
ops_exit_list+0xb0/0x170 net/core/net_namespace.c:169
cleanup_net+0x4ee/0xb10 net/core/net_namespace.c:613
...
ip(6)tables appears to be ok (ret should be 0 at this point) but make
this more obvious.
Fixes:
|
||
|
6965c92ef4 |
inet: fix fast path in __inet_hash_connect()
[ Upstream commit 21cbd90a6fab7123905386985e3e4a80236b8714 ]
__inet_hash_connect() has a fast path taken if sk_head(&tb->owners) is
equal to the sk parameter.
sk_head() returns the hlist_entry() with respect to the sk_node field.
However entries in the tb->owners list are inserted with respect to the
sk_bind_node field with sk_add_bind_node().
Thus the check would never pass and the fast path never execute.
This fast path has never been executed or tested as this bug seems
to be present since commit
|
||
|
0ae9d81109 |
txhash: fix sk->sk_txrehash default
[ Upstream commit c11204c78d6966c5bda6dd05c3ac5cbb193f93e3 ]
This code fix a bug that sk->sk_txrehash gets its default enable
value from sysctl_txrehash only when the socket is a TCP listener.
We should have sysctl_txrehash to set the default sk->sk_txrehash,
no matter TCP, nor listerner/connector.
Tested by following packetdrill:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 socket(..., SOCK_DGRAM, IPPROTO_UDP) = 4
// SO_TXREHASH == 74, default to sysctl_txrehash == 1
+0 getsockopt(3, SOL_SOCKET, 74, [1], [4]) = 0
+0 getsockopt(4, SOL_SOCKET, 74, [1], [4]) = 0
Fixes:
|
||
|
5a19095103 |
use less confusing names for iov_iter direction initializers
[ Upstream commit de4eda9de2d957ef2d6a8365a01e26a435e958cb ] READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Stable-dep-of: 6dd88fd59da8 ("vhost-scsi: unbreak any layout for response") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
12b0ec7c69 |
bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener
[ Upstream commit ddce1e091757d0259107c6c0c7262df201de2b66 ] A listening socket linked to a sockmap has its sk_prot overridden. It points to one of the struct proto variants in tcp_bpf_prots. The variant depends on the socket's family and which sockmap programs are attached. A child socket cloned from a TCP listener initially inherits their sk_prot. But before cloning is finished, we restore the child's proto to the listener's original non-tcp_bpf_prots one. This happens in tcp_create_openreq_child -> tcp_bpf_clone. Today, in tcp_bpf_clone we detect if the child's proto should be restored by checking only for the TCP_BPF_BASE proto variant. This is not correct. The sk_prot of listening socket linked to a sockmap can point to to any variant in tcp_bpf_prots. If the listeners sk_prot happens to be not the TCP_BPF_BASE variant, then the child socket unintentionally is left if the inherited sk_prot by tcp_bpf_clone. This leads to issues like infinite recursion on close [1], because the child state is otherwise not set up for use with tcp_bpf_prot operations. Adjust the check in tcp_bpf_clone to detect all of tcp_bpf_prots variants. Note that it wouldn't be sufficient to check the socket state when overriding the sk_prot in tcp_bpf_update_proto in order to always use the TCP_BPF_BASE variant for listening sockets. Since commit |
||
|
f9753ebd61 |
ipv4: prevent potential spectre v1 gadget in fib_metrics_match()
[ Upstream commit 5e9398a26a92fc402d82ce1f97cc67d832527da0 ]
if (!type)
continue;
if (type > RTAX_MAX)
return false;
...
fi_val = fi->fib_metrics->metrics[type - 1];
@type being used as an array index, we need to prevent
cpu speculation or risk leaking kernel memory content.
Fixes:
|
||
|
6850fe301d |
ipv4: prevent potential spectre v1 gadget in ip_metrics_convert()
[ Upstream commit 1d1d63b612801b3f0a39b7d4467cad0abd60e5c8 ]
if (!type)
continue;
if (type > RTAX_MAX)
return -EINVAL;
...
metrics[type - 1] = val;
@type being used as an array index, we need to prevent
cpu speculation or risk leaking kernel memory content.
Fixes:
|
||
|
dc706d7787 |
tcp: fix rate_app_limited to default to 1
[ Upstream commit 300b655db1b5152d6101bcb6801d50899b20c2d6 ]
The initial default value of 0 for tp->rate_app_limited was incorrect,
since a flow is indeed application-limited until it first sends
data. Fixing the default to be 1 is generally correct but also
specifically will help user-space applications avoid using the initial
tcpi_delivery_rate value of 0 that persists until the connection has
some non-zero bandwidth sample.
Fixes:
|
||
|
6df50d9469 |
tcp: avoid the lookup process failing to get sk in ehash table
[ Upstream commit 3f4ca5fafc08881d7a57daa20449d171f2887043 ]
While one cpu is working on looking up the right socket from ehash
table, another cpu is done deleting the request socket and is about
to add (or is adding) the big socket from the table. It means that
we could miss both of them, even though it has little chance.
Let me draw a call trace map of the server side.
CPU 0 CPU 1
----- -----
tcp_v4_rcv() syn_recv_sock()
inet_ehash_insert()
-> sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
-> __sk_nulls_add_node_rcu(sk, list)
Notice that the CPU 0 is receiving the data after the final ack
during 3-way shakehands and CPU 1 is still handling the final ack.
Why could this be a real problem?
This case is happening only when the final ack and the first data
receiving by different CPUs. Then the server receiving data with
ACK flag tries to search one proper established socket from ehash
table, but apparently it fails as my map shows above. After that,
the server fetches a listener socket and then sends a RST because
it finds a ACK flag in the skb (data), which obeys RST definition
in RFC 793.
Besides, Eric pointed out there's one more race condition where it
handles tw socket hashdance. Only by adding to the tail of the list
before deleting the old one can we avoid the race if the reader has
already begun the bucket traversal and it would possibly miss the head.
Many thanks to Eric for great help from beginning to end.
Fixes:
|
||
|
ddb98087bd |
net/ulp: use consistent error code when blocking ULP
commit 8ccc99362b60c6f27bb46f36fdaaccf4ef0303de upstream. The referenced commit changed the error code returned by the kernel when preventing a non-established socket from attaching the ktls ULP. Before to such a commit, the user-space got ENOTCONN instead of EINVAL. The existing self-tests depend on such error code, and the change caused a failure: RUN global.non_established ... tls.c:1673:non_established:Expected errno (22) == ENOTCONN (107) non_established: Test failed at step #3 FAIL global.non_established In the unlikely event existing applications do the same, address the issue by restoring the prior error code in the above scenario. Note that the only other ULP performing similar checks at init time - smc_ulp_ops - also fails with ENOTCONN when trying to attach the ULP to a non-established socket. Reported-by: Sabrina Dubroca <sd@queasysnail.net> Fixes: 2c02d41d71f9 ("net/ulp: prevent ULP without clone op from entering the LISTEN status") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/7bb199e7a93317fb6f8bf8b9b2dc71c18f337cde.1674042685.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
7d242f4a0c |
net/ulp: prevent ULP without clone op from entering the LISTEN status
[ Upstream commit 2c02d41d71f90a5168391b6a5f2954112ba2307c ]
When an ULP-enabled socket enters the LISTEN status, the listener ULP data
pointer is copied inside the child/accepted sockets by sk_clone_lock().
The relevant ULP can take care of de-duplicating the context pointer via
the clone() operation, but only MPTCP and SMC implement such op.
Other ULPs may end-up with a double-free at socket disposal time.
We can't simply clear the ULP data at clone time, as TLS replaces the
socket ops with custom ones assuming a valid TLS ULP context is
available.
Instead completely prevent clone-less ULP sockets from entering the
LISTEN status.
Fixes:
|
||
|
01ca3695f1 |
tcp: Add TIME_WAIT sockets in bhash2.
[ Upstream commit 936a192f974018b4f6040f6f77b1cc1e75bd8666 ] Jiri Slaby reported regression of bind() with a simple repro. [0] The repro creates a TIME_WAIT socket and tries to bind() a new socket with the same local address and port. Before commit |
||
|
01a3015206 |
mptcp: remove MPTCP 'ifdef' in TCP SYN cookies
commit 3fff88186f047627bb128d65155f42517f8e448f upstream. To ease the maintenance, it is often recommended to avoid having #ifdef preprocessor conditions. Here the section related to CONFIG_MPTCP was quite short but the next commit needs to add more code around. It is then cleaner to move specific MPTCP code to functions located in net/mptcp directory. Now that mptcp_subflow_request_sock_ops structure can be static, it can also be marked as "read only after init". Suggested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
588d0b8462 |
net/tunnel: wait until all sk_user_data reader finish before releasing the sock
[ Upstream commit 3cf7203ca620682165706f70a1b12b5194607dce ]
There is a race condition in vxlan that when deleting a vxlan device
during receiving packets, there is a possibility that the sock is
released after getting vxlan_sock vs from sk_user_data. Then in
later vxlan_ecn_decapsulate(), vxlan_get_sk_family() we will got
NULL pointer dereference. e.g.
#0 [ffffa25ec6978a38] machine_kexec at ffffffff8c669757
#1 [ffffa25ec6978a90] __crash_kexec at ffffffff8c7c0a4d
#2 [ffffa25ec6978b58] crash_kexec at ffffffff8c7c1c48
#3 [ffffa25ec6978b60] oops_end at ffffffff8c627f2b
#4 [ffffa25ec6978b80] page_fault_oops at ffffffff8c678fcb
#5 [ffffa25ec6978bd8] exc_page_fault at ffffffff8d109542
#6 [ffffa25ec6978c00] asm_exc_page_fault at ffffffff8d200b62
[exception RIP: vxlan_ecn_decapsulate+0x3b]
RIP: ffffffffc1014e7b RSP: ffffa25ec6978cb0 RFLAGS: 00010246
RAX: 0000000000000008 RBX: ffff8aa000888000 RCX: 0000000000000000
RDX: 000000000000000e RSI: ffff8a9fc7ab803e RDI: ffff8a9fd1168700
RBP: ffff8a9fc7ab803e R8: 0000000000700000 R9: 00000000000010ae
R10: ffff8a9fcb748980 R11: 0000000000000000 R12: ffff8a9fd1168700
R13: ffff8aa000888000 R14: 00000000002a0000 R15: 00000000000010ae
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffa25ec6978ce8] vxlan_rcv at ffffffffc10189cd [vxlan]
#8 [ffffa25ec6978d90] udp_queue_rcv_one_skb at ffffffff8cfb6507
#9 [ffffa25ec6978dc0] udp_unicast_rcv_skb at ffffffff8cfb6e45
#10 [ffffa25ec6978dc8] __udp4_lib_rcv at ffffffff8cfb8807
#11 [ffffa25ec6978e20] ip_protocol_deliver_rcu at ffffffff8cf76951
#12 [ffffa25ec6978e48] ip_local_deliver at ffffffff8cf76bde
#13 [ffffa25ec6978ea0] __netif_receive_skb_one_core at ffffffff8cecde9b
#14 [ffffa25ec6978ec8] process_backlog at ffffffff8cece139
#15 [ffffa25ec6978f00] __napi_poll at ffffffff8ceced1a
#16 [ffffa25ec6978f28] net_rx_action at ffffffff8cecf1f3
#17 [ffffa25ec6978fa0] __softirqentry_text_start at ffffffff8d4000ca
#18 [ffffa25ec6978ff0] do_softirq at ffffffff8c6fbdc3
Reproducer: https://github.com/Mellanox/ovs-tests/blob/master/test-ovs-vxlan-remove-tunnel-during-traffic.sh
Fix this by waiting for all sk_user_data reader to finish before
releasing the sock.
Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Fixes:
|
||
|
26a844a91e |
bpf, sockmap: Fix data loss caused by using apply_bytes on ingress redirect
[ Upstream commit 9072931f020bfd907d6d89ee21ff1481cd78b407 ]
Use apply_bytes on ingress redirect, when apply_bytes is less than
the length of msg data, some data may be skipped and lost in
bpf_tcp_ingress().
If there is still data in the scatterlist that has not been consumed,
we cannot move the msg iter.
Fixes:
|
||
|
e45de3007a |
bpf, sockmap: Fix missing BPF_F_INGRESS flag when using apply_bytes
[ Upstream commit a351d6087bf7d3d8440d58d3bf244ec64b89394a ]
When redirecting, we use sk_msg_to_ingress() to get the BPF_F_INGRESS
flag from the msg->flags. If apply_bytes is used and it is larger than
the current data being processed, sk_psock_msg_verdict() will not be
called when sendmsg() is called again. At this time, the msg->flags is 0,
and we lost the BPF_F_INGRESS flag.
So we need to save the BPF_F_INGRESS flag in sk_psock and use it when
redirection.
Fixes:
|
||
|
113236e8f4 |
bpf, sockmap: Fix repeated calls to sock_put() when msg has more_data
[ Upstream commit 7a9841ca025275b5b0edfb0b618934abb6ceec15 ]
In tcp_bpf_send_verdict() redirection, the eval variable is assigned to
__SK_REDIRECT after the apply_bytes data is sent, if msg has more_data,
sock_put() will be called multiple times.
We should reset the eval variable to __SK_NONE every time more_data
starts.
This causes:
IPv4: Attempt to release TCP socket in state 1 00000000b4c925d7
------------[ cut here ]------------
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 5 PID: 4482 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0x110
Modules linked in:
CPU: 5 PID: 4482 Comm: sockhash_bypass Kdump: loaded Not tainted 6.0.0 #1
Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014
Call Trace:
<TASK>
__tcp_transmit_skb+0xa1b/0xb90
? __alloc_skb+0x8c/0x1a0
? __kmalloc_node_track_caller+0x184/0x320
tcp_write_xmit+0x22a/0x1110
__tcp_push_pending_frames+0x32/0xf0
do_tcp_sendpages+0x62d/0x640
tcp_bpf_push+0xae/0x2c0
tcp_bpf_sendmsg_redir+0x260/0x410
? preempt_count_add+0x70/0xa0
tcp_bpf_send_verdict+0x386/0x4b0
tcp_bpf_sendmsg+0x21b/0x3b0
sock_sendmsg+0x58/0x70
__sys_sendto+0xfa/0x170
? xfd_validate_state+0x1d/0x80
? switch_fpu_return+0x59/0xe0
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x37/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes:
|
||
|
d24fd986ae |
net: Return errno in sk->sk_prot->get_port().
[ Upstream commit 7a7160edf1bfde25422262fb26851cef65f695d3 ]
We assume the correct errno is -EADDRINUSE when sk->sk_prot->get_port()
fails, so some ->get_port() functions return just 1 on failure and the
callers return -EADDRINUSE instead.
However, mptcp_get_port() can return -EINVAL. Let's not ignore the error.
Note the only exception is inet_autobind(), all of whose callers return
-EAGAIN instead.
Fixes:
|
||
|
073b50de84 |
udp: Clean up some functions.
[ Upstream commit 919dfa0b20ae56060dce0436eb710717f8987d18 ] This patch adds no functional change and cleans up some functions that the following patches touch around so that we make them tidy and easy to review/revert. The change is mainly to keep reverse christmas tree order. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 7a7160edf1bf ("net: Return errno in sk->sk_prot->get_port().") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
c0d999348e |
ipv4: Fix incorrect route flushing when table ID 0 is used
Cited commit added the table ID to the FIB info structure, but did not
properly initialize it when table ID 0 is used. This can lead to a route
in the default VRF with a preferred source address not being flushed
when the address is deleted.
Consider the following example:
# ip address add dev dummy1 192.0.2.1/28
# ip address add dev dummy1 192.0.2.17/28
# ip route add 198.51.100.0/24 via 192.0.2.2 src 192.0.2.17 metric 100
# ip route add table 0 198.51.100.0/24 via 192.0.2.2 src 192.0.2.17 metric 200
# ip route show 198.51.100.0/24
198.51.100.0/24 via 192.0.2.2 dev dummy1 src 192.0.2.17 metric 100
198.51.100.0/24 via 192.0.2.2 dev dummy1 src 192.0.2.17 metric 200
Both routes are installed in the default VRF, but they are using two
different FIB info structures. One with a metric of 100 and table ID of
254 (main) and one with a metric of 200 and table ID of 0. Therefore,
when the preferred source address is deleted from the default VRF,
the second route is not flushed:
# ip address del dev dummy1 192.0.2.17/28
# ip route show 198.51.100.0/24
198.51.100.0/24 via 192.0.2.2 dev dummy1 src 192.0.2.17 metric 200
Fix by storing a table ID of 254 instead of 0 in the route configuration
structure.
Add a test case that fails before the fix:
# ./fib_tests.sh -t ipv4_del_addr
IPv4 delete address route tests
Regular FIB info
TEST: Route removed from VRF when source address deleted [ OK ]
TEST: Route in default VRF not removed [ OK ]
TEST: Route removed in default VRF when source address deleted [ OK ]
TEST: Route in VRF is not removed by address delete [ OK ]
Identical FIB info with different table ID
TEST: Route removed from VRF when source address deleted [ OK ]
TEST: Route in default VRF not removed [ OK ]
TEST: Route removed in default VRF when source address deleted [ OK ]
TEST: Route in VRF is not removed by address delete [ OK ]
Table ID 0
TEST: Route removed in default VRF when source address deleted [FAIL]
Tests passed: 8
Tests failed: 1
And passes after:
# ./fib_tests.sh -t ipv4_del_addr
IPv4 delete address route tests
Regular FIB info
TEST: Route removed from VRF when source address deleted [ OK ]
TEST: Route in default VRF not removed [ OK ]
TEST: Route removed in default VRF when source address deleted [ OK ]
TEST: Route in VRF is not removed by address delete [ OK ]
Identical FIB info with different table ID
TEST: Route removed from VRF when source address deleted [ OK ]
TEST: Route in default VRF not removed [ OK ]
TEST: Route removed in default VRF when source address deleted [ OK ]
TEST: Route in VRF is not removed by address delete [ OK ]
Table ID 0
TEST: Route removed in default VRF when source address deleted [ OK ]
Tests passed: 9
Tests failed: 0
Fixes:
|
||
|
f96a3d7455 |
ipv4: Fix incorrect route flushing when source address is deleted
Cited commit added the table ID to the FIB info structure, but did not
prevent structures with different table IDs from being consolidated.
This can lead to routes being flushed from a VRF when an address is
deleted from a different VRF.
Fix by taking the table ID into account when looking for a matching FIB
info. This is already done for FIB info structures backed by a nexthop
object in fib_find_info_nh().
Add test cases that fail before the fix:
# ./fib_tests.sh -t ipv4_del_addr
IPv4 delete address route tests
Regular FIB info
TEST: Route removed from VRF when source address deleted [ OK ]
TEST: Route in default VRF not removed [ OK ]
TEST: Route removed in default VRF when source address deleted [ OK ]
TEST: Route in VRF is not removed by address delete [ OK ]
Identical FIB info with different table ID
TEST: Route removed from VRF when source address deleted [FAIL]
TEST: Route in default VRF not removed [ OK ]
RTNETLINK answers: File exists
TEST: Route removed in default VRF when source address deleted [ OK ]
TEST: Route in VRF is not removed by address delete [FAIL]
Tests passed: 6
Tests failed: 2
And pass after:
# ./fib_tests.sh -t ipv4_del_addr
IPv4 delete address route tests
Regular FIB info
TEST: Route removed from VRF when source address deleted [ OK ]
TEST: Route in default VRF not removed [ OK ]
TEST: Route removed in default VRF when source address deleted [ OK ]
TEST: Route in VRF is not removed by address delete [ OK ]
Identical FIB info with different table ID
TEST: Route removed from VRF when source address deleted [ OK ]
TEST: Route in default VRF not removed [ OK ]
TEST: Route removed in default VRF when source address deleted [ OK ]
TEST: Route in VRF is not removed by address delete [ OK ]
Tests passed: 8
Tests failed: 0
Fixes:
|
||
|
ee496694b9 |
ip_gre: do not report erspan version on GRE interface
Although the type I ERSPAN is based on the barebones IP + GRE
encapsulation and no extra ERSPAN header. Report erspan version on GRE
interface looks unreasonable. Fix this by separating the erspan and gre
fill info.
IPv6 GRE does not have this info as IPv6 only supports erspan version
1 and 2.
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes:
|
||
|
c25b7a7a56 |
inet: ping: use hlist_nulls rcu iterator during lookup
ping_lookup() does not acquire the table spinlock, so iteration should
use hlist_nulls_for_each_entry_rcu().
Spotted during code review.
Fixes:
|
||
|
d5082d386e |
ipv4: Fix route deletion when nexthop info is not specified
When the kernel receives a route deletion request from user space it tries to delete a route that matches the route attributes specified in the request. If only prefix information is specified in the request, the kernel should delete the first matching FIB alias regardless of its associated FIB info. However, an error is currently returned when the FIB info is backed by a nexthop object: # ip nexthop add id 1 via 192.0.2.2 dev dummy10 # ip route add 198.51.100.0/24 nhid 1 # ip route del 198.51.100.0/24 RTNETLINK answers: No such process Fix by matching on such a FIB info when legacy nexthop attributes are not specified in the request. An earlier check already covers the case where a nexthop ID is specified in the request. Add tests that cover these flows. Before the fix: # ./fib_nexthops.sh -t ipv4_fcnal ... TEST: Delete route when not specifying nexthop attributes [FAIL] Tests passed: 11 Tests failed: 1 After the fix: # ./fib_nexthops.sh -t ipv4_fcnal ... TEST: Delete route when not specifying nexthop attributes [ OK ] Tests passed: 12 Tests failed: 0 No regressions in other tests: # ./fib_nexthops.sh ... Tests passed: 228 Tests failed: 0 # ./fib_tests.sh ... Tests passed: 186 Tests failed: 0 Cc: stable@vger.kernel.org Reported-by: Jonas Gorski <jonas.gorski@gmail.com> Tested-by: Jonas Gorski <jonas.gorski@gmail.com> Fixes: |
||
|
06ccc8ec70 |
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says: ==================== ipsec 2022-11-23 1) Fix "disable_policy" on ipv4 early demuxP Packets after the initial packet in a flow might be incorectly dropped on early demux if there are no matching policies. From Eyal Birger. 2) Fix a kernel warning in case XFRM encap type is not available. From Eyal Birger. 3) Fix ESN wrap around for GSO to avoid a double usage of a sequence number. From Christian Langrock. 4) Fix a send_acquire race with pfkey_register. From Herbert Xu. 5) Fix a list corruption panic in __xfrm_state_delete(). Thomas Jarosch. 6) Fix an unchecked return value in xfrm6_init(). Chen Zhongjin. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Fix ignored return value in xfrm6_init() xfrm: Fix oops in __xfrm_state_delete() af_key: Fix send_acquire race with pfkey_register xfrm: replay: Fix ESN wrap around for GSO xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not available xfrm: fix "disable_policy" on ipv4 early demux ==================== Link: https://lore.kernel.org/r/20221123093117.434274-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
568fe84940 |
ipv4: Fix error return code in fib_table_insert()
In fib_table_insert(), if the alias was already inserted, but node not
exist, the error code should be set before return from error handling path.
Fixes:
|
||
|
e0833d1fed |
dccp/tcp: Fixup bhash2 bucket when connect() fails.
If a socket bound to a wildcard address fails to connect(), we
only reset saddr and keep the port. Then, we have to fix up the
bhash2 bucket; otherwise, the bucket has an inconsistent address
in the list.
Also, listen() for such a socket will fire the WARN_ON() in
inet_csk_get_port(). [0]
Note that when a system runs out of memory, we give up fixing the
bucket and unlink sk from bhash and bhash2 by inet_put_port().
[0]:
WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Modules linked in:
CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
FS: 00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
inet_listen (net/ipv4/af_inet.c:228)
__sys_listen (net/socket.c:1810)
__x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
RIP: 0033:0x7f8ac051de5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
</TASK>
Fixes:
|
||
|
8c5dae4c1a |
dccp/tcp: Update saddr under bhash's lock.
When we call connect() for a socket bound to a wildcard address, we update saddr locklessly. However, it could result in a data race; another thread iterating over bhash might see a corrupted address. Let's update saddr under the bhash bucket's lock. Fixes: |
||
|
8acdad37cd |
dccp/tcp: Remove NULL check for prev_saddr in inet_bhash2_update_saddr().
When we call inet_bhash2_update_saddr(), prev_saddr is always non-NULL. Let's remove the unnecessary test. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
77934dc6db |
dccp/tcp: Reset saddr on failure after inet6?_hash_connect().
When connect() is called on a socket bound to the wildcard address, we change the socket's saddr to a local address. If the socket fails to connect() to the destination, we have to reset the saddr. However, when an error occurs after inet_hash6?_connect() in (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave the socket bound to the address. From the user's point of view, whether saddr is reset or not varies with errno. Let's fix this inconsistent behaviour. Note that after this patch, the repro [0] will trigger the WARN_ON() in inet_csk_get_port() again, but this patch is not buggy and rather fixes a bug papering over the bhash2's bug for which we need another fix. For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect() by this sequence: s1 = socket() s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s1.bind(('127.0.0.1', 10000)) s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000))) # or s1.connect(('127.0.0.1', 10000)) s2 = socket() s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s2.bind(('0.0.0.0', 10000)) s2.connect(('127.0.0.1', 10000)) # -EADDRNOTAVAIL s2.listen(32) # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2); [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09 Fixes: |
||
|
764f848589 |
ipv4/fib: Replace zero-length array with DECLARE_FLEX_ARRAY() helper
Zero-length arrays are deprecated[1] and are being replaced with flexible array members in support of the ongoing efforts to tighten the FORTIFY_SOURCE routines on memcpy(), correctly instrument array indexing with UBSAN_BOUNDS, and to globally enable -fstrict-flex-arrays=3. Replace zero-length array with flexible-array member in struct key_vector. This results in no differences in binary output. [1] https://github.com/KSPP/linux/issues/78 Cc: Jakub Kicinski <kuba@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
52d1aa8b82 |
netfilter: conntrack: Fix data-races around ct mark
nf_conn:mark can be read from and written to in parallel. Use
READ_ONCE()/WRITE_ONCE() for reads and writes to prevent unwanted
compiler optimizations.
Fixes:
|
||
|
aeac4ec8f4 |
tcp: configurable source port perturb table size
On embedded systems with little memory and no relevant
security concerns, it is beneficial to reduce the size
of the table.
Reducing the size from 2^16 to 2^8 saves 255 KiB
of kernel RAM.
Makes the table size configurable as an expert option.
The size was previously increased from 2^8 to 2^16
in commit
|
||
|
0c175da7b0 |
tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent
If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
of TCPOPT_SACK_PERM is called to enable sack after data is sent
and dupacks are received , it will trigger a warning in function
tcp_verify_left_out() as follows:
============================================
WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
tcp_timeout_mark_lost+0x154/0x160
tcp_enter_loss+0x2b/0x290
tcp_retransmit_timer+0x50b/0x640
tcp_write_timer_handler+0x1c8/0x340
tcp_write_timer+0xe5/0x140
call_timer_fn+0x3a/0x1b0
__run_timers.part.0+0x1bf/0x2d0
run_timer_softirq+0x43/0xb0
__do_softirq+0xfd/0x373
__irq_exit_rcu+0xf6/0x140
The warning is caused in the following steps:
1. a socket named socketA is created
2. socketA enters repair mode without build a connection
3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
directly
4. socketA leaves repair mode
5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
ack receives) increase
6. socketA enters repair mode again
7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
increases
9. sack_outs + lost_out > packets_out triggers since lost_out and
sack_outs increase repeatly
In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
Step7 not happen and the warning will not be triggered. As suggested by
Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
already sent.
socket-tcp tests in CRIU has been tested as follows:
$ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \
--ignore-taint
socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
Fixes:
|
||
|
8ec95b9471 |
bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues
When running `test_sockmap` selftests, the following warning appears: WARNING: CPU: 2 PID: 197 at net/core/stream.c:205 sk_stream_kill_queues+0xd3/0xf0 Call Trace: <TASK> inet_csk_destroy_sock+0x55/0x110 tcp_rcv_state_process+0xd28/0x1380 ? tcp_v4_do_rcv+0x77/0x2c0 tcp_v4_do_rcv+0x77/0x2c0 __release_sock+0x106/0x130 __tcp_close+0x1a7/0x4e0 tcp_close+0x20/0x70 inet_release+0x3c/0x80 __sock_release+0x3a/0xb0 sock_close+0x14/0x20 __fput+0xa3/0x260 task_work_run+0x59/0xb0 exit_to_user_mode_prepare+0x1b3/0x1c0 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x48/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae The root case is in commit |
||
|
71b7786ea4 |
net: also flag accepted sockets supporting msghdr originated zerocopy
Without this only the client initiated tcp sockets have SOCK_SUPPORT_ZC.
The listening socket on the server also has it, but the accepted
connections didn't, which meant IORING_OP_SEND[MSG]_ZC will always
fails with -EOPNOTSUPP.
Fixes:
|
||
|
e276d62dcf |
net/ulp: remove SOCK_SUPPORT_ZC from tls sockets
Remove SOCK_SUPPORT_ZC when we're setting ulp as it might not support
msghdr::ubuf_info, e.g. like TLS replacing ->sk_prot with a new set of
handlers.
Cc: <stable@vger.kernel.org> # 6.0
Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes:
|
||
|
fee9ac0664 |
net: remove SOCK_SUPPORT_ZC from sockmap
sockmap replaces ->sk_prot with its own callbacks, we should remove
SOCK_SUPPORT_ZC as the new proto doesn't support msghdr::ubuf_info.
Cc: <stable@vger.kernel.org> # 6.0
Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes:
|
||
|
bac0f937c3 |
nh: fix scope used to find saddr when adding non gw nh
As explained by Julian, fib_nh_scope is related to fib_nh_gw4, but fib_info_update_nhc_saddr() needs the scope of the route, which is the scope "before" fib_nh_scope, ie fib_nh_scope - 1. This patch fixes the problem described in commit |
||
|
e021c329ee |
Revert "ip: fix dflt addr selection for connected nexthop"
This reverts commit
|
||
|
745b913a59 |
Revert "ip: fix triggering of 'icmp redirect'"
This reverts commit
|
||
|
337a0a0b63 |
Including fixes from bpf.
Current release - regressions: - eth: fman: re-expose location of the MAC address to userspace, apparently some udev scripts depended on the exact value Current release - new code bugs: - bpf: - wait for busy refill_work when destroying bpf memory allocator - allow bpf_user_ringbuf_drain() callbacks to return 1 - fix dispatcher patchable function entry to 5 bytes nop Previous releases - regressions: - net-memcg: avoid stalls when under memory pressure - tcp: fix indefinite deferral of RTO with SACK reneging - tipc: fix a null-ptr-deref in tipc_topsrv_accept - eth: macb: specify PHY PM management done by MAC - tcp: fix a signed-integer-overflow bug in tcp_add_backlog() Previous releases - always broken: - eth: amd-xgbe: SFP fixes and compatibility improvements Misc: - docs: netdev: offer performance feedback to contributors Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmNW024ACgkQMUZtbf5S IrvX7w//SP/zKZwgzC13zd2rrCP16TX2QvkHPmSLvcldQDXdCypmsoc5Vb8UNpkG jwAuy2pxqPy2oxTwTBQv9TNRT2oqEFOsFTK+w410whlL7g1wZ02aXU8qFhV2XumW o4gRtM+UISPUKFbOnawdK1XlrNdeLF3bjETvW2GP9zxCb0iqoQXtDDNKxv2B2iQA MSyTtzHA4n9GS7LKGtPgsP2Ose7h1Z+AjTIpQH1nvfEHJUf/wmxUdCK+fuwfeLjY PhmYaPG/333j1bfBk1Ms/nUYA5KRXlEj9A/7jDtxhxNEwaTNKyLB19a6oVxXxpSQ x/k+nZP1RColn5xeco5a1X9aHHQ46PJQ8wVAmxYDIeIA5XPMgShNmhAyjrq1ac+o 9vYeYpmnMGSTLdBMvGbWpynWHe7SddgF8LkbnYf2HLKbxe4bgkOnmxOUH4q9iinZ MfVSknjax4DP0C7X1kGgR6WyltWnkrahOdUkINsIUNxj0KxJa/eStpJIbJrfkxNV gHbOjB2/bF3SXENrS4A0IJCgsbO9YugN83Eyu0WDWQOw9wVgopzxOJx9R+H0wkVH XpGGP8qi1DZiTE3iQiq1LHj6f6kirFmtt9QFH5yzaqtKBaqXakHaXwUO4VtD+BI9 NPFKvFL6jrp8EAn0PTM/RrvhJZN+V0bFXiyiMe0TLx+aR0UMxGc= =dD6N -----END PGP SIGNATURE----- Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf. The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe I'm biased. Current release - regressions: - eth: fman: re-expose location of the MAC address to userspace, apparently some udev scripts depended on the exact value Current release - new code bugs: - bpf: - wait for busy refill_work when destroying bpf memory allocator - allow bpf_user_ringbuf_drain() callbacks to return 1 - fix dispatcher patchable function entry to 5 bytes nop Previous releases - regressions: - net-memcg: avoid stalls when under memory pressure - tcp: fix indefinite deferral of RTO with SACK reneging - tipc: fix a null-ptr-deref in tipc_topsrv_accept - eth: macb: specify PHY PM management done by MAC - tcp: fix a signed-integer-overflow bug in tcp_add_backlog() Previous releases - always broken: - eth: amd-xgbe: SFP fixes and compatibility improvements Misc: - docs: netdev: offer performance feedback to contributors" * tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits) net-memcg: avoid stalls when under memory pressure tcp: fix indefinite deferral of RTO with SACK reneging tcp: fix a signed-integer-overflow bug in tcp_add_backlog() net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed docs: netdev: offer performance feedback to contributors kcm: annotate data-races around kcm->rx_wait kcm: annotate data-races around kcm->rx_psock net: fman: Use physical address for userspace interfaces net/mlx5e: Cleanup MACsec uninitialization routine atlantic: fix deadlock at aq_nic_stop nfp: only clean `sp_indiff` when application firmware is unloaded amd-xgbe: add the bit rate quirk for Molex cables amd-xgbe: fix the SFP compliance codes check for DAC cables amd-xgbe: enable PLL_CTL for fixed PHY modes only amd-xgbe: use enums for mailbox cmd and sub_cmds amd-xgbe: Yellow carp devices do not need rrc bpf: Use __llist_del_all() whenever possbile during memory draining bpf: Wait for busy refill_work when destroying bpf memory allocator MAINTAINERS: add keyword match on PTP ... |
||
|
3d2af9cce3 |
tcp: fix indefinite deferral of RTO with SACK reneging
This commit fixes a bug that can cause a TCP data sender to repeatedly
defer RTOs when encountering SACK reneging.
The bug is that when we're in fast recovery in a scenario with SACK
reneging, every time we get an ACK we call tcp_check_sack_reneging()
and it can note the apparent SACK reneging and rearm the RTO timer for
srtt/2 into the future. In some SACK reneging scenarios that can
happen repeatedly until the receive window fills up, at which point
the sender can't send any more, the ACKs stop arriving, and the RTO
fires at srtt/2 after the last ACK. But that can take far too long
(O(10 secs)), since the connection is stuck in fast recovery with a
low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
available.
This fix changes the logic in tcp_check_sack_reneging() to only rearm
the RTO timer if data is cumulatively ACKed, indicating forward
progress. This avoids this kind of nearly infinite loop of RTO timer
re-arming. In addition, this meets the goals of
tcp_check_sack_reneging() in handling Windows TCP behavior that looks
temporarily like SACK reneging but is not really.
Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
and provided critical packet traces that enabled root-causing this
issue. Also, many thanks to Jakub Kicinski for testing this fix.
Fixes:
|
||
|
ec791d8149 |
tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and
in tcp_add_backlog(), the variable limit is caculated by adding
sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value
of int and overflow. This patch reduces the limit budget by
halving the sndbuf to solve this issue since ACK packets are much
smaller than the payload.
Fixes:
|
||
|
942e01ab90 |
io_uring-6.1-2022-10-22
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmNUFz4QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqrSEAC+jhEaIB4srOr5DMta/CxBoKwiZIcMmsaK pzRMFSTKWWtsx3COGjT0vwzm4VuZsZztE+A6buYs5riEDsI0l5TJiZ0fOqi9N0nB orehq+7T2Cn5E848bMzLo7tSmlibAionrOQde5PbtmDcltuKddu9TiXNzD6XufLB dLWbTRLdfxuFW0c8DYwng+KNBocXad64gu3ADuxKVGkWDs9tfOvaFWE/NgoXsEoq a7uaBCF+DBWYhHWk0WPOA2+BLyNMN+g7owX1GWqW/Sr48CQDJSw5YnpKCi8+jZdb uHreUIH96w/2A7CFfNCOfx5MhYCrX/j9ik6mDt2B8Gbh6vg3LlMADSo6xXSPag0r 7Lu7AVr7Sko6NKU2x/pCzv/U85TFuvuqSfH+YFK3rdEouZk26o5PfJEugAtD3gKv smAk+ATmgx/iye5a2Uq7ClVVdOcdQULrC15/8XdcG7eI+l2q3AbgTa53PdBw3oF7 S+ANKMP5kPkPe1wDxFR0g3v7vsZmmfahRuss3xWC+PnHZPFZPQFRIohjWSsu1Exl Ztri7Xy/ypC7bZ5F1pch1AjiLfLCGzpmKjT4QAy/mSFAJVboRDb0PTwN1w7uVCBQ qK8TIw2iVKjEeIps/CedO+nQQrxhOKcizxLIPTyfaT6ZOJrbHalcQheEJqpWMnrF w7dYkDnjYw== =tTNe -----END PGP SIGNATURE----- Merge tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux Pull io_uring follow-up from Jens Axboe: "Currently the zero-copy has automatic fallback to normal transmit, and it was decided that it'd be cleaner to return an error instead if the socket type doesn't support it. Zero-copy does work with UDP and TCP, it's more of a future proofing kind of thing (eg for samba)" * tag 'io_uring-6.1-2022-10-22' of git://git.kernel.dk/linux: io_uring/net: fail zc sendmsg when unsupported by socket io_uring/net: fail zc send when unsupported by socket net: flag sockets supporting msghdr originated zerocopy |
||
|
e993ffe3da |
net: flag sockets supporting msghdr originated zerocopy
We need an efficient way in io_uring to check whether a socket supports zerocopy with msghdr provided ubuf_info. Add a new flag into the struct socket flags fields. Cc: <stable@vger.kernel.org> # 6.0 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/3dafafab822b1c66308bb58a0ac738b1e3f53f74.1666346426.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
6d36c728bc |
Networking fixes for 6.1-rc2, including fixes from netfilter
Current release - regressions: - revert "net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}" - revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" - dsa: uninitialized variable in dsa_slave_netdevice_event() - eth: sunhme: uninitialized variable in happy_meal_init() Current release - new code bugs: - eth: octeontx2: fix resource not freed after malloc Previous releases - regressions: - sched: fix return value of qdisc ingress handling on success - sched: fix race condition in qdisc_graft() - udp: update reuse->has_conns under reuseport_lock. - tls: strp: make sure the TCP skbs do not have overlapping data - hsr: avoid possible NULL deref in skb_clone() - tipc: fix an information leak in tipc_topsrv_kern_subscr - phylink: add mac_managed_pm in phylink_config structure - eth: i40e: fix DMA mappings leak - eth: hyperv: fix a RX-path warning - eth: mtk: fix memory leaks Previous releases - always broken: - sched: cake: fix null pointer access issue when cake_init() fails -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmNRKdcSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkYn8P/31xjE9/BRQKVGQOMxj78vhQvVHEZYXJ OJcaLcjxUCqj6hu3pEcuf88PeTicyfEqN32zzH1k+SS8jGQCmoVtXUfbZ7pDR6Tc rsAqVhLD6JnYkGEgtzm3i+8EfSeBoCy9kT4JZzRxQmOfZr1rBmtoMHOB4cGk9g1K lSF3KJcKT1GDacB/gVei+ms0Y+Q9WULOg3OFuyLSeltAkhZKaTfx/qqsLLEHFqZc u6eR31GwG28Y4GVurLQSOdaWrFOKqmPFOpzvjmeKC2RBqS6hVl4/YKZTmTV53Lee brm6kuVlU7CJVZEN2qF8G2+/SqLgcB0o26JVnml1kT8n0GlFAbyFf5akawjT8/Je G/zgz6k6wUAI2g3nSPNmgqVtobsypthzWL/bOpWfGfJFXxGOgLG3pbZYIl816Tha KnibZqQOBHxfPaUzh0xCLhidoi5G0T8ip9o9tyKlnmvbKY/EWk6HiIjCWlxnkPiO GRdHkyF7KMxqo/QuE9AK3LnD/AsLeWcuqzMveiaTbYLkMjGDW1yz3otL7KXW5l8U zYoUn1HQLkNqDE17+PjRo28awOMyN6ujXggKBK/hfXVnPpdW3yWPUoslqQdVn5KC 3PLeSNM1v4UQSMWx1alRx1PvA+zYDX4GQpSSXbQgYGVim8LMwZQ1xui2qWF8xEau k9ZKfMSUNGEr =y2As -----END PGP SIGNATURE----- Merge tag 'net-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Current release - regressions: - revert "net: fix cpu_max_bits_warn() usage in netif_attrmask_next{,_and}" - revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" - dsa: uninitialized variable in dsa_slave_netdevice_event() - eth: sunhme: uninitialized variable in happy_meal_init() Current release - new code bugs: - eth: octeontx2: fix resource not freed after malloc Previous releases - regressions: - sched: fix return value of qdisc ingress handling on success - sched: fix race condition in qdisc_graft() - udp: update reuse->has_conns under reuseport_lock. - tls: strp: make sure the TCP skbs do not have overlapping data - hsr: avoid possible NULL deref in skb_clone() - tipc: fix an information leak in tipc_topsrv_kern_subscr - phylink: add mac_managed_pm in phylink_config structure - eth: i40e: fix DMA mappings leak - eth: hyperv: fix a RX-path warning - eth: mtk: fix memory leaks Previous releases - always broken: - sched: cake: fix null pointer access issue when cake_init() fails" * tag 'net-6.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits) net: phy: dp83822: disable MDI crossover status change interrupt net: sched: fix race condition in qdisc_graft() net: hns: fix possible memory leak in hnae_ae_register() wwan_hwsim: fix possible memory leak in wwan_hwsim_dev_new() sfc: include vport_id in filter spec hash and equal() genetlink: fix kdoc warnings selftests: add selftest for chaining of tc ingress handling to egress net: Fix return value of qdisc ingress handling on success net: sched: sfb: fix null pointer access issue when sfb_init() fails Revert "net: sched: fq_codel: remove redundant resource cleanup in fq_codel_init()" net: sched: cake: fix null pointer access issue when cake_init() fails ethernet: marvell: octeontx2 Fix resource not freed after malloc netfilter: nf_tables: relax NFTA_SET_ELEM_KEY_END set flags requirements netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces. ionic: catch NULL pointer issue on reconfig net: hsr: avoid possible NULL deref in skb_clone() bnxt_en: fix memory leak in bnxt_nvm_test() ip6mr: fix UAF issue in ip6mr_sk_done() when addrconf_init_net() failed udp: Update reuse->has_conns under reuseport_lock. net: ethernet: mediatek: ppe: Remove the unused function mtk_foe_entry_usable() ... |
||
|
4b549ccce9 |
xfrm: replay: Fix ESN wrap around for GSO
When using GSO it can happen that the wrong seq_hi is used for the last
packets before the wrap around. This can lead to double usage of a
sequence number. To avoid this, we should serialize this last GSO
packet.
Fixes:
|
||
|
1fcc064b30 |
netfilter: rpfilter/fib: Set ->flowic_uid correctly for user namespaces.
Currently netfilter's rpfilter and fib modules implicitely initialise ->flowic_uid with 0. This is normally the root UID. However, this isn't the case in user namespaces, where user ID 0 is mapped to a different kernel UID. By initialising ->flowic_uid with sock_net_uid(), we get the root UID of the user namespace, thus keeping the same behaviour whether or not we're running in a user namepspace. Note, this is similar to commit |
||
|
69421bf984 |
udp: Update reuse->has_conns under reuseport_lock.
When we call connect() for a UDP socket in a reuseport group, we have
to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel
could select a unconnected socket wrongly for packets sent to the
connected socket.
However, the current way to set has_conns is illegal and possible to
trigger that problem. reuseport_has_conns() changes has_conns under
rcu_read_lock(), which upgrades the RCU reader to the updater. Then,
it must do the update under the updater's lock, reuseport_lock, but
it doesn't for now.
For this reason, there is a race below where we fail to set has_conns
resulting in the wrong socket selection. To avoid the race, let's split
the reader and updater with proper locking.
cpu1 cpu2
+----+ +----+
__ip[46]_datagram_connect() reuseport_grow()
. .
|- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size)
| . |
| |- rcu_read_lock()
| |- reuse = rcu_dereference(sk->sk_reuseport_cb)
| |
| | | /* reuse->has_conns == 0 here */
| | |- more_reuse->has_conns = reuse->has_conns
| |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */
| | |
| | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
| | | more_reuse)
| `- rcu_read_unlock() `- kfree_rcu(reuse, rcu)
|
|- sk->sk_state = TCP_ESTABLISHED
Note the likely(reuse) in reuseport_has_conns_set() is always true,
but we put the test there for ease of review. [0]
For the record, usually, sk_reuseport_cb is changed under lock_sock().
The only exception is reuseport_grow() & TCP reqsk migration case.
1) shutdown() TCP listener, which is moved into the latter part of
reuse->socks[] to migrate reqsk.
2) New listen() overflows reuse->socks[] and call reuseport_grow().
3) reuse->max_socks overflows u16 with the new listener.
4) reuseport_grow() pops the old shutdown()ed listener from the array
and update its sk->sk_reuseport_cb as NULL without lock_sock().
shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(),
but, reuseport_has_conns_set() is called only for UDP under lock_sock(),
so likely(reuse) never be false in reuseport_has_conns_set().
[0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/
Fixes:
|
||
|
f1947d7c8a |
Random number generator fixes for Linux 6.1-rc1.
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmNHYD0ACgkQSfxwEqXe A655AA//dJK0PdRghqrKQsl18GOCffV5TUw5i1VbJQbI9d8anfxNjVUQiNGZi4et qUwZ8OqVXxYx1Z1UDgUE39PjEDSG9/cCvOpMUWqN20/+6955WlNZjwA7Fk6zjvlM R30fz5CIJns9RFvGT4SwKqbVLXIMvfg/wDENUN+8sxt36+VD2gGol7J2JJdngEhM lW+zqzi0ABqYy5so4TU2kixpKmpC08rqFvQbD1GPid+50+JsOiIqftDErt9Eg1Mg MqYivoFCvbAlxxxRh3+UHBd7ZpJLtp1UFEOl2Rf00OXO+ZclLCAQAsTczucIWK9M 8LCZjb7d4lPJv9RpXFAl3R1xvfc+Uy2ga5KeXvufZtc5G3aMUKPuIU7k28ZyblVS XXsXEYhjTSd0tgi3d0JlValrIreSuj0z2QGT5pVcC9utuAqAqRIlosiPmgPlzXjr Us4jXaUhOIPKI+Musv/fqrxsTQziT0jgVA3Njlt4cuAGm/EeUbLUkMWwKXjZLTsv vDsBhEQFmyZqxWu4pYo534VX2mQWTaKRV1SUVVhQEHm57b00EAiZohoOvweB09SR 4KiJapikoopmW4oAUFotUXUL1PM6yi+MXguTuc1SEYuLz/tCFtK8DJVwNpfnWZpE lZKvXyJnHq2Sgod/hEZq58PMvT6aNzTzSg7YzZy+VabxQGOO5mc= =M+mV -----END PGP SIGNATURE----- Merge tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_*() namespace, not the get_random_*() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1 |
||
|
740ea3c4a0 |
tcp: Clean up kernel listener's reqsk in inet_twsk_purge()
Eric Dumazet reported a use-after-free related to the per-netns ehash
series. [0]
When we create a TCP socket from userspace, the socket always holds a
refcnt of the netns. This guarantees that a reqsk timer is always fired
before netns dismantle. Each reqsk has a refcnt of its listener, so the
listener is not freed before the reqsk, and the net is not freed before
the listener as well.
OTOH, when in-kernel users create a TCP socket, it might not hold a refcnt
of its netns. Thus, a reqsk timer can be fired after the netns dismantle
and access freed per-netns ehash.
To avoid the use-after-free, we need to clean up TCP_NEW_SYN_RECV sockets
in inet_twsk_purge() if the netns uses a per-netns ehash.
[0]: https://lore.kernel.org/netdev/CANn89iLXMup0dRD_Ov79Xt8N9FM0XdhCHEN05sf3eLwxKweM6w@mail.gmail.com/
BUG: KASAN: use-after-free in tcp_or_dccp_get_hashinfo
include/net/inet_hashtables.h:181 [inline]
BUG: KASAN: use-after-free in reqsk_queue_unlink+0x320/0x350
net/ipv4/inet_connection_sock.c:913
Read of size 8 at addr ffff88807545bd80 by task syz-executor.2/8301
CPU: 1 PID: 8301 Comm: syz-executor.2 Not tainted
6.0.0-syzkaller-02757-gaf7d23f9d96a #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 09/22/2022
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:317 [inline]
print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
kasan_report+0xb1/0x1e0 mm/kasan/report.c:495
tcp_or_dccp_get_hashinfo include/net/inet_hashtables.h:181 [inline]
reqsk_queue_unlink+0x320/0x350 net/ipv4/inet_connection_sock.c:913
inet_csk_reqsk_queue_drop net/ipv4/inet_connection_sock.c:927 [inline]
inet_csk_reqsk_queue_drop_and_put net/ipv4/inet_connection_sock.c:939 [inline]
reqsk_timer_handler+0x724/0x1160 net/ipv4/inet_connection_sock.c:1053
call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
expire_timers kernel/time/timer.c:1519 [inline]
__run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
__run_timers kernel/time/timer.c:1768 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
__do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
invoke_softirq kernel/softirq.c:445 [inline]
__irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1107
</IRQ>
Fixes:
|
||
|
f49cd2f4d6 |
tcp: Fix data races around icsk->icsk_af_ops.
setsockopt(IPV6_ADDRFORM) and tcp_v6_connect() change icsk->icsk_af_ops
under lock_sock(), but tcp_(get|set)sockopt() read it locklessly. To
avoid load/store tearing, we need to add READ_ONCE() and WRITE_ONCE()
for the reads and writes.
Thanks to Eric Dumazet for providing the syzbot report:
BUG: KCSAN: data-race in tcp_setsockopt / tcp_v6_connect
write to 0xffff88813c624518 of 8 bytes by task 23936 on cpu 0:
tcp_v6_connect+0x5b3/0xce0 net/ipv6/tcp_ipv6.c:240
__inet_stream_connect+0x159/0x6d0 net/ipv4/af_inet.c:660
inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:724
__sys_connect_file net/socket.c:1976 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1993
__do_sys_connect net/socket.c:2003 [inline]
__se_sys_connect net/socket.c:2000 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:2000
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff88813c624518 of 8 bytes by task 23937 on cpu 1:
tcp_setsockopt+0x147/0x1c80 net/ipv4/tcp.c:3789
sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3585
__sys_setsockopt+0x212/0x2b0 net/socket.c:2252
__do_sys_setsockopt net/socket.c:2263 [inline]
__se_sys_setsockopt net/socket.c:2260 [inline]
__x64_sys_setsockopt+0x62/0x70 net/socket.c:2260
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffffffff8539af68 -> 0xffffffff8539aff8
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 23937 Comm: syz-executor.5 Not tainted
6.0.0-rc4-syzkaller-00331-g4ed9c1e971b1-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 08/26/2022
Fixes:
|
||
|
364f997b5c |
ipv6: Fix data races around sk->sk_prot.
Commit |
||
|
d38afeec26 |
tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct().
Originally, inet6_sk(sk)->XXX were changed under lock_sock(), so we were able to clean them up by calling inet6_destroy_sock() during the IPv6 -> IPv4 conversion by IPV6_ADDRFORM. However, commit |
||
|
acc641ab95 |
netfilter: rpfilter/fib: Populate flowic_l3mdev field
Use the introduced field for correct operation with VRF devices instead
of conditionally overwriting flowic_oif. This is a partial revert of
commit
|
||
|
3a5913183a |
xfrm: fix "disable_policy" on ipv4 early demux
The commit in the "Fixes" tag tried to avoid a case where policy check
is ignored due to dst caching in next hops.
However, when the traffic is locally consumed, the dst may be cached
in a local TCP or UDP socket as part of early demux. In this case the
"disable_policy" flag is not checked as ip_route_input_noref() was only
called before caching, and thus, packets after the initial packet in a
flow will be dropped if not matching policies.
Fix by checking the "disable_policy" flag also when a valid dst is
already available.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216557
Reported-by: Monil Patel <monil191989@gmail.com>
Fixes:
|
||
|
72e560cb8c |
tcp: cdg: allow tcp_cdg_release() to be called multiple times
Apparently, mptcp is able to call tcp_disconnect() on an already
disconnected flow. This is generally fine, unless current congestion
control is CDG, because it might trigger a double-free [1]
Instead of fixing MPTCP, and future bugs, we can make tcp_disconnect()
more resilient.
[1]
BUG: KASAN: double-free in slab_free mm/slub.c:3539 [inline]
BUG: KASAN: double-free in kfree+0xe2/0x580 mm/slub.c:4567
CPU: 0 PID: 3645 Comm: kworker/0:7 Not tainted 6.0.0-syzkaller-02734-g0326074ff465 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
Workqueue: events mptcp_worker
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:317 [inline]
print_report.cold+0x2ba/0x719 mm/kasan/report.c:433
kasan_report_invalid_free+0x81/0x190 mm/kasan/report.c:462
____kasan_slab_free+0x18b/0x1c0 mm/kasan/common.c:356
kasan_slab_free include/linux/kasan.h:200 [inline]
slab_free_hook mm/slub.c:1759 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1785
slab_free mm/slub.c:3539 [inline]
kfree+0xe2/0x580 mm/slub.c:4567
tcp_disconnect+0x980/0x1e20 net/ipv4/tcp.c:3145
__mptcp_close_ssk+0x5ca/0x7e0 net/mptcp/protocol.c:2327
mptcp_do_fastclose net/mptcp/protocol.c:2592 [inline]
mptcp_worker+0x78c/0xff0 net/mptcp/protocol.c:2627
process_one_work+0x991/0x1610 kernel/workqueue.c:2289
worker_thread+0x665/0x1080 kernel/workqueue.c:2436
kthread+0x2e4/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
</TASK>
Allocated by task 3671:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:45 [inline]
set_alloc_info mm/kasan/common.c:437 [inline]
____kasan_kmalloc mm/kasan/common.c:516 [inline]
____kasan_kmalloc mm/kasan/common.c:475 [inline]
__kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:525
kmalloc_array include/linux/slab.h:640 [inline]
kcalloc include/linux/slab.h:671 [inline]
tcp_cdg_init+0x10d/0x170 net/ipv4/tcp_cdg.c:380
tcp_init_congestion_control+0xab/0x550 net/ipv4/tcp_cong.c:193
tcp_reinit_congestion_control net/ipv4/tcp_cong.c:217 [inline]
tcp_set_congestion_control+0x96c/0xaa0 net/ipv4/tcp_cong.c:391
do_tcp_setsockopt+0x505/0x2320 net/ipv4/tcp.c:3513
tcp_setsockopt+0xd4/0x100 net/ipv4/tcp.c:3801
mptcp_setsockopt+0x35f/0x2570 net/mptcp/sockopt.c:844
__sys_setsockopt+0x2d6/0x690 net/socket.c:2252
__do_sys_setsockopt net/socket.c:2263 [inline]
__se_sys_setsockopt net/socket.c:2260 [inline]
__x64_sys_setsockopt+0xba/0x150 net/socket.c:2260
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 16:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:45
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:367 [inline]
____kasan_slab_free+0x166/0x1c0 mm/kasan/common.c:329
kasan_slab_free include/linux/kasan.h:200 [inline]
slab_free_hook mm/slub.c:1759 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1785
slab_free mm/slub.c:3539 [inline]
kfree+0xe2/0x580 mm/slub.c:4567
tcp_cleanup_congestion_control+0x70/0x120 net/ipv4/tcp_cong.c:226
tcp_v4_destroy_sock+0xdd/0x750 net/ipv4/tcp_ipv4.c:2254
tcp_v6_destroy_sock+0x11/0x20 net/ipv6/tcp_ipv6.c:1969
inet_csk_destroy_sock+0x196/0x440 net/ipv4/inet_connection_sock.c:1157
tcp_done+0x23b/0x340 net/ipv4/tcp.c:4649
tcp_rcv_state_process+0x40e7/0x4990 net/ipv4/tcp_input.c:6624
tcp_v6_do_rcv+0x3fc/0x13c0 net/ipv6/tcp_ipv6.c:1525
tcp_v6_rcv+0x2e8e/0x3830 net/ipv6/tcp_ipv6.c:1759
ip6_protocol_deliver_rcu+0x2db/0x1950 net/ipv6/ip6_input.c:439
ip6_input_finish+0x14c/0x2c0 net/ipv6/ip6_input.c:484
NF_HOOK include/linux/netfilter.h:302 [inline]
NF_HOOK include/linux/netfilter.h:296 [inline]
ip6_input+0x9c/0xd0 net/ipv6/ip6_input.c:493
dst_input include/net/dst.h:455 [inline]
ip6_rcv_finish+0x193/0x2c0 net/ipv6/ip6_input.c:79
ip_sabotage_in net/bridge/br_netfilter_hooks.c:874 [inline]
ip_sabotage_in+0x1fa/0x260 net/bridge/br_netfilter_hooks.c:865
nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline]
nf_hook_slow+0xc5/0x1f0 net/netfilter/core.c:614
nf_hook.constprop.0+0x3ac/0x650 include/linux/netfilter.h:257
NF_HOOK include/linux/netfilter.h:300 [inline]
ipv6_rcv+0x9e/0x380 net/ipv6/ip6_input.c:309
__netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5485
__netif_receive_skb+0x1f/0x1c0 net/core/dev.c:5599
netif_receive_skb_internal net/core/dev.c:5685 [inline]
netif_receive_skb+0x12f/0x8d0 net/core/dev.c:5744
NF_HOOK include/linux/netfilter.h:302 [inline]
NF_HOOK include/linux/netfilter.h:296 [inline]
br_pass_frame_up+0x303/0x410 net/bridge/br_input.c:68
br_handle_frame_finish+0x909/0x1aa0 net/bridge/br_input.c:199
br_nf_hook_thresh+0x2f8/0x3d0 net/bridge/br_netfilter_hooks.c:1041
br_nf_pre_routing_finish_ipv6+0x695/0xef0 net/bridge/br_netfilter_ipv6.c:207
NF_HOOK include/linux/netfilter.h:302 [inline]
br_nf_pre_routing_ipv6+0x417/0x7c0 net/bridge/br_netfilter_ipv6.c:237
br_nf_pre_routing+0x1496/0x1fe0 net/bridge/br_netfilter_hooks.c:507
nf_hook_entry_hookfn include/linux/netfilter.h:142 [inline]
nf_hook_bridge_pre net/bridge/br_input.c:255 [inline]
br_handle_frame+0x9c9/0x12d0 net/bridge/br_input.c:399
__netif_receive_skb_core+0x9fe/0x38f0 net/core/dev.c:5379
__netif_receive_skb_one_core+0xae/0x180 net/core/dev.c:5483
__netif_receive_skb+0x1f/0x1c0 net/core/dev.c:5599
process_backlog+0x3a0/0x7c0 net/core/dev.c:5927
__napi_poll+0xb3/0x6d0 net/core/dev.c:6494
napi_poll net/core/dev.c:6561 [inline]
net_rx_action+0x9c1/0xd90 net/core/dev.c:6672
__do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
Fixes:
|
||
|
0d24148bd2 |
inet: ping: fix recent breakage
Blamed commit broke the assumption used by ping sendmsg() that
allocated skb would have MAX_HEADER bytes in skb->head.
This patch changes the way ping works, by making sure
the skb head contains space for the icmp header,
and adjusting ping_getfrag() which was desperate
about going past the icmp header :/
This is adopting what UDP does, mostly.
syzbot is able to crash a host using both kfence and following repro in a loop.
fd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_ICMPV6)
connect(fd, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0),
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28
sendmsg(fd, {msg_name=NULL, msg_namelen=0, msg_iov=[
{iov_base="\200\0\0\0\23\0\0\0\0\0\0\0\0\0"..., iov_len=65496}],
msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0
When kfence triggers, skb->head only has 64 bytes, immediately followed
by struct skb_shared_info (no extra headroom based on ksize(ptr))
Then icmpv6_push_pending_frames() is overwriting first bytes
of skb_shinfo(skb), making nr_frags bigger than MAX_SKB_FRAGS,
and/or setting shinfo->gso_size to a non zero value.
If nr_frags is mangled, a crash happens in skb_release_data()
If gso_size is mangled, we have the following report:
lo: caps=(0x00000516401d7c69, 0x00000516401d7c69)
WARNING: CPU: 0 PID: 7548 at net/core/dev.c:3239 skb_warn_bad_offload+0x119/0x230 net/core/dev.c:3239
Modules linked in:
CPU: 0 PID: 7548 Comm: syz-executor268 Not tainted 6.0.0-syzkaller-02754-g557f050166e5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
RIP: 0010:skb_warn_bad_offload+0x119/0x230 net/core/dev.c:3239
Code: 70 03 00 00 e8 58 c3 24 fa 4c 8d a5 e8 00 00 00 e8 4c c3 24 fa 4c 89 e9 4c 89 e2 4c 89 f6 48 c7 c7 00 53 f5 8a e8 13 ac e7 01 <0f> 0b 5b 5d 41 5c 41 5d 41 5e e9 28 c3 24 fa e8 23 c3 24 fa 48 89
RSP: 0018:ffffc9000366f3e8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88807a9d9d00 RCX: 0000000000000000
RDX: ffff8880780c0000 RSI: ffffffff8160f6f8 RDI: fffff520006cde6f
RBP: ffff888079952000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000400 R11: 0000000000000000 R12: ffff8880799520e8
R13: ffff88807a9da070 R14: ffff888079952000 R15: 0000000000000000
FS: 0000555556be6300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020010000 CR3: 000000006eb7b000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
gso_features_check net/core/dev.c:3521 [inline]
netif_skb_features+0x83e/0xb90 net/core/dev.c:3554
validate_xmit_skb+0x2b/0xf10 net/core/dev.c:3659
__dev_queue_xmit+0x998/0x3ad0 net/core/dev.c:4248
dev_queue_xmit include/linux/netdevice.h:3008 [inline]
neigh_hh_output include/net/neighbour.h:530 [inline]
neigh_output include/net/neighbour.h:544 [inline]
ip6_finish_output2+0xf97/0x1520 net/ipv6/ip6_output.c:134
__ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
ip6_finish_output+0x690/0x1160 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip6_output+0x1ed/0x540 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:445 [inline]
ip6_local_out+0xaf/0x1a0 net/ipv6/output_core.c:161
ip6_send_skb+0xb7/0x340 net/ipv6/ip6_output.c:1966
ip6_push_pending_frames+0xdd/0x100 net/ipv6/ip6_output.c:1986
icmpv6_push_pending_frames+0x2af/0x490 net/ipv6/icmp.c:303
ping_v6_sendmsg+0xc44/0x1190 net/ipv6/ping.c:190
inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:734
____sys_sendmsg+0x712/0x8c0 net/socket.c:2482
___sys_sendmsg+0x110/0x1b0 net/socket.c:2536
__sys_sendmsg+0xf3/0x1c0 net/socket.c:2565
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f21aab42b89
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 41 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff1729d038 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f21aab42b89
RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
RBP: 0000000000000000 R08: 000000000000000d R09: 000000000000000d
R10: 000000000000000d R11: 0000000000000246 R12: 00007fff1729d050
R13: 00000000000f4240 R14: 0000000000021dd1 R15: 00007fff1729d044
</TASK>
Fixes:
|
||
|
87445f369c |
ipv6: ping: fix wrong checksum for large frames
For a given ping datagram, ping_getfrag() is called once
per skb fragment.
A large datagram requiring more than one page fragment
is currently getting the checksum of the last fragment,
instead of the cumulative one.
After this patch, "ping -s 35000 ::1" is working correctly.
Fixes:
|
||
|
197173db99 |
treewide: use get_random_bytes() when possible
The prandom_bytes() function has been a deprecated inline wrapper around get_random_bytes() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
||
|
a251c17aa5 |
treewide: use get_random_u32() when possible
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
||
|
7e3cf0843f |
treewide: use get_random_{u8,u16}() when possible, part 1
Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value, simply use the get_random_{u8,u16}() functions, which are faster than wasting the additional bytes from a 32-bit value. This was done mechanically with this coccinelle script: @@ expression E; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; typedef __be16; typedef __le16; typedef u8; @@ ( - (get_random_u32() & 0xffff) + get_random_u16() | - (get_random_u32() & 0xff) + get_random_u8() | - (get_random_u32() % 65536) + get_random_u16() | - (get_random_u32() % 256) + get_random_u8() | - (get_random_u32() >> 16) + get_random_u16() | - (get_random_u32() >> 24) + get_random_u8() | - (u16)get_random_u32() + get_random_u16() | - (u8)get_random_u32() + get_random_u8() | - (__be16)get_random_u32() + (__be16)get_random_u16() | - (__le16)get_random_u32() + (__le16)get_random_u16() | - prandom_u32_max(65536) + get_random_u16() | - prandom_u32_max(256) + get_random_u8() | - E->inet_id = get_random_u32() + E->inet_id = get_random_u16() ) @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; identifier v; @@ - u16 v = get_random_u32(); + u16 v = get_random_u16(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u8; identifier v; @@ - u8 v = get_random_u32(); + u8 v = get_random_u8(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u16; u16 v; @@ - v = get_random_u32(); + v = get_random_u16(); @@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u8; u8 v; @@ - v = get_random_u32(); + v = get_random_u8(); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Examine limits @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value < 256: coccinelle.RESULT = cocci.make_ident("get_random_u8") elif value < 65536: coccinelle.RESULT = cocci.make_ident("get_random_u16") else: print("Skipping large mask of %s" % (literal)) cocci.include_match(False) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; identifier add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + (RESULT() & LITERAL) Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
||
|
81895a65ec |
treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
||
|
61b91eb33a |
ipv4: Handle attempt to delete multipath route when fib_info contains an nh reference
Gwangun Jung reported a slab-out-of-bounds access in fib_nh_match: fib_nh_match+0xf98/0x1130 linux-6.0-rc7/net/ipv4/fib_semantics.c:961 fib_table_delete+0x5f3/0xa40 linux-6.0-rc7/net/ipv4/fib_trie.c:1753 inet_rtm_delroute+0x2b3/0x380 linux-6.0-rc7/net/ipv4/fib_frontend.c:874 Separate nexthop objects are mutually exclusive with the legacy multipath spec. Fix fib_nh_match to return if the config for the to be deleted route contains a multipath spec while the fib_info is using a nexthop object. Fixes: |
||
|
e52f7c1ddf |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in the left-over fixes before the net-next pull-request. Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c |
||
|
2a4187f440 |
once: rename _SLOW to _SLEEPABLE
The _SLOW designation wasn't really descriptive of anything. This is
meant to be called from process context when it's possible to sleep. So
name this more aptly _SLEEPABLE, which better fits its intended use.
Fixes:
|
||
|
a08d97a193 |
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-03 We've added 143 non-merge commits during the last 27 day(s) which contain a total of 151 files changed, 8321 insertions(+), 1402 deletions(-). The main changes are: 1) Add kfuncs for PKCS#7 signature verification from BPF programs, from Roberto Sassu. 2) Add support for struct-based arguments for trampoline based BPF programs, from Yonghong Song. 3) Fix entry IP for kprobe-multi and trampoline probes under IBT enabled, from Jiri Olsa. 4) Batch of improvements to veristat selftest tool in particular to add CSV output, a comparison mode for CSV outputs and filtering, from Andrii Nakryiko. 5) Add preparatory changes needed for the BPF core for upcoming BPF HID support, from Benjamin Tissoires. 6) Support for direct writes to nf_conn's mark field from tc and XDP BPF program types, from Daniel Xu. 7) Initial batch of documentation improvements for BPF insn set spec, from Dave Thaler. 8) Add a new BPF_MAP_TYPE_USER_RINGBUF map which provides single-user-space-producer / single-kernel-consumer semantics for BPF ring buffer, from David Vernet. 9) Follow-up fixes to BPF allocator under RT to always use raw spinlock for the BPF hashtab's bucket lock, from Hou Tao. 10) Allow creating an iterator that loops through only the resources of one task/thread instead of all, from Kui-Feng Lee. 11) Add support for kptrs in the per-CPU arraymap, from Kumar Kartikeya Dwivedi. 12) Add a new kfunc helper for nf to set src/dst NAT IP/port in a newly allocated CT entry which is not yet inserted, from Lorenzo Bianconi. 13) Remove invalid recursion check for struct_ops for TCP congestion control BPF programs, from Martin KaFai Lau. 14) Fix W^X issue with BPF trampoline and BPF dispatcher, from Song Liu. 15) Fix percpu_counter leakage in BPF hashtab allocation error path, from Tetsuo Handa. 16) Various cleanups in BPF selftests to use preferred ASSERT_* macros, from Wang Yufen. 17) Add invocation for cgroup/connect{4,6} BPF programs for ICMP pings, from YiFei Zhu. 18) Lift blinding decision under bpf_jit_harden = 1 to bpf_capable(), from Yauheni Kaliuta. 19) Various libbpf fixes and cleanups including a libbpf NULL pointer deref, from Xin Liu. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (143 commits) net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c Documentation: bpf: Add implementation notes documentations to table of contents bpf, docs: Delete misformatted table. selftests/xsk: Fix double free bpftool: Fix error message of strerror libbpf: Fix overrun in netlink attribute iteration selftests/bpf: Fix spelling mistake "unpriviledged" -> "unprivileged" samples/bpf: Fix typo in xdp_router_ipv4 sample bpftool: Remove unused struct event_ring_info bpftool: Remove unused struct btf_attach_point bpf, docs: Add TOC and fix formatting. bpf, docs: Add Clang note about BPF_ALU bpf, docs: Move Clang notes to a separate file bpf, docs: Linux byteswap note bpf, docs: Move legacy packet instructions to a separate file selftests/bpf: Check -EBUSY for the recurred bpf_setsockopt(TCP_CONGESTION) bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself bpf: Refactor bpf_setsockopt(TCP_CONGESTION) handling into another function bpf: Move the "cdg" tcp-cc check to the common sol_tcp_sockopt() bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline ... ==================== Link: https://lore.kernel.org/r/20221003194915.11847-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
62c07983be |
once: add DO_ONCE_SLOW() for sleepable contexts
Christophe Leroy reported a ~80ms latency spike happening at first TCP connect() time. This is because __inet_hash_connect() uses get_random_once() to populate a perturbation table which became quite big after commit |
||
|
5eddb24901 |
gro: add support of (hw)gro packets to gro stack
Current GRO stack only supports incoming packets containing one frame/MSS. This patch changes GRO to accept packets that are already GRO. HW-GRO (aka RSC for some vendors) is very often limited in presence of interleaved packets. Linux SW GRO stack can complete the job and provide larger GRO packets, thus reducing rate of ACK packets and cpu overhead. This also means BIG TCP can still be used, even if HW-GRO/RSC was able to cook ~64 KB GRO packets. v2: fix logic in tcp_gro_receive() Only support TCP for the moment (Paolo) Co-Developed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
b86fca800a |
net: Add helper function to parse netlink msg of ip_tunnel_parm
Add ip_tunnel_netlink_parms to parse netlink msg of ip_tunnel_parm. Reduces duplicate code, no actual functional changes. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
537dd2d9fb |
net: Add helper function to parse netlink msg of ip_tunnel_encap
Add ip_tunnel_netlink_encap_parms to parse netlink msg of ip_tunnel_encap. Reduces duplicate code, no actual functional changes. Signed-off-by: Liu Jian <liujian56@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
42e8e6d906 |
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says: ==================== 1) Refactor selftests to use an array of structs in xfrm_fill_key(). From Gautam Menghani. 2) Drop an unused argument from xfrm_policy_match. From Hongbin Wang. 3) Support collect metadata mode for xfrm interfaces. From Eyal Birger. 4) Add netlink extack support to xfrm. From Sabrina Dubroca. Please note, there is a merge conflict in: include/net/dst_metadata.h between commit: |
||
|
0bafedc536 |
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says: ==================== pull request (net): ipsec 2022-09-29 1) Use the inner instead of the outer protocol for GSO on inter address family tunnels. This fixes the GSO case for address family tunnels. From Sabrina Dubroca. 2) Reset ipcomp_scratches with NULL when freed, otherwise it holds obsolete address. From Khalid Masum. 3) Reinject transport-mode packets through workqueue instead of a tasklet. The tasklet might take too long to finish. From Liu Jian. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
f4ce91ce12 |
tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited
This commit fixes a bug in the tracking of max_packets_out and
is_cwnd_limited. This bug can cause the connection to fail to remember
that is_cwnd_limited is true, causing the connection to fail to grow
cwnd when it should, causing throughput to be lower than it should be.
The following event sequence is an example that triggers the bug:
(a) The connection is cwnd_limited, but packets_out is not at its
peak due to TSO deferral deciding not to send another skb yet.
In such cases the connection can advance max_packets_seq and set
tp->is_cwnd_limited to true and max_packets_out to a small
number.
(b) Then later in the round trip the connection is pacing-limited (not
cwnd-limited), and packets_out is larger. In such cases the
connection would raise max_packets_out to a bigger number but
(unexpectedly) flip tp->is_cwnd_limited from true to false.
This commit fixes that bug.
One straightforward fix would be to separately track (a) the next
window after max_packets_out reaches a maximum, and (b) the next
window after tp->is_cwnd_limited is set to true. But this would
require consuming an extra u32 sequence number.
Instead, to save space we track only the most important
information. Specifically, we track the strongest available signal of
the degree to which the cwnd is fully utilized:
(1) If the connection is cwnd-limited then we remember that fact for
the current window.
(2) If the connection not cwnd-limited then we track the maximum
number of outstanding packets in the current window.
In particular, note that the new logic cannot trigger the buggy
(a)/(b) sequence above because with the new logic a condition where
tp->packets_out > tp->max_packets_out can only trigger an update of
tp->is_cwnd_limited if tp->is_cwnd_limited is false.
This first showed up in a testing of a BBRv2 dev branch, but this
buggy behavior highlighted a general issue with the
tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
the proper rate for any TCP congestion control, including Reno or
CUBIC.
Fixes:
|
||
|
061ff04071 |
bpf: tcp: Stop bpf_setsockopt(TCP_CONGESTION) in init ops to recur itself
When a bad bpf prog '.init' calls bpf_setsockopt(TCP_CONGESTION, "itself"), it will trigger this loop: .init => bpf_setsockopt(tcp_cc) => .init => bpf_setsockopt(tcp_cc) ... ... => .init => bpf_setsockopt(tcp_cc). It was prevented by the prog->active counter before but the prog->active detection cannot be used in struct_ops as explained in the earlier patch of the set. In this patch, the second bpf_setsockopt(tcp_cc) is not allowed in order to break the loop. This is done by using a bit of an existing 1 byte hole in tcp_sock to check if there is on-going bpf_setsockopt(TCP_CONGESTION) in this tcp_sock. Note that this essentially limits only the first '.init' can call bpf_setsockopt(TCP_CONGESTION) to pick a fallback cc (eg. peer does not support ECN) and the second '.init' cannot fallback to another cc. This applies even the second bpf_setsockopt(TCP_CONGESTION) will not cause a loop. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220929070407.965581-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
6ee5532052 |
xfrm: ipcomp: add extack to ipcomp{4,6}_init_state
And the shared helper ipcomp_init_state. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> |
||
|
25ec92cd04 |
xfrm: tunnel: add extack to ipip_init_state, xfrm6_tunnel_init_state
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> |
||
|
67c44f93c9 |
xfrm: esp: add extack to esp_init_state, esp6_init_state
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> |
||
|
ef87a4f84b |
xfrm: ah: add extack to ah_init_state, ah6_init_state
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> |
||
|
e1e10b44cf |
xfrm: pass extack down to xfrm_type ->init_state
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> |
||
|
5456262d2b |
net: Fix incorrect address comparison when searching for a bind2 bucket
The v6_rcv_saddr and rcv_saddr are inside a union in the
'struct inet_bind2_bucket'. When searching a bucket by following the
bhash2 hashtable chain, eg. inet_bind2_bucket_match, it is only using
the sk->sk_family and there is no way to check if the inet_bind2_bucket
has a v6 or v4 address in the union. This leads to an uninit-value
KMSAN report in [0] and also potentially incorrect matches.
This patch fixes it by adding a family member to the inet_bind2_bucket
and then tests 'sk->sk_family != tb->family' before matching
the sk's address to the tb's address.
Cc: Joanne Koong <joannelkoong@gmail.com>
Fixes:
|
||
|
3242abeb8d |
tcp: export tcp_sendmsg_fastopen
It will be used to support TCP FastOpen with MPTCP in the following commit. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Dmytro Shytyi <dmytro@shytyi.net> Signed-off-by: Dmytro Shytyi <dmytro@shytyi.net> Signed-off-by: Benjamin Hesmans <benjamin.hesmans@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
e7d2b51016 |
net: shrink struct ubuf_info
We can benefit from a smaller struct ubuf_info, so leave only mandatory fields and let users to decide how they want to extend it. Convert MSG_ZEROCOPY to struct ubuf_info_msgzc and remove duplicated fields. This reduces the size from 48 bytes to just 16. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
2a8a7c0eaa |
netfilter: nft_fib: Fix for rpath check with VRF devices
Analogous to commit |
||
|
31f1fbcb34 |
udp: Refactor udp_read_skb()
Delete the unnecessary while loop in udp_read_skb() for readability. Additionally, since recv_actor() cannot return a value greater than skb->len (see sk_psock_verdict_recv()), remove the redundant check. Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/343b5d8090a3eb764068e9f1d392939e2b423747.1663909008.git.peilin.ye@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
0140a7168f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/freescale/fec.h |
||
|
db39dfdc1c |
udp: Use WARN_ON_ONCE() in udp_read_skb()
Prevent udp_read_skb() from flooding the syslog. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Link: https://lore.kernel.org/r/20220921005915.2697-1-yepeilin.cs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
72f5c89804 |
netfilter: rpfilter: Remove unused variable 'ret'.
Commit
|
||
|
d1e5e6408b |
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Control write on sysctl by BPF We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications. Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent. In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
edc12f032a |
tcp: Save unnecessary inet_twsk_purge() calls.
While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch() and tcpv6_net_exit_batch() for AF_INET and AF_INET6. These commands trigger the kernel to walk through the potentially big ehash twice even though the netns has no TIME_WAIT sockets. # ip netns add test # ip netns del test or # unshare -n /bin/true >/dev/null When tw_refcount is 1, we need not call inet_twsk_purge() at least for the net. We can save such unneeded iterations if all netns in net_exit_list have no TIME_WAIT sockets. This change eliminates the tax by the additional unshare() described in the next patch to guarantee the per-netns ehash size. Tested: # mount -t debugfs none /sys/kernel/debug/ # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter # echo function > /sys/kernel/debug/tracing/current_tracer # cat ./add_del_unshare.sh for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait; # ./add_del_unshare.sh Before the patch: # cat /sys/kernel/debug/tracing/trace_pipe kworker/u128:0-8 [031] ...1. 174.162765: cleanup_net <-process_one_work kworker/u128:0-8 [031] ...1. 174.240796: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [032] ...1. 174.244759: inet_twsk_purge <-tcp_sk_exit_batch kworker/u128:0-8 [034] ...1. 174.290861: cleanup_net <-process_one_work kworker/u128:0-8 [039] ...1. 175.245027: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [046] ...1. 175.290541: inet_twsk_purge <-tcp_sk_exit_batch kworker/u128:0-8 [037] ...1. 175.321046: cleanup_net <-process_one_work kworker/u128:0-8 [024] ...1. 175.941633: inet_twsk_purge <-cleanup_net kworker/u128:0-8 [025] ...1. 176.242539: inet_twsk_purge <-tcp_sk_exit_batch After: # cat /sys/kernel/debug/tracing/trace_pipe kworker/u128:0-8 [038] ...1. 428.116174: cleanup_net <-process_one_work kworker/u128:0-8 [038] ...1. 428.262532: cleanup_net <-process_one_work kworker/u128:0-8 [030] ...1. 429.292645: cleanup_net <-process_one_work Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
4461568aa4 |
tcp: Access &tcp_hashinfo via net.
We will soon introduce an optional per-netns ehash. This means we cannot use tcp_hashinfo directly in most places. Instead, access it via net->ipv4.tcp_death_row.hashinfo. The access will be valid only while initialising tcp_hashinfo itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
429e42c1c5 |
tcp: Set NULL to sk->sk_prot->h.hashinfo.
We will soon introduce an optional per-netns ehash. This means we cannot use the global sk->sk_prot->h.hashinfo to fetch a TCP hashinfo. Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get a proper hashinfo from net->ipv4.tcp_death_row.hashinfo. Note that we need not use sk->sk_prot->h.hashinfo if DCCP is disabled. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
e9bd0cca09 |
tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.
We will soon introduce an optional per-netns ehash and access hash tables via net->ipv4.tcp_death_row->hashinfo instead of &tcp_hashinfo in most places. It could harm the fast path because dereferences of two fields in net and tcp_death_row might incur two extra cache line misses. To save one dereference, let's place tcp_death_row back in netns_ipv4 and fetch hashinfo via net->ipv4.tcp_death_row"."hashinfo. Note tcp_death_row was initially placed in netns_ipv4, and commit |
||
|
08eaef9040 |
tcp: Clean up some functions.
This patch adds no functional change and cleans up some functions that the following patches touch around so that we make them tidy and easy to review/revert. The changes are - Keep reverse christmas tree order - Remove unnecessary init of port in inet_csk_find_open_port() - Use req_to_sk() once in reqsk_queue_unlink() - Use sock_net(sk) once in tcp_time_wait() and tcp_v[46]_connect() Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
b07a9b26e2 |
ipmr: Always call ip{,6}_mr_forward() from RCU read-side critical section
These functions expect to be called from RCU read-side critical section,
but this only happens when invoked from the data path via
ip{,6}_mr_input(). They can also be invoked from process context in
response to user space adding a multicast route which resolves a cache
entry with queued packets [1][2].
Fix by adding missing rcu_read_lock() / rcu_read_unlock() in these call
paths.
[1]
WARNING: suspicious RCU usage
6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted
-----------------------------
net/ipv4/ipmr.c:84 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by smcrouted/246:
#0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip_mroute_setsockopt+0x11c/0x1420
stack backtrace:
CPU: 0 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x91/0xb9
vif_dev_read+0xbf/0xd0
ipmr_queue_xmit+0x135/0x1ab0
ip_mr_forward+0xe7b/0x13d0
ipmr_mfc_add+0x1a06/0x2ad0
ip_mroute_setsockopt+0x5c1/0x1420
do_ip_setsockopt+0x23d/0x37f0
ip_setsockopt+0x56/0x80
raw_setsockopt+0x219/0x290
__sys_setsockopt+0x236/0x4d0
__x64_sys_setsockopt+0xbe/0x160
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
[2]
WARNING: suspicious RCU usage
6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387 Not tainted
-----------------------------
net/ipv6/ip6mr.c:69 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by smcrouted/246:
#0: ffffffff862389b0 (rtnl_mutex){+.+.}-{3:3}, at: ip6_mroute_setsockopt+0x6b9/0x2630
stack backtrace:
CPU: 1 PID: 246 Comm: smcrouted Not tainted 6.0.0-rc3-custom-15969-g049d233c8bcc-dirty #1387
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x91/0xb9
vif_dev_read+0xbf/0xd0
ip6mr_forward2.isra.0+0xc9/0x1160
ip6_mr_forward+0xef0/0x13f0
ip6mr_mfc_add+0x1ff2/0x31f0
ip6_mroute_setsockopt+0x1825/0x2630
do_ipv6_setsockopt+0x462/0x4440
ipv6_setsockopt+0x105/0x140
rawv6_setsockopt+0xd8/0x690
__sys_setsockopt+0x236/0x4d0
__x64_sys_setsockopt+0xbe/0x160
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes:
|
||
|
db4192a754 |
tcp: read multiple skbs in tcp_read_skb()
Before we switched to ->read_skb(), ->read_sock() was passed with
desc.count=1, which technically indicates we only read one skb per
->sk_data_ready() call. However, for TCP, this is not true.
TCP at least has sk_rcvlowat which intentionally holds skb's in
receive queue until this watermark is reached. This means when
->sk_data_ready() is invoked there could be multiple skb's in the
queue, therefore we have to read multiple skbs in tcp_read_skb()
instead of one.
Fixes:
|
||
|
9662895186 |
tcp: Use WARN_ON_ONCE() in tcp_read_skb()
Prevent tcp_read_skb() from flooding the syslog. Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
896f07c07d |
bpf: Use 0 instead of NOT_INIT for btf_struct_access() writes
Returning a bpf_reg_type only makes sense in the context of a BPF_READ. For writes, prefer to explicitly return 0 for clarity. Note that is non-functional change as it just so happened that NOT_INIT == 0. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/01772bc1455ae16600796ac78c6cc9fff34f95ff.1662568410.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
0ffe241253 |
bpf: Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
Usually when a TCP/UDP connection is initiated, we can bind the socket to a specific IP attached to an interface in a cgroup/connect hook. But for pings, this is impossible, as the hook is not being called. This adds the hook invocation to unprivileged ICMP ping (i.e. ping sockets created with SOCK_DGRAM IPPROTO_ICMP(V6) as opposed to SOCK_RAW. Logic is mirrored from UDP sockets where the hook is invoked during pre_connect, after a check for suficiently sized addr_len. Signed-off-by: YiFei Zhu <zhuyifei@google.com> Link: https://lore.kernel.org/r/5764914c252fad4cd134fb6664c6ede95f409412.1662682323.git.zhuyifei@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> |
||
|
ceef59b549 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says: ==================== The following set contains changes for your *net-next* tree: - make conntrack ignore packets that are delayed (containing data already acked). The current behaviour to flag them as INVALID causes more harm than good, let them pass so peer can send an immediate ACK for the most recent sequence number. - make conntrack recognize when both peers have sent 'invalid' FINs: This helps cleaning out stale connections faster for those cases where conntrack is no longer in sync with the actual connection state. - Now that DECNET is gone, we don't need to reserve space for DECNET related information. - compact common 'find a free port number for the new inbound connection' code and move it to a helper, then cap number of tries the new helper will make until it gives up. - replace various instances of strlcpy with strscpy, from Wolfram Sang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
9f8f1933dc |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/freescale/fec.h |
||
|
c92c271710 |
netfilter: nat: move repetitive nat port reserve loop to a helper
Almost all nat helpers reserve an expecation port the same way: Try the port inidcated by the peer, then move to next port if that port is already in use. We can squash this into a helper. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> |
||
|
2786bcff28 |
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-09-05 The following pull-request contains BPF updates for your *net-next* tree. We've added 106 non-merge commits during the last 18 day(s) which contain a total of 159 files changed, 5225 insertions(+), 1358 deletions(-). There are two small merge conflicts, resolve them as follows: 1) tools/testing/selftests/bpf/DENYLIST.s390x Commit |
||
|
686dc2db2a |
tcp: fix early ETIMEDOUT after spurious non-SACK RTO
Fix a bug reported and analyzed by Nagaraj Arankal, where the handling
of a spurious non-SACK RTO could cause a connection to fail to clear
retrans_stamp, causing a later RTO to very prematurely time out the
connection with ETIMEDOUT.
Here is the buggy scenario, expanding upon Nagaraj Arankal's excellent
report:
(*1) Send one data packet on a non-SACK connection
(*2) Because no ACK packet is received, the packet is retransmitted
and we enter CA_Loss; but this retransmission is spurious.
(*3) The ACK for the original data is received. The transmitted packet
is acknowledged. The TCP timestamp is before the retrans_stamp,
so tcp_may_undo() returns true, and tcp_try_undo_loss() returns
true without changing state to Open (because tcp_is_sack() is
false), and tcp_process_loss() returns without calling
tcp_try_undo_recovery(). Normally after undoing a CA_Loss
episode, tcp_fastretrans_alert() would see that the connection
has returned to CA_Open and fall through and call
tcp_try_to_open(), which would set retrans_stamp to 0. However,
for non-SACK connections we hold the connection in CA_Loss, so do
not fall through to call tcp_try_to_open() and do not set
retrans_stamp to 0. So retrans_stamp is (erroneously) still
non-zero.
At this point the first "retransmission event" has passed and
been recovered from. Any future retransmission is a completely
new "event". However, retrans_stamp is erroneously still
set. (And we are still in CA_Loss, which is correct.)
(*4) After 16 minutes (to correspond with tcp_retries2=15), a new data
packet is sent. Note: No data is transmitted between (*3) and
(*4) and we disabled keep alives.
The socket's timeout SHOULD be calculated from this point in
time, but instead it's calculated from the prior "event" 16
minutes ago (step (*2)).
(*5) Because no ACK packet is received, the packet is retransmitted.
(*6) At the time of the 2nd retransmission, the socket returns
ETIMEDOUT, prematurely, because retrans_stamp is (erroneously)
too far in the past (set at the time of (*2)).
This commit fixes this bug by ensuring that we reuse in
tcp_try_undo_loss() the same careful logic for non-SACK connections
that we have in tcp_try_undo_recovery(). To avoid duplicating logic,
we factor out that logic into a new
tcp_is_non_sack_preventing_reopen() helper and call that helper from
both undo functions.
Fixes:
|
||
|
fd969f25fe |
bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
This patch changes bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt() and remove the duplicated code. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002925.2895416-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
273b7f0fb4 |
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
This patch changes bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). It removes the duplicated code from bpf_getsockopt(SOL_TCP). Before this patch, there were some optnames available to bpf_setsockopt(SOL_TCP) but missing in bpf_getsockopt(SOL_TCP). For example, TCP_NODELAY, TCP_MAXSEG, TCP_KEEPIDLE, TCP_KEEPINTVL, and a few more. It surprises users from time to time. This patch automatically closes this gap without duplicating more code. bpf_getsockopt(TCP_SAVED_SYN) does not free the saved_syn, so it stays in sol_tcp_sockopt(). For string name value like TCP_CONGESTION, bpf expects it is always null terminated, so sol_tcp_sockopt() decrements optlen by one before calling do_tcp_getsockopt() and the 'if (optlen < saved_optlen) memset(..,0,..);' in __bpf_getsockopt() will always do a null termination. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002918.2894511-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
1985320c54 |
bpf: net: Avoid do_ip_getsockopt() taking sk lock when called from bpf
Similar to the earlier commit that changed sk_setsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking lock when called from bpf. This patch also changes do_ip_getsockopt() to use sockopt_{lock,release}_sock() such that a latter patch can make bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002834.2891514-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
728f064cd7 |
bpf: net: Change do_ip_getsockopt() to take the sockptr_t argument
Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument. This patch also changes do_ip_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt(). Note on the change in ip_mc_gsfget(). This function is to return an array of sockaddr_storage in optval. This function is shared between ip_get_mcast_msfilter() and compat_ip_get_mcast_msfilter(). However, the sockaddr_storage is stored at different offset of the optval because of the difference between group_filter and compat_group_filter. Thus, a new 'ss_offset' argument is added to ip_mc_gsfget(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002828.2890585-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
d51bbff2ab |
bpf: net: Avoid do_tcp_getsockopt() taking sk lock when called from bpf
Similar to the earlier commit that changed sk_setsockopt() to use sockopt_{lock,release}_sock() such that it can avoid taking lock when called from bpf. This patch also changes do_tcp_getsockopt() to use sockopt_{lock,release}_sock() such that a latter patch can make bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002821.2889765-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
34704ef024 |
bpf: net: Change do_tcp_getsockopt() to take the sockptr_t argument
Similar to the earlier patch that changes sk_getsockopt() to take the sockptr_t argument . This patch also changes do_tcp_getsockopt() to take the sockptr_t argument such that a latter patch can make bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt(). Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20220902002815.2889332-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
e7506d344b |
rxrpc fixes
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmMQjRQACgkQ+7dXa6fL C2uGMBAAmb6+9iaxeAj6kE+/P4LCzUgfiIi0mH5QdK5JmCWDkhOvH1XCqHJwPIVb w/z5OGQm8CpaVb0nRC3pXMxdrG2OAmE6q6vYnJly30XRYdmv4tQ40SVwsq+OUmZ/ dTdSgoFUYEb7WPwfGIdwpCQS0hmrXXDz5yRwq7DPmrcMEj374AKdADeJEQhvjd86 NkhAp9bANNCwxIEouUntHtfZM+x3zFhzPh+giV+54WOKPLbqUYiG/AKo729oIgNf dGApdKjvfxbK/dx0fqfYXk19RpbJgyacrGrl5GQ+tZ1qkdL0PB1B1Q/nh/4COoTi Q7bTrZ6HyXGpQCMIwEY43dkdnnEdfaWLeXw4EqU0oHABckpCrzkv+HsV2kAPrEUS wsLPXEuNY4Lszbidc4+NKGIg82RQWrPjtxmslys6YdyO12cmOiyH51RePkeUvkKY C4erE+Kj7tNM58BHGN1TYsGhgHdAta5DWn+008Uc5KQh1WFJW+Wfk+VHtmNwSj4f PiPcO1TM9Sp42jRizlPQE8DF2KGSs13pwxtomvYpeufZKjZL29OnfMySH+bXfRU0 JDzyUbeY1jMayADV3ovXQsP4KIIbKzL/m1LxBCtZ+R/zcxdij87pRIzFV8ZnnsZ1 QBfWobozK1NNFofMES6LjocoHF0ZT3H8khpaYTOlZA0MjSwXM5o= =GW48 -----END PGP SIGNATURE----- Merge tag 'rxrpc-fixes-20220901' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc fixes Here are some fixes for AF_RXRPC: (1) Fix the handling of ICMP/ICMP6 packets. This is a problem due to rxrpc being switched to acting as a UDP tunnel, thereby allowing it to steal the packets before they go through the UDP Rx queue. UDP tunnels can't get ICMP/ICMP6 packets, however. This patch adds an additional encap hook so that they can. (2) Fix the encryption routines in rxkad to handle packets that have more than three parts correctly. The problem is that ->nr_frags doesn't count the initial fragment, so the sglist ends up too short. (3) Fix a problem with destruction of the local endpoint potentially getting repeated. (4) Fix the calculation of the time at which to resend. jiffies_to_usecs() gives microseconds, not nanoseconds. (5) Fix AFS to work out when callback promises and locks expire based on the time an op was issued rather than the time the first reply packet arrives. We don't know how long the server took between calculating the expiry interval and transmitting the reply. (6) Given (5), rxrpc_get_reply_time() is no longer used, so remove it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
3261400639 |
tcp: TX zerocopy should not sense pfmemalloc status
We got a recent syzbot report [1] showing a possible misuse of pfmemalloc page status in TCP zerocopy paths. Indeed, for pages coming from user space or other layers, using page_is_pfmemalloc() is moot, and possibly could give false positives. There has been attempts to make page_is_pfmemalloc() more robust, but not using it in the first place in this context is probably better, removing cpu cycles. Note to stable teams : You need to backport |
||
|
60ad1100d5 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
tools/testing/selftests/net/.gitignore sort the net-next version and use it Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
ac56a0b48d |
rxrpc: Fix ICMP/ICMP6 error handling
Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing
it to siphon off UDP packets early in the handling of received UDP packets
thereby avoiding the packet going through the UDP receive queue, it doesn't
get ICMP packets through the UDP ->sk_error_report() callback. In fact, it
doesn't appear that there's any usable option for getting hold of ICMP
packets.
Fix this by adding a new UDP encap hook to distribute error messages for
UDP tunnels. If the hook is set, then the tunnel driver will be able to
see ICMP packets. The hook provides the offset into the packet of the UDP
header of the original packet that caused the notification.
An alternative would be to call the ->error_handler() hook - but that
requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error()
do, though isn't really necessary or desirable in rxrpc's case is we want
to parse them there and then, not queue them).
Changes
=======
ver #3)
- Fixed an uninitialised variable.
ver #2)
- Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals.
Fixes:
|
||
|
79e3602caa |
tcp: make global challenge ack rate limitation per net-ns and default disabled
Because per host rate limiting has been proven problematic (side channel attacks can be based on it), per host rate limiting of challenge acks ideally should be per netns and turned off by default. This is a long due followup of following commits: |
||
|
8c70521238 |
tcp: annotate data-race around challenge_timestamp
challenge_timestamp can be read an written by concurrent threads.
This was expected, but we need to annotate the race to avoid potential issues.
Following patch moves challenge_timestamp and challenge_count
to per-netns storage to provide better isolation.
Fixes:
|
||
|
0e4d354762 |
net-next: Fix IP_UNICAST_IF option behavior for connected sockets
The IP_UNICAST_IF socket option is used to set the outgoing interface for outbound packets. The IP_UNICAST_IF socket option was added as it was needed by the Wine project, since no other existing option (SO_BINDTODEVICE socket option, IP_PKTINFO socket option or the bind function) provided the needed characteristics needed by the IP_UNICAST_IF socket option. [1] The IP_UNICAST_IF socket option works well for unconnected sockets, that is, the interface specified by the IP_UNICAST_IF socket option is taken into consideration in the route lookup process when a packet is being sent. However, for connected sockets, the outbound interface is chosen when connecting the socket, and in the route lookup process which is done when a packet is being sent, the interface specified by the IP_UNICAST_IF socket option is being ignored. This inconsistent behavior was reported and discussed in an issue opened on systemd's GitHub project [2]. Also, a bug report was submitted in the kernel's bugzilla [3]. To understand the problem in more detail, we can look at what happens for UDP packets over IPv4 (The same analysis was done separately in the referenced systemd issue). When a UDP packet is sent the udp_sendmsg function gets called and the following happens: 1. The oif member of the struct ipcm_cookie ipc (which stores the output interface of the packet) is initialized by the ipcm_init_sk function to inet->sk.sk_bound_dev_if (the device set by the SO_BINDTODEVICE socket option). 2. If the IP_PKTINFO socket option was set, the oif member gets overridden by the call to the ip_cmsg_send function. 3. If no output interface was selected yet, the interface specified by the IP_UNICAST_IF socket option is used. 4. If the socket is connected and no destination address is specified in the send function, the struct ipcm_cookie ipc is not taken into consideration and the cached route, that was calculated in the connect function is being used. Thus, for a connected socket, the IP_UNICAST_IF sockopt isn't taken into consideration. This patch corrects the behavior of the IP_UNICAST_IF socket option for connect()ed sockets by taking into consideration the IP_UNICAST_IF sockopt when connecting the socket. In order to avoid reconnecting the socket, this option is still ignored when applied on an already connected socket until connect() is called again by the Richard Gobert. Change the __ip4_datagram_connect function, which is called during socket connection, to take into consideration the interface set by the IP_UNICAST_IF socket option, in a similar way to what is done in the udp_sendmsg function. [1] https://lore.kernel.org/netdev/1328685717.4736.4.camel@edumazet-laptop/T/ [2] https://github.com/systemd/systemd/issues/11935#issuecomment-618691018 [3] https://bugzilla.kernel.org/show_bug.cgi?id=210255 Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220829111554.GA1771@debian Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
eb55dc09b5 |
ip: fix triggering of 'icmp redirect'
__mkroute_input() uses fib_validate_source() to trigger an icmp redirect.
My understanding is that fib_validate_source() is used to know if the src
address and the gateway address are on the same link. For that,
fib_validate_source() returns 1 (same link) or 0 (not the same network).
__mkroute_input() is the only user of these positive values, all other
callers only look if the returned value is negative.
Since the below patch, fib_validate_source() didn't return anymore 1 when
both addresses are on the same network, because the route lookup returns
RT_SCOPE_LINK instead of RT_SCOPE_HOST. But this is, in fact, right.
Let's adapat the test to return 1 again when both addresses are on the same
link.
CC: stable@vger.kernel.org
Fixes:
|
||
|
84e5a0f208 |
bpf, net: Avoid loading module when calling bpf_setsockopt(TCP_CONGESTION)
When bpf prog changes tcp-cc by calling bpf_setsockopt(TCP_CONGESTION), it should not try to load module which may be a blocking operation. This details was correct in the v1 [0] but missed by mistake in the later revision in commit |
||
|
47cf88993c |
net: unify alloclen calculation for paged requests
Consolidate alloclen and pagedlen calculation for zerocopy and normal paged requests. The current non-zerocopy paged version can a bit overallocate and unnecessary copy a small chunk of data into the linear part. Cc: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/netdev/CA+FuTSf0+cJ9_N_xrHmCGX_KoVCWcE0YQBdtgEkzGvcLMSv7Qw@mail.gmail.com/ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b0e4edb7b91f171c7119891d3c61040b8c56596e.1661428921.git.asml.silence@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
|
9c5d03d362 |
genetlink: start to validate reserved header bytes
We had historically not checked that genlmsghdr.reserved is 0 on input which prevents us from using those precious bytes in the future. One use case would be to extend the cmd field, which is currently just 8 bits wide and 256 is not a lot of commands for some core families. To make sure that new families do the right thing by default put the onus of opting out of validation on existing families. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel) Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
26dbd66eab |
esp: choose the correct inner protocol for GSO on inter address family tunnels
Commit |
||
|
2e085ec0e2 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel borkmann says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 14 day(s) which contain a total of 13 files changed, 61 insertions(+), 24 deletions(-). The main changes are: 1) Fix BPF verifier's precision tracking around BPF ring buffer, from Kumar Kartikeya Dwivedi. 2) Fix regression in tunnel key infra when passing FLOWI_FLAG_ANYSRC, from Eyal Birger. 3) Fix insufficient permissions for bpf_sys_bpf() helper, from YiFei Zhu. 4) Fix splat from hitting BUG when purging effective cgroup programs, from Pu Lehui. 5) Fix range tracking for array poke descriptors, from Daniel Borkmann. 6) Fix corrupted packets for XDP_SHARED_UMEM in aligned mode, from Magnus Karlsson. 7) Fix NULL pointer splat in BPF sockmap sk_msg_recvmsg(), from Liu Jian. 8) Add READ_ONCE() to bpf_jit_limit when reading from sysctl, from Kuniyuki Iwashima. 9) Add BPF selftest lru_bug check to s390x deny list, from Daniel Müller. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
880b0dd94f |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/mellanox/mlx5/core/en_fs.c |
||
|
35ffb66547 |
net: gro: skb_gro_header helper function
Introduce a simple helper function to replace a common pattern. When accessing the GRO header, we fetch the pointer from frag0, then test its validity and fetch it from the skb when necessary. This leads to the pattern skb_gro_header_fast -> skb_gro_header_hard -> skb_gro_header_slow recurring many times throughout GRO code. This patch replaces these patterns with a single inlined function call, improving code readability. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220823071034.GA56142@debian Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
|
28044fc1d4 |
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only. In the socket bind path, we have to check for bind conflicts by traversing the specified port's inet_bind_bucket while holding the hashbucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, the bind conflict check is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock. This patch adds a second bind table, bhash2, that hashes by port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6). Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the hashbucket spinlock. Please note a few things: * There can be the case where the a socket's address changes after it has been bound. There are two cases where this happens: 1) The case where there is a bind() call on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will assign the socket an address when it handles the connect() 2) In inet_sk_reselect_saddr(), which is called when rebuilding the sk header and a few pre-conditions are met (eg rerouting fails). In these two cases, we need to update the bhash2 table by removing the entry for the old address, and add a new entry reflecting the updated address. * The bhash2 table must have its own lock, even though concurrent accesses on the same port are protected by the bhash lock. Bhash2 must have its own lock to protect against cases where sockets on different ports hash to different bhash hashbuckets but to the same bhash2 hashbucket. This brings up a few stipulations: 1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock will always be acquired after the bhash lock and released before the bhash lock is released. 2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always acquired+released before another bhash2 lock is acquired+released. * The bhash table cannot be superseded by the bhash2 table because for bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket bound to that port must be checked for a potential conflict. The bhash table is the only source of port->socket associations. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
a5612ca10d |
net: Fix data-races around sysctl_devconf_inherit_init_net.
While reading sysctl_devconf_inherit_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes:
|
||
|
657b991afb |
net: Fix data-races around sysctl_max_skb_frags.
While reading sysctl_max_skb_frags, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes:
|
||
|
7de6d09f51 |
net: Fix data-races around sysctl_optmem_max.
While reading sysctl_optmem_max, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes:
|
||
|
1227c1771d |
net: Fix data-races around sysctl_[rw]mem_(max|default).
While reading sysctl_[rw]mem_(max|default), they can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes:
|
||
|
aacd467c0a |
tcp: annotate data-race around tcp_md5sig_pool_populated
tcp_md5sig_pool_populated can be read while another thread changes its value. The race has no consequence because allocations are protected with tcp_md5sig_mutex. This patch adds READ_ONCE() and WRITE_ONCE() to document the race and silence KCSAN. Reported-by: Abhishek Shah <abhishek.shah@columbia.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
01e454f243 |
ipv4: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this subsystem. Conversion is 1:1 because the return value is not used. Generated by a coccinelle script. Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://lore.kernel.org/r/20220818210219.8467-1-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
268603d79c |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
ee7f1e1302 |
bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()
After the prep work in the previous patches, this patch removes the dup code from bpf_setsockopt(SOL_IP) and reuses the implementation in do_ip_setsockopt(). The existing optname white-list is refactored into a new function sol_ip_setsockopt(). NOTE, the current bpf_setsockopt(IP_TOS) is quite different from the the do_ip_setsockopt(IP_TOS). For example, it does not take the INET_ECN_MASK into the account for tcp and also does not adjust sk->sk_priority. It looks like the current bpf_setsockopt(IP_TOS) was referencing the IPV6_TCLASS implementation instead of IP_TOS. This patch tries to rectify that by using the do_ip_setsockopt(IP_TOS). While this is a behavior change, the do_ip_setsockopt(IP_TOS) behavior is arguably what the user is expecting. At least, the INET_ECN_MASK bits should be masked out for tcp. Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061826.4180990-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
0c751f7071 |
bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()
After the prep work in the previous patches, this patch removes all the dup code from bpf_setsockopt(SOL_TCP) and reuses the do_tcp_setsockopt(). The existing optname white-list is refactored into a new function sol_tcp_setsockopt(). The sol_tcp_setsockopt() also calls the bpf_sol_tcp_setsockopt() to handle the TCP_BPF_XXX specific optnames. bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to save the eth header also and it comes for free from do_tcp_setsockopt(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061819.4180146-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
1df055d3c7 |
bpf: net: Change do_ip_setsockopt() to use the sockopt's lock_sock() and capable()
Similar to the earlier patch that avoids sk_setsockopt() from taking sk lock and doing capable test when called by bpf. This patch changes do_ip_setsockopt() to use the sockopt_{lock,release}_sock() and sockopt_[ns_]capable(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061737.4176402-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
cb388e7ee3 |
bpf: net: Change do_tcp_setsockopt() to use the sockopt's lock_sock() and capable()
Similar to the earlier patch that avoids sk_setsockopt() from taking sk lock and doing capable test when called by bpf. This patch changes do_tcp_setsockopt() to use the sockopt_{lock,release}_sock() and sockopt_[ns_]capable(). Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220817061730.4176021-1-kafai@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
7ec9fce4b3 |
ip_tunnel: Respect tunnel key's "flow_flags" in IP tunnels
Commit |
||
|
2e23acd99e |
tcp: handle pure FIN case correctly
When skb->len==0, the recv_actor() returns 0 too, but we also use 0
for error conditions. This patch amends this by propagating the errors
to tcp_read_skb() so that we can distinguish skb->len==0 case from
error cases.
Fixes:
|
||
|
a8688821f3 |
tcp: refactor tcp_read_skb() a bit
As tcp_read_skb() only reads one skb at a time, the while loop is unnecessary, we can turn it into an if. This also simplifies the code logic. Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
c457985aaa |
tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
tcp_cleanup_rbuf() retrieves the skb from sk_receive_queue, it
assumes the skb is not yet dequeued. This is no longer true for
tcp_read_skb() case where we dequeue the skb first.
Fix this by introducing a helper __tcp_cleanup_rbuf() which does
not require any skb and calling it in tcp_read_skb().
Fixes:
|
||
|
e9c6e79760 |
tcp: fix sock skb accounting in tcp_read_skb()
Before commit |