lineage-22.0
233 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Greg Kroah-Hartman
|
66e91da883 |
This is the 5.10.210 stable release
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmXYTLkACgkQONu9yGCS aT4+fhAAqqR/Cvx53ZKMQ8GZTCudAZnr/Dz6kWYwxhhhIbQjDpCaf9mgsrEDaQS2 ancSZjzYaOUIXq/IsthXxQIUhiZbuM3iuSEi7+odWgSYdkFyzuUt8MWLBGSaB5Er ojn+APtq7vPXTSnp7uMwqMC3/BHCKkeYIjRVevhhHBKG5d3lzkV1xU8NcvMkLaly CIRxpWXD3w2b7K0GEbb/zN1GQEHDCQcxjuaJoe/5FKGJkqd3T31eyiJTRumCCMcz j8vkGkYmcMJpWf04iLgVA1p13I5/HGrXdEBI/GutN8IABIC3Cp42jW8phHYKW5ZM a4R25LZG5buND1Ubpq+EDrYn3EaPek5XRki0w8ZAXfNa3rYc+N6mQjkzNSOzhJ/5 VNsn3EAE1Dwtar5Z3ASe9ugDbh+0bgx85PbfaADK88V+qWb3DVr1TBWmDNu2vfVP rv4I0EKu9r3vOE8aNMEBuhAVkIK3mEQUxwab6RKNrMby/5Uwa+ugrrUtQd8V+T1S j6r6v7u7aZ8mhYO7d6WSvAKL85lCWGbs3WRIKCJZmDRyqWrWW9tVWRN9wrZ2QnRr iaCQKk8P474P7/j1zwnmih8l4wS1oszveNziWwd0fi1Nn/WQYM+JKYQvpuQijmQ+ J9jLyWo7a59zffIE6mzJdNwFy9hlw9X+VnJmExk/Q88Z7Bt5wPQ= =laYd -----END PGP SIGNATURE----- Merge 5.10.210 into android12-5.10-lts Changes in 5.10.210 usb: cdns3: Fixes for sparse warnings usb: cdns3: fix uvc failure work since sg support enabled usb: cdns3: fix incorrect calculation of ep_buf_size when more than one config usb: cdns3: fix iso transfer error when mult is not zero usb: cdns3: Fix uvc fail when DMA cross 4k boundery since sg enabled PCI: mediatek: Clear interrupt status before dispatching handler units: change from 'L' to 'UL' units: add the HZ macros serial: sc16is7xx: set safe default SPI clock frequency spi: introduce SPI_MODE_X_MASK macro serial: sc16is7xx: add check for unsupported SPI modes during probe iio: adc: ad7091r: Set alert bit in config register iio: adc: ad7091r: Allow users to configure device events iio: adc: ad7091r: Enable internal vref if external vref is not supplied dmaengine: fix NULL pointer in channel unregistration function iio:adc:ad7091r: Move exports into IIO_AD7091R namespace. ext4: allow for the last group to be marked as trimmed crypto: api - Disallow identical driver names PM: hibernate: Enforce ordering during image compression/decompression hwrng: core - Fix page fault dead lock on mmap-ed hwrng crypto: s390/aes - Fix buffer overread in CTR mode rpmsg: virtio: Free driver_override when rpmsg_remove() bus: mhi: host: Drop chan lock before queuing buffers parisc/firmware: Fix F-extend for PDC addresses async: Split async_schedule_node_domain() async: Introduce async_schedule_dev_nocall() arm64: dts: qcom: sdm845: fix USB wakeup interrupt types arm64: dts: qcom: sdm845: fix USB DP/DM HS PHY interrupts lsm: new security_file_ioctl_compat() hook scripts/get_abi: fix source path leak mmc: core: Use mrq.sbc in close-ended ffu mmc: mmc_spi: remove custom DMA mapped buffers rtc: Adjust failure return code for cmos_set_alarm() nouveau/vmm: don't set addr on the fail path to avoid warning ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path rename(): fix the locking of subdirectories block: Remove special-casing of compound pages stddef: Introduce DECLARE_FLEX_ARRAY() helper smb3: Replace smb2pdu 1-element arrays with flex-arrays mm: vmalloc: introduce array allocation functions KVM: use __vcalloc for very large allocations net/smc: fix illegal rmb_desc access in SMC-D connection dump tcp: make sure init the accept_queue's spinlocks once bnxt_en: Wait for FLR to complete during probe vlan: skip nested type that is not IFLA_VLAN_QOS_MAPPING llc: make llc_ui_sendmsg() more robust against bonding changes llc: Drop support for ETH_P_TR_802_2. net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recv tracing: Ensure visibility when inserting an element into tracing_map afs: Hide silly-rename files from userspace tcp: Add memory barrier to tcp_push() netlink: fix potential sleeping issue in mqueue_flush_file ipv6: init the accept_queue's spinlocks in inet6_create net/mlx5: DR, Use the right GVMI number for drop action net/mlx5e: fix a double-free in arfs_create_groups netfilter: nf_tables: restrict anonymous set and map names to 16 bytes netfilter: nf_tables: validate NFPROTO_* family net: mvpp2: clear BM pool before initialization selftests: netdevsim: fix the udp_tunnel_nic test fjes: fix memleaks in fjes_hw_setup net: fec: fix the unhandled context fault from smmu btrfs: ref-verify: free ref cache before clearing mount opt btrfs: tree-checker: fix inline ref size in error messages btrfs: don't warn if discard range is not aligned to sector btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args btrfs: don't abort filesystem when attempting to snapshot deleted subvolume rbd: don't move requests to the running list on errors exec: Fix error handling in begin_new_exec() wifi: iwlwifi: fix a memory corruption netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain netfilter: nf_tables: reject QUEUE/DROP verdict parameters gpiolib: acpi: Ignore touchpad wakeup on GPD G1619-04 drm: Don't unref the same fb many times by mistake due to deadlock handling drm/bridge: nxp-ptn3460: fix i2c_master_send() error checking drm/tidss: Fix atomic_flush check drm/bridge: nxp-ptn3460: simplify some error checking PM: sleep: Use dev_printk() when possible PM: sleep: Avoid calling put_device() under dpm_list_mtx PM: core: Remove unnecessary (void *) conversions PM: sleep: Fix possible deadlocks in core system-wide PM code fs/pipe: move check to pipe_has_watch_queue() pipe: wakeup wr_wait after setting max_usage ARM: dts: samsung: exynos4210-i9100: Unconditionally enable LDO12 arm64: dts: qcom: sc7180: Use pdc interrupts for USB instead of GIC interrupts arm64: dts: qcom: sc7180: fix USB wakeup interrupt types media: mtk-jpeg: Fix use after free bug due to error path handling in mtk_jpeg_dec_device_run mm: use __pfn_to_section() instead of open coding it mm/sparsemem: fix race in accessing memory_section->usage btrfs: remove err variable from btrfs_delete_subvolume btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted drm: panel-simple: add missing bus flags for Tianma tm070jvhg[30/33] drm/exynos: fix accidental on-stack copy of exynos_drm_plane drm/exynos: gsc: minor fix for loop iteration in gsc_runtime_resume gpio: eic-sprd: Clear interrupt after set the interrupt type spi: bcm-qspi: fix SFDP BFPT read by usig mspi read mips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nan tick/sched: Preserve number of idle sleeps across CPU hotplug events x86/entry/ia32: Ensure s32 is sign extended to s64 powerpc/mm: Fix null-pointer dereference in pgtable_cache_add drivers/perf: pmuv3: don't expose SW_INCR event in sysfs powerpc: Fix build error due to is_valid_bugaddr() powerpc/mm: Fix build failures due to arch_reserved_kernel_pages() x86/boot: Ignore NMIs during very early boot powerpc: pmd_move_must_withdraw() is only needed for CONFIG_TRANSPARENT_HUGEPAGE powerpc/lib: Validate size for vector operations x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file debugobjects: Stop accessing objects after releasing hash bucket lock regulator: core: Only increment use_count when enable_count changes audit: Send netlink ACK before setting connection in auditd_set ACPI: video: Add quirk for the Colorful X15 AT 23 Laptop PNP: ACPI: fix fortify warning ACPI: extlog: fix NULL pointer dereference check PM / devfreq: Synchronize devfreq_monitor_[start/stop] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events FS:JFS:UBSAN:array-index-out-of-bounds in dbAdjTree UBSAN: array-index-out-of-bounds in dtSplitRoot jfs: fix slab-out-of-bounds Read in dtSearch jfs: fix array-index-out-of-bounds in dbAdjTree jfs: fix uaf in jfs_evict_inode pstore/ram: Fix crash when setting number of cpus to an odd number crypto: stm32/crc32 - fix parsing list of devices afs: fix the usage of read_seqbegin_or_lock() in afs_lookup_volume_rcu() afs: fix the usage of read_seqbegin_or_lock() in afs_find_server*() rxrpc_find_service_conn_rcu: fix the usage of read_seqbegin_or_lock() jfs: fix array-index-out-of-bounds in diNewExt s390/ptrace: handle setting of fpc register correctly KVM: s390: fix setting of fpc register SUNRPC: Fix a suspicious RCU usage warning ecryptfs: Reject casefold directory inodes ext4: fix inconsistent between segment fstrim and full fstrim ext4: unify the type of flexbg_size to unsigned int ext4: remove unnecessary check from alloc_flex_gd() ext4: avoid online resizing failures due to oversized flex bg wifi: rt2x00: restart beacon queue when hardware reset selftests/bpf: satisfy compiler by having explicit return in btf test selftests/bpf: Fix pyperf180 compilation failure with clang18 scsi: lpfc: Fix possible file string name overflow when updating firmware PCI: Add no PM reset quirk for NVIDIA Spectrum devices bonding: return -ENOMEM instead of BUG in alb_upper_dev_walk scsi: arcmsr: Support new PCI device IDs 1883 and 1886 ARM: dts: imx7d: Fix coresight funnel ports ARM: dts: imx7s: Fix lcdif compatible ARM: dts: imx7s: Fix nand-controller #size-cells wifi: ath9k: Fix potential array-index-out-of-bounds read in ath9k_htc_txstatus() bpf: Add map and need_defer parameters to .map_fd_put_ptr() scsi: libfc: Don't schedule abort twice scsi: libfc: Fix up timeout error in fc_fcp_rec_error() bpf: Set uattr->batch.count as zero before batched update or deletion ARM: dts: rockchip: fix rk3036 hdmi ports node ARM: dts: imx25/27-eukrea: Fix RTC node name ARM: dts: imx: Use flash@0,0 pattern ARM: dts: imx27: Fix sram node ARM: dts: imx1: Fix sram node ionic: pass opcode to devcmd_wait block/rnbd-srv: Check for unlikely string overflow ARM: dts: imx25: Fix the iim compatible string ARM: dts: imx25/27: Pass timing0 ARM: dts: imx27-apf27dev: Fix LED name ARM: dts: imx23-sansa: Use preferred i2c-gpios properties ARM: dts: imx23/28: Fix the DMA controller node name net: dsa: mv88e6xxx: Fix mv88e6352_serdes_get_stats error path block: prevent an integer overflow in bvec_try_merge_hw_page md: Whenassemble the array, consult the superblock of the freshest device arm64: dts: qcom: msm8996: Fix 'in-ports' is a required property arm64: dts: qcom: msm8998: Fix 'out-ports' is a required property wifi: rtl8xxxu: Add additional USB IDs for RTL8192EU devices wifi: rtlwifi: rtl8723{be,ae}: using calculate_bit_shift() wifi: cfg80211: free beacon_ies when overridden from hidden BSS Bluetooth: qca: Set both WIDEBAND_SPEECH and LE_STATES quirks for QCA2066 Bluetooth: L2CAP: Fix possible multiple reject send i40e: Fix VF disable behavior to block all traffic f2fs: fix to check return value of f2fs_reserve_new_block() ALSA: hda: Refer to correct stream index at loops ASoC: doc: Fix undefined SND_SOC_DAPM_NOPM argument fast_dput(): handle underflows gracefully RDMA/IPoIB: Fix error code return in ipoib_mcast_join drm/amd/display: Fix tiled display misalignment f2fs: fix write pointers on zoned device after roll forward drm/drm_file: fix use of uninitialized variable drm/framebuffer: Fix use of uninitialized variable drm/mipi-dsi: Fix detach call without attach media: stk1160: Fixed high volume of stk1160_dbg messages media: rockchip: rga: fix swizzling for RGB formats PCI: add INTEL_HDA_ARL to pci_ids.h ALSA: hda: Intel: add HDA_ARL PCI ID support ALSA: hda: intel-dspcfg: add filters for ARL-S and ARL drm/exynos: Call drm_atomic_helper_shutdown() at shutdown/unbind time IB/ipoib: Fix mcast list locking media: ddbridge: fix an error code problem in ddb_probe drm/msm/dpu: Ratelimit framedone timeout msgs clk: hi3620: Fix memory leak in hi3620_mmc_clk_init() clk: mmp: pxa168: Fix memory leak in pxa168_clk_init() watchdog: it87_wdt: Keep WDTCTRL bit 3 unmodified for IT8784/IT8786 drm/amdgpu: Let KFD sync with VM fences drm/amdgpu: Drop 'fence' check in 'to_amdgpu_amdkfd_fence()' leds: trigger: panic: Don't register panic notifier if creating the trigger failed um: Fix naming clash between UML and scheduler um: Don't use vfprintf() for os_info() um: net: Fix return type of uml_net_start_xmit() i3c: master: cdns: Update maximum prescaler value for i2c clock xen/gntdev: Fix the abuse of underlying struct page in DMA-buf import mfd: ti_am335x_tscadc: Fix TI SoC dependencies PCI: Only override AMD USB controller if required PCI: switchtec: Fix stdev_release() crash after surprise hot remove usb: hub: Replace hardcoded quirk value with BIT() macro tty: allow TIOCSLCKTRMIOS with CAP_CHECKPOINT_RESTORE fs/kernfs/dir: obey S_ISGID PCI/AER: Decode Requester ID when no error info found libsubcmd: Fix memory leak in uniq() virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings blk-mq: fix IO hang from sbitmap wakeup race ceph: fix deadlock or deadcode of misusing dget() drm/amd/powerplay: Fix kzalloc parameter 'ATOM_Tonga_PPM_Table' in 'get_platform_power_management_table()' drm/amdgpu: Release 'adev->pm.fw' before return in 'amdgpu_device_need_post()' perf: Fix the nr_addr_filters fix wifi: cfg80211: fix RCU dereference in __cfg80211_bss_update drm: using mul_u32_u32() requires linux/math64.h scsi: isci: Fix an error code problem in isci_io_request_build() scsi: core: Introduce enum scsi_disposition scsi: core: Move scsi_host_busy() out of host lock for waking up EH handler ip6_tunnel: use dev_sw_netstats_rx_add() ip6_tunnel: make sure to pull inner header in __ip6_tnl_rcv() net-zerocopy: Refactor frag-is-remappable test. tcp: add sanity checks to rx zerocopy ixgbe: Remove non-inclusive language ixgbe: Refactor returning internal error codes ixgbe: Refactor overtemp event handling ixgbe: Fix an error handling path in ixgbe_read_iosf_sb_reg_x550() ipv6: Ensure natural alignment of const ipv6 loopback and router addresses llc: call sock_orphan() at release time netfilter: nf_log: replace BUG_ON by WARN_ON_ONCE when putting logger netfilter: nft_ct: sanitize layer 3 and 4 protocol number in custom expectations net: ipv4: fix a memleak in ip_setup_cork af_unix: fix lockdep positive in sk_diag_dump_icons() net: sysfs: Fix /sys/class/net/<iface> path HID: apple: Add support for the 2021 Magic Keyboard HID: apple: Add 2021 magic keyboard FN key mapping bonding: remove print in bond_verify_device_path uapi: stddef.h: Fix __DECLARE_FLEX_ARRAY for C++ PM: sleep: Fix error handling in dpm_prepare() dmaengine: fsl-dpaa2-qdma: Fix the size of dma pools dmaengine: ti: k3-udma: Report short packet errors dmaengine: fsl-qdma: Fix a memory leak related to the status queue DMA dmaengine: fsl-qdma: Fix a memory leak related to the queue command DMA phy: renesas: rcar-gen3-usb2: Fix returning wrong error code dmaengine: fix is_slave_direction() return false when DMA_DEV_TO_DEV phy: ti: phy-omap-usb2: Fix NULL pointer dereference for SRP drm/msm/dp: return correct Colorimetry for DP_TEST_DYNAMIC_RANGE_CEA case net: stmmac: xgmac: fix handling of DPP safety error for DMA channels selftests: net: avoid just another constant wait tunnels: fix out of bounds access when building IPv6 PMTU error atm: idt77252: fix a memleak in open_card_ubr0 hwmon: (aspeed-pwm-tacho) mutex for tach reading hwmon: (coretemp) Fix out-of-bounds memory access hwmon: (coretemp) Fix bogus core_id to attr name mapping inet: read sk->sk_family once in inet_recv_error() rxrpc: Fix response to PING RESPONSE ACKs to a dead call tipc: Check the bearer type before calling tipc_udp_nl_bearer_add() ppp_async: limit MRU to 64K netfilter: nft_compat: reject unused compat flag netfilter: nft_compat: restrict match/target protocol to u16 netfilter: nft_ct: reject direction for ct id netfilter: nft_set_pipapo: store index in scratch maps netfilter: nft_set_pipapo: add helper to release pcpu scratch area netfilter: nft_set_pipapo: remove scratch_aligned pointer scsi: core: Move scsi_host_busy() out of host lock if it is for per-command blk-iocost: Fix an UBSAN shift-out-of-bounds warning net/af_iucv: clean up a try_then_request_module() USB: serial: qcserial: add new usb-id for Dell Wireless DW5826e USB: serial: option: add Fibocom FM101-GL variant USB: serial: cp210x: add ID for IMST iM871A-USB usb: host: xhci-plat: Add support for XHCI_SG_TRB_CACHE_SIZE_QUIRK hrtimer: Report offline hrtimer enqueue Input: i8042 - fix strange behavior of touchpad on Clevo NS70PU Input: atkbd - skip ATKBD_CMD_SETLEDS when skipping ATKBD_CMD_GETID vhost: use kzalloc() instead of kmalloc() followed by memset() clocksource: Skip watchdog check for large watchdog intervals net: stmmac: xgmac: use #define for string constants net: stmmac: xgmac: fix a typo of register name in DPP safety handling netfilter: nft_set_rbtree: skip end interval element from gc btrfs: forbid creating subvol qgroups btrfs: do not ASSERT() if the newly created subvolume already got read btrfs: forbid deleting live subvol qgroup btrfs: send: return EOPNOTSUPP on unknown flags of: unittest: Fix compile in the non-dynamic case net: openvswitch: limit the number of recursions from action sets spi: ppc4xx: Drop write-only variable ASoC: rt5645: Fix deadlock in rt5645_jack_detect_work() net: sysfs: Fix /sys/class/net/<iface> path for statistics MIPS: Add 'memory' clobber to csum_ipv6_magic() inline assembler i40e: Fix waiting for queues of all VSIs to be disabled tracing/trigger: Fix to return error if failed to alloc snapshot mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again ALSA: hda/realtek: Fix the external mic not being recognised for Acer Swift 1 SF114-32 ALSA: hda/realtek: Enable Mute LED on HP Laptop 14-fq0xxx HID: wacom: generic: Avoid reporting a serial of '0' to userspace HID: wacom: Do not register input devices until after hid_hw_start usb: ucsi_acpi: Fix command completion handling USB: hub: check for alternate port before enabling A_ALT_HNP_SUPPORT usb: f_mass_storage: forbid async queue when shutdown happen media: ir_toy: fix a memleak in irtoy_tx powerpc/kasan: Fix addr error caused by page alignment i2c: i801: Remove i801_set_block_buffer_mode i2c: i801: Fix block process call transactions modpost: trim leading spaces when processing source files list scsi: Revert "scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock" lsm: fix the logic in security_inode_getsecctx() firewire: core: correct documentation of fw_csr_string() kernel API kbuild: Fix changing ELF file type for output of gen_btf for big endian nfc: nci: free rx_data_reassembly skb on NCI device cleanup net: hsr: remove WARN_ONCE() in send_hsr_supervision_frame() xen-netback: properly sync TX responses ALSA: hda/realtek: Enable headset mic on Vaio VJFE-ADL binder: signal epoll threads of self-work misc: fastrpc: Mark all sessions as invalid in cb_remove ext4: fix double-free of blocks due to wrong extents moved_len tracing: Fix wasted memory in saved_cmdlines logic staging: iio: ad5933: fix type mismatch regression iio: magnetometer: rm3100: add boundary check for the value read from RM3100_REG_TMRC iio: accel: bma400: Fix a compilation problem media: rc: bpf attach/detach requires write permission hv_netvsc: Fix race condition between netvsc_probe and netvsc_remove ring-buffer: Clean ring_buffer_poll_wait() error return serial: max310x: set default value when reading clock ready bit serial: max310x: improve crystal stable clock detection x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6 x86/mm/ident_map: Use gbpages only where full GB page should be mapped. mmc: slot-gpio: Allow non-sleeping GPIO ro ALSA: hda/conexant: Add quirk for SWS JS201D nilfs2: fix data corruption in dsync block recovery for small block sizes nilfs2: fix hang in nilfs_lookup_dirty_data_buffers() crypto: ccp - Fix null pointer dereference in __sev_platform_shutdown_locked nfp: use correct macro for LengthSelect in BAR config nfp: flower: prevent re-adding mac index for bonded port wifi: mac80211: reload info pointer in ieee80211_tx_dequeue() irqchip/irq-brcmstb-l2: Add write memory barrier before exit irqchip/gic-v3-its: Fix GICv4.1 VPE affinity update s390/qeth: Fix potential loss of L3-IP@ in case of network issues ceph: prevent use-after-free in encode_cap_msg() of: property: fix typo in io-channels can: j1939: Fix UAF in j1939_sk_match_filter during setsockopt(SO_J1939_FILTER) pmdomain: core: Move the unused cleanup to a _sync initcall tracing: Inform kmemleak of saved_cmdlines allocation Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d" bus: moxtet: Add spi device table PCI: dwc: endpoint: Fix dw_pcie_ep_raise_msix_irq() alignment support mips: Fix max_mapnr being uninitialized on early stages crypto: lib/mpi - Fix unexpected pointer access in mpi_ec_init serial: Add rs485_supported to uart_port serial: 8250_exar: Fill in rs485_supported serial: 8250_exar: Set missing rs485_supported flag scripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm scripts/decode_stacktrace.sh: support old bash version scripts: decode_stacktrace: demangle Rust symbols scripts/decode_stacktrace.sh: optionally use LLVM utilities netfilter: ipset: fix performance regression in swap operation netfilter: ipset: Missing gc cancellations fixed hrtimer: Ignore slack time for RT tasks in schedule_hrtimeout_range() Revert "arm64: Stash shadow stack pointer in the task struct on interrupt" net: prevent mss overflow in skb_segment() sched/membarrier: reduce the ability to hammer on sys_membarrier nilfs2: fix potential bug in end_buffer_async_write nilfs2: replace WARN_ONs for invalid DAT metadata block requests dm: limit the number of targets and parameter size area PM: runtime: add devm_pm_runtime_enable helper PM: runtime: Have devm_pm_runtime_enable() handle pm_runtime_dont_use_autosuspend() drm/msm/dsi: Enable runtime PM netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval() net: bcmgenet: Fix EEE implementation PCI: dwc: Fix a 64bit bug in dw_pcie_ep_raise_msix_irq() Linux 5.10.210 Change-Id: I5e7327f58dd6abd26ac2b1e328a81c1010d1147c Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
Lukas Schauer
|
162ae0e78b |
pipe: wakeup wr_wait after setting max_usage
[ Upstream commit e95aada4cb93d42e25c30a0ef9eb2923d9711d4a ] Commit |
||
Max Kellermann
|
b6f27626f5 |
fs/pipe: move check to pipe_has_watch_queue()
[ Upstream commit b4bd6b4bac8edd61eb8f7b836969d12c0c6af165 ] This declutters the code by reducing the number of #ifdefs and makes the watch_queue checks simpler. This has no runtime effect; the machine code is identical. Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Message-Id: <20230921075755.1378787-2-max.kellermann@ionos.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: e95aada4cb93 ("pipe: wakeup wr_wait after setting max_usage") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Greg Kroah-Hartman
|
2de0a17df4 |
This is the 5.10.120 stable release
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmKdoegACgkQONu9yGCS aT6ytRAAjBTL+El1JeJ5W14PjtQEl45XhEqJ300SuF9Ob0ZBgooPbBa3rwPBw05T xpyT4/vLGmG87s2+KTUkAtfX8lhoWLbE+xx4zrOx49plbFVYYvqugzvzliXvZ8Zb 2h1aP0SG8hrIFHMsN+qOZnmY3k9m8Z7rNZu/Eyq8zQ/z0SCmQ0EExq3PiKskrfyb 4AcC5Y77UfUl9r7loyFVAPj5z5AOyE+d/5biFdLsJgHa+qHmoYTMKYy53BF8aD3v jIOZ4JzYZ+ybtkGSrPtXmay02c15TqWXgzlYRjpYQm75C69yFugxFt5usBsVAZmK JraaBEr13EdZhBvove2cV0ZK8afJfwoo2NuTBuxw52ZmEDvifkb7ESwPa/1kbAqI 267e+yTtqRoge2STMXqU4J3GNbMNf9vmOq4x6NaaPF+Q2K05Z9xO64yh0uxhYx7i n/EoT2ESOrWguhICf+Gets58dmY6jbWNzlCKFJnbXiuYvx1AxcS1nzP0OrzeWsq8 qii9QS8VouzLnKGXanRZKmjznxSWMyQ3UjA/u4pqL4ISsDy25ICxeSemLvtHIdSU Ksd+RQgL39e+V9ZXUZ+KQ6zm6a4gKXHsqxuq1UO4xzxYNnR7sgvDS4ACquLRae+N ISNrz0LP0bDDXHFK4ElU0LBs5jwdYRm67TSVSAOVjCl5B5tB8zk= =o88P -----END PGP SIGNATURE----- Merge 5.10.120 into android12-5.10-lts Changes in 5.10.120 pinctrl: sunxi: fix f1c100s uart2 function percpu_ref_init(): clean ->percpu_count_ref on failure net: af_key: check encryption module availability consistency nfc: pn533: Fix buggy cleanup order net: ftgmac100: Disable hardware checksum on AST2600 i2c: ismt: Provide a DMA buffer for Interrupt Cause Logging drivers: i2c: thunderx: Allow driver to work with ACPI defined TWSI controllers netfilter: nf_tables: disallow non-stateful expression in sets earlier pipe: make poll_usage boolean and annotate its access pipe: Fix missing lock in pipe_resize_ring() cfg80211: set custom regdomain after wiphy registration assoc_array: Fix BUG_ON during garbage collect io_uring: don't re-import iovecs from callbacks io_uring: fix using under-expanded iters net: ipa: compute proper aggregation limit xfs: detect overflows in bmbt records xfs: show the proper user quota options xfs: fix the forward progress assertion in xfs_iwalk_run_callbacks xfs: fix an ABBA deadlock in xfs_rename xfs: Fix CIL throttle hang when CIL space used going backwards drm/i915: Fix -Wstringop-overflow warning in call to intel_read_wm_latency() exfat: check if cluster num is valid lib/crypto: add prompts back to crypto libraries crypto: drbg - prepare for more fine-grained tracking of seeding state crypto: drbg - track whether DRBG was seeded with !rng_is_initialized() crypto: drbg - move dynamic ->reseed_threshold adjustments to __drbg_seed() crypto: drbg - make reseeding from get_random_bytes() synchronous netfilter: nf_tables: sanitize nft_set_desc_concat_parse() netfilter: conntrack: re-fetch conntrack after insertion KVM: PPC: Book3S HV: fix incorrect NULL check on list iterator x86/kvm: Alloc dummy async #PF token outside of raw spinlock x86, kvm: use correct GFP flags for preemption disabled KVM: x86: avoid calling x86 emulator without a decoded instruction crypto: caam - fix i.MX6SX entropy delay value crypto: ecrdsa - Fix incorrect use of vli_cmp zsmalloc: fix races between asynchronous zspage free and page migration Bluetooth: hci_qca: Use del_timer_sync() before freeing ARM: dts: s5pv210: Correct interrupt name for bluetooth in Aries dm integrity: fix error code in dm_integrity_ctr() dm crypt: make printing of the key constant-time dm stats: add cond_resched when looping over entries dm verity: set DM_TARGET_IMMUTABLE feature flag raid5: introduce MD_BROKEN HID: multitouch: Add support for Google Whiskers Touchpad HID: multitouch: add quirks to enable Lenovo X12 trackpoint tpm: Fix buffer access in tpm2_get_tpm_pt() tpm: ibmvtpm: Correct the return value in tpm_ibmvtpm_probe() docs: submitting-patches: Fix crossref to 'The canonical patch format' NFS: Memory allocation failures are not server fatal errors NFSD: Fix possible sleep during nfsd4_release_lockowner() bpf: Fix potential array overflow in bpf_trampoline_get_progs() bpf: Enlarge offset check value to INT_MAX in bpf_skb_{load,store}_bytes Linux 5.10.120 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I48c0d649a50bd16ad719b2cb9f0ffccd0a74519a |
||
David Howells
|
8fbd54ab06 |
pipe: Fix missing lock in pipe_resize_ring()
commit 189b0ddc245139af81198d1a3637cac74f96e13a upstream.
pipe_resize_ring() needs to take the pipe->rd_wait.lock spinlock to
prevent post_one_notification() from trying to insert into the ring
whilst the ring is being replaced.
The occupancy check must be done after the lock is taken, and the lock
must be taken after the new ring is allocated.
The bug can lead to an oops looking something like:
BUG: KASAN: use-after-free in post_one_notification.isra.0+0x62e/0x840
Read of size 4 at addr ffff88801cc72a70 by task poc/27196
...
Call Trace:
post_one_notification.isra.0+0x62e/0x840
__post_watch_notification+0x3b7/0x650
key_create_or_update+0xb8b/0xd20
__do_sys_add_key+0x175/0x340
__x64_sys_add_key+0xbe/0x140
do_syscall_64+0x5c/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Reported by Selim Enes Karaduman @Enesdex working with Trend Micro Zero
Day Initiative.
Fixes:
|
||
Kuniyuki Iwashima
|
cd720fad8b |
pipe: make poll_usage boolean and annotate its access
commit f485922d8fe4e44f6d52a5bb95a603b7c65554bb upstream. Patch series "Fix data-races around epoll reported by KCSAN." This series suppresses a false positive KCSAN's message and fixes a real data-race. This patch (of 2): pipe_poll() runs locklessly and assigns 1 to poll_usage. Once poll_usage is set to 1, it never changes in other places. However, concurrent writes of a value trigger KCSAN, so let's make KCSAN happy. BUG: KCSAN: data-race in pipe_poll / pipe_poll write to 0xffff8880042f6678 of 4 bytes by task 174 on cpu 3: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) write to 0xffff8880042f6678 of 4 bytes by task 177 on cpu 1: pipe_poll (fs/pipe.c:656) ep_item_poll.isra.0 (./include/linux/poll.h:88 fs/eventpoll.c:853) do_epoll_wait (fs/eventpoll.c:1692 fs/eventpoll.c:1806 fs/eventpoll.c:2234) __x64_sys_epoll_wait (fs/eventpoll.c:2246 fs/eventpoll.c:2241 fs/eventpoll.c:2241) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 177 Comm: epoll_race Not tainted 5.17.0-58927-gf443e374ae13 #6 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014 Link: https://lkml.kernel.org/r/20220322002653.33865-1-kuniyu@amazon.co.jp Link: https://lkml.kernel.org/r/20220322002653.33865-2-kuniyu@amazon.co.jp Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kuniyuki Iwashima <kuni1840@gmail.com> Cc: "Soheil Hassas Yeganeh" <soheil@google.com> Cc: "Sridhar Samudrala" <sridhar.samudrala@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Greg Kroah-Hartman
|
5287773dba |
This is the 5.10.106 stable release
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmIx4yUACgkQONu9yGCS aT5q2A/+OlbFfkWLW5T0p54BIaXQGduXCtKyG81gZeJJPjQH0Meh6rtzea3n9/2/ ESgOBWmoGCX1I9dq8oWgE+krdg4Bu1pSsyWELlVBD5YeovsrdMJokfbZ5AW7wioO ca/sZpaysp0TiluGSwkU2eYLOeaUNvSUrLPoAK/BjZ4W/INXKQyY3gaG9SqhSEiY ag13I9BLvltepYWQbgq2MJzI5vKq3DHiQSC4/wDA1J02rfa8+C3pTSYZVKES6+6/ NVend811EpI/XQeQhsTP4wVtOYODJyVrXFIcV2n4MIsw80MhlomJjI/3BYiIYTf4 aYqwdGn0CZTQDCAC/HkNJbttcSw5wyJHGDG3e/QQArQ4L0Tu2ah9pXygIkMs9RXj ClUlR2sE0HQBtWrMxz9qT5Yo0F33ghU40VaDGUpX9QNs0Tat5hqGpj3KNVutxrk5 D5U5ueXA8DysmjWv8+oawUoRCuYa4NfPPHA9LExkn+oatu/GdCT6j3Ih5pxOOorw uWijcyvElOvSRvOzBbrLcHmbqV9WIM3cO2ob5I+HbV/agCoytnZS487KY4wp/ZRQ hd/gc03hIJiyJxIWAFosDAHJYsSnyVHTMDqDNF2EQ/Myto0Tf4idAjThCAl8pE1K WmcmqArZei0PfFfa68jOtd5CTE/dBQHwamSaQx0Rf9UkxN7sOko= =PEUE -----END PGP SIGNATURE----- Merge 5.10.106 into android12-5.10-lts Changes in 5.10.106 ARM: boot: dts: bcm2711: Fix HVS register range clk: qcom: gdsc: Add support to update GDSC transition delay HID: vivaldi: fix sysfs attributes leak arm64: dts: armada-3720-turris-mox: Add missing ethernet0 alias tipc: fix kernel panic when enabling bearer mISDN: Remove obsolete PIPELINE_DEBUG debugging information mISDN: Fix memory leak in dsp_pipeline_build() virtio-blk: Don't use MAX_DISCARD_SEGMENTS if max_discard_seg is zero isdn: hfcpci: check the return value of dma_set_mask() in setup_hw() net: qlogic: check the return value of dma_alloc_coherent() in qed_vf_hw_prepare() esp: Fix BEET mode inter address family tunneling on GSO qed: return status of qed_iov_get_link drm/sun4i: mixer: Fix P010 and P210 format numbers net: dsa: mt7530: fix incorrect test in mt753x_phylink_validate() ARM: dts: aspeed: Fix AST2600 quad spi group i40e: stop disabling VFs due to PF error responses ice: stop disabling VFs due to PF error responses ice: Align macro names to the specification ice: Remove unnecessary checker loop ice: Rename a couple of variables ice: Fix curr_link_speed advertised speed ethernet: Fix error handling in xemaclite_of_probe tipc: fix incorrect order of state message data sanity check net: ethernet: ti: cpts: Handle error for clk_enable net: ethernet: lpc_eth: Handle error for clk_enable ax25: Fix NULL pointer dereference in ax25_kill_by_device net/mlx5: Fix size field in bufferx_reg struct net/mlx5: Fix a race on command flush flow net/mlx5e: Lag, Only handle events from highest priority multipath entry NFC: port100: fix use-after-free in port100_send_complete selftests: pmtu.sh: Kill tcpdump processes launched by subshell. gpio: ts4900: Do not set DAT and OE together gianfar: ethtool: Fix refcount leak in gfar_get_ts_info net: phy: DP83822: clear MISR2 register to disable interrupts sctp: fix kernel-infoleak for SCTP sockets net: bcmgenet: Don't claim WOL when its not available selftests/bpf: Add test for bpf_timer overwriting crash spi: rockchip: Fix error in getting num-cs property spi: rockchip: terminate dma transmission when slave abort net-sysfs: add check for netdevice being present to speed_show hwmon: (pmbus) Clear pmbus fault/warning bits after read gpio: Return EPROBE_DEFER if gc->to_irq is NULL Revert "xen-netback: remove 'hotplug-status' once it has served its purpose" Revert "xen-netback: Check for hotplug-status existence before watching" ipv6: prevent a possible race condition with lifetimes tracing: Ensure trace buffer is at least 4096 bytes large selftest/vm: fix map_fixed_noreplace test failure selftests/memfd: clean up mapping in mfd_fail_write ARM: Spectre-BHB: provide empty stub for non-config fuse: fix pipe buffer lifetime for direct_io staging: rtl8723bs: Fix access-point mode deadlock staging: gdm724x: fix use after free in gdm_lte_rx() net: macb: Fix lost RX packet wakeup race in NAPI receive mmc: meson: Fix usage of meson_mmc_post_req() riscv: Fix auipc+jalr relocation range checks arm64: dts: marvell: armada-37xx: Remap IO space to bus address 0x0 virtio: unexport virtio_finalize_features virtio: acknowledge all features before access watch_queue, pipe: Free watchqueue state after clearing pipe ring watch_queue: Fix to release page in ->release() watch_queue: Fix to always request a pow-of-2 pipe ring size watch_queue: Fix the alloc bitmap size to reflect notes allocated watch_queue: Free the alloc bitmap when the watch_queue is torn down watch_queue: Fix lack of barrier/sync/lock between post and read watch_queue: Make comment about setting ->defunct more accurate x86/boot: Fix memremap of setup_indirect structures x86/boot: Add setup_indirect support in early_memremap_is_setup_data() x86/traps: Mark do_int3() NOKPROBE_SYMBOL ext4: add check to prevent attempting to resize an fs with sparse_super2 ARM: fix Thumb2 regression with Spectre BHB watch_queue: Fix filter limit check Linux 5.10.106 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: Ic7943bdf8c771bff4a95fcf0585ec9c24057cb5b |
||
David Howells
|
ec03510e0a |
watch_queue: Fix lack of barrier/sync/lock between post and read
commit 2ed147f015af2b48f41c6f0b6746aa9ea85c19f3 upstream.
There's nothing to synchronise post_one_notification() versus
pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.
Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().
If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.
Fixes:
|
||
David Howells
|
d729d4e99f |
watch_queue, pipe: Free watchqueue state after clearing pipe ring
commit db8facfc9fafacefe8a835416a6b77c838088f8b upstream.
In free_pipe_info(), free the watchqueue state after clearing the pipe
ring as each pipe ring descriptor has a release function, and in the
case of a notification message, this is watch_queue_pipe_buf_release()
which tries to mark the allocation bitmap that was previously released.
Fix this by moving the put of the pipe's ref on the watch queue to after
the ring has been cleared. We still need to call watch_queue_clear()
before doing that to make sure that the pipe is disconnected from any
notification sources first.
Fixes:
|
||
Greg Kroah-Hartman
|
4b20d2de0b |
Revert "pipe: avoid unnecessary EPOLLET wakeups under normal loads"
This reverts commit
|
||
Greg Kroah-Hartman
|
b6e7497caf |
Revert "pipe: do FASYNC notifications for every pipe IO, not just state changes"
This reverts commit
|
||
Linus Torvalds
|
3b2018f9c9 |
pipe: do FASYNC notifications for every pipe IO, not just state changes
commit fe67f4dd8daa252eb9aa7acb61555f3cc3c1ce4c upstream. It turns out that the SIGIO/FASYNC situation is almost exactly the same as the EPOLLET case was: user space really wants to be notified after every operation. Now, in a perfect world it should be sufficient to only notify user space on "state transitions" when the IO state changes (ie when a pipe goes from unreadable to readable, or from unwritable to writable). User space should then do as much as possible - fully emptying the buffer or what not - and we'll notify it again the next time the state changes. But as with EPOLLET, we have at least one case (stress-ng) where the kernel sent SIGIO due to the pipe being marked for asynchronous notification, but the user space signal handler then didn't actually necessarily read it all before returning (it read more than what was written, but since there could be multiple writes, it could leave data pending). The user space code then expected to get another SIGIO for subsequent writes - even though the pipe had been readable the whole time - and would only then read more. This is arguably a user space bug - and Colin King already fixed the stress-ng code in question - but the kernel regression rules are clear: it doesn't matter if kernel people think that user space did something silly and wrong. What matters is that it used to work. So if user space depends on specific historical kernel behavior, it's a regression when that behavior changes. It's on us: we were silly to have that non-optimal historical behavior, and our old kernel behavior was what user space was tested against. Because of how the FASYNC notification was tied to wakeup behavior, this was first broken by commits |
||
Linus Torvalds
|
e91da23c1b |
pipe: avoid unnecessary EPOLLET wakeups under normal loads
commit 3b844826b6c6affa80755254da322b017358a2f4 upstream. I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers") ended up causing a quite noticeable regression on larger machines. Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's not clear that this matters in real life all that much, but as Mel points out, it's used often enough when comparing kernels and so the performance regression shows up like a sore thumb. It's easy enough to fix at least for the common cases where pipes are used purely for data transfer, and you never have any exciting poll usage at all. So set a special 'poll_usage' flag when there is polling activity, and make the ugly "EPOLLET has crazy legacy expectations" semantics explicit to only that case. I would love to limit it to just the broken EPOLLET case, but the pipe code can't see the difference between epoll and regular select/poll, so any non-read/write waiting will trigger the extra wakeup behavior. That is sufficient for at least the hackbench case. Apart from making the odd extra wakeup cases more explicitly about EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe write, not at the first write chunk. That is actually much saner semantics (as much as you can call any of the legacy edge-triggered expectations for EPOLLET "sane") since it means that you know the wakeup will happen once the write is done, rather than possibly in the middle of one. [ For stable people: I'm putting a "Fixes" tag on this, but I leave it up to you to decide whether you actually want to backport it or not. It likely has no impact outside of synthetic benchmarks - Linus ] Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/ Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers") Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Sandeep Patil <sspatil@android.com> Tested-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Alex Xu (Hello71)
|
6b5a3d2c2b |
pipe: increase minimum default pipe size to 2 pages
commit 46c4c9d1beb7f5b4cec4dd90e7728720583ee348 upstream.
This program always prints 4096 and hangs before the patch, and always
prints 8192 and exits successfully after:
int main()
{
int pipefd[2];
for (int i = 0; i < 1025; i++)
if (pipe(pipefd) == -1)
return 1;
size_t bufsz = fcntl(pipefd[1], F_GETPIPE_SZ);
printf("%zd\n", bufsz);
char *buf = calloc(bufsz, 1);
write(pipefd[1], buf, bufsz);
read(pipefd[0], buf, bufsz-1);
write(pipefd[1], buf, 1);
}
Note that you may need to increase your RLIMIT_NOFILE before running the
program.
Fixes:
|
||
Linus Torvalds
|
27aa7171fe |
pipe: make pipe writes always wake up readers
commit 3a34b13a88caeb2800ab44a4918f230041b37dd9 upstream. Since commit |
||
Johannes Berg
|
e857271389 |
fs/pipe: allow sendfile() to pipe again
commit f8ad8187c3b536ee2b10502a8340c014204a1af0 upstream. After commit |
||
Linus Torvalds
|
5b697f86f9 |
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fix from Al Viro: "Fixes an obvious bug (memory leak introduced in 5.8)" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: pipe: Fix memory leaks in create_pipe_files() |
||
Linus Torvalds
|
472e5b056f |
pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by using a non-specific "pipe_wait()" that waited for any pipe event to happen, which depended on all pipe IO being entirely serialized by the pipe lock. So by checking the state you were waiting for, and then adding yourself to the wait queue before dropping the lock, you were guaranteed to see all the wakeups. Strictly speaking, the actual wakeups were not done under the lock, but the pipe_wait() model still worked, because since the waiter held the lock when checking whether it should sleep, it would always see the current state, and the wakeup was always done after updating the state. However, commit |
||
Qian Cai
|
8a018eb55e |
pipe: Fix memory leaks in create_pipe_files()
Calling pipe2() with O_NOTIFICATION_PIPE could results in memory leaks unless watch_queue_init() is successful. In case of watch_queue_init() failure in pipe2() we are left with inode and pipe_inode_info instances that need to be freed. That failure exit has been introduced in commit |
||
Linus Torvalds
|
6c32978414 |
Notifications over pipes + Keyring notifications
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0 Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12 72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7 GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops= =YTGf -----END PGP SIGNATURE----- Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull notification queue from David Howells: "This adds a general notification queue concept and adds an event source for keys/keyrings, such as linking and unlinking keys and changing their attributes. Thanks to Debarshi Ray, we do have a pull request to use this to fix a problem with gnome-online-accounts - as mentioned last time: https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 Without this, g-o-a has to constantly poll a keyring-based kerberos cache to find out if kinit has changed anything. [ There are other notification pending: mount/sb fsinfo notifications for libmount that Karel Zak and Ian Kent have been working on, and Christian Brauner would like to use them in lxc, but let's see how this one works first ] LSM hooks are included: - A set of hooks are provided that allow an LSM to rule on whether or not a watch may be set. Each of these hooks takes a different "watched object" parameter, so they're not really shareable. The LSM should use current's credentials. [Wanted by SELinux & Smack] - A hook is provided to allow an LSM to rule on whether or not a particular message may be posted to a particular queue. This is given the credentials from the event generator (which may be the system) and the watch setter. [Wanted by Smack] I've provided SELinux and Smack with implementations of some of these hooks. WHY === Key/keyring notifications are desirable because if you have your kerberos tickets in a file/directory, your Gnome desktop will monitor that using something like fanotify and tell you if your credentials cache changes. However, we also have the ability to cache your kerberos tickets in the session, user or persistent keyring so that it isn't left around on disk across a reboot or logout. Keyrings, however, cannot currently be monitored asynchronously, so the desktop has to poll for it - not so good on a laptop. This facility will allow the desktop to avoid the need to poll. DESIGN DECISIONS ================ - The notification queue is built on top of a standard pipe. Messages are effectively spliced in. The pipe is opened with a special flag: pipe2(fds, O_NOTIFICATION_PIPE); The special flag has the same value as O_EXCL (which doesn't seem like it will ever be applicable in this context)[?]. It is given up front to make it a lot easier to prohibit splice&co from accessing the pipe. [?] Should this be done some other way? I'd rather not use up a new O_* flag if I can avoid it - should I add a pipe3() system call instead? The pipe is then configured:: ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); Messages are then read out of the pipe using read(). - It should be possible to allow write() to insert data into the notification pipes too, but this is currently disabled as the kernel has to be able to insert messages into the pipe *without* holding pipe->mutex and the code to make this work needs careful auditing. - sendfile(), splice() and vmsplice() are disabled on notification pipes because of the pipe->mutex issue and also because they sometimes want to revert what they just did - but one or more notification messages might've been interleaved in the ring. - The kernel inserts messages with the wait queue spinlock held. This means that pipe_read() and pipe_write() have to take the spinlock to update the queue pointers. - Records in the buffer are binary, typed and have a length so that they can be of varying size. This allows multiple heterogeneous sources to share a common buffer; there are 16 million types available, of which I've used just a few, so there is scope for others to be used. Tags may be specified when a watchpoint is created to help distinguish the sources. - Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. - Notification pipes don't interfere with each other; each may be bound to a different set of watches. Any particular notification will be copied to all the queues that are currently watching for it - and only those that are watching for it. - When recording a notification, the kernel will not sleep, but will rather mark a queue as having lost a message if there's insufficient space. read() will fabricate a loss notification message at an appropriate point later. - The notification pipe is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); watch_mount(AT_FDCWD, "/", 0, fd, 0x02); watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); where in both cases, fd indicates the queue and the number after is a tag between 0 and 255. - Watches are removed if either the notification pipe is destroyed or the watched object is destroyed. In the latter case, a message will be generated indicating the enforced watch removal. Things I want to avoid: - Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). - Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. - Letting users see events they shouldn't be able to see. TESTING AND MANPAGES ==================== - The keyutils tree has a pipe-watch branch that has keyctl commands for making use of notifications. Proposed manual pages can also be found on this branch, though a couple of them really need to go to the main manpages repository instead. If the kernel supports the watching of keys, then running "make test" on that branch will cause the testing infrastructure to spawn a monitoring process on the side that monitors a notifications pipe for all the key/keyring changes induced by the tests and they'll all be checked off to make sure they happened. https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch - A test program is provided (samples/watch_queue/watch_test) that can be used to monitor for keyrings, mount and superblock events. Information on the notifications is simply logged to stdout" * tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: smack: Implement the watch_key and post_notification hooks selinux: Implement the watch_key security hook keys: Make the KEY_NEED_* perms an enum rather than a mask pipe: Add notification lossage handling pipe: Allow buffers to be marked read-whole-or-error for notifications Add sample notification program watch_queue: Add a key/keyring notification facility security: Add hooks to rule on setting a watch pipe: Add general notification queue support pipe: Add O_NOTIFICATION_PIPE security: Add a hook for the point of notification insertion uapi: General notification queue definitions |
||
Christoph Hellwig
|
c928f642c2 |
fs: rename pipe_buf ->steal to ->try_steal
And replace the arcane return value convention with a simple bool where true means success and false means failure. [AV: braino fix folded in] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
Christoph Hellwig
|
b8d9e7f241 |
fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
Christoph Hellwig
|
f6dd975583 |
pipe: merge anon_pipe_buf*_ops
All the op vectors are exactly the same, they are just used to encode packet or nomerge behavior. There already is a flag for the packet behavior, so just add a new one to allow for merging. Inverting it vs the previous nomerge special casing actually allows for much nicer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
David Howells
|
e7d553d69c |
pipe: Add notification lossage handling
Add handling for loss of notifications by having read() insert a loss-notification message after it has read the pipe buffer that was last in the ring when the loss occurred. Lossage can come about either by running out of notification descriptors or by running out of space in the pipe ring. Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
8cfba76383 |
pipe: Allow buffers to be marked read-whole-or-error for notifications
Allow a buffer to be marked such that read() must return the entire buffer in one go or return ENOBUFS. Multiple buffers can be amalgamated into a single read, but a short read will occur if the next "whole" buffer won't fit. This is useful for watch queue notifications to make sure we don't split a notification across multiple reads, especially given that we need to fabricate an overrun record under some circumstances - and that isn't in the buffers. Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
c73be61ced |
pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a standard pipe. Notifications are 'spliced' into the pipe and then read out. splice(), vmsplice() and sendfile() are forbidden on pipes used for notifications as post_one_notification() cannot take pipe->mutex. This means that notifications could be posted in between individual pipe buffers, making iov_iter_revert() difficult to effect. The way the notification queue is used is: (1) An application opens a pipe with a special flag and indicates the number of messages it wishes to be able to queue at once (this can only be set once): pipe2(fds, O_NOTIFICATION_PIPE); ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); (2) The application then uses poll() and read() as normal to extract data from the pipe. read() will return multiple notifications if the buffer is big enough, but it will not split a notification across buffers - rather it will return a short read or EMSGSIZE. Notification messages include a length in the header so that the caller can split them up. Each message has a header that describes it: struct watch_notification { __u32 type:24; __u32 subtype:8; __u32 info; }; The type indicates the source (eg. mount tree changes, superblock events, keyring changes, block layer events) and the subtype indicates the event type (eg. mount, unmount; EIO, EDQUOT; link, unlink). The info field indicates a number of things, including the entry length, an ID assigned to a watchpoint contributing to this buffer and type-specific flags. Supplementary data, such as the key ID that generated an event, can be attached in additional slots. The maximum message size is 127 bytes. Messages may not be padded or aligned, so there is no guarantee, for example, that the notification type will be on a 4-byte bounary. Signed-off-by: David Howells <dhowells@redhat.com> |
||
Roman Gushchin
|
f4b00eab50 |
mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page()
Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page() to better reflect what they are actually doing: 1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge the current memcg 2) set or clear the PageKmemcg flag Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
6551d5c56e |
pipe: make sure to wake up everybody when the last reader/writer closes
Andrei Vagin reported that commit |
||
Linus Torvalds
|
0ddad21d3e |
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting for readers and writers, avoiding a nasty thundering herd problem when there are lots of readers waiting for data on a pipe (or, less commonly, lots of writers waiting for a pipe to have space). While this isn't a common occurrence in the traditional "use a pipe as a data transport" case, where you typically only have a single reader and a single writer process, there is one common special case: using a pipe as a source of "locking tokens" rather than for data communication. In particular, the GNU make jobserver code ends up using a pipe as a way to limit parallelism, where each job consumes a token by reading a byte from the jobserver pipe, and releases the token by writing a byte back to the pipe. This pattern is fairly traditional on Unix, and works very well, but will waste a lot of time waking up a lot of processes when only a single reader needs to be woken up when a writer releases a new token. A simplified test-case of just this pipe interaction is to create 64 processes, and then pass a single token around between them (this test-case also intentionally passes another token that gets ignored to test the "wake up next" logic too, in case anybody wonders about it): #include <unistd.h> int main(int argc, char **argv) { int fd[2], counters[2]; pipe(fd); counters[0] = 0; counters[1] = -1; write(fd[1], counters, sizeof(counters)); /* 64 processes */ fork(); fork(); fork(); fork(); fork(); fork(); do { int i; read(fd[0], &i, sizeof(i)); if (i < 0) continue; counters[0] = i+1; write(fd[1], counters, (1+(i & 1)) *sizeof(int)); } while (counters[0] < 1000000); return 0; } and in a perfect world, passing that token around should only cause one context switch per transfer, when the writer of a token causes a directed wakeup of just a single reader. But with the "writer wakes all readers" model we traditionally had, on my test box the above case causes more than an order of magnitude more scheduling: instead of the expected ~1M context switches, "perf stat" shows 231,852.37 msec task-clock # 15.857 CPUs utilized 11,250,961 context-switches # 0.049 M/sec 616,304 cpu-migrations # 0.003 M/sec 1,648 page-faults # 0.007 K/sec 1,097,903,998,514 cycles # 4.735 GHz 120,781,778,352 instructions # 0.11 insn per cycle 27,997,056,043 branches # 120.754 M/sec 283,581,233 branch-misses # 1.01% of all branches 14.621273891 seconds time elapsed 0.018243000 seconds user 3.611468000 seconds sys before this commit. After this commit, I get 5,229.55 msec task-clock # 3.072 CPUs utilized 1,212,233 context-switches # 0.232 M/sec 103,951 cpu-migrations # 0.020 M/sec 1,328 page-faults # 0.254 K/sec 21,307,456,166 cycles # 4.074 GHz 12,947,819,999 instructions # 0.61 insn per cycle 2,881,985,678 branches # 551.096 M/sec 64,267,015 branch-misses # 2.23% of all branches 1.702148350 seconds time elapsed 0.004868000 seconds user 0.110786000 seconds sys instead. Much better. [ Note! This kernel improvement seems to be very good at triggering a race condition in the make jobserver (in GNU make 4.2.1) for me. It's a long known bug that was fixed back in June 2017 by GNU make commit b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs."). But there wasn't a new release of GNU make until 4.3 on Jan 19 2020, so a number of distributions may still have the buggy version. Some have backported the fix to their 4.2.1 release, though, and even without the fix it's quite timing-dependent whether the bug actually is hit. ] Josh Triplett says: "I've been hammering on your pipe fix patch (switching to exclusive wait queues) for a month or so, on several different systems, and I've run into no issues with it. The patch *substantially* improves parallel build times on large (~100 CPU) systems, both with parallel make and with other things that use make's pipe-based jobserver. All current distributions (including stable and long-term stable distributions) have versions of GNU make that no longer have the jobserver bug" Tested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Jan Stancek
|
0dd1e3773a |
pipe: fix empty pipe check in pipe_write()
LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb,
with read side observing empty pipe and sleeping and write
side running out of space and then sleeping as well. In this
scenario there are 5 writers and 1 reader.
Problem is that after pipe_write() reacquires pipe lock, it
re-checks for empty pipe with potentially stale 'head' and
doesn't wake up read side anymore. pipe->tail can advance
beyond 'head', because there are multiple writers.
Use pipe->head for empty pipe check after reacquiring lock
to observe current state.
Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour.
Without patch it hanged within a minute.
Fixes:
|
||
Linus Torvalds
|
d1c6a2aa02 |
pipe: simplify signal handling in pipe_read() and add comments
There's no need to separately check for signals while inside the locked region, since we're going to do "wait_event_interruptible()" right afterwards anyway, and the error handling is much simpler there. The check for whether we had already read anything was also redundant, since we no longer do the odd merging of reads when there are pending writers. But perhaps more importantly, this adds commentary about why we still need to wake up possible writers even though we didn't read any data, and why we can skip all the finishing touches now if we get a signal (or had a signal pending) while waiting for more data. [ This is a split-out cleanup from my "make pipe IO use exclusive wait queues" thing, which I can't apply because it triggers a nasty bug in the GNU make jobserver - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
85190d15f4 |
pipe: don't use 'pipe_wait() for basic pipe IO
pipe_wait() may be simple, but since it relies on the pipe lock, it means that we have to do the wakeup while holding the lock. That's unfortunate, because the very first thing the waked entity will want to do is to get the pipe lock for itself. So get rid of the pipe_wait() usage by simply releasing the pipe lock, doing the wakeup (if required) and then using wait_event_interruptible() to wait on the right condition instead. wait_event_interruptible() handles races on its own by comparing the wakeup condition before and after adding itself to the wait queue, so you can use an optimistic unlocked condition for it. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
a28c8b9db8 |
pipe: remove 'waiting_writers' merging logic
This code is ancient, and goes back to when we only had a single page for the pipe buffers. The exact history is hidden in the mists of time (ie "before git", and in fact predates the BK repository too). At that long-ago point in time, it actually helped to try to merge big back-and-forth pipe reads and writes, and not limit pipe reads to the single pipe buffer in length just because that was all we had at a time. However, since then we've expanded the pipe buffers to multiple pages, and this logic really doesn't seem to make sense. And a lot of it is somewhat questionable (ie "hmm, the user asked for a non-blocking read, but we see that there's a writer pending, so let's wait anyway to get the extra data that the writer will have"). But more importantly, it makes the "go to sleep" logic much less obvious, and considering the wakeup issues we've had, I want to make for less of those kinds of things. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
f467a6a664 |
pipe: fix and clarify pipe read wakeup logic
This is the read side version of the previous commit: it simplifies the logic to only wake up waiting writers when necessary, and makes sure to use a synchronous wakeup. This time not so much for GNU make jobserver reasons (that pipe never fills up), but simply to get the writer going quickly again. A bit less verbose commentary this time, if only because I assume that the write side commentary isn't going to be ignored if you touch this code. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
1b6b26ae70 |
pipe: fix and clarify pipe write wakeup logic
The pipe rework ends up having been extra painful, partly becaused of actual bugs with ordering and caching of the pipe state, but also because of subtle performance issues. In particular, the pipe rework caused the kernel build to inexplicably slow down. The reason turns out to be that the GNU make jobserver (which limits the parallelism of the build) uses a pipe to implement a "token" system: a parallel submake will read a character from the pipe to get the job token before starting a new job, and will write a character back to the pipe when it is done. The overall job limit is thus easily controlled by just writing the appropriate number of initial token characters into the pipe. But to work well, that really means that the old behavior of write wakeups being synchronous (WF_SYNC) is very important - when the pipe writer wakes up a reader, we want the reader to actually get scheduled immediately. Otherwise you lose the parallelism of the build. The pipe rework lost that synchronous wakeup on write, and we had clearly all forgotten the reasons and rules for it. This rewrites the pipe write wakeup logic to do the required Wsync wakeups, but also clarifies the logic and avoids extraneous wakeups. It also ends up addign a number of comments about what oit does and why, so that we hopefully don't end up forgetting about this next time we change this code. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
ad910e36da |
pipe: fix poll/select race introduced by the pipe rework
The kernel wait queues have a basic rule to them: you add yourself to
the wait-queue first, and then you check the things that you're going to
wait on. That avoids the races with the event you're waiting for.
The same goes for poll/select logic: the "poll_wait()" goes first, and
then you check the things you're polling for.
Of course, if you use locking, the ordering doesn't matter since the
lock will serialize with anything that changes the state you're looking
at. That's not the case here, though.
So move the poll_wait() first in pipe_poll(), before you start looking
at the pipe state.
Fixes:
|
||
Linus Torvalds
|
da73fcd8cf |
Merge branch 'pipe-rework' (patches from David Howells)
Merge two fixes for the pipe rework from David Howells: "Here are a couple of patches to fix bugs syzbot found in the pipe changes: - An assertion check will sometimes trip when polling a pipe because the ring size and indices used are approximate and may be being changed simultaneously. An equivalent approximate calculation was done previously, but without the assertion check, so I've just dropped the check. To make it accurate, the pipe mutex would need to be taken or the spin lock could be used - but usage of the spinlock would need to be rolled out into splice, iov_iter and other places for that. - The index mask and the max_usage values cannot be cached across pipe_wait() as F_SETPIPE_SZ could have been called during the wait. This can cause pipe_write() to break" * pipe-rework: pipe: Fix missing mask update after pipe_wait() pipe: Remove assertion from pipe_poll() |
||
David Howells
|
8f868d68d3 |
pipe: Fix missing mask update after pipe_wait()
Fix pipe_write() to not cache the ring index mask and max_usage as their
values are invalidated by calling pipe_wait() because the latter
function drops the pipe lock, thereby allowing F_SETPIPE_SZ change them.
Without this, pipe_write() may subsequently miscalculate the array
indices and pipe fullness, leading to an oops like the following:
BUG: KASAN: slab-out-of-bounds in pipe_write+0xc25/0xe10 fs/pipe.c:481
Write of size 8 at addr ffff8880771167a8 by task syz-executor.3/7987
...
CPU: 1 PID: 7987 Comm: syz-executor.3 Not tainted 5.4.0-rc2-syzkaller #0
...
Call Trace:
pipe_write+0xc25/0xe10 fs/pipe.c:481
call_write_iter include/linux/fs.h:1895 [inline]
new_sync_write+0x3fd/0x7e0 fs/read_write.c:483
__vfs_write+0x94/0x110 fs/read_write.c:496
vfs_write+0x18a/0x520 fs/read_write.c:558
ksys_write+0x105/0x220 fs/read_write.c:611
__do_sys_write fs/read_write.c:623 [inline]
__se_sys_write fs/read_write.c:620 [inline]
__x64_sys_write+0x6e/0xb0 fs/read_write.c:620
do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is not a problem for pipe_read() as the mask is recalculated on
each pass of the loop, after pipe_wait() has been called.
Fixes:
|
||
David Howells
|
8c7b8c34ae |
pipe: Remove assertion from pipe_poll()
An assertion check was added to pipe_poll() to make sure that the ring
occupancy isn't seen to overflow the ring size. However, since no locks
are held when the three values are read, it is possible for F_SETPIPE_SZ
to intervene and muck up the calculation, thereby causing the oops.
Fix this by simply removing the assertion and accepting that the
calculation might be approximate.
Note that the previous code also had a similar issue, though there was
no assertion check, since the occupancy counter and the ring size were
not read with a lock held, so it's possible that the poll check might
have malfunctioned then too.
Also wake up all the waiters so that they can reissue their checks if
there was a competing read or write.
Fixes:
|
||
Linus Torvalds
|
6a965666b7 |
Pipework for general notification queue
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl3O0OoACgkQ+7dXa6fL C2tAwA//VH9Y81azemXFdflDF90sSH3TCASlKHVYHbBNAkH/QP5F00G4BEM4nNqH F3x7qcU9vzfGdumF1pc90Yt6XSYlsQEGF+xMyMw/VS2wKs40yv+b/doVbzOWbN9C NfrklgHeuuBk+JzU2llDisVqKRTLt4SmDpYu1ZdcchUQFZCCl3BpgdSEC+xXrHay +KlRPVNMSd2kXMCDuSWrr71lVNdCTdf3nNC5p1i780+VrgpIBIG/jmiNdCcd7PLH 1aesPlr8UZY3+bmRtqe587fVRAhT2qA2xibKtyf9R0hrDtUKR4NSnpPmaeIjb26e LhVntcChhYxQqzy/T4ScTDNVjpSlwi6QMo5DwAwzNGf2nf/v5/CZ+vGYDVdXRFHj tgH1+8eDpHsi7jJp6E4cmZjiolsUx/ePDDTrQ4qbdDMO7fmIV6YQKFAMTLJepLBY qnJVqoBq3qn40zv6tVZmKgWiXQ65jEkBItZhEUmcQRBiSbBDPweIdEzx/mwzkX7U 1gShGdut6YP4GX7BnOhkiQmzucS85mgkUfG43+mBfYXb+4zNTEjhhkqhEduz2SQP xnjHxEM+MTGCj3PozIpJxNKzMTEceYY7cAUdNEMDQcHog7OCnIdGBIc7BPnsN8yA CPzntwP4mmLfK3weq3PIGC6d9xfc9PpmiR9docxQOvE6sk2Ifeo= =FKC7 -----END PGP SIGNATURE----- Merge tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull pipe rework from David Howells: "This is my set of preparatory patches for building a general notification queue on top of pipes. It makes a number of significant changes: - It removes the nr_exclusive argument from __wake_up_sync_key() as this is always 1. This prepares for the next step: - Adds wake_up_interruptible_sync_poll_locked() so that poll can be woken up from a function that's holding the poll waitqueue spinlock. - Change the pipe buffer ring to be managed in terms of unbounded head and tail indices rather than bounded index and length. This means that reading the pipe only needs to modify one index, not two. - A selection of helper functions are provided to query the state of the pipe buffer, plus a couple to apply updates to the pipe indices. - The pipe ring is allowed to have kernel-reserved slots. This allows many notification messages to be spliced in by the kernel without allowing userspace to pin too many pages if it writes to the same pipe. - Advance the head and tail indices inside the pipe waitqueue lock and use wake_up_interruptible_sync_poll_locked() to poke poll without having to take the lock twice. - Rearrange pipe_write() to preallocate the buffer it is going to write into and then drop the spinlock. This allows kernel notifications to then be added the ring whilst it is filling the buffer it allocated. The read side is stalled because the pipe mutex is still held. - Don't wake up readers on a pipe if there was already data in it when we added more. - Don't wake up writers on a pipe if the ring wasn't full before we removed a buffer" * tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: pipe: Remove sync on wake_ups pipe: Increase the writer-wakeup threshold to reduce context-switch count pipe: Check for ring full inside of the spinlock in pipe_write() pipe: Remove redundant wakeup from pipe_write() pipe: Rearrange sequence in pipe_write() to preallocate slot pipe: Conditionalise wakeup in pipe_read() pipe: Advance tail pointer inside of wait spinlock in pipe_read() pipe: Allow pipes to have kernel-reserved slots pipe: Use head and tail pointers for the ring, not cursor and length Add wake_up_interruptible_sync_poll_locked() Remove the nr_exclusive argument from __wake_up_sync_key() pipe: Reduce #inclusion of pipe_fs_i.h |
||
Linus Torvalds
|
d8e464ecc1 |
vfs: mark pipes and sockets as stream-like file descriptors
In commit
|
||
David Howells
|
3c0edea9b2 | pipe: Remove sync on wake_ups | ||
David Howells
|
cefa80ced5 |
pipe: Increase the writer-wakeup threshold to reduce context-switch count
Increase the threshold at which the reader sends a wake event to the writers in the queue such that the queue must be half empty before the wake is issued rather than the wake being issued when just a single slot available. This reduces the number of context switches in the tests significantly, without altering the amount of work achieved. With my pipe-bench program, there's a 20% reduction versus an unpatched kernel. Suggested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
8df441294d |
pipe: Check for ring full inside of the spinlock in pipe_write()
Make pipe_write() check to see if the ring has become full between it taking the pipe mutex, checking the ring status and then taking the spinlock. This can happen if a notification is written into the pipe as that happens without the pipe mutex. Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
7e25a73f1a |
pipe: Remove redundant wakeup from pipe_write()
Remove a redundant wakeup from pipe_write(). Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
a194dfe6e6 |
pipe: Rearrange sequence in pipe_write() to preallocate slot
Rearrange the sequence in pipe_write() so that the allocation of the new buffer, the allocation of a ring slot and the attachment to the ring is done under the pipe wait spinlock and then the lock is dropped and the buffer can be filled. The data copy needs to be done with the spinlock unheld and irqs enabled, so the lock needs to be dropped first. However, the reader can't progress as we're holding pipe->mutex. We also need to drop the lock as that would impact others looking at the pipe waitqueue, such as poll(), the consumer and a future kernel message writer. We just abandon the preallocated slot if we get a copy error. Future writes may continue it and a future read will eventually recycle it. Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
8446487feb |
pipe: Conditionalise wakeup in pipe_read()
Only do a wakeup in pipe_read() if we made space in a completely full buffer. The producer shouldn't be waiting on pipe->wait otherwise. Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
b667b86734 |
pipe: Advance tail pointer inside of wait spinlock in pipe_read()
Advance the pipe ring tail pointer inside of wait spinlock in pipe_read() so that the pipe can be written into with kernel notifications from contexts where pipe->mutex cannot be taken. Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
6718b6f855 |
pipe: Allow pipes to have kernel-reserved slots
Split pipe->ring_size into two numbers: (1) pipe->ring_size - indicates the hard size of the pipe ring. (2) pipe->max_usage - indicates the maximum number of pipe ring slots that userspace orchestrated events can fill. This allows for a pipe that is both writable by the general kernel notification facility and by userspace, allowing plenty of ring space for notifications to be added whilst preventing userspace from being able to pin too much unswappable kernel space. Signed-off-by: David Howells <dhowells@redhat.com> |
||
David Howells
|
8cefc107ca |
pipe: Use head and tail pointers for the ring, not cursor and length
Convert pipes to use head and tail pointers for the buffer ring rather than pointer and length as the latter requires two atomic ops to update (or a combined op) whereas the former only requires one. (1) The head pointer is the point at which production occurs and points to the slot in which the next buffer will be placed. This is equivalent to pipe->curbuf + pipe->nrbufs. The head pointer belongs to the write-side. (2) The tail pointer is the point at which consumption occurs. It points to the next slot to be consumed. This is equivalent to pipe->curbuf. The tail pointer belongs to the read-side. (3) head and tail are allowed to run to UINT_MAX and wrap naturally. They are only masked off when the array is being accessed, e.g.: pipe->bufs[head & mask] This means that it is not necessary to have a dead slot in the ring as head == tail isn't ambiguous. (4) The ring is empty if "head == tail". A helper, pipe_empty(), is provided for this. (5) The occupancy of the ring is "head - tail". A helper, pipe_occupancy(), is provided for this. (6) The number of free slots in the ring is "pipe->ring_size - occupancy". A helper, pipe_space_for_user() is provided to indicate how many slots userspace may use. (7) The ring is full if "head - tail >= pipe->ring_size". A helper, pipe_full(), is provided for this. Signed-off-by: David Howells <dhowells@redhat.com> |