6dde3d399b
880 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Greg Kroah-Hartman
|
c9b484c69d |
This is the 6.1.68 stable release
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmV57F0ACgkQONu9yGCS aT5Ihg//f5xvyjEEbZyE7tFaBBgx8ceQCtteRyi+Jw3Hy65/9neETij0t97IhG37 I89TIAddzNIl51ifl8UYZMWI780HbnW1YdbVLMElbngbmT5rHzIsGpAVCC+SDmMK NPWXrqWIw6yTVSbTwqKIqOLlEiLxGjdWnPxjoMXBVyje+EcmANBe+fe9qkLq98XC ZgzrRZyriS8QLMMscy/GmdxIyC32nxebdHDwwE6qgYM8GWNfqLLektX798VGFhra ByR9bvsJ0PD5m9siCGcx37lVusJDLMjJp4FtMIFTrH63i0sMQm7HKiggJmbCm4lH Sgbo4iwvSVa2xf1glPJagE9tiah5b0feLqgrQf/ONO2PdCjcERN47472IcQgRvQ+ SDYKScZBSp1/Jd063dHiK/u79uxEBFEdisAkPG2MstjCySEDuhvDrV5R0iKDpQBP y2FXb4RArqZFrGwS4Zfxx/EQnj3MYJ11a4AE5I0yUGIj7vrFdddayBDBVdwhog84 QhHPH0F/eC/zSMATYSQSCZTTSZ2UoR8NODXyOryoH5tmXlgxXWKq1oFi5nUnysoP SkGDT0dg+kbReQNA+eyj5qTS4lzincIyP2B4Ple9d75zpx1UENlqVm1xvWLccyFt 3eV/XNRg8dAapsbqvEtW+iev6izutWgcG6p1hToObnbg5uHy6fI= =+iTJ -----END PGP SIGNATURE----- Merge 6.1.68 into android14-6.1-lts Changes in 6.1.68 vdpa/mlx5: preserve CVQ vringh index hrtimers: Push pending hrtimers away from outgoing CPU earlier i2c: designware: Fix corrupted memory seen in the ISR netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test zstd: Fix array-index-out-of-bounds UBSAN warning tg3: Move the [rt]x_dropped counters to tg3_napi tg3: Increment tx_dropped in tg3_tso_bug() kconfig: fix memory leak from range properties drm/amdgpu: correct chunk_ptr to a pointer to chunk. x86: Introduce ia32_enabled() x86/coco: Disable 32-bit emulation by default on TDX and SEV x86/entry: Convert INT 0x80 emulation to IDTENTRY x86/entry: Do not allow external 0x80 interrupts x86/tdx: Allow 32-bit emulation by default dt: dt-extract-compatibles: Handle cfile arguments in generator function dt: dt-extract-compatibles: Don't follow symlinks when walking tree platform/x86: asus-wmi: Move i8042 filter install to shared asus-wmi code of: dynamic: Fix of_reconfig_get_state_change() return value documentation platform/x86: wmi: Skip blocks with zero instances ipv6: fix potential NULL deref in fib6_add() octeontx2-pf: Add missing mutex lock in otx2_get_pauseparam octeontx2-af: Check return value of nix_get_nixlf before using nixlf hv_netvsc: rndis_filter needs to select NLS r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE r8152: Add RTL8152_INACCESSIBLE checks to more loops r8152: Add RTL8152_INACCESSIBLE to r8156b_wait_loading_flash() r8152: Add RTL8152_INACCESSIBLE to r8153_pre_firmware_1() r8152: Add RTL8152_INACCESSIBLE to r8153_aldps_en() mlxbf-bootctl: correctly identify secure boot with development keys platform/mellanox: Add null pointer checks for devm_kasprintf() platform/mellanox: Check devm_hwmon_device_register_with_groups() return value arcnet: restoring support for multiple Sohard Arcnet cards octeontx2-pf: consider both Rx and Tx packet stats for adaptive interrupt coalescing net: stmmac: fix FPE events losing xsk: Skip polling event check for unbound socket octeontx2-af: fix a use-after-free in rvu_npa_register_reporters i40e: Fix unexpected MFS warning message iavf: validate tx_coalesce_usecs even if rx_coalesce_usecs is zero net: bnxt: fix a potential use-after-free in bnxt_init_tc tcp: fix mid stream window clamp. ionic: fix snprintf format length warning ionic: Fix dim work handling in split interrupt mode ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit() net: atlantic: Fix NULL dereference of skb pointer in net: hns: fix wrong head when modify the tx feature when sending packets net: hns: fix fake link up on xge port octeontx2-af: Adjust Tx credits when MCS external bypass is disabled octeontx2-af: Fix mcs sa cam entries size octeontx2-af: Fix mcs stats register address octeontx2-af: Add missing mcs flr handler call octeontx2-af: Update Tx link register range dt-bindings: interrupt-controller: Allow #power-domain-cells netfilter: nft_exthdr: add boolean DCCP option matching netfilter: nf_tables: fix 'exist' matching on bigendian arches netfilter: nf_tables: bail out on mismatching dynset and set expressions netfilter: nf_tables: validate family when identifying table via handle netfilter: xt_owner: Fix for unsafe access of sk->sk_socket tcp: do not accept ACK of bytes we never sent bpf: sockmap, updating the sg structure should also update curr psample: Require 'CAP_NET_ADMIN' when joining "packets" group drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group mm/damon/sysfs: eliminate potential uninitialized variable warning tee: optee: Fix supplicant based device enumeration RDMA/hns: Fix unnecessary err return when using invalid congest control algorithm RDMA/irdma: Do not modify to SQD on error RDMA/irdma: Add wait for suspend on SQD arm64: dts: rockchip: Expand reg size of vdec node for RK3328 arm64: dts: rockchip: Expand reg size of vdec node for RK3399 ASoC: fsl_sai: Fix no frame sync clock issue on i.MX8MP RDMA/rtrs-srv: Do not unconditionally enable irq RDMA/rtrs-clt: Start hb after path_up RDMA/rtrs-srv: Check return values while processing info request RDMA/rtrs-srv: Free srv_mr iu only when always_invalidate is true RDMA/rtrs-srv: Destroy path files after making sure no IOs in-flight RDMA/rtrs-clt: Fix the max_send_wr setting RDMA/rtrs-clt: Remove the warnings for req in_use check RDMA/bnxt_re: Correct module description string RDMA/irdma: Refactor error handling in create CQP RDMA/irdma: Fix UAF in irdma_sc_ccq_get_cqe_info() hwmon: (acpi_power_meter) Fix 4.29 MW bug ASoC: codecs: lpass-tx-macro: set active_decimator correct default value hwmon: (nzxt-kraken2) Fix error handling path in kraken2_probe() ASoC: wm_adsp: fix memleak in wm_adsp_buffer_populate RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz RDMA/irdma: Avoid free the non-cqp_request scratch drm/bridge: tc358768: select CONFIG_VIDEOMODE_HELPERS arm64: dts: imx8mq: drop usb3-resume-missing-cas from usb arm64: dts: imx8mp: imx8mq: Add parkmode-disable-ss-quirk on DWC3 ARM: dts: imx6ul-pico: Describe the Ethernet PHY clock tracing: Fix a warning when allocating buffered events fails scsi: be2iscsi: Fix a memleak in beiscsi_init_wrb_handle() ARM: imx: Check return value of devm_kasprintf in imx_mmdc_perf_init ARM: dts: imx7: Declare timers compatible with fsl,imx6dl-gpt ARM: dts: imx28-xea: Pass the 'model' property riscv: fix misaligned access handling of C.SWSP and C.SDSP md: introduce md_ro_state md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly() iommu: Avoid more races around device probe rethook: Use __rcu pointer for rethook::handler kprobes: consistent rcu api usage for kretprobe holder ASoC: amd: yc: Fix non-functional mic on ASUS E1504FA io_uring/af_unix: disable sending io_uring over sockets nvme-pci: Add sleep quirk for Kingston drives io_uring: fix mutex_unlock with unreferenced ctx ALSA: usb-audio: Add Pioneer DJM-450 mixer controls ALSA: pcm: fix out-of-bounds in snd_pcm_state_names ALSA: hda/realtek: Enable headset on Lenovo M90 Gen5 ALSA: hda/realtek: add new Framework laptop to quirks ALSA: hda/realtek: Add Framework laptop 16 to quirks ring-buffer: Test last update in 32bit version of __rb_time_read() nilfs2: fix missing error check for sb_set_blocksize call nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage() cgroup_freezer: cgroup_freezing: Check if not frozen checkstack: fix printed address tracing: Always update snapshot buffer size tracing: Disable snapshot buffer when stopping instance tracers tracing: Fix incomplete locking when disabling buffered events tracing: Fix a possible race when disabling buffered events packet: Move reference count in packet_sock to atomic_long_t r8169: fix rtl8125b PAUSE frames blasting when suspended regmap: fix bogus error on regcache_sync success platform/surface: aggregator: fix recv_buf() return value hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write mm: fix oops when filemap_map_pmd() without prealloc_pte powercap: DTPM: Fix missing cpufreq_cpu_put() calls md/raid6: use valid sector values to determine if an I/O should wait on the reshape arm64: dts: mediatek: mt7622: fix memory node warning check arm64: dts: mediatek: mt8183-kukui-jacuzzi: fix dsi unnecessary cells properties arm64: dts: mediatek: cherry: Fix interrupt cells for MT6360 on I2C7 arm64: dts: mediatek: mt8173-evb: Fix regulator-fixed node names arm64: dts: mediatek: mt8195: Fix PM suspend/resume with venc clocks arm64: dts: mediatek: mt8183: Fix unit address for scp reserved memory arm64: dts: mediatek: mt8183: Move thermal-zones to the root node arm64: dts: mediatek: mt8183-evb: Fix unit_address_vs_reg warning on ntc binder: fix memory leaks of spam and pending work coresight: etm4x: Make etm4_remove_dev() return void coresight: etm4x: Remove bogous __exit annotation for some functions hwtracing: hisi_ptt: Add dummy callback pmu::read() misc: mei: client.c: return negative error code in mei_cl_write misc: mei: client.c: fix problem of return '-EOVERFLOW' in mei_cl_write LoongArch: BPF: Don't sign extend memory load operand LoongArch: BPF: Don't sign extend function return value ring-buffer: Force absolute timestamp on discard of event tracing: Set actual size after ring buffer resize tracing: Stop current tracer when resizing buffer parisc: Reduce size of the bug_table on 64-bit kernel by half parisc: Fix asm operand number out of range build error in bug table arm64: dts: mediatek: add missing space before { arm64: dts: mt8183: kukui: Fix underscores in node names perf: Fix perf_event_validate_size() x86/sev: Fix kernel crash due to late update to read-only ghcb_version gpiolib: sysfs: Fix error handling on failed export drm/amdgpu: fix memory overflow in the IB test drm/amd/amdgpu: Fix warnings in amdgpu/amdgpu_display.c drm/amdgpu: correct the amdgpu runtime dereference usage count drm/amdgpu: Update ras eeprom support for smu v13_0_0 and v13_0_10 drm/amdgpu: Add EEPROM I2C address support for ip discovery drm/amdgpu: Remove redundant I2C EEPROM address drm/amdgpu: Decouple RAS EEPROM addresses from chips drm/amdgpu: Add support for RAS table at 0x40000 drm/amdgpu: Remove second moot switch to set EEPROM I2C address drm/amdgpu: Return from switch early for EEPROM I2C address drm/amdgpu: simplify amdgpu_ras_eeprom.c drm/amdgpu: Add I2C EEPROM support on smu v13_0_6 drm/amdgpu: Update EEPROM I2C address for smu v13_0_0 usb: gadget: f_hid: fix report descriptor allocation serial: 8250_dw: Add ACPI ID for Granite Rapids-D UART parport: Add support for Brainboxes IX/UC/PX parallel cards cifs: Fix non-availability of dedup breaking generic/304 Revert "xhci: Loosen RPM as default policy to cover for AMD xHC 1.1" smb: client: fix potential NULL deref in parse_dfs_referrals() usb: typec: class: fix typec_altmode_put_partner to put plugs ARM: PL011: Fix DMA support serial: sc16is7xx: address RX timeout interrupt errata serial: 8250: 8250_omap: Clear UART_HAS_RHR_IT_DIS bit serial: 8250: 8250_omap: Do not start RX DMA on THRI interrupt serial: 8250_omap: Add earlycon support for the AM654 UART controller devcoredump: Send uevent once devcd is ready x86/CPU/AMD: Check vendor in the AMD microcode callback USB: gadget: core: adjust uevent timing on gadget unbind cifs: Fix flushing, invalidation and file size with copy_file_range() cifs: Fix flushing, invalidation and file size with FICLONE MIPS: kernel: Clear FPU states when setting up kernel threads KVM: s390/mm: Properly reset no-dat KVM: SVM: Update EFER software model on CR0 trap for SEV-ES MIPS: Loongson64: Reserve vgabios memory on boot MIPS: Loongson64: Handle more memory types passed from firmware MIPS: Loongson64: Enable DMA noncoherent support netfilter: nft_set_pipapo: skip inactive elements during set walk riscv: Kconfig: Add select ARM_AMBA to SOC_STARFIVE drm/i915/display: Drop check for doublescan mode in modevalid drm/i915/lvds: Use REG_BIT() & co. drm/i915/sdvo: stop caching has_hdmi_monitor in struct intel_sdvo drm/i915: Skip some timing checks on BXT/GLK DSI transcoders Linux 6.1.68 Change-Id: I0a824071a80b24dc4a2e0077f305b7cac42235b8 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
Mike Kravetz
|
574a6db80f |
hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write
commit 187da0f8250aa94bd96266096aef6f694e0b4cd2 upstream. The routine __vma_private_lock tests for the existence of a reserve map associated with a private hugetlb mapping. A pointer to the reserve map is in vma->vm_private_data. __vma_private_lock was checking the pointer for NULL. However, it is possible that the low bits of the pointer could be used as flags. In such instances, vm_private_data is not NULL and not a valid pointer. This results in the null-ptr-deref reported by syzbot: general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef] CPU: 0 PID: 5048 Comm: syz-executor139 Not tainted 6.6.0-rc7-syzkaller-00142-g88 8cf78c29e2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 1 0/09/2023 RIP: 0010:__lock_acquire+0x109/0x5de0 kernel/locking/lockdep.c:5004 ... Call Trace: <TASK> lock_acquire kernel/locking/lockdep.c:5753 [inline] lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5718 down_write+0x93/0x200 kernel/locking/rwsem.c:1573 hugetlb_vma_lock_write mm/hugetlb.c:300 [inline] hugetlb_vma_lock_write+0xae/0x100 mm/hugetlb.c:291 __hugetlb_zap_begin+0x1e9/0x2b0 mm/hugetlb.c:5447 hugetlb_zap_begin include/linux/hugetlb.h:258 [inline] unmap_vmas+0x2f4/0x470 mm/memory.c:1733 exit_mmap+0x1ad/0xa60 mm/mmap.c:3230 __mmput+0x12a/0x4d0 kernel/fork.c:1349 mmput+0x62/0x70 kernel/fork.c:1371 exit_mm kernel/exit.c:567 [inline] do_exit+0x9ad/0x2a20 kernel/exit.c:861 __do_sys_exit kernel/exit.c:991 [inline] __se_sys_exit kernel/exit.c:989 [inline] __x64_sys_exit+0x42/0x50 kernel/exit.c:989 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Mask off low bit flags before checking for NULL pointer. In addition, the reserve map only 'belongs' to the OWNER (parent in parent/child relationships) so also check for the OWNER flag. Link: https://lkml.kernel.org/r/20231114012033.259600-1-mike.kravetz@oracle.com Reported-by: syzbot+6ada951e7c0f7bc8a71e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/00000000000078d1e00608d7878b@google.com/ Fixes: bf4916922c60 ("hugetlbfs: extend hugetlb_vma_lock to private VMAs") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Rik van Riel <riel@surriel.com> Cc: Edward Adam Davis <eadavis@qq.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Greg Kroah-Hartman
|
2cd386b08b |
Merge 6.1.61 into android14-6.1-lts
Changes in 6.1.61
KVM: x86/pmu: Truncate counter value to allowed width on write
mmc: core: Align to common busy polling behaviour for mmc ioctls
mmc: block: ioctl: do write error check for spi
mmc: core: Fix error propagation for some ioctl commands
ASoC: codecs: wcd938x: Convert to platform remove callback returning void
ASoC: codecs: wcd938x: Simplify with dev_err_probe
ASoC: codecs: wcd938x: fix regulator leaks on probe errors
ASoC: codecs: wcd938x: fix runtime PM imbalance on remove
pinctrl: qcom: lpass-lpi: fix concurrent register updates
mcb: Return actual parsed size when reading chameleon table
mcb-lpc: Reallocate memory region to avoid memory overlapping
virtio_balloon: Fix endless deflation and inflation on arm64
virtio-mmio: fix memory leak of vm_dev
virtio-crypto: handle config changed by work queue
virtio_pci: fix the common cfg map size
vsock/virtio: initialize the_virtio_vsock before using VQs
vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE
arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399
arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards
mm: fix vm_brk_flags() to not bail out while holding lock
hugetlbfs: clear resv_map pointer if mmap fails
mm/page_alloc: correct start page when guard page debug is enabled
mm/migrate: fix do_pages_move for compat pointers
hugetlbfs: extend hugetlb_vma_lock to private VMAs
maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
nfsd: lock_rename() needs both directories to live on the same fs
drm/i915/pmu: Check if pmu is closed before stopping event
drm/amd: Disable ASPM for VI w/ all Intel systems
drm/dp_mst: Fix NULL deref in get_mst_branch_device_by_guid_helper()
ARM: OMAP: timer32K: fix all kernel-doc warnings
firmware/imx-dsp: Fix use_after_free in imx_dsp_setup_channels()
clk: ti: Fix missing omap4 mcbsp functional clock and aliases
clk: ti: Fix missing omap5 mcbsp functional clock and aliases
r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx
r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1
r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1
iavf: initialize waitqueues before starting watchdog_task
i40e: Fix I40E_FLAG_VF_VLAN_PRUNING value
treewide: Spelling fix in comment
igb: Fix potential memory leak in igb_add_ethtool_nfc_entry
neighbour: fix various data-races
igc: Fix ambiguity in the ethtool advertising
net: ethernet: adi: adin1110: Fix uninitialized variable
net: ieee802154: adf7242: Fix some potential buffer overflow in adf7242_stats_show()
net: usb: smsc95xx: Fix uninit-value access in smsc95xx_read_reg
r8152: Increase USB control msg timeout to 5000ms as per spec
r8152: Run the unload routine if we have errors during probe
r8152: Cancel hw_phy_work if we have an error in probe
r8152: Release firmware if we have an error in probe
tcp: fix wrong RTO timeout when received SACK reneging
gtp: uapi: fix GTPA_MAX
gtp: fix fragmentation needed check with gso
i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR
drm/logicvc: Kconfig: select REGMAP and REGMAP_MMIO
iavf: in iavf_down, disable queues when removing the driver
scsi: sd: Introduce manage_shutdown device flag
blk-throttle: check for overflow in calculate_bytes_allowed
kasan: print the original fault addr when access invalid shadow
io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
iio: afe: rescale: Accept only offset channels
iio: exynos-adc: request second interupt only when touchscreen mode is used
iio: adc: xilinx-xadc: Don't clobber preset voltage/temperature thresholds
iio: adc: xilinx-xadc: Correct temperature offset/scale for UltraScale
i2c: muxes: i2c-mux-pinctrl: Use of_get_i2c_adapter_by_node()
i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node()
i2c: muxes: i2c-demux-pinctrl: Use of_get_i2c_adapter_by_node()
i2c: stm32f7: Fix PEC handling in case of SMBUS transfers
i2c: aspeed: Fix i2c bus hang in slave read
tracing/kprobes: Fix the description of variable length arguments
misc: fastrpc: Reset metadata buffer to avoid incorrect free
misc: fastrpc: Free DMA handles for RPC calls with no arguments
misc: fastrpc: Clean buffers on remote invocation failures
misc: fastrpc: Unmap only if buffer is unmapped from DSP
nvmem: imx: correct nregs for i.MX6ULL
nvmem: imx: correct nregs for i.MX6SLL
nvmem: imx: correct nregs for i.MX6UL
x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
x86/cpu: Add model number for Intel Arrow Lake mobile processor
perf/core: Fix potential NULL deref
sparc32: fix a braino in fault handling in csum_and_copy_..._user()
clk: Sanitize possible_parent_show to Handle Return Value of of_clk_get_parent_name
platform/x86: Add s2idle quirk for more Lenovo laptops
ext4: add two helper functions extent_logical_end() and pa_logical_end()
ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow
ext4: avoid overlapping preallocations due to overflow
objtool/x86: add missing embedded_insn check
Linux 6.1.61
Note, this merge point also reverts commit
|
||
Rik van Riel
|
b1b2750de1 |
hugetlbfs: extend hugetlb_vma_lock to private VMAs
commit bf4916922c60f43efaa329744b3eef539aa6a2b2 upstream.
Extend the locking scheme used to protect shared hugetlb mappings from
truncate vs page fault races, in order to protect private hugetlb mappings
(with resv_map) against MADV_DONTNEED.
Add a read-write semaphore to the resv_map data structure, and use that
from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
race between MADV_DONTNEED and page faults.
Link: https://lkml.kernel.org/r/20231006040020.3677377-3-riel@surriel.com
Fixes:
|
||
Rik van Riel
|
0aa7b24c06 |
hugetlbfs: clear resv_map pointer if mmap fails
commit 92fe9dcbe4e109a7ce6bab3e452210a35b0ab493 upstream.
Patch series "hugetlbfs: close race between MADV_DONTNEED and page fault", v7.
Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.
This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.
Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.
Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:
CPU 1 CPU 2
MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page
Fix that race by extending the hugetlb_vma_lock locking scheme to also
cover private hugetlb mappings (with resv_map), and pulling the locking
from __unmap_hugepage_final_range into helper functions called from
zap_page_range_single. This ensures page faults stay locked out of the
MADV_DONTNEED VMA until the huge pages have actually been freed.
This patch (of 3):
Hugetlbfs leaves a dangling pointer in the VMA if mmap fails. This has
not been a problem so far, but other code in this patch series tries to
follow that pointer.
Link: https://lkml.kernel.org/r/20231006040020.3677377-1-riel@surriel.com
Link: https://lkml.kernel.org/r/20231006040020.3677377-2-riel@surriel.com
Fixes:
|
||
Greg Kroah-Hartman
|
50874c58d8 |
Merge 6.1.47 into android14-6.1-lts
Changes in 6.1.47 mmc: sdhci-f-sdh30: Replace with sdhci_pltfm cpuidle: psci: Extend information in log about OSI/PC mode cpuidle: psci: Move enabling OSI mode after power domains creation zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks zsmalloc: fix races between modifications of fullness and isolated selftests: forwarding: tc_actions: cleanup temporary files when test is aborted selftests: forwarding: tc_actions: Use ncat instead of nc net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore net/smc: Fix setsockopt and sysctl to specify same buffer size again net: phy: at803x: Use devm_regulator_get_enable_optional() net: phy: at803x: fix the wol setting functions drm/amdgpu: fix calltrace warning in amddrm_buddy_fini drm/amdgpu: Fix integer overflow in amdgpu_cs_pass1 drm/amdgpu: fix memory leak in mes self test ASoC: Intel: sof_sdw: add quirk for MTL RVP ASoC: Intel: sof_sdw: add quirk for LNL RVP PCI: tegra194: Fix possible array out of bounds access ASoC: SOF: amd: Add pci revision id check drm/stm: ltdc: fix late dereference check drm: rcar-du: remove R-Car H3 ES1.* workarounds ASoC: amd: vangogh: Add check for acp config flags in vangogh platform ARM: dts: imx6dl: prtrvt, prtvt7, prti6q, prtwd2: fix USB related warnings ASoC: Intel: sof_sdw_rt_sdca_jack_common: test SOF_JACK_JDSRC in _exit ASoC: Intel: sof_sdw: Add support for Rex soundwire iopoll: Call cpu_relax() in busy loops ASoC: SOF: Intel: fix SoundWire/HDaudio mutual exclusion dma-remap: use kvmalloc_array/kvfree for larger dma memory remap accel/habanalabs: add pci health check during heartbeat HID: logitech-hidpp: Add USB and Bluetooth IDs for the Logitech G915 TKL Keyboard iommu/amd: Introduce Disable IRTE Caching Support drm/amdgpu: install stub fence into potential unused fence pointers drm/amd/display: Apply 60us prefetch for DCFCLK <= 300Mhz RDMA/mlx5: Return the firmware result upon destroying QP/RQ drm/amd/display: Skip DPP DTO update if root clock is gated drm/amd/display: Enable dcn314 DPP RCO ASoC: SOF: core: Free the firmware trace before calling snd_sof_shutdown() HID: intel-ish-hid: ipc: Add Arrow Lake PCI device ID ALSA: hda/realtek: Add quirks for ROG ALLY CS35l41 audio smb: client: fix warning in cifs_smb3_do_mount() cifs: fix session state check in reconnect to avoid use-after-free issue serial: stm32: Ignore return value of uart_remove_one_port() in .remove() led: qcom-lpg: Fix resource leaks in for_each_available_child_of_node() loops media: v4l2-mem2mem: add lock to protect parameter num_rdy media: camss: set VFE bpl_alignment to 16 for sdm845 and sm8250 usb: gadget: u_serial: Avoid spinlock recursion in __gs_console_push usb: gadget: uvc: queue empty isoc requests if no video buffer is available media: platform: mediatek: vpu: fix NULL ptr dereference thunderbolt: Read retimer NVM authentication status prior tb_retimer_set_inbound_sbtx() usb: chipidea: imx: don't request QoS for imx8ulp usb: chipidea: imx: add missing USB PHY DPDM wakeup setting gfs2: Fix possible data races in gfs2_show_options() pcmcia: rsrc_nonstatic: Fix memory leak in nonstatic_release_resource_db() thunderbolt: Add Intel Barlow Ridge PCI ID thunderbolt: Limit Intel Barlow Ridge USB3 bandwidth firewire: net: fix use after free in fwnet_finish_incoming_packet() watchdog: sp5100_tco: support Hygon FCH/SCH (Server Controller Hub) Bluetooth: L2CAP: Fix use-after-free Bluetooth: btusb: Add MT7922 bluetooth ID for the Asus Ally ceph: try to dump the msgs when decoding fails drm/amdgpu: Fix potential fence use-after-free v2 fs/ntfs3: Enhance sanity check while generating attr_list fs: ntfs3: Fix possible null-pointer dereferences in mi_read() fs/ntfs3: Mark ntfs dirty when on-disk struct is corrupted ALSA: hda/realtek: Add quirks for Unis H3C Desktop B760 & Q760 ALSA: hda: fix a possible null-pointer dereference due to data race in snd_hdac_regmap_sync() ALSA: hda/realtek: Add quirk for ASUS ROG GX650P ALSA: hda/realtek: Add quirk for ASUS ROG GA402X ALSA: hda/realtek: Add quirk for ASUS ROG GZ301V powerpc/kasan: Disable KCOV in KASAN code Bluetooth: MGMT: Use correct address for memcpy() ring-buffer: Do not swap cpu_buffer during resize process igc: read before write to SRRCTL register drm/amd/display: save restore hdcp state when display is unplugged from mst hub drm/amd/display: phase3 mst hdcp for multiple displays drm/amd/display: fix access hdcp_workqueue assert KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption ARM: dts: nxp/imx6sll: fix wrong property name in usbphy node fbdev/hyperv-fb: Do not set struct fb_info.apertures video/aperture: Only remove sysfb on the default vga pci device btrfs: move out now unused BG from the reclaim list btrfs: convert btrfs_block_group::needs_free_space to runtime flag btrfs: convert btrfs_block_group::seq_zone to runtime flag btrfs: fix use-after-free of new block group that became unused virtio-mmio: don't break lifecycle of vm_dev vduse: Use proper spinlock for IRQ injection vdpa/mlx5: Fix mr->initialized semantics vdpa/mlx5: Delete control vq iotlb in destroy_mr only when necessary cifs: fix potential oops in cifs_oplock_break i2c: bcm-iproc: Fix bcm_iproc_i2c_isr deadlock issue i2c: hisi: Only handle the interrupt of the driver's transfer i2c: tegra: Fix i2c-tegra DMA config option processing fbdev: mmp: fix value check in mmphw_probe() powerpc/rtas_flash: allow user copy to flash block cache objects vdpa: Add features attr to vdpa_nl_policy for nlattr length check vdpa: Add queue index attr to vdpa_nl_policy for nlattr length check vdpa: Add max vqp attr to vdpa_nl_policy for nlattr length check vdpa: Enable strict validation for netlinks ops tty: n_gsm: fix the UAF caused by race condition in gsm_cleanup_mux tty: serial: fsl_lpuart: Clear the error flags by writing 1 for lpuart32 platforms btrfs: fix incorrect splitting in btrfs_drop_extent_map_range btrfs: fix BUG_ON condition in btrfs_cancel_balance i2c: designware: Correct length byte validation logic i2c: designware: Handle invalid SMBus block data response length value net: xfrm: Fix xfrm_address_filter OOB read net: af_key: fix sadb_x_filter validation net: xfrm: Amend XFRMA_SEC_CTX nla_policy structure xfrm: fix slab-use-after-free in decode_session6 ip6_vti: fix slab-use-after-free in decode_session6 ip_vti: fix potential slab-use-after-free in decode_session6 xfrm: add NULL check in xfrm_update_ae_params xfrm: add forgotten nla_policy for XFRMA_MTIMER_THRESH virtio_net: notify MAC address change on device initialization virtio-net: set queues after driver_ok net: pcs: Add missing put_device call in miic_create net: phy: fix IRQ-based wake-on-lan over hibernate / power off selftests: mirror_gre_changes: Tighten up the TTL test match drm/panel: simple: Fix AUO G121EAN01 panel timings according to the docs net: macb: In ZynqMP resume always configure PS GTR for non-wakeup source octeon_ep: cancel tx_timeout_task later in remove sequence netfilter: nf_tables: fix false-positive lockdep splat netfilter: nf_tables: deactivate catchall elements in next generation ipvs: fix racy memcpy in proc_do_sync_threshold netfilter: nft_dynset: disallow object maps net: phy: broadcom: stub c45 read/write for 54810 team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves net: openvswitch: reject negative ifindex iavf: fix FDIR rule fields masks validation i40e: fix misleading debug logs net: dsa: mv88e6xxx: Wait for EEPROM done before HW reset sfc: don't unregister flow_indr if it was never registered sock: Fix misuse of sk_under_memory_pressure() net: do not allow gso_size to be set to GSO_BY_FRAGS qede: fix firmware halt over suspend and resume ice: Block switchdev mode when ADQ is active and vice versa bus: ti-sysc: Flush posted write on enable before reset arm64: dts: qcom: qrb5165-rb5: fix thermal zone conflict arm64: dts: rockchip: Disable HS400 for eMMC on ROCK Pi 4 arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+ ARM: dts: imx: align LED node names with dtschema ARM: dts: imx6: phytec: fix RTC interrupt level arm64: dts: imx8mm: Drop CSI1 PHY reference clock configuration ARM: dts: imx: Set default tuning step for imx6sx usdhc arm64: dts: imx93: Fix anatop node size ASoC: rt5665: add missed regulator_bulk_disable ASoC: meson: axg-tdm-formatter: fix channel slot allocation ALSA: hda/realtek: Add quirks for HP G11 Laptops soc: aspeed: uart-routing: Use __sysfs_match_string soc: aspeed: socinfo: Add kfree for kstrdup ALSA: hda/realtek - Remodified 3k pull low procedure riscv: uaccess: Return the number of bytes effectively not copied serial: 8250: Fix oops for port->pm on uart_change_pm() ALSA: usb-audio: Add support for Mythware XA001AU capture and playback interfaces. cifs: Release folio lock on fscache read hit. virtio-net: Zero max_tx_vq field for VIRTIO_NET_CTRL_MQ_HASH_CONFIG case arm64: dts: rockchip: Fix Wifi/Bluetooth on ROCK Pi 4 boards blk-crypto: dynamically allocate fallback profile mmc: wbsd: fix double mmc_free_host() in wbsd_init() mmc: block: Fix in_flight[issue_type] value error drm/qxl: fix UAF on handle creation drm/i915/sdvo: fix panel_type initialization drm/amd: flush any delayed gfxoff on suspend entry drm/amdgpu: skip fence GFX interrupts disable/enable for S0ix drm/amdgpu/pm: fix throttle_status for other than MP1 11.0.7 ASoC: amd: vangogh: select CONFIG_SND_AMD_ACP_CONFIG drm/amd/display: disable RCO for DCN314 zsmalloc: allow only one active pool compaction context sched/fair: unlink misfit task from cpu overutilized sched/fair: Remove capacity inversion detection drm/amd/display: Implement workaround for writing to OTG_PIXEL_RATE_DIV register hugetlb: do not clear hugetlb dtor until allocating vmemmap netfilter: set default timeout to 3 secs for sctp shutdown send and recv state arm64/ptrace: Ensure that SME is set up for target when writing SSVE state drm/amd/pm: skip the RLC stop when S0i3 suspend for SMU v13.0.4/11 drm/amdgpu: keep irq count in amdgpu_irq_disable_all af_unix: Fix null-ptr-deref in unix_stream_sendpage(). drm/nouveau/disp: fix use-after-free in error handling of nouveau_connector_create net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled mmc: f-sdh30: fix order of function calls in sdhci_f_sdh30_remove Linux 6.1.47 Change-Id: I7c55c71f43f88a1d44d39c835e3f6e58d4c86279 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
Mike Kravetz
|
1b4ce2952b |
hugetlb: do not clear hugetlb dtor until allocating vmemmap
commit 32c877191e022b55fe3a374f3d7e9fb5741c514d upstream.
Patch series "Fix hugetlb free path race with memory errors".
In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered.
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/
Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page. However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.
A much more detailed description is in the first patch.
The first patch addresses the race window. However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page
free operation. This could lead to slowdowns if one is freeing a large
number of hugetlb pages.
The second path optimizes the update_and_free_pages_bulk routine to only
take the lock once in bulk operations.
The second patch is technically not a bug fix, but includes a Fixes tag
and Cc stable to avoid a performance regression. It can be combined with
the first, but was done separately make reviewing easier.
This patch (of 2):
Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
lists, get a ref on the page and remove hugetlb destructor. This
all must be done under the hugetlb lock. After this call, the page
can be treated as a normal compound page or a collection of base
size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
needed and the free routine of the underlying allocator is called
on the resulting page. We can not hold the hugetlb lock here.
One issue with this scheme is that a memory error could occur between
these two steps. In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages. It will then try to SetPageHWPoison(page) on the page with an
error. If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.
Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap. Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present. This saves a
lock/unlock cycle. Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.
Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count. This is not
a normal state. The only code that would notice is the memory error
code, and it is set up to retry in such a case.
A subsequent patch will create a routine to do bulk processing of
vmemmap allocation. This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.
Link: https://lkml.kernel.org/r/20230711220942.43706-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com
Fixes:
|
||
Matthew Wilcox (Oracle)
|
66cbbe6b31 |
FROMGIT: mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
Handle a little more of the page fault path outside the mmap sem. The hugetlb path doesn't need to check whether the VMA is anonymous; the VM_HUGETLB flag is only set on hugetlbfs VMAs. There should be no performance change from the previous commit; this is simply a step to ease bisection of any problems. Link: https://lkml.kernel.org/r/20230724185410.1124082-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 51db5e8974cafee10b2252efa78f89af7d60cd11 https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable) Bug: 293665307 Change-Id: I300c7105fa3530e8eb05862cb3f66b7adac99420 Signed-off-by: Suren Baghdasaryan <surenb@google.com> |
||
Suren Baghdasaryan
|
ad18923856 |
FROMGIT: mm: replace mmap with vma write lock assertions when operating on a vma
Vma write lock assertion always includes mmap write lock assertion and additional vma lock checks when per-VMA locks are enabled. Replace weaker mmap_assert_write_locked() assertions with stronger vma_assert_write_locked() ones when we are operating on a vma which is expected to be locked. Link: https://lkml.kernel.org/r/20230804152724.3090321-4-surenb@google.com Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 928a31b91cf64aa99a8999dcd66bec0ad02f64ef https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable) Bug: 293665307 Change-Id: I861db0510612f571f2ca44e0a9d7e01274d4eb36 Signed-off-by: Suren Baghdasaryan <surenb@google.com> |
||
Suren Baghdasaryan
|
bf16383ebd |
UPSTREAM: mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(), replace it with VM_LOCKED_MASK bitmask and convert all users. Link: https://lkml.kernel.org/r/20230126193752.297968-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit e430a95a04efc557bc4ff9b3035c7c85aee5d63f) Bug: 161210518 Change-Id: I17bbcc01a133511dbfaf3d82fbc4b25ecdd0b376 Signed-off-by: Suren Baghdasaryan <surenb@google.com> |
||
Peter Xu
|
f042ee354c |
mm/hugetlb: fix uffd wr-protection for CoW optimization path
commit 60d5b473d61be61ac315e544fcd6a8234a79500e upstream.
This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be
writable even with uffd-wp bit set. It only happens with hugetlb private
mappings, when someone firstly wr-protects a missing pte (which will
install a pte marker), then a write to the same page without any prior
access to the page.
Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before
reaching hugetlb_wp() to avoid taking more locks that userfault won't
need. However there's one CoW optimization path that can trigger
hugetlb_wp() inside hugetlb_no_page(), which will bypass the trap.
This patch skips hugetlb_wp() for CoW and retries the fault if uffd-wp bit
is detected. The new path will only trigger in the CoW optimization path
because generic hugetlb_fault() (e.g. when a present pte was
wr-protected) will resolve the uffd-wp bit already. Also make sure
anonymous UNSHARE won't be affected and can still be resolved, IOW only
skip CoW not CoR.
This patch will be needed for v5.19+ hence copy stable.
[peterx@redhat.com: v2]
Link: https://lkml.kernel.org/r/ZBzOqwF2wrHgBVZb@x1n
[peterx@redhat.com: v3]
Link: https://lkml.kernel.org/r/20230324142620.2344140-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230321191840.1897940-1-peterx@redhat.com
Fixes:
|
||
Peter Xu
|
3b8ede6665 |
mm/hugetlb: pre-allocate pgtable pages for uffd wr-protects
commit fed15f1345dc8a7fc8baa81e8b55c3ba010d7f4b upstream.
Userfaultfd-wp uses pte markers to mark wr-protected pages for both shmem
and hugetlb. Shmem has pre-allocation ready for markers, but hugetlb path
was overlooked.
Doing so by calling huge_pte_alloc() if the initial pgtable walk fails to
find the huge ptep. It's possible that huge_pte_alloc() can fail with
high memory pressure, in that case stop the loop immediately and fail
silently. This is not the most ideal solution but it matches with what we
do with shmem meanwhile it avoids the splat in dmesg.
Link: https://lkml.kernel.org/r/20230104225207.1066932-2-peterx@redhat.com
Fixes:
|
||
David Hildenbrand
|
8d6a675cd7 |
mm/hugetlb: fix uffd-wp handling for migration entries in hugetlb_change_protection()
commit 44f86392bdd165da7e43d3c772aeb1e128ffd6c8 upstream.
We have to update the uffd-wp SWP PTE bit independent of the type of
migration entry. Currently, if we're unlucky and we want to install/clear
the uffd-wp bit just while we're migrating a read-only mapped hugetlb
page, we would miss to set/clear the uffd-wp bit.
Further, if we're processing a readable-exclusive migration entry and
neither want to set or clear the uffd-wp bit, we could currently end up
losing the uffd-wp bit. Note that the same would hold for writable
migrating entries, however, having a writable migration entry with the
uffd-wp bit set would already mean that something went wrong.
Note that the change from !is_readable_migration_entry ->
writable_migration_entry is harmless and actually cleaner, as raised by
Miaohe Lin and discussed in [1].
[1] https://lkml.kernel.org/r/90dd6a93-4500-e0de-2bf0-bf522c311b0c@huawei.com
Link: https://lkml.kernel.org/r/20221222205511.675832-3-david@redhat.com
Fixes:
|
||
David Hildenbrand
|
6062c992e9 |
mm/hugetlb: fix PTE marker handling in hugetlb_change_protection()
commit 0e678153f5be7e6c8d28835f5a678618da4b7a9c upstream.
Patch series "mm/hugetlb: uffd-wp fixes for hugetlb_change_protection()".
Playing with virtio-mem and background snapshots (using uffd-wp) on
hugetlb in QEMU, I managed to trigger a VM_BUG_ON(). Looking into the
details, hugetlb_change_protection() seems to not handle uffd-wp correctly
in all cases.
Patch #1 fixes my test case. I don't have reproducers for patch #2, as it
requires running into migration entries.
I did not yet check in detail yet if !hugetlb code requires similar care.
This patch (of 2):
There are two problematic cases when stumbling over a PTE marker in
hugetlb_change_protection():
(1) We protect an uffd-wp PTE marker a second time using uffd-wp: we will
end up in the "!huge_pte_none(pte)" case and mess up the PTE marker.
(2) We unprotect a uffd-wp PTE marker: we will similarly end up in the
"!huge_pte_none(pte)" case even though we cleared the PTE, because
the "pte" variable is stale. We'll mess up the PTE marker.
For example, if we later stumble over such a "wrongly modified" PTE marker,
we'll treat it like a present PTE that maps some garbage page.
This can, for example, be triggered by mapping a memfd backed by huge
pages, registering uffd-wp, uffd-wp'ing an unmapped page and (a)
uffd-wp'ing it a second time; or (b) uffd-unprotecting it; or (c)
unregistering uffd-wp. Then, ff we trigger fallocate(FALLOC_FL_PUNCH_HOLE)
on that file range, we will run into a VM_BUG_ON:
[ 195.039560] page:00000000ba1f2987 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x0
[ 195.039565] flags: 0x7ffffc0001000(reserved|node=0|zone=0|lastcpupid=0x1fffff)
[ 195.039568] raw: 0007ffffc0001000 ffffe742c0000008 ffffe742c0000008 0000000000000000
[ 195.039569] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[ 195.039569] page dumped because: VM_BUG_ON_PAGE(compound && !PageHead(page))
[ 195.039573] ------------[ cut here ]------------
[ 195.039574] kernel BUG at mm/rmap.c:1346!
[ 195.039579] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 195.039581] CPU: 7 PID: 4777 Comm: qemu-system-x86 Not tainted 6.0.12-200.fc36.x86_64 #1
[ 195.039583] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET50W (1.50 ) 09/15/2022
[ 195.039584] RIP: 0010:page_remove_rmap+0x45b/0x550
[ 195.039588] Code: [...]
[ 195.039589] RSP: 0018:ffffbc03c3633ba8 EFLAGS: 00010292
[ 195.039591] RAX: 0000000000000040 RBX: ffffe742c0000000 RCX: 0000000000000000
[ 195.039592] RDX: 0000000000000002 RSI: ffffffff8e7aac1a RDI: 00000000ffffffff
[ 195.039592] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffbc03c3633a08
[ 195.039593] R10: 0000000000000003 R11: ffffffff8f146328 R12: ffff9b04c42754b0
[ 195.039594] R13: ffffffff8fcc6328 R14: ffffbc03c3633c80 R15: ffff9b0484ab9100
[ 195.039595] FS: 00007fc7aaf68640(0000) GS:ffff9b0bbf7c0000(0000) knlGS:0000000000000000
[ 195.039596] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 195.039597] CR2: 000055d402c49110 CR3: 0000000159392003 CR4: 0000000000772ee0
[ 195.039598] PKRU: 55555554
[ 195.039599] Call Trace:
[ 195.039600] <TASK>
[ 195.039602] __unmap_hugepage_range+0x33b/0x7d0
[ 195.039605] unmap_hugepage_range+0x55/0x70
[ 195.039608] hugetlb_vmdelete_list+0x77/0xa0
[ 195.039611] hugetlbfs_fallocate+0x410/0x550
[ 195.039612] ? _raw_spin_unlock_irqrestore+0x23/0x40
[ 195.039616] vfs_fallocate+0x12e/0x360
[ 195.039618] __x64_sys_fallocate+0x40/0x70
[ 195.039620] do_syscall_64+0x58/0x80
[ 195.039623] ? syscall_exit_to_user_mode+0x17/0x40
[ 195.039624] ? do_syscall_64+0x67/0x80
[ 195.039626] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 195.039628] RIP: 0033:0x7fc7b590651f
[ 195.039653] Code: [...]
[ 195.039654] RSP: 002b:00007fc7aaf66e70 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 195.039655] RAX: ffffffffffffffda RBX: 0000558ef4b7f370 RCX: 00007fc7b590651f
[ 195.039656] RDX: 0000000018000000 RSI: 0000000000000003 RDI: 000000000000000c
[ 195.039657] RBP: 0000000008000000 R08: 0000000000000000 R09: 0000000000000073
[ 195.039658] R10: 0000000008000000 R11: 0000000000000293 R12: 0000000018000000
[ 195.039658] R13: 00007fb8bbe00000 R14: 000000000000000c R15: 0000000000001000
[ 195.039661] </TASK>
Fix it by not going into the "!huge_pte_none(pte)" case if we stumble over
an exclusive marker. spin_unlock() + continue would get the job done.
However, instead, make it clearer that there are no fall-through
statements: we process each case (hwpoison, migration, marker, !none,
none) and then unlock the page table to continue with the next PTE. Let's
avoid "continue" statements and use a single spin_unlock() at the end.
Link: https://lkml.kernel.org/r/20221222205511.675832-1-david@redhat.com
Link: https://lkml.kernel.org/r/20221222205511.675832-2-david@redhat.com
Fixes:
|
||
James Houghton
|
63f71b8609 |
hugetlb: unshare some PMDs when splitting VMAs
commit b30c14cd61025eeea2f2e8569606cd167ba9ad2d upstream.
PMD sharing can only be done in PUD_SIZE-aligned pieces of VMAs; however,
it is possible that HugeTLB VMAs are split without unsharing the PMDs
first.
Without this fix, it is possible to hit the uffd-wp-related WARN_ON_ONCE
in hugetlb_change_protection [1]. The key there is that
hugetlb_unshare_all_pmds will not attempt to unshare PMDs in
non-PUD_SIZE-aligned sections of the VMA.
It might seem ideal to unshare in hugetlb_vm_op_open, but we need to
unshare in both the new and old VMAs, so unsharing in hugetlb_vm_op_split
seems natural.
[1]: https://lore.kernel.org/linux-mm/CADrL8HVeOkj0QH5VZZbRzybNE8CG-tEGFshnA+bG9nMgcWtBSg@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230104231910.1464197-1-jthoughton@google.com
Fixes:
|
||
Mike Kravetz
|
17183187dc |
hugetlb: really allocate vma lock for all sharable vmas
commit e700898fa075c69b3ae02b702ab57fb75e1a82ec upstream. Commit |
||
Mike Kravetz
|
04ada095dc |
hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range. For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final. However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out. In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap. Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing. Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table. This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.
Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
call from unmap_vmas(). This is used to indicate the 'final' unmapping of
a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and
the vm_lock is not deleted.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
Fixes:
|
||
Mike Kravetz
|
7fb0728a9b |
hugetlb: fix __prep_compound_gigantic_page page flag setting
Commit |
||
James Houghton
|
8625147caf |
hugetlbfs: don't delete error page from pagecache
This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.
Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache. That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned. As [1] states,
this is effectively memory corruption.
The fix is to leave the page in the page cache. If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses. For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.
[1]: commit
|
||
Mike Kravetz
|
612b8a3170 |
hugetlb: fix memory leak associated with vma_lock structure
The hugetlb vma_lock structure hangs off the vm_private_data pointer of
sharable hugetlb vmas. The structure is vma specific and can not be
shared between vmas. At fork and various other times, vmas are duplicated
via vm_area_dup(). When this happens, the pointer in the newly created
vma must be cleared and the structure reallocated. Two hugetlb specific
routines deal with this hugetlb_dup_vma_private and hugetlb_vm_op_open.
Both routines are called for newly created vmas. hugetlb_dup_vma_private
would always clear the pointer and hugetlb_vm_op_open would allocate the
new vms_lock structure. This did not work in the case of this calling
sequence pointed out in [1].
move_vma
copy_vma
new_vma = vm_area_dup(vma);
new_vma->vm_ops->open(new_vma); --> new_vma has its own vma lock.
is_vm_hugetlb_page(vma)
clear_vma_resv_huge_pages
hugetlb_dup_vma_private --> vma->vm_private_data is set to NULL
When clearing hugetlb_dup_vma_private we actually leak the associated
vma_lock structure.
The vma_lock structure contains a pointer to the associated vma. This
information can be used in hugetlb_dup_vma_private and hugetlb_vm_op_open
to ensure we only clear the vm_private_data of newly created (copied)
vmas. In such cases, the vma->vma_lock->vma field will not point to the
vma.
Update hugetlb_dup_vma_private and hugetlb_vm_op_open to not clear
vm_private_data if vma->vma_lock->vma == vma. Also, log a warning if
hugetlb_vm_op_open ever encounters the case where vma_lock has already
been correctly allocated for the vma.
[1] https://lore.kernel.org/linux-mm/5154292a-4c55-28cd-0935-82441e512fc3@huawei.com/
Link: https://lkml.kernel.org/r/20221019201957.34607-1-mike.kravetz@oracle.com
Fixes:
|
||
Rik van Riel
|
12df140f0b |
mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
The h->*_huge_pages counters are protected by the hugetlb_lock, but
alloc_huge_page has a corner case where it can decrement the counter
outside of the lock.
This could lead to a corrupted value of h->resv_huge_pages, which we have
observed on our systems.
Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a
potential race.
Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com
Fixes:
|
||
Linus Torvalds
|
5e714bf171 |
- Alistair Popple has a series which addresses a race which causes page
refcounting errors in ZONE_DEVICE pages. - Peter Xu fixes some userfaultfd test harness instability. - Various other patches in MM, mainly fixes. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0j6igAKCRDdBJ7gKXxA jnGxAP99bV39ZtOsoY4OHdZlWU16BUjKuf/cb3bZlC2G849vEwD+OKlij86SG20j MGJQ6TfULJ8f1dnQDd6wvDfl3FMl7Qc= =tbdp -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - fix a race which causes page refcounting errors in ZONE_DEVICE pages (Alistair Popple) - fix userfaultfd test harness instability (Peter Xu) - various other patches in MM, mainly fixes * tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits) highmem: fix kmap_to_page() for kmap_local_page() addresses mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page mm/selftest: uffd: explain the write missing fault check mm/hugetlb: use hugetlb_pte_stable in migration race check mm/hugetlb: fix race condition of uffd missing/minor handling zram: always expose rw_page LoongArch: update local TLB if PTE entry exists mm: use update_mmu_tlb() on the second thread kasan: fix array-bounds warnings in tests hmm-tests: add test for migrate_device_range() nouveau/dmem: evict device private memory during release nouveau/dmem: refactor nouveau_dmem_fault_copy_one() mm/migrate_device.c: add migrate_device_range() mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page() mm/memremap.c: take a pgmap reference on page allocation mm: free device private pages have zero refcount mm/memory.c: fix race when faulting a device private page mm/damon: use damon_sz_region() in appropriate place mm/damon: move sz_damon_region to damon_sz_region lib/test_meminit: add checks for the allocation functions ... |
||
Peter Xu
|
f9bf6c03ec |
mm/hugetlb: use hugetlb_pte_stable in migration race check
After hugetlb_pte_stable() introduced, we can also rewrite the migration race condition against page allocation to use the new helper too. Link: https://lkml.kernel.org/r/20221004193400.110155-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peter Xu
|
2ea7ff1e39 |
mm/hugetlb: fix race condition of uffd missing/minor handling
Patch series "mm/hugetlb: Fix selftest failures with write check", v3. Currently akpm mm-unstable fails with uffd hugetlb private mapping test randomly on a write check. The initial bisection of that points to the recent pmd unshare series, but it turns out there's no direction relationship with the series but only some timing change caused the race to start trigger. The race should be fixed in patch 1. Patch 2 is a trivial cleanup on the similar race with hugetlb migrations, patch 3 comment on the write check so when anyone read it again it'll be clear why it's there. This patch (of 3): After the recent rework patchset of hugetlb locking on pmd sharing, kselftest for userfaultfd sometimes fails on hugetlb private tests with unexpected write fault checks. It turns out there's nothing wrong within the locking series regarding this matter, but it could have changed the timing of threads so it can trigger an old bug. The real bug is when we call hugetlb_no_page() we're not with the pgtable lock. It means we're reading the pte values lockless. It's perfectly fine in most cases because before we do normal page allocations we'll take the lock and check pte_same() again. However before that, there are actually two paths on userfaultfd missing/minor handling that may directly move on with the fault process without checking the pte values. It means for these two paths we may be generating an uffd message based on an unstable pte, while an unstable pte can legally be anything as long as the modifier holds the pgtable lock. One example, which is also what happened in the failing kselftest and caused the test failure, is that for private mappings wr-protection changes can happen on one page. While hugetlb_change_protection() generally requires pte being cleared before being changed, then there can be a race condition like: thread 1 thread 2 -------- -------- UFFDIO_WRITEPROTECT hugetlb_fault hugetlb_change_protection pgtable_lock() huge_ptep_modify_prot_start pte==NULL hugetlb_no_page generate uffd missing event even if page existed!! huge_ptep_modify_prot_commit pgtable_unlock() Fix this by rechecking the pte after pgtable lock for both userfaultfd missing & minor fault paths. This bug should have been around starting from uffd hugetlb introduced, so attaching a Fixes to the commit. Also attach another Fixes to the minor support commit for easier tracking. Note that userfaultfd is actually fine with false positives (e.g. caused by pte changed), but not wrong logical events (e.g. caused by reading a pte during changing). The latter can confuse the userspace, so the strictness is very much preferred. E.g., MISSING event should never happen on the page after UFFDIO_COPY has correctly installed the page and returned. Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com Fixes: |
||
Peter Xu
|
515778e2d7 |
mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in
When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
ifdefs to make sure the code won't be reached when not compiled in.
Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
Fixes:
|
||
Andrew Morton
|
acfac37851 |
mm/hugetlb.c: make __hugetlb_vma_unlock_write_put() static
Reported-by: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
1440f57602 |
Five hotfixes - three for nilfs2, two for MM. For are cc:stable, one is
not. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0YhtwAKCRDdBJ7gKXxA juJLAQDCa0g8sfe9cTw3PT1gRnn8gWLHEkMgUWVC/aBaqYFGeQEAta+g8muv9Tpd qODv0JARH4cwONKEA24Oql+A5RnI6gQ= =QZnW -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "Five hotfixes - three for nilfs2, two for MM. For are cc:stable, one is not" * tag 'mm-hotfixes-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nilfs2: fix leak of nilfs_root in case of writer thread creation failure nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level() nilfs2: fix use-after-free bug of struct nilfs_root mm/damon/core: initialize damon_target->list in damon_new_target() mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page |
||
Baolin Wang
|
fac35ba763 |
mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page
On some architectures (like ARM64), it can support CONT-PTE/PMD size hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and 1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified. So when looking up a CONT-PTE size hugetlb page by follow_page(), it will use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size hugetlb in follow_page_pte(). However this pte entry lock is incorrect for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get the correct lock, which is mm->page_table_lock. That means the pte entry of the CONT-PTE size hugetlb under current pte lock is unstable in follow_page_pte(), we can continue to migrate or poison the pte entry of the CONT-PTE size hugetlb, which can cause some potential race issues, even though they are under the 'pte lock'. For example, suppose thread A is trying to look up a CONT-PTE size hugetlb page by move_pages() syscall under the lock, however antoher thread B can migrate the CONT-PTE hugetlb page at the same time, which will cause thread A to get an incorrect page, if thread A also wants to do page migration, then data inconsistency error occurs. Moreover we have the same issue for CONT-PMD size hugetlb in follow_huge_pmd(). To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte() to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to get the correct pte entry lock to make the pte entry stable. Mike said: Support for CONT_PMD/_PTE was added with |
||
Mike Kravetz
|
bbff39cc6c |
hugetlb: allocate vma lock for all sharable vmas
The hugetlb vma lock was originally designed to synchronize pmd sharing. As such, it was only necessary to allocate the lock for vmas that were capable of pmd sharing. Later in the development cycle, it was discovered that it could also be used to simplify fault/truncation races as described in [1]. However, a subsequent change to allocate the lock for all vmas that use the page cache was never made. A fault/truncation race could leave pages in a file past i_size until the file is removed. Remove the previous restriction and allocate lock for all VM_MAYSHARE vmas. Warn in the unlikely event of allocation failure. [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com Fixes: "hugetlb: clean up code checking for fault/truncation races" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
ecfbd73387 |
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb file truncation/hole punch code may need to back out and take locks in order in the routine hugetlb_unmap_file_folio(). This code could race with vma freeing as pointed out in [1] and result in accessing a stale vma pointer. To address this, take the vma_lock when clearing the vma_lock->vma pointer. [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/ [mike.kravetz@oracle.com: address build issues] Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
131a79b474 |
hugetlb: fix vma lock handling during split vma and range unmapping
Patch series "hugetlb: fixes for new vma lock series". In review of the series "hugetlb: Use new vma lock for huge pmd sharing synchronization", Miaohe Lin pointed out two key issues: 1) There is a race in the routine hugetlb_unmap_file_folio when locks are dropped and reacquired in the correct order [1]. 2) With the switch to using vma lock for fault/truncate synchronization, we need to make sure lock exists for all VM_MAYSHARE vmas, not just vmas capable of pmd sharing. These two issues are addressed here. In addition, having a vma lock present in all VM_MAYSHARE vmas, uncovered some issues around vma splitting. Those are also addressed. [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/ This patch (of 3): The hugetlb vma lock hangs off the vm_private_data field and is specific to the vma. When vm_area_dup() is called as part of vma splitting, the vma lock pointer is copied to the new vma. This will result in issues such as double freeing of the structure. Update the hugetlb open vm_ops to allocate a new vma lock for the new vma. The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE to prevent subsequent pmd sharing. hugetlb_vma_lock_free attempted to anticipate this by checking both VM_MAYSHARE and VM_SHARED. However, if only VM_MAYSHARE was set we would miss the free. With the introduction of the vma lock, a vma can not participate in pmd sharing if vm_private_data is NULL. Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final, free the vma lock to prevent sharing. Also, update the sharing code to make sure vma lock is indeed a condition for pmd sharing. hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas. Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com Fixes: "hugetlb: add vma based lock for pmd sharing" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Xin Hao
|
8346d69d8b |
mm/hugetlb: add available_huge_pages() func
In hugetlb.c there are several places which compare the values of 'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so add a new available_huge_pages() function to do these. Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.com Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liu Shixin
|
958f32ce83 |
mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes:
|
||
Mike Kravetz
|
2b21624fc2 |
hugetlb: freeze allocated pages before creating hugetlb pages
When creating hugetlb pages, the hugetlb code must first allocate contiguous pages from a low level allocator such as buddy, cma or memblock. The pages returned from these low level allocators are ref counted. This creates potential issues with other code taking speculative references on these pages before they can be transformed to a hugetlb page. This issue has been addressed with methods and code such as that provided in [1]. Recent discussions about vmemmap freeing [2] have indicated that it would be beneficial to freeze all sub pages, including the head page of pages returned from low level allocators before converting to a hugetlb page. This helps avoid races if we want to replace the page containing vmemmap for the head page. There have been proposals to change at least the buddy allocator to return frozen pages as described at [3]. If such a change is made, it can be employed by the hugetlb code. However, as mentioned above hugetlb uses several low level allocators so each would need to be modified to return frozen pages. For now, we can manually freeze the returned pages. This is done in two places: 1) alloc_buddy_huge_page, only the returned head page is ref counted. We freeze the head page, retrying once in the VERY rare case where there may be an inflated ref count. 2) prep_compound_gigantic_page, for gigantic pages the current code freezes all pages except the head page. New code will simply freeze the head page as well. In a few other places, code checks for inflated ref counts on newly allocated hugetlb pages. With the modifications to freeze after allocating, this code can be removed. After hugetlb pages are freshly allocated, they are often added to the hugetlb free lists. Since these pages were previously ref counted, this was done via put_page() which would end up calling the hugetlb destructor: free_huge_page. With changes to freeze pages, we simply call free_huge_page directly to add the pages to the free list. In a few other places, freshly allocated hugetlb pages were immediately put into use, and the expectation was they were already ref counted. In these cases, we must manually ref count the page. [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/ [mike.kravetz@oracle.com: fix NULL pointer dereference] Link: https://lkml.kernel.org/r/20220921202702.106069-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20220916214638.155744-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
fa27759af4 |
hugetlb: clean up code checking for fault/truncation races
With the new hugetlb vma lock in place, it can also be used to handle page fault races with file truncation. The lock is taken at the beginning of the code fault path in read mode. During truncation, it is taken in write mode for each vma which has the file mapped. The file's size (i_size) is modified before taking the vma lock to unmap. How are races handled? The page fault code checks i_size early in processing after taking the vma lock. If the fault is beyond i_size, the fault is aborted. If the fault is not beyond i_size the fault will continue and a new page will be added to the file. It could be that truncation code modifies i_size after the check in fault code. That is OK, as truncation code will soon remove the page. The truncation code will wait until the fault is finished, as it must obtain the vma lock in write mode. This patch cleans up/removes late checks in the fault paths that try to back out pages racing with truncation. As noted above, we just let the truncation code remove the pages. [mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning] Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
40549ba8f8 |
hugetlb: use new vma_lock for pmd sharing synchronization
The new hugetlb vma lock is used to address this race: Faulting thread Unsharing thread ... ... ptep = huge_pte_offset() or ptep = huge_pte_alloc() ... i_mmap_lock_write lock page table ptep invalid <------------------------ huge_pmd_unshare() Could be in a previously unlock_page_table sharing process or worse i_mmap_unlock_write ... The vma_lock is used as follows: - During fault processing. The lock is acquired in read mode before doing a page table lock and allocation (huge_pte_alloc). The lock is held until code is finished with the page table entry (ptep). - The lock must be held in write mode whenever huge_pmd_unshare is called. Lock ordering issues come into play when unmapping a page from all vmas mapping the page. The i_mmap_rwsem must be held to search for the vmas, and the vma lock must be held before calling unmap which will call huge_pmd_unshare. This is done today in: - try_to_migrate_one and try_to_unmap_ for page migration and memory error handling. In these routines we 'try' to obtain the vma lock and fail to unmap if unsuccessful. Calling routines already deal with the failure of unmapping. - hugetlb_vmdelete_list for truncation and hole punch. This routine also tries to acquire the vma lock. If it fails, it skips the unmapping. However, we can not have file truncation or hole punch fail because of contention. After hugetlb_vmdelete_list, truncation and hole punch call remove_inode_hugepages. remove_inode_hugepages checks for mapped pages and call hugetlb_unmap_file_page to unmap them. hugetlb_unmap_file_page is designed to drop locks and reacquire in the correct order to guarantee unmap success. Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
8d9bfb2608 |
hugetlb: add vma based lock for pmd sharing
Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for synchronization use by vmas that could be involved in pmd sharing. This data structure contains a rw semaphore that is the primary tool used for synchronization. This new structure is ref counted, so that it can exist when NOT attached to a vma. This is only helpful in resolving lock ordering issues where code may need to obtain the vma_lock while there are no guarantees the vma may go away. By obtaining a ref on the structure, it can be guaranteed that at least the rw semaphore will not go away. Only add infrastructure for the new lock here. Actual use will be added in subsequent patches. [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release] Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
12710fd696 |
hugetlb: rename vma_shareable() and refactor code
Rename the routine vma_shareable to vma_addr_pmd_shareable as it is checking a specific address within the vma. Refactor code to check if an aligned range is shareable as this will be needed in a subsequent patch. Link: https://lkml.kernel.org/r/20220914221810.95771-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
7e1813d48d |
hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
remove_huge_page removes a hugetlb page from the page cache. Change to hugetlb_delete_from_page_cache as it is a more descriptive name. huge_add_to_page_cache is global in scope, but only deals with hugetlb pages. For consistency and clarity, rename to hugetlb_add_to_page_cache. Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: James Houghton <jthoughton@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
3a47c54f09 |
hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
Commit
|
||
Mike Kravetz
|
188a39725a |
hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
Patch series "hugetlb: Use new vma lock for huge pmd sharing synchronization", v2. hugetlb fault scalability regressions have recently been reported [1]. This is not the first such report, as regressions were also noted when commit |
||
XU pengfei
|
3259914f8c |
mm/hugetlb: remove unnecessary 'NULL' values from pointer
Pointer variables allocate memory first, and then judge. There is no need to initialize the assignment. Link: https://lkml.kernel.org/r/20220914012113.6271-1-xupengfei@nfschina.com Signed-off-by: XU pengfei <xupengfei@nfschina.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Muchun Song
|
a4a00b451e |
mm: hugetlb: eliminate memory-less nodes handling
The memory-notify-based approach aims to handle meory-less nodes, however, it just adds the complexity of code as pointed by David in thread [1]. The handling of memory-less nodes is introduced by commit |
||
Muchun Song
|
b958d4d08f |
mm: hugetlb: simplify per-node sysfs creation and removal
Patch series "simplify handling of per-node sysfs creation and removal",
v4.
This patch (of 2):
The following commit offload per-node sysfs creation and removal to a
kworker and did not say why it is needed. And it also said "I don't know
that this is absolutely required". It seems like the author was not sure
as well. Since it only complicates the code, this patch will revert the
changes to simplify the code.
|
||
Cheng Li
|
14455eabd8 |
mm: use nth_page instead of mem_map_offset mem_map_next
To handle the discontiguous case, mem_map_next() has a parameter named
`offset`. As a function caller, one would be confused why "get next
entry" needs a parameter named "offset". The other drawback of
mem_map_next() is that the callers must take care of the map between
parameter "iter" and "offset", otherwise we may get an hole or duplication
during iteration. So we use nth_page instead of mem_map_next.
And replace mem_map_offset with nth_page() per Matthew's comments.
Link: https://lkml.kernel.org/r/1662708669-9395-1-git-send-email-lic121@chinatelecom.cn
Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
Fixes:
|
||
Li zeming
|
8eeda55fe0 |
mm/hugetlb.c: remove unnecessary initialization of local `err'
Link: https://lkml.kernel.org/r/20220905020918.3552-1-zeming@nfschina.com Signed-off-by: Li zeming <zeming@nfschina.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrew Morton
|
6d751329e7 | Merge branch 'mm-hotfixes-stable' into mm-stable | ||
Doug Berger
|
317314527d |
mm/hugetlb: correct demote page offset logic
With gigantic pages it may not be true that struct page structures are
contiguous across the entire gigantic page. The nth_page macro is used
here in place of direct pointer arithmetic to correct for this.
Mike said:
: This error could cause addressing exceptions. However, this is only
: possible in configurations where CONFIG_SPARSEMEM &&
: !CONFIG_SPARSEMEM_VMEMMAP. Such a configuration option is rare and
: unknown to be the default anywhere.
Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com
Fixes:
|
||
Miaohe Lin
|
5e6b1bf1b5 |
hugetlb: remove meaningless BUG_ON(huge_pte_none())
When code reaches here, invalid page would have been accessed if huge pte is none. So this BUG_ON(huge_pte_none()) is meaningless. Remove it. Link: https://lkml.kernel.org/r/20220901120030.63318-10-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Miaohe Lin
|
a9e1eab241 |
hugetlb: add comment for subtle SetHPageVmemmapOptimized()
The SetHPageVmemmapOptimized() called here seems unnecessary as it's assumed to be set when calling this function. But it's indeed cleared by above set_page_private(page, 0). Add a comment to avoid possible future confusion. Link: https://lkml.kernel.org/r/20220901120030.63318-9-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |