Commit Graph

880 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
c9b484c69d This is the 6.1.68 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmV57F0ACgkQONu9yGCS
 aT5Ihg//f5xvyjEEbZyE7tFaBBgx8ceQCtteRyi+Jw3Hy65/9neETij0t97IhG37
 I89TIAddzNIl51ifl8UYZMWI780HbnW1YdbVLMElbngbmT5rHzIsGpAVCC+SDmMK
 NPWXrqWIw6yTVSbTwqKIqOLlEiLxGjdWnPxjoMXBVyje+EcmANBe+fe9qkLq98XC
 ZgzrRZyriS8QLMMscy/GmdxIyC32nxebdHDwwE6qgYM8GWNfqLLektX798VGFhra
 ByR9bvsJ0PD5m9siCGcx37lVusJDLMjJp4FtMIFTrH63i0sMQm7HKiggJmbCm4lH
 Sgbo4iwvSVa2xf1glPJagE9tiah5b0feLqgrQf/ONO2PdCjcERN47472IcQgRvQ+
 SDYKScZBSp1/Jd063dHiK/u79uxEBFEdisAkPG2MstjCySEDuhvDrV5R0iKDpQBP
 y2FXb4RArqZFrGwS4Zfxx/EQnj3MYJ11a4AE5I0yUGIj7vrFdddayBDBVdwhog84
 QhHPH0F/eC/zSMATYSQSCZTTSZ2UoR8NODXyOryoH5tmXlgxXWKq1oFi5nUnysoP
 SkGDT0dg+kbReQNA+eyj5qTS4lzincIyP2B4Ple9d75zpx1UENlqVm1xvWLccyFt
 3eV/XNRg8dAapsbqvEtW+iev6izutWgcG6p1hToObnbg5uHy6fI=
 =+iTJ
 -----END PGP SIGNATURE-----

Merge 6.1.68 into android14-6.1-lts

Changes in 6.1.68
	vdpa/mlx5: preserve CVQ vringh index
	hrtimers: Push pending hrtimers away from outgoing CPU earlier
	i2c: designware: Fix corrupted memory seen in the ISR
	netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test
	zstd: Fix array-index-out-of-bounds UBSAN warning
	tg3: Move the [rt]x_dropped counters to tg3_napi
	tg3: Increment tx_dropped in tg3_tso_bug()
	kconfig: fix memory leak from range properties
	drm/amdgpu: correct chunk_ptr to a pointer to chunk.
	x86: Introduce ia32_enabled()
	x86/coco: Disable 32-bit emulation by default on TDX and SEV
	x86/entry: Convert INT 0x80 emulation to IDTENTRY
	x86/entry: Do not allow external 0x80 interrupts
	x86/tdx: Allow 32-bit emulation by default
	dt: dt-extract-compatibles: Handle cfile arguments in generator function
	dt: dt-extract-compatibles: Don't follow symlinks when walking tree
	platform/x86: asus-wmi: Move i8042 filter install to shared asus-wmi code
	of: dynamic: Fix of_reconfig_get_state_change() return value documentation
	platform/x86: wmi: Skip blocks with zero instances
	ipv6: fix potential NULL deref in fib6_add()
	octeontx2-pf: Add missing mutex lock in otx2_get_pauseparam
	octeontx2-af: Check return value of nix_get_nixlf before using nixlf
	hv_netvsc: rndis_filter needs to select NLS
	r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
	r8152: Add RTL8152_INACCESSIBLE checks to more loops
	r8152: Add RTL8152_INACCESSIBLE to r8156b_wait_loading_flash()
	r8152: Add RTL8152_INACCESSIBLE to r8153_pre_firmware_1()
	r8152: Add RTL8152_INACCESSIBLE to r8153_aldps_en()
	mlxbf-bootctl: correctly identify secure boot with development keys
	platform/mellanox: Add null pointer checks for devm_kasprintf()
	platform/mellanox: Check devm_hwmon_device_register_with_groups() return value
	arcnet: restoring support for multiple Sohard Arcnet cards
	octeontx2-pf: consider both Rx and Tx packet stats for adaptive interrupt coalescing
	net: stmmac: fix FPE events losing
	xsk: Skip polling event check for unbound socket
	octeontx2-af: fix a use-after-free in rvu_npa_register_reporters
	i40e: Fix unexpected MFS warning message
	iavf: validate tx_coalesce_usecs even if rx_coalesce_usecs is zero
	net: bnxt: fix a potential use-after-free in bnxt_init_tc
	tcp: fix mid stream window clamp.
	ionic: fix snprintf format length warning
	ionic: Fix dim work handling in split interrupt mode
	ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit()
	net: atlantic: Fix NULL dereference of skb pointer in
	net: hns: fix wrong head when modify the tx feature when sending packets
	net: hns: fix fake link up on xge port
	octeontx2-af: Adjust Tx credits when MCS external bypass is disabled
	octeontx2-af: Fix mcs sa cam entries size
	octeontx2-af: Fix mcs stats register address
	octeontx2-af: Add missing mcs flr handler call
	octeontx2-af: Update Tx link register range
	dt-bindings: interrupt-controller: Allow #power-domain-cells
	netfilter: nft_exthdr: add boolean DCCP option matching
	netfilter: nf_tables: fix 'exist' matching on bigendian arches
	netfilter: nf_tables: bail out on mismatching dynset and set expressions
	netfilter: nf_tables: validate family when identifying table via handle
	netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
	tcp: do not accept ACK of bytes we never sent
	bpf: sockmap, updating the sg structure should also update curr
	psample: Require 'CAP_NET_ADMIN' when joining "packets" group
	drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
	mm/damon/sysfs: eliminate potential uninitialized variable warning
	tee: optee: Fix supplicant based device enumeration
	RDMA/hns: Fix unnecessary err return when using invalid congest control algorithm
	RDMA/irdma: Do not modify to SQD on error
	RDMA/irdma: Add wait for suspend on SQD
	arm64: dts: rockchip: Expand reg size of vdec node for RK3328
	arm64: dts: rockchip: Expand reg size of vdec node for RK3399
	ASoC: fsl_sai: Fix no frame sync clock issue on i.MX8MP
	RDMA/rtrs-srv: Do not unconditionally enable irq
	RDMA/rtrs-clt: Start hb after path_up
	RDMA/rtrs-srv: Check return values while processing info request
	RDMA/rtrs-srv: Free srv_mr iu only when always_invalidate is true
	RDMA/rtrs-srv: Destroy path files after making sure no IOs in-flight
	RDMA/rtrs-clt: Fix the max_send_wr setting
	RDMA/rtrs-clt: Remove the warnings for req in_use check
	RDMA/bnxt_re: Correct module description string
	RDMA/irdma: Refactor error handling in create CQP
	RDMA/irdma: Fix UAF in irdma_sc_ccq_get_cqe_info()
	hwmon: (acpi_power_meter) Fix 4.29 MW bug
	ASoC: codecs: lpass-tx-macro: set active_decimator correct default value
	hwmon: (nzxt-kraken2) Fix error handling path in kraken2_probe()
	ASoC: wm_adsp: fix memleak in wm_adsp_buffer_populate
	RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
	RDMA/irdma: Avoid free the non-cqp_request scratch
	drm/bridge: tc358768: select CONFIG_VIDEOMODE_HELPERS
	arm64: dts: imx8mq: drop usb3-resume-missing-cas from usb
	arm64: dts: imx8mp: imx8mq: Add parkmode-disable-ss-quirk on DWC3
	ARM: dts: imx6ul-pico: Describe the Ethernet PHY clock
	tracing: Fix a warning when allocating buffered events fails
	scsi: be2iscsi: Fix a memleak in beiscsi_init_wrb_handle()
	ARM: imx: Check return value of devm_kasprintf in imx_mmdc_perf_init
	ARM: dts: imx7: Declare timers compatible with fsl,imx6dl-gpt
	ARM: dts: imx28-xea: Pass the 'model' property
	riscv: fix misaligned access handling of C.SWSP and C.SDSP
	md: introduce md_ro_state
	md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()
	iommu: Avoid more races around device probe
	rethook: Use __rcu pointer for rethook::handler
	kprobes: consistent rcu api usage for kretprobe holder
	ASoC: amd: yc: Fix non-functional mic on ASUS E1504FA
	io_uring/af_unix: disable sending io_uring over sockets
	nvme-pci: Add sleep quirk for Kingston drives
	io_uring: fix mutex_unlock with unreferenced ctx
	ALSA: usb-audio: Add Pioneer DJM-450 mixer controls
	ALSA: pcm: fix out-of-bounds in snd_pcm_state_names
	ALSA: hda/realtek: Enable headset on Lenovo M90 Gen5
	ALSA: hda/realtek: add new Framework laptop to quirks
	ALSA: hda/realtek: Add Framework laptop 16 to quirks
	ring-buffer: Test last update in 32bit version of __rb_time_read()
	nilfs2: fix missing error check for sb_set_blocksize call
	nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
	cgroup_freezer: cgroup_freezing: Check if not frozen
	checkstack: fix printed address
	tracing: Always update snapshot buffer size
	tracing: Disable snapshot buffer when stopping instance tracers
	tracing: Fix incomplete locking when disabling buffered events
	tracing: Fix a possible race when disabling buffered events
	packet: Move reference count in packet_sock to atomic_long_t
	r8169: fix rtl8125b PAUSE frames blasting when suspended
	regmap: fix bogus error on regcache_sync success
	platform/surface: aggregator: fix recv_buf() return value
	hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write
	mm: fix oops when filemap_map_pmd() without prealloc_pte
	powercap: DTPM: Fix missing cpufreq_cpu_put() calls
	md/raid6: use valid sector values to determine if an I/O should wait on the reshape
	arm64: dts: mediatek: mt7622: fix memory node warning check
	arm64: dts: mediatek: mt8183-kukui-jacuzzi: fix dsi unnecessary cells properties
	arm64: dts: mediatek: cherry: Fix interrupt cells for MT6360 on I2C7
	arm64: dts: mediatek: mt8173-evb: Fix regulator-fixed node names
	arm64: dts: mediatek: mt8195: Fix PM suspend/resume with venc clocks
	arm64: dts: mediatek: mt8183: Fix unit address for scp reserved memory
	arm64: dts: mediatek: mt8183: Move thermal-zones to the root node
	arm64: dts: mediatek: mt8183-evb: Fix unit_address_vs_reg warning on ntc
	binder: fix memory leaks of spam and pending work
	coresight: etm4x: Make etm4_remove_dev() return void
	coresight: etm4x: Remove bogous __exit annotation for some functions
	hwtracing: hisi_ptt: Add dummy callback pmu::read()
	misc: mei: client.c: return negative error code in mei_cl_write
	misc: mei: client.c: fix problem of return '-EOVERFLOW' in mei_cl_write
	LoongArch: BPF: Don't sign extend memory load operand
	LoongArch: BPF: Don't sign extend function return value
	ring-buffer: Force absolute timestamp on discard of event
	tracing: Set actual size after ring buffer resize
	tracing: Stop current tracer when resizing buffer
	parisc: Reduce size of the bug_table on 64-bit kernel by half
	parisc: Fix asm operand number out of range build error in bug table
	arm64: dts: mediatek: add missing space before {
	arm64: dts: mt8183: kukui: Fix underscores in node names
	perf: Fix perf_event_validate_size()
	x86/sev: Fix kernel crash due to late update to read-only ghcb_version
	gpiolib: sysfs: Fix error handling on failed export
	drm/amdgpu: fix memory overflow in the IB test
	drm/amd/amdgpu: Fix warnings in amdgpu/amdgpu_display.c
	drm/amdgpu: correct the amdgpu runtime dereference usage count
	drm/amdgpu: Update ras eeprom support for smu v13_0_0 and v13_0_10
	drm/amdgpu: Add EEPROM I2C address support for ip discovery
	drm/amdgpu: Remove redundant I2C EEPROM address
	drm/amdgpu: Decouple RAS EEPROM addresses from chips
	drm/amdgpu: Add support for RAS table at 0x40000
	drm/amdgpu: Remove second moot switch to set EEPROM I2C address
	drm/amdgpu: Return from switch early for EEPROM I2C address
	drm/amdgpu: simplify amdgpu_ras_eeprom.c
	drm/amdgpu: Add I2C EEPROM support on smu v13_0_6
	drm/amdgpu: Update EEPROM I2C address for smu v13_0_0
	usb: gadget: f_hid: fix report descriptor allocation
	serial: 8250_dw: Add ACPI ID for Granite Rapids-D UART
	parport: Add support for Brainboxes IX/UC/PX parallel cards
	cifs: Fix non-availability of dedup breaking generic/304
	Revert "xhci: Loosen RPM as default policy to cover for AMD xHC 1.1"
	smb: client: fix potential NULL deref in parse_dfs_referrals()
	usb: typec: class: fix typec_altmode_put_partner to put plugs
	ARM: PL011: Fix DMA support
	serial: sc16is7xx: address RX timeout interrupt errata
	serial: 8250: 8250_omap: Clear UART_HAS_RHR_IT_DIS bit
	serial: 8250: 8250_omap: Do not start RX DMA on THRI interrupt
	serial: 8250_omap: Add earlycon support for the AM654 UART controller
	devcoredump: Send uevent once devcd is ready
	x86/CPU/AMD: Check vendor in the AMD microcode callback
	USB: gadget: core: adjust uevent timing on gadget unbind
	cifs: Fix flushing, invalidation and file size with copy_file_range()
	cifs: Fix flushing, invalidation and file size with FICLONE
	MIPS: kernel: Clear FPU states when setting up kernel threads
	KVM: s390/mm: Properly reset no-dat
	KVM: SVM: Update EFER software model on CR0 trap for SEV-ES
	MIPS: Loongson64: Reserve vgabios memory on boot
	MIPS: Loongson64: Handle more memory types passed from firmware
	MIPS: Loongson64: Enable DMA noncoherent support
	netfilter: nft_set_pipapo: skip inactive elements during set walk
	riscv: Kconfig: Add select ARM_AMBA to SOC_STARFIVE
	drm/i915/display: Drop check for doublescan mode in modevalid
	drm/i915/lvds: Use REG_BIT() & co.
	drm/i915/sdvo: stop caching has_hdmi_monitor in struct intel_sdvo
	drm/i915: Skip some timing checks on BXT/GLK DSI transcoders
	Linux 6.1.68

Change-Id: I0a824071a80b24dc4a2e0077f305b7cac42235b8
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-01-05 08:40:52 +00:00
Mike Kravetz
574a6db80f hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write
commit 187da0f8250aa94bd96266096aef6f694e0b4cd2 upstream.

The routine __vma_private_lock tests for the existence of a reserve map
associated with a private hugetlb mapping.  A pointer to the reserve map
is in vma->vm_private_data.  __vma_private_lock was checking the pointer
for NULL.  However, it is possible that the low bits of the pointer could
be used as flags.  In such instances, vm_private_data is not NULL and not
a valid pointer.  This results in the null-ptr-deref reported by syzbot:

general protection fault, probably for non-canonical address 0xdffffc000000001d:
 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
CPU: 0 PID: 5048 Comm: syz-executor139 Not tainted 6.6.0-rc7-syzkaller-00142-g88
8cf78c29e2 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 1
0/09/2023
RIP: 0010:__lock_acquire+0x109/0x5de0 kernel/locking/lockdep.c:5004
...
Call Trace:
 <TASK>
 lock_acquire kernel/locking/lockdep.c:5753 [inline]
 lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5718
 down_write+0x93/0x200 kernel/locking/rwsem.c:1573
 hugetlb_vma_lock_write mm/hugetlb.c:300 [inline]
 hugetlb_vma_lock_write+0xae/0x100 mm/hugetlb.c:291
 __hugetlb_zap_begin+0x1e9/0x2b0 mm/hugetlb.c:5447
 hugetlb_zap_begin include/linux/hugetlb.h:258 [inline]
 unmap_vmas+0x2f4/0x470 mm/memory.c:1733
 exit_mmap+0x1ad/0xa60 mm/mmap.c:3230
 __mmput+0x12a/0x4d0 kernel/fork.c:1349
 mmput+0x62/0x70 kernel/fork.c:1371
 exit_mm kernel/exit.c:567 [inline]
 do_exit+0x9ad/0x2a20 kernel/exit.c:861
 __do_sys_exit kernel/exit.c:991 [inline]
 __se_sys_exit kernel/exit.c:989 [inline]
 __x64_sys_exit+0x42/0x50 kernel/exit.c:989
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Mask off low bit flags before checking for NULL pointer.  In addition, the
reserve map only 'belongs' to the OWNER (parent in parent/child
relationships) so also check for the OWNER flag.

Link: https://lkml.kernel.org/r/20231114012033.259600-1-mike.kravetz@oracle.com
Reported-by: syzbot+6ada951e7c0f7bc8a71e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/00000000000078d1e00608d7878b@google.com/
Fixes: bf4916922c60 ("hugetlbfs: extend hugetlb_vma_lock to private VMAs")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Edward Adam Davis <eadavis@qq.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-13 18:39:20 +01:00
Greg Kroah-Hartman
2cd386b08b Merge 6.1.61 into android14-6.1-lts
Changes in 6.1.61
	KVM: x86/pmu: Truncate counter value to allowed width on write
	mmc: core: Align to common busy polling behaviour for mmc ioctls
	mmc: block: ioctl: do write error check for spi
	mmc: core: Fix error propagation for some ioctl commands
	ASoC: codecs: wcd938x: Convert to platform remove callback returning void
	ASoC: codecs: wcd938x: Simplify with dev_err_probe
	ASoC: codecs: wcd938x: fix regulator leaks on probe errors
	ASoC: codecs: wcd938x: fix runtime PM imbalance on remove
	pinctrl: qcom: lpass-lpi: fix concurrent register updates
	mcb: Return actual parsed size when reading chameleon table
	mcb-lpc: Reallocate memory region to avoid memory overlapping
	virtio_balloon: Fix endless deflation and inflation on arm64
	virtio-mmio: fix memory leak of vm_dev
	virtio-crypto: handle config changed by work queue
	virtio_pci: fix the common cfg map size
	vsock/virtio: initialize the_virtio_vsock before using VQs
	vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE
	arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399
	arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards
	mm: fix vm_brk_flags() to not bail out while holding lock
	hugetlbfs: clear resv_map pointer if mmap fails
	mm/page_alloc: correct start page when guard page debug is enabled
	mm/migrate: fix do_pages_move for compat pointers
	hugetlbfs: extend hugetlb_vma_lock to private VMAs
	maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
	nfsd: lock_rename() needs both directories to live on the same fs
	drm/i915/pmu: Check if pmu is closed before stopping event
	drm/amd: Disable ASPM for VI w/ all Intel systems
	drm/dp_mst: Fix NULL deref in get_mst_branch_device_by_guid_helper()
	ARM: OMAP: timer32K: fix all kernel-doc warnings
	firmware/imx-dsp: Fix use_after_free in imx_dsp_setup_channels()
	clk: ti: Fix missing omap4 mcbsp functional clock and aliases
	clk: ti: Fix missing omap5 mcbsp functional clock and aliases
	r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx
	r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1
	r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1
	iavf: initialize waitqueues before starting watchdog_task
	i40e: Fix I40E_FLAG_VF_VLAN_PRUNING value
	treewide: Spelling fix in comment
	igb: Fix potential memory leak in igb_add_ethtool_nfc_entry
	neighbour: fix various data-races
	igc: Fix ambiguity in the ethtool advertising
	net: ethernet: adi: adin1110: Fix uninitialized variable
	net: ieee802154: adf7242: Fix some potential buffer overflow in adf7242_stats_show()
	net: usb: smsc95xx: Fix uninit-value access in smsc95xx_read_reg
	r8152: Increase USB control msg timeout to 5000ms as per spec
	r8152: Run the unload routine if we have errors during probe
	r8152: Cancel hw_phy_work if we have an error in probe
	r8152: Release firmware if we have an error in probe
	tcp: fix wrong RTO timeout when received SACK reneging
	gtp: uapi: fix GTPA_MAX
	gtp: fix fragmentation needed check with gso
	i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR
	drm/logicvc: Kconfig: select REGMAP and REGMAP_MMIO
	iavf: in iavf_down, disable queues when removing the driver
	scsi: sd: Introduce manage_shutdown device flag
	blk-throttle: check for overflow in calculate_bytes_allowed
	kasan: print the original fault addr when access invalid shadow
	io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
	iio: afe: rescale: Accept only offset channels
	iio: exynos-adc: request second interupt only when touchscreen mode is used
	iio: adc: xilinx-xadc: Don't clobber preset voltage/temperature thresholds
	iio: adc: xilinx-xadc: Correct temperature offset/scale for UltraScale
	i2c: muxes: i2c-mux-pinctrl: Use of_get_i2c_adapter_by_node()
	i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node()
	i2c: muxes: i2c-demux-pinctrl: Use of_get_i2c_adapter_by_node()
	i2c: stm32f7: Fix PEC handling in case of SMBUS transfers
	i2c: aspeed: Fix i2c bus hang in slave read
	tracing/kprobes: Fix the description of variable length arguments
	misc: fastrpc: Reset metadata buffer to avoid incorrect free
	misc: fastrpc: Free DMA handles for RPC calls with no arguments
	misc: fastrpc: Clean buffers on remote invocation failures
	misc: fastrpc: Unmap only if buffer is unmapped from DSP
	nvmem: imx: correct nregs for i.MX6ULL
	nvmem: imx: correct nregs for i.MX6SLL
	nvmem: imx: correct nregs for i.MX6UL
	x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
	x86/cpu: Add model number for Intel Arrow Lake mobile processor
	perf/core: Fix potential NULL deref
	sparc32: fix a braino in fault handling in csum_and_copy_..._user()
	clk: Sanitize possible_parent_show to Handle Return Value of of_clk_get_parent_name
	platform/x86: Add s2idle quirk for more Lenovo laptops
	ext4: add two helper functions extent_logical_end() and pa_logical_end()
	ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow
	ext4: avoid overlapping preallocations due to overflow
	objtool/x86: add missing embedded_insn check
	Linux 6.1.61

Note, this merge point also reverts commit
bb20a245df which is commit
24eca2dce0f8d19db808c972b0281298d0bafe99 upstream, as it conflicts with
the previous reverts for ABI issues, AND is an ABI break in itself.  If
it is needed in the future, it can be brought back in an abi-safe way.

Change-Id: I425bfa3be6d65328e23affd52d10b827aea6e44a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-11-07 08:21:27 +00:00
Rik van Riel
b1b2750de1 hugetlbfs: extend hugetlb_vma_lock to private VMAs
commit bf4916922c60f43efaa329744b3eef539aa6a2b2 upstream.

Extend the locking scheme used to protect shared hugetlb mappings from
truncate vs page fault races, in order to protect private hugetlb mappings
(with resv_map) against MADV_DONTNEED.

Add a read-write semaphore to the resv_map data structure, and use that
from the hugetlb_vma_(un)lock_* functions, in preparation for closing the
race between MADV_DONTNEED and page faults.

Link: https://lkml.kernel.org/r/20231006040020.3677377-3-riel@surriel.com
Fixes: 04ada095dc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-02 09:35:24 +01:00
Rik van Riel
0aa7b24c06 hugetlbfs: clear resv_map pointer if mmap fails
commit 92fe9dcbe4e109a7ce6bab3e452210a35b0ab493 upstream.

Patch series "hugetlbfs: close race between MADV_DONTNEED and page fault", v7.

Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.

This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.

Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault.  However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.

Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:

       CPU 1                            CPU 2

       MADV_DONTNEED
       unmap page
       shoot down TLB entry
                                       page fault
                                       fail to allocate a huge page
                                       killed with SIGBUS
       free page

Fix that race by extending the hugetlb_vma_lock locking scheme to also
cover private hugetlb mappings (with resv_map), and pulling the locking
from __unmap_hugepage_final_range into helper functions called from
zap_page_range_single.  This ensures page faults stay locked out of the
MADV_DONTNEED VMA until the huge pages have actually been freed.


This patch (of 3):

Hugetlbfs leaves a dangling pointer in the VMA if mmap fails.  This has
not been a problem so far, but other code in this patch series tries to
follow that pointer.

Link: https://lkml.kernel.org/r/20231006040020.3677377-1-riel@surriel.com
Link: https://lkml.kernel.org/r/20231006040020.3677377-2-riel@surriel.com
Fixes: 04ada095dc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-02 09:35:24 +01:00
Greg Kroah-Hartman
50874c58d8 Merge 6.1.47 into android14-6.1-lts
Changes in 6.1.47
	mmc: sdhci-f-sdh30: Replace with sdhci_pltfm
	cpuidle: psci: Extend information in log about OSI/PC mode
	cpuidle: psci: Move enabling OSI mode after power domains creation
	zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks
	zsmalloc: fix races between modifications of fullness and isolated
	selftests: forwarding: tc_actions: cleanup temporary files when test is aborted
	selftests: forwarding: tc_actions: Use ncat instead of nc
	net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
	net/smc: Fix setsockopt and sysctl to specify same buffer size again
	net: phy: at803x: Use devm_regulator_get_enable_optional()
	net: phy: at803x: fix the wol setting functions
	drm/amdgpu: fix calltrace warning in amddrm_buddy_fini
	drm/amdgpu: Fix integer overflow in amdgpu_cs_pass1
	drm/amdgpu: fix memory leak in mes self test
	ASoC: Intel: sof_sdw: add quirk for MTL RVP
	ASoC: Intel: sof_sdw: add quirk for LNL RVP
	PCI: tegra194: Fix possible array out of bounds access
	ASoC: SOF: amd: Add pci revision id check
	drm/stm: ltdc: fix late dereference check
	drm: rcar-du: remove R-Car H3 ES1.* workarounds
	ASoC: amd: vangogh: Add check for acp config flags in vangogh platform
	ARM: dts: imx6dl: prtrvt, prtvt7, prti6q, prtwd2: fix USB related warnings
	ASoC: Intel: sof_sdw_rt_sdca_jack_common: test SOF_JACK_JDSRC in _exit
	ASoC: Intel: sof_sdw: Add support for Rex soundwire
	iopoll: Call cpu_relax() in busy loops
	ASoC: SOF: Intel: fix SoundWire/HDaudio mutual exclusion
	dma-remap: use kvmalloc_array/kvfree for larger dma memory remap
	accel/habanalabs: add pci health check during heartbeat
	HID: logitech-hidpp: Add USB and Bluetooth IDs for the Logitech G915 TKL Keyboard
	iommu/amd: Introduce Disable IRTE Caching Support
	drm/amdgpu: install stub fence into potential unused fence pointers
	drm/amd/display: Apply 60us prefetch for DCFCLK <= 300Mhz
	RDMA/mlx5: Return the firmware result upon destroying QP/RQ
	drm/amd/display: Skip DPP DTO update if root clock is gated
	drm/amd/display: Enable dcn314 DPP RCO
	ASoC: SOF: core: Free the firmware trace before calling snd_sof_shutdown()
	HID: intel-ish-hid: ipc: Add Arrow Lake PCI device ID
	ALSA: hda/realtek: Add quirks for ROG ALLY CS35l41 audio
	smb: client: fix warning in cifs_smb3_do_mount()
	cifs: fix session state check in reconnect to avoid use-after-free issue
	serial: stm32: Ignore return value of uart_remove_one_port() in .remove()
	led: qcom-lpg: Fix resource leaks in for_each_available_child_of_node() loops
	media: v4l2-mem2mem: add lock to protect parameter num_rdy
	media: camss: set VFE bpl_alignment to 16 for sdm845 and sm8250
	usb: gadget: u_serial: Avoid spinlock recursion in __gs_console_push
	usb: gadget: uvc: queue empty isoc requests if no video buffer is available
	media: platform: mediatek: vpu: fix NULL ptr dereference
	thunderbolt: Read retimer NVM authentication status prior tb_retimer_set_inbound_sbtx()
	usb: chipidea: imx: don't request QoS for imx8ulp
	usb: chipidea: imx: add missing USB PHY DPDM wakeup setting
	gfs2: Fix possible data races in gfs2_show_options()
	pcmcia: rsrc_nonstatic: Fix memory leak in nonstatic_release_resource_db()
	thunderbolt: Add Intel Barlow Ridge PCI ID
	thunderbolt: Limit Intel Barlow Ridge USB3 bandwidth
	firewire: net: fix use after free in fwnet_finish_incoming_packet()
	watchdog: sp5100_tco: support Hygon FCH/SCH (Server Controller Hub)
	Bluetooth: L2CAP: Fix use-after-free
	Bluetooth: btusb: Add MT7922 bluetooth ID for the Asus Ally
	ceph: try to dump the msgs when decoding fails
	drm/amdgpu: Fix potential fence use-after-free v2
	fs/ntfs3: Enhance sanity check while generating attr_list
	fs: ntfs3: Fix possible null-pointer dereferences in mi_read()
	fs/ntfs3: Mark ntfs dirty when on-disk struct is corrupted
	ALSA: hda/realtek: Add quirks for Unis H3C Desktop B760 & Q760
	ALSA: hda: fix a possible null-pointer dereference due to data race in snd_hdac_regmap_sync()
	ALSA: hda/realtek: Add quirk for ASUS ROG GX650P
	ALSA: hda/realtek: Add quirk for ASUS ROG GA402X
	ALSA: hda/realtek: Add quirk for ASUS ROG GZ301V
	powerpc/kasan: Disable KCOV in KASAN code
	Bluetooth: MGMT: Use correct address for memcpy()
	ring-buffer: Do not swap cpu_buffer during resize process
	igc: read before write to SRRCTL register
	drm/amd/display: save restore hdcp state when display is unplugged from mst hub
	drm/amd/display: phase3 mst hdcp for multiple displays
	drm/amd/display: fix access hdcp_workqueue assert
	KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption
	ARM: dts: nxp/imx6sll: fix wrong property name in usbphy node
	fbdev/hyperv-fb: Do not set struct fb_info.apertures
	video/aperture: Only remove sysfb on the default vga pci device
	btrfs: move out now unused BG from the reclaim list
	btrfs: convert btrfs_block_group::needs_free_space to runtime flag
	btrfs: convert btrfs_block_group::seq_zone to runtime flag
	btrfs: fix use-after-free of new block group that became unused
	virtio-mmio: don't break lifecycle of vm_dev
	vduse: Use proper spinlock for IRQ injection
	vdpa/mlx5: Fix mr->initialized semantics
	vdpa/mlx5: Delete control vq iotlb in destroy_mr only when necessary
	cifs: fix potential oops in cifs_oplock_break
	i2c: bcm-iproc: Fix bcm_iproc_i2c_isr deadlock issue
	i2c: hisi: Only handle the interrupt of the driver's transfer
	i2c: tegra: Fix i2c-tegra DMA config option processing
	fbdev: mmp: fix value check in mmphw_probe()
	powerpc/rtas_flash: allow user copy to flash block cache objects
	vdpa: Add features attr to vdpa_nl_policy for nlattr length check
	vdpa: Add queue index attr to vdpa_nl_policy for nlattr length check
	vdpa: Add max vqp attr to vdpa_nl_policy for nlattr length check
	vdpa: Enable strict validation for netlinks ops
	tty: n_gsm: fix the UAF caused by race condition in gsm_cleanup_mux
	tty: serial: fsl_lpuart: Clear the error flags by writing 1 for lpuart32 platforms
	btrfs: fix incorrect splitting in btrfs_drop_extent_map_range
	btrfs: fix BUG_ON condition in btrfs_cancel_balance
	i2c: designware: Correct length byte validation logic
	i2c: designware: Handle invalid SMBus block data response length value
	net: xfrm: Fix xfrm_address_filter OOB read
	net: af_key: fix sadb_x_filter validation
	net: xfrm: Amend XFRMA_SEC_CTX nla_policy structure
	xfrm: fix slab-use-after-free in decode_session6
	ip6_vti: fix slab-use-after-free in decode_session6
	ip_vti: fix potential slab-use-after-free in decode_session6
	xfrm: add NULL check in xfrm_update_ae_params
	xfrm: add forgotten nla_policy for XFRMA_MTIMER_THRESH
	virtio_net: notify MAC address change on device initialization
	virtio-net: set queues after driver_ok
	net: pcs: Add missing put_device call in miic_create
	net: phy: fix IRQ-based wake-on-lan over hibernate / power off
	selftests: mirror_gre_changes: Tighten up the TTL test match
	drm/panel: simple: Fix AUO G121EAN01 panel timings according to the docs
	net: macb: In ZynqMP resume always configure PS GTR for non-wakeup source
	octeon_ep: cancel tx_timeout_task later in remove sequence
	netfilter: nf_tables: fix false-positive lockdep splat
	netfilter: nf_tables: deactivate catchall elements in next generation
	ipvs: fix racy memcpy in proc_do_sync_threshold
	netfilter: nft_dynset: disallow object maps
	net: phy: broadcom: stub c45 read/write for 54810
	team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves
	net: openvswitch: reject negative ifindex
	iavf: fix FDIR rule fields masks validation
	i40e: fix misleading debug logs
	net: dsa: mv88e6xxx: Wait for EEPROM done before HW reset
	sfc: don't unregister flow_indr if it was never registered
	sock: Fix misuse of sk_under_memory_pressure()
	net: do not allow gso_size to be set to GSO_BY_FRAGS
	qede: fix firmware halt over suspend and resume
	ice: Block switchdev mode when ADQ is active and vice versa
	bus: ti-sysc: Flush posted write on enable before reset
	arm64: dts: qcom: qrb5165-rb5: fix thermal zone conflict
	arm64: dts: rockchip: Disable HS400 for eMMC on ROCK Pi 4
	arm64: dts: rockchip: Disable HS400 for eMMC on ROCK 4C+
	ARM: dts: imx: align LED node names with dtschema
	ARM: dts: imx6: phytec: fix RTC interrupt level
	arm64: dts: imx8mm: Drop CSI1 PHY reference clock configuration
	ARM: dts: imx: Set default tuning step for imx6sx usdhc
	arm64: dts: imx93: Fix anatop node size
	ASoC: rt5665: add missed regulator_bulk_disable
	ASoC: meson: axg-tdm-formatter: fix channel slot allocation
	ALSA: hda/realtek: Add quirks for HP G11 Laptops
	soc: aspeed: uart-routing: Use __sysfs_match_string
	soc: aspeed: socinfo: Add kfree for kstrdup
	ALSA: hda/realtek - Remodified 3k pull low procedure
	riscv: uaccess: Return the number of bytes effectively not copied
	serial: 8250: Fix oops for port->pm on uart_change_pm()
	ALSA: usb-audio: Add support for Mythware XA001AU capture and playback interfaces.
	cifs: Release folio lock on fscache read hit.
	virtio-net: Zero max_tx_vq field for VIRTIO_NET_CTRL_MQ_HASH_CONFIG case
	arm64: dts: rockchip: Fix Wifi/Bluetooth on ROCK Pi 4 boards
	blk-crypto: dynamically allocate fallback profile
	mmc: wbsd: fix double mmc_free_host() in wbsd_init()
	mmc: block: Fix in_flight[issue_type] value error
	drm/qxl: fix UAF on handle creation
	drm/i915/sdvo: fix panel_type initialization
	drm/amd: flush any delayed gfxoff on suspend entry
	drm/amdgpu: skip fence GFX interrupts disable/enable for S0ix
	drm/amdgpu/pm: fix throttle_status for other than MP1 11.0.7
	ASoC: amd: vangogh: select CONFIG_SND_AMD_ACP_CONFIG
	drm/amd/display: disable RCO for DCN314
	zsmalloc: allow only one active pool compaction context
	sched/fair: unlink misfit task from cpu overutilized
	sched/fair: Remove capacity inversion detection
	drm/amd/display: Implement workaround for writing to OTG_PIXEL_RATE_DIV register
	hugetlb: do not clear hugetlb dtor until allocating vmemmap
	netfilter: set default timeout to 3 secs for sctp shutdown send and recv state
	arm64/ptrace: Ensure that SME is set up for target when writing SSVE state
	drm/amd/pm: skip the RLC stop when S0i3 suspend for SMU v13.0.4/11
	drm/amdgpu: keep irq count in amdgpu_irq_disable_all
	af_unix: Fix null-ptr-deref in unix_stream_sendpage().
	drm/nouveau/disp: fix use-after-free in error handling of nouveau_connector_create
	net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled
	mmc: f-sdh30: fix order of function calls in sdhci_f_sdh30_remove
	Linux 6.1.47

Change-Id: I7c55c71f43f88a1d44d39c835e3f6e58d4c86279
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-09-13 19:35:46 +00:00
Mike Kravetz
1b4ce2952b hugetlb: do not clear hugetlb dtor until allocating vmemmap
commit 32c877191e022b55fe3a374f3d7e9fb5741c514d upstream.

Patch series "Fix hugetlb free path race with memory errors".

In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered.
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/

Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page.  However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.

A much more detailed description is in the first patch.

The first patch addresses the race window.  However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page
free operation.  This could lead to slowdowns if one is freeing a large
number of hugetlb pages.

The second path optimizes the update_and_free_pages_bulk routine to only
take the lock once in bulk operations.

The second patch is technically not a bug fix, but includes a Fixes tag
and Cc stable to avoid a performance regression.  It can be combined with
the first, but was done separately make reviewing easier.


This patch (of 2):

Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
  lists, get a ref on the page and remove hugetlb destructor.  This
  all must be done under the hugetlb lock.  After this call, the page
  can be treated as a normal compound page or a collection of base
  size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
  needed and the free routine of the underlying allocator is called
  on the resulting page.  We can not hold the hugetlb lock here.

One issue with this scheme is that a memory error could occur between
these two steps.  In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages.  It will then try to SetPageHWPoison(page) on the page with an
error.  If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.

Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap.  Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present.  This saves a
lock/unlock cycle.  Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.

Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count.  This is not
a normal state.  The only code that would notice is the memory error
code, and it is set up to retry in such a case.

A subsequent patch will create a routine to do bulk processing of
vmemmap allocation.  This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.

Link: https://lkml.kernel.org/r/20230711220942.43706-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com
Fixes: ad2fa3717b ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-23 17:52:41 +02:00
Matthew Wilcox (Oracle)
66cbbe6b31 FROMGIT: mm: move FAULT_FLAG_VMA_LOCK check from handle_mm_fault()
Handle a little more of the page fault path outside the mmap sem.  The
hugetlb path doesn't need to check whether the VMA is anonymous; the
VM_HUGETLB flag is only set on hugetlbfs VMAs.  There should be no
performance change from the previous commit; this is simply a step to ease
bisection of any problems.

Link: https://lkml.kernel.org/r/20230724185410.1124082-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 51db5e8974cafee10b2252efa78f89af7d60cd11
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)

Bug: 293665307
Change-Id: I300c7105fa3530e8eb05862cb3f66b7adac99420
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 09:59:51 -07:00
Suren Baghdasaryan
ad18923856 FROMGIT: mm: replace mmap with vma write lock assertions when operating on a vma
Vma write lock assertion always includes mmap write lock assertion and
additional vma lock checks when per-VMA locks are enabled. Replace
weaker mmap_assert_write_locked() assertions with stronger
vma_assert_write_locked() ones when we are operating on a vma which
is expected to be locked.

Link: https://lkml.kernel.org/r/20230804152724.3090321-4-surenb@google.com
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 928a31b91cf64aa99a8999dcd66bec0ad02f64ef
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)

Bug: 293665307
Change-Id: I861db0510612f571f2ca44e0a9d7e01274d4eb36
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-08-16 16:55:02 +00:00
Suren Baghdasaryan
bf16383ebd UPSTREAM: mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(), replace
it with VM_LOCKED_MASK bitmask and convert all users.

Link: https://lkml.kernel.org/r/20230126193752.297968-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit e430a95a04efc557bc4ff9b3035c7c85aee5d63f)

Bug: 161210518
Change-Id: I17bbcc01a133511dbfaf3d82fbc4b25ecdd0b376
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:57 +00:00
Peter Xu
f042ee354c mm/hugetlb: fix uffd wr-protection for CoW optimization path
commit 60d5b473d61be61ac315e544fcd6a8234a79500e upstream.

This patch fixes an issue that a hugetlb uffd-wr-protected mapping can be
writable even with uffd-wp bit set.  It only happens with hugetlb private
mappings, when someone firstly wr-protects a missing pte (which will
install a pte marker), then a write to the same page without any prior
access to the page.

Userfaultfd-wp trap for hugetlb was implemented in hugetlb_fault() before
reaching hugetlb_wp() to avoid taking more locks that userfault won't
need.  However there's one CoW optimization path that can trigger
hugetlb_wp() inside hugetlb_no_page(), which will bypass the trap.

This patch skips hugetlb_wp() for CoW and retries the fault if uffd-wp bit
is detected.  The new path will only trigger in the CoW optimization path
because generic hugetlb_fault() (e.g.  when a present pte was
wr-protected) will resolve the uffd-wp bit already.  Also make sure
anonymous UNSHARE won't be affected and can still be resolved, IOW only
skip CoW not CoR.

This patch will be needed for v5.19+ hence copy stable.

[peterx@redhat.com: v2]
  Link: https://lkml.kernel.org/r/ZBzOqwF2wrHgBVZb@x1n
[peterx@redhat.com: v3]
  Link: https://lkml.kernel.org/r/20230324142620.2344140-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230321191840.1897940-1-peterx@redhat.com
Fixes: 166f3ecc0d ("mm/hugetlb: hook page faults for uffd write protection")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-04-13 16:55:36 +02:00
Peter Xu
3b8ede6665 mm/hugetlb: pre-allocate pgtable pages for uffd wr-protects
commit fed15f1345dc8a7fc8baa81e8b55c3ba010d7f4b upstream.

Userfaultfd-wp uses pte markers to mark wr-protected pages for both shmem
and hugetlb.  Shmem has pre-allocation ready for markers, but hugetlb path
was overlooked.

Doing so by calling huge_pte_alloc() if the initial pgtable walk fails to
find the huge ptep.  It's possible that huge_pte_alloc() can fail with
high memory pressure, in that case stop the loop immediately and fail
silently.  This is not the most ideal solution but it matches with what we
do with shmem meanwhile it avoids the splat in dmesg.

Link: https://lkml.kernel.org/r/20230104225207.1066932-2-peterx@redhat.com
Fixes: 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: James Houghton <jthoughton@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org>	[5.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24 07:24:36 +01:00
David Hildenbrand
8d6a675cd7 mm/hugetlb: fix uffd-wp handling for migration entries in hugetlb_change_protection()
commit 44f86392bdd165da7e43d3c772aeb1e128ffd6c8 upstream.

We have to update the uffd-wp SWP PTE bit independent of the type of
migration entry.  Currently, if we're unlucky and we want to install/clear
the uffd-wp bit just while we're migrating a read-only mapped hugetlb
page, we would miss to set/clear the uffd-wp bit.

Further, if we're processing a readable-exclusive migration entry and
neither want to set or clear the uffd-wp bit, we could currently end up
losing the uffd-wp bit.  Note that the same would hold for writable
migrating entries, however, having a writable migration entry with the
uffd-wp bit set would already mean that something went wrong.

Note that the change from !is_readable_migration_entry ->
writable_migration_entry is harmless and actually cleaner, as raised by
Miaohe Lin and discussed in [1].

[1] https://lkml.kernel.org/r/90dd6a93-4500-e0de-2bf0-bf522c311b0c@huawei.com

Link: https://lkml.kernel.org/r/20221222205511.675832-3-david@redhat.com
Fixes: 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24 07:24:35 +01:00
David Hildenbrand
6062c992e9 mm/hugetlb: fix PTE marker handling in hugetlb_change_protection()
commit 0e678153f5be7e6c8d28835f5a678618da4b7a9c upstream.

Patch series "mm/hugetlb: uffd-wp fixes for hugetlb_change_protection()".

Playing with virtio-mem and background snapshots (using uffd-wp) on
hugetlb in QEMU, I managed to trigger a VM_BUG_ON().  Looking into the
details, hugetlb_change_protection() seems to not handle uffd-wp correctly
in all cases.

Patch #1 fixes my test case.  I don't have reproducers for patch #2, as it
requires running into migration entries.

I did not yet check in detail yet if !hugetlb code requires similar care.


This patch (of 2):

There are two problematic cases when stumbling over a PTE marker in
hugetlb_change_protection():

(1) We protect an uffd-wp PTE marker a second time using uffd-wp: we will
    end up in the "!huge_pte_none(pte)" case and mess up the PTE marker.

(2) We unprotect a uffd-wp PTE marker: we will similarly end up in the
    "!huge_pte_none(pte)" case even though we cleared the PTE, because
    the "pte" variable is stale. We'll mess up the PTE marker.

For example, if we later stumble over such a "wrongly modified" PTE marker,
we'll treat it like a present PTE that maps some garbage page.

This can, for example, be triggered by mapping a memfd backed by huge
pages, registering uffd-wp, uffd-wp'ing an unmapped page and (a)
uffd-wp'ing it a second time; or (b) uffd-unprotecting it; or (c)
unregistering uffd-wp. Then, ff we trigger fallocate(FALLOC_FL_PUNCH_HOLE)
on that file range, we will run into a VM_BUG_ON:

[  195.039560] page:00000000ba1f2987 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x0
[  195.039565] flags: 0x7ffffc0001000(reserved|node=0|zone=0|lastcpupid=0x1fffff)
[  195.039568] raw: 0007ffffc0001000 ffffe742c0000008 ffffe742c0000008 0000000000000000
[  195.039569] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[  195.039569] page dumped because: VM_BUG_ON_PAGE(compound && !PageHead(page))
[  195.039573] ------------[ cut here ]------------
[  195.039574] kernel BUG at mm/rmap.c:1346!
[  195.039579] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[  195.039581] CPU: 7 PID: 4777 Comm: qemu-system-x86 Not tainted 6.0.12-200.fc36.x86_64 #1
[  195.039583] Hardware name: LENOVO 20WNS1F81N/20WNS1F81N, BIOS N35ET50W (1.50 ) 09/15/2022
[  195.039584] RIP: 0010:page_remove_rmap+0x45b/0x550
[  195.039588] Code: [...]
[  195.039589] RSP: 0018:ffffbc03c3633ba8 EFLAGS: 00010292
[  195.039591] RAX: 0000000000000040 RBX: ffffe742c0000000 RCX: 0000000000000000
[  195.039592] RDX: 0000000000000002 RSI: ffffffff8e7aac1a RDI: 00000000ffffffff
[  195.039592] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffbc03c3633a08
[  195.039593] R10: 0000000000000003 R11: ffffffff8f146328 R12: ffff9b04c42754b0
[  195.039594] R13: ffffffff8fcc6328 R14: ffffbc03c3633c80 R15: ffff9b0484ab9100
[  195.039595] FS:  00007fc7aaf68640(0000) GS:ffff9b0bbf7c0000(0000) knlGS:0000000000000000
[  195.039596] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  195.039597] CR2: 000055d402c49110 CR3: 0000000159392003 CR4: 0000000000772ee0
[  195.039598] PKRU: 55555554
[  195.039599] Call Trace:
[  195.039600]  <TASK>
[  195.039602]  __unmap_hugepage_range+0x33b/0x7d0
[  195.039605]  unmap_hugepage_range+0x55/0x70
[  195.039608]  hugetlb_vmdelete_list+0x77/0xa0
[  195.039611]  hugetlbfs_fallocate+0x410/0x550
[  195.039612]  ? _raw_spin_unlock_irqrestore+0x23/0x40
[  195.039616]  vfs_fallocate+0x12e/0x360
[  195.039618]  __x64_sys_fallocate+0x40/0x70
[  195.039620]  do_syscall_64+0x58/0x80
[  195.039623]  ? syscall_exit_to_user_mode+0x17/0x40
[  195.039624]  ? do_syscall_64+0x67/0x80
[  195.039626]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  195.039628] RIP: 0033:0x7fc7b590651f
[  195.039653] Code: [...]
[  195.039654] RSP: 002b:00007fc7aaf66e70 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[  195.039655] RAX: ffffffffffffffda RBX: 0000558ef4b7f370 RCX: 00007fc7b590651f
[  195.039656] RDX: 0000000018000000 RSI: 0000000000000003 RDI: 000000000000000c
[  195.039657] RBP: 0000000008000000 R08: 0000000000000000 R09: 0000000000000073
[  195.039658] R10: 0000000008000000 R11: 0000000000000293 R12: 0000000018000000
[  195.039658] R13: 00007fb8bbe00000 R14: 000000000000000c R15: 0000000000001000
[  195.039661]  </TASK>

Fix it by not going into the "!huge_pte_none(pte)" case if we stumble over
an exclusive marker.  spin_unlock() + continue would get the job done.

However, instead, make it clearer that there are no fall-through
statements: we process each case (hwpoison, migration, marker, !none,
none) and then unlock the page table to continue with the next PTE.  Let's
avoid "continue" statements and use a single spin_unlock() at the end.

Link: https://lkml.kernel.org/r/20221222205511.675832-1-david@redhat.com
Link: https://lkml.kernel.org/r/20221222205511.675832-2-david@redhat.com
Fixes: 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24 07:24:35 +01:00
James Houghton
63f71b8609 hugetlb: unshare some PMDs when splitting VMAs
commit b30c14cd61025eeea2f2e8569606cd167ba9ad2d upstream.

PMD sharing can only be done in PUD_SIZE-aligned pieces of VMAs; however,
it is possible that HugeTLB VMAs are split without unsharing the PMDs
first.

Without this fix, it is possible to hit the uffd-wp-related WARN_ON_ONCE
in hugetlb_change_protection [1].  The key there is that
hugetlb_unshare_all_pmds will not attempt to unshare PMDs in
non-PUD_SIZE-aligned sections of the VMA.

It might seem ideal to unshare in hugetlb_vm_op_open, but we need to
unshare in both the new and old VMAs, so unsharing in hugetlb_vm_op_split
seems natural.

[1]: https://lore.kernel.org/linux-mm/CADrL8HVeOkj0QH5VZZbRzybNE8CG-tEGFshnA+bG9nMgcWtBSg@mail.gmail.com/

Link: https://lkml.kernel.org/r/20230104231910.1464197-1-jthoughton@google.com
Fixes: 6dfeaff93b ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp")
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-24 07:24:33 +01:00
Mike Kravetz
17183187dc hugetlb: really allocate vma lock for all sharable vmas
commit e700898fa075c69b3ae02b702ab57fb75e1a82ec upstream.

Commit bbff39cc6c ("hugetlb: allocate vma lock for all sharable vmas")
removed the pmd sharable checks in the vma lock helper routines.  However,
it left the functional version of helper routines behind #ifdef
CONFIG_ARCH_WANT_HUGE_PMD_SHARE.  Therefore, the vma lock is not being
used for sharable vmas on architectures that do not support pmd sharing.
On these architectures, a potential fault/truncation race is exposed that
could leave pages in a hugetlb file past i_size until the file is removed.

Move the functional vma lock helpers outside the ifdef, and remove the
non-functional stubs.  Since the vma lock is not just for pmd sharing,
rename the routine __vma_shareable_flags_pmd.

Link: https://lkml.kernel.org/r/20221212235042.178355-1-mike.kravetz@oracle.com
Fixes: bbff39cc6c ("hugetlb: allocate vma lock for all sharable vmas")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-07 11:11:55 +01:00
Mike Kravetz
04ada095dc hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range.  For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final.  However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing.  Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table.  This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.

Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
call from unmap_vmas().  This is used to indicate the 'final' unmapping of
a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
the vm_lock is not deleted.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
Fixes: 90e7e7f5ef ("mm: enable MADV_DONTNEED for hugetlb mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30 14:49:40 -08:00
Mike Kravetz
7fb0728a9b hugetlb: fix __prep_compound_gigantic_page page flag setting
Commit 2b21624fc2 ("hugetlb: freeze allocated pages before creating
hugetlb pages") changed the order page flags were cleared and set in the
head page.  It moved the __ClearPageReserved after __SetPageHead. 
However, there is a check to make sure __ClearPageReserved is never done
on a head page.  If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG
will be hit when creating a hugetlb gigantic page:

    page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
    ------------[ cut here ]------------
    kernel BUG at include/linux/page-flags.h:500!
    Call Trace will differ depending on whether hugetlb page is created
    at boot time or run time.

Make sure to __ClearPageReserved BEFORE __SetPageHead.

Link: https://lkml.kernel.org/r/20221118195249.178319-1-mike.kravetz@oracle.com
Fixes: 2b21624fc2 ("hugetlb: freeze allocated pages before creating hugetlb pages")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Tarun Sahu <tsahu@linux.ibm.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:44 -08:00
James Houghton
8625147caf hugetlbfs: don't delete error page from pagecache
This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.

Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache.  That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned.  As [1] states,
this is effectively memory corruption.

The fix is to leave the page in the page cache.  If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses.  For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.

[1]: commit a760542666 ("mm: shmem: don't truncate page if memory failure happens")

Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08 15:57:22 -08:00
Mike Kravetz
612b8a3170 hugetlb: fix memory leak associated with vma_lock structure
The hugetlb vma_lock structure hangs off the vm_private_data pointer of
sharable hugetlb vmas.  The structure is vma specific and can not be
shared between vmas.  At fork and various other times, vmas are duplicated
via vm_area_dup().  When this happens, the pointer in the newly created
vma must be cleared and the structure reallocated.  Two hugetlb specific
routines deal with this hugetlb_dup_vma_private and hugetlb_vm_op_open. 
Both routines are called for newly created vmas.  hugetlb_dup_vma_private
would always clear the pointer and hugetlb_vm_op_open would allocate the
new vms_lock structure.  This did not work in the case of this calling
sequence pointed out in [1].

  move_vma
    copy_vma
      new_vma = vm_area_dup(vma);
      new_vma->vm_ops->open(new_vma); --> new_vma has its own vma lock.
    is_vm_hugetlb_page(vma)
      clear_vma_resv_huge_pages
        hugetlb_dup_vma_private --> vma->vm_private_data is set to NULL

When clearing hugetlb_dup_vma_private we actually leak the associated
vma_lock structure.

The vma_lock structure contains a pointer to the associated vma.  This
information can be used in hugetlb_dup_vma_private and hugetlb_vm_op_open
to ensure we only clear the vm_private_data of newly created (copied)
vmas.  In such cases, the vma->vma_lock->vma field will not point to the
vma.

Update hugetlb_dup_vma_private and hugetlb_vm_op_open to not clear
vm_private_data if vma->vma_lock->vma == vma.  Also, log a warning if
hugetlb_vm_op_open ever encounters the case where vma_lock has already
been correctly allocated for the vma.

[1] https://lore.kernel.org/linux-mm/5154292a-4c55-28cd-0935-82441e512fc3@huawei.com/

Link: https://lkml.kernel.org/r/20221019201957.34607-1-mike.kravetz@oracle.com
Fixes: 131a79b474 ("hugetlb: fix vma lock handling during split vma and range unmapping")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20 21:27:23 -07:00
Rik van Riel
12df140f0b mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages
The h->*_huge_pages counters are protected by the hugetlb_lock, but
alloc_huge_page has a corner case where it can decrement the counter
outside of the lock.

This could lead to a corrupted value of h->resv_huge_pages, which we have
observed on our systems.

Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a
potential race.

Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com
Fixes: a88c769548 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Glen McCready <gkmccready@meta.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20 21:27:23 -07:00
Linus Torvalds
5e714bf171 - Alistair Popple has a series which addresses a race which causes page
refcounting errors in ZONE_DEVICE pages.
 
 - Peter Xu fixes some userfaultfd test harness instability.
 
 - Various other patches in MM, mainly fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0j6igAKCRDdBJ7gKXxA
 jnGxAP99bV39ZtOsoY4OHdZlWU16BUjKuf/cb3bZlC2G849vEwD+OKlij86SG20j
 MGJQ6TfULJ8f1dnQDd6wvDfl3FMl7Qc=
 =tbdp
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - fix a race which causes page refcounting errors in ZONE_DEVICE pages
   (Alistair Popple)

 - fix userfaultfd test harness instability (Peter Xu)

 - various other patches in MM, mainly fixes

* tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits)
  highmem: fix kmap_to_page() for kmap_local_page() addresses
  mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page
  mm/selftest: uffd: explain the write missing fault check
  mm/hugetlb: use hugetlb_pte_stable in migration race check
  mm/hugetlb: fix race condition of uffd missing/minor handling
  zram: always expose rw_page
  LoongArch: update local TLB if PTE entry exists
  mm: use update_mmu_tlb() on the second thread
  kasan: fix array-bounds warnings in tests
  hmm-tests: add test for migrate_device_range()
  nouveau/dmem: evict device private memory during release
  nouveau/dmem: refactor nouveau_dmem_fault_copy_one()
  mm/migrate_device.c: add migrate_device_range()
  mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()
  mm/memremap.c: take a pgmap reference on page allocation
  mm: free device private pages have zero refcount
  mm/memory.c: fix race when faulting a device private page
  mm/damon: use damon_sz_region() in appropriate place
  mm/damon: move sz_damon_region to damon_sz_region
  lib/test_meminit: add checks for the allocation functions
  ...
2022-10-14 12:28:43 -07:00
Peter Xu
f9bf6c03ec mm/hugetlb: use hugetlb_pte_stable in migration race check
After hugetlb_pte_stable() introduced, we can also rewrite the migration
race condition against page allocation to use the new helper too.

Link: https://lkml.kernel.org/r/20221004193400.110155-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 18:51:50 -07:00
Peter Xu
2ea7ff1e39 mm/hugetlb: fix race condition of uffd missing/minor handling
Patch series "mm/hugetlb: Fix selftest failures with write check", v3.

Currently akpm mm-unstable fails with uffd hugetlb private mapping test
randomly on a write check.

The initial bisection of that points to the recent pmd unshare series, but
it turns out there's no direction relationship with the series but only
some timing change caused the race to start trigger.

The race should be fixed in patch 1.  Patch 2 is a trivial cleanup on the
similar race with hugetlb migrations, patch 3 comment on the write check
so when anyone read it again it'll be clear why it's there.


This patch (of 3):

After the recent rework patchset of hugetlb locking on pmd sharing,
kselftest for userfaultfd sometimes fails on hugetlb private tests with
unexpected write fault checks.

It turns out there's nothing wrong within the locking series regarding
this matter, but it could have changed the timing of threads so it can
trigger an old bug.

The real bug is when we call hugetlb_no_page() we're not with the pgtable
lock.  It means we're reading the pte values lockless.  It's perfectly
fine in most cases because before we do normal page allocations we'll take
the lock and check pte_same() again.  However before that, there are
actually two paths on userfaultfd missing/minor handling that may directly
move on with the fault process without checking the pte values.

It means for these two paths we may be generating an uffd message based on
an unstable pte, while an unstable pte can legally be anything as long as
the modifier holds the pgtable lock.

One example, which is also what happened in the failing kselftest and
caused the test failure, is that for private mappings wr-protection
changes can happen on one page.  While hugetlb_change_protection()
generally requires pte being cleared before being changed, then there can
be a race condition like:

        thread 1                              thread 2
        --------                              --------

      UFFDIO_WRITEPROTECT                     hugetlb_fault
        hugetlb_change_protection
          pgtable_lock()
          huge_ptep_modify_prot_start
                                              pte==NULL
                                              hugetlb_no_page
                                                generate uffd missing event
                                                even if page existed!!
          huge_ptep_modify_prot_commit
          pgtable_unlock()

Fix this by rechecking the pte after pgtable lock for both userfaultfd
missing & minor fault paths.

This bug should have been around starting from uffd hugetlb introduced, so
attaching a Fixes to the commit.  Also attach another Fixes to the minor
support commit for easier tracking.

Note that userfaultfd is actually fine with false positives (e.g.  caused
by pte changed), but not wrong logical events (e.g.  caused by reading a
pte during changing).  The latter can confuse the userspace, so the
strictness is very much preferred.  E.g., MISSING event should never
happen on the page after UFFDIO_COPY has correctly installed the page and
returned.

Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com
Fixes: 1a1aad8a9b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Fixes: 7677f7fd8b ("userfaultfd: add minor fault registration mode")
Signed-off-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 18:51:50 -07:00
Peter Xu
515778e2d7 mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in
When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
ifdefs to make sure the code won't be reached when not compiled in.

Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
Fixes: b1f9e87686 ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: <syzbot+2b9b4f0895be09a6dec3@syzkaller.appspotmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Edward Liaw <edliaw@google.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 15:56:46 -07:00
Andrew Morton
acfac37851 mm/hugetlb.c: make __hugetlb_vma_unlock_write_put() static
Reported-by: kernel test robot <lkp@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 15:56:45 -07:00
Linus Torvalds
1440f57602 Five hotfixes - three for nilfs2, two for MM. For are cc:stable, one is
not.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0YhtwAKCRDdBJ7gKXxA
 juJLAQDCa0g8sfe9cTw3PT1gRnn8gWLHEkMgUWVC/aBaqYFGeQEAta+g8muv9Tpd
 qODv0JARH4cwONKEA24Oql+A5RnI6gQ=
 =QZnW
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc hotfixes from Andrew Morton:
 "Five hotfixes - three for nilfs2, two for MM. For are cc:stable, one
  is not"

* tag 'mm-hotfixes-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  nilfs2: fix leak of nilfs_root in case of writer thread creation failure
  nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()
  nilfs2: fix use-after-free bug of struct nilfs_root
  mm/damon/core: initialize damon_target->list in damon_new_target()
  mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page
2022-10-12 11:16:58 -07:00
Baolin Wang
fac35ba763 mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.

So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
hugetlb in follow_page_pte().  However this pte entry lock is incorrect
for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
the correct lock, which is mm->page_table_lock.

That means the pte entry of the CONT-PTE size hugetlb under current pte
lock is unstable in follow_page_pte(), we can continue to migrate or
poison the pte entry of the CONT-PTE size hugetlb, which can cause some
potential race issues, even though they are under the 'pte lock'.

For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
page by move_pages() syscall under the lock, however antoher thread B can
migrate the CONT-PTE hugetlb page at the same time, which will cause
thread A to get an incorrect page, if thread A also wants to do page
migration, then data inconsistency error occurs.

Moreover we have the same issue for CONT-PMD size hugetlb in
follow_huge_pmd().

To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
get the correct pte entry lock to make the pte entry stable.

Mike said:

Support for CONT_PMD/_PTE was added with bb9dd3df8e ("arm64: hugetlb:
refactor find_num_contig()").  Patch series "Support for contiguous pte
hugepages", v4.  However, I do not believe these code paths were
executed until migration support was added with 5480280d3f ("arm64/mm:
enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
with 5480280d3f for the Fixes: targe.

Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.1662017562.git.baolin.wang@linux.alibaba.com
Fixes: 5480280d3f ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-11 19:05:44 -07:00
Mike Kravetz
bbff39cc6c hugetlb: allocate vma lock for all sharable vmas
The hugetlb vma lock was originally designed to synchronize pmd sharing. 
As such, it was only necessary to allocate the lock for vmas that were
capable of pmd sharing.  Later in the development cycle, it was discovered
that it could also be used to simplify fault/truncation races as described
in [1].  However, a subsequent change to allocate the lock for all vmas
that use the page cache was never made.  A fault/truncation race could
leave pages in a file past i_size until the file is removed.

Remove the previous restriction and allocate lock for all VM_MAYSHARE
vmas.  Warn in the unlikely event of allocation failure.

[1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t

Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com
Fixes: "hugetlb: clean up code checking for fault/truncation races"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-07 14:28:40 -07:00
Mike Kravetz
ecfbd73387 hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb file truncation/hole punch code may need to back out and take
locks in order in the routine hugetlb_unmap_file_folio().  This code could
race with vma freeing as pointed out in [1] and result in accessing a
stale vma pointer.  To address this, take the vma_lock when clearing the
vma_lock->vma pointer.

[1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/

[mike.kravetz@oracle.com: address build issues]
  Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey
Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com
Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-07 14:28:40 -07:00
Mike Kravetz
131a79b474 hugetlb: fix vma lock handling during split vma and range unmapping
Patch series "hugetlb: fixes for new vma lock series".

In review of the series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", Miaohe Lin pointed out two key issues:

1) There is a race in the routine hugetlb_unmap_file_folio when locks
   are dropped and reacquired in the correct order [1].

2) With the switch to using vma lock for fault/truncate synchronization,
   we need to make sure lock exists for all VM_MAYSHARE vmas, not just
   vmas capable of pmd sharing.

These two issues are addressed here.  In addition, having a vma lock
present in all VM_MAYSHARE vmas, uncovered some issues around vma
splitting.  Those are also addressed.

[1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/


This patch (of 3):

The hugetlb vma lock hangs off the vm_private_data field and is specific
to the vma.  When vm_area_dup() is called as part of vma splitting, the
vma lock pointer is copied to the new vma.  This will result in issues
such as double freeing of the structure.  Update the hugetlb open vm_ops
to allocate a new vma lock for the new vma.

The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
only VM_MAYSHARE was set we would miss the free.  With the introduction of
the vma lock, a vma can not participate in pmd sharing if vm_private_data
is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
free the vma lock to prevent sharing.  Also, update the sharing code to
make sure vma lock is indeed a condition for pmd sharing. 
hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.

Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
Fixes: "hugetlb: add vma based lock for pmd sharing"
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-07 14:28:40 -07:00
Xin Hao
8346d69d8b mm/hugetlb: add available_huge_pages() func
In hugetlb.c there are several places which compare the values of
'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so
add a new available_huge_pages() function to do these.

Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:35 -07:00
Liu Shixin
958f32ce83 mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,

hugetlb_fault
  hugetlb_no_page
    /*unlock vma_lock */
    hugetlb_handle_userfault
      handle_userfault
        /* unlock mm->mmap_lock*/
                                           vm_mmap_pgoff
                                             do_mmap
                                               mmap_region
                                                 munmap_vma_range
                                                   /* clean old vma */
        /* lock vma_lock again  <--- UAF */
    /* unlock vma_lock */

Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.

[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:32 -07:00
Mike Kravetz
2b21624fc2 hugetlb: freeze allocated pages before creating hugetlb pages
When creating hugetlb pages, the hugetlb code must first allocate
contiguous pages from a low level allocator such as buddy, cma or
memblock.  The pages returned from these low level allocators are ref
counted.  This creates potential issues with other code taking speculative
references on these pages before they can be transformed to a hugetlb
page.  This issue has been addressed with methods and code such as that
provided in [1].

Recent discussions about vmemmap freeing [2] have indicated that it would
be beneficial to freeze all sub pages, including the head page of pages
returned from low level allocators before converting to a hugetlb page. 
This helps avoid races if we want to replace the page containing vmemmap
for the head page.

There have been proposals to change at least the buddy allocator to return
frozen pages as described at [3].  If such a change is made, it can be
employed by the hugetlb code.  However, as mentioned above hugetlb uses
several low level allocators so each would need to be modified to return
frozen pages.  For now, we can manually freeze the returned pages.  This
is done in two places:

1) alloc_buddy_huge_page, only the returned head page is ref counted.
   We freeze the head page, retrying once in the VERY rare case where
   there may be an inflated ref count.
2) prep_compound_gigantic_page, for gigantic pages the current code
   freezes all pages except the head page.  New code will simply freeze
   the head page as well.

In a few other places, code checks for inflated ref counts on newly
allocated hugetlb pages.  With the modifications to freeze after
allocating, this code can be removed.

After hugetlb pages are freshly allocated, they are often added to the
hugetlb free lists.  Since these pages were previously ref counted, this
was done via put_page() which would end up calling the hugetlb destructor:
free_huge_page.  With changes to freeze pages, we simply call
free_huge_page directly to add the pages to the free list.

In a few other places, freshly allocated hugetlb pages were immediately
put into use, and the expectation was they were already ref counted.  In
these cases, we must manually ref count the page.

[1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/
[2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/

[mike.kravetz@oracle.com: fix NULL pointer dereference]
  Link: https://lkml.kernel.org/r/20220921202702.106069-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220916214638.155744-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:31 -07:00
Mike Kravetz
fa27759af4 hugetlb: clean up code checking for fault/truncation races
With the new hugetlb vma lock in place, it can also be used to handle page
fault races with file truncation.  The lock is taken at the beginning of
the code fault path in read mode.  During truncation, it is taken in write
mode for each vma which has the file mapped.  The file's size (i_size) is
modified before taking the vma lock to unmap.

How are races handled?

The page fault code checks i_size early in processing after taking the vma
lock.  If the fault is beyond i_size, the fault is aborted.  If the fault
is not beyond i_size the fault will continue and a new page will be added
to the file.  It could be that truncation code modifies i_size after the
check in fault code.  That is OK, as truncation code will soon remove the
page.  The truncation code will wait until the fault is finished, as it
must obtain the vma lock in write mode.

This patch cleans up/removes late checks in the fault paths that try to
back out pages racing with truncation.  As noted above, we just let the
truncation code remove the pages.

[mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
  Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:17 -07:00
Mike Kravetz
40549ba8f8 hugetlb: use new vma_lock for pmd sharing synchronization
The new hugetlb vma lock is used to address this race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...

The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
  doing a page table lock and allocation (huge_pte_alloc).  The lock is
  held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
  called.

Lock ordering issues come into play when unmapping a page from all
vmas mapping the page.  The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare.  This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
  error handling.  In these routines we 'try' to obtain the vma lock and
  fail to unmap if unsuccessful.  Calling routines already deal with the
  failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch.  This routine
  also tries to acquire the vma lock.  If it fails, it skips the
  unmapping.  However, we can not have file truncation or hole punch
  fail because of contention.  After hugetlb_vmdelete_list, truncation
  and hole punch call remove_inode_hugepages.  remove_inode_hugepages
  checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
  hugetlb_unmap_file_page is designed to drop locks and reacquire in the
  correct order to guarantee unmap success.

Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:17 -07:00
Mike Kravetz
8d9bfb2608 hugetlb: add vma based lock for pmd sharing
Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for
synchronization use by vmas that could be involved in pmd sharing.  This
data structure contains a rw semaphore that is the primary tool used for
synchronization.

This new structure is ref counted, so that it can exist when NOT attached
to a vma.  This is only helpful in resolving lock ordering issues where
code may need to obtain the vma_lock while there are no guarantees the vma
may go away.  By obtaining a ref on the structure, it can be guaranteed
that at least the rw semaphore will not go away.

Only add infrastructure for the new lock here.  Actual use will be added
in subsequent patches.

[mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release]
  Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:17 -07:00
Mike Kravetz
12710fd696 hugetlb: rename vma_shareable() and refactor code
Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
checking a specific address within the vma.  Refactor code to check if an
aligned range is shareable as this will be needed in a subsequent patch.

Link: https://lkml.kernel.org/r/20220914221810.95771-6-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:16 -07:00
Mike Kravetz
7e1813d48d hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
remove_huge_page removes a hugetlb page from the page cache.  Change to
hugetlb_delete_from_page_cache as it is a more descriptive name. 
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.

Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:16 -07:00
Mike Kravetz
3a47c54f09 hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization
Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  However, this has been shown to cause
performance/scaling issues.  Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

NOTE: Reverting this code does expose the following race condition.

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

It is unknown if the above race was ever experienced by a user.  It was
discovered via code inspection when initially addressed.

In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.

Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:16 -07:00
Mike Kravetz
188a39725a hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
Patch series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", v2.

hugetlb fault scalability regressions have recently been reported [1]. 
This is not the first such report, as regressions were also noted when
commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

			48 sample Avg
next-20220913		next-20220913			next-20220913
unmodified	revert i_mmap_sema locking	vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s		501934 KB/s			504793 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  For comparison, a single instance of the program
was run.  Then, multiple instances were run in parallel to introduce
lock contention.  Changing the locking scheme results in a significant
performance benefit.

test		instances	unmodified	revert		vma
--------------------------------------------------------------------------
faults per sec	1		393043		395680		389932
faults per sec  24		 71405		 81191		 79048
forks per sec   1		  2802		  2747		  2725
forks per sec   24		   439		   536		   500
Combined faults 24		  1621		 68070		 53662
Combined forks  24		   358		    67		   142

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381ade and 87bf91d39b which
depends on c0d0381ade.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381ade reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39b exposes races in page fault/file truncation.  When
the new vma lock is put to use in patch 8, this will handle the fault/file
truncation races.  This is explained in patch 9 where code associated with
these races is cleaned up.

Patches 3 - 5 restructure existing code in preparation for using the new
vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
this semaphore will be held in read mode for the duration of fault
processing, and held in write mode for unmap operations which may call
huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
synchronize huge pmd sharing.  However it is only required in the fault
path when setting up sharing, and will be acquired in huge_pmd_share().

Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.

Patch 7 refactors code in preparation for using the new lock.  And, patch
8 finally adds code to make use of this new vma lock.  Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock.  Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/


This patch (of 9):

Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:16 -07:00
XU pengfei
3259914f8c mm/hugetlb: remove unnecessary 'NULL' values from pointer
Pointer variables allocate memory first, and then judge.  There is no need
to initialize the assignment.

Link: https://lkml.kernel.org/r/20220914012113.6271-1-xupengfei@nfschina.com
Signed-off-by: XU pengfei <xupengfei@nfschina.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:15 -07:00
Muchun Song
a4a00b451e mm: hugetlb: eliminate memory-less nodes handling
The memory-notify-based approach aims to handle meory-less nodes, however,
it just adds the complexity of code as pointed by David in thread [1]. 
The handling of memory-less nodes is introduced by commit 4faf8d950e
("hugetlb: handle memory hot-plug events").  >From its commit message, we
cannot find any necessity of handling this case.  So, we can simply
register/unregister sysfs entries in register_node/unregister_node to
simlify the code.

BTW, hotplug callback added because in hugetlb_register_all_nodes() we
register sysfs nodes only for N_MEMORY nodes, seeing commit 9b5e5d0fdc,
which said it was a preparation for handling memory-less nodes via memory
hotplug.  Since we want to remove memory hotplug, so make sure we only
register per-node sysfs for online (N_ONLINE) nodes in
hugetlb_register_all_nodes().

https://lore.kernel.org/linux-mm/60933ffc-b850-976c-78a0-0ee6e0ea9ef0@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20220914072603.60293-3-songmuchun@bytedance.com
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:15 -07:00
Muchun Song
b958d4d08f mm: hugetlb: simplify per-node sysfs creation and removal
Patch series "simplify handling of per-node sysfs creation and removal",
v4.


This patch (of 2):

The following commit offload per-node sysfs creation and removal to a
kworker and did not say why it is needed.  And it also said "I don't know
that this is absolutely required".  It seems like the author was not sure
as well.  Since it only complicates the code, this patch will revert the
changes to simplify the code.

  39da08cb07 ("hugetlb: offload per node attribute registrations")

We could use memory hotplug notifier to do per-node sysfs creation and
removal instead of inserting those operations to node registration and
unregistration.  Then, it can reduce the code coupling between node.c and
hugetlb.c.  Also, it can simplify the code.

Link: https://lkml.kernel.org/r/20220914072603.60293-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220914072603.60293-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:15 -07:00
Cheng Li
14455eabd8 mm: use nth_page instead of mem_map_offset mem_map_next
To handle the discontiguous case, mem_map_next() has a parameter named
`offset`.  As a function caller, one would be confused why "get next
entry" needs a parameter named "offset".  The other drawback of
mem_map_next() is that the callers must take care of the map between
parameter "iter" and "offset", otherwise we may get an hole or duplication
during iteration.  So we use nth_page instead of mem_map_next.

And replace mem_map_offset with nth_page() per Matthew's comments.

Link: https://lkml.kernel.org/r/1662708669-9395-1-git-send-email-lic121@chinatelecom.cn
Signed-off-by: Cheng Li <lic121@chinatelecom.cn>
Fixes: 69d177c2fc ("hugetlbfs: handle pages higher order than MAX_ORDER")
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:08 -07:00
Li zeming
8eeda55fe0 mm/hugetlb.c: remove unnecessary initialization of local `err'
Link: https://lkml.kernel.org/r/20220905020918.3552-1-zeming@nfschina.com
Signed-off-by: Li zeming <zeming@nfschina.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:02:55 -07:00
Andrew Morton
6d751329e7 Merge branch 'mm-hotfixes-stable' into mm-stable 2022-09-26 13:13:15 -07:00
Doug Berger
317314527d mm/hugetlb: correct demote page offset logic
With gigantic pages it may not be true that struct page structures are
contiguous across the entire gigantic page.  The nth_page macro is used
here in place of direct pointer arithmetic to correct for this.

Mike said:

: This error could cause addressing exceptions.  However, this is only
: possible in configurations where CONFIG_SPARSEMEM &&
: !CONFIG_SPARSEMEM_VMEMMAP.  Such a configuration option is rare and
: unknown to be the default anywhere.

Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com
Fixes: 8531fc6f52 ("hugetlb: add hugetlb demote page support")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 12:14:34 -07:00
Miaohe Lin
5e6b1bf1b5 hugetlb: remove meaningless BUG_ON(huge_pte_none())
When code reaches here, invalid page would have been accessed if huge pte
is none. So this BUG_ON(huge_pte_none()) is meaningless. Remove it.

Link: https://lkml.kernel.org/r/20220901120030.63318-10-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:26:10 -07:00
Miaohe Lin
a9e1eab241 hugetlb: add comment for subtle SetHPageVmemmapOptimized()
The SetHPageVmemmapOptimized() called here seems unnecessary as it's
assumed to be set when calling this function. But it's indeed cleared
by above set_page_private(page, 0). Add a comment to avoid possible
future confusion.

Link: https://lkml.kernel.org/r/20220901120030.63318-9-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:26:10 -07:00