android_kernel_xiaomi_sm8450

Author	SHA1	Message	Date
Collin Fijalkovich	28b4b1588e	FROMLIST: mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs Transparent huge pages are supported for read-only non-shmem files, but are only used for vmas with VM_DENYWRITE. This condition ensures that file THPs are protected from writes while an application is running (ETXTBSY). Any existing file THPs are then dropped from the page cache when a file is opened for write in do_dentry_open(). Since sys_mmap ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas produced by execve(). Systems that make heavy use of shared libraries (e.g. Android) are unable to apply VM_DENYWRITE through the dynamic linker, preventing them from benefiting from the resultant reduced contention on the TLB. This patch reduces the constraint on file THPs allowing use with any executable mapping from a file not opened for write (see inode_is_open_for_write()). It also introduces additional conditions to ensure that files opened for write will never be backed by file THPs. Restricting the use of THPs to executable mappings eliminates the risk that a read-only file later opened for write would encounter significant latencies due to page cache truncation. The ld linker flag '-z max-page-size=(hugepage size)' can be used to produce executables with the necessary layout. The dynamic linker must map these file's segments at a hugepage size aligned vma for the mapping to be backed with THPs. Comparison of the performance characteristics of 4KB and 2MB-backed libraries follows; the Android dex2oat tool was used to AOT compile an example application on a single ARM core. 4KB Pages: ========== count event_name # count / runtime 598,995,035,942 cpu-cycles # 1.800861 GHz 81,195,620,851 raw-stall-frontend # 244.112 M/sec 347,754,466,597 iTLB-loads # 1.046 G/sec 2,970,248,900 iTLB-load-misses # 0.854122% miss rate Total test time: 332.854998 seconds. 2MB Pages: ========== count event_name # count / runtime 592,872,663,047 cpu-cycles # 1.800358 GHz 76,485,624,143 raw-stall-frontend # 232.261 M/sec 350,478,413,710 iTLB-loads # 1.064 G/sec 803,233,322 iTLB-load-misses # 0.229182% miss rate Total test time: 329.826087 seconds A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use: /apex/com.android.art/lib64/libart.so FilePmdMapped: 4096 kB /apex/com.android.art/lib64/libart-compiler.so FilePmdMapped: 2048 kB Bug: 158135888 Link: https://lore.kernel.org/patchwork/patch/1408266/ Reviewed-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com> Change-Id: I75c693a4b4e7526d374ef2c010bde3094233eef2	2021-04-28 18:41:35 +00:00
Greg Kroah-Hartman	d8c7f0a3cd	Merge 5.10.20 into android12-5.10 Changes in 5.10.20 vmlinux.lds.h: add DWARF v5 sections vdpa/mlx5: fix param validation in mlx5_vdpa_get_config() debugfs: be more robust at handling improper input in debugfs_lookup() debugfs: do not attempt to create a new file before the filesystem is initalized scsi: libsas: docs: Remove notify_ha_event() scsi: qla2xxx: Fix mailbox Ch erroneous error kdb: Make memory allocations more robust w1: w1_therm: Fix conversion result for negative temperatures PCI: qcom: Use PHY_REFCLK_USE_PAD only for ipq8064 PCI: Decline to resize resources if boot config must be preserved virt: vbox: Do not use wait_event_interruptible when called from kernel context bfq: Avoid false bfq queue merging ALSA: usb-audio: Fix PCM buffer allocation in non-vmalloc mode MIPS: vmlinux.lds.S: add missing PAGE_ALIGNED_DATA() section vmlinux.lds.h: Define SANTIZER_DISCARDS with CONFIG_GCOV_KERNEL=y random: fix the RNDRESEEDCRNG ioctl ALSA: pcm: Call sync_stop at disconnection ALSA: pcm: Assure sync with the pending stop operation at suspend ALSA: pcm: Don't call sync_stop if it hasn't been stopped drm/i915/gt: One more flush for Baytrail clear residuals ath10k: Fix error handling in case of CE pipe init failure Bluetooth: btqcomsmd: Fix a resource leak in error handling paths in the probe function Bluetooth: hci_uart: Fix a race for write_work scheduling Bluetooth: Fix initializing response id after clearing struct arm64: dts: renesas: beacon kit: Fix choppy Bluetooth Audio arm64: dts: renesas: beacon: Fix audio-1.8V pin enable ARM: dts: exynos: correct PMIC interrupt trigger level on Artik 5 ARM: dts: exynos: correct PMIC interrupt trigger level on Monk ARM: dts: exynos: correct PMIC interrupt trigger level on Rinato ARM: dts: exynos: correct PMIC interrupt trigger level on Spring ARM: dts: exynos: correct PMIC interrupt trigger level on Arndale Octa ARM: dts: exynos: correct PMIC interrupt trigger level on Odroid XU3 family arm64: dts: exynos: correct PMIC interrupt trigger level on TM2 arm64: dts: exynos: correct PMIC interrupt trigger level on Espresso memory: mtk-smi: Fix PM usage counter unbalance in mtk_smi ops Bluetooth: hci_qca: Fix memleak in qca_controller_memdump staging: vchiq: Fix bulk userdata handling staging: vchiq: Fix bulk transfers on 64-bit builds arm64: dts: qcom: msm8916-samsung-a5u: Fix iris compatible net: stmmac: dwmac-meson8b: fix enabling the timing-adjustment clock bpf: Add bpf_patch_call_args prototype to include/linux/bpf.h bpf: Avoid warning when re-casting __bpf_call_base into __bpf_call_base_args firmware: arm_scmi: Fix call site of scmi_notification_exit arm64: dts: allwinner: A64: properly connect USB PHY to port 0 arm64: dts: allwinner: H6: properly connect USB PHY to port 0 arm64: dts: allwinner: Drop non-removable from SoPine/LTS SD card arm64: dts: allwinner: H6: Allow up to 150 MHz MMC bus frequency arm64: dts: allwinner: A64: Limit MMC2 bus frequency to 150 MHz arm64: dts: qcom: msm8916-samsung-a2015: Fix sensors cpufreq: brcmstb-avs-cpufreq: Free resources in error path cpufreq: brcmstb-avs-cpufreq: Fix resource leaks in ->remove() arm64: dts: rockchip: rk3328: Add clock_in_out property to gmac2phy node ACPICA: Fix exception code class checks usb: gadget: u_audio: Free requests only after callback arm64: dts: qcom: sdm845-db845c: Fix reset-pin of ov8856 node soc: qcom: socinfo: Fix an off by one in qcom_show_pmic_model() soc: ti: pm33xx: Fix some resource leak in the error handling paths of the probe function staging: media: atomisp: Fix size_t format specifier in hmm_alloc() debug statemenet Bluetooth: drop HCI device reference before return Bluetooth: Put HCI device if inquiry procedure interrupts memory: ti-aemif: Drop child node when jumping out loop ARM: dts: Configure missing thermal interrupt for 4430 usb: dwc2: Do not update data length if it is 0 on inbound transfers usb: dwc2: Abort transaction after errors with unknown reason usb: dwc2: Make "trimming xfer length" a debug message staging: rtl8723bs: wifi_regd.c: Fix incorrect number of regulatory rules x86/MSR: Filter MSR writes through X86_IOC_WRMSR_REGS ioctl too arm64: dts: renesas: beacon: Fix EEPROM compatible value can: mcp251xfd: mcp251xfd_probe(): fix errata reference ARM: dts: armada388-helios4: assign pinctrl to LEDs ARM: dts: armada388-helios4: assign pinctrl to each fan arm64: dts: armada-3720-turris-mox: rename u-boot mtd partition to a53-firmware opp: Correct debug message in _opp_add_static_v2() Bluetooth: btusb: Fix memory leak in btusb_mtk_wmt_recv soc: qcom: ocmem: don't return NULL in of_get_ocmem arm64: dts: msm8916: Fix reserved and rfsa nodes unit address arm64: dts: meson: fix broken wifi node for Khadas VIM3L iwlwifi: mvm: set enabled in the PPAG command properly ARM: s3c: fix fiq for clang IAS optee: simplify i2c access staging: wfx: fix possible panic with re-queued frames ARM: at91: use proper asm syntax in pm_suspend ath10k: Fix suspicious RCU usage warning in ath10k_wmi_tlv_parse_peer_stats_info() ath10k: Fix lockdep assertion warning in ath10k_sta_statistics ath11k: fix a locking bug in ath11k_mac_op_start() soc: aspeed: snoop: Add clock control logic iwlwifi: mvm: fix the type we use in the PPAG table validity checks iwlwifi: mvm: store PPAG enabled/disabled flag properly iwlwifi: mvm: send stored PPAG command instead of local iwlwifi: mvm: assign SAR table revision to the command later iwlwifi: mvm: don't check if CSA event is running before removing bpf_lru_list: Read double-checked variable once without lock iwlwifi: pnvm: set the PNVM again if it was already loaded iwlwifi: pnvm: increment the pointer before checking the TLV ath9k: fix data bus crash when setting nf_override via debugfs selftests/bpf: Convert test_xdp_redirect.sh to bash ibmvnic: Set to CLOSED state even on error bnxt_en: reverse order of TX disable and carrier off bnxt_en: Fix devlink info's stored fw.psid version format. xen/netback: fix spurious event detection for common event case dpaa2-eth: fix memory leak in XDP_REDIRECT net: phy: consider that suspend2ram may cut off PHY power net/mlx5e: Don't change interrupt moderation params when DIM is enabled net/mlx5e: Change interrupt moderation channel params also when channels are closed net/mlx5: Fix health error state handling net/mlx5e: Replace synchronize_rcu with synchronize_net net/mlx5e: kTLS, Use refcounts to free kTLS RX priv context net/mlx5: Disable devlink reload for multi port slave device net/mlx5: Disallow RoCE on multi port slave device net/mlx5: Disallow RoCE on lag device net/mlx5: Disable devlink reload for lag devices net/mlx5e: CT: manage the lifetime of the ct entry object net/mlx5e: Check tunnel offload is required before setting SWP mac80211: fix potential overflow when multiplying to u32 integers libbpf: Ignore non function pointer member in struct_ops bpf: Fix an unitialized value in bpf_iter bpf, devmap: Use GFP_KERNEL for xdp bulk queue allocation bpf: Fix bpf_fib_lookup helper MTU check for SKB ctx selftests: mptcp: fix ACKRX debug message tcp: fix SO_RCVLOWAT related hangs under mem pressure net: axienet: Handle deferred probe on clock properly cxgb4/chtls/cxgbit: Keeping the max ofld immediate data size same in cxgb4 and ulds b43: N-PHY: Fix the update of coef for the PHY revision >= 3case bpf: Clear subreg_def for global function return values ibmvnic: add memory barrier to protect long term buffer ibmvnic: skip send_request_unmap for timeout reset net: dsa: felix: perform teardown in reverse order of setup net: dsa: felix: don't deinitialize unused ports net: phy: mscc: adding LCPLL reset to VSC8514 net: amd-xgbe: Reset the PHY rx data path when mailbox command timeout net: amd-xgbe: Fix NETDEV WATCHDOG transmit queue timeout warning net: amd-xgbe: Reset link when the link never comes back net: amd-xgbe: Fix network fluctuations when using 1G BELFUSE SFP net: mvneta: Remove per-cpu queue mapping for Armada 3700 net: enetc: fix destroyed phylink dereference during unbind tty: convert tty_ldisc_ops 'read()' function to take a kernel pointer tty: implement read_iter fbdev: aty: SPARC64 requires FB_ATY_CT drm/gma500: Fix error return code in psb_driver_load() gma500: clean up error handling in init drm/fb-helper: Add missed unlocks in setcmap_legacy() drm/panel: mantix: Tweak init sequence drm/vc4: hdmi: Take into account the clock doubling flag in atomic_check crypto: sun4i-ss - linearize buffers content must be kept crypto: sun4i-ss - fix kmap usage crypto: arm64/aes-ce - really hide slower algos when faster ones are enabled hwrng: ingenic - Fix a resource leak in an error handling path media: allegro: Fix use after free on error kcsan: Rewrite kcsan_prandom_u32_max() without prandom_u32_state() drm: rcar-du: Fix PM reference leak in rcar_cmm_enable() drm: rcar-du: Fix crash when using LVDS1 clock for CRTC drm: rcar-du: Fix the return check of of_parse_phandle and of_find_device_by_node drm/amdgpu: Fix macro name _AMDGPU_TRACE_H_ in preprocessor if condition MIPS: c-r4k: Fix section mismatch for loongson2_sc_init MIPS: lantiq: Explicitly compare LTQ_EBU_PCC_ISTAT against 0 drm/virtio: make sure context is created in gem open drm/fourcc: fix Amlogic format modifier masks media: ipu3-cio2: Build only for x86 media: i2c: ov5670: Fix PIXEL_RATE minimum value media: imx: Unregister csc/scaler only if registered media: imx: Fix csc/scaler unregister media: mtk-vcodec: fix error return code in vdec_vp9_decode() media: camss: missing error code in msm_video_register() media: vsp1: Fix an error handling path in the probe function media: em28xx: Fix use-after-free in em28xx_alloc_urbs media: media/pci: Fix memleak in empress_init media: tm6000: Fix memleak in tm6000_start_stream media: aspeed: fix error return code in aspeed_video_setup_video() ASoC: cs42l56: fix up error handling in probe ASoC: qcom: qdsp6: Move frontend AIFs to q6asm-dai evm: Fix memleak in init_desc crypto: bcm - Rename struct device_private to bcm_device_private sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue drm/sun4i: tcon: fix inverted DCLK polarity media: imx7: csi: Fix regression for parallel cameras on i.MX6UL media: imx7: csi: Fix pad link validation media: ti-vpe: cal: fix write to unallocated memory MIPS: properly stop .eh_frame generation MIPS: Compare __SYNC_loongson3_war against 0 drm/tegra: Fix reference leak when pm_runtime_get_sync() fails drm/amdgpu: toggle on DF Cstate after finishing xgmi injection bsg: free the request before return error code macintosh/adb-iop: Use big-endian autopoll mask drm/amd/display: Fix 10/12 bpc setup in DCE output bit depth reduction. drm/amd/display: Fix HDMI deep color output for DCE 6-11. media: software_node: Fix refcounts in software_node_get_next_child() media: lmedm04: Fix misuse of comma media: vidtv: psi: fix missing crc for PMT media: atomisp: Fix a buffer overflow in debug code media: qm1d1c0042: fix error return code in qm1d1c0042_init() media: cx25821: Fix a bug when reallocating some dma memory media: mtk-vcodec: fix argument used when DEBUG is defined media: pxa_camera: declare variable when DEBUG is defined media: uvcvideo: Accept invalid bFormatIndex and bFrameIndex values sched/eas: Don't update misfit status if the task is pinned f2fs: compress: fix potential deadlock ASoC: qcom: lpass-cpu: Remove bit clock state check ASoC: SOF: Intel: hda: cancel D0i3 work during runtime suspend perf/arm-cmn: Fix PMU instance naming perf/arm-cmn: Move IRQs when migrating context mtd: parser: imagetag: fix error codes in bcm963xx_parse_imagetag_partitions() crypto: talitos - Work around SEC6 ERRATA (AES-CTR mode data size error) crypto: talitos - Fix ctr(aes) on SEC1 drm/nouveau: bail out of nouveau_channel_new if channel init fails mm: proc: Invalidate TLB after clearing soft-dirty page state ata: ahci_brcm: Add back regulators management ASoC: cpcap: fix microphone timeslot mask ASoC: codecs: add missing max_register in regmap config mtd: parsers: afs: Fix freeing the part name memory in failure f2fs: fix to avoid inconsistent quota data drm/amdgpu: Prevent shift wrapping in amdgpu_read_mask() f2fs: fix a wrong condition in __submit_bio ASoC: qcom: Fix typo error in HDMI regmap config callbacks KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRs drm/mediatek: Check if fb is null Drivers: hv: vmbus: Avoid use-after-free in vmbus_onoffer_rescind() ASoC: Intel: sof_sdw: add missing TGL_HDMI quirk for Dell SKU 0A5E ASoC: Intel: sof_sdw: add missing TGL_HDMI quirk for Dell SKU 0A3E locking/lockdep: Avoid unmatched unlock ASoC: qcom: lpass: Fix i2s ctl register bit map ASoC: rt5682: Fix panic in rt5682_jack_detect_handler happening during system shutdown ASoC: SOF: debug: Fix a potential issue on string buffer termination btrfs: clarify error returns values in __load_free_space_cache btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge KVM: x86: Restore all 64 bits of DR6 and DR7 during RSM on x86-64 s390/zcrypt: return EIO when msg retry limit reached drm/vc4: hdmi: Move hdmi reset to bind drm/vc4: hdmi: Fix register offset with longer CEC messages drm/vc4: hdmi: Fix up CEC registers drm/vc4: hdmi: Restore cec physical address on reconnect drm/vc4: hdmi: Compute the CEC clock divider from the clock rate drm/vc4: hdmi: Update the CEC clock divider on HSM rate change drm/lima: fix reference leak in lima_pm_busy drm/dp_mst: Don't cache EDIDs for physical ports hwrng: timeriomem - Fix cooldown period calculation crypto: ecdh_helper - Ensure 'len >= secret.len' in decode_key() io_uring: fix possible deadlock in io_uring_poll nvmet-tcp: fix receive data digest calculation for multiple h2cdata PDUs nvmet-tcp: fix potential race of tcp socket closing accept_work nvme-multipath: set nr_zones for zoned namespaces nvmet: remove extra variable in identify ns nvmet: set status to 0 in case for invalid nsid ASoC: SOF: sof-pci-dev: add missing Up-Extreme quirk ima: Free IMA measurement buffer on error ima: Free IMA measurement buffer after kexec syscall ASoC: simple-card-utils: Fix device module clock fs/jfs: fix potential integer overflow on shift of a int jffs2: fix use after free in jffs2_sum_write_data() ubifs: Fix memleak in ubifs_init_authentication ubifs: replay: Fix high stack usage, again ubifs: Fix error return code in alloc_wbufs() irqchip/imx: IMX_INTMUX should not default to y, unconditionally smp: Process pending softirqs in flush_smp_call_function_from_idle() drm/amdgpu/display: remove hdcp_srm sysfs on device removal capabilities: Don't allow writing ambiguous v3 file capabilities HSI: Fix PM usage counter unbalance in ssi_hw_init power: supply: cpcap: Add missing IRQF_ONESHOT to fix regression clk: meson: clk-pll: fix initializing the old rate (fallback) for a PLL clk: meson: clk-pll: make "ret" a signed integer clk: meson: clk-pll: propagate the error from meson_clk_pll_set_rate() selftests/powerpc: Make the test check in eeh-basic.sh posix compliant regulator: qcom-rpmh-regulator: add pm8009-1 chip revision arm64: dts: qcom: qrb5165-rb5: fix pm8009 regulators quota: Fix memory leak when handling corrupted quota file i2c: iproc: handle only slave interrupts which are enabled i2c: iproc: update slave isr mask (ISR_MASK_SLAVE) i2c: iproc: handle master read request spi: cadence-quadspi: Abort read if dummy cycles required are too many clk: sunxi-ng: h6: Fix CEC clock clk: renesas: r8a779a0: Remove non-existent S2 clock clk: renesas: r8a779a0: Fix parent of CBFUSA clock HID: core: detect and skip invalid inputs to snto32() RDMA/siw: Fix handling of zero-sized Read and Receive Queues. dmaengine: fsldma: Fix a resource leak in the remove function dmaengine: fsldma: Fix a resource leak in an error handling path of the probe function dmaengine: owl-dma: Fix a resource leak in the remove function dmaengine: hsu: disable spurious interrupt mfd: bd9571mwv: Use devm_mfd_add_devices() power: supply: cpcap-charger: Fix missing power_supply_put() power: supply: cpcap-battery: Fix missing power_supply_put() power: supply: cpcap-charger: Fix power_supply_put on null battery pointer fdt: Properly handle "no-map" field in the memory region of/fdt: Make sure no-map does not remove already reserved regions RDMA/rtrs: Extend ibtrs_cq_qp_create RDMA/rtrs-srv: Release lock before call into close_sess RDMA/rtrs-srv: Use sysfs_remove_file_self for disconnect RDMA/rtrs-clt: Set mininum limit when create QP RDMA/rtrs: Call kobject_put in the failure path RDMA/rtrs-srv: Fix missing wr_cqe RDMA/rtrs-clt: Refactor the failure cases in alloc_clt RDMA/rtrs-srv: Init wr_cnt as 1 power: reset: at91-sama5d2_shdwc: fix wkupdbc mask rtc: s5m: select REGMAP_I2C dmaengine: idxd: set DMA channel to be private power: supply: fix sbs-charger build, needs REGMAP_I2C clocksource/drivers/ixp4xx: Select TIMER_OF when needed clocksource/drivers/mxs_timer: Add missing semicolon when DEBUG is defined spi: imx: Don't print error on -EPROBEDEFER RDMA/mlx5: Use the correct obj_id upon DEVX TIR creation IB/mlx5: Add mutex destroy call to cap_mask_mutex mutex clk: sunxi-ng: h6: Fix clock divider range on some clocks platform/chrome: cros_ec_proto: Use EC_HOST_EVENT_MASK not BIT platform/chrome: cros_ec_proto: Add LID and BATTERY to default mask regulator: axp20x: Fix reference cout leak watch_queue: Drop references to /dev/watch_queue certs: Fix blacklist flag type confusion regulator: s5m8767: Fix reference count leak spi: atmel: Put allocated master before return regulator: s5m8767: Drop regulators OF node reference power: supply: axp20x_usb_power: Init work before enabling IRQs power: supply: smb347-charger: Fix interrupt usage if interrupt is unavailable regulator: core: Avoid debugfs: Directory ... already present! error isofs: release buffer head before return watchdog: intel-mid_wdt: Postpone IRQ handler registration till SCU is ready auxdisplay: ht16k33: Fix refresh rate handling objtool: Fix error handling for STD/CLD warnings objtool: Fix retpoline detection in asm code objtool: Fix ".cold" section suffix check for newer versions of GCC scsi: lpfc: Fix ancient double free iommu: Switch gather->end to the inclusive end IB/umad: Return EIO in case of when device disassociated IB/umad: Return EPOLLERR in case of when device disassociated KVM: PPC: Make the VMX instruction emulation routines static powerpc/47x: Disable 256k page size powerpc/time: Enable sched clock for irqtime mmc: owl-mmc: Fix a resource leak in an error handling path and in the remove function mmc: sdhci-sprd: Fix some resource leaks in the remove function mmc: usdhi6rol0: Fix a resource leak in the error handling path of the probe mmc: renesas_sdhi_internal_dmac: Fix DMA buffer alignment from 8 to 128-bytes ARM: 9046/1: decompressor: Do not clear SCTLR.nTLSMD for ARMv7+ cores i2c: qcom-geni: Store DMA mapping data in geni_i2c_dev struct amba: Fix resource leak for drivers without .remove iommu: Move iotlb_sync_map out from __iommu_map iommu: Properly pass gfp_t in _iommu_map() to avoid atomic sleeping IB/mlx5: Return appropriate error code instead of ENOMEM IB/cm: Avoid a loop when device has 255 ports tracepoint: Do not fail unregistering a probe due to memory failure rtc: zynqmp: depend on HAS_IOMEM perf tools: Fix DSO filtering when not finding a map for a sampled address perf vendor events arm64: Fix Ampere eMag event typo RDMA/rxe: Fix coding error in rxe_recv.c RDMA/rxe: Fix coding error in rxe_rcv_mcast_pkt RDMA/rxe: Correct skb on loopback path spi: stm32: properly handle 0 byte transfer mfd: altera-sysmgr: Fix physical address storing more mfd: wm831x-auxadc: Prevent use after free in wm831x_auxadc_read_irq() powerpc/pseries/dlpar: handle ibm, configure-connector delay status powerpc/8xx: Fix software emulation interrupt clk: qcom: gcc-msm8998: Fix Alpha PLL type for all GPLLs kunit: tool: fix unit test cleanup handling kselftests: dmabuf-heaps: Fix Makefile's inclusion of the kernel's usr/include dir RDMA/hns: Fixed wrong judgments in the goto branch RDMA/siw: Fix calculation of tx_valid_cpus size RDMA/hns: Fix type of sq_signal_bits RDMA/hns: Disable RQ inline by default clk: divider: fix initialization with parent_hw spi: pxa2xx: Fix the controller numbering for Wildcat Point powerpc/uaccess: Avoid might_fault() when user access is enabled powerpc/kuap: Restore AMR after replaying soft interrupts regulator: qcom-rpmh: fix pm8009 ldo7 clk: aspeed: Fix APLL calculate formula from ast2600-A2 selftests/ftrace: Update synthetic event syntax errors perf symbols: Use (long) for iterator for bfd symbols regulator: bd718x7, bd71828, Fix dvs voltage levels spi: dw: Avoid stack content exposure spi: Skip zero-length transfers in spi_transfer_one_message() printk: avoid prb_first_valid_seq() where possible perf symbols: Fix return value when loading PE DSO nfsd: register pernet ops last, unregister first svcrdma: Hold private mutex while invoking rdma_accept() ceph: fix flush_snap logic after putting caps RDMA/hns: Fixes missing error code of CMDQ RDMA/ucma: Fix use-after-free bug in ucma_create_uevent RDMA/rtrs-srv: Fix stack-out-of-bounds RDMA/rtrs: Only allow addition of path to an already established session RDMA/rtrs-srv: fix memory leak by missing kobject free RDMA/rtrs-srv-sysfs: fix missing put_device RDMA/rtrs-srv: Do not pass a valid pointer to PTR_ERR() Input: sur40 - fix an error code in sur40_probe() perf record: Fix continue profiling after draining the buffer perf intel-pt: Fix missing CYC processing in PSB perf intel-pt: Fix premature IPC perf intel-pt: Fix IPC with CYC threshold perf test: Fix unaligned access in sample parsing test Input: elo - fix an error code in elo_connect() sparc64: only select COMPAT_BINFMT_ELF if BINFMT_ELF is set sparc: fix led.c driver when PROC_FS is not enabled Input: zinitix - fix return type of zinitix_init_touch() ARM: 9065/1: OABI compat: fix build when EPOLL is not enabled misc: eeprom_93xx46: Fix module alias to enable module autoprobe phy: rockchip-emmc: emmc_phy_init() always return 0 phy: cadence-torrent: Fix error code in cdns_torrent_phy_probe() misc: eeprom_93xx46: Add module alias to avoid breaking support for non device tree users PCI: rcar: Always allocate MSI addresses in 32bit space soundwire: cadence: fix ACK/NAK handling pwm: rockchip: Enable APB clock during register access while probing pwm: rockchip: rockchip_pwm_probe(): Remove superfluous clk_unprepare() pwm: rockchip: Eliminate potential race condition when probing PCI: xilinx-cpm: Fix reference count leak on error path VMCI: Use set_page_dirty_lock() when unregistering guest memory PCI: Align checking of syscall user config accessors mei: hbm: call mei_set_devstate() on hbm stop response drm/msm: Fix MSM_INFO_GET_IOVA with carveout drm/msm/dsi: Correct io_start for MSM8994 (20nm PHY) drm/msm/mdp5: Fix wait-for-commit for cmd panels drm/msm: Fix race of GPU init vs timestamp power management. drm/msm: Fix races managing the OOB state for timestamp vs timestamps. drm/msm/dp: trigger unplug event in msm_dp_display_disable vfio/iommu_type1: Populate full dirty when detach non-pinned group vfio/iommu_type1: Fix some sanity checks in detach group vfio-pci/zdev: fix possible segmentation fault issue ext4: fix potential htree index checksum corruption phy: USB_LGM_PHY should depend on X86 coresight: etm4x: Skip accessing TRCPDCR in save/restore nvmem: core: Fix a resource leak on error in nvmem_add_cells_from_of() nvmem: core: skip child nodes not matching binding soundwire: bus: use sdw_update_no_pm when initializing a device soundwire: bus: use sdw_write_no_pm when setting the bus scale registers soundwire: export sdw_write/read_no_pm functions soundwire: bus: fix confusion on device used by pm_runtime misc: fastrpc: fix incorrect usage of dma_map_sgtable remoteproc/mediatek: acknowledge watchdog IRQ after handled regmap: sdw: use _no_pm functions in regmap_read/write ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it mailbox: sprd: correct definition of SPRD_OUTBOX_FIFO_FULL device-dax: Fix default return code of range_parse() PCI: pci-bridge-emul: Fix array overruns, improve safety PCI: cadence: Fix DMA range mapping early return error i40e: Fix flow for IPv6 next header (extension header) i40e: Add zero-initialization of AQ command structures i40e: Fix overwriting flow control settings during driver loading i40e: Fix addition of RX filters after enabling FW LLDP agent i40e: Fix VFs not created Take mmap lock in cacheflush syscall nios2: fixed broken sys_clone syscall i40e: Fix add TC filter for IPv6 octeontx2-af: Fix an off by one in rvu_dbg_qsize_write() pwm: iqs620a: Fix overflow and optimize calculations vfio/type1: Use follow_pte() ice: report correct max number of TCs ice: Account for port VLAN in VF max packet size calculation ice: Fix state bits on LLDP mode switch ice: update the number of available RSS queues net: stmmac: fix CBS idleslope and sendslope calculation net/mlx4_core: Add missed mlx4_free_cmd_mailbox() PCI: rockchip: Make 'ep-gpios' DT property optional vxlan: move debug check after netdev unregister wireguard: device: do not generate ICMP for non-IP packets wireguard: kconfig: use arm chacha even with no neon ocfs2: fix a use after free on error mm: memcontrol: fix NR_ANON_THPS accounting in charge moving mm: memcontrol: fix slub memory accounting mm/memory.c: fix potential pte_unmap_unlock pte error mm/hugetlb: fix potential double free in hugetlb_register_node() error path mm/hugetlb: suppress wrong warning info when alloc gigantic page mm/compaction: fix misbehaviors of fast_find_migrateblock() r8169: fix jumbo packet handling on RTL8168e NFSv4: Fixes for nfs4_bitmask_adjust() KVM: SVM: Intercept INVPCID when it's disabled to inject #UD KVM: x86/mmu: Expand collapsible SPTE zap for TDP MMU to ZONE_DEVICE and HugeTLB pages arm64: Add missing ISB after invalidating TLB in __primary_switch i2c: brcmstb: Fix brcmstd_send_i2c_cmd condition i2c: exynos5: Preserve high speed master code mm,thp,shmem: make khugepaged obey tmpfs mount flags mm: fix memory_failure() handling of dax-namespace metadata mm/rmap: fix potential pte_unmap on an not mapped pte proc: use kvzalloc for our kernel buffer csky: Fix a size determination in gpr_get() scsi: bnx2fc: Fix Kconfig warning & CNIC build errors scsi: sd: sd_zbc: Don't pass GFP_NOIO to kvcalloc block: reopen the device in blkdev_reread_part ide/falconide: Fix module unload scsi: sd: Fix Opal support blk-settings: align max_sectors on "logical_block_size" boundary soundwire: intel: fix possible crash when no device is detected ACPI: property: Fix fwnode string properties matching ACPI: configfs: add missing check after configfs_register_default_group() cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known HID: logitech-dj: add support for keyboard events in eQUAD step 4 Gaming HID: wacom: Ignore attempts to overwrite the touch_max value from HID Input: raydium_ts_i2c - do not send zero length Input: xpad - add support for PowerA Enhanced Wired Controller for Xbox Series X\|S Input: joydev - prevent potential read overflow in ioctl Input: i8042 - add ASUS Zenbook Flip to noselftest list media: mceusb: Fix potential out-of-bounds shift USB: serial: option: update interface mapping for ZTE P685M usb: musb: Fix runtime PM race in musb_queue_resume_work usb: dwc3: gadget: Fix setting of DEPCFG.bInterval_m1 usb: dwc3: gadget: Fix dep->interval for fullspeed interrupt USB: serial: ftdi_sio: fix FTX sub-integer prescaler USB: serial: pl2303: fix line-speed handling on newer chips USB: serial: mos7840: fix error code in mos7840_write() USB: serial: mos7720: fix error code in mos7720_write() phy: lantiq: rcu-usb2: wait after clock enable ALSA: fireface: fix to parse sync status register of latter protocol ALSA: hda: Add another CometLake-H PCI ID ALSA: hda/hdmi: Drop bogus check at closing a stream ALSA: hda/realtek: modify EAPD in the ALC886 ALSA: hda/realtek: Quirk for HP Spectre x360 14 amp setup MIPS: Ingenic: Disable HPTLB for D0 XBurst CPUs too MIPS: Support binutils configured with --enable-mips-fix-loongson3-llsc=yes MIPS: VDSO: Use CLANG_FLAGS instead of filtering out '--target=' Revert "MIPS: Octeon: Remove special handling of CONFIG_MIPS_ELF_APPENDED_DTB=y" Revert "bcache: Kill btree_io_wq" bcache: Give btree_io_wq correct semantics again bcache: Move journal work to new flush wq Revert "drm/amd/display: Update NV1x SR latency values" drm/amd/display: Add FPU wrappers to dcn21_validate_bandwidth() drm/amd/display: Remove Assert from dcn10_get_dig_frontend drm/amd/display: Add vupdate_no_lock interrupts for DCN2.1 drm/amdkfd: Fix recursive lock warnings drm/amdgpu: Set reference clock to 100Mhz on Renoir (v2) drm/nouveau/kms: handle mDP connectors drm/modes: Switch to 64bit maths to avoid integer overflow drm/sched: Cancel and flush all outstanding jobs before finish. drm/panel: kd35t133: allow using non-continuous dsi clock drm/rockchip: Require the YTR modifier for AFBC ASoC: siu: Fix build error by a wrong const prefix selinux: fix inconsistency between inode_getxattr and inode_listsecurity erofs: initialized fields can only be observed after bit is set tpm_tis: Fix check_locality for correct locality acquisition tpm_tis: Clean up locality release KEYS: trusted: Fix incorrect handling of tpm_get_random() KEYS: trusted: Fix migratable=1 failing KEYS: trusted: Reserve TPM for seal and unseal operations btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node btrfs: do not warn if we can't find the reloc root when looking up backref btrfs: add asserts for deleting backref cache nodes btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root btrfs: fix reloc root leak with 0 ref reloc roots on recovery btrfs: splice remaining dirty_bg's onto the transaction dirty bg list btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself btrfs: account for new extents being deleted in total_bytes_pinned btrfs: fix extent buffer leak on failure to copy root drm/i915/gt: Flush before changing register state drm/i915/gt: Correct surface base address for renderclear crypto: arm64/sha - add missing module aliases crypto: aesni - prevent misaligned buffers on the stack crypto: michael_mic - fix broken misalignment handling crypto: sun4i-ss - checking sg length is not sufficient crypto: sun4i-ss - IV register does not work on A10 and A13 crypto: sun4i-ss - handle BigEndian for cipher crypto: sun4i-ss - initialize need_fallback soc: samsung: exynos-asv: don't defer early on not-supported SoCs soc: samsung: exynos-asv: handle reading revision register error seccomp: Add missing return in non-void function arm64: ptrace: Fix seccomp of traced syscall -1 (NO_SYSCALL) misc: rtsx: init of rts522a add OCP power off when no card is present drivers/misc/vmw_vmci: restrict too big queue size in qp_host_alloc_queue pstore: Fix typo in compression option name dts64: mt7622: fix slow sd card access arm64: dts: agilex: fix phy interface bit shift for gmac1 and gmac2 staging/mt7621-dma: mtk-hsdma.c->hsdma-mt7621.c staging: gdm724x: Fix DMA from stack staging: rtl8188eu: Add Edimax EW-7811UN V2 to device table floppy: reintroduce O_NDELAY fix media: i2c: max9286: fix access to unallocated memory media: ir_toy: add another IR Droid device media: ipu3-cio2: Fix mbus_code processing in cio2_subdev_set_fmt() media: marvell-ccic: power up the device on mclk enable media: smipcie: fix interrupt handling and IR timeout x86/virt: Eat faults on VMXOFF in reboot flows x86/reboot: Force all cpus to exit VMX root if VMX is supported x86/fault: Fix AMD erratum #91 errata fixup for user code x86/entry: Fix instrumentation annotation powerpc/prom: Fix "ibm,arch-vec-5-platform-support" scan rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers rcu/nocb: Perform deferred wake up before last idle's need_resched() check kprobes: Fix to delay the kprobes jump optimization arm64: Extend workaround for erratum 1024718 to all versions of Cortex-A55 iommu/arm-smmu-qcom: Fix mask extraction for bootloader programmed SMRs arm64: kexec_file: fix memory leakage in create_dtb() when fdt_open_into() fails arm64: uprobe: Return EOPNOTSUPP for AARCH32 instruction probing arm64 module: set plt* section addresses to 0x0 arm64: spectre: Prevent lockdep splat on v4 mitigation enable path riscv: Disable KSAN_SANITIZE for vDSO watchdog: qcom: Remove incorrect usage of QCOM_WDT_ENABLE_IRQ watchdog: mei_wdt: request stop on unregister coresight: etm4x: Handle accesses to TRCSTALLCTLR mtd: spi-nor: sfdp: Fix last erase region marking mtd: spi-nor: sfdp: Fix wrong erase type bitmask for overlaid region mtd: spi-nor: core: Fix erase type discovery for overlaid region mtd: spi-nor: core: Add erase size check for erase command initialization mtd: spi-nor: hisi-sfc: Put child node np on error path fs/affs: release old buffer head on error path seq_file: document how per-entry resources are managed. x86: fix seq_file iteration for pat/memtype.c mm: memcontrol: fix swap undercounting in cgroup2 mm: memcontrol: fix get_active_memcg return value hugetlb: fix update_and_free_page contig page struct assumption hugetlb: fix copy_huge_page_from_user contig page struct assumption mm/vmscan: restore zone_reclaim_mode ABI mm, compaction: make fast_isolate_freepages() stay within zone KVM: nSVM: fix running nested guests when npt=0 nvmem: qcom-spmi-sdam: Fix uninitialized pdev pointer module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbols mmc: sdhci-esdhc-imx: fix kernel panic when remove module mmc: sdhci-pci-o2micro: Bug fix for SDR104 HW tuning failure powerpc/32: Preserve cr1 in exception prolog stack check to fix build error powerpc/kexec_file: fix FDT size estimation for kdump kernel powerpc/32s: Add missing call to kuep_lock on syscall entry spmi: spmi-pmic-arb: Fix hw_irq overflow mei: fix transfer over dma with extended header mei: me: emmitsburg workstation DID mei: me: add adler lake point S DID mei: me: add adler lake point LP DID gpio: pcf857x: Fix missing first interrupt mfd: gateworks-gsc: Fix interrupt type printk: fix deadlock when kernel panic exfat: fix shift-out-of-bounds in exfat_fill_super() zonefs: Fix file size of zones in full condition kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE thermal: cpufreq_cooling: freq_qos_update_request() returns < 0 on error cpufreq: qcom-hw: drop devm_xxx() calls from init/exit hooks cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument cpufreq: intel_pstate: Get per-CPU max freq via MSR_HWP_CAPABILITIES if available proc: don't allow async path resolution of /proc/thread-self components s390/vtime: fix inline assembly clobber list virtio/s390: implement virtio-ccw revision 2 correctly um: mm: check more comprehensively for stub changes um: defer killing userspace on page table update failures irqchip/loongson-pch-msi: Use bitmap_zalloc() to allocate bitmap f2fs: fix out-of-repair __setattr_copy() f2fs: enforce the immutable flag on open files f2fs: flush data when enabling checkpoint back sparc32: fix a user-triggerable oops in clear_user() spi: fsl: invert spisel_boot signal on MPC8309 spi: spi-synquacer: fix set_cs handling gfs2: fix glock confusion in function signal_our_withdraw gfs2: Don't skip dlm unlock if glock has an lvb gfs2: Lock imbalance on error path in gfs2_recover_one gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end dm: fix deadlock when swapping to encrypted device dm table: fix iterate_devices based device capability checks dm table: fix DAX iterate_devices based device capability checks dm table: fix zoned iterate_devices based device capability checks dm writecache: fix performance degradation in ssd mode dm writecache: return the exact table values that were set dm writecache: fix writing beyond end of underlying device when shrinking dm era: Recover committed writeset after crash dm era: Update in-core bitset after committing the metadata dm era: Verify the data block size hasn't changed dm era: Fix bitset memory leaks dm era: Use correct value size in equality function of writeset tree dm era: Reinitialize bitset cache before digesting a new writeset dm era: only resize metadata in preresume drm/i915: Reject 446-480MHz HDMI clock on GLK kgdb: fix to kill breakpoints on initmem after boot ipv6: silence compilation warning for non-IPV6 builds net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending wireguard: selftests: test multiple parallel streams wireguard: queueing: get rid of per-peer ring buffers net: sched: fix police ext initialization net: qrtr: Fix memory leak in qrtr_tun_open net_sched: fix RTNL deadlock again caused by request_module() ARM: dts: aspeed: Add LCLK to lpc-snoop Linux 5.10.20 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I3fbcecd9413ce212dac68d5cc800c9457feba56a	2021-03-07 12:33:33 +01:00
Rik van Riel	a7fbcb3b56	mm,thp,shmem: make khugepaged obey tmpfs mount flags [ Upstream commit cd89fb06509903f942a0ffe97ffa63034671ed0c ] Currently if thp enabled=[madvise], mounting a tmpfs filesystem with huge=always and mmapping files from that tmpfs does not result in khugepaged collapsing those mappings, despite the mount flag indicating that it should. Fix that by breaking up the blocks of tests in hugepage_vma_check a little bit, and testing things in the correct order. Link: https://lkml.kernel.org/r/20201124194925.623931-4-riel@surriel.com Fixes: `c2231020ea` ("mm: thp: register mm for khugepaged when merging vma for shmem") Signed-off-by: Rik van Riel <riel@surriel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xu Yu <xuyu@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-03-04 11:38:20 +01:00
Will Deacon	2d5a172ce8	BACKPORT: FROMGIT: mm: Avoid modifying vmf.address in __collapse_huge_page_swapin() In preparation for const-ifying the anonymous struct field of 'struct vm_fault', rework __collapse_huge_page_swapin() to avoid continuously updating vmf.address and instead populate a new 'struct vm_fault' on the stack for each page being processed. Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Change-Id: Ibabb7b324a2a7debd43e154516c85aaae327b841 Bug: 171278850 (cherry picked from commit 2b635dd372f6c8f27644c662bb48d10376ce561a https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/faultaround) [vinmenon: changes for speculative page fault] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>	2021-02-11 12:15:32 +00:00
Laurent Dufour	32507b6ff2	FROMLIST: mm: cache some VMA fields in the vm_fault structure When handling speculative page fault, the vma->vm_flags and vma->vm_page_prot fields are read once the page table lock is released. So there is no more guarantee that these fields would not change in our back. They will be saved in the vm_fault structure before the VMA is checked for changes. This patch also set the fields in hugetlb_no_page() and __collapse_huge_page_swapin even if it is not need for the callee. Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Change-Id: I9821f02ea32ef220b57b8bfd817992bbf71bbb1d Link: https://lore.kernel.org/lkml/1523975611-15978-13-git-send-email-ldufour@linux.vnet.ibm.com/ Bug: 161210518 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>	2021-01-22 18:00:15 +00:00
Laurent Dufour	9cfe16897f	FROMLIST: mm: protect VMA modifications using VMA sequence count The VMA sequence count has been introduced to allow fast detection of VMA modification when running a page fault handler without holding the mmap_sem. This patch provides protection against the VMA modification done in : - madvise() - mpol_rebind_policy() - vma_replace_policy() - change_prot_numa() - mlock(), munlock() - mprotect() - mmap_region() - collapse_huge_page() - userfaultd registering services In addition, VMA fields which will be read during the speculative fault path needs to be written using WRITE_ONCE to prevent write to be split and intermediate values to be pushed to other CPUs. Change-Id: Ic36046b7254e538b6baf7144c50ae577ee7f2074 Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Link: https://lore.kernel.org/lkml/1523975611-15978-10-git-send-email-ldufour@linux.vnet.ibm.com/ Bug: 161210518 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>	2021-01-22 17:59:47 +00:00
Jann Horn	4d45e75a99	mm: remove the now-unnecessary mmget_still_valid() hack The preceding patches have ensured that core dumping properly takes the mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all its users. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-10-16 11:11:22 -07:00
Vijay Balakrishna	4aab2be098	mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged When memory is hotplug added or removed the min_free_kbytes should be recalculated based on what is expected by khugepaged. Currently after hotplug, min_free_kbytes will be set to a lower default and higher default set when THP enabled is lost. This change restores min_free_kbytes as expected for THP consumers. [vijayb@linux.microsoft.com: v5] Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com Fixes: `f000565adb` ("thp: set recommended min free kbytes") Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Allen Pais <apais@microsoft.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-10-11 10:31:11 -07:00
Hugh Dickins	033b5d7755	mm/khugepaged: fix filemap page_to_pgoff(page) != offset There have been elusive reports of filemap_fault() hitting its VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built with CONFIG_READ_ONLY_THP_FOR_FS=y. Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged without NUMA reuses the same huge page after collapse_file() failed (whereas NUMA targets its allocation to the respective node each time). And most of us were usually testing with CONFIG_NUMA=y kernels. collapse_file(old start) new_page = khugepaged_alloc_page(hpage) __SetPageLocked(new_page) new_page->index = start // hpage->index=old offset new_page->mapping = mapping xas_store(&xas, new_page) filemap_fault page = find_get_page(mapping, offset) // if offset falls inside hpage then // compound_head(page) == hpage lock_page_maybe_drop_mmap() __lock_page(page) // collapse fails xas_store(&xas, old page) new_page->mapping = NULL unlock_page(new_page) collapse_file(new start) new_page = khugepaged_alloc_page(hpage) __SetPageLocked(new_page) new_page->index = start // hpage->index=new offset new_page->mapping = mapping // mapping becomes valid again // since compound_head(page) == hpage // page_to_pgoff(page) got changed VM_BUG_ON_PAGE(page_to_pgoff(page) != offset) An initial patch replaced __SetPageLocked() by lock_page(), which did fix the race which Suren illustrates above. But testing showed that it's not good enough: if the racing task's __lock_page() gets delayed long after its find_get_page(), then it may follow collapse_file(new start)'s successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE. It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a check and retry (as is done for mapping), with similar relaxations in find_lock_entry() and pagecache_get_page(): but it's not obvious what else might get caught out; and khugepaged non-NUMA appears to be unique in exposing a page to page cache, then revoking, without going through a full cycle of freeing before reuse. Instead, non-NUMA khugepaged_prealloc_page() release the old page if anyone else has a reference to it (1% of cases when I tested). Although never reported on huge tmpfs, I believe its find_lock_entry() has been at similar risk; but huge tmpfs does not rely on khugepaged for its normal working nearly so much as READ_ONLY_THP_FOR_FS does. Reported-by: Denis Lisov <dennis.lissov@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569 Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org Reported-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw Reported-and-analyzed-by: Suren Baghdasaryan <surenb@google.com> Fixes: `87c460a0bd` ("mm/khugepaged: collapse_shmem() without freezing new_page") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # v4.9+ Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-10-10 15:52:54 -07:00
David Howells	e5a59d308f	mm/khugepaged.c: fix khugepaged's request size in collapse_file collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to be read to page_cache_sync_readahead(). The intent was probably to read a single page. Fix it to use the number of pages to the end of the window instead. Fixes: `99cb0dbd47` ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Yang Shi <shy828301@gmail.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Eric Biggers <ebiggers@google.com> Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-09-05 12:14:30 -07:00
Hugh Dickins	f3f99d63a8	khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter() syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in __khugepaged_enter(): yes, when one thread is about to dump core, has set core_state, and is waiting for others, another might do something calling __khugepaged_enter(), which now crashes because I lumped the core_state test (known as "mmget_still_valid") into khugepaged_test_exit(). I still think it's best to lump them together, so just in this exceptional case, check mm->mm_users directly instead of khugepaged_test_exit(). Fixes: `bbe98f9cad` ("khugepaged: khugepaged_test_exit() check mmget_still_valid()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Eric Dumazet <edumazet@google.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-21 09:52:53 -07:00
Joonsoo Kim	b518154e59	mm/vmscan: protect the workingset on anonymous LRU In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active \| inactive). 1. 50 hot pages on active list 50(h) \| 0 2. workload: 50 newly created (used-once) pages 50(uo) \| 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) \| 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) \| 0 2. workload: 50 newly created (used-once) pages 50(h) \| 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) \| 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-12 10:57:55 -07:00
Hugh Dickins	bbe98f9cad	khugepaged: khugepaged_test_exit() check mmget_still_valid() Move collapse_huge_page()'s mmget_still_valid() check into khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP only, and earned its mmget_still_valid() check because it inserts a huge pmd entry in place of the page table's pmd entry; whereas collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp() merely clears the page table's pmd entry. But core dumping without mmap lock must have been as open to mistaking a racily cleared pmd entry for a page table at physical page 0, as exit_mmap() was. And we certainly have no interest in mapping as a THP once dumping core. Fixes: `59ea6d06cf` ("coredump: fix race condition between collapse_huge_page() and core dumping") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-07 11:33:29 -07:00
Hugh Dickins	18e77600f7	khugepaged: retract_page_tables() remember to test exit Only once have I seen this scenario (and forgot even to notice what forced the eventual crash): a sequence of "BUG: Bad page map" alerts from vm_normal_page(), from zap_pte_range() servicing exit_mmap(); pmd:00000000, pte values corresponding to data in physical page 0. The pte mappings being zapped in this case were supposed to be from a huge page of ext4 text (but could as well have been shmem): my belief is that it was racing with collapse_file()'s retract_page_tables(), found pmd pointing to a page table, locked it, but pmd had become 0 by the time start_pte was decided. In most cases, that possibility is excluded by holding mmap lock; but exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged checks khugepaged_test_exit() after acquiring mmap lock: khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so, for example. But retract_page_tables() did not: fix that. The fix is for retract_page_tables() to check khugepaged_test_exit(), after acquiring mmap lock, before doing anything to the page table. Getting the mmap lock serializes with __mmput(), which briefly takes and drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on mm_users makes sure we don't touch the page table once exit_mmap() might reach it, since exit_mmap() will be proceeding without mmap lock, not expecting anyone to be racing with it. Fixes: `f3f0e1d215` ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-07 11:33:29 -07:00
Hugh Dickins	119a5fc161	khugepaged: collapse_pte_mapped_thp() protect the pmd lock When retract_page_tables() removes a page table to make way for a huge pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the case when the original mmap_write_trylock had failed), only mmap_write_trylock and pmd lock are held. That's not enough. One machine has twice crashed under load, with "BUG: spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving page_referenced() on a file THP, that had found a page table at *pmd) discovers that the page table page and its lock have already been freed by the time it comes to unlock. Follow the example of retract_page_tables(), but we only need one of huge page lock or i_mmap_lock_write to secure against this: because it's the narrower lock, and because it simplifies collapse_pte_mapped_thp() to know the hpage earlier, choose to rely on huge page lock here. Fixes: `27e1f82731` ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-07 11:33:29 -07:00
Hugh Dickins	723a80dafe	khugepaged: collapse_pte_mapped_thp() flush the right range pmdp_collapse_flush() should be given the start address at which the huge page is mapped, haddr: it was given addr, which at that point has been used as a local variable, incremented to the end address of the extent. Found by source inspection while chasing a hugepage locking bug, which I then could not explain by this. At first I thought this was very bad; then saw that all of the page translations that were not flushed would actually still point to the right pages afterwards, so harmless; then realized that I know nothing of how different architectures and models cache intermediate paging structures, so maybe it matters after all - particularly since the page table concerned is immediately freed. Much easier to fix than to think about. Fixes: `27e1f82731` ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021204390.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-08-07 11:33:29 -07:00
Kirill A. Shutemov	594cced14a	khugepaged: fix null-pointer dereference due to race khugepaged has to drop mmap lock several times while collapsing a page. The situation can change while the lock is dropped and we need to re-validate that the VMA is still in place and the PMD is still subject for collapse. But we miss one corner case: while collapsing an anonymous pages the VMA could be replaced with file VMA. If the file VMA doesn't have any private pages we get NULL pointer dereference: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] anon_vma_lock_write include/linux/rmap.h:120 [inline] collapse_huge_page mm/khugepaged.c:1110 [inline] khugepaged_scan_pmd mm/khugepaged.c:1349 [inline] khugepaged_scan_mm_slot mm/khugepaged.c:2110 [inline] khugepaged_do_scan mm/khugepaged.c:2193 [inline] khugepaged+0x3bba/0x5a10 mm/khugepaged.c:2238 The fix is to make sure that the VMA is anonymous in hugepage_vma_revalidate(). The helper is only used for collapsing anonymous pages. Fixes: `99cb0dbd47` ("mm,thp: add read-only THP support for (non-shmem) FS") Reported-by: syzbot+ed318e8b790ca72c5ad0@syzkaller.appspotmail.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200722121439.44328-1-kirill.shutemov@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-07-24 12:42:41 -07:00
Michel Lespinasse	c1e8d7c6a7	mmap locking API: convert mmap_sem comments Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Michel Lespinasse	3e4e28c5a8	mmap locking API: convert mmap_sem API comments Convert comments that reference old mmap_sem APIs to reference corresponding new mmap locking APIs instead. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Michel Lespinasse	d8ed45c5dc	mmap locking API: use coccinelle to convert mmap_sem rwsem call sites This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock \| -down_write +mmap_write_lock \| -down_write_killable +mmap_write_lock_killable \| -down_write_trylock +mmap_write_trylock \| -up_write +mmap_write_unlock \| -downgrade_write +mmap_write_downgrade \| -down_read +mmap_read_lock \| -down_read_killable +mmap_read_lock_killable \| -down_read_trylock +mmap_read_trylock \| -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Johannes Weiner	6058eaec81	mm: fold and remove lru_cache_add_anon() and lru_cache_add_file() They're the same function, and for the purpose of all callers they are equivalent to lru_cache_add(). [akpm@linux-foundation.org: fix it for local_lock changes] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:48 -07:00
Johannes Weiner	d9eb1ea2bf	mm: memcontrol: delete unused lrucare handling Swapin faults were the last event to charge pages after they had already been put on the LRU list. Now that we charge directly on swapin, the lrucare portion of the charge code is unused. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:48 -07:00
Johannes Weiner	9d82c69438	mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API With the page->mapping requirement gone from memcg, we can charge anon and file-thp pages in one single step, right after they're allocated. This removes two out of three API calls - especially the tricky commit step that needed to happen at just the right time between when the page is "set up" and when it's "published" - somewhat vague and fluid concepts that varied by page type. All we need is a freshly allocated page and a memcg context to charge. v2: prevent double charges on pre-allocated hugepages in khugepaged [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL] Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:48 -07:00
Johannes Weiner	be5d0a74c6	mm: memcontrol: switch to native NR_ANON_MAPPED counter Memcg maintains a private MEMCG_RSS counter. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the same way we do for NR_FILE_MAPPED. With the previous patch removing MEMCG_CACHE and the private NR_SHMEM counter, this patch finally eliminates the need to have page->mapping set up at charge time. However, we need to have page->mem_cgroup set up by the time rmap runs and does the accounting, so switch the commit and the rmap callbacks around. v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:47 -07:00
Johannes Weiner	0d1c20722a	mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM. The page is already locked in these places, so page->mem_cgroup is stable; we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure it's set up in time. Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private NR_SHMEM accounting sites. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:47 -07:00
Johannes Weiner	3fba69a56e	mm: memcontrol: drop @compound parameter from memcg charging API The memcg charging API carries a boolean @compound parameter that tells whether the page we're dealing with is a hugepage. mem_cgroup_commit_charge() has another boolean @lrucare that indicates whether the page needs LRU locking or not while charging. The majority of callsites know those parameters at compile time, which results in a lot of naked "false, false" argument lists. This makes for cryptic code and is a breeding ground for subtle mistakes. Thankfully, the huge page state can be inferred from the page itself and doesn't need to be passed along. This is safe because charging completes before the page is published and somebody may split it. Simplify the callsites by removing @compound, and let memcg infer the state by using hpage_nr_pages() unconditionally. That function does PageTransHuge() to identify huge pages, which also helpfully asserts that nobody passes in tail pages by accident. The following patches will introduce a new charging API, best not to carry over unnecessary weight. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:47 -07:00
Kirill A. Shutemov	71a2c112a0	khugepaged: introduce 'max_ptes_shared' tunable 'max_ptes_shared' specifies how many pages can be shared across multiple processes. Exceeding the number would block the collapse:: /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared A higher value may increase memory footprint for some workloads. By default, at least half of pages has to be not shared. [colin.king@canonical.com: fix several spelling mistakes] Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200416160026.16538-9-kirill.shutemov@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:46 -07:00
Kirill A. Shutemov	5503fbf2b0	khugepaged: allow to collapse PTE-mapped compound pages We can collapse PTE-mapped compound pages. We only need to avoid handling them more than once: lock/unlock page only once if it's present in the PMD range multiple times as it handled on compound level. The same goes for LRU isolation and putback. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200416160026.16538-7-kirill.shutemov@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:46 -07:00
Kirill A. Shutemov	9445689f3b	khugepaged: allow to collapse a page shared across fork The page can be included into collapse as long as it doesn't have extra pins (from GUP or otherwise). Logic to check the refcount is moved to a separate function. For pages in swap cache, add compound_nr(page) to the expected refcount, in order to handle the compound page case. This is in preparation for the following patch. VM_BUG_ON_PAGE() was removed from __collapse_huge_page_copy() as the invariant it checks is no longer valid: the source can be mapped multiple times now. [yang.shi@linux.alibaba.com: remove error message when checking external pins] Link: http://lkml.kernel.org/r/1589317383-9595-1-git-send-email-yang.shi@linux.alibaba.com [cai@lca.pw: fix set-but-not-used warning] Link: http://lkml.kernel.org/r/20200521145644.GA6367@ovpn-112-192.phx2.redhat.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200416160026.16538-6-kirill.shutemov@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:46 -07:00
Kirill A. Shutemov	ae2c5d8042	khugepaged: drain LRU add pagevec after swapin collapse_huge_page() tries to swap in pages that are part of the PMD range. Just swapped in page goes though LRU add cache. The cache gets extra reference on the page. The extra reference can lead to the collapse fail: the following __collapse_huge_page_isolate() would check refcount and abort collapse seeing unexpected refcount. The fix is to drain local LRU add cache in __collapse_huge_page_swapin() if we successfully swapped in any pages. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200416160026.16538-5-kirill.shutemov@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:46 -07:00
Kirill A. Shutemov	a980df33e9	khugepaged: drain all LRU caches before scanning pages Having a page in LRU add cache offsets page refcount and gives false-negative on PageLRU(). It reduces collapse success rate. Drain all LRU add caches before scanning. It happens relatively rare and should not disturb the system too much. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200416160026.16538-4-kirill.shutemov@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:46 -07:00
Kirill A. Shutemov	ffe945e633	khugepaged: do not stop collapse if less than half PTEs are referenced __collapse_huge_page_swapin() checks the number of referenced PTE to decide if the memory range is hot enough to justify swapin. We have few problems with the approach: - It is way too late: we can do the check much earlier and safe time. khugepaged_scan_pmd() already knows if we have any pages to swap in and number of referenced page. - It stops collapse altogether if there's not enough referenced pages, not only swappingin. Fix it by making the right check early. We also can avoid additional page table scanning if khugepaged_scan_pmd() haven't found any swap entries. Fixes: `0db501f7a3` ("mm, thp: convert from optimistic swapin collapsing to conservative") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Zi Yan <ziy@nvidia.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200416160026.16538-3-kirill.shutemov@linux.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-03 20:09:46 -07:00
Hugh Dickins	2f33a70602	mm,thp: stop leaking unreleased file pages When collapse_file() calls try_to_release_page(), it has already isolated the page: so if releasing buffers happens to fail (as it sometimes does), remember to putback_lru_page(): otherwise that page is left unreclaimable and unfreeable, and the file extent uncollapsible. Fixes: `99cb0dbd47` ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> [5.4+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005231837500.1766@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-05-28 11:35:40 -07:00
Peter Xu	e1e267c792	khugepaged: skip collapse if uffd-wp detected Don't collapse the huge PMD if there is any userfault write protected small PTEs. The problem is that the write protection is in small page granularity and there's no way to keep all these write protection information if the small pages are going to be merged into a huge PMD. The same thing needs to be considered for swap entries and migration entries. So do the check as well disregarding khugepaged_max_ptes_swap. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-07 10:43:39 -07:00
Huang Ying	9de4f22a60	mm: code cleanup for MADV_FREE Some comments for MADV_FREE is revised and added to help people understand the MADV_FREE code, especially the page flag, PG_swapbacked. This makes page_is_file_cache() isn't consistent with its comments. So the function is renamed to page_is_file_lru() to make them consistent again. All these are put in one patch as one logical change. Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-07 10:43:38 -07:00
Matthew Wilcox (Oracle)	396bcc5299	mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE Commit `e496cf3d78` ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE") notes that it should be reverted when the PowerPC problem was fixed. The commit fixing the PowerPC problem (`953c66c2b2`) did not revert the commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an oversight, so remove the Kconfig symbol and undo the work of commit `e496cf3d78`. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-07 10:43:38 -07:00
Anshuman Khandual	222100eed2	mm/vma: make is_vma_temporary_stack() available for general use Currently the declaration and definition for is_vma_temporary_stack() are scattered. Lets make is_vma_temporary_stack() helper available for general use and also drop the declaration from (include/linux/huge_mm.h) which is no longer required. While at this, rename this as vma_is_temporary_stack() in line with existing helpers. This should not cause any functional change. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-02 09:35:29 -07:00
Anshuman Khandual	b44437723c	mm/vma: move VM_NO_KHUGEPAGED into generic header Patch series "mm/vma: some more minor changes", v2. The motivation here is to consolidate VMA flags and helpers in generic memory header and reduce code duplication when ever applicable. If there are other possible similar instances which might be missing here, please do let me me know. I will be happy to incorporate them. This patch (of 3): Move VM_NO_KHUGEPAGED into generic header (include/linux/mm.h). This just makes sure that no VMA flag is scattered in individual function files any longer. While at this, fix an old comment which is no longer valid. This should not cause any functional change. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1582782965-3274-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-04-02 09:35:29 -07:00
Song Liu	75f360696c	mm/thp: flush file for !is_shmem PageDirty() case in collapse_file() For non-shmem file THPs, khugepaged only collapses read only .text mapping (VM_DENYWRITE). These pages should not be dirty except the case where the file hasn't been flushed since first write. Call filemap_flush() in collapse_file() to accelerate the write back in such cases. Link: http://lkml.kernel.org/r/20191106060930.2571389-3-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-12-01 12:59:09 -08:00
Song Liu	4655e5e5f3	mm,thp: recheck each page before collapsing file THP In collapse_file(), for !is_shmem case, current check cannot guarantee the locked page is up-to-date. Specifically, xas_unlock_irq() should not be called before lock_page() and get_page(); and it is necessary to recheck PageUptodate() after locking the page. With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text may contain corrupted data. This is because khugepaged mistakenly collapses some not up-to-date sub pages into a huge page, and assumes the huge page is up-to-date. This will NOT corrupt data in the disk, because the page is read-only and never written back. Fix this by properly checking PageUptodate() after locking the page. This check replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);". Also, move PageDirty() check after locking the page. Current khugepaged should not try to collapse dirty file THP, because it is limited to read-only .text. The only case we hit a dirty page here is when the page hasn't been written since write. Bail out and retry when this happens. syzbot reported bug on previous version of this patch. Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com Fixes: `99cb0dbd47` ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-15 18:34:00 -08:00
Ville Syrjälä	ec649c9d45	mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y I got some khugepaged spew on a 32bit x86: BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged INFO: lockdep is turned off. CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206 Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203 07/08/2009 Call Trace: dump_stack+0x66/0x8e ___might_sleep.cold.96+0x95/0xa6 __might_sleep+0x2e/0x80 collapse_huge_page.isra.51+0x5ac/0x1360 khugepaged+0x9a9/0x20f0 kthread+0xf5/0x110 ret_from_fork+0x2e/0x38 Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic() vs. mmu_notifier_invalidate_range_start(). Let's do the naive approach and just reorder the two operations. Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com Fixes: `810e24e009` ("mm/mmu_notifiers: annotate with might_sleep()") Signed-off-by: Ville Syrjl <ville.syrjala@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-06 08:47:50 -08:00
Song Liu	27e1f82731	khugepaged: enable collapse pmd for pte-mapped THP khugepaged needs exclusive mmap_sem to access page table. When it fails to lock mmap_sem, the page will fault in as pte-mapped THP. As the page is already a THP, khugepaged will not handle this pmd again. This patch enables the khugepaged to retry collapse the page table. struct mm_slot (in khugepaged.c) is extended with an array, containing addresses of pte-mapped THPs. We use array here for simplicity. We can easily replace it with more advanced data structures when needed. In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to collapse the page table. Since collapse may happen at an later time, some pages may already fault in. collapse_pte_mapped_thp() is added to properly handle these pages. collapse_pte_mapped_thp() also double checks whether all ptes in this pmd are mapping to the same THP. This is necessary because some subpage of the THP may be replaced, for example by uprobe. In such cases, it is not possible to collapse the pmd. [kirill.shutemov@linux.intel.com: add comments for retract_page_tables()] Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box Link: http://lkml.kernel.org/r/20190815164525.1848545-6-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:11 -07:00
Song Liu	09d91cda0e	mm,thp: avoid writes to file with THP in pagecache In previous patch, an application could put part of its text section in THP via madvise(). These THPs will be protected from writes when the application is still running (TXTBSY). However, after the application exits, the file is available for writes. This patch avoids writes to file THP by dropping page cache for the file when the file is open for write. A new counter nr_thps is added to struct address_space. In do_dentry_open(), if the file is open for write and nr_thps is non-zero, we drop page cache for the whole file. Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:11 -07:00
Song Liu	99cb0dbd47	mm,thp: add read-only THP support for (non-shmem) FS This patch is (hopefully) the first step to enable THP for non-shmem filesystems. This patch enables an application to put part of its text sections to THP via madvise, for example: madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE); We tried to reuse the logic for THP on tmpfs. Currently, write is not supported for non-shmem THP. khugepaged will only process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is execve(). This requirement limits non-shmem THP to text sections. The next patch will handle writes, which would only happen when the all the vmas with VM_DENYWRITE are unmapped. An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this feature. [songliubraving@fb.com: fix build without CONFIG_SHMEM] Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com [songliubraving@fb.com: fix double unlock in collapse_file()] Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:11 -07:00
Song Liu	579c571e2e	khugepaged: rename collapse_shmem() and khugepaged_scan_shmem() Next patch will add khugepaged support of non-shmem files. This patch renames these two functions to reflect the new functionality: collapse_shmem() => collapse_file() khugepaged_scan_shmem() => khugepaged_scan_file() Link: http://lkml.kernel.org/r/20190801184244.3169074-6-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:11 -07:00
Matthew Wilcox (Oracle)	4101196b19	mm: page cache: store only head pages in i_pages Transparent Huge Pages are currently stored in i_pages as pointers to consecutive subpages. This patch changes that to storing consecutive pointers to the head page in preparation for storing huge pages more efficiently in i_pages. Large parts of this are "inspired" by Kirill's patch https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/ Kirill and Huang Ying contributed several fixes. [willy@infradead.org: use compound_nr, squish uninit-var warning] Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Kirill Shutemov <kirill@shutemov.name> Reviewed-by: Song Liu <songliubraving@fb.com> Tested-by: Song Liu <songliubraving@fb.com> Tested-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Tested-by: Qian Cai <cai@lca.pw> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-09-24 15:54:08 -07:00
Matt Fleming	a55c7454a8	sched/topology: Improve load balancing on AMD EPYC systems SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init() for any sched domains with a NUMA distance greater than 2 hops (RECLAIM_DISTANCE). The idea being that it's expensive to balance across domains that far apart. However, as is rather unfortunately explained in: commit `32e45ff43e` ("mm: increase RECLAIM_DISTANCE to 30") the value for RECLAIM_DISTANCE is based on node distance tables from 2011-era hardware. Current AMD EPYC machines have the following NUMA node distances: node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 16 32 32 32 32 1: 16 10 16 16 32 32 32 32 2: 16 16 10 16 32 32 32 32 3: 16 16 16 10 32 32 32 32 4: 32 32 32 32 10 16 16 16 5: 32 32 32 32 16 10 16 16 6: 32 32 32 32 16 16 10 16 7: 32 32 32 32 16 16 16 10 where 2 hops is 32. The result is that the scheduler fails to load balance properly across NUMA nodes on different sockets -- 2 hops apart. For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4 (CPUs 32-39) like so, $ numactl -C 0-7,32-39 ./spinner 16 causes all threads to fork and remain on node 0 until the active balancer kicks in after a few seconds and forcibly moves some threads to node 4. Override node_reclaim_distance for AMD Zen. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suravee.Suthikulpanit@amd.com Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas.Lendacky@amd.com Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>	2019-09-03 09:17:37 +02:00
Linus Torvalds	69bf4b6b54	Revert "mm: page cache: store only head pages in i_pages" This reverts commit `5fd4ca2d84`. Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in __delete_from_swap_cache() to trigger: page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363 anon flags: 0x17fffe00080034(uptodate\|lru\|active\|swapbacked) raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689 raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000 page dumped because: VM_BUG_ON_PAGE(entry != page) page->mem_cgroup:ffff978433ace000 ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:170! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019 RIP: 0010:__delete_from_swap_cache+0x20d/0x240 Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f RSP: 0018:ffffa982036e7980 EFLAGS: 00010046 RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900 RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535 R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120 R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000 FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0 Call Trace: delete_from_swap_cache+0x46/0xa0 try_to_free_swap+0xbc/0x110 swap_writepage+0x13/0x70 pageout.isra.0+0x13c/0x350 shrink_page_list+0xc14/0xdf0 shrink_inactive_list+0x1e5/0x3c0 shrink_node_memcg+0x202/0x760 shrink_node+0xe0/0x470 balance_pgdat+0x2d1/0x510 kswapd+0x220/0x420 kthread+0xfb/0x130 ret_from_fork+0x22/0x40 and it's not immediately obvious why it happens. It's too late in the rc cycle to do anything but revert for now. Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/ Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-07-05 19:55:18 -07:00
Andrea Arcangeli	59ea6d06cf	coredump: fix race condition between collapse_huge_page() and core dumping When fixing the race conditions between the coredump and the mmap_sem holders outside the context of the process, we focused on mmget_not_zero()/get_task_mm() callers in `04f5866e41` ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping"), but those aren't the only cases where the mmap_sem can be taken outside of the context of the process as Michal Hocko noticed while backporting that commit to older -stable kernels. If mmgrab() is called in the context of the process, but then the mm_count reference is transferred outside the context of the process, that can also be a problem if the mmap_sem has to be taken for writing through that mm_count reference. khugepaged registration calls mmgrab() in the context of the process, but the mmap_sem for writing is taken later in the context of the khugepaged kernel thread. collapse_huge_page() after taking the mmap_sem for writing doesn't modify any vma, so it's not obvious that it could cause a problem to the coredump, but it happens to modify the pmd in a way that breaks an invariant that pmd_trans_huge_lock() relies upon. collapse_huge_page() needs the mmap_sem for writing just to block concurrent page faults that call pmd_trans_huge_lock(). Specifically the invariant that "!pmd_trans_huge()" cannot become a "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs. The coredump will call __get_user_pages() without mmap_sem for reading, which eventually can invoke a lockless page fault which will need a functional pmd_trans_huge_lock(). So collapse_huge_page() needs to use mmget_still_valid() to check it's not running concurrently with the coredump... as long as the coredump can invoke page faults without holding the mmap_sem for reading. This has "Fixes: khugepaged" to facilitate backporting, but in my view it's more a bug in the coredump code that will eventually have to be rewritten to stop invoking page faults without the mmap_sem for reading. So the long term plan is still to drop all mmget_still_valid(). Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com Fixes: `ba76149f47` ("thp: khugepaged") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-06-13 17:34:56 -10:00
Jérôme Glisse	7269f99993	mm/mmu_notifier: use correct mmu_notifier events for each invalidation This updates each existing invalidation to use the correct mmu notifier event that represent what is happening to the CPU page table. See the patch which introduced the events to see the rational behind this. Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-05-14 09:47:49 -07:00

1 2 3

113 Commits