Commit Graph

624 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
3ca4271578 Reapply "Merge tag 'android14-6.1.75_r00' into android14-6.1"
This reverts commit 6bad1052c2, it is the
LTS merge that had to previously get reverted due to being merged too
early.

Cc: Todd Kjos <tkjos@google.com>
Change-Id: I31b7d660bd833cf022ac4870f6d01e723fda5182
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-04-02 19:49:12 +00:00
Zi Yan
176b8fe524 FROMLIST: mm/migrate: set swap entry values of THP tail pages properly.
The tail pages in a THP can have swap entry information stored in their
private field. When migrating to a new page, all tail pages of the new
page need to update ->private to avoid future data corruption.

This fix is stable-only, since after commit 07e09c483cbe ("mm/huge_memory:
work on folio->swap instead of page->private when splitting folio"),
subpages of a swapcached THP no longer requires the maintenance.

Adding THPs to the swapcache was introduced in commit
38d8b4e6bd ("mm, THP, swap: delay splitting THP during swap out"),
where each subpage of a THP added to the swapcache had its own swapcache
entry and required the ->private field to point to the correct swapcache
entry. Later, when THP migration functionality was implemented in commit
616b837153 ("mm: thp: enable thp migration in generic path"),
it initially did not handle the subpages of swapcached THPs, failing to
update their ->private fields or replace the subpage pointers in the
swapcache. Subsequently, commit e71769ae52 ("mm: enable thp migration
for shmem thp") addressed the swapcache update aspect. This patch fixes
the update of subpage ->private fields.

Bug: 324818390
Fixes: 616b837153 ("mm: thp: enable thp migration in generic path")
Link: https://lore.kernel.org/linux-mm/20240306155217.118467-1-zi.yan@sent.com/
Reported-and-tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Change-Id: Ia4603cd58b76dc6ff46a2c53a735942a87221419
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Closes: https://lore.kernel.org/linux-mm/1707814102-22682-1-git-send-email-quic_charante@quicinc.com/
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
2024-03-21 16:07:20 +00:00
Todd Kjos
6bad1052c2 Revert "Merge tag 'android14-6.1.75_r00' into android14-6.1"
This reverts commit 1dbafe61e3.

Reason for revert: Too early. Needs to wait until 2024-03-27

Change-Id: I769b944bd089aa2278659ec87f7ba4ac4e74ee4a
Signed-off-by: Todd Kjos <tkjos@google.com>
2024-03-07 21:18:27 +00:00
Greg Kroah-Hartman
e1b12db2de Merge 6.1.72 into android14-6.1-lts
Changes in 6.1.72
	keys, dns: Fix missing size check of V1 server-list header
	block: Don't invalidate pagecache for invalid falloc modes
	ALSA: hda/realtek: enable SND_PCI_QUIRK for hp pavilion 14-ec1xxx series
	ALSA: hda/realtek: fix mute/micmute LEDs for a HP ZBook
	ALSA: hda/realtek: Fix mute and mic-mute LEDs for HP ProBook 440 G6
	mptcp: prevent tcp diag from closing listener subflows
	Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"
	drm/mgag200: Fix gamma lut not initialized for G200ER, G200EV, G200SE
	cifs: cifs_chan_is_iface_active should be called with chan_lock held
	cifs: do not depend on release_iface for maintaining iface_list
	KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRL
	wifi: iwlwifi: pcie: don't synchronize IRQs from IRQ
	drm/bridge: ti-sn65dsi86: Never store more than msg->size bytes in AUX xfer
	netfilter: use skb_ip_totlen and iph_totlen
	netfilter: nf_tables: set transport offset from mac header for netdev/egress
	nfc: llcp_core: Hold a ref to llcp_local->dev when holding a ref to llcp_local
	octeontx2-af: Fix marking couple of structure as __packed
	drm/i915/dp: Fix passing the correct DPCD_REV for drm_dp_set_phy_test_pattern
	ice: Fix link_down_on_close message
	ice: Shut down VSI with "link-down-on-close" enabled
	i40e: Fix filter input checks to prevent config with invalid values
	igc: Report VLAN EtherType matching back to user
	igc: Check VLAN TCI mask
	igc: Check VLAN EtherType mask
	ASoC: fsl_rpmsg: Fix error handler with pm_runtime_enable
	ASoC: mediatek: mt8186: fix AUD_PAD_TOP register and offset
	mlxbf_gige: fix receive packet race condition
	net: sched: em_text: fix possible memory leak in em_text_destroy()
	r8169: Fix PCI error on system resume
	can: raw: add support for SO_MARK
	net-timestamp: extend SOF_TIMESTAMPING_OPT_ID to HW timestamps
	net: annotate data-races around sk->sk_tsflags
	net: annotate data-races around sk->sk_bind_phc
	net: Implement missing getsockopt(SO_TIMESTAMPING_NEW)
	selftests: bonding: do not set port down when adding to bond
	ARM: sun9i: smp: Fix array-index-out-of-bounds read in sunxi_mc_smp_init
	sfc: fix a double-free bug in efx_probe_filters
	net: bcmgenet: Fix FCS generation for fragmented skbuffs
	netfilter: nft_immediate: drop chain reference counter on error
	net: Save and restore msg_namelen in sock_sendmsg
	i40e: fix use-after-free in i40e_aqc_add_filters()
	ASoC: meson: g12a-toacodec: Validate written enum values
	ASoC: meson: g12a-tohdmitx: Validate written enum values
	ASoC: meson: g12a-toacodec: Fix event generation
	ASoC: meson: g12a-tohdmitx: Fix event generation for S/PDIF mux
	i40e: Restore VF MSI-X state during PCI reset
	igc: Fix hicredit calculation
	net/qla3xxx: fix potential memleak in ql_alloc_buffer_queues
	net/smc: fix invalid link access in dumping SMC-R connections
	octeontx2-af: Always configure NIX TX link credits based on max frame size
	octeontx2-af: Re-enable MAC TX in otx2_stop processing
	asix: Add check for usbnet_get_endpoints
	net: ravb: Wait for operating mode to be applied
	bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()
	net: Implement missing SO_TIMESTAMPING_NEW cmsg support
	selftests: secretmem: floor the memory size to the multiple of page_size
	cpu/SMT: Create topology_smt_thread_allowed()
	cpu/SMT: Make SMT control more robust against enumeration failures
	srcu: Fix callbacks acceleration mishandling
	bpf, x64: Fix tailcall infinite loop
	bpf, x86: Simplify the parsing logic of structure parameters
	bpf, x86: save/restore regs with BPF_DW size
	net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
	udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
	splice, net: Add a splice_eof op to file-ops and socket-ops
	ipv4, ipv6: Use splice_eof() to flush
	udp: introduce udp->udp_flags
	udp: move udp->no_check6_tx to udp->udp_flags
	udp: move udp->no_check6_rx to udp->udp_flags
	udp: move udp->gro_enabled to udp->udp_flags
	udp: move udp->accept_udp_{l4|fraglist} to udp->udp_flags
	udp: lockless UDP_ENCAP_L2TPINUDP / UDP_GRO
	udp: annotate data-races around udp->encap_type
	wifi: iwlwifi: yoyo: swap cdb and jacket bits values
	arm64: dts: qcom: sdm845: align RPMh regulator nodes with bindings
	arm64: dts: qcom: sdm845: Fix PSCI power domain names
	fbdev: imsttfb: Release framebuffer and dealloc cmap on error path
	fbdev: imsttfb: fix double free in probe()
	bpf: decouple prune and jump points
	bpf: remove unnecessary prune and jump points
	bpf: Remove unused insn_cnt argument from visit_[func_call_]insn()
	bpf: clean up visit_insn()'s instruction processing
	bpf: Support new 32bit offset jmp instruction
	bpf: handle ldimm64 properly in check_cfg()
	bpf: fix precision backtracking instruction iteration
	blk-mq: make sure active queue usage is held for bio_integrity_prep()
	net/mlx5: Increase size of irq name buffer
	s390/mm: add missing arch_set_page_dat() call to vmem_crst_alloc()
	s390/cpumf: support user space events for counting
	f2fs: clean up i_compress_flag and i_compress_level usage
	f2fs: convert to use bitmap API
	f2fs: assign default compression level
	f2fs: set the default compress_level on ioctl
	selftests: mptcp: fix fastclose with csum failure
	selftests: mptcp: set FAILING_LINKS in run_tests
	media: camss: sm8250: Virtual channels for CSID
	media: qcom: camss: Fix set CSI2_RX_CFG1_VC_MODE when VC is greater than 3
	ext4: convert move_extent_per_page() to use folios
	khugepage: replace try_to_release_page() with filemap_release_folio()
	memory-failure: convert truncate_error_page() to use folio
	mm: merge folio_has_private()/filemap_release_folio() call pairs
	mm, netfs, fscache: stop read optimisation when folio removed from pagecache
	filemap: add a per-mapping stable writes flag
	block: update the stable_writes flag in bdev_add
	smb: client: fix missing mode bits for SMB symlinks
	net: dpaa2-eth: rearrange variable in dpaa2_eth_get_ethtool_stats
	dpaa2-eth: recycle the RX buffer only after all processing done
	ethtool: don't propagate EOPNOTSUPP from dumps
	bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
	firmware: arm_scmi: Fix frequency truncation by promoting multiplier type
	ALSA: hda/realtek: Add quirk for Lenovo Yoga Pro 7
	genirq/affinity: Remove the 'firstvec' parameter from irq_build_affinity_masks
	genirq/affinity: Pass affinity managed mask array to irq_build_affinity_masks
	genirq/affinity: Don't pass irq_affinity_desc array to irq_build_affinity_masks
	genirq/affinity: Rename irq_build_affinity_masks as group_cpus_evenly
	genirq/affinity: Move group_cpus_evenly() into lib/
	lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly
	mm/memory_hotplug: add missing mem_hotplug_lock
	mm/memory_hotplug: fix error handling in add_memory_resource()
	net: sched: call tcf_ct_params_free to free params in tcf_ct_init
	netfilter: flowtable: allow unidirectional rules
	netfilter: flowtable: cache info of last offload
	net/sched: act_ct: offload UDP NEW connections
	net/sched: act_ct: Fix promotion of offloaded unreplied tuple
	netfilter: flowtable: GC pushes back packets to classic path
	net/sched: act_ct: Take per-cb reference to tcf_ct_flow_table
	octeontx2-af: Fix pause frame configuration
	octeontx2-af: Support variable number of lmacs
	btrfs: fix qgroup_free_reserved_data int overflow
	btrfs: mark the len field in struct btrfs_ordered_sum as unsigned
	ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
	firewire: ohci: suppress unexpected system reboot in AMD Ryzen machines and ASM108x/VT630x PCIe cards
	x86/kprobes: fix incorrect return address calculation in kprobe_emulate_call_indirect
	i2c: core: Fix atomic xfer check for non-preempt config
	mm: fix unmap_mapping_range high bits shift bug
	drm/amdgpu: skip gpu_info fw loading on navi12
	drm/amd/display: add nv12 bounding box
	mmc: meson-mx-sdhc: Fix initialization frozen issue
	mmc: rpmb: fixes pause retune on all RPMB partitions.
	mmc: core: Cancel delayed work before releasing host
	mmc: sdhci-sprd: Fix eMMC init failure after hw reset
	genirq/affinity: Only build SMP-only helper functions on SMP kernels
	f2fs: compress: fix to assign compress_level for lz4 correctly
	net/sched: act_ct: additional checks for outdated flows
	net/sched: act_ct: Always fill offloading tuple iifidx
	bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4
	bpf: syzkaller found null ptr deref in unix_bpf proto add
	media: qcom: camss: Comment CSID dt_id field
	smb3: Replace smb2pdu 1-element arrays with flex-arrays
	Revert "interconnect: qcom: sm8250: Enable sync_state"
	Linux 6.1.72

Change-Id: Id00eb2ae1159d4d5fa0ef914e672c5669cbf5b0a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-01-14 13:26:13 +00:00
Greg Kroah-Hartman
bb47960a9d Merge branch 'android14-6.1' into branch 'android14-6.1-lts'
This merges all of the latest changes in 'android14-6.1' into
'android14-6.1-lts' to get it to pass TH again due to new symbols being
added.  Included in here are the following commits:

* a41a4ee370 ANDROID: Update the ABI symbol list
* 0801d8a89d ANDROID: mm: export dump_tasks symbol.
* 7c91752f5d FROMLIST: scsi: ufs: Remove the ufshcd_hba_exit() call from ufshcd_async_scan()
* 28154afe74 FROMLIST: scsi: ufs: Simplify power management during async scan
* febcf1429f ANDROID: gki_defconfig: Set CONFIG_IDLE_INJECT and CONFIG_CPU_IDLE_THERMAL into y
* bc4d82ee40 ANDROID: KMI workaround for CONFIG_NETFILTER_FAMILY_BRIDGE
* 227b55a7a3 ANDROID: dma-buf: don't re-purpose kobject as work_struct
* c1b1201d39 BACKPORT: FROMLIST: dma-buf: Move sysfs work out of DMA-BUF export path
* 928b3b5dde UPSTREAM: netfilter: nf_tables: skip set commit for deleted/destroyed sets
* 031f804149 ANDROID: KVM: arm64: Avoid BUG-ing from the host abort path
* c5dc4b4b3d ANDROID: Update the ABI symbol list
* 5070b3b594 UPSTREAM: ipv4: igmp: fix refcnt uaf issue when receiving igmp query packet
* 02aa72665c UPSTREAM: nvmet-tcp: Fix a possible UAF in queue intialization setup
* d6554d1262 FROMGIT: usb: dwc3: gadget: Handle EP0 request dequeuing properly
* 29544d4157 ANDROID: ABI: Update symbol list for imx
* 02f444ba07 UPSTREAM: io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
* ec46fe0ac7 UPSTREAM: bpf: Fix prog_array_map_poke_run map poke update
* 98b0e4cf09 BACKPORT: xhci: track port suspend state correctly in unsuccessful resume cases
* ac90f08292 ANDROID: Update the ABI symbol list
* ef67750d99 ANDROID: sched: Export symbols for vendor modules
* 934a40576e UPSTREAM: usb: dwc3: core: add support for disabling High-speed park mode
* 8a597e7a2d ANDROID: KVM: arm64: Don't prepopulate MMIO regions for host stage-2
* ed9b660cd1 BACKPORT: FROMGIT fork: use __mt_dup() to duplicate maple tree in dup_mmap()
* 3743b40f65 FROMGIT: maple_tree: preserve the tree attributes when destroying maple tree
* 1bec2dd52e FROMGIT: maple_tree: update check_forking() and bench_forking()
* e57d333531 FROMGIT: maple_tree: skip other tests when BENCH is enabled
* c79ca61edc FROMGIT: maple_tree: update the documentation of maple tree
* 7befa7bbc9 FROMGIT: maple_tree: add test for mtree_dup()
* f73f881af4 FROMGIT: radix tree test suite: align kmem_cache_alloc_bulk() with kernel behavior.
* eb5048ea90 FROMGIT: maple_tree: introduce interfaces __mt_dup() and mtree_dup()
* dc9323545b FROMGIT: maple_tree: introduce {mtree,mas}_lock_nested()
* 4ddcdc519b FROMGIT: maple_tree: add mt_free_one() and mt_attr() helpers
* c52d48818b UPSTREAM: maple_tree: introduce __mas_set_range()
* 066d57de87 ANDROID: GKI: Enable symbols for v4l2 in async and fwnode
* e74417834e ANDROID: Update the ABI symbol list
* 15a93de464 ANDROID: KVM: arm64: Fix hyp event alignment
* 717d1f8f91 ANDROID: KVM: arm64: Fix host_smc print typo
* 8fc25d7862 FROMGIT: f2fs: do not return EFSCORRUPTED, but try to run online repair
* 99288e911a ANDROID: KVM: arm64: Document module_change_host_prot_range
* 4d99e41ce1 FROMGIT: PM / devfreq: Synchronize devfreq_monitor_[start/stop]
* 6c8f710857 FROMGIT: arch/mm/fault: fix major fault accounting when retrying under per-VMA lock
* 4a518d8633 UPSTREAM: mm: handle write faults to RO pages under the VMA lock
* c1da94fa44 UPSTREAM: mm: handle read faults under the VMA lock
* 6541fffd92 UPSTREAM: mm: handle COW faults under the VMA lock
* c7fa581a79 UPSTREAM: mm: handle shared faults under the VMA lock
* 95af8a80bb BACKPORT: mm: call wp_page_copy() under the VMA lock
* b43b26b4cd UPSTREAM: mm: make lock_folio_maybe_drop_mmap() VMA lock aware
* 9c4bc457ab UPSTREAM: mm/memory.c: fix mismerge
* 7d50253c27 ANDROID: Export functions to be used with dma_map_ops in modules
* 37e0a5b868 BACKPORT: FROMGIT: erofs: enable sub-page compressed block support
* f466d52164 FROMGIT: erofs: refine z_erofs_transform_plain() for sub-page block support
* a18efa4e4a FROMGIT: erofs: fix ztailpacking for subpage compressed blocks
* 0c6a18c75b BACKPORT: FROMGIT: erofs: fix up compacted indexes for block size < 4096
* d7bb85f1cb FROMGIT: erofs: record `pclustersize` in bytes instead of pages
* 9d259220ac FROMGIT: erofs: support I/O submission for sub-page compressed blocks
* 8a49ea9441 FROMGIT: erofs: fix lz4 inplace decompression
* bdc5d268ba FROMGIT: erofs: fix memory leak on short-lived bounced pages
* 0d329bbe5c BACKPORT: erofs: tidy up z_erofs_do_read_page()
* dc94c3cc6b UPSTREAM: erofs: move preparation logic into z_erofs_pcluster_begin()
* 7751567a71 BACKPORT: erofs: avoid obsolete {collector,collection} terms
* d0dbf74792 BACKPORT: erofs: simplify z_erofs_read_fragment()
* 4067dd9969 UPSTREAM: erofs: get rid of the remaining kmap_atomic()
* 365ca16da2 UPSTREAM: erofs: simplify z_erofs_transform_plain()
* 187d034575 BACKPORT: erofs: adapt managed inode operations into folios
* 3d93182661 UPSTREAM: erofs: avoid on-stack pagepool directly passed by arguments
* 5c1827383a UPSTREAM: erofs: allocate extra bvec pages directly instead of retrying
* bed20ed1d3 UPSTREAM: erofs: clean up z_erofs_pcluster_readmore()
* 5e861fa97e UPSTREAM: erofs: remove the member readahead from struct z_erofs_decompress_frontend
* 66595bb17c UPSTREAM: erofs: fold in z_erofs_decompress()
* 88a1939504 UPSTREAM: erofs: enable large folios for iomap mode
* 2c085909e7 ANDROID: Update the ABI symbol list
* d16a15fde5 UPSTREAM: USB: gadget: core: adjust uevent timing on gadget unbind
* d3006fb944 ANDROID: ABI: Update oplus symbol list
* bc97d5019a ANDROID: vendor_hooks: Add hooks for rt_mutex steal
* 401a2769d9 UPSTREAM: dm verity: don't perform FEC for failed readahead IO
* 30bca9e278 UPSTREAM: netfilter: nft_set_pipapo: skip inactive elements during set walk
* 44702d8fa1 FROMLIST: mm: migrate high-order folios in swap cache correctly
* 613d8368e3 ANDROID: fuse-bpf: Follow mounts in lookups

Change-Id: I49d28ad030d7840490441ce6a7936b5e1047913e
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-01-11 08:06:52 +00:00
David Howells
bceff380f3 mm: merge folio_has_private()/filemap_release_folio() call pairs
[ Upstream commit 0201ebf274a306a6ebb95e5dc2d6a0a27c737cac ]

Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.

This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache.  The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.

The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.

The second patch is the actual fix.  Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set.  folio_needs_release() is altered to
add a check for this.

This patch (of 2):

Make filemap_release_folio() check folio_has_private().  Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.

There are a couple of sites in mm/vscan.c that this can't so easily be
done.  In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.

In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.

A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory.  This will acquire
additional parts to the condition in a future patch.

After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.

Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 1898efcdbed3 ("block: update the stable_writes flag in bdev_add")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-01-10 17:10:31 +01:00
Charan Teja Kalla
be72d197b2 mm: migrate high-order folios in swap cache correctly
commit fc346d0a70a13d52fe1c4bc49516d83a42cd7c4c upstream.

Large folios occupy N consecutive entries in the swap cache instead of
using multi-index entries like the page cache.  However, if a large folio
is re-added to the LRU list, it can be migrated.  The migration code was
not aware of the difference between the swap cache and the page cache and
assumed that a single xas_store() would be sufficient.

This leaves potentially many stale pointers to the now-migrated folio in
the swap cache, which can lead to almost arbitrary data corruption in the
future.  This can also manifest as infinite loops with the RCU read lock
held.

[willy@infradead.org: modifications to the changelog & tweaked the fix]
Fixes: 3417013e0d ("mm/migrate: Add folio_migrate_mapping()")
Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-05 15:18:39 +01:00
Charan Teja Kalla
44702d8fa1 FROMLIST: mm: migrate high-order folios in swap cache correctly
Large folios occupy N consecutive entries in the swap cache instead of
using multi-index entries like the page cache.  However, if a large folio
is re-added to the LRU list, it can be migrated.  The migration code was
not aware of the difference between the swap cache and the page cache and
assumed that a single xas_store() would be sufficient.

This leaves potentially many stale pointers to the now-migrated folio in
the swap cache, which can lead to almost arbitrary data corruption in the
future.  This can also manifest as infinite loops with the RCU read lock
held.

Bug: 315281107
Change-Id: I455f964a9f21c13089890073777388236b6669d7
[willy@infradead.org: modifications to the changelog & tweaked the fix]
Fixes: 3417013e0d ("mm/migrate: Add folio_migrate_mapping()")
Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org
Link: https://lore.kernel.org/linux-mm/20231214045841.961776-1-willy@infradead.org/
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
2023-12-21 00:41:24 +00:00
Greg Kroah-Hartman
2cd386b08b Merge 6.1.61 into android14-6.1-lts
Changes in 6.1.61
	KVM: x86/pmu: Truncate counter value to allowed width on write
	mmc: core: Align to common busy polling behaviour for mmc ioctls
	mmc: block: ioctl: do write error check for spi
	mmc: core: Fix error propagation for some ioctl commands
	ASoC: codecs: wcd938x: Convert to platform remove callback returning void
	ASoC: codecs: wcd938x: Simplify with dev_err_probe
	ASoC: codecs: wcd938x: fix regulator leaks on probe errors
	ASoC: codecs: wcd938x: fix runtime PM imbalance on remove
	pinctrl: qcom: lpass-lpi: fix concurrent register updates
	mcb: Return actual parsed size when reading chameleon table
	mcb-lpc: Reallocate memory region to avoid memory overlapping
	virtio_balloon: Fix endless deflation and inflation on arm64
	virtio-mmio: fix memory leak of vm_dev
	virtio-crypto: handle config changed by work queue
	virtio_pci: fix the common cfg map size
	vsock/virtio: initialize the_virtio_vsock before using VQs
	vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE
	arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399
	arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards
	mm: fix vm_brk_flags() to not bail out while holding lock
	hugetlbfs: clear resv_map pointer if mmap fails
	mm/page_alloc: correct start page when guard page debug is enabled
	mm/migrate: fix do_pages_move for compat pointers
	hugetlbfs: extend hugetlb_vma_lock to private VMAs
	maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
	nfsd: lock_rename() needs both directories to live on the same fs
	drm/i915/pmu: Check if pmu is closed before stopping event
	drm/amd: Disable ASPM for VI w/ all Intel systems
	drm/dp_mst: Fix NULL deref in get_mst_branch_device_by_guid_helper()
	ARM: OMAP: timer32K: fix all kernel-doc warnings
	firmware/imx-dsp: Fix use_after_free in imx_dsp_setup_channels()
	clk: ti: Fix missing omap4 mcbsp functional clock and aliases
	clk: ti: Fix missing omap5 mcbsp functional clock and aliases
	r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx
	r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1
	r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1
	iavf: initialize waitqueues before starting watchdog_task
	i40e: Fix I40E_FLAG_VF_VLAN_PRUNING value
	treewide: Spelling fix in comment
	igb: Fix potential memory leak in igb_add_ethtool_nfc_entry
	neighbour: fix various data-races
	igc: Fix ambiguity in the ethtool advertising
	net: ethernet: adi: adin1110: Fix uninitialized variable
	net: ieee802154: adf7242: Fix some potential buffer overflow in adf7242_stats_show()
	net: usb: smsc95xx: Fix uninit-value access in smsc95xx_read_reg
	r8152: Increase USB control msg timeout to 5000ms as per spec
	r8152: Run the unload routine if we have errors during probe
	r8152: Cancel hw_phy_work if we have an error in probe
	r8152: Release firmware if we have an error in probe
	tcp: fix wrong RTO timeout when received SACK reneging
	gtp: uapi: fix GTPA_MAX
	gtp: fix fragmentation needed check with gso
	i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR
	drm/logicvc: Kconfig: select REGMAP and REGMAP_MMIO
	iavf: in iavf_down, disable queues when removing the driver
	scsi: sd: Introduce manage_shutdown device flag
	blk-throttle: check for overflow in calculate_bytes_allowed
	kasan: print the original fault addr when access invalid shadow
	io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
	iio: afe: rescale: Accept only offset channels
	iio: exynos-adc: request second interupt only when touchscreen mode is used
	iio: adc: xilinx-xadc: Don't clobber preset voltage/temperature thresholds
	iio: adc: xilinx-xadc: Correct temperature offset/scale for UltraScale
	i2c: muxes: i2c-mux-pinctrl: Use of_get_i2c_adapter_by_node()
	i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node()
	i2c: muxes: i2c-demux-pinctrl: Use of_get_i2c_adapter_by_node()
	i2c: stm32f7: Fix PEC handling in case of SMBUS transfers
	i2c: aspeed: Fix i2c bus hang in slave read
	tracing/kprobes: Fix the description of variable length arguments
	misc: fastrpc: Reset metadata buffer to avoid incorrect free
	misc: fastrpc: Free DMA handles for RPC calls with no arguments
	misc: fastrpc: Clean buffers on remote invocation failures
	misc: fastrpc: Unmap only if buffer is unmapped from DSP
	nvmem: imx: correct nregs for i.MX6ULL
	nvmem: imx: correct nregs for i.MX6SLL
	nvmem: imx: correct nregs for i.MX6UL
	x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
	x86/cpu: Add model number for Intel Arrow Lake mobile processor
	perf/core: Fix potential NULL deref
	sparc32: fix a braino in fault handling in csum_and_copy_..._user()
	clk: Sanitize possible_parent_show to Handle Return Value of of_clk_get_parent_name
	platform/x86: Add s2idle quirk for more Lenovo laptops
	ext4: add two helper functions extent_logical_end() and pa_logical_end()
	ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow
	ext4: avoid overlapping preallocations due to overflow
	objtool/x86: add missing embedded_insn check
	Linux 6.1.61

Note, this merge point also reverts commit
bb20a245df which is commit
24eca2dce0f8d19db808c972b0281298d0bafe99 upstream, as it conflicts with
the previous reverts for ABI issues, AND is an ABI break in itself.  If
it is needed in the future, it can be brought back in an abi-safe way.

Change-Id: I425bfa3be6d65328e23affd52d10b827aea6e44a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-11-07 08:21:27 +00:00
Gregory Price
c9b066f692 mm/migrate: fix do_pages_move for compat pointers
commit 229e2253766c7cdfe024f1fe280020cc4711087c upstream.

do_pages_move does not handle compat pointers for the page list.
correctly.  Add in_compat_syscall check and appropriate get_user fetch
when iterating the page list.

It makes the syscall in compat mode (32-bit userspace, 64-bit kernel)
work the same way as the native 32-bit syscall again, restoring the
behavior before my broken commit 5b1b561ba7 ("mm: simplify
compat_sys_move_pages").

More specifically, my patch moved the parsing of the 'pages' array from
the main entry point into do_pages_stat(), which left the syscall
working correctly for the 'stat' operation (nodes = NULL), while the
'move' operation (nodes != NULL) is now missing the conversion and
interprets 'pages' as an array of 64-bit pointers instead of the
intended 32-bit userspace pointers.

It is possible that nobody noticed this bug because the few
applications that actually call move_pages are unlikely to run in
compat mode because of their large memory requirements, but this
clearly fixes a user-visible regression and should have been caught by
ltp.

Link: https://lkml.kernel.org/r/20231003144857.752952-1-gregory.price@memverge.com
Fixes: 5b1b561ba7 ("mm: simplify compat_sys_move_pages")
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-02 09:35:24 +01:00
Peifeng Li
c3d26e2b5a ANDROID: vendor_hooks: Add hooks for lookaround
Add hooks for support lookaround in memory reclamation.

- android_vh_test_clear_look_around_ref
- android_vh_check_folio_look_around_ref
- android_vh_look_around_migrate_folio
- android_vh_look_around

Bug: 292051411

Signed-off-by: Peifeng Li <lipeifeng@oppo.com>
Change-Id: I9a606ae71d2f1303df3b02403b30bc8fdc9d06dd
(cherry picked from commit f50f24e781738c8e5aa9f285d8726202f33107d6)
[huzhanyuan: changed page to folio where appropriate]
2023-08-02 21:57:15 +00:00
Greg Kroah-Hartman
dafc2fae4d This is the 6.1.13 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmP2A8MACgkQONu9yGCS
 aT7GrhAAky2nTRG9J0oPxh5Eu7wNKmjqDWNj9c6it3iGHpb+tfOY+LfPXHmWz0kX
 NoaNYGZGD8SDbkmwrSOmFB1Q/0OZ4/aIwM7Kwcw72UJVvrlsKx1HwiJjXKk809ZL
 bVlLUQzFTwyVIYcvjXQ8CuBHwBinLc3qkcyYGgbS8bseR4pDuxwoToDwAxk1d/0j
 ozWuzUKhSdYHYIUrk3papUro2UpF+Kb7KFpNiVo2wMaZM7en2XK3khCt8TuojH6c
 DXL+KZ/HbB8Ig1PWLaw2/6o4ispNy6bz7CJx6oDiOILR+le8xZA5WTdkXT3ovjyr
 LxutmPTTw6PxextIyVRblJWzXNcjdlV552U4gnnngcWn6wg4D4otqYnYvTaAUc+u
 sQnwrlQFxB2KfFKLNepGAy7klQJsYP3eadjDgGXP9TSmuUvUYRaNr6h0XukbyYkc
 kx2+Tw51NMKEqhgnaiKDN8AZEDTuLu5F4+NrUertxlb3PWeRRMRYVGJ1uw0KJg6t
 d5eniCB00SaaqN6M68u/hRYRi3gnwIsU7DitEpqejqwzskMpgegMFvebmCwORiq3
 D+FD4EHOlztIToXhmEOXp0cz8fs27MuWmq4GkSwXvJuq+id5cQFdDN5GeLgNdAvH
 Kiu/Y+DY6ObW31tAQ1Jjp20L2RaWWvubrCBGeIqiDzUWmCohsks=
 =TXvc
 -----END PGP SIGNATURE-----

Merge 6.1.13 into android14-6.1

Changes in 6.1.13
	mptcp: sockopt: make 'tcp_fastopen_connect' generic
	mptcp: fix locking for setsockopt corner-case
	mptcp: deduplicate error paths on endpoint creation
	mptcp: fix locking for in-kernel listener creation
	btrfs: move the auto defrag code to defrag.c
	btrfs: lock the inode in shared mode before starting fiemap
	ASoC: amd: yc: Add DMI support for new acer/emdoor platforms
	ASoC: SOF: sof-audio: start with the right widget type
	ALSA: usb-audio: Add FIXED_RATE quirk for JBL Quantum610 Wireless
	ASoC: Intel: sof_rt5682: always set dpcm_capture for amplifiers
	ASoC: Intel: sof_cs42l42: always set dpcm_capture for amplifiers
	ASoC: Intel: sof_nau8825: always set dpcm_capture for amplifiers
	ASoC: Intel: sof_ssp_amp: always set dpcm_capture for amplifiers
	selftests/bpf: Verify copy_register_state() preserves parent/live fields
	ALSA: hda: Do not unset preset when cleaning up codec
	ASoC: amd: yc: Add Xiaomi Redmi Book Pro 15 2022 into DMI table
	bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself
	ASoC: cs42l56: fix DT probe
	tools/virtio: fix the vringh test for virtio ring changes
	vdpa: ifcvf: Do proper cleanup if IFCVF init fails
	net/rose: Fix to not accept on connected socket
	selftest: net: Improve IPV6_TCLASS/IPV6_HOPLIMIT tests apparmor compatibility
	net: stmmac: do not stop RX_CLK in Rx LPI state for qcs404 SoC
	powerpc/64: Fix perf profiling asynchronous interrupt handlers
	fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work()
	drm/nouveau/devinit/tu102-: wait for GFW_BOOT_PROGRESS == COMPLETED
	net: ethernet: mtk_eth_soc: Avoid truncating allocation
	net: sched: sch: Bounds check priority
	s390/decompressor: specify __decompress() buf len to avoid overflow
	nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association
	nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set
	nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set
	drm/amd/display: Add missing brackets in calculation
	drm/amd/display: Adjust downscaling limits for dcn314
	drm/amd/display: Unassign does_plane_fit_in_mall function from dcn3.2
	drm/amd/display: Reset DMUB mailbox SW state after HW reset
	drm/amdgpu: enable HDP SD for gfx 11.0.3
	drm/amdgpu: Enable vclk dclk node for gc11.0.3
	drm/amd/display: Properly handle additional cases where DCN is not supported
	platform/x86: touchscreen_dmi: Add Chuwi Vi8 (CWI501) DMI match
	ceph: move mount state enum to super.h
	ceph: blocklist the kclient when receiving corrupted snap trace
	selftests: mptcp: userspace: fix v4-v6 test in v6.1
	of: reserved_mem: Have kmemleak ignore dynamically allocated reserved mem
	kasan: fix Oops due to missing calls to kasan_arch_is_ready()
	mm: shrinkers: fix deadlock in shrinker debugfs
	aio: fix mremap after fork null-deref
	vmxnet3: move rss code block under eop descriptor
	fbdev: Fix invalid page access after closing deferred I/O devices
	drm: Disable dynamic debug as broken
	drm/amd/amdgpu: fix warning during suspend
	drm/amd/display: Fail atomic_check early on normalize_zpos error
	drm/vmwgfx: Stop accessing buffer objects which failed init
	drm/vmwgfx: Do not drop the reference to the handle too soon
	mmc: jz4740: Work around bug on JZ4760(B)
	mmc: meson-gx: fix SDIO mode if cap_sdio_irq isn't set
	mmc: sdio: fix possible resource leaks in some error paths
	mmc: mmc_spi: fix error handling in mmc_spi_probe()
	ALSA: hda: Fix codec device field initializan
	ALSA: hda/conexant: add a new hda codec SN6180
	ALSA: hda/realtek - fixed wrong gpio assigned
	ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform.
	ALSA: hda/realtek: Enable mute/micmute LEDs and speaker support for HP Laptops
	ata: ahci: Add Tiger Lake UP{3,4} AHCI controller
	ata: libata-core: Disable READ LOG DMA EXT for Samsung MZ7LH
	sched/psi: Fix use-after-free in ep_remove_wait_queue()
	hugetlb: check for undefined shift on 32 bit architectures
	nilfs2: fix underflow in second superblock position calculations
	mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount
	mm/filemap: fix page end in filemap_get_read_batch
	mm/migrate: fix wrongly apply write bit after mkdirty on sparc64
	gpio: sim: fix a memory leak
	freezer,umh: Fix call_usermode_helper_exec() vs SIGKILL
	coredump: Move dump_emit_page() to kill unused warning
	Revert "mm: Always release pages to the buddy allocator in memblock_free_late()."
	net: Fix unwanted sign extension in netdev_stats_to_stats64()
	revert "squashfs: harden sanity check in squashfs_read_xattr_id_table"
	drm/vc4: crtc: Increase setup cost in core clock calculation to handle extreme reduced blanking
	drm/vc4: Fix YUV plane handling when planes are in different buffers
	drm/i915/gen11: Wa_1408615072/Wa_1407596294 should be on GT list
	ice: fix lost multicast packets in promisc mode
	ixgbe: allow to increase MTU to 3K with XDP enabled
	i40e: add double of VLAN header when computing the max MTU
	net: bgmac: fix BCM5358 support by setting correct flags
	net: ethernet: ti: am65-cpsw: Add RX DMA Channel Teardown Quirk
	sctp: sctp_sock_filter(): avoid list_entry() on possibly empty list
	net/sched: tcindex: update imperfect hash filters respecting rcu
	ice: xsk: Fix cleaning of XDP_TX frames
	dccp/tcp: Avoid negative sk_forward_alloc by ipv6_pinfo.pktoptions.
	net/usb: kalmia: Don't pass act_len in usb_bulk_msg error path
	net/sched: act_ctinfo: use percpu stats
	net: openvswitch: fix possible memory leak in ovs_meter_cmd_set()
	net: stmmac: fix order of dwmac5 FlexPPS parametrization sequence
	bnxt_en: Fix mqprio and XDP ring checking logic
	tracing: Make trace_define_field_ext() static
	net: stmmac: Restrict warning on disabling DMA store and fwd mode
	net: use a bounce buffer for copying skb->mark
	tipc: fix kernel warning when sending SYN message
	net: mpls: fix stale pointer if allocation fails during device rename
	igb: conditionalize I2C bit banging on external thermal sensor support
	igb: Fix PPS input and output using 3rd and 4th SDP
	ixgbe: add double of VLAN header when computing the max MTU
	ipv6: Fix datagram socket connection with DSCP.
	ipv6: Fix tcp socket connection with DSCP.
	mm/gup: add folio to list when folio_isolate_lru() succeed
	mm: extend max struct page size for kmsan
	i40e: Add checking for null for nlmsg_find_attr()
	net/sched: tcindex: search key must be 16 bits
	nvme-tcp: stop auth work after tearing down queues in error recovery
	nvme-rdma: stop auth work after tearing down queues in error recovery
	KVM: x86/pmu: Disable vPMU support on hybrid CPUs (host PMUs)
	kvm: initialize all of the kvm_debugregs structure before sending it to userspace
	perf/x86: Refuse to export capabilities for hybrid PMUs
	alarmtimer: Prevent starvation by small intervals and SIG_IGN
	nvme-pci: refresh visible attrs for cmb attributes
	ASoC: SOF: Intel: hda-dai: fix possible stream_tag leak
	net: sched: sch: Fix off by one in htb_activate_prios()
	Linux 6.1.13

Change-Id: I8a1e4175939c14f726c545001061b95462566386
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-02-22 12:32:41 +00:00
Peter Xu
54806cb751 mm/migrate: fix wrongly apply write bit after mkdirty on sparc64
commit 96a9c287e25d690fd9623b5133703b8e310fbed1 upstream.

Nick Bowler reported another sparc64 breakage after the young/dirty
persistent work for page migration (per "Link:" below).  That's after a
similar report [2].

It turns out page migration was overlooked, and it wasn't failing before
because page migration was not enabled in the initial report test
environment.

David proposed another way [2] to fix this from sparc64 side, but that
patch didn't land somehow.  Neither did I check whether there's any other
arch that has similar issues.

Let's fix it for now as simple as moving the write bit handling to be
after dirty, like what we did before.

Note: this is based on mm-unstable, because the breakage was since 6.1 and
we're at a very late stage of 6.2 (-rc8), so I assume for this specific
case we should target this at 6.3.

[1] https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/
[2] https://lore.kernel.org/all/20221212130213.136267-1-david@redhat.com/

Link: https://lkml.kernel.org/r/20230216153059.256739-1-peterx@redhat.com
Fixes: 2e3468778d ("mm: remember young/dirty bit for page migrations")
Link: https://lore.kernel.org/all/CADyTPExpEqaJiMGoV+Z6xVgL50ZoMJg49B10LcZ=8eg19u34BA@mail.gmail.com/
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Nick Bowler <nbowler@draconx.ca>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Nick Bowler <nbowler@draconx.ca>
Cc: <regressions@lists.linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-22 12:59:49 +01:00
Charan Teja Reddy
731835eae3 ANDROID: implement wrapper for reverse migration
Reverse migration is used to do the balancing the occupancy of memory
zones in a node in the system whose imabalance may be caused by
migration of pages to other zones by an operation, eg: hotremove and
then hotadding the same memory. In this case there is a lot of free
memory in newly hotadd memory which can be filled up by the previous
migrated pages(as part of offline/hotremove) thus may free up some
pressure in other zones of the node.

Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/

Bug: 201263307
Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
Signed-off-by: Sukadev Bhattiprolu <quic_sukadev@quicinc.com>
2023-01-03 18:56:33 +00:00
Baolin Wang
03e5f82ea6 mm: migrate: fix return value if all subpages of THPs are migrated successfully
During THP migration, if THPs are not migrated but they are split and all
subpages are migrated successfully, migrate_pages() will still return the
number of THP pages that were not migrated.  This will confuse the callers
of migrate_pages().  For example, the longterm pinning will failed though
all pages are migrated successfully.

Thus we should return 0 to indicate that all pages are migrated in this
case

Link: https://lkml.kernel.org/r/de386aa864be9158d2f3b344091419ea7c38b2f7.1666599848.git.baolin.wang@linux.alibaba.com
Fixes: b5bade978e ("mm: migrate: fix the return value of migrate_pages()")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-28 13:37:22 -07:00
Alistair Popple
16ce101db8 mm/memory.c: fix race when faulting a device private page
Patch series "Fix several device private page reference counting issues",
v2

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages.  These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems.  However
without these fixes it is possible to crash the kernel from userspace. 
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting. 
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code. 
Unfortunately I lack hardware to test either of those so any help there
would be appreciated.  The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.


This patch (of 8):

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called.  However no
reference is taken on the faulting page.  Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap.  This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid.  It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed.  Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail.  To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.

[mpe@ellerman.id.au: fix build]
  Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-12 18:51:49 -07:00
Matthew Wilcox (Oracle)
29eea9b5a9 mm: convert page_get_anon_vma() to folio_get_anon_vma()
With all callers now passing in a folio, rename the function and convert
all callers.  Removes a couple of calls to compound_head() and a reference
to page->mapping.

Link: https://lkml.kernel.org/r/20220902194653.1739778-55-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:02:54 -07:00
Matthew Wilcox (Oracle)
c33db29231 migrate: convert unmap_and_move_huge_page() to use folios
Saves several calls to compound_head() and removes a couple of uses of
page->lru.

Link: https://lkml.kernel.org/r/20220902194653.1739778-52-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:02:54 -07:00
Matthew Wilcox (Oracle)
682a71a1b6 migrate: convert __unmap_and_move() to use folios
Removes a lot of calls to compound_head().  Also remove a VM_BUG_ON that
can never trigger as the PageAnon bit is the bottom bit of page->mapping.

Link: https://lkml.kernel.org/r/20220902194653.1739778-51-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:02:53 -07:00
Haiyue Wang
f7091ed64e mm: fix the handling Non-LRU pages returned by follow_page
The handling Non-LRU pages returned by follow_page() jumps directly, it
doesn't call put_page() to handle the reference count, since 'FOLL_GET'
flag for follow_page() has get_page() called.  Fix the zone device page
check by handling the page reference count correctly before returning.

And as David reviewed, "device pages are never PageKsm pages".  Drop this
zone device page check for break_ksm().

Since the zone device page can't be a transparent huge page, so drop the
redundant zone device page check for split_huge_pages_pid().  (by Miaohe)

Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com
Fixes: 3218f8712d ("mm: handling Non-LRU pages returned by vm_normal_pages")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:28 -07:00
Aneesh Kumar K.V
467b171af8 mm/demotion: update node_is_toptier to work with memory tiers
With memory tier support we can have memory only NUMA nodes in the top
tier from which we want to avoid promotion tracking NUMA faults.  Update
node_is_toptier to work with memory tiers.  All NUMA nodes are by default
top tier nodes.  With lower(slower) memory tiers added we consider all
memory tiers above a memory tier having CPU NUMA nodes as a top memory
tier

[sj@kernel.org: include missed header file, memory-tiers.h]
  Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
[akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
[aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
  Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:12 -07:00
Aneesh Kumar K.V
6c542ab757 mm/demotion: build demotion targets based on explicit memory tiers
This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance.  All N_MEMORY NUMA nodes will be placed in the
default memory tier and additional memory tiers will be added by drivers
like dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs.  The closest
node in the immediately following memory tier is used as a demotion
target.

Since we are now only building demotion target for N_MEMORY NUMA nodes the
CPU hotplug calls are removed in this patch.

Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:12 -07:00
Aneesh Kumar K.V
9195244022 mm/demotion: move memory demotion related code
This moves memory demotion related code to mm/memory-tiers.c.  No
functional change in this patch.

Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:11 -07:00
Baolin Wang
7047b5a40b mm: migrate: do not retry 10 times for the subpages of fail-to-migrate THP
If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will
be split, and the subpages of fail-to-migrate THP will be tried to migrate
again, so we should not account the retry counter in the second loop,
since we already accounted 'nr_thp_failed' in the first loop.

Moreover we also do not need retry 10 times for -EAGAIN case for the
subpages of fail-to-migrate THP in the second loop, since we already
regarded the THP as migration failure, and save some migration time (for
the worst case, will try 512 * 10 times) according to previous discussion
[1].

[1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/

Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.com
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:07 -07:00
Huang Ying
077309bc1e migrate_pages(): fix failure counting for retry
After 10 retries, we will give up and the remaining pages will be counted
as failure in nr_failed and nr_thp_failed.  We should count the failure in
nr_failed_pages too.  This is done in this patch.

Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com
Fixes: 5984fabb6e ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:07 -07:00
Huang Ying
e6fa8a79fe migrate_pages(): fix failure counting for THP splitting
If THP is failed to be migrated, it may be split and retry.  But after
splitting, the head page will be left in "from" list, although THP
migration failure has been counted already.  If the head page is failed to
be migrated too, the failure will be counted twice incorrectly.  So this
is fixed in this patch via moving the head page of THP after splitting to
"thp_split_pages" too.

Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com
Fixes: 5984fabb6e ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:07 -07:00
Huang Ying
577be05c89 migrate_pages(): fix failure counting for THP on -ENOSYS
If THP or hugetlbfs page migration isn't supported, unmap_and_move() or
unmap_and_move_huge_page() will return -ENOSYS.  For THP, splitting will
be tried, but if splitting doesn't succeed, the THP will be left in "from"
list wrongly.  If some other pages are retried, the THP migration failure
will counted again.  This is fixed via moving the failure THP from "from"
to "ret_pages".

Another issue of the original code is that the unsupported failure
processing isn't consistent between THP and hugetlbfs page.  Make them
consistent in this patch to make the code easier to be understood too.

Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com
Fixes: 5984fabb6e ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:07 -07:00
Huang Ying
5fc30916b5 migrate_pages(): fix failure counting for THP subpages retrying
If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be
split into thp_split_pages, and after other pages are migrated, pages in
thp_split_pages will be migrated with no_subpage_counting == true, because
its failure have been counted already.  If some pages in thp_split_pages
are retried during migration, we should not count their failure if
no_subpage_counting == true too.  This is done this patch to fix the
failure counting for THP subpages retrying.

Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com
Fixes: 5984fabb6e ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:06 -07:00
Huang Ying
fbed53b477 migrate_pages(): fix THP failure counting for -ENOMEM
In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be
returned, and migrate_pages() will try to split the THP unless "reason" is
MR_NUMA_MISPLACED (that is, nosplit == true).  But when nosplit == true,
the THP migration failure will not be counted.

This is incorrect, so in this patch, the THP migration failure will be
counted for -ENOMEM regardless of nosplit is true or false.  The nr_failed
counting isn't fixed because it's not used.  Added some comments for it
per Baolin's suggestion.

Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com
Fixes: 5984fabb6e ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:06 -07:00
Huang Ying
9c62ff005f migrate_pages(): remove unnecessary list_safe_reset_next()
Before commit b5bade978e ("mm: migrate: fix the return value of
migrate_pages()"), the tail pages of THP will be put in the "from"
list directly.  So one of the loop cursors (page2) needs to be reset,
as is done in try_split_thp() via list_safe_reset_next().  But after
the commit, the tail pages of THP will be put in a dedicated
list (thp_split_pages).  That is, the "from" list will not be changed
during splitting.  So, it's unnecessary to call list_safe_reset_next()
anymore.

This is a code cleanup, no functionality changes are expected.

Link: https://lkml.kernel.org/r/20220817081408.513338-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:06 -07:00
Huang Ying
a7504ed14f migrate: fix syscall move_pages() return value for failure
Patch series "migrate_pages(): fix several bugs in error path", v3.

During review the code of migrate_pages() and build a test program for
it.  Several bugs in error path are identified and fixed in this
series.

Most patches are tested via

- Apply error-inject.patch in Linux kernel
- Compile test-migrate.c (with -lnuma)
- Test with test-migrate.sh

error-inject.patch, test-migrate.c, and test-migrate.sh are as below.
It turns out that error injection is an important tool to fix bugs in
error path.


This patch (of 8):

The return value of move_pages() syscall is incorrect when counting
the remaining pages to be migrated.  For example, for the following
test program,

"
 #define _GNU_SOURCE

 #include <stdbool.h>
 #include <stdio.h>
 #include <string.h>
 #include <stdlib.h>
 #include <errno.h>

 #include <fcntl.h>
 #include <sys/uio.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <unistd.h>
 #include <numaif.h>
 #include <numa.h>

 #ifndef MADV_FREE
 #define MADV_FREE	8		/* free pages only if memory pressure */
 #endif

 #define ONE_MB		(1024 * 1024)
 #define MAP_SIZE	(16 * ONE_MB)
 #define THP_SIZE	(2 * ONE_MB)
 #define THP_MASK	(THP_SIZE - 1)

 #define ERR_EXIT_ON(cond, msg)					\
	 do {							\
		 int __cond_in_macro = (cond);			\
		 if (__cond_in_macro)				\
			 error_exit(__cond_in_macro, (msg));	\
	 } while (0)

 void error_msg(int ret, int nr, int *status, const char *msg)
 {
	 int i;

	 fprintf(stderr, "Error: %s, ret : %d, error: %s\n",
		 msg, ret, strerror(errno));

	 if (!nr)
		 return;
	 fprintf(stderr, "status: ");
	 for (i = 0; i < nr; i++)
		 fprintf(stderr, "%d ", status[i]);
	 fprintf(stderr, "\n");
 }

 void error_exit(int ret, const char *msg)
 {
	 error_msg(ret, 0, NULL, msg);
	 exit(1);
 }

 int page_size;

 bool do_vmsplice;
 bool do_thp;

 static int pipe_fds[2];
 void *addr;
 char *pn;
 char *pn1;
 void *pages[2];
 int status[2];

 void prepare()
 {
	 int ret;
	 struct iovec iov;

	 if (addr) {
		 munmap(addr, MAP_SIZE);
		 close(pipe_fds[0]);
		 close(pipe_fds[1]);
	 }

	 ret = pipe(pipe_fds);
	 ERR_EXIT_ON(ret, "pipe");

	 addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	 ERR_EXIT_ON(addr == MAP_FAILED, "mmap");
	 if (do_thp) {
		 ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE);
		 ERR_EXIT_ON(ret, "advise hugepage");
	 }

	 pn = (char *)(((unsigned long)addr + THP_SIZE) & ~THP_MASK);
	 pn1 = pn + THP_SIZE;
	 pages[0] = pn;
	 pages[1] = pn1;
	 *pn = 1;

	 if (do_vmsplice) {
		 iov.iov_base = pn;
		 iov.iov_len = page_size;
		 ret = vmsplice(pipe_fds[1], &iov, 1, 0);
		 ERR_EXIT_ON(ret < 0, "vmsplice");
	 }

	 status[0] = status[1] = 1024;
 }

 void test_migrate()
 {
	 int ret;
	 int nodes[2] = { 1, 1 };
	 pid_t pid = getpid();

	 prepare();
	 ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL);
	 error_msg(ret, 1, status, "move 1 page");

	 prepare();
	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
	 error_msg(ret, 2, status, "move 2 pages, page 1 not mapped");

	 prepare();
	 *pn1 = 1;
	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
	 error_msg(ret, 2, status, "move 2 pages");

	 prepare();
	 *pn1 = 1;
	 nodes[1] = 0;
	 ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL);
	 error_msg(ret, 2, status, "move 2 pages, page 1 to node 0");
 }

 int main(int argc, char *argv[])
 {
	 numa_run_on_node(0);
	 page_size = getpagesize();

	 test_migrate();

	 fprintf(stderr, "\nMake page 0 cannot be migrated:\n");
	 do_vmsplice = true;
	 test_migrate();

	 fprintf(stderr, "\nTest THP:\n");
	 do_thp = true;
	 do_vmsplice = false;
	 test_migrate();

	 fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n");
	 do_vmsplice = true;
	 test_migrate();

	 return 0;
 }
"

The output of the current kernel is,

"
Error: move 1 page, ret : 0, error: Success
status: 1
Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
status: 1 -14
Error: move 2 pages, ret : 0, error: Success
status: 1 1
Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
status: 1 0

Make page 0 cannot be migrated:
Error: move 1 page, ret : 0, error: Success
status: 1024
Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
status: 1024 -14
Error: move 2 pages, ret : 0, error: Success
status: 1024 1024
Error: move 2 pages, page 1 to node 0, ret : 1, error: Success
status: 1024 1024
"

While the expected output is,

"
Error: move 1 page, ret : 0, error: Success
status: 1
Error: move 2 pages, page 1 not mapped, ret : 0, error: Success
status: 1 -14
Error: move 2 pages, ret : 0, error: Success
status: 1 1
Error: move 2 pages, page 1 to node 0, ret : 0, error: Success
status: 1 0

Make page 0 cannot be migrated:
Error: move 1 page, ret : 1, error: Success
status: 1024
Error: move 2 pages, page 1 not mapped, ret : 1, error: Success
status: 1024 -14
Error: move 2 pages, ret : 1, error: Success
status: 1024 1024
Error: move 2 pages, page 1 to node 0, ret : 2, error: Success
status: 1024 1024
"

Fix this via correcting the remaining pages counting.  With the fix,
the output for the test program as above is expected.

Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com
Fixes: 5984fabb6e ("mm: move_pages: report the number of non-attempted pages")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:06 -07:00
Peter Xu
2e3468778d mm: remember young/dirty bit for page migrations
When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table
using either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure. 
Not to mention hardware setting the young bit can bring quite some
overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
set the bit.  The same slowdown problem to dirty bits when the memory is
first written after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated.  To achieve it, define a new set
of bits in the migration swap offset field to cache the A/D bits for old
pte.  Then when removing/recovering the migration entry, we can recover
the A/D bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect
how many swp offset bits we have, and we'll only enable this feature if we
know the swp offset is big enough to store both the PFN value and the A/D
bits.  Otherwise the A/D bits are dropped like before.

Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:05 -07:00
Huang Ying
33024536ba memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4.

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified. 
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote.  But this isn't a perfect
algorithm to identify the hot pages.  Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g.  60 seconds).  So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patchset
as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed.  The test results are
as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.

Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.


This patch (of 3):

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified.  Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
identify the hot pages.  Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g.  60 seconds).  The most frequently
accessed (MFU) algorithm is better.

So, in this patch we implemented a better hot page selection algorithm. 
Which is based on NUMA balancing page table scanning and hint page fault
as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a better estimation of the page hot/cold.

It's hard to find some extra space in struct page to hold the scan time. 
Fortunately, we can reuse some bits used by the original NUMA balancing.

NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth.  But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node.  So the accessing CPU and PID information are unnecessary for the
slow memory pages.  We can reuse these bits in struct page to record the
scan time.  For the fast memory pages, these bits are used as before.

For the hot threshold, the default value is 1 second, which works well in
our performance test.  All pages with hint page fault latency < hot
threshold will be considered hot.

It's hard for users to determine the hot threshold.  So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment.  We will continue to work on a hot threshold automatic
adjustment mechanism.

The downside of the above method is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.

Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.

Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: osalvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:54 -07:00
Haiyue Wang
8315682148 mm: migration: fix the FOLL_GET failure on following huge page
Not all huge page APIs support FOLL_GET option, so move_pages() syscall
will fail to get the page node information for some huge pages.

Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will
return NULL page for FOLL_GET when calling move_pages() syscall with the
NULL 'nodes' parameter, the 'status' parameter has '-2' error in array.

Note: follow_huge_pud() now supports FOLL_GET in linux 6.0.
      Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev

But these huge page APIs don't support FOLL_GET:
  1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c
  2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c
     It will cause WARN_ON_ONCE for FOLL_GET.
  3. follow_huge_pgd() in mm/hugetlb.c

This is an temporary solution to mitigate the side effect of the race
condition fix by calling follow_page() with FOLL_GET set for huge pages.

After supporting follow huge page by FOLL_GET is done, this fix can be
reverted safely.

Link: https://lkml.kernel.org/r/20220823135841.934465-2-haiyue.wang@intel.com
Link: https://lkml.kernel.org/r/20220812084921.409142-1-haiyue.wang@intel.com
Fixes: 4cd614841c ("mm: migration: fix possible do_pages_stat_array racing with memory offline")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:52 -07:00
Linus Torvalds
6614a3c316 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.

  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.

  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.

  Summary:

   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport

   - Some kmemleak fixes from Patrick Wang and Waiman Long

   - DAMON updates from SeongJae Park

   - memcg debug/visibility work from Roman Gushchin

   - vmalloc speedup from Uladzislau Rezki

   - more folio conversion work from Matthew Wilcox

   - enhancements for coherent device memory mapping from Alex Sierra

   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan

   - hugetlb optimizations from Mike Kravetz

   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.

   - mprotect soft-dirty checking has been improved by Peter Xu

   - Many other singleton patches all over the place"

 [ XFS merge from hell as per Darrick Wong in

   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]

* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ...
2022-08-05 16:32:45 -07:00
Matthew Wilcox (Oracle)
9d0ddc0cb5 fs: Remove aops->migratepage()
With all users converted to migrate_folio(), remove this operation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
b890ec2a2c hugetlb: Convert to migrate_folio
This involves converting migrate_huge_page_move_mapping().  We also need a
folio variant of hugetlb_set_page_subpool(), but that's for a later patch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
2ec810d596 mm/migrate: Add filemap_migrate_folio()
There is nothing iomap-specific about iomap_migratepage(), and it fits
a pattern used by several other filesystems, so move it to mm/migrate.c,
convert it to be filemap_migrate_folio() and convert the iomap filesystems
to use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
541846502f mm/migrate: Convert migrate_page() to migrate_folio()
Convert all callers to pass a folio.  Most have the folio
already available.  Switch all users from aops->migratepage to
aops->migrate_folio.  Also turn the documentation into kerneldoc.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
108ca83581 mm/migrate: Convert expected_page_refs() to folio_expected_refs()
Now that both callers have a folio, convert this function to
take a folio & rename it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
67235182a4 mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
declarations to buffer.h and switch all filesystems that have wired
them up.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
2be7fa10c0 mm/migrate: Convert writeout() to take a folio
Use a folio throughout this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
8faa8ef5dd mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()
Use a folio throughout.  migrate_page() will be converted to
migrate_folio() later.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
5490da4f06 fs: Add aops->migrate_folio
Provide a folio-based replacement for aops->migratepage.  Update the
documentation to document migrate_folio instead of migratepage.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
68f2736a85 mm: Convert all PageMovable users to movable_operations
These drivers are rather uncomfortably hammered into the
address_space_operations hole.  They aren't filesystems and don't behave
like filesystems.  They just need their own movable_operations structure,
which we can point to directly from page->mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Alex Sierra
3218f8712d mm: handling Non-LRU pages returned by vm_normal_pages
With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages.  Although they
behave like normal pages for purposes of mapping in CPU page, and for COW.
They do not support LRU lists, NUMA migration or THP.

Callers to follow_page() currently don't expect ZONE_DEVICE pages,
however, with DEVICE_COHERENT we might now return ZONE_DEVICE.  Check for
ZONE_DEVICE pages in applicable users of follow_page() as well.

Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>	[v2]
Reviewed-by: Alistair Popple <apopple@nvidia.com>	[v6]
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-17 17:14:28 -07:00
Miaohe Lin
ad1ac596e8 mm/migration: fix potential pte_unmap on an not mapped pte
__migration_entry_wait and migration_entry_wait_on_locked assume pte is
always mapped from caller.  But this is not the case when it's called from
migration_entry_wait_huge and follow_huge_pmd.  Add a hugetlbfs variant
that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue.

Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com
Fixes: 30dad30922 ("mm: migration: add migrate_entry_wait_huge()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 18:08:37 -07:00
Miaohe Lin
7ce82f4c3f mm/migration: return errno when isolate_huge_page failed
We might fail to isolate huge page due to e.g.  the page is under
migration which cleared HPageMigratable.  We should return errno in this
case rather than always return 1 which could confuse the user, i.e.  the
caller might think all of the memory is migrated while the hugetlb page is
left behind.  We make the prototype of isolate_huge_page consistent with
isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
to isolate_hugetlb as suggested by Muchun to improve the readability.

Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
Fixes: e8db67eb0d ("mm: migrate: move_pages() supports thp migration")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Huang Ying <ying.huang@intel.com>
Reported-by: kernel test robot <lkp@intel.com> (build error)
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 18:08:37 -07:00
Miaohe Lin
160088b3b6 mm/migration: remove unneeded lock page and PageMovable check
When non-lru movable page was freed from under us, __ClearPageMovable must
have been done.  So we can remove unneeded lock page and PageMovable check
here.  Also free_pages_prepare() will clear PG_isolated for us, so we can
further remove ClearPageIsolated as suggested by David.

Link: https://lkml.kernel.org/r/20220530113016.16663-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 18:08:37 -07:00
Matthew Wilcox (Oracle)
b653db7735 mm: Clear page->private when splitting or migrating a page
In our efforts to remove uses of PG_private, we have found folios with
the private flag clear and folio->private not-NULL.  That is the root
cause behind 642d51fb07 ("ceph: check folio PG_private bit instead
of folio->private").  It can also affect a few other filesystems that
haven't yet reported a problem.

compaction_alloc() can return a page with uninitialised page->private,
and rather than checking all the callers of migrate_pages(), just zero
page->private after calling get_new_page().  Similarly, the tail pages
from split_huge_page() may also have an uninitialised page->private.

Reported-by: Xiubo Li <xiubli@redhat.com>
Tested-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-23 12:21:44 -04:00