android_kernel_samsung_sm8650

Author	SHA1	Message	Date
Greg Kroah-Hartman	3ca4271578	Reapply "Merge tag 'android14-6.1.75_r00' into android14-6.1" This reverts commit `6bad1052c2`, it is the LTS merge that had to previously get reverted due to being merged too early. Cc: Todd Kjos <tkjos@google.com> Change-Id: I31b7d660bd833cf022ac4870f6d01e723fda5182 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-04-02 19:49:12 +00:00
Zi Yan	176b8fe524	FROMLIST: mm/migrate: set swap entry values of THP tail pages properly. The tail pages in a THP can have swap entry information stored in their private field. When migrating to a new page, all tail pages of the new page need to update ->private to avoid future data corruption. This fix is stable-only, since after commit 07e09c483cbe ("mm/huge_memory: work on folio->swap instead of page->private when splitting folio"), subpages of a swapcached THP no longer requires the maintenance. Adding THPs to the swapcache was introduced in commit `38d8b4e6bd` ("mm, THP, swap: delay splitting THP during swap out"), where each subpage of a THP added to the swapcache had its own swapcache entry and required the ->private field to point to the correct swapcache entry. Later, when THP migration functionality was implemented in commit `616b837153` ("mm: thp: enable thp migration in generic path"), it initially did not handle the subpages of swapcached THPs, failing to update their ->private fields or replace the subpage pointers in the swapcache. Subsequently, commit `e71769ae52` ("mm: enable thp migration for shmem thp") addressed the swapcache update aspect. This patch fixes the update of subpage ->private fields. Bug: 324818390 Fixes: `616b837153` ("mm: thp: enable thp migration in generic path") Link: https://lore.kernel.org/linux-mm/20240306155217.118467-1-zi.yan@sent.com/ Reported-and-tested-by: Charan Teja Kalla <quic_charante@quicinc.com> Change-Id: Ia4603cd58b76dc6ff46a2c53a735942a87221419 Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/1707814102-22682-1-git-send-email-quic_charante@quicinc.com/ Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>	2024-03-21 16:07:20 +00:00
Todd Kjos	6bad1052c2	Revert "Merge tag 'android14-6.1.75_r00' into android14-6.1" This reverts commit `1dbafe61e3`. Reason for revert: Too early. Needs to wait until 2024-03-27 Change-Id: I769b944bd089aa2278659ec87f7ba4ac4e74ee4a Signed-off-by: Todd Kjos <tkjos@google.com>	2024-03-07 21:18:27 +00:00
Greg Kroah-Hartman	e1b12db2de	Merge 6.1.72 into android14-6.1-lts Changes in 6.1.72 keys, dns: Fix missing size check of V1 server-list header block: Don't invalidate pagecache for invalid falloc modes ALSA: hda/realtek: enable SND_PCI_QUIRK for hp pavilion 14-ec1xxx series ALSA: hda/realtek: fix mute/micmute LEDs for a HP ZBook ALSA: hda/realtek: Fix mute and mic-mute LEDs for HP ProBook 440 G6 mptcp: prevent tcp diag from closing listener subflows Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()" drm/mgag200: Fix gamma lut not initialized for G200ER, G200EV, G200SE cifs: cifs_chan_is_iface_active should be called with chan_lock held cifs: do not depend on release_iface for maintaining iface_list KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRL wifi: iwlwifi: pcie: don't synchronize IRQs from IRQ drm/bridge: ti-sn65dsi86: Never store more than msg->size bytes in AUX xfer netfilter: use skb_ip_totlen and iph_totlen netfilter: nf_tables: set transport offset from mac header for netdev/egress nfc: llcp_core: Hold a ref to llcp_local->dev when holding a ref to llcp_local octeontx2-af: Fix marking couple of structure as __packed drm/i915/dp: Fix passing the correct DPCD_REV for drm_dp_set_phy_test_pattern ice: Fix link_down_on_close message ice: Shut down VSI with "link-down-on-close" enabled i40e: Fix filter input checks to prevent config with invalid values igc: Report VLAN EtherType matching back to user igc: Check VLAN TCI mask igc: Check VLAN EtherType mask ASoC: fsl_rpmsg: Fix error handler with pm_runtime_enable ASoC: mediatek: mt8186: fix AUD_PAD_TOP register and offset mlxbf_gige: fix receive packet race condition net: sched: em_text: fix possible memory leak in em_text_destroy() r8169: Fix PCI error on system resume can: raw: add support for SO_MARK net-timestamp: extend SOF_TIMESTAMPING_OPT_ID to HW timestamps net: annotate data-races around sk->sk_tsflags net: annotate data-races around sk->sk_bind_phc net: Implement missing getsockopt(SO_TIMESTAMPING_NEW) selftests: bonding: do not set port down when adding to bond ARM: sun9i: smp: Fix array-index-out-of-bounds read in sunxi_mc_smp_init sfc: fix a double-free bug in efx_probe_filters net: bcmgenet: Fix FCS generation for fragmented skbuffs netfilter: nft_immediate: drop chain reference counter on error net: Save and restore msg_namelen in sock_sendmsg i40e: fix use-after-free in i40e_aqc_add_filters() ASoC: meson: g12a-toacodec: Validate written enum values ASoC: meson: g12a-tohdmitx: Validate written enum values ASoC: meson: g12a-toacodec: Fix event generation ASoC: meson: g12a-tohdmitx: Fix event generation for S/PDIF mux i40e: Restore VF MSI-X state during PCI reset igc: Fix hicredit calculation net/qla3xxx: fix potential memleak in ql_alloc_buffer_queues net/smc: fix invalid link access in dumping SMC-R connections octeontx2-af: Always configure NIX TX link credits based on max frame size octeontx2-af: Re-enable MAC TX in otx2_stop processing asix: Add check for usbnet_get_endpoints net: ravb: Wait for operating mode to be applied bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters() net: Implement missing SO_TIMESTAMPING_NEW cmsg support selftests: secretmem: floor the memory size to the multiple of page_size cpu/SMT: Create topology_smt_thread_allowed() cpu/SMT: Make SMT control more robust against enumeration failures srcu: Fix callbacks acceleration mishandling bpf, x64: Fix tailcall infinite loop bpf, x86: Simplify the parsing logic of structure parameters bpf, x86: save/restore regs with BPF_DW size net: Declare MSG_SPLICE_PAGES internal sendmsg() flag udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES splice, net: Add a splice_eof op to file-ops and socket-ops ipv4, ipv6: Use splice_eof() to flush udp: introduce udp->udp_flags udp: move udp->no_check6_tx to udp->udp_flags udp: move udp->no_check6_rx to udp->udp_flags udp: move udp->gro_enabled to udp->udp_flags udp: move udp->accept_udp_{l4\|fraglist} to udp->udp_flags udp: lockless UDP_ENCAP_L2TPINUDP / UDP_GRO udp: annotate data-races around udp->encap_type wifi: iwlwifi: yoyo: swap cdb and jacket bits values arm64: dts: qcom: sdm845: align RPMh regulator nodes with bindings arm64: dts: qcom: sdm845: Fix PSCI power domain names fbdev: imsttfb: Release framebuffer and dealloc cmap on error path fbdev: imsttfb: fix double free in probe() bpf: decouple prune and jump points bpf: remove unnecessary prune and jump points bpf: Remove unused insn_cnt argument from visit_[func_call_]insn() bpf: clean up visit_insn()'s instruction processing bpf: Support new 32bit offset jmp instruction bpf: handle ldimm64 properly in check_cfg() bpf: fix precision backtracking instruction iteration blk-mq: make sure active queue usage is held for bio_integrity_prep() net/mlx5: Increase size of irq name buffer s390/mm: add missing arch_set_page_dat() call to vmem_crst_alloc() s390/cpumf: support user space events for counting f2fs: clean up i_compress_flag and i_compress_level usage f2fs: convert to use bitmap API f2fs: assign default compression level f2fs: set the default compress_level on ioctl selftests: mptcp: fix fastclose with csum failure selftests: mptcp: set FAILING_LINKS in run_tests media: camss: sm8250: Virtual channels for CSID media: qcom: camss: Fix set CSI2_RX_CFG1_VC_MODE when VC is greater than 3 ext4: convert move_extent_per_page() to use folios khugepage: replace try_to_release_page() with filemap_release_folio() memory-failure: convert truncate_error_page() to use folio mm: merge folio_has_private()/filemap_release_folio() call pairs mm, netfs, fscache: stop read optimisation when folio removed from pagecache filemap: add a per-mapping stable writes flag block: update the stable_writes flag in bdev_add smb: client: fix missing mode bits for SMB symlinks net: dpaa2-eth: rearrange variable in dpaa2_eth_get_ethtool_stats dpaa2-eth: recycle the RX buffer only after all processing done ethtool: don't propagate EOPNOTSUPP from dumps bpf, sockmap: af_unix stream sockets need to hold ref for pair sock firmware: arm_scmi: Fix frequency truncation by promoting multiplier type ALSA: hda/realtek: Add quirk for Lenovo Yoga Pro 7 genirq/affinity: Remove the 'firstvec' parameter from irq_build_affinity_masks genirq/affinity: Pass affinity managed mask array to irq_build_affinity_masks genirq/affinity: Don't pass irq_affinity_desc array to irq_build_affinity_masks genirq/affinity: Rename irq_build_affinity_masks as group_cpus_evenly genirq/affinity: Move group_cpus_evenly() into lib/ lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly mm/memory_hotplug: add missing mem_hotplug_lock mm/memory_hotplug: fix error handling in add_memory_resource() net: sched: call tcf_ct_params_free to free params in tcf_ct_init netfilter: flowtable: allow unidirectional rules netfilter: flowtable: cache info of last offload net/sched: act_ct: offload UDP NEW connections net/sched: act_ct: Fix promotion of offloaded unreplied tuple netfilter: flowtable: GC pushes back packets to classic path net/sched: act_ct: Take per-cb reference to tcf_ct_flow_table octeontx2-af: Fix pause frame configuration octeontx2-af: Support variable number of lmacs btrfs: fix qgroup_free_reserved_data int overflow btrfs: mark the len field in struct btrfs_ordered_sum as unsigned ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg() firewire: ohci: suppress unexpected system reboot in AMD Ryzen machines and ASM108x/VT630x PCIe cards x86/kprobes: fix incorrect return address calculation in kprobe_emulate_call_indirect i2c: core: Fix atomic xfer check for non-preempt config mm: fix unmap_mapping_range high bits shift bug drm/amdgpu: skip gpu_info fw loading on navi12 drm/amd/display: add nv12 bounding box mmc: meson-mx-sdhc: Fix initialization frozen issue mmc: rpmb: fixes pause retune on all RPMB partitions. mmc: core: Cancel delayed work before releasing host mmc: sdhci-sprd: Fix eMMC init failure after hw reset genirq/affinity: Only build SMP-only helper functions on SMP kernels f2fs: compress: fix to assign compress_level for lz4 correctly net/sched: act_ct: additional checks for outdated flows net/sched: act_ct: Always fill offloading tuple iifidx bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4 bpf: syzkaller found null ptr deref in unix_bpf proto add media: qcom: camss: Comment CSID dt_id field smb3: Replace smb2pdu 1-element arrays with flex-arrays Revert "interconnect: qcom: sm8250: Enable sync_state" Linux 6.1.72 Change-Id: Id00eb2ae1159d4d5fa0ef914e672c5669cbf5b0a Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-01-14 13:26:13 +00:00
Greg Kroah-Hartman	bb47960a9d	Merge branch 'android14-6.1' into branch 'android14-6.1-lts' This merges all of the latest changes in 'android14-6.1' into 'android14-6.1-lts' to get it to pass TH again due to new symbols being added. Included in here are the following commits: * `a41a4ee370` ANDROID: Update the ABI symbol list * `0801d8a89d` ANDROID: mm: export dump_tasks symbol. * `7c91752f5d` FROMLIST: scsi: ufs: Remove the ufshcd_hba_exit() call from ufshcd_async_scan() * `28154afe74` FROMLIST: scsi: ufs: Simplify power management during async scan * `febcf1429f` ANDROID: gki_defconfig: Set CONFIG_IDLE_INJECT and CONFIG_CPU_IDLE_THERMAL into y * `bc4d82ee40` ANDROID: KMI workaround for CONFIG_NETFILTER_FAMILY_BRIDGE * `227b55a7a3` ANDROID: dma-buf: don't re-purpose kobject as work_struct * `c1b1201d39` BACKPORT: FROMLIST: dma-buf: Move sysfs work out of DMA-BUF export path * `928b3b5dde` UPSTREAM: netfilter: nf_tables: skip set commit for deleted/destroyed sets * `031f804149` ANDROID: KVM: arm64: Avoid BUG-ing from the host abort path * `c5dc4b4b3d` ANDROID: Update the ABI symbol list * `5070b3b594` UPSTREAM: ipv4: igmp: fix refcnt uaf issue when receiving igmp query packet * `02aa72665c` UPSTREAM: nvmet-tcp: Fix a possible UAF in queue intialization setup * `d6554d1262` FROMGIT: usb: dwc3: gadget: Handle EP0 request dequeuing properly * `29544d4157` ANDROID: ABI: Update symbol list for imx * `02f444ba07` UPSTREAM: io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid * `ec46fe0ac7` UPSTREAM: bpf: Fix prog_array_map_poke_run map poke update * `98b0e4cf09` BACKPORT: xhci: track port suspend state correctly in unsuccessful resume cases * `ac90f08292` ANDROID: Update the ABI symbol list * `ef67750d99` ANDROID: sched: Export symbols for vendor modules * `934a40576e` UPSTREAM: usb: dwc3: core: add support for disabling High-speed park mode * `8a597e7a2d` ANDROID: KVM: arm64: Don't prepopulate MMIO regions for host stage-2 * `ed9b660cd1` BACKPORT: FROMGIT fork: use __mt_dup() to duplicate maple tree in dup_mmap() * `3743b40f65` FROMGIT: maple_tree: preserve the tree attributes when destroying maple tree * `1bec2dd52e` FROMGIT: maple_tree: update check_forking() and bench_forking() * `e57d333531` FROMGIT: maple_tree: skip other tests when BENCH is enabled * `c79ca61edc` FROMGIT: maple_tree: update the documentation of maple tree * `7befa7bbc9` FROMGIT: maple_tree: add test for mtree_dup() * `f73f881af4` FROMGIT: radix tree test suite: align kmem_cache_alloc_bulk() with kernel behavior. * `eb5048ea90` FROMGIT: maple_tree: introduce interfaces __mt_dup() and mtree_dup() * `dc9323545b` FROMGIT: maple_tree: introduce {mtree,mas}_lock_nested() * `4ddcdc519b` FROMGIT: maple_tree: add mt_free_one() and mt_attr() helpers * `c52d48818b` UPSTREAM: maple_tree: introduce __mas_set_range() * `066d57de87` ANDROID: GKI: Enable symbols for v4l2 in async and fwnode * `e74417834e` ANDROID: Update the ABI symbol list * `15a93de464` ANDROID: KVM: arm64: Fix hyp event alignment * `717d1f8f91` ANDROID: KVM: arm64: Fix host_smc print typo * `8fc25d7862` FROMGIT: f2fs: do not return EFSCORRUPTED, but try to run online repair * `99288e911a` ANDROID: KVM: arm64: Document module_change_host_prot_range * `4d99e41ce1` FROMGIT: PM / devfreq: Synchronize devfreq_monitor_[start/stop] * `6c8f710857` FROMGIT: arch/mm/fault: fix major fault accounting when retrying under per-VMA lock * `4a518d8633` UPSTREAM: mm: handle write faults to RO pages under the VMA lock * `c1da94fa44` UPSTREAM: mm: handle read faults under the VMA lock * `6541fffd92` UPSTREAM: mm: handle COW faults under the VMA lock * `c7fa581a79` UPSTREAM: mm: handle shared faults under the VMA lock * `95af8a80bb` BACKPORT: mm: call wp_page_copy() under the VMA lock * `b43b26b4cd` UPSTREAM: mm: make lock_folio_maybe_drop_mmap() VMA lock aware * `9c4bc457ab` UPSTREAM: mm/memory.c: fix mismerge * `7d50253c27` ANDROID: Export functions to be used with dma_map_ops in modules * `37e0a5b868` BACKPORT: FROMGIT: erofs: enable sub-page compressed block support * `f466d52164` FROMGIT: erofs: refine z_erofs_transform_plain() for sub-page block support * `a18efa4e4a` FROMGIT: erofs: fix ztailpacking for subpage compressed blocks * `0c6a18c75b` BACKPORT: FROMGIT: erofs: fix up compacted indexes for block size < 4096 * `d7bb85f1cb` FROMGIT: erofs: record `pclustersize` in bytes instead of pages * `9d259220ac` FROMGIT: erofs: support I/O submission for sub-page compressed blocks * `8a49ea9441` FROMGIT: erofs: fix lz4 inplace decompression * `bdc5d268ba` FROMGIT: erofs: fix memory leak on short-lived bounced pages * `0d329bbe5c` BACKPORT: erofs: tidy up z_erofs_do_read_page() * `dc94c3cc6b` UPSTREAM: erofs: move preparation logic into z_erofs_pcluster_begin() * `7751567a71` BACKPORT: erofs: avoid obsolete {collector,collection} terms * `d0dbf74792` BACKPORT: erofs: simplify z_erofs_read_fragment() * `4067dd9969` UPSTREAM: erofs: get rid of the remaining kmap_atomic() * `365ca16da2` UPSTREAM: erofs: simplify z_erofs_transform_plain() * `187d034575` BACKPORT: erofs: adapt managed inode operations into folios * `3d93182661` UPSTREAM: erofs: avoid on-stack pagepool directly passed by arguments * `5c1827383a` UPSTREAM: erofs: allocate extra bvec pages directly instead of retrying * `bed20ed1d3` UPSTREAM: erofs: clean up z_erofs_pcluster_readmore() * `5e861fa97e` UPSTREAM: erofs: remove the member readahead from struct z_erofs_decompress_frontend * `66595bb17c` UPSTREAM: erofs: fold in z_erofs_decompress() * `88a1939504` UPSTREAM: erofs: enable large folios for iomap mode * `2c085909e7` ANDROID: Update the ABI symbol list * `d16a15fde5` UPSTREAM: USB: gadget: core: adjust uevent timing on gadget unbind * `d3006fb944` ANDROID: ABI: Update oplus symbol list * `bc97d5019a` ANDROID: vendor_hooks: Add hooks for rt_mutex steal * `401a2769d9` UPSTREAM: dm verity: don't perform FEC for failed readahead IO * `30bca9e278` UPSTREAM: netfilter: nft_set_pipapo: skip inactive elements during set walk * `44702d8fa1` FROMLIST: mm: migrate high-order folios in swap cache correctly * `613d8368e3` ANDROID: fuse-bpf: Follow mounts in lookups Change-Id: I49d28ad030d7840490441ce6a7936b5e1047913e Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2024-01-11 08:06:52 +00:00
David Howells	bceff380f3	mm: merge folio_has_private()/filemap_release_folio() call pairs [ Upstream commit 0201ebf274a306a6ebb95e5dc2d6a0a27c737cac ] Patch series "mm, netfs, fscache: Stop read optimisation when folio removed from pagecache", v7. This fixes an optimisation in fscache whereby we don't read from the cache for a particular file until we know that there's data there that we don't have in the pagecache. The problem is that I'm no longer using PG_fscache (aka PG_private_2) to indicate that the page is cached and so I don't get a notification when a cached page is dropped from the pagecache. The first patch merges some folio_has_private() and filemap_release_folio() pairs and introduces a helper, folio_needs_release(), to indicate if a release is required. The second patch is the actual fix. Following Willy's suggestions[1], it adds an AS_RELEASE_ALWAYS flag to an address_space that will make filemap_release_folio() always call ->release_folio(), even if PG_private/PG_private_2 aren't set. folio_needs_release() is altered to add a check for this. This patch (of 2): Make filemap_release_folio() check folio_has_private(). Then, in most cases, where a call to folio_has_private() is immediately followed by a call to filemap_release_folio(), we can get rid of the test in the pair. There are a couple of sites in mm/vscan.c that this can't so easily be done. In shrink_folio_list(), there are actually three cases (something different is done for incompletely invalidated buffers), but filemap_release_folio() elides two of them. In shrink_active_list(), we don't have have the folio lock yet, so the check allows us to avoid locking the page unnecessarily. A wrapper function to check if a folio needs release is provided for those places that still need to do it in the mm/ directory. This will acquire additional parts to the condition in a future patch. After this, the only remaining caller of folio_has_private() outside of mm/ is a check in fuse. Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Xiubo Li <xiubli@redhat.com> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 1898efcdbed3 ("block: update the stable_writes flag in bdev_add") Signed-off-by: Sasha Levin <sashal@kernel.org>	2024-01-10 17:10:31 +01:00
Charan Teja Kalla	be72d197b2	mm: migrate high-order folios in swap cache correctly commit fc346d0a70a13d52fe1c4bc49516d83a42cd7c4c upstream. Large folios occupy N consecutive entries in the swap cache instead of using multi-index entries like the page cache. However, if a large folio is re-added to the LRU list, it can be migrated. The migration code was not aware of the difference between the swap cache and the page cache and assumed that a single xas_store() would be sufficient. This leaves potentially many stale pointers to the now-migrated folio in the swap cache, which can lead to almost arbitrary data corruption in the future. This can also manifest as infinite loops with the RCU read lock held. [willy@infradead.org: modifications to the changelog & tweaked the fix] Fixes: `3417013e0d` ("mm/migrate: Add folio_migrate_mapping()") Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-01-05 15:18:39 +01:00
Charan Teja Kalla	44702d8fa1	FROMLIST: mm: migrate high-order folios in swap cache correctly Large folios occupy N consecutive entries in the swap cache instead of using multi-index entries like the page cache. However, if a large folio is re-added to the LRU list, it can be migrated. The migration code was not aware of the difference between the swap cache and the page cache and assumed that a single xas_store() would be sufficient. This leaves potentially many stale pointers to the now-migrated folio in the swap cache, which can lead to almost arbitrary data corruption in the future. This can also manifest as infinite loops with the RCU read lock held. Bug: 315281107 Change-Id: I455f964a9f21c13089890073777388236b6669d7 [willy@infradead.org: modifications to the changelog & tweaked the fix] Fixes: `3417013e0d` ("mm/migrate: Add folio_migrate_mapping()") Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org Link: https://lore.kernel.org/linux-mm/20231214045841.961776-1-willy@infradead.org/ Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>	2023-12-21 00:41:24 +00:00
Greg Kroah-Hartman	2cd386b08b	Merge 6.1.61 into android14-6.1-lts Changes in 6.1.61 KVM: x86/pmu: Truncate counter value to allowed width on write mmc: core: Align to common busy polling behaviour for mmc ioctls mmc: block: ioctl: do write error check for spi mmc: core: Fix error propagation for some ioctl commands ASoC: codecs: wcd938x: Convert to platform remove callback returning void ASoC: codecs: wcd938x: Simplify with dev_err_probe ASoC: codecs: wcd938x: fix regulator leaks on probe errors ASoC: codecs: wcd938x: fix runtime PM imbalance on remove pinctrl: qcom: lpass-lpi: fix concurrent register updates mcb: Return actual parsed size when reading chameleon table mcb-lpc: Reallocate memory region to avoid memory overlapping virtio_balloon: Fix endless deflation and inflation on arm64 virtio-mmio: fix memory leak of vm_dev virtio-crypto: handle config changed by work queue virtio_pci: fix the common cfg map size vsock/virtio: initialize the_virtio_vsock before using VQs vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399 arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards mm: fix vm_brk_flags() to not bail out while holding lock hugetlbfs: clear resv_map pointer if mmap fails mm/page_alloc: correct start page when guard page debug is enabled mm/migrate: fix do_pages_move for compat pointers hugetlbfs: extend hugetlb_vma_lock to private VMAs maple_tree: add GFP_KERNEL to allocations in mas_expected_entries() nfsd: lock_rename() needs both directories to live on the same fs drm/i915/pmu: Check if pmu is closed before stopping event drm/amd: Disable ASPM for VI w/ all Intel systems drm/dp_mst: Fix NULL deref in get_mst_branch_device_by_guid_helper() ARM: OMAP: timer32K: fix all kernel-doc warnings firmware/imx-dsp: Fix use_after_free in imx_dsp_setup_channels() clk: ti: Fix missing omap4 mcbsp functional clock and aliases clk: ti: Fix missing omap5 mcbsp functional clock and aliases r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1 r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1 iavf: initialize waitqueues before starting watchdog_task i40e: Fix I40E_FLAG_VF_VLAN_PRUNING value treewide: Spelling fix in comment igb: Fix potential memory leak in igb_add_ethtool_nfc_entry neighbour: fix various data-races igc: Fix ambiguity in the ethtool advertising net: ethernet: adi: adin1110: Fix uninitialized variable net: ieee802154: adf7242: Fix some potential buffer overflow in adf7242_stats_show() net: usb: smsc95xx: Fix uninit-value access in smsc95xx_read_reg r8152: Increase USB control msg timeout to 5000ms as per spec r8152: Run the unload routine if we have errors during probe r8152: Cancel hw_phy_work if we have an error in probe r8152: Release firmware if we have an error in probe tcp: fix wrong RTO timeout when received SACK reneging gtp: uapi: fix GTPA_MAX gtp: fix fragmentation needed check with gso i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR drm/logicvc: Kconfig: select REGMAP and REGMAP_MMIO iavf: in iavf_down, disable queues when removing the driver scsi: sd: Introduce manage_shutdown device flag blk-throttle: check for overflow in calculate_bytes_allowed kasan: print the original fault addr when access invalid shadow io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid iio: afe: rescale: Accept only offset channels iio: exynos-adc: request second interupt only when touchscreen mode is used iio: adc: xilinx-xadc: Don't clobber preset voltage/temperature thresholds iio: adc: xilinx-xadc: Correct temperature offset/scale for UltraScale i2c: muxes: i2c-mux-pinctrl: Use of_get_i2c_adapter_by_node() i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node() i2c: muxes: i2c-demux-pinctrl: Use of_get_i2c_adapter_by_node() i2c: stm32f7: Fix PEC handling in case of SMBUS transfers i2c: aspeed: Fix i2c bus hang in slave read tracing/kprobes: Fix the description of variable length arguments misc: fastrpc: Reset metadata buffer to avoid incorrect free misc: fastrpc: Free DMA handles for RPC calls with no arguments misc: fastrpc: Clean buffers on remote invocation failures misc: fastrpc: Unmap only if buffer is unmapped from DSP nvmem: imx: correct nregs for i.MX6ULL nvmem: imx: correct nregs for i.MX6SLL nvmem: imx: correct nregs for i.MX6UL x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility x86/cpu: Add model number for Intel Arrow Lake mobile processor perf/core: Fix potential NULL deref sparc32: fix a braino in fault handling in csum_and_copy_..._user() clk: Sanitize possible_parent_show to Handle Return Value of of_clk_get_parent_name platform/x86: Add s2idle quirk for more Lenovo laptops ext4: add two helper functions extent_logical_end() and pa_logical_end() ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow ext4: avoid overlapping preallocations due to overflow objtool/x86: add missing embedded_insn check Linux 6.1.61 Note, this merge point also reverts commit `bb20a245df` which is commit 24eca2dce0f8d19db808c972b0281298d0bafe99 upstream, as it conflicts with the previous reverts for ABI issues, AND is an ABI break in itself. If it is needed in the future, it can be brought back in an abi-safe way. Change-Id: I425bfa3be6d65328e23affd52d10b827aea6e44a Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2023-11-07 08:21:27 +00:00
Gregory Price	c9b066f692	mm/migrate: fix do_pages_move for compat pointers commit 229e2253766c7cdfe024f1fe280020cc4711087c upstream. do_pages_move does not handle compat pointers for the page list. correctly. Add in_compat_syscall check and appropriate get_user fetch when iterating the page list. It makes the syscall in compat mode (32-bit userspace, 64-bit kernel) work the same way as the native 32-bit syscall again, restoring the behavior before my broken commit `5b1b561ba7` ("mm: simplify compat_sys_move_pages"). More specifically, my patch moved the parsing of the 'pages' array from the main entry point into do_pages_stat(), which left the syscall working correctly for the 'stat' operation (nodes = NULL), while the 'move' operation (nodes != NULL) is now missing the conversion and interprets 'pages' as an array of 64-bit pointers instead of the intended 32-bit userspace pointers. It is possible that nobody noticed this bug because the few applications that actually call move_pages are unlikely to run in compat mode because of their large memory requirements, but this clearly fixes a user-visible regression and should have been caught by ltp. Link: https://lkml.kernel.org/r/20231003144857.752952-1-gregory.price@memverge.com Fixes: `5b1b561ba7` ("mm: simplify compat_sys_move_pages") Signed-off-by: Gregory Price <gregory.price@memverge.com> Reported-by: Arnd Bergmann <arnd@arndb.de> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-11-02 09:35:24 +01:00
Peifeng Li	c3d26e2b5a	ANDROID: vendor_hooks: Add hooks for lookaround Add hooks for support lookaround in memory reclamation. - android_vh_test_clear_look_around_ref - android_vh_check_folio_look_around_ref - android_vh_look_around_migrate_folio - android_vh_look_around Bug: 292051411 Signed-off-by: Peifeng Li <lipeifeng@oppo.com> Change-Id: I9a606ae71d2f1303df3b02403b30bc8fdc9d06dd (cherry picked from commit f50f24e781738c8e5aa9f285d8726202f33107d6) [huzhanyuan: changed page to folio where appropriate]	2023-08-02 21:57:15 +00:00
Greg Kroah-Hartman	dafc2fae4d	This is the 6.1.13 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmP2A8MACgkQONu9yGCS aT7GrhAAky2nTRG9J0oPxh5Eu7wNKmjqDWNj9c6it3iGHpb+tfOY+LfPXHmWz0kX NoaNYGZGD8SDbkmwrSOmFB1Q/0OZ4/aIwM7Kwcw72UJVvrlsKx1HwiJjXKk809ZL bVlLUQzFTwyVIYcvjXQ8CuBHwBinLc3qkcyYGgbS8bseR4pDuxwoToDwAxk1d/0j ozWuzUKhSdYHYIUrk3papUro2UpF+Kb7KFpNiVo2wMaZM7en2XK3khCt8TuojH6c DXL+KZ/HbB8Ig1PWLaw2/6o4ispNy6bz7CJx6oDiOILR+le8xZA5WTdkXT3ovjyr LxutmPTTw6PxextIyVRblJWzXNcjdlV552U4gnnngcWn6wg4D4otqYnYvTaAUc+u sQnwrlQFxB2KfFKLNepGAy7klQJsYP3eadjDgGXP9TSmuUvUYRaNr6h0XukbyYkc kx2+Tw51NMKEqhgnaiKDN8AZEDTuLu5F4+NrUertxlb3PWeRRMRYVGJ1uw0KJg6t d5eniCB00SaaqN6M68u/hRYRi3gnwIsU7DitEpqejqwzskMpgegMFvebmCwORiq3 D+FD4EHOlztIToXhmEOXp0cz8fs27MuWmq4GkSwXvJuq+id5cQFdDN5GeLgNdAvH Kiu/Y+DY6ObW31tAQ1Jjp20L2RaWWvubrCBGeIqiDzUWmCohsks= =TXvc -----END PGP SIGNATURE----- Merge 6.1.13 into android14-6.1 Changes in 6.1.13 mptcp: sockopt: make 'tcp_fastopen_connect' generic mptcp: fix locking for setsockopt corner-case mptcp: deduplicate error paths on endpoint creation mptcp: fix locking for in-kernel listener creation btrfs: move the auto defrag code to defrag.c btrfs: lock the inode in shared mode before starting fiemap ASoC: amd: yc: Add DMI support for new acer/emdoor platforms ASoC: SOF: sof-audio: start with the right widget type ALSA: usb-audio: Add FIXED_RATE quirk for JBL Quantum610 Wireless ASoC: Intel: sof_rt5682: always set dpcm_capture for amplifiers ASoC: Intel: sof_cs42l42: always set dpcm_capture for amplifiers ASoC: Intel: sof_nau8825: always set dpcm_capture for amplifiers ASoC: Intel: sof_ssp_amp: always set dpcm_capture for amplifiers selftests/bpf: Verify copy_register_state() preserves parent/live fields ALSA: hda: Do not unset preset when cleaning up codec ASoC: amd: yc: Add Xiaomi Redmi Book Pro 15 2022 into DMI table bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself ASoC: cs42l56: fix DT probe tools/virtio: fix the vringh test for virtio ring changes vdpa: ifcvf: Do proper cleanup if IFCVF init fails net/rose: Fix to not accept on connected socket selftest: net: Improve IPV6_TCLASS/IPV6_HOPLIMIT tests apparmor compatibility net: stmmac: do not stop RX_CLK in Rx LPI state for qcs404 SoC powerpc/64: Fix perf profiling asynchronous interrupt handlers fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() drm/nouveau/devinit/tu102-: wait for GFW_BOOT_PROGRESS == COMPLETED net: ethernet: mtk_eth_soc: Avoid truncating allocation net: sched: sch: Bounds check priority s390/decompressor: specify __decompress() buf len to avoid overflow nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set drm/amd/display: Add missing brackets in calculation drm/amd/display: Adjust downscaling limits for dcn314 drm/amd/display: Unassign does_plane_fit_in_mall function from dcn3.2 drm/amd/display: Reset DMUB mailbox SW state after HW reset drm/amdgpu: enable HDP SD for gfx 11.0.3 drm/amdgpu: Enable vclk dclk node for gc11.0.3 drm/amd/display: Properly handle additional cases where DCN is not supported platform/x86: touchscreen_dmi: Add Chuwi Vi8 (CWI501) DMI match ceph: move mount state enum to super.h ceph: blocklist the kclient when receiving corrupted snap trace selftests: mptcp: userspace: fix v4-v6 test in v6.1 of: reserved_mem: Have kmemleak ignore dynamically allocated reserved mem kasan: fix Oops due to missing calls to kasan_arch_is_ready() mm: shrinkers: fix deadlock in shrinker debugfs aio: fix mremap after fork null-deref vmxnet3: move rss code block under eop descriptor fbdev: Fix invalid page access after closing deferred I/O devices drm: Disable dynamic debug as broken drm/amd/amdgpu: fix warning during suspend drm/amd/display: Fail atomic_check early on normalize_zpos error drm/vmwgfx: Stop accessing buffer objects which failed init drm/vmwgfx: Do not drop the reference to the handle too soon mmc: jz4740: Work around bug on JZ4760(B) mmc: meson-gx: fix SDIO mode if cap_sdio_irq isn't set mmc: sdio: fix possible resource leaks in some error paths mmc: mmc_spi: fix error handling in mmc_spi_probe() ALSA: hda: Fix codec device field initializan ALSA: hda/conexant: add a new hda codec SN6180 ALSA: hda/realtek - fixed wrong gpio assigned ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform. ALSA: hda/realtek: Enable mute/micmute LEDs and speaker support for HP Laptops ata: ahci: Add Tiger Lake UP{3,4} AHCI controller ata: libata-core: Disable READ LOG DMA EXT for Samsung MZ7LH sched/psi: Fix use-after-free in ep_remove_wait_queue() hugetlb: check for undefined shift on 32 bit architectures nilfs2: fix underflow in second superblock position calculations mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount mm/filemap: fix page end in filemap_get_read_batch mm/migrate: fix wrongly apply write bit after mkdirty on sparc64 gpio: sim: fix a memory leak freezer,umh: Fix call_usermode_helper_exec() vs SIGKILL coredump: Move dump_emit_page() to kill unused warning Revert "mm: Always release pages to the buddy allocator in memblock_free_late()." net: Fix unwanted sign extension in netdev_stats_to_stats64() revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" drm/vc4: crtc: Increase setup cost in core clock calculation to handle extreme reduced blanking drm/vc4: Fix YUV plane handling when planes are in different buffers drm/i915/gen11: Wa_1408615072/Wa_1407596294 should be on GT list ice: fix lost multicast packets in promisc mode ixgbe: allow to increase MTU to 3K with XDP enabled i40e: add double of VLAN header when computing the max MTU net: bgmac: fix BCM5358 support by setting correct flags net: ethernet: ti: am65-cpsw: Add RX DMA Channel Teardown Quirk sctp: sctp_sock_filter(): avoid list_entry() on possibly empty list net/sched: tcindex: update imperfect hash filters respecting rcu ice: xsk: Fix cleaning of XDP_TX frames dccp/tcp: Avoid negative sk_forward_alloc by ipv6_pinfo.pktoptions. net/usb: kalmia: Don't pass act_len in usb_bulk_msg error path net/sched: act_ctinfo: use percpu stats net: openvswitch: fix possible memory leak in ovs_meter_cmd_set() net: stmmac: fix order of dwmac5 FlexPPS parametrization sequence bnxt_en: Fix mqprio and XDP ring checking logic tracing: Make trace_define_field_ext() static net: stmmac: Restrict warning on disabling DMA store and fwd mode net: use a bounce buffer for copying skb->mark tipc: fix kernel warning when sending SYN message net: mpls: fix stale pointer if allocation fails during device rename igb: conditionalize I2C bit banging on external thermal sensor support igb: Fix PPS input and output using 3rd and 4th SDP ixgbe: add double of VLAN header when computing the max MTU ipv6: Fix datagram socket connection with DSCP. ipv6: Fix tcp socket connection with DSCP. mm/gup: add folio to list when folio_isolate_lru() succeed mm: extend max struct page size for kmsan i40e: Add checking for null for nlmsg_find_attr() net/sched: tcindex: search key must be 16 bits nvme-tcp: stop auth work after tearing down queues in error recovery nvme-rdma: stop auth work after tearing down queues in error recovery KVM: x86/pmu: Disable vPMU support on hybrid CPUs (host PMUs) kvm: initialize all of the kvm_debugregs structure before sending it to userspace perf/x86: Refuse to export capabilities for hybrid PMUs alarmtimer: Prevent starvation by small intervals and SIG_IGN nvme-pci: refresh visible attrs for cmb attributes ASoC: SOF: Intel: hda-dai: fix possible stream_tag leak net: sched: sch: Fix off by one in htb_activate_prios() Linux 6.1.13 Change-Id: I8a1e4175939c14f726c545001061b95462566386 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2023-02-22 12:32:41 +00:00
Peter Xu	54806cb751	mm/migrate: fix wrongly apply write bit after mkdirty on sparc64 commit 96a9c287e25d690fd9623b5133703b8e310fbed1 upstream. Nick Bowler reported another sparc64 breakage after the young/dirty persistent work for page migration (per "Link:" below). That's after a similar report [2]. It turns out page migration was overlooked, and it wasn't failing before because page migration was not enabled in the initial report test environment. David proposed another way [2] to fix this from sparc64 side, but that patch didn't land somehow. Neither did I check whether there's any other arch that has similar issues. Let's fix it for now as simple as moving the write bit handling to be after dirty, like what we did before. Note: this is based on mm-unstable, because the breakage was since 6.1 and we're at a very late stage of 6.2 (-rc8), so I assume for this specific case we should target this at 6.3. [1] https://lore.kernel.org/all/20221021160603.GA23307@u164.east.ru/ [2] https://lore.kernel.org/all/20221212130213.136267-1-david@redhat.com/ Link: https://lkml.kernel.org/r/20230216153059.256739-1-peterx@redhat.com Fixes: `2e3468778d` ("mm: remember young/dirty bit for page migrations") Link: https://lore.kernel.org/all/CADyTPExpEqaJiMGoV+Z6xVgL50ZoMJg49B10LcZ=8eg19u34BA@mail.gmail.com/ Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Nick Bowler <nbowler@draconx.ca> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Nick Bowler <nbowler@draconx.ca> Cc: <regressions@lists.linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-22 12:59:49 +01:00
Charan Teja Reddy	731835eae3	ANDROID: implement wrapper for reverse migration Reverse migration is used to do the balancing the occupancy of memory zones in a node in the system whose imabalance may be caused by migration of pages to other zones by an operation, eg: hotremove and then hotadding the same memory. In this case there is a lot of free memory in newly hotadd memory which can be filled up by the previous migrated pages(as part of offline/hotremove) thus may free up some pressure in other zones of the node. Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/ Bug: 201263307 Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com> Signed-off-by: Sukadev Bhattiprolu <quic_sukadev@quicinc.com>	2023-01-03 18:56:33 +00:00
Baolin Wang	03e5f82ea6	mm: migrate: fix return value if all subpages of THPs are migrated successfully During THP migration, if THPs are not migrated but they are split and all subpages are migrated successfully, migrate_pages() will still return the number of THP pages that were not migrated. This will confuse the callers of migrate_pages(). For example, the longterm pinning will failed though all pages are migrated successfully. Thus we should return 0 to indicate that all pages are migrated in this case Link: https://lkml.kernel.org/r/de386aa864be9158d2f3b344091419ea7c38b2f7.1666599848.git.baolin.wang@linux.alibaba.com Fixes: `b5bade978e` ("mm: migrate: fix the return value of migrate_pages()") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-28 13:37:22 -07:00
Alistair Popple	16ce101db8	mm/memory.c: fix race when faulting a device private page Patch series "Fix several device private page reference counting issues", v2 This series aims to fix a number of page reference counting issues in drivers dealing with device private ZONE_DEVICE pages. These result in use-after-free type bugs, either from accessing a struct page which no longer exists because it has been removed or accessing fields within the struct page which are no longer valid because the page has been freed. During normal usage it is unlikely these will cause any problems. However without these fixes it is possible to crash the kernel from userspace. These crashes can be triggered either by unloading the kernel module or unbinding the device from the driver prior to a userspace task exiting. In modules such as Nouveau it is also possible to trigger some of these issues by explicitly closing the device file-descriptor prior to the task exiting and then accessing device private memory. This involves some minor changes to both PowerPC and AMD GPU code. Unfortunately I lack hardware to test either of those so any help there would be appreciated. The changes mimic what is done in for both Nouveau and hmm-tests though so I doubt they will cause problems. This patch (of 8): When the CPU tries to access a device private page the migrate_to_ram() callback associated with the pgmap for the page is called. However no reference is taken on the faulting page. Therefore a concurrent migration of the device private page can free the page and possibly the underlying pgmap. This results in a race which can crash the kernel due to the migrate_to_ram() function pointer becoming invalid. It also means drivers can't reliably read the zone_device_data field because the page may have been freed with memunmap_pages(). Close the race by getting a reference on the page while holding the ptl to ensure it has not been freed. Unfortunately the elevated reference count will cause the migration required to handle the fault to fail. To avoid this failure pass the faulting page into the migrate_vma functions so that if an elevated reference count is found it can be checked to see if it's expected or not. [mpe@ellerman.id.au: fix build] Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Lyude Paul <lyude@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-12 18:51:49 -07:00
Matthew Wilcox (Oracle)	29eea9b5a9	mm: convert page_get_anon_vma() to folio_get_anon_vma() With all callers now passing in a folio, rename the function and convert all callers. Removes a couple of calls to compound_head() and a reference to page->mapping. Link: https://lkml.kernel.org/r/20220902194653.1739778-55-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-03 14:02:54 -07:00
Matthew Wilcox (Oracle)	c33db29231	migrate: convert unmap_and_move_huge_page() to use folios Saves several calls to compound_head() and removes a couple of uses of page->lru. Link: https://lkml.kernel.org/r/20220902194653.1739778-52-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-03 14:02:54 -07:00
Matthew Wilcox (Oracle)	682a71a1b6	migrate: convert __unmap_and_move() to use folios Removes a lot of calls to compound_head(). Also remove a VM_BUG_ON that can never trigger as the PageAnon bit is the bottom bit of page->mapping. Link: https://lkml.kernel.org/r/20220902194653.1739778-51-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-03 14:02:53 -07:00
Haiyue Wang	f7091ed64e	mm: fix the handling Non-LRU pages returned by follow_page The handling Non-LRU pages returned by follow_page() jumps directly, it doesn't call put_page() to handle the reference count, since 'FOLL_GET' flag for follow_page() has get_page() called. Fix the zone device page check by handling the page reference count correctly before returning. And as David reviewed, "device pages are never PageKsm pages". Drop this zone device page check for break_ksm(). Since the zone device page can't be a transparent huge page, so drop the redundant zone device page check for split_huge_pages_pid(). (by Miaohe) Link: https://lkml.kernel.org/r/20220823135841.934465-3-haiyue.wang@intel.com Fixes: `3218f8712d` ("mm: handling Non-LRU pages returned by vm_normal_pages") Signed-off-by: Haiyue Wang <haiyue.wang@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:28 -07:00
Aneesh Kumar K.V	467b171af8	mm/demotion: update node_is_toptier to work with memory tiers With memory tier support we can have memory only NUMA nodes in the top tier from which we want to avoid promotion tracking NUMA faults. Update node_is_toptier to work with memory tiers. All NUMA nodes are by default top tier nodes. With lower(slower) memory tiers added we consider all memory tiers above a memory tier having CPU NUMA nodes as a top memory tier [sj@kernel.org: include missed header file, memory-tiers.h] Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org [akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h] [aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers] Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:12 -07:00
Aneesh Kumar K.V	6c542ab757	mm/demotion: build demotion targets based on explicit memory tiers This patch switch the demotion target building logic to use memory tiers instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the default memory tier and additional memory tiers will be added by drivers like dax kmem. This patch builds the demotion target for a NUMA node by looking at all memory tiers below the tier to which the NUMA node belongs. The closest node in the immediately following memory tier is used as a demotion target. Since we are now only building demotion target for N_MEMORY NUMA nodes the CPU hotplug calls are removed in this patch. Link: https://lkml.kernel.org/r/20220818131042.113280-6-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:12 -07:00
Aneesh Kumar K.V	9195244022	mm/demotion: move memory demotion related code This moves memory demotion related code to mm/memory-tiers.c. No functional change in this patch. Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Wei Xu <weixugc@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Bharata B Rao <bharata@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hesham Almatary <hesham.almatary@huawei.com> Cc: Jagdish Gediya <jvgediya.oss@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:11 -07:00
Baolin Wang	7047b5a40b	mm: migrate: do not retry 10 times for the subpages of fail-to-migrate THP If THP is failed to migrate due to -ENOSYS or -ENOMEM case, the THP will be split, and the subpages of fail-to-migrate THP will be tried to migrate again, so we should not account the retry counter in the second loop, since we already accounted 'nr_thp_failed' in the first loop. Moreover we also do not need retry 10 times for -EAGAIN case for the subpages of fail-to-migrate THP in the second loop, since we already regarded the THP as migration failure, and save some migration time (for the worst case, will try 512 * 10 times) according to previous discussion [1]. [1] https://lore.kernel.org/linux-mm/87r13a7n04.fsf@yhuang6-desk2.ccr.corp.intel.com/ Link: https://lkml.kernel.org/r/20220817081408.513338-9-ying.huang@intel.com Tested-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:07 -07:00
Huang Ying	077309bc1e	migrate_pages(): fix failure counting for retry After 10 retries, we will give up and the remaining pages will be counted as failure in nr_failed and nr_thp_failed. We should count the failure in nr_failed_pages too. This is done in this patch. Link: https://lkml.kernel.org/r/20220817081408.513338-8-ying.huang@intel.com Fixes: `5984fabb6e` ("mm: move_pages: report the number of non-attempted pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:07 -07:00
Huang Ying	e6fa8a79fe	migrate_pages(): fix failure counting for THP splitting If THP is failed to be migrated, it may be split and retry. But after splitting, the head page will be left in "from" list, although THP migration failure has been counted already. If the head page is failed to be migrated too, the failure will be counted twice incorrectly. So this is fixed in this patch via moving the head page of THP after splitting to "thp_split_pages" too. Link: https://lkml.kernel.org/r/20220817081408.513338-7-ying.huang@intel.com Fixes: `5984fabb6e` ("mm: move_pages: report the number of non-attempted pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:07 -07:00
Huang Ying	577be05c89	migrate_pages(): fix failure counting for THP on -ENOSYS If THP or hugetlbfs page migration isn't supported, unmap_and_move() or unmap_and_move_huge_page() will return -ENOSYS. For THP, splitting will be tried, but if splitting doesn't succeed, the THP will be left in "from" list wrongly. If some other pages are retried, the THP migration failure will counted again. This is fixed via moving the failure THP from "from" to "ret_pages". Another issue of the original code is that the unsupported failure processing isn't consistent between THP and hugetlbfs page. Make them consistent in this patch to make the code easier to be understood too. Link: https://lkml.kernel.org/r/20220817081408.513338-6-ying.huang@intel.com Fixes: `5984fabb6e` ("mm: move_pages: report the number of non-attempted pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:07 -07:00
Huang Ying	5fc30916b5	migrate_pages(): fix failure counting for THP subpages retrying If THP is failed to be migrated for -ENOSYS and -ENOMEM, the THP will be split into thp_split_pages, and after other pages are migrated, pages in thp_split_pages will be migrated with no_subpage_counting == true, because its failure have been counted already. If some pages in thp_split_pages are retried during migration, we should not count their failure if no_subpage_counting == true too. This is done this patch to fix the failure counting for THP subpages retrying. Link: https://lkml.kernel.org/r/20220817081408.513338-5-ying.huang@intel.com Fixes: `5984fabb6e` ("mm: move_pages: report the number of non-attempted pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:06 -07:00
Huang Ying	fbed53b477	migrate_pages(): fix THP failure counting for -ENOMEM In unmap_and_move(), if the new THP cannot be allocated, -ENOMEM will be returned, and migrate_pages() will try to split the THP unless "reason" is MR_NUMA_MISPLACED (that is, nosplit == true). But when nosplit == true, the THP migration failure will not be counted. This is incorrect, so in this patch, the THP migration failure will be counted for -ENOMEM regardless of nosplit is true or false. The nr_failed counting isn't fixed because it's not used. Added some comments for it per Baolin's suggestion. Link: https://lkml.kernel.org/r/20220817081408.513338-4-ying.huang@intel.com Fixes: `5984fabb6e` ("mm: move_pages: report the number of non-attempted pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:06 -07:00
Huang Ying	9c62ff005f	migrate_pages(): remove unnecessary list_safe_reset_next() Before commit `b5bade978e` ("mm: migrate: fix the return value of migrate_pages()"), the tail pages of THP will be put in the "from" list directly. So one of the loop cursors (page2) needs to be reset, as is done in try_split_thp() via list_safe_reset_next(). But after the commit, the tail pages of THP will be put in a dedicated list (thp_split_pages). That is, the "from" list will not be changed during splitting. So, it's unnecessary to call list_safe_reset_next() anymore. This is a code cleanup, no functionality changes are expected. Link: https://lkml.kernel.org/r/20220817081408.513338-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:06 -07:00
Huang Ying	a7504ed14f	migrate: fix syscall move_pages() return value for failure Patch series "migrate_pages(): fix several bugs in error path", v3. During review the code of migrate_pages() and build a test program for it. Several bugs in error path are identified and fixed in this series. Most patches are tested via - Apply error-inject.patch in Linux kernel - Compile test-migrate.c (with -lnuma) - Test with test-migrate.sh error-inject.patch, test-migrate.c, and test-migrate.sh are as below. It turns out that error injection is an important tool to fix bugs in error path. This patch (of 8): The return value of move_pages() syscall is incorrect when counting the remaining pages to be migrated. For example, for the following test program, " #define _GNU_SOURCE #include <stdbool.h> #include <stdio.h> #include <string.h> #include <stdlib.h> #include <errno.h> #include <fcntl.h> #include <sys/uio.h> #include <sys/mman.h> #include <sys/types.h> #include <unistd.h> #include <numaif.h> #include <numa.h> #ifndef MADV_FREE #define MADV_FREE 8 /* free pages only if memory pressure / #endif #define ONE_MB (1024 1024) #define MAP_SIZE (16 * ONE_MB) #define THP_SIZE (2 * ONE_MB) #define THP_MASK (THP_SIZE - 1) #define ERR_EXIT_ON(cond, msg) \ do { \ int __cond_in_macro = (cond); \ if (__cond_in_macro) \ error_exit(__cond_in_macro, (msg)); \ } while (0) void error_msg(int ret, int nr, int status, const char msg) { int i; fprintf(stderr, "Error: %s, ret : %d, error: %s\n", msg, ret, strerror(errno)); if (!nr) return; fprintf(stderr, "status: "); for (i = 0; i < nr; i++) fprintf(stderr, "%d ", status[i]); fprintf(stderr, "\n"); } void error_exit(int ret, const char msg) { error_msg(ret, 0, NULL, msg); exit(1); } int page_size; bool do_vmsplice; bool do_thp; static int pipe_fds[2]; void addr; char pn; char pn1; void pages[2]; int status[2]; void prepare() { int ret; struct iovec iov; if (addr) { munmap(addr, MAP_SIZE); close(pipe_fds[0]); close(pipe_fds[1]); } ret = pipe(pipe_fds); ERR_EXIT_ON(ret, "pipe"); addr = mmap(NULL, MAP_SIZE, PROT_READ \| PROT_WRITE, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0); ERR_EXIT_ON(addr == MAP_FAILED, "mmap"); if (do_thp) { ret = madvise(addr, MAP_SIZE, MADV_HUGEPAGE); ERR_EXIT_ON(ret, "advise hugepage"); } pn = (char )(((unsigned long)addr + THP_SIZE) & ~THP_MASK); pn1 = pn + THP_SIZE; pages[0] = pn; pages[1] = pn1; pn = 1; if (do_vmsplice) { iov.iov_base = pn; iov.iov_len = page_size; ret = vmsplice(pipe_fds[1], &iov, 1, 0); ERR_EXIT_ON(ret < 0, "vmsplice"); } status[0] = status[1] = 1024; } void test_migrate() { int ret; int nodes[2] = { 1, 1 }; pid_t pid = getpid(); prepare(); ret = move_pages(pid, 1, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 1, status, "move 1 page"); prepare(); ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages, page 1 not mapped"); prepare(); pn1 = 1; ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages"); prepare(); pn1 = 1; nodes[1] = 0; ret = move_pages(pid, 2, pages, nodes, status, MPOL_MF_MOVE_ALL); error_msg(ret, 2, status, "move 2 pages, page 1 to node 0"); } int main(int argc, char argv[]) { numa_run_on_node(0); page_size = getpagesize(); test_migrate(); fprintf(stderr, "\nMake page 0 cannot be migrated:\n"); do_vmsplice = true; test_migrate(); fprintf(stderr, "\nTest THP:\n"); do_thp = true; do_vmsplice = false; test_migrate(); fprintf(stderr, "\nTHP: make page 0 cannot be migrated:\n"); do_vmsplice = true; test_migrate(); return 0; } " The output of the current kernel is, " Error: move 1 page, ret : 0, error: Success status: 1 Error: move 2 pages, page 1 not mapped, ret : 0, error: Success status: 1 -14 Error: move 2 pages, ret : 0, error: Success status: 1 1 Error: move 2 pages, page 1 to node 0, ret : 0, error: Success status: 1 0 Make page 0 cannot be migrated: Error: move 1 page, ret : 0, error: Success status: 1024 Error: move 2 pages, page 1 not mapped, ret : 1, error: Success status: 1024 -14 Error: move 2 pages, ret : 0, error: Success status: 1024 1024 Error: move 2 pages, page 1 to node 0, ret : 1, error: Success status: 1024 1024 " While the expected output is, " Error: move 1 page, ret : 0, error: Success status: 1 Error: move 2 pages, page 1 not mapped, ret : 0, error: Success status: 1 -14 Error: move 2 pages, ret : 0, error: Success status: 1 1 Error: move 2 pages, page 1 to node 0, ret : 0, error: Success status: 1 0 Make page 0 cannot be migrated: Error: move 1 page, ret : 1, error: Success status: 1024 Error: move 2 pages, page 1 not mapped, ret : 1, error: Success status: 1024 -14 Error: move 2 pages, ret : 1, error: Success status: 1024 1024 Error: move 2 pages, page 1 to node 0, ret : 2, error: Success status: 1024 1024 " Fix this via correcting the remaining pages counting. With the fix, the output for the test program as above is expected. Link: https://lkml.kernel.org/r/20220817081408.513338-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220817081408.513338-2-ying.huang@intel.com Fixes: `5984fabb6e` ("mm: move_pages: report the number of non-attempted pages") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:06 -07:00
Peter Xu	2e3468778d	mm: remember young/dirty bit for page migrations When page migration happens, we always ignore the young/dirty bit settings in the old pgtable, and marking the page as old in the new page table using either pte_mkold() or pmd_mkold(), and keeping the pte clean. That's fine from functional-wise, but that's not friendly to page reclaim because the moving page can be actively accessed within the procedure. Not to mention hardware setting the young bit can bring quite some overhead on some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit. The same slowdown problem to dirty bits when the memory is first written after page migration happened. Actually we can easily remember the A/D bit configuration and recover the information after the page is migrated. To achieve it, define a new set of bits in the migration swap offset field to cache the A/D bits for old pte. Then when removing/recovering the migration entry, we can recover the A/D bits even if the page changed. One thing to mention is that here we used max_swapfile_size() to detect how many swp offset bits we have, and we'll only enable this feature if we know the swp offset is big enough to store both the PFN value and the A/D bits. Otherwise the A/D bits are dropped like before. Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:05 -07:00
Huang Ying	33024536ba	memory tiering: hot page selection with hint page fault latency Patch series "memory tiering: hot page selection", v4. To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory nodes need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). So in this patchset, we implement a new hot page identification algorithm based on the latency between NUMA balancing page table scanning and hint page fault. Which is a kind of mostly frequently accessed (MFU) algorithm. In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes. A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention. A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patchset as the page promotion rate limit mechanism. The promotion hot threshold is workload and system configuration dependent. So in this patchset, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit. We used the pmbench memory accessing benchmark tested the patchset on a 2-socket server system with DRAM and PMEM installed. The test results are as follows, pmbench score promote rate (accesses/s) MB/s ------------- ------------ base 146887704.1 725.6 hot selection 165695601.2 544.0 rate limit 162814569.8 165.2 auto adjustment 170495294.0 136.9 From the results above, With hot page selection patch [1/3], the pmbench score increases about 12.8%, and promote rate (overhead) decreases about 25.0%, compared with base kernel. With rate limit patch [2/3], pmbench score decreases about 1.7%, and promote rate decreases about 69.6%, compared with hot page selection patch. With threshold auto adjustment patch [3/3], pmbench score increases about 4.7%, and promote rate decrease about 17.1%, compared with rate limit patch. Baolin helped to test the patchset with MySQL on a machine which contains 1 DRAM node (30G) and 1 PMEM node (126G). sysbench /usr/share/sysbench/oltp_read_write.lua \ ...... --tables=200 \ --table-size=1000000 \ --report-interval=10 \ --threads=16 \ --time=120 The tps can be improved about 5%. This patch (of 3): To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory node need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). The most frequently accessed (MFU) algorithm is better. So, in this patch we implemented a better hot page selection algorithm. Which is based on NUMA balancing page table scanning and hint page fault as follows, - When the page tables of the processes are scanned to change PTE/PMD to be PROT_NONE, the current time is recorded in struct page as scan time. - When the page is accessed, hint page fault will occur. The scan time is gotten from the struct page. And The hint page fault latency is defined as hint page fault time - scan time The shorter the hint page fault latency of a page is, the higher the probability of their access frequency to be higher. So the hint page fault latency is a better estimation of the page hot/cold. It's hard to find some extra space in struct page to hold the scan time. Fortunately, we can reuse some bits used by the original NUMA balancing. NUMA balancing uses some bits in struct page to store the page accessing CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the multi-stage node selection algorithm to avoid to migrate pages shared accessed by the NUMA nodes back and forth. But for pages in the slow memory node, even if they are shared accessed by multiple NUMA nodes, as long as the pages are hot, they need to be promoted to the fast memory node. So the accessing CPU and PID information are unnecessary for the slow memory pages. We can reuse these bits in struct page to record the scan time. For the fast memory pages, these bits are used as before. For the hot threshold, the default value is 1 second, which works well in our performance test. All pages with hint page fault latency < hot threshold will be considered hot. It's hard for users to determine the hot threshold. So we don't provide a kernel ABI to set it, just provide a debugfs interface for advanced users to experiment. We will continue to work on a hot threshold automatic adjustment mechanism. The downside of the above method is that the response time to the workload hot spot changing may be much longer. For example, - A previous cold memory area becomes hot - The hint page fault will be triggered. But the hint page fault latency isn't shorter than the hot threshold. So the pages will not be promoted. - When the memory area is scanned again, maybe after a scan period, the hint page fault latency measured will be shorter than the hot threshold and the pages will be promoted. To mitigate this, if there are enough free space in the fast memory node, the hot threshold will not be used, all pages will be promoted upon the hint page fault for fast response. Thanks Zhong Jiang reported and tested the fix for a bug when disabling memory tiering mode dynamically. Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: osalvador <osalvador@suse.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-11 20:25:54 -07:00
Haiyue Wang	8315682148	mm: migration: fix the FOLL_GET failure on following huge page Not all huge page APIs support FOLL_GET option, so move_pages() syscall will fail to get the page node information for some huge pages. Like x86 on linux 5.19 with 1GB huge page API follow_huge_pud(), it will return NULL page for FOLL_GET when calling move_pages() syscall with the NULL 'nodes' parameter, the 'status' parameter has '-2' error in array. Note: follow_huge_pud() now supports FOLL_GET in linux 6.0. Link: https://lore.kernel.org/all/20220714042420.1847125-3-naoya.horiguchi@linux.dev But these huge page APIs don't support FOLL_GET: 1. follow_huge_pud() in arch/s390/mm/hugetlbpage.c 2. follow_huge_addr() in arch/ia64/mm/hugetlbpage.c It will cause WARN_ON_ONCE for FOLL_GET. 3. follow_huge_pgd() in mm/hugetlb.c This is an temporary solution to mitigate the side effect of the race condition fix by calling follow_page() with FOLL_GET set for huge pages. After supporting follow huge page by FOLL_GET is done, this fix can be reverted safely. Link: https://lkml.kernel.org/r/20220823135841.934465-2-haiyue.wang@intel.com Link: https://lkml.kernel.org/r/20220812084921.409142-1-haiyue.wang@intel.com Fixes: `4cd614841c` ("mm: migration: fix possible do_pages_stat_array racing with memory offline") Signed-off-by: Haiyue Wang <haiyue.wang@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-11 20:25:52 -07:00
Linus Torvalds	6614a3c316	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/ SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE= =w/UH -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...	2022-08-05 16:32:45 -07:00
Matthew Wilcox (Oracle)	9d0ddc0cb5	fs: Remove aops->migratepage() With all users converted to migrate_folio(), remove this operation. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)	b890ec2a2c	hugetlb: Convert to migrate_folio This involves converting migrate_huge_page_move_mapping(). We also need a folio variant of hugetlb_set_page_subpool(), but that's for a later patch. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>	2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)	2ec810d596	mm/migrate: Add filemap_migrate_folio() There is nothing iomap-specific about iomap_migratepage(), and it fits a pattern used by several other filesystems, so move it to mm/migrate.c, convert it to be filemap_migrate_folio() and convert the iomap filesystems to use it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>	2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)	541846502f	mm/migrate: Convert migrate_page() to migrate_folio() Convert all callers to pass a folio. Most have the folio already available. Switch all users from aops->migratepage to aops->migrate_folio. Also turn the documentation into kerneldoc. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com>	2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)	108ca83581	mm/migrate: Convert expected_page_refs() to folio_expected_refs() Now that both callers have a folio, convert this function to take a folio & rename it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	67235182a4	mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio() Use a folio throughout __buffer_migrate_folio(), add kernel-doc for buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their declarations to buffer.h and switch all filesystems that have wired them up. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	2be7fa10c0	mm/migrate: Convert writeout() to take a folio Use a folio throughout this function. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	8faa8ef5dd	mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio() Use a folio throughout. migrate_page() will be converted to migrate_folio() later. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	5490da4f06	fs: Add aops->migrate_folio Provide a folio-based replacement for aops->migratepage. Update the documentation to document migrate_folio instead of migratepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)	68f2736a85	mm: Convert all PageMovable users to movable_operations These drivers are rather uncomfortably hammered into the address_space_operations hole. They aren't filesystems and don't behave like filesystems. They just need their own movable_operations structure, which we can point to directly from page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2022-08-02 12:34:03 -04:00
Alex Sierra	3218f8712d	mm: handling Non-LRU pages returned by vm_normal_pages With DEVICE_COHERENT, we'll soon have vm_normal_pages() return device-managed anonymous pages that are not LRU pages. Although they behave like normal pages for purposes of mapping in CPU page, and for COW. They do not support LRU lists, NUMA migration or THP. Callers to follow_page() currently don't expect ZONE_DEVICE pages, however, with DEVICE_COHERENT we might now return ZONE_DEVICE. Check for ZONE_DEVICE pages in applicable users of follow_page() as well. Link: https://lkml.kernel.org/r/20220715150521.18165-5-alex.sierra@amd.com Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> [v2] Reviewed-by: Alistair Popple <apopple@nvidia.com> [v6] Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:14:28 -07:00
Miaohe Lin	ad1ac596e8	mm/migration: fix potential pte_unmap on an not mapped pte __migration_entry_wait and migration_entry_wait_on_locked assume pte is always mapped from caller. But this is not the case when it's called from migration_entry_wait_huge and follow_huge_pmd. Add a hugetlbfs variant that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue. Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com Fixes: `30dad30922` ("mm: migration: add migrate_entry_wait_huge()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:37 -07:00
Miaohe Lin	7ce82f4c3f	mm/migration: return errno when isolate_huge_page failed We might fail to isolate huge page due to e.g. the page is under migration which cleared HPageMigratable. We should return errno in this case rather than always return 1 which could confuse the user, i.e. the caller might think all of the memory is migrated while the hugetlb page is left behind. We make the prototype of isolate_huge_page consistent with isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page to isolate_hugetlb as suggested by Muchun to improve the readability. Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com Fixes: `e8db67eb0d` ("mm: migrate: move_pages() supports thp migration") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Huang Ying <ying.huang@intel.com> Reported-by: kernel test robot <lkp@intel.com> (build error) Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:37 -07:00
Miaohe Lin	160088b3b6	mm/migration: remove unneeded lock page and PageMovable check When non-lru movable page was freed from under us, __ClearPageMovable must have been done. So we can remove unneeded lock page and PageMovable check here. Also free_pages_prepare() will clear PG_isolated for us, so we can further remove ClearPageIsolated as suggested by David. Link: https://lkml.kernel.org/r/20220530113016.16663-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-03 18:08:37 -07:00
Matthew Wilcox (Oracle)	b653db7735	mm: Clear page->private when splitting or migrating a page In our efforts to remove uses of PG_private, we have found folios with the private flag clear and folio->private not-NULL. That is the root cause behind `642d51fb07` ("ceph: check folio PG_private bit instead of folio->private"). It can also affect a few other filesystems that haven't yet reported a problem. compaction_alloc() can return a page with uninitialised page->private, and rather than checking all the callers of migrate_pages(), just zero page->private after calling get_new_page(). Similarly, the tail pages from split_huge_page() may also have an uninitialised page->private. Reported-by: Xiubo Li <xiubli@redhat.com> Tested-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2022-06-23 12:21:44 -04:00

1 2 3 4 5 ...

624 Commits