android_kernel_samsung_sm8650

Author	SHA1	Message	Date
xiaosong.ma	bbc9d3bc0b	ANDROID: vendor_hooks: mm: Add tune_swappiness vendor hook in get_swappiness() Add hook in get_swappiness() for customized swappiness when lru_gen is enabled. Bug: 299548382 Test: buid pass Change-Id: If15cb4f71fda6c0b24359f8dc439a090a5434dc9 Signed-off-by: xiaosong.ma <xiaosong.ma@unisoc.com>	2023-09-15 16:06:46 +00:00
Kalesh Singh	0500235e3f	ANDROID: vendor_hook: Add vendor hook to decide scan abort policy Allow vendor hook to enable checking of the high water marks to decide if reclaim should continue scanning. Bug: 224956008 Change-Id: I63fe1fd386e7599451c2df0a04c8440b4fc142fc Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-09-12 23:08:17 +00:00
Tangquan Zheng	8e6550add2	ANDROID: vendor_hooks: Add tune swappiness hook in get_scan_count() Add hook in get_scan_count() for customized swappiness. Partial cherry-pick of aosp/2119426. Bug: 297985476 Change-Id: I9d4074cf1a4097ff2a96be04646a01624cbd8dc3 Signed-off-by: Tangquan Zheng <zhengtangquan@oppo.com>	2023-08-31 17:38:17 +00:00
Suren Baghdasaryan	3c187b4a12	BACKPORT: FROMGIT: mm: enable page walking API to lock vmas during the walk walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 2ebc368f59eedcef0de7c832fe1d62935cd3a7ff https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable) [surenb: changed locking in break_ksm since it's done differently, skipped the change in the missing __ksm_del_vma(), skipped the change in the missing walk_page_range_vma(), removed unused local variables] Bug: 293665307 Change-Id: Iede9eaa950ea59a268a2e74a8d3022162f0bbd80 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2023-08-16 16:55:02 +00:00
Charan Teja Kalla	f86c79eb86	FROMGIT: Multi-gen LRU: skip CMA pages when they are not eligible This patch is based on the commit 5da226dbfce3("mm: skip CMA pages when they are not available") which skips cma pages reclaim when they are not eligible for the current allocation context. In mglru, such pages are added to the tail of the immediate generation to maintain better LRU order, which is unlike the case of conventional LRU where such pages are directly added to the head of the LRU list(akin to adding to head of the youngest generation in mglru). No observable issue without this patch on MGLRU, but logically it make sense to skip the CMA page reclaim when those pages can't be satisfied for the current allocation context. Link: https://lkml.kernel.org/r/1691568344-13475-1-git-send-email-quic_charante@quicinc.com Change-Id: I586415b3e3a92da23f3e79b9d63802a2ced03432 Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 75d52d9304ef5b268eb798b0c679815290a0fc83 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable) Bug: 288383787 Bug: 291719697 Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-08-15 19:57:01 +00:00
Zhaoyang Huang	7ae1e02abb	UPSTREAM: mm: skip CMA pages when they are not available This patch fixes unproductive reclaiming of CMA pages by skipping them when they are not available for current context. It arises from the below OOM issue, which was caused by a large proportion of MIGRATE_CMA pages among free pages. [ 36.172486] [03-19 10:05:52.172] ActivityManager: page allocation failure: order:0, mode:0xc00(GFP_NOIO), nodemask=(null),cpuset=foreground,mems_allowed=0 [ 36.189447] [03-19 10:05:52.189] DMA32: 04kB 4478kB (C) 21716kB (C) 12432kB (C) 13664kB (C) 70128kB (C) 22256kB (C) 3512kB (C) 01024kB 02048kB 04096kB = 35848kB [ 36.193125] [03-19 10:05:52.193] Normal: 2314kB (UMEH) 498kB (MEH) 1416kB (H) 1332kB (H) 864kB (H) 2128kB (H) 0256kB 1512kB (H) 01024kB 02048kB 04096kB = 3236kB ... [ 36.234447] [03-19 10:05:52.234] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) [ 36.234455] [03-19 10:05:52.234] cache: ext4_io_end, object size: 64, buffer size: 64, default order: 0, min order: 0 [ 36.234459] [03-19 10:05:52.234] node 0: slabs: 53,objs: 3392, free: 0 This change further decreases the chance for wrong OOMs in the presence of a lot of CMA memory. [david@redhat.com: changelog addition] Link: https://lkml.kernel.org/r/1685501461-19290-1-git-send-email-zhaoyang.huang@unisoc.com Change-Id: I84f1145c38b5ff7b825f2122b33bc55997931bd7 Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: ke.wang <ke.wang@unisoc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 5da226dbfce3a2f44978c2c7cf88166e69a6788b) Bug: 288383787 Bug: 291719697 Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-08-15 19:57:01 +00:00
Jiewen Wang	dbb09068c1	ANDROID: vendor_hooks: Add tune scan type hook in get_scan_count() Add hook in get_scan_count() for oem to wield customized reclamation strategy Bug: 294180281 Change-Id: Ic54d35128e458661fc2b641809f5371b1d9a488e Signed-off-by: Jiewen Wang <jiewen.wang@vivo.com>	2023-08-07 18:11:35 +00:00
Kalesh Singh	5e1d25ac2a	FROMGIT: BACKPORT: Multi-gen LRU: Fix can_swap in lru_gen_look_around() walk->can_swap might be invalid since it's not guaranteed to be initialized for the particular lruvec. Instead deduce it from the folio type (anon/file). Link: https://lkml.kernel.org/r/20230802025606.346758-3-kaleshsingh@google.com Fixes: `018ee47f14` ("mm: multi-gen LRU: exploit locality in rmap") Change-Id: I1ae78011d4972d87bac9f2db8c56352cdb7a9be6 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek] Tested-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Barrett <steven@liquorix.net> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit fdf19e8c8f1cdcee4eccf4c98a875f44f39d8b9d https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable) Bug: 288383787 Bug: 291719697 [ Kalesh Singh - Fix trivial conflict in lru_gen_look_around() ] Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-08-03 20:45:58 +00:00
Kalesh Singh	addf1a9a65	FROMGIT: Multi-gen LRU: Avoid race in inc_min_seq() inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This is because the generations are reused (the last oldest now empty generation will become the next youngest generation). inc_min_seq() is retried until successful, dropping the lru_lock and yielding the CPU on each failure, and retaking the lock before trying again: while (!inc_min_seq(lruvec, type, can_swap)) { spin_unlock_irq(&lruvec->lru_lock); cond_resched(); spin_lock_irq(&lruvec->lru_lock); } However, the initial condition that required incrementing the min_seq (nr_gens == MAX_NR_GENS) is not retested. This can change by another call to inc_max_seq() from run_aging() with force_scan=true from the debugfs interface. Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid unnecessarily incrementing the min_seq by rechecking the number of generations before each attempt. This issue was uncovered in previous discussion on the list by Yu Zhao and Aneesh Kumar [1]. [1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY_dpf+zR3w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com Fixes: `d6c3af7d8a` ("mm: multi-gen LRU: debugfs interface") Change-Id: I89e84ef2927eb1b0091f1be28bd03eb04dee4c57 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek] Tested-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Barrett <steven@liquorix.net> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 250dbd10306126b06415afda8adfc27b2b780428 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable) Bug: 288383787 Bug: 291719697 Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-08-03 20:45:58 +00:00
Kalesh Singh	a7adb98897	FROMGIT: Multi-gen LRU: Fix per-zone reclaim MGLRU has a LRU list for each zone for each type (anon/file) in each generation: long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES]; The min_seq (oldest generation) can progress independently for each type but the max_seq (youngest generation) is shared for both anon and file. This is to maintain a common frame of reference. In order for eviction to advance the min_seq of a type, all the per-zone lists in the oldest generation of that type must be empty. The eviction logic only considers pages from eligible zones for eviction or promotion. scan_folios() { ... for (zone = sc->reclaim_idx; zone >= 0; zone--) { ... sort_folio(); // Promote ... isolate_folio(); // Evict } ... } Consider the system has the movable zone configured and default 4 generations. The current state of the system is as shown below (only illustrating one type for simplicity): Type: ANON Zone DMA32 Normal Movable Device Gen 0 0 0 4GB 0 Gen 1 0 1GB 1MB 0 Gen 2 1MB 4GB 1MB 0 Gen 3 1MB 1MB 1MB 0 Now consider there is a GFP_KERNEL allocation request (eligible zone index <= Normal), evict_folios() will return without doing any work since there are no pages to scan in the eligible zones of the oldest generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE allocation request; which may not happen soon if there is a lot of free memory in the movable zone. This can lead to OOM kills, although there is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to reclaim. This issue is not seen in the conventional active/inactive LRU since there are no per-zone lists. If there are no (not enough) folios to scan in the eligible zones, move folios from ineligible zone (zone_index > reclaim_index) to the next generation. This allows for the progression of min_seq and reclaiming from the next generation (Gen 1). Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently. [1] https://github.com/raspberrypi/linux/issues/5395 Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com Fixes: `ac35a49023` ("mm: multi-gen LRU: minimal implementation") Change-Id: I5bbf44bd7ffe42f4347df4be59a75c1603c9b947 Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek] Tested-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Barrett <steven@liquorix.net> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 1462260adc41c5974362cb54ff577c2a15b8c7b2 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable) Bug: 288383787 Bug: 291719697 Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-08-03 20:45:58 +00:00
Peifeng Li	c3d26e2b5a	ANDROID: vendor_hooks: Add hooks for lookaround Add hooks for support lookaround in memory reclamation. - android_vh_test_clear_look_around_ref - android_vh_check_folio_look_around_ref - android_vh_look_around_migrate_folio - android_vh_look_around Bug: 292051411 Signed-off-by: Peifeng Li <lipeifeng@oppo.com> Change-Id: I9a606ae71d2f1303df3b02403b30bc8fdc9d06dd (cherry picked from commit f50f24e781738c8e5aa9f285d8726202f33107d6) [huzhanyuan: changed page to folio where appropriate]	2023-08-02 21:57:15 +00:00
Todd Kjos	d645236cfd	ANDROID: fix kernelci build failure in vmscan.c Vendor hooks added in vmscan.c directly referenced a vendor-specific field which is only defined if CONFIG_ANDROID_VENDOR_OEM_DATA is enabled. A kernelci config wich CONFIG_ANDROID_VENDOR_OEM_DATA disabled and CONFIG_ANDROID_VENDOR_HOOKS enabled has a build-break due to the undefined field. Fixes: `3e2dc32f59` ("ANDROID: mm: create vendor hooks for memory reclaim") Change-Id: Id7d31af9cf5752eba5ba27c5d31a288230f29114 Signed-off-by: Todd Kjos <tkjos@google.com>	2023-06-09 19:30:11 +00:00
Liujie Xie	3efffff553	ANDROID: Allow vendor module to reclaim a memcg Export try_to_free_mem_cgroup_pages function to allow vendor modules to reclaim a memory cgroup. Bug: 192052083 Signed-off-by: Liujie Xie <xieliujie@oppo.com> (cherry picked from commit a8385d61f27b57d98fb6245a23477c6ed5db4a7c) (cherry picked from commit 1ed025b9a1c8dc1420ccf1a656797b85eacd2bdb) Change-Id: Iec6ef50f5c71c62d0c9aa6de90e56a143dac61c1 Signed-off-by: lvwenhuan <lvwenhuan@oppo.com>	2023-06-08 00:54:10 +00:00
Kalesh Singh	b0375cb69c	BACKPORT: mm: Multi-gen LRU: remove wait_event_killable() Android 14 and later default to MGLRU [1] and field telemetry showed occasional long tail latency (>100ms) in the reclaim path. Tracing revealed priority inversion in the reclaim path. In try_to_inc_max_seq(), when high priority tasks were blocked on wait_event_killable(), the preemption of the low priority task to call wake_up_all() caused those high priority tasks to wait longer than necessary. In general, this problem is not different from others of its kind, e.g., one caused by mutex_lock(). However, it is specific to MGLRU because it introduced the new wait queue lruvec->mm_state.wait. The purpose of this new wait queue is to avoid the thundering herd problem. If many direct reclaimers rush into try_to_inc_max_seq(), only one can succeed, i.e., the one to wake up the rest, and the rest who failed might cause premature OOM kills if they do not wait. So far there is no evidence supporting this scenario, based on how often the wait has been hit. And this begs the question how useful the wait queue is in practice. Based on Minchan's recommendation, which is in line with his commit `6d4675e601` ("mm: don't be stuck to rmap lock on reclaim path") and the rest of the MGLRU code which also uses trylock when possible, remove the wait queue. [1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf Link: https://lkml.kernel.org/r/20230413214326.2147568-1-kaleshsingh@google.com Fixes: `bd74fdaea1` ("mm: multi-gen LRU: support page table walks") Change-Id: I911f3968fd1adb25171279cc5b6f48ccb7efc8de Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Suggested-by: Minchan Kim <minchan@kernel.org> Reported-by: Wei Wang <wvw@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 7f63cf2d9b9bbe7b90f808927558a66ff737d399) Bug: 277906484 [ Kalesh Singh - Fix conflict in mm/vmscan.c (lru_gen_del_mm) ] Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-06-07 14:25:07 +00:00
Dezhi Huang	3e2dc32f59	ANDROID: mm: create vendor hooks for memory reclaim we try to adjust page reclaim operations based on the running task and kernel memory pressure. Thus, we want to create some vendor hooks into kernel6.1. Firstly, we add ADNRROID_VENDOR_DATA into the struct scan_control, special operations would be performed based on this special scan option. We measure the importance of the current process in the system and obtain its weight, which is recorded in ANDROID_VENDOR_DATA. The hook function: trace_android_vh_modify_scan_control is added inside of the function modify_scan_control() to adjust reclaim operations based on memory pressure. The hook function: trace_android_vh_should_continue_reclaim is added inside of the function shrink_node() to decide if page_reclaim would continue or not based on memory pressure. The hook function: trace_android_vh_file_is_tiny_bypass is added into the function prepare_scan_count() to decide if the file pages should be skipped in condition to file refualts and memory pressure. Bug: 279793370 Change-Id: I1efe9d3e866f37b0295c7cd94ec8ca0117a9bd4a Signed-off-by: Dezhi Huang <huangdezhi@hihonor.com>	2023-06-05 16:31:49 +00:00
Kalesh Singh	500484f5be	BACKPORT: FROMGIT: Multi-gen LRU: fix workingset accounting On Android app cycle workloads, MGLRU showed a significant reduction in workingset refaults although pgpgin/pswpin remained relatively unchanged. This indicated MGLRU may be undercounting workingset refaults. This has impact on userspace programs, like Android's LMKD, that monitor workingset refault statistics to detect thrashing. It was found that refaults were only accounted if the MGLRU shadow entry was for a recently evicted folio. However, recently evicted folios should be accounted as workingset activation, and refaults should be accounted regardless of recency. Fix MGLRU's workingset refault and activation accounting to more closely match that of the conventional active/inactive LRU. Link: https://lkml.kernel.org/r/20230523205922.3852731-1-kaleshsingh@google.com Fixes: `ac35a49023` ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reported-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 02ad728453d2ddb09d7ce5e59854ebb27544d488 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable) Bug: 284043217 [ Kalesh Singh - Fix conflicts in mm/workingset.c ] Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Change-Id: I6d42cca9064e66099fbbc20aa2143961f84b2003	2023-05-27 00:38:36 +00:00
Liujie Xie	6f3353ca09	ANDROID: vendor_hooks: Add hook in shrink_node_memcgs Add vendor hook in shrink_node_memcgs to adjust whether to skip memory reclamation of memcg. Bug: 226482420 Signed-off-by: Liujie Xie <xieliujie@oppo.com> (cherry picked from commit b7ea1c49876197a3b5f17f7bb2699c5594f0b57e) Change-Id: I925856353e63c5a821027de4f8476c833e21b982 Signed-off-by: lvwenhuan <lvwenhuan@oppo.com>	2023-05-25 21:44:09 +00:00
Dezhi Huang	94b540c38d	ANDROID: mm: create vendor hooks for do_shrink_slab() The hook function: trace_android_vh_do_shrink_slab is added inside of the function do_shrink_slab() to changed the numbers of page to be reclaimed from kernel. Bug: 279793370 Change-Id: I7c0b955be97f841c69bc99a152b59ed9823707ed Signed-off-by: Dezhi Huang <huangdezhi@hihonor.com>	2023-05-24 21:12:43 +00:00
Dezhi Huang	da4e60efe1	ANDROID: mm: create vendor hooks for shrink_slab() Trace_android_vh_shrink_slab_bypass is added in the beginning of the function shrink_slab() to bypass kernel page reclaim in some conditons. Bug: 279793370 Change-Id: I6d5c8be28addf43d6fc9d07b5133135641590c3a Signed-off-by: Dezhi Huang <huangdezhi@hihonor.com>	2023-05-24 21:12:43 +00:00
Charan Teja Reddy	88153d9a99	ANDROID: vmscan: Support multiple kswapd threads per node Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. Test Details NOTE: The tests below were run with shadow entries disabled. See the associated patch and cover letter for details The tests below were designed with the assumption that a kswapd bottleneck is best demonstrated using filesystem reads. This way, the inactive list will be full of clean pages, simplifying the analysis and allowing kswapd to achieve the highest possible steal rate. Maximum steal rates for kswapd are likely to be the same or lower for any other mix of page types on the system. Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has an XFS file system mounted separately as /d0 through /d7. SSD drives require multiple concurrent streams to show their potential, so I created eleven 250GB zero-filled files on each drive so that I could test with parallel reads. The test script runs in multiple stages. At each stage, the number of dd tasks run concurrently is increased by 2. I did not include all of the test output for brevity. During each stage dd tasks are launched to read from each drive in a round robin fashion until the specified number of tasks for the stage has been reached. Then iostat, vmstat and top are started in the background with 10 second intervals. After five minutes, all of the dd tasks are killed and the iostat, vmstat and top output is parsed in order to report the following: CPU consumption - sy - aggregate kernel mode CPU consumption from vmstat output. The value doesn't tend to fluctuate much so I just grab the highest value. Each sample is averaged over 10 seconds - dd_cpu - for all of the dd tasks averaged across the top samples since there is a lot of variation. Throughput - in Kbytes - Command is iostat -x -d 10 -g total This first test performs reads using O_DIRECT in order to show the maximum throughput that can be obtained using these drives. It also demonstrates how rapidly throughput scales as the number of dd tasks are increased. The dd command for this test looks like this: Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M Test #1: Direct IO dd sy dd_cpu throughput 6 0 2.33 14726026.40 10 1 2.95 19954974.80 16 1 2.63 24419689.30 22 1 2.63 25430303.20 28 1 2.91 26026513.20 34 1 2.53 26178618.00 40 1 2.18 26239229.20 46 1 1.91 26250550.40 52 1 1.69 26251845.60 58 1 1.54 26253205.60 64 1 1.43 26253780.80 70 1 1.31 26254154.80 76 1 1.21 26253660.80 82 1 1.12 26254214.80 88 1 1.07 26253770.00 90 1 1.04 26252406.40 Throughput was close to peak with only 22 dd tasks. Very little system CPU was consumed as expected as the drives DMA directly into the user address space when using direct IO. In this next test, the iflag=direct option is removed and we only run the test until the pgscan_kswapd from /proc/vmstat starts to increment. At that point metrics are parsed and reported and the pagecache contents are dropped prior to the next test. Lather, rinse, repeat. Test #2: standard file system IO, no page replacement dd sy dd_cpu throughput 6 2 28.78 5134316.40 10 3 31.40 8051218.40 16 5 34.73 11438106.80 22 7 33.65 14140596.40 28 8 31.24 16393455.20 34 10 29.88 18219463.60 40 11 28.33 19644159.60 46 11 25.05 20802497.60 52 13 26.92 22092370.00 58 13 23.29 22884881.20 64 14 23.12 23452248.80 70 15 22.40 23916468.00 76 16 22.06 24328737.20 82 17 20.97 24718693.20 88 16 18.57 25149404.40 90 16 18.31 25245565.60 Each read has to pause after the buffer in kernel space is populated while those pages are added to the pagecache and copied into the user address space. For this reason, more parallel streams are required to achieve peak throughput. The copy operation consumes substantially more CPU than direct IO as expected. The next test measures throughput after kswapd starts running. This is the same test only we wait for kswapd to wake up before we start collecting metrics. The script actually keeps track of a few things that were not mentioned earlier. It tracks direct reclaims and page scans by watching the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the same way it is tracked for dd. Since the test is 100% reads, you can assume that the page steal rate for kswapd and direct reclaims is almost identical to the scan rate. Test #3: 1 kswapd thread per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 4 26.07 28.56 27.03 7355924.40 0 459316976 0 16 7 34.94 69.33 69.66 10867895.20 0 872661643 0 22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334 28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142 34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515 40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710 46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973 52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458 58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540 64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337 70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436 76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384 82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008 88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208 90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901 In the previous test where kswapd was not involved, the system-wide kernel mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA node), kswapd can only be responsible for a little over 4% of the increase. The rest is likely caused by 51,618 direct reclaims that scanned 1.2 billion pages over the five minute time period of the test. Same test, more kswapd tasks: Test #4: 4 kswapd threads per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 5 27.09 16.65 14.17 7842605.60 0 459105291 0 16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515 22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0 28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0 34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0 40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0 46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0 52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0 58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0 64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821 70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159 76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763 82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704 88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202 90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615 By increasing the number of kswapd threads, throughput increased by ~50% while kernel mode CPU utilization decreased or stayed the same, likely due to a decrease in the number of parallel tasks at any given time doing page replacement. Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com> Bug: 201263306 Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com [charante@codeaurora.org]: Changes made to select number of kswapds through uapi Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [quic_vjitta@quicinc.com]: Changes made to move multiple kswapd threads logic to vendor hooks Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com> (cherry picked from commit 0d61a651e4dd3c61d1658cc92e0b0450c8374738) Change-Id: I8425cab7f40cbeaf65af0ea118c1a9ac7da0930e [quic_vjitta@quicinc.com]: Resolved minor merge conflicts Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>	2023-04-26 17:01:51 +00:00
Vijayanand Jitta	d167f5b990	ANDROID: mm: Export kswapd function To support multiple kswap threads vendor modules need access to kswapd function. So, export it. Bug: 201263306 Change-Id: I442612710835f39836a295e9d1936f86826ab960 Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com> (cherry picked from commit 12972dd7bfa306aa07c92966c4efe7b1c0c5e043)	2023-04-26 17:01:51 +00:00
T.J. Alumbaugh	451d7c42ea	UPSTREAM: mm: multi-gen LRU: simplify lru_gen_look_around() Update the folio generation in place with or without current->reclaim_state->mm_walk. The LRU lock is held for longer, if mm_walk is NULL and the number of folios to update is more than PAGEVEC_SIZE. This causes a measurable regression from the LRU lock contention during a microbencmark. But a tiny regression is not worth the complexity. Link: https://lkml.kernel.org/r/20230118001827.1040870-8-talumbau@google.com Change-Id: I9ce18b4f4062e6c1c13c98ece9422478eb8e1846 Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit abf086721a2f1e6897c57796f7268df1b194c750) Bug: 274865848 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-04-12 16:02:15 +00:00
T.J. Alumbaugh	fae7f9ea58	UPSTREAM: mm: multi-gen LRU: improve walk_pmd_range() Improve readability of walk_pmd_range() and walk_pmd_range_locked(). Link: https://lkml.kernel.org/r/20230118001827.1040870-7-talumbau@google.com Change-Id: Ia084fbf53fe989673b7804ca8ca520af12d7d52a Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit b5ff4133617d0eced35b685da0bd0929dd9fabb7) Bug: 274865848 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-04-12 16:02:15 +00:00
T.J. Alumbaugh	24307a538b	UPSTREAM: mm: multi-gen LRU: improve lru_gen_exit_memcg() Add warnings and poison ->next. Link: https://lkml.kernel.org/r/20230118001827.1040870-6-talumbau@google.com Change-Id: I53de9e04c1ae941e122b33cd45d2bbb5f34aae0c Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 37cc99979d04cca677c0ad5c0acd1149ec165d1b) Bug: 274865848 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-04-12 16:02:15 +00:00
T.J. Alumbaugh	e1cf082319	UPSTREAM: mm: multi-gen LRU: section for memcg LRU Move memcg LRU code into a dedicated section. Improve the design doc to outline its architecture. Link: https://lkml.kernel.org/r/20230118001827.1040870-5-talumbau@google.com Change-Id: Id252e420cff7a858acb098cf2b3642da5c40f602 Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 36c7b4db7c942ae9e1b111f0c6b468c8b2e33842) Bug: 274865848 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-04-12 16:02:15 +00:00
T.J. Alumbaugh	282363eb6f	UPSTREAM: mm: multi-gen LRU: section for Bloom filters Move Bloom filters code into a dedicated section. Improve the design doc to explain Bloom filter usage and connection between aging and eviction in their use. Link: https://lkml.kernel.org/r/20230118001827.1040870-4-talumbau@google.com Change-Id: I73e866f687c1ed9f5c8538086aa39408b79897db Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit ccbbbb85945d8f0255aa9dbc1b617017e2294f2c) Bug: 274865848 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-04-12 16:02:15 +00:00
T.J. Alumbaugh	4d8cf6f6f0	UPSTREAM: mm: multi-gen LRU: section for rmap/PT walk feedback Add a section for lru_gen_look_around() in the code and the design doc. Link: https://lkml.kernel.org/r/20230118001827.1040870-3-talumbau@google.com Change-Id: I5097af63f61b3b69ec2abee6cdbdc33c296df213 Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit db19a43d9b3a8876552f00f656008206ef9a5efa) Bug: 274865848 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-04-12 16:02:15 +00:00
T.J. Alumbaugh	014c372cc3	UPSTREAM: mm: multi-gen LRU: section for working set protection Patch series "mm: multi-gen LRU: improve". This patch series improves a few MGLRU functions, collects related functions, and adds additional documentation. This patch (of 7): Add a section for working set protection in the code and the design doc. The admin doc already contains its usage. Link: https://lkml.kernel.org/r/20230118001827.1040870-1-talumbau@google.com Link: https://lkml.kernel.org/r/20230118001827.1040870-2-talumbau@google.com Change-Id: I65599075fd42951db7739a2ab7cee78516e157b3 Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 7b8144e63d84716f16a1b929e0c7e03ae5c4d5c1) Bug: 274865848 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-04-12 16:02:15 +00:00
Yu Zhao	6ddfdb3d53	UPSTREAM: mm: add vma_has_recency() Add vma_has_recency() to indicate whether a VMA may exhibit temporal locality that the LRU algorithm relies on. This function returns false for VMAs marked by VM_SEQ_READ or VM_RAND_READ. While the former flag indicates linear access, i.e., a special case of spatial locality, both flags indicate a lack of temporal locality, i.e., the reuse of an area within a relatively small duration. "Recency" is chosen over "locality" to avoid confusion between temporal and spatial localities. Before this patch, the active/inactive LRU only ignored the accessed bit from VMAs marked by VM_SEQ_READ. After this patch, the active/inactive LRU and MGLRU share the same logic: they both ignore the accessed bit if vma_has_recency() returns false. For the active/inactive LRU, the following fio test showed a [6, 8]% increase in IOPS when randomly accessing mapped files under memory pressure. kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo) kb=$((kb - 810241024)) modprobe brd rd_nr=1 rd_size=$kb dd if=/dev/zero of=/dev/ram0 bs=1M mkfs.ext4 /dev/ram0 mount /dev/ram0 /mnt/ swapoff -a fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \ --size=8G --rw=randrw --time_based --runtime=10m \ --group_reporting The discussion that led to this patch is here [1]. Additional test results are available in that thread. [1] https://lore.kernel.org/r/Y31s%2FK8T85jh05wH@google.com/ Link: https://lkml.kernel.org/r/20221230215252.2628425-1-yuzhao@google.com Change-Id: I291dcb795197659e40e46539cd32b857677c34ad Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Righi <andrea.righi@canonical.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michael Larabel <Michael@MichaelLarabel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 8788f6781486769d9598dcaedc3fe0eb12fc3e59) Bug: 274865848 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-04-12 16:02:15 +00:00
Kalesh Singh	ae678a47ee	ANDROID: MGLRU: Avoid reactivation of anon pages on swap full Avoid anon reclaim if swapping full since this reactivates the pages. Bug: 261619133 Bug: 276521916 Change-Id: Ia3af7fe8d5b29405830a812e73f95d11a0f8ee3a Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-04-05 07:19:07 +00:00
Yu Zhao	3792ff78b1	UPSTREAM: mm: multi-gen LRU: avoid futile retries Recall that the per-node memcg LRU has two generations and they alternate when the last memcg (of a given node) is moved from one to the other. Each generation is also sharded into multiple bins to improve scalability. A reclaimer starts with a random bin (in the old generation) and, if it fails, it will retry, i.e., to try the rest of the bins. If a reclaimer fails with the last memcg, it should move this memcg to the young generation first, which causes the generations to alternate, and then retry. Otherwise, the retries will be futile because all other bins are empty. Link: https://lkml.kernel.org/r/20230213075322.1416966-1-yuzhao@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit 9f550d78b40da21b4da515db4c37d8d7b12aa1a6) Change-Id: Ie92535676b005ec9e7987632b742fdde8d54436f Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Yu Zhao	599cea335f	UPSTREAM: mm: multi-gen LRU: simplify arch_has_hw_pte_young() check Scanning page tables when hardware does not set the accessed bit has no real use cases. Link: https://lkml.kernel.org/r/20221222041905.2431096-9-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit f386e9314025ea99dae639ed2032560a92081430) Change-Id: I84d97ab665b4e3bb862a9bc7d72f50dea7191a6b Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Yu Zhao	6573287e54	BACKPORT: mm: multi-gen LRU: clarify scan_control flags Among the flags in scan_control: 1. sc->may_swap, which indicates swap constraint due to memsw.max, is supported as usual. 2. sc->proactive, which indicates reclaim by memory.reclaim, may not opportunistically skip the aging path, since it is considered less latency sensitive. 3. !(sc->gfp_mask & __GFP_IO), which indicates IO constraint, lowers swappiness to prioritize file LRU, since clean file folios are more likely to exist. 4. sc->may_writepage and sc->may_unmap, which indicates opportunistic reclaim, are rejected, since unmapped clean folios are already prioritized. Scanning for more of them is likely futile and can cause high reclaim latency when there is a large number of memcgs. The rest are handled by the existing code. Link: https://lkml.kernel.org/r/20221222041905.2431096-8-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit e9d4e1ee788097484606c32122f146d802a9c5fb) [TJ: Resolved conflict with older function signature for min_cgroup_below_min, and over `cdded86118` ("ANDROID: MGLRU: Don't skip anon reclaim if swap low")] Change-Id: Ic2e779eaf4e91a3921831b4e2fa10c740dc59d50 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Yu Zhao	014f97baad	BACKPORT: mm: multi-gen LRU: per-node lru_gen_folio lists For each node, memcgs are divided into two generations: the old and the young. For each generation, memcgs are randomly sharded into multiple bins to improve scalability. For each bin, an RCU hlist_nulls is virtually divided into three segments: the head, the tail and the default. An onlining memcg is added to the tail of a random bin in the old generation. The eviction starts at the head of a random bin in the old generation. The per-node memcg generation counter, whose reminder (mod 2) indexes the old generation, is incremented when all its bins become empty. There are four operations: 1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in its current generation (old or young) and updates its "seg" to "head"; 2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in its current generation (old or young) and updates its "seg" to "tail"; 3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in the old generation, updates its "gen" to "old" and resets its "seg" to "default"; 4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin in the young generation, updates its "gen" to "young" and resets its "seg" to "default". The events that trigger the above operations are: 1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD; 2. The first attempt to reclaim an memcg below low, which triggers MEMCG_LRU_TAIL; 3. The first attempt to reclaim an memcg below reclaimable size threshold, which triggers MEMCG_LRU_TAIL; 4. The second attempt to reclaim an memcg below reclaimable size threshold, which triggers MEMCG_LRU_YOUNG; 5. Attempting to reclaim an memcg below min, which triggers MEMCG_LRU_YOUNG; 6. Finishing the aging on the eviction path, which triggers MEMCG_LRU_YOUNG; 7. Offlining an memcg, which triggers MEMCG_LRU_OLD. Note that memcg LRU only applies to global reclaim, and the round-robin incrementing of their max_seq counters ensures the eventual fairness to all eligible memcgs. For memcg reclaim, it still relies on mem_cgroup_iter(). Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit e4dde56cd208674ce899b47589f263499e5b8cdc) [TJ: Resolved conflicts with older function signatures for min_cgroup_below_min / min_cgroup_below_low and includes] Change-Id: Idc8a0f635e035d72dd911f807d1224cb47cbd655 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Yu Zhao	8d0d562dd3	UPSTREAM: mm: multi-gen LRU: shuffle should_run_aging() Move should_run_aging() next to its only caller left. Link: https://lkml.kernel.org/r/20221222041905.2431096-6-yuzhao@google.com Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit 77d4459a4a1a472b7309e475f962dda87d950abd) Signed-off-by: T.J. Mercier <tjmercier@google.com> Change-Id: I3b0383fe16b93a783b4d8c0b3a0b325160392576 Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Yu Zhao	5e9fed8a03	BACKPORT: mm: multi-gen LRU: remove aging fairness safeguard Recall that the aging produces the youngest generation: first it scans for accessed folios and updates their gen counters; then it increments lrugen->max_seq. The current aging fairness safeguard for kswapd uses two passes to ensure the fairness to multiple eligible memcgs. On the first pass, which is shared with the eviction, it checks whether all eligible memcgs are low on cold folios. If so, it requires a second pass, on which it ages all those memcgs at the same time. With memcg LRU, the aging, while ensuring eventual fairness, will run when necessary. Therefore the current aging fairness safeguard for kswapd will not be needed. Note that memcg LRU only applies to global reclaim. For memcg reclaim, the aging can be unfair to different memcgs, i.e., their lrugen->max_seq can be incremented at different paces. Link: https://lkml.kernel.org/r/20221222041905.2431096-5-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit 7348cc91821b0cb24dfb00e578047f68299a50ab) [TJ: Resolved conflicts with older function signatures for min_cgroup_below_min / min_cgroup_below_low] Change-Id: I6e36ecfbaaefbc0a56d9a9d5d7cbe404ed7f57a5 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Yu Zhao	ce1cc5d887	UPSTREAM: mm: multi-gen LRU: remove eviction fairness safeguard Recall that the eviction consumes the oldest generation: first it bucket-sorts folios whose gen counters were updated by the aging and reclaims the rest; then it increments lrugen->min_seq. The current eviction fairness safeguard for global reclaim has a dilemma: when there are multiple eligible memcgs, should it continue or stop upon meeting the reclaim goal? If it continues, it overshoots and increases direct reclaim latency; if it stops, it loses fairness between memcgs it has taken memory away from and those it has yet to. With memcg LRU, the eviction, while ensuring eventual fairness, will stop upon meeting its goal. Therefore the current eviction fairness safeguard for global reclaim will not be needed. Note that memcg LRU only applies to global reclaim. For memcg reclaim, the eviction will continue, even if it is overshooting. This becomes unconditional due to code simplification. Link: https://lkml.kernel.org/r/20221222041905.2431096-4-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit a579086c99ed70cc4bfc104348dbe3dd8f2787e6) Change-Id: I08ac1b3c90e29cafd0566785aaa4bcdb5db7d22c Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Yu Zhao	0f410148ae	UPSTREAM: mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] lru_gen_folio will be chained into per-node lists by the coming lrugen->list. Link: https://lkml.kernel.org/r/20221222041905.2431096-3-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit 6df1b2212950aae2b2188c6645ea18e2a9e3fdd5) Change-Id: I09f53e0fb2cd6b8b3adbb8a80b15dc5efbeae857 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Yu Zhao	3f963725af	UPSTREAM: mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio Patch series "mm: multi-gen LRU: memcg LRU", v3. Overview ======== An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, since each node and memcg combination has an LRU of folios (see mem_cgroup_lruvec()). Its goal is to improve the scalability of global reclaim, which is critical to system-wide memory overcommit in data centers. Note that memcg reclaim is currently out of scope. Its memory bloat is a pointer to each lruvec and negligible to each pglist_data. In terms of traversing memcgs during global reclaim, it improves the best-case complexity from O(n) to O(1) and does not affect the worst-case complexity O(n). Therefore, on average, it has a sublinear complexity in contrast to the current linear complexity. The basic structure of an memcg LRU can be understood by an analogy to the active/inactive LRU (of folios): 1. It has the young and the old (generations), i.e., the counterparts to the active and the inactive; 2. The increment of max_seq triggers promotion, i.e., the counterpart to activation; 3. Other events trigger similar operations, e.g., offlining an memcg triggers demotion, i.e., the counterpart to deactivation. In terms of global reclaim, it has two distinct features: 1. Sharding, which allows each thread to start at a random memcg (in the old generation) and improves parallelism; 2. Eventual fairness, which allows direct reclaim to bail out at will and reduces latency without affecting fairness over some time. The commit message in patch 6 details the workflow: https://lore.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com/ The following is a simple test to quickly verify its effectiveness. Test design: 1. Create multiple memcgs. 2. Each memcg contains a job (fio). 3. All jobs access the same amount of memory randomly. 4. The system does not experience global memory pressure. 5. Periodically write to the root memory.reclaim. Desired outcome: 1. All memcgs have similar pgsteal counts, i.e., stddev(pgsteal) over mean(pgsteal) is close to 0%. 2. The total pgsteal is close to the total requested through memory.reclaim, i.e., sum(pgsteal) over sum(requested) is close to 100%. Actual outcome [1]: MGLRU off MGLRU on stddev(pgsteal) / mean(pgsteal) 75% 20% sum(pgsteal) / sum(requested) 425% 95% #################################################################### MEMCGS=128 for ((memcg = 0; memcg < $MEMCGS; memcg++)); do mkdir /sys/fs/cgroup/memcg$memcg done start() { echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \ --filename=/dev/zero --size=1920M --rw=randrw \ --rate=64m,64m --random_distribution=random \ --fadvise_hint=0 --time_based --runtime=10h \ --group_reporting --minimal } for ((memcg = 0; memcg < $MEMCGS; memcg++)); do start & done sleep 600 for ((i = 0; i < 600; i++)); do echo 256m >/sys/fs/cgroup/memory.reclaim sleep 6 done for ((memcg = 0; memcg < $MEMCGS; memcg++)); do grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat done #################################################################### [1]: This was obtained from running the above script (touches less than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an hour. This patch (of 8): The new name lru_gen_folio will be more distinct from the coming lru_gen_memcg. Link: https://lkml.kernel.org/r/20221222041905.2431096-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20221222041905.2431096-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Bug: 274865848 (cherry picked from commit 391655fe08d1f942359a11148aa9aaf3f99d6d6f) Change-Id: I7df67e0e2435ba28f10eaa57d28d98b61a9210a6 Signed-off-by: T.J. Mercier <tjmercier@google.com>	2023-03-30 10:23:50 +00:00
Kalesh Singh	cdded86118	ANDROID: MGLRU: Don't skip anon reclaim if swap low MGLRU tries to avoid doing unnecessary anon reclaim work if swap is low by checking if the available swap is less than the MIN_BATCH_SIZE (256kB). This can lead to unintened consequences where PSI pressure is less and LMKD doesn't wake up in time to avoid file cache thrashing. Remove this check to preserve the old bahavior. This can be improved later on once we have a low swap notification event from the kernel. Bug: 268574308 Change-Id: Id381316931a9cf6e7ea8b1ea7800c77f176c9892 Signed-off-by: Kalesh Singh <kaleshsingh@google.com>	2023-02-22 17:17:07 +00:00
Greg Kroah-Hartman	dafc2fae4d	This is the 6.1.13 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmP2A8MACgkQONu9yGCS aT7GrhAAky2nTRG9J0oPxh5Eu7wNKmjqDWNj9c6it3iGHpb+tfOY+LfPXHmWz0kX NoaNYGZGD8SDbkmwrSOmFB1Q/0OZ4/aIwM7Kwcw72UJVvrlsKx1HwiJjXKk809ZL bVlLUQzFTwyVIYcvjXQ8CuBHwBinLc3qkcyYGgbS8bseR4pDuxwoToDwAxk1d/0j ozWuzUKhSdYHYIUrk3papUro2UpF+Kb7KFpNiVo2wMaZM7en2XK3khCt8TuojH6c DXL+KZ/HbB8Ig1PWLaw2/6o4ispNy6bz7CJx6oDiOILR+le8xZA5WTdkXT3ovjyr LxutmPTTw6PxextIyVRblJWzXNcjdlV552U4gnnngcWn6wg4D4otqYnYvTaAUc+u sQnwrlQFxB2KfFKLNepGAy7klQJsYP3eadjDgGXP9TSmuUvUYRaNr6h0XukbyYkc kx2+Tw51NMKEqhgnaiKDN8AZEDTuLu5F4+NrUertxlb3PWeRRMRYVGJ1uw0KJg6t d5eniCB00SaaqN6M68u/hRYRi3gnwIsU7DitEpqejqwzskMpgegMFvebmCwORiq3 D+FD4EHOlztIToXhmEOXp0cz8fs27MuWmq4GkSwXvJuq+id5cQFdDN5GeLgNdAvH Kiu/Y+DY6ObW31tAQ1Jjp20L2RaWWvubrCBGeIqiDzUWmCohsks= =TXvc -----END PGP SIGNATURE----- Merge 6.1.13 into android14-6.1 Changes in 6.1.13 mptcp: sockopt: make 'tcp_fastopen_connect' generic mptcp: fix locking for setsockopt corner-case mptcp: deduplicate error paths on endpoint creation mptcp: fix locking for in-kernel listener creation btrfs: move the auto defrag code to defrag.c btrfs: lock the inode in shared mode before starting fiemap ASoC: amd: yc: Add DMI support for new acer/emdoor platforms ASoC: SOF: sof-audio: start with the right widget type ALSA: usb-audio: Add FIXED_RATE quirk for JBL Quantum610 Wireless ASoC: Intel: sof_rt5682: always set dpcm_capture for amplifiers ASoC: Intel: sof_cs42l42: always set dpcm_capture for amplifiers ASoC: Intel: sof_nau8825: always set dpcm_capture for amplifiers ASoC: Intel: sof_ssp_amp: always set dpcm_capture for amplifiers selftests/bpf: Verify copy_register_state() preserves parent/live fields ALSA: hda: Do not unset preset when cleaning up codec ASoC: amd: yc: Add Xiaomi Redmi Book Pro 15 2022 into DMI table bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself ASoC: cs42l56: fix DT probe tools/virtio: fix the vringh test for virtio ring changes vdpa: ifcvf: Do proper cleanup if IFCVF init fails net/rose: Fix to not accept on connected socket selftest: net: Improve IPV6_TCLASS/IPV6_HOPLIMIT tests apparmor compatibility net: stmmac: do not stop RX_CLK in Rx LPI state for qcs404 SoC powerpc/64: Fix perf profiling asynchronous interrupt handlers fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() drm/nouveau/devinit/tu102-: wait for GFW_BOOT_PROGRESS == COMPLETED net: ethernet: mtk_eth_soc: Avoid truncating allocation net: sched: sch: Bounds check priority s390/decompressor: specify __decompress() buf len to avoid overflow nvme-fc: fix a missing queue put in nvmet_fc_ls_create_association nvme: clear the request_queue pointers on failure in nvme_alloc_admin_tag_set nvme: clear the request_queue pointers on failure in nvme_alloc_io_tag_set drm/amd/display: Add missing brackets in calculation drm/amd/display: Adjust downscaling limits for dcn314 drm/amd/display: Unassign does_plane_fit_in_mall function from dcn3.2 drm/amd/display: Reset DMUB mailbox SW state after HW reset drm/amdgpu: enable HDP SD for gfx 11.0.3 drm/amdgpu: Enable vclk dclk node for gc11.0.3 drm/amd/display: Properly handle additional cases where DCN is not supported platform/x86: touchscreen_dmi: Add Chuwi Vi8 (CWI501) DMI match ceph: move mount state enum to super.h ceph: blocklist the kclient when receiving corrupted snap trace selftests: mptcp: userspace: fix v4-v6 test in v6.1 of: reserved_mem: Have kmemleak ignore dynamically allocated reserved mem kasan: fix Oops due to missing calls to kasan_arch_is_ready() mm: shrinkers: fix deadlock in shrinker debugfs aio: fix mremap after fork null-deref vmxnet3: move rss code block under eop descriptor fbdev: Fix invalid page access after closing deferred I/O devices drm: Disable dynamic debug as broken drm/amd/amdgpu: fix warning during suspend drm/amd/display: Fail atomic_check early on normalize_zpos error drm/vmwgfx: Stop accessing buffer objects which failed init drm/vmwgfx: Do not drop the reference to the handle too soon mmc: jz4740: Work around bug on JZ4760(B) mmc: meson-gx: fix SDIO mode if cap_sdio_irq isn't set mmc: sdio: fix possible resource leaks in some error paths mmc: mmc_spi: fix error handling in mmc_spi_probe() ALSA: hda: Fix codec device field initializan ALSA: hda/conexant: add a new hda codec SN6180 ALSA: hda/realtek - fixed wrong gpio assigned ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform. ALSA: hda/realtek: Enable mute/micmute LEDs and speaker support for HP Laptops ata: ahci: Add Tiger Lake UP{3,4} AHCI controller ata: libata-core: Disable READ LOG DMA EXT for Samsung MZ7LH sched/psi: Fix use-after-free in ep_remove_wait_queue() hugetlb: check for undefined shift on 32 bit architectures nilfs2: fix underflow in second superblock position calculations mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount mm/filemap: fix page end in filemap_get_read_batch mm/migrate: fix wrongly apply write bit after mkdirty on sparc64 gpio: sim: fix a memory leak freezer,umh: Fix call_usermode_helper_exec() vs SIGKILL coredump: Move dump_emit_page() to kill unused warning Revert "mm: Always release pages to the buddy allocator in memblock_free_late()." net: Fix unwanted sign extension in netdev_stats_to_stats64() revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" drm/vc4: crtc: Increase setup cost in core clock calculation to handle extreme reduced blanking drm/vc4: Fix YUV plane handling when planes are in different buffers drm/i915/gen11: Wa_1408615072/Wa_1407596294 should be on GT list ice: fix lost multicast packets in promisc mode ixgbe: allow to increase MTU to 3K with XDP enabled i40e: add double of VLAN header when computing the max MTU net: bgmac: fix BCM5358 support by setting correct flags net: ethernet: ti: am65-cpsw: Add RX DMA Channel Teardown Quirk sctp: sctp_sock_filter(): avoid list_entry() on possibly empty list net/sched: tcindex: update imperfect hash filters respecting rcu ice: xsk: Fix cleaning of XDP_TX frames dccp/tcp: Avoid negative sk_forward_alloc by ipv6_pinfo.pktoptions. net/usb: kalmia: Don't pass act_len in usb_bulk_msg error path net/sched: act_ctinfo: use percpu stats net: openvswitch: fix possible memory leak in ovs_meter_cmd_set() net: stmmac: fix order of dwmac5 FlexPPS parametrization sequence bnxt_en: Fix mqprio and XDP ring checking logic tracing: Make trace_define_field_ext() static net: stmmac: Restrict warning on disabling DMA store and fwd mode net: use a bounce buffer for copying skb->mark tipc: fix kernel warning when sending SYN message net: mpls: fix stale pointer if allocation fails during device rename igb: conditionalize I2C bit banging on external thermal sensor support igb: Fix PPS input and output using 3rd and 4th SDP ixgbe: add double of VLAN header when computing the max MTU ipv6: Fix datagram socket connection with DSCP. ipv6: Fix tcp socket connection with DSCP. mm/gup: add folio to list when folio_isolate_lru() succeed mm: extend max struct page size for kmsan i40e: Add checking for null for nlmsg_find_attr() net/sched: tcindex: search key must be 16 bits nvme-tcp: stop auth work after tearing down queues in error recovery nvme-rdma: stop auth work after tearing down queues in error recovery KVM: x86/pmu: Disable vPMU support on hybrid CPUs (host PMUs) kvm: initialize all of the kvm_debugregs structure before sending it to userspace perf/x86: Refuse to export capabilities for hybrid PMUs alarmtimer: Prevent starvation by small intervals and SIG_IGN nvme-pci: refresh visible attrs for cmb attributes ASoC: SOF: Intel: hda-dai: fix possible stream_tag leak net: sched: sch: Fix off by one in htb_activate_prios() Linux 6.1.13 Change-Id: I8a1e4175939c14f726c545001061b95462566386 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2023-02-22 12:32:41 +00:00
Qi Zheng	86e3baf6a6	mm: shrinkers: fix deadlock in shrinker debugfs commit badc28d4924bfed73efc93f716a0c3aa3afbdf6f upstream. The debugfs_remove_recursive() is invoked by unregister_shrinker(), which is holding the write lock of shrinker_rwsem. It will waits for the handler of debugfs file complete. The handler also needs to hold the read lock of shrinker_rwsem to do something. So it may cause the following deadlock: CPU0 CPU1 debugfs_file_get() shrinker_debugfs_count_show()/shrinker_debugfs_scan_write() unregister_shrinker() --> down_write(&shrinker_rwsem); debugfs_remove_recursive() // wait for (A) --> wait_for_completion(); // wait for (B) --> down_read_killable(&shrinker_rwsem) debugfs_file_put() -- (A) up_write() -- (B) The down_read_killable() can be killed, so that the above deadlock can be recovered. But it still requires an extra kill action, otherwise it will block all subsequent shrinker-related operations, so it's better to fix it. [akpm@linux-foundation.org: fix CONFIG_SHRINKER_DEBUG=n stub] Link: https://lkml.kernel.org/r/20230202105612.64641-1-zhengqi.arch@bytedance.com Fixes: `5035ebc644` ("mm: shrinkers: introduce debugfs interface for memory shrinkers") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-22 12:59:46 +01:00
Greg Kroah-Hartman	c747c01851	This is the 6.1.11 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmPkywAACgkQONu9yGCS aT42Kw/9FFrdwv29yND651dPIglYKgO0Oz27/LFNGqst1A/G1ITzfs/94NSRr+9j uvwmBLbC+n/OXYavliBVWlPaYUCLqoFSfR+q953yz/UT0803E8BUvQ8NN8O7lsg7 hfbWJaASxt5puy2pBFypeWM+OXoVOvUBj3VhbgtUwwcYLPuYafj9rCAytdIIf5fr RKWBLfx7As4OJ+Hb3KNkolTkFDTfV5+zqCAc9Ko474d1bpRnF15UdQN8Kkinr2+O YNGTvDT8jR8eAk/9PiCNrG7DEMSKaczP8n/ap6PikD/KnK7ShtCLwZztLnmu65g1 vZG+cnEda8FuY3Ms03UrHhKqzMzBY/vslzBNMBTNmDsr+b7ilhffAYXPKS8s7xrg bJjmfzfITFAjXrml25enVO0V9RtTxv6E07U7SnDrLsvE2KBFZfUR/3Xl70bVBb0S db60kmEoq3XHHtoVySOHlfihVHSy02V9dlFcLOYMQsDHsGVsRXOR87g6d7+rJS3h hYWz5YxMLJUr2qn2836DPBnX9Ix0VjDx+X2fB4bNYzKc1dMlgzbpYrhk9LEOUDsx emJuqZskjkLby9Bw36N3eHW3fKPOFrwpYwPWYJHdWx1mmFSNdV6MdfEtZXpuEkFJ iFyJPeeODGadoiznnXTaBFfhozRj+B6FXrY6pkF+WMoSt8ZlZpM= =vu7j -----END PGP SIGNATURE----- Merge 6.1.11 into android14-6.1 Changes in 6.1.11 firewire: fix memory leak for payload of request subaction to IEC 61883-1 FCP region bus: sunxi-rsb: Fix error handling in sunxi_rsb_init() arm64: dts: imx8m-venice: Remove incorrect 'uart-has-rtscts' arm64: dts: freescale: imx8dxl: fix sc_pwrkey's property name linux,keycode ASoC: amd: acp-es8336: Drop reference count of ACPI device after use ASoC: Intel: bytcht_es8316: Drop reference count of ACPI device after use ASoC: Intel: bytcr_rt5651: Drop reference count of ACPI device after use ASoC: Intel: bytcr_rt5640: Drop reference count of ACPI device after use ASoC: Intel: bytcr_wm5102: Drop reference count of ACPI device after use ASoC: Intel: sof_es8336: Drop reference count of ACPI device after use ASoC: Intel: avs: Implement PCI shutdown bpf: Fix off-by-one error in bpf_mem_cache_idx() bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpers ALSA: hda/via: Avoid potential array out-of-bound in add_secret_dac_path() bpf: Fix to preserve reg parent/live fields when copying range info selftests/filesystems: grant executable permission to run_fat_tests.sh ASoC: SOF: ipc4-mtrace: prevent underflow in sof_ipc4_priority_mask_dfs_write() bpf: Add missing btf_put to register_btf_id_dtor_kfuncs media: v4l2-ctrls-api.c: move ctrl->is_new = 1 to the correct line bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener arm64: dts: imx8mm: Fix pad control for UART1_DTE_RX arm64: dts: imx8mm-verdin: Do not power down eth-phy drm/vc4: hdmi: make CEC adapter name unique drm/ssd130x: Init display before the SSD130X_DISPLAY_ON command scsi: Revert "scsi: core: map PQ=1, PDT=other values to SCSI_SCAN_TARGET_PRESENT" bpf: Fix the kernel crash caused by bpf_setsockopt(). ALSA: memalloc: Workaround for Xen PV vhost/net: Clear the pending messages when the backend is removed copy_oldmem_kernel() - WRITE is "data source", not destination WRITE is "data source", not destination... READ is "data destination", not source... zcore: WRITE is "data source", not destination... memcpy_real(): WRITE is "data source", not destination... fix iov_iter_bvec() "direction" argument fix 'direction' argument of iov_iter_{init,bvec}() fix "direction" argument of iov_iter_kvec() use less confusing names for iov_iter direction initializers vhost-scsi: unbreak any layout for response ice: Prevent set_channel from changing queues while RDMA active qede: execute xdp_do_flush() before napi_complete_done() virtio-net: execute xdp_do_flush() before napi_complete_done() dpaa_eth: execute xdp_do_flush() before napi_complete_done() dpaa2-eth: execute xdp_do_flush() before napi_complete_done() skb: Do mix page pool and page referenced frags in GRO sfc: correctly advertise tunneled IPv6 segmentation net: phy: dp83822: Fix null pointer access on DP83825/DP83826 devices net: wwan: t7xx: Fix Runtime PM initialization block, bfq: replace 0/1 with false/true in bic apis block, bfq: fix uaf for bfqq in bic_set_bfqq() netrom: Fix use-after-free caused by accept on already connected socket fscache: Use wait_on_bit() to wait for the freeing of relinquished volume platform/x86/amd/pmf: update to auto-mode limits only after AMT event platform/x86/amd/pmf: Add helper routine to update SPS thermals platform/x86/amd/pmf: Fix to update SPS default pprof thermals platform/x86/amd/pmf: Add helper routine to check pprof is balanced platform/x86/amd/pmf: Fix to update SPS thermals when power supply change platform/x86/amd/pmf: Ensure mutexes are initialized before use platform/x86: thinkpad_acpi: Fix thinklight LED brightness returning 255 drm/i915/guc: Fix locking when searching for a hung request drm/i915: Fix request ref counting during error capture & debugfs dump drm/i915: Fix up locking around dumping requests lists drm/i915/adlp: Fix typo for reference clock net/tls: tls_is_tx_ready() checked list_entry ALSA: firewire-motu: fix unreleased lock warning in hwdep device netfilter: br_netfilter: disable sabotage_in hook after first suppression block: ublk: extending queue_size to fix overflow kunit: fix kunit_test_init_section_suites(...) squashfs: harden sanity check in squashfs_read_xattr_id_table maple_tree: should get pivots boundary by type sctp: do not check hb_timer.expires when resetting hb_timer net: phy: meson-gxl: Add generic dummy stubs for MMD register access drm/panel: boe-tv101wum-nl6: Ensure DSI writes succeed during disable ip/ip6_gre: Fix changing addr gen mode not generating IPv6 link local address ip/ip6_gre: Fix non-point-to-point tunnel not generating IPv6 link local address riscv: kprobe: Fixup kernel panic when probing an illegal position igc: return an error if the mac type is unknown in igc_ptp_systim_to_hwtstamp() octeontx2-af: Fix devlink unregister can: j1939: fix errant WARN_ON_ONCE in j1939_session_deactivate can: raw: fix CAN FD frame transmissions over CAN XL devices can: mcp251xfd: mcp251xfd_ring_set_ringparam(): assign missing tx_obj_num_coalesce_irq ata: libata: Fix sata_down_spd_limit() when no link speed is reported selftests: net: udpgso_bench_rx: Fix 'used uninitialized' compiler warning selftests: net: udpgso_bench_rx/tx: Stop when wrong CLI args are provided selftests: net: udpgso_bench: Fix racing bug between the rx/tx programs selftests: net: udpgso_bench_tx: Cater for pending datagrams zerocopy benchmarking virtio-net: Keep stop() to follow mirror sequence of open() net: openvswitch: fix flow memory leak in ovs_flow_cmd_new efi: fix potential NULL deref in efi_mem_reserve_persistent rtc: sunplus: fix format string for printing resource certs: Fix build error when PKCS#11 URI contains semicolon kbuild: modinst: Fix build error when CONFIG_MODULE_SIG_KEY is a PKCS#11 URI i2c: designware-pci: Add new PCI IDs for AMD NAVI GPU i2c: mxs: suppress probe-deferral error message scsi: target: core: Fix warning on RT kernels x86/aperfmperf: Erase stale arch_freq_scale values when disabling frequency invariance readings perf/x86/intel: Add Emerald Rapids perf/x86/intel/cstate: Add Emerald Rapids scsi: iscsi_tcp: Fix UAF during logout when accessing the shost ipaddress scsi: iscsi_tcp: Fix UAF during login when accessing the shost ipaddress i2c: rk3x: fix a bunch of kernel-doc warnings Revert "gfs2: stop using generic_writepages in gfs2_ail1_start_one" x86/build: Move '-mindirect-branch-cs-prefix' out of GCC-only block platform/x86: dell-wmi: Add a keymap for KEY_MUTE in type 0x0010 table platform/x86: hp-wmi: Handle Omen Key event platform/x86: gigabyte-wmi: add support for B450M DS3H WIFI-CF platform/x86/amd: pmc: Disable IRQ1 wakeup for RN/CZN net/x25: Fix to not accept on connected socket drm/amd/display: Fix timing not changning when freesync video is enabled bcache: Silence memcpy() run-time false positive warnings iio: adc: stm32-dfsdm: fill module aliases usb: dwc3: qcom: enable vbus override when in OTG dr-mode usb: gadget: f_fs: Fix unbalanced spinlock in __ffs_ep0_queue_wait vc_screen: move load of struct vc_data pointer in vcs_read() to avoid UAF fbcon: Check font dimension limits cgroup/cpuset: Fix wrong check in update_parent_subparts_cpumask() hv_netvsc: Fix missed pagebuf entries in netvsc_dma_map/unmap() ARM: dts: imx7d-smegw01: Fix USB host over-current polarity net: qrtr: free memory on error path in radix_tree_insert() can: isotp: split tx timer into transmission and timeout can: isotp: handle wait_event_interruptible() return values watchdog: diag288_wdt: do not use stack buffers for hardware data watchdog: diag288_wdt: fix __diag288() inline assembly ALSA: hda/realtek: Add Acer Predator PH315-54 ALSA: hda/realtek: fix mute/micmute LEDs, speaker don't work for a HP platform ASoC: codecs: wsa883x: correct playback min/max rates ASoC: SOF: sof-audio: unprepare when swidget->use_count > 0 ASoC: SOF: sof-audio: skip prepare/unprepare if swidget is NULL ASoC: SOF: keep prepare/unprepare widgets in sink path efi: Accept version 2 of memory attributes table rtc: efi: Enable SET/GET WAKEUP services as optional iio: hid: fix the retval in accel_3d_capture_sample iio: hid: fix the retval in gyro_3d_capture_sample iio: adc: xilinx-ams: fix devm_krealloc() return value check iio: adc: berlin2-adc: Add missing of_node_put() in error path iio: imx8qxp-adc: fix irq flood when call imx8qxp_adc_read_raw() iio:adc:twl6030: Enable measurements of VUSB, VBAT and others iio: light: cm32181: Fix PM support on system with 2 I2C resources iio: imu: fxos8700: fix ACCEL measurement range selection iio: imu: fxos8700: fix incomplete ACCEL and MAGN channels readback iio: imu: fxos8700: fix IMU data bits returned to user space iio: imu: fxos8700: fix map label of channel type to MAGN sensor iio: imu: fxos8700: fix swapped ACCEL and MAGN channels readback iio: imu: fxos8700: fix incorrect ODR mode readback iio: imu: fxos8700: fix failed initialization ODR mode assignment iio: imu: fxos8700: remove definition FXOS8700_CTRL_ODR_MIN iio: imu: fxos8700: fix MAGN sensor scale and unit nvmem: brcm_nvram: Add check for kzalloc nvmem: sunxi_sid: Always use 32-bit MMIO reads nvmem: qcom-spmi-sdam: fix module autoloading parisc: Fix return code of pdc_iodc_print() parisc: Replace hardcoded value with PRIV_USER constant in ptrace.c parisc: Wire up PTRACE_GETREGS/PTRACE_SETREGS for compat case riscv: disable generation of unwind tables Revert "mm: kmemleak: alloc gray object for reserved region with direct map" mm: multi-gen LRU: fix crash during cgroup migration mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath() usb: gadget: f_uac2: Fix incorrect increment of bNumEndpoints usb: typec: ucsi: Don't attempt to resume the ports before they exist usb: gadget: udc: do not clear gadget driver.bus kernel/irq/irqdomain.c: fix memory leak with using debugfs_lookup() HV: hv_balloon: fix memory leak with using debugfs_lookup() x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses fpga: m10bmc-sec: Fix probe rollback fpga: stratix10-soc: Fix return value check in s10_ops_write_init() mm/uffd: fix pte marker when fork() without fork event mm/swapfile: add cond_resched() in get_swap_pages() mm/khugepaged: fix ->anon_vma race mm, mremap: fix mremap() expanding for vma's with vm_ops->close() mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups highmem: round down the address passed to kunmap_flush_on_unmap() ia64: fix build error due to switch case label appearing next to declaration Squashfs: fix handling and sanity checking of xattr_ids count maple_tree: fix mas_empty_area_rev() lower bound validation migrate: hugetlb: check for hugetlb shared PMD in node migration dma-buf: actually set signaling bit for private stub fences serial: stm32: Merge hard IRQ and threaded IRQ handling into single IRQ handler drm/i915: Avoid potential vm use-after-free drm/i915: Fix potential bit_17 double-free drm/amd: Fix initialization for nbio 4.3.0 drm/amd/pm: drop unneeded dpm features disablement for SMU 13.0.4/11 drm/amdgpu: update wave data type to 3 for gfx11 nvmem: core: initialise nvmem->id early nvmem: core: remove nvmem_config wp_gpio nvmem: core: fix cleanup after dev_set_name() nvmem: core: fix registration vs use race nvmem: core: fix device node refcounting nvmem: core: fix cell removal on error nvmem: core: fix return value phy: qcom-qmp-combo: fix runtime suspend serial: 8250_dma: Fix DMA Rx completion race serial: 8250_dma: Fix DMA Rx rearm race platform/x86/amd: pmc: add CONFIG_SERIO dependency ASoC: SOF: sof-audio: prepare_widgets: Check swidget for NULL on sink failure iio:adc:twl6030: Enable measurement of VAC powerpc/64s/radix: Fix crash with unaligned relocated kernel powerpc/64s: Fix local irq disable when PMIs are disabled powerpc/imc-pmu: Revert nest_init_lock to being a mutex fs/ntfs3: Validate attribute data and valid sizes ovl: Use "buf" flexible array for memcpy() destination f2fs: initialize locks earlier in f2fs_fill_super() fbdev: smscufx: fix error handling code in ufx_usb_probe f2fs: fix to do sanity check on i_extra_isize in is_alive() wifi: brcmfmac: Check the count value of channel spec to prevent out-of-bounds reads gfs2: Cosmetic gfs2_dinode_{in,out} cleanup gfs2: Always check inode size of inline inodes bpf: Skip invalid kfunc call in backtrack_insn Linux 6.1.11 Change-Id: I69722bc9711b91f2fca18de59746ada373f64c5e Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2023-02-09 13:29:55 +00:00
Yu Zhao	0444802231	mm: multi-gen LRU: fix crash during cgroup migration commit de08eaa6156405f2e9369f06ba5afae0e4ab3b62 upstream. lru_gen_migrate_mm() assumes lru_gen_add_mm() runs prior to itself. This isn't true for the following scenario: CPU 1 CPU 2 clone() cgroup_can_fork() cgroup_procs_write() cgroup_post_fork() task_lock() lru_gen_migrate_mm() task_unlock() task_lock() lru_gen_add_mm() task_unlock() And when the above happens, kernel crashes because of linked list corruption (mm_struct->lru_gen.list). Link: https://lore.kernel.org/r/20230115134651.30028-1-msizanoen@qtmlabs.xyz/ Link: https://lkml.kernel.org/r/20230116034405.2960276-1-yuzhao@google.com Fixes: `bd74fdaea1` ("mm: multi-gen LRU: support page table walks") Signed-off-by: Yu Zhao <yuzhao@google.com> Reported-by: msizanoen <msizanoen@qtmlabs.xyz> Tested-by: msizanoen <msizanoen@qtmlabs.xyz> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-02-09 11:28:20 +01:00
Greg Kroah-Hartman	403d73d693	Merge `97ee9d1c16` ("Merge tag 'block-6.1-2022-12-02' of git://git.kernel.dk/linux") into android-mainline Steps on the way to 6.1-rc8 Change-Id: Ia137727ec3fcd0eef3de5deb65133113ab3f63ea Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2022-12-04 11:47:42 +00:00
Juergen Gross	4aaf269c76	mm: introduce arch_has_hw_nonleaf_pmd_young() When running as a Xen PV guests commit `eed9a328aa` ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in pmdp_test_and_clear_young(): BUG: unable to handle page fault for address: ffff8880083374d0 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065 Oops: 0003 [#1] PREEMPT SMP NOPTI CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1 RIP: e030:pmdp_test_and_clear_young+0x25/0x40 This happens because the Xen hypervisor can't emulate direct writes to page table entries other than PTEs. This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young() similar to arch_has_hw_pte_young() and test that instead of CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG. Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com Fixes: `eed9a328aa` ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: David Hildenbrand <david@redhat.com> [core changes] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-30 14:49:41 -08:00
Greg Kroah-Hartman	e00ac7661f	Linux 6.1-rc7 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmOD10QeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGGz4H+wZRlo4Bd7/zKaYZ Zy40ylCfnsMsm2dD+fPdYsr1Izxh3PQ7ithk21TL1PWSXLGcHZzJCjF2cN//z02S 0IZE59Zc5vsLUPvnTZHdHBD4p+pKe5dk3dUP2AlanoxptumXpdZtLBs74R0yv6Fa Js1+2qxOQqC/BuWjW2SoJJzvcpMEtIXPCpizdkme95aGw+TfzuRw3Nn+lgJ0TTIr 1N98IbTz+7bqj9PvrFuVgbkMyJHwquVFc/r/vq6Yv+B6Mn/N1CrKW2VPxW299F3Z kvvjosSilrzzQsG7MYQWES6q6JdAQD0VOZDRI11WhwXbKsQobZ6hjgK1wOmInasS fTsOVmA= =PkFi -----END PGP SIGNATURE----- Merge 6.1-rc7 into android-mainline Linux 6.1-rc7 Change-Id: I5273a4225425a0f7f770c0e361df67d1abdcd5f6 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2022-11-28 16:47:55 +00:00
Aneesh Kumar K.V	81a70c21d9	mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1 balance_dirty_pages doesn't do the required dirty throttling on cgroupv1. See commit `9badce000e` ("cgroup, writeback: don't enable cgroup writeback on traditional hierarchies"). Instead, the kernel depends on writeback throttling in shrink_folio_list to achieve the same goal. With large memory systems, the flusher may not be able to writeback quickly enough such that we will start finding pages in the shrink_folio_list already in writeback. Hence for cgroupv1 let's do a reclaim throttle after waking up the flusher. The below test which used to fail on a 256GB system completes till the the file system is full with this change. root@lp2:/sys/fs/cgroup/memory# mkdir test root@lp2:/sys/fs/cgroup/memory# cd test/ root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M Killed Link: https://lkml.kernel.org/r/20221118070603.84081-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: zefan li <lizefan.x@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-22 18:50:45 -08:00
Yu Zhao	359a5e1416	mm: multi-gen LRU: retry folios written back while isolated The page reclaim isolates a batch of folios from the tail of one of the LRU lists and works on those folios one by one. For a suitable swap-backed folio, if the swap device is async, it queues that folio for writeback. After the page reclaim finishes an entire batch, it puts back the folios it queued for writeback to the head of the original LRU list. In the meantime, the page writeback flushes the queued folios also by batches. Its batching logic is independent from that of the page reclaim. For each of the folios it writes back, the page writeback calls folio_rotate_reclaimable() which tries to rotate a folio to the tail. folio_rotate_reclaimable() only works for a folio after the page reclaim has put it back. If an async swap device is fast enough, the page writeback can finish with that folio while the page reclaim is still working on the rest of the batch containing it. In this case, that folio will remain at the head and the page reclaim will not retry it before reaching there. This patch adds a retry to evict_folios(). After evict_folios() has finished an entire batch and before it puts back folios it cannot free immediately, it retries those that may have missed the rotation. Before this patch, ~60% of folios swapped to an Intel Optane missed folio_rotate_reclaimable(). After this patch, ~99% of missed folios were reclaimed upon retry. This problem affects relatively slow async swap devices like Samsung 980 Pro much less and does not affect sync swap devices like zram or zswap at all. Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com Fixes: `ac35a49023` ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: "Yin, Fengwei" <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-22 18:50:43 -08:00
Johannes Weiner	f53af4285d	mm: vmscan: fix extreme overreclaim and swap floods During proactive reclaim, we sometimes observe severe overreclaim, with several thousand times more pages reclaimed than requested. This trace was obtained from shrink_lruvec() during such an instance: prio:0 anon_cost:1141521 file_cost:7767 nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190) nr=[7161123 345 578 1111] While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it by swapping. These requests take over a minute, during which the write() to memory.reclaim is unkillably stuck inside the kernel. Digging into the source, this is caused by the proportional reclaim bailout logic. This code tries to resolve a fundamental conflict: to reclaim roughly what was requested, while also aging all LRUs fairly and in accordance to their size, swappiness, refault rates etc. The way it attempts fairness is that once the reclaim goal has been reached, it stops scanning the LRUs with the smaller remaining scan targets, and adjusts the remainder of the bigger LRUs according to how much of the smaller LRUs was scanned. It then finishes scanning that remainder regardless of the reclaim goal. This works fine if priority levels are low and the LRU lists are comparable in size. However, in this instance, the cgroup that is targeted by proactive reclaim has almost no files left - they've already been squeezed out by proactive reclaim earlier - and the remaining anon pages are hot. Anon rotations cause the priority level to drop to 0, which results in reclaim targeting all of anon (a lot) and all of file (almost nothing). By the time reclaim decides to bail, it has scanned most or all of the file target, and therefor must also scan most or all of the enormous anon target. This target is thousands of times larger than the reclaim goal, thus causing the overreclaim. The bailout code hasn't changed in years, why is this failing now? The most likely explanations are two other recent changes in anon reclaim: 1. Before the series starting with commit `5df741963d` ("mm: fix LRU balancing effect of new transparent huge pages"), the VM was overall relatively reluctant to swap at all, even if swap was configured. This means the LRU balancing code didn't come into play as often as it does now, and mostly in high pressure situations where pronounced swap activity wouldn't be as surprising. 2. For historic reasons, shrink_lruvec() loops on the scan targets of all LRU lists except the active anon one, meaning it would bail if the only remaining pages to scan were active anon - even if there were a lot of them. Before the series starting with commit `ccc5dc6734` ("mm/vmscan: make active/inactive ratio as 1:1 for anon lru"), most anon pages would live on the active LRU; the inactive one would contain only a handful of preselected reclaim candidates. After the series, anon gets aged similarly to file, and the inactive list is the default for new anon pages as well, making it often the much bigger list. As a result, the VM is now more likely to actually finish large anon targets than before. Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the larger LRU lists is made before bailing out on a met reclaim goal. This fixes the extreme overreclaim problem. Fairness is more subtle and harder to evaluate. No obvious misbehavior was observed on the test workload, in any case. Conceptually, fairness should primarily be a cumulative effect from regular, lower priority scans. Once the VM is in trouble and needs to escalate scan targets to make forward progress, fairness needs to take a backseat. This is also acknowledged by the myriad exceptions in get_scan_count(). This patch makes fairness decrease gradually, as it keeps fairness work static over increasing priority levels with growing scan targets. This should make more sense - although we may have to re-visit the exact values. Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-22 18:50:41 -08:00

1 2 3 4 5 ...

1227 Commits