Commit Graph

19898 Commits

Author SHA1 Message Date
Suren Baghdasaryan
78c6875e2f UPSTREAM: mm: change per-VMA lock statistics to be disabled by default
Change CONFIG_PER_VMA_LOCK_STATS to be disabled by default, as most users
don't need it.  Add configuration help to clarify its usage.

Link: https://lkml.kernel.org/r/20230428173533.18158-1-surenb@google.com
Fixes: 52f238653e45 ("mm: introduce per-VMA lock statistics")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 6152e53d9671b0ccc21c1bca842617b32ccfc5d8)

Bug: 161210518
Change-Id: Ibd57999a415b5433ae3b99365ea50526a35452d1
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:02 +00:00
Suren Baghdasaryan
23fcd3167e UPSTREAM: mm/mmap: free vm_area_struct without call_rcu in exit_mmap
call_rcu() can take a long time when callback offloading is enabled.  Its
use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.

Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it.  Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock.  Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().

Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.

Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a)

Bug: 161210518
Change-Id: I4fbf3ef38fdb22a3c80dcc61125ec21d2c426100
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:02 +00:00
Suren Baghdasaryan
ebbbcdfeaf UPSTREAM: mm: introduce per-VMA lock statistics
Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra statistics
about handling page fault under VMA lock.

Link: https://lkml.kernel.org/r/20230227173632.3292573-29-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 52f238653e452e0fda61e880f263a173d219acd1)

Bug: 161210518
Change-Id: I1bc9ab9bc0307af26e0c51ba12f9ad561af5b6c8
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:01 +00:00
Suren Baghdasaryan
4e4c6989ae UPSTREAM: mm: prevent userfaults to be handled under per-vma lock
Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
handling under VMA lock and retry holding mmap_lock.  This can be handled
more gracefully in the future.

Link: https://lkml.kernel.org/r/20230227173632.3292573-28-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 444eeb17437a0ef526c606e9141a415d3b7dfddd)

Bug: 161210518
Change-Id: I383603d637497ea9917ad08908530f91052a17cc
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:01 +00:00
Suren Baghdasaryan
6e306e82ac UPSTREAM: mm: prevent do_swap_page from handling page faults under VMA lock
Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock.  This can be handled
more gracefully in the future.

Link: https://lkml.kernel.org/r/20230227173632.3292573-27-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 17c05f18e54158a3eed0c22c85b7a756b63dcc01)

Bug: 161210518
Change-Id: I047f4d0e0ca3b3bf9505e5cda2da768c88bed20e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:01 +00:00
Suren Baghdasaryan
c06661eab5 UPSTREAM: mm: fall back to mmap_lock if vma->anon_vma is not yet set
When vma->anon_vma is not set, page fault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one.  find_mergeable_anon_vma() walks VMA tree to find a
compatible adjacent VMA and that requires not only the faulting VMA to be
stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set.  This
situation happens only on the first page fault and should not affect
overall performance.

Link: https://lkml.kernel.org/r/20230227173632.3292573-25-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 2ac0af1b66e3b66307f53b1cc446514308ec466d)

Bug: 161210518
Change-Id: Iafacad5bda7bb138b290f38421a22d828051b067
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:01 +00:00
Suren Baghdasaryan
5949b78f6c UPSTREAM: mm: introduce lock_vma_under_rcu to be used from arch-specific code
Introduce lock_vma_under_rcu function to lookup and lock a VMA during page
fault handling.  When VMA is not found, can't be locked or changes after
being locked, the function returns NULL.  The lookup is performed under
RCU protection to prevent the found VMA from being destroyed before the
VMA lock is acquired.  VMA lock statistics are updated according to the
results.  For now only anonymous VMAs can be searched this way.  In other
cases the function returns NULL.

Link: https://lkml.kernel.org/r/20230227173632.3292573-24-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 50ee32537206140e4cf6e47024be29a84d458d49)

Bug: 161210518
Change-Id: I4872bb04f5c8a515e4b31bc36c95e15b62cbd0da
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:01 +00:00
Suren Baghdasaryan
35ffa4830e BACKPORT: mm: introduce vma detached flag
Per-vma locking mechanism will search for VMA under RCU protection and
then after locking it, has to ensure it was not removed from the VMA tree
after we found it.  To make this check efficient, introduce a
vma->detached flag to mark VMAs which were removed from the VMA tree.

Link: https://lkml.kernel.org/r/20230227173632.3292573-23-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 457f67be5910a2b5f1fda8af06bfe4d3492a0a4f)
[surenb: vma_complete does not exist in 6.1, therefore patch is adjusted
to mark VMAs detached directly in vma_expand and __vma_adjust]

Bug: 161210518
Change-Id: Id1f31733cb7a36f3f1294b2be83cf3b87ba3f812
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:00 +00:00
Suren Baghdasaryan
3c6748cd51 UPSTREAM: mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
Page fault handlers might need to fire MMU notifications while a new
notifier is being registered.  Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with page fault handlers that would hold VMA
locks.  VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.

Link: https://lkml.kernel.org/r/20230227173632.3292573-22-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit eeff9a5d47f89bc641034fea05501c8a6de131cb)

Bug: 161210518
Change-Id: I4176bf0e1b07f03dfc1ac7dd37d7941d5a1dbc02
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:00 +00:00
Suren Baghdasaryan
9cc64c7fb9 UPSTREAM: mm: conditionally write-lock VMA in free_pgtables
Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock.  munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables.  Add a parameter to
free_pgtables for such scenario.

Link: https://lkml.kernel.org/r/20230227173632.3292573-20-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 98e51a2239d9d419d819cd61a2e720ebf19a8b0a)

Bug: 161210518
Change-Id: I3c9177cce187526407754baf7641d3741ca7b0cb
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:00 +00:00
Suren Baghdasaryan
5f1e1ab919 UPSTREAM: mm: write-lock VMAs before removing them from VMA tree
Write-locking VMAs before isolating them ensures that page fault handlers
don't operate on isolated VMAs.

[surenb@google.com: mm/nommu: remove unnecessary VMA locking]
  Link: https://lkml.kernel.org/r/20230301190457.1498985-1-surenb@google.com
  Link: https://lore.kernel.org/all/Y%2F8CJQGNuMUTdLwP@localhost/
Link: https://lkml.kernel.org/r/20230227173632.3292573-19-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 73046fd00b069ffd198eda099dae966e152fae39)

Bug: 161210518
Change-Id: Ia742da40896e6bc4e8150911596f80dca5ef3e12
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:00 +00:00
Suren Baghdasaryan
24ecdbc5e2 UPSTREAM: mm/mremap: write-lock VMA while remapping it to a new address range
Write-lock VMA as locked before copying it and when copy_vma produces a
new VMA.

Link: https://lkml.kernel.org/r/20230227173632.3292573-18-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit d6ac235de4ba6dc659eebb5f4e5ba0a8523d8424)

Bug: 161210518
Change-Id: I38b5c5689380754a366223caff30e1ac4aaf7cc4
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:25:00 +00:00
Suren Baghdasaryan
2554cb4775 FROMLIST: mm/mmap: write-lock VMAs affected by VMA expansion
vma_expand changes VMA boundaries and might result in freeing an adjacent
VMA. Write-lock affected VMAs to prevent concurrent page faults.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Link: https://lore.kernel.org/all/20230109205336.3665937-22-surenb@google.com/
[surenb: using older v1 of patchset due to __vma_adjust() being removed
in 6.2-rc4]
[surenb: lock next earlier when removing it like we do in v3:
https://lore.kernel.org/all/20230216051750.3125598-18-surenb@google.com/]

Bug: 161210518
Change-Id: I31aff80996b4ad646bdd6861ff6479c8eb2a690a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:59 +00:00
Suren Baghdasaryan
57b3f8a5ab FROMLIST: mm/mmap: write-lock VMAs in vma_adjust
vma_adjust modifies a VMA and possibly its neighbors. Write-lock them
before making the modifications.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Link: https://lore.kernel.org/all/20230109205336.3665937-21-surenb@google.com/
[surenb: using older v1 of patchset due to __vma_adjust() being removed
in 6.2-rc4]
[surenb: minor fixes in next_next locking inside __vma_adjust]

Bug: 161210518
Change-Id: I9ab2f88c82a7071fe2f1a14c51a2e6f1b6196681
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:59 +00:00
Suren Baghdasaryan
998ec9f54d FROMLIST: mm/mmap: write-lock VMAs before merging, splitting or expanding them
Decisions about whether VMAs can be merged, split or expanded must be
made while VMAs are protected from the changes which can affect that
decision. For example, merge_vma uses vma->anon_vma in its decision
whether the VMA can be merged. Meanwhile, page fault handler changes
vma->anon_vma during COW operation.
Write-lock all VMAs which might be affected by a merge or split operation
before making decision how such operations should be performed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Link: https://lore.kernel.org/all/20230216051750.3125598-17-surenb@google.com/
[surenb: using older v3 of patchset due to missing __vma_adjust()
refactoring in 6.2-rc4 which introduced vma_prepare()]

Bug: 161210518
Change-Id: I56d84aa67366a1988fc81296da7164ad7f89a5c0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:59 +00:00
Suren Baghdasaryan
d73ebe031c UPSTREAM: mm/khugepaged: write-lock VMA while collapsing a huge page
Protect VMA from concurrent page fault handler while collapsing a huge
page.  Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes.  pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.

Before this patch, page tables can be walked under any one of the
mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either are
locked or don't exist.  This patch adds a fourth lock under which page
tables can be traversed, and so khugepaged must also lock out that one.

[surenb@google.com: vm_lock/i_mmap_rwsem inversion in retract_page_tables]
  Link: https://lkml.kernel.org/r/20230303213250.3555716-1-surenb@google.com
[surenb@google.com: build fix]
  Link: https://lkml.kernel.org/r/CAJuCfpFjWhtzRE1X=J+_JjgJzNKhq-=JT8yTBSTHthwp0pqWZw@mail.gmail.com
Link: https://lkml.kernel.org/r/20230227173632.3292573-16-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 55fd6fccad3172c0feaaa817f0a1283629ff183e)

Bug: 161210518
Change-Id: I6c3cddd7861dd03fe496c4de20f284dc692c8654
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:59 +00:00
Suren Baghdasaryan
3771808d64 FROMLIST: mm/mmap: move VMA locking before vma_adjust_trans_huge call
vma_adjust_trans_huge() modifies the VMA and such modifications should
be done after VMA is marked as being written. Therefore move VMA flag
modifications before vma_adjust_trans_huge() so that VMA is marked
before all these modifications.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Link: https://lore.kernel.org/all/20230216051750.3125598-15-surenb@google.com/
[surenb: using older v3 of patchset due to missing __vma_adjust()
refactoring in 6.2-rc4 which introduced vma_prepare()]

Bug: 161210518
Change-Id: I650162fd85fabee00a8a05ddb32318e654270cb1
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:59 +00:00
Suren Baghdasaryan
a9ea3113d4 UPSTREAM: mm: add per-VMA lock and helper functions to control it
Introduce per-VMA locking.  The lock implementation relies on a per-vma
and per-mm sequence counters to note exclusive locking:

  - read lock - (implemented by vma_start_read) requires the vma
    (vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ.
    If they match then there must be a vma exclusive lock held somewhere.
  - read unlock - (implemented by vma_end_read) is a trivial vma->lock
    unlock.
  - write lock - (vma_start_write) requires the mmap_lock to be held
    exclusively and the current mm counter is assigned to the vma counter.
    This will allow multiple vmas to be locked under a single mmap_lock
    write lock (e.g. during vma merging). The vma counter is modified
    under exclusive vma lock.
  - write unlock - (vma_end_write_all) is a batch release of all vma
    locks held. It doesn't pair with a specific vma_start_write! It is
    done before exclusive mmap_lock is released by incrementing mm
    sequence counter (mm_lock_seq).
  - write downgrade - if the mmap_lock is downgraded to the read lock, all
    vma write locks are released as well (effectivelly same as write
    unlock).

Link: https://lkml.kernel.org/r/20230227173632.3292573-13-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 5e31275cc997f8ec5d9e8d65fe9840ebed89db19)

Bug: 161210518
Change-Id: I5e0db53a4b5562e59dd031fabbae4f97acc1bce1
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:59 +00:00
Suren Baghdasaryan
04f73ad5b4 UPSTREAM: mm: introduce CONFIG_PER_VMA_LOCK
Patch series "Per-VMA locks", v4.

LWN article describing the feature: https://lwn.net/Articles/906852/

Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM
last year [2], which concluded with suggestion that “a reader/writer
semaphore could be put into the VMA itself; that would have the effect of
using the VMA as a sort of range lock.  There would still be contention at
the VMA level, but it would be an improvement.” This patchset implements
this suggested approach.

When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock.  If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.

One notable way the implementation deviates from the proposal is the way
VMAs are read-locked.  During some of mm updates, multiple VMAs need to be
locked until the end of the update (e.g.  vma_merge, split_vma, etc).
Tracking all the locked VMAs, avoiding recursive locks, figuring out when
it's safe to unlock previously locked VMAs would make the code more
complex.  So, instead of the usual lock/unlock pattern, the proposed
solution marks a VMA as locked and provides an efficient way to:

1. Identify locked VMAs.

2. Unlock all locked VMAs in bulk.

We also postpone unlocking the locked VMAs until the end of the update,
when we do mmap_write_unlock.  Potentially this keeps a VMA locked for
longer than is absolutely necessary but it results in a big reduction of
code complexity.

Read-locking a VMA is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct.  VMA is considered read-locked
when these sequence numbers are equal.  To read-lock a VMA we set the
sequence number in vm_area_struct to be equal to the sequence number in
mm_struct.  To unlock all VMAs we increment mm_struct's seq number.  This
allows for an efficient way to track locked VMAs and to drop the locks on
all VMAs at the end of the update.

The patchset implements per-VMA locking only for anonymous pages which are
not in swap and avoids userfaultfs as their implementation is more
complex.  Additional support for file-back page faults, swapped and user
pages can be added incrementally.

Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits).  Still, with lower complexity
this approach might be more desirable.

Since RFC was posted in September 2022, two separate Google teams outside
of Android evaluated the patchset and confirmed positive results.  Here
are the known usecases when per-VMA locks show benefits:

Android:

Apps with high number of threads (~100) launch times improve by up to 20%.
Each thread mmaps several areas upon startup (Stack and Thread-local
storage (TLS), thread signal stack, indirect ref table), which requires
taking mmap_lock in write mode.  Page faults take mmap_lock in read mode.
During app launch, both thread creation and page faults establishing the
active workinget are happening in parallel and that causes lock contention
between mm writers and readers even if updates and page faults are
happening in different VMAs.  Per-vma locks prevent this contention by
providing more granular lock.

Google Fibers:

We have several dynamically sized thread pools that spawn new threads
under increased load and reduce their number when idling. For example,
Google's in-process scheduling/threading framework, UMCG/Fibers, is backed
by such a thread pool. When idling, only a small number of idle worker
threads are available; when a spike of incoming requests arrive, each
request is handled in its own "fiber", which is a work item posted onto a
UMCG worker thread; quite often these spikes lead to a number of new
threads spawning. Each new thread needs to allocate and register an RSEQ
section on its TLS, then register itself with the kernel as a UMCG worker
thread, and only after that it can be considered by the in-process
UMCG/Fiber scheduler as available to do useful work. In short, during an
incoming workload spike new threads have to be spawned, and they perform
several syscalls (RSEQ registration, UMCG worker registration, memory
allocations) before they can actually start doing useful work. Removing
any bottlenecks on this thread startup path will greatly improve our
services' latencies when faced with request/workload spikes.

At high scale, mmap_lock contention during thread creation and stack page
faults leads to user-visible multi-second serving latencies in a similar
pattern to Android app startup.  Per-VMA locking patchset has been run
successfully in limited experiments with user-facing production workloads.
In these experiments, we observed that the peak thread creation rate was
high enough that thread creation is no longer a bottleneck.

TCP zerocopy receive:

From the point of view of TCP zerocopy receive, the per-vma lock patch is
massively beneficial.

In today's implementation, a process with N threads where N - 1 are
performing zerocopy receive and 1 thread is performing madvise() with the
write lock taken (e.g.  needs to change vm_flags) will result in all N -1
receive threads blocking until the madvise is done.  Conversely, on a busy
process receiving a lot of data, an madvise operation that does need to
take the mmap lock in write mode will need to wait for all of the receives
to be done - a lose:lose proposition.  Per-VMA locking _removes_ by
definition this source of contention entirely.

There are other benefits for receive as well, chiefly a reduction in
cacheline bouncing across receiving threads for locking/unlocking the
single mmap lock.  On an RPC style synthetic workload with 4KB RPCs:

1a) The find+lock+unlock VMA path in the base case, without the
    per-vma lock patchset, is about 0.7% of cycles as measured by perf.

1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5%
    cycles overall - most of this is within the TCP read hotpath (a small
    fraction is 'other' usage in the system).

2a) The find+lock+unlock VMA path, with the per-vma patchset and a
    trivial patch written to take advantage of it in TCP, is about 0.4% of
    cycles (down from 0.7% above)

2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is <
    0.1% cycles and is out of the TCP read hotpath entirely (down from
    0.5% before, the remaining usage is the 'other' usage in the system).
    So, in addition to entirely removing an onerous source of contention,
    it also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+
    (compared to overall cycles in perf) for the 'small' RPC scenario.

In https://lkml.kernel.org/r/87fsaqouyd.fsf_-_@stealth, Punit
demonstrated throughput improvements of as much as 188% from this
patchset.

This patch (of 25):

This configuration variable will be used to build the support for VMA
locking during page fault handling.

This is enabled on supported architectures with SMP and MMU set.

The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.

Link: https://lkml.kernel.org/r/20230227173632.3292573-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230227173632.3292573-10-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 0b6cc04f3db3604c1485049bc9582523c2b44b75)

Bug: 161210518
Change-Id: I787e1d28194655fb717d38718b2b839ef4e6226c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:58 +00:00
Suren Baghdasaryan
ef8351241d UPSTREAM: mm: introduce vm_flags_reset_once to replace WRITE_ONCE vm_flags updates
Provide vm_flags_reset_once() and replace the vm_flags updates which used
WRITE_ONCE() to prevent compiler optimizations.

Link: https://lkml.kernel.org/r/20230201000116.1333160-1-surenb@google.com
Fixes: 0cce31a0aa0e ("mm: replace vma->vm_flags direct modifications with modifier calls")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 601c3c29dbeb049862faa00917f2daf094a71028)

Bug: 161210518
Change-Id: Ied961a1bfbdc25b79268ba04515960c664052d61
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:58 +00:00
Suren Baghdasaryan
75977e5919 UPSTREAM: mm: export dump_mm()
mmap_assert_write_locked() is used in vm_flags modifiers.  Because
mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes
modified from inside a module, it's necessary to export dump_mm()
function.

Link: https://lkml.kernel.org/r/20230126193752.297968-8-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit c2fdc235300a027adc04a41b383bd78ab5da56f4)

Bug: 161210518
Change-Id: I78d82d04c26c9ae3bcd118e281d2ac8531e1ad81
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:58 +00:00
Suren Baghdasaryan
2ff3b23c7f UPSTREAM: mm: introduce __vm_flags_mod and use it in untrack_pfn
There are scenarios when vm_flags can be modified without exclusive
mmap_lock, such as:
- after VMA was isolated and mmap_lock was downgraded or dropped
- in exit_mmap when there are no other mm users and locking is unnecessary
Introduce __vm_flags_mod to avoid assertions when the caller takes
responsibility for the required locking.
Pass a hint to untrack_pfn to conditionally use __vm_flags_mod for
flags modification to avoid assertion.

Link: https://lkml.kernel.org/r/20230126193752.297968-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 68f48381d7fdd1cbb9d88c37a4dfbb98ac78226d)

Bug: 161210518
Change-Id: I6ba44b03cde4c9b96d80423d41accab1effb71ac
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:58 +00:00
Suren Baghdasaryan
5dd0547a3e UPSTREAM: mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 1c71222e5f2393b5ea1a41795c67589eea7e3490)

Bug: 161210518
Change-Id: Ifc352b487db109adab17dd33a83f5c7e68c0bbc6
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:57 +00:00
Suren Baghdasaryan
bf16383ebd UPSTREAM: mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK
To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(), replace
it with VM_LOCKED_MASK bitmask and convert all users.

Link: https://lkml.kernel.org/r/20230126193752.297968-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit e430a95a04efc557bc4ff9b3035c7c85aee5d63f)

Bug: 161210518
Change-Id: I17bbcc01a133511dbfaf3d82fbc4b25ecdd0b376
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-07 14:24:57 +00:00
Jaewon Kim
a390414140 ANDROID: vendor_hooks: add hooks for extra memory
Add vendor hooks for extra memory. If there is extra memory, this can
be accounted like other memory stats. One of the usecases could be
cleancache. If some of ram memory is used for cleancache, its free,
cache, and total size could be added through these vendor hooks.

Bug: 283896254

Change-Id: Iad7330310528581f09842f45860f05dc84823f41
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
2023-06-07 01:06:25 +00:00
xiaofeng
508ca06639 ANDROID: vendor_hooks:vendor hook for control memory dirty rate
When the IO pressure increases or the system performs dirty
page balancing, the frame rate of the foreground application
may become unstable. Therefore, a hook point is added to limit
the buffer IO rate from the source.

Bug: 262189942
Change-Id: I5214d611a388c5e8d87dc44ffde86ead1834ddff
Signed-off-by: xiaofeng <xiaofeng5@xiaomi.com>
2023-06-06 23:03:20 +00:00
Liam R. Howlett
2ea053d317 FROMGIT: userfaultfd: fix regression in userfaultfd_unmap_prep()
Android reported a performance regression in the userfaultfd unmap path.
A closer inspection on the userfaultfd_unmap_prep() change showed that a
second tree walk would be necessary in the reworked code.

Fix the regression by passing each VMA that will be unmapped through to
the userfaultfd_unmap_prep() function as they are added to the unmap list,
instead of re-walking the tree for the VMA.

Link: https://lkml.kernel.org/r/20230601015402.2819343-1-Liam.Howlett@oracle.com
Fixes: 69dbe6daf1 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit de53cc0be1c8b47d595682932beb3c11be9e4e5a
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)

Bug: 274059236
Change-Id: Ia189a5e98ffe86c4ca5ac3b686ada5f51826f2ed
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-06 20:05:25 +00:00
Liam R. Howlett
2f5f352e6a FROMGIT: BACKPORT: mm: avoid rewalk in mmap_region
If the iterator has moved to the previous entry, then step forward one
range, back to the gap.

Link: https://lkml.kernel.org/r/20230518145544.1722059-36-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit d3f028c7599ea2297dd630e1a6acaf4915c769d3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)

Bug: 274059236
Change-Id: Ic45e095c728095d41647a704a287596d03489cdf
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-06 20:05:25 +00:00
Liam R. Howlett
5ff9438fe1 FROMGIT: BACKPORT: mm/mmap: change do_vmi_align_munmap() for maple tree iterator changes
The maple tree iterator clean up is incompatible with the way
do_vmi_align_munmap() expects it to behave.  Update the expected behaviour
to map now since the change will work currently.

Link: https://lkml.kernel.org/r/20230518145544.1722059-23-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit a4d5b9fbaf42d668c1b5c7f231f79776a9419a91
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
[surenb: adjust for missing vma_iter_load]

Bug: 274059236
Change-Id: Id05ab617a3539f885a32c7d3031098a8c005fff8
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-06 20:05:25 +00:00
Liam R. Howlett
aede79b81e ANDROID: mm: Fix __vma_adjust() writes for the maple tree
Only write when necessary to the maple tree.  This should only occur
when the VMA changes.  In the __vma_adjust() case, it is either the vma
when it is expanded, the next vma when the boundary expands into 'vma',
writing the 'insert', or when vma expands/shrinks for shift_arg_pages().

The mas_preallocate() setup should track the intended write to ensure
the correct number of nodes are preallocated for the pending write.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: 61b337f650
[surenb: __vma_adjust was removed in 6.3, therefore these fixes are
not applicable upstream anymore. The patch was obtained from the
author's tree]

Bug: 274059236
Change-Id: I69d68a5b4ff11c40985f7b03b31eec4bb24dcbb6
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-06 20:05:25 +00:00
Liam R. Howlett
b802573f44 FROMLIST: BACKPORT: mm: Set up vma iterator for vma_iter_prealloc() calls
Set the correct limits for vma_iter_prealloc() calls so that the maple
tree can be smarter about how many nodes are needed.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Link: https://lore.kernel.org/lkml/20230601021605.2823123-11-Liam.Howlett@oracle.com/
[surenb: remove vma_iter-related changes not present in 6.1 kernel]

Bug: 274059236
Change-Id: I05d1989e35b2e72b9346743f290da66739b3ee59
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-06 20:05:25 +00:00
Liam R. Howlett
e9fdabfc2a FROMLIST: BACKPORT: mm: Change do_vmi_align_munmap() side tree index
The majority of the calls to munmap a VMA is for a single vma.  The
maple tree is able to store a single entry at 0, with a size of 1 as a
pointer and avoid any allocations.  Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count.  This
will leverage the ability to store the first entry without a node
allocation.

Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries.  Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.

Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls.  This happens in the
static unmap_region() and exit_mmap().

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Link: https://lore.kernel.org/lkml/20230601021605.2823123-5-Liam.Howlett@oracle.com/
[surenb: skip changes passing maple state to unmap_vmas() and
free_pgtables()]

Bug: 274059236
Change-Id: If38cfecd51da884bcfdbdfdfbf955a0b338d3d60
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-06 20:05:25 +00:00
Liam R. Howlett
25bed2fdbc UPSTREAM: mm/mmap: remove preallocation from do_mas_align_munmap()
In preparation of passing the vma state through split, the pre-allocation
that occurs before the split has to be moved to after.  Since the
preallocation would then live right next to the store, just call store
instead of preallocating.  This effectively restores the potential error
path of splitting and not munmap'ing which pre-dates the maple tree.

Link: https://lkml.kernel.org/r/20230120162650.984577-12-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

(cherry picked from commit 0378c0a0e9e463b9e31b94fbbbc10f94b34225b6)

Bug: 274059236
Change-Id: I3539fb3a08043dae1bc8aaa6c7f285711a0b5548
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-06-06 20:05:25 +00:00
Sooyong Suk
aee36dd530 ANDROID: mm: add vendor hooks in madvise for swap entry
Add vendor hooks in madvise for swap entry
- android_vh_madvise_pageout_swap_entry
- android_vh_madvise_swapin_walk_pmd_entry
- android_vh_process_madvise_end

Bug: 284059805

Change-Id: Ic389244e343737a583286c20cadb6774efd8890c
Signed-off-by: Sooyong Suk <s.suk@samsung.com>
2023-06-05 23:12:28 +00:00
Peter Collingbourne
131714e34b FROMLIST: mm: Call arch_swap_restore() from unuse_pte()
We would like to move away from requiring architectures to restore
metadata from swap in the set_pte_at() implementation, as this is not only
error-prone but adds complexity to the arch-specific code. This requires
us to call arch_swap_restore() before calling swap_free() whenever pages
are restored from swap. We are currently doing so everywhere except in
unuse_pte(); do so there as well.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I68276653e612d64cde271ce1b5a99ae05d6bbc4f
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/all/20230523004312.1807357-3-pcc@google.com/
Change-Id: I68276653e612d64cde271ce1b5a99ae05d6bbc4f
Bug: 274890466
2023-06-05 21:53:19 +00:00
Peter Collingbourne
3805b879f5 FROMLIST: mm: Call arch_swap_restore() from do_swap_page()
Commit c145e0b47c ("mm: streamline COW logic in do_swap_page()") moved
the call to swap_free() before the call to set_pte_at(), which meant that
the MTE tags could end up being freed before set_pte_at() had a chance
to restore them. Fix it by adding a call to the arch_swap_restore() hook
before the call to swap_free().

Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965
Cc: <stable@vger.kernel.org> # 6.1
Fixes: c145e0b47c ("mm: streamline COW logic in do_swap_page()")
Reported-by: Qun-wei Lin (林群崴) <Qun-wei.Lin@mediatek.com>
Closes: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/all/20230523004312.1807357-2-pcc@google.com/
Change-Id: I6470efa669e8bd2f841049b8c61020c510678965
Bug: 274890466
2023-06-05 21:53:19 +00:00
xiaofeng
025b5a487b ANDROID: vendor_hooks:vendor hook for __alloc_pages_slowpath.
add vendor hook in __alloc_pages_slowpath ahead of
__alloc_pages_direct_reclaim and warn_alloc.

Bug: 243629905
Change-Id: Ieacc6cf79823c0bfacfdeec9afb55ed66f40d0b0
Signed-off-by: xiaofeng <xiaofeng5@xiaomi.com>
2023-06-05 16:38:22 +00:00
Dezhi Huang
3e2dc32f59 ANDROID: mm: create vendor hooks for memory reclaim
we try to adjust page reclaim operations based on the running task
and kernel memory pressure. Thus, we want to create some vendor hooks
into kernel6.1.

Firstly, we add ADNRROID_VENDOR_DATA into the struct scan_control,
special operations would be performed based on this special scan option.
We measure the importance of the current process in the system and
obtain its weight, which is recorded in ANDROID_VENDOR_DATA.

The hook function: trace_android_vh_modify_scan_control is added inside
of the function modify_scan_control() to adjust reclaim operations based
on memory pressure.

The hook function: trace_android_vh_should_continue_reclaim is added inside
of the function shrink_node() to decide if page_reclaim would continue
or not based on memory pressure.

The hook function: trace_android_vh_file_is_tiny_bypass is added into the
function prepare_scan_count() to decide if the file pages should be skipped
in condition to file refualts and memory pressure.

Bug: 279793370
Change-Id: I1efe9d3e866f37b0295c7cd94ec8ca0117a9bd4a
Signed-off-by: Dezhi Huang <huangdezhi@hihonor.com>
2023-06-05 16:31:49 +00:00
Zhenhua Huang
78fe8913d1 UPSTREAM: mm,kfence: decouple kfence from page granularity mapping judgement
Kfence only needs its pool to be mapped as page granularity, if it is
inited early. Previous judgement was a bit over protected. From [1], Mark
suggested to "just map the KFENCE region a page granularity". So I
decouple it from judgement and do page granularity mapping for kfence
pool only. Need to be noticed that late init of kfence pool still requires
page granularity mapping.

Page granularity mapping in theory cost more(2M per 1GB) memory on arm64
platform. Like what I've tested on QEMU(emulated 1GB RAM) with
gki_defconfig, also turning off rodata protection:
Before:
[root@liebao ]# cat /proc/meminfo
MemTotal:         999484 kB
After:
[root@liebao ]# cat /proc/meminfo
MemTotal:        1001480 kB

To implement this, also relocate the kfence pool allocation before the
linear mapping setting up, arm64_kfence_alloc_pool is to allocate phys
addr, __kfence_pool is to be set after linear mapping set up.

LINK: [1] https://lore.kernel.org/linux-arm-kernel/Y+IsdrvDNILA59UN@FVFF77S0Q05N/
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/1679066974-690-1-git-send-email-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>

BUG: 284812202
Change-Id: I8e7c565d3f4d6349a028a6a060259d62cf5beee7
(cherry picked from commit bfa7965b33ab79fc3b2f8adc14704075fe2416cd)
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
2023-05-31 17:22:42 +00:00
Tetsuo Handa
8035e57ec7 UPSTREAM: mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
commit 1007843a91909a4995ee78a538f62d8665705b66 upstream.

syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.

One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.

  CPU0
  ----
  __build_all_zonelists() {
    write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
    // e.g. timer interrupt handler runs at this moment
      some_timer_func() {
        kmalloc(GFP_ATOMIC) {
          __alloc_pages_slowpath() {
            read_seqbegin(&zonelist_update_seq) {
              // spins forever because zonelist_update_seq.seqcount is odd
            }
          }
        }
      }
    // e.g. timer interrupt handler finishes
    write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
  }

This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests.  But Michal Hocko does not know whether we should go with this
approach.

Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.

  CPU0                                   CPU1
  ----                                   ----
  pty_write() {
    tty_insert_flip_string_and_push_buffer() {
                                         __build_all_zonelists() {
                                           write_seqlock(&zonelist_update_seq);
                                           build_zonelists() {
                                             printk() {
                                               vprintk() {
                                                 vprintk_default() {
                                                   vprintk_emit() {
                                                     console_unlock() {
                                                       console_flush_all() {
                                                         console_emit_next_record() {
                                                           con->write() = serial8250_console_write() {
      spin_lock_irqsave(&port->lock, flags);
      tty_insert_flip_string() {
        tty_insert_flip_string_fixed_flag() {
          __tty_buffer_request_room() {
            tty_buffer_alloc() {
              kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
                __alloc_pages_slowpath() {
                  zonelist_iter_begin() {
                    read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
                                                             spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
                    }
                  }
                }
              }
            }
          }
        }
      }
      spin_unlock_irqrestore(&port->lock, flags);
                                                             // message is printed to console
                                                             spin_unlock_irqrestore(&port->lock, flags);
                                                           }
                                                         }
                                                       }
                                                     }
                                                   }
                                                 }
                                               }
                                             }
                                           }
                                           write_sequnlock(&zonelist_update_seq);
                                         }
    }
  }

This deadlock scenario can be eliminated by

  preventing interrupt context from calling kmalloc(GFP_ATOMIC)

and

  preventing printk() from calling console_flush_all()

while zonelist_update_seq.seqcount is odd.

Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by

  disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)

and

  disabling synchronous printk() in order to avoid console_flush_all()

.

As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced.  Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...

Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
Fixes: 3d36424b3b ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
  Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Change-Id: Ifc0c6ed9be6d36166367811ad412bedc66ed713e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Patrick Daly <quic_pdaly@quicinc.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b528537d13)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-05-31 16:27:26 +00:00
Mel Gorman
fa3ef799ad UPSTREAM: mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
commit 4d73ba5fa710fe7d432e0b271e6fecd252aef66e upstream.

A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory.  Further
testing allocating huge pages that the cost is linear i.e.  if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step.  Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a
migration that fails after isolating the range, draining pages and a whole
lot of other useless work.

Commit eb14d4eefd ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can sometimes migrate.  While
there may be value on migrating a 2M page to satisfy a 1G allocation, it's
potentially expensive if the 1G allocation fails and it's pointless to try
moving a 1G page for a new 1G allocation or scan the tail pages for valid
PFNs.

Reintroduce the PageHuge check and assume any contiguous region with
hugetlbfs pages is unsuitable for a new 1G allocation.

The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time.  This test happens just after boot
when fragmentation is not an issue.  Units are in milliseconds.

hpagealloc
                               6.3.0-rc6              6.3.0-rc6              6.3.0-rc6
                                 vanilla   hugeallocrevert-v1r1   hugeallocsimple-v1r2
Min       Latency       26.42 (   0.00%)        5.07 (  80.82%)       18.94 (  28.30%)
1st-qrtle Latency      356.61 (   0.00%)        5.34 (  98.50%)       19.85 (  94.43%)
2nd-qrtle Latency      697.26 (   0.00%)        5.47 (  99.22%)       20.44 (  97.07%)
3rd-qrtle Latency      972.94 (   0.00%)        5.50 (  99.43%)       20.81 (  97.86%)
Max-1     Latency       26.42 (   0.00%)        5.07 (  80.82%)       18.94 (  28.30%)
Max-5     Latency       82.14 (   0.00%)        5.11 (  93.78%)       19.31 (  76.49%)
Max-10    Latency      150.54 (   0.00%)        5.20 (  96.55%)       19.43 (  87.09%)
Max-90    Latency     1164.45 (   0.00%)        5.53 (  99.52%)       20.97 (  98.20%)
Max-95    Latency     1223.06 (   0.00%)        5.55 (  99.55%)       21.06 (  98.28%)
Max-99    Latency     1278.67 (   0.00%)        5.57 (  99.56%)       22.56 (  98.24%)
Max       Latency     1310.90 (   0.00%)        8.06 (  99.39%)       26.62 (  97.97%)
Amean     Latency      678.36 (   0.00%)        5.44 *  99.20%*       20.44 *  96.99%*

                   6.3.0-rc6   6.3.0-rc6   6.3.0-rc6
                     vanilla   revert-v1   hugeallocfix-v2
Duration User           0.28        0.27        0.30
Duration System       808.66       17.77       35.99
Duration Elapsed      830.87       18.08       36.33

The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
page and almost 10 minutes in total to run the test.  Reverting the
problematic commit reduces it to 8ms at worst and the patch takes 26ms.
This patch fixes the main issue with skipping huge pages but leaves the
page_count() out because a page with an elevated count potentially can
migrate.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
Fixes: eb14d4eefd ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
Change-Id: I552f0631f15e41038219e207c994fa7702b269fa
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yuanxi Liu <y.liu@naruida.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 059f24aff6)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-05-31 16:27:26 +00:00
Alexander Potapenko
f800df6e1f UPSTREAM: mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
commit 47ebd0310e89c087f56e58c103c44b72a2f6b216 upstream.

As reported by Dipanjan Das, when KMSAN is used together with kernel fault
injection (or, generally, even without the latter), calls to kcalloc() or
__vmap_pages_range_noflush() may fail, leaving the metadata mappings for
the virtual mapping in an inconsistent state.  When these metadata
mappings are accessed later, the kernel crashes.

To address the problem, we return a non-zero error code from
kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping
failure inside it, and make vmap_pages_range_noflush() return an error if
KMSAN fails to allocate the metadata.

This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(),
as these allocation failures are not fatal anymore.

Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com
Fixes: b073d7f8ae ("mm: kmsan: maintain KMSAN metadata for page operations")
Change-Id: I2a50da1c7cc438a30026b2b18d425fff2ea349b6
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
  Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
Reviewed-by: Marco Elver <elver@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bd6f3421a5)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-05-31 15:20:12 +00:00
Alexander Potapenko
843caf6daa UPSTREAM: mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
commit fdea03e12aa2a44a7bb34144208be97fc25dfd90 upstream.

Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range()
must also properly handle allocation/mapping failures.  In the case of
such, it must clean up the already created metadata mappings and return an
error code, so that the error can be propagated to ioremap_page_range().
Without doing so, KMSAN may silently fail to bring the metadata for the
page range into a consistent state, which will result in user-visible
crashes when trying to access them.

Link: https://lkml.kernel.org/r/20230413131223.4135168-2-glider@google.com
Fixes: b073d7f8ae ("mm: kmsan: maintain KMSAN metadata for page operations")
Change-Id: Iae12299853f5f39b473c509d0ad63ac20d0425e7
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
  Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/
Reviewed-by: Marco Elver <elver@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 433a7ecaed)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-05-30 17:15:55 +00:00
Naoya Horiguchi
ac51e1f090 UPSTREAM: mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO
commit 4737edbbdd4958ae29ca6a310a6a2fa4e0684b01 upstream.

split_huge_page_to_list() WARNs when called for huge zero pages, which
sounds to me too harsh because it does not imply a kernel bug, but just
notifies the event to admins.  On the other hand, this is considered as
critical by syzkaller and makes its testing less efficient, which seems to
me harmful.

So replace the VM_WARN_ON_ONCE_FOLIO with pr_warn_ratelimited.

Link: https://lkml.kernel.org/r/20230406082004.2185420-1-naoya.horiguchi@linux.dev
Fixes: 478d134e95 ("mm/huge_memory: do not overkill when splitting huge_zero_page")
Change-Id: Ib41a08bf87cc55ce240a63eddf5609aa7c8976ef
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: syzbot+07a218429c8d19b1fb25@syzkaller.appspotmail.com
  Link: https://lore.kernel.org/lkml/000000000000a6f34a05e6efcd01@google.com/
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Xu Yu <xuyu@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e8a7bdb6f7)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-05-30 17:15:55 +00:00
David Hildenbrand
12132bd611 UPSTREAM: mm/userfaultfd: fix uffd-wp handling for THP migration entries
commit 24bf08c4376be417f16ceb609188b16f461b0443 upstream.

Looks like what we fixed for hugetlb in commit 44f86392bdd1 ("mm/hugetlb:
fix uffd-wp handling for migration entries in
hugetlb_change_protection()") similarly applies to THP.

Setting/clearing uffd-wp on THP migration entries is not implemented
properly.  Further, while removing migration PMDs considers the uffd-wp
bit, inserting migration PMDs does not consider the uffd-wp bit.

We have to set/clear independently of the migration entry type in
change_huge_pmd() and properly copy the uffd-wp bit in
set_pmd_migration_entry().

Verified using a simple reproducer that triggers migration of a THP, that
the set_pmd_migration_entry() no longer loses the uffd-wp bit.

Link: https://lkml.kernel.org/r/20230405160236.587705-2-david@redhat.com
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Change-Id: I263a9fd8a6695f546fe5c5279a439f4f1c151c48
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit cc647e05db)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-05-30 17:15:55 +00:00
Peter Xu
ab721b09b1 UPSTREAM: mm/khugepaged: check again on anon uffd-wp during isolation
commit dd47ac428c3f5f3bcabe845f36be870fe6c20784 upstream.

Khugepaged collapse an anonymous thp in two rounds of scans.  The 2nd
round done in __collapse_huge_page_isolate() after
hpage_collapse_scan_pmd(), during which all the locks will be released
temporarily.  It means the pgtable can change during this phase before 2nd
round starts.

It's logically possible some ptes got wr-protected during this phase, and
we can errornously collapse a thp without noticing some ptes are
wr-protected by userfault.  e1e267c792 wanted to avoid it but it only
did that for the 1st phase, not the 2nd phase.

Since __collapse_huge_page_isolate() happens after a round of small page
swapins, we don't need to worry on any !present ptes - if it existed
khugepaged will already bail out.  So we only need to check present ptes
with uffd-wp bit set there.

This is something I found only but never had a reproducer, I thought it
was one caused a bug in Muhammad's recent pagemap new ioctl work, but it
turns out it's not the cause of that but an userspace bug.  However this
seems to still be a real bug even with a very small race window, still
worth to have it fixed and copy stable.

Link: https://lkml.kernel.org/r/20230405155120.3608140-1-peterx@redhat.com
Fixes: e1e267c792 ("khugepaged: skip collapse if uffd-wp detected")
Change-Id: Iab7f0ac5b9b6d055485ca244b2fa1e13f0dbc570
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 519dbe737f)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-05-30 14:32:04 +00:00
Kalesh Singh
500484f5be BACKPORT: FROMGIT: Multi-gen LRU: fix workingset accounting
On Android app cycle workloads, MGLRU showed a significant reduction in
workingset refaults although pgpgin/pswpin remained relatively unchanged.
This indicated MGLRU may be undercounting workingset refaults.

This has impact on userspace programs, like Android's LMKD, that monitor
workingset refault statistics to detect thrashing.

It was found that refaults were only accounted if the MGLRU shadow entry
was for a recently evicted folio.  However, recently evicted folios should
be accounted as workingset activation, and refaults should be accounted
regardless of recency.

Fix MGLRU's workingset refault and activation accounting to more closely
match that of the conventional active/inactive LRU.

Link: https://lkml.kernel.org/r/20230523205922.3852731-1-kaleshsingh@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 02ad728453d2ddb09d7ce5e59854ebb27544d488 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 284043217
[ Kalesh Singh - Fix conflicts in mm/workingset.c ]
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I6d42cca9064e66099fbbc20aa2143961f84b2003
2023-05-27 00:38:36 +00:00
Liujie Xie
6f3353ca09 ANDROID: vendor_hooks: Add hook in shrink_node_memcgs
Add vendor hook in shrink_node_memcgs to adjust whether
to skip memory reclamation of memcg.

Bug: 226482420
Signed-off-by: Liujie Xie <xieliujie@oppo.com>
(cherry picked from commit b7ea1c49876197a3b5f17f7bb2699c5594f0b57e)

Change-Id: I925856353e63c5a821027de4f8476c833e21b982
Signed-off-by: lvwenhuan <lvwenhuan@oppo.com>
2023-05-25 21:44:09 +00:00
Liujie Xie
573ba7b6e6 ANDROID: vendor_hooks: Add hooks for memory when debug
Add vendors hooks for recording memory used

Vendor modules allocate and manages the memory itself.

These memories might not be included in kernel memory
statistics. Also, detailed references and vendor-specific
information are managed only inside modules. When
various problems such as memory leaks occurs, these
information should be showed in real-time.

Bug: 182443489
Bug: 234407991
Bug: 277799025

Signed-off-by: Liujie Xie <xieliujie@oppo.com>
Change-Id: I62d8bb2b6650d8b187b433f97eb833ef0b784df1
Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
2023-05-25 21:06:40 +00:00
Dezhi Huang
94b540c38d ANDROID: mm: create vendor hooks for do_shrink_slab()
The hook function: trace_android_vh_do_shrink_slab is added inside
of the function do_shrink_slab() to changed the numbers of page to
be reclaimed from kernel.

Bug: 279793370
Change-Id: I7c0b955be97f841c69bc99a152b59ed9823707ed
Signed-off-by: Dezhi Huang <huangdezhi@hihonor.com>
2023-05-24 21:12:43 +00:00