While testing the split PMD path with lockdep enabled I've got an "Invalid
wait context" error caused by split_huge_page_to_list() trying to lock
anon_vma->rwsem while inside RCU read section. The issues is due to
move_pages_pte() calling split_folio() under RCU read lock. Fix this by
unmapping the PTEs and exiting RCU read section before splitting the folio
and then retrying. The same retry pattern is used when locking the folio
or anon_vma in this function. After splitting the large folio we unlock
and release it because after the split the old folio might not be the one
that contains the src_addr.
Link: https://lkml.kernel.org/r/20240102233256.1077959-1-surenb@google.com
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 982ae058b2f08f576e4f3d4055f8916ba789f3d4)
Bug: 274911254
Change-Id: I382c6631d821b0ed26d9b15afa78a417dafaeb2e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Implement the uABI of UFFDIO_MOVE ioctl.
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [2].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread's completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [2]. Furthermore,
UFFDIO_MOVE enables moving swapped-out pages without touching them within
the same vma. Today, it can only be done by mremap, however it forces
splitting the vma.
[1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/
[2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/
Update for the ioctl_userfaultfd(2) manpage:
UFFDIO_MOVE
(Since Linux xxx) Move a continuous memory chunk into the
userfault registered range and optionally wake up the blocked
thread. The source and destination addresses and the number of
bytes to move are specified by the src, dst, and len fields of
the uffdio_move structure pointed to by argp:
struct uffdio_move {
__u64 dst; /* Destination of move */
__u64 src; /* Source of move */
__u64 len; /* Number of bytes to move */
__u64 mode; /* Flags controlling behavior of move */
__s64 move; /* Number of bytes moved, or negated error */
};
The following value may be bitwise ORed in mode to change the
behavior of the UFFDIO_MOVE operation:
UFFDIO_MOVE_MODE_DONTWAKE
Do not wake up the thread that waits for page-fault
resolution
UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
Allow holes in the source virtual range that is being moved.
When not specified, the holes will result in ENOENT error.
When specified, the holes will be accounted as successfully
moved memory. This is mostly useful to move hugepage aligned
virtual regions without knowing if there are transparent
hugepages in the regions or not, but preventing the risk of
having to split the hugepage during the operation.
The move field is used by the kernel to return the number of
bytes that was actually moved, or an error (a negated errno-
style value). If the value returned in move doesn't match the
value that was specified in len, the operation fails with the
error EAGAIN. The move field is output-only; it is not read by
the UFFDIO_MOVE operation.
The operation may fail for various reasons. Usually, remapping of
pages that are not exclusive to the given process fail; once KSM
might deduplicate pages or fork() COW-shares pages during fork()
with child processes, they are no longer exclusive. Further, the
kernel might only perform lightweight checks for detecting whether
the pages are exclusive, and return -EBUSY in case that check fails.
To make the operation more likely to succeed, KSM should be
disabled, fork() should be avoided or MADV_DONTFORK should be
configured for the source VMA before fork().
This ioctl(2) operation returns 0 on success. In this case, the
entire area was moved. On error, -1 is returned and errno is
set to indicate the error. Possible errors include:
EAGAIN The number of bytes moved (i.e., the value returned in
the move field) does not equal the value that was
specified in the len field.
EINVAL Either dst or len was not a multiple of the system page
size, or the range specified by src and len or dst and len
was invalid.
EINVAL An invalid bit was specified in the mode field.
ENOENT
The source virtual memory range has unmapped holes and
UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.
EEXIST
The destination virtual memory range is fully or partially
mapped.
EBUSY
The pages in the source virtual memory range are either
pinned or not exclusive to the process. The kernel might
only perform lightweight checks for detecting whether the
pages are exclusive. To make the operation more likely to
succeed, KSM should be disabled, fork() should be avoided
or MADV_DONTFORK should be configured for the source virtual
memory area before fork().
ENOMEM Allocating memory needed for the operation failed.
ESRCH
The target process has exited at the time of a UFFDIO_MOVE
operation.
Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit adef440691bab824e39c1b17382322d195e1fab0)
Conflicts:
mm/huge_memory.c
mm/userfaultfd.c
1. Add vma parameter in mmu_notifier_range_init() calls.
2. Replace folio_move_anon_rmap() with page_move_anon_rmap().
3. Remove vma parameter in pmd_mkwrite() calls.
4. Replace pte_offset_map_nolock() with pte_offset_map()+pte_lockptr()
combo.
5. Remove VM_SHADOW_STACK in vma_move_compatible().
6. Replace pmdp_get_lockless() with pmd_read_atomic().
Bug: 274911254
Change-Id: I1116f15a395f1a8bac176906f7f9c2411e59dc54
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Patch series "userfaultfd move option", v6.
This patch series introduces UFFDIO_MOVE feature to userfaultfd, which has
long been implemented and maintained by Andrea in his local tree [1], but
was not upstreamed due to lack of use cases where this approach would be
better than allocating a new page and copying the contents. Previous
upstraming attempts could be found at [6] and [7].
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3]. We see over 40% reduction (on a Google pixel 6 device) in
the compacting thread's completion time by using UFFDIO_MOVE vs.
UFFDIO_COPY. This was measured using a benchmark that emulates a heap
compaction implementation using userfaultfd (to allow concurrent accesses
by application threads). More details of the usecase are explained in
[3].
Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
TODOs for follow-up improvements:
- cross-mm support. Known differences from single-mm and missing pieces:
- memcg recharging (might need to isolate pages in the process)
- mm counters
- cross-mm deposit table moves
- cross-mm test
- document the address space where src and dest reside in struct
uffdio_move
- TLB flush batching. Will require extensive changes to PTL locking in
move_pages_pte(). OTOH that might let us reuse parts of mremap code.
This patch (of 5):
For now, folio_move_anon_rmap() was only used to move a folio to a
different anon_vma after fork(), whereby the root anon_vma stayed
unchanged. For that, it was sufficient to hold the folio lock when
calling folio_move_anon_rmap().
However, we want to make use of folio_move_anon_rmap() to move folios
between VMAs that have a different root anon_vma. As folio_referenced()
performs an RMAP walk without holding the folio lock but only holding the
anon_vma in read mode, holding the folio lock is insufficient.
When moving to an anon_vma with a different root anon_vma, we'll have to
hold both, the folio lock and the anon_vma lock in write mode.
Consequently, whenever we succeeded in folio_lock_anon_vma_read() to
read-lock the anon_vma, we have to re-check if the mapping was changed in
the meantime. If that was the case, we have to retry.
Note that folio_move_anon_rmap() must only be called if the anon page is
exclusive to a process, and must not be called on KSM folios.
This is a preparation for UFFDIO_MOVE, which will hold the folio lock, the
anon_vma lock in write mode, and the mmap_lock in read mode.
Link: https://lkml.kernel.org/r/20231206103702.3873743-1-surenb@google.com
Link: https://lkml.kernel.org/r/20231206103702.3873743-2-surenb@google.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: kernel-team@android.com
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 880a99b60d467eefd96322e27b0a8c0b805dfa43)
Bug: 274911254
Change-Id: Iad9619c0273e050af26356f66ae9fc88b56d68bd
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
The current implementation of the mark_victim tracepoint provides only the
process ID (pid) of the victim process. This limitation poses challenges
for userspace tools requiring real-time OOM analysis and intervention.
Although this information is available from the kernel logs, it’s not
the appropriate format to provide OOM notifications. In Android, BPF
programs are used with the mark_victim trace events to notify userspace of
an OOM kill. For consistency, update the trace event to include the same
information about the OOMed victim as the kernel logs.
- UID
In Android each installed application has a unique UID. Including
the `uid` assists in correlating OOM events with specific apps.
- Process Name (comm)
Enables identification of the affected process.
- OOM Score
Will allow userspace to get additional insight of the relative kill
priority of the OOM victim. In Android, the oom_score_adj is used to
categorize app state (foreground, background, etc.), which aids in
analyzing user-perceptible impacts of OOM events [1].
- Total VM, RSS Stats, and pgtables
Amount of memory used by the victim that will, potentially, be freed up
by killing it.
[1] 246dc8fc95:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283
Signed-off-by: Carlos Galo <carlosgalo@google.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 331214192
(cherry picked from commit 72ba14deb40a9e9668ec5e66a341ed657e5215c2)
[ carlosgalo: Manually added struct cred change in mark_oom_victim function ]
Link: https://lore.kernel.org/all/20240223173258.174828-1-carlosgalo@google.com/
Change-Id: I24f503ceca04b83f8abf42fcd04a3409e17be6b5
This reverts commit 6b4c816d17.
Reason for revert: b/331214192
Change-Id: I9f4f56de7d65cee19c7015b0cb1bda339d82a5f5
Signed-off-by: Carlos Galo <carlosgalo@google.com>
Export two functions to help memory reclaim.
Bug: 323406883
Change-Id: I099d414c9b3648224ab077b9929c6622b2d4228a
Signed-off-by: Minchan Kim <minchan@google.com>
This patch adds two exported functions to set/get reclaim parameters.
Bug: 323406883
Change-Id: I8c29073dba3e77cb5db7f45b640518deae04b8a9
Signed-off-by: Minchan Kim <minchan@google.com>
__alloc_pages_direct_reclaim() is called from slowpath allocation where
high atomic reserves can be unreserved after there is a progress in
reclaim and yet no suitable page is found. Later should_reclaim_retry()
gets called from slow path allocation to decide if the reclaim needs to be
retried before OOM kill path is taken.
should_reclaim_retry() checks the available(reclaimable + free pages)
memory against the min wmark levels of a zone and returns:
a) true, if it is above the min wmark so that slow path allocation will
do the reclaim retries.
b) false, thus slowpath allocation takes oom kill path.
should_reclaim_retry() can also unreserves the high atomic reserves **but
only after all the reclaim retries are exhausted.**
In a case where there are almost none reclaimable memory and free pages
contains mostly the high atomic reserves but allocation context can't use
these high atomic reserves, makes the available memory below min wmark
levels hence false is returned from should_reclaim_retry() leading the
allocation request to take OOM kill path. This can turn into a early oom
kill if high atomic reserves are holding lot of free memory and
unreserving of them is not attempted.
(early)OOM is encountered on a VM with the below state:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB
Per above log, the free memory of ~7MB exist in the high atomic reserves
is not freed up before falling back to oom kill path.
Fix it by trying to unreserve the high atomic reserves in
should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback
to oom kill path.
Bug: 332219324
Link: https://lkml.kernel.org/r/1700823445-27531-1-git-send-email-quic_charante@quicinc.com
Fixes: 0aaa29a56e ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
(cherry picked from commit ac3f3b0a55518056bc80ed32a41931c99e1f7d81)
Change-Id: I432d4ac4864d401a4413f6b2ef902625766f8070
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Highatomic reserves are set to roughly 1% of zone for maximum and a
pageblock size for minimum. Encountered a system with the below
configuration:
Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
reserved_highatomic:8192KB managed:49224kB
On such systems, even a single pageblock makes highatomic reserves are set
to ~8% of the zone memory. This high value can easily exert pressure on
the zone.
Per discussion with Michal and Mel, it is not much useful to reserve the
memory for highatomic allocations on such small systems[1]. Since the
minimum size for high atomic reserves is always going to be a pageblock
size and if 1% of zone managed pages is going to be below pageblock size,
don't reserve memory for high atomic allocations. Thanks Michal for this
suggestion[2].
Since no memory is being reserved for high atomic allocations and if
respective allocation failures are seen, this patch can be reverted.
[1] https://lore.kernel.org/linux-mm/20231117161956.d3yjdxhhm4rhl7h2@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/
Bug: 332219324
Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.1700821416.git.quic_charante@quicinc.com
Change-Id: Id059b63bd6ee68b3a2cd1c4b44613234a42d0a46
(cherry picked from commit 9cd20f3fe045af95a8fe7a12328b21bfd2f3b8bf)
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: page_alloc: fixes for high atomic reserve
caluculations", v3.
The state of the system where the issue exposed shown in oom kill logs:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB
From the above, it is seen that ~16MB of memory reserved for high atomic
reserves against the expectation of 1% reserves which is fixed in the 1st
patch.
Don't reserve the high atomic page blocks if 1% of zone memory size is
below a pageblock size.
This patch (of 2):
reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages
of a zone, which is used for the high order atomic allocations.
It uses the below calculation to reserve:
static void reserve_highatomic_pageblock(struct page *page, ....) {
.......
max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
if (zone->nr_reserved_highatomic >= max_managed)
goto out;
zone->nr_reserved_highatomic += pageblock_nr_pages;
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
out:
....
}
Since we are always appending the 1% of zone managed pages count to
pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the
nr_reserved_highatomic is incremented/decremented in pageblock sizes.
Encountered a system(actually a VM running on the Linux kernel) with the
below zone configuration:
Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
reserved_highatomic:8192KB managed:49224kB
The existing calculations making it to reserve the 8MB(with pageblock size
of 4MB) i.e. 16% of the zone managed memory. Reserving such high amount
of memory can easily exert memory pressure in the system thus may lead
into unnecessary reclaims till unreserving of high atomic reserves.
Since high atomic reserves are managed in pageblock size granules, as
MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for
high atomic reserves as, minimum is pageblock size , maximum is
approximately 1% of the zone managed pages.
Bug: 332219324
Link: https://lkml.kernel.org/r/cover.1700821416.git.quic_charante@quicinc.com
Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.1700821416.git.quic_charante@quicinc.com
Change-Id: Icc15fb88ef6166f691f5aa14311bc45bff972b99
(cherry picked from commit d68e39fc45f70e35eb74df2128d315c1d91e4dc4)
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 6bad1052c2, it is the
LTS merge that had to previously get reverted due to being merged too
early.
Cc: Todd Kjos <tkjos@google.com>
Change-Id: I31b7d660bd833cf022ac4870f6d01e723fda5182
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
The tail pages in a THP can have swap entry information stored in their
private field. When migrating to a new page, all tail pages of the new
page need to update ->private to avoid future data corruption.
This fix is stable-only, since after commit 07e09c483cbe ("mm/huge_memory:
work on folio->swap instead of page->private when splitting folio"),
subpages of a swapcached THP no longer requires the maintenance.
Adding THPs to the swapcache was introduced in commit
38d8b4e6bd ("mm, THP, swap: delay splitting THP during swap out"),
where each subpage of a THP added to the swapcache had its own swapcache
entry and required the ->private field to point to the correct swapcache
entry. Later, when THP migration functionality was implemented in commit
616b837153 ("mm: thp: enable thp migration in generic path"),
it initially did not handle the subpages of swapcached THPs, failing to
update their ->private fields or replace the subpage pointers in the
swapcache. Subsequently, commit e71769ae52 ("mm: enable thp migration
for shmem thp") addressed the swapcache update aspect. This patch fixes
the update of subpage ->private fields.
Bug: 324818390
Fixes: 616b837153 ("mm: thp: enable thp migration in generic path")
Link: https://lore.kernel.org/linux-mm/20240306155217.118467-1-zi.yan@sent.com/
Reported-and-tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Change-Id: Ia4603cd58b76dc6ff46a2c53a735942a87221419
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Closes: https://lore.kernel.org/linux-mm/1707814102-22682-1-git-send-email-quic_charante@quicinc.com/
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
alloc_contig_migrate_range has every information to be able to understand
big contiguous allocation latency. For example, how many pages are
migrated, how many times they were needed to unmap from page tables.
This patch adds the trace event to collect the allocation statistics. In
the field, it was quite useful to understand CMA allocation latency.
[akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled]
Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com
Signed-off-by: Richard Chang <richardycc@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org.
Cc: Martin Liu <liumartin@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 315897534
(cherry picked from commit c8b36003121834cb77fcaf8a1ce0a454d7a97891
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable)
[richardycc: slight modification for android change 0de2f42977]
Change-Id: If6c3cd106201fd13683d1dd5afdfa62a48a4dd3b
Signed-off-by: Richard Chang <richardycc@google.com>
This reverts commit 1dbafe61e3.
Reason for revert: Too early. Needs to wait until 2024-03-27
Change-Id: I769b944bd089aa2278659ec87f7ba4ac4e74ee4a
Signed-off-by: Todd Kjos <tkjos@google.com>
Backmerge the latest android14-6.1 changes into the lts branch to keep
up to date. Contains the following commits:
* 3578913b2e UPSTREAM: net/rose: Fix Use-After-Free in rose_ioctl
* 8fbed1ea00 UPSTREAM: ida: Fix crash in ida_free when the bitmap is empty
* 6ce5bb744e ANDROID: GKI: Update symbol list for mtk
* 7cbad58851 Reapply "perf: Disallow mis-matched inherited group reads"
* 067a03c44e ANDROID: GKI: Add Pasa symbol list
* b6be1a36f7 FROMGIT: mm: memcg: don't periodically flush stats when memcg is disabled
* d0e2d333f9 ANDROID: Update the ABI symbol list
* 10558542a1 ANDROID: sched: export update_misfit_status symbol
* a0b3b39898 ANDROID: GKI: Add ASR KMI symbol list
* 599710db0f FROMGIT: usb: dwc3: gadget: Fix NULL pointer dereference in dwc3_gadget_suspend
* 9265fa90c1 FROMLIST: usb: core: Prevent null pointer dereference in update_port_device_state
* 2730733d54 ANDROID: gki_defconfig: Enable CONFIG_NVME_MULTIPATH
* 4f668f5682 BACKPORT: irqchip/gic-v3: Work around affinity issues on ASR8601
* 473a871315 BACKPORT: irqchip/gic-v3: Improve affinity helper
* 6c32acf537 UPSTREAM: sched/fair: Limit sched slice duration
* 7088d250bf ANDROID: Update the ABI symbol list
* c249740414 ANDROID: idle_inject: Export function symbols
* 990d341477 ANDROID: Update the ABI symbol list
* be92a6a1b4 ANDROID: GKI: Remove CONFIG_MEDIA_CEC_RC
* fa9ac43f16 BACKPORT: usb: host: xhci: Avoid XHCI resume delay if SSUSB device is not present
* f27fc6ba23 Merge "Merge tag 'android14-6.1.68_r00' into branch 'android14-6.1'" into android14-6.1
|\
| * 0177cfb2a2 Merge tag 'android14-6.1.68_r00' into branch 'android14-6.1'
* c96cea1a3c ANDROID: Update the ABI symbol list
* c2fbc12180 ANDROID: uid_sys_stats: Drop CONFIG_UID_SYS_STATS_DEBUG logic
* 90bd30bdef ANDROID: Update the ABI symbol list
* 3280560843 ANDROID: Update the ABI symbol list
* 427210e440 UPSTREAM: usb: gadget: uvc: Remove nested locking
* 9267e267be ANDROID: uid_sys_stats: Fully initialize uid_entry_tmp value
* 2d3f0c9d41 ANDROID: Roll back some code to fix system_server registers psi trigger failed.
* bd77c97c76 UPSTREAM: usb: gadget: uvc: Fix use are free during STREAMOFF
* 21c71a7d0e ANDROID: GKI: Add symbol list for Nothing
* aba5a3fe09 ANDROID: Enable CONFIG_LAZY_RCU in x86 gki_defconfig
* 204160394a ANDROID: fuse-bpf: Fix the issue of abnormal lseek system calls
* 947708f1ff ANDROID: ABI: Update symbol list for imx
* 7eedea7abf BACKPORT: PM: sleep: Fix possible deadlocks in core system-wide PM code
* e1a20dd9ff UPSTREAM: async: Introduce async_schedule_dev_nocall()
* e4b0e14f83 UPSTREAM: async: Split async_schedule_node_domain()
* 6b4c816d17 FROMGIT: BACKPORT: mm: update mark_victim tracepoints fields
* d97ea65296 ANDROID: Enable CONFIG_LAZY_RCU in arm64 gki_defconfig
* 90d68cedd1 FROMLIST: rcu: Provide a boot time parameter to control lazy RCU
* a079cc5876 ANDROID: rcu: Add a minimum time for marking boot as completed
* ffe09c06a8 UPSTREAM: rcu: Disable laziness if lazy-tracking says so
* d07488d26e UPSTREAM: rcu: Track laziness during boot and suspend
* 4316bd568b UPSTREAM: net: Use call_rcu_hurry() for dst_release()
* b9427245f0 UPSTREAM: workqueue: Make queue_rcu_work() use call_rcu_hurry()
* 72fdf7f606 UPSTREAM: percpu-refcount: Use call_rcu_hurry() for atomic switch
* ced65a053b UPSTREAM: io_uring: use call_rcu_hurry if signaling an eventfd
* 84c8157d06 UPSTREAM: rcu: Update synchronize_rcu_mult() comment for call_rcu_hurry()
* 3751416eeb UPSTREAM: scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu()
* 52193e9489 UPSTREAM: rcu/rcutorture: Use call_rcu_hurry() where needed
* 83f8ba569f UPSTREAM: rcu/rcuscale: Use call_rcu_hurry() for async reader test
* 9b625f4978 UPSTREAM: rcu/sync: Use call_rcu_hurry() instead of call_rcu
* c570c8fea3 BACKPORT: rcu: Shrinker for lazy rcu
* 4957579439 UPSTREAM: rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
* 66a832fe38 UPSTREAM: rcu: Make call_rcu() lazy to save power
* 4fb09fb4f7 UPSTREAM: rcu: Fix missing nocb gp wake on rcu_barrier()
* 64c59ad2c3 UPSTREAM: rcu: Fix late wakeup when flush of bypass cblist happens
* 0799ace265 ANDROID: Update the ABI symbol list
* 65db2f8ed3 ANDROID: GKI: add GKI symbol list for Exynosauto SoC
* cfe8cce4e8 UPSTREAM: coresight: tmc: Don't enable TMC when it's not ready.
* 899194d7e9 UPSTREAM: netfilter: nf_tables: bail out on mismatching dynset and set expressions
* e6712ed4f0 ANDROID: ABI: Update oplus symbol list
* 24bb8fc82e ANDROID: vendor_hooks: add hooks in driver/android/binder.c
* 55930b39ca ANDROID: GKI: Update honda symbol list for xt_LOG
* 3160b69e20 ANDROID: GKI: Update honda symbol list for ebt filter
* 4dc7f98815 ANDROID: GKI: Update honda symbol list for ebtables
* 39a0823340 ANDROID: GKI: Update honda symbol list for net scheduler
* dd0098bdb4 ANDROID: GKI: Update honda symbol list for led-trigger
* 66a20ed4b8 ANDROID: GKI: Add initial symbol list for honda
* 28dbe4d613 ANDROID: GKI: add symbols to ABI
* 97100e867e FROMGIT: usb: dwc: ep0: Update request status in dwc3_ep0_stall_restart
* 36248a15a7 FROMGIT: usb: dwc3: set pm runtime active before resume common
Change-Id: I8d9586a94c3182cd365d1e3b651a7552c7c9949b
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
The root memcg is onlined even when memcg is disabled. When it's onlined
a 2 second periodic stat flush is started, but no stat flushing is
required when memcg is disabled because there can be no child memcgs.
Most calls to flush memcg stats are avoided when memcg is disabled as a
result of the mem_cgroup_disabled check added in 7d7ef0a4686a ("mm: memcg:
restore subtree stats flushing"), but the periodic flushing started in
mem_cgroup_css_online is not. Skip it.
Link: https://lkml.kernel.org/r/20240126211927.1171338-1-tjmercier@google.com
Fixes: aa48e47e39 ("memcg: infrastructure to flush memcg stats")
Change-Id: Iae6aeb3091d349898ea4987a784a971d9b3c97f7
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Li <chrisl@kernel.org>
Reported-by: Minchan Kim <minchan@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 7e9bccbe57812f888f51d46d7cdbc6327eee24f3
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/
mm-unstable)
Signed-off-by: T.J. Mercier <tjmercier@google.com>
The current implementation of the mark_victim tracepoint provides only the
process ID (pid) of the victim process. This limitation poses challenges
for userspace tools that need additional information about the OOM victim.
The association between pid and the additional data may be lost after the
kill, making it difficult for userspace to correlate the OOM event with
the specific process.
In order to mitigate this limitation, add the following fields:
- UID
In Android each installed application has a unique UID. Including
the `uid` assists in correlating OOM events with specific apps.
- Process Name (comm)
Enables identification of the affected process.
- OOM Score
Allows userspace to get additional insights of the relative kill
priority of the OOM victim.
Link: https://lkml.kernel.org/r/20240111210539.636607-1-carlosgalo@google.com
Change-Id: Icc3ed013a9dfff9bb09f1d7588757e6028c17069
Signed-off-by: Carlos Galo <carlosgalo@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 649ffb4cbb90a7f60f17dd74e57d814e762ea01d mm-unstable)
[ carlosgalo: Manually added struct cred change in mark_oom_victim function ]
Bug: 315560026
Change-Id: I81fb6f3447f432100ad4cd25e22db23768003388
Signed-off-by: Carlos Galo <carlosgalo@google.com>
This syncs up the -lts branch with the changes in the non-lts branch,
specifically needed for the ABI symbol updates to allow the build
servers to keep running properly.
Included in here are commits:
* df1cdb0a70 ANDROID: Update the pixel symbol list
* 66cd99ccdb BACKPORT: UPSTREAM: phy: qcom-qmp: Introduce Kconfig symbols for discrete drivers
* a70d3b7bdd ANDROID: GKI: add symbols of vendor hooks to ABI for swapping in ahead
* d4db0d5d08 ANDROID: GKI: add vendor hooks for swapping in ahead
* fd40c1d901 ANDROID: add 16k targets for Microdroid kernel
* 82bf9e7625 FROMGIT: BACKPORT: mm/cma: fix placement of trace_cma_alloc_start/finish
* 800cac4b33 FROMGIT: wifi: nl80211: Extend del pmksa support for SAE and OWE security
Change-Id: I94352b7351253b88af675cc7749bde2936dd91c7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Changes in 6.1.72
keys, dns: Fix missing size check of V1 server-list header
block: Don't invalidate pagecache for invalid falloc modes
ALSA: hda/realtek: enable SND_PCI_QUIRK for hp pavilion 14-ec1xxx series
ALSA: hda/realtek: fix mute/micmute LEDs for a HP ZBook
ALSA: hda/realtek: Fix mute and mic-mute LEDs for HP ProBook 440 G6
mptcp: prevent tcp diag from closing listener subflows
Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"
drm/mgag200: Fix gamma lut not initialized for G200ER, G200EV, G200SE
cifs: cifs_chan_is_iface_active should be called with chan_lock held
cifs: do not depend on release_iface for maintaining iface_list
KVM: x86/pmu: fix masking logic for MSR_CORE_PERF_GLOBAL_CTRL
wifi: iwlwifi: pcie: don't synchronize IRQs from IRQ
drm/bridge: ti-sn65dsi86: Never store more than msg->size bytes in AUX xfer
netfilter: use skb_ip_totlen and iph_totlen
netfilter: nf_tables: set transport offset from mac header for netdev/egress
nfc: llcp_core: Hold a ref to llcp_local->dev when holding a ref to llcp_local
octeontx2-af: Fix marking couple of structure as __packed
drm/i915/dp: Fix passing the correct DPCD_REV for drm_dp_set_phy_test_pattern
ice: Fix link_down_on_close message
ice: Shut down VSI with "link-down-on-close" enabled
i40e: Fix filter input checks to prevent config with invalid values
igc: Report VLAN EtherType matching back to user
igc: Check VLAN TCI mask
igc: Check VLAN EtherType mask
ASoC: fsl_rpmsg: Fix error handler with pm_runtime_enable
ASoC: mediatek: mt8186: fix AUD_PAD_TOP register and offset
mlxbf_gige: fix receive packet race condition
net: sched: em_text: fix possible memory leak in em_text_destroy()
r8169: Fix PCI error on system resume
can: raw: add support for SO_MARK
net-timestamp: extend SOF_TIMESTAMPING_OPT_ID to HW timestamps
net: annotate data-races around sk->sk_tsflags
net: annotate data-races around sk->sk_bind_phc
net: Implement missing getsockopt(SO_TIMESTAMPING_NEW)
selftests: bonding: do not set port down when adding to bond
ARM: sun9i: smp: Fix array-index-out-of-bounds read in sunxi_mc_smp_init
sfc: fix a double-free bug in efx_probe_filters
net: bcmgenet: Fix FCS generation for fragmented skbuffs
netfilter: nft_immediate: drop chain reference counter on error
net: Save and restore msg_namelen in sock_sendmsg
i40e: fix use-after-free in i40e_aqc_add_filters()
ASoC: meson: g12a-toacodec: Validate written enum values
ASoC: meson: g12a-tohdmitx: Validate written enum values
ASoC: meson: g12a-toacodec: Fix event generation
ASoC: meson: g12a-tohdmitx: Fix event generation for S/PDIF mux
i40e: Restore VF MSI-X state during PCI reset
igc: Fix hicredit calculation
net/qla3xxx: fix potential memleak in ql_alloc_buffer_queues
net/smc: fix invalid link access in dumping SMC-R connections
octeontx2-af: Always configure NIX TX link credits based on max frame size
octeontx2-af: Re-enable MAC TX in otx2_stop processing
asix: Add check for usbnet_get_endpoints
net: ravb: Wait for operating mode to be applied
bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()
net: Implement missing SO_TIMESTAMPING_NEW cmsg support
selftests: secretmem: floor the memory size to the multiple of page_size
cpu/SMT: Create topology_smt_thread_allowed()
cpu/SMT: Make SMT control more robust against enumeration failures
srcu: Fix callbacks acceleration mishandling
bpf, x64: Fix tailcall infinite loop
bpf, x86: Simplify the parsing logic of structure parameters
bpf, x86: save/restore regs with BPF_DW size
net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
splice, net: Add a splice_eof op to file-ops and socket-ops
ipv4, ipv6: Use splice_eof() to flush
udp: introduce udp->udp_flags
udp: move udp->no_check6_tx to udp->udp_flags
udp: move udp->no_check6_rx to udp->udp_flags
udp: move udp->gro_enabled to udp->udp_flags
udp: move udp->accept_udp_{l4|fraglist} to udp->udp_flags
udp: lockless UDP_ENCAP_L2TPINUDP / UDP_GRO
udp: annotate data-races around udp->encap_type
wifi: iwlwifi: yoyo: swap cdb and jacket bits values
arm64: dts: qcom: sdm845: align RPMh regulator nodes with bindings
arm64: dts: qcom: sdm845: Fix PSCI power domain names
fbdev: imsttfb: Release framebuffer and dealloc cmap on error path
fbdev: imsttfb: fix double free in probe()
bpf: decouple prune and jump points
bpf: remove unnecessary prune and jump points
bpf: Remove unused insn_cnt argument from visit_[func_call_]insn()
bpf: clean up visit_insn()'s instruction processing
bpf: Support new 32bit offset jmp instruction
bpf: handle ldimm64 properly in check_cfg()
bpf: fix precision backtracking instruction iteration
blk-mq: make sure active queue usage is held for bio_integrity_prep()
net/mlx5: Increase size of irq name buffer
s390/mm: add missing arch_set_page_dat() call to vmem_crst_alloc()
s390/cpumf: support user space events for counting
f2fs: clean up i_compress_flag and i_compress_level usage
f2fs: convert to use bitmap API
f2fs: assign default compression level
f2fs: set the default compress_level on ioctl
selftests: mptcp: fix fastclose with csum failure
selftests: mptcp: set FAILING_LINKS in run_tests
media: camss: sm8250: Virtual channels for CSID
media: qcom: camss: Fix set CSI2_RX_CFG1_VC_MODE when VC is greater than 3
ext4: convert move_extent_per_page() to use folios
khugepage: replace try_to_release_page() with filemap_release_folio()
memory-failure: convert truncate_error_page() to use folio
mm: merge folio_has_private()/filemap_release_folio() call pairs
mm, netfs, fscache: stop read optimisation when folio removed from pagecache
filemap: add a per-mapping stable writes flag
block: update the stable_writes flag in bdev_add
smb: client: fix missing mode bits for SMB symlinks
net: dpaa2-eth: rearrange variable in dpaa2_eth_get_ethtool_stats
dpaa2-eth: recycle the RX buffer only after all processing done
ethtool: don't propagate EOPNOTSUPP from dumps
bpf, sockmap: af_unix stream sockets need to hold ref for pair sock
firmware: arm_scmi: Fix frequency truncation by promoting multiplier type
ALSA: hda/realtek: Add quirk for Lenovo Yoga Pro 7
genirq/affinity: Remove the 'firstvec' parameter from irq_build_affinity_masks
genirq/affinity: Pass affinity managed mask array to irq_build_affinity_masks
genirq/affinity: Don't pass irq_affinity_desc array to irq_build_affinity_masks
genirq/affinity: Rename irq_build_affinity_masks as group_cpus_evenly
genirq/affinity: Move group_cpus_evenly() into lib/
lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly
mm/memory_hotplug: add missing mem_hotplug_lock
mm/memory_hotplug: fix error handling in add_memory_resource()
net: sched: call tcf_ct_params_free to free params in tcf_ct_init
netfilter: flowtable: allow unidirectional rules
netfilter: flowtable: cache info of last offload
net/sched: act_ct: offload UDP NEW connections
net/sched: act_ct: Fix promotion of offloaded unreplied tuple
netfilter: flowtable: GC pushes back packets to classic path
net/sched: act_ct: Take per-cb reference to tcf_ct_flow_table
octeontx2-af: Fix pause frame configuration
octeontx2-af: Support variable number of lmacs
btrfs: fix qgroup_free_reserved_data int overflow
btrfs: mark the len field in struct btrfs_ordered_sum as unsigned
ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()
firewire: ohci: suppress unexpected system reboot in AMD Ryzen machines and ASM108x/VT630x PCIe cards
x86/kprobes: fix incorrect return address calculation in kprobe_emulate_call_indirect
i2c: core: Fix atomic xfer check for non-preempt config
mm: fix unmap_mapping_range high bits shift bug
drm/amdgpu: skip gpu_info fw loading on navi12
drm/amd/display: add nv12 bounding box
mmc: meson-mx-sdhc: Fix initialization frozen issue
mmc: rpmb: fixes pause retune on all RPMB partitions.
mmc: core: Cancel delayed work before releasing host
mmc: sdhci-sprd: Fix eMMC init failure after hw reset
genirq/affinity: Only build SMP-only helper functions on SMP kernels
f2fs: compress: fix to assign compress_level for lz4 correctly
net/sched: act_ct: additional checks for outdated flows
net/sched: act_ct: Always fill offloading tuple iifidx
bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4
bpf: syzkaller found null ptr deref in unix_bpf proto add
media: qcom: camss: Comment CSID dt_id field
smb3: Replace smb2pdu 1-element arrays with flex-arrays
Revert "interconnect: qcom: sm8250: Enable sync_state"
Linux 6.1.72
Change-Id: Id00eb2ae1159d4d5fa0ef914e672c5669cbf5b0a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmWYD8QACgkQONu9yGCS
aT5cEA//UKwVnselP3QHU6yEm2j8Vuq5IOEIqIeYTDTyS7TGP83SsyM4n2KRlTwC
/vaY3HWNsZHLqsNICPOPSdQn9STa7MYTnf/ackBbPglDnDz/A6mSB3zkXtCKFm6+
UBmk6Y8pZwpdvk3aa6Z62Kr5bGGHdzvXdiJitERLlD2PFUOZT9/IHSncGnts3TQv
PjFXy1KVIGsThKbtjtYPpa100RAti5HeLv/NbsaVbuKYMME/QCFmqyNRAp9k2iHx
3nkze70aoREShEDjaLkcsirzwRKJu7qqNriYLt+wd7HmcD328R2UlTR8L3ZM0xOq
qxBHnzbFtQyGR7NAudi2pStqwctPhFP6vRz1aJvt+w9tmbeKAWQWMd2pNvG8GhJm
nxYFGyPLzTgPifK5SELCNIW4WXf8rnrRNgZ+Ph/JIGuhp+603//ATHRlVEwHcnl+
M0GRbL06nWFVvfdKCYuu0autb9sW5T/vq02cbE5vRVVaziazry8S8EmxYQyOg9X/
CBAd1XTybVZki9VkIP5zbdvWJL3LhFfsabBFy7TPZor/YCJQDvxzw1iwtY/BPVDT
MryHjrYwH/n5RvibANRcTbCamMQY4IrJ4X3afJGgh7BK5N5C5ug4HYJ7oG5QB++x
xC4A5x3L6D9SE/St8hFWghjYcd6lFcjlz1wJ5MyLImwYqfr8DnY=
=Vt0s
-----END PGP SIGNATURE-----
Merge 6.1.71 into android14-6.1-lts
Changes in 6.1.71
ksmbd: replace one-element arrays with flexible-array members
ksmbd: set SMB2_SESSION_FLAG_ENCRYPT_DATA when enforcing data encryption for this share
ksmbd: use F_SETLK when unlocking a file
ksmbd: Fix resource leak in smb2_lock()
ksmbd: Convert to use sysfs_emit()/sysfs_emit_at() APIs
ksmbd: Implements sess->rpc_handle_list as xarray
ksmbd: fix typo, syncronous->synchronous
ksmbd: Remove duplicated codes
ksmbd: update Kconfig to note Kerberos support and fix indentation
ksmbd: Fix spelling mistake "excceed" -> "exceeded"
ksmbd: Fix parameter name and comment mismatch
ksmbd: remove unused is_char_allowed function
ksmbd: delete asynchronous work from list
ksmbd: set NegotiateContextCount once instead of every inc
ksmbd: avoid duplicate negotiate ctx offset increments
ksmbd: remove unused compression negotiate ctx packing
fs: introduce lock_rename_child() helper
ksmbd: fix racy issue from using ->d_parent and ->d_name
ksmbd: fix uninitialized pointer read in ksmbd_vfs_rename()
ksmbd: fix uninitialized pointer read in smb2_create_link()
ksmbd: call putname after using the last component
ksmbd: fix posix_acls and acls dereferencing possible ERR_PTR()
ksmbd: add mnt_want_write to ksmbd vfs functions
ksmbd: remove unused ksmbd_tree_conn_share function
ksmbd: use kzalloc() instead of __GFP_ZERO
ksmbd: return a literal instead of 'err' in ksmbd_vfs_kern_path_locked()
ksmbd: Change the return value of ksmbd_vfs_query_maximal_access to void
ksmbd: use kvzalloc instead of kvmalloc
ksmbd: Replace the ternary conditional operator with min()
ksmbd: Use struct_size() helper in ksmbd_negotiate_smb_dialect()
ksmbd: Replace one-element array with flexible-array member
ksmbd: Fix unsigned expression compared with zero
ksmbd: check if a mount point is crossed during path lookup
ksmbd: switch to use kmemdup_nul() helper
ksmbd: add support for read compound
ksmbd: fix wrong interim response on compound
ksmbd: fix `force create mode' and `force directory mode'
ksmbd: Fix one kernel-doc comment
ksmbd: add missing calling smb2_set_err_rsp() on error
ksmbd: remove experimental warning
ksmbd: remove unneeded mark_inode_dirty in set_info_sec()
ksmbd: fix passing freed memory 'aux_payload_buf'
ksmbd: return invalid parameter error response if smb2 request is invalid
ksmbd: check iov vector index in ksmbd_conn_write()
ksmbd: fix race condition with fp
ksmbd: fix race condition from parallel smb2 logoff requests
ksmbd: fix race condition from parallel smb2 lock requests
ksmbd: fix race condition between tree conn lookup and disconnect
ksmbd: fix wrong error response status by using set_smb2_rsp_status()
ksmbd: fix Null pointer dereferences in ksmbd_update_fstate()
ksmbd: fix potential double free on smb2_read_pipe() error path
ksmbd: Remove unused field in ksmbd_user struct
ksmbd: reorganize ksmbd_iov_pin_rsp()
ksmbd: fix kernel-doc comment of ksmbd_vfs_setxattr()
ksmbd: fix recursive locking in vfs helpers
ksmbd: fix missing RDMA-capable flag for IPoIB device in ksmbd_rdma_capable_netdev()
ksmbd: add support for surrogate pair conversion
ksmbd: no need to wait for binded connection termination at logoff
ksmbd: fix kernel-doc comment of ksmbd_vfs_kern_path_locked()
ksmbd: prevent memory leak on error return
ksmbd: fix possible deadlock in smb2_open
ksmbd: separately allocate ci per dentry
ksmbd: move oplock handling after unlock parent dir
ksmbd: release interim response after sending status pending response
ksmbd: move setting SMB2_FLAGS_ASYNC_COMMAND and AsyncId
ksmbd: don't update ->op_state as OPLOCK_STATE_NONE on error
ksmbd: set epoch in create context v2 lease
ksmbd: set v2 lease capability
ksmbd: downgrade RWH lease caching state to RH for directory
ksmbd: send v2 lease break notification for directory
ksmbd: lazy v2 lease break on smb2_write()
ksmbd: avoid duplicate opinfo_put() call on error of smb21_lease_break_ack()
ksmbd: fix wrong allocation size update in smb2_open()
ARM: dts: Fix occasional boot hang for am3 usb
usb: fotg210-hcd: delete an incorrect bounds test
spi: Introduce spi_get_device_match_data() helper
iio: imu: adis16475: add spi_device_id table
nfsd: separate nfsd_last_thread() from nfsd_put()
nfsd: call nfsd_last_thread() before final nfsd_put()
linux/export: Ensure natural alignment of kcrctab array
spi: Reintroduce spi_set_cs_timing()
spi: Add APIs in spi core to set/get spi->chip_select and spi->cs_gpiod
spi: atmel: Fix clock issue when using devices with different polarities
block: renumber QUEUE_FLAG_HW_WC
ksmbd: fix slab-out-of-bounds in smb_strndup_from_utf16()
platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe
mm/filemap: avoid buffered read/write race to read inconsistent data
mm: migrate high-order folios in swap cache correctly
mm/memory-failure: cast index to loff_t before shifting it
mm/memory-failure: check the mapcount of the precise page
ring-buffer: Fix wake ups when buffer_percent is set to 100
tracing: Fix blocked reader of snapshot buffer
ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()
netfilter: nf_tables: skip set commit for deleted/destroyed sets
ring-buffer: Fix slowpath of interrupted event
NFSD: fix possible oops when nfsd/pool_stats is closed.
spi: Constify spi parameters of chip select APIs
device property: Allow const parameter to dev_fwnode()
kallsyms: Make module_kallsyms_on_each_symbol generally available
tracing/kprobes: Fix symbol counting logic by looking at modules as well
Revert "platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe"
Linux 6.1.71
Change-Id: I7bc16d981b90e8e0b633628438f79fce898ad15a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmWSsnYACgkQONu9yGCS
aT7ZRw//bmrTWoNbFf/qdM11oPF9EHus9FUgSlP5yvNaa6jcPfwGx71NPXUkz+wU
xKobh1VwK7TJxq4JHFQeMmupW/8++NeWNygwtYsllwnsMGzHL+mz2Txysrr/mhMx
WUs6UVYXRxnuQJJDSqtTvMoyllpAJ1QQxJNuhKKOI1i+0DIu9YjQklD/4eW3cebv
8B9f3CeOyP/oL5Z0MqFTP8OnWx6X3jTbO4caor+qsyR+frgpXgBppTF76RHcd8lX
MLVlx7aqr4wcml/uUMsolw8Zjbb719mX+KW3LHltl8wHftZeinYUsu1afnlb5dG1
rAaVgut0PmjTAQ/KwIp54CGO2MADwApMCUXIm0yyKSpNfw+HKR10bpz64HOFp9KQ
368YpjDJ3onkQdrLjV57w37YBRLyWxipeBya2+S4rdyPSfuvPkPCRNVkEDnHVAnH
jxEhuoMZ2f/CIA8BT32y4DYDvEaIdfp7jVvEDFREDyIVXRMBhIneMhhyjU+Oe7Rw
1q/sfEJejXFa5VvC+Jl+K5LouP59M5MTq3RkCoYxZKz+bdfpOLEJ6AZJoZHcS02J
QlM/pL213nC1ye3tuWFu3tNPzPS/G6LNQfGgSsBUzRn9IX2osn/epNFnCHBIFqlK
apjrXObrmqKE6jNvy6ktHUDpnEXPZFpvirSXRN2Lk9SYh76bFP0=
=d63o
-----END PGP SIGNATURE-----
Merge 6.1.70 into android14-6.1-lts
Changes in 6.1.70
kasan: disable kasan_non_canonical_hook() for HW tags
bpf: Fix prog_array_map_poke_run map poke update
HID: i2c-hid: acpi: Unify ACPI ID tables format
HID: i2c-hid: Add IDEA5002 to i2c_hid_acpi_blacklist[]
drm/amd/display: fix hw rotated modes when PSR-SU is enabled
ARM: dts: dra7: Fix DRA7 L3 NoC node register size
ARM: OMAP2+: Fix null pointer dereference and memory leak in omap_soc_device_init
reset: Fix crash when freeing non-existent optional resets
s390/vx: fix save/restore of fpu kernel context
wifi: iwlwifi: pcie: add another missing bh-disable for rxq->lock
wifi: mac80211: check if the existing link config remains unchanged
wifi: mac80211: mesh: check element parsing succeeded
wifi: mac80211: mesh_plink: fix matches_local logic
Revert "net/mlx5e: fix double free of encap_header in update funcs"
Revert "net/mlx5e: fix double free of encap_header"
net/mlx5e: Fix slab-out-of-bounds in mlx5_query_nic_vport_mac_list()
net/mlx5: Introduce and use opcode getter in command interface
net/mlx5: Prevent high-rate FW commands from populating all slots
net/mlx5: Re-organize mlx5_cmd struct
net/mlx5e: Fix a race in command alloc flow
net/mlx5e: fix a potential double-free in fs_udp_create_groups
net/mlx5: Fix fw tracer first block check
net/mlx5e: Correct snprintf truncation handling for fw_version buffer
net/mlx5e: Correct snprintf truncation handling for fw_version buffer used by representors
net: mscc: ocelot: fix eMAC TX RMON stats for bucket 256-511 and above
octeontx2-pf: Fix graceful exit during PFC configuration failure
net: Return error from sk_stream_wait_connect() if sk_wait_event() fails
net: sched: ife: fix potential use-after-free
ethernet: atheros: fix a memleak in atl1e_setup_ring_resources
net/rose: fix races in rose_kill_by_device()
Bluetooth: Fix deadlock in vhci_send_frame
Bluetooth: hci_event: shut up a false-positive warning
net: mana: select PAGE_POOL
net: check vlan filter feature in vlan_vids_add_by_dev() and vlan_vids_del_by_dev()
afs: Fix the dynamic root's d_delete to always delete unused dentries
afs: Fix dynamic root lookup DNS check
net: check dev->gso_max_size in gso_features_check()
keys, dns: Allow key types (eg. DNS) to be reclaimed immediately on expiry
afs: Fix overwriting of result of DNS query
afs: Fix use-after-free due to get/remove race in volume tree
ASoC: hdmi-codec: fix missing report for jack initial status
ASoC: fsl_sai: Fix channel swap issue on i.MX8MP
i2c: aspeed: Handle the coalesced stop conditions with the start conditions.
x86/xen: add CPU dependencies for 32-bit build
pinctrl: at91-pio4: use dedicated lock class for IRQ
gpiolib: cdev: add gpio_device locking wrapper around gpio_ioctl()
nvme-pci: fix sleeping function called from interrupt context
drm/i915/mtl: limit second scaler vertical scaling in ver >= 14
drm/i915: Relocate intel_atomic_setup_scalers()
drm/i915: Fix intel_atomic_setup_scalers() plane_state handling
drm/i915/dpt: Only do the POT stride remap when using DPT
drm/i915/mtl: Add MTL for remapping CCS FBs
drm/i915: Fix ADL+ tiled plane stride when the POT stride is smaller than the original
interconnect: Treat xlate() returning NULL node as an error
iio: imu: inv_mpu6050: fix an error code problem in inv_mpu6050_read_raw
interconnect: qcom: sm8250: Enable sync_state
Input: ipaq-micro-keys - add error handling for devm_kmemdup
scsi: bnx2fc: Fix skb double free in bnx2fc_rcv()
iio: common: ms_sensors: ms_sensors_i2c: fix humidity conversion time table
iio: adc: ti_am335x_adc: Fix return value check of tiadc_request_dma()
iio: triggered-buffer: prevent possible freeing of wrong buffer
ALSA: usb-audio: Increase delay in MOTU M quirk
usb-storage: Add quirk for incorrect WP on Kingston DT Ultimate 3.0 G3
wifi: cfg80211: Add my certificate
wifi: cfg80211: fix certs build to not depend on file order
USB: serial: ftdi_sio: update Actisense PIDs constant names
USB: serial: option: add Quectel EG912Y module support
USB: serial: option: add Foxconn T99W265 with new baseline
USB: serial: option: add Quectel RM500Q R13 firmware support
ALSA: hda/realtek: Add quirk for ASUS ROG GV302XA
Bluetooth: hci_event: Fix not checking if HCI_OP_INQUIRY has been sent
Bluetooth: af_bluetooth: Fix Use-After-Free in bt_sock_recvmsg
Bluetooth: L2CAP: Send reject on command corrupted request
Bluetooth: MGMT/SMP: Fix address type when using SMP over BREDR/LE
Bluetooth: Add more enc key size check
net: usb: ax88179_178a: avoid failed operations when device is disconnected
Input: soc_button_array - add mapping for airplane mode button
net: 9p: avoid freeing uninit memory in p9pdu_vreadf
net: rfkill: gpio: set GPIO direction
net: ks8851: Fix TX stall caused by TX buffer overrun
dt-bindings: nvmem: mxs-ocotp: Document fsl,ocotp
smb: client: fix OOB in cifsd when receiving compounded resps
smb: client: fix potential OOB in cifs_dump_detail()
smb: client: fix OOB in SMB2_query_info_init()
smb: client: fix OOB in smbCalcSize()
drm/i915: Reject async flips with bigjoiner
9p: prevent read overrun in protocol dump tracepoint
RISC-V: Fix do_notify_resume / do_work_pending prototype
loop: do not enforce max_loop hard limit by (new) default
dm thin metadata: Fix ABBA deadlock by resetting dm_bufio_client
Revert "drm/amd/display: Do not set DRR on pipe commit"
btrfs: zoned: no longer count fresh BG region as zone unusable
ubifs: fix possible dereference after free
ublk: move ublk_cancel_dev() out of ub->mutex
selftests: mptcp: join: fix subflow_send_ack lookup
Revert "scsi: aacraid: Reply queue mapping to CPUs based on IRQ affinity"
scsi: core: Always send batch on reset or error handling command
tracing / synthetic: Disable events after testing in synth_event_gen_test_init()
dm-integrity: don't modify bio's immutable bio_vec in integrity_metadata()
pinctrl: starfive: jh7100: ignore disabled device tree nodes
bus: ti-sysc: Flush posted write only after srst_udelay
gpio: dwapb: mask/unmask IRQ when disable/enale it
lib/vsprintf: Fix %pfwf when current node refcount == 0
thunderbolt: Fix memory leak in margining_port_remove()
KVM: arm64: vgic: Simplify kvm_vgic_destroy()
KVM: arm64: vgic: Add a non-locking primitive for kvm_vgic_vcpu_destroy()
KVM: arm64: vgic: Force vcpu vgic teardown on vcpu destroy
x86/alternatives: Sync core before enabling interrupts
mm/damon/core: make damon_start() waits until kdamond_fn() starts
fuse: share lookup state between submount and its parent
wifi: cfg80211: fix CQM for non-range use
wifi: nl80211: fix deadlock in nl80211_set_cqm_rssi (6.6.x)
loop: deprecate autoloading callback loop_probe()
Linux 6.1.70
Change-Id: I72bfbd39ae932d290b13d6fdde8e6684a84ec9e1
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Add vendor hooks to capture demand paging during APP launch,
so we can do it in advance in next launch.
Bug: 315913896
Signed-off-by: Lianjun Huang <huanglianjun@xiaomi.com>
Signed-off-by: Lianjun Huang <huanglianjun@xiaomi.corp-partner.google.com>
Change-Id: I2698fefd347745fb4ff84b111caedbb3bb365ce3
This reverts commit a2eefda9e3.
This issue is fixed properly in 6.1.70 so no longer needed here as it
will cause merge issues.
Change-Id: Ie80acf8e96dbcedd4a5d61701db8cbd3871258e2
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Changes in 6.1.69
perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
r8152: add USB device driver for config selection
r8152: add vendor/device ID pair for D-Link DUB-E250
r8152: add vendor/device ID pair for ASUS USB-C2500
powerpc/ftrace: Fix stack teardown in ftrace_no_trace
ext4: fix warning in ext4_dio_write_end_io()
ksmbd: fix memory leak in smb2_lock()
afs: Fix refcount underflow from error handling race
HID: lenovo: Restrict detection of patched firmware only to USB cptkbd
net/mlx5e: Fix possible deadlock on mlx5e_tx_timeout_work
net: ipv6: support reporting otherwise unknown prefix flags in RTM_NEWPREFIX
qca_debug: Prevent crash on TX ring changes
qca_debug: Fix ethtool -G iface tx behavior
qca_spi: Fix reset behavior
bnxt_en: Clear resource reservation during resume
bnxt_en: Save ring error counters across reset
bnxt_en: Fix wrong return value check in bnxt_close_nic()
bnxt_en: Fix HWTSTAMP_FILTER_ALL packet timestamp logic
atm: solos-pci: Fix potential deadlock on &cli_queue_lock
atm: solos-pci: Fix potential deadlock on &tx_queue_lock
net: vlan: introduce skb_vlan_eth_hdr()
net: fec: correct queue selection
octeontx2-af: fix a use-after-free in rvu_nix_register_reporters
octeontx2-pf: Fix promisc mcam entry action
octeontx2-af: Update RSS algorithm index
atm: Fix Use-After-Free in do_vcc_ioctl
net/rose: Fix Use-After-Free in rose_ioctl
iavf: Introduce new state machines for flow director
iavf: Handle ntuple on/off based on new state machines for flow director
qed: Fix a potential use-after-free in qed_cxt_tables_alloc
net: Remove acked SYN flag from packet in the transmit queue correctly
net: ena: Destroy correct number of xdp queues upon failure
net: ena: Fix xdp drops handling due to multibuf packets
net: ena: Fix XDP redirection error
stmmac: dwmac-loongson: Make sure MDIO is initialized before use
sign-file: Fix incorrect return values check
vsock/virtio: Fix unsigned integer wrap around in virtio_transport_has_space()
dpaa2-switch: fix size of the dma_unmap
dpaa2-switch: do not ask for MDB, VLAN and FDB replay
net: stmmac: Handle disabled MDIO busses from devicetree
appletalk: Fix Use-After-Free in atalk_ioctl
net: atlantic: fix double free in ring reinit logic
cred: switch to using atomic_long_t
fuse: dax: set fc->dax to NULL in fuse_dax_conn_free()
ALSA: hda/hdmi: add force-connect quirk for NUC5CPYB
ALSA: hda/hdmi: add force-connect quirks for ASUSTeK Z170 variants
ALSA: hda/realtek: Apply mute LED quirk for HP15-db
Revert "PCI: acpiphp: Reassign resources on bridge if necessary"
PCI: loongson: Limit MRRS to 256
ksmbd: fix wrong name of SMB2_CREATE_ALLOCATION_SIZE
drm/mediatek: Add spinlock for setting vblank event in atomic_begin
x86/hyperv: Fix the detection of E820_TYPE_PRAM in a Gen2 VM
usb: aqc111: check packet for fixup for true limit
stmmac: dwmac-loongson: Add architecture dependency
blk-throttle: fix lockdep warning of "cgroup_mutex or RCU read lock required!"
blk-cgroup: bypass blkcg_deactivate_policy after destroying
bcache: avoid oversize memory allocation by small stripe_size
bcache: remove redundant assignment to variable cur_idx
bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc()
bcache: avoid NULL checking to c->root in run_cache_set()
nbd: fold nbd config initialization into nbd_alloc_config()
nvme-auth: set explanation code for failure2 msgs
nvme: catch errors from nvme_configure_metadata()
selftests/bpf: fix bpf_loop_bench for new callback verification scheme
LoongArch: Add dependency between vmlinuz.efi and vmlinux.efi
LoongArch: Implement constant timer shutdown interface
platform/x86: intel_telemetry: Fix kernel doc descriptions
HID: glorious: fix Glorious Model I HID report
HID: add ALWAYS_POLL quirk for Apple kb
nbd: pass nbd_sock to nbd_read_reply() instead of index
HID: hid-asus: reset the backlight brightness level on resume
HID: multitouch: Add quirk for HONOR GLO-GXXX touchpad
asm-generic: qspinlock: fix queued_spin_value_unlocked() implementation
net: usb: qmi_wwan: claim interface 4 for ZTE MF290
arm64: add dependency between vmlinuz.efi and Image
HID: hid-asus: add const to read-only outgoing usb buffer
perf: Fix perf_event_validate_size() lockdep splat
btrfs: do not allow non subvolume root targets for snapshot
soundwire: stream: fix NULL pointer dereference for multi_link
ext4: prevent the normalized size from exceeding EXT_MAX_BLOCKS
arm64: mm: Always make sw-dirty PTEs hw-dirty in pte_modify
team: Fix use-after-free when an option instance allocation fails
drm/amdgpu/sdma5.2: add begin/end_use ring callbacks
dmaengine: stm32-dma: avoid bitfield overflow assertion
mm/mglru: fix underprotected page cache
mm/shmem: fix race in shmem_undo_range w/THP
btrfs: free qgroup reserve when ORDERED_IOERR is set
btrfs: don't clear qgroup reserved bit in release_folio
drm/amdgpu: fix tear down order in amdgpu_vm_pt_free
drm/amd/display: Disable PSR-SU on Parade 0803 TCON again
drm/i915: Fix remapped stride with CCS on ADL+
smb: client: fix OOB in receive_encrypted_standard()
smb: client: fix NULL deref in asn1_ber_decoder()
smb: client: fix OOB in smb2_query_reparse_point()
ring-buffer: Fix memory leak of free page
tracing: Update snapshot buffer on resize if it is allocated
ring-buffer: Do not update before stamp when switching sub-buffers
ring-buffer: Have saved event hold the entire event
ring-buffer: Fix writing to the buffer with max_data_size
ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archs
ring-buffer: Do not try to put back write_stamp
ring-buffer: Have rb_time_cmpxchg() set the msb counter too
net: tls, update curr on splice as well
r8152: avoid to change cfg for all devices
r8152: remove rtl_vendor_mode function
r8152: fix the autosuspend doesn't work
Linux 6.1.69
Change-Id: I695d1d50ca8c00ff505505918bdc59ce9d29d479
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
The current placement of trace_cma_alloc_start/finish misses the fail
cases: !cma || !cma->count || !cma->bitmap.
trace_cma_alloc_finish is also not emitted for the failure case
where bitmap_count > bitmap_maxno.
Fix these missed cases by moving the start event before the failure
checks and moving the finish event to the out label.
Link: https://lkml.kernel.org/r/20240110012234.3793639-1-kaleshsingh@google.com
Fixes: 7bc1aec5e2 ("mm: cma: add trace events for CMA alloc perf testing")
Change-Id: I61153fe078da4f9f3338147f1fbb7697a5554078
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 3b08ab9a811caebe1327f25f51557f95200d94bf https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 315897033
[ Remove ret arg from trace_cma_alloc_finish - Kalesh Singh ]
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
This merges all of the latest changes in 'android14-6.1' into
'android14-6.1-lts' to get it to pass TH again due to new symbols being
added. Included in here are the following commits:
* a41a4ee370 ANDROID: Update the ABI symbol list
* 0801d8a89d ANDROID: mm: export dump_tasks symbol.
* 7c91752f5d FROMLIST: scsi: ufs: Remove the ufshcd_hba_exit() call from ufshcd_async_scan()
* 28154afe74 FROMLIST: scsi: ufs: Simplify power management during async scan
* febcf1429f ANDROID: gki_defconfig: Set CONFIG_IDLE_INJECT and CONFIG_CPU_IDLE_THERMAL into y
* bc4d82ee40 ANDROID: KMI workaround for CONFIG_NETFILTER_FAMILY_BRIDGE
* 227b55a7a3 ANDROID: dma-buf: don't re-purpose kobject as work_struct
* c1b1201d39 BACKPORT: FROMLIST: dma-buf: Move sysfs work out of DMA-BUF export path
* 928b3b5dde UPSTREAM: netfilter: nf_tables: skip set commit for deleted/destroyed sets
* 031f804149 ANDROID: KVM: arm64: Avoid BUG-ing from the host abort path
* c5dc4b4b3d ANDROID: Update the ABI symbol list
* 5070b3b594 UPSTREAM: ipv4: igmp: fix refcnt uaf issue when receiving igmp query packet
* 02aa72665c UPSTREAM: nvmet-tcp: Fix a possible UAF in queue intialization setup
* d6554d1262 FROMGIT: usb: dwc3: gadget: Handle EP0 request dequeuing properly
* 29544d4157 ANDROID: ABI: Update symbol list for imx
* 02f444ba07 UPSTREAM: io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
* ec46fe0ac7 UPSTREAM: bpf: Fix prog_array_map_poke_run map poke update
* 98b0e4cf09 BACKPORT: xhci: track port suspend state correctly in unsuccessful resume cases
* ac90f08292 ANDROID: Update the ABI symbol list
* ef67750d99 ANDROID: sched: Export symbols for vendor modules
* 934a40576e UPSTREAM: usb: dwc3: core: add support for disabling High-speed park mode
* 8a597e7a2d ANDROID: KVM: arm64: Don't prepopulate MMIO regions for host stage-2
* ed9b660cd1 BACKPORT: FROMGIT fork: use __mt_dup() to duplicate maple tree in dup_mmap()
* 3743b40f65 FROMGIT: maple_tree: preserve the tree attributes when destroying maple tree
* 1bec2dd52e FROMGIT: maple_tree: update check_forking() and bench_forking()
* e57d333531 FROMGIT: maple_tree: skip other tests when BENCH is enabled
* c79ca61edc FROMGIT: maple_tree: update the documentation of maple tree
* 7befa7bbc9 FROMGIT: maple_tree: add test for mtree_dup()
* f73f881af4 FROMGIT: radix tree test suite: align kmem_cache_alloc_bulk() with kernel behavior.
* eb5048ea90 FROMGIT: maple_tree: introduce interfaces __mt_dup() and mtree_dup()
* dc9323545b FROMGIT: maple_tree: introduce {mtree,mas}_lock_nested()
* 4ddcdc519b FROMGIT: maple_tree: add mt_free_one() and mt_attr() helpers
* c52d48818b UPSTREAM: maple_tree: introduce __mas_set_range()
* 066d57de87 ANDROID: GKI: Enable symbols for v4l2 in async and fwnode
* e74417834e ANDROID: Update the ABI symbol list
* 15a93de464 ANDROID: KVM: arm64: Fix hyp event alignment
* 717d1f8f91 ANDROID: KVM: arm64: Fix host_smc print typo
* 8fc25d7862 FROMGIT: f2fs: do not return EFSCORRUPTED, but try to run online repair
* 99288e911a ANDROID: KVM: arm64: Document module_change_host_prot_range
* 4d99e41ce1 FROMGIT: PM / devfreq: Synchronize devfreq_monitor_[start/stop]
* 6c8f710857 FROMGIT: arch/mm/fault: fix major fault accounting when retrying under per-VMA lock
* 4a518d8633 UPSTREAM: mm: handle write faults to RO pages under the VMA lock
* c1da94fa44 UPSTREAM: mm: handle read faults under the VMA lock
* 6541fffd92 UPSTREAM: mm: handle COW faults under the VMA lock
* c7fa581a79 UPSTREAM: mm: handle shared faults under the VMA lock
* 95af8a80bb BACKPORT: mm: call wp_page_copy() under the VMA lock
* b43b26b4cd UPSTREAM: mm: make lock_folio_maybe_drop_mmap() VMA lock aware
* 9c4bc457ab UPSTREAM: mm/memory.c: fix mismerge
* 7d50253c27 ANDROID: Export functions to be used with dma_map_ops in modules
* 37e0a5b868 BACKPORT: FROMGIT: erofs: enable sub-page compressed block support
* f466d52164 FROMGIT: erofs: refine z_erofs_transform_plain() for sub-page block support
* a18efa4e4a FROMGIT: erofs: fix ztailpacking for subpage compressed blocks
* 0c6a18c75b BACKPORT: FROMGIT: erofs: fix up compacted indexes for block size < 4096
* d7bb85f1cb FROMGIT: erofs: record `pclustersize` in bytes instead of pages
* 9d259220ac FROMGIT: erofs: support I/O submission for sub-page compressed blocks
* 8a49ea9441 FROMGIT: erofs: fix lz4 inplace decompression
* bdc5d268ba FROMGIT: erofs: fix memory leak on short-lived bounced pages
* 0d329bbe5c BACKPORT: erofs: tidy up z_erofs_do_read_page()
* dc94c3cc6b UPSTREAM: erofs: move preparation logic into z_erofs_pcluster_begin()
* 7751567a71 BACKPORT: erofs: avoid obsolete {collector,collection} terms
* d0dbf74792 BACKPORT: erofs: simplify z_erofs_read_fragment()
* 4067dd9969 UPSTREAM: erofs: get rid of the remaining kmap_atomic()
* 365ca16da2 UPSTREAM: erofs: simplify z_erofs_transform_plain()
* 187d034575 BACKPORT: erofs: adapt managed inode operations into folios
* 3d93182661 UPSTREAM: erofs: avoid on-stack pagepool directly passed by arguments
* 5c1827383a UPSTREAM: erofs: allocate extra bvec pages directly instead of retrying
* bed20ed1d3 UPSTREAM: erofs: clean up z_erofs_pcluster_readmore()
* 5e861fa97e UPSTREAM: erofs: remove the member readahead from struct z_erofs_decompress_frontend
* 66595bb17c UPSTREAM: erofs: fold in z_erofs_decompress()
* 88a1939504 UPSTREAM: erofs: enable large folios for iomap mode
* 2c085909e7 ANDROID: Update the ABI symbol list
* d16a15fde5 UPSTREAM: USB: gadget: core: adjust uevent timing on gadget unbind
* d3006fb944 ANDROID: ABI: Update oplus symbol list
* bc97d5019a ANDROID: vendor_hooks: Add hooks for rt_mutex steal
* 401a2769d9 UPSTREAM: dm verity: don't perform FEC for failed readahead IO
* 30bca9e278 UPSTREAM: netfilter: nft_set_pipapo: skip inactive elements during set walk
* 44702d8fa1 FROMLIST: mm: migrate high-order folios in swap cache correctly
* 613d8368e3 ANDROID: fuse-bpf: Follow mounts in lookups
Change-Id: I49d28ad030d7840490441ce6a7936b5e1047913e
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 9eab0421fa94a3dde0d1f7e36ab3294fc306c99d upstream.
The bug happens when highest bit of holebegin is 1, suppose holebegin is
0x8000000111111000, after shift, hba would be 0xfff8000000111111, then
vma_interval_tree_foreach would look it up fail or leads to the wrong
result.
error call seq e.g.:
- mmap(..., offset=0x8000000111111000)
|- syscall(mmap, ... unsigned long, off):
|- ksys_mmap_pgoff( ... , off >> PAGE_SHIFT);
here pgoff is correctly shifted to 0x8000000111111,
but pass 0x8000000111111000 as holebegin to unmap
would then cause terrible result, as shown below:
- unmap_mapping_range(..., loff_t const holebegin)
|- pgoff_t hba = holebegin >> PAGE_SHIFT;
/* hba = 0xfff8000000111111 unexpectedly */
The issue happens in Heterogeneous computing, where the device(e.g.
gpu) and host share the same virtual address space.
A simple workflow pattern which hit the issue is:
/* host */
1. userspace first mmap a file backed VA range with specified offset.
e.g. (offset=0x800..., mmap return: va_a)
2. write some data to the corresponding sys page
e.g. (va_a = 0xAABB)
/* device */
3. gpu workload touches VA, triggers gpu fault and notify the host.
/* host */
4. reviced gpu fault notification, then it will:
4.1 unmap host pages and also takes care of cpu tlb
(use unmap_mapping_range with offset=0x800...)
4.2 migrate sys page to device
4.3 setup device page table and resolve device fault.
/* device */
5. gpu workload continued, it accessed va_a and got 0xAABB.
6. gpu workload continued, it wrote 0xBBCC to va_a.
/* host */
7. userspace access va_a, as expected, it will:
7.1 trigger cpu vm fault.
7.2 driver handling fault to migrate gpu local page to host.
8. userspace then could correctly get 0xBBCC from va_a
9. done
But in step 4.1, if we hit the bug this patch mentioned, then userspace
would never trigger cpu fault, and still get the old value: 0xAABB.
Making holebegin unsigned first fixes the bug.
Link: https://lkml.kernel.org/r/20231220052839.26970-1-jiajun.xie.sh@gmail.com
Signed-off-by: Jiajun Xie <jiajun.xie.sh@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f42ce5f087eb69e47294ababd2e7e6f88a82d308 ]
In add_memory_resource(), creation of memory block devices occurs after
successful call to arch_add_memory(). However, creation of memory block
devices could fail. In that case, arch_remove_memory() is called to
perform necessary cleanup.
Currently with or without altmap support, arch_remove_memory() is always
passed with altmap set to NULL during error handling. This leads to
freeing of struct pages using free_pages(), eventhough the allocation
might have been performed with altmap support via
altmap_alloc_block_buf().
Fix the error handling by passing altmap in arch_remove_memory(). This
ensures the following:
* When altmap is disabled, deallocation of the struct pages array occurs
via free_pages().
* When altmap is enabled, deallocation occurs via vmem_altmap_free().
Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com
Fixes: a08a2ae346 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: <stable@vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 001002e73712cdf6b8d9a103648cda3040ad7647 ]
From Documentation/core-api/memory-hotplug.rst:
When adding/removing/onlining/offlining memory or adding/removing
heterogeneous/device memory, we should always hold the mem_hotplug_lock
in write mode to serialise memory hotplug (e.g. access to global/zone
variables).
mhp_(de)init_memmap_on_memory() functions can change zone stats and
struct page content, but they are currently called w/o the
mem_hotplug_lock.
When memory block is being offlined and when kmemleak goes through each
populated zone, the following theoretical race conditions could occur:
CPU 0: | CPU 1:
memory_offline() |
-> offline_pages() |
-> mem_hotplug_begin() |
... |
-> mem_hotplug_done() |
| kmemleak_scan()
| -> get_online_mems()
| ...
-> mhp_deinit_memmap_on_memory() |
[not protected by mem_hotplug_begin/done()]|
Marks memory section as offline, | Retrieves zone_start_pfn
poisons vmemmap struct pages and updates | and struct page members.
the zone related data |
| ...
| -> put_online_mems()
Fix this by ensuring mem_hotplug_lock is taken before performing
mhp_init_memmap_on_memory(). Also ensure that
mhp_deinit_memmap_on_memory() holds the lock.
online/offline_pages() are currently only called from
memory_block_online/offline(), so it is safe to move the locking there.
Link: https://lkml.kernel.org/r/20231120145354.308999-2-sumanthk@linux.ibm.com
Fixes: a08a2ae346 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: kernel test robot <lkp@intel.com>
Cc: <stable@vger.kernel.org> [5.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 762321dab9a72760bf9aec48362f932717c9424d ]
folio_wait_stable waits for writeback to finish before modifying the
contents of a folio again, e.g. to support check summing of the data
in the block integrity code.
Currently this behavior is controlled by the SB_I_STABLE_WRITES flag
on the super_block, which means it is uniform for the entire file system.
This is wrong for the block device pseudofs which is shared by all
block devices, or file systems that can use multiple devices like XFS
witht the RT subvolume or btrfs (although btrfs currently reimplements
folio_wait_stable anyway).
Add a per-address_space AS_STABLE_WRITES flag to control the behavior
in a more fine grained way. The existing SB_I_STABLE_WRITES is kept
to initialize AS_STABLE_WRITES to the existing default which covers
most cases.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231025141020.192413-2-hch@lst.de
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stable-dep-of: 1898efcdbed3 ("block: update the stable_writes flag in bdev_add")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b4fa966f03b7401ceacd4ffd7227197afb2b8376 ]
Fscache has an optimisation by which reads from the cache are skipped
until we know that (a) there's data there to be read and (b) that data
isn't entirely covered by pages resident in the netfs pagecache. This is
done with two flags manipulated by fscache_note_page_release():
if (...
test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) &&
test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags))
clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags);
where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to
indicate that netfslib should download from the server or clear the page
instead.
The fscache_note_page_release() function is intended to be called from
->releasepage() - but that only gets called if PG_private or PG_private_2
is set - and currently the former is at the discretion of the network
filesystem and the latter is only set whilst a page is being written to
the cache, so sometimes we miss clearing the optimisation.
Fix this by following Willy's suggestion[1] and adding an address_space
flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call
->release_folio() if it's set, even if PG_private or PG_private_2 aren't
set.
Note that this would require folio_test_private() and page_has_private() to
become more complicated. To avoid that, in the places[*] where these are
used to conditionalise calls to filemap_release_folio() and
try_to_release_page(), the tests are removed the those functions just
jumped to unconditionally and the test is performed there.
[*] There are some exceptions in vmscan.c where the check guards more than
just a call to the releaser. I've added a function, folio_needs_release()
to wrap all the checks for that.
AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from
fscache and cleared in ->evict_inode() before truncate_inode_pages_final()
is called.
Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared
and the optimisation cancelled if a cachefiles object already contains data
when we open it.
[dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()]
Link: 902c990e31
Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com
Fixes: 1f67e6d0b1 ("fscache: Provide a function to note the release of a page")
Fixes: 047487c947 ("cachefiles: Implement the I/O routines")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Tested-by: SeongJae Park <sj@kernel.org>
Cc: Daire Byrne <daire.byrne@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 1898efcdbed3 ("block: update the stable_writes flag in bdev_add")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0201ebf274a306a6ebb95e5dc2d6a0a27c737cac ]
Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.
This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache. The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.
The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.
The second patch is the actual fix. Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set. folio_needs_release() is altered to
add a check for this.
This patch (of 2):
Make filemap_release_folio() check folio_has_private(). Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.
There are a couple of sites in mm/vscan.c that this can't so easily be
done. In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.
In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.
A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory. This will acquire
additional parts to the condition in a future patch.
After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.
Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 1898efcdbed3 ("block: update the stable_writes flag in bdev_add")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ac5efa782041670b63a05c36d92d02a80e50bb63 ]
Replace try_to_release_page() with filemap_release_folio(). This change
is in preparation for the removal of the try_to_release_page() wrapper.
Link: https://lkml.kernel.org/r/20221118073055.55694-4-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 1898efcdbed3 ("block: update the stable_writes flag in bdev_add")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 64ab3195ea077eaeedc8b382939c3dc5ca56f369 ]
Replace some calls with their folio equivalents. This change removes 4
calls to compound_head() and is in preparation for the removal of the
try_to_release_page() wrapper.
Link: https://lkml.kernel.org/r/20221118073055.55694-3-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 1898efcdbed3 ("block: update the stable_writes flag in bdev_add")
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit c79c5a0a00a9457718056b588f312baadf44e471 upstream.
A process may map only some of the pages in a folio, and might be missed
if it maps the poisoned page but not the head page. Or it might be
unnecessarily hit if it maps the head page, but not the poisoned page.
Link: https://lkml.kernel.org/r/20231218135837.3310403-3-willy@infradead.org
Fixes: 7af446a841 ("HWPOISON, hugetlb: enable error handling path for hugepage")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 39ebd6dce62d8cfe3864e16148927a139f11bc9a upstream.
On 32-bit systems, we'll lose the top bits of index because arithmetic
will be performed in unsigned long instead of unsigned long long. This
affects files over 4GB in size.
Link: https://lkml.kernel.org/r/20231218135837.3310403-4-willy@infradead.org
Fixes: 6100e34b25 ("mm, memory_failure: Teach memory_failure() about dev_pagemap pages")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fc346d0a70a13d52fe1c4bc49516d83a42cd7c4c upstream.
Large folios occupy N consecutive entries in the swap cache instead of
using multi-index entries like the page cache. However, if a large folio
is re-added to the LRU list, it can be migrated. The migration code was
not aware of the difference between the swap cache and the page cache and
assumed that a single xas_store() would be sufficient.
This leaves potentially many stale pointers to the now-migrated folio in
the swap cache, which can lead to almost arbitrary data corruption in the
future. This can also manifest as infinite loops with the RCU read lock
held.
[willy@infradead.org: modifications to the changelog & tweaked the fix]
Fixes: 3417013e0d ("mm/migrate: Add folio_migrate_mapping()")
Link: https://lkml.kernel.org/r/20231214045841.961776-1-willy@infradead.org
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Closes: https://lkml.kernel.org/r/1700569840-17327-1-git-send-email-quic_charante@quicinc.com
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e2c27b803bb664748e090d99042ac128b3f88d92 upstream.
The following concurrency may cause the data read to be inconsistent with
the data on disk:
cpu1 cpu2
------------------------------|------------------------------
// Buffered write 2048 from 0
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
// Buffered read 4096 from 0 smp_wmb()
ext4_file_read_iter set_bit(PG_uptodate, folio_flags)
generic_file_read_iter i_size_write // 2048
filemap_read unlock_page(page)
filemap_get_pages
filemap_get_read_batch
folio_test_uptodate(folio)
ret = test_bit(PG_uptodate, folio_flags)
if (ret)
smp_rmb();
// Ensure that the data in page 0-2048 is up-to-date.
// New buffered write 2048 from 2048
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate, folio_flags)
i_size_write // 4096
unlock_page(page)
isize = i_size_read(inode) // 4096
// Read the latest isize 4096, but without smp_rmb(), there may be
// Load-Load disorder resulting in the data in the 2048-4096 range
// in the page is not up-to-date.
copy_page_to_iter
// copyout 4096
In the concurrency above, we read the updated i_size, but there is no read
barrier to ensure that the data in the page is the same as the i_size at
this point, so we may copy the unsynchronized page out. Hence adding the
missing read memory barrier to fix this.
This is a Load-Load reordering issue, which only occurs on some weak
mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong
mem-ordering architectures (e.g. X86). And theoretically the problem
doesn't only happen on ext4, filesystems that call filemap_read() but
don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this
problem, while filesystems with inode lock (e.g. xfs, nfs) won't have
this problem.
Link: https://lkml.kernel.org/r/20231213062324.739009-1-libaokun1@huawei.com
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: yangerkun <yangerkun@huawei.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmV57F0ACgkQONu9yGCS
aT5Ihg//f5xvyjEEbZyE7tFaBBgx8ceQCtteRyi+Jw3Hy65/9neETij0t97IhG37
I89TIAddzNIl51ifl8UYZMWI780HbnW1YdbVLMElbngbmT5rHzIsGpAVCC+SDmMK
NPWXrqWIw6yTVSbTwqKIqOLlEiLxGjdWnPxjoMXBVyje+EcmANBe+fe9qkLq98XC
ZgzrRZyriS8QLMMscy/GmdxIyC32nxebdHDwwE6qgYM8GWNfqLLektX798VGFhra
ByR9bvsJ0PD5m9siCGcx37lVusJDLMjJp4FtMIFTrH63i0sMQm7HKiggJmbCm4lH
Sgbo4iwvSVa2xf1glPJagE9tiah5b0feLqgrQf/ONO2PdCjcERN47472IcQgRvQ+
SDYKScZBSp1/Jd063dHiK/u79uxEBFEdisAkPG2MstjCySEDuhvDrV5R0iKDpQBP
y2FXb4RArqZFrGwS4Zfxx/EQnj3MYJ11a4AE5I0yUGIj7vrFdddayBDBVdwhog84
QhHPH0F/eC/zSMATYSQSCZTTSZ2UoR8NODXyOryoH5tmXlgxXWKq1oFi5nUnysoP
SkGDT0dg+kbReQNA+eyj5qTS4lzincIyP2B4Ple9d75zpx1UENlqVm1xvWLccyFt
3eV/XNRg8dAapsbqvEtW+iev6izutWgcG6p1hToObnbg5uHy6fI=
=+iTJ
-----END PGP SIGNATURE-----
Merge 6.1.68 into android14-6.1-lts
Changes in 6.1.68
vdpa/mlx5: preserve CVQ vringh index
hrtimers: Push pending hrtimers away from outgoing CPU earlier
i2c: designware: Fix corrupted memory seen in the ISR
netfilter: ipset: fix race condition between swap/destroy and kernel side add/del/test
zstd: Fix array-index-out-of-bounds UBSAN warning
tg3: Move the [rt]x_dropped counters to tg3_napi
tg3: Increment tx_dropped in tg3_tso_bug()
kconfig: fix memory leak from range properties
drm/amdgpu: correct chunk_ptr to a pointer to chunk.
x86: Introduce ia32_enabled()
x86/coco: Disable 32-bit emulation by default on TDX and SEV
x86/entry: Convert INT 0x80 emulation to IDTENTRY
x86/entry: Do not allow external 0x80 interrupts
x86/tdx: Allow 32-bit emulation by default
dt: dt-extract-compatibles: Handle cfile arguments in generator function
dt: dt-extract-compatibles: Don't follow symlinks when walking tree
platform/x86: asus-wmi: Move i8042 filter install to shared asus-wmi code
of: dynamic: Fix of_reconfig_get_state_change() return value documentation
platform/x86: wmi: Skip blocks with zero instances
ipv6: fix potential NULL deref in fib6_add()
octeontx2-pf: Add missing mutex lock in otx2_get_pauseparam
octeontx2-af: Check return value of nix_get_nixlf before using nixlf
hv_netvsc: rndis_filter needs to select NLS
r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
r8152: Add RTL8152_INACCESSIBLE checks to more loops
r8152: Add RTL8152_INACCESSIBLE to r8156b_wait_loading_flash()
r8152: Add RTL8152_INACCESSIBLE to r8153_pre_firmware_1()
r8152: Add RTL8152_INACCESSIBLE to r8153_aldps_en()
mlxbf-bootctl: correctly identify secure boot with development keys
platform/mellanox: Add null pointer checks for devm_kasprintf()
platform/mellanox: Check devm_hwmon_device_register_with_groups() return value
arcnet: restoring support for multiple Sohard Arcnet cards
octeontx2-pf: consider both Rx and Tx packet stats for adaptive interrupt coalescing
net: stmmac: fix FPE events losing
xsk: Skip polling event check for unbound socket
octeontx2-af: fix a use-after-free in rvu_npa_register_reporters
i40e: Fix unexpected MFS warning message
iavf: validate tx_coalesce_usecs even if rx_coalesce_usecs is zero
net: bnxt: fix a potential use-after-free in bnxt_init_tc
tcp: fix mid stream window clamp.
ionic: fix snprintf format length warning
ionic: Fix dim work handling in split interrupt mode
ipv4: ip_gre: Avoid skb_pull() failure in ipgre_xmit()
net: atlantic: Fix NULL dereference of skb pointer in
net: hns: fix wrong head when modify the tx feature when sending packets
net: hns: fix fake link up on xge port
octeontx2-af: Adjust Tx credits when MCS external bypass is disabled
octeontx2-af: Fix mcs sa cam entries size
octeontx2-af: Fix mcs stats register address
octeontx2-af: Add missing mcs flr handler call
octeontx2-af: Update Tx link register range
dt-bindings: interrupt-controller: Allow #power-domain-cells
netfilter: nft_exthdr: add boolean DCCP option matching
netfilter: nf_tables: fix 'exist' matching on bigendian arches
netfilter: nf_tables: bail out on mismatching dynset and set expressions
netfilter: nf_tables: validate family when identifying table via handle
netfilter: xt_owner: Fix for unsafe access of sk->sk_socket
tcp: do not accept ACK of bytes we never sent
bpf: sockmap, updating the sg structure should also update curr
psample: Require 'CAP_NET_ADMIN' when joining "packets" group
drop_monitor: Require 'CAP_SYS_ADMIN' when joining "events" group
mm/damon/sysfs: eliminate potential uninitialized variable warning
tee: optee: Fix supplicant based device enumeration
RDMA/hns: Fix unnecessary err return when using invalid congest control algorithm
RDMA/irdma: Do not modify to SQD on error
RDMA/irdma: Add wait for suspend on SQD
arm64: dts: rockchip: Expand reg size of vdec node for RK3328
arm64: dts: rockchip: Expand reg size of vdec node for RK3399
ASoC: fsl_sai: Fix no frame sync clock issue on i.MX8MP
RDMA/rtrs-srv: Do not unconditionally enable irq
RDMA/rtrs-clt: Start hb after path_up
RDMA/rtrs-srv: Check return values while processing info request
RDMA/rtrs-srv: Free srv_mr iu only when always_invalidate is true
RDMA/rtrs-srv: Destroy path files after making sure no IOs in-flight
RDMA/rtrs-clt: Fix the max_send_wr setting
RDMA/rtrs-clt: Remove the warnings for req in_use check
RDMA/bnxt_re: Correct module description string
RDMA/irdma: Refactor error handling in create CQP
RDMA/irdma: Fix UAF in irdma_sc_ccq_get_cqe_info()
hwmon: (acpi_power_meter) Fix 4.29 MW bug
ASoC: codecs: lpass-tx-macro: set active_decimator correct default value
hwmon: (nzxt-kraken2) Fix error handling path in kraken2_probe()
ASoC: wm_adsp: fix memleak in wm_adsp_buffer_populate
RDMA/core: Fix umem iterator when PAGE_SIZE is greater then HCA pgsz
RDMA/irdma: Avoid free the non-cqp_request scratch
drm/bridge: tc358768: select CONFIG_VIDEOMODE_HELPERS
arm64: dts: imx8mq: drop usb3-resume-missing-cas from usb
arm64: dts: imx8mp: imx8mq: Add parkmode-disable-ss-quirk on DWC3
ARM: dts: imx6ul-pico: Describe the Ethernet PHY clock
tracing: Fix a warning when allocating buffered events fails
scsi: be2iscsi: Fix a memleak in beiscsi_init_wrb_handle()
ARM: imx: Check return value of devm_kasprintf in imx_mmdc_perf_init
ARM: dts: imx7: Declare timers compatible with fsl,imx6dl-gpt
ARM: dts: imx28-xea: Pass the 'model' property
riscv: fix misaligned access handling of C.SWSP and C.SDSP
md: introduce md_ro_state
md: don't leave 'MD_RECOVERY_FROZEN' in error path of md_set_readonly()
iommu: Avoid more races around device probe
rethook: Use __rcu pointer for rethook::handler
kprobes: consistent rcu api usage for kretprobe holder
ASoC: amd: yc: Fix non-functional mic on ASUS E1504FA
io_uring/af_unix: disable sending io_uring over sockets
nvme-pci: Add sleep quirk for Kingston drives
io_uring: fix mutex_unlock with unreferenced ctx
ALSA: usb-audio: Add Pioneer DJM-450 mixer controls
ALSA: pcm: fix out-of-bounds in snd_pcm_state_names
ALSA: hda/realtek: Enable headset on Lenovo M90 Gen5
ALSA: hda/realtek: add new Framework laptop to quirks
ALSA: hda/realtek: Add Framework laptop 16 to quirks
ring-buffer: Test last update in 32bit version of __rb_time_read()
nilfs2: fix missing error check for sb_set_blocksize call
nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
cgroup_freezer: cgroup_freezing: Check if not frozen
checkstack: fix printed address
tracing: Always update snapshot buffer size
tracing: Disable snapshot buffer when stopping instance tracers
tracing: Fix incomplete locking when disabling buffered events
tracing: Fix a possible race when disabling buffered events
packet: Move reference count in packet_sock to atomic_long_t
r8169: fix rtl8125b PAUSE frames blasting when suspended
regmap: fix bogus error on regcache_sync success
platform/surface: aggregator: fix recv_buf() return value
hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write
mm: fix oops when filemap_map_pmd() without prealloc_pte
powercap: DTPM: Fix missing cpufreq_cpu_put() calls
md/raid6: use valid sector values to determine if an I/O should wait on the reshape
arm64: dts: mediatek: mt7622: fix memory node warning check
arm64: dts: mediatek: mt8183-kukui-jacuzzi: fix dsi unnecessary cells properties
arm64: dts: mediatek: cherry: Fix interrupt cells for MT6360 on I2C7
arm64: dts: mediatek: mt8173-evb: Fix regulator-fixed node names
arm64: dts: mediatek: mt8195: Fix PM suspend/resume with venc clocks
arm64: dts: mediatek: mt8183: Fix unit address for scp reserved memory
arm64: dts: mediatek: mt8183: Move thermal-zones to the root node
arm64: dts: mediatek: mt8183-evb: Fix unit_address_vs_reg warning on ntc
binder: fix memory leaks of spam and pending work
coresight: etm4x: Make etm4_remove_dev() return void
coresight: etm4x: Remove bogous __exit annotation for some functions
hwtracing: hisi_ptt: Add dummy callback pmu::read()
misc: mei: client.c: return negative error code in mei_cl_write
misc: mei: client.c: fix problem of return '-EOVERFLOW' in mei_cl_write
LoongArch: BPF: Don't sign extend memory load operand
LoongArch: BPF: Don't sign extend function return value
ring-buffer: Force absolute timestamp on discard of event
tracing: Set actual size after ring buffer resize
tracing: Stop current tracer when resizing buffer
parisc: Reduce size of the bug_table on 64-bit kernel by half
parisc: Fix asm operand number out of range build error in bug table
arm64: dts: mediatek: add missing space before {
arm64: dts: mt8183: kukui: Fix underscores in node names
perf: Fix perf_event_validate_size()
x86/sev: Fix kernel crash due to late update to read-only ghcb_version
gpiolib: sysfs: Fix error handling on failed export
drm/amdgpu: fix memory overflow in the IB test
drm/amd/amdgpu: Fix warnings in amdgpu/amdgpu_display.c
drm/amdgpu: correct the amdgpu runtime dereference usage count
drm/amdgpu: Update ras eeprom support for smu v13_0_0 and v13_0_10
drm/amdgpu: Add EEPROM I2C address support for ip discovery
drm/amdgpu: Remove redundant I2C EEPROM address
drm/amdgpu: Decouple RAS EEPROM addresses from chips
drm/amdgpu: Add support for RAS table at 0x40000
drm/amdgpu: Remove second moot switch to set EEPROM I2C address
drm/amdgpu: Return from switch early for EEPROM I2C address
drm/amdgpu: simplify amdgpu_ras_eeprom.c
drm/amdgpu: Add I2C EEPROM support on smu v13_0_6
drm/amdgpu: Update EEPROM I2C address for smu v13_0_0
usb: gadget: f_hid: fix report descriptor allocation
serial: 8250_dw: Add ACPI ID for Granite Rapids-D UART
parport: Add support for Brainboxes IX/UC/PX parallel cards
cifs: Fix non-availability of dedup breaking generic/304
Revert "xhci: Loosen RPM as default policy to cover for AMD xHC 1.1"
smb: client: fix potential NULL deref in parse_dfs_referrals()
usb: typec: class: fix typec_altmode_put_partner to put plugs
ARM: PL011: Fix DMA support
serial: sc16is7xx: address RX timeout interrupt errata
serial: 8250: 8250_omap: Clear UART_HAS_RHR_IT_DIS bit
serial: 8250: 8250_omap: Do not start RX DMA on THRI interrupt
serial: 8250_omap: Add earlycon support for the AM654 UART controller
devcoredump: Send uevent once devcd is ready
x86/CPU/AMD: Check vendor in the AMD microcode callback
USB: gadget: core: adjust uevent timing on gadget unbind
cifs: Fix flushing, invalidation and file size with copy_file_range()
cifs: Fix flushing, invalidation and file size with FICLONE
MIPS: kernel: Clear FPU states when setting up kernel threads
KVM: s390/mm: Properly reset no-dat
KVM: SVM: Update EFER software model on CR0 trap for SEV-ES
MIPS: Loongson64: Reserve vgabios memory on boot
MIPS: Loongson64: Handle more memory types passed from firmware
MIPS: Loongson64: Enable DMA noncoherent support
netfilter: nft_set_pipapo: skip inactive elements during set walk
riscv: Kconfig: Add select ARM_AMBA to SOC_STARFIVE
drm/i915/display: Drop check for doublescan mode in modevalid
drm/i915/lvds: Use REG_BIT() & co.
drm/i915/sdvo: stop caching has_hdmi_monitor in struct intel_sdvo
drm/i915: Skip some timing checks on BXT/GLK DSI transcoders
Linux 6.1.68
Change-Id: I0a824071a80b24dc4a2e0077f305b7cac42235b8
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result in
better performance. __mt_dup() uses DFS pre-order to duplicate the maple
tree, so it is efficient.
The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs. The proof of the time complexity is provided in the commit log
that introduces __mt_dup(). After duplicating the maple tree, each
element is traversed and replaced (ignoring the cases of deletion, which
are rare). Since it is only a replacement operation for each element,
this process is also O(n).
Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node,
pushing data to adjacent nodes, or even splitting nodes. The frequency of
each action is difficult to calculate. The worst-case scenario for a
single insertion is when the tree undergoes splitting at every level. If
we consider each insertion as the worst-case scenario, we can determine
that the upper bound of the time complexity is O(n*log(n)), although this
is a loose upper bound. However, based on the test data, it appears that
the actual time complexity is likely to be O(n).
As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree. To handle this, we mark the failure point with
XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.
There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
[1] https://github.com/kdlucas/byte-unixbench/tree/master
Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d2406291483775ecddaee929231a39c70c08fda2
https://git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
[surenb: open-coded vma_iter_clear_gfp(), vma_iter_bulk_store();
replaced vma_next() with mas_find()]
Bug: 308042511
Change-Id: I42d6620e8ce6a0b16211c231a9b72ba16ba9c0d2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
I think this is a pretty rare occurrence, but for consistency handle
faults with the VMA lock held the same way that we handle other faults
with the VMA lock held.
Link: https://lkml.kernel.org/r/20231006195318.4087158-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4a68fef16df9d88d528094116f8bbd2dbfa62089)
Bug: 293665307
Change-Id: I69cec218c8a1fe14df3268722e6b1be6dffe7978
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Most file-backed faults are already handled through ->map_pages(), but if
we need to do I/O we'll come this way. Since filemap_fault() is now safe
to be called under the VMA lock, we can handle these faults under the VMA
lock now.
Link: https://lkml.kernel.org/r/20231006195318.4087158-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 12214eba1992642eee5813a9cc9f626e5b2d1815)
Bug: 293665307
Change-Id: Iee48af98b866d88d88ec01143eb26389ab373b6b
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
If the page is not currently present in the page tables, we need to call
the page fault handler to find out which page we're supposed to COW, so we
need to both check that there is already an anon_vma and that the fault
handler doesn't need the mmap_lock.
Link: https://lkml.kernel.org/r/20231006195318.4087158-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4de8c93a4751e10737b6af65db42c743228c67a6)
Bug: 293665307
Change-Id: If749a6f8fcf69d83bbf872c1d45865d1b1b77ea0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
There are many implementations of ->fault and some of them depend on
mmap_lock being held. All vm_ops that implement ->map_pages() end up
calling filemap_fault(), which I have audited to be sure it does not rely
on mmap_lock. So (for now) key off ->map_pages existing as a flag to
indicate that it's safe to call ->fault while only holding the vma lock.
Link: https://lkml.kernel.org/r/20231006195318.4087158-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4ed4379881aa62588aba6442a9f362a8cf7624e6)
Bug: 293665307
Change-Id: Ifb5ab3df5d05fb182d0cb52820fa24e28e2d6496
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
It is usually safe to call wp_page_copy() under the VMA lock. The only
unsafe situation is when no anon_vma has been allocated for this VMA, and
we have to look at adjacent VMAs to determine if their anon_vma can be
shared. Since this happens only for the first COW of a page in this VMA,
the majority of calls to wp_page_copy() do not need to fall back to the
mmap_sem.
Add vmf_anon_prepare() as an alternative to anon_vma_prepare() which will
return RETRY if we currently hold the VMA lock and need to allocate an
anon_vma. This lets us drop the check in do_wp_page().
Link: https://lkml.kernel.org/r/20231006195318.4087158-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 164b06f238b986317131e6b61b2f22aabcbc2cc0)
[surenb: resolved merge conflicts due to folio/page differences]
Bug: 293665307
Change-Id: I39bdc247b375bd3dae8078b52c60fd4ce12e1850
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Patch series "Handle more faults under the VMA lock", v2.
At this point, we're handling the majority of file-backed page faults
under the VMA lock, using the ->map_pages entry point. This patch set
attempts to expand that for the following siutations:
- We have to do a read. This could be because we've hit the point in
the readahead window where we need to kick off the next readahead,
or because the page is simply not present in cache.
- We're handling a write fault. Most applications don't do I/O by writes
to shared mmaps for very good reasons, but some do, and it'd be nice
to not make that slow unnecessarily.
- We're doing a COW of a private mapping (both PTE already present
and PTE not-present). These are two different codepaths and I handle
both of them in this patch set.
There is no support in this patch set for drivers to mark themselves as
being VMA lock friendly; they could implement the ->map_pages
vm_operation, but if they do, they would be the first. This is probably
something we want to change at some point in the future, and I've marked
where to make that change in the code.
There is very little performance change in the benchmarks we've run;
mostly because the vast majority of page faults are handled through the
other paths. I still think this patch series is useful for workloads that
may take these paths more often, and just for cleaning up the fault path
in general (it's now clearer why we have to retry in these cases).
This patch (of 6):
Drop the VMA lock instead of the mmap_lock if that's the one which
is held.
Link: https://lkml.kernel.org/r/20231006195318.4087158-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231006195318.4087158-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 5d74b2ab2c15d596c470bae6626f345d5575a9d0)
Bug: 293665307
Change-Id: Ife2d11ab12fb428868cd44751784cf731fbffe62
Signed-off-by: Suren Baghdasaryan <surenb@google.com>