68a63a077e
19424 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Greg Kroah-Hartman
|
45ea58f9db |
Revert "memcg: drop kmem.limit_in_bytes"
This reverts commit
|
||
Kalesh Singh
|
017a058053 |
Multi-gen LRU: avoid race in inc_min_seq()
commit bb5e7f234eacf34b65be67ebb3613e3b8cf11b87 upstream.
inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This
is because the generations are reused (the last oldest now empty
generation will become the next youngest generation).
inc_min_seq() is retried until successful, dropping the lru_lock
and yielding the CPU on each failure, and retaking the lock before
trying again:
while (!inc_min_seq(lruvec, type, can_swap)) {
spin_unlock_irq(&lruvec->lru_lock);
cond_resched();
spin_lock_irq(&lruvec->lru_lock);
}
However, the initial condition that required incrementing the min_seq
(nr_gens == MAX_NR_GENS) is not retested. This can change by another
call to inc_max_seq() from run_aging() with force_scan=true from the
debugfs interface.
Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid
unnecessarily incrementing the min_seq by rechecking the number of
generations before each attempt.
This issue was uncovered in previous discussion on the list by Yu Zhao
and Aneesh Kumar [1].
[1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY_dpf+zR3w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com
Fixes:
|
||
Muchun Song
|
84a212a72c |
mm: hugetlb_vmemmap: fix a race between vmemmap pmd split
commit 3ce2c24cb68f228590a053d6058a5901cd31af61 upstream.
The local variable @page in __split_vmemmap_huge_pmd() to obtain a pmd
page without holding page_table_lock may possiblely get the page table
page instead of a huge pmd page.
The effect may be in set_pte_at() since we may pass an invalid page
struct, if set_pte_at() wants to access the page struct (e.g.
CONFIG_PAGE_TABLE_CHECK is enabled), it may crash the kernel.
So fix it. And inline __split_vmemmap_huge_pmd() since it only has one
user.
Link: https://lkml.kernel.org/r/20230707033859.16148-1-songmuchun@bytedance.com
Fixes:
|
||
Michal Hocko
|
21ef9e1120 |
memcg: drop kmem.limit_in_bytes
commit 86327e8eb94c52eca4f93cfece2e29d1bf52acbf upstream. kmem.limit_in_bytes (v1 way to limit kernel memory usage) has been deprecated since |
||
Kalesh Singh
|
f367915961 |
Multi-gen LRU: fix per-zone reclaim
commit 669281ee7ef731fb5204df9d948669bf32a5e68d upstream.
MGLRU has a LRU list for each zone for each type (anon/file) in each
generation:
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
The min_seq (oldest generation) can progress independently for each
type but the max_seq (youngest generation) is shared for both anon and
file. This is to maintain a common frame of reference.
In order for eviction to advance the min_seq of a type, all the per-zone
lists in the oldest generation of that type must be empty.
The eviction logic only considers pages from eligible zones for
eviction or promotion.
scan_folios() {
...
for (zone = sc->reclaim_idx; zone >= 0; zone--) {
...
sort_folio(); // Promote
...
isolate_folio(); // Evict
}
...
}
Consider the system has the movable zone configured and default 4
generations. The current state of the system is as shown below
(only illustrating one type for simplicity):
Type: ANON
Zone DMA32 Normal Movable Device
Gen 0 0 0 4GB 0
Gen 1 0 1GB 1MB 0
Gen 2 1MB 4GB 1MB 0
Gen 3 1MB 1MB 1MB 0
Now consider there is a GFP_KERNEL allocation request (eligible zone
index <= Normal), evict_folios() will return without doing any work
since there are no pages to scan in the eligible zones of the oldest
generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE
allocation request; which may not happen soon if there is a lot of free
memory in the movable zone. This can lead to OOM kills, although there
is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to
reclaim.
This issue is not seen in the conventional active/inactive LRU since
there are no per-zone lists.
If there are no (not enough) folios to scan in the eligible zones, move
folios from ineligible zone (zone_index > reclaim_index) to the next
generation. This allows for the progression of min_seq and reclaiming
from the next generation (Gen 1).
Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently.
[1] https://github.com/raspberrypi/linux/issues/5395
Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com
Fixes:
|
||
Yu Zhao
|
a73d04c460 |
mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
commit 6df1b2212950aae2b2188c6645ea18e2a9e3fdd5 upstream. lru_gen_folio will be chained into per-node lists by the coming lrugen->list. Link: https://lkml.kernel.org/r/20221222041905.2431096-3-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Joel Fernandes (Google)
|
4245ca8f40 |
mm/vmalloc: add a safer version of find_vm_area() for debug
commit 0818e739b5c061b0251c30152380600fb9b84c0c upstream.
It is unsafe to dump vmalloc area information when trying to do so from
some contexts. Add a safer trylock version of the same function to do a
best-effort VMA finding and use it from vmalloc_dump_obj().
[applied test robot feedback on unused function fix.]
[applied Uladzislau feedback on locking.]
Link: https://lkml.kernel.org/r/20230904180806.1002832-1-joel@joelfernandes.org
Fixes:
|
||
Zqiang
|
3f7a4e88e4 |
rcu: dump vmalloc memory info safely
commit c83ad36a18c02c0f51280b50272327807916987f upstream.
Currently, for double invoke call_rcu(), will dump rcu_head objects memory
info, if the objects is not allocated from the slab allocator, the
vmalloc_dump_obj() will be invoke and the vmap_area_lock spinlock need to
be held, since the call_rcu() can be invoked in interrupt context,
therefore, there is a possibility of spinlock deadlock scenarios.
And in Preempt-RT kernel, the rcutorture test also trigger the following
lockdep warning:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 1
3 locks held by swapper/0/1:
#0: ffffffffb534ee80 (fullstop_mutex){+.+.}-{4:4}, at: torture_init_begin+0x24/0xa0
#1: ffffffffb5307940 (rcu_read_lock){....}-{1:3}, at: rcu_torture_init+0x1ec7/0x2370
#2: ffffffffb536af40 (vmap_area_lock){+.+.}-{3:3}, at: find_vmap_area+0x1f/0x70
irq event stamp: 565512
hardirqs last enabled at (565511): [<ffffffffb379b138>] __call_rcu_common+0x218/0x940
hardirqs last disabled at (565512): [<ffffffffb5804262>] rcu_torture_init+0x20b2/0x2370
softirqs last enabled at (399112): [<ffffffffb36b2586>] __local_bh_enable_ip+0x126/0x170
softirqs last disabled at (399106): [<ffffffffb43fef59>] inet_register_protosw+0x9/0x1d0
Preemption disabled at:
[<ffffffffb58040c3>] rcu_torture_init+0x1f13/0x2370
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.5.0-rc4-rt2-yocto-preempt-rt+ #15
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x68/0xb0
dump_stack+0x14/0x20
__might_resched+0x1aa/0x280
? __pfx_rcu_torture_err_cb+0x10/0x10
rt_spin_lock+0x53/0x130
? find_vmap_area+0x1f/0x70
find_vmap_area+0x1f/0x70
vmalloc_dump_obj+0x20/0x60
mem_dump_obj+0x22/0x90
__call_rcu_common+0x5bf/0x940
? debug_smp_processor_id+0x1b/0x30
call_rcu_hurry+0x14/0x20
rcu_torture_init+0x1f82/0x2370
? __pfx_rcu_torture_leak_cb+0x10/0x10
? __pfx_rcu_torture_leak_cb+0x10/0x10
? __pfx_rcu_torture_init+0x10/0x10
do_one_initcall+0x6c/0x300
? debug_smp_processor_id+0x1b/0x30
kernel_init_freeable+0x2b9/0x540
? __pfx_kernel_init+0x10/0x10
kernel_init+0x1f/0x150
ret_from_fork+0x40/0x50
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
The previous patch fixes this by using the deadlock-safe best-effort
version of find_vm_area. However, in case of failure print the fact that
the pointer was a vmalloc pointer so that we print at least something.
Link: https://lkml.kernel.org/r/20230904180806.1002832-2-joel@joelfernandes.org
Fixes:
|
||
Abel Wu
|
0f50641222 |
net-memcg: Fix scope of sockmem pressure indicators
[ Upstream commit ac8a52962164a50e693fa021d3564d7745b83a7f ]
Now there are two indicators of socket memory pressure sit inside
struct mem_cgroup, socket_pressure and tcpmem_pressure, indicating
memory reclaim pressure in memcg->memory and ->tcpmem respectively.
When in legacy mode (cgroupv1), the socket memory is charged into
->tcpmem which is independent of ->memory, so socket_pressure has
nothing to do with socket's pressure at all. Things could be worse
by taking socket_pressure into consideration in legacy mode, as a
pressure in ->memory can lead to premature reclamation/throttling
in socket.
While for the default mode (cgroupv2), the socket memory is charged
into ->memory, and ->tcpmem/->tcpmem_pressure are simply not used.
So {socket,tcpmem}_pressure are only used in default/legacy mode
respectively for indicating socket memory pressure. This patch fixes
the pieces of code that make mixed use of both.
Fixes:
|
||
Christian Brauner
|
c13e6edbad |
tmpfs: verify {g,u}id mount options correctly
[ Upstream commit 0200679fc7953177941e41c2a4241d0b6c2c5de8 ]
A while ago we received the following report:
"The other outstanding issue I noticed comes from the fact that
fsconfig syscalls may occur in a different userns than that which
called fsopen. That means that resolving the uid/gid via
current_user_ns() can save a kuid that isn't mapped in the associated
namespace when the filesystem is finally mounted. This means that it
is possible for an unprivileged user to create files owned by any
group in a tmpfs mount (since we can set the SUID bit on the tmpfs
directory), or a tmpfs that is owned by any user, including the root
group/user."
The contract for {g,u}id mount options and {g,u}id values in general set
from userspace has always been that they are translated according to the
caller's idmapping. In so far, tmpfs has been doing the correct thing.
But since tmpfs is mountable in unprivileged contexts it is also
necessary to verify that the resulting {k,g}uid is representable in the
namespace of the superblock to avoid such bugs as above.
The new mount api's cross-namespace delegation abilities are already
widely used. After having talked to a bunch of userspace this is the
most faithful solution with minimal regression risks. I know of one
users - systemd - that makes use of the new mount api in this way and
they don't set unresolable {g,u}ids. So the regression risk is minimal.
Link: https://lore.kernel.org/lkml/CALxfFW4BXhEwxR0Q5LSkg-8Vb4r2MONKCcUCVioehXQKr35eHg@mail.gmail.com
Fixes:
|
||
Yin Fengwei
|
bd20e20c4d |
madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check
commit 0e0e9bd5f7b9d40fd03b70092367247d52da1db0 upstream. Commit |
||
Miaohe Lin
|
bdc544a87d |
mm: memory-failure: fix unexpected return value in soft_offline_page()
commit e2c1ab070fdc81010ec44634838d24fce9ff9e53 upstream.
When page_handle_poison() fails to handle the hugepage or free page in
retry path, soft_offline_page() will return 0 while -EBUSY is expected in
this case.
Consequently the user will think soft_offline_page succeeds while it in
fact failed. So the user will not try again later in this case.
Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com
Fixes:
|
||
Alexandre Ghiti
|
07fad410aa |
mm: add a call to flush_cache_vmap() in vmap_pfn()
commit a50420c79731fc5cf27ad43719c1091e842a2606 upstream. flush_cache_vmap() must be called after new vmalloc mappings are installed in the page table in order to allow architectures to make sure the new mapping is visible. It could lead to a panic since on some architectures (like powerpc), the page table walker could see the wrong pte value and trigger a spurious page fault that can not be resolved (see commit |
||
Hugh Dickins
|
d13f3a63d2 |
shmem: fix smaps BUG sleeping while atomic
commit e5548f85b4527c4c803b7eae7887c10bf8f90c97 upstream.
smaps_pte_hole_lookup() is calling shmem_partial_swap_usage() with page
table lock held: but shmem_partial_swap_usage() does cond_resched_rcu() if
need_resched(): "BUG: sleeping function called from invalid context".
Since shmem_partial_swap_usage() is designed to count across a range, but
smaps_pte_hole_lookup() only calls it for a single page slot, just break
out of the loop on the last or only page, before checking need_resched().
Link: https://lkml.kernel.org/r/6fe3b3ec-abdf-332f-5c23-6a3b3a3b11a9@google.com
Fixes:
|
||
Mike Kravetz
|
1b4ce2952b |
hugetlb: do not clear hugetlb dtor until allocating vmemmap
commit 32c877191e022b55fe3a374f3d7e9fb5741c514d upstream.
Patch series "Fix hugetlb free path race with memory errors".
In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered.
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/
Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page. However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.
A much more detailed description is in the first patch.
The first patch addresses the race window. However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page
free operation. This could lead to slowdowns if one is freeing a large
number of hugetlb pages.
The second path optimizes the update_and_free_pages_bulk routine to only
take the lock once in bulk operations.
The second patch is technically not a bug fix, but includes a Fixes tag
and Cc stable to avoid a performance regression. It can be combined with
the first, but was done separately make reviewing easier.
This patch (of 2):
Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
lists, get a ref on the page and remove hugetlb destructor. This
all must be done under the hugetlb lock. After this call, the page
can be treated as a normal compound page or a collection of base
size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
needed and the free routine of the underlying allocator is called
on the resulting page. We can not hold the hugetlb lock here.
One issue with this scheme is that a memory error could occur between
these two steps. In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages. It will then try to SetPageHWPoison(page) on the page with an
error. If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.
Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap. Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present. This saves a
lock/unlock cycle. Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.
Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count. This is not
a normal state. The only code that would notice is the memory error
code, and it is set up to retry in such a case.
A subsequent patch will create a routine to do bulk processing of
vmemmap allocation. This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.
Link: https://lkml.kernel.org/r/20230711220942.43706-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com
Fixes:
|
||
Sergey Senozhatsky
|
5274bf1f74 |
zsmalloc: allow only one active pool compaction context
commit d2658f2052c7db6ec0a79977205f8cf1cb9effc2 upstream. zsmalloc pool can be compacted concurrently by many contexts, e.g. cc1 handle_mm_fault() do_anonymous_page() __alloc_pages_slowpath() try_to_free_pages() do_try_to_free_pages( lru_gen_shrink_node() shrink_slab() do_shrink_slab() zs_shrinker_scan() zs_compact() Pool compaction is currently (basically) single-threaded as it is performed under pool->lock. Having multiple compaction threads results in unnecessary contention, as each thread competes for pool->lock. This, in turn, affects all zsmalloc operations such as zs_malloc(), zs_map_object(), zs_free(), etc. Introduce the pool->compaction_in_progress atomic variable, which ensures that only one compaction context can run at a time. This reduces overall pool->lock contention in (corner) cases when many contexts attempt to shrink zspool simultaneously. Link: https://lkml.kernel.org/r/20230418074639.1903197-1-senozhatsky@chromium.org Fixes: c0547d0b6a4b ("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Andrew Yang
|
f872672edd |
zsmalloc: fix races between modifications of fullness and isolated
[ Upstream commit 4b5d1e47b69426c0f7491d97d73ad0152d02d437 ]
We encountered many kernel exceptions of VM_BUG_ON(zspage->isolated ==
0) in dec_zspage_isolation() and BUG_ON(!pages[1]) in zs_unmap_object()
lately. This issue only occurs when migration and reclamation occur at
the same time.
With our memory stress test, we can reproduce this issue several times
a day. We have no idea why no one else encountered this issue. BTW,
we switched to the new kernel version with this defect a few months
ago.
Since fullness and isolated share the same unsigned int, modifications of
them should be protected by the same lock.
[andrew.yang@mediatek.com: move comment]
Link: https://lkml.kernel.org/r/20230727062910.6337-1-andrew.yang@mediatek.com
Link: https://lkml.kernel.org/r/20230721063705.11455-1-andrew.yang@mediatek.com
Fixes:
|
||
Nhat Pham
|
802b34e992 |
zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks
[ Upstream commit c0547d0b6a4b637db05406b90ba82e1b2e71de56 ] Currently, zsmalloc has a hierarchy of locks, which includes a pool-level migrate_lock, and a lock for each size class. We have to obtain both locks in the hotpath in most cases anyway, except for zs_malloc. This exception will no longer exist when we introduce a LRU into the zs_pool for the new writeback functionality - we will need to obtain a pool-level lock to synchronize LRU handling even in zs_malloc. In preparation for zsmalloc writeback, consolidate these locks into a single pool-level lock, which drastically reduces the complexity of synchronization in zsmalloc. We have also benchmarked the lock consolidation to see the performance effect of this change on zram. First, we ran a synthetic FS workload on a server machine with 36 cores (same machine for all runs), using fs_mark -d ../zram1mnt -s 100000 -n 2500 -t 32 -k before and after for btrfs and ext4 on zram (FS usage is 80%). Here is the result (unit is file/second): With lock consolidation (btrfs): Average: 13520.2, Median: 13531.0, Stddev: 137.5961482019028 Without lock consolidation (btrfs): Average: 13487.2, Median: 13575.0, Stddev: 309.08283679298665 With lock consolidation (ext4): Average: 16824.4, Median: 16839.0, Stddev: 89.97388510006668 Without lock consolidation (ext4) Average: 16958.0, Median: 16986.0, Stddev: 194.7370021336469 As you can see, we observe a 0.3% regression for btrfs, and a 0.9% regression for ext4. This is a small, barely measurable difference in my opinion. For a more realistic scenario, we also tries building the kernel on zram. Here is the time it takes (in seconds): With lock consolidation (btrfs): real Average: 319.6, Median: 320.0, Stddev: 0.8944271909999159 user Average: 6894.2, Median: 6895.0, Stddev: 25.528415540334656 sys Average: 521.4, Median: 522.0, Stddev: 1.51657508881031 Without lock consolidation (btrfs): real Average: 319.8, Median: 320.0, Stddev: 0.8366600265340756 user Average: 6896.6, Median: 6899.0, Stddev: 16.04057355583023 sys Average: 520.6, Median: 521.0, Stddev: 1.140175425099138 With lock consolidation (ext4): real Average: 320.0, Median: 319.0, Stddev: 1.4142135623730951 user Average: 6896.8, Median: 6878.0, Stddev: 28.621670111997307 sys Average: 521.2, Median: 521.0, Stddev: 1.7888543819998317 Without lock consolidation (ext4) real Average: 319.6, Median: 319.0, Stddev: 0.8944271909999159 user Average: 6886.2, Median: 6887.0, Stddev: 16.93221781102523 sys Average: 520.4, Median: 520.0, Stddev: 1.140175425099138 The difference is entirely within the noise of a typical run on zram. This hardly justifies the complexity of maintaining both the pool lock and the class lock. In fact, for writeback, we would need to introduce yet another lock to prevent data races on the pool's LRU, further complicating the lock handling logic. IMHO, it is just better to collapse all of these into a single pool-level lock. Link: https://lkml.kernel.org/r/20221128191616.1261026-4-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 4b5d1e47b694 ("zsmalloc: fix races between modifications of fullness and isolated") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Roman Gushchin
|
33d9490b27 |
mm: kmem: fix a NULL pointer dereference in obj_stock_flush_required()
commit 3b8abb3239530c423c0b97e42af7f7e856e1ee96 upstream.
KCSAN found an issue in obj_stock_flush_required():
stock->cached_objcg can be reset between the check and dereference:
==================================================================
BUG: KCSAN: data-race in drain_all_stock / drain_obj_stock
write to 0xffff888237c2a2f8 of 8 bytes by task 19625 on cpu 0:
drain_obj_stock+0x408/0x4e0 mm/memcontrol.c:3306
refill_obj_stock+0x9c/0x1e0 mm/memcontrol.c:3340
obj_cgroup_uncharge+0xe/0x10 mm/memcontrol.c:3408
memcg_slab_free_hook mm/slab.h:587 [inline]
__cache_free mm/slab.c:3373 [inline]
__do_kmem_cache_free mm/slab.c:3577 [inline]
kmem_cache_free+0x105/0x280 mm/slab.c:3602
__d_free fs/dcache.c:298 [inline]
dentry_free fs/dcache.c:375 [inline]
__dentry_kill+0x422/0x4a0 fs/dcache.c:621
dentry_kill+0x8d/0x1e0
dput+0x118/0x1f0 fs/dcache.c:913
__fput+0x3bf/0x570 fs/file_table.c:329
____fput+0x15/0x20 fs/file_table.c:349
task_work_run+0x123/0x160 kernel/task_work.c:179
resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
exit_to_user_mode_loop+0xcf/0xe0 kernel/entry/common.c:171
exit_to_user_mode_prepare+0x6a/0xa0 kernel/entry/common.c:203
__syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:296
do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888237c2a2f8 of 8 bytes by task 19632 on cpu 1:
obj_stock_flush_required mm/memcontrol.c:3319 [inline]
drain_all_stock+0x174/0x2a0 mm/memcontrol.c:2361
try_charge_memcg+0x6d0/0xd10 mm/memcontrol.c:2703
try_charge mm/memcontrol.c:2837 [inline]
mem_cgroup_charge_skmem+0x51/0x140 mm/memcontrol.c:7290
sock_reserve_memory+0xb1/0x390 net/core/sock.c:1025
sk_setsockopt+0x800/0x1e70 net/core/sock.c:1525
udp_lib_setsockopt+0x99/0x6c0 net/ipv4/udp.c:2692
udp_setsockopt+0x73/0xa0 net/ipv4/udp.c:2817
sock_common_setsockopt+0x61/0x70 net/core/sock.c:3668
__sys_setsockopt+0x1c3/0x230 net/socket.c:2271
__do_sys_setsockopt net/socket.c:2282 [inline]
__se_sys_setsockopt net/socket.c:2279 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2279
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0xffff8881382d52c0 -> 0xffff888138893740
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 19632 Comm: syz-executor.0 Not tainted 6.3.0-rc2-syzkaller-00387-g534293368afa #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
Fix it by using READ_ONCE()/WRITE_ONCE() for all accesses to
stock->cached_objcg.
Link: https://lkml.kernel.org/r/20230502160839.361544-1-roman.gushchin@linux.dev
Fixes:
|
||
Arnd Bergmann
|
a4336343ea |
kasan: add kasan_tag_mismatch prototype
commit fb646a4cd3f0ff27d19911bef7b6622263723df6 upstream. The kasan sw-tags implementation contains one function that is only called from assembler and has no prototype in a header. This causes a W=1 warning: mm/kasan/sw_tags.c:171:6: warning: no previous prototype for 'kasan_tag_mismatch' [-Wmissing-prototypes] 171 | void kasan_tag_mismatch(unsigned long addr, unsigned long access_info, Add a prototype in the local header to get a clean build. Link: https://lkml.kernel.org/r/20230509145735.9263-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Liam R. Howlett
|
a02c6dc0ef |
mm/mmap: Fix extra maple tree write
based on commit 0503ea8f5ba73eb3ab13a81c1eefbaf51405385a upstream. This was inadvertently fixed during the removal of __vma_adjust(). When __vma_adjust() is adjusting next with a negative value (pushing vma->vm_end lower), there would be two writes to the maple tree. The first write is unnecessary and uses all allocated nodes in the maple state. The second write is necessary but will need to allocate nodes since the first write has used the allocated nodes. This may be a problem as it may not be safe to allocate at this time, such as a low memory situation. Fix the issue by avoiding the first write and only write the adjusted "next" VMA. Reported-by: John Hsu <John.Hsu@mediatek.com> Link: https://lore.kernel.org/lkml/9cb8c599b1d7f9c1c300d1a334d5eb70ec4d7357.camel@mediatek.com/ Cc: stable@vger.kernel.org Cc: linux-mm@kvack.org Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Roberto Sassu
|
1f34bf8b44 |
shmem: use ramfs_kill_sb() for kill_sb method of ramfs-based tmpfs
commit 36ce9d76b0a93bae799e27e4f5ac35478c676592 upstream.
As the ramfs-based tmpfs uses ramfs_init_fs_context() for the
init_fs_context method, which allocates fc->s_fs_info, use ramfs_kill_sb()
to free it and avoid a memory leak.
Link: https://lkml.kernel.org/r/20230607161523.2876433-1-roberto.sassu@huaweicloud.com
Fixes:
|
||
Ryan Roberts
|
23fbff67b0 |
mm/damon/ops-common: atomically test and clear young on ptes and pmds
commit c11d34fa139e4b0fb4249a30f37b178353533fa1 upstream.
It is racy to non-atomically read a pte, then clear the young bit, then
write it back as this could discard dirty information. Further, it is bad
practice to directly set a pte entry within a table. Instead clearing
young must go through the arch-provided helper,
ptep_test_and_clear_young() to ensure it is modified atomically and to
give the arch code visibility and allow it to check (and potentially
modify) the operation.
Link: https://lkml.kernel.org/r/20230602092949.545577-3-ryan.roberts@arm.com
Fixes:
|
||
Suren Baghdasaryan
|
e0d7a96b27 |
mm/mmap: Fix VM_LOCKED check in do_vmi_align_munmap()
6.1 backport of the patch [1] uses 'next' vma instead of 'split' vma.
Fix the mistake.
[1] commit 606c812eb1d5 ("mm/mmap: Fix error path in do_vmi_align_munmap()")
Fixes:
|
||
Peter Collingbourne
|
50fb32197f |
mm: call arch_swap_restore() from do_swap_page()
commit 6dca4ac6fc91fd41ea4d6c4511838d37f4e0eab2 upstream. Commit |
||
Max Filippov
|
6b2849b3e0 |
xtensa: fix lock_mm_and_find_vma in case VMA not found
commit 03f889378f33aa9a9d8e5f49ba94134cf6158090 upstream. MMU version of lock_mm_and_find_vma releases the mm lock before returning when VMA is not found. Do the same in noMMU version. This fixes hang on an attempt to handle protection fault. Fixes: d85a143b69ab ("xtensa: fix NOMMU build with lock_mm_and_find_vma() conversion") Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Linus Torvalds
|
323846590c |
xtensa: fix NOMMU build with lock_mm_and_find_vma() conversion
commit d85a143b69abb4d7544227e26d12c4c7735ab27d upstream. It turns out that xtensa has a really odd configuration situation: you can do a no-MMU config, but still have the page fault code enabled. Which doesn't sound all that sensible, but it turns out that xtensa can have protection faults even without the MMU, and we have this: config PFAULT bool "Handle protection faults" if EXPERT && !MMU default y help Handle protection faults. MMU configurations must enable it. noMMU configurations may disable it if used memory map never generates protection faults or faults are always fatal. If unsure, say Y. which completely violated my expectations of the page fault handling. End result: Guenter reports that the xtensa no-MMU builds all fail with arch/xtensa/mm/fault.c: In function ‘do_page_fault’: arch/xtensa/mm/fault.c:133:8: error: implicit declaration of function ‘lock_mm_and_find_vma’ because I never exposed the new lock_mm_and_find_vma() function for the no-MMU case. Doing so is simple enough, and fixes the problem. Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Fixes: a050ba1e7422 ("mm/fault: convert remaining simple cases to lock_mm_and_find_vma()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Linus Torvalds
|
e6bbad7571 |
mm: always expand the stack with the mmap write lock held
commit 8d7071af890768438c14db6172cc8f9f4d04e184 upstream This finishes the job of always holding the mmap write lock when extending the user stack vma, and removes the 'write_locked' argument from the vm helper functions again. For some cases, we just avoid expanding the stack at all: drivers and page pinning really shouldn't be extending any stacks. Let's see if any strange users really wanted that. It's worth noting that architectures that weren't converted to the new lock_mm_and_find_vma() helper function are left using the legacy "expand_stack()" function, but it has been changed to drop the mmap_lock and take it for writing while expanding the vma. This makes it fairly straightforward to convert the remaining architectures. As a result of dropping and re-taking the lock, the calling conventions for this function have also changed, since the old vma may no longer be valid. So it will now return the new vma if successful, and NULL - and the lock dropped - if the area could not be extended. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [6.1: Patch drivers/iommu/io-pgfault.c instead] Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Liam R. Howlett
|
6a6b5616c3 |
mm: make find_extend_vma() fail if write lock not held
commit f440fa1ac955e2898893f9301568435eb5cdfc4b upstream. Make calls to extend_vma() and find_extend_vma() fail if the write lock is required. To avoid making this a flag-day event, this still allows the old read-locking case for the trivial situations, and passes in a flag to say "is it write-locked". That way write-lockers can say "yes, I'm being careful", and legacy users will continue to work in all the common cases until they have been fully converted to the new world order. Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Ben Hutchings
|
1f4197f050 |
arm/mm: Convert to using lock_mm_and_find_vma()
commit 8b35ca3e45e35a26a21427f35d4093606e93ad0a upstream. arm has an additional check for address < FIRST_USER_ADDRESS before expanding the stack. Since FIRST_USER_ADDRESS is defined everywhere (generally as 0), move that check to the generic expand_downwards(). Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Linus Torvalds
|
755aa1bc6a |
mm: make the page fault mmap locking killable
commit eda0047296a16d65a7f2bc60a408f70d178b2014 upstream. This is done as a separate patch from introducing the new lock_mm_and_find_vma() helper, because while it's an obvious change, it's not what x86 used to do in this area. We already abort the page fault on fatal signals anyway, so why should we wait for the mmap lock only to then abort later? With the new helper function that returns without the lock held on failure anyway, this is particularly easy and straightforward. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Linus Torvalds
|
d6a5c7a1a6 |
mm: introduce new 'lock_mm_and_find_vma()' page fault helper
commit c2508ec5a58db67093f4fb8bf89a9a7c53a109e9 upstream. .. and make x86 use it. This basically extracts the existing x86 "find and expand faulting vma" code, but extends it to also take the mmap lock for writing in case we actually do need to expand the vma. We've historically short-circuited that case, and have some rather ugly special logic to serialize the stack segment expansion (since we only hold the mmap lock for reading) that doesn't match the normal VM locking. That slight violation of locking worked well, right up until it didn't: the maple tree code really does want proper locking even for simple extension of an existing vma. So extract the code for "look up the vma of the fault" from x86, fix it up to do the necessary write locking, and make it available as a helper function for other architectures that can use the common helper. Note: I say "common helper", but it really only handles the normal stack-grows-down case. Which is all architectures except for PA-RISC and IA64. So some rare architectures can't use the helper, but if they care they'll just need to open-code this logic. It's also worth pointing out that this code really would like to have an optimistic "mmap_upgrade_trylock()" to make it quicker to go from a read-lock (for the common case) to taking the write lock (for having to extend the vma) in the normal single-threaded situation where there is no other locking activity. But that _is_ all the very uncommon special case, so while it would be nice to have such an operation, it probably doesn't matter in reality. I did put in the skeleton code for such a possible future expansion, even if it only acts as pseudo-documentation for what we're doing. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [6.1: Ignore CONFIG_PER_VMA_LOCK context] Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Tony Luck
|
84f077802e |
mm, hwpoison: when copy-on-write hits poison, take page offline
commit d302c2398ba269e788a4f37ae57c07a7fcabaa42 upstream. Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use memory_failure_queue() to request a call to memory_failure() for the page with the error. Also provide a stub version for CONFIG_MEMORY_FAILURE=n Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Due to missing commits e591ef7d96d6e ("mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage") 5033091de814a ("mm/hwpoison: introduce per-memory_block hwpoison counter") The impact of e591ef7d96d6e is its introduction of an additional flag in __get_huge_page_for_hwpoison() that serves as an indication a hwpoisoned hugetlb page should have its migratable bit cleared. The impact of 5033091de814a is contexual. Resolve by ignoring both missing commits. - jane] Signed-off-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Tony Luck
|
4af5960d7c |
mm, hwpoison: try to recover from copy-on write faults
commit a873dfe1032a132bf89f9e19a6ac44f5a0b78754 upstream. Patch series "Copy-on-write poison recovery", v3. Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com> for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: This patch (of 2): If the kernel is copying a page as the result of a copy-on-write fault and runs into an uncorrectable error, Linux will crash because it does not have recovery code for this case where poison is consumed by the kernel. It is easy to set up a test case. Just inject an error into a private page, fork(2), and have the child process write to the page. I wrapped that neatly into a test at: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git just enable ACPI error injection and run: # ./einj_mem-uc -f copy-on-write Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() on architectures where that is available (currently x86 and powerpc). When an error is detected during the page copy, return VM_FAULT_HWPOISON to caller of wp_page_copy(). This propagates up the call stack. Both x86 and powerpc have code in their fault handler to deal with this code by sending a SIGBUS to the application. Note that this patch avoids a system crash and signals the process that triggered the copy-on-write action. It does not take any action for the memory error that is still in the shared page. To handle that a call to memory_failure() is needed. But this cannot be done from wp_page_copy() because it holds mmap_lock(). Perhaps the architecture fault handlers can deal with this loose end in a subsequent patch? On Intel/x86 this loose end will often be handled automatically because the memory controller provides an additional notification of the h/w poison in memory, the handler for this will call memory_failure(). This isn't a 100% solution. If there are multiple errors, not all may be logged in this way. [tony.luck@intel.com: add call to kmsan_unpoison_memory(), per Miaohe Lin] Link: https://lkml.kernel.org/r/20221031201029.102123-2-tony.luck@intel.com Link: https://lkml.kernel.org/r/20221021200120.175753-1-tony.luck@intel.com Link: https://lkml.kernel.org/r/20221021200120.175753-2-tony.luck@intel.com Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Igned-off-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
David Woodhouse
|
42a018a796 |
mm/mmap: Fix error return in do_vmi_align_munmap()
commit 6c26bd4384da24841bac4f067741bbca18b0fb74 upstream, If mas_store_gfp() in the gather loop failed, the 'error' variable that ultimately gets returned was not being set. In many cases, its original value of -ENOMEM was still in place, and that was fine. But if VMAs had been split at the start or end of the range, then 'error' could be zero. Change to the 'error = foo(); if (error) goto â¦' idiom to fix the bug. Also clean up a later case which avoided the same bug by *explicitly* setting error = -ENOMEM right before calling the function that might return -ENOMEM. In a final cosmetic change, move the 'Point of no return' comment to *after* the goto. That's been in the wrong place since the preallocation was removed, and this new error path was added. Fixes: 606c812eb1d5 ("mm/mmap: Fix error path in do_vmi_align_munmap()") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Cc: stable@vger.kernel.org Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Liam R. Howlett
|
a149174ff8 |
mm/mmap: Fix error path in do_vmi_align_munmap()
commit 606c812eb1d5b5fb0dd9e330ca94b52d7c227830 upstream
The error unrolling was leaving the VMAs detached in many cases and
leaving the locked_vm statistic altered, and skipping the unrolling
entirely in the case of the vma tree write failing.
Fix the error path by re-attaching the detached VMAs and adding the
necessary goto for the failed vma tree write, and fix the locked_vm
statistic by only updating after the vma tree write succeeds.
Fixes:
|
||
Roberto Sassu
|
1a2793a25a |
memfd: check for non-NULL file_seals in memfd_create() syscall
[ Upstream commit 935d44acf621aa0688fef8312dec3e5940f38f4e ]
Ensure that file_seals is non-NULL before using it in the memfd_create()
syscall. One situation in which memfd_file_seals_ptr() could return a
NULL pointer when CONFIG_SHMEM=n, oopsing the kernel.
Link: https://lkml.kernel.org/r/20230607132427.2867435-1-roberto.sassu@huaweicloud.com
Fixes:
|
||
Alexei Starovoitov
|
2e7ad879e1 |
mm: Fix copy_from_user_nofault().
commit d319f344561de23e810515d109c7278919bff7b0 upstream. There are several issues with copy_from_user_nofault(): - access_ok() is designed for user context only and for that reason it has WARN_ON_IN_IRQ() which triggers when bpf, kprobe, eprobe and perf on ppc are calling it from irq. - it's missing nmi_uaccess_okay() which is a nop on all architectures except x86 where it's required. The comment in arch/x86/mm/tlb.c explains the details why it's necessary. Calling copy_from_user_nofault() from bpf, [ke]probe without this check is not safe. - __copy_from_user_inatomic() under CONFIG_HARDENED_USERCOPY is calling check_object_size()->__check_object_size()->check_heap_object()->find_vmap_area()->spin_lock() which is not safe to do from bpf, [ke]probe and perf due to potential deadlock. Fix all three issues. At the end the copy_from_user_nofault() becomes equivalent to copy_from_user_nmi() from safety point of view with a difference in the return value. Reported-by: Hsin-Wei Hung <hsinweih@uci.edu> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Florian Lehner <dev@der-flo.net> Tested-by: Hsin-Wei Hung <hsinweih@uci.edu> Tested-by: Florian Lehner <dev@der-flo.net> Link: https://lore.kernel.org/r/20230410174345.4376-2-dev@der-flo.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Javier Honduvilla Coto <javierhonduco@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Nhat Pham
|
447f325497 |
zswap: do not shrink if cgroup may not zswap
commit 0bdf0efa180a9cb1361cbded4e2260a49306ac89 upstream.
Before storing a page, zswap first checks if the number of stored pages
exceeds the limit specified by memory.zswap.max, for each cgroup in the
hierarchy. If this limit is reached or exceeded, then zswap shrinking is
triggered and short-circuits the store attempt.
However, since the zswap's LRU is not memcg-aware, this can create the
following pathological behavior: the cgroup whose zswap limit is 0 will
evict pages from other cgroups continually, without lowering its own zswap
usage. This means the shrinking will continue until the need for swap
ceases or the pool becomes empty.
As a result of this, we observe a disproportionate amount of zswap
writeback and a perpetually small zswap pool in our experiments, even
though the pool limit is never hit.
More generally, a cgroup might unnecessarily evict pages from other
cgroups before we drive the memcg back below its limit.
This patch fixes the issue by rejecting zswap store attempt without
shrinking the pool when obj_cgroup_may_zswap() returns false.
[akpm@linux-foundation.org: fix return of unintialized value]
[akpm@linux-foundation.org: s/ENOSPC/ENOMEM/]
Link: https://lkml.kernel.org/r/20230530222440.2777700-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230530232435.3097106-1-nphamcs@gmail.com
Fixes:
|
||
Ruihan Li
|
df9bc25d13 |
mm: page_table_check: Ensure user pages are not slab pages
commit 44d0fb387b53e56c8a050bac5c7d460e21eb226f upstream.
The current uses of PageAnon in page table check functions can lead to
type confusion bugs between struct page and slab [1], if slab pages are
accidentally mapped into the user space. This is because slab reuses the
bits in struct page to store its internal states, which renders PageAnon
ineffective on slab pages.
Since slab pages are not expected to be mapped into the user space, this
patch adds BUG_ON(PageSlab(page)) checks to make sure that slab pages
are not inadvertently mapped. Otherwise, there must be some bugs in the
kernel.
Reported-by: syzbot+fcf1a817ceb50935ce99@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000258e5e05fae79fc1@google.com/ [1]
Fixes:
|
||
Ruihan Li
|
08378f0314 |
mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM
commit 81a31a860bb61d54eb688af2568d9332ed9b8942 upstream. Without EXCLUSIVE_SYSTEM_RAM, users are allowed to map arbitrary physical memory regions into the userspace via /dev/mem. At the same time, pages may change their properties (e.g., from anonymous pages to named pages) while they are still being mapped in the userspace, leading to "corruption" detected by the page table check. To avoid these false positives, this patch makes PAGE_TABLE_CHECK depends on EXCLUSIVE_SYSTEM_RAM. This dependency is understandable because PAGE_TABLE_CHECK is a hardening technique but /dev/mem without STRICT_DEVMEM (i.e., !EXCLUSIVE_SYSTEM_RAM) is itself a security problem. Even with EXCLUSIVE_SYSTEM_RAM, I/O pages may be still allowed to be mapped via /dev/mem. However, these pages are always considered as named pages, so they won't break the logic used in the page table check. Cc: <stable@vger.kernel.org> # 5.17 Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/r/20230515130958.32471-4-lrh2000@pku.edu.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Domenico Cerasuolo
|
2cab13f500 |
mm: fix zswap writeback race condition
commit 04fc7816089c5a32c29a04ec94b998e219dfb946 upstream. The zswap writeback mechanism can cause a race condition resulting in memory corruption, where a swapped out page gets swapped in with data that was written to a different page. The race unfolds like this: 1. a page with data A and swap offset X is stored in zswap 2. page A is removed off the LRU by zpool driver for writeback in zswap-shrink work, data for A is mapped by zpool driver 3. user space program faults and invalidates page entry A, offset X is considered free 4. kswapd stores page B at offset X in zswap (zswap could also be full, if so, page B would then be IOed to X, then skip step 5.) 5. entry A is replaced by B in tree->rbroot, this doesn't affect the local reference held by zswap-shrink work 6. zswap-shrink work writes back A at X, and frees zswap entry A 7. swapin of slot X brings A in memory instead of B The fix: Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW), zswap-shrink work just checks that the local zswap_entry reference is still the same as the one in the tree. If it's not the same it means that it's either been invalidated or replaced, in both cases the writeback is aborted because the local entry contains stale data. Reproducer: I originally found this by running `stress` overnight to validate my work on the zswap writeback mechanism, it manifested after hours on my test machine. The key to make it happen is having zswap writebacks, so whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do the trick. In order to reproduce this faster on a vm, I setup a system with ~100M of available memory and a 500M swap file, then running `stress --vm 1 --vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens of minutes. One can speed things up even more by swinging /sys/module/zswap/parameters/max_pool_percent up and down between, say, 20 and 1; this makes it reproduce in tens of seconds. It's crucial to set `--vm-stride` to something other than 4096 otherwise `stress` won't realize that memory has been corrupted because all pages would have the same data. Link: https://lkml.kernel.org/r/20230503151200.19707-1-cerasuolodomenico@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chris Li (Google) <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Lorenzo Stoakes
|
6b5b755463 |
mm/mempolicy: correctly update prev when policy is equal on mbind
commit 00ca0f2e86bf40b016a646e6323a8941a09cf106 upstream. The refactoring in commit f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator") introduces a subtle bug which arises when attempting to apply a new NUMA policy across a range of VMAs in mbind_range(). The refactoring passes a **prev pointer to keep track of the previous VMA in order to reduce duplication, and in all but one case it keeps this correctly updated. The bug arises when a VMA within the specified range has an equivalent policy as determined by mpol_equal() - which unlike other cases, does not update prev. This can result in a situation where, later in the iteration, a VMA is found whose policy does need to change. At this point, vma_merge() is invoked with prev pointing to a VMA which is before the previous VMA. Since vma_merge() discovers the curr VMA by looking for the one immediately after prev, it will now be in a situation where this VMA is incorrect and the merge will not proceed correctly. This is checked in the VM_WARN_ON() invariant case with end > curr->vm_end, which, if a merge is possible, results in a warning (if CONFIG_DEBUG_VM is specified). I note that vma_merge() performs these invariant checks only after merge_prev/merge_next are checked, which is debatable as it hides this issue if no merge is possible even though a buggy situation has arisen. The solution is simply to update the prev pointer even when policies are equal. This caused a bug to arise in the 6.2.y stable tree, and this patch resolves this bug. Link: https://lkml.kernel.org/r/83f1d612acb519d777bebf7f3359317c4e7f4265.1682866629.git.lstoakes@gmail.com Fixes: f4e9e0e69468 ("mm/mempolicy: fix use-after-free of VMA iterator") Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304292203.44ddeff6-oliver.sang@intel.com Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Mark Rutland
|
da4c747730 |
kasan: hw_tags: avoid invalid virt_to_page()
commit 29083fd84da576bfb3563d044f98d38e6b338f00 upstream.
When booting with 'kasan.vmalloc=off', a kernel configured with support
for KASAN_HW_TAGS will explode at boot time due to bogus use of
virt_to_page() on a vmalloc adddress. With CONFIG_DEBUG_VIRTUAL selected
this will be reported explicitly, and with or without CONFIG_DEBUG_VIRTUAL
the kernel will dereference a bogus address:
| ------------[ cut here ]------------
| virt_to_phys used for non-linear address: (____ptrval____) (0xffff800008000000)
| WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x78/0x80
| Modules linked in:
| CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc3-00073-g83865133300d-dirty #4
| Hardware name: linux,dummy-virt (DT)
| pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : __virt_to_phys+0x78/0x80
| lr : __virt_to_phys+0x78/0x80
| sp : ffffcd076afd3c80
| x29: ffffcd076afd3c80 x28: 0068000000000f07 x27: ffff800008000000
| x26: fffffbfff0000000 x25: fffffbffff000000 x24: ff00000000000000
| x23: ffffcd076ad3c000 x22: fffffc0000000000 x21: ffff800008000000
| x20: ffff800008004000 x19: ffff800008000000 x18: ffff800008004000
| x17: 666678302820295f x16: ffffffffffffffff x15: 0000000000000004
| x14: ffffcd076b009e88 x13: 0000000000000fff x12: 0000000000000003
| x11: 00000000ffffefff x10: c0000000ffffefff x9 : 0000000000000000
| x8 : 0000000000000000 x7 : 205d303030303030 x6 : 302e30202020205b
| x5 : ffffcd076b41d63f x4 : ffffcd076afd3827 x3 : 0000000000000000
| x2 : 0000000000000000 x1 : ffffcd076afd3a30 x0 : 000000000000004f
| Call trace:
| __virt_to_phys+0x78/0x80
| __kasan_unpoison_vmalloc+0xd4/0x478
| __vmalloc_node_range+0x77c/0x7b8
| __vmalloc_node+0x54/0x64
| init_IRQ+0x94/0xc8
| start_kernel+0x194/0x420
| __primary_switched+0xbc/0xc4
| ---[ end trace 0000000000000000 ]---
| Unable to handle kernel paging request at virtual address 03fffacbe27b8000
| Mem abort info:
| ESR = 0x0000000096000004
| EC = 0x25: DABT (current EL), IL = 32 bits
| SET = 0, FnV = 0
| EA = 0, S1PTW = 0
| FSC = 0x04: level 0 translation fault
| Data abort info:
| ISV = 0, ISS = 0x00000004
| CM = 0, WnR = 0
| swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041bc5000
| [03fffacbe27b8000] pgd=0000000000000000, p4d=0000000000000000
| Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.3.0-rc3-00073-g83865133300d-dirty #4
| Hardware name: linux,dummy-virt (DT)
| pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : __kasan_unpoison_vmalloc+0xe4/0x478
| lr : __kasan_unpoison_vmalloc+0xd4/0x478
| sp : ffffcd076afd3ca0
| x29: ffffcd076afd3ca0 x28: 0068000000000f07 x27: ffff800008000000
| x26: 0000000000000000 x25: 03fffacbe27b8000 x24: ff00000000000000
| x23: ffffcd076ad3c000 x22: fffffc0000000000 x21: ffff800008000000
| x20: ffff800008004000 x19: ffff800008000000 x18: ffff800008004000
| x17: 666678302820295f x16: ffffffffffffffff x15: 0000000000000004
| x14: ffffcd076b009e88 x13: 0000000000000fff x12: 0000000000000001
| x11: 0000800008000000 x10: ffff800008000000 x9 : ffffb2f8dee00000
| x8 : 000ffffb2f8dee00 x7 : 205d303030303030 x6 : 302e30202020205b
| x5 : ffffcd076b41d63f x4 : ffffcd076afd3827 x3 : 0000000000000000
| x2 : 0000000000000000 x1 : ffffcd076afd3a30 x0 : ffffb2f8dee00000
| Call trace:
| __kasan_unpoison_vmalloc+0xe4/0x478
| __vmalloc_node_range+0x77c/0x7b8
| __vmalloc_node+0x54/0x64
| init_IRQ+0x94/0xc8
| start_kernel+0x194/0x420
| __primary_switched+0xbc/0xc4
| Code: d34cfc08 aa1f03fa 8b081b39 d503201f (f9400328)
| ---[ end trace 0000000000000000 ]---
| Kernel panic - not syncing: Attempted to kill the idle task!
This is because init_vmalloc_pages() erroneously calls virt_to_page() on
a vmalloc address, while virt_to_page() is only valid for addresses in
the linear/direct map. Since init_vmalloc_pages() expects virtual
addresses in the vmalloc range, it must use vmalloc_to_page() rather
than virt_to_page().
We call init_vmalloc_pages() from __kasan_unpoison_vmalloc(), where we
check !is_vmalloc_or_module_addr(), suggesting that we might encounter a
non-vmalloc address. Luckily, this never happens. By design, we only
call __kasan_unpoison_vmalloc() on pointers in the vmalloc area, and I
have verified that we don't violate that expectation. Given that,
is_vmalloc_or_module_addr() must always be true for any legitimate
argument to __kasan_unpoison_vmalloc().
Correct init_vmalloc_pages() to use vmalloc_to_page(), and remove the
redundant and misleading use of is_vmalloc_or_module_addr() in
__kasan_unpoison_vmalloc().
Link: https://lkml.kernel.org/r/20230418164212.1775741-1-mark.rutland@arm.com
Fixes:
|
||
Jan Kara
|
8d67449f90 |
mm: do not reclaim private data from pinned page
commit d824ec2a154677f63c56cc71ffe4578274f6e32e upstream. If the page is pinned, there's no point in trying to reclaim it. Furthermore if the page is from the page cache we don't want to reclaim fs-private data from the page because the pinning process may be writing to the page at any time and reclaiming fs private info on a dirty page can upset the filesystem (see link below). Link: https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz Link: https://lkml.kernel.org/r/20230428124140.30166-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Liam R. Howlett
|
862ea63fad |
mm/mempolicy: fix use-after-free of VMA iterator
commit f4e9e0e69468583c2c6d9d5c7bfc975e292bf188 upstream.
set_mempolicy_home_node() iterates over a list of VMAs and calls
mbind_range() on each VMA, which also iterates over the singular list of
the VMA passed in and potentially splits the VMA. Since the VMA iterator
is not passed through, set_mempolicy_home_node() may now point to a stale
node in the VMA tree. This can result in a UAF as reported by syzbot.
Avoid the stale maple tree node by passing the VMA iterator through to the
underlying call to split_vma().
mbind_range() is also overly complicated, since there are two calling
functions and one already handles iterating over the VMAs. Simplify
mbind_range() to only handle merging and splitting of the VMAs.
Align the new loop in do_mbind() and existing loop in
set_mempolicy_home_node() to use the reduced mbind_range() function. This
allows for a single location of the range calculation and avoids
constantly looking up the previous VMA (since this is a loop over the
VMAs).
Link: https://lore.kernel.org/linux-mm/000000000000c93feb05f87e24ad@google.com/
Fixes:
|
||
Tetsuo Handa
|
b528537d13 |
mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
commit 1007843a91909a4995ee78a538f62d8665705b66 upstream.
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.
One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
CPU0
----
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
// e.g. timer interrupt handler runs at this moment
some_timer_func() {
kmalloc(GFP_ATOMIC) {
__alloc_pages_slowpath() {
read_seqbegin(&zonelist_update_seq) {
// spins forever because zonelist_update_seq.seqcount is odd
}
}
}
}
// e.g. timer interrupt handler finishes
write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
}
This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.
Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.
CPU0 CPU1
---- ----
pty_write() {
tty_insert_flip_string_and_push_buffer() {
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq);
build_zonelists() {
printk() {
vprintk() {
vprintk_default() {
vprintk_emit() {
console_unlock() {
console_flush_all() {
console_emit_next_record() {
con->write() = serial8250_console_write() {
spin_lock_irqsave(&port->lock, flags);
tty_insert_flip_string() {
tty_insert_flip_string_fixed_flag() {
__tty_buffer_request_room() {
tty_buffer_alloc() {
kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
__alloc_pages_slowpath() {
zonelist_iter_begin() {
read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
}
}
}
}
}
}
}
}
spin_unlock_irqrestore(&port->lock, flags);
// message is printed to console
spin_unlock_irqrestore(&port->lock, flags);
}
}
}
}
}
}
}
}
}
write_sequnlock(&zonelist_update_seq);
}
}
}
This deadlock scenario can be eliminated by
preventing interrupt context from calling kmalloc(GFP_ATOMIC)
and
preventing printk() from calling console_flush_all()
while zonelist_update_seq.seqcount is odd.
Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by
disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
and
disabling synchronous printk() in order to avoid console_flush_all()
.
As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...
Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
Fixes:
|
||
Liam R. Howlett
|
7e6631f782 |
mm/mmap: regression fix for unmapped_area{_topdown}
commit 58c5d0d6d522112577c7eeb71d382ea642ed7be4 upstream.
The maple tree limits the gap returned to a window that specifically fits
what was asked. This may not be optimal in the case of switching search
directions or a gap that does not satisfy the requested space for other
reasons. Fix the search by retrying the operation and limiting the search
window in the rare occasion that a conflict occurs.
Link: https://lkml.kernel.org/r/20230414185919.4175572-1-Liam.Howlett@oracle.com
Fixes:
|
||
Mel Gorman
|
059f24aff6 |
mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
commit 4d73ba5fa710fe7d432e0b271e6fecd252aef66e upstream. A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is taking an excessive amount of time for large amounts of memory. Further testing allocating huge pages that the cost is linear i.e. if allocating 1G pages in batches of 10 then the time to allocate nr_hugepages from 10->20->30->etc increases linearly even though 10 pages are allocated at each step. Profiles indicated that much of the time is spent checking the validity within already existing huge pages and then attempting a migration that fails after isolating the range, draining pages and a whole lot of other useless work. Commit |
||
Alexander Potapenko
|
bd6f3421a5 |
mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
commit 47ebd0310e89c087f56e58c103c44b72a2f6b216 upstream.
As reported by Dipanjan Das, when KMSAN is used together with kernel fault
injection (or, generally, even without the latter), calls to kcalloc() or
__vmap_pages_range_noflush() may fail, leaving the metadata mappings for
the virtual mapping in an inconsistent state. When these metadata
mappings are accessed later, the kernel crashes.
To address the problem, we return a non-zero error code from
kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping
failure inside it, and make vmap_pages_range_noflush() return an error if
KMSAN fails to allocate the metadata.
This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(),
as these allocation failures are not fatal anymore.
Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com
Fixes:
|