Steps on the way to 6.1-rc1
Resolves merge conflicts in:
block/blk-crypto-fallback.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I8bc1919753d4664d1d37df82a6e1b82f7b591192
Steps on the way to 6.1-rc1
Resolves merge conflicts in:
include/uapi/linux/fuse.h
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I9c642d47e57e26958f55f8e9ef6920594631531b
The mali GPU device needs to have a custom get_unmapped_area() file
operation [1] in order to align GPU VA memory to a 2MB boundary for
64-bit processes. Since the GPU driver is a module, we need to export
vm_unmapped_area() to call it from the device's get_unmapped_area() file
op.
[1] https://android.googlesource.com/kernel/google-modules/gpu/+/refs/heads/android-gs-raviole-mainline/mali_kbase/thirdparty/mali_kbase_mmap.c#237
Bug: 254386546
Signed-off-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I08a1247e215a46fa0ca1a0d61f63be4fa8375ee4
Steps on the way to 6.1-rc1
resolves conflicts in:
include/linux/nodemask.h
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I05f94636e62c86dbab41bdcee21f2449f341cf2d
Steps on the way to 6.1-rc1
Resolves conflicts in:
fs/exec.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia649390f891780ed03530adcf0fac9560635e58f
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmNHYD0ACgkQSfxwEqXe
A655AA//dJK0PdRghqrKQsl18GOCffV5TUw5i1VbJQbI9d8anfxNjVUQiNGZi4et
qUwZ8OqVXxYx1Z1UDgUE39PjEDSG9/cCvOpMUWqN20/+6955WlNZjwA7Fk6zjvlM
R30fz5CIJns9RFvGT4SwKqbVLXIMvfg/wDENUN+8sxt36+VD2gGol7J2JJdngEhM
lW+zqzi0ABqYy5so4TU2kixpKmpC08rqFvQbD1GPid+50+JsOiIqftDErt9Eg1Mg
MqYivoFCvbAlxxxRh3+UHBd7ZpJLtp1UFEOl2Rf00OXO+ZclLCAQAsTczucIWK9M
8LCZjb7d4lPJv9RpXFAl3R1xvfc+Uy2ga5KeXvufZtc5G3aMUKPuIU7k28ZyblVS
XXsXEYhjTSd0tgi3d0JlValrIreSuj0z2QGT5pVcC9utuAqAqRIlosiPmgPlzXjr
Us4jXaUhOIPKI+Musv/fqrxsTQziT0jgVA3Njlt4cuAGm/EeUbLUkMWwKXjZLTsv
vDsBhEQFmyZqxWu4pYo534VX2mQWTaKRV1SUVVhQEHm57b00EAiZohoOvweB09SR
4KiJapikoopmW4oAUFotUXUL1PM6yi+MXguTuc1SEYuLz/tCFtK8DJVwNpfnWZpE
lZKvXyJnHq2Sgod/hEZq58PMvT6aNzTzSg7YzZy+VabxQGOO5mc=
=M+mV
-----END PGP SIGNATURE-----
Merge tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull more random number generator updates from Jason Donenfeld:
"This time with some large scale treewide cleanups.
The intent of this pull is to clean up the way callers fetch random
integers. The current rules for doing this right are:
- If you want a secure or an insecure random u64, use get_random_u64()
- If you want a secure or an insecure random u32, use get_random_u32()
The old function prandom_u32() has been deprecated for a while
now and is just a wrapper around get_random_u32(). Same for
get_random_int().
- If you want a secure or an insecure random u16, use get_random_u16()
- If you want a secure or an insecure random u8, use get_random_u8()
- If you want secure or insecure random bytes, use get_random_bytes().
The old function prandom_bytes() has been deprecated for a while
now and has long been a wrapper around get_random_bytes()
- If you want a non-uniform random u32, u16, or u8 bounded by a
certain open interval maximum, use prandom_u32_max()
I say "non-uniform", because it doesn't do any rejection sampling
or divisions. Hence, it stays within the prandom_*() namespace, not
the get_random_*() namespace.
I'm currently investigating a "uniform" function for 6.2. We'll see
what comes of that.
By applying these rules uniformly, we get several benefits:
- By using prandom_u32_max() with an upper-bound that the compiler
can prove at compile-time is ≤65536 or ≤256, internally
get_random_u16() or get_random_u8() is used, which wastes fewer
batched random bytes, and hence has higher throughput.
- By using prandom_u32_max() instead of %, when the upper-bound is
not a constant, division is still avoided, because
prandom_u32_max() uses a faster multiplication-based trick instead.
- By using get_random_u16() or get_random_u8() in cases where the
return value is intended to indeed be a u16 or a u8, we waste fewer
batched random bytes, and hence have higher throughput.
This series was originally done by hand while I was on an airplane
without Internet. Later, Kees and I worked on retroactively figuring
out what could be done with Coccinelle and what had to be done
manually, and then we split things up based on that.
So while this touches a lot of files, the actual amount of code that's
hand fiddled is comfortably small"
* tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
prandom: remove unused functions
treewide: use get_random_bytes() when possible
treewide: use get_random_u32() when possible
treewide: use get_random_{u8,u16}() when possible, part 2
treewide: use get_random_{u8,u16}() when possible, part 1
treewide: use prandom_u32_max() when possible, part 2
treewide: use prandom_u32_max() when possible, part 1
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmNLD/EACgkQ4CHKc/GJ
qRCnhQf+Oj0qB9bdy+MmgirN0/VHFmTQbNSYUd/gzGmfcAHpxIE9KG0V9+y9I2wG
Nh6WgUKwX1IEKQ37X+VT/XsIe9VcALcn5LjxD/J4cL71CREa/0HGQbBavt9GuDsC
zkUwxYx6iAtGfK/PK9jE2eHIzxzfZ6kEkFsMaS+jP/8iLnE9trAhQ1o6vG15EFPA
MHjJ3+y7AsUE7SYHKL+8WLA+QR443SlHN0u327KkA2kKpjsj+hqQdiPfHqOArBbo
vw2DI14tcELGtruo5zHMVT9TcXWV7hcJ6yTTnaKxI+WCbgsEpPQKevTmc7q9P0H4
hLgQEElRuzBrXUCIBPVboNuTgGNjLQ==
=cVwd
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.1-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab hotfix from Vlastimil Babka:
"A single fix for the common-kmalloc series, for warnings on mips and
sparc64 reported by Guenter Roeck"
* tag 'slab-for-6.1-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: use kmalloc_node() for off slab freelist_idx_t array allocation
After commit d6a71648db ("mm/slab: kmalloc: pass requests larger than
order-1 page to page allocator"), SLAB passes large ( > PAGE_SIZE * 2)
requests to buddy like SLUB does.
SLAB has been using kmalloc caches to allocate freelist_idx_t array for
off slab caches. But after the commit, freelist_size can be bigger than
KMALLOC_MAX_CACHE_SIZE.
Instead of using pointer to kmalloc cache, use kmalloc_node() and only
check if the kmalloc cache is off slab during calculate_slab_order().
If freelist_size > KMALLOC_MAX_CACHE_SIZE, no looping condition happens
as it allocates freelist_idx_t array directly from buddy.
Link: https://lore.kernel.org/all/20221014205818.GA1428667@roeck-us.net/
Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net>
Fixes: d6a71648db ("mm/slab: kmalloc: pass requests larger than order-1 page to page allocator")
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
refcounting errors in ZONE_DEVICE pages.
- Peter Xu fixes some userfaultfd test harness instability.
- Various other patches in MM, mainly fixes.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0j6igAKCRDdBJ7gKXxA
jnGxAP99bV39ZtOsoY4OHdZlWU16BUjKuf/cb3bZlC2G849vEwD+OKlij86SG20j
MGJQ6TfULJ8f1dnQDd6wvDfl3FMl7Qc=
=tbdp
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- fix a race which causes page refcounting errors in ZONE_DEVICE pages
(Alistair Popple)
- fix userfaultfd test harness instability (Peter Xu)
- various other patches in MM, mainly fixes
* tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits)
highmem: fix kmap_to_page() for kmap_local_page() addresses
mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page
mm/selftest: uffd: explain the write missing fault check
mm/hugetlb: use hugetlb_pte_stable in migration race check
mm/hugetlb: fix race condition of uffd missing/minor handling
zram: always expose rw_page
LoongArch: update local TLB if PTE entry exists
mm: use update_mmu_tlb() on the second thread
kasan: fix array-bounds warnings in tests
hmm-tests: add test for migrate_device_range()
nouveau/dmem: evict device private memory during release
nouveau/dmem: refactor nouveau_dmem_fault_copy_one()
mm/migrate_device.c: add migrate_device_range()
mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()
mm/memremap.c: take a pgmap reference on page allocation
mm: free device private pages have zero refcount
mm/memory.c: fix race when faulting a device private page
mm/damon: use damon_sz_region() in appropriate place
mm/damon: move sz_damon_region to damon_sz_region
lib/test_meminit: add checks for the allocation functions
...
kmap_to_page() is used to get the page for a virtual address which may
be kmap'ed. Unfortunately, kmap_local_page() stores mappings in a
thread local array separate from kmap(). These mappings were not
checked by the call.
Check the kmap_local_page() mappings and return the page if found.
Because it is intended to remove kmap_to_page() add a warn on once to
the kmap checks to flag potential issues early.
NOTE Due to 32bit x86 use of kmap local in iomap atmoic, KMAP_LOCAL does
not require HIGHMEM to be set. Therefore the support calls required a
new KMAP_LOCAL section to fix 0day build errors.
[akpm@linux-foundation.org: fix warning]
Link: https://lkml.kernel.org/r/20221006040555.1502679-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: kernel test robot <lkp@intel.com>
Cc: "Fabio M. De Francesco" <fmdefrancesco@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
PGFREE and PGALLOC represent the number of freed and allocated pages. So
the page order must be considered.
Link: https://lkml.kernel.org/r/20221006101540.40686-1-laoar.shao@gmail.com
Fixes: 44042b4498 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After hugetlb_pte_stable() introduced, we can also rewrite the migration
race condition against page allocation to use the new helper too.
Link: https://lkml.kernel.org/r/20221004193400.110155-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/hugetlb: Fix selftest failures with write check", v3.
Currently akpm mm-unstable fails with uffd hugetlb private mapping test
randomly on a write check.
The initial bisection of that points to the recent pmd unshare series, but
it turns out there's no direction relationship with the series but only
some timing change caused the race to start trigger.
The race should be fixed in patch 1. Patch 2 is a trivial cleanup on the
similar race with hugetlb migrations, patch 3 comment on the write check
so when anyone read it again it'll be clear why it's there.
This patch (of 3):
After the recent rework patchset of hugetlb locking on pmd sharing,
kselftest for userfaultfd sometimes fails on hugetlb private tests with
unexpected write fault checks.
It turns out there's nothing wrong within the locking series regarding
this matter, but it could have changed the timing of threads so it can
trigger an old bug.
The real bug is when we call hugetlb_no_page() we're not with the pgtable
lock. It means we're reading the pte values lockless. It's perfectly
fine in most cases because before we do normal page allocations we'll take
the lock and check pte_same() again. However before that, there are
actually two paths on userfaultfd missing/minor handling that may directly
move on with the fault process without checking the pte values.
It means for these two paths we may be generating an uffd message based on
an unstable pte, while an unstable pte can legally be anything as long as
the modifier holds the pgtable lock.
One example, which is also what happened in the failing kselftest and
caused the test failure, is that for private mappings wr-protection
changes can happen on one page. While hugetlb_change_protection()
generally requires pte being cleared before being changed, then there can
be a race condition like:
thread 1 thread 2
-------- --------
UFFDIO_WRITEPROTECT hugetlb_fault
hugetlb_change_protection
pgtable_lock()
huge_ptep_modify_prot_start
pte==NULL
hugetlb_no_page
generate uffd missing event
even if page existed!!
huge_ptep_modify_prot_commit
pgtable_unlock()
Fix this by rechecking the pte after pgtable lock for both userfaultfd
missing & minor fault paths.
This bug should have been around starting from uffd hugetlb introduced, so
attaching a Fixes to the commit. Also attach another Fixes to the minor
support commit for easier tracking.
Note that userfaultfd is actually fine with false positives (e.g. caused
by pte changed), but not wrong logical events (e.g. caused by reading a
pte during changing). The latter can confuse the userspace, so the
strictness is very much preferred. E.g., MISSING event should never
happen on the page after UFFDIO_COPY has correctly installed the page and
returned.
Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com
Fixes: 1a1aad8a9b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Fixes: 7677f7fd8b ("userfaultfd: add minor fault registration mode")
Signed-off-by: Peter Xu <peterx@redhat.com>
Co-developed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As message in commit 7df6769743 ("mm/memory.c: Update local TLB if PTE
entry exists") said, we should update local TLB only on the second thread.
So in the do_anonymous_page() here, we should use update_mmu_tlb()
instead of update_mmu_cache() on the second thread.
As David pointed out, this is a performance improvement, not a
correctness fix.
Link: https://lkml.kernel.org/r/20220929112318.32393-2-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Chris Zankel <chris@zankel.net>
Cc: Huacai Chen <chenhuacai@loongson.cn>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
GCC's -Warray-bounds option detects out-of-bounds accesses to
statically-sized allocations in krealloc out-of-bounds tests.
Use OPTIMIZER_HIDE_VAR to suppress the warning.
Also change kmalloc_memmove_invalid_size to use OPTIMIZER_HIDE_VAR
instead of a volatile variable.
Link: https://lkml.kernel.org/r/e94399242d32e00bba6fd0d9ec4c897f188128e8.1664215688.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Device drivers can use the migrate_vma family of functions to migrate
existing private anonymous mappings to device private pages. These pages
are backed by memory on the device with drivers being responsible for
copying data to and from device memory.
Device private pages are freed via the pgmap->page_free() callback when
they are unmapped and their refcount drops to zero. Alternatively they
may be freed indirectly via migration back to CPU memory in response to a
pgmap->migrate_to_ram() callback called whenever the CPU accesses an
address mapped to a device private page.
In other words drivers cannot control the lifetime of data allocated on
the devices and must wait until these pages are freed from userspace.
This causes issues when memory needs to reclaimed on the device, either
because the device is going away due to a ->release() callback or because
another user needs to use the memory.
Drivers could use the existing migrate_vma functions to migrate data off
the device. However this would require them to track the mappings of each
page which is both complicated and not always possible. Instead drivers
need to be able to migrate device pages directly so they can free up
device memory.
To allow that this patch introduces the migrate_device family of functions
which are functionally similar to migrate_vma but which skips the initial
lookup based on mapping.
Link: https://lkml.kernel.org/r/868116aab70b0c8ee467d62498bb2cf0ef907295.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
migrate_device_coherent_page() reuses the existing migrate_vma family of
functions to migrate a specific page without providing a valid mapping or
vma. This looks a bit odd because it means we are calling migrate_vma_*()
without setting a valid vma, however it was considered acceptable at the
time because the details were internal to migrate_device.c and there was
only a single user.
One of the reasons the details could be kept internal was that this was
strictly for migrating device coherent memory. Such memory can be copied
directly by the CPU without intervention from a driver. However this
isn't true for device private memory, and a future change requires similar
functionality for device private memory. So refactor the code into
something more sensible for migrating device memory without a vma.
Link: https://lkml.kernel.org/r/c7b2ff84e9b33d022cf4a40f87d051f281a16d8f.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ZONE_DEVICE pages have a struct dev_pagemap which is allocated by a
driver. When the struct page is first allocated by the kernel in
memremap_pages() a reference is taken on the associated pagemap to ensure
it is not freed prior to the pages being freed.
Prior to 27674ef6c7 ("mm: remove the extra ZONE_DEVICE struct page
refcount") pages were considered free and returned to the driver when the
reference count dropped to one. However the pagemap reference was not
dropped until the page reference count hit zero. This would occur as part
of the final put_page() in memunmap_pages() which would wait for all pages
to be freed prior to returning.
When the extra refcount was removed the pagemap reference was no longer
being dropped in put_page(). Instead memunmap_pages() was changed to
explicitly drop the pagemap references. This means that memunmap_pages()
can complete even though pages are still mapped by the kernel which can
lead to kernel crashes, particularly if a driver frees the pagemap.
To fix this drivers should take a pagemap reference when allocating the
page. This reference can then be returned when the page is freed.
Link: https://lkml.kernel.org/r/12d155ec727935ebfbb4d639a03ab374917ea51b.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Fixes: 27674ef6c7 ("mm: remove the extra ZONE_DEVICE struct page refcount")
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since 27674ef6c7 ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use. However before handing them back to the
owning device driver we add an extra reference count such that free pages
have a reference count of one.
This makes it difficult to tell if a page is free or not because both free
and in use pages will have a non-zero refcount. Instead we should return
pages to the drivers page allocator with a zero reference count. Kernel
code can then safely use kernel functions such as get_page_unless_zero().
Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Fix several device private page reference counting issues",
v2
This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages. These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.
During normal usage it is unlikely these will cause any problems. However
without these fixes it is possible to crash the kernel from userspace.
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting.
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.
This involves some minor changes to both PowerPC and AMD GPU code.
Unfortunately I lack hardware to test either of those so any help there
would be appreciated. The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.
This patch (of 8):
When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called. However no
reference is taken on the faulting page. Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap. This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid. It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().
Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed. Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail. To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.
[mpe@ellerman.id.au: fix build]
Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In many places we can use damon_sz_region() to instead of "r->ar.end -
r->ar.start".
Link: https://lkml.kernel.org/r/20220927001946.85375-2-xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rename sz_damon_region() to damon_sz_region(), and move it to
"include/linux/damon.h", because in many places, we can to use this func.
Link: https://lkml.kernel.org/r/20220927001946.85375-1-xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is an optimization to reduce stackdepot pressure.
struct mmu_gather contains 7 1-bit fields packed into a 32-bit unsigned
int value. The remaining 25 bits remain uninitialized and are never used,
but KMSAN updates the origin for them in zap_pXX_range() in mm/memory.c,
thus creating very long origin chains. This is technically correct, but
consumes too much memory.
Unpoisoning the whole structure will prevent creating such chains.
Link: https://lkml.kernel.org/r/20220905122452.2258262-20-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit c462ac288f ("mm: Introduce arch_validate_flags()") added a late
check in mmap_region() to let architectures validate vm_flags. The check
needs to happen after calling ->mmap() as the flags can potentially be
modified during this callback.
If arch_validate_flags() check fails we unmap and free the vma. However,
the error path fails to undo the ->mmap() call that previously succeeded
and depending on the specific ->mmap() implementation this translates to
reference increments, memory allocations and other operations what will
not be cleaned up.
There are several places (mainly device drivers) where this is an issue.
However, one specific example is bpf_map_mmap() which keeps count of the
mappings in map->writecnt. The count is incremented on ->mmap() and then
decremented on vm_ops->close(). When arch_validate_flags() fails this
count is off since bpf_map_mmap_close() is never called.
One can reproduce this issue in arm64 devices with MTE support. Here the
vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set
previously. From userspace then is enough to pass the PROT_MTE flag to
mmap() syscall to trigger the arch_validate_flags() failure.
The following program reproduces this issue:
#include <stdio.h>
#include <unistd.h>
#include <linux/unistd.h>
#include <linux/bpf.h>
#include <sys/mman.h>
int main(void)
{
union bpf_attr attr = {
.map_type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(long long),
.max_entries = 256,
.map_flags = BPF_F_MMAPABLE,
};
int fd;
fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0);
return 0;
}
By manually adding some log statements to the vm_ops callbacks we can
confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon
->release():
With PROT_MTE flag:
root@debian:~# ./bpf-test
[ 111.263874] bpf_map_write_active_inc: map=9 writecnt=1
[ 111.288763] bpf_map_release: map=9 writecnt=1
Without PROT_MTE flag:
root@debian:~# ./bpf-test
[ 157.816912] bpf_map_write_active_inc: map=10 writecnt=1
[ 157.830442] bpf_map_write_active_dec: map=10 writecnt=0
[ 157.832396] bpf_map_release: map=10 writecnt=0
This patch fixes the above issue by calling vm_ops->close() when the
arch_validate_flags() check fails, after this we can proceed to unmap and
free the vma on the error path.
Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com
Fixes: c462ac288f ("mm: Introduce arch_validate_flags()")
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
ifdefs to make sure the code won't be reached when not compiled in.
Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
Fixes: b1f9e87686 ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: <syzbot+2b9b4f0895be09a6dec3@syzkaller.appspotmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Edward Liaw <edliaw@google.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If the brk VMA is the last vma in a maple node and meets the rare criteria
that it can be expanded, then preallocation is necessary to avoid a
potential fs_reclaim circular lock issue on low resources.
At the same time use the actual vma start address (unaligned) when calling
vma_adjust_trans_huge().
Link: https://lkml.kernel.org/r/20221011160624.1253454-1-Liam.Howlett@oracle.com
Fixes: 2e7ce7d354 (mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap())
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The anon vma was not unlinked and the file was not closed in the failure
path when the machine runs out of memory during the maple tree
modification. This caused a memory leak of the anon vma chain and vma
since neither would be freed.
Link: https://lkml.kernel.org/r/20221011203621.1446507-1-Liam.Howlett@oracle.com
Fixes: 524e00b36e ("mm: remove rb tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Tested-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When we successfully find a pageblock in fast_find_migrateblock(), the
block will be set skip-flag through set_pageblock_skip(). However, when
entering isolate_migratepages_block(), the whole pageblock will be skipped
due to the branch 'if (!valid_page && IS_ALIGNED(low_pfn,
pageblock_nr_pages))'. Eventually we will goto isolate_abort and isolate
nothing. That makes fast_find_migrateblock useless.
In this patch, when we find a suitable pageblock in
fast_find_migrateblock, we do noting but let isolate_migratepages_block to
set skip flag to the pageblock after scan it. Normally, we would isolate
some pages from the fast-find block.
I use mmtest/thpscale-madvhugepage test it. Here is the result:
baseline patch
Amean fault-both-1 1331.66 ( 0.00%) 1261.04 * 5.30%*
Amean fault-both-3 1383.95 ( 0.00%) 1191.69 * 13.89%*
Amean fault-both-5 1568.13 ( 0.00%) 1445.20 * 7.84%*
Amean fault-both-7 1819.62 ( 0.00%) 1555.13 * 14.54%*
Amean fault-both-12 1106.96 ( 0.00%) 1149.43 * -3.84%*
Amean fault-both-18 2196.93 ( 0.00%) 1875.77 * 14.62%*
Amean fault-both-24 2642.69 ( 0.00%) 2671.21 * -1.08%*
Amean fault-both-30 2901.89 ( 0.00%) 2857.32 * 1.54%*
Amean fault-both-32 3747.00 ( 0.00%) 3479.23 * 7.15%*
Link: https://lkml.kernel.org/r/20220713062009.597255-1-zhouchuyi@bytedance.com
Fixes: 70b44595ea ("mm, compaction: use free lists to quickly locate a migration source")
Signed-off-by: zhouchuyi <zhouchuyi@bytedance.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
not.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0YhtwAKCRDdBJ7gKXxA
juJLAQDCa0g8sfe9cTw3PT1gRnn8gWLHEkMgUWVC/aBaqYFGeQEAta+g8muv9Tpd
qODv0JARH4cwONKEA24Oql+A5RnI6gQ=
=QZnW
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"Five hotfixes - three for nilfs2, two for MM. For are cc:stable, one
is not"
* tag 'mm-hotfixes-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nilfs2: fix leak of nilfs_root in case of writer thread creation failure
nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()
nilfs2: fix use-after-free bug of struct nilfs_root
mm/damon/core: initialize damon_target->list in damon_new_target()
mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page
'struct damon_target' creation function, 'damon_new_target()' is not
initializing its '->list' field, unlike other DAMON structs creator
functions such as 'damon_new_region()'. Normal users of
'damon_new_target()' initializes the field by adding the target to DAMON
context's targets list, but some code could access the uninitialized
field.
This commit avoids the case by initializing the field in
'damon_new_target()'.
Link: https://lkml.kernel.org/r/20221002193130.8227-1-sj@kernel.org
Fixes: f23b8eee18 ("mm/damon/core: implement region-based sampling")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>