[ Upstream commit 87b5a5c209405cb6b57424cdfa226a6dbd349232 ]
end key should be equal to start unless NFT_SET_EXT_KEY_END is present.
Its possible to add elements that only have a start key
("{ 1.0.0.0 . 2.0.0.0 }") without an internval end.
Insertion treats this via:
if (nft_set_ext_exists(ext, NFT_SET_EXT_KEY_END))
end = (const u8 *)nft_set_ext_key_end(ext)->data;
else
end = start;
but removal side always uses nft_set_ext_key_end().
This is wrong and leads to garbage remaining in the set after removal
next lookup/insert attempt will give:
BUG: KASAN: slab-use-after-free in pipapo_get+0x8eb/0xb90
Read of size 1 at addr ffff888100d50586 by task nft-pipapo_uaf_/1399
Call Trace:
kasan_report+0x105/0x140
pipapo_get+0x8eb/0xb90
nft_pipapo_insert+0x1dc/0x1710
nf_tables_newsetelem+0x31f5/0x4e00
..
Bug: 293587745
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Reported-by: lonial con <kongln9170@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 90c3955beb)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I51a423aaa2c31c4df89776505b602aa2c1523b82
Running the following will run scripts/checkpatch.pl on a
patch of HEAD
tools/bazel run //common:checkpatch
or a given Git SHA1:
tools/bazel run //common:checkpatch -- --git_sha1 ...
For additional flags, see
tools/bazel run //common:checkpatch -- --help
For details, see
build/kernel/kleaf/docs/checkpatch.md
in your source tree.
Test: TH
Bug: 259995152
Change-Id: Iaad8fd69508cf9be11340166aafbb84930d4805c
Signed-off-by: Yifan Hong <elsk@google.com>
(cherry picked from commit 7dbf26568fcccde88470e7a25c07f0c7229e85f1)
Avichal Rakesh reported a kernel panic that occurred when the UVC
gadget driver was removed from a gadget's configuration. The panic
involves a somewhat complicated interaction between the kernel driver
and a userspace component (as described in the Link tag below), but
the analysis did make one thing clear: The Gadget core should
accomodate gadget drivers calling usb_gadget_deactivate() as part of
their unbind procedure.
Currently this doesn't work. gadget_unbind_driver() calls
driver->unbind() while holding the udc->connect_lock mutex, and
usb_gadget_deactivate() attempts to acquire that mutex, which will
result in a deadlock.
The simple fix is for gadget_unbind_driver() to release the mutex when
invoking the ->unbind() callback. There is no particular reason for
it to be holding the mutex at that time, and the mutex isn't held
while the ->bind() callback is invoked. So we'll drop the mutex
before performing the unbind callback and reacquire it afterward.
We'll also add a couple of comments to usb_gadget_activate() and
usb_gadget_deactivate(). Because they run in process context they
must not be called from a gadget driver's ->disconnect() callback,
which (according to the kerneldoc for struct usb_gadget_driver in
include/linux/usb/gadget.h) may run in interrupt context. This may
help prevent similar bugs from arising in the future.
Reported-and-tested-by: Avichal Rakesh <arakesh@google.com>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Fixes: 286d9975a838 ("usb: gadget: udc: core: Prevent soft_connect_store() race")
Link: https://lore.kernel.org/linux-usb/4d7aa3f4-22d9-9f5a-3d70-1bd7148ff4ba@google.com/
Cc: Badhri Jagan Sridharan <badhri@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/48b2f1f1-0639-46bf-bbfc-98cb05a24914@rowland.harvard.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bug: 291976100
Change-Id: Icff01d8e88f041af4bda8726242de9cd518a247a
(cherry picked from commit 65dadb2beeb7360232b09ebc4585b54475dfee06)
Signed-off-by: Avichal Rakesh <arakesh@google.com>
Update symbols to symbol list externed by oppo memory group.
ABI DIFFERENCES HAVE BEEN DETECTED!
1 variable symbol(s) added
'unsigned long zero_pfn'
Bug: 292051411
Change-Id: I913c01c7671729bf33b78a218c61cfb94628fb0e
Signed-off-by: huzhanyuan <huzhanyuan@oppo.com>
The __GFP_CMA was added but not added to the gfpflag_names. Let me add
it to show on %pGg printk.
Bug: 295271520
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Change-Id: I155fdcc0e2c18db390b5166ba8d2b93c793caae6
slab-out-of-bounds happens if the xhci platform drivers don't define
the extra_priv_size in their xhci_driver_overrides structure. Move
xhci_vendor_ops structure to xhci main structure to avoid
extra_priv_size affacts xhci_vendor_get_ops which causes the
slab-out-of-bounds error.
Fixes: 90ab8e7f98 ("ANDROID: usb: host: add xhci hooks for USB offload")
Bug: 293869685
Bug: 194461020
Test: build and boot pass
Change-Id: Id17fdfbfd3e8edcc89a05c9c2f553ffab494215e
Signed-off-by: Howard Yen <howardyen@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
(cherry picked from commit 34f6c9c3088b13884567429e3c2ceb08d2235b5b)
(cherry picked from commit 00666b8e3e6ed6ba82fd23d8c83390c30f426469)
Pixel is using these symbols in its USB driver implementation.
3 function symbol(s) added
'int xhci_address_device(struct usb_hcd*, struct usb_device*)'
'int xhci_bus_resume(struct usb_hcd*)'
'int xhci_bus_suspend(struct usb_hcd*)'
Bug: 277396090
Bug: 287008367
Change-Id: Id89097ab094e0582560383793c91278c88cb078f
Signed-off-by: André Draszik <draszik@google.com>
We expect a file page access after dropping caches should be a major
fault, but sometimes it's still a minor fault. That's because a file page
can't be dropped if it's in a per-cpu pagevec. Draining all pages from
per-cpu pagevec to lru list before trying to drop caches.
Link: https://lkml.kernel.org/r/20230630092203.16080-1-andrew.yang@mediatek.com
Change-Id: I9b03c53e39b87134d5ddd0c40ac9b36cf4d190cd
Signed-off-by: Andrew Yang <andrew.yang@mediatek.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 285794522
(cherry picked from commit a481c6fdf3e4fdf31bda91098dfbf46098037e76
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
uid_sys_stats tries to acquire a lock when any task exits to do some
bookkeeping in common data structure. If the lock is contended, it
allocates and schedules a work to do the work later to avoid task exit
latency.
In a stress test which creates many tasks exiting, the workqueue can be
overwhelmed by the number of works being scheduled and allocates more
worker threads to handle queue. The growth of the number of threads is
effectively unbounded and can exhaust the process table. This causes
denial of service to userspace trying to fork().
Instead of allocating a new work each, create a linked list of the
update stats deferred work and have a single work to drain the linked
list. The linked list is implemented using an atomic_long_t.
Bug: 294468796
Fixes: 5586278c0f ("ANDROID: uid_sys_stats: defer process_notifier work if uid_lock is contended")
Change-Id: I15f20f4f69ea66a452bdf815c4ef3a0da3edfd36
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Add hook in get_scan_count() for oem to wield customized reclamation strategy
Bug: 294180281
Change-Id: Ic54d35128e458661fc2b641809f5371b1d9a488e
Signed-off-by: Jiewen Wang <jiewen.wang@vivo.com>
inc_max_seq() will try to inc_min_seq() if nr_gens == MAX_NR_GENS. This
is because the generations are reused (the last oldest now empty
generation will become the next youngest generation).
inc_min_seq() is retried until successful, dropping the lru_lock
and yielding the CPU on each failure, and retaking the lock before
trying again:
while (!inc_min_seq(lruvec, type, can_swap)) {
spin_unlock_irq(&lruvec->lru_lock);
cond_resched();
spin_lock_irq(&lruvec->lru_lock);
}
However, the initial condition that required incrementing the min_seq
(nr_gens == MAX_NR_GENS) is not retested. This can change by another
call to inc_max_seq() from run_aging() with force_scan=true from the
debugfs interface.
Since the eviction stalls when the nr_gens == MIN_NR_GENS, avoid
unnecessarily incrementing the min_seq by rechecking the number of
generations before each attempt.
This issue was uncovered in previous discussion on the list by Yu Zhao
and Aneesh Kumar [1].
[1] https://lore.kernel.org/linux-mm/CAOUHufbO7CaVm=xjEb1avDhHVvnC8pJmGyKcFf2iY_dpf+zR3w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20230802025606.346758-2-kaleshsingh@google.com
Fixes: d6c3af7d8a ("mm: multi-gen LRU: debugfs interface")
Change-Id: I89e84ef2927eb1b0091f1be28bd03eb04dee4c57
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 250dbd10306126b06415afda8adfc27b2b780428 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 288383787
Bug: 291719697
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
MGLRU has a LRU list for each zone for each type (anon/file) in each
generation:
long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
The min_seq (oldest generation) can progress independently for each
type but the max_seq (youngest generation) is shared for both anon and
file. This is to maintain a common frame of reference.
In order for eviction to advance the min_seq of a type, all the per-zone
lists in the oldest generation of that type must be empty.
The eviction logic only considers pages from eligible zones for
eviction or promotion.
scan_folios() {
...
for (zone = sc->reclaim_idx; zone >= 0; zone--) {
...
sort_folio(); // Promote
...
isolate_folio(); // Evict
}
...
}
Consider the system has the movable zone configured and default 4
generations. The current state of the system is as shown below
(only illustrating one type for simplicity):
Type: ANON
Zone DMA32 Normal Movable Device
Gen 0 0 0 4GB 0
Gen 1 0 1GB 1MB 0
Gen 2 1MB 4GB 1MB 0
Gen 3 1MB 1MB 1MB 0
Now consider there is a GFP_KERNEL allocation request (eligible zone
index <= Normal), evict_folios() will return without doing any work
since there are no pages to scan in the eligible zones of the oldest
generation. Reclaim won't make progress until triggered from a ZONE_MOVABLE
allocation request; which may not happen soon if there is a lot of free
memory in the movable zone. This can lead to OOM kills, although there
is 1GB pages in the Normal zone of Gen 1 that we have not yet tried to
reclaim.
This issue is not seen in the conventional active/inactive LRU since
there are no per-zone lists.
If there are no (not enough) folios to scan in the eligible zones, move
folios from ineligible zone (zone_index > reclaim_index) to the next
generation. This allows for the progression of min_seq and reclaiming
from the next generation (Gen 1).
Qualcomm, Mediatek and raspberrypi [1] discovered this issue independently.
[1] https://github.com/raspberrypi/linux/issues/5395
Link: https://lkml.kernel.org/r/20230802025606.346758-1-kaleshsingh@google.com
Fixes: ac35a49023 ("mm: multi-gen LRU: minimal implementation")
Change-Id: I5bbf44bd7ffe42f4347df4be59a75c1603c9b947
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reported-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> [mediatek]
Tested-by: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Barrett <steven@liquorix.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 1462260adc41c5974362cb54ff577c2a15b8c7b2 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 288383787
Bug: 291719697
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
If you have trouble reading this new file format, please refresh your
prebuilt version of STG with repo sync.
Bug: 294213765
Change-Id: I4d7ee716231956c5f4da1343cc0db5170aaaa3b1
Signed-off-by: Giuliano Procida <gprocida@google.com>
When handling deduplicated compressed data, there can be multiple
decompressed extents pointing to the same compressed data in one shot.
In such cases, the bvecs which belong to the longest extent will be
selected as the primary bvecs for real decompressors to decode and the
other duplicated bvecs will be directly copied from the primary bvecs.
Previously, only relative offsets of the longest extent were checked to
decompress the primary bvecs. On rare occasions, it can be incorrect
if there are several extents with the same start relative offset.
As a result, some short bvecs could be selected for decompression and
then cause data corruption.
For example, as Shijie Sun reported off-list, considering the following
extents of a file:
117: 903345.. 915250 | 11905 : 385024.. 389120 | 4096
...
119: 919729.. 930323 | 10594 : 385024.. 389120 | 4096
...
124: 968881.. 980786 | 11905 : 385024.. 389120 | 4096
The start relative offset is the same: 2225, but extent 119 (919729..
930323) is shorter than the others.
Let's restrict the bvec length in addition to the start offset if bvecs
are not full.
Reported-by: Shijie Sun <sunshijie@xiaomi.com>
Fixes: 5c2a64252c ("erofs: introduce partial-referenced pclusters")
Tested-by Shijie Sun <sunshijie@xiaomi.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230719065459.60083-1-hsiangkao@linux.alibaba.com
(cherry picked from commit 7d15c91a75aae55767f368e8abbabd7cedf4ec94
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git dev)
Bug: 293245292
Change-Id: Ic8ded9b2d3592ffd0863f4f0d2ac4ae6a1821a1b
Signed-off-by: sunshijie <sunshijie@xiaomi.corp-partner.google.com>
ABGR64_12 is a reversed RGB format with alpha channel last,
12 bits per component like ABGR32,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
Bug: 293213303
Change-Id: Idc4e1100c9e2134a48b594151e3398f6436b010d
(cherry picked from commit 302b988ca03d83da0a7e006a57efda646c30f978)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
BGR48_12 is a reversed RGB format with 12 bits per component like BGR24,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
Bug: 293213303
Change-Id: I27d14a33c8e2b4847a63ea05b285786766949ebf
(cherry picked from commit da0b7a400e4f39726c3c383f377fb51dbd8b0c71)
[Jindong: Fixed conflicts in .rst file and v4l2-ioctl.c]
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
YUV48_12 is a YUV format with 12-bits per component like YUV24,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
[hverkuil: replaced a . by ,]
Bug: 293213303
Change-Id: I12e6f02b99918a429224320da2127d6b4d777584
(cherry picked from commit 99c954967762976b15265ea383354095e1ed1efa)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Y212 is a YUV format with 12-bits per component like YUYV,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
Add the missing v4l2 foramt info of Y212
Bug: 293213303
Change-Id: Ibdf9bb3a3f1eb895da9eca52d115e08b656b5153
(cherry picked from commit a178dd3bbecc3e26dfc2c72b6fe64d9bf7749de2)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Y012 is a luma-only formats with 12-bits per pixel,
expanded to 16bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
Bug: 293213303
Change-Id: I1a8f73162932e0760aabbe44525d7c74ace9f7bd
(cherry picked from commit a490ea68444084ec0368c019e11ee4a7e5c8bb13)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
P012 is a YUV format with 12-bits per component with interleaved UV,
like NV12, expanded to 16 bits.
Data in the 12 high bits, zeros in the 4 low bits,
arranged in little endian order.
And P012M has two non contiguous planes.
Bug: 293213303
Change-Id: I1fbfa7c445bc682766f479cca07eb8cb16cbb44f
(cherry picked from commit aa1080404200694aace5989f99664ca75e73b03d)
Signed-off-by: Ming Qian <ming.qian@nxp.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Create input symbol files to generate GKI modules header
under include/config. By placing files in this generated
directory, the default filters that ignore certain files
will work without any special handling required, and they
will also be available to inspect after the build to inspect
for the debugging purposes.
abi_gki_protected_exports: Input for gki_module_protected_exports.h
From :- ${objtree}/abi_gki_protected_exports
To :- include/config/abi_gki_protected_exports
all_kmi_symbols: Input for gki_module_unprotected.h
- Rename to abi_gki_kmi_symbols
From :- all_kmi_symbols
To :- include/config/abi_gki_kmi_symbols
Bug: 286529877
Test: TH
Test: Manual verification of the generated files
Change-Id: Iafa10631e7712a8e1e87a2f56cfd614de6b1053a
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
create_open would always take its parent directory's bpf for the created
object. Modify to use the bpf stored in fuse_dentry which is set by
lookup.
Bug: 291705489
Test: fuse_test passes, adb push file /sdcard/Android/data works
Signed-off-by: Paul Lawrence <paullawrence@google.com>
Change-Id: I0a1ea2a291a8fdf67923f1827176b2ea96bd4c2d
Store the results of a negative lookup in the fuse_dentry so later
opcodes can use them to create files
Bug: 291705489
Test: fuse_test passes
Signed-off-by: Paul Lawrence <paullawrence@google.com>
Change-Id: I725e714a1d6ce43f24431d07c24e96349ef1a55c
fuse_iget_backing returns an inode or null, not a ERR_PTR. So check it's
not NULL
Also make sure we put the inode if d_splice_alias fails
Bug: 293349757
Test: fuse_test runs
Signed_off_by: Paul Lawrence <paullawrence@google.com>
Change-Id: I1eadad32f80bab6730e461412b4b7ab4d6c56bf2
This adds passthrough only support for ioctls with fuse-bpf.
compat_ioctls will return -ENOTTY.
Bug: 279519292
Test: F2fsMiscTest#testAtomicWrite
Change-Id: Ia3052e465d87dc1d15ae13955fba8a7f93bc387b
Signed-off-by: Daniel Rosenberg <drosen@google.com>
mbind() calls down into vma_replace_policy() without taking the per-VMA
locks, replaces the VMA's vma->vm_policy pointer, and frees the old
policy. That's bad; a concurrent page fault might still be using the
old policy (in vma_alloc_folio()), resulting in use-after-free.
Normally this will manifest as a use-after-free read first, but it can
result in memory corruption, including because vma_alloc_folio() can
call mpol_cond_put() on the freed policy, which conditionally changes
the policy's refcount member.
This bug is specific to CONFIG_NUMA, but it does also affect non-NUMA
systems as long as the kernel was built with CONFIG_NUMA.
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 293665307
(cherry picked from commit 6c21e066f9256ea1df6f88768f6ae1080b7cf509)
Change-Id: I2e3a4ee8bad97457ee3e127694f0609e7a240a2f
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
lock_vma_under_rcu() tries to guarantee that __anon_vma_prepare() can't
be called in the VMA-locked page fault path by ensuring that
vma->anon_vma is set.
However, this check happens before the VMA is locked, which means a
concurrent move_vma() can concurrently call unlink_anon_vmas(), which
disassociates the VMA's anon_vma.
This means we can get UAF in the following scenario:
THREAD 1 THREAD 2
======== ========
<page fault>
lock_vma_under_rcu()
rcu_read_lock()
mas_walk()
check vma->anon_vma
mremap() syscall
move_vma()
vma_start_write()
unlink_anon_vmas()
<syscall end>
handle_mm_fault()
__handle_mm_fault()
handle_pte_fault()
do_pte_missing()
do_anonymous_page()
anon_vma_prepare()
__anon_vma_prepare()
find_mergeable_anon_vma()
mas_walk() [looks up VMA X]
munmap() syscall (deletes VMA X)
reusable_anon_vma() [called on freed VMA X]
This is a security bug if you can hit it, although an attacker would
have to win two races at once where the first race window is only a few
instructions wide.
This patch is based on some previous discussion with Linus Torvalds on
the security list.
Cc: stable@vger.kernel.org
Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 293665307
(cherry picked from commit 657b5146955eba331e01b9a6ae89ce2e716ba306)
[surenb: removed vma_is_tcp() call not present in 6.1]
Change-Id: I4bd91e1db337ff35eb7c1d436f4372944556dd7d
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
GIC700 erratum 2941627 may cause GIC-700 missing SPIs wake
requests when SPIs are deactivated while targeting a
sleeping CPU - ie a CPU for which the redistributor:
GICR_WAKER.ProcessorSleep == 1
This runtime situation can happen if an SPI that has been
activated on a core is retargeted to a different core, it
becomes pending and the target core subsequently enters a
power state quiescing the respective redistributor.
When this situation is hit, the de-activation carried out
on the core that activated the SPI (through either ICC_EOIR1_EL1
or ICC_DIR_EL1 register writes) does not trigger a wake
requests for the sleeping GIC redistributor even if the SPI
is pending.
Work around the erratum by de-activating the SPI using the
redistributor GICD_ICACTIVER register if the runtime
conditions require it (ie the IRQ was retargeted between
activation and de-activation).
Bug: 292459437
Change-Id: Ide915b8c925a631a7fc9ccebca19d9175def162e
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230704155034.148262-1-lpieralisi@kernel.org
(cherry picked from commit 6fe5c68ee6a1aae0ef291a56001e7888de547fa2 https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git irq/irqchip-fixes)
Signed-off-by: Carlos Galo <carlosgalo@google.com>
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
but task is already holding lock:
(&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.
Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after it becomes visible")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1c7873e3364570ec89343ff4877e0f27a7b21a61)
Change-Id: I85f9cfb6ee8f3d9fefda5518c5637a7dff64bac3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
When forking a child process, the parent write-protects anonymous pages
and COW-shares them with the child being forked using copy_present_pte().
We must not take any concurrent page faults on the source vma's as they
are being processed, as we expect both the vma and the pte's behind it
to be stable. For example, the anon_vma_fork() expects the parents
vma->anon_vma to not change during the vma copy.
A concurrent page fault on a page newly marked read-only by the page
copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
source vma, defeating the anon_vma_clone() that wasn't done because the
parent vma originally didn't have an anon_vma, but we now might end up
copying a pte entry for a page that has one.
Before the per-vma lock based changes, the mmap_lock guaranteed
exclusion with concurrent page faults. But now we need to do a
vma_start_write() to make sure no concurrent faults happen on this vma
while it is being processed.
This fix can potentially regress some fork-heavy workloads. Kernel
build time did not show noticeable regression on a 56-core machine while
a stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Suggested-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit fb49c455323ff8319a123dd312be9082c49a23a5)
Change-Id: Ic5aa9dc51a888b5b0319ec4ec6d2941424573ca0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
mmap_region adds a newly created VMA into VMA tree and might modify it
afterwards before dropping the mmap_lock. This poses a problem for page
faults handled under per-VMA locks because they don't take the mmap_lock
and can stumble on this VMA while it's still being modified. Currently
this does not pose a problem since post-addition modifications are done
only for file-backed VMAs, which are not handled under per-VMA lock.
However, once support for handling file-backed page faults with per-VMA
locks is added, this will become a race.
Fix this by write-locking the VMA before inserting it into the VMA tree.
Other places where a new VMA is added into VMA tree do not modify it
after the insertion, so do not need the same locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 33313a747e81af9f31d0d45de78c9397fa3655eb)
Change-Id: I3bb6a7bc8dd579e11f9c18cbc8e4a6e7279bbfb2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
With recent changes necessitating mmap_lock to be held for write while
expanding a stack, per-VMA locks should follow the same rules and be
write-locked to prevent page faults into the VMA being expanded. Add
the necessary locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c137381f71aec755fbf47cd4e9bd4dce752c054c)
Change-Id: I3e6a8c89c1fb7c0669e1232176bb04ea6b09bc0a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
In commit 8d7071af8907 ("mm: always expand the stack with the mmap write
lock held"), find_extend_vma() was no longer being used in the tree, so
it was removed. Unfortunately some GKI external module is using this,
so bring it back to allow things to continue to work.
Bug: 161946584
Fixes: 8d7071af8907 ("mm: always expand the stack with the mmap write lock held")
Change-Id: I6f1fb1fd8193625fe3dac0bbc5b0aff653b3d879
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8d7071af890768438c14db6172cc8f9f4d04e184 upstream
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[6.1: Patch drivers/iommu/io-pgfault.c instead]
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[surenb: change in io-pgfault.c was done in iommu-sva.c]
Change-Id: Icdcdded08d7ad4eda8fae1120a3c8b3d957516c1
(cherry picked from commit 8d7071af890768438c14db6172cc8f9f4d04e184)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit f313c51d26aa87e69633c9b46efb37a930faca71 upstream.
This is a small step towards a model where GUP itself would not expand
the stack, and any user that needs GUP to not look up existing mappings,
but actually expand on them, would have to do so manually before-hand,
and with the mm lock held for writing.
It turns out that execve() already did almost exactly that, except it
didn't take the mm lock at all (it's single-threaded so no locking
technically needed, but it could cause lockdep errors). And it only did
it for the CONFIG_STACK_GROWSUP case, since in that case GUP has
obviously never expanded the stack downwards.
So just make that CONFIG_STACK_GROWSUP case do the right thing with
locking, and enable it generally. This will eventually help GUP, and in
the meantime avoids a special case and the lockdep issue.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[6.1 Minor context from still having FOLL_FORCE flags set]
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I24c652740dcfc674b0aef8e09ef72f09ad61254c
(cherry picked from commit f313c51d26aa87e69633c9b46efb37a930faca71)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>