In some cases a VMA with padding representation may be split, and
therefore the padding flags must be updated accordingly.
There are 3 cases to handle:
Given:
| DDDDPPPP |
where:
- D represents 1 page of data;
- P represents 1 page of padding;
- | represents the boundaries (start/end) of the VMA
1) Split exactly at the padding boundary
| DDDDPPPP | --> | DDDD | PPPP |
- Remove padding flags from the first VMA.
- The second VMA is all padding
2) Split within the padding area
| DDDDPPPP | --> | DDDDPP | PP |
- Subtract the length of the second VMA from the first VMA's
padding.
- The second VMA is all padding, adjust its padding length (flags)
3) Split within the data area
| DDDDPPPP | --> | DD | DDPPPP |
- Remove padding flags from the first VMA.
- The second VMA is has the same padding as from before the split.
To simplify the semantics merging of padding VMAs is not allowed.
If a split produces a VMA that is entirely padding, show_[s]maps()
only outputs the padding VMA entry (as the data entry is of length 0).
Bug: 330117029
Bug: 327600007
Bug: 330767927
Bug: 328266487
Bug: 329803029
Change-Id: Ie2628ced5512e2c7f8af25fabae1f38730c8bb1a
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result in
better performance. __mt_dup() uses DFS pre-order to duplicate the maple
tree, so it is efficient.
The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs. The proof of the time complexity is provided in the commit log
that introduces __mt_dup(). After duplicating the maple tree, each
element is traversed and replaced (ignoring the cases of deletion, which
are rare). Since it is only a replacement operation for each element,
this process is also O(n).
Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node,
pushing data to adjacent nodes, or even splitting nodes. The frequency of
each action is difficult to calculate. The worst-case scenario for a
single insertion is when the tree undergoes splitting at every level. If
we consider each insertion as the worst-case scenario, we can determine
that the upper bound of the time complexity is O(n*log(n)), although this
is a loose upper bound. However, based on the test data, it appears that
the actual time complexity is likely to be O(n).
As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree. To handle this, we mark the failure point with
XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.
There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.
Below are the test results. The first row shows the number of VMAs. The
second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.
21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
[1] https://github.com/kdlucas/byte-unixbench/tree/master
Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d2406291483775ecddaee929231a39c70c08fda2
https://git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
[surenb: open-coded vma_iter_clear_gfp(), vma_iter_bulk_store();
replaced vma_next() with mas_find()]
Bug: 308042511
Change-Id: I42d6620e8ce6a0b16211c231a9b72ba16ba9c0d2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Changes in 6.1.62
ASoC: simple-card: fixup asoc_simple_probe() error handling
coresight: tmc-etr: Disable warnings for allocation failures
ASoC: tlv320adc3xxx: BUG: Correct micbias setting
net: sched: cls_u32: Fix allocation size in u32_init()
irqchip/riscv-intc: Mark all INTC nodes as initialized
irqchip/stm32-exti: add missing DT IRQ flag translation
dmaengine: ste_dma40: Fix PM disable depth imbalance in d40_probe
powerpc/85xx: Fix math emulation exception
Input: synaptics-rmi4 - handle reset delay when using SMBus trsnsport
fbdev: atyfb: only use ioremap_uc() on i386 and ia64
fs/ntfs3: Add ckeck in ni_update_parent()
fs/ntfs3: Write immediately updated ntfs state
fs/ntfs3: Use kvmalloc instead of kmalloc(... __GFP_NOWARN)
fs/ntfs3: Fix possible NULL-ptr-deref in ni_readpage_cmpr()
fs/ntfs3: Fix NULL pointer dereference on error in attr_allocate_frame()
fs/ntfs3: Fix directory element type detection
fs/ntfs3: Avoid possible memory leak
spi: npcm-fiu: Fix UMA reads when dummy.nbytes == 0
netfilter: nfnetlink_log: silence bogus compiler warning
efi: fix memory leak in krealloc failure handling
ASoC: rt5650: fix the wrong result of key button
ASoC: codecs: tas2780: Fix log of failed reset via I2C.
drm/ttm: Reorder sys manager cleanup step
fbdev: omapfb: fix some error codes
fbdev: uvesafb: Call cn_del_callback() at the end of uvesafb_exit()
scsi: mpt3sas: Fix in error path
drm/amdgpu: Unset context priority is now invalid
gpu/drm: Eliminate DRM_SCHED_PRIORITY_UNSET
LoongArch: Export symbol invalid_pud_table for modules building
LoongArch: Replace kmap_atomic() with kmap_local_page() in copy_user_highpage()
netfilter: nf_tables: audit log object reset once per table
platform/mellanox: mlxbf-tmfifo: Fix a warning message
drm/amdgpu: Reserve fences for VM update
net: chelsio: cxgb4: add an error code check in t4_load_phy_fw
r8152: Check for unplug in rtl_phy_patch_request()
r8152: Check for unplug in r8153b_ups_en() / r8153c_ups_en()
powerpc/mm: Fix boot crash with FLATMEM
io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock
power: supply: core: Use blocking_notifier_call_chain to avoid RCU complaint
perf evlist: Avoid frequency mode for the dummy event
x86: KVM: SVM: always update the x2avic msr interception
mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer
mmap: fix error paths with dup_anon_vma()
ALSA: usb-audio: add quirk flag to enable native DSD for McIntosh devices
PCI: Prevent xHCI driver from claiming AMD VanGogh USB3 DRD device
usb: storage: set 1.50 as the lower bcdDevice for older "Super Top" compatibility
usb: typec: tcpm: Fix NULL pointer dereference in tcpm_pd_svdm()
usb: raw-gadget: properly handle interrupted requests
tty: n_gsm: fix race condition in status line change on dead connections
tty: 8250: Remove UC-257 and UC-431
tty: 8250: Add support for additional Brainboxes UC cards
tty: 8250: Add support for Brainboxes UP cards
tty: 8250: Add support for Intashield IS-100
tty: 8250: Fix port count of PX-257
tty: 8250: Fix up PX-803/PX-857
tty: 8250: Add support for additional Brainboxes PX cards
tty: 8250: Add support for Intashield IX cards
tty: 8250: Add Brainboxes Oxford Semiconductor-based quirks
misc: pci_endpoint_test: Add deviceID for J721S2 PCIe EP device support
ALSA: hda: intel-dsp-config: Fix JSL Chromebook quirk detection
ASoC: SOF: sof-pci-dev: Fix community key quirk detection
Linux 6.1.62
Change-Id: I2f696c88b48e82eb0d925a26ce6716693595d421
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 824135c46b00df7fb369ec7f1f8607427bbebeb0 upstream.
When the calling function fails after the dup_anon_vma(), the
duplication of the anon_vma is not being undone. Add the necessary
unlink_anon_vma() call to the error paths that are missing them.
This issue showed up during inspection of the error path in vma_merge()
for an unrelated vma iterator issue.
Users may experience increased memory usage, which may be problematic as
the failure would likely be caused by a low memory situation.
Link: https://lkml.kernel.org/r/20230929183041.2835469-3-Liam.Howlett@oracle.com
Fixes: d4af56c5c7 ("mm: start tracking VMAs with maple tree")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 6.1.61
KVM: x86/pmu: Truncate counter value to allowed width on write
mmc: core: Align to common busy polling behaviour for mmc ioctls
mmc: block: ioctl: do write error check for spi
mmc: core: Fix error propagation for some ioctl commands
ASoC: codecs: wcd938x: Convert to platform remove callback returning void
ASoC: codecs: wcd938x: Simplify with dev_err_probe
ASoC: codecs: wcd938x: fix regulator leaks on probe errors
ASoC: codecs: wcd938x: fix runtime PM imbalance on remove
pinctrl: qcom: lpass-lpi: fix concurrent register updates
mcb: Return actual parsed size when reading chameleon table
mcb-lpc: Reallocate memory region to avoid memory overlapping
virtio_balloon: Fix endless deflation and inflation on arm64
virtio-mmio: fix memory leak of vm_dev
virtio-crypto: handle config changed by work queue
virtio_pci: fix the common cfg map size
vsock/virtio: initialize the_virtio_vsock before using VQs
vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE
arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399
arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards
mm: fix vm_brk_flags() to not bail out while holding lock
hugetlbfs: clear resv_map pointer if mmap fails
mm/page_alloc: correct start page when guard page debug is enabled
mm/migrate: fix do_pages_move for compat pointers
hugetlbfs: extend hugetlb_vma_lock to private VMAs
maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
nfsd: lock_rename() needs both directories to live on the same fs
drm/i915/pmu: Check if pmu is closed before stopping event
drm/amd: Disable ASPM for VI w/ all Intel systems
drm/dp_mst: Fix NULL deref in get_mst_branch_device_by_guid_helper()
ARM: OMAP: timer32K: fix all kernel-doc warnings
firmware/imx-dsp: Fix use_after_free in imx_dsp_setup_channels()
clk: ti: Fix missing omap4 mcbsp functional clock and aliases
clk: ti: Fix missing omap5 mcbsp functional clock and aliases
r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx
r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1
r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1
iavf: initialize waitqueues before starting watchdog_task
i40e: Fix I40E_FLAG_VF_VLAN_PRUNING value
treewide: Spelling fix in comment
igb: Fix potential memory leak in igb_add_ethtool_nfc_entry
neighbour: fix various data-races
igc: Fix ambiguity in the ethtool advertising
net: ethernet: adi: adin1110: Fix uninitialized variable
net: ieee802154: adf7242: Fix some potential buffer overflow in adf7242_stats_show()
net: usb: smsc95xx: Fix uninit-value access in smsc95xx_read_reg
r8152: Increase USB control msg timeout to 5000ms as per spec
r8152: Run the unload routine if we have errors during probe
r8152: Cancel hw_phy_work if we have an error in probe
r8152: Release firmware if we have an error in probe
tcp: fix wrong RTO timeout when received SACK reneging
gtp: uapi: fix GTPA_MAX
gtp: fix fragmentation needed check with gso
i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR
drm/logicvc: Kconfig: select REGMAP and REGMAP_MMIO
iavf: in iavf_down, disable queues when removing the driver
scsi: sd: Introduce manage_shutdown device flag
blk-throttle: check for overflow in calculate_bytes_allowed
kasan: print the original fault addr when access invalid shadow
io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
iio: afe: rescale: Accept only offset channels
iio: exynos-adc: request second interupt only when touchscreen mode is used
iio: adc: xilinx-xadc: Don't clobber preset voltage/temperature thresholds
iio: adc: xilinx-xadc: Correct temperature offset/scale for UltraScale
i2c: muxes: i2c-mux-pinctrl: Use of_get_i2c_adapter_by_node()
i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node()
i2c: muxes: i2c-demux-pinctrl: Use of_get_i2c_adapter_by_node()
i2c: stm32f7: Fix PEC handling in case of SMBUS transfers
i2c: aspeed: Fix i2c bus hang in slave read
tracing/kprobes: Fix the description of variable length arguments
misc: fastrpc: Reset metadata buffer to avoid incorrect free
misc: fastrpc: Free DMA handles for RPC calls with no arguments
misc: fastrpc: Clean buffers on remote invocation failures
misc: fastrpc: Unmap only if buffer is unmapped from DSP
nvmem: imx: correct nregs for i.MX6ULL
nvmem: imx: correct nregs for i.MX6SLL
nvmem: imx: correct nregs for i.MX6UL
x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
x86/cpu: Add model number for Intel Arrow Lake mobile processor
perf/core: Fix potential NULL deref
sparc32: fix a braino in fault handling in csum_and_copy_..._user()
clk: Sanitize possible_parent_show to Handle Return Value of of_clk_get_parent_name
platform/x86: Add s2idle quirk for more Lenovo laptops
ext4: add two helper functions extent_logical_end() and pa_logical_end()
ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow
ext4: avoid overlapping preallocations due to overflow
objtool/x86: add missing embedded_insn check
Linux 6.1.61
Note, this merge point also reverts commit
bb20a245df which is commit
24eca2dce0f8d19db808c972b0281298d0bafe99 upstream, as it conflicts with
the previous reverts for ABI issues, AND is an ABI break in itself. If
it is needed in the future, it can be brought back in an abi-safe way.
Change-Id: I425bfa3be6d65328e23affd52d10b827aea6e44a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit e0f81ab1e4f42ffece6440dc78f583eb352b9a71 upstream.
Calling vm_brk_flags() with flags set other than VM_EXEC will exit the
function without releasing the mmap_write_lock.
Just do the sanity check before the lock is acquired. This doesn't fix an
actual issue since no caller sets a flag other than VM_EXEC.
Link: https://lkml.kernel.org/r/20230929171937.work.697-kees@kernel.org
Fixes: 2e7ce7d354 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()")
Signed-off-by: Sebastian Ott <sebott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vma_prepare() is currently the central place where vmas are being locked
before vma_complete() applies changes to them. While this is convenient,
it also obscures vma locking and makes it harder to follow the locking
rules. Move vma locking out of vma_prepare() and take vma locks
explicitly at the locations where vmas are being modified. Move vma
locking and replace it with an assertion inside dup_anon_vma() to further
clarify the locking pattern inside vma_merge().
Link: https://lkml.kernel.org/r/20230804152724.3090321-7-surenb@google.com
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit b1985ca5e7e6464d205a98a78cca229224346c21
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
[surenb: skip changes in vma_prepare() which does not exist, skip
changes in vma_merge() since required locks are already in __vma_adjust(),
skip change in dup_anon_vma() since required locks are already in place,
skip unnecessary lock in do_brk_flags()]
Bug: 293665307
Change-Id: I99261aa1db3bec73795e63c333768bc68da8045c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
While it's not strictly necessary to lock a newly created vma before
adding it into the vma tree (as long as no further changes are performed
to it), it seems like a good policy to lock it and prevent accidental
changes after it becomes visible to the page faults. Lock the vma before
adding it into the vma tree.
Link: https://lkml.kernel.org/r/20230804152724.3090321-6-surenb@google.com
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit c3249c06c48dda30f93e62b57773d5ed409d4f77
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
[surenb: resolved conflicts due to changes in vma_merge() and
__vma_adjust()]
Bug: 293665307
Change-Id: I4ee0d2abcc8a3f45545f470f1bf7f0be728d6f44
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap
lock is held write-locked by the caller, and the caller is responsible for
dropping the mmap lock at a later point (which will also release the VMA
locks).
Calling vma_end_write_all() here is dangerous because the caller might
have write-locked a VMA with the expectation that it will stay
write-locked until the mmap_lock is released, as usual.
This _almost_ becomes a problem in the following scenario:
An anonymous VMA A and an SGX VMA B are mapped adjacent to each other.
Userspace calls munmap() on a range starting at the start address of A and
ending in the middle of B.
Hypothetical call graph with additional notes in brackets:
do_vmi_align_munmap
[begin first for_each_vma_range loop]
vma_start_write [on VMA A]
vma_mark_detached [on VMA A]
__split_vma [on VMA B]
sgx_vma_open [== new->vm_ops->open]
sgx_encl_mm_add
__mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN]
mm_take_all_locks
mm_drop_all_locks
vma_end_write_all [drops VMA lock taken on VMA A before]
vma_start_write [on VMA B]
vma_mark_detached [on VMA B]
[end first for_each_vma_range loop]
vma_iter_clear_gfp [removes VMAs from maple tree]
mmap_write_downgrade
unmap_region
mmap_read_unlock
In this hypothetical scenario, while do_vmi_align_munmap() thinks it still
holds a VMA write lock on VMA A, the VMA write lock has actually been
invalidated inside __split_vma().
The call from sgx_encl_mm_add() to __mmu_notifier_register() can't
actually happen here, as far as I understand, because we are duplicating
an existing SGX VMA, but sgx_encl_mm_add() only calls
__mmu_notifier_register() for the first SGX VMA created in a given
process. So this could only happen in fork(), not on munmap(). But in my
view it is just pure luck that this can't happen.
Also, we wouldn't actually have any bad consequences from this in
do_vmi_align_munmap(), because by the time the bug drops the lock on VMA
A, we've already marked VMA A as detached, which makes it completely
ineligible for any VMA-locked page faults. But again, that's just pure
luck.
So remove the vma_end_write_all(), so that VMA write locks are only ever
released on mmap_write_unlock() or mmap_write_downgrade().
Also add comments to document the locking rules established by this patch.
Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com
Fixes: eeff9a5d47f8 ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 28ed252b44fb2f1efaef1287eea267d54e79f7d5
https: //git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-unstable)
Bug: 293665307
Change-Id: Ic0b28229d175e3125de1ef274282fbf43b556db7
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the
VMA that is being expanded to cover the area previously occupied by
another VMA. This currently happens while `dst` is not write-locked.
This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as
the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent
page faults can happen on `dst` under the per-VMA lock. This is already
icky in itself, since such page faults can now install pages into `dst`
that are attached to an `anon_vma` that is not yet tied back to the
`anon_vma` with an `anon_vma_chain`. But if `anon_vma_clone()` fails due
to an out-of-memory error, things get much worse: `anon_vma_clone()` then
reverts `dst->anon_vma` back to NULL, and `dst` remains completely
unconnected to the `anon_vma`, even though we can have pages in the area
covered by `dst` that point to the `anon_vma`.
This means the `anon_vma` of such pages can be freed while the pages are
still mapped into userspace, which leads to UAF when a helper like
folio_lock_anon_vma_read() tries to look up the anon_vma of such a page.
This theoretically is a security bug, but I believe it is really hard to
actually trigger as an unprivileged user because it requires that you can
make an order-0 GFP_KERNEL allocation fail, and the page allocator tries
pretty hard to prevent that.
I think doing the vma_start_write() call inside dup_anon_vma() is the most
straightforward fix for now.
For a kernel-assisted reproducer, see the notes section of the patch mail.
Link: https://lkml.kernel.org/r/20230721034643.616851-1-jannh@google.com
Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it")
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d8ab9f7b644a2c9b64de405c1953c905ff219dc9)
[surenb: since dup_anon_vma() is missing, add vma_start_write() directly
before anon_vma is assigned]
Bug: 293665307
Change-Id: I1b44e6278e464157e666cc5dbdb0fcc29bcf665e
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
based on commit 0503ea8f5ba73eb3ab13a81c1eefbaf51405385a upstream.
This was inadvertently fixed during the removal of __vma_adjust().
When __vma_adjust() is adjusting next with a negative value (pushing
vma->vm_end lower), there would be two writes to the maple tree. The
first write is unnecessary and uses all allocated nodes in the maple
state. The second write is necessary but will need to allocate nodes
since the first write has used the allocated nodes. This may be a
problem as it may not be safe to allocate at this time, such as a low
memory situation. Fix the issue by avoiding the first write and only
write the adjusted "next" VMA.
Reported-by: John Hsu <John.Hsu@mediatek.com>
Link: https://lore.kernel.org/lkml/9cb8c599b1d7f9c1c300d1a334d5eb70ec4d7357.camel@mediatek.com/
Cc: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a02c6dc0efhttps://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
linux-6.1.y)
Bug: 295269894
Change-Id: I1a4bdc080d4ee92dbe06dc788961532d0c85fd7c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
but task is already holding lock:
(&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.
Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after it becomes visible")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1c7873e3364570ec89343ff4877e0f27a7b21a61)
Change-Id: I85f9cfb6ee8f3d9fefda5518c5637a7dff64bac3
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
mmap_region adds a newly created VMA into VMA tree and might modify it
afterwards before dropping the mmap_lock. This poses a problem for page
faults handled under per-VMA locks because they don't take the mmap_lock
and can stumble on this VMA while it's still being modified. Currently
this does not pose a problem since post-addition modifications are done
only for file-backed VMAs, which are not handled under per-VMA lock.
However, once support for handling file-backed page faults with per-VMA
locks is added, this will become a race.
Fix this by write-locking the VMA before inserting it into the VMA tree.
Other places where a new VMA is added into VMA tree do not modify it
after the insertion, so do not need the same locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 33313a747e81af9f31d0d45de78c9397fa3655eb)
Change-Id: I3bb6a7bc8dd579e11f9c18cbc8e4a6e7279bbfb2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
With recent changes necessitating mmap_lock to be held for write while
expanding a stack, per-VMA locks should follow the same rules and be
write-locked to prevent page faults into the VMA being expanded. Add
the necessary locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c137381f71aec755fbf47cd4e9bd4dce752c054c)
Change-Id: I3e6a8c89c1fb7c0669e1232176bb04ea6b09bc0a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
In commit 8d7071af8907 ("mm: always expand the stack with the mmap write
lock held"), find_extend_vma() was no longer being used in the tree, so
it was removed. Unfortunately some GKI external module is using this,
so bring it back to allow things to continue to work.
Bug: 161946584
Fixes: 8d7071af8907 ("mm: always expand the stack with the mmap write lock held")
Change-Id: I6f1fb1fd8193625fe3dac0bbc5b0aff653b3d879
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8d7071af890768438c14db6172cc8f9f4d04e184 upstream
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[6.1: Patch drivers/iommu/io-pgfault.c instead]
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[surenb: change in io-pgfault.c was done in iommu-sva.c]
Change-Id: Icdcdded08d7ad4eda8fae1120a3c8b3d957516c1
(cherry picked from commit 8d7071af890768438c14db6172cc8f9f4d04e184)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit f440fa1ac955e2898893f9301568435eb5cdfc4b upstream.
Make calls to extend_vma() and find_extend_vma() fail if the write lock
is required.
To avoid making this a flag-day event, this still allows the old
read-locking case for the trivial situations, and passes in a flag to
say "is it write-locked". That way write-lockers can say "yes, I'm
being careful", and legacy users will continue to work in all the common
cases until they have been fully converted to the new world order.
Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: If12d2d68429b6d71393f02d5ed7e6939c3cd5405
(cherry picked from commit f440fa1ac955e2898893f9301568435eb5cdfc4b)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8b35ca3e45e35a26a21427f35d4093606e93ad0a upstream.
arm has an additional check for address < FIRST_USER_ADDRESS before
expanding the stack. Since FIRST_USER_ADDRESS is defined everywhere
(generally as 0), move that check to the generic expand_downwards().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Ie1090f587090ef16de4bce224bbc52334bfe78fa
(cherry picked from commit 8b35ca3e45e35a26a21427f35d4093606e93ad0a)
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 6c26bd4384da24841bac4f067741bbca18b0fb74 upstream,
If mas_store_gfp() in the gather loop failed, the 'error' variable that
ultimately gets returned was not being set. In many cases, its original
value of -ENOMEM was still in place, and that was fine. But if VMAs had
been split at the start or end of the range, then 'error' could be zero.
Change to the 'error = foo(); if (error) goto â¦' idiom to fix the bug.
Also clean up a later case which avoided the same bug by *explicitly*
setting error = -ENOMEM right before calling the function that might
return -ENOMEM.
In a final cosmetic change, move the 'Point of no return' comment to
*after* the goto. That's been in the wrong place since the preallocation
was removed, and this new error path was added.
Fixes: 606c812eb1d5 ("mm/mmap: Fix error path in do_vmi_align_munmap()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 42a018a796)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I5da7b1e126968e174e733d45ff24439089de60af
commit 606c812eb1d5b5fb0dd9e330ca94b52d7c227830 upstream
The error unrolling was leaving the VMAs detached in many cases and
leaving the locked_vm statistic altered, and skipping the unrolling
entirely in the case of the vma tree write failing.
Fix the error path by re-attaching the detached VMAs and adding the
necessary goto for the failed vma tree write, and fix the locked_vm
statistic by only updating after the vma tree write succeeds.
Fixes: 763ecb0350 ("mm: remove the vma linked list")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ dwmw2: Strictly, the original patch wasn't *re-attaching* the
detached VMAs. They *were* still attached but just had
the 'detached' flag set, which is an optimisation. Which
doesn't exist in 6.3, so drop that. Also drop the call
to vma_start_write() which came in with the per-VMA
locking in 6.4. ]
[ dwmw2 (6.1): It's do_mas_align_munmap() here. And has two call
sites for the now-removed munmap_sidetree() function.
Inline them both rather then trying to backport various
dependencies with potentially subtle interactions. ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[surenb: added needed vma_start_write and vma_vma_mark_detached calls]
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I1e42347ecf9eb46077739a267ac00264f94fa59a
based on commit 0503ea8f5ba73eb3ab13a81c1eefbaf51405385a upstream.
This was inadvertently fixed during the removal of __vma_adjust().
When __vma_adjust() is adjusting next with a negative value (pushing
vma->vm_end lower), there would be two writes to the maple tree. The
first write is unnecessary and uses all allocated nodes in the maple
state. The second write is necessary but will need to allocate nodes
since the first write has used the allocated nodes. This may be a
problem as it may not be safe to allocate at this time, such as a low
memory situation. Fix the issue by avoiding the first write and only
write the adjusted "next" VMA.
Reported-by: John Hsu <John.Hsu@mediatek.com>
Link: https://lore.kernel.org/lkml/9cb8c599b1d7f9c1c300d1a334d5eb70ec4d7357.camel@mediatek.com/
Cc: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8d7071af890768438c14db6172cc8f9f4d04e184 upstream
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[6.1: Patch drivers/iommu/io-pgfault.c instead]
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f440fa1ac955e2898893f9301568435eb5cdfc4b upstream.
Make calls to extend_vma() and find_extend_vma() fail if the write lock
is required.
To avoid making this a flag-day event, this still allows the old
read-locking case for the trivial situations, and passes in a flag to
say "is it write-locked". That way write-lockers can say "yes, I'm
being careful", and legacy users will continue to work in all the common
cases until they have been fully converted to the new world order.
Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8b35ca3e45e35a26a21427f35d4093606e93ad0a upstream.
arm has an additional check for address < FIRST_USER_ADDRESS before
expanding the stack. Since FIRST_USER_ADDRESS is defined everywhere
(generally as 0), move that check to the generic expand_downwards().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6c26bd4384da24841bac4f067741bbca18b0fb74 upstream,
If mas_store_gfp() in the gather loop failed, the 'error' variable that
ultimately gets returned was not being set. In many cases, its original
value of -ENOMEM was still in place, and that was fine. But if VMAs had
been split at the start or end of the range, then 'error' could be zero.
Change to the 'error = foo(); if (error) goto â¦' idiom to fix the bug.
Also clean up a later case which avoided the same bug by *explicitly*
setting error = -ENOMEM right before calling the function that might
return -ENOMEM.
In a final cosmetic change, move the 'Point of no return' comment to
*after* the goto. That's been in the wrong place since the preallocation
was removed, and this new error path was added.
Fixes: 606c812eb1d5 ("mm/mmap: Fix error path in do_vmi_align_munmap()")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 606c812eb1d5b5fb0dd9e330ca94b52d7c227830 upstream
The error unrolling was leaving the VMAs detached in many cases and
leaving the locked_vm statistic altered, and skipping the unrolling
entirely in the case of the vma tree write failing.
Fix the error path by re-attaching the detached VMAs and adding the
necessary goto for the failed vma tree write, and fix the locked_vm
statistic by only updating after the vma tree write succeeds.
Fixes: 763ecb0350 ("mm: remove the vma linked list")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ dwmw2: Strictly, the original patch wasn't *re-attaching* the
detached VMAs. They *were* still attached but just had
the 'detached' flag set, which is an optimisation. Which
doesn't exist in 6.3, so drop that. Also drop the call
to vma_start_write() which came in with the per-VMA
locking in 6.4. ]
[ dwmw2 (6.1): It's do_mas_align_munmap() here. And has two call
sites for the now-removed munmap_sidetree() function.
Inline them both rather then trying to backport various
dependencies with potentially subtle interactions. ]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 52ace503ecf894ec2f63b8137f181868ea61d95a.
The issue that required the revert is fixed by:
0257d9908d38 ("maple_tree: make maple state reusable after mas_empty_area()")
Bug: 281094761
Change-Id: I97b45525689097d0c1369f81a994d50f0662c9c2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
call_rcu() can take a long time when callback offloading is enabled. Its
use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed.
Because exit_mmap() is called only after the last mm user drops its
refcount, the page fault handlers can't be racing with it. Any other
possible user like oom-reaper or process_mrelease are already synchronized
using mmap_lock. Therefore exit_mmap() can free VMAs directly, without
the use of call_rcu().
Expose __vm_area_free() and use it from exit_mmap() to avoid possible
call_rcu() floods and performance regressions caused by it.
Link: https://lkml.kernel.org/r/20230227173632.3292573-33-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0d2ebf9c3f7822e7ba3e4792ea3b6b19aa2da34a)
Bug: 161210518
Change-Id: I4fbf3ef38fdb22a3c80dcc61125ec21d2c426100
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Per-vma locking mechanism will search for VMA under RCU protection and
then after locking it, has to ensure it was not removed from the VMA tree
after we found it. To make this check efficient, introduce a
vma->detached flag to mark VMAs which were removed from the VMA tree.
Link: https://lkml.kernel.org/r/20230227173632.3292573-23-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 457f67be5910a2b5f1fda8af06bfe4d3492a0a4f)
[surenb: vma_complete does not exist in 6.1, therefore patch is adjusted
to mark VMAs detached directly in vma_expand and __vma_adjust]
Bug: 161210518
Change-Id: Id1f31733cb7a36f3f1294b2be83cf3b87ba3f812
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Page fault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with page fault handlers that would hold VMA
locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.
Link: https://lkml.kernel.org/r/20230227173632.3292573-22-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit eeff9a5d47f89bc641034fea05501c8a6de131cb)
Bug: 161210518
Change-Id: I4176bf0e1b07f03dfc1ac7dd37d7941d5a1dbc02
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock. munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables. Add a parameter to
free_pgtables for such scenario.
Link: https://lkml.kernel.org/r/20230227173632.3292573-20-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 98e51a2239d9d419d819cd61a2e720ebf19a8b0a)
Bug: 161210518
Change-Id: I3c9177cce187526407754baf7641d3741ca7b0cb
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Write-lock VMA as locked before copying it and when copy_vma produces a
new VMA.
Link: https://lkml.kernel.org/r/20230227173632.3292573-18-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Laurent Dufour <laurent.dufour@fr.ibm.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d6ac235de4ba6dc659eebb5f4e5ba0a8523d8424)
Bug: 161210518
Change-Id: I38b5c5689380754a366223caff30e1ac4aaf7cc4
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
vma_expand changes VMA boundaries and might result in freeing an adjacent
VMA. Write-lock affected VMAs to prevent concurrent page faults.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/all/20230109205336.3665937-22-surenb@google.com/
[surenb: using older v1 of patchset due to __vma_adjust() being removed
in 6.2-rc4]
[surenb: lock next earlier when removing it like we do in v3:
https://lore.kernel.org/all/20230216051750.3125598-18-surenb@google.com/]
Bug: 161210518
Change-Id: I31aff80996b4ad646bdd6861ff6479c8eb2a690a
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
vma_adjust modifies a VMA and possibly its neighbors. Write-lock them
before making the modifications.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/all/20230109205336.3665937-21-surenb@google.com/
[surenb: using older v1 of patchset due to __vma_adjust() being removed
in 6.2-rc4]
[surenb: minor fixes in next_next locking inside __vma_adjust]
Bug: 161210518
Change-Id: I9ab2f88c82a7071fe2f1a14c51a2e6f1b6196681
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Decisions about whether VMAs can be merged, split or expanded must be
made while VMAs are protected from the changes which can affect that
decision. For example, merge_vma uses vma->anon_vma in its decision
whether the VMA can be merged. Meanwhile, page fault handler changes
vma->anon_vma during COW operation.
Write-lock all VMAs which might be affected by a merge or split operation
before making decision how such operations should be performed.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/all/20230216051750.3125598-17-surenb@google.com/
[surenb: using older v3 of patchset due to missing __vma_adjust()
refactoring in 6.2-rc4 which introduced vma_prepare()]
Bug: 161210518
Change-Id: I56d84aa67366a1988fc81296da7164ad7f89a5c0
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
vma_adjust_trans_huge() modifies the VMA and such modifications should
be done after VMA is marked as being written. Therefore move VMA flag
modifications before vma_adjust_trans_huge() so that VMA is marked
before all these modifications.
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/all/20230216051750.3125598-15-surenb@google.com/
[surenb: using older v3 of patchset due to missing __vma_adjust()
refactoring in 6.2-rc4 which introduced vma_prepare()]
Bug: 161210518
Change-Id: I650162fd85fabee00a8a05ddb32318e654270cb1
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
There are scenarios when vm_flags can be modified without exclusive
mmap_lock, such as:
- after VMA was isolated and mmap_lock was downgraded or dropped
- in exit_mmap when there are no other mm users and locking is unnecessary
Introduce __vm_flags_mod to avoid assertions when the caller takes
responsibility for the required locking.
Pass a hint to untrack_pfn to conditionally use __vm_flags_mod for
flags modification to avoid assertion.
Link: https://lkml.kernel.org/r/20230126193752.297968-7-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 68f48381d7fdd1cbb9d88c37a4dfbb98ac78226d)
Bug: 161210518
Change-Id: I6ba44b03cde4c9b96d80423d41accab1effb71ac
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.
[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 1c71222e5f2393b5ea1a41795c67589eea7e3490)
Bug: 161210518
Change-Id: Ifc352b487db109adab17dd33a83f5c7e68c0bbc6
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
To simplify the usage of VM_LOCKED_CLEAR_MASK in vm_flags_clear(), replace
it with VM_LOCKED_MASK bitmask and convert all users.
Link: https://lkml.kernel.org/r/20230126193752.297968-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit e430a95a04efc557bc4ff9b3035c7c85aee5d63f)
Bug: 161210518
Change-Id: I17bbcc01a133511dbfaf3d82fbc4b25ecdd0b376
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Android reported a performance regression in the userfaultfd unmap path.
A closer inspection on the userfaultfd_unmap_prep() change showed that a
second tree walk would be necessary in the reworked code.
Fix the regression by passing each VMA that will be unmapped through to
the userfaultfd_unmap_prep() function as they are added to the unmap list,
instead of re-walking the tree for the VMA.
Link: https://lkml.kernel.org/r/20230601015402.2819343-1-Liam.Howlett@oracle.com
Fixes: 69dbe6daf1 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit de53cc0be1c8b47d595682932beb3c11be9e4e5a
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
Bug: 274059236
Change-Id: Ia189a5e98ffe86c4ca5ac3b686ada5f51826f2ed
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
If the iterator has moved to the previous entry, then step forward one
range, back to the gap.
Link: https://lkml.kernel.org/r/20230518145544.1722059-36-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d3f028c7599ea2297dd630e1a6acaf4915c769d3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
Bug: 274059236
Change-Id: Ic45e095c728095d41647a704a287596d03489cdf
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The maple tree iterator clean up is incompatible with the way
do_vmi_align_munmap() expects it to behave. Update the expected behaviour
to map now since the change will work currently.
Link: https://lkml.kernel.org/r/20230518145544.1722059-23-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: David Binderman <dcb314@hotmail.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Vernon Yang <vernon2gm@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit a4d5b9fbaf42d668c1b5c7f231f79776a9419a91
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable)
[surenb: adjust for missing vma_iter_load]
Bug: 274059236
Change-Id: Id05ab617a3539f885a32c7d3031098a8c005fff8
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Only write when necessary to the maple tree. This should only occur
when the VMA changes. In the __vma_adjust() case, it is either the vma
when it is expanded, the next vma when the boundary expands into 'vma',
writing the 'insert', or when vma expands/shrinks for shift_arg_pages().
The mas_preallocate() setup should track the intended write to ensure
the correct number of nodes are preallocated for the pending write.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: 61b337f650
[surenb: __vma_adjust was removed in 6.3, therefore these fixes are
not applicable upstream anymore. The patch was obtained from the
author's tree]
Bug: 274059236
Change-Id: I69d68a5b4ff11c40985f7b03b31eec4bb24dcbb6
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Set the correct limits for vma_iter_prealloc() calls so that the maple
tree can be smarter about how many nodes are needed.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: https://lore.kernel.org/lkml/20230601021605.2823123-11-Liam.Howlett@oracle.com/
[surenb: remove vma_iter-related changes not present in 6.1 kernel]
Bug: 274059236
Change-Id: I05d1989e35b2e72b9346743f290da66739b3ee59
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
The majority of the calls to munmap a VMA is for a single vma. The
maple tree is able to store a single entry at 0, with a size of 1 as a
pointer and avoid any allocations. Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count. This
will leverage the ability to store the first entry without a node
allocation.
Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries. Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.
Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls. This happens in the
static unmap_region() and exit_mmap().
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: https://lore.kernel.org/lkml/20230601021605.2823123-5-Liam.Howlett@oracle.com/
[surenb: skip changes passing maple state to unmap_vmas() and
free_pgtables()]
Bug: 274059236
Change-Id: If38cfecd51da884bcfdbdfdfbf955a0b338d3d60
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
In preparation of passing the vma state through split, the pre-allocation
that occurs before the split has to be moved to after. Since the
preallocation would then live right next to the store, just call store
instead of preallocating. This effectively restores the potential error
path of splitting and not munmap'ing which pre-dates the maple tree.
Link: https://lkml.kernel.org/r/20230120162650.984577-12-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0378c0a0e9e463b9e31b94fbbbc10f94b34225b6)
Bug: 274059236
Change-Id: I3539fb3a08043dae1bc8aaa6c7f285711a0b5548
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Add hook in mmap_region() to record the vma and address information
of monitored processes.
Bug: 198385827
Change-Id: I0bde29113b47ca7f4a9f5d42a54188e791ca3b7e
Signed-off-by: Jiewen Wang <jiewen.wang@vivo.com>
(cherry picked from commit 73c9d4a9d575107b90a6d9f415fa56f963264d06)