This patch adds two exported functions to set/get reclaim parameters.
Bug: 323406883
Change-Id: I8c29073dba3e77cb5db7f45b640518deae04b8a9
Signed-off-by: Minchan Kim <minchan@google.com>
__alloc_pages_direct_reclaim() is called from slowpath allocation where
high atomic reserves can be unreserved after there is a progress in
reclaim and yet no suitable page is found. Later should_reclaim_retry()
gets called from slow path allocation to decide if the reclaim needs to be
retried before OOM kill path is taken.
should_reclaim_retry() checks the available(reclaimable + free pages)
memory against the min wmark levels of a zone and returns:
a) true, if it is above the min wmark so that slow path allocation will
do the reclaim retries.
b) false, thus slowpath allocation takes oom kill path.
should_reclaim_retry() can also unreserves the high atomic reserves **but
only after all the reclaim retries are exhausted.**
In a case where there are almost none reclaimable memory and free pages
contains mostly the high atomic reserves but allocation context can't use
these high atomic reserves, makes the available memory below min wmark
levels hence false is returned from should_reclaim_retry() leading the
allocation request to take OOM kill path. This can turn into a early oom
kill if high atomic reserves are holding lot of free memory and
unreserving of them is not attempted.
(early)OOM is encountered on a VM with the below state:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB
Per above log, the free memory of ~7MB exist in the high atomic reserves
is not freed up before falling back to oom kill path.
Fix it by trying to unreserve the high atomic reserves in
should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback
to oom kill path.
Bug: 332219324
Link: https://lkml.kernel.org/r/1700823445-27531-1-git-send-email-quic_charante@quicinc.com
Fixes: 0aaa29a56e ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
(cherry picked from commit ac3f3b0a55518056bc80ed32a41931c99e1f7d81)
Change-Id: I432d4ac4864d401a4413f6b2ef902625766f8070
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Reported-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Highatomic reserves are set to roughly 1% of zone for maximum and a
pageblock size for minimum. Encountered a system with the below
configuration:
Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
reserved_highatomic:8192KB managed:49224kB
On such systems, even a single pageblock makes highatomic reserves are set
to ~8% of the zone memory. This high value can easily exert pressure on
the zone.
Per discussion with Michal and Mel, it is not much useful to reserve the
memory for highatomic allocations on such small systems[1]. Since the
minimum size for high atomic reserves is always going to be a pageblock
size and if 1% of zone managed pages is going to be below pageblock size,
don't reserve memory for high atomic allocations. Thanks Michal for this
suggestion[2].
Since no memory is being reserved for high atomic allocations and if
respective allocation failures are seen, this patch can be reverted.
[1] https://lore.kernel.org/linux-mm/20231117161956.d3yjdxhhm4rhl7h2@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/
Bug: 332219324
Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.1700821416.git.quic_charante@quicinc.com
Change-Id: Id059b63bd6ee68b3a2cd1c4b44613234a42d0a46
(cherry picked from commit 9cd20f3fe045af95a8fe7a12328b21bfd2f3b8bf)
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: page_alloc: fixes for high atomic reserve
caluculations", v3.
The state of the system where the issue exposed shown in oom kill logs:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB
From the above, it is seen that ~16MB of memory reserved for high atomic
reserves against the expectation of 1% reserves which is fixed in the 1st
patch.
Don't reserve the high atomic page blocks if 1% of zone memory size is
below a pageblock size.
This patch (of 2):
reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages
of a zone, which is used for the high order atomic allocations.
It uses the below calculation to reserve:
static void reserve_highatomic_pageblock(struct page *page, ....) {
.......
max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
if (zone->nr_reserved_highatomic >= max_managed)
goto out;
zone->nr_reserved_highatomic += pageblock_nr_pages;
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
out:
....
}
Since we are always appending the 1% of zone managed pages count to
pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the
nr_reserved_highatomic is incremented/decremented in pageblock sizes.
Encountered a system(actually a VM running on the Linux kernel) with the
below zone configuration:
Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB
reserved_highatomic:8192KB managed:49224kB
The existing calculations making it to reserve the 8MB(with pageblock size
of 4MB) i.e. 16% of the zone managed memory. Reserving such high amount
of memory can easily exert memory pressure in the system thus may lead
into unnecessary reclaims till unreserving of high atomic reserves.
Since high atomic reserves are managed in pageblock size granules, as
MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for
high atomic reserves as, minimum is pageblock size , maximum is
approximately 1% of the zone managed pages.
Bug: 332219324
Link: https://lkml.kernel.org/r/cover.1700821416.git.quic_charante@quicinc.com
Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.1700821416.git.quic_charante@quicinc.com
Change-Id: Icc15fb88ef6166f691f5aa14311bc45bff972b99
(cherry picked from commit d68e39fc45f70e35eb74df2128d315c1d91e4dc4)
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
alloc_contig_migrate_range has every information to be able to understand
big contiguous allocation latency. For example, how many pages are
migrated, how many times they were needed to unmap from page tables.
This patch adds the trace event to collect the allocation statistics. In
the field, it was quite useful to understand CMA allocation latency.
[akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled]
Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com
Signed-off-by: Richard Chang <richardycc@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org.
Cc: Martin Liu <liumartin@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 315897534
(cherry picked from commit c8b36003121834cb77fcaf8a1ce0a454d7a97891
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable)
[richardycc: slight modification for android change 0de2f42977]
Change-Id: If6c3cd106201fd13683d1dd5afdfa62a48a4dd3b
Signed-off-by: Richard Chang <richardycc@google.com>
This catches the android14-6.1-lts branch up with the latest changes in
the android14-6.1 branch, including a number of important symbols being
added for tracking.
This includes the following commits:
* 2d8a5ddebb ANDROID: Update the ABI symbol list
* ddf142e5a8 ANDROID: netlink: add netlink poll and hooks
* c9b5c232e7 ANDROID: Update the ABI symbol list
* 3c9cb9c06f ANDROID: GKI: Update symbol list for mtk
* 5723833390 ANDROID: mm: lru_cache_disable skips lru cache drainnig
* 0de2f42977 ANDROID: mm: cma: introduce __cma_alloc API
* db9d7ba706 ANDROID: Update the ABI representation
* 6b972d6047 BACKPORT: fscrypt: support crypto data unit size less than filesystem block size
* 72bdb74622 UPSTREAM: netfilter: nf_tables: remove catchall element in GC sync path
* 924116f1b8 ANDROID: GKI: Update oplus symbol list
* 0ad2a3cd4d ANDROID: vendor_hooks: export tracepoint symbol trace_mm_vmscan_kswapd_wake
* 6465e29536 BACKPORT: HID: input: map battery system charging
* cfdfc17a46 ANDROID: fuse-bpf: Ignore readaheads unless they go to the daemon
* 354b1b716c FROMGIT: f2fs: skip adding a discard command if exists
* ccbea4f458 UPSTREAM: f2fs: clean up zones when not successfully unmounted
* 88cccede6d UPSTREAM: f2fs: use finish zone command when closing a zone
* b2d3a555d3 UPSTREAM: f2fs: check zone write pointer points to the end of zone
* c9e29a0073 UPSTREAM: f2fs: close unused open zones while mounting
* e92b866e22 UPSTREAM: f2fs: maintain six open zones for zoned devices
* 088f228370 ANDROID: update symbol for unisoc whitelist
* aa71a02cf3 ANDROID: vendor_hooks: mm: add hook to count the number pages allocated for each slab
* 4326c78f84 ANDROID: Update the ABI symbol list
* eb67f58322 ANDROID: sched: Add trace_android_rvh_set_user_nice_locked
* 855511173d UPSTREAM: ASoC: soc-compress: Fix deadlock in soc_compr_open_fe
* 6cb2109589 BACKPORT: ASoC: add snd_soc_card_mutex_lock/unlock()
* edfef8fdc9 BACKPORT: ASoC: expand snd_soc_dpcm_mutex_lock/unlock()
* 52771d9792 BACKPORT: ASoC: expand snd_soc_dapm_mutex_lock/unlock()
Change-Id: I81dd834d6a7b6a32fae56cdc3ebd6a29f0decb80
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
This patch enhances the CMA API with support for failfast mode,
utilizing the __GFP_NORETRY flag. This mode is specifically designed
for high-order bulk allocation scenarios, enabling the CMA API to
avoid prolonged stalls resulting from blocking pages such as those
undergoing page writeback or page locking. Instead of stalling, the
API will continue searching for readily migratable pages across
different pageblocks.
The original patch link:
Link: https://lore.kernel.org/linux-mm/YAnM5PbNJZlk%2F%2FiX@google.com/T/#m36b144ff81fe0a8f0ecaf6813de4819ecc41f8fe
Bug: 308881290
Change-Id: I1c623f17fb49c26005aaffc17330cf820ce6585c
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 3390547fec36527ed15dd213ee55d397f83ffa46)
This syncs up the android14-6.1-lts branch with many changes that have
happened in the android14-6.1 branch.
Included in here are the following commits:
d2c0f4c450 FROMLIST: devcoredump: Send uevent once devcd is ready
95307ec5c8 FROMLIST: iommu: Avoid more races around device probe
5c8e593916 ANDROID: Update the ABI symbol list
94ddfc9ce4 FROMLIST: ufs: core: clear cmd if abort success in mcq mode
8f46c34931 BACKPORT: wifi: cfg80211: Allow AP/P2PGO to indicate port authorization to peer STA/P2PClient
b3ccd8f092 BACKPORT: wifi: cfg80211: OWE DH IE handling offload
daa7a3d95d ANDROID: KVM: arm64: mount procfs for pKVM module loading
1b639e97b8 ANDROID: GKI: Update symbol list for mtk
b496cc3115 ANDROID: fuse-bpf: Add NULL pointer check in fuse_release_in
8431e524d6 UPSTREAM: serial: 8250_port: Check IRQ data before use
825c17428a ANDROID: KVM: arm64: Fix error path in pkvm_mem_abort()
22e9166465 ANDROID: abi_gki_aarch64_qcom: Update symbol list
ca06bb1e93 ANDROID: GKI: add allowed list for Exynosauto SoC
fb91717581 ANDROID: Update the ABI symbol list
ec3c9a1702 ANDROID: sched: Add vendor hook for util_fits_cpu
c47043d65f ANDROID: update symbol for unisoc vendor_hooks
6e881bf034 ANDROID: vendor_hooks: mm: add hook to count the number pages allocated for each slab
a59b32866c UPSTREAM: usb: gadget: udc: Handle gadget_connect failure during bind operation
7a33209b36 ANDROID: Update the ABI symbol list
69b689971a ANDROID: softirq: Add EXPORT_SYMBOL_GPL for softirq and tasklet
48c6c901fe ANDROID: fs/passthrough: Fix compatibility with R/O file system
abcd4c51e7 FROMLIST: usb: typec: tcpm: Fix sink caps op current check
9d6ac9dc6a UPSTREAM: scsi: ufs: core: Add advanced RPMB support where UFSHCI 4.0 does not support EHS length in UTRD
beea09533d ANDROID: ABI: Update symbol list for MediatTek
5683c2b460 ANDROID: vendor_hooks: Add hook for mmc queue
43a07d84da Revert "proc: allow pid_revalidate() during LOOKUP_RCU"
230d34da33 UPSTREAM: scsi: ufs: ufs-qcom: Clear qunipro_g4_sel for HW major version > 5
0920d4de75 ANDROID: GKI: Update symbols to symbol list
019393a917 ANDROID: vendor_hook: Add hook to tune readaround size
0c859c2180 ANDROID: add for tuning readahead size
a8206e3023 ANDROID: vendor_hooks: Add hooks to avoid key threads stalled in memory allocations
ad9947dc8d ANDROID: GKI: Update oplus symbol list
71f3b61ee4 ANDROID: vendor_hooks: add hooks for adjust kvmalloc_node alloc_flags
fef66e8544 UPSTREAM: netfilter: ipset: add the missing IP_SET_HASH_WITH_NET0 macro for ip_set_hash_netportnet.c
7bec8a8180 ANDROID: ABI: Update symbol list for imx
af888bd2a1 ANDROID: abi_gki_aarch64_qcom: Add __netif_rx
131de563f6 ANDROID: ABI: Update sony symbol list and stg
977770ec27 ANDROID: mmc: Add vendor hooks for sdcard failure diagnostics
f8d3b450e9 ANDROID: Update symbol list for mtk
2799026a28 UPSTREAM: scsi: ufs: mcq: Fix the search/wrap around logic
7350783e67 UPSTREAM: scsi: ufs: core: Fix ufshcd_inc_sq_tail() function bug
ff5fa0b7bd FROMLIST: ufs: core: Expand MCQ queue slot to DeviceQueueDepth + 1
4bbeaddf9a Revert "ANDROID: KVM: arm64: Don't allocate from handle_host_mem_abort"
9d38c0bc65 BACKPORT: usb: typec: altmodes/displayport: Signal hpd when configuring pin assignment
e329419f70 UPSTREAM: mm: multi-gen LRU: don't spin during memcg release
3d1b582966 UPSTREAM: vringh: don't use vringh_kiov_advance() in vringh_iov_xfer()
Change-Id: I89958775659103a11ae82cac8d6b4764251da3c5
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Changes in 6.1.61
KVM: x86/pmu: Truncate counter value to allowed width on write
mmc: core: Align to common busy polling behaviour for mmc ioctls
mmc: block: ioctl: do write error check for spi
mmc: core: Fix error propagation for some ioctl commands
ASoC: codecs: wcd938x: Convert to platform remove callback returning void
ASoC: codecs: wcd938x: Simplify with dev_err_probe
ASoC: codecs: wcd938x: fix regulator leaks on probe errors
ASoC: codecs: wcd938x: fix runtime PM imbalance on remove
pinctrl: qcom: lpass-lpi: fix concurrent register updates
mcb: Return actual parsed size when reading chameleon table
mcb-lpc: Reallocate memory region to avoid memory overlapping
virtio_balloon: Fix endless deflation and inflation on arm64
virtio-mmio: fix memory leak of vm_dev
virtio-crypto: handle config changed by work queue
virtio_pci: fix the common cfg map size
vsock/virtio: initialize the_virtio_vsock before using VQs
vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE
arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399
arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards
mm: fix vm_brk_flags() to not bail out while holding lock
hugetlbfs: clear resv_map pointer if mmap fails
mm/page_alloc: correct start page when guard page debug is enabled
mm/migrate: fix do_pages_move for compat pointers
hugetlbfs: extend hugetlb_vma_lock to private VMAs
maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
nfsd: lock_rename() needs both directories to live on the same fs
drm/i915/pmu: Check if pmu is closed before stopping event
drm/amd: Disable ASPM for VI w/ all Intel systems
drm/dp_mst: Fix NULL deref in get_mst_branch_device_by_guid_helper()
ARM: OMAP: timer32K: fix all kernel-doc warnings
firmware/imx-dsp: Fix use_after_free in imx_dsp_setup_channels()
clk: ti: Fix missing omap4 mcbsp functional clock and aliases
clk: ti: Fix missing omap5 mcbsp functional clock and aliases
r8169: fix the KCSAN reported data-race in rtl_tx() while reading tp->cur_tx
r8169: fix the KCSAN reported data-race in rtl_tx while reading TxDescArray[entry].opts1
r8169: fix the KCSAN reported data race in rtl_rx while reading desc->opts1
iavf: initialize waitqueues before starting watchdog_task
i40e: Fix I40E_FLAG_VF_VLAN_PRUNING value
treewide: Spelling fix in comment
igb: Fix potential memory leak in igb_add_ethtool_nfc_entry
neighbour: fix various data-races
igc: Fix ambiguity in the ethtool advertising
net: ethernet: adi: adin1110: Fix uninitialized variable
net: ieee802154: adf7242: Fix some potential buffer overflow in adf7242_stats_show()
net: usb: smsc95xx: Fix uninit-value access in smsc95xx_read_reg
r8152: Increase USB control msg timeout to 5000ms as per spec
r8152: Run the unload routine if we have errors during probe
r8152: Cancel hw_phy_work if we have an error in probe
r8152: Release firmware if we have an error in probe
tcp: fix wrong RTO timeout when received SACK reneging
gtp: uapi: fix GTPA_MAX
gtp: fix fragmentation needed check with gso
i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR
drm/logicvc: Kconfig: select REGMAP and REGMAP_MMIO
iavf: in iavf_down, disable queues when removing the driver
scsi: sd: Introduce manage_shutdown device flag
blk-throttle: check for overflow in calculate_bytes_allowed
kasan: print the original fault addr when access invalid shadow
io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
iio: afe: rescale: Accept only offset channels
iio: exynos-adc: request second interupt only when touchscreen mode is used
iio: adc: xilinx-xadc: Don't clobber preset voltage/temperature thresholds
iio: adc: xilinx-xadc: Correct temperature offset/scale for UltraScale
i2c: muxes: i2c-mux-pinctrl: Use of_get_i2c_adapter_by_node()
i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node()
i2c: muxes: i2c-demux-pinctrl: Use of_get_i2c_adapter_by_node()
i2c: stm32f7: Fix PEC handling in case of SMBUS transfers
i2c: aspeed: Fix i2c bus hang in slave read
tracing/kprobes: Fix the description of variable length arguments
misc: fastrpc: Reset metadata buffer to avoid incorrect free
misc: fastrpc: Free DMA handles for RPC calls with no arguments
misc: fastrpc: Clean buffers on remote invocation failures
misc: fastrpc: Unmap only if buffer is unmapped from DSP
nvmem: imx: correct nregs for i.MX6ULL
nvmem: imx: correct nregs for i.MX6SLL
nvmem: imx: correct nregs for i.MX6UL
x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
x86/cpu: Add model number for Intel Arrow Lake mobile processor
perf/core: Fix potential NULL deref
sparc32: fix a braino in fault handling in csum_and_copy_..._user()
clk: Sanitize possible_parent_show to Handle Return Value of of_clk_get_parent_name
platform/x86: Add s2idle quirk for more Lenovo laptops
ext4: add two helper functions extent_logical_end() and pa_logical_end()
ext4: fix BUG in ext4_mb_new_inode_pa() due to overflow
ext4: avoid overlapping preallocations due to overflow
objtool/x86: add missing embedded_insn check
Linux 6.1.61
Note, this merge point also reverts commit
bb20a245df which is commit
24eca2dce0f8d19db808c972b0281298d0bafe99 upstream, as it conflicts with
the previous reverts for ABI issues, AND is an ABI break in itself. If
it is needed in the future, it can be brought back in an abi-safe way.
Change-Id: I425bfa3be6d65328e23affd52d10b827aea6e44a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
memory allocations
We add these hooks to avoid key threads blocked in memory allocation
path.
-android_vh_free_unref_page_bypass ----We create a memory pool for the
key threads. This hook determines whether a page should be free to the
pool or to buddy freelist. It works with a existing hook
`android_vh_alloc_pages_reclaim_bypass`, which takes pages out of the
pool.
-android_vh_kvmalloc_node_use_vmalloc ----For key threads, we perfer
not to run into direct reclaim. So we clear __GFP_DIRECT_RECLAIM flag.
For threads which are not that important, we perfer use vmalloc.
-android_vh_should_alloc_pages_retry ----Before key threads run into
direct reclaim, we want to retry with a lower watermark.
-android_vh_unreserve_highatomic_bypass ----We want to keep more
highatomic pages when unreserve them to avoid highatomic allocation
failures.
-android_vh_rmqueue_bulk_bypass ----We found sometimes when key threads
run into rmqueue_bulk, it took several milliseconds spinning at
zone->lock or filling per-cpu pages. We use this hook to take pages from
the mempool mentioned above, rather than grab zone->lock and fill a
batch of pages to per-cpu.
Bug: 288216516
Change-Id: I1656032d6819ca627723341987b6094775bc345f
Signed-off-by: Oven <liyangouwen1@oppo.com>
commit 61e21cf2d2c3cc5e60e8d0a62a77e250fccda62c upstream.
When guard page debug is enabled and set_page_guard returns success, we
miss to forward page to point to start of next split range and we will do
split unexpectedly in page range without target page. Move start page
update before set_page_guard to fix this.
As we split to wrong target page, then splited pages are not able to merge
back to original order when target page is put back and splited pages
except target page is not usable. To be specific:
Consider target page is the third page in buddy page with order 2.
| buddy-2 | Page | Target | Page |
After break down to target page, we will only set first page to Guard
because of bug.
| Guard | Page | Target | Page |
When we try put_page_back_buddy with target page, the buddy page of target
if neither guard nor buddy, Then it's not able to construct original page
with order 2
| Guard | Page | buddy-0 | Page |
All pages except target page is not in free list and is not usable.
Link: https://lkml.kernel.org/r/20230927094401.68205-1-shikemeng@huaweicloud.com
Fixes: 06be6ff3d2 ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmUlrb4ACgkQONu9yGCS
aT4b+hAAgvFC6P+XmyyNXJ9ISHLkgSlcIAdatb+qeOCUtdiWHqfxIha13FdnCdhL
WS2c/O9ORfAzjFwnYWF6LBwH8ArxRSkAXrGCMuCkEFBP3cG/j2HD+XLAAYEuBjjb
sf1fw8e8VSgaPEOnwXie5rTfAY4VnZKEtZjAxjyIQnJKVVKfxQRb8CyaWDPzPD0Z
tL/iABt7UWNHZayHTHsh0YhF2UhXtOjHinWigEarcZQEvOB2qRQtFl71cnqosi+t
3ZRZzepH7/Fx3v6/H/6PNq+GSI/ZzhOiCQolVV5YcMGHXsW9cP6arjLUxco5pzpk
pEg0vdMq47JOZYQ2pIewG4t7+NLmFIxCRFnKQVbxeFNSY9c1jhd8g5lhx9YEXwjT
BzMtV5DnZoaoMdq2P1STw/+RVYrDI1Lm6jqfgw/D27b7LzQ13VsGM9BJ1rCs8Hm7
UhWyjwFcgo0vhpfML1RF0RtT9Mo5SOnpGPfpbFdjg8jdXlGknNH0QsH+EY/BpF8l
h77P5BvoNIjsIN3B1YunfXtFXhx3h0sI8zZrqHR+zhOeWGsXcqQ5mZ/lYdYKkKuH
R8LRB7shPndF4xdRX0uRXwomcXhs+60eA5xEvE9u0CqqdpXfQN5oTwixfCm2C8MS
O5Fc7hfvK11XtR3ja+y3KRhiNG3YsfW2PXnlOfZxMZ6iPqXtA/o=
=5/pn
-----END PGP SIGNATURE-----
Merge 6.1.57 into android14-6.1-lts
Changes in 6.1.57
spi: zynqmp-gqspi: fix clock imbalance on probe failure
ASoC: soc-utils: Export snd_soc_dai_is_dummy() symbol
ASoC: tegra: Fix redundant PLLA and PLLA_OUT0 updates
mptcp: rename timer related helper to less confusing names
mptcp: fix dangling connection hang-up
mptcp: annotate lockless accesses to sk->sk_err
mptcp: move __mptcp_error_report in protocol.c
mptcp: process pending subflow error on close
ata,scsi: do not issue START STOP UNIT on resume
scsi: sd: Differentiate system and runtime start/stop management
scsi: sd: Do not issue commands to suspended disks on shutdown
scsi: core: Improve type safety of scsi_rescan_device()
scsi: Do not attempt to rescan suspended devices
ata: libata-scsi: Fix delayed scsi_rescan_device() execution
NFS: Cleanup unused rpc_clnt variable
NFS: rename nfs_client_kset to nfs_kset
NFSv4: Fix a state manager thread deadlock regression
mm/memory: add vm_normal_folio()
mm/mempolicy: convert queue_pages_pmd() to queue_folios_pmd()
mm/mempolicy: convert queue_pages_pte_range() to queue_folios_pte_range()
mm/mempolicy: convert migrate_page_add() to migrate_folio_add()
mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
mm/page_alloc: always remove pages from temporary list
mm/page_alloc: leave IRQs enabled for per-cpu page allocations
mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list
ring-buffer: remove obsolete comment for free_buffer_page()
ring-buffer: Fix bytes info in per_cpu buffer stats
btrfs: use struct qstr instead of name and namelen pairs
btrfs: setup qstr from dentrys using fscrypt helper
btrfs: use struct fscrypt_str instead of struct qstr
Revert "NFSv4: Retry LOCK on OLD_STATEID during delegation return"
arm64: Avoid repeated AA64MMFR1_EL1 register read on pagefault path
net: add sysctl accept_ra_min_rtr_lft
net: change accept_ra_min_rtr_lft to affect all RA lifetimes
net: release reference to inet6_dev pointer
arm64: cpufeature: Fix CLRBHB and BC detection
drm/amd/display: Adjust the MST resume flow
iommu/arm-smmu-v3: Set TTL invalidation hint better
iommu/arm-smmu-v3: Avoid constructing invalid range commands
rbd: move rbd_dev_refresh() definition
rbd: decouple header read-in from updating rbd_dev->header
rbd: decouple parent info read-in from updating rbd_dev
rbd: take header_rwsem in rbd_dev_refresh() only when updating
block: fix use-after-free of q->q_usage_counter
hwmon: (nzxt-smart2) Add device id
hwmon: (nzxt-smart2) add another USB ID
i40e: fix the wrong PTP frequency calculation
scsi: zfcp: Fix a double put in zfcp_port_enqueue()
iommu/vt-d: Avoid memory allocation in iommu_suspend()
vringh: don't use vringh_kiov_advance() in vringh_iov_xfer()
net: ethernet: mediatek: disable irq before schedule napi
mptcp: userspace pm allow creating id 0 subflow
qed/red_ll2: Fix undefined behavior bug in struct qed_ll2_info
Bluetooth: hci_codec: Fix leaking content of local_codecs
Bluetooth: hci_sync: Fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
wifi: mwifiex: Fix tlv_buf_left calculation
md/raid5: release batch_last before waiting for another stripe_head
PCI: qcom: Fix IPQ8074 enumeration
net: replace calls to sock->ops->connect() with kernel_connect()
net: prevent rewrite of msg_name in sock_sendmsg()
drm/amd: Fix detection of _PR3 on the PCIe root port
drm/amd: Fix logic error in sienna_cichlid_update_pcie_parameters()
arm64: Add Cortex-A520 CPU part definition
arm64: errata: Add Cortex-A520 speculative unprivileged load workaround
HID: sony: Fix a potential memory leak in sony_probe()
ubi: Refuse attaching if mtd's erasesize is 0
erofs: fix memory leak of LZMA global compressed deduplication
wifi: iwlwifi: dbg_ini: fix structure packing
wifi: iwlwifi: mvm: Fix a memory corruption issue
wifi: cfg80211: hold wiphy lock in auto-disconnect
wifi: cfg80211: move wowlan disable under locks
wifi: cfg80211: add a work abstraction with special semantics
wifi: cfg80211: fix cqm_config access race
wifi: cfg80211: add missing kernel-doc for cqm_rssi_work
wifi: mwifiex: Fix oob check condition in mwifiex_process_rx_packet
leds: Drop BUG_ON check for LED_COLOR_ID_MULTI
bpf: Fix tr dereferencing
regulator: mt6358: Drop *_SSHUB regulators
regulator: mt6358: Use linear voltage helpers for single range regulators
regulator: mt6358: split ops for buck and linear range LDO regulators
Bluetooth: Delete unused hci_req_prepare_suspend() declaration
Bluetooth: ISO: Fix handling of listen for unicast
drivers/net: process the result of hdlc_open() and add call of hdlc_close() in uhdlc_close()
wifi: mt76: mt76x02: fix MT76x0 external LNA gain handling
perf/x86/amd/core: Fix overflow reset on hotplug
regmap: rbtree: Fix wrong register marked as in-cache when creating new node
wifi: mac80211: fix potential key use-after-free
perf/x86/amd: Do not WARN() on every IRQ
iommu/mediatek: Fix share pgtable for iova over 4GB
regulator/core: regulator_register: set device->class earlier
ima: Finish deprecation of IMA_TRUSTED_KEYRING Kconfig
scsi: target: core: Fix deadlock due to recursive locking
ima: rework CONFIG_IMA dependency block
NFSv4: Fix a nfs4_state_manager() race
bpf: tcp_read_skb needs to pop skb regardless of seq
bpf, sockmap: Do not inc copied_seq when PEEK flag set
bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets
modpost: add missing else to the "of" check
net: fix possible store tearing in neigh_periodic_work()
bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookup
neighbour: annotate lockless accesses to n->nud_state
neighbour: switch to standard rcu, instead of rcu_bh
neighbour: fix data-races around n->output
ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data()
ptp: ocp: Fix error handling in ptp_ocp_device_init
net: dsa: mv88e6xxx: Avoid EEPROM timeout when EEPROM is absent
ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
net: usb: smsc75xx: Fix uninit-value access in __smsc75xx_read_reg
net: nfc: llcp: Add lock when modifying device list
net: ethernet: ti: am65-cpsw: Fix error code in am65_cpsw_nuss_init_tx_chns()
ibmveth: Remove condition to recompute TCP header checksum.
netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp
selftests: netfilter: Test nf_tables audit logging
selftests: netfilter: Extend nft_audit.sh
netfilter: nf_tables: Deduplicate nft_register_obj audit logs
netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
ipv4: Set offload_failed flag in fibmatch results
net: stmmac: dwmac-stm32: fix resume on STM32 MCU
tipc: fix a potential deadlock on &tx->lock
tcp: fix quick-ack counting to count actual ACKs of new data
tcp: fix delayed ACKs for MSS boundary condition
sctp: update transport state when processing a dupcook packet
sctp: update hb timer immediately after users change hb_interval
netlink: split up copies in the ack construction
netlink: Fix potential skb memleak in netlink_ack
netlink: annotate data-races around sk->sk_err
HID: sony: remove duplicate NULL check before calling usb_free_urb()
HID: intel-ish-hid: ipc: Disable and reenable ACPI GPE bit
intel_idle: add Emerald Rapids Xeon support
smb: use kernel_connect() and kernel_bind()
parisc: Fix crash with nr_cpus=1 option
dm zoned: free dmz->ddev array in dmz_put_zoned_devices
RDMA/core: Require admin capabilities to set system parameters
of: dynamic: Fix potential memory leak in of_changeset_action()
IB/mlx4: Fix the size of a buffer in add_port_entries()
gpio: aspeed: fix the GPIO number passed to pinctrl_gpio_set_config()
gpio: pxa: disable pinctrl calls for MMP_GPIO
RDMA/cma: Initialize ib_sa_multicast structure to 0 when join
RDMA/cma: Fix truncation compilation warning in make_cma_ports
RDMA/uverbs: Fix typo of sizeof argument
RDMA/srp: Do not call scsi_done() from srp_abort()
RDMA/siw: Fix connection failure handling
RDMA/mlx5: Fix mutex unlocking on error flow for steering anchor creation
RDMA/mlx5: Fix NULL string error
x86/sev: Use the GHCB protocol when available for SNP CPUID requests
ksmbd: fix race condition between session lookup and expire
ksmbd: fix uaf in smb20_oplock_break_ack
parisc: Restore __ldcw_align for PA-RISC 2.0 processors
ipv6: remove nexthop_fib6_nh_bh()
vrf: Fix lockdep splat in output path
btrfs: fix an error handling path in btrfs_rename()
btrfs: fix fscrypt name leak after failure to join log transaction
netlink: remove the flex array from struct nlmsghdr
btrfs: file_remove_privs needs an exclusive lock in direct io write
ipv6: remove one read_lock()/read_unlock() pair in rt6_check_neigh()
xen/events: replace evtchn_rwlock with RCU
Linux 6.1.57
Change-Id: I2c200264df72a9043d91d31479c91b0d7f94863e
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
__build_all_zonelists() acquires zonelist_update_seq by first disabling
interrupts via local_irq_save() and then acquiring the seqlock with
write_seqlock(). This is troublesome and leads to problems on PREEMPT_RT.
The problem is that the inner spinlock_t becomes a sleeping lock on
PREEMPT_RT and must not be acquired with disabled interrupts.
The API provides write_seqlock_irqsave() which does the right thing in one
step. printk_deferred_enter() has to be invoked in non-migrate-able
context to ensure that deferred printing is enabled and disabled on the
same CPU. This is the case after zonelist_update_seq has been acquired.
There was discussion on the first submission that the order should be:
local_irq_disable();
printk_deferred_enter();
write_seqlock();
to avoid pitfalls like having an unaccounted printk() coming from
write_seqlock_irqsave() before printk_deferred_enter() is invoked. The
only origin of such a printk() can be a lockdep splat because the lockdep
annotation happens after the sequence count is incremented. This is
exceptional and subject to change.
It was also pointed that PREEMPT_RT can be affected by the printk problem
since its write_seqlock_irqsave() does not really disable interrupts.
This isn't the case because PREEMPT_RT's printk implementation differs
from the mainline implementation in two important aspects:
- Printing happens in a dedicated threads and not at during the
invocation of printk().
- In emergency cases where synchronous printing is used, a different
driver is used which does not use tty_port::lock.
Acquire zonelist_update_seq with write_seqlock_irqsave() and then defer
printk output.
Bug: 254441685
Link: https://lkml.kernel.org/r/20230623201517.yw286Knb@linutronix.de
Fixes: 1007843a91909 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit a2ebb51575828209b3e9d6f3c6576f7a7c70d0f6)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: Ib40406942a4ba9ae9d6e2044077043aa1455a0a7
Add a vendor hook for costly order page counting
and other vendor specific functions.
Bug: 174521902
Bug: 172987241
Signed-off-by: Chiawei Wang <chiaweiwang@google.com>
Change-Id: I89206727a462548cc3500b695d85c83ff003eec7
Signed-off-by: Richard Chang <richardycc@google.com>
(cherry picked from commit 369de3780428a17e9afece2f5747f03619d589b6)
Signed-off-by: liangjlee <liangjlee@google.com>
Since Android has pcp list for MIGRATE_CMA[1], it could cause
CMA allocation latency due to not freeing the MIGRATE_ISOLATE
page immediately.
Originally, MIGRATE_ISOLATED page is supposed to go buddy list
with skipping pcp list. Otherwise, the page could be reallocated
from pcp list or staying on the pcp list until the pcp is drained
so that CMA keeps retrying since it couldn't find the freed page
from buddy list. That worked before since the CMA pfnblocks changed
only from MIGRATE_CMA to MIGRATE_ISOLATE and free function logic
in page allocator has checked MIGRATE_ISOLATEness on every CMA
pages using below.
free_unref_page_commit
if (migratetype >= MIGRATE_PCPTYPES)
if(is_migrate_isolate(migratetype))
free_one_page(page);
It worked since enum MIGRATE_CMA was bigger than enum
MIGRATE_PCPTYPES but since [1], the enum MIGRATE_CMA is less than
MIGRATE_PCPTYPES so the logic above doesn't work any more.
It could cause following race
CPU 0 CPU 1
free_unref_page
migratetype = get_pfnblock_migratetype()
set_pcppage_migratetype(MIGRATE_CMA)
cma_alloc
alloc_contig_range
set_migrate_isolate(MIGRATE_ISOLATE)
add the page into pcp list
the page could be reallocated
This patch couldn't fix the race completely due to missing zone->lock
in order-0 page free(for performance reason). However, it's not a new
problem so we need to deal with the issue separately.
[1] ANDROID: mm: add cma pcp list
Bug: 218731671
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ibea20085ce5bfb4b74b83b041f9bda9a380120f9
(cherry picked from commit d9e4b67784866047e8cfb5598cdf1ebc0c71f3d9)
Signed-off-by: Richard Chang <richardycc@google.com>
[ Upstream commit 7b086755fb8cdbb6b3e45a1bbddc00e7f9b1dc03 ]
Commit 4b23a68f95 ("mm/page_alloc: protect PCP lists with a spinlock")
bypasses the pcplist on lock contention and returns the page directly to
the buddy list of the page's migratetype.
For pages that don't have their own pcplist, such as CMA and HIGHATOMIC,
the migratetype is temporarily updated such that the page can hitch a ride
on the MOVABLE pcplist. Their true type is later reassessed when flushing
in free_pcppages_bulk(). However, when lock contention is detected after
the type was already overridden, the bypass will then put the page on the
wrong buddy list.
Once on the MOVABLE buddy list, the page becomes eligible for fallbacks
and even stealing. In the case of HIGHATOMIC, otherwise ineligible
allocations can dip into the highatomic reserves. In the case of CMA, the
page can be lost from the CMA region permanently.
Use a separate pcpmigratetype variable for the pcplist override. Use the
original migratetype when going directly to the buddy. This fixes the bug
and should make the intentions more obvious in the code.
Originally sent here to address the HIGHATOMIC case:
https://lore.kernel.org/lkml/20230821183733.106619-4-hannes@cmpxchg.org/
Changelog updated in response to the CMA-specific bug report.
[mgorman@techsingularity.net: updated changelog]
Link: https://lkml.kernel.org/r/20230911181108.GA104295@cmpxchg.org
Fixes: 4b23a68f95 ("mm/page_alloc: protect PCP lists with a spinlock")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Joe Liu <joe.liu@mediatek.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5749077415994eb02d660b2559b9d8278521e73d ]
The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
allocating from the PCP must not re-enter the allocator from IRQ context.
In each instance where IRQ-reentrancy is possible, the lock is acquired
using pcp_spin_trylock_irqsave() even though IRQs are disabled and
re-entrancy is impossible.
Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the
common case at the cost of some IRQ allocations taking a slower path. If
the PCP lists need to be refilled, the zone lock still needs to disable
IRQs but that will only happen on PCP refill and drain. If an IRQ is
raised when a PCP allocation is in progress, the trylock will fail and
fallback to using the buddy lists directly. Note that this may not be a
universal win if an interrupt-intensive workload also allocates heavily
from interrupt context and contends heavily on the zone->lock as a result.
[mgorman@techsingularity.net: migratetype might be wrong if a PCP was locked]
Link: https://lkml.kernel.org/r/20221122131229.5263-2-mgorman@techsingularity.net
[yuzhao@google.com: reported lockdep issue on IO completion from softirq]
[hughd@google.com: fix list corruption, lock improvements, micro-optimsations]
Link: https://lkml.kernel.org/r/20221118101714.19590-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 7b086755fb8c ("mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c3e58a70425ac6ddaae1529c8146e88b4f7252bb ]
Patch series "Leave IRQs enabled for per-cpu page allocations", v3.
This patch (of 2):
free_unref_page_list() has neglected to remove pages properly from the
list of pages to free since forever. It works by coincidence because
list_add happened to do the right thing adding the pages to just the PCP
lists. However, a later patch added pages to either the PCP list or the
zone list but only properly deleted the page from the list in one path
leading to list corruption and a subsequent failure. As a preparation
patch, always delete the pages from one list properly before adding to
another. On its own, this fixes nothing although it adds a fractional
amount of overhead but is critical to the next patch.
Link: https://lkml.kernel.org/r/20221118101714.19590-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20221118101714.19590-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 7b086755fb8c ("mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list")
Signed-off-by: Sasha Levin <sashal@kernel.org>
To monitor the efficiency of memory relciaming in __alloc_pages_slowpath, add vendor hooks in each stages of __alloc_pages_slowpath including __alloc_pages_may_oom and __alloc_pages_direct_reclaim.
android_vh_mm_alloc_pages_direct_reclaim_enter()
android_vh_mm_alloc_pages_direct_reclaim_exit(unsigned long, int)
android_vh_mm_alloc_pages_may_oom_exit(struct oom_control *, unsigned long)
Bug: 301044280
Change-Id: Ic5b5f1c2ad31b16e7339f539fcf54659e9acaba7
Signed-off-by: John Hsu <john.hsu@mediatek.com>
Add vendor hooks for extra memory. If there is extra memory, this can
be accounted like other memory stats. One of the usecases could be
cleancache. If some of ram memory is used for cleancache, its free,
cache, and total size could be added through these vendor hooks.
Bug: 283896254
Change-Id: Iad7330310528581f09842f45860f05dc84823f41
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
commit 1007843a91909a4995ee78a538f62d8665705b66 upstream.
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.
One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
CPU0
----
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
// e.g. timer interrupt handler runs at this moment
some_timer_func() {
kmalloc(GFP_ATOMIC) {
__alloc_pages_slowpath() {
read_seqbegin(&zonelist_update_seq) {
// spins forever because zonelist_update_seq.seqcount is odd
}
}
}
}
// e.g. timer interrupt handler finishes
write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
}
This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.
Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.
CPU0 CPU1
---- ----
pty_write() {
tty_insert_flip_string_and_push_buffer() {
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq);
build_zonelists() {
printk() {
vprintk() {
vprintk_default() {
vprintk_emit() {
console_unlock() {
console_flush_all() {
console_emit_next_record() {
con->write() = serial8250_console_write() {
spin_lock_irqsave(&port->lock, flags);
tty_insert_flip_string() {
tty_insert_flip_string_fixed_flag() {
__tty_buffer_request_room() {
tty_buffer_alloc() {
kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
__alloc_pages_slowpath() {
zonelist_iter_begin() {
read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
}
}
}
}
}
}
}
}
spin_unlock_irqrestore(&port->lock, flags);
// message is printed to console
spin_unlock_irqrestore(&port->lock, flags);
}
}
}
}
}
}
}
}
}
write_sequnlock(&zonelist_update_seq);
}
}
}
This deadlock scenario can be eliminated by
preventing interrupt context from calling kmalloc(GFP_ATOMIC)
and
preventing printk() from calling console_flush_all()
while zonelist_update_seq.seqcount is odd.
Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by
disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
and
disabling synchronous printk() in order to avoid console_flush_all()
.
As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...
Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
Fixes: 3d36424b3b ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Change-Id: Ifc0c6ed9be6d36166367811ad412bedc66ed713e
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Patrick Daly <quic_pdaly@quicinc.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b528537d13)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 4d73ba5fa710fe7d432e0b271e6fecd252aef66e upstream.
A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory. Further
testing allocating huge pages that the cost is linear i.e. if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step. Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a
migration that fails after isolating the range, draining pages and a whole
lot of other useless work.
Commit eb14d4eefd ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can sometimes migrate. While
there may be value on migrating a 2M page to satisfy a 1G allocation, it's
potentially expensive if the 1G allocation fails and it's pointless to try
moving a 1G page for a new 1G allocation or scan the tail pages for valid
PFNs.
Reintroduce the PageHuge check and assume any contiguous region with
hugetlbfs pages is unsuitable for a new 1G allocation.
The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time. This test happens just after boot
when fragmentation is not an issue. Units are in milliseconds.
hpagealloc
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla hugeallocrevert-v1r1 hugeallocsimple-v1r2
Min Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
1st-qrtle Latency 356.61 ( 0.00%) 5.34 ( 98.50%) 19.85 ( 94.43%)
2nd-qrtle Latency 697.26 ( 0.00%) 5.47 ( 99.22%) 20.44 ( 97.07%)
3rd-qrtle Latency 972.94 ( 0.00%) 5.50 ( 99.43%) 20.81 ( 97.86%)
Max-1 Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
Max-5 Latency 82.14 ( 0.00%) 5.11 ( 93.78%) 19.31 ( 76.49%)
Max-10 Latency 150.54 ( 0.00%) 5.20 ( 96.55%) 19.43 ( 87.09%)
Max-90 Latency 1164.45 ( 0.00%) 5.53 ( 99.52%) 20.97 ( 98.20%)
Max-95 Latency 1223.06 ( 0.00%) 5.55 ( 99.55%) 21.06 ( 98.28%)
Max-99 Latency 1278.67 ( 0.00%) 5.57 ( 99.56%) 22.56 ( 98.24%)
Max Latency 1310.90 ( 0.00%) 8.06 ( 99.39%) 26.62 ( 97.97%)
Amean Latency 678.36 ( 0.00%) 5.44 * 99.20%* 20.44 * 96.99%*
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla revert-v1 hugeallocfix-v2
Duration User 0.28 0.27 0.30
Duration System 808.66 17.77 35.99
Duration Elapsed 830.87 18.08 36.33
The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
page and almost 10 minutes in total to run the test. Reverting the
problematic commit reduces it to 8ms at worst and the patch takes 26ms.
This patch fixes the main issue with skipping huge pages but leaves the
page_count() out because a page with an elevated count potentially can
migrate.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
Fixes: eb14d4eefd ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
Change-Id: I552f0631f15e41038219e207c994fa7702b269fa
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yuanxi Liu <y.liu@naruida.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 059f24aff6)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Add vendors hooks for recording memory used
Vendor modules allocate and manages the memory itself.
These memories might not be included in kernel memory
statistics. Also, detailed references and vendor-specific
information are managed only inside modules. When
various problems such as memory leaks occurs, these
information should be showed in real-time.
Bug: 182443489
Bug: 234407991
Bug: 277799025
Signed-off-by: Liujie Xie <xieliujie@oppo.com>
Change-Id: I62d8bb2b6650d8b187b433f97eb833ef0b784df1
Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
Add vendor hook inside of get_page_from_freelist() to check
and modify the watermark in some special situations.
Additional page flag bit will be set for future identification.
Separately, a vendor hook inside of page_add_new_anon_rmap()
is added to set the referenced bit in some situations, e.g.
if the special bit in the page flag mentioned before is set,
we will give this page one more chance before it gets reclaimed.
Bug: 279793368
Change-Id: I363853a050a87201f6f368ccc580485dddd6c6b6
Signed-off-by: Dezhi Huang <huangdezhi@hihonor.com>
spin_trylock may fail due to a parallel drain in rmqueue_pcplist.
In the case, it should retry to allocate with buddy.
It matches with upstream policy.
Fixes: 433445e9a1 ("ANDROID: mm: add cma pcp list")
Change-Id: I07367888d7ede38e09f9d882fc2485baa175fe64
Signed-off-by: Minchan Kim <minchan@google.com>
For CMA allocation, it's really critical to migrate a page but
sometimes it fails. One of the reasons is some driver holds a
page refcount for a long time so VM couldn't migrate the page
at that time.
The concern here is there is no way to find the who hold the
refcount of the page effectively. This patch introduces feature
to keep tracking page's pinner. All get_page sites are vulnerable
to pin a page for a long time but the cost to keep track it would
be significat since get_page is the most frequent kernel operation.
Furthermore, the page could be not user page but kernel page which
is not related to the page migration failure.
Thus, this patch keeps tracks of only migration failed pages to
reduce runtime cost. Once page migration fails in CMA allocation
path, those pages are marked as "migration failure" and every
put_page operation against those pages, callstack of the put
are recorded into page_pinner buffer. Later, admin can see
what pages were failed and who released the refcount since the
failure. It really helps effectively to find out longtime refcount
holder to prevent the page migration.
note: page_pinner doesn't guarantee attributing/unattributing are
atomic if they happen at the same time. It's just best effort so
false-positive could happen.
Bug: 183414571
BUg: 240196534
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I603d0c0122734c377db6b1eb95848a6f734173a0
(cherry picked from commit 898cfbf094a2fc13c67fab5b5d3c916f0139833a)
Add a PCP list for __GFP_CMA allocations so as not to deprive
MIGRATE_MOVABLE allocations quick access to pages on their PCP
lists.
Bug: 158645321
Change-Id: I9831eed113ec9e851b4f651755205ac9cf23b9be
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
[isaacm@codeaurora.org: Resolve merge conflicts related to new mm
features]
Signed-off-by: Isaac J. Manjarres <isaacm@quicinc.com>
quic_sukadev@quicinc.com: Resolve merge conflicts due to earlier patch
dropping gfp_flags;drop BUILD_BUG_ON related to MIGRATETYPE_HIGHATOMIC
since its value changed.
Signed-off-by: Sukadev Bhattiprolu <quic_sukadev@quicinc.com>
CMA pages are designed to be used as fallback for movable allocations
and cannot be used for non-movable allocations. If CMA pages are
utilized poorly, non-movable allocations may end up getting starved if
all regular movable pages are allocated and the only pages left are
CMA. Always using CMA pages first creates unacceptable performance
problems. As a midway alternative, use CMA pages for certain
userspace allocations. The userspace pages can be migrated or dropped
quickly which giving decent utilization.
Additionally, add a fall-backs for failed CMA allocations in rmqueue()
and __rmqueue_pcplist() (the latter addition being driven by a report
by the kernel test robot); these fallbacks were dealt with differently
in the original version of the patch as the rmqueue() call chain has
changed).
Bug: 158645321
Link: https://lore.kernel.org/lkml/cover.1604282969.git.cgoldswo@codeaurora.org/
Change-Id: Iad46f0405b416e29ae788f82b79c9953513a9c9d
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Heesub Shin <heesub.shin@samsung.com>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
[cgoldswo@codeaurora.org: Place in bugfixes; remove cma_alloc zone flag]
Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
[isaacm@codeaurora.org: Resolve merge conflicts to account for new mm
features]
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
[quic_sukadev@quicinc.com: dropped unused gfp_flags parameter to
__rmqueue_pcplist(), resolved some conflicts]
Signed-off-by: Sukadev Bhattiprolu <quic_sukadev@quicinc.com>
commit 1007843a91909a4995ee78a538f62d8665705b66 upstream.
syzbot is reporting circular locking dependency which involves
zonelist_update_seq seqlock [1], for this lock is checked by memory
allocation requests which do not need to be retried.
One deadlock scenario is kmalloc(GFP_ATOMIC) from an interrupt handler.
CPU0
----
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount odd
// e.g. timer interrupt handler runs at this moment
some_timer_func() {
kmalloc(GFP_ATOMIC) {
__alloc_pages_slowpath() {
read_seqbegin(&zonelist_update_seq) {
// spins forever because zonelist_update_seq.seqcount is odd
}
}
}
}
// e.g. timer interrupt handler finishes
write_sequnlock(&zonelist_update_seq); // makes zonelist_update_seq.seqcount even
}
This deadlock scenario can be easily eliminated by not calling
read_seqbegin(&zonelist_update_seq) from !__GFP_DIRECT_RECLAIM allocation
requests, for retry is applicable to only __GFP_DIRECT_RECLAIM allocation
requests. But Michal Hocko does not know whether we should go with this
approach.
Another deadlock scenario which syzbot is reporting is a race between
kmalloc(GFP_ATOMIC) from tty_insert_flip_string_and_push_buffer() with
port->lock held and printk() from __build_all_zonelists() with
zonelist_update_seq held.
CPU0 CPU1
---- ----
pty_write() {
tty_insert_flip_string_and_push_buffer() {
__build_all_zonelists() {
write_seqlock(&zonelist_update_seq);
build_zonelists() {
printk() {
vprintk() {
vprintk_default() {
vprintk_emit() {
console_unlock() {
console_flush_all() {
console_emit_next_record() {
con->write() = serial8250_console_write() {
spin_lock_irqsave(&port->lock, flags);
tty_insert_flip_string() {
tty_insert_flip_string_fixed_flag() {
__tty_buffer_request_room() {
tty_buffer_alloc() {
kmalloc(GFP_ATOMIC | __GFP_NOWARN) {
__alloc_pages_slowpath() {
zonelist_iter_begin() {
read_seqbegin(&zonelist_update_seq); // spins forever because zonelist_update_seq.seqcount is odd
spin_lock_irqsave(&port->lock, flags); // spins forever because port->lock is held
}
}
}
}
}
}
}
}
spin_unlock_irqrestore(&port->lock, flags);
// message is printed to console
spin_unlock_irqrestore(&port->lock, flags);
}
}
}
}
}
}
}
}
}
write_sequnlock(&zonelist_update_seq);
}
}
}
This deadlock scenario can be eliminated by
preventing interrupt context from calling kmalloc(GFP_ATOMIC)
and
preventing printk() from calling console_flush_all()
while zonelist_update_seq.seqcount is odd.
Since Petr Mladek thinks that __build_all_zonelists() can become a
candidate for deferring printk() [2], let's address this problem by
disabling local interrupts in order to avoid kmalloc(GFP_ATOMIC)
and
disabling synchronous printk() in order to avoid console_flush_all()
.
As a side effect of minimizing duration of zonelist_update_seq.seqcount
being odd by disabling synchronous printk(), latency at
read_seqbegin(&zonelist_update_seq) for both !__GFP_DIRECT_RECLAIM and
__GFP_DIRECT_RECLAIM allocation requests will be reduced. Although, from
lockdep perspective, not calling read_seqbegin(&zonelist_update_seq) (i.e.
do not record unnecessary locking dependency) from interrupt context is
still preferable, even if we don't allow calling kmalloc(GFP_ATOMIC)
inside
write_seqlock(&zonelist_update_seq)/write_sequnlock(&zonelist_update_seq)
section...
Link: https://lkml.kernel.org/r/8796b95c-3da3-5885-fddd-6ef55f30e4d3@I-love.SAKURA.ne.jp
Fixes: 3d36424b3b ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
Link: https://lkml.kernel.org/r/ZCrs+1cDqPWTDFNM@alley [2]
Reported-by: syzbot <syzbot+223c7461c58c58a4cb10@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=223c7461c58c58a4cb10 [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Petr Mladek <pmladek@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Patrick Daly <quic_pdaly@quicinc.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4d73ba5fa710fe7d432e0b271e6fecd252aef66e upstream.
A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory. Further
testing allocating huge pages that the cost is linear i.e. if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step. Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a
migration that fails after isolating the range, draining pages and a whole
lot of other useless work.
Commit eb14d4eefd ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can sometimes migrate. While
there may be value on migrating a 2M page to satisfy a 1G allocation, it's
potentially expensive if the 1G allocation fails and it's pointless to try
moving a 1G page for a new 1G allocation or scan the tail pages for valid
PFNs.
Reintroduce the PageHuge check and assume any contiguous region with
hugetlbfs pages is unsuitable for a new 1G allocation.
The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time. This test happens just after boot
when fragmentation is not an issue. Units are in milliseconds.
hpagealloc
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla hugeallocrevert-v1r1 hugeallocsimple-v1r2
Min Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
1st-qrtle Latency 356.61 ( 0.00%) 5.34 ( 98.50%) 19.85 ( 94.43%)
2nd-qrtle Latency 697.26 ( 0.00%) 5.47 ( 99.22%) 20.44 ( 97.07%)
3rd-qrtle Latency 972.94 ( 0.00%) 5.50 ( 99.43%) 20.81 ( 97.86%)
Max-1 Latency 26.42 ( 0.00%) 5.07 ( 80.82%) 18.94 ( 28.30%)
Max-5 Latency 82.14 ( 0.00%) 5.11 ( 93.78%) 19.31 ( 76.49%)
Max-10 Latency 150.54 ( 0.00%) 5.20 ( 96.55%) 19.43 ( 87.09%)
Max-90 Latency 1164.45 ( 0.00%) 5.53 ( 99.52%) 20.97 ( 98.20%)
Max-95 Latency 1223.06 ( 0.00%) 5.55 ( 99.55%) 21.06 ( 98.28%)
Max-99 Latency 1278.67 ( 0.00%) 5.57 ( 99.56%) 22.56 ( 98.24%)
Max Latency 1310.90 ( 0.00%) 8.06 ( 99.39%) 26.62 ( 97.97%)
Amean Latency 678.36 ( 0.00%) 5.44 * 99.20%* 20.44 * 96.99%*
6.3.0-rc6 6.3.0-rc6 6.3.0-rc6
vanilla revert-v1 hugeallocfix-v2
Duration User 0.28 0.27 0.30
Duration System 808.66 17.77 35.99
Duration Elapsed 830.87 18.08 36.33
The vanilla kernel is poor, taking up to 1.3 second to allocate a huge
page and almost 10 minutes in total to run the test. Reverting the
problematic commit reduces it to 8ms at worst and the patch takes 26ms.
This patch fixes the main issue with skipping huge pages but leaves the
page_count() out because a page with an elevated count potentially can
migrate.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
Link: https://lkml.kernel.org/r/20230414141429.pwgieuwluxwez3rj@techsingularity.net
Fixes: eb14d4eefd ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Yuanxi Liu <y.liu@naruida.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 6.1.22
interconnect: qcom: osm-l3: fix icc_onecell_data allocation
interconnect: qcom: sm8450: switch to qcom_icc_rpmh_* function
interconnect: qcom: qcm2290: Fix MASTER_SNOC_BIMC_NRT
perf/core: Fix perf_output_begin parameter is incorrectly invoked in perf_event_bpf_output
perf: fix perf_event_context->time
tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
drm/amd/display: Include virtual signal to set k1 and k2 values
drm/amd/display: fix k1 k2 divider programming for phantom streams
drm/amd/display: Remove OTG DIV register write for Virtual signals.
mptcp: refactor passive socket initialization
mptcp: use the workqueue to destroy unaccepted sockets
mptcp: fix UaF in listener shutdown
drm/amd/display: Fix DP MST sinks removal issue
arm64: dts: qcom: sm8450: Mark UFS controller as cache coherent
power: supply: bq24190: Fix use after free bug in bq24190_remove due to race condition
power: supply: da9150: Fix use after free bug in da9150_charger_remove due to race condition
arm64: dts: imx8dxl-evk: Disable hibernation mode of AR8031 for EQOS
arm64: dts: imx8dxl-evk: Fix eqos phy reset gpio
ARM: dts: imx6sll: e70k02: fix usbotg1 pinctrl
ARM: dts: imx6sll: e60k02: fix usbotg1 pinctrl
ARM: dts: imx6sl: tolino-shine2hd: fix usbotg1 pinctrl
arm64: dts: imx8mn: specify #sound-dai-cells for SAI nodes
arm64: dts: imx93: add missing #address-cells and #size-cells to i2c nodes
NFS: Fix /proc/PID/io read_bytes for buffered reads
xsk: Add missing overflow check in xdp_umem_reg
iavf: fix inverted Rx hash condition leading to disabled hash
iavf: fix non-tunneled IPv6 UDP packet type and hashing
iavf: do not track VLAN 0 filters
intel/igbvf: free irq on the error path in igbvf_request_msix()
igbvf: Regard vf reset nack as success
igc: fix the validation logic for taprio's gate list
i2c: imx-lpi2c: check only for enabled interrupt flags
i2c: mxs: ensure that DMA buffers are safe for DMA
i2c: hisi: Only use the completion interrupt to finish the transfer
scsi: scsi_dh_alua: Fix memleak for 'qdata' in alua_activate()
nfsd: don't replace page in rq_pages if it's a continuation of last page
net: dsa: b53: mmap: fix device tree support
net: usb: smsc95xx: Limit packet length to skb->len
efi/libstub: smbios: Use length member instead of record struct size
qed/qed_sriov: guard against NULL derefs from qed_iov_get_vf_info
xirc2ps_cs: Fix use after free bug in xirc2ps_detach
net: phy: Ensure state transitions are processed from phy_stop()
net: mdio: fix owner field for mdio buses registered using device-tree
net: mdio: fix owner field for mdio buses registered using ACPI
net: stmmac: Fix for mismatched host/device DMA address width
thermal/drivers/mellanox: Use generic thermal_zone_get_trip() function
mlxsw: core_thermal: Fix fan speed in maximum cooling state
drm/i915: Print return value on error
drm/i915/fbdev: lock the fbdev obj before vma pin
drm/i915/guc: Rename GuC register state capture node to be more obvious
drm/i915/guc: Fix missing ecodes
drm/i915/gt: perform uc late init after probe error injection
net: qcom/emac: Fix use after free bug in emac_remove due to race condition
net: usb: lan78xx: Limit packet length to skb->len
net/ps3_gelic_net: Fix RX sk_buff length
net/ps3_gelic_net: Use dma_mapping_error
octeontx2-vf: Add missing free for alloc_percpu
bootconfig: Fix testcase to increase max node
keys: Do not cache key in task struct if key is requested from kernel thread
ice: check if VF exists before mode check
iavf: fix hang on reboot with ice
i40e: fix flow director packet filter programming
bpf: Adjust insufficient default bpf_jit_limit
net/mlx5e: Set uplink rep as NETNS_LOCAL
net/mlx5e: Block entering switchdev mode with ns inconsistency
net/mlx5: Fix steering rules cleanup
net/mlx5e: Overcome slow response for first macsec ASO WQE
net/mlx5: Read the TC mapping of all priorities on ETS query
net/mlx5: E-Switch, Fix an Oops in error handling code
net: dsa: tag_brcm: legacy: fix daisy-chained switches
atm: idt77252: fix kmemleak when rmmod idt77252
erspan: do not use skb_mac_header() in ndo_start_xmit()
net/sonic: use dma_mapping_error() for error check
nvme-tcp: fix nvme_tcp_term_pdu to match spec
mlxsw: spectrum_fid: Fix incorrect local port type
hvc/xen: prevent concurrent accesses to the shared ring
ksmbd: add low bound validation to FSCTL_SET_ZERO_DATA
ksmbd: add low bound validation to FSCTL_QUERY_ALLOCATED_RANGES
ksmbd: fix possible refcount leak in smb2_open()
Bluetooth: hci_sync: Resume adv with no RPA when active scan
Bluetooth: hci_core: Detect if an ACL packet is in fact an ISO packet
Bluetooth: btusb: Remove detection of ISO packets over bulk
Bluetooth: ISO: fix timestamped HCI ISO data packet parsing
Bluetooth: Remove "Power-on" check from Mesh feature
gve: Cache link_speed value from device
net: asix: fix modprobe "sysfs: cannot create duplicate filename"
net: dsa: mt7530: move enabling disabling core clock to mt7530_pll_setup()
net: dsa: mt7530: move lowering TRGMII driving to mt7530_setup()
net: dsa: mt7530: move setting ssc_delta to PHY_INTERFACE_MODE_TRGMII case
net: mdio: thunder: Add missing fwnode_handle_put()
drm/amd/display: Set dcn32 caps.seamless_odm
Bluetooth: btqcomsmd: Fix command timeout after setting BD address
Bluetooth: L2CAP: Fix responding with wrong PDU type
Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work
Bluetooth: mgmt: Fix MGMT add advmon with RSSI command
Bluetooth: HCI: Fix global-out-of-bounds
platform/chrome: cros_ec_chardev: fix kernel data leak from ioctl
entry: Fix noinstr warning in __enter_from_user_mode()
perf/x86/amd/core: Always clear status for idx
entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
hwmon: fix potential sensor registration fail if of_node is missing
hwmon (it87): Fix voltage scaling for chips with 10.9mV ADCs
scsi: qla2xxx: Synchronize the IOCB count to be in order
scsi: qla2xxx: Perform lockless command completion in abort path
smb3: lower default deferred close timeout to address perf regression
smb3: fix unusable share after force unmount failure
uas: Add US_FL_NO_REPORT_OPCODES for JMicron JMS583Gen 2
thunderbolt: Use scale field when allocating USB3 bandwidth
thunderbolt: Call tb_check_quirks() after initializing adapters
thunderbolt: Add quirk to disable CLx
thunderbolt: Fix memory leak in margining
thunderbolt: Disable interrupt auto clear for rings
thunderbolt: Add missing UNSET_INBOUND_SBTX for retimer access
thunderbolt: Use const qualifier for `ring_interrupt_index`
thunderbolt: Rename shadowed variables bit to interrupt_bit and auto_clear_bit
ASoC: amd: yp: Add OMEN by HP Gaming Laptop 16z-n000 to quirks
ASoC: amd: yc: Add DMI entries to support HP OMEN 16-n0xxx (8A43)
ACPI: x86: Drop quirk for HP Elitebook
ACPI: x86: utils: Add Cezanne to the list for forcing StorageD3Enable
riscv: Bump COMMAND_LINE_SIZE value to 1024
drm/cirrus: NULL-check pipe->plane.state->fb in cirrus_pipe_update()
HID: cp2112: Fix driver not registering GPIO IRQ chip as threaded
ca8210: fix mac_len negative array access
HID: logitech-hidpp: Add support for Logitech MX Master 3S mouse
HID: intel-ish-hid: ipc: Fix potential use-after-free in work function
m68k: mm: Fix systems with memory at end of 32-bit address space
m68k: Only force 030 bus error if PC not in exception table
selftests/bpf: check that modifier resolves after pointer
scsi: target: iscsi: Fix an error message in iscsi_check_key()
scsi: qla2xxx: Add option to disable FC2 Target support
scsi: hisi_sas: Check devm_add_action() return value
scsi: ufs: core: Add soft dependency on governor_simpleondemand
scsi: lpfc: Check kzalloc() in lpfc_sli4_cgn_params_read()
scsi: lpfc: Avoid usage of list iterator variable after loop
scsi: mpi3mr: Driver unload crashes host when enhanced logging is enabled
scsi: mpi3mr: Wait for diagnostic save during controller init
scsi: mpi3mr: NVMe command size greater than 8K fails
scsi: mpi3mr: Bad drive in topology results kernel crash
scsi: storvsc: Handle BlockSize change in Hyper-V VHD/VHDX file
platform/x86: int3472: Add GPIOs to Surface Go 3 Board data
net: usb: cdc_mbim: avoid altsetting toggling for Telit FE990
net: usb: qmi_wwan: add Telit 0x1080 composition
drm/amd/display: Update clock table to include highest clock setting
sh: sanitize the flags on sigreturn
drm/amdgpu: Fix call trace warning and hang when removing amdgpu device
drm/amd: Fix initialization mistake for NBIO 7.3.0
net/sched: act_mirred: better wording on protection against excessive stack growth
act_mirred: use the backlog for nested calls to mirred ingress
cifs: lock chan_lock outside match_session
cifs: append path to open_enter trace event
cifs: do not poll server interfaces too regularly
cifs: empty interface list when server doesn't support query interfaces
cifs: dump pending mids for all channels in DebugData
cifs: print session id while listing open files
cifs: fix dentry lookups in directory handle cache
x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_uabi_buf()
selftests/x86/amx: Add a ptrace test
scsi: core: Add BLIST_SKIP_VPD_PAGES for SKhynix H28U74301AMR
usb: misc: onboard-hub: add support for Microchip USB2517 USB 2.0 hub
usb: dwc2: drd: fix inconsistent mode if role-switch-default-mode="host"
usb: dwc2: fix a devres leak in hw_enable upon suspend resume
usb: gadget: u_audio: don't let userspace block driver unbind
btrfs: zoned: fix btrfs_can_activate_zone() to support DUP profile
Bluetooth: Fix race condition in hci_cmd_sync_clear
efi: sysfb_efi: Fix DMI quirks not working for simpledrm
mm/slab: Fix undefined init_cache_node_node() for NUMA and !SMP
fscrypt: destroy keyring after security_sb_delete()
fsverity: Remove WQ_UNBOUND from fsverity read workqueue
lockd: set file_lock start and end when decoding nlm4 testargs
arm64: dts: imx8mm-nitrogen-r2: fix WM8960 clock name
igb: revert rtnl_lock() that causes deadlock
dm thin: fix deadlock when swapping to thin device
usb: typec: tcpm: fix create duplicate source-capabilities file
usb: typec: tcpm: fix warning when handle discover_identity message
usb: cdns3: Fix issue with using incorrect PCI device function
usb: cdnsp: Fixes issue with redundant Status Stage
usb: cdnsp: changes PCI Device ID to fix conflict with CNDS3 driver
usb: chipdea: core: fix return -EINVAL if request role is the same with current role
usb: chipidea: core: fix possible concurrent when switch role
usb: dwc3: gadget: Add 1ms delay after end transfer command without IOC
usb: ucsi: Fix NULL pointer deref in ucsi_connector_change()
usb: ucsi_acpi: Increase the command completion timeout
mm: kfence: fix using kfence_metadata without initialization in show_object()
kfence: avoid passing -g for test
io_uring/net: avoid sending -ECONNABORTED on repeated connection requests
io_uring/rsrc: fix null-ptr-deref in io_file_bitmap_get()
Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare"
test_maple_tree: add more testing for mas_empty_area()
maple_tree: fix mas_skip_node() end slot detection
ksmbd: fix wrong signingkey creation when encryption is AES256
ksmbd: set FILE_NAMED_STREAMS attribute in FS_ATTRIBUTE_INFORMATION
ksmbd: don't terminate inactive sessions after a few seconds
ksmbd: return STATUS_NOT_SUPPORTED on unsupported smb2.0 dialect
ksmbd: return unsupported error on smb1 mount
wifi: mac80211: fix qos on mesh interfaces
nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy()
drm/bridge: lt8912b: return EPROBE_DEFER if bridge is not found
drm/amd/display: fix wrong index used in dccg32_set_dpstreamclk
drm/meson: fix missing component unbind on bind errors
drm/amdgpu/nv: Apply ASPM quirk on Intel ADL + AMD Navi
drm/i915/active: Fix missing debug object activation
drm/i915: Preserve crtc_state->inherited during state clearing
drm/amdgpu: skip ASIC reset for APUs when go to S4
drm/amdgpu: reposition the gpu reset checking for reuse
riscv: mm: Fix incorrect ASID argument when flushing TLB
riscv: Handle zicsr/zifencei issues between clang and binutils
tee: amdtee: fix race condition in amdtee_open_session
firmware: arm_scmi: Fix device node validation for mailbox transport
arm64: dts: qcom: sc7280: Mark PCIe controller as cache coherent
arm64: dts: qcom: sm8150: Fix the iommu mask used for PCIe controllers
soc: qcom: llcc: Fix slice configuration values for SC8280XP
mm/ksm: fix race with VMA iteration and mm_struct teardown
bus: imx-weim: fix branch condition evaluates to a garbage value
i2c: xgene-slimpro: Fix out-of-bounds bug in xgene_slimpro_i2c_xfer()
dm stats: check for and propagate alloc_percpu failure
dm crypt: add cond_resched() to dmcrypt_write()
dm crypt: avoid accessing uninitialized tasklet
sched/fair: sanitize vruntime of entity being placed
sched/fair: Sanitize vruntime of entity being migrated
drm/amdkfd: introduce dummy cache info for property asic
drm/amdkfd: Fix the warning of array-index-out-of-bounds
drm/amdkfd: add GC 11.0.4 KFD support
drm/amdkfd: Fix the memory overrun
Linux 6.1.22
Change-Id: Id13b4655dbfb59c29a0b8953e5e0cda3703f1879
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit f446883d12b8bfa486f7c98d403054d61d38c989 upstream.
This reverts commit 487a32ec24.
should_skip_kasan_poison() reads the PG_skip_kasan_poison flag from
page->flags. However, this line of code in free_pages_prepare():
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
clears most of page->flags, including PG_skip_kasan_poison, before calling
should_skip_kasan_poison(), which meant that it would never return true as
a result of the page flag being set. Therefore, fix the code to call
should_skip_kasan_poison() before clearing the flags, as we were doing
before the reverted patch.
This fixes a measurable performance regression introduced in the reverted
commit, where munmap() takes longer than intended if HW tags KASAN is
supported and enabled at runtime. Without this patch, we see a
single-digit percentage performance regression in a particular
mmap()-heavy benchmark when enabling HW tags KASAN, and with the patch,
there is no statistically significant performance impact when enabling HW
tags KASAN.
Link: https://lkml.kernel.org/r/20230310042914.3805818-2-pcc@google.com
Fixes: 487a32ec24 ("kasan: drop skip_kasan_poison variable in free_pages_prepare")
Link: https://linux-review.googlesource.com/id/Ic4f13affeebd20548758438bb9ed9ca40e312b79
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: <stable@vger.kernel.org> [6.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For each node, memcgs are divided into two generations: the old and
the young. For each generation, memcgs are randomly sharded into
multiple bins to improve scalability. For each bin, an RCU hlist_nulls
is virtually divided into three segments: the head, the tail and the
default.
An onlining memcg is added to the tail of a random bin in the old
generation. The eviction starts at the head of a random bin in the old
generation. The per-node memcg generation counter, whose reminder (mod
2) indexes the old generation, is incremented when all its bins become
empty.
There are four operations:
1. MEMCG_LRU_HEAD, which moves an memcg to the head of a random bin in
its current generation (old or young) and updates its "seg" to
"head";
2. MEMCG_LRU_TAIL, which moves an memcg to the tail of a random bin in
its current generation (old or young) and updates its "seg" to
"tail";
3. MEMCG_LRU_OLD, which moves an memcg to the head of a random bin in
the old generation, updates its "gen" to "old" and resets its "seg"
to "default";
4. MEMCG_LRU_YOUNG, which moves an memcg to the tail of a random bin
in the young generation, updates its "gen" to "young" and resets
its "seg" to "default".
The events that trigger the above operations are:
1. Exceeding the soft limit, which triggers MEMCG_LRU_HEAD;
2. The first attempt to reclaim an memcg below low, which triggers
MEMCG_LRU_TAIL;
3. The first attempt to reclaim an memcg below reclaimable size
threshold, which triggers MEMCG_LRU_TAIL;
4. The second attempt to reclaim an memcg below reclaimable size
threshold, which triggers MEMCG_LRU_YOUNG;
5. Attempting to reclaim an memcg below min, which triggers
MEMCG_LRU_YOUNG;
6. Finishing the aging on the eviction path, which triggers
MEMCG_LRU_YOUNG;
7. Offlining an memcg, which triggers MEMCG_LRU_OLD.
Note that memcg LRU only applies to global reclaim, and the
round-robin incrementing of their max_seq counters ensures the
eventual fairness to all eligible memcgs. For memcg reclaim, it still
relies on mem_cgroup_iter().
Link: https://lkml.kernel.org/r/20221222041905.2431096-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 274865848
(cherry picked from commit e4dde56cd208674ce899b47589f263499e5b8cdc)
[TJ: Resolved conflicts with older function signatures for
min_cgroup_below_min / min_cgroup_below_low and includes]
Change-Id: Idc8a0f635e035d72dd911f807d1224cb47cbd655
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Changes in 6.1.12
hv_netvsc: Allocate memory in netvsc_dma_map() with GFP_ATOMIC
btrfs: limit device extents to the device size
btrfs: zlib: zero-initialize zlib workspace
ALSA: hda/realtek: Add Positivo N14KP6-TG
ALSA: emux: Avoid potential array out-of-bound in snd_emux_xg_control()
ALSA: hda/realtek: Fix the speaker output on Samsung Galaxy Book2 Pro 360
ALSA: hda/realtek: Enable mute/micmute LEDs on HP Elitebook, 645 G9
ALSA: hda/realtek: Add quirk for ASUS UM3402 using CS35L41
ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform.
Revert "PCI/ASPM: Save L1 PM Substates Capability for suspend/resume"
Revert "PCI/ASPM: Refactor L1 PM Substates Control Register programming"
tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw
of/address: Return an error when no valid dma-ranges are found
can: j1939: do not wait 250 ms if the same addr was already claimed
HID: logitech: Disable hi-res scrolling on USB
xfrm: compat: change expression for switch in xfrm_xlate64
IB/hfi1: Restore allocated resources on failed copyout
xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
IB/IPoIB: Fix legacy IPoIB due to wrong number of queues
xfrm: annotate data-race around use_time
RDMA/irdma: Fix potential NULL-ptr-dereference
RDMA/usnic: use iommu_map_atomic() under spin_lock()
xfrm: fix bug with DSCP copy to v6 from v4 tunnel
of: Make OF framebuffer device names unique
net: phylink: move phy_device_free() to correctly release phy device
bonding: fix error checking in bond_debug_reregister()
net: macb: Perform zynqmp dynamic configuration only for SGMII interface
net: phy: meson-gxl: use MMD access dummy stubs for GXL, internal PHY
ionic: clean interrupt before enabling queue to avoid credit race
ionic: refactor use of ionic_rx_fill()
ionic: missed doorbell workaround
cpufreq: qcom-hw: Fix cpufreq_driver->get() for non-LMH systems
uapi: add missing ip/ipv6 header dependencies for linux/stddef.h
net: microchip: sparx5: fix PTP init/deinit not checking all ports
HID: amd_sfh: if no sensors are enabled, clean up
drm/i915: Don't do the WM0->WM1 copy w/a if WM1 is already enabled
drm/virtio: exbuf->fence_fd unmodified on interrupted wait
cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task
nvidiafb: detect the hardware support before removing console.
ice: Do not use WQ_MEM_RECLAIM flag for workqueue
ice: Fix disabling Rx VLAN filtering with port VLAN enabled
ice: switch: fix potential memleak in ice_add_adv_recipe()
net: dsa: mt7530: don't change PVC_EG_TAG when CPU port becomes VLAN-aware
net: mscc: ocelot: fix VCAP filters not matching on MAC with "protocol 802.1Q"
net/mlx5e: Update rx ring hw mtu upon each rx-fcs flag change
net/mlx5: Bridge, fix ageing of peer FDB entries
net/mlx5e: Fix crash unsetting rx-vlan-filter in switchdev mode
net/mlx5e: IPoIB, Show unknown speed instead of error
net/mlx5: Store page counters in a single array
net/mlx5: Expose SF firmware pages counter
net/mlx5: fw_tracer, Clear load bit when freeing string DBs buffers
net/mlx5: fw_tracer, Zero consumer index when reloading the tracer
net/mlx5: Serialize module cleanup with reload and remove
igc: Add ndo_tx_timeout support
net: ethernet: mtk_eth_soc: fix wrong parameters order in __xdp_rxq_info_reg()
txhash: fix sk->sk_txrehash default
selftests: Fix failing VXLAN VNI filtering test
rds: rds_rm_zerocopy_callback() use list_first_entry()
net: mscc: ocelot: fix all IPv6 getting trapped to CPU when PTP timestamping is used
selftests: forwarding: lib: quote the sysctl values
arm64: dts: rockchip: fix input enable pinconf on rk3399
arm64: dts: rockchip: set sdmmc0 speed to sd-uhs-sdr50 on rock-3a
ALSA: pci: lx6464es: fix a debug loop
riscv: stacktrace: Fix missing the first frame
arm64: dts: mediatek: mt8195: Fix vdosys* compatible strings
ASoC: tas5805m: rework to avoid scheduling while atomic.
ASoC: tas5805m: add missing page switch.
ASoC: fsl_sai: fix getting version from VERID
ASoC: topology: Return -ENOMEM on memory allocation failure
clk: microchip: mpfs-ccc: Use devm_kasprintf() for allocating formatted strings
pinctrl: mediatek: Fix the drive register definition of some Pins
pinctrl: aspeed: Fix confusing types in return value
pinctrl: single: fix potential NULL dereference
spi: dw: Fix wrong FIFO level setting for long xfers
pinctrl: aspeed: Revert "Force to disable the function's signal"
pinctrl: intel: Restore the pins that used to be in Direct IRQ mode
cifs: Fix use-after-free in rdata->read_into_pages()
net: USB: Fix wrong-direction WARNING in plusb.c
mptcp: do not wait for bare sockets' timeout
mptcp: be careful on subflow status propagation on errors
selftests: mptcp: allow more slack for slow test-case
selftests: mptcp: stop tests earlier
btrfs: simplify update of last_dir_index_offset when logging a directory
btrfs: free device in btrfs_close_devices for a single device filesystem
usb: core: add quirk for Alcor Link AK9563 smartcard reader
usb: typec: altmodes/displayport: Fix probe pin assign check
cxl/region: Fix null pointer dereference for resetting decoder
cxl/region: Fix passthrough-decoder detection
clk: ingenic: jz4760: Update M/N/OD calculation algorithm
pinctrl: qcom: sm8450-lpass-lpi: correct swr_rx_data group
drm/amd/pm: add SMU 13.0.7 missing GetPptLimit message mapping
ceph: flush cap releases when the session is flushed
nvdimm: Support sizeof(struct page) > MAX_STRUCT_PAGE_SIZE
riscv: Fixup race condition on PG_dcache_clean in flush_icache_pte
riscv: kprobe: Fixup misaligned load text
powerpc/64s/interrupt: Fix interrupt exit race with security mitigation switch
drm/amdgpu: Use the TGID for trace_amdgpu_vm_update_ptes
tracing: Fix TASK_COMM_LEN in trace event format file
rtmutex: Ensure that the top waiter is always woken up
arm64: dts: meson-gx: Make mmc host controller interrupts level-sensitive
arm64: dts: meson-g12-common: Make mmc host controller interrupts level-sensitive
arm64: dts: meson-axg: Make mmc host controller interrupts level-sensitive
Fix page corruption caused by racy check in __free_pages
arm64: efi: Force the use of SetVirtualAddressMap() on eMAG and Altra Max machines
drm/amd/pm: bump SMU 13.0.0 driver_if header version
drm/amdgpu: Add unique_id support for GC 11.0.1/2
drm/amd/pm: bump SMU 13.0.7 driver_if header version
drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
drm/amdgpu/smu: skip pptable init under sriov
drm/amd/display: properly handling AGP aperture in vm setup
drm/amd/display: fix cursor offset on rotation 180
drm/i915: Move fd_install after last use of fence
drm/i915: Initialize the obj flags for shmem objects
drm/i915: Fix VBT DSI DVO port handling
x86/speculation: Identify processors vulnerable to SMT RSB predictions
KVM: x86: Mitigate the cross-thread return address predictions bug
Documentation/hw-vuln: Add documentation for Cross-Thread Return Predictions
Linux 6.1.12
Change-Id: I4deaf57516f3e7b40e728d473986fa355a11fc37
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 462a8e08e0e6287e5ce13187257edbf24213ed03 upstream.
When we upgraded our kernel, we started seeing some page corruption like
the following consistently:
BUG: Bad page state in process ganesha.nfsd pfn:1304ca
page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
flags: 0x17ffffc0000000()
raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000
raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P B O 5.10.158-1.nutanix.20221209.el7.x86_64 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
Call Trace:
dump_stack+0x74/0x96
bad_page.cold+0x63/0x94
check_new_page_bad+0x6d/0x80
rmqueue+0x46e/0x970
get_page_from_freelist+0xcb/0x3f0
? _cond_resched+0x19/0x40
__alloc_pages_nodemask+0x164/0x300
alloc_pages_current+0x87/0xf0
skb_page_frag_refill+0x84/0x110
...
Sometimes, it would also show up as corruption in the free list pointer
and cause crashes.
After bisecting the issue, we found the issue started from commit
e320d3012d ("mm/page_alloc.c: fix freeing non-compound pages"):
if (put_page_testzero(page))
free_the_page(page, order);
else if (!PageHead(page))
while (order-- > 0)
free_the_page(page + (1 << order), order);
So the problem is the check PageHead is racy because at this point we
already dropped our reference to the page. So even if we came in with
compound page, the page can already be freed and PageHead can return
false and we will end up freeing all the tail pages causing double free.
Fixes: e320d3012d ("mm/page_alloc.c: fix freeing non-compound pages")
Link: https://lore.kernel.org/lkml/BYAPR02MB448855960A9656EEA81141FC94D99@BYAPR02MB4488.namprd02.prod.outlook.com/
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[The patch is in the mm-unstable tree.]
The implementation of page_alloc poisoning sampling assumed that
tag_clear_highpage resets page tags for __GFP_ZEROTAGS allocations.
However, this is no longer the case since commit 70c248aca9
("mm: kasan: Skip unpoisoning of user pages").
This leads to kernel crashes when MTE-enabled userspace mappings are
used with Hardware Tag-Based KASAN enabled.
Reset page tags for __GFP_ZEROTAGS allocations in post_alloc_hook().
Also clarify and fix related comments.
Fixes: 44383cef54c0 ("kasan: allow sampling page_alloc allocations for HW_TAGS")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Peter Collingbourne <pcc@google.com>
Tested-by: Peter Collingbourne <pcc@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 238286329
Bug: 264310057
Link: https://lore.kernel.org/all/5dbd866714b4839069e2d8469ac45b60953db290.1674592780.git.andreyknvl@google.com/
Change-Id: Iea4234bcf7e35337c8063827b07039583bca9c66
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
[The patch is in mm-stable tree.]
As Hardware Tag-Based KASAN is intended to be used in production, its
performance impact is crucial. As page_alloc allocations tend to be big,
tagging and checking all such allocations can introduce a significant
slowdown.
Add two new boot parameters that allow to alleviate that slowdown:
- kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only
every Nth page_alloc allocation with the order configured by the second
added parameter (default: tag every such allocation).
- kasan.page_alloc.sample.order, which makes sampling enabled by the first
parameter only affect page_alloc allocations with the order equal or
greater than the specified value (default: 3, see below).
The exact performance improvement caused by using the new parameters
depends on their values and the applied workload.
The chosen default value for kasan.page_alloc.sample.order is 3, which
matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER. This is
done for two reasons:
1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed
costly to service", which corresponds to the idea that only large and
thus costly allocations are supposed to sampled.
2. One of the workloads targeted by this patch is a benchmark that sends
a large amount of data over a local loopback connection. Most multi-page
data allocations in the networking subsystem have the order of
SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER).
When running a local loopback test on a testing MTE-enabled device in sync
mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown.
Applying this patch and setting kasan.page_alloc.sampling to a value
higher than 1 allows to lower the slowdown. The performance improvement
saturates around the sampling interval value of 10 with the default
sampling page order of 3. This lowers the slowdown to ~20%. The slowdown
in real scenarios involving the network will likely be better.
Enabling page_alloc sampling has a downside: KASAN misses bad accesses to
a page_alloc allocation that has not been tagged. This lowers the value
of KASAN as a security mitigation.
However, based on measuring the number of page_alloc allocations of
different orders during boot in a test build, sampling with the default
kasan.page_alloc.sample.order value affects only ~7% of allocations. The
rest ~93% of allocations are still checked deterministically.
Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Brand <markbrand@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 238286329
Bug: 264310057
(cherry picked from commit 44383cef54c0ce1201f884d83cc2b367bc5aa4f7 git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-stable)
Change-Id: I85f9eb4e93eeddff8f8d06238f433226affca177
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
This reverts commit a2a9e34d16.
Reason for revert:
Observed frequent boot crashes on a device with sampling KASAN enabled.
Bug: 265863271
Change-Id: Ib7860295065ed7aaa36d9e47d8aaa97918c7bc57
Signed-off-by: Peter Collingbourne <pcc@google.com>
[The patch is in mm-unstable tree.]
As Hardware Tag-Based KASAN is intended to be used in production, its
performance impact is crucial. As page_alloc allocations tend to be big,
tagging and checking all such allocations can introduce a significant
slowdown.
Add two new boot parameters that allow to alleviate that slowdown:
- kasan.page_alloc.sample, which makes Hardware Tag-Based KASAN tag only
every Nth page_alloc allocation with the order configured by the second
added parameter (default: tag every such allocation).
- kasan.page_alloc.sample.order, which makes sampling enabled by the first
parameter only affect page_alloc allocations with the order equal or
greater than the specified value (default: 3, see below).
The exact performance improvement caused by using the new parameters
depends on their values and the applied workload.
The chosen default value for kasan.page_alloc.sample.order is 3, which
matches both PAGE_ALLOC_COSTLY_ORDER and SKB_FRAG_PAGE_ORDER. This is
done for two reasons:
1. PAGE_ALLOC_COSTLY_ORDER is "the order at which allocations are deemed
costly to service", which corresponds to the idea that only large and
thus costly allocations are supposed to sampled.
2. One of the workloads targeted by this patch is a benchmark that sends
a large amount of data over a local loopback connection. Most multi-page
data allocations in the networking subsystem have the order of
SKB_FRAG_PAGE_ORDER (or PAGE_ALLOC_COSTLY_ORDER).
When running a local loopback test on a testing MTE-enabled device in sync
mode, enabling Hardware Tag-Based KASAN introduces a ~50% slowdown.
Applying this patch and setting kasan.page_alloc.sampling to a value
higher than 1 allows to lower the slowdown. The performance improvement
saturates around the sampling interval value of 10 with the default
sampling page order of 3. This lowers the slowdown to ~20%. The slowdown
in real scenarios involving the network will likely be better.
Enabling page_alloc sampling has a downside: KASAN misses bad accesses to
a page_alloc allocation that has not been tagged. This lowers the value
of KASAN as a security mitigation.
However, based on measuring the number of page_alloc allocations of
different orders during boot in a test build, sampling with the default
kasan.page_alloc.sample.order value affects only ~7% of allocations. The
rest ~93% of allocations are still checked deterministically.
Link: https://lkml.kernel.org/r/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Brand <markbrand@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 238286329
Bug: 264310057
Link: https://lore.kernel.org/all/129da0614123bb85ed4dd61ae30842b2dd7c903f.1671471846.git.andreyknvl@google.com
Change-Id: Icc7befe61848021c68a12034f426f1c300181ad6
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reverse migration is used to do the balancing the occupancy of memory
zones in a node in the system whose imabalance may be caused by
migration of pages to other zones by an operation, eg: hotremove and
then hotadding the same memory. In this case there is a lot of free
memory in newly hotadd memory which can be filled up by the previous
migrated pages(as part of offline/hotremove) thus may free up some
pressure in other zones of the node.
Upstream discussion: https://lore.kernel.org/all/ee78c83d-da9b-f6d1-4f66-934b7782acfb@codeaurora.org/
Bug: 201263307
Change-Id: Ib3137dab0db66ecf6858c4077dcadb9dfd0c6b1c
Signed-off-by: Charan Teja Reddy <quic_charante@quicinc.com>
Signed-off-by: Sukadev Bhattiprolu <quic_sukadev@quicinc.com>
When we specify __GFP_NOWARN, we only expect that no warnings will be
issued for current caller. But in the __should_failslab() and
__should_fail_alloc_page(), the local GFP flags alter the global
{failslab|fail_page_alloc}.attr, which is persistent and shared by all
tasks. This is not what we expected, let's fix it.
[akpm@linux-foundation.org: unexport should_fail_ex()]
Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
Fixes: 3f913fc5f9 ("mm: fix missing handler for __GFP_NOWARN")
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Although page allocation always clears page->private in the first page or
head page of an allocation, it has never made a point of clearing
page->private in the tails (though 0 is often what is already there).
But now commit 71e2d666ef ("mm/huge_memory: do not clobber swp_entry_t
during THP split") issues a warning when page_tail->private is found to be
non-0 (unless it's swapcache).
Change that warning to dump page_tail (which also dumps head), instead of
just the head: so far we have seen dead000000000122, dead000000000003,
dead000000000001 or 0000000000000002 in the raw output for tail private.
We could just delete the warning, but today's consensus appears to want
page->private to be 0, unless there's a good reason for it to be set: so
now clear it in prep_compound_tail() (more general than just for THP; but
not for high order allocation, which makes no pass down the tails).
Link: https://lkml.kernel.org/r/1c4233bb-4e4d-5969-fbd4-96604268a285@google.com
Fixes: 71e2d666ef ("mm/huge_memory: do not clobber swp_entry_t during THP split")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Try to avoid using the left over split page on the next request for a page
by calling __free_pages_ok() with FPI_TO_TAIL. This increases the
potential of defragmenting memory when it's used for a short period of
time.
Link: https://lkml.kernel.org/r/20220531185626.yvlmymbxyoe5vags@revolver
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
refcounting errors in ZONE_DEVICE pages.
- Peter Xu fixes some userfaultfd test harness instability.
- Various other patches in MM, mainly fixes.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0j6igAKCRDdBJ7gKXxA
jnGxAP99bV39ZtOsoY4OHdZlWU16BUjKuf/cb3bZlC2G849vEwD+OKlij86SG20j
MGJQ6TfULJ8f1dnQDd6wvDfl3FMl7Qc=
=tbdp
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- fix a race which causes page refcounting errors in ZONE_DEVICE pages
(Alistair Popple)
- fix userfaultfd test harness instability (Peter Xu)
- various other patches in MM, mainly fixes
* tag 'mm-stable-2022-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (29 commits)
highmem: fix kmap_to_page() for kmap_local_page() addresses
mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page
mm/selftest: uffd: explain the write missing fault check
mm/hugetlb: use hugetlb_pte_stable in migration race check
mm/hugetlb: fix race condition of uffd missing/minor handling
zram: always expose rw_page
LoongArch: update local TLB if PTE entry exists
mm: use update_mmu_tlb() on the second thread
kasan: fix array-bounds warnings in tests
hmm-tests: add test for migrate_device_range()
nouveau/dmem: evict device private memory during release
nouveau/dmem: refactor nouveau_dmem_fault_copy_one()
mm/migrate_device.c: add migrate_device_range()
mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()
mm/memremap.c: take a pgmap reference on page allocation
mm: free device private pages have zero refcount
mm/memory.c: fix race when faulting a device private page
mm/damon: use damon_sz_region() in appropriate place
mm/damon: move sz_damon_region to damon_sz_region
lib/test_meminit: add checks for the allocation functions
...
PGFREE and PGALLOC represent the number of freed and allocated pages. So
the page order must be considered.
Link: https://lkml.kernel.org/r/20221006101540.40686-1-laoar.shao@gmail.com
Fixes: 44042b4498 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since 27674ef6c7 ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use. However before handing them back to the
owning device driver we add an extra reference count such that free pages
have a reference count of one.
This makes it difficult to tell if a page is free or not because both free
and in use pages will have a non-zero refcount. Instead we should return
pages to the drivers page allocator with a zero reference count. Kernel
code can then safely use kernel functions such as get_page_unless_zero().
Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>