Commit Graph

300 Commits

Author SHA1 Message Date
Fangzheng Zhang
aa71a02cf3 ANDROID: vendor_hooks: mm: add hook to count the number pages
allocated for each slab

Add the tracing interface on the kmalloc_large allocation path,
which can detect the number of pages allocated by the slab,
and if exceeds a threshold, trigger a panic or other actions.

Bug: 312897430
Change-Id: I5575d0e4f91dab1c6e074f3e907fee8ea9327fd7
Signed-off-by: Fangzheng Zhang <fangzheng.zhang@unisoc.com>
2023-11-30 18:19:39 +00:00
Greg Kroah-Hartman
2950de8b2d Merge 6.1.56 into android14-6.1-lts
Changes in 6.1.56
	NFS: Fix error handling for O_DIRECT write scheduling
	NFS: Fix O_DIRECT locking issues
	NFS: More O_DIRECT accounting fixes for error paths
	NFS: Use the correct commit info in nfs_join_page_group()
	NFS: More fixes for nfs_direct_write_reschedule_io()
	NFS/pNFS: Report EINVAL errors from connect() to the server
	SUNRPC: Mark the cred for revalidation if the server rejects it
	NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server
	NFSv4.1: fix pnfs MDS=DS session trunking
	media: v4l: Use correct dependency for camera sensor drivers
	media: via: Use correct dependency for camera sensor drivers
	netfs: Only call folio_start_fscache() one time for each folio
	dm: fix a race condition in retrieve_deps
	btrfs: improve error message after failure to add delayed dir index item
	btrfs: remove BUG() after failure to insert delayed dir index item
	ext4: replace the traditional ternary conditional operator with with max()/min()
	ext4: move setting of trimmed bit into ext4_try_to_trim_range()
	ext4: do not let fstrim block system suspend
	netfilter: nf_tables: don't skip expired elements during walk
	netfilter: nf_tables: GC transaction API to avoid race with control plane
	netfilter: nf_tables: adapt set backend to use GC transaction API
	netfilter: nft_set_hash: mark set element as dead when deleting from packet path
	netfilter: nf_tables: remove busy mark and gc batch API
	netfilter: nf_tables: don't fail inserts if duplicate has expired
	netfilter: nf_tables: fix GC transaction races with netns and netlink event exit path
	netfilter: nf_tables: GC transaction race with netns dismantle
	netfilter: nf_tables: GC transaction race with abort path
	netfilter: nf_tables: use correct lock to protect gc_list
	netfilter: nf_tables: defer gc run if previous batch is still pending
	netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
	netfilter: nft_set_rbtree: use read spinlock to avoid datapath contention
	netfilter: nft_set_pipapo: call nft_trans_gc_queue_sync() in catchall GC
	netfilter: nft_set_pipapo: stop GC iteration if GC transaction allocation fails
	netfilter: nft_set_hash: try later when GC hits EAGAIN on iteration
	netfilter: nf_tables: fix memleak when more than 255 elements expired
	ASoC: meson: spdifin: start hw on dai probe
	netfilter: nf_tables: disallow element removal on anonymous sets
	bpf: Avoid deadlock when using queue and stack maps from NMI
	ASoC: rt5640: Revert "Fix sleep in atomic context"
	ASoC: rt5640: Fix IRQ not being free-ed for HDA jack detect mode
	ALSA: hda/realtek: Splitting the UX3402 into two separate models
	netfilter: conntrack: fix extension size table
	selftests: tls: swap the TX and RX sockets in some tests
	net/core: Fix ETH_P_1588 flow dissector
	ASoC: hdaudio.c: Add missing check for devm_kstrdup
	ASoC: imx-audmix: Fix return error with devm_clk_get()
	octeon_ep: fix tx dma unmap len values in SG
	iavf: do not process adminq tasks when __IAVF_IN_REMOVE_TASK is set
	ASoC: SOF: core: Only call sof_ops_free() on remove if the probe was successful
	iavf: add iavf_schedule_aq_request() helper
	iavf: schedule a request immediately after add/delete vlan
	i40e: Fix VF VLAN offloading when port VLAN is configured
	netfilter, bpf: Adjust timeouts of non-confirmed CTs in bpf_ct_insert_entry()
	ionic: fix 16bit math issue when PAGE_SIZE >= 64KB
	igc: Fix infinite initialization loop with early XDP redirect
	ipv4: fix null-deref in ipv4_link_failure
	scsi: iscsi_tcp: restrict to TCP sockets
	powerpc/perf/hv-24x7: Update domain value check
	dccp: fix dccp_v4_err()/dccp_v6_err() again
	x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()
	net: hsr: Properly parse HSRv1 supervisor frames.
	platform/x86: intel_scu_ipc: Check status after timeout in busy_loop()
	platform/x86: intel_scu_ipc: Check status upon timeout in ipc_wait_for_interrupt()
	platform/x86: intel_scu_ipc: Don't override scu in intel_scu_ipc_dev_simple_command()
	platform/x86: intel_scu_ipc: Fail IPC send if still busy
	x86/srso: Fix srso_show_state() side effect
	x86/srso: Fix SBPB enablement for spec_rstack_overflow=off
	net: hns3: add cmdq check for vf periodic service task
	net: hns3: fix GRE checksum offload issue
	net: hns3: only enable unicast promisc when mac table full
	net: hns3: fix fail to delete tc flower rules during reset issue
	net: hns3: add 5ms delay before clear firmware reset irq source
	net: bridge: use DEV_STATS_INC()
	team: fix null-ptr-deref when team device type is changed
	net: rds: Fix possible NULL-pointer dereference
	netfilter: nf_tables: disable toggling dormant table state more than once
	netfilter: ipset: Fix race between IPSET_CMD_CREATE and IPSET_CMD_SWAP
	i915/pmu: Move execlist stats initialization to execlist specific setup
	locking/seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()
	net: ena: Flush XDP packets on error.
	bnxt_en: Flush XDP for bnxt_poll_nitroa0()'s NAPI
	octeontx2-pf: Do xdp_do_flush() after redirects.
	igc: Expose tx-usecs coalesce setting to user
	proc: nommu: /proc/<pid>/maps: release mmap read lock
	proc: nommu: fix empty /proc/<pid>/maps
	cifs: Fix UAF in cifs_demultiplex_thread()
	gpio: tb10x: Fix an error handling path in tb10x_gpio_probe()
	i2c: mux: demux-pinctrl: check the return value of devm_kstrdup()
	i2c: mux: gpio: Add missing fwnode_handle_put()
	i2c: xiic: Correct return value check for xiic_reinit()
	ARM: dts: BCM5301X: Extend RAM to full 256MB for Linksys EA6500 V2
	ARM: dts: samsung: exynos4210-i9100: Fix LCD screen's physical size
	ARM: dts: qcom: msm8974pro-castor: correct inverted X of touchscreen
	ARM: dts: qcom: msm8974pro-castor: correct touchscreen function names
	ARM: dts: qcom: msm8974pro-castor: correct touchscreen syna,nosleep-mode
	f2fs: optimize iteration over sparse directories
	f2fs: get out of a repeat loop when getting a locked data page
	s390/pkey: fix PKEY_TYPE_EP11_AES handling in PKEY_CLR2SECK2 IOCTL
	arm64: dts: qcom: sdm845-db845c: Mark cont splash memory region as reserved
	wifi: ath11k: fix tx status reporting in encap offload mode
	wifi: ath11k: Cleanup mac80211 references on failure during tx_complete
	scsi: qla2xxx: Select qpair depending on which CPU post_cmd() gets called
	scsi: qla2xxx: Use raw_smp_processor_id() instead of smp_processor_id()
	drm/amdkfd: Flush TLB after unmapping for GFX v9.4.3
	drm/amdkfd: Insert missing TLB flush on GFX10 and later
	btrfs: reset destination buffer when read_extent_buffer() gets invalid range
	vfio/mdev: Fix a null-ptr-deref bug for mdev_unregister_parent()
	MIPS: Alchemy: only build mmc support helpers if au1xmmc is enabled
	spi: spi-gxp: BUG: Correct spi write return value
	drm/bridge: ti-sn65dsi83: Do not generate HFP/HBP/HSA and EOT packet
	bus: ti-sysc: Use fsleep() instead of usleep_range() in sysc_reset()
	bus: ti-sysc: Fix missing AM35xx SoC matching
	firmware: arm_scmi: Harden perf domain info access
	firmware: arm_scmi: Fixup perf power-cost/microwatt support
	power: supply: mt6370: Fix missing error code in mt6370_chg_toggle_cfo()
	clk: sprd: Fix thm_parents incorrect configuration
	clk: tegra: fix error return case for recalc_rate
	ARM: dts: omap: correct indentation
	ARM: dts: ti: omap: Fix bandgap thermal cells addressing for omap3/4
	ARM: dts: Unify pwm-omap-dmtimer node names
	ARM: dts: Unify pinctrl-single pin group nodes for omap4
	ARM: dts: ti: omap: motorola-mapphone: Fix abe_clkctrl warning on boot
	bus: ti-sysc: Fix SYSC_QUIRK_SWSUP_SIDLE_ACT handling for uart wake-up
	power: supply: ucs1002: fix error code in ucs1002_get_property()
	firmware: imx-dsp: Fix an error handling path in imx_dsp_setup_channels()
	xtensa: add default definition for XCHAL_HAVE_DIV32
	xtensa: iss/network: make functions static
	xtensa: boot: don't add include-dirs
	xtensa: umulsidi3: fix conditional expression
	xtensa: boot/lib: fix function prototypes
	power: supply: rk817: Fix node refcount leak
	selftests/powerpc: Use CLEAN macro to fix make warning
	selftests/powerpc: Pass make context to children
	selftests/powerpc: Fix emit_tests to work with run_kselftest.sh
	soc: imx8m: Enable OCOTP clock for imx8mm before reading registers
	arm64: dts: imx: Add imx8mm-prt8mm.dtb to build
	firmware: arm_ffa: Don't set the memory region attributes for MEM_LEND
	gpio: pmic-eic-sprd: Add can_sleep flag for PMIC EIC chip
	i2c: npcm7xx: Fix callback completion ordering
	x86/reboot: VMCLEAR active VMCSes before emergency reboot
	ceph: drop messages from MDS when unmounting
	dma-debug: don't call __dma_entry_alloc_check_leak() under free_entries_lock
	bpf: Annotate bpf_long_memcpy with data_race
	spi: sun6i: reduce DMA RX transfer width to single byte
	spi: sun6i: fix race between DMA RX transfer completion and RX FIFO drain
	nvme-fc: Prevent null pointer dereference in nvme_fc_io_getuuid()
	parisc: sba: Fix compile warning wrt list of SBA devices
	parisc: iosapic.c: Fix sparse warnings
	parisc: drivers: Fix sparse warning
	parisc: irq: Make irq_stack_union static to avoid sparse warning
	scsi: qedf: Add synchronization between I/O completions and abort
	scsi: ufs: core: Move __ufshcd_send_uic_cmd() outside host_lock
	scsi: ufs: core: Poll HCS.UCRDY before issuing a UIC command
	selftests/ftrace: Correctly enable event in instance-event.tc
	ring-buffer: Avoid softlockup in ring_buffer_resize()
	btrfs: assert delayed node locked when removing delayed item
	selftests: fix dependency checker script
	ring-buffer: Do not attempt to read past "commit"
	net/smc: bugfix for smcr v2 server connect success statistic
	ata: sata_mv: Fix incorrect string length computation in mv_dump_mem()
	platform/mellanox: mlxbf-bootctl: add NET dependency into Kconfig
	platform/x86: asus-wmi: Support 2023 ROG X16 tablet mode
	thermal/of: add missing of_node_put()
	drm/amd/display: Don't check registers, if using AUX BL control
	drm/amdgpu/soc21: don't remap HDP registers for SR-IOV
	drm/amdgpu/nbio4.3: set proper rmmio_remap.reg_offset for SR-IOV
	drm/amdgpu: Handle null atom context in VBIOS info ioctl
	riscv: errata: fix T-Head dcache.cva encoding
	scsi: pm80xx: Use phy-specific SAS address when sending PHY_START command
	scsi: pm80xx: Avoid leaking tags when processing OPC_INB_SET_CONTROLLER_CONFIG command
	smb3: correct places where ENOTSUPP is used instead of preferred EOPNOTSUPP
	ata: libata-eh: do not clear ATA_PFLAG_EH_PENDING in ata_eh_reset()
	spi: nxp-fspi: reset the FLSHxCR1 registers
	spi: stm32: add a delay before SPI disable
	ASoC: fsl: imx-pcm-rpmsg: Add SNDRV_PCM_INFO_BATCH flag
	spi: intel-pci: Add support for Granite Rapids SPI serial flash
	bpf: Clarify error expectations from bpf_clone_redirect
	ALSA: hda: intel-sdw-acpi: Use u8 type for link index
	ASoC: cs42l42: Ensure a reset pulse meets minimum pulse width.
	ASoC: cs42l42: Don't rely on GPIOD_OUT_LOW to set RESET initially low
	firmware: cirrus: cs_dsp: Only log list of algorithms in debug build
	memblock tests: fix warning: "__ALIGN_KERNEL" redefined
	memblock tests: fix warning ‘struct seq_file’ declared inside parameter list
	ASoC: imx-rpmsg: Set ignore_pmdown_time for dai_link
	media: vb2: frame_vector.c: replace WARN_ONCE with a comment
	NFSv4.1: fix zero value filehandle in post open getattr
	ASoC: SOF: Intel: MTL: Reduce the DSP init timeout
	powerpc/watchpoints: Disable preemption in thread_change_pc()
	powerpc/watchpoint: Disable pagefaults when getting user instruction
	powerpc/watchpoints: Annotate atomic context in more places
	ncsi: Propagate carrier gain/loss events to the NCSI controller
	net: hsr: Add __packed to struct hsr_sup_tlv.
	tsnep: Fix NAPI scheduling
	tsnep: Fix NAPI polling with budget 0
	LoongArch: Set all reserved memblocks on Node#0 at initialization
	fbdev/sh7760fb: Depend on FB=y
	perf build: Define YYNOMEM as YYNOABORT for bison < 3.81
	nvme-pci: factor the iod mempool creation into a helper
	nvme-pci: factor out a nvme_pci_alloc_dev helper
	nvme-pci: do not set the NUMA node of device if it has none
	wifi: ath11k: Don't drop tx_status when peer cannot be found
	scsi: qla2xxx: Fix NULL pointer dereference in target mode
	nvme-pci: always return an ERR_PTR from nvme_pci_alloc_dev
	smack: Record transmuting in smk_transmuted
	smack: Retrieve transmuting information in smack_inode_getsecurity()
	iommu/arm-smmu-v3: Fix soft lockup triggered by arm_smmu_mm_invalidate_range
	x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race
	x86/srso: Add SRSO mitigation for Hygon processors
	KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway
	KVM: SVM: Fix TSC_AUX virtualization setup
	KVM: x86/mmu: Open code leaf invalidation from mmu_notifier
	KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe()
	mptcp: fix bogus receive window shrinkage with multiple subflows
	misc: rtsx: Fix some platforms can not boot and move the l1ss judgment to probe
	Revert "tty: n_gsm: fix UAF in gsm_cleanup_mux"
	serial: 8250_port: Check IRQ data before use
	nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
	netfilter: nf_tables: disallow rule removal from chain binding
	ALSA: hda: Disable power save for solving pop issue on Lenovo ThinkCentre M70q
	LoongArch: Define relocation types for ABI v2.10
	LoongArch: numa: Fix high_memory calculation
	ata: libata-scsi: link ata port and scsi device
	ata: libata-scsi: ignore reserved bits for REPORT SUPPORTED OPERATION CODES
	io_uring/fs: remove sqe->rw_flags checking from LINKAT
	i2c: i801: unregister tco_pdev in i801_probe() error path
	ASoC: amd: yc: Fix non-functional mic on Lenovo 82QF and 82UG
	kernel/sched: Modify initial boot task idle setup
	sched/rt: Fix live lock between select_fallback_rq() and RT push
	netfilter: nf_tables: fix kdoc warnings after gc rework
	Revert "SUNRPC dont update timeout value on connection reset"
	timers: Tag (hr)timer softirq as hotplug safe
	drm/tests: Fix incorrect argument in drm_test_mm_insert_range
	arm64: defconfig: remove CONFIG_COMMON_CLK_NPCM8XX=y
	mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
	mm/slab_common: fix slab_caches list corruption after kmem_cache_destroy()
	mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement
	ring-buffer: Update "shortest_full" in polling
	btrfs: properly report 0 avail for very full file systems
	media: uvcvideo: Fix OOB read
	bpf: Add override check to kprobe multi link attach
	bpf: Fix BTF_ID symbol generation collision
	bpf: Fix BTF_ID symbol generation collision in tools/
	net: thunderbolt: Fix TCPv6 GSO checksum calculation
	fs/smb/client: Reset password pointer to NULL
	ata: libata-core: Fix ata_port_request_pm() locking
	ata: libata-core: Fix port and device removal
	ata: libata-core: Do not register PM operations for SAS ports
	ata: libata-sata: increase PMP SRST timeout to 10s
	drm/i915/gt: Fix reservation address in ggtt_reserve_guc_top
	power: supply: rk817: Add missing module alias
	power: supply: ab8500: Set typing and props
	fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
	drm/amdkfd: Use gpu_offset for user queue's wptr
	drm/meson: fix memory leak on ->hpd_notify callback
	memcg: drop kmem.limit_in_bytes
	mm, memcg: reconsider kmem.limit_in_bytes deprecation
	ASoC: amd: yc: Fix a non-functional mic on Lenovo 82TL
	Linux 6.1.56

Change-Id: Id110614d91d6d60fb6c7622c5af82f219a84a30f
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-10-27 09:17:04 +00:00
Rafael Aquini
a5569bb187 mm/slab_common: fix slab_caches list corruption after kmem_cache_destroy()
commit 46a9ea6681907a3be6b6b0d43776dccc62cad6cf upstream.

After the commit in Fixes:, if a module that created a slab cache does not
release all of its allocated objects before destroying the cache (at rmmod
time), we might end up releasing the kmem_cache object without removing it
from the slab_caches list thus corrupting the list as kmem_cache_destroy()
ignores the return value from shutdown_cache(), which in turn never removes
the kmem_cache object from slabs_list in case __kmem_cache_shutdown() fails
to release all of the cache's slabs.

This is easily observable on a kernel built with CONFIG_DEBUG_LIST=y
as after that ill release the system will immediately trip on list_add,
or list_del, assertions similar to the one shown below as soon as another
kmem_cache gets created, or destroyed:

  [ 1041.213632] list_del corruption. next->prev should be ffff89f596fb5768, but was 52f1e5016aeee75d. (next=ffff89f595a1b268)
  [ 1041.219165] ------------[ cut here ]------------
  [ 1041.221517] kernel BUG at lib/list_debug.c:62!
  [ 1041.223452] invalid opcode: 0000 [#1] PREEMPT SMP PTI
  [ 1041.225408] CPU: 2 PID: 1852 Comm: rmmod Kdump: loaded Tainted: G    B   W  OE      6.5.0 #15
  [ 1041.228244] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc37 05/24/2023
  [ 1041.231212] RIP: 0010:__list_del_entry_valid+0xae/0xb0

Another quick way to trigger this issue, in a kernel with CONFIG_SLUB=y,
is to set slub_debug to poison the released objects and then just run
cat /proc/slabinfo after removing the module that leaks slab objects,
in which case the kernel will panic:

  [   50.954843] general protection fault, probably for non-canonical address 0xa56b6b6b6b6b6b8b: 0000 [#1] PREEMPT SMP PTI
  [   50.961545] CPU: 2 PID: 1495 Comm: cat Kdump: loaded Tainted: G    B   W  OE      6.5.0 #15
  [   50.966808] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc37 05/24/2023
  [   50.972663] RIP: 0010:get_slabinfo+0x42/0xf0

This patch fixes this issue by properly checking shutdown_cache()'s
return value before taking the kmem_cache_release() branch.

Fixes: 0495e337b7 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock")
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06 14:57:03 +02:00
Liujie Xie
573ba7b6e6 ANDROID: vendor_hooks: Add hooks for memory when debug
Add vendors hooks for recording memory used

Vendor modules allocate and manages the memory itself.

These memories might not be included in kernel memory
statistics. Also, detailed references and vendor-specific
information are managed only inside modules. When
various problems such as memory leaks occurs, these
information should be showed in real-time.

Bug: 182443489
Bug: 234407991
Bug: 277799025

Signed-off-by: Liujie Xie <xieliujie@oppo.com>
Change-Id: I62d8bb2b6650d8b187b433f97eb833ef0b784df1
Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com>
2023-05-25 21:06:40 +00:00
Vlastimil Babka
c18c20f162 mm, slab: remove duplicate kernel-doc comment for ksize()
Akira reports:

> "make htmldocs" reports duplicate C declaration of ksize() as follows:

> /linux/Documentation/core-api/mm-api:43: ./mm/slab_common.c:1428: WARNING: Duplicate C declaration, also defined at core-api/mm-api:212.
> Declaration is '.. c:function:: size_t ksize (const void *objp)'.

> This is due to the kernel-doc comment for ksize() declaration added in
> include/linux/slab.h by commit 05a940656e ("slab: Introduce
> kmalloc_size_roundup()").

There is an older kernel-doc comment for ksize() definition in
mm/slab_common.c, which is not only duplicated, but also contradicts the
new one - the additional storage discovered by ksize() should not be
used by callers anymore. Delete the old kernel-doc.

Reported-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/all/d33440f6-40cf-9747-3340-e54ffaf7afb8@gmail.com/
Fixes: 05a940656e ("slab: Introduce kmalloc_size_roundup()")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-07 17:11:27 +01:00
Kees Cook
328687151b mm/slab_common: Restore passing "caller" for tracing
The "caller" argument was accidentally being ignored in a few places
that were recently refactored. Restore these "caller" arguments, instead
of _RET_IP_.

Fixes: 11e9734bcb ("mm/slab_common: unify NUMA and UMA version of tracepoints")
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-06 21:20:46 +01:00
Vlastimil Babka
eb4940d4ad mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace()
For !CONFIG_TRACING kernels, the kmalloc() implementation tries (in cases where
the allocation size is build-time constant) to save a function call, by
inlining kmalloc_trace() to a kmem_cache_alloc() call.

However since commit 6edf2576a6 ("mm/slub: enable debugging memory wasting of
kmalloc") this path now fails to pass the original request size to be
eventually recorded (for kmalloc caches with debugging enabled).

We could adjust the code to call __kmem_cache_alloc_node() as the
CONFIG_TRACING variant, but that would as a result inline a call with 5
parameters, bloating the kmalloc() call sites. The cost of extra function
call (to kmalloc_trace()) seems like a lesser evil.

It also appears that the !CONFIG_TRACING variant is incompatible with upcoming
hardening efforts [1] so it's easier if we just remove it now. Kernels with no
tracing are rare these days and the benefit is dubious anyway.

[1] https://lore.kernel.org/linux-mm/20221101222520.never.109-kees@kernel.org/T/#m20ecf14390e406247bde0ea9cce368f469c539ed

Link: https://lore.kernel.org/all/097d8fba-bd10-a312-24a3-a4068c4f424c@suse.cz/
Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-04 14:57:21 +01:00
Lukas Bulwahn
a207620123 mm/slab_common: repair kernel-doc for __ksize()
Commit 445d41d7a7 ("Merge branch 'slab/for-6.1/kmalloc_size_roundup' into
slab/for-next") resolved a conflict of two concurrent changes to __ksize().

However, it did not adjust the kernel-doc comment of __ksize(), while the
name of the argument to __ksize() was renamed.

Hence, ./scripts/ kernel-doc -none mm/slab_common.c warns about it.

Adjust the kernel-doc comment for __ksize() for make W=1 happiness.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-11-03 18:09:45 +01:00
Linus Torvalds
27bc50fc90 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative
   reports (or any positive ones, come to that).
 
 - Also the Maple Tree from Liam R.  Howlett.  An overlapping range-based
   tree for vmas.  It it apparently slight more efficient in its own right,
   but is mainly targeted at enabling work to reduce mmap_lock contention.
 
   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.
 
   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
   This has yet to be addressed due to Liam's unfortunately timed
   vacation.  He is now back and we'll get this fixed up.
 
 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer.  It uses
   clang-generated instrumentation to detect used-unintialized bugs down to
   the single bit level.
 
   KMSAN keeps finding bugs.  New ones, as well as the legacy ones.
 
 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.
 
 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
   file/shmem-backed pages.
 
 - userfaultfd updates from Axel Rasmussen
 
 - zsmalloc cleanups from Alexey Romanov
 
 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
 
 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.
 
 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.
 
 - memcg cleanups from Kairui Song.
 
 - memcg fixes and cleanups from Johannes Weiner.
 
 - Vishal Moola provides more folio conversions
 
 - Zhang Yi removed ll_rw_block() :(
 
 - migration enhancements from Peter Xu
 
 - migration error-path bugfixes from Huang Ying
 
 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths.  For optimizations by PMEM drivers, DRM
   drivers, etc.
 
 - vma merging improvements from Jakub Matěn.
 
 - NUMA hinting cleanups from David Hildenbrand.
 
 - xu xin added aditional userspace visibility into KSM merging activity.
 
 - THP & KSM code consolidation from Qi Zheng.
 
 - more folio work from Matthew Wilcox.
 
 - KASAN updates from Andrey Konovalov.
 
 - DAMON cleanups from Kaixu Xia.
 
 - DAMON work from SeongJae Park: fixes, cleanups.
 
 - hugetlb sysfs cleanups from Muchun Song.
 
 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
 joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
 bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
 =xfWx
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
2022-10-10 17:53:04 -07:00
Vlastimil Babka
445d41d7a7 Merge branch 'slab/for-6.1/kmalloc_size_roundup' into slab/for-next
The first two patches from a series by Kees Cook [1] that introduce
kmalloc_size_roundup(). This will allow merging of per-subsystem patches using
the new function and ultimately stop (ab)using ksize() in a way that causes
ongoing trouble for debugging functionality and static checkers.

[1] https://lore.kernel.org/all/20220923202822.2667581-1-keescook@chromium.org/

--
Resolved a conflict of modifying mm/slab.c __ksize() comment with a commit that
unifies __ksize() implementation into mm/slab_common.c
2022-09-29 11:30:55 +02:00
Vlastimil Babka
af961f8059 Merge branch 'slab/for-6.1/slub_debug_waste' into slab/for-next
A patch from Feng Tang that enhances the existing debugfs alloc_traces
file for kmalloc caches with information about how much space is wasted
by allocations that needs less space than the particular kmalloc cache
provides.
2022-09-29 11:28:26 +02:00
Kees Cook
05a940656e slab: Introduce kmalloc_size_roundup()
In the effort to help the compiler reason about buffer sizes, the
__alloc_size attribute was added to allocators. This improves the scope
of the compiler's ability to apply CONFIG_UBSAN_BOUNDS and (in the near
future) CONFIG_FORTIFY_SOURCE. For most allocations, this works well,
as the vast majority of callers are not expecting to use more memory
than what they asked for.

There is, however, one common exception to this: anticipatory resizing
of kmalloc allocations. These cases all use ksize() to determine the
actual bucket size of a given allocation (e.g. 128 when 126 was asked
for). This comes in two styles in the kernel:

1) An allocation has been determined to be too small, and needs to be
   resized. Instead of the caller choosing its own next best size, it
   wants to minimize the number of calls to krealloc(), so it just uses
   ksize() plus some additional bytes, forcing the realloc into the next
   bucket size, from which it can learn how large it is now. For example:

	data = krealloc(data, ksize(data) + 1, gfp);
	data_len = ksize(data);

2) The minimum size of an allocation is calculated, but since it may
   grow in the future, just use all the space available in the chosen
   bucket immediately, to avoid needing to reallocate later. A good
   example of this is skbuff's allocators:

	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
	...
	/* kmalloc(size) might give us more room than requested.
	 * Put skb_shared_info exactly at the end of allocated zone,
	 * to allow max possible filling before reallocation.
	 */
	osize = ksize(data);
        size = SKB_WITH_OVERHEAD(osize);

In both cases, the "how much was actually allocated?" question is answered
_after_ the allocation, where the compiler hinting is not in an easy place
to make the association any more. This mismatch between the compiler's
view of the buffer length and the code's intention about how much it is
going to actually use has already caused problems[1]. It is possible to
fix this by reordering the use of the "actual size" information.

We can serve the needs of users of ksize() and still have accurate buffer
length hinting for the compiler by doing the bucket size calculation
_before_ the allocation. Code can instead ask "how large an allocation
would I get for a given size?".

Introduce kmalloc_size_roundup(), to serve this function so we can start
replacing the "anticipatory resizing" uses of ksize().

[1] https://github.com/ClangBuiltLinux/linux/issues/1599
    https://github.com/KSPP/linux/issues/183

[ vbabka@suse.cz: add SLOB version ]

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-29 11:10:34 +02:00
Kees Cook
9ed9cac185 slab: Remove __malloc attribute from realloc functions
The __malloc attribute should not be applied to "realloc" functions, as
the returned pointer may alias the storage of the prior pointer. Instead
of splitting __malloc from __alloc_size, which would be a huge amount of
churn, just create __realloc_size for the few cases where it is needed.

Thanks to Geert Uytterhoeven <geert@linux-m68k.org> for reporting build
failures with gcc-8 in earlier version which tried to remove the #ifdef.
While the "alloc_size" attribute is available on all GCC versions, I
forgot that it gets disabled explicitly by the kernel in GCC < 9.1 due
to misbehaviors. Add a note to the compiler_attributes.h entry for it.

Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-29 11:05:57 +02:00
Feng Tang
6edf2576a6 mm/slub: enable debugging memory wasting of kmalloc
kmalloc's API family is critical for mm, with one nature that it will
round up the request size to a fixed one (mostly power of 2). Say
when user requests memory for '2^n + 1' bytes, actually 2^(n+1) bytes
could be allocated, so in worst case, there is around 50% memory
space waste.

The wastage is not a big issue for requests that get allocated/freed
quickly, but may cause problems with objects that have longer life
time.

We've met a kernel boot OOM panic (v5.10), and from the dumped slab
info:

    [   26.062145] kmalloc-2k            814056KB     814056KB

From debug we found there are huge number of 'struct iova_magazine',
whose size is 1032 bytes (1024 + 8), so each allocation will waste
1016 bytes. Though the issue was solved by giving the right (bigger)
size of RAM, it is still nice to optimize the size (either use a
kmalloc friendly size or create a dedicated slab for it).

And from lkml archive, there was another crash kernel OOM case [1]
back in 2019, which seems to be related with the similar slab waste
situation, as the log is similar:

    [    4.332648] iommu: Adding device 0000:20:02.0 to group 16
    [    4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0
    ...
    [    4.857565] kmalloc-2048           59164KB      59164KB

The crash kernel only has 256M memory, and 59M is pretty big here.
(Note: the related code has been changed and optimised in recent
kernel [2], these logs are just picked to demo the problem, also
a patch changing its size to 1024 bytes has been merged)

So add an way to track each kmalloc's memory waste info, and
leverage the existing SLUB debug framework (specifically
SLUB_STORE_USER) to show its call stack of original allocation,
so that user can evaluate the waste situation, identify some hot
spots and optimize accordingly, for a better utilization of memory.

The waste info is integrated into existing interface:
'/sys/kernel/debug/slab/kmalloc-xx/alloc_traces', one example of
'kmalloc-4k' after boot is:

 126 ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe] waste=233856/1856 age=280763/281414/282065 pid=1330 cpus=32 nodes=1
     __kmem_cache_alloc_node+0x11f/0x4e0
     __kmalloc_node+0x4e/0x140
     ixgbe_alloc_q_vector+0xbe/0x830 [ixgbe]
     ixgbe_init_interrupt_scheme+0x2ae/0xc90 [ixgbe]
     ixgbe_probe+0x165f/0x1d20 [ixgbe]
     local_pci_probe+0x78/0xc0
     work_for_cpu_fn+0x26/0x40
     ...

which means in 'kmalloc-4k' slab, there are 126 requests of
2240 bytes which got a 4KB space (wasting 1856 bytes each
and 233856 bytes in total), from ixgbe_alloc_q_vector().

And when system starts some real workload like multiple docker
instances, there could are more severe waste.

[1]. https://lkml.org/lkml/2019/8/12/266
[2]. https://lore.kernel.org/lkml/2920df89-9975-5785-f79b-257d3052dfaf@huawei.com/

[Thanks Hyeonggon for pointing out several bugs about sorting/format]
[Thanks Vlastimil for suggesting way to reduce memory usage of
 orig_size and keep it only for kmalloc objects]

Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-23 12:32:45 +02:00
Vlastimil Babka
3662c13ec6 Merge branch 'slab/for-6.1/common_kmalloc' into slab/for-next
The "common kmalloc v4" series [1] by Hyeonggon Yoo.

- Improves the mm/slab_common.c wrappers to allow deleting duplicated
  code between SLAB and SLUB.
- Large kmalloc() allocations in SLAB are passed to page allocator like
  in SLUB, reducing number of kmalloc caches.
- Removes the {kmem_cache_alloc,kmalloc}_node variants of tracepoints,
  node id parameter added to non-_node variants.
- 8 files changed, 341 insertions(+), 651 deletions(-)

[1] https://lore.kernel.org/all/20220817101826.236819-1-42.hyeyoo@gmail.com/

--
Merge resolves trivial conflict in mm/slub.c with commit 5373b8a09d
("kasan: call kasan_malloc() from __kmalloc_*track_caller()")
2022-09-23 10:32:02 +02:00
Vlastimil Babka
0467ca385f Merge branch 'slab/for-6.1/trivial' into slab/for-next
Trivial fixes and cleanups:
- unneeded variable removals, by ye xingchen
2022-09-23 10:29:53 +02:00
Feng Tang
d71608a877 mm/slab_common: fix possible double free of kmem_cache
When doing slub_debug test, kfence's 'test_memcache_typesafe_by_rcu'
kunit test case cause a use-after-free error:

  BUG: KASAN: use-after-free in kobject_del+0x14/0x30
  Read of size 8 at addr ffff888007679090 by task kunit_try_catch/261

  CPU: 1 PID: 261 Comm: kunit_try_catch Tainted: G    B            N 6.0.0-rc5-next-20220916 #17
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x48
   print_address_description.constprop.0+0x87/0x2a5
   print_report+0x103/0x1ed
   kasan_report+0xb7/0x140
   kobject_del+0x14/0x30
   kmem_cache_destroy+0x130/0x170
   test_exit+0x1a/0x30
   kunit_try_run_case+0xad/0xc0
   kunit_generic_run_threadfn_adapter+0x26/0x50
   kthread+0x17b/0x1b0
   </TASK>

The cause is inside kmem_cache_destroy():

kmem_cache_destroy
    acquire lock/mutex
    shutdown_cache
        schedule_work(kmem_cache_release) (if RCU flag set)
    release lock/mutex
    kmem_cache_release (if RCU flag not set)

In some certain timing, the scheduled work could be run before
the next RCU flag checking, which can then get a wrong value
and lead to double kmem_cache_release().

Fix it by caching the RCU flag inside protected area, just like 'refcnt'

Fixes: 0495e337b7 ("mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock")
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-19 16:27:26 +02:00
Waiman Long
0495e337b7 mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock
A circular locking problem is reported by lockdep due to the following
circular locking dependency.

  +--> cpu_hotplug_lock --> slab_mutex --> kn->active --+
  |                                                     |
  +-----------------------------------------------------+

The forward cpu_hotplug_lock ==> slab_mutex ==> kn->active dependency
happens in

  kmem_cache_destroy():	cpus_read_lock(); mutex_lock(&slab_mutex);
  ==> sysfs_slab_unlink()
      ==> kobject_del()
          ==> kernfs_remove()
	      ==> __kernfs_remove()
	          ==> kernfs_drain(): rwsem_acquire(&kn->dep_map, ...);

The backward kn->active ==> cpu_hotplug_lock dependency happens in

  kernfs_fop_write_iter(): kernfs_get_active();
  ==> slab_attr_store()
      ==> cpu_partial_store()
          ==> flush_all(): cpus_read_lock()

One way to break this circular locking chain is to avoid holding
cpu_hotplug_lock and slab_mutex while deleting the kobject in
sysfs_slab_unlink() which should be equivalent to doing a write_lock
and write_unlock pair of the kn->active virtual lock.

Since the kobject structures are not protected by slab_mutex or the
cpu_hotplug_lock, we can certainly release those locks before doing
the delete operation.

Move sysfs_slab_unlink() and sysfs_slab_release() to the newly
created kmem_cache_release() and call it outside the slab_mutex &
cpu_hotplug_lock critical sections. There will be a slight delay
in the deletion of sysfs files if kmem_cache_release() is called
indirectly from a work function.

Fixes: 5a836bf6b0 ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: David Rientjes <rientjes@google.com>
Link: https://lore.kernel.org/all/YwOImVd+nRUsSAga@hyeyoo/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01 12:10:31 +02:00
Hyeonggon Yoo
d5eff73690 mm/sl[au]b: check if large object is valid in __ksize()
If address of large object is not beginning of folio or size of the
folio is too small, it must be invalid. WARN() and return 0 in such
cases.

Cc: Marco Elver <elver@google.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01 11:44:39 +02:00
Hyeonggon Yoo
8dfa9d5540 mm/slab_common: move declaration of __ksize() to mm/slab.h
__ksize() is only called by KASAN. Remove export symbol and move
declaration to mm/slab.h as we don't want to grow its callers.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01 11:44:39 +02:00
Hyeonggon Yoo
2c1d697fb8 mm/slab_common: drop kmem_alloc & avoid dereferencing fields when not using
Drop kmem_alloc event class, and define kmalloc and kmem_cache_alloc
using TRACE_EVENT() macro.

And then this patch does:
   - Do not pass pointer to struct kmem_cache to trace_kmalloc.
     gfp flag is enough to know if it's accounted or not.
   - Avoid dereferencing s->object_size and s->size when not using kmem_cache_alloc event.
   - Avoid dereferencing s->name in when not using kmem_cache_free event.
   - Adjust s->size to SLOB_UNITS(s->size) * SLOB_UNIT in SLOB

Cc: Vasily Averin <vasily.averin@linux.dev>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01 11:44:26 +02:00
Hyeonggon Yoo
11e9734bcb mm/slab_common: unify NUMA and UMA version of tracepoints
Drop kmem_alloc event class, rename kmem_alloc_node to kmem_alloc, and
remove _node postfix for NUMA version of tracepoints.

This will break some tools that depend on {kmem_cache_alloc,kmalloc}_node,
but at this point maintaining both kmem_alloc and kmem_alloc_node
event classes does not makes sense at all.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01 10:40:27 +02:00
Hyeonggon Yoo
26a40990ba mm/sl[au]b: cleanup kmem_cache_alloc[_node]_trace()
Despite its name, kmem_cache_alloc[_node]_trace() is hook for inlined
kmalloc. So rename it to kmalloc[_node]_trace().

Move its implementation to slab_common.c by using
__kmem_cache_alloc_node(), but keep CONFIG_TRACING=n varients to save a
function call when CONFIG_TRACING=n.

Use __assume_kmalloc_alignment for kmalloc[_node]_trace instead of
__assume_slab_alignement. Generally kmalloc has larger alignment
requirements.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01 10:40:27 +02:00
Hyeonggon Yoo
b140513524 mm/sl[au]b: generalize kmalloc subsystem
Now everything in kmalloc subsystem can be generalized.
Let's do it!

Generalize __do_kmalloc_node(), __kmalloc_node_track_caller(),
kfree(), __ksize(), __kmalloc(), __kmalloc_node() and move them
to slab_common.c.

In the meantime, rename kmalloc_large_node_notrace()
to __kmalloc_large_node() and make it static as it's now only called in
slab_common.c.

[ feng.tang@intel.com: adjust kfence skip list to include
  __kmem_cache_free so that kfence kunit tests do not fail ]

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01 10:38:06 +02:00
Hyeonggon Yoo
d6a71648db mm/slab: kmalloc: pass requests larger than order-1 page to page allocator
There is not much benefit for serving large objects in kmalloc().
Let's pass large requests to page allocator like SLUB for better
maintenance of common code.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24 16:11:41 +02:00
Hyeonggon Yoo
c4cab55752 mm/slab_common: cleanup kmalloc_large()
Now that kmalloc_large() and kmalloc_large_node() do mostly same job,
make kmalloc_large() wrapper of kmalloc_large_node_notrace().

In the meantime, add missing flag fix code in
kmalloc_large_node_notrace().

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24 16:11:41 +02:00
Hyeonggon Yoo
bf37d79102 mm/slab_common: kmalloc_node: pass large requests to page allocator
Now that kmalloc_large_node() is in common code, pass large requests
to page allocator in kmalloc_node() using kmalloc_large_node().

One problem is that currently there is no tracepoint in
kmalloc_large_node(). Instead of simply putting tracepoint in it,
use kmalloc_large_node{,_notrace} depending on its caller to show
useful address for both inlined kmalloc_node() and
__kmalloc_node_track_caller() when large objects are allocated.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24 16:11:41 +02:00
Hyeonggon Yoo
a0c3b94002 mm/slub: move kmalloc_large_node() to slab_common.c
In later patch SLAB will also pass requests larger than order-1 page
to page allocator. Move kmalloc_large_node() to slab_common.c.

Fold kmalloc_large_node_hook() into kmalloc_large_node() as there is
no other caller.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24 16:11:41 +02:00
Hyeonggon Yoo
e4c98d6895 mm/slab_common: fold kmalloc_order_trace() into kmalloc_large()
There is no caller of kmalloc_order_trace() except kmalloc_large().
Fold it into kmalloc_large() and remove kmalloc_order{,_trace}().

Also add tracepoint in kmalloc_large() that was previously
in kmalloc_order_trace().

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24 16:11:40 +02:00
ye xingchen
610f9c00ce mm/slab_common: Remove the unneeded result variable
Return the value from __kmem_cache_shrink() directly instead of storing it
 in another redundant variable.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-23 16:03:05 +02:00
Hyeonggon Yoo
3041808b52 mm/slab_common: move generic bulk alloc/free functions to SLOB
Now that only SLOB use __kmem_cache_{alloc,free}_bulk(), move them to
SLOB. No functional change intended.

Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-20 13:30:12 +02:00
Vasily Averin
b347aa7b57 mm/tracing: add 'accounted' entry into output of allocation tracepoints
Slab caches marked with SLAB_ACCOUNT force accounting for every
allocation from this cache even if __GFP_ACCOUNT flag is not passed.
Unfortunately, at the moment this flag is not visible in ftrace output,
and this makes it difficult to analyze the accounted allocations.

This patch adds boolean "accounted" entry into trace output,
and set it to 'true' for calls used __GFP_ACCOUNT flag and
for allocations from caches marked with SLAB_ACCOUNT.
Set it to 'false' if accounting is disabled in configs.

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Link: https://lore.kernel.org/r/c418ed25-65fe-f623-fbf8-1676528859ed@openvz.org
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-04 17:11:27 +02:00
Linus Torvalds
98931dd95f Yang Shi has improved the behaviour of khugepaged collapsing of readonly
file-backed transparent hugepages.
 
 Johannes Weiner has arranged for zswap memory use to be tracked and
 managed on a per-cgroup basis.
 
 Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
 enablement of the recent huge page vmemmap optimization feature.
 
 Baolin Wang contributes a series to fix some issues around hugetlb
 pagetable invalidation.
 
 Zhenwei Pi has fixed some interactions between hwpoisoned pages and
 virtualization.
 
 Tong Tiangen has enabled the use of the presently x86-only
 page_table_check debugging feature on arm64 and riscv.
 
 David Vernet has done some fixup work on the memcg selftests.
 
 Peter Xu has taught userfaultfd to handle write protection faults against
 shmem- and hugetlbfs-backed files.
 
 More DAMON development from SeongJae Park - adding online tuning of the
 feature and support for monitoring of fixed virtual address ranges.  Also
 easier discovery of which monitoring operations are available.
 
 Nadav Amit has done some optimization of TLB flushing during mprotect().
 
 Neil Brown continues to labor away at improving our swap-over-NFS support.
 
 David Hildenbrand has some fixes to anon page COWing versus
 get_user_pages().
 
 Peng Liu fixed some errors in the core hugetlb code.
 
 Joao Martins has reduced the amount of memory consumed by device-dax's
 compound devmaps.
 
 Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
 
 Muchun Song has found and fixed some errors in the TLB flushing of
 transparent hugepages.
 
 Roman Gushchin has done more work on the memcg selftests.
 
 And, of course, many smaller fixes and cleanups.  Notably, the customary
 million cleanup serieses from Miaohe Lin.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
 jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
 PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
 =nFe6
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Almost all of MM here. A few things are still getting finished off,
  reviewed, etc.

   - Yang Shi has improved the behaviour of khugepaged collapsing of
     readonly file-backed transparent hugepages.

   - Johannes Weiner has arranged for zswap memory use to be tracked and
     managed on a per-cgroup basis.

   - Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
     runtime enablement of the recent huge page vmemmap optimization
     feature.

   - Baolin Wang contributes a series to fix some issues around hugetlb
     pagetable invalidation.

   - Zhenwei Pi has fixed some interactions between hwpoisoned pages and
     virtualization.

   - Tong Tiangen has enabled the use of the presently x86-only
     page_table_check debugging feature on arm64 and riscv.

   - David Vernet has done some fixup work on the memcg selftests.

   - Peter Xu has taught userfaultfd to handle write protection faults
     against shmem- and hugetlbfs-backed files.

   - More DAMON development from SeongJae Park - adding online tuning of
     the feature and support for monitoring of fixed virtual address
     ranges. Also easier discovery of which monitoring operations are
     available.

   - Nadav Amit has done some optimization of TLB flushing during
     mprotect().

   - Neil Brown continues to labor away at improving our swap-over-NFS
     support.

   - David Hildenbrand has some fixes to anon page COWing versus
     get_user_pages().

   - Peng Liu fixed some errors in the core hugetlb code.

   - Joao Martins has reduced the amount of memory consumed by
     device-dax's compound devmaps.

   - Some cleanups of the arch-specific pagemap code from Anshuman
     Khandual.

   - Muchun Song has found and fixed some errors in the TLB flushing of
     transparent hugepages.

   - Roman Gushchin has done more work on the memcg selftests.

  ... and, of course, many smaller fixes and cleanups. Notably, the
  customary million cleanup serieses from Miaohe Lin"

* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
  mm: kfence: use PAGE_ALIGNED helper
  selftests: vm: add the "settings" file with timeout variable
  selftests: vm: add "test_hmm.sh" to TEST_FILES
  selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
  selftests: vm: add migration to the .gitignore
  selftests/vm/pkeys: fix typo in comment
  ksm: fix typo in comment
  selftests: vm: add process_mrelease tests
  Revert "mm/vmscan: never demote for memcg reclaim"
  mm/kfence: print disabling or re-enabling message
  include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
  include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
  mm: fix a potential infinite loop in start_isolate_page_range()
  MAINTAINERS: add Muchun as co-maintainer for HugeTLB
  zram: fix Kconfig dependency warning
  mm/shmem: fix shmem folio swapoff hang
  cgroup: fix an error handling path in alloc_pagecache_max_30M()
  mm: damon: use HPAGE_PMD_SIZE
  tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
  nodemask.h: fix compilation error with GCC12
  ...
2022-05-26 12:32:41 -07:00
Linus Torvalds
2e17ce1106 slab changes for 5.19
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmKLUYoACgkQ4CHKc/GJ
 qRCMFwf/Tm1cf2JLUANrT58rjkrrj15EtKhnJdm5/yvmsWKps7WKPP4jeUHe+NTO
 NovAGt67lG1l6LMLczZkWckOkWlyYjC42CPDLdxRUkk+zQRb3nRA8Nbt6VTNBOfQ
 0wTLOqXgsNXdSPSVUsKGL8kIAHNQTMX+7TjO6s7CXy/5Qag6r1iZX2HZxASOHxLa
 yYzaJ9pJRZBAMGnzV6L6v0J8KPnjYO0fB68S1qYQTbhoRxchtFF+0AIr1JydGgBI
 9RFUowTrSpJkZtcSjabopvZz4JfCRDP+eAxkyw13feji7MG1FMX74HgDdw+HhzTv
 R2/6iA5WcsmzcXopsfMx8lUP/KIfPw==
 =gnSc
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - Conversion of slub_debug stack traces to stackdepot, allowing more
   useful debugfs-based inspection for e.g. memory leak debugging.
   Allocation and free debugfs info now includes full traces and is
   sorted by the unique trace frequency.

   The stackdepot conversion was already attempted last year but
   reverted by ae14c63a9f. The memory overhead (while not actually
   enabled on boot) has been meanwhile solved by making the large
   stackdepot allocation dynamic. The xfstest issues haven't been
   reproduced on current kernel locally nor in -next, so the slab cache
   layout changes that originally made that bug manifest were probably
   not the root cause.

 - Refactoring of dma-kmalloc caches creation.

 - Trivial cleanups such as removal of unused parameters, fixes and
   clarifications of comments.

 - Hyeonggon Yoo joins as a reviewer.

* tag 'slab-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  MAINTAINERS: add myself as reviewer for slab
  mm/slub: remove unused kmem_cache_order_objects max
  mm: slab: fix comment for __assume_kmalloc_alignment
  mm: slab: fix comment for ARCH_KMALLOC_MINALIGN
  mm/slub: remove unneeded return value of slab_pad_check
  mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
  mm/slub: remove meaningless node check in ___slab_alloc()
  mm/slub: remove duplicate flag in allocate_slab()
  mm/slub: remove unused parameter in setup_object*()
  mm/slab.c: fix comments
  slab, documentation: add description of debugfs files for SLUB caches
  mm/slub: sort debugfs output by frequency of stack traces
  mm/slub: distinguish and print stack traces in debugfs files
  mm/slub: use stackdepot to save stack trace in objects
  mm/slub: move struct track init out of set_track()
  lib/stackdepot: allow requesting early initialization dynamically
  mm/slub, kunit: Make slub_kunit unaffected by user specified flags
  mm/slab: remove some unused functions
2022-05-25 10:24:04 -07:00
Vlastimil Babka
e001897da6 Merge branches 'slab/for-5.19/stackdepot' and 'slab/for-5.19/refactor' into slab/for-linus 2022-05-23 11:14:32 +02:00
Peter Collingbourne
d949a8155d mm: make minimum slab alignment a runtime property
When CONFIG_KASAN_HW_TAGS is enabled we currently increase the minimum
slab alignment to 16.  This happens even if MTE is not supported in
hardware or disabled via kasan=off, which creates an unnecessary memory
overhead in those cases.  Eliminate this overhead by making the minimum
slab alignment a runtime property and only aligning to 16 if KASAN is
enabled at runtime.

On a DragonBoard 845c (non-MTE hardware) with a kernel built with
CONFIG_KASAN_HW_TAGS, waiting for quiescence after a full Android boot I
see the following Slab measurements in /proc/meminfo (median of 3
reboots):

Before: 169020 kB
After:  167304 kB

[akpm@linux-foundation.org: make slab alignment type `unsigned int' to avoid casting]
Link: https://linux-review.googlesource.com/id/I752e725179b43b144153f4b6f584ceb646473ead
Link: https://lkml.kernel.org/r/20220427195820.1716975-2-pcc@google.com
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:07 -07:00
Marco Elver
2dfe63e61c mm, kfence: support kmem_dump_obj() for KFENCE objects
Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
producing garbage data due to the object not actually being maintained
by SLAB or SLUB.

Fix this by implementing __kfence_obj_info() that copies relevant
information to struct kmem_obj_info when the object was allocated by
KFENCE; this is called by a common kmem_obj_info(), which also calls the
slab/slub/slob specific variant now called __kmem_obj_info().

For completeness, kmem_dump_obj() now displays if the object was
allocated by KFENCE.

Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
Fixes: b89fb5ef0c ("mm, kfence: insert KFENCE hooks for SLUB")
Fixes: d3fb45f370 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>	[slab]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-15 14:49:55 -07:00
Ohhoon Kwon
33647783de mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
There are four types of kmalloc_caches: KMALLOC_NORMAL, KMALLOC_CGROUP,
KMALLOC_RECLAIM, and KMALLOC_DMA. While the first three types are
created using new_kmalloc_cache(), KMALLOC_DMA caches are created in a
separate logic. Let KMALLOC_DMA caches be also created using
new_kmalloc_cache(), to enhance readability.

Historically, there were only KMALLOC_NORMAL caches and KMALLOC_DMA
caches in the first place, and they were initialized in two separate
logics. However, when KMALLOC_RECLAIM was introduced in v4.20 via
commit 1291523f2c ("mm, slab/slub: introduce kmalloc-reclaimable
caches") and KMALLOC_CGROUP was introduced in v5.14 via
commit 494c1dfe85 ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches"), their creations were merged with KMALLOC_NORMAL's only.
KMALLOC_DMA creation logic should be merged with them, too.

By merging KMALLOC_DMA initialization with other types, the following
two changes might occur:
1. The order dma-kmalloc-<n> caches added in slab_cache list may be
sorted by size. i.e. the order they appear in /proc/slabinfo may change
as well.
2. slab_state will be set to UP after KMALLOC_DMA is created.
In case of slub, freelist randomization is dependent on slab_state>=UP,
and therefore KMALLOC_DMA cache's freelist will not be randomized in
creation, but will be deferred to init_freelist_randomization().

Co-developed-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Signed-off-by: Ohhoon Kwon <ohkwon1043@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220410162511.656541-1-ohkwon1043@gmail.com
2022-04-13 09:07:06 +02:00
Oliver Glitta
5cf909c553 mm/slub: use stackdepot to save stack trace in objects
Many stack traces are similar so there are many similar arrays.
Stackdepot saves each unique stack only once.

Replace field addrs in struct track with depot_stack_handle_t handle.  Use
stackdepot to save stack trace.

The benefits are smaller memory overhead and possibility to aggregate
per-cache statistics in the following patch using the stackdepot handle
instead of matching stacks manually.

[ vbabka@suse.cz: rebase to 5.17-rc1 and adjust accordingly ]

This was initially merged as commit 788691464c and reverted by commit
ae14c63a9f due to several issues, that should now be fixed.
The problem of unconditional memory overhead by stackdepot has been
addressed by commit 2dba5eb1c7 ("lib/stackdepot: allow optional init
and stack_table allocation by kvmalloc()"), so the dependency on
stackdepot will result in extra memory usage only when a slab cache
tracking is actually enabled, and not for all CONFIG_SLUB_DEBUG builds.
The build failures on some architectures were also addressed, and the
reported issue with xfs/433 test did not reproduce on 5.17-rc1 with this
patch.

Signed-off-by: Oliver Glitta <glittao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
2022-04-06 11:03:32 +02:00
Miaohe Lin
7d6b6cc355 mm/slab_common: use helper function is_power_of_2()
Use helper function is_power_of_2() to check if KMALLOC_MIN_SIZE is power
of two. Minor readability improvement.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220217091609.8214-1-linmiaohe@huawei.com
2022-02-21 11:38:12 +01:00
Linus Torvalds
f56caedaf9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "146 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
  dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
  memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
  userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
  ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
  damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
  mm/damon: hide kernel pointer from tracepoint event
  mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
  mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
  mm/damon/dbgfs: remove an unnecessary variable
  mm/damon: move the implementation of damon_insert_region to damon.h
  mm/damon: add access checking for hugetlb pages
  Docs/admin-guide/mm/damon/usage: update for schemes statistics
  mm/damon/dbgfs: support all DAMOS stats
  Docs/admin-guide/mm/damon/reclaim: document statistics parameters
  mm/damon/reclaim: provide reclamation statistics
  mm/damon/schemes: account how many times quota limit has exceeded
  mm/damon/schemes: account scheme actions that successfully applied
  mm/damon: remove a mistakenly added comment for a future feature
  Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
  Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
  Docs/admin-guide/mm/damon/usage: remove redundant information
  Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
  mm/damon: convert macro functions to static inline functions
  mm/damon: modify damon_rand() macro to static inline function
  mm/damon: move damon_rand() definition into damon.h
  ...
2022-01-15 20:37:06 +02:00
Quanfa Fu
0b8f0d8700 mm: fix some comment errors
Link: https://lkml.kernel.org/r/20211101040208.460810-1-fuqf0919@gmail.com
Signed-off-by: Quanfa Fu <fuqf0919@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:31 +02:00
Muchun Song
17c1736775 mm: memcontrol: make cgroup_memory_nokmem static
Commit 494c1dfe85 ("mm: memcg/slab: create a new set of kmalloc-cg-<n>
caches") makes cgroup_memory_nokmem global, however, it is unnecessary
because there is already a function mem_cgroup_kmem_disabled() which
exports it.

Just make it static and replace it with mem_cgroup_kmem_disabled() in
mm/slab_common.c.

Link: https://lkml.kernel.org/r/20211109065418.21693-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:27 +02:00
Marco Elver
bed0a9b591 kasan: add ability to detect double-kmem_cache_destroy()
Because mm/slab_common.c is not instrumented with software KASAN modes,
it is not possible to detect use-after-free of the kmem_cache passed
into kmem_cache_destroy().  In particular, because of the s->refcount--
and subsequent early return if non-zero, KASAN would never be able to
see the double-free via kmem_cache_free(kmem_cache, s).  To be able to
detect a double-kmem_cache_destroy(), check accessibility of the
kmem_cache, and in case of failure return early.

While KASAN_HW_TAGS is able to detect such bugs, by checking
accessibility and returning early we fail more gracefully and also avoid
corrupting reused objects (where tags mismatch).

A recent case of a double-kmem_cache_destroy() was detected by KFENCE:
https://lkml.kernel.org/r/0000000000003f654905c168b09d@google.com, which
was not detectable by software KASAN modes.

Link: https://lkml.kernel.org/r/20211119142219.1519617-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:26 +02:00
Muchun Song
c29b5b3d33 mm: slab: make slab iterator functions static
There is no external users of slab_start/next/stop(), so make them
static.  And the memory.kmem.slabinfo is deprecated, which outputs
nothing now, so move memcg_slab_show() into mm/memcontrol.c and rename
it to mem_cgroup_slab_show to be consistent with other function names.

Link: https://lkml.kernel.org/r/20211109133359.32881-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:25 +02:00
Marco Elver
7302e91f39 mm/slab_common: use WARN() if cache still has objects on destroy
Calling kmem_cache_destroy() while the cache still has objects allocated
is a kernel bug, and will usually result in the entire cache being
leaked.  While the message in kmem_cache_destroy() resembles a warning,
it is currently not implemented using a real WARN().

This is problematic for infrastructure testing the kernel, all of which
rely on the specific format of WARN()s to pick up on bugs.

Some 13 years ago this used to be a simple WARN_ON() in slub, but commit
d629d81957 ("slub: improve kmem_cache_destroy() error message")
changed it into an open-coded warning to avoid confusion with a bug in
slub itself.

Instead, turn the open-coded warning into a real WARN() with the message
preserved, so that test systems can actually identify these issues, and
we get all the other benefits of using a normal WARN().  The warning
message is extended with "when called from <caller-ip>" to make it even
clearer where the fault lies.

For most configurations this is only a cosmetic change, however, note
that WARN() here will now also respect panic_on_warn.

Link: https://lkml.kernel.org/r/20211102170733.648216-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:25 +02:00
Matthew Wilcox (Oracle)
7213230af5 mm: Use struct slab in kmem_obj_info()
All three implementations of slab support kmem_obj_info() which reports
details of an object allocated from the slab allocator.  By using the
slab type instead of the page type, we make it obvious that this can
only be called for slabs.

[ vbabka@suse.cz: also convert the related kmem_valid_obj() to folios ]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
2022-01-06 12:25:51 +01:00
Stephen Kitt
53944f171a mm: remove HARDENED_USERCOPY_FALLBACK
This has served its purpose and is no longer used.  All usercopy
violations appear to have been handled by now, any remaining instances
(or new bugs) will cause copies to be rejected.

This isn't a direct revert of commit 2d891fbc3b ("usercopy: Allow
strict enforcement of whitelists"); since usercopy_fallback is
effectively 0, the fallback handling is removed too.

This also removes the usercopy_fallback module parameter on slab_common.

Link: https://github.com/KSPP/linux/issues/153
Link: https://lkml.kernel.org/r/20210921061149.1091163-1-steve@sk2.org
Signed-off-by: Stephen Kitt <steve@sk2.org>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Stanley <joel@jms.id.au>	[defconfig change]
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E . Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:43 -07:00
Sebastian Andrzej Siewior
5a836bf6b0 mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
flush_all() flushes a specific SLAB cache on each CPU (where the cache
is present). The deactivate_slab()/__free_slab() invocation happens
within IPI handler and is problematic for PREEMPT_RT.

The flush operation is not a frequent operation or a hot path. The
per-CPU flush operation can be moved to within a workqueue.

Because a workqueue handler, unlike IPI handler, does not disable irqs,
flush_slab() now has to disable them for working with the kmem_cache_cpu
fields. deactivate_slab() is safe to call with irqs enabled.

[vbabka@suse.cz: adapt to new SLUB changes]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-04 01:12:23 +02:00
Linus Torvalds
28e92f9903 Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:

 - Bitmap parsing support for "all" as an alias for all bits

 - Documentation updates

 - Miscellaneous fixes, including some that overlap into mm and lockdep

 - kvfree_rcu() updates

 - mem_dump_obj() updates, with acks from one of the slab-allocator
   maintainers

 - RCU NOCB CPU updates, including limited deoffloading

 - SRCU updates

 - Tasks-RCU updates

 - Torture-test updates

* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
  tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
  rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
  rcu: Add missing __releases() annotation
  rcu: Remove obsolete rcu_read_unlock() deadlock commentary
  rcu: Improve comments describing RCU read-side critical sections
  rcu: Create an unrcu_pointer() to remove __rcu from a pointer
  srcu: Early test SRCU polling start
  rcu: Fix various typos in comments
  rcu/nocb: Unify timers
  rcu/nocb: Prepare for fine-grained deferred wakeup
  rcu/nocb: Only cancel nocb timer if not polling
  rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
  rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
  rcu/nocb: Allow de-offloading rdp leader
  rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
  rcu: Don't penalize priority boosting when there is nothing to boost
  rcu: Point to documentation of ordering guarantees
  rcu: Make rcu_gp_cleanup() be noinline for tracing
  rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
  rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
  ...
2021-07-04 12:58:33 -07:00