android_kernel_samsung_sm8650

History

Charan Teja Reddy 88153d9a99 ANDROID: vmscan: Support multiple kswapd threads per node Page replacement is handled in the Linux Kernel in one of two ways: 1) Asynchronously via kswapd 2) Synchronously, via direct reclaim At page allocation time the allocating task is immediately given a page from the zone free list allowing it to go right back to work doing whatever it was doing; Probably directly or indirectly executing business logic. Just prior to satisfying the allocation, free pages is checked to see if it has reached the zone low watermark and if so, kswapd is awakened. Kswapd will start scanning pages looking for inactive pages to evict to make room for new page allocations. The work of kswapd allows tasks to continue allocating memory from their respective zone free list without incurring any delay. When the demand for free pages exceeds the rate that kswapd tasks can supply them, page allocation works differently. Once the allocating task finds that the number of free pages is at or below the zone min watermark, the task will no longer pull pages from the free list. Instead, the task will run the same CPU-bound routines as kswapd to satisfy its own allocation by scanning and evicting pages. This is called a direct reclaim. The time spent performing a direct reclaim can be substantial, often taking tens to hundreds of milliseconds for small order0 allocations to half a second or more for order9 huge-page allocations. In fact, kswapd is not actually required on a linux system. It exists for the sole purpose of optimizing performance by preventing direct reclaims. When memory shortfall is sufficient to trigger direct reclaims, they can occur in any task that is running on the system. A single aggressive memory allocating task can set the stage for collateral damage to occur in small tasks that rarely allocate additional memory. Consider the impact of injecting an additional 100ms of latency when nscd allocates memory to facilitate caching of a DNS query. The presence of direct reclaims 10 years ago was a fairly reliable indicator that too much was being asked of a Linux system. Kswapd was likely wasting time scanning pages that were ineligible for eviction. Adding RAM or reducing the working set size would usually make the problem go away. Since then hardware has evolved to bring a new struggle for kswapd. Storage speeds have increased by orders of magnitude while CPU clock speeds stayed the same or even slowed down in exchange for more cores per package. This presents a throughput problem for a single threaded kswapd that will get worse with each generation of new hardware. Test Details NOTE: The tests below were run with shadow entries disabled. See the associated patch and cover letter for details The tests below were designed with the assumption that a kswapd bottleneck is best demonstrated using filesystem reads. This way, the inactive list will be full of clean pages, simplifying the analysis and allowing kswapd to achieve the highest possible steal rate. Maximum steal rates for kswapd are likely to be the same or lower for any other mix of page types on the system. Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has an XFS file system mounted separately as /d0 through /d7. SSD drives require multiple concurrent streams to show their potential, so I created eleven 250GB zero-filled files on each drive so that I could test with parallel reads. The test script runs in multiple stages. At each stage, the number of dd tasks run concurrently is increased by 2. I did not include all of the test output for brevity. During each stage dd tasks are launched to read from each drive in a round robin fashion until the specified number of tasks for the stage has been reached. Then iostat, vmstat and top are started in the background with 10 second intervals. After five minutes, all of the dd tasks are killed and the iostat, vmstat and top output is parsed in order to report the following: CPU consumption - sy - aggregate kernel mode CPU consumption from vmstat output. The value doesn't tend to fluctuate much so I just grab the highest value. Each sample is averaged over 10 seconds - dd_cpu - for all of the dd tasks averaged across the top samples since there is a lot of variation. Throughput - in Kbytes - Command is iostat -x -d 10 -g total This first test performs reads using O_DIRECT in order to show the maximum throughput that can be obtained using these drives. It also demonstrates how rapidly throughput scales as the number of dd tasks are increased. The dd command for this test looks like this: Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M Test #1: Direct IO dd sy dd_cpu throughput 6 0 2.33 14726026.40 10 1 2.95 19954974.80 16 1 2.63 24419689.30 22 1 2.63 25430303.20 28 1 2.91 26026513.20 34 1 2.53 26178618.00 40 1 2.18 26239229.20 46 1 1.91 26250550.40 52 1 1.69 26251845.60 58 1 1.54 26253205.60 64 1 1.43 26253780.80 70 1 1.31 26254154.80 76 1 1.21 26253660.80 82 1 1.12 26254214.80 88 1 1.07 26253770.00 90 1 1.04 26252406.40 Throughput was close to peak with only 22 dd tasks. Very little system CPU was consumed as expected as the drives DMA directly into the user address space when using direct IO. In this next test, the iflag=direct option is removed and we only run the test until the pgscan_kswapd from /proc/vmstat starts to increment. At that point metrics are parsed and reported and the pagecache contents are dropped prior to the next test. Lather, rinse, repeat. Test #2: standard file system IO, no page replacement dd sy dd_cpu throughput 6 2 28.78 5134316.40 10 3 31.40 8051218.40 16 5 34.73 11438106.80 22 7 33.65 14140596.40 28 8 31.24 16393455.20 34 10 29.88 18219463.60 40 11 28.33 19644159.60 46 11 25.05 20802497.60 52 13 26.92 22092370.00 58 13 23.29 22884881.20 64 14 23.12 23452248.80 70 15 22.40 23916468.00 76 16 22.06 24328737.20 82 17 20.97 24718693.20 88 16 18.57 25149404.40 90 16 18.31 25245565.60 Each read has to pause after the buffer in kernel space is populated while those pages are added to the pagecache and copied into the user address space. For this reason, more parallel streams are required to achieve peak throughput. The copy operation consumes substantially more CPU than direct IO as expected. The next test measures throughput after kswapd starts running. This is the same test only we wait for kswapd to wake up before we start collecting metrics. The script actually keeps track of a few things that were not mentioned earlier. It tracks direct reclaims and page scans by watching the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the same way it is tracked for dd. Since the test is 100% reads, you can assume that the page steal rate for kswapd and direct reclaims is almost identical to the scan rate. Test #3: 1 kswapd thread per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 4 26.07 28.56 27.03 7355924.40 0 459316976 0 16 7 34.94 69.33 69.66 10867895.20 0 872661643 0 22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334 28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142 34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515 40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710 46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973 52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458 58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540 64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337 70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436 76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384 82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008 88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208 90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901 In the previous test where kswapd was not involved, the system-wide kernel mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA node), kswapd can only be responsible for a little over 4% of the increase. The rest is likely caused by 51,618 direct reclaims that scanned 1.2 billion pages over the five minute time period of the test. Same test, more kswapd tasks: Test #4: 4 kswapd threads per node dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct 10 5 27.09 16.65 14.17 7842605.60 0 459105291 0 16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515 22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0 28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0 34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0 40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0 46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0 52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0 58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0 64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821 70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159 76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763 82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704 88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202 90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615 By increasing the number of kswapd threads, throughput increased by ~50% while kernel mode CPU utilization decreased or stayed the same, likely due to a decrease in the number of parallel tasks at any given time doing page replacement. Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com> Bug: 201263306 Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com [charante@codeaurora.org]: Changes made to select number of kswapds through uapi Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> [quic_vjitta@quicinc.com]: Changes made to move multiple kswapd threads logic to vendor hooks Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com> (cherry picked from commit 0d61a651e4dd3c61d1658cc92e0b0450c8374738) Change-Id: I8425cab7f40cbeaf65af0ea118c1a9ac7da0930e [quic_vjitta@quicinc.com]: Resolved minor merge conflicts Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>		2023-04-26 17:01:51 +00:00
..
accessibility	tty: fix possible null-ptr-defer in spk_ttyio_release	2023-01-24 07:24:37 +01:00
acpi	ACPI: resource: Add Medion S17413 to IRQ override quirk	2023-04-20 12:35:12 +02:00
amba
android	ANDROID: vmscan: Support multiple kswapd threads per node	2023-04-26 17:01:51 +00:00
ata	UPSTREAM: scsi: ata: libata-scsi: Convert to scsi_execute_cmd()	2023-03-15 16:17:14 +00:00
atm	atm: idt77252: fix kmemleak when rmmod idt77252	2023-03-30 12:49:09 +02:00
auxdisplay	auxdisplay: hd44780: Fix potential memory leak in hd44780_remove()	2023-03-11 13:55:16 +01:00
base	Merge 6.1.18 into android14-6.1	2023-03-21 08:22:15 +00:00
bcma
block	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
bluetooth	bluetooth: btbcm: Fix logic error in forming the board name.	2023-04-20 12:35:06 +02:00
bus	bus: imx-weim: fix branch condition evaluates to a garbage value	2023-03-30 12:49:29 +02:00
cdrom
char	tpm/eventlog: Don't abort tpm_read_log on faulty ACPI address	2023-03-17 08:50:30 +01:00
clk	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
clocksource	FROMGIT: clocksource/drivers/timer-mediatek: Split out CPUXGPT timers	2023-04-21 14:54:53 +00:00
comedi	comedi: adv_pci1760: Fix PWM instruction handling	2023-01-24 07:24:35 +01:00
connector
counter	counter: 104-quad-8: Fix Synapse action reported for Index signals	2023-04-13 16:55:31 +02:00
cpufreq	Revert "ANDROID: cpufreq: Add a restricted vendor hook for freq transition"	2023-03-31 18:25:45 +00:00
cpuidle	Merge 6.1.21 into android14-6.1	2023-03-24 08:47:17 +00:00
crypto	crypto: qat - fix out-of-bounds read	2023-03-10 09:34:19 +01:00
cxl	cxl/pci: Handle excessive CDAT length	2023-04-13 16:55:25 +02:00
dax	dax/kmem: Fix leak of memory-hotplug resources	2023-03-10 09:34:25 +01:00
dca
devfreq	PM/devfreq: governor: Add a private governor_data for governor	2023-01-07 11:11:40 +01:00
dio	drivers: dio: fix possible memory leak in dio_init()	2022-12-31 13:32:38 +01:00
dma	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
dma-buf	ANDROID: dma-buf: heaps: dmabuf page pool spinlock should be spinlock_t	2023-04-26 17:01:50 +00:00
edac	EDAC/qcom: Do not pass llcc_driv_data as edac_device_ctl_info's pvt_info	2023-02-01 08:34:40 +01:00
eisa
extcon	extcon: usbc-tusb320: Update state on probe even if no IRQ pending	2022-12-31 13:32:39 +01:00
firewire	firewire: fix memory leak for payload of request subaction to IEC 61883-1 FCP region	2023-02-09 11:27:59 +01:00
firmware	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
fpga	fpga: microchip-spi: rewrite status polling in a time measurable way	2023-03-10 09:33:34 +01:00
fsi	use less confusing names for iov_iter direction initializers	2023-02-09 11:28:04 +01:00
gnss
gpio	Revert "pwm: Make .get_state() callback return an error code"	2023-04-26 10:00:48 +00:00
gpu	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
greybus
hid	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
hsi	HSI: omap_ssi_core: Fix error handling in ssi_init()	2022-12-31 13:32:45 +01:00
hte
hv	Merge 6.1.24 into android14-6.1	2023-04-22 08:52:25 +00:00
hwmon	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
hwspinlock
hwtracing	coresight-etm4: Fix for() loop drvdata->nr_addr_cmp range bug	2023-04-13 16:55:30 +02:00
i2c	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
i3c
idle	Revert "cpuidle, intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE again"	2023-04-06 12:10:58 +02:00
iio	iio: adc: ad7791: fix IRQ flags	2023-04-13 16:55:31 +02:00
infiniband	RDMA/core: Fix GID entry ref leak when create_ah fails	2023-04-20 12:35:10 +02:00
input	Input: goodix - add Lenovo Yoga Book X90F to nine_bytes_report DMI table	2023-04-06 12:10:50 +02:00
interconnect	interconnect: qcom: qcm2290: Fix MASTER_SNOC_BIMC_NRT	2023-03-30 12:48:59 +02:00
iommu	UPSTREAM: iommu: Rename iommu-sva-lib.{c,h}	2023-04-12 02:08:28 +00:00
ipack
irqchip	ANDROID: gic: Add vendor hook for gic-v3 resume	2023-03-20 10:53:38 -07:00
isdn	use less confusing names for iov_iter direction initializers	2023-02-09 11:28:04 +01:00
leds	Revert "pwm: Make .get_state() callback return an error code"	2023-04-26 10:00:48 +00:00
macintosh	macintosh: windfarm: Use unsigned type for 1-bit bitfields	2023-03-17 08:50:31 +01:00
mailbox	ANDROID: virt: gunyah: Move arch_is_gh_guest under RM probe	2023-04-11 15:26:03 +00:00
mcb	mcb: mcb-parse: fix error handing in chameleon_parse_gdd()	2022-12-31 13:32:41 +01:00
md	Merge 6.1.24 into android14-6.1	2023-04-22 08:52:25 +00:00
media	FROMGIT: media: add RealVideo format RV30 and RV40	2023-04-24 10:45:38 +00:00
memory	memory: tegra30-emc: fix interconnect registration race	2023-03-22 13:33:56 +01:00
memstick	memstick/ms_block: Add check for alloc_ordered_workqueue	2022-12-31 13:32:25 +01:00
message	FROMGIT: scsi: core: Change the return type of .eh_timed_out()	2023-03-15 16:17:14 +00:00
mfd	mfd: arizona: Use pm_runtime_resume_and_get() to prevent refcnt leak	2023-03-11 13:55:32 +01:00
misc	UPSTREAM: iommu: Remove SVM_FLAG_SUPERVISOR_MODE support	2023-04-12 02:08:27 +00:00
mmc	Merge 6.1.21 into android14-6.1	2023-03-24 08:47:17 +00:00
most
mtd	ubi: Fix deadlock caused by recursively holding work_sem	2023-04-20 12:35:14 +02:00
mux
net	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
nfc	nfc: st-nci: Fix use after free bug in ndlc_remove due to race condition	2023-03-22 13:33:46 +01:00
ntb
nubus
nvdimm	cxl/pmem: Fix nvdimm registration races	2023-03-10 09:34:20 +01:00
nvme	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
nvmem	nvmem: core: fix return value	2023-02-09 11:28:25 +01:00
of	ANDROID: of: of_reserved_mem: Increase limit for reserved_mem regions	2023-03-22 14:27:16 +00:00
opp	OPP: fix error checking in opp_migrate_dentry()	2023-03-10 09:33:01 +01:00
parisc	parisc: led: Fix potential null-ptr-deref in start_task()	2023-01-07 11:11:55 +01:00
parport
pci	Merge 6.1.24 into android14-6.1	2023-04-22 08:52:25 +00:00
pcmcia
peci
perf	Partially revert "perf/arm-cmn: Optimise DTC counter accesses"	2023-02-01 08:34:49 +01:00
phy	phy: rockchip-typec: Fix unsigned comparison with less than zero	2023-03-11 13:55:40 +01:00
pinctrl	Revert "pinctrl: amd: Disable and mask interrupts on resume"	2023-04-20 12:35:05 +02:00
platform	platform/x86: think-lmi: Clean up display of current_value on Thinkstation	2023-04-13 16:55:22 +02:00
pnp	PNP: fix name memory leak in pnp_alloc_dev()	2022-12-31 13:31:56 +01:00
power	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
powercap	powercap: fix possible name leak in powercap_register_zone()	2023-03-10 09:32:56 +01:00
pps
ps3
ptp	ptp_qoriq: fix memory leak in probe()	2023-04-06 12:10:44 +02:00
pwm	Revert "pwm: Make .get_state() callback return an error code"	2023-04-26 10:00:48 +00:00
rapidio	rapidio: devices: fix missing put_device in mport_cdev_open	2022-12-31 13:32:00 +01:00
ras
regulator	regulator: Handle deferred clk	2023-04-06 12:10:46 +02:00
remoteproc	Merge 6.1.16 into android14-6.1	2023-03-13 15:45:34 +00:00
reset	reset: uniphier-glue: Fix possible null-ptr-deref	2023-02-01 08:34:05 +01:00
rpmsg	rpmsg: glink: Release driver_override	2023-03-10 09:33:45 +01:00
rtc	rtc: allow rtc_read_alarm without read_alarm callback	2023-03-11 13:55:30 +01:00
s390	s390/vfio-ap: fix memory leak in vfio_ap device driver	2023-04-06 12:10:46 +02:00
sbus
scsi	This is the 6.1.25 stable release	2023-04-26 13:13:19 +00:00
sh
siox
slimbus
soc	FROMGIT: soc: qcom: geni-se: Move qcom-geni-se.h to linux/soc/qcom/geni-se.h	2023-04-13 14:26:27 +00:00
soundwire	soundwire: cadence: Drain the RX FIFO after an IO timeout	2023-03-11 13:55:40 +01:00
spi	FROMGIT: soc: qcom: geni-se: Move qcom-geni-se.h to linux/soc/qcom/geni-se.h	2023-04-13 14:26:27 +00:00
spmi
ssb
staging	FROMLIST: staging: greybus: drop loopback test files	2023-04-06 15:46:10 +00:00
target	Merge 6.1.22 into android14-6.1	2023-03-31 08:15:39 +00:00
tc
tee	tee: amdtee: fix race condition in amdtee_open_session	2023-03-30 12:49:29 +02:00
thermal	Merge 6.1.18 into android14-6.1	2023-03-21 08:22:15 +00:00
thunderbolt	thunderbolt: Limit USB3 bandwidth of certain Intel USB4 host routers	2023-04-06 12:10:33 +02:00
tty	Merge 6.1.24 into android14-6.1	2023-04-22 08:52:25 +00:00
ufs	UPSTREAM: scsi: ufs: core: Print trs for pending requests in MCQ mode	2023-04-25 01:29:15 +00:00
uio	uio: uio_dmem_genirq: Fix deadlock between irq config and handling	2022-12-31 13:32:38 +01:00
usb	ANDROID: preserve CRC for xhci symbols	2023-04-26 10:00:48 +00:00
vdpa	vp_vdpa: fix the crash in hot unplug with vp_vdpa	2023-03-22 13:34:03 +01:00
vfio	vfio/type1: restore locked_vm	2023-03-10 09:34:32 +01:00
vhost	vhost-vdpa: free iommu domain after last use during cleanup	2023-03-22 13:33:44 +01:00
video	fbcon: set_con2fb_map needs to set con2fb_map!	2023-04-20 12:35:07 +02:00
virt	ANDROID: virt: gunyah: Move arch_is_gh_guest under RM probe	2023-04-11 15:26:03 +00:00
virtio	Merge 6.1.8 into android14-6.1	2023-01-26 12:13:04 +00:00
vlynq
w1	w1: fix WARNING after calling w1_process()	2023-02-01 08:34:26 +01:00
watchdog	watchdog: sbsa_wdog: Make sure the timeout programming is within the limits	2023-03-11 13:55:24 +01:00
xen	xen/grant-dma-iommu: Implement a dummy probe_device() callback	2023-03-10 09:33:02 +01:00
zorro
Kconfig
Makefile
OWNERS