android_kernel_samsung_sm8650/drivers
Charan Teja Reddy 88153d9a99 ANDROID: vmscan: Support multiple kswapd threads per node
Page replacement is handled in the Linux Kernel in one of two ways:

1) Asynchronously via kswapd
2) Synchronously, via direct reclaim

At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.

Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.

When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.

The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.

When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.

The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.

Test Details

NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details

The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.

Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
eleven 250GB zero-filled files on each drive so that I could test with
parallel reads.

The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.

During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:

CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
       doesn't tend to fluctuate much so I just grab the highest value.
       Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
           there is a lot of variation.

Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total

This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.

The dd command for this test looks like this:

Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M

Test #1: Direct IO
dd sy dd_cpu throughput
6  0  2.33   14726026.40
10 1  2.95   19954974.80
16 1  2.63   24419689.30
22 1  2.63   25430303.20
28 1  2.91   26026513.20
34 1  2.53   26178618.00
40 1  2.18   26239229.20
46 1  1.91   26250550.40
52 1  1.69   26251845.60
58 1  1.54   26253205.60
64 1  1.43   26253780.80
70 1  1.31   26254154.80
76 1  1.21   26253660.80
82 1  1.12   26254214.80
88 1  1.07   26253770.00
90 1  1.04   26252406.40

Throughput was close to peak with only 22 dd tasks. Very little system CPU
was consumed as expected as the drives DMA directly into the user address
space when using direct IO.

In this next test, the iflag=direct option is removed and we only run the
test until the pgscan_kswapd from /proc/vmstat starts to increment. At
that point metrics are parsed and reported and the pagecache contents are
dropped prior to the next test. Lather, rinse, repeat.

Test #2: standard file system IO, no page replacement
dd sy dd_cpu throughput
6  2  28.78  5134316.40
10 3  31.40  8051218.40
16 5  34.73  11438106.80
22 7  33.65  14140596.40
28 8  31.24  16393455.20
34 10 29.88  18219463.60
40 11 28.33  19644159.60
46 11 25.05  20802497.60
52 13 26.92  22092370.00
58 13 23.29  22884881.20
64 14 23.12  23452248.80
70 15 22.40  23916468.00
76 16 22.06  24328737.20
82 17 20.97  24718693.20
88 16 18.57  25149404.40
90 16 18.31  25245565.60

Each read has to pause after the buffer in kernel space is populated while
those pages are added to the pagecache and copied into the user address
space. For this reason, more parallel streams are required to achieve peak
throughput. The copy operation consumes substantially more CPU than direct
IO as expected.

The next test measures throughput after kswapd starts running. This is the
same test only we wait for kswapd to wake up before we start collecting
metrics. The script actually keeps track of a few things that were not
mentioned earlier. It tracks direct reclaims and page scans by watching
the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the
same way it is tracked for dd.

Since the test is 100% reads, you can assume that the page steal rate for
kswapd and direct reclaims is almost identical to the scan rate.

Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput  dr    pgscan_kswapd pgscan_direct
10 4  26.07  28.56   27.03   7355924.40  0     459316976     0
16 7  34.94  69.33   69.66   10867895.20 0     872661643     0
22 10 36.03  93.99   99.33   13130613.60 489   1037654473    11268334
28 10 30.34  95.90   98.60   14601509.60 671   1182591373    15429142
34 14 34.77  97.50   99.23   16468012.00 10850 1069005644    249839515
40 17 36.32  91.49   97.11   17335987.60 18903 975417728     434467710
46 19 38.40  90.54   91.61   17705394.40 25369 855737040     582427973
52 22 40.88  83.97   83.70   17607680.40 31250 709532935     724282458
58 25 40.89  82.19   80.14   17976905.60 35060 657796473     804117540
64 28 41.77  73.49   75.20   18001910.00 39073 561813658     895289337
70 33 45.51  63.78   64.39   17061897.20 44523 379465571     1020726436
76 36 46.95  57.96   60.32   16964459.60 47717 291299464     1093172384
82 39 47.16  55.43   56.16   16949956.00 49479 247071062     1134163008
88 42 47.41  53.75   47.62   16930911.20 51521 195449924     1180442208
90 43 47.18  51.40   50.59   16864428.00 51618 190758156     1183203901

In the previous test where kswapd was not involved, the system-wide kernel
mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption
with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA
node), kswapd can only be responsible for a little over 4% of the increase.
The rest is likely caused by 51,618 direct reclaims that scanned 1.2
billion pages over the five minute time period of the test.

Same test, more kswapd tasks:

Test #4: 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput  dr    pgscan_kswapd pgscan_direct
10 5  27.09  16.65   14.17   7842605.60  0     459105291     0
16 10 37.12  26.02   24.85   11352920.40 15    920527796     358515
22 11 36.94  37.13   35.82   13771869.60 0     1132169011     0
28 13 35.23  48.43   46.86   16089746.00 0     1312902070     0
34 15 33.37  53.02   55.69   18314856.40 0     1476169080     0
40 19 35.90  69.60   64.41   19836126.80 0     1629999149     0
46 22 36.82  88.55   57.20   20740216.40 0     1708478106     0
52 24 34.38  93.76   68.34   21758352.00 0     1794055559     0
58 24 30.51  79.20   82.33   22735594.00 0     1872794397     0
64 26 30.21  97.12   76.73   23302203.60 176   1916593721     4206821
70 33 32.92  92.91   92.87   23776588.00 3575  1817685086     85574159
76 37 31.62  91.20   89.83   24308196.80 4752  1812262569     113981763
82 29 25.53  93.23   92.33   24802791.20 306   2032093122     7350704
88 43 37.12  76.18   77.01   25145694.40 20310 1253204719     487048202
90 42 38.56  73.90   74.57   22516787.60 22774 1193637495     545463615

By increasing the number of kswapd threads, throughput increased by ~50%
while kernel mode CPU utilization decreased or stayed the same, likely due
to a decrease in the number of parallel tasks at any given time doing page
replacement.

Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Bug: 201263306
Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com
[charante@codeaurora.org]: Changes made to select number of kswapds through uapi
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
[quic_vjitta@quicinc.com]: Changes made to move multiple kswapd threads logic to vendor hooks
Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>
(cherry picked from commit 0d61a651e4dd3c61d1658cc92e0b0450c8374738)

Change-Id: I8425cab7f40cbeaf65af0ea118c1a9ac7da0930e
[quic_vjitta@quicinc.com]: Resolved minor merge conflicts
Signed-off-by: Vijayanand Jitta <quic_vjitta@quicinc.com>
2023-04-26 17:01:51 +00:00
..
accessibility tty: fix possible null-ptr-defer in spk_ttyio_release 2023-01-24 07:24:37 +01:00
acpi ACPI: resource: Add Medion S17413 to IRQ override quirk 2023-04-20 12:35:12 +02:00
amba
android ANDROID: vmscan: Support multiple kswapd threads per node 2023-04-26 17:01:51 +00:00
ata UPSTREAM: scsi: ata: libata-scsi: Convert to scsi_execute_cmd() 2023-03-15 16:17:14 +00:00
atm atm: idt77252: fix kmemleak when rmmod idt77252 2023-03-30 12:49:09 +02:00
auxdisplay auxdisplay: hd44780: Fix potential memory leak in hd44780_remove() 2023-03-11 13:55:16 +01:00
base Merge 6.1.18 into android14-6.1 2023-03-21 08:22:15 +00:00
bcma
block This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
bluetooth bluetooth: btbcm: Fix logic error in forming the board name. 2023-04-20 12:35:06 +02:00
bus bus: imx-weim: fix branch condition evaluates to a garbage value 2023-03-30 12:49:29 +02:00
cdrom
char tpm/eventlog: Don't abort tpm_read_log on faulty ACPI address 2023-03-17 08:50:30 +01:00
clk This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
clocksource FROMGIT: clocksource/drivers/timer-mediatek: Split out CPUXGPT timers 2023-04-21 14:54:53 +00:00
comedi comedi: adv_pci1760: Fix PWM instruction handling 2023-01-24 07:24:35 +01:00
connector
counter counter: 104-quad-8: Fix Synapse action reported for Index signals 2023-04-13 16:55:31 +02:00
cpufreq Revert "ANDROID: cpufreq: Add a restricted vendor hook for freq transition" 2023-03-31 18:25:45 +00:00
cpuidle Merge 6.1.21 into android14-6.1 2023-03-24 08:47:17 +00:00
crypto crypto: qat - fix out-of-bounds read 2023-03-10 09:34:19 +01:00
cxl cxl/pci: Handle excessive CDAT length 2023-04-13 16:55:25 +02:00
dax dax/kmem: Fix leak of memory-hotplug resources 2023-03-10 09:34:25 +01:00
dca
devfreq PM/devfreq: governor: Add a private governor_data for governor 2023-01-07 11:11:40 +01:00
dio drivers: dio: fix possible memory leak in dio_init() 2022-12-31 13:32:38 +01:00
dma This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
dma-buf ANDROID: dma-buf: heaps: dmabuf page pool spinlock should be spinlock_t 2023-04-26 17:01:50 +00:00
edac EDAC/qcom: Do not pass llcc_driv_data as edac_device_ctl_info's pvt_info 2023-02-01 08:34:40 +01:00
eisa
extcon extcon: usbc-tusb320: Update state on probe even if no IRQ pending 2022-12-31 13:32:39 +01:00
firewire firewire: fix memory leak for payload of request subaction to IEC 61883-1 FCP region 2023-02-09 11:27:59 +01:00
firmware This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
fpga fpga: microchip-spi: rewrite status polling in a time measurable way 2023-03-10 09:33:34 +01:00
fsi use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
gnss
gpio Revert "pwm: Make .get_state() callback return an error code" 2023-04-26 10:00:48 +00:00
gpu This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
greybus
hid This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
hsi HSI: omap_ssi_core: Fix error handling in ssi_init() 2022-12-31 13:32:45 +01:00
hte
hv Merge 6.1.24 into android14-6.1 2023-04-22 08:52:25 +00:00
hwmon This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
hwspinlock
hwtracing coresight-etm4: Fix for() loop drvdata->nr_addr_cmp range bug 2023-04-13 16:55:30 +02:00
i2c This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
i3c
idle Revert "cpuidle, intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE *again*" 2023-04-06 12:10:58 +02:00
iio iio: adc: ad7791: fix IRQ flags 2023-04-13 16:55:31 +02:00
infiniband RDMA/core: Fix GID entry ref leak when create_ah fails 2023-04-20 12:35:10 +02:00
input Input: goodix - add Lenovo Yoga Book X90F to nine_bytes_report DMI table 2023-04-06 12:10:50 +02:00
interconnect interconnect: qcom: qcm2290: Fix MASTER_SNOC_BIMC_NRT 2023-03-30 12:48:59 +02:00
iommu UPSTREAM: iommu: Rename iommu-sva-lib.{c,h} 2023-04-12 02:08:28 +00:00
ipack
irqchip ANDROID: gic: Add vendor hook for gic-v3 resume 2023-03-20 10:53:38 -07:00
isdn use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
leds Revert "pwm: Make .get_state() callback return an error code" 2023-04-26 10:00:48 +00:00
macintosh macintosh: windfarm: Use unsigned type for 1-bit bitfields 2023-03-17 08:50:31 +01:00
mailbox ANDROID: virt: gunyah: Move arch_is_gh_guest under RM probe 2023-04-11 15:26:03 +00:00
mcb mcb: mcb-parse: fix error handing in chameleon_parse_gdd() 2022-12-31 13:32:41 +01:00
md Merge 6.1.24 into android14-6.1 2023-04-22 08:52:25 +00:00
media FROMGIT: media: add RealVideo format RV30 and RV40 2023-04-24 10:45:38 +00:00
memory memory: tegra30-emc: fix interconnect registration race 2023-03-22 13:33:56 +01:00
memstick memstick/ms_block: Add check for alloc_ordered_workqueue 2022-12-31 13:32:25 +01:00
message FROMGIT: scsi: core: Change the return type of .eh_timed_out() 2023-03-15 16:17:14 +00:00
mfd mfd: arizona: Use pm_runtime_resume_and_get() to prevent refcnt leak 2023-03-11 13:55:32 +01:00
misc UPSTREAM: iommu: Remove SVM_FLAG_SUPERVISOR_MODE support 2023-04-12 02:08:27 +00:00
mmc Merge 6.1.21 into android14-6.1 2023-03-24 08:47:17 +00:00
most
mtd ubi: Fix deadlock caused by recursively holding work_sem 2023-04-20 12:35:14 +02:00
mux
net This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
nfc nfc: st-nci: Fix use after free bug in ndlc_remove due to race condition 2023-03-22 13:33:46 +01:00
ntb
nubus
nvdimm cxl/pmem: Fix nvdimm registration races 2023-03-10 09:34:20 +01:00
nvme This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
nvmem nvmem: core: fix return value 2023-02-09 11:28:25 +01:00
of ANDROID: of: of_reserved_mem: Increase limit for reserved_mem regions 2023-03-22 14:27:16 +00:00
opp OPP: fix error checking in opp_migrate_dentry() 2023-03-10 09:33:01 +01:00
parisc parisc: led: Fix potential null-ptr-deref in start_task() 2023-01-07 11:11:55 +01:00
parport
pci Merge 6.1.24 into android14-6.1 2023-04-22 08:52:25 +00:00
pcmcia
peci
perf Partially revert "perf/arm-cmn: Optimise DTC counter accesses" 2023-02-01 08:34:49 +01:00
phy phy: rockchip-typec: Fix unsigned comparison with less than zero 2023-03-11 13:55:40 +01:00
pinctrl Revert "pinctrl: amd: Disable and mask interrupts on resume" 2023-04-20 12:35:05 +02:00
platform platform/x86: think-lmi: Clean up display of current_value on Thinkstation 2023-04-13 16:55:22 +02:00
pnp PNP: fix name memory leak in pnp_alloc_dev() 2022-12-31 13:31:56 +01:00
power This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
powercap powercap: fix possible name leak in powercap_register_zone() 2023-03-10 09:32:56 +01:00
pps
ps3
ptp ptp_qoriq: fix memory leak in probe() 2023-04-06 12:10:44 +02:00
pwm Revert "pwm: Make .get_state() callback return an error code" 2023-04-26 10:00:48 +00:00
rapidio rapidio: devices: fix missing put_device in mport_cdev_open 2022-12-31 13:32:00 +01:00
ras
regulator regulator: Handle deferred clk 2023-04-06 12:10:46 +02:00
remoteproc Merge 6.1.16 into android14-6.1 2023-03-13 15:45:34 +00:00
reset reset: uniphier-glue: Fix possible null-ptr-deref 2023-02-01 08:34:05 +01:00
rpmsg rpmsg: glink: Release driver_override 2023-03-10 09:33:45 +01:00
rtc rtc: allow rtc_read_alarm without read_alarm callback 2023-03-11 13:55:30 +01:00
s390 s390/vfio-ap: fix memory leak in vfio_ap device driver 2023-04-06 12:10:46 +02:00
sbus
scsi This is the 6.1.25 stable release 2023-04-26 13:13:19 +00:00
sh
siox
slimbus
soc FROMGIT: soc: qcom: geni-se: Move qcom-geni-se.h to linux/soc/qcom/geni-se.h 2023-04-13 14:26:27 +00:00
soundwire soundwire: cadence: Drain the RX FIFO after an IO timeout 2023-03-11 13:55:40 +01:00
spi FROMGIT: soc: qcom: geni-se: Move qcom-geni-se.h to linux/soc/qcom/geni-se.h 2023-04-13 14:26:27 +00:00
spmi
ssb
staging FROMLIST: staging: greybus: drop loopback test files 2023-04-06 15:46:10 +00:00
target Merge 6.1.22 into android14-6.1 2023-03-31 08:15:39 +00:00
tc
tee tee: amdtee: fix race condition in amdtee_open_session 2023-03-30 12:49:29 +02:00
thermal Merge 6.1.18 into android14-6.1 2023-03-21 08:22:15 +00:00
thunderbolt thunderbolt: Limit USB3 bandwidth of certain Intel USB4 host routers 2023-04-06 12:10:33 +02:00
tty Merge 6.1.24 into android14-6.1 2023-04-22 08:52:25 +00:00
ufs UPSTREAM: scsi: ufs: core: Print trs for pending requests in MCQ mode 2023-04-25 01:29:15 +00:00
uio uio: uio_dmem_genirq: Fix deadlock between irq config and handling 2022-12-31 13:32:38 +01:00
usb ANDROID: preserve CRC for xhci symbols 2023-04-26 10:00:48 +00:00
vdpa vp_vdpa: fix the crash in hot unplug with vp_vdpa 2023-03-22 13:34:03 +01:00
vfio vfio/type1: restore locked_vm 2023-03-10 09:34:32 +01:00
vhost vhost-vdpa: free iommu domain after last use during cleanup 2023-03-22 13:33:44 +01:00
video fbcon: set_con2fb_map needs to set con2fb_map! 2023-04-20 12:35:07 +02:00
virt ANDROID: virt: gunyah: Move arch_is_gh_guest under RM probe 2023-04-11 15:26:03 +00:00
virtio Merge 6.1.8 into android14-6.1 2023-01-26 12:13:04 +00:00
vlynq
w1 w1: fix WARNING after calling w1_process() 2023-02-01 08:34:26 +01:00
watchdog watchdog: sbsa_wdog: Make sure the timeout programming is within the limits 2023-03-11 13:55:24 +01:00
xen xen/grant-dma-iommu: Implement a dummy probe_device() callback 2023-03-10 09:33:02 +01:00
zorro
Kconfig
Makefile
OWNERS