android_kernel_samsung_sm8650/Documentation
Mike Kravetz 79dfc69552 hugetlb: add demote hugetlb page sysfs interfaces
Patch series "hugetlb: add demote/split page functionality", v4.

The concurrent use of multiple hugetlb page sizes on a single system is
becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
being used to back VMs in hosting environments.

When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools.  This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup.  In addition, preallocating huge
pages minimizes the issue of memory fragmentation that increases the
longer the system is up and running.

In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become depleted
while larger hugetlb pages still remain.  In such situations, it is
desirable to convert larger hugetlb pages to smaller hugetlb pages.

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  For example, to convert 50 GB pages on x86:

  gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
  m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
  echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
  echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages

On an idle system this operation is fairly reliable and results are as
expected.  The number of 2MB pages is increased as expected and the time
of the operation is a second or two.

However, when there is activity on the system the following issues
arise:

1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.

2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed. This is
   because the area freed by the larger page could quickly be
   fragmented.

In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:

  Unexpected number of 2MB pages allocated: Expected 25600, have 19944
  real    0m42.092s
  user    0m0.008s
  sys     0m41.467s

To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size.  This avoids freeing pages to buddy and then
trying to allocate from buddy.

Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.

 - demote_size
        Target page size for demotion, a smaller huge page size. File
        can be written to chose a smaller huge page size if multiple are
        available.

 - demote
        Writable number of hugetlb pages to be demoted

To demote 50 GB huge pages, one would:

  cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
  cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
  echo 50 > .../hugepages-1048576kB/demote

Only hugetlb pages which are free at the time of the request can be
demoted.  Demotion does not add to the complexity of surplus pages and
honors reserved huge pages.  Therefore, when a value is written to the
sysfs demote file, that value is only the maximum number of pages which
will be demoted.  It is possible fewer will actually be demoted.  The
recently introduced per-hstate mutex is used to synchronize demote
operations with other operations that modify hugetlb pools.

Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages
are used to back VMs on x86.  Both issues of long allocation times and
not necessarily getting the expected number of smaller huge pages after
a free and allocate cycle have been experienced.  The occurrence of
these issues is dependent on other activity within the host and can not
be predicted.

This patch (of 5):

Two new sysfs files are added to demote hugtlb pages.  These files are
both per-hugetlb page size and per node.  Files are:

  demote_size - The size in Kb that pages are demoted to. (read-write)
  demote - The number of huge pages to demote. (write-only)

By default, demote_size is the next smallest huge page size.  Valid huge
page sizes less than huge page size may be written to this file.  When
huge pages are demoted, they are demoted to this size.

Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.

NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size.  For example, on x86 1GB huge
pages will have demote interfaces.  2MB huge pages will not have demote
interfaces.

This patch does not provide full demote functionality.  It only provides
the sysfs interfaces.

It also provides documentation for the new interfaces.

[mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
  Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com

Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06 13:30:39 -07:00
..
ABI Misc driver patches for 5.15-rc1, second round 2021-09-10 11:31:47 -07:00
accounting
admin-guide hugetlb: add demote hugetlb page sysfs interfaces 2021-11-06 13:30:39 -07:00
arm Documentation: arm: marvell: Add 88F6825 model into list 2021-08-24 13:26:32 -06:00
arm64 Merge remote-tracking branch 'tip/sched/arm64' into for-next/core 2021-08-31 09:10:00 +01:00
block Documentation: block: blk-mq: Fix small typo in multi-queue docs 2021-08-24 13:30:00 -06:00
bpf libbpf: Rename libbpf documentation index file 2021-08-18 08:45:25 -07:00
cdrom
core-api irqchip fixes for 5.15, take #1 2021-09-24 14:11:04 +02:00
cpu-freq cpufreq: Remove ready() callback 2021-09-02 18:04:17 +02:00
crypto
dev-tools Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
devicetree Pin control fixes for the v5.15 series: 2021-10-25 09:47:18 -07:00
doc-guide
driver-api cxl for v5.15 2021-09-09 11:48:27 -07:00
fault-injection Char / Misc driver changes for 5.15-rc1 2021-09-01 08:35:06 -07:00
fb
features RISC-V Patches for the 5.15 Merge Window, Part 2 2021-09-11 14:29:42 -07:00
filesystems Fixed xfstests generic/016 generic/021 generic/022 generic/041 generic/274 generic/423, 2021-10-15 09:58:11 -04:00
firmware_class
firmware-guide docs: firmware-guide: acpi: dsd: graph.rst: replace some characters 2021-07-25 14:35:46 -06:00
fpga fpga: fix spelling mistakes 2021-07-21 19:54:21 -07:00
gpu Merge tag 'amd-drm-fixes-5.15-2021-10-06' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes 2021-10-08 11:40:21 +10:00
hid
hwmon hwmon: (k10temp) Remove residues of current and voltage 2021-09-12 17:56:36 -07:00
i2c Documentation: i2c: add i2c-sysfs into index 2021-08-10 22:58:32 +02:00
ia64
ide
iio
infiniband
input
isdn
kbuild Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
kernel-hacking docs: kernel-hacking: Remove inappropriate text 2021-09-03 15:56:45 -06:00
leds Documentation: leds: standartizing LED names 2021-08-20 10:26:24 +02:00
litmus-tests
livepatch
locking Documentation: locking: fix references 2021-08-24 13:20:39 -06:00
m68k
maintainer
mhi
mips
misc-devices
netlabel
networking mctp: unify sockaddr_mctp types 2021-10-18 13:47:09 +01:00
nios2
nvdimm
openrisc
parisc
PCI pci-v5.15-changes 2021-09-07 19:13:42 -07:00
pcmcia
power Documentation: power: include kernel-doc in Energy Model doc 2021-09-07 21:17:28 +02:00
powerpc powerpc/doc: Fix htmldocs errors 2021-08-27 00:56:34 +10:00
process Merge branch 'gcc-min-version-5.1' (make gcc-5.1 the minimum version) 2021-09-13 10:43:04 -07:00
RCU doc: Update stallwarn.rst with recent changes 2021-07-20 13:36:33 -07:00
riscv
s390
scheduler
scsi
security
sh
sound Yet another set of documentation changes: 2021-09-01 18:49:47 -07:00
sparc
sphinx docs: sphinx-requirements: Move sphinx_rtd_theme to top 2021-08-12 09:15:38 -06:00
sphinx-static
spi
staging
target
timers
trace Tracing updates for 5.15: 2021-09-05 11:50:41 -07:00
translations Merge branch 'gcc-min-version-5.1' (make gcc-5.1 the minimum version) 2021-09-13 10:43:04 -07:00
usb docs: usb: fix malformed table 2021-08-05 12:31:51 +02:00
userspace-api ptp: Document the PTP_CLK_MAGIC ioctl number 2021-10-27 17:02:51 -07:00
virt ARM: 2021-09-07 13:40:51 -07:00
vm Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
w1
watchdog
x86 Another collection of documentation patches, mostly fixes but also includes 2021-09-08 16:28:14 -07:00
xtensa
.gitignore
arch.rst
asm-annotations.rst
atomic_bitops.txt
atomic_t.txt Documentation/atomic_t: Document forward progress expectations 2021-08-04 15:16:47 +02:00
Changes
CodingStyle
conf.py docs: pdfdocs: Fix typo in CJK-language specific font settings 2021-09-06 16:53:39 -06:00
COPYING-logo
docutils.conf
dontdiff
index.rst
Kconfig
logo.gif
Makefile
memory-barriers.txt
SubmittingPatches
watch_queue.rst