6e12c5b7d4
For CMA allocation, it's really critical to migrate a page but sometimes it fails. One of the reasons is some driver holds a page refcount for a long time so VM couldn't migrate the page at that time. The concern here is there is no way to find the who hold the refcount of the page effectively. This patch introduces feature to keep tracking page's pinner. All get_page sites are vulnerable to pin a page for a long time but the cost to keep track it would be significat since get_page is the most frequent kernel operation. Furthermore, the page could be not user page but kernel page which is not related to the page migration failure. So, this patch keeps tracking only get_user_pages/follow_page with (FOLL_GET|PIN friends because they are the very common APIs to pin user pages which could cause migration failure and the less frequent than get_page so runtime cost wouldn't be that big but could cover many cases effectively. This patch also introduces put_user_page API. It aims for attributing "the pinner releases the page from now on" while it release the page refcount. Thus, any user of get_user_pages/follow_page(FOLL_GET) must use put_user_page as pair of those functions. Otherwise, page_pinner will treat them long term pinner as false postive but nothing should affect stability. * $debugfs/page_pinner/threshold It indicates threshold(microsecond) to flag long term pinning. It's configurable(Default is 300000us). Once you write new value to the threshold, old data will clear. * $debugfs/page_pinner/longterm_pinner It shows call sites where the duration of pinning was greater than the threshold. Internally, it uses a static array to keep 4096 elements and overwrites old ones once overflow happens. Therefore, you could lose some information. example) Page pinned ts 76953865787 us count 1 PFN 9856945 Block 9625 type Movable Flags 0x8000000000080014(uptodate|lru|swapbacked) __set_page_pinner+0x34/0xcc try_grab_page+0x19c/0x1a0 follow_page_pte+0x1c0/0x33c follow_page_mask+0xc0/0xc8 __get_user_pages+0x178/0x414 __gup_longterm_locked+0x80/0x148 internal_get_user_pages_fast+0x140/0x174 pin_user_pages_fast+0x24/0x40 CCC BBB AAA __arm64_sys_ioctl+0x94/0xd0 el0_svc_common+0xa4/0x180 do_el0_svc+0x28/0x88 el0_svc+0x14/0x24 note: page_pinner doesn't guarantee attributing/unattributing are atomic if they happen at the same time. It's just best effort so false-positive could happen. Bug: 183414571 Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: Ife37ec360eef993d390b9c131732218a4dfd2f04
169 lines
6.1 KiB
Plaintext
169 lines
6.1 KiB
Plaintext
# SPDX-License-Identifier: GPL-2.0-only
|
|
config PAGE_EXTENSION
|
|
bool "Extend memmap on extra space for more information on page"
|
|
help
|
|
Extend memmap on extra space for more information on page. This
|
|
could be used for debugging features that need to insert extra
|
|
field for every page. This extension enables us to save memory
|
|
by not allocating this extra memory according to boottime
|
|
configuration.
|
|
|
|
config DEBUG_PAGEALLOC
|
|
bool "Debug page memory allocations"
|
|
depends on DEBUG_KERNEL
|
|
depends on !HIBERNATION || ARCH_SUPPORTS_DEBUG_PAGEALLOC && !PPC && !SPARC
|
|
select PAGE_POISONING if !ARCH_SUPPORTS_DEBUG_PAGEALLOC
|
|
help
|
|
Unmap pages from the kernel linear mapping after free_pages().
|
|
Depending on runtime enablement, this results in a small or large
|
|
slowdown, but helps to find certain types of memory corruption.
|
|
|
|
Also, the state of page tracking structures is checked more often as
|
|
pages are being allocated and freed, as unexpected state changes
|
|
often happen for same reasons as memory corruption (e.g. double free,
|
|
use-after-free). The error reports for these checks can be augmented
|
|
with stack traces of last allocation and freeing of the page, when
|
|
PAGE_OWNER is also selected and enabled on boot.
|
|
|
|
For architectures which don't enable ARCH_SUPPORTS_DEBUG_PAGEALLOC,
|
|
fill the pages with poison patterns after free_pages() and verify
|
|
the patterns before alloc_pages(). Additionally, this option cannot
|
|
be enabled in combination with hibernation as that would result in
|
|
incorrect warnings of memory corruption after a resume because free
|
|
pages are not saved to the suspend image.
|
|
|
|
By default this option will have a small overhead, e.g. by not
|
|
allowing the kernel mapping to be backed by large pages on some
|
|
architectures. Even bigger overhead comes when the debugging is
|
|
enabled by DEBUG_PAGEALLOC_ENABLE_DEFAULT or the debug_pagealloc
|
|
command line parameter.
|
|
|
|
config DEBUG_PAGEALLOC_ENABLE_DEFAULT
|
|
bool "Enable debug page memory allocations by default?"
|
|
depends on DEBUG_PAGEALLOC
|
|
help
|
|
Enable debug page memory allocations by default? This value
|
|
can be overridden by debug_pagealloc=off|on.
|
|
|
|
config PAGE_OWNER
|
|
bool "Track page owner"
|
|
depends on DEBUG_KERNEL && STACKTRACE_SUPPORT
|
|
select DEBUG_FS
|
|
select STACKTRACE
|
|
select STACKDEPOT
|
|
select PAGE_EXTENSION
|
|
help
|
|
This keeps track of what call chain is the owner of a page, may
|
|
help to find bare alloc_page(s) leaks. Even if you include this
|
|
feature on your build, it is disabled in default. You should pass
|
|
"page_owner=on" to boot parameter in order to enable it. Eats
|
|
a fair amount of memory if enabled. See tools/vm/page_owner_sort.c
|
|
for user-space helper.
|
|
|
|
If unsure, say N.
|
|
|
|
config PAGE_PINNER
|
|
bool "Track page pinner"
|
|
depends on DEBUG_KERNEL && STACKTRACE_SUPPORT
|
|
select DEBUG_FS
|
|
select STACKTRACE
|
|
select STACKDEPOT
|
|
select PAGE_EXTENSION
|
|
help
|
|
This keeps track of what call chain is the pinner of a page, may
|
|
help to find page migration failures. Even if you include this
|
|
feature in your build, it is disabled by default. You should pass
|
|
"page_pinner=on" to boot parameter in order to enable it. Eats
|
|
a fair amount of memory if enabled.
|
|
|
|
If unsure, say N.
|
|
|
|
config PAGE_POISONING
|
|
bool "Poison pages after freeing"
|
|
help
|
|
Fill the pages with poison patterns after free_pages() and verify
|
|
the patterns before alloc_pages. The filling of the memory helps
|
|
reduce the risk of information leaks from freed data. This does
|
|
have a potential performance impact if enabled with the
|
|
"page_poison=1" kernel boot option.
|
|
|
|
Note that "poison" here is not the same thing as the "HWPoison"
|
|
for CONFIG_MEMORY_FAILURE. This is software poisoning only.
|
|
|
|
If you are only interested in sanitization of freed pages without
|
|
checking the poison pattern on alloc, you can boot the kernel with
|
|
"init_on_free=1" instead of enabling this.
|
|
|
|
If unsure, say N
|
|
|
|
config DEBUG_PAGE_REF
|
|
bool "Enable tracepoint to track down page reference manipulation"
|
|
depends on DEBUG_KERNEL
|
|
depends on TRACEPOINTS
|
|
help
|
|
This is a feature to add tracepoint for tracking down page reference
|
|
manipulation. This tracking is useful to diagnose functional failure
|
|
due to migration failures caused by page reference mismatches. Be
|
|
careful when enabling this feature because it adds about 30 KB to the
|
|
kernel code. However the runtime performance overhead is virtually
|
|
nil until the tracepoints are actually enabled.
|
|
|
|
config DEBUG_RODATA_TEST
|
|
bool "Testcase for the marking rodata read-only"
|
|
depends on STRICT_KERNEL_RWX
|
|
help
|
|
This option enables a testcase for the setting rodata read-only.
|
|
|
|
config ARCH_HAS_DEBUG_WX
|
|
bool
|
|
|
|
config DEBUG_WX
|
|
bool "Warn on W+X mappings at boot"
|
|
depends on ARCH_HAS_DEBUG_WX
|
|
depends on MMU
|
|
select PTDUMP_CORE
|
|
help
|
|
Generate a warning if any W+X mappings are found at boot.
|
|
|
|
This is useful for discovering cases where the kernel is leaving W+X
|
|
mappings after applying NX, as such mappings are a security risk.
|
|
|
|
Look for a message in dmesg output like this:
|
|
|
|
<arch>/mm: Checked W+X mappings: passed, no W+X pages found.
|
|
|
|
or like this, if the check failed:
|
|
|
|
<arch>/mm: Checked W+X mappings: failed, <N> W+X pages found.
|
|
|
|
Note that even if the check fails, your kernel is possibly
|
|
still fine, as W+X mappings are not a security hole in
|
|
themselves, what they do is that they make the exploitation
|
|
of other unfixed kernel bugs easier.
|
|
|
|
There is no runtime or memory usage effect of this option
|
|
once the kernel has booted up - it's a one time check.
|
|
|
|
If in doubt, say "Y".
|
|
|
|
config GENERIC_PTDUMP
|
|
bool
|
|
|
|
config PTDUMP_CORE
|
|
bool
|
|
|
|
config PTDUMP_DEBUGFS
|
|
bool "Export kernel pagetable layout to userspace via debugfs"
|
|
depends on DEBUG_KERNEL
|
|
depends on DEBUG_FS
|
|
depends on GENERIC_PTDUMP
|
|
select PTDUMP_CORE
|
|
help
|
|
Say Y here if you want to show the kernel pagetable layout in a
|
|
debugfs file. This information is only useful for kernel developers
|
|
who are working in architecture specific areas of the kernel.
|
|
It is probably not a good idea to enable this feature in a production
|
|
kernel.
|
|
|
|
If in doubt, say N.
|