License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 23:07:57 +09:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2005-04-17 07:20:36 +09:00
|
|
|
#ifndef _LINUX_MM_H
|
|
|
|
#define _LINUX_MM_H
|
|
|
|
|
|
|
|
#include <linux/errno.h>
|
|
|
|
|
|
|
|
#ifdef __KERNEL__
|
|
|
|
|
2014-01-24 08:52:54 +09:00
|
|
|
#include <linux/mmdebug.h>
|
2005-04-17 07:20:36 +09:00
|
|
|
#include <linux/gfp.h>
|
2011-11-24 10:12:59 +09:00
|
|
|
#include <linux/bug.h>
|
2005-04-17 07:20:36 +09:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/mmzone.h>
|
|
|
|
#include <linux/rbtree.h>
|
2011-12-09 07:33:54 +09:00
|
|
|
#include <linux/atomic.h>
|
2006-07-03 16:24:33 +09:00
|
|
|
#include <linux/debug_locks.h>
|
2006-09-27 17:50:01 +09:00
|
|
|
#include <linux/mm_types.h>
|
2020-06-09 13:33:14 +09:00
|
|
|
#include <linux/mmap_lock.h>
|
2010-02-10 18:20:20 +09:00
|
|
|
#include <linux/range.h>
|
2010-05-25 06:32:53 +09:00
|
|
|
#include <linux/pfn.h>
|
2016-01-16 09:56:55 +09:00
|
|
|
#include <linux/percpu-refcount.h>
|
2011-01-14 08:46:32 +09:00
|
|
|
#include <linux/bit_spinlock.h>
|
2011-07-08 13:14:42 +09:00
|
|
|
#include <linux/shrinker.h>
|
2014-10-10 07:27:29 +09:00
|
|
|
#include <linux/resource.h>
|
mm/debug-pagealloc: prepare boottime configurable on/off
Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page. So, we
can't use this good feature in many cases.
Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.
This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext. Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext. This patch implements this.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 09:55:49 +09:00
|
|
|
#include <linux/page_ext.h>
|
2015-07-13 23:55:44 +09:00
|
|
|
#include <linux/err.h>
|
2016-03-18 06:19:26 +09:00
|
|
|
#include <linux/page_ref.h>
|
2017-09-09 08:11:46 +09:00
|
|
|
#include <linux/memremap.h>
|
2018-05-09 04:55:26 +09:00
|
|
|
#include <linux/overflow.h>
|
2019-03-12 15:28:13 +09:00
|
|
|
#include <linux/sizes.h>
|
2020-05-02 16:41:26 +09:00
|
|
|
#include <linux/android_kabi.h>
|
2020-07-06 16:00:01 +09:00
|
|
|
#include <linux/android_vendor.h>
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
struct mempolicy;
|
|
|
|
struct anon_vma;
|
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 08:31:39 +09:00
|
|
|
struct anon_vma_chain;
|
2007-07-30 07:36:13 +09:00
|
|
|
struct file_ra_state;
|
Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.
This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
getting them indirectly
Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
they don't need sched.h
b) sched.h stops being dependency for significant number of files:
on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
after patch it's only 3744 (-8.3%).
Cross-compile tested on
all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
alpha alpha-up
arm
i386 i386-up i386-defconfig i386-allnoconfig
ia64 ia64-up
m68k
mips
parisc parisc-up
powerpc powerpc-up
s390 s390-up
sparc sparc-up
sparc64 sparc64-up
um-x86_64
x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
as well as my two usual configs.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-21 06:22:52 +09:00
|
|
|
struct user_struct;
|
2007-07-30 07:36:13 +09:00
|
|
|
struct writeback_control;
|
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-29 03:50:53 +09:00
|
|
|
struct bdi_writeback;
|
2005-04-17 07:20:36 +09:00
|
|
|
|
mm: allow a controlled amount of unfairness in the page lock
commit 5ef64cc8987a9211d3f3667331ba3411a94ddc79 upstream.
Commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic") made
the page locking entirely fair, in that if a waiter came in while the
lock was held, the lock would be transferred to the lockers strictly in
order.
That was intended to finally get rid of the long-reported watchdog
failures that involved the page lock under extreme load, where a process
could end up waiting essentially forever, as other page lockers stole
the lock from under it.
It also improved some benchmarks, but it ended up causing huge
performance regressions on others, simply because fair lock behavior
doesn't end up giving out the lock as aggressively, causing better
worst-case latency, but potentially much worse average latencies and
throughput.
Instead of reverting that change entirely, this introduces a controlled
amount of unfairness, with a sysctl knob to tune it if somebody needs
to. But the default value should hopefully be good for any normal load,
allowing a few rounds of lock stealing, but enforcing the strict
ordering before the lock has been stolen too many times.
There is also a hint from Matthieu Baerts that the fair page coloring
may end up exposing an ABBA deadlock that is hidden by the usual
optimistic lock stealing, and while the unfairness doesn't fix the
fundamental issue (and I'm still looking at that), it avoids it in
practice.
The amount of unfairness can be modified by writing a new value to the
'sysctl_page_lock_unfairness' variable (default value of 5, exposed
through /proc/sys/vm/page_lock_unfairness), but that is hopefully
something we'd use mainly for debugging rather than being necessary for
any deep system tuning.
This whole issue has exposed just how critical the page lock can be, and
how contended it gets under certain locks. And the main contention
doesn't really seem to be anything related to IO (which was the origin
of this lock), but for things like just verifying that the page file
mapping is stable while faulting in the page into a page table.
Link: https://lore.kernel.org/linux-fsdevel/ed8442fd-6f54-dd84-cd4a-941e8b7ee603@MichaelLarabel.com/
Link: https://www.phoronix.com/scan.php?page=article&item=linux-50-59&num=1
Link: https://lore.kernel.org/linux-fsdevel/c560a38d-8313-51fb-b1ec-e904bd8836bc@tessares.net/
Reported-and-tested-by: Michael Larabel <Michael@michaellarabel.com>
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Tested-by: Maximilian Heyne <mheyne@amazon.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-09-14 06:05:35 +09:00
|
|
|
extern int sysctl_page_lock_unfairness;
|
|
|
|
|
2017-04-01 07:11:47 +09:00
|
|
|
void init_mm_internals(void);
|
|
|
|
|
2013-07-04 07:04:23 +09:00
|
|
|
#ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */
|
2005-04-17 07:20:36 +09:00
|
|
|
extern unsigned long max_mapnr;
|
2013-07-04 07:04:23 +09:00
|
|
|
|
|
|
|
static inline void set_max_mapnr(unsigned long limit)
|
|
|
|
{
|
|
|
|
max_mapnr = limit;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline void set_max_mapnr(unsigned long limit) { }
|
2005-04-17 07:20:36 +09:00
|
|
|
#endif
|
|
|
|
|
2018-12-28 17:34:29 +09:00
|
|
|
extern atomic_long_t _totalram_pages;
|
|
|
|
static inline unsigned long totalram_pages(void)
|
|
|
|
{
|
|
|
|
return (unsigned long)atomic_long_read(&_totalram_pages);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void totalram_pages_inc(void)
|
|
|
|
{
|
|
|
|
atomic_long_inc(&_totalram_pages);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void totalram_pages_dec(void)
|
|
|
|
{
|
|
|
|
atomic_long_dec(&_totalram_pages);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void totalram_pages_add(long count)
|
|
|
|
{
|
|
|
|
atomic_long_add(count, &_totalram_pages);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void totalram_pages_set(long val)
|
|
|
|
{
|
|
|
|
atomic_long_set(&_totalram_pages, val);
|
|
|
|
}
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
extern void * high_memory;
|
|
|
|
extern int page_cluster;
|
|
|
|
|
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
extern int sysctl_legacy_va_layout;
|
|
|
|
#else
|
|
|
|
#define sysctl_legacy_va_layout 0
|
|
|
|
#endif
|
|
|
|
|
mm: mmap: add new /proc tunable for mmap_base ASLR
Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack. This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.
The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems. The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy. This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.
The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:
http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device. The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece). With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.
The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.
When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.
This patch (of 4):
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Signed-off-by: Daniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 08:19:53 +09:00
|
|
|
#ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS
|
|
|
|
extern const int mmap_rnd_bits_min;
|
|
|
|
extern const int mmap_rnd_bits_max;
|
|
|
|
extern int mmap_rnd_bits __read_mostly;
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS
|
|
|
|
extern const int mmap_rnd_compat_bits_min;
|
|
|
|
extern const int mmap_rnd_compat_bits_max;
|
|
|
|
extern int mmap_rnd_compat_bits __read_mostly;
|
|
|
|
#endif
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
#include <asm/page.h>
|
|
|
|
#include <asm/pgtable.h>
|
|
|
|
#include <asm/processor.h>
|
|
|
|
|
2019-06-04 21:04:47 +09:00
|
|
|
/*
|
|
|
|
* Architectures that support memory tagging (assigning tags to memory regions,
|
|
|
|
* embedding these tags into addresses that point to these memory regions, and
|
|
|
|
* checking that the memory and the pointer tags match on memory accesses)
|
|
|
|
* redefine this macro to strip tags from pointers.
|
|
|
|
* It's defined as noop for arcitectures that don't support memory tagging.
|
|
|
|
*/
|
|
|
|
#ifndef untagged_addr
|
|
|
|
#define untagged_addr(addr) (addr)
|
|
|
|
#endif
|
|
|
|
|
2013-11-13 08:07:59 +09:00
|
|
|
#ifndef __pa_symbol
|
|
|
|
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
|
|
|
|
#endif
|
|
|
|
|
2020-09-11 07:33:56 +09:00
|
|
|
#ifndef __va_function
|
|
|
|
#define __va_function(x) (x)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifndef __pa_function
|
|
|
|
#define __pa_function(x) __pa_symbol(x)
|
|
|
|
#endif
|
|
|
|
|
mm: replace open coded page to virt conversion with page_to_virt()
The open coded conversion from struct page address to virtual address in
lowmem_page_address() involves an intermediate conversion step to pfn
number/physical address. Since the placement of the struct page array
relative to the linear mapping may be completely independent from the
placement of physical RAM (as is that case for arm64 after commit
dfd55ad85e 'arm64: vmemmap: use virtual projection of linear region'),
the conversion to physical address and back again should factor out of
the equation, but unfortunately, the shifting and pointer arithmetic
involved prevent this from happening, and the resulting calculation
essentially subtracts the address of the start of physical memory and
adds it back again, in a way that prevents the compiler from optimizing
it away.
Since the start of physical memory is not a build time constant on arm64,
the resulting conversion involves an unnecessary memory access, which
we would like to get rid of. So replace the open coded conversion with
a call to page_to_virt(), and use the open coded conversion as its
default definition, to be overriden by the architecture, if desired.
The existing arch specific definitions of page_to_virt are all equivalent
to this default definition, so by itself this patch is a no-op.
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-19 01:04:57 +09:00
|
|
|
#ifndef page_to_virt
|
|
|
|
#define page_to_virt(x) __va(PFN_PHYS(page_to_pfn(x)))
|
|
|
|
#endif
|
|
|
|
|
2017-01-11 06:35:42 +09:00
|
|
|
#ifndef lm_alias
|
|
|
|
#define lm_alias(x) __va(__pa_symbol(x))
|
|
|
|
#endif
|
|
|
|
|
2014-10-23 19:07:44 +09:00
|
|
|
/*
|
|
|
|
* To prevent common memory management code establishing
|
|
|
|
* a zero page mapping on a read fault.
|
|
|
|
* This macro should be defined within <asm/pgtable.h>.
|
|
|
|
* s390 does this to prevent multiplexing of hardware bits
|
|
|
|
* related to the physical page in case of virtualization.
|
|
|
|
*/
|
|
|
|
#ifndef mm_forbids_zeropage
|
|
|
|
#define mm_forbids_zeropage(X) (0)
|
|
|
|
#endif
|
|
|
|
|
mm: zero reserved and unavailable struct pages
Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by
going through __init_single_page().
In some cases these struct pages are accessed even if they do not
contain any data. One example is page_to_pfn() might access page->flags
if this is where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).
One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the
exiting memory from pfn 1 (i.e. KVM).
Since struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.
The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)
Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().
===
Here is more detailed example of problem that this patch is addressing:
Run tested on qemu with the following arguments:
-enable-kvm -cpu kvm64 -m 512 -smp 2
This patch reports that there are 98 unavailable pages.
They are: pfn 0 and pfns in range [159, 255].
Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.
e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]
Notice, that exactly unavailable pfns are missing!
Now, lets check what we have in zone 0: [1, 131039]
pfn 0, is not part of the zone, but pfns [1, 158], are.
However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug. Because, that path operates at 2M
boundaries (section_nr). And checks if 2M range of pages is hot
removable. It starts with first pfn from zone, rounds it down to 2M
boundary (sturct pages are allocated at 2M boundaries when vmemmap is
created), and checks if that section is hot removable. In this case
start with pfn 1 and convert it down to pfn 0. Later pfn is converted
to struct page, and some fields are checked. Now, if we do not zero
struct pages, we get unpredictable results.
In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
vmemmap memory to ones, the following panic is observed with kernel test
without this patch applied:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: is_pageblock_removable_nolock+0x35/0x90
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT
...
task: ffff88001f4e2900 task.stack: ffffc90000314000
RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
Call Trace:
? is_mem_section_removable+0x5a/0xd0
show_mem_removable+0x6b/0xa0
dev_attr_show+0x1b/0x50
sysfs_kf_seq_show+0xa1/0x100
kernfs_seq_show+0x22/0x30
seq_read+0x1ac/0x3a0
kernfs_fop_read+0x36/0x190
? security_file_permission+0x90/0xb0
__vfs_read+0x16/0x30
vfs_read+0x81/0x130
SyS_read+0x44/0xa0
entry_SYSCALL_64_fastpath+0x1f/0xbd
Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-16 10:36:31 +09:00
|
|
|
/*
|
|
|
|
* On some architectures it is expensive to call memset() for small sizes.
|
2019-05-14 09:21:10 +09:00
|
|
|
* If an architecture decides to implement their own version of
|
|
|
|
* mm_zero_struct_page they should wrap the defines below in a #ifndef and
|
|
|
|
* define their own version of this macro in <asm/pgtable.h>
|
mm: zero reserved and unavailable struct pages
Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by
going through __init_single_page().
In some cases these struct pages are accessed even if they do not
contain any data. One example is page_to_pfn() might access page->flags
if this is where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).
One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the
exiting memory from pfn 1 (i.e. KVM).
Since struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.
The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)
Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().
===
Here is more detailed example of problem that this patch is addressing:
Run tested on qemu with the following arguments:
-enable-kvm -cpu kvm64 -m 512 -smp 2
This patch reports that there are 98 unavailable pages.
They are: pfn 0 and pfns in range [159, 255].
Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.
e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]
Notice, that exactly unavailable pfns are missing!
Now, lets check what we have in zone 0: [1, 131039]
pfn 0, is not part of the zone, but pfns [1, 158], are.
However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug. Because, that path operates at 2M
boundaries (section_nr). And checks if 2M range of pages is hot
removable. It starts with first pfn from zone, rounds it down to 2M
boundary (sturct pages are allocated at 2M boundaries when vmemmap is
created), and checks if that section is hot removable. In this case
start with pfn 1 and convert it down to pfn 0. Later pfn is converted
to struct page, and some fields are checked. Now, if we do not zero
struct pages, we get unpredictable results.
In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
vmemmap memory to ones, the following panic is observed with kernel test
without this patch applied:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: is_pageblock_removable_nolock+0x35/0x90
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT
...
task: ffff88001f4e2900 task.stack: ffffc90000314000
RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
Call Trace:
? is_mem_section_removable+0x5a/0xd0
show_mem_removable+0x6b/0xa0
dev_attr_show+0x1b/0x50
sysfs_kf_seq_show+0xa1/0x100
kernfs_seq_show+0x22/0x30
seq_read+0x1ac/0x3a0
kernfs_fop_read+0x36/0x190
? security_file_permission+0x90/0xb0
__vfs_read+0x16/0x30
vfs_read+0x81/0x130
SyS_read+0x44/0xa0
entry_SYSCALL_64_fastpath+0x1f/0xbd
Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-16 10:36:31 +09:00
|
|
|
*/
|
2019-05-14 09:21:10 +09:00
|
|
|
#if BITS_PER_LONG == 64
|
|
|
|
/* This function must be updated when the size of struct page grows above 80
|
|
|
|
* or reduces below 56. The idea that compiler optimizes out switch()
|
|
|
|
* statement, and only leaves move/store instructions. Also the compiler can
|
|
|
|
* combine write statments if they are both assignments and can be reordered,
|
|
|
|
* this can result in several of the writes here being dropped.
|
|
|
|
*/
|
|
|
|
#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
|
|
|
|
static inline void __mm_zero_struct_page(struct page *page)
|
|
|
|
{
|
|
|
|
unsigned long *_pp = (void *)page;
|
|
|
|
|
|
|
|
/* Check that struct page is either 56, 64, 72, or 80 bytes */
|
|
|
|
BUILD_BUG_ON(sizeof(struct page) & 7);
|
|
|
|
BUILD_BUG_ON(sizeof(struct page) < 56);
|
|
|
|
BUILD_BUG_ON(sizeof(struct page) > 80);
|
|
|
|
|
|
|
|
switch (sizeof(struct page)) {
|
|
|
|
case 80:
|
|
|
|
_pp[9] = 0; /* fallthrough */
|
|
|
|
case 72:
|
|
|
|
_pp[8] = 0; /* fallthrough */
|
|
|
|
case 64:
|
|
|
|
_pp[7] = 0; /* fallthrough */
|
|
|
|
case 56:
|
|
|
|
_pp[6] = 0;
|
|
|
|
_pp[5] = 0;
|
|
|
|
_pp[4] = 0;
|
|
|
|
_pp[3] = 0;
|
|
|
|
_pp[2] = 0;
|
|
|
|
_pp[1] = 0;
|
|
|
|
_pp[0] = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#else
|
mm: zero reserved and unavailable struct pages
Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by
going through __init_single_page().
In some cases these struct pages are accessed even if they do not
contain any data. One example is page_to_pfn() might access page->flags
if this is where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).
One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the
exiting memory from pfn 1 (i.e. KVM).
Since struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.
The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)
Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().
===
Here is more detailed example of problem that this patch is addressing:
Run tested on qemu with the following arguments:
-enable-kvm -cpu kvm64 -m 512 -smp 2
This patch reports that there are 98 unavailable pages.
They are: pfn 0 and pfns in range [159, 255].
Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.
e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]
Notice, that exactly unavailable pfns are missing!
Now, lets check what we have in zone 0: [1, 131039]
pfn 0, is not part of the zone, but pfns [1, 158], are.
However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug. Because, that path operates at 2M
boundaries (section_nr). And checks if 2M range of pages is hot
removable. It starts with first pfn from zone, rounds it down to 2M
boundary (sturct pages are allocated at 2M boundaries when vmemmap is
created), and checks if that section is hot removable. In this case
start with pfn 1 and convert it down to pfn 0. Later pfn is converted
to struct page, and some fields are checked. Now, if we do not zero
struct pages, we get unpredictable results.
In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
vmemmap memory to ones, the following panic is observed with kernel test
without this patch applied:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: is_pageblock_removable_nolock+0x35/0x90
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT
...
task: ffff88001f4e2900 task.stack: ffffc90000314000
RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
Call Trace:
? is_mem_section_removable+0x5a/0xd0
show_mem_removable+0x6b/0xa0
dev_attr_show+0x1b/0x50
sysfs_kf_seq_show+0xa1/0x100
kernfs_seq_show+0x22/0x30
seq_read+0x1ac/0x3a0
kernfs_fop_read+0x36/0x190
? security_file_permission+0x90/0xb0
__vfs_read+0x16/0x30
vfs_read+0x81/0x130
SyS_read+0x44/0xa0
entry_SYSCALL_64_fastpath+0x1f/0xbd
Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-16 10:36:31 +09:00
|
|
|
#define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page)))
|
|
|
|
#endif
|
|
|
|
|
2016-03-18 06:18:48 +09:00
|
|
|
/*
|
|
|
|
* Default maximum number of active map areas, this limits the number of vmas
|
|
|
|
* per mm struct. Users can overwrite this number by sysctl but there is a
|
|
|
|
* problem.
|
|
|
|
*
|
|
|
|
* When a program's coredump is generated as ELF format, a section is created
|
|
|
|
* per a vma. In ELF, the number of sections is represented in unsigned short.
|
|
|
|
* This means the number of sections should be smaller than 65535 at coredump.
|
|
|
|
* Because the kernel adds some informative sections to a image of program at
|
|
|
|
* generating coredump, we need some margin. The number of extra sections is
|
|
|
|
* 1-3 now and depends on arch. We use "5" as safe margin, here.
|
|
|
|
*
|
|
|
|
* ELF extended numbering allows more than 65535 sections, so 16-bit bound is
|
|
|
|
* not a hard limit any more. Although some userspace tools can be surprised by
|
|
|
|
* that.
|
|
|
|
*/
|
|
|
|
#define MAPCOUNT_ELF_CORE_MARGIN (5)
|
|
|
|
#define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
|
|
|
|
|
|
|
|
extern int sysctl_max_map_count;
|
|
|
|
|
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 07:08:10 +09:00
|
|
|
extern unsigned long sysctl_user_reserve_kbytes;
|
2013-04-30 07:08:11 +09:00
|
|
|
extern unsigned long sysctl_admin_reserve_kbytes;
|
mm: limit growth of 3% hardcoded other user reserve
Add user_reserve_kbytes knob.
Limit the growth of the memory reserved for other user processes to
min(3% current process size, user_reserve_pages). Only about 8MB is
necessary to enable recovery in the default mode, and only a few hundred
MB are required even when overcommit is disabled.
user_reserve_pages defaults to min(3% free pages, 128MB)
I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
then adding the RSS of each.
This only affects OVERCOMMIT_NEVER mode.
Background
1. user reserve
__vm_enough_memory reserves a hardcoded 3% of the current process size for
other applications when overcommit is disabled. This was done so that a
user could recover if they launched a memory hogging process. Without the
reserve, a user would easily run into a message such as:
bash: fork: Cannot allocate memory
2. admin reserve
Additionally, a hardcoded 3% of free memory is reserved for root in both
overcommit 'guess' and 'never' modes. This was intended to prevent a
scenario where root-cant-log-in and perform recovery operations.
Note that this reserve shrinks, and doesn't guarantee a useful reserve.
Motivation
The two hardcoded memory reserves should be updated to account for current
memory sizes.
Also, the admin reserve would be more useful if it didn't shrink too much.
When the current code was originally written, 1GB was considered
"enterprise". Now the 3% reserve can grow to multiple GB on large memory
systems, and it only needs to be a few hundred MB at most to enable a user
or admin to recover a system with an unwanted memory hogging process.
I've found that reducing these reserves is especially beneficial for a
specific type of application load:
* single application system
* one or few processes (e.g. one per core)
* allocating all available memory
* not initializing every page immediately
* long running
I've run scientific clusters with this sort of load. A long running job
sometimes failed many hours (weeks of CPU time) into a calculation. They
weren't initializing all of their memory immediately, and they weren't
using calloc, so I put systems into overcommit 'never' mode. These
clusters run diskless and have no swap.
However, with the current reserves, a user wishing to allocate as much
memory as possible to one process may be prevented from using, for
example, almost 2GB out of 32GB.
The effect is less, but still significant when a user starts a job with
one process per core. I have repeatedly seen a set of processes
requesting the same amount of memory fail because one of them could not
allocate the amount of memory a user would expect to be able to allocate.
For example, Message Passing Interfce (MPI) processes, one per core. And
it is similar for other parallel programming frameworks.
Changing this reserve code will make the overcommit never mode more useful
by allowing applications to allocate nearly all of the available memory.
Also, the new admin_reserve_kbytes will be safer than the current behavior
since the hardcoded 3% of available memory reserve can shrink to something
useless in the case where applications have grabbed all available memory.
Risks
* "bash: fork: Cannot allocate memory"
The downside of the first patch-- which creates a tunable user reserve
that is only used in overcommit 'never' mode--is that an admin can set
it so low that a user may not be able to kill their process, even if
they already have a shell prompt.
Of course, a user can get in the same predicament with the current 3%
reserve--they just have to launch processes until 3% becomes negligible.
* root-cant-log-in problem
The second patch, adding the tunable rootuser_reserve_pages, allows
the admin to shoot themselves in the foot by setting it too small. They
can easily get the system into a state where root-can't-log-in.
However, the new admin_reserve_kbytes will be safer than the current
behavior since the hardcoded 3% of available memory reserve can shrink
to something useless in the case where applications have grabbed all
available memory.
Alternatives
* Memory cgroups provide a more flexible way to limit application memory.
Not everyone wants to set up cgroups or deal with their overhead.
* We could create a fourth overcommit mode which provides smaller reserves.
The size of useful reserves may be drastically different depending
on the whether the system is embedded or enterprise.
* Force users to initialize all of their memory or use calloc.
Some users don't want/expect the system to overcommit when they malloc.
Overcommit 'never' mode is for this scenario, and it should work well.
The new user and admin reserve tunables are simple to use, with low
overhead compared to cgroups. The patches preserve current behavior where
3% of memory is less than 128MB, except that the admin reserve doesn't
shrink to an unusable size under pressure. The code allows admins to tune
for embedded and enterprise usage.
FAQ
* How is the root-cant-login problem addressed?
What happens if admin_reserve_pages is set to 0?
Root is free to shoot themselves in the foot by setting
admin_reserve_kbytes too low.
On x86_64, the minimum useful reserve is:
8MB for overcommit 'guess'
128MB for overcommit 'never'
admin_reserve_pages defaults to min(3% free memory, 8MB)
So, anyone switching to 'never' mode needs to adjust
admin_reserve_pages.
* How do you calculate a minimum useful reserve?
A user or the admin needs enough memory to login and perform
recovery operations, which includes, at a minimum:
sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
For overcommit 'guess', we can sum resident set sizes (RSS)
because we only need enough memory to handle what the recovery
programs will typically use. On x86_64 this is about 8MB.
For overcommit 'never', we can take the max of their virtual sizes (VSZ)
and add the sum of their RSS. We use VSZ instead of RSS because mode
forces us to ensure we can fulfill all of the requested memory allocations--
even if the programs only use a fraction of what they ask for.
On x86_64 this is about 128MB.
When swap is enabled, reserves are useful even when they are as
small as 10MB, regardless of overcommit mode.
When both swap and overcommit are disabled, then the admin should
tune the reserves higher to be absolutley safe. Over 230MB each
was safest in my testing.
* What happens if user_reserve_pages is set to 0?
Note, this only affects overcomitt 'never' mode.
Then a user will be able to allocate all available memory minus
admin_reserve_kbytes.
However, they will easily see a message such as:
"bash: fork: Cannot allocate memory"
And they won't be able to recover/kill their application.
The admin should be able to recover the system if
admin_reserve_kbytes is set appropriately.
* What's the difference between overcommit 'guess' and 'never'?
"Guess" allows an allocation if there are enough free + reclaimable
pages. It has a hardcoded 3% of free pages reserved for root.
"Never" allows an allocation if there is enough swap + a configurable
percentage (default is 50) of physical RAM. It has a hardcoded 3% of
free pages reserved for root, like "Guess" mode. It also has a
hardcoded 3% of the current process size reserved for additional
applications.
* Why is overcommit 'guess' not suitable even when an app eventually
writes to every page? It takes free pages, file pages, available
swap pages, reclaimable slab pages into consideration. In other words,
these are all pages available, then why isn't overcommit suitable?
Because it only looks at the present state of the system. It
does not take into account the memory that other applications have
malloced, but haven't initialized yet. It overcommits the system.
Test Summary
There was little change in behavior in the default overcommit 'guess'
mode with swap enabled before and after the patch. This was expected.
Systems run most predictably (i.e. no oom kills) in overcommit 'never'
mode with swap enabled. This also allowed the most memory to be allocated
to a user application.
Overcommit 'guess' mode without swap is a bad idea. It is easy to
crash the system. None of the other tested combinations crashed.
This matches my experience on the Roadrunner supercomputer.
Without the tunable user reserve, a system in overcommit 'never' mode
and without swap does not allow the admin to recover, although the
admin can.
With the new tunable reserves, a system in overcommit 'never' mode
and without swap can be configured to:
1. maximize user-allocatable memory, running close to the edge of
recoverability
2. maximize recoverability, sacrificing allocatable memory to
ensure that a user cannot take down a system
Test Description
Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap
System is booted into multiuser console mode, with unnecessary services
turned off. Caches were dropped before each test.
Hogs are user memtester processes that attempt to allocate all free memory
as reported by /proc/meminfo
In overcommit 'never' mode, memory_ratio=100
Test Results
3.9.0-rc1-mm1
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5432/5432 no yes yes
guess yes 4 5444/5444 1 yes yes
guess no 1 5302/5449 no yes yes
guess no 4 - crash no no
never yes 1 5460/5460 1 yes yes
never yes 4 5460/5460 1 yes yes
never no 1 5218/5432 no no yes
never no 4 5203/5448 no no yes
3.9.0-rc1-mm1-tunablereserves
User and Admin Recovery show their respective reserves, if applicable.
Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
---------- ---- ---- ------------- ---- ------------- --------------
guess yes 1 5419/5419 no - yes 8MB yes
guess yes 4 5436/5436 1 - yes 8MB yes
guess no 1 5440/5440 * - yes 8MB yes
guess no 4 - crash - no 8MB no
* process would successfully mlock, then the oom killer would pick it
never yes 1 5446/5446 no 10MB yes 20MB yes
never yes 4 5456/5456 no 10MB yes 20MB yes
never no 1 5387/5429 no 128MB no 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5323/5428 no 226MB barely 8MB barely
never no 1 5359/5448 no 10MB no 10MB barely
never no 1 5323/5428 no 0MB no 10MB barely
never no 1 5332/5428 no 0MB no 50MB yes
never no 1 5293/5429 no 0MB no 90MB yes
never no 1 5001/5427 no 230MB yes 338MB yes
never no 4* 4998/5424 no 230MB yes 338MB yes
* more memtesters were launched, able to allocate approximately another 100MB
Future Work
- Test larger memory systems.
- Test an embedded image.
- Test other architectures.
- Time malloc microbenchmarks.
- Would it be useful to be able to set overcommit policy for
each memory cgroup?
- Some lines are slightly above 80 chars.
Perhaps define a macro to convert between pages and kb?
Other places in the kernel do this.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: make init_user_reserve() static]
Signed-off-by: Andrew Shewmaker <agshew@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 07:08:10 +09:00
|
|
|
|
2014-01-22 08:49:14 +09:00
|
|
|
extern int sysctl_overcommit_memory;
|
|
|
|
extern int sysctl_overcommit_ratio;
|
|
|
|
extern unsigned long sysctl_overcommit_kbytes;
|
|
|
|
|
|
|
|
extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
|
|
|
|
size_t *, loff_t *);
|
|
|
|
extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
|
|
|
|
size_t *, loff_t *);
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
|
|
|
|
|
2008-07-24 13:28:13 +09:00
|
|
|
/* to align the pointer to the (next) page boundary */
|
|
|
|
#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)
|
|
|
|
|
2013-07-04 07:02:11 +09:00
|
|
|
/* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */
|
2016-10-08 09:02:04 +09:00
|
|
|
#define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
|
2013-07-04 07:02:11 +09:00
|
|
|
|
2019-01-04 08:29:02 +09:00
|
|
|
#define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
/*
|
|
|
|
* Linux kernel virtual memory manager primitives.
|
|
|
|
* The idea being to have a "virtual" mm in the same way
|
|
|
|
* we have a virtual fs - giving a cleaner interface to the
|
|
|
|
* mm details, and allowing different kinds of memory mappings
|
|
|
|
* (from shared memory to executable loading to arbitrary
|
|
|
|
* mmap() functions).
|
|
|
|
*/
|
|
|
|
|
2018-07-22 07:24:03 +09:00
|
|
|
struct vm_area_struct *vm_area_alloc(struct mm_struct *);
|
2018-07-22 05:48:51 +09:00
|
|
|
struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
|
|
|
|
void vm_area_free(struct vm_area_struct *);
|
2006-12-07 13:32:48 +09:00
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
#ifndef CONFIG_MMU
|
2009-01-08 21:04:47 +09:00
|
|
|
extern struct rb_root nommu_region_tree;
|
|
|
|
extern struct rw_semaphore nommu_region_sem;
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
extern unsigned int kobjsize(const void *objp);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
2008-08-16 19:07:21 +09:00
|
|
|
* vm_flags in vm_area_struct, see mm_types.h.
|
2016-03-18 06:18:53 +09:00
|
|
|
* When changing, update also include/trace/events/mmflags.h
|
2005-04-17 07:20:36 +09:00
|
|
|
*/
|
2012-10-09 08:28:37 +09:00
|
|
|
#define VM_NONE 0x00000000
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
#define VM_READ 0x00000001 /* currently active flags */
|
|
|
|
#define VM_WRITE 0x00000002
|
|
|
|
#define VM_EXEC 0x00000004
|
|
|
|
#define VM_SHARED 0x00000008
|
|
|
|
|
2005-09-22 01:55:39 +09:00
|
|
|
/* mprotect() hardcodes VM_MAYREAD >> 4 == VM_READ, and so for r/w/x bits. */
|
2005-04-17 07:20:36 +09:00
|
|
|
#define VM_MAYREAD 0x00000010 /* limits for mprotect() etc */
|
|
|
|
#define VM_MAYWRITE 0x00000020
|
|
|
|
#define VM_MAYEXEC 0x00000040
|
|
|
|
#define VM_MAYSHARE 0x00000080
|
|
|
|
|
|
|
|
#define VM_GROWSDOWN 0x00000100 /* general info on the segment */
|
2015-09-05 07:46:17 +09:00
|
|
|
#define VM_UFFD_MISSING 0x00000200 /* missing pages tracking */
|
2005-11-29 07:34:23 +09:00
|
|
|
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
|
2005-04-17 07:20:36 +09:00
|
|
|
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
|
2015-09-05 07:46:17 +09:00
|
|
|
#define VM_UFFD_WP 0x00001000 /* wrprotect pages tracking */
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
#define VM_LOCKED 0x00002000
|
|
|
|
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
|
|
|
|
|
|
|
|
/* Used by sys_madvise() */
|
|
|
|
#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
|
|
|
|
#define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */
|
|
|
|
|
|
|
|
#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
|
|
|
|
#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
|
2015-11-06 11:51:36 +09:00
|
|
|
#define VM_LOCKONFAULT 0x00080000 /* Lock the pages covered when they are faulted in */
|
2005-04-17 07:20:36 +09:00
|
|
|
#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
|
2008-07-24 13:27:28 +09:00
|
|
|
#define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */
|
2005-04-17 07:20:36 +09:00
|
|
|
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
|
2017-11-02 00:36:41 +09:00
|
|
|
#define VM_SYNC 0x00800000 /* Synchronous page faults */
|
2012-10-09 08:28:37 +09:00
|
|
|
#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
|
mm,fork: introduce MADV_WIPEONFORK
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
in the child process after fork. This differs from MADV_DONTFORK in one
important way.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
MADV_WIPEONFORK only works on private, anonymous VMAs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
[akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 08:25:15 +09:00
|
|
|
#define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */
|
2012-10-09 08:28:59 +09:00
|
|
|
#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
|
mm: fix fault vs invalidate race for linear mappings
Fix the race between invalidate_inode_pages and do_no_page.
Andrea Arcangeli identified a subtle race between invalidation of pages from
pagecache with userspace mappings, and do_no_page.
The issue is that invalidation has to shoot down all mappings to the page,
before it can be discarded from the pagecache. Between shooting down ptes to
a particular page, and actually dropping the struct page from the pagecache,
do_no_page from any process might fault on that page and establish a new
mapping to the page just before it gets discarded from the pagecache.
The most common case where such invalidation is used is in file truncation.
This case was catered for by doing a sort of open-coded seqlock between the
file's i_size, and its truncate_count.
Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then find the
page if it is within i_size, and then check truncate_count under the page
table lock and back out and retry if it had subsequently been changed (ptl
will serialise against unmapping, and ensure a potentially updated
truncate_count is actually visible).
Complexity and documentation issues aside, the locking protocol fails in the
case where we would like to invalidate pagecache inside i_size. do_no_page
can come in anytime and filemap_nopage is not aware of the invalidation in
progress (as it is when it is outside i_size). The end result is that
dangling (->mapping == NULL) pages that appear to be from a particular file
may be mapped into userspace with nonsense data. Valid mappings to the same
place will see a different page.
Andrea implemented two working fixes, one using a real seqlock, another using
a page->flags bit. He also proposed using the page lock in do_no_page, but
that was initially considered too heavyweight. However, it is not a global or
per-file lock, and the page cacheline is modified in do_no_page to increment
_count and _mapcount anyway, so a further modification should not be a large
performance hit. Scalability is not an issue.
This patch implements this latter approach. ->nopage implementations return
with the page locked if it is possible for their underlying file to be
invalidated (in that case, they must set a special vm_flags bit to indicate
so). do_no_page only unlocks the page after setting up the mapping
completely. invalidation is excluded because it holds the page lock during
invalidation of each page (and ensures that the page is not mapped while
holding the lock).
This also allows significant simplifications in do_no_page, because we have
the page locked in the right place in the pagecache from the start.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 17:46:57 +09:00
|
|
|
|
2013-09-12 06:22:24 +09:00
|
|
|
#ifdef CONFIG_MEM_SOFT_DIRTY
|
|
|
|
# define VM_SOFTDIRTY 0x08000000 /* Not soft dirty clean area */
|
|
|
|
#else
|
|
|
|
# define VM_SOFTDIRTY 0
|
|
|
|
#endif
|
|
|
|
|
mm: introduce VM_MIXEDMAP
This series introduces some important infrastructure work. The overall result
is that:
1. We now support XIP backed filesystems using memory that have no
struct page allocated to them. And patches 6 and 7 actually implement
this for s390.
This is pretty important in a number of cases. As far as I understand,
in the case of virtualisation (eg. s390), each guest may mount a
readonly copy of the same filesystem (eg. the distro). Currently,
guests need to allocate struct pages for this image. So if you have
100 guests, you already need to allocate more memory for the struct
pages than the size of the image. I think. (Carsten?)
For other (eg. embedded) systems, you may have a very large non-
volatile filesystem. If you have to have struct pages for this, then
your RAM consumption will go up proportionally to fs size. Even
though it is just a small proportion, the RAM can be much more costly
eg in terms of power, so every KB less that Linux uses makes it more
attractive to a lot of these guys.
2. VM_MIXEDMAP allows us to support mappings where you actually do want
to refcount _some_ pages in the mapping, but not others, and support
COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM
filesystem in progress. Future iterations of this filesystem will
most likely want to migrate pages between pagecache and XIP backing,
which is where the requirement for mixed (some refcounted, some not)
comes from.
3. pte_special also has a peripheral usage that I need for my lockless
get_user_pages patch. That was shown to speed up "oltp" on db2 by
10% on a 2 socket system, which is kind of significant because they
scrounge for months to try to find 0.1% improvement on these
workloads. I'm hoping we might finally be faster than AIX on
pSeries with this :). My reference to lockless get_user_pages is not
meant to justify this patchset (which doesn't include lockless gup),
but just to show that pte_special is not some s390 specific thing that
should be hidden in arch code or xip code: I definitely want to use it
on at least x86 and powerpc as well.
This patch:
Introduce a new type of mapping, VM_MIXEDMAP. This is unlike VM_PFNMAP in
that it can support COW mappings of arbitrary ranges including ranges without
struct page *and* ranges with a struct page that we actually want to refcount
(PFNMAP can only support COW in those cases where the un-COW-ed translations
are mapped linearly in the virtual address, and can only support non
refcounted ranges).
VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not
refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it
needs to avoid refcounting pfn_valid pages eg. for /dev/mem mappings).
Signed-off-by: Jared Hulbert <jaredeh@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 18:12:58 +09:00
|
|
|
#define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */
|
2012-10-09 08:28:37 +09:00
|
|
|
#define VM_HUGEPAGE 0x20000000 /* MADV_HUGEPAGE marked this vma */
|
|
|
|
#define VM_NOHUGEPAGE 0x40000000 /* MADV_NOHUGEPAGE marked this vma */
|
2009-09-22 09:01:57 +09:00
|
|
|
#define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2016-02-13 06:02:08 +09:00
|
|
|
#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
|
|
|
|
#define VM_HIGH_ARCH_BIT_0 32 /* bit only usable on 64-bit architectures */
|
|
|
|
#define VM_HIGH_ARCH_BIT_1 33 /* bit only usable on 64-bit architectures */
|
|
|
|
#define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */
|
|
|
|
#define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */
|
x86,mpx: make mpx depend on x86-64 to free up VMA flag
Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
This patch (of 2):
MPX only seems to be available on 64 bit CPUs, starting with Skylake and
Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
order to free up a VMA flag.
Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 08:25:11 +09:00
|
|
|
#define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */
|
2016-02-13 06:02:08 +09:00
|
|
|
#define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
|
|
|
|
#define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
|
|
|
|
#define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
|
|
|
|
#define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
|
x86,mpx: make mpx depend on x86-64 to free up VMA flag
Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
This patch (of 2):
MPX only seems to be available on 64 bit CPUs, starting with Skylake and
Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
order to free up a VMA flag.
Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 08:25:11 +09:00
|
|
|
#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
|
2016-02-13 06:02:08 +09:00
|
|
|
#endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
|
|
|
|
|
2018-03-27 18:09:26 +09:00
|
|
|
#ifdef CONFIG_ARCH_HAS_PKEYS
|
2016-02-13 06:02:10 +09:00
|
|
|
# define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
|
|
|
|
# define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */
|
2018-03-27 18:09:27 +09:00
|
|
|
# define VM_PKEY_BIT1 VM_HIGH_ARCH_1 /* on x86 and 5-bit value on ppc64 */
|
2016-02-13 06:02:10 +09:00
|
|
|
# define VM_PKEY_BIT2 VM_HIGH_ARCH_2
|
|
|
|
# define VM_PKEY_BIT3 VM_HIGH_ARCH_3
|
2018-03-27 18:09:27 +09:00
|
|
|
#ifdef CONFIG_PPC
|
|
|
|
# define VM_PKEY_BIT4 VM_HIGH_ARCH_4
|
|
|
|
#else
|
|
|
|
# define VM_PKEY_BIT4 0
|
2016-02-13 06:02:10 +09:00
|
|
|
#endif
|
2018-03-27 18:09:26 +09:00
|
|
|
#endif /* CONFIG_ARCH_HAS_PKEYS */
|
|
|
|
|
|
|
|
#if defined(CONFIG_X86)
|
|
|
|
# define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */
|
2012-10-09 08:28:37 +09:00
|
|
|
#elif defined(CONFIG_PPC)
|
|
|
|
# define VM_SAO VM_ARCH_1 /* Strong Access Ordering (powerpc) */
|
|
|
|
#elif defined(CONFIG_PARISC)
|
|
|
|
# define VM_GROWSUP VM_ARCH_1
|
|
|
|
#elif defined(CONFIG_IA64)
|
|
|
|
# define VM_GROWSUP VM_ARCH_1
|
2018-02-24 07:46:41 +09:00
|
|
|
#elif defined(CONFIG_SPARC64)
|
|
|
|
# define VM_SPARC_ADI VM_ARCH_1 /* Uses ADI tag for access control */
|
|
|
|
# define VM_ARCH_CLEAR VM_SPARC_ADI
|
2012-10-09 08:28:37 +09:00
|
|
|
#elif !defined(CONFIG_MMU)
|
|
|
|
# define VM_MAPPED_COPY VM_ARCH_1 /* T if mapped copy of data (nommu mmap) */
|
|
|
|
#endif
|
|
|
|
|
x86,mpx: make mpx depend on x86-64 to free up VMA flag
Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
This patch (of 2):
MPX only seems to be available on 64 bit CPUs, starting with Skylake and
Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
order to free up a VMA flag.
Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 08:25:11 +09:00
|
|
|
#if defined(CONFIG_X86_INTEL_MPX)
|
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
MPX-enabled applications using large swaths of memory can
potentially have large numbers of bounds tables in process
address space to save bounds information. These tables can take
up huge swaths of memory (as much as 80% of the memory on the
system) even if we clean them up aggressively. In the worst-case
scenario, the tables can be 4x the size of the data structure
being tracked. IOW, a 1-page structure can require 4 bounds-table
pages.
Being this huge, our expectation is that folks using MPX are
going to be keen on figuring out how much memory is being
dedicated to it. So we need a way to track memory use for MPX.
If we want to specifically track MPX VMAs we need to be able to
distinguish them from normal VMAs, and keep them from getting
merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
both of those things. With this flag, MPX bounds-table VMAs can
be distinguished from other VMAs, and userspace can also walk
/proc/$pid/smaps to get memory usage for MPX.
In addition to this flag, we also introduce a special ->vm_ops
specific to MPX VMAs (see the patch "add MPX specific mmap
interface"), but currently different ->vm_ops do not by
themselves prevent VMA merging, so we still need this flag.
We understand that VM_ flags are scarce and are open to other
options.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-15 00:18:25 +09:00
|
|
|
/* MPX specific bounds table or bounds directory */
|
2017-10-04 08:14:24 +09:00
|
|
|
# define VM_MPX VM_HIGH_ARCH_4
|
x86,mpx: make mpx depend on x86-64 to free up VMA flag
Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
This patch (of 2):
MPX only seems to be available on 64 bit CPUs, starting with Skylake and
Goldmont. Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
order to free up a VMA flag.
Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Colm MacCártaigh <colm@allcosts.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 08:25:11 +09:00
|
|
|
#else
|
|
|
|
# define VM_MPX VM_NONE
|
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
MPX-enabled applications using large swaths of memory can
potentially have large numbers of bounds tables in process
address space to save bounds information. These tables can take
up huge swaths of memory (as much as 80% of the memory on the
system) even if we clean them up aggressively. In the worst-case
scenario, the tables can be 4x the size of the data structure
being tracked. IOW, a 1-page structure can require 4 bounds-table
pages.
Being this huge, our expectation is that folks using MPX are
going to be keen on figuring out how much memory is being
dedicated to it. So we need a way to track memory use for MPX.
If we want to specifically track MPX VMAs we need to be able to
distinguish them from normal VMAs, and keep them from getting
merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
both of those things. With this flag, MPX bounds-table VMAs can
be distinguished from other VMAs, and userspace can also walk
/proc/$pid/smaps to get memory usage for MPX.
In addition to this flag, we also introduce a special ->vm_ops
specific to MPX VMAs (see the patch "add MPX specific mmap
interface"), but currently different ->vm_ops do not by
themselves prevent VMA merging, so we still need this flag.
We understand that VM_ flags are scarce and are open to other
options.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-15 00:18:25 +09:00
|
|
|
#endif
|
|
|
|
|
2012-10-09 08:28:37 +09:00
|
|
|
#ifndef VM_GROWSUP
|
|
|
|
# define VM_GROWSUP VM_NONE
|
|
|
|
#endif
|
|
|
|
|
2010-05-25 06:32:24 +09:00
|
|
|
/* Bits set in the VMA until the stack is in its final location */
|
|
|
|
#define VM_STACK_INCOMPLETE_SETUP (VM_RAND_READ | VM_SEQ_READ)
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
|
|
|
|
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#ifdef CONFIG_STACK_GROWSUP
|
2016-02-03 09:57:46 +09:00
|
|
|
#define VM_STACK VM_GROWSUP
|
2005-04-17 07:20:36 +09:00
|
|
|
#else
|
2016-02-03 09:57:46 +09:00
|
|
|
#define VM_STACK VM_GROWSDOWN
|
2005-04-17 07:20:36 +09:00
|
|
|
#endif
|
|
|
|
|
2016-02-03 09:57:46 +09:00
|
|
|
#define VM_STACK_FLAGS (VM_STACK | VM_STACK_DEFAULT_FLAGS | VM_ACCOUNT)
|
|
|
|
|
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 12:26:44 +09:00
|
|
|
/*
|
2011-04-28 07:26:45 +09:00
|
|
|
* Special vmas that are non-mergable, non-mlock()able.
|
|
|
|
* Note: mm/huge_memory.c VM_NO_THP depends on this definition.
|
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 12:26:44 +09:00
|
|
|
*/
|
2014-03-04 08:38:27 +09:00
|
|
|
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
|
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 12:26:44 +09:00
|
|
|
|
2014-04-08 07:37:10 +09:00
|
|
|
/* This mask defines which mm->def_flags a process can inherit its parent */
|
|
|
|
#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
|
|
|
|
|
2015-11-06 11:51:36 +09:00
|
|
|
/* This mask is used to clear all the VMA flags used by mlock */
|
|
|
|
#define VM_LOCKED_CLEAR_MASK (~(VM_LOCKED | VM_LOCKONFAULT))
|
|
|
|
|
2018-02-22 02:15:50 +09:00
|
|
|
/* Arch-specific flags to clear when updating VM flags on protection change */
|
|
|
|
#ifndef VM_ARCH_CLEAR
|
|
|
|
# define VM_ARCH_CLEAR VM_NONE
|
|
|
|
#endif
|
|
|
|
#define VM_FLAGS_CLEAR (ARCH_VM_PKEY_FLAGS | VM_ARCH_CLEAR)
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
/*
|
|
|
|
* mapping from the currently active vm_flags protection bits (the
|
|
|
|
* low four bits) to a page protection mask..
|
|
|
|
*/
|
|
|
|
extern pgprot_t protection_map[16];
|
|
|
|
|
2007-07-19 17:47:03 +09:00
|
|
|
#define FAULT_FLAG_WRITE 0x01 /* Fault was a write access */
|
2015-02-11 07:09:51 +09:00
|
|
|
#define FAULT_FLAG_MKWRITE 0x02 /* Fault was mkwrite of existing pte */
|
|
|
|
#define FAULT_FLAG_ALLOW_RETRY 0x04 /* Retry fault if blocking */
|
|
|
|
#define FAULT_FLAG_RETRY_NOWAIT 0x08 /* Don't drop mmap_sem and wait when retrying */
|
|
|
|
#define FAULT_FLAG_KILLABLE 0x10 /* The fault task is in SIGKILL killable region */
|
|
|
|
#define FAULT_FLAG_TRIED 0x20 /* Second try */
|
|
|
|
#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
|
2016-02-13 06:02:21 +09:00
|
|
|
#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
|
2016-02-13 06:02:24 +09:00
|
|
|
#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
|
mm: make faultaround produce old ptes
Based on Kirill's patch [1].
Currently, faultaround code produces young pte. This can screw up
vmscan behaviour[2], as it makes vmscan think that these pages are hot
and not push them out on first round.
During sparse file access faultaround gets more pages mapped and all of
them are young. Under memory pressure, this makes vmscan swap out anon
pages instead, or to drop other page cache pages which otherwise stay
resident.
Modify faultaround to produce old ptes if sysctl 'want_old_faultaround_pte'
is set, so they can easily be reclaimed under memory pressure.
This can to some extend defeat the purpose of faultaround on machines
without hardware accessed bit as it will not help us with reducing the
number of minor page faults.
Making the faultaround ptes old results in a unixbench regression for some
architectures [3][4]. But on some architectures like arm64 it is not found
to cause any regression.
unixbench shell8 scores on arm64 v8.2 hardware with CONFIG_ARM64_HW_AFDBM
enabled (5 runs min, max, avg):
Base: (741,748,744)
With this patch: (739,748,743)
So by default produce young ptes and provide a sysctl option to make the
ptes old.
[1] https://marc.info/?l=linux-mm&m=146348837703148
[2] https://lkml.org/lkml/2016/4/18/612
[3] https://marc.info/?l=linux-kernel&m=146582237922378&w=2
[4] https://marc.info/?l=linux-mm&m=146589376909424&w=2
Change-Id: I193185cc953bc33a44fc24963a9df9e555906d95
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Patch-mainline: linux-mm @ Fri, 19 Jan 2018 17:24:54
[vinmenon@codeaurora.org: enable by default since arm works well
with old fault_around ptes + edit the links in commit message to
fix checkpatch issues]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
[swatsrid@codeaurora.org: Fix merge conflicts]
Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org>
Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
2016-05-21 08:58:41 +09:00
|
|
|
#define FAULT_FLAG_PREFAULT_OLD 0x400 /* Make faultaround ptes old */
|
2018-04-17 23:33:24 +09:00
|
|
|
/* Speculative fault, not holding mmap_sem */
|
|
|
|
#define FAULT_FLAG_SPECULATIVE 0x200
|
2007-07-19 17:47:03 +09:00
|
|
|
|
2017-02-23 08:39:50 +09:00
|
|
|
#define FAULT_FLAG_TRACE \
|
|
|
|
{ FAULT_FLAG_WRITE, "WRITE" }, \
|
|
|
|
{ FAULT_FLAG_MKWRITE, "MKWRITE" }, \
|
|
|
|
{ FAULT_FLAG_ALLOW_RETRY, "ALLOW_RETRY" }, \
|
|
|
|
{ FAULT_FLAG_RETRY_NOWAIT, "RETRY_NOWAIT" }, \
|
|
|
|
{ FAULT_FLAG_KILLABLE, "KILLABLE" }, \
|
|
|
|
{ FAULT_FLAG_TRIED, "TRIED" }, \
|
|
|
|
{ FAULT_FLAG_USER, "USER" }, \
|
|
|
|
{ FAULT_FLAG_REMOTE, "REMOTE" }, \
|
|
|
|
{ FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }
|
|
|
|
|
2007-07-19 17:46:59 +09:00
|
|
|
/*
|
2007-07-19 17:47:03 +09:00
|
|
|
* vm_fault is filled by the the pagefault handler and passed to the vma's
|
2007-07-19 17:47:05 +09:00
|
|
|
* ->fault function. The vma's ->fault is responsible for returning a bitmask
|
|
|
|
* of VM_FAULT_xxx flags that give details about how the fault was handled.
|
2007-07-19 17:46:59 +09:00
|
|
|
*
|
2016-01-15 08:20:12 +09:00
|
|
|
* MM layer fills up gfp_mask for page allocations but fault handler might
|
|
|
|
* alter it if its implementation requires a different allocation context.
|
|
|
|
*
|
2015-02-11 07:09:51 +09:00
|
|
|
* pgoff should be used in favour of virtual_address, if possible.
|
2007-07-19 17:46:59 +09:00
|
|
|
*/
|
2007-07-19 17:47:03 +09:00
|
|
|
struct vm_fault {
|
2016-12-15 08:06:58 +09:00
|
|
|
struct vm_area_struct *vma; /* Target VMA */
|
2007-07-19 17:47:03 +09:00
|
|
|
unsigned int flags; /* FAULT_FLAG_xxx flags */
|
2016-01-15 08:20:12 +09:00
|
|
|
gfp_t gfp_mask; /* gfp mask to be used for allocations */
|
2007-07-19 17:47:03 +09:00
|
|
|
pgoff_t pgoff; /* Logical page offset based on vma */
|
2016-12-15 08:06:58 +09:00
|
|
|
unsigned long address; /* Faulting virtual address */
|
2018-04-17 23:33:24 +09:00
|
|
|
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
|
|
|
|
unsigned int sequence;
|
|
|
|
pmd_t orig_pmd; /* value of PMD at the time of fault */
|
|
|
|
#endif
|
2016-12-15 08:06:58 +09:00
|
|
|
pmd_t *pmd; /* Pointer to pmd entry matching
|
2016-12-15 08:07:16 +09:00
|
|
|
* the 'address' */
|
mm,fs,dax: change ->pmd_fault to ->huge_fault
Patch series "1G transparent hugepage support for device dax", v2.
The following series implements support for 1G trasparent hugepage on
x86 for device dax. The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX. I have
forward ported the relevant bits to 4.10-rc. The current submission has
only the necessary code to support device DAX.
Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs. Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file. We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].
[1]: https://lkml.org/lkml/2016/1/31/52
Comments from Nilesh @ Oracle:
There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:
processes : 10,000
memory : 6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB
As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level. Memory sizes will keep increasing; so
this number will keep increasing.
An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.
This patch (of 3):
In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.
[ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
[dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-25 07:56:59 +09:00
|
|
|
pud_t *pud; /* Pointer to pud entry matching
|
|
|
|
* the 'address'
|
|
|
|
*/
|
2016-12-15 08:07:16 +09:00
|
|
|
pte_t orig_pte; /* Value of PTE at the time of fault */
|
2007-07-19 17:47:03 +09:00
|
|
|
|
2016-12-15 08:07:18 +09:00
|
|
|
struct page *cow_page; /* Page handler may use for COW fault */
|
|
|
|
struct mem_cgroup *memcg; /* Cgroup cow_page belongs to */
|
2007-07-19 17:47:03 +09:00
|
|
|
struct page *page; /* ->fault handlers should return a
|
2007-07-19 17:47:05 +09:00
|
|
|
* page here, unless VM_FAULT_NOPAGE
|
2007-07-19 17:47:03 +09:00
|
|
|
* is set (which is also implied by
|
2007-07-19 17:47:05 +09:00
|
|
|
* VM_FAULT_ERROR).
|
2007-07-19 17:47:03 +09:00
|
|
|
*/
|
2016-12-15 08:06:58 +09:00
|
|
|
/* These three entries are valid only while holding ptl lock */
|
2016-07-27 07:25:20 +09:00
|
|
|
pte_t *pte; /* Pointer to pte entry matching
|
|
|
|
* the 'address'. NULL if the page
|
|
|
|
* table hasn't been allocated.
|
|
|
|
*/
|
|
|
|
spinlock_t *ptl; /* Page table lock.
|
|
|
|
* Protects pte page table if 'pte'
|
|
|
|
* is not NULL, otherwise pmd.
|
|
|
|
*/
|
2016-07-27 07:25:23 +09:00
|
|
|
pgtable_t prealloc_pte; /* Pre-allocated pte page table.
|
|
|
|
* vm_ops->map_pages() calls
|
|
|
|
* alloc_set_pte() from atomic context.
|
|
|
|
* do_fault_around() pre-allocates
|
|
|
|
* page table to avoid allocation from
|
|
|
|
* atomic context.
|
|
|
|
*/
|
2020-05-05 00:21:02 +09:00
|
|
|
unsigned long vma_flags; /* Speculative Page Fault field */
|
|
|
|
pgprot_t vma_page_prot; /* Speculative Page Fault field */
|
2020-07-06 16:00:01 +09:00
|
|
|
ANDROID_VENDOR_DATA(1);
|
|
|
|
ANDROID_VENDOR_DATA(2);
|
2007-07-19 17:46:59 +09:00
|
|
|
};
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2017-02-25 07:57:08 +09:00
|
|
|
/* page entry size for vm->huge_fault() */
|
|
|
|
enum page_entry_size {
|
|
|
|
PE_SIZE_PTE = 0,
|
|
|
|
PE_SIZE_PMD,
|
|
|
|
PE_SIZE_PUD,
|
|
|
|
};
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
/*
|
|
|
|
* These are the virtual MM functions - opening of an area, closing and
|
|
|
|
* unmapping it (needed to keep files on disk up-to-date etc), pointer
|
2018-05-29 21:14:07 +09:00
|
|
|
* to the functions called when a no-page or a wp-page exception occurs.
|
2005-04-17 07:20:36 +09:00
|
|
|
*/
|
|
|
|
struct vm_operations_struct {
|
|
|
|
void (*open)(struct vm_area_struct * area);
|
|
|
|
void (*close)(struct vm_area_struct * area);
|
2017-11-30 09:10:28 +09:00
|
|
|
int (*split)(struct vm_area_struct * area, unsigned long addr);
|
2015-09-05 07:48:04 +09:00
|
|
|
int (*mremap)(struct vm_area_struct * area);
|
2018-04-06 08:25:23 +09:00
|
|
|
vm_fault_t (*fault)(struct vm_fault *vmf);
|
|
|
|
vm_fault_t (*huge_fault)(struct vm_fault *vmf,
|
|
|
|
enum page_entry_size pe_size);
|
2016-12-15 08:06:58 +09:00
|
|
|
void (*map_pages)(struct vm_fault *vmf,
|
2016-07-27 07:25:20 +09:00
|
|
|
pgoff_t start_pgoff, pgoff_t end_pgoff);
|
2018-04-06 08:24:25 +09:00
|
|
|
unsigned long (*pagesize)(struct vm_area_struct * area);
|
2006-06-23 18:03:43 +09:00
|
|
|
|
|
|
|
/* notification that a previously read-only page is about to become
|
|
|
|
* writable, if an error is returned it will cause a SIGBUS */
|
2018-04-06 08:25:23 +09:00
|
|
|
vm_fault_t (*page_mkwrite)(struct vm_fault *vmf);
|
2008-07-24 13:27:05 +09:00
|
|
|
|
2015-04-16 08:15:11 +09:00
|
|
|
/* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
|
2018-04-06 08:25:23 +09:00
|
|
|
vm_fault_t (*pfn_mkwrite)(struct vm_fault *vmf);
|
2015-04-16 08:15:11 +09:00
|
|
|
|
2008-07-24 13:27:05 +09:00
|
|
|
/* called by access_process_vm when get_user_pages() fails, typically
|
|
|
|
* for use by special VMAs that can switch between memory and hardware
|
|
|
|
*/
|
|
|
|
int (*access)(struct vm_area_struct *vma, unsigned long addr,
|
|
|
|
void *buf, int len, int write);
|
2014-05-20 07:58:32 +09:00
|
|
|
|
|
|
|
/* Called by the /proc/PID/maps code to ask the vma whether it
|
|
|
|
* has a special name. Returning non-NULL will also cause this
|
|
|
|
* vma to be dumped unconditionally. */
|
|
|
|
const char *(*name)(struct vm_area_struct *vma);
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
#ifdef CONFIG_NUMA
|
2008-04-28 18:13:14 +09:00
|
|
|
/*
|
|
|
|
* set_policy() op must add a reference to any non-NULL @new mempolicy
|
|
|
|
* to hold the policy upon return. Caller should pass NULL @new to
|
|
|
|
* remove a policy and fall back to surrounding context--i.e. do not
|
|
|
|
* install a MPOL_DEFAULT policy, nor the task or system default
|
|
|
|
* mempolicy.
|
|
|
|
*/
|
2005-04-17 07:20:36 +09:00
|
|
|
int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
|
2008-04-28 18:13:14 +09:00
|
|
|
|
|
|
|
/*
|
|
|
|
* get_policy() op must add reference [mpol_get()] to any policy at
|
|
|
|
* (vma,addr) marked as MPOL_SHARED. The shared policy infrastructure
|
|
|
|
* in mm/mempolicy.c will do this automatically.
|
|
|
|
* get_policy() must NOT add a ref if the policy at (vma,addr) is not
|
|
|
|
* marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
|
|
|
|
* If no [shared/vma] mempolicy exists at the addr, get_policy() op
|
|
|
|
* must return NULL--i.e., do not "fallback" to task or system default
|
|
|
|
* policy.
|
|
|
|
*/
|
2005-04-17 07:20:36 +09:00
|
|
|
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr);
|
|
|
|
#endif
|
2014-12-18 23:48:15 +09:00
|
|
|
/*
|
|
|
|
* Called by vm_normal_page() for special PTEs to find the
|
|
|
|
* page for @addr. This is useful if the default behavior
|
|
|
|
* (using pte_page()) would not find the correct page.
|
|
|
|
*/
|
|
|
|
struct page *(*find_special_page)(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr);
|
2020-05-02 16:41:26 +09:00
|
|
|
|
|
|
|
ANDROID_KABI_RESERVE(1);
|
|
|
|
ANDROID_KABI_RESERVE(2);
|
|
|
|
ANDROID_KABI_RESERVE(3);
|
|
|
|
ANDROID_KABI_RESERVE(4);
|
2005-04-17 07:20:36 +09:00
|
|
|
};
|
|
|
|
|
2018-04-17 23:33:13 +09:00
|
|
|
static inline void INIT_VMA(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
INIT_LIST_HEAD(&vma->anon_vma_chain);
|
2018-04-17 23:33:14 +09:00
|
|
|
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
|
|
|
|
seqcount_init(&vma->vm_sequence);
|
mm: protect mm_rb tree with a rwlock
This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.
To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock. As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock. The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.
To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1. Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.
This patch is preparing for 2 kind of VMA access :
- as usual, under the control of the mmap_sem,
- without holding the mmap_sem for the speculative page fault handler.
Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.
The patch is introducing 2 new functions:
- vma_get() to find a VMA based on an address by holding the new rwlock.
- vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.
When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.
When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.
[1] https://patchwork.kernel.org/patch/5108281/
Change-Id: I9ecc922b8efa4b28975cc6a8e9531284c24ac14e
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:23
[vinmenon@codeaurora.org: fix the return of put_vma]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
2018-04-17 23:33:23 +09:00
|
|
|
atomic_set(&vma->vm_ref_count, 1);
|
2018-04-17 23:33:14 +09:00
|
|
|
#endif
|
2018-04-17 23:33:13 +09:00
|
|
|
}
|
|
|
|
|
2018-07-27 08:37:25 +09:00
|
|
|
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
|
|
|
|
{
|
mm: fix vma_is_anonymous() false-positives
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA. This is unreliable as ->mmap may not set ->vm_ops.
False-positive vma_is_anonymous() may lead to crashes:
next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
------------[ cut here ]------------
kernel BUG at mm/memory.c:1422!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
Call Trace:
unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
unmap_mapping_range_vma mm/memory.c:2792 [inline]
unmap_mapping_range_tree mm/memory.c:2813 [inline]
unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
unmap_mapping_range+0x48/0x60 mm/memory.c:2880
truncate_pagecache+0x54/0x90 mm/truncate.c:800
truncate_setsize+0x70/0xb0 mm/truncate.c:826
simple_setattr+0xe9/0x110 fs/libfs.c:409
notify_change+0xf13/0x10f0 fs/attr.c:335
do_truncate+0x1ac/0x2b0 fs/open.c:63
do_sys_ftruncate+0x492/0x560 fs/open.c:205
__do_sys_ftruncate fs/open.c:215 [inline]
__se_sys_ftruncate fs/open.c:213 [inline]
__x64_sys_ftruncate+0x59/0x80 fs/open.c:213
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reproducer:
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
#define KCOV_ENABLE _IO('c', 100)
#define KCOV_DISABLE _IO('c', 101)
#define COVER_SIZE (1024<<10)
#define KCOV_TRACE_PC 0
#define KCOV_TRACE_CMP 1
int main(int argc, char **argv)
{
int fd;
unsigned long *cover;
system("mount -t debugfs none /sys/kernel/debug");
fd = open("/sys/kernel/debug/kcov", O_RDWR);
ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
munmap(cover, COVER_SIZE * sizeof(unsigned long));
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
ftruncate(fd, 3UL << 20);
return 0;
}
This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.
If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.
Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-27 08:37:35 +09:00
|
|
|
static const struct vm_operations_struct dummy_vm_ops = {};
|
|
|
|
|
2018-08-22 13:53:06 +09:00
|
|
|
memset(vma, 0, sizeof(*vma));
|
2018-07-27 08:37:25 +09:00
|
|
|
vma->vm_mm = mm;
|
mm: fix vma_is_anonymous() false-positives
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA. This is unreliable as ->mmap may not set ->vm_ops.
False-positive vma_is_anonymous() may lead to crashes:
next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
------------[ cut here ]------------
kernel BUG at mm/memory.c:1422!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
Call Trace:
unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
unmap_mapping_range_vma mm/memory.c:2792 [inline]
unmap_mapping_range_tree mm/memory.c:2813 [inline]
unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
unmap_mapping_range+0x48/0x60 mm/memory.c:2880
truncate_pagecache+0x54/0x90 mm/truncate.c:800
truncate_setsize+0x70/0xb0 mm/truncate.c:826
simple_setattr+0xe9/0x110 fs/libfs.c:409
notify_change+0xf13/0x10f0 fs/attr.c:335
do_truncate+0x1ac/0x2b0 fs/open.c:63
do_sys_ftruncate+0x492/0x560 fs/open.c:205
__do_sys_ftruncate fs/open.c:215 [inline]
__se_sys_ftruncate fs/open.c:213 [inline]
__x64_sys_ftruncate+0x59/0x80 fs/open.c:213
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reproducer:
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
#define KCOV_ENABLE _IO('c', 100)
#define KCOV_DISABLE _IO('c', 101)
#define COVER_SIZE (1024<<10)
#define KCOV_TRACE_PC 0
#define KCOV_TRACE_CMP 1
int main(int argc, char **argv)
{
int fd;
unsigned long *cover;
system("mount -t debugfs none /sys/kernel/debug");
fd = open("/sys/kernel/debug/kcov", O_RDWR);
ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
munmap(cover, COVER_SIZE * sizeof(unsigned long));
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
ftruncate(fd, 3UL << 20);
return 0;
}
This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.
If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.
Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-27 08:37:35 +09:00
|
|
|
vma->vm_ops = &dummy_vm_ops;
|
2018-04-17 23:33:13 +09:00
|
|
|
INIT_VMA(vma);
|
2018-07-27 08:37:25 +09:00
|
|
|
}
|
|
|
|
|
mm: fix vma_is_anonymous() false-positives
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA. This is unreliable as ->mmap may not set ->vm_ops.
False-positive vma_is_anonymous() may lead to crashes:
next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
------------[ cut here ]------------
kernel BUG at mm/memory.c:1422!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
01/01/2011
RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
Call Trace:
unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
unmap_mapping_range_vma mm/memory.c:2792 [inline]
unmap_mapping_range_tree mm/memory.c:2813 [inline]
unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
unmap_mapping_range+0x48/0x60 mm/memory.c:2880
truncate_pagecache+0x54/0x90 mm/truncate.c:800
truncate_setsize+0x70/0xb0 mm/truncate.c:826
simple_setattr+0xe9/0x110 fs/libfs.c:409
notify_change+0xf13/0x10f0 fs/attr.c:335
do_truncate+0x1ac/0x2b0 fs/open.c:63
do_sys_ftruncate+0x492/0x560 fs/open.c:205
__do_sys_ftruncate fs/open.c:215 [inline]
__se_sys_ftruncate fs/open.c:213 [inline]
__x64_sys_ftruncate+0x59/0x80 fs/open.c:213
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reproducer:
#include <stdio.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>
#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
#define KCOV_ENABLE _IO('c', 100)
#define KCOV_DISABLE _IO('c', 101)
#define COVER_SIZE (1024<<10)
#define KCOV_TRACE_PC 0
#define KCOV_TRACE_CMP 1
int main(int argc, char **argv)
{
int fd;
unsigned long *cover;
system("mount -t debugfs none /sys/kernel/debug");
fd = open("/sys/kernel/debug/kcov", O_RDWR);
ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
munmap(cover, COVER_SIZE * sizeof(unsigned long));
cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
ftruncate(fd, 3UL << 20);
return 0;
}
This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.
If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.
Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-27 08:37:35 +09:00
|
|
|
static inline void vma_set_anonymous(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
vma->vm_ops = NULL;
|
|
|
|
}
|
|
|
|
|
2019-07-19 07:57:24 +09:00
|
|
|
static inline bool vma_is_anonymous(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return !vma->vm_ops;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_SHMEM
|
|
|
|
/*
|
|
|
|
* The vma_is_shmem is not inline because it is used only by slow
|
|
|
|
* paths in userfault.
|
|
|
|
*/
|
|
|
|
bool vma_is_shmem(struct vm_area_struct *vma);
|
|
|
|
#else
|
|
|
|
static inline bool vma_is_shmem(struct vm_area_struct *vma) { return false; }
|
|
|
|
#endif
|
|
|
|
|
|
|
|
int vma_is_stack_for_current(struct vm_area_struct *vma);
|
|
|
|
|
mm: do not initialize TLB stack vma's with vma_init()
Commit 2c4541e24c55 ("mm: use vma_init() to initialize VMAs on stack and
data segments") tried to initialize various left-over ad-hoc vma's
"properly", but actually made things worse for the temporary vma's used
for TLB flushing.
vma_init() doesn't actually initialize all of the vma, just a few
fields, so doing something like
- struct vm_area_struct vma = { .vm_mm = tlb->mm, };
+ struct vm_area_struct vma;
+
+ vma_init(&vma, tlb->mm);
was actually very bad: instead of having a nicely initialized vma with
every field but "vm_mm" zeroed, you'd have an entirely uninitialized vma
with only a couple of fields initialized. And they weren't even fields
that the code in question mostly cared about.
The flush_tlb_range() function takes a "struct vma" rather than a
"struct mm_struct", because a few architectures actually care about what
kind of range it is - being able to only do an ITLB flush if it's a
range that doesn't have data accesses enabled, for example. And all the
normal users already have the vma for doing the range invalidation.
But a few people want to call flush_tlb_range() with a range they just
made up, so they also end up using a made-up vma. x86 just has a
special "flush_tlb_mm_range()" function for this, but other
architectures (arm and ia64) do the "use fake vma" thing instead, and
thus got caught up in the vma_init() changes.
At the same time, the TLB flushing code really doesn't care about most
other fields in the vma, so vma_init() is just unnecessary and
pointless.
This fixes things by having an explicit "this is just an initializer for
the TLB flush" initializer macro, which is used by the arm/arm64/ia64
people who mis-use this interface with just a dummy vma.
Fixes: 2c4541e24c55 ("mm: use vma_init() to initialize VMAs on stack and data segments")
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02 05:43:38 +09:00
|
|
|
/* flush_tlb_range() takes a vma, not a mm, and can care about flags */
|
|
|
|
#define TLB_FLUSH_VMA(mm,flags) { .vm_mm = (mm), .vm_flags = (flags) }
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
struct mmu_gather;
|
|
|
|
struct inode;
|
|
|
|
|
2019-07-17 08:30:47 +09:00
|
|
|
#if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
|
2016-01-16 09:56:52 +09:00
|
|
|
static inline int pmd_devmap(pmd_t pmd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2017-02-25 07:57:02 +09:00
|
|
|
static inline int pud_devmap(pud_t pud)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2017-03-17 00:26:53 +09:00
|
|
|
static inline int pgd_devmap(pgd_t pgd)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2016-01-16 09:56:52 +09:00
|
|
|
#endif
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
/*
|
|
|
|
* FIXME: take this include out, include page-flags.h in
|
|
|
|
* files which need it (119 of them)
|
|
|
|
*/
|
|
|
|
#include <linux/page-flags.h>
|
thp: transparent hugepage core
Lately I've been working to make KVM use hugepages transparently without
the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
see removed:
1) hugepages have to be swappable or the guest physical memory remains
locked in RAM and can't be paged out to swap
2) if a hugepage allocation fails, regular pages should be allocated
instead and mixed in the same vma without any failure and without
userland noticing
3) if some task quits and more hugepages become available in the
buddy, guest physical memory backed by regular pages should be
relocated on hugepages automatically in regions under
madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
not null)
4) avoidance of reservation and maximization of use of hugepages whenever
possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
1 machine with 1 database with 1 database cache with 1 database cache size
known at boot time. It's definitely not feasible with a virtualization
hypervisor usage like RHEV-H that runs an unknown number of virtual machines
with an unknown size of each virtual machine with an unknown amount of
pagecache that could be potentially useful in the host for guest not using
O_DIRECT (aka cache=off).
hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization,
becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
to 19 in case only the hypervisor uses transparent hugepages, and they
decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
linux hypervisor and the linux guest both uses this patch (though the
guest will limit the addition speedup to anonymous regions only for
now...). Even more important is that the tlb miss handler is much slower
on a NPT/EPT guest than for a regular shadow paging or no-virtualization
scenario. So maximizing the amount of virtual memory cached by the TLB
pays off significantly more with NPT/EPT than without (even if there would
be no significant speedup in the tlb-miss runtime).
The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on
regular anonymous vmas. This is what this patch tries to achieve in the
least intrusive possible way. We want hugepages and hugetlb to be used in
a way so that all applications can benefit without changes (as usual we
leverage the KVM virtualization design: by improving the Linux VM at
large, KVM gets the performance boost too).
The most important design choice is: always fallback to 4k allocation if
the hugepage allocation fails! This is the _very_ opposite of some large
pagecache patches that failed with -EIO back then if a 64k (or similar)
allocation failed...
Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an operation
that can't fail. This way the reliability of the swapping isn't decreased
(no need to allocate memory when we are short on memory to swap) and it's
trivial to plug a split_huge_page* one-liner where needed without
polluting the VM. Over time we can teach mprotect, mremap and friends to
handle pmd_trans_huge natively without calling split_huge_page*. The fact
it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
(instead of the current void) we'd need to rollback the mprotect from the
middle of it (ideally including undoing the split_vma) which would be a
big change and in the very wrong direction (it'd likely be simpler not to
call split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.
The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
incremental and it'll just be an "harmless" addition later if this initial
part is agreed upon. It also should be noted that locking-wise replacing
regular pages with hugepages is going to be very easy if compared to what
I'm doing below in split_huge_page, as it will only happen when
page_count(page) matches page_mapcount(page) if we can take the PG_lock
and mmap_sem in write mode. collapse_huge_page will be a "best effort"
that (unlike split_huge_page) can fail at the minimal sign of trouble and
we can try again later. collapse_huge_page will be similar to how KSM
works and the madvise(MADV_HUGEPAGE) will work similar to
madvise(MADV_MERGEABLE).
The default I like is that transparent hugepages are used at page fault
time. This can be changed with
/sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
to three values "always", "madvise", "never" which mean respectively that
hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
controls if the hugepage allocation should defrag memory aggressively
"always", only inside "madvise" regions, or "never".
The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point of
view. In short there is no current existing way to serialize the O_DIRECT
final put_page against split_huge_page_refcount so I had to invent a new
one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
returns so...). And I didn't want to impact all gup/gup_fast users for
now, maybe if we change the gup interface substantially we can avoid this
locking, I admit I didn't think too much about it because changing the gup
unpinning interface would be invasive.
If we ignored O_DIRECT we could stick to the existing compound refcounting
code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
(and any other mmu notifier user) would call it without FOLL_GET (and if
FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
current task mmu notifier list yet). But O_DIRECT is fundamental for
decent performance of virtualized I/O on fast storage so we can't avoid it
to solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.
Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young bit
on the pmd, that didn't have a range check but I think KVM will be fine
because the whole point of hugepages is that EPT/NPT will also use a huge
pmd when they notice gup returns pages with PageCompound set, so they
won't care of a range and there's just the pmd young bit to check in that
case.
NOTE: in some cases if the L2 cache is small, this may slowdown and waste
memory during COWs because 4M of memory are accessed in a single fault
instead of 8k (the payoff is that after COW the program can run faster).
So we might want to switch the copy_huge_page (and clear_huge_page too) to
not temporal stores. I also extensively researched ways to avoid this
cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
up to 1M (I can send those patches that fully implemented prefault) but I
concluded they're not worth it and they add an huge additional complexity
and they remove all tlb benefits until the full hugepage has been faulted
in, to save a little bit of memory and some cache during app startup, but
they still don't improve substantially the cache-trashing during startup
if the prefault happens in >4k chunks. One reason is that those 4k pte
entries copied are still mapped on a perfectly cache-colored hugepage, so
the trashing is the worst one can generate in those copies (cow of 4k page
copies aren't so well colored so they trashes less, but again this results
in software running faster after the page fault). Those prefault patches
allowed things like a pte where post-cow pages were local 4k regular anon
pages and the not-yet-cowed pte entries were pointing in the middle of
some hugepage mapped read-only. If it doesn't payoff substantially with
todays hardware it will payoff even less in the future with larger l2
caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or
with the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that
will ensure not a single hugepage is allocated at boot time. It is simple
enough to just disable transparent hugepage globally and let transparent
hugepages be allocated selectively by applications in the MADV_HUGEPAGE
region (both at page fault time, and if enabled with the
collapse_huge_page too through the kernel daemon).
This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like
power have certain tlb limits that prevents mixing different page size in
the same regions so they will not fit in this framework that requires
"graceful fallback" to basic PAGE_SIZE in case of physical memory
fragmentation. hugetlbfs remains a perfect fit for those because its
software limits happen to match the hardware limits. hugetlbfs also
remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
to be found not fragmented after a certain system uptime and that would be
very expensive to defragment with relocation, so requiring reservation.
hugetlbfs is the "reservation way", the point of transparent hugepages is
not to have any reservation at all and maximizing the use of cache and
hugepages at all times automatically.
Some performance result:
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988
============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (3UL*1024*1024*1024)
int main()
{
char *p = malloc(SIZE), *p2;
struct timeval before, after;
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset page fault %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
return 0;
}
============
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 08:46:52 +09:00
|
|
|
#include <linux/huge_mm.h>
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Methods to modify the page usage count.
|
|
|
|
*
|
|
|
|
* What counts for a page usage:
|
|
|
|
* - cache mapping (page->mapping)
|
|
|
|
* - private data (page->private)
|
|
|
|
* - page mapped in a task's page tables, each mapping
|
|
|
|
* is counted separately
|
|
|
|
*
|
|
|
|
* Also, many kernel routines increase the page count before a critical
|
|
|
|
* routine so they can be sure the page doesn't go away from under them.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2006-09-26 15:31:35 +09:00
|
|
|
* Drop a ref, return true if the refcount fell to zero (the page has no users)
|
2005-04-17 07:20:36 +09:00
|
|
|
*/
|
2006-03-22 17:08:03 +09:00
|
|
|
static inline int put_page_testzero(struct page *page)
|
|
|
|
{
|
2016-03-18 06:19:26 +09:00
|
|
|
VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
|
|
|
|
return page_ref_dec_and_test(page);
|
2006-03-22 17:08:03 +09:00
|
|
|
}
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
/*
|
2006-03-22 17:08:03 +09:00
|
|
|
* Try to grab a ref unless the page has a refcount of zero, return false if
|
|
|
|
* that is the case.
|
2013-08-28 17:37:42 +09:00
|
|
|
* This can be called when MMU is off so it must not access
|
|
|
|
* any of the virtual mappings.
|
2005-04-17 07:20:36 +09:00
|
|
|
*/
|
2006-03-22 17:08:03 +09:00
|
|
|
static inline int get_page_unless_zero(struct page *page)
|
|
|
|
{
|
2016-03-18 06:19:26 +09:00
|
|
|
return page_ref_add_unless(page, 1, 0);
|
2006-03-22 17:08:03 +09:00
|
|
|
}
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2010-01-27 12:06:39 +09:00
|
|
|
extern int page_is_ram(unsigned long pfn);
|
2015-08-11 12:07:05 +09:00
|
|
|
|
|
|
|
enum {
|
|
|
|
REGION_INTERSECTS,
|
|
|
|
REGION_DISJOINT,
|
|
|
|
REGION_MIXED,
|
|
|
|
};
|
|
|
|
|
2016-01-27 05:57:28 +09:00
|
|
|
int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
|
|
|
|
unsigned long desc);
|
2010-01-27 12:06:39 +09:00
|
|
|
|
2008-02-05 15:28:31 +09:00
|
|
|
/* Support for virtually mapped pages */
|
2008-02-05 15:28:32 +09:00
|
|
|
struct page *vmalloc_to_page(const void *addr);
|
|
|
|
unsigned long vmalloc_to_pfn(const void *addr);
|
2008-02-05 15:28:31 +09:00
|
|
|
|
2008-03-12 16:51:31 +09:00
|
|
|
/*
|
|
|
|
* Determine if an address is within the vmalloc range
|
|
|
|
*
|
|
|
|
* On nommu, vmalloc/vfree wrap through kmalloc/kfree directly, so there
|
|
|
|
* is no special casing required.
|
|
|
|
*/
|
2013-08-23 05:46:07 +09:00
|
|
|
|
|
|
|
#ifdef CONFIG_ENABLE_VMALLOC_SAVING
|
|
|
|
extern bool is_vmalloc_addr(const void *x);
|
|
|
|
#else
|
2016-05-20 09:11:29 +09:00
|
|
|
static inline bool is_vmalloc_addr(const void *x)
|
2008-02-05 15:28:34 +09:00
|
|
|
{
|
2008-03-12 16:51:31 +09:00
|
|
|
#ifdef CONFIG_MMU
|
2008-02-05 15:28:34 +09:00
|
|
|
unsigned long addr = (unsigned long)x;
|
|
|
|
|
|
|
|
return addr >= VMALLOC_START && addr < VMALLOC_END;
|
2008-03-12 16:51:31 +09:00
|
|
|
#else
|
2016-05-20 09:11:29 +09:00
|
|
|
return false;
|
2008-02-24 08:23:37 +09:00
|
|
|
#endif
|
2008-03-12 16:51:31 +09:00
|
|
|
}
|
2013-08-23 05:46:07 +09:00
|
|
|
#endif //CONFIG_ENABLE_VMALLOC_SAVING
|
2019-07-12 12:52:08 +09:00
|
|
|
|
|
|
|
#ifndef is_ioremap_addr
|
|
|
|
#define is_ioremap_addr(x) is_vmalloc_addr(x)
|
|
|
|
#endif
|
|
|
|
|
2009-09-23 08:45:49 +09:00
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
extern int is_vmalloc_or_module_addr(const void *x);
|
|
|
|
#else
|
2009-09-24 20:33:32 +09:00
|
|
|
static inline int is_vmalloc_or_module_addr(const void *x)
|
2009-09-23 08:45:49 +09:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
2008-02-05 15:28:34 +09:00
|
|
|
|
2017-05-09 07:57:09 +09:00
|
|
|
extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
|
|
|
|
static inline void *kvmalloc(size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kvmalloc_node(size, flags, NUMA_NO_NODE);
|
|
|
|
}
|
|
|
|
static inline void *kvzalloc_node(size_t size, gfp_t flags, int node)
|
|
|
|
{
|
|
|
|
return kvmalloc_node(size, flags | __GFP_ZERO, node);
|
|
|
|
}
|
|
|
|
static inline void *kvzalloc(size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kvmalloc(size, flags | __GFP_ZERO);
|
|
|
|
}
|
|
|
|
|
2017-05-09 07:57:27 +09:00
|
|
|
static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
|
|
|
|
{
|
2018-05-09 04:55:26 +09:00
|
|
|
size_t bytes;
|
|
|
|
|
|
|
|
if (unlikely(check_mul_overflow(n, size, &bytes)))
|
2017-05-09 07:57:27 +09:00
|
|
|
return NULL;
|
|
|
|
|
2018-05-09 04:55:26 +09:00
|
|
|
return kvmalloc(bytes, flags);
|
2017-05-09 07:57:27 +09:00
|
|
|
}
|
|
|
|
|
2018-06-12 06:35:55 +09:00
|
|
|
static inline void *kvcalloc(size_t n, size_t size, gfp_t flags)
|
|
|
|
{
|
|
|
|
return kvmalloc_array(n, size, flags | __GFP_ZERO);
|
|
|
|
}
|
|
|
|
|
2014-05-07 03:02:53 +09:00
|
|
|
extern void kvfree(const void *addr);
|
2020-06-05 08:48:21 +09:00
|
|
|
extern void kvfree_sensitive(const void *addr, size_t len);
|
2014-05-07 03:02:53 +09:00
|
|
|
|
2020-05-28 14:20:47 +09:00
|
|
|
/*
|
|
|
|
* Mapcount of compound page as a whole, does not include mapped sub-pages.
|
|
|
|
*
|
|
|
|
* Must be called only for compound pages or any their tail sub-pages.
|
|
|
|
*/
|
2016-01-16 09:53:42 +09:00
|
|
|
static inline int compound_mapcount(struct page *page)
|
|
|
|
{
|
2016-05-21 08:58:24 +09:00
|
|
|
VM_BUG_ON_PAGE(!PageCompound(page), page);
|
2016-01-16 09:53:42 +09:00
|
|
|
page = compound_head(page);
|
|
|
|
return atomic_read(compound_mapcount_ptr(page)) + 1;
|
|
|
|
}
|
|
|
|
|
2011-11-03 05:36:59 +09:00
|
|
|
/*
|
|
|
|
* The atomic page->_mapcount, starts from -1: so that transitions
|
|
|
|
* both from it and to it can be tracked, using atomic_inc_and_test
|
|
|
|
* and atomic_add_negative(-1).
|
|
|
|
*/
|
2013-02-23 09:34:59 +09:00
|
|
|
static inline void page_mapcount_reset(struct page *page)
|
2011-11-03 05:36:59 +09:00
|
|
|
{
|
|
|
|
atomic_set(&(page)->_mapcount, -1);
|
|
|
|
}
|
|
|
|
|
2016-01-16 09:54:37 +09:00
|
|
|
int __page_mapcount(struct page *page);
|
|
|
|
|
2020-05-28 14:20:47 +09:00
|
|
|
/*
|
|
|
|
* Mapcount of 0-order page; when compound sub-page, includes
|
|
|
|
* compound_mapcount().
|
|
|
|
*
|
|
|
|
* Result is undefined for pages which cannot be mapped into userspace.
|
|
|
|
* For example SLAB or special types of pages. See function page_has_type().
|
|
|
|
* They use this place in struct page differently.
|
|
|
|
*/
|
2011-11-03 05:36:59 +09:00
|
|
|
static inline int page_mapcount(struct page *page)
|
|
|
|
{
|
2016-01-16 09:54:37 +09:00
|
|
|
if (unlikely(PageCompound(page)))
|
|
|
|
return __page_mapcount(page);
|
|
|
|
return atomic_read(&page->_mapcount) + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
int total_mapcount(struct page *page);
|
mm: thp: calculate the mapcount correctly for THP pages during WP faults
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.
This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.
Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.
Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.
Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.
Mike Marciniszyn said:
: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit 61f5d698cc97 ("mm: re-enable THP")
:
: The test program registers two rather large MRs (512M) and RDMA
: writes data to a passive peer using the first and RDMA reads it back
: into the second MR and compares that data. The sizes are chosen randomly
: between 0 and 1024 bytes.
:
: The test will get through a few (<= 4 iterations) and then gets a
: compare error.
:
: Tracing indicates the kernel logical addresses associated with the individual
: pages at registration ARE correct , the data in the "RDMA read response only"
: packets ARE correct.
:
: The "corruption" occurs when the packet crosse two pages that are not physically
: contiguous. The second page reads back as zero in the program.
:
: It looks like the user VA at the point of the compare error no longer points to
: the same physical address as was registered.
:
: This patch totally resolves the issue!
Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Josh Collier <josh.d.collier@intel.com>
Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
Cc: <stable@vger.kernel.org> [4.5]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-13 07:42:25 +09:00
|
|
|
int page_trans_huge_mapcount(struct page *page, int *total_mapcount);
|
2016-01-16 09:54:37 +09:00
|
|
|
#else
|
|
|
|
static inline int total_mapcount(struct page *page)
|
|
|
|
{
|
|
|
|
return page_mapcount(page);
|
2011-11-03 05:36:59 +09:00
|
|
|
}
|
mm: thp: calculate the mapcount correctly for THP pages during WP faults
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.
This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.
Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.
Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.
Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.
Mike Marciniszyn said:
: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit 61f5d698cc97 ("mm: re-enable THP")
:
: The test program registers two rather large MRs (512M) and RDMA
: writes data to a passive peer using the first and RDMA reads it back
: into the second MR and compares that data. The sizes are chosen randomly
: between 0 and 1024 bytes.
:
: The test will get through a few (<= 4 iterations) and then gets a
: compare error.
:
: Tracing indicates the kernel logical addresses associated with the individual
: pages at registration ARE correct , the data in the "RDMA read response only"
: packets ARE correct.
:
: The "corruption" occurs when the packet crosse two pages that are not physically
: contiguous. The second page reads back as zero in the program.
:
: It looks like the user VA at the point of the compare error no longer points to
: the same physical address as was registered.
:
: This patch totally resolves the issue!
Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Josh Collier <josh.d.collier@intel.com>
Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
Cc: <stable@vger.kernel.org> [4.5]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-13 07:42:25 +09:00
|
|
|
static inline int page_trans_huge_mapcount(struct page *page,
|
|
|
|
int *total_mapcount)
|
|
|
|
{
|
|
|
|
int mapcount = page_mapcount(page);
|
|
|
|
if (total_mapcount)
|
|
|
|
*total_mapcount = mapcount;
|
|
|
|
return mapcount;
|
|
|
|
}
|
2016-01-16 09:54:37 +09:00
|
|
|
#endif
|
2011-11-03 05:36:59 +09:00
|
|
|
|
2007-05-07 06:49:41 +09:00
|
|
|
static inline struct page *virt_to_head_page(const void *x)
|
|
|
|
{
|
|
|
|
struct page *page = virt_to_page(x);
|
2015-02-11 07:09:35 +09:00
|
|
|
|
2015-11-07 09:29:54 +09:00
|
|
|
return compound_head(page);
|
2007-05-07 06:49:41 +09:00
|
|
|
}
|
|
|
|
|
2016-01-16 09:52:56 +09:00
|
|
|
void __put_page(struct page *page);
|
|
|
|
|
2006-08-14 15:24:27 +09:00
|
|
|
void put_pages_list(struct list_head *pages);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2006-03-22 17:08:05 +09:00
|
|
|
void split_page(struct page *page, unsigned int order);
|
|
|
|
|
2006-12-07 13:33:32 +09:00
|
|
|
/*
|
|
|
|
* Compound pages have a destructor function. Provide a
|
|
|
|
* prototype for that function and accessor functions.
|
2015-11-07 09:29:50 +09:00
|
|
|
* These are _only_ valid on the head of a compound page.
|
2006-12-07 13:33:32 +09:00
|
|
|
*/
|
2015-11-07 09:29:50 +09:00
|
|
|
typedef void compound_page_dtor(struct page *);
|
|
|
|
|
|
|
|
/* Keep the enum in sync with compound_page_dtors array in mm/page_alloc.c */
|
|
|
|
enum compound_dtor_id {
|
|
|
|
NULL_COMPOUND_DTOR,
|
|
|
|
COMPOUND_PAGE_DTOR,
|
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
HUGETLB_PAGE_DTOR,
|
2016-01-16 09:54:17 +09:00
|
|
|
#endif
|
2020-09-23 02:06:43 +09:00
|
|
|
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_GKI_OPT_FEATURES)
|
2016-01-16 09:54:17 +09:00
|
|
|
TRANSHUGE_PAGE_DTOR,
|
2015-11-07 09:29:50 +09:00
|
|
|
#endif
|
|
|
|
NR_COMPOUND_DTORS,
|
|
|
|
};
|
|
|
|
extern compound_page_dtor * const compound_page_dtors[];
|
2006-12-07 13:33:32 +09:00
|
|
|
|
|
|
|
static inline void set_compound_page_dtor(struct page *page,
|
2015-11-07 09:29:50 +09:00
|
|
|
enum compound_dtor_id compound_dtor)
|
2006-12-07 13:33:32 +09:00
|
|
|
{
|
2015-11-07 09:29:50 +09:00
|
|
|
VM_BUG_ON_PAGE(compound_dtor >= NR_COMPOUND_DTORS, page);
|
|
|
|
page[1].compound_dtor = compound_dtor;
|
2006-12-07 13:33:32 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline compound_page_dtor *get_compound_page_dtor(struct page *page)
|
|
|
|
{
|
2015-11-07 09:29:50 +09:00
|
|
|
VM_BUG_ON_PAGE(page[1].compound_dtor >= NR_COMPOUND_DTORS, page);
|
|
|
|
return compound_page_dtors[page[1].compound_dtor];
|
2006-12-07 13:33:32 +09:00
|
|
|
}
|
|
|
|
|
2015-11-07 09:29:57 +09:00
|
|
|
static inline unsigned int compound_order(struct page *page)
|
2007-05-07 06:49:39 +09:00
|
|
|
{
|
2007-05-07 06:49:40 +09:00
|
|
|
if (!PageHead(page))
|
2007-05-07 06:49:39 +09:00
|
|
|
return 0;
|
2015-02-12 08:24:46 +09:00
|
|
|
return page[1].compound_order;
|
2007-05-07 06:49:39 +09:00
|
|
|
}
|
|
|
|
|
2015-11-07 09:29:50 +09:00
|
|
|
static inline void set_compound_order(struct page *page, unsigned int order)
|
2007-05-07 06:49:39 +09:00
|
|
|
{
|
2015-02-12 08:24:46 +09:00
|
|
|
page[1].compound_order = order;
|
2007-05-07 06:49:39 +09:00
|
|
|
}
|
|
|
|
|
2019-09-24 07:34:30 +09:00
|
|
|
/* Returns the number of pages in this potentially compound page. */
|
|
|
|
static inline unsigned long compound_nr(struct page *page)
|
|
|
|
{
|
|
|
|
return 1UL << compound_order(page);
|
|
|
|
}
|
|
|
|
|
2019-09-24 07:34:25 +09:00
|
|
|
/* Returns the number of bytes in this potentially compound page. */
|
|
|
|
static inline unsigned long page_size(struct page *page)
|
|
|
|
{
|
|
|
|
return PAGE_SIZE << compound_order(page);
|
|
|
|
}
|
|
|
|
|
2019-09-24 07:34:28 +09:00
|
|
|
/* Returns the number of bits needed for the number of bytes in a page */
|
|
|
|
static inline unsigned int page_shift(struct page *page)
|
|
|
|
{
|
|
|
|
return PAGE_SHIFT + compound_order(page);
|
|
|
|
}
|
|
|
|
|
2016-01-16 09:54:17 +09:00
|
|
|
void free_compound_page(struct page *page);
|
|
|
|
|
2011-01-21 16:49:56 +09:00
|
|
|
#ifdef CONFIG_MMU
|
2011-01-14 08:46:37 +09:00
|
|
|
/*
|
|
|
|
* Do pte_mkwrite, but only if the vma says VM_WRITE. We do this when
|
|
|
|
* servicing faults for write access. In the normal case, do always want
|
|
|
|
* pte_mkwrite. But get_user_pages can cause write faults for mappings
|
|
|
|
* that do not have writing enabled, when used by access_process_vm.
|
|
|
|
*/
|
2018-04-17 23:33:18 +09:00
|
|
|
static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
|
2011-01-14 08:46:37 +09:00
|
|
|
{
|
2018-04-17 23:33:18 +09:00
|
|
|
if (likely(vma_flags & VM_WRITE))
|
2011-01-14 08:46:37 +09:00
|
|
|
pte = pte_mkwrite(pte);
|
|
|
|
return pte;
|
|
|
|
}
|
mm: introduce vm_ops->map_pages()
Here's new version of faultaround patchset. It took a while to tune it
and collect performance data.
First patch adds new callback ->map_pages to vm_operations_struct.
->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from
"pgoff" till "max_pgoff". ->map_pages() is called with page table
locked and must not block. If it's not possible to reach a page without
blocking, filesystem should skip it. Filesystem should use do_set_pte()
to setup page table entry. Pointer to entry associated with offset
"pgoff" is passed in "pte" field in vm_fault structure. Pointers to
entries for other offsets should be calculated relative to "pte".
Currently VM use ->map_pages only on read page fault path. We try to
map FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now.
Performance data for different FAULT_AROUND_ORDER is below.
TODO:
- implement ->map_pages() for shmem/tmpfs;
- modify get_user_pages() to be able to use ->map_pages() and implement
mmap(MAP_POPULATE|MAP_NONBLOCK) on top.
=========================================================================
Tested on 4-socket machine (120 threads) with 128GiB of RAM.
Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
somewhere between 3 and 5. Let's say 4 :)
Linux build (make -j60)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 283,301,572 247,151,987 212,215,789 204,772,882 199,568,944 194,703,779 193,381,485
time, seconds 151.227629483 153.920996480 151.356125472 150.863792049 150.879207877 151.150764954 151.450962358
Linux rebuild (make -j60)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 5,396,854 4,148,444 2,855,286 2,577,282 2,361,957 2,169,573 2,112,643
time, seconds 27.404543757 27.559725591 27.030057426 26.855045126 26.678618635 26.974523490 26.761320095
Git test suite (make -j60 test)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515
time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413
Two synthetic tests: access every word in file in sequential/random order.
It doesn't improve much after FAULT_AROUND_ORDER == 4.
Sequential access 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
1 thread
minor-faults 4,195,437 2,098,275 525,068 262,251 131,170 32,856 8,282
time, seconds 7.250461742 6.461711074 5.493859139 5.488488147 5.707213983 5.898510832 5.109232856
8 threads
minor-faults 33,557,540 16,892,728 4,515,848 2,366,999 1,423,382 442,732 142,339
time, seconds 16.649304881 9.312555263 6.612490639 6.394316732 6.669827501 6.75078944 6.371900528
32 threads
minor-faults 134,228,222 67,526,810 17,725,386 9,716,537 4,763,731 1,668,921 537,200
time, seconds 49.164430543 29.712060103 12.938649729 10.175151004 11.840094583 9.594081325 9.928461797
60 threads
minor-faults 251,687,988 126,146,952 32,919,406 18,208,804 10,458,947 2,733,907 928,217
time, seconds 86.260656897 49.626551828 22.335007632 17.608243696 16.523119035 16.339489186 16.326390902
120 threads
minor-faults 503,352,863 252,939,677 67,039,168 35,191,827 19,170,091 4,688,357 1,471,862
time, seconds 124.589206333 79.757867787 39.508707872 32.167281632 29.972989292 28.729834575 28.042251622
Random access 1GiB file
1 thread
minor-faults 262,636 132,743 34,369 17,299 8,527 3,451 1,222
time, seconds 15.351890914 16.613802482 16.569227308 15.179220992 16.557356122 16.578247824 15.365266994
8 threads
minor-faults 2,098,948 1,061,871 273,690 154,501 87,110 25,663 7,384
time, seconds 15.040026343 15.096933500 14.474757288 14.289129964 14.411537468 14.296316837 14.395635804
32 threads
minor-faults 8,390,734 4,231,023 1,054,432 528,847 269,242 97,746 26,881
time, seconds 20.430433109 21.585235358 22.115062928 14.872878951 14.880856305 14.883370649 14.821261690
60 threads
minor-faults 15,733,258 7,892,809 1,973,393 988,266 594,789 164,994 51,691
time, seconds 26.577302548 25.692397770 18.728863715 20.153026398 21.619101933 17.745086260 17.613215273
120 threads
minor-faults 31,471,111 15,816,616 3,959,209 1,978,685 1,008,299 264,635 96,010
time, seconds 41.835322703 40.459786095 36.085306105 35.313894834 35.814445675 36.552633793 34.289210594
Touch only one page in page table in 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
1 thread
minor-faults 8,372 8,324 8,270 8,260 8,249 8,239 8,237
time, seconds 0.039892712 0.045369149 0.051846126 0.063681685 0.079095975 0.17652406 0.541213386
8 threads
minor-faults 65,731 65,681 65,628 65,620 65,608 65,599 65,596
time, seconds 0.124159196 0.488600638 0.156854426 0.191901957 0.242631486 0.543569456 1.677303984
32 threads
minor-faults 262,388 262,341 262,285 262,276 262,266 262,257 263,183
time, seconds 0.452421421 0.488600638 0.565020946 0.648229739 0.789850823 1.651584361 5.000361559
60 threads
minor-faults 491,822 491,792 491,723 491,711 491,701 491,691 491,825
time, seconds 0.763288616 0.869620515 0.980727360 1.161732354 1.466915814 3.04041448 9.308612938
120 threads
minor-faults 983,466 983,655 983,366 983,372 983,363 984,083 984,164
time, seconds 1.595846553 1.667902182 2.008959376 2.425380942 2.941368804 5.977807890 18.401846125
This patch (of 2):
Introduce new vm_ops callback ->map_pages() and uses it for mapping easy
accessible pages around fault address.
On read page fault, if filesystem provides ->map_pages(), we try to map up
to FAULT_AROUND_PAGES pages around page fault address in hope to reduce
number of minor page faults.
We call ->map_pages first and use ->fault() as fallback if page by the
offset is not ready to be mapped (cold page cache or something).
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 07:37:18 +09:00
|
|
|
|
2018-08-24 09:01:36 +09:00
|
|
|
vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
|
2016-07-27 07:25:23 +09:00
|
|
|
struct page *page);
|
2018-08-24 09:01:36 +09:00
|
|
|
vm_fault_t finish_fault(struct vm_fault *vmf);
|
|
|
|
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
|
2011-01-21 16:49:56 +09:00
|
|
|
#endif
|
2011-01-14 08:46:37 +09:00
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
/*
|
|
|
|
* Multiple processes may "see" the same page. E.g. for untouched
|
|
|
|
* mappings of /dev/null, all processes see the same page full of
|
|
|
|
* zeroes, and text pages of executables and shared libraries have
|
|
|
|
* only one copy in memory, at most, normally.
|
|
|
|
*
|
|
|
|
* For the non-reserved pages, page_count(page) denotes a reference count.
|
2005-09-22 01:55:38 +09:00
|
|
|
* page_count() == 0 means the page is free. page->lru is then used for
|
|
|
|
* freelist management in the buddy allocator.
|
2006-09-26 15:31:35 +09:00
|
|
|
* page_count() > 0 means the page has been allocated.
|
2005-04-17 07:20:36 +09:00
|
|
|
*
|
2006-09-26 15:31:35 +09:00
|
|
|
* Pages are allocated by the slab allocator in order to provide memory
|
|
|
|
* to kmalloc and kmem_cache_alloc. In this case, the management of the
|
|
|
|
* page, and the fields in 'struct page' are the responsibility of mm/slab.c
|
|
|
|
* unless a particular usage is carefully commented. (the responsibility of
|
|
|
|
* freeing the kmalloc memory is the caller's, of course).
|
2005-04-17 07:20:36 +09:00
|
|
|
*
|
2006-09-26 15:31:35 +09:00
|
|
|
* A page may be used by anyone else who does a __get_free_page().
|
|
|
|
* In this case, page_count still tracks the references, and should only
|
|
|
|
* be used through the normal accessor functions. The top bits of page->flags
|
|
|
|
* and page->virtual store page management information, but all other fields
|
|
|
|
* are unused and could be used privately, carefully. The management of this
|
|
|
|
* page is the responsibility of the one who allocated it, and those who have
|
|
|
|
* subsequently been given references to it.
|
|
|
|
*
|
|
|
|
* The other pages (we may call them "pagecache pages") are completely
|
2005-04-17 07:20:36 +09:00
|
|
|
* managed by the Linux memory manager: I/O, buffers, swapping etc.
|
|
|
|
* The following discussion applies only to them.
|
|
|
|
*
|
2006-09-26 15:31:35 +09:00
|
|
|
* A pagecache page contains an opaque `private' member, which belongs to the
|
|
|
|
* page's address_space. Usually, this is the address of a circular list of
|
|
|
|
* the page's disk buffers. PG_private must be set to tell the VM to call
|
|
|
|
* into the filesystem to release these pages.
|
2005-04-17 07:20:36 +09:00
|
|
|
*
|
2006-09-26 15:31:35 +09:00
|
|
|
* A page may belong to an inode's memory mapping. In this case, page->mapping
|
|
|
|
* is the pointer to the inode, and page->index is the file offset of the page,
|
2016-04-01 21:29:48 +09:00
|
|
|
* in units of PAGE_SIZE.
|
2005-04-17 07:20:36 +09:00
|
|
|
*
|
2006-09-26 15:31:35 +09:00
|
|
|
* If pagecache pages are not associated with an inode, they are said to be
|
|
|
|
* anonymous pages. These may become associated with the swapcache, and in that
|
|
|
|
* case PG_swapcache is set, and page->private is an offset into the swapcache.
|
2005-04-17 07:20:36 +09:00
|
|
|
*
|
2006-09-26 15:31:35 +09:00
|
|
|
* In either case (swapcache or inode backed), the pagecache itself holds one
|
|
|
|
* reference to the page. Setting PG_private should also increment the
|
|
|
|
* refcount. The each user mapping also has a reference to the page.
|
2005-04-17 07:20:36 +09:00
|
|
|
*
|
2006-09-26 15:31:35 +09:00
|
|
|
* The pagecache pages are stored in a per-mapping radix tree, which is
|
2018-04-11 08:36:56 +09:00
|
|
|
* rooted at mapping->i_pages, and indexed by offset.
|
2006-09-26 15:31:35 +09:00
|
|
|
* Where 2.4 and early 2.6 kernels kept dirty/clean pages in per-address_space
|
|
|
|
* lists, we instead now tag pages as dirty/writeback in the radix tree.
|
2005-04-17 07:20:36 +09:00
|
|
|
*
|
2006-09-26 15:31:35 +09:00
|
|
|
* All pagecache pages may be subject to I/O:
|
2005-04-17 07:20:36 +09:00
|
|
|
* - inode pages may need to be read from disk,
|
|
|
|
* - inode pages which have been modified and are MAP_SHARED may need
|
2006-09-26 15:31:35 +09:00
|
|
|
* to be written back to the inode on disk,
|
|
|
|
* - anonymous pages (including MAP_PRIVATE file mappings) which have been
|
|
|
|
* modified may need to be swapped out to swap space and (later) to be read
|
|
|
|
* back into memory.
|
2005-04-17 07:20:36 +09:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The zone field is never updated after free_area_init_core()
|
|
|
|
* sets it, so none of the operations on it need to be atomic.
|
|
|
|
*/
|
2005-06-23 16:07:40 +09:00
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
|
2005-11-06 01:25:53 +09:00
|
|
|
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
|
|
|
|
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
|
2013-10-07 19:29:20 +09:00
|
|
|
#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
|
2018-12-28 17:30:57 +09:00
|
|
|
#define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
|
2005-06-23 16:07:40 +09:00
|
|
|
/*
|
2011-03-31 10:57:33 +09:00
|
|
|
* Define the bit shifts to access each section. For non-existent
|
2005-06-23 16:07:40 +09:00
|
|
|
* sections we define the shift as 0; that plus a 0 mask ensures
|
|
|
|
* the compiler will optimise away reference to them.
|
|
|
|
*/
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
|
|
|
|
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
|
|
|
|
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
|
2013-10-07 19:29:20 +09:00
|
|
|
#define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
|
2018-12-28 17:30:57 +09:00
|
|
|
#define KASAN_TAG_PGSHIFT (KASAN_TAG_PGOFF * (KASAN_TAG_WIDTH != 0))
|
2005-06-23 16:07:40 +09:00
|
|
|
|
2010-10-27 06:21:37 +09:00
|
|
|
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
|
|
|
|
#ifdef NODE_NOT_IN_PAGE_FLAGS
|
2006-12-07 13:31:45 +09:00
|
|
|
#define ZONEID_SHIFT (SECTIONS_SHIFT + ZONES_SHIFT)
|
2007-02-10 18:43:14 +09:00
|
|
|
#define ZONEID_PGOFF ((SECTIONS_PGOFF < ZONES_PGOFF)? \
|
|
|
|
SECTIONS_PGOFF : ZONES_PGOFF)
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
#else
|
2006-12-07 13:31:45 +09:00
|
|
|
#define ZONEID_SHIFT (NODES_SHIFT + ZONES_SHIFT)
|
2007-02-10 18:43:14 +09:00
|
|
|
#define ZONEID_PGOFF ((NODES_PGOFF < ZONES_PGOFF)? \
|
|
|
|
NODES_PGOFF : ZONES_PGOFF)
|
2006-12-07 13:31:45 +09:00
|
|
|
#endif
|
|
|
|
|
2007-02-10 18:43:14 +09:00
|
|
|
#define ZONEID_PGSHIFT (ZONEID_PGOFF * (ZONEID_SHIFT != 0))
|
2005-06-23 16:07:40 +09:00
|
|
|
|
2008-04-28 18:12:48 +09:00
|
|
|
#if SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
|
|
|
|
#error SECTIONS_WIDTH+NODES_WIDTH+ZONES_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
|
2005-06-23 16:07:40 +09:00
|
|
|
#endif
|
|
|
|
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
|
|
|
|
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
|
|
|
|
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
|
numa: use LAST_CPUPID_SHIFT to calculate LAST_CPUPID_MASK
LAST_CPUPID_MASK is calculated using LAST_CPUPID_WIDTH. However
LAST_CPUPID_WIDTH itself can be 0. (when LAST_CPUPID_NOT_IN_PAGE_FLAGS is
set). In such a case LAST_CPUPID_MASK turns out to be 0.
But with recent commit 1ae71d0319: (mm: numa: bugfix for
LAST_CPUPID_NOT_IN_PAGE_FLAGS) if LAST_CPUPID_MASK is 0,
page_cpupid_xchg_last() and page_cpupid_reset_last() causes
page->_last_cpupid to be set to 0.
This causes performance regression. Its almost as if numa_balancing is
off.
Fix LAST_CPUPID_MASK by using LAST_CPUPID_SHIFT instead of
LAST_CPUPID_WIDTH.
Some performance numbers and perf stats with and without the fix.
(3.14-rc6)
----------
numa01
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
12,27,462 cs [100.00%]
2,41,957 migrations [100.00%]
1,68,01,713 faults [100.00%]
7,99,35,29,041 cache-misses
98,808 migrate:mm_migrate_pages [100.00%]
1407.690148814 seconds time elapsed
numa02
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
63,065 cs [100.00%]
14,364 migrations [100.00%]
2,08,118 faults [100.00%]
25,32,59,404 cache-misses
12 migrate:mm_migrate_pages [100.00%]
63.840827219 seconds time elapsed
(3.14-rc6 with fix)
-------------------
numa01
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa01':
9,68,911 cs [100.00%]
1,01,414 migrations [100.00%]
88,38,697 faults [100.00%]
4,42,92,51,042 cache-misses
4,25,060 migrate:mm_migrate_pages [100.00%]
685.965331189 seconds time elapsed
numa02
Performance counter stats for '/usr/bin/time -f %e %S %U %c %w -o start_bench.out -a ./numa02':
17,543 cs [100.00%]
2,962 migrations [100.00%]
1,17,843 faults [100.00%]
11,80,61,644 cache-misses
12,358 migrate:mm_migrate_pages [100.00%]
20.380132343 seconds time elapsed
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-08 07:37:57 +09:00
|
|
|
#define LAST_CPUPID_MASK ((1UL << LAST_CPUPID_SHIFT) - 1)
|
2018-12-28 17:30:57 +09:00
|
|
|
#define KASAN_TAG_MASK ((1UL << KASAN_TAG_WIDTH) - 1)
|
2006-12-07 13:31:45 +09:00
|
|
|
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
|
2005-06-23 16:07:40 +09:00
|
|
|
|
2011-07-26 09:11:51 +09:00
|
|
|
static inline enum zone_type page_zonenum(const struct page *page)
|
2005-04-17 07:20:36 +09:00
|
|
|
{
|
2005-06-23 16:07:40 +09:00
|
|
|
return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
|
2005-04-17 07:20:36 +09:00
|
|
|
}
|
|
|
|
|
2016-01-16 09:56:17 +09:00
|
|
|
#ifdef CONFIG_ZONE_DEVICE
|
|
|
|
static inline bool is_zone_device_page(const struct page *page)
|
|
|
|
{
|
|
|
|
return page_zonenum(page) == ZONE_DEVICE;
|
|
|
|
}
|
2018-10-27 07:07:52 +09:00
|
|
|
extern void memmap_init_zone_device(struct zone *, unsigned long,
|
|
|
|
unsigned long, struct dev_pagemap *);
|
2016-01-16 09:56:17 +09:00
|
|
|
#else
|
|
|
|
static inline bool is_zone_device_page(const struct page *page)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2017-09-09 08:11:46 +09:00
|
|
|
#endif
|
2017-09-09 08:11:43 +09:00
|
|
|
|
2018-05-17 03:46:08 +09:00
|
|
|
#ifdef CONFIG_DEV_PAGEMAP_OPS
|
|
|
|
void __put_devmap_managed_page(struct page *page);
|
|
|
|
DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
|
|
|
|
static inline bool put_devmap_managed_page(struct page *page)
|
|
|
|
{
|
|
|
|
if (!static_branch_unlikely(&devmap_managed_key))
|
|
|
|
return false;
|
|
|
|
if (!is_zone_device_page(page))
|
|
|
|
return false;
|
|
|
|
switch (page->pgmap->type) {
|
|
|
|
case MEMORY_DEVICE_PRIVATE:
|
|
|
|
case MEMORY_DEVICE_FS_DAX:
|
|
|
|
__put_devmap_managed_page(page);
|
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else /* CONFIG_DEV_PAGEMAP_OPS */
|
|
|
|
static inline bool put_devmap_managed_page(struct page *page)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2019-07-17 08:30:44 +09:00
|
|
|
#endif /* CONFIG_DEV_PAGEMAP_OPS */
|
2018-05-17 03:46:08 +09:00
|
|
|
|
2017-09-09 08:12:32 +09:00
|
|
|
static inline bool is_device_private_page(const struct page *page)
|
|
|
|
{
|
2019-07-17 08:30:44 +09:00
|
|
|
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
|
|
|
|
IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
|
|
|
|
is_zone_device_page(page) &&
|
|
|
|
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
|
2017-09-09 08:12:32 +09:00
|
|
|
}
|
PCI/P2PDMA: Support peer-to-peer memory
Some PCI devices may have memory mapped in a BAR space that's intended for
use in peer-to-peer transactions. To enable such transactions the memory
must be registered with ZONE_DEVICE pages so it can be used by DMA
interfaces in existing drivers.
Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:
struct pci_dev *pci_p2pmem_find[_many]();
int pci_p2pdma_distance[_many]();
void *pci_alloc_p2pmem();
The new interface requires a driver to collect a list of client devices
involved in the transaction then call pci_p2pmem_find() to obtain any
suitable P2P memory. Alternatively, if the caller knows a device which
provides P2P memory, they can use pci_p2pdma_distance() to determine if it
is usable. With a suitable p2pmem device, memory can then be allocated
with pci_alloc_p2pmem() for use in DMA transactions.
Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCIe switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.
The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI bridge. This is because we
have no way of knowing whether peer-to-peer routing between PCIe Root Ports
is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P
transfers that go through the RC is limited to only reducing DRAM usage
and, in some cases, coding convenience. The PCI-SIG may be exploring
adding a new capability bit to advertise whether this is possible for
future hardware.
This commit includes significant rework and feedback from Christoph
Hellwig.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
[bhelgaas: fold in fix from Keith Busch <keith.busch@intel.com>:
https://lore.kernel.org/linux-pci/20181012155920.15418-1-keith.busch@intel.com,
to address comment from Dan Carpenter <dan.carpenter@oracle.com>, fold in
https://lore.kernel.org/linux-pci/20181017160510.17926-1-logang@deltatee.com]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-10-05 06:27:35 +09:00
|
|
|
|
|
|
|
static inline bool is_pci_p2pdma_page(const struct page *page)
|
|
|
|
{
|
2019-07-17 08:30:44 +09:00
|
|
|
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
|
|
|
|
IS_ENABLED(CONFIG_PCI_P2PDMA) &&
|
|
|
|
is_zone_device_page(page) &&
|
|
|
|
page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
|
PCI/P2PDMA: Support peer-to-peer memory
Some PCI devices may have memory mapped in a BAR space that's intended for
use in peer-to-peer transactions. To enable such transactions the memory
must be registered with ZONE_DEVICE pages so it can be used by DMA
interfaces in existing drivers.
Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:
struct pci_dev *pci_p2pmem_find[_many]();
int pci_p2pdma_distance[_many]();
void *pci_alloc_p2pmem();
The new interface requires a driver to collect a list of client devices
involved in the transaction then call pci_p2pmem_find() to obtain any
suitable P2P memory. Alternatively, if the caller knows a device which
provides P2P memory, they can use pci_p2pdma_distance() to determine if it
is usable. With a suitable p2pmem device, memory can then be allocated
with pci_alloc_p2pmem() for use in DMA transactions.
Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCIe switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.
The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI bridge. This is because we
have no way of knowing whether peer-to-peer routing between PCIe Root Ports
is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P
transfers that go through the RC is limited to only reducing DRAM usage
and, in some cases, coding convenience. The PCI-SIG may be exploring
adding a new capability bit to advertise whether this is possible for
future hardware.
This commit includes significant rework and feedback from Christoph
Hellwig.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
[bhelgaas: fold in fix from Keith Busch <keith.busch@intel.com>:
https://lore.kernel.org/linux-pci/20181012155920.15418-1-keith.busch@intel.com,
to address comment from Dan Carpenter <dan.carpenter@oracle.com>, fold in
https://lore.kernel.org/linux-pci/20181017160510.17926-1-logang@deltatee.com]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-10-05 06:27:35 +09:00
|
|
|
}
|
2017-09-09 08:11:46 +09:00
|
|
|
|
2019-04-12 02:06:20 +09:00
|
|
|
/* 127: arbitrary random number, small enough to assemble well */
|
|
|
|
#define page_ref_zero_or_close_to_overflow(page) \
|
|
|
|
((unsigned int) page_ref_count(page) + 127u <= 127u)
|
|
|
|
|
2016-01-16 09:56:55 +09:00
|
|
|
static inline void get_page(struct page *page)
|
|
|
|
{
|
|
|
|
page = compound_head(page);
|
|
|
|
/*
|
|
|
|
* Getting a normal page or the head of a compound page
|
2016-05-20 09:10:49 +09:00
|
|
|
* requires to already have an elevated page->_refcount.
|
2016-01-16 09:56:55 +09:00
|
|
|
*/
|
2019-04-12 02:06:20 +09:00
|
|
|
VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
|
2016-03-18 06:19:26 +09:00
|
|
|
page_ref_inc(page);
|
2016-01-16 09:56:55 +09:00
|
|
|
}
|
|
|
|
|
2019-04-12 02:14:59 +09:00
|
|
|
static inline __must_check bool try_get_page(struct page *page)
|
|
|
|
{
|
|
|
|
page = compound_head(page);
|
|
|
|
if (WARN_ON_ONCE(page_ref_count(page) <= 0))
|
|
|
|
return false;
|
2016-03-18 06:19:26 +09:00
|
|
|
page_ref_inc(page);
|
2019-04-12 02:14:59 +09:00
|
|
|
return true;
|
2016-01-16 09:56:55 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void put_page(struct page *page)
|
|
|
|
{
|
|
|
|
page = compound_head(page);
|
|
|
|
|
2017-09-09 08:11:46 +09:00
|
|
|
/*
|
2018-05-17 03:46:08 +09:00
|
|
|
* For devmap managed pages we need to catch refcount transition from
|
|
|
|
* 2 to 1, when refcount reach one it means the page is free and we
|
|
|
|
* need to inform the device driver through callback. See
|
2017-09-09 08:11:46 +09:00
|
|
|
* include/linux/memremap.h and HMM for details.
|
|
|
|
*/
|
2018-05-17 03:46:08 +09:00
|
|
|
if (put_devmap_managed_page(page))
|
2017-09-09 08:11:46 +09:00
|
|
|
return;
|
|
|
|
|
2016-01-16 09:56:55 +09:00
|
|
|
if (put_page_testzero(page))
|
|
|
|
__put_page(page);
|
|
|
|
}
|
|
|
|
|
mm: introduce put_user_page*(), placeholder versions
A discussion of the overall problem is below.
As mentioned in patch 0001, the steps are to fix the problem are:
1) Provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
Overview
========
Some kernel components (file systems, device drivers) need to access
memory that is specified via process virtual address. For a long time,
the API to achieve that was get_user_pages ("GUP") and its variations.
However, GUP has critical limitations that have been overlooked; in
particular, GUP does not interact correctly with filesystems in all
situations. That means that file-backed memory + GUP is a recipe for
potential problems, some of which have already occurred in the field.
GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
code to get the struct page behind a virtual address and to let storage
hardware perform a direct copy to or from that page. This is a
short-lived access pattern, and as such, the window for a concurrent
writeback of GUP'd page was small enough that there were not (we think)
any reported problems. Also, userspace was expected to understand and
accept that Direct IO was not synchronized with memory-mapped access to
that data, nor with any process address space changes such as munmap(),
mremap(), etc.
Over the years, more GUP uses have appeared (virtualization, device
drivers, RDMA) that can keep the pages they get via GUP for a long period
of time (seconds, minutes, hours, days, ...). This long-term pinning
makes an underlying design problem more obvious.
In fact, there are a number of key problems inherent to GUP:
Interactions with file systems
==============================
File systems expect to be able to write back data, both to reclaim pages,
and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
write access to the file memory pages means that such hardware can dirty
the pages, without the filesystem being aware. This can, in some cases
(depending on filesystem, filesystem options, block device, block device
options, and other variables), lead to data corruption, and also to kernel
bugs of the form:
kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
...which is due to the file system asserting that there are still buffer
heads attached:
({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})
Dave Chinner's description of this is very clear:
"The fundamental issue is that ->page_mkwrite must be called on every
write access to a clean file backed page, not just the first one.
How long the GUP reference lasts is irrelevant, if the page is clean
and you need to dirty it, you must call ->page_mkwrite before it is
marked writeable and dirtied. Every. Time."
This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.
Long term GUP
=============
Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
writeable mapping is created), and the pages are file-backed. That can
lead to filesystem corruption. What happens is that when a file-backed
page is being written back, it is first mapped read-only in all of the CPU
page tables; the file system then assumes that nobody can write to the
page, and that the page content is therefore stable. Unfortunately, the
GUP callers generally do not monitor changes to the CPU pages tables; they
instead assume that the following pattern is safe (it's not):
get_user_pages()
Hardware can keep a reference to those pages for a very long time,
and write to it at any time. Because "hardware" here means "devices
that are not a CPU", this activity occurs without any interaction with
the kernel's file system code.
for each page
set_page_dirty
put_page()
In fact, the GUP documentation even recommends that pattern.
Anyway, the file system assumes that the page is stable (nothing is
writing to the page), and that is a problem: stable page content is
necessary for many filesystem actions during writeback, such as checksum,
encryption, RAID striping, etc. Furthermore, filesystem features like COW
(copy on write) or snapshot also rely on being able to use a new page for
as memory for that memory range inside the file.
Corruption during write back is clearly possible here. To solve that, one
idea is to identify pages that have active GUP, so that we can use a
bounce page to write stable data to the filesystem. The filesystem would
work on the bounce page, while any of the active GUP might write to the
original page. This would avoid the stable page violation problem, but
note that it is only part of the overall solution, because other problems
remain.
Other filesystem features that need to replace the page with a new one can
be inhibited for pages that are GUP-pinned. This will, however, alter and
limit some of those filesystem features. The only fix for that would be
to require GUP users to monitor and respond to CPU page table updates.
Subsystems such as ODP and HMM do this, for example. This aspect of the
problem is still under discussion.
Direct IO
=========
Direct IO can cause corruption, if userspace does Direct-IO that writes to
a range of virtual addresses that are mmap'd to a file. The pages written
to are file-backed pages that can be under write back, while the Direct IO
is taking place. Here, Direct IO races with a write back: it calls GUP
before page_mkclean() has replaced the CPU pte with a read-only entry.
The race window is pretty small, which is probably why years have gone by
before we noticed this problem: Direct IO is generally very quick, and
tends to finish up before the filesystem gets around to do anything with
the page contents. However, it's still a real problem. The solution is
to never let GUP return pages that are under write back, but instead,
force GUP to take a write fault on those pages. That way, GUP will
properly synchronize with the active write back. This does not change the
required GUP behavior, it just avoids that race.
Details
=======
Introduces put_user_page(), which simply calls put_page(). This provides
a way to update all get_user_pages*() callers, so that they call
put_user_page(), instead of put_page().
Also introduces put_user_pages(), and a few dirty/locked variations, as a
replacement for release_pages(), and also as a replacement for open-coded
loops that release multiple pages. These may be used for subsequent
performance improvements, via batching of pages to be released.
This is the first step of fixing a problem (also described in [1] and [2])
with interactions between get_user_pages ("gup") and filesystems.
Problem description: let's start with a bug report. Below, is what
happens sometimes, under memory pressure, when a driver pins some pages
via gup, and then marks those pages dirty, and releases them. Note that
the gup documentation actually recommends that pattern. The problem is
that the filesystem may do a writeback while the pages were gup-pinned,
and then the filesystem believes that the pages are clean. So, when the
driver later marks the pages as dirty, that conflicts with the
filesystem's page tracking and results in a BUG(), like this one that I
experienced:
kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
...which is due to the file system asserting that there are still buffer
heads attached:
({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})
Dave Chinner's description of this is very clear:
"The fundamental issue is that ->page_mkwrite must be called on
every write access to a clean file backed page, not just the first
one. How long the GUP reference lasts is irrelevant, if the page is
clean and you need to dirty it, you must call ->page_mkwrite before it
is marked writeable and dirtied. Every. Time."
This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.
The steps are to fix it are:
1) (This patch): provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [docs]
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:19:08 +09:00
|
|
|
/**
|
|
|
|
* put_user_page() - release a gup-pinned page
|
|
|
|
* @page: pointer to page to be released
|
|
|
|
*
|
|
|
|
* Pages that were pinned via get_user_pages*() must be released via
|
|
|
|
* either put_user_page(), or one of the put_user_pages*() routines
|
|
|
|
* below. This is so that eventually, pages that are pinned via
|
|
|
|
* get_user_pages*() can be separately tracked and uniquely handled. In
|
|
|
|
* particular, interactions with RDMA and filesystems need special
|
|
|
|
* handling.
|
|
|
|
*
|
|
|
|
* put_user_page() and put_page() are not interchangeable, despite this early
|
|
|
|
* implementation that makes them look the same. put_user_page() calls must
|
|
|
|
* be perfectly matched up with get_user_page() calls.
|
|
|
|
*/
|
|
|
|
static inline void put_user_page(struct page *page)
|
|
|
|
{
|
|
|
|
put_page(page);
|
|
|
|
}
|
|
|
|
|
mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
[11~From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
Patch series "mm/gup: add make_dirty arg to put_user_pages_dirty_lock()",
v3.
There are about 50+ patches in my tree [2], and I'll be sending out the
remaining ones in a few more groups:
* The block/bio related changes (Jerome mostly wrote those, but I've had
to move stuff around extensively, and add a little code)
* mm/ changes
* other subsystem patches
* an RFC that shows the current state of the tracking patch set. That
can only be applied after all call sites are converted, but it's good to
get an early look at it.
This is part a tree-wide conversion, as described in fc1d8e7cca2d ("mm:
introduce put_user_page*(), placeholder versions").
This patch (of 3):
Provide more capable variation of put_user_pages_dirty_lock(), and delete
put_user_pages_dirty(). This is based on the following:
1. Lots of call sites become simpler if a bool is passed into
put_user_page*(), instead of making the call site choose which
put_user_page*() variant to call.
2. Christoph Hellwig's observation that set_page_dirty_lock() is
usually correct, and set_page_dirty() is usually a bug, or at least
questionable, within a put_user_page*() calling chain.
This leads to the following API choices:
* put_user_pages_dirty_lock(page, npages, make_dirty)
* There is no put_user_pages_dirty(). You have to
hand code that, in the rare case that it's
required.
[jhubbard@nvidia.com: remove unused variable in siw_free_plist()]
Link: http://lkml.kernel.org/r/20190729074306.10368-1-jhubbard@nvidia.com
Link: http://lkml.kernel.org/r/20190724044537.10458-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 07:35:04 +09:00
|
|
|
void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
|
|
|
|
bool make_dirty);
|
|
|
|
|
mm: introduce put_user_page*(), placeholder versions
A discussion of the overall problem is below.
As mentioned in patch 0001, the steps are to fix the problem are:
1) Provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
Overview
========
Some kernel components (file systems, device drivers) need to access
memory that is specified via process virtual address. For a long time,
the API to achieve that was get_user_pages ("GUP") and its variations.
However, GUP has critical limitations that have been overlooked; in
particular, GUP does not interact correctly with filesystems in all
situations. That means that file-backed memory + GUP is a recipe for
potential problems, some of which have already occurred in the field.
GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
code to get the struct page behind a virtual address and to let storage
hardware perform a direct copy to or from that page. This is a
short-lived access pattern, and as such, the window for a concurrent
writeback of GUP'd page was small enough that there were not (we think)
any reported problems. Also, userspace was expected to understand and
accept that Direct IO was not synchronized with memory-mapped access to
that data, nor with any process address space changes such as munmap(),
mremap(), etc.
Over the years, more GUP uses have appeared (virtualization, device
drivers, RDMA) that can keep the pages they get via GUP for a long period
of time (seconds, minutes, hours, days, ...). This long-term pinning
makes an underlying design problem more obvious.
In fact, there are a number of key problems inherent to GUP:
Interactions with file systems
==============================
File systems expect to be able to write back data, both to reclaim pages,
and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
write access to the file memory pages means that such hardware can dirty
the pages, without the filesystem being aware. This can, in some cases
(depending on filesystem, filesystem options, block device, block device
options, and other variables), lead to data corruption, and also to kernel
bugs of the form:
kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
...which is due to the file system asserting that there are still buffer
heads attached:
({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})
Dave Chinner's description of this is very clear:
"The fundamental issue is that ->page_mkwrite must be called on every
write access to a clean file backed page, not just the first one.
How long the GUP reference lasts is irrelevant, if the page is clean
and you need to dirty it, you must call ->page_mkwrite before it is
marked writeable and dirtied. Every. Time."
This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.
Long term GUP
=============
Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
writeable mapping is created), and the pages are file-backed. That can
lead to filesystem corruption. What happens is that when a file-backed
page is being written back, it is first mapped read-only in all of the CPU
page tables; the file system then assumes that nobody can write to the
page, and that the page content is therefore stable. Unfortunately, the
GUP callers generally do not monitor changes to the CPU pages tables; they
instead assume that the following pattern is safe (it's not):
get_user_pages()
Hardware can keep a reference to those pages for a very long time,
and write to it at any time. Because "hardware" here means "devices
that are not a CPU", this activity occurs without any interaction with
the kernel's file system code.
for each page
set_page_dirty
put_page()
In fact, the GUP documentation even recommends that pattern.
Anyway, the file system assumes that the page is stable (nothing is
writing to the page), and that is a problem: stable page content is
necessary for many filesystem actions during writeback, such as checksum,
encryption, RAID striping, etc. Furthermore, filesystem features like COW
(copy on write) or snapshot also rely on being able to use a new page for
as memory for that memory range inside the file.
Corruption during write back is clearly possible here. To solve that, one
idea is to identify pages that have active GUP, so that we can use a
bounce page to write stable data to the filesystem. The filesystem would
work on the bounce page, while any of the active GUP might write to the
original page. This would avoid the stable page violation problem, but
note that it is only part of the overall solution, because other problems
remain.
Other filesystem features that need to replace the page with a new one can
be inhibited for pages that are GUP-pinned. This will, however, alter and
limit some of those filesystem features. The only fix for that would be
to require GUP users to monitor and respond to CPU page table updates.
Subsystems such as ODP and HMM do this, for example. This aspect of the
problem is still under discussion.
Direct IO
=========
Direct IO can cause corruption, if userspace does Direct-IO that writes to
a range of virtual addresses that are mmap'd to a file. The pages written
to are file-backed pages that can be under write back, while the Direct IO
is taking place. Here, Direct IO races with a write back: it calls GUP
before page_mkclean() has replaced the CPU pte with a read-only entry.
The race window is pretty small, which is probably why years have gone by
before we noticed this problem: Direct IO is generally very quick, and
tends to finish up before the filesystem gets around to do anything with
the page contents. However, it's still a real problem. The solution is
to never let GUP return pages that are under write back, but instead,
force GUP to take a write fault on those pages. That way, GUP will
properly synchronize with the active write back. This does not change the
required GUP behavior, it just avoids that race.
Details
=======
Introduces put_user_page(), which simply calls put_page(). This provides
a way to update all get_user_pages*() callers, so that they call
put_user_page(), instead of put_page().
Also introduces put_user_pages(), and a few dirty/locked variations, as a
replacement for release_pages(), and also as a replacement for open-coded
loops that release multiple pages. These may be used for subsequent
performance improvements, via batching of pages to be released.
This is the first step of fixing a problem (also described in [1] and [2])
with interactions between get_user_pages ("gup") and filesystems.
Problem description: let's start with a bug report. Below, is what
happens sometimes, under memory pressure, when a driver pins some pages
via gup, and then marks those pages dirty, and releases them. Note that
the gup documentation actually recommends that pattern. The problem is
that the filesystem may do a writeback while the pages were gup-pinned,
and then the filesystem believes that the pages are clean. So, when the
driver later marks the pages as dirty, that conflicts with the
filesystem's page tracking and results in a BUG(), like this one that I
experienced:
kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
...which is due to the file system asserting that there are still buffer
heads attached:
({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})
Dave Chinner's description of this is very clear:
"The fundamental issue is that ->page_mkwrite must be called on
every write access to a clean file backed page, not just the first
one. How long the GUP reference lasts is irrelevant, if the page is
clean and you need to dirty it, you must call ->page_mkwrite before it
is marked writeable and dirtied. Every. Time."
This is just one symptom of the larger design problem: real filesystems
that actually write to a backing device, do not actually support
get_user_pages() being called on their pages, and letting hardware write
directly to those pages--even though that pattern has been going on since
about 2005 or so.
The steps are to fix it are:
1) (This patch): provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem.
[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
[2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> [docs]
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:19:08 +09:00
|
|
|
void put_user_pages(struct page **pages, unsigned long npages);
|
|
|
|
|
2013-02-23 09:35:21 +09:00
|
|
|
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
|
|
|
|
#define SECTION_IN_PAGE_FLAGS
|
|
|
|
#endif
|
|
|
|
|
2006-12-07 13:31:45 +09:00
|
|
|
/*
|
2013-09-12 06:22:35 +09:00
|
|
|
* The identification function is mainly used by the buddy allocator for
|
|
|
|
* determining if two pages could be buddies. We are not really identifying
|
|
|
|
* the zone since we could be using the section number id if we do not have
|
|
|
|
* node id available in page flags.
|
|
|
|
* We only guarantee that it will return the same value for two combinable
|
|
|
|
* pages in a zone.
|
2006-12-07 13:31:45 +09:00
|
|
|
*/
|
2006-06-23 18:03:01 +09:00
|
|
|
static inline int page_zone_id(struct page *page)
|
|
|
|
{
|
2006-12-07 13:31:45 +09:00
|
|
|
return (page->flags >> ZONEID_PGSHIFT) & ZONEID_MASK;
|
2005-06-23 16:07:40 +09:00
|
|
|
}
|
|
|
|
|
2006-12-07 13:31:45 +09:00
|
|
|
#ifdef NODE_NOT_IN_PAGE_FLAGS
|
2011-07-26 09:11:51 +09:00
|
|
|
extern int page_to_nid(const struct page *page);
|
2006-12-07 13:31:45 +09:00
|
|
|
#else
|
2011-07-26 09:11:51 +09:00
|
|
|
static inline int page_to_nid(const struct page *page)
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
{
|
2018-04-06 08:22:47 +09:00
|
|
|
struct page *p = (struct page *)page;
|
|
|
|
|
|
|
|
return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
}
|
2006-12-07 13:31:45 +09:00
|
|
|
#endif
|
|
|
|
|
2012-11-12 18:06:20 +09:00
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int cpu_pid_to_cpupid(int cpu, int pid)
|
2012-11-12 18:06:20 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
|
2012-11-12 18:06:20 +09:00
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int cpupid_to_pid(int cpupid)
|
2012-11-12 18:06:20 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return cpupid & LAST__PID_MASK;
|
2012-11-12 18:06:20 +09:00
|
|
|
}
|
2013-10-07 19:29:07 +09:00
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int cpupid_to_cpu(int cpupid)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return (cpupid >> LAST__PID_SHIFT) & LAST__CPU_MASK;
|
2013-10-07 19:29:07 +09:00
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int cpupid_to_nid(int cpupid)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return cpu_to_node(cpupid_to_cpu(cpupid));
|
2013-10-07 19:29:07 +09:00
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline bool cpupid_pid_unset(int cpupid)
|
2012-11-12 18:06:20 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return cpupid_to_pid(cpupid) == (-1 & LAST__PID_MASK);
|
2013-10-07 19:29:07 +09:00
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline bool cpupid_cpu_unset(int cpupid)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
|
2013-10-07 19:29:07 +09:00
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:21 +09:00
|
|
|
static inline bool __cpupid_match_pid(pid_t task_pid, int cpupid)
|
|
|
|
{
|
|
|
|
return (task_pid & LAST__PID_MASK) == cpupid_to_pid(cpupid);
|
|
|
|
}
|
|
|
|
|
|
|
|
#define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, cpupid)
|
2013-10-07 19:29:20 +09:00
|
|
|
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
|
|
|
|
static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS
When doing some numa tests on powerpc, I triggered an oops bug. I find
it is caused by using page->_last_cpupid. It should be initialized as
"-1 & LAST_CPUPID_MASK", but not "-1". Otherwise, in task_numa_fault(),
we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)). And
finally cause an oops bug in task_numa_group(), since the online cpu is
less than possible cpu. This happen with CONFIG_SPARSE_VMEMMAP disabled
Call trace:
SMP NR_CPUS=64 NUMA PowerNV
Modules linked in:
CPU: 24 PID: 804 Comm: systemd-udevd Not tainted3.13.0-rc1+ #32
task: c000001e2746aa80 ti: c000001e32c50000 task.ti:c000001e32c50000
REGS: c000001e32c53510 TRAP: 0300 Not tainted(3.13.0-rc1+)
MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR:28024424 XER: 20000000
CFAR: c000000000009324 DAR: 7265717569726857 DSISR:40000000 SOFTE: 1
NIP .task_numa_fault+0x1470/0x2370
LR .task_numa_fault+0x1468/0x2370
Call Trace:
.task_numa_fault+0x1468/0x2370 (unreliable)
.do_numa_page+0x480/0x4a0
.handle_mm_fault+0x4ec/0xc90
.do_page_fault+0x3a8/0x890
handle_page_fault+0x10/0x30
Instruction dump:
3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fefb3863af78 3c82fefb
3884b138 48d9cfd5 60000000 e93f0100 <812902e4> 7d2907b45529063e 7d2a07b4
---[ end trace 15f2510da5ae07cf ]---
Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 08:38:39 +09:00
|
|
|
return xchg(&page->_last_cpupid, cpupid & LAST_CPUPID_MASK);
|
2013-10-07 19:29:07 +09:00
|
|
|
}
|
2013-10-07 19:29:20 +09:00
|
|
|
|
|
|
|
static inline int page_cpupid_last(struct page *page)
|
|
|
|
{
|
|
|
|
return page->_last_cpupid;
|
|
|
|
}
|
|
|
|
static inline void page_cpupid_reset_last(struct page *page)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGS
When doing some numa tests on powerpc, I triggered an oops bug. I find
it is caused by using page->_last_cpupid. It should be initialized as
"-1 & LAST_CPUPID_MASK", but not "-1". Otherwise, in task_numa_fault(),
we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)). And
finally cause an oops bug in task_numa_group(), since the online cpu is
less than possible cpu. This happen with CONFIG_SPARSE_VMEMMAP disabled
Call trace:
SMP NR_CPUS=64 NUMA PowerNV
Modules linked in:
CPU: 24 PID: 804 Comm: systemd-udevd Not tainted3.13.0-rc1+ #32
task: c000001e2746aa80 ti: c000001e32c50000 task.ti:c000001e32c50000
REGS: c000001e32c53510 TRAP: 0300 Not tainted(3.13.0-rc1+)
MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR:28024424 XER: 20000000
CFAR: c000000000009324 DAR: 7265717569726857 DSISR:40000000 SOFTE: 1
NIP .task_numa_fault+0x1470/0x2370
LR .task_numa_fault+0x1468/0x2370
Call Trace:
.task_numa_fault+0x1468/0x2370 (unreliable)
.do_numa_page+0x480/0x4a0
.handle_mm_fault+0x4ec/0xc90
.do_page_fault+0x3a8/0x890
handle_page_fault+0x10/0x30
Instruction dump:
3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fefb3863af78 3c82fefb
3884b138 48d9cfd5 60000000 e93f0100 <812902e4> 7d2907b45529063e 7d2a07b4
---[ end trace 15f2510da5ae07cf ]---
Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04 08:38:39 +09:00
|
|
|
page->_last_cpupid = -1 & LAST_CPUPID_MASK;
|
2012-11-12 18:06:20 +09:00
|
|
|
}
|
|
|
|
#else
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int page_cpupid_last(struct page *page)
|
2013-02-23 09:34:32 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
|
2013-02-23 09:34:32 +09:00
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
extern int page_cpupid_xchg_last(struct page *page, int cpupid);
|
2013-02-23 09:34:32 +09:00
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline void page_cpupid_reset_last(struct page *page)
|
2013-02-23 09:34:32 +09:00
|
|
|
{
|
2016-05-20 09:13:53 +09:00
|
|
|
page->flags |= LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT;
|
2013-02-23 09:34:32 +09:00
|
|
|
}
|
2013-10-07 19:29:20 +09:00
|
|
|
#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
|
|
|
|
#else /* !CONFIG_NUMA_BALANCING */
|
|
|
|
static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
|
2012-11-12 18:06:20 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return page_to_nid(page); /* XXX */
|
2012-11-12 18:06:20 +09:00
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int page_cpupid_last(struct page *page)
|
2012-11-12 18:06:20 +09:00
|
|
|
{
|
2013-10-07 19:29:20 +09:00
|
|
|
return page_to_nid(page); /* XXX */
|
2012-11-12 18:06:20 +09:00
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int cpupid_to_nid(int cpupid)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int cpupid_to_pid(int cpupid)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int cpupid_to_cpu(int cpupid)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline int cpu_pid_to_cpupid(int nid, int pid)
|
|
|
|
{
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool cpupid_pid_unset(int cpupid)
|
2013-10-07 19:29:07 +09:00
|
|
|
{
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2013-10-07 19:29:20 +09:00
|
|
|
static inline void page_cpupid_reset_last(struct page *page)
|
2012-11-12 18:06:20 +09:00
|
|
|
{
|
|
|
|
}
|
2013-10-07 19:29:21 +09:00
|
|
|
|
|
|
|
static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2013-10-07 19:29:20 +09:00
|
|
|
#endif /* CONFIG_NUMA_BALANCING */
|
2012-11-12 18:06:20 +09:00
|
|
|
|
2018-12-28 17:30:57 +09:00
|
|
|
#ifdef CONFIG_KASAN_SW_TAGS
|
kasan: fix per-page tags for non-page_alloc pages
commit cf10bd4c4aff8dd64d1aa7f2a529d0c672bc16af upstream.
To allow performing tag checks on page_alloc addresses obtained via
page_address(), tag-based KASAN modes store tags for page_alloc
allocations in page->flags.
Currently, the default tag value stored in page->flags is 0x00.
Therefore, page_address() returns a 0x00ffff... address for pages that
were not allocated via page_alloc.
This might cause problems. A particular case we encountered is a
conflict with KFENCE. If a KFENCE-allocated slab object is being freed
via kfree(page_address(page) + offset), the address passed to kfree()
will get tagged with 0x00 (as slab pages keep the default per-page
tags). This leads to is_kfence_address() check failing, and a KFENCE
object ending up in normal slab freelist, which causes memory
corruptions.
This patch changes the way KASAN stores tag in page-flags: they are now
stored xor'ed with 0xff. This way, KASAN doesn't need to initialize
per-page flags for every created page, which might be slow.
With this change, page_address() returns natively-tagged (with 0xff)
pointers for pages that didn't have tags set explicitly.
This patch fixes the encountered conflict with KFENCE and prevents more
similar issues that can occur in the future.
Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-25 13:37:20 +09:00
|
|
|
|
|
|
|
/*
|
|
|
|
* KASAN per-page tags are stored xor'ed with 0xff. This allows to avoid
|
|
|
|
* setting tags for all pages to native kernel tag value 0xff, as the default
|
|
|
|
* value 0x00 maps to 0xff.
|
|
|
|
*/
|
|
|
|
|
2018-12-28 17:30:57 +09:00
|
|
|
static inline u8 page_kasan_tag(const struct page *page)
|
|
|
|
{
|
kasan: fix per-page tags for non-page_alloc pages
commit cf10bd4c4aff8dd64d1aa7f2a529d0c672bc16af upstream.
To allow performing tag checks on page_alloc addresses obtained via
page_address(), tag-based KASAN modes store tags for page_alloc
allocations in page->flags.
Currently, the default tag value stored in page->flags is 0x00.
Therefore, page_address() returns a 0x00ffff... address for pages that
were not allocated via page_alloc.
This might cause problems. A particular case we encountered is a
conflict with KFENCE. If a KFENCE-allocated slab object is being freed
via kfree(page_address(page) + offset), the address passed to kfree()
will get tagged with 0x00 (as slab pages keep the default per-page
tags). This leads to is_kfence_address() check failing, and a KFENCE
object ending up in normal slab freelist, which causes memory
corruptions.
This patch changes the way KASAN stores tag in page-flags: they are now
stored xor'ed with 0xff. This way, KASAN doesn't need to initialize
per-page flags for every created page, which might be slow.
With this change, page_address() returns natively-tagged (with 0xff)
pointers for pages that didn't have tags set explicitly.
This patch fixes the encountered conflict with KFENCE and prevents more
similar issues that can occur in the future.
Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-25 13:37:20 +09:00
|
|
|
u8 tag;
|
|
|
|
|
|
|
|
tag = (page->flags >> KASAN_TAG_PGSHIFT) & KASAN_TAG_MASK;
|
|
|
|
tag ^= 0xff;
|
|
|
|
|
|
|
|
return tag;
|
2018-12-28 17:30:57 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void page_kasan_tag_set(struct page *page, u8 tag)
|
|
|
|
{
|
kasan: fix per-page tags for non-page_alloc pages
commit cf10bd4c4aff8dd64d1aa7f2a529d0c672bc16af upstream.
To allow performing tag checks on page_alloc addresses obtained via
page_address(), tag-based KASAN modes store tags for page_alloc
allocations in page->flags.
Currently, the default tag value stored in page->flags is 0x00.
Therefore, page_address() returns a 0x00ffff... address for pages that
were not allocated via page_alloc.
This might cause problems. A particular case we encountered is a
conflict with KFENCE. If a KFENCE-allocated slab object is being freed
via kfree(page_address(page) + offset), the address passed to kfree()
will get tagged with 0x00 (as slab pages keep the default per-page
tags). This leads to is_kfence_address() check failing, and a KFENCE
object ending up in normal slab freelist, which causes memory
corruptions.
This patch changes the way KASAN stores tag in page-flags: they are now
stored xor'ed with 0xff. This way, KASAN doesn't need to initialize
per-page flags for every created page, which might be slow.
With this change, page_address() returns natively-tagged (with 0xff)
pointers for pages that didn't have tags set explicitly.
This patch fixes the encountered conflict with KFENCE and prevents more
similar issues that can occur in the future.
Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-25 13:37:20 +09:00
|
|
|
tag ^= 0xff;
|
2018-12-28 17:30:57 +09:00
|
|
|
page->flags &= ~(KASAN_TAG_MASK << KASAN_TAG_PGSHIFT);
|
|
|
|
page->flags |= (tag & KASAN_TAG_MASK) << KASAN_TAG_PGSHIFT;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void page_kasan_tag_reset(struct page *page)
|
|
|
|
{
|
|
|
|
page_kasan_tag_set(page, 0xff);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline u8 page_kasan_tag(const struct page *page)
|
|
|
|
{
|
|
|
|
return 0xff;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void page_kasan_tag_set(struct page *page, u8 tag) { }
|
|
|
|
static inline void page_kasan_tag_reset(struct page *page) { }
|
|
|
|
#endif
|
|
|
|
|
2011-07-26 09:11:51 +09:00
|
|
|
static inline struct zone *page_zone(const struct page *page)
|
2006-12-07 13:31:45 +09:00
|
|
|
{
|
|
|
|
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
|
|
|
|
}
|
|
|
|
|
mm, vmstat: add infrastructure for per-node vmstats
Patchset: "Move LRU page reclaim from zones to nodes v9"
This series moves LRUs from the zones to the node. While this is a
current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details.
Some of the broad motivations for this are;
1. The residency of a page partially depends on what zone the page was
allocated from. This is partially combatted by the fair zone allocation
policy but that is a partial solution that introduces overhead in the
page allocator paths.
2. Currently, reclaim on node 0 behaves slightly different to node 1. For
example, direct reclaim scans in zonelist order and reclaims even if
the zone is over the high watermark regardless of the age of pages
in that LRU. Kswapd on the other hand starts reclaim on the highest
unbalanced zone. A difference in distribution of file/anon pages due
to when they were allocated results can result in a difference in
again. While the fair zone allocation policy mitigates some of the
problems here, the page reclaim results on a multi-zone node will
always be different to a single-zone node.
it was scheduled on as a result.
3. kswapd and the page allocator scan zones in the opposite order to
avoid interfering with each other but it's sensitive to timing. This
mitigates the page allocator using pages that were allocated very recently
in the ideal case but it's sensitive to timing. When kswapd is allocating
from lower zones then it's great but during the rebalancing of the highest
zone, the page allocator and kswapd interfere with each other. It's worse
if the highest zone is small and difficult to balance.
4. slab shrinkers are node-based which makes it harder to identify the exact
relationship between slab reclaim and LRU reclaim.
The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.
Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.
The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.
pagealloc
---------
This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)
This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;
4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
User 189.19 191.80
System 2604.45 2533.56
Elapsed 2855.30 2786.39
The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;
4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v8
DMA32 allocs 28794729769 0
Normal allocs 48432501431 77227309877
Movable allocs 0 0
tiobench on ext4
----------------
tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 approx-v9
User 645.72 525.90
System 403.85 331.75
Elapsed 6795.36 6783.67
This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity
4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
Minor Faults 645838 647465
Major Faults 573 640
Swap Ins 0 0
Swap Outs 0 0
DMA allocs 0 0
DMA32 allocs 46041453 44190646
Normal allocs 78053072 79887245
Movable allocs 0 0
Allocation stalls 24 67
Stall zone DMA 0 0
Stall zone DMA32 0 0
Stall zone Normal 0 2
Stall zone HighMem 0 0
Stall zone Movable 0 65
Direct pages scanned 10969 30609
Kswapd pages scanned 93375144 93492094
Kswapd pages reclaimed 93372243 93489370
Direct pages reclaimed 10969 30609
Kswapd efficiency 99% 99%
Kswapd velocity 13741.015 13781.934
Direct efficiency 100% 100%
Direct velocity 1.614 4.512
Percentage direct scans 0% 0%
kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).
pgbench read-only large configuration on ext4
---------------------------------------------
pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe
pgbench Transactions
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v8
Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%)
Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%)
Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%)
Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%)
Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%)
Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%)
Negligible differences again. As with tiobench, overall reclaim activity
was comparable.
bonnie++ on ext4
----------------
No interesting performance difference, negligible differences on reclaim
stats.
paralleldd on ext4
------------------
This workload uses varying numbers of dd instances to read large amounts of
data from disk.
4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v9
Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%)
Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%)
Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%)
Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%)
Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%)
Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%)
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v9
User 1548.01 1552.44
System 8609.71 8515.08
Elapsed 3587.10 3594.54
There is little or no change in performance but some drop in system CPU usage.
4.7.0-rc3 4.7.0-rc3
mmotm-20160623 nodelru-v9
Minor Faults 362662 367360
Major Faults 1204 1143
Swap Ins 22 0
Swap Outs 2855 1029
DMA allocs 0 0
DMA32 allocs 31409797 28837521
Normal allocs 46611853 49231282
Movable allocs 0 0
Direct pages scanned 0 0
Kswapd pages scanned 40845270 40869088
Kswapd pages reclaimed 40830976 40855294
Direct pages reclaimed 0 0
Kswapd efficiency 99% 99%
Kswapd velocity 11386.711 11369.769
Direct efficiency 100% 100%
Direct velocity 0.000 0.000
Percentage direct scans 0% 0%
Page writes by reclaim 2855 1029
Page writes file 0 0
Page writes anon 2855 1029
Page reclaim immediate 771 1628
Sector Reads 293312636 293536360
Sector Writes 18213568 18186480
Page rescued immediate 0 0
Slabs scanned 128257 132747
Direct inode steals 181 56
Kswapd inode steals 59 1131
It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.
stutter
-------
stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.
stutter
4.7.0-rc4 4.7.0-rc4
mmotm-20160623 nodelru-v8
Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%)
1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%)
2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%)
3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%)
Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%)
Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%)
Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%)
Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%)
Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%)
Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%)
Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%)
Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%)
Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%)
Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%)
Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%)
Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%)
Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%)
This shows a number of improvements with the worst-case outlier greatly
improved.
Some of the vmstats are interesting
4.7.0-rc4 4.7.0-rc4
mmotm-20160623nodelru-v8
Swap Ins 163 502
Swap Outs 0 0
DMA allocs 0 0
DMA32 allocs 618719206 1381662383
Normal allocs 891235743 564138421
Movable allocs 0 0
Allocation stalls 2603 1
Direct pages scanned 216787 2
Kswapd pages scanned 50719775 41778378
Kswapd pages reclaimed 41541765 41777639
Direct pages reclaimed 209159 0
Kswapd efficiency 81% 99%
Kswapd velocity 16859.554 14329.059
Direct efficiency 96% 0%
Direct velocity 72.061 0.001
Percentage direct scans 0% 0%
Page writes by reclaim 6215049 0
Page writes file 6215049 0
Page writes anon 0 0
Page reclaim immediate 70673 90
Sector Reads 81940800 81680456
Sector Writes 100158984 98816036
Page rescued immediate 0 0
Slabs scanned 1366954 22683
While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.
This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.
1. Reclaim/compaction is going to be affected because the amount of reclaim is
no longer targetted at a specific zone. Compaction works on a per-zone basis
so there is no guarantee that reclaiming a few THP's worth page pages will
have a positive impact on compaction success rates.
2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
are called is now different. This may or may not be a problem but if it
is, it'll be because shrinkers are not called enough and some balancing
is required.
3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
distributed between zones and the fair zone allocation policy used to do
something very similar for anon. The distribution is now different but not
necessarily in any way that matters but it's still worth bearing in mind.
VM statistic counters for reclaim decisions are zone-based. If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that. The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state. The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet. There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
patch with no functional change. There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.
Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-29 07:45:24 +09:00
|
|
|
static inline pg_data_t *page_pgdat(const struct page *page)
|
|
|
|
{
|
|
|
|
return NODE_DATA(page_to_nid(page));
|
|
|
|
}
|
|
|
|
|
2013-02-23 09:35:21 +09:00
|
|
|
#ifdef SECTION_IN_PAGE_FLAGS
|
2011-05-25 09:12:32 +09:00
|
|
|
static inline void set_page_section(struct page *page, unsigned long section)
|
|
|
|
{
|
|
|
|
page->flags &= ~(SECTIONS_MASK << SECTIONS_PGSHIFT);
|
|
|
|
page->flags |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
|
|
|
|
}
|
|
|
|
|
2011-08-18 01:40:33 +09:00
|
|
|
static inline unsigned long page_to_section(const struct page *page)
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
{
|
|
|
|
return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
|
|
|
|
}
|
2008-04-28 18:12:43 +09:00
|
|
|
#endif
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
|
2006-09-26 15:31:13 +09:00
|
|
|
static inline void set_page_zone(struct page *page, enum zone_type zone)
|
2005-06-23 16:07:40 +09:00
|
|
|
{
|
|
|
|
page->flags &= ~(ZONES_MASK << ZONES_PGSHIFT);
|
|
|
|
page->flags |= (zone & ZONES_MASK) << ZONES_PGSHIFT;
|
|
|
|
}
|
2006-09-26 15:31:13 +09:00
|
|
|
|
2005-06-23 16:07:40 +09:00
|
|
|
static inline void set_page_node(struct page *page, unsigned long node)
|
|
|
|
{
|
|
|
|
page->flags &= ~(NODES_MASK << NODES_PGSHIFT);
|
|
|
|
page->flags |= (node & NODES_MASK) << NODES_PGSHIFT;
|
2005-04-17 07:20:36 +09:00
|
|
|
}
|
2006-12-07 13:31:45 +09:00
|
|
|
|
2006-09-26 15:31:13 +09:00
|
|
|
static inline void set_page_links(struct page *page, enum zone_type zone,
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
unsigned long node, unsigned long pfn)
|
2005-04-17 07:20:36 +09:00
|
|
|
{
|
2005-06-23 16:07:40 +09:00
|
|
|
set_page_zone(page, zone);
|
|
|
|
set_page_node(page, node);
|
2013-02-23 09:35:21 +09:00
|
|
|
#ifdef SECTION_IN_PAGE_FLAGS
|
[PATCH] sparsemem memory model
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-23 16:07:54 +09:00
|
|
|
set_page_section(page, pfn_to_section_nr(pfn));
|
2011-05-25 09:12:32 +09:00
|
|
|
#endif
|
2005-04-17 07:20:36 +09:00
|
|
|
}
|
|
|
|
|
2015-10-02 07:37:02 +09:00
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
static inline struct mem_cgroup *page_memcg(struct page *page)
|
|
|
|
{
|
|
|
|
return page->mem_cgroup;
|
|
|
|
}
|
2016-07-29 07:45:10 +09:00
|
|
|
static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
|
|
|
return READ_ONCE(page->mem_cgroup);
|
|
|
|
}
|
2015-10-02 07:37:02 +09:00
|
|
|
#else
|
|
|
|
static inline struct mem_cgroup *page_memcg(struct page *page)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
2016-07-29 07:45:10 +09:00
|
|
|
static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
|
|
|
return NULL;
|
|
|
|
}
|
2015-10-02 07:37:02 +09:00
|
|
|
#endif
|
|
|
|
|
2006-06-30 17:55:32 +09:00
|
|
|
/*
|
|
|
|
* Some inline functions in vmstat.h depend on page_zone()
|
|
|
|
*/
|
|
|
|
#include <linux/vmstat.h>
|
|
|
|
|
2011-07-26 09:11:51 +09:00
|
|
|
static __always_inline void *lowmem_page_address(const struct page *page)
|
2005-04-17 07:20:36 +09:00
|
|
|
{
|
mm: replace open coded page to virt conversion with page_to_virt()
The open coded conversion from struct page address to virtual address in
lowmem_page_address() involves an intermediate conversion step to pfn
number/physical address. Since the placement of the struct page array
relative to the linear mapping may be completely independent from the
placement of physical RAM (as is that case for arm64 after commit
dfd55ad85e 'arm64: vmemmap: use virtual projection of linear region'),
the conversion to physical address and back again should factor out of
the equation, but unfortunately, the shifting and pointer arithmetic
involved prevent this from happening, and the resulting calculation
essentially subtracts the address of the start of physical memory and
adds it back again, in a way that prevents the compiler from optimizing
it away.
Since the start of physical memory is not a build time constant on arm64,
the resulting conversion involves an unnecessary memory access, which
we would like to get rid of. So replace the open coded conversion with
a call to page_to_virt(), and use the open coded conversion as its
default definition, to be overriden by the architecture, if desired.
The existing arch specific definitions of page_to_virt are all equivalent
to this default definition, so by itself this patch is a no-op.
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-04-19 01:04:57 +09:00
|
|
|
return page_to_virt(page);
|
2005-04-17 07:20:36 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
#if defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL)
|
|
|
|
#define HASHED_PAGE_VIRTUAL
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#if defined(WANT_PAGE_VIRTUAL)
|
2014-01-22 08:48:47 +09:00
|
|
|
static inline void *page_address(const struct page *page)
|
|
|
|
{
|
|
|
|
return page->virtual;
|
|
|
|
}
|
|
|
|
static inline void set_page_address(struct page *page, void *address)
|
|
|
|
{
|
|
|
|
page->virtual = address;
|
|
|
|
}
|
2005-04-17 07:20:36 +09:00
|
|
|
#define page_address_init() do { } while(0)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#if defined(HASHED_PAGE_VIRTUAL)
|
2011-08-17 21:45:09 +09:00
|
|
|
void *page_address(const struct page *page);
|
2005-04-17 07:20:36 +09:00
|
|
|
void set_page_address(struct page *page, void *virtual);
|
|
|
|
void page_address_init(void);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#if !defined(HASHED_PAGE_VIRTUAL) && !defined(WANT_PAGE_VIRTUAL)
|
|
|
|
#define page_address(page) lowmem_page_address(page)
|
|
|
|
#define set_page_address(page, address) do { } while(0)
|
|
|
|
#define page_address_init() do { } while(0)
|
|
|
|
#endif
|
|
|
|
|
2015-04-16 08:14:53 +09:00
|
|
|
extern void *page_rmapping(struct page *page);
|
|
|
|
extern struct anon_vma *page_anon_vma(struct page *page);
|
2013-02-23 09:34:35 +09:00
|
|
|
extern struct address_space *page_mapping(struct page *page);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2012-08-01 08:44:47 +09:00
|
|
|
extern struct address_space *__page_file_mapping(struct page *);
|
|
|
|
|
|
|
|
static inline
|
|
|
|
struct address_space *page_file_mapping(struct page *page)
|
|
|
|
{
|
|
|
|
if (unlikely(PageSwapCache(page)))
|
|
|
|
return __page_file_mapping(page);
|
|
|
|
|
|
|
|
return page->mapping;
|
|
|
|
}
|
|
|
|
|
mm, swap: use offset of swap entry as key of swap cache
This patch is to improve the performance of swap cache operations when
the type of the swap device is not 0. Originally, the whole swap entry
value is used as the key of the swap cache, even though there is one
radix tree for each swap device. If the type of the swap device is not
0, the height of the radix tree of the swap cache will be increased
unnecessary, especially on 64bit architecture. For example, for a 1GB
swap device on the x86_64 architecture, the height of the radix tree of
the swap cache is 11. But if the offset of the swap entry is used as
the key of the swap cache, the height of the radix tree of the swap
cache is 4. The increased height causes unnecessary radix tree
descending and increased cache footprint.
This patch reduces the height of the radix tree of the swap cache via
using the offset of the swap entry instead of the whole swap entry value
as the key of the swap cache. In 32 processes sequential swap out test
case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
when the type of the swap device is 1.
Use the whole swap entry as key,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,
Use the swap offset as key,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,
Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 09:00:21 +09:00
|
|
|
extern pgoff_t __page_file_index(struct page *page);
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
/*
|
|
|
|
* Return the pagecache index of the passed page. Regular pagecache pages
|
mm, swap: use offset of swap entry as key of swap cache
This patch is to improve the performance of swap cache operations when
the type of the swap device is not 0. Originally, the whole swap entry
value is used as the key of the swap cache, even though there is one
radix tree for each swap device. If the type of the swap device is not
0, the height of the radix tree of the swap cache will be increased
unnecessary, especially on 64bit architecture. For example, for a 1GB
swap device on the x86_64 architecture, the height of the radix tree of
the swap cache is 11. But if the offset of the swap entry is used as
the key of the swap cache, the height of the radix tree of the swap
cache is 4. The increased height causes unnecessary radix tree
descending and increased cache footprint.
This patch reduces the height of the radix tree of the swap cache via
using the offset of the swap entry instead of the whole swap entry value
as the key of the swap cache. In 32 processes sequential swap out test
case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
when the type of the swap device is 1.
Use the whole swap entry as key,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,
Use the swap offset as key,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,
Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 09:00:21 +09:00
|
|
|
* use ->index whereas swapcache pages use swp_offset(->private)
|
2005-04-17 07:20:36 +09:00
|
|
|
*/
|
|
|
|
static inline pgoff_t page_index(struct page *page)
|
|
|
|
{
|
|
|
|
if (unlikely(PageSwapCache(page)))
|
mm, swap: use offset of swap entry as key of swap cache
This patch is to improve the performance of swap cache operations when
the type of the swap device is not 0. Originally, the whole swap entry
value is used as the key of the swap cache, even though there is one
radix tree for each swap device. If the type of the swap device is not
0, the height of the radix tree of the swap cache will be increased
unnecessary, especially on 64bit architecture. For example, for a 1GB
swap device on the x86_64 architecture, the height of the radix tree of
the swap cache is 11. But if the offset of the swap entry is used as
the key of the swap cache, the height of the radix tree of the swap
cache is 4. The increased height causes unnecessary radix tree
descending and increased cache footprint.
This patch reduces the height of the radix tree of the swap cache via
using the offset of the swap entry instead of the whole swap entry value
as the key of the swap cache. In 32 processes sequential swap out test
case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
when the type of the swap device is 1.
Use the whole swap entry as key,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,
Use the swap offset as key,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,
Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 09:00:21 +09:00
|
|
|
return __page_file_index(page);
|
2005-04-17 07:20:36 +09:00
|
|
|
return page->index;
|
|
|
|
}
|
|
|
|
|
2016-05-20 09:12:00 +09:00
|
|
|
bool page_mapped(struct page *page);
|
mm: migrate: support non-lru movable page migration
We have allowed migration for only LRU pages until now and it was enough
to make high-order pages. But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,. enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.
So, this patch is to support facility to change non-movable pages with
movable. For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.
If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.
1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation. If a driver cannot isolate the page, it should
return *false*.
Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.
2. int (*migratepage) (struct address_space *mapping,
struct page *newpage, struct page *oldpage, enum migrate_mode);
After isolation, VM calls migratepage of driver with isolated page. The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0. If driver cannot migrate the page at the
moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure". On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.
Driver shouldn't touch page.lru field VM using in the functions.
3. void (*putback_page)(struct page *);
If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page. In this function, driver should put the isolated page back to the
own data structure.
4. non-lru movable page flags
There are two page flags for supporting non-lru movable page.
* PG_movable
Driver should use the below function to make page movable under
page_lock.
void __SetPageMovable(struct page *page, struct address_space *mapping)
It needs argument of address_space for registering migration family
functions which will be called by VM. Exactly speaking, PG_movable is
not a real flag of struct page. Rather than, VM reuses page->mapping's
lower bits to represent it.
#define PAGE_MAPPING_MOVABLE 0x2
page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
so driver shouldn't access page->mapping directly. Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.
For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page. As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable). But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated. Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.
For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents
sudden destroying of page->mapping.
Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.
* PG_isolated
To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
non-lru movable page, it can skip it. Driver doesn't need to manipulate
the flag because VM will set/clear it automatically. Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field. PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.
[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-27 07:23:05 +09:00
|
|
|
struct address_space *page_mapping(struct page *page);
|
mm: fix races between swapoff and flush dcache
Thanks to commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks"), after swapoff the address_space associated with the swap
device will be freed. So page_mapping() users which may touch the
address_space need some kind of mechanism to prevent the address_space
from being freed during accessing.
The dcache flushing functions (flush_dcache_page(), etc) in architecture
specific code may access the address_space of swap device for anonymous
pages in swap cache via page_mapping() function. But in some cases
there are no mechanisms to prevent the swap device from being swapoff,
for example,
CPU1 CPU2
__get_user_pages() swapoff()
flush_dcache_page()
mapping = page_mapping()
... exit_swap_address_space()
... kvfree(spaces)
mapping_mapped(mapping)
The address space may be accessed after being freed.
But from cachetlb.txt and Russell King, flush_dcache_page() only care
about file cache pages, for anonymous pages, flush_anon_page() should be
used. The implementation of flush_dcache_page() in all architectures
follows this too. They will check whether page_mapping() is NULL and
whether mapping_mapped() is true to determine whether to flush the
dcache immediately. And they will use interval tree (mapping->i_mmap)
to find all user space mappings. While mapping_mapped() and
mapping->i_mmap isn't used by anonymous pages in swap cache at all.
So, to fix the race between swapoff and flush dcache, __page_mapping()
is add to return the address_space for file cache pages and NULL
otherwise. All page_mapping() invoking in flush dcache functions are
replaced with page_mapping_file().
[akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-06 08:24:39 +09:00
|
|
|
struct address_space *page_mapping_file(struct page *page);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2015-08-22 06:11:51 +09:00
|
|
|
/*
|
|
|
|
* Return true only if the page has been allocated with
|
|
|
|
* ALLOC_NO_WATERMARKS and the low watermark was not
|
|
|
|
* met implying that the system is under some pressure.
|
|
|
|
*/
|
|
|
|
static inline bool page_is_pfmemalloc(struct page *page)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Page index cannot be this large so this must be
|
|
|
|
* a pfmemalloc page.
|
|
|
|
*/
|
|
|
|
return page->index == -1UL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only to be called by the page allocator on a freshly allocated
|
|
|
|
* page.
|
|
|
|
*/
|
|
|
|
static inline void set_page_pfmemalloc(struct page *page)
|
|
|
|
{
|
|
|
|
page->index = -1UL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void clear_page_pfmemalloc(struct page *page)
|
|
|
|
{
|
|
|
|
page->index = 0;
|
|
|
|
}
|
|
|
|
|
2009-01-07 07:38:59 +09:00
|
|
|
/*
|
|
|
|
* Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
|
|
|
|
*/
|
|
|
|
extern void pagefault_out_of_memory(void);
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
#define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK)
|
|
|
|
|
2011-03-23 08:30:46 +09:00
|
|
|
/*
|
2011-05-25 09:11:16 +09:00
|
|
|
* Flags passed to show_mem() and show_free_areas() to suppress output in
|
2011-03-23 08:30:46 +09:00
|
|
|
* various contexts.
|
|
|
|
*/
|
2013-04-30 07:06:11 +09:00
|
|
|
#define SHOW_MEM_FILTER_NODES (0x0001u) /* disallowed nodes */
|
2011-03-23 08:30:46 +09:00
|
|
|
|
2017-02-23 08:46:16 +09:00
|
|
|
extern void show_free_areas(unsigned int flags, nodemask_t *nodemask);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2019-09-24 07:32:59 +09:00
|
|
|
#ifdef CONFIG_MMU
|
2016-01-16 09:57:22 +09:00
|
|
|
extern bool can_do_mlock(void);
|
2019-09-24 07:32:59 +09:00
|
|
|
#else
|
|
|
|
static inline bool can_do_mlock(void) { return false; }
|
|
|
|
#endif
|
2005-04-17 07:20:36 +09:00
|
|
|
extern int user_shm_lock(size_t, struct user_struct *);
|
|
|
|
extern void user_shm_unlock(size_t, struct user_struct *);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Parameter block passed down to zap_pte_range in exceptional cases.
|
|
|
|
*/
|
|
|
|
struct zap_details {
|
|
|
|
struct address_space *check_mapping; /* Check page->mapping if set */
|
|
|
|
pgoff_t first_index; /* Lowest page->index to unmap */
|
|
|
|
pgoff_t last_index; /* Highest page->index to unmap */
|
mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
[ Upstream commit 22061a1ffabdb9c3385de159c5db7aac3a4df1cc ]
There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
it, but before its page_remove_rmap() gets to decrement
compound_mapcount: generating false "BUG: Bad page cache" reports that
the page is still mapped when deleted. This commit fixes that, but not
in the way I hoped.
The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has
often been an annoyance that we usually call unmap_mapping_range() with
no pages locked, but there apply it to a single locked page.
try_to_unmap() looks more suitable for a single locked page.
However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap. Set that aside.
The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end. Nice. But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage. It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too). Set that aside.
Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.
This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details. Then zap_pmd_range() watches for this
case, and does spin_unlock(pmd_lock) if so - just like
page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but
safe.
Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not. Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.
Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Note on stable backport: fixed up call to truncate_cleanup_page()
in truncate_inode_pages_range(). Use hpage_nr_pages() in
unmap_mapping_page().
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-16 10:24:03 +09:00
|
|
|
struct page *single_page; /* Locked page to be unmapped */
|
2005-04-17 07:20:36 +09:00
|
|
|
};
|
|
|
|
|
2018-04-17 23:33:21 +09:00
|
|
|
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
|
|
|
|
pte_t pte, unsigned long vma_flags);
|
|
|
|
static inline struct page *vm_normal_page(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, pte_t pte)
|
|
|
|
{
|
|
|
|
return _vm_normal_page(vma, addr, pte, vma->vm_flags);
|
|
|
|
}
|
|
|
|
|
2016-04-29 08:18:35 +09:00
|
|
|
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
|
|
|
|
pmd_t pmd);
|
mm: introduce pte_special pte bit
s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
model (which is more dynamic than most). Instead, they had proposed to
implement it with an additional path through vm_normal_page(), using a bit in
the pte to determine whether or not the page should be refcounted:
vm_normal_page()
{
...
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
#ifdef s390
if (!mixedmap_refcount_pte(pte))
return NULL;
#else
if (!pfn_valid(pfn))
return NULL;
#endif
goto out;
}
...
}
This is fine, however if we are allowed to use a bit in the pte to determine
refcountedness, we can use that to _completely_ replace all the vma based
schemes. So instead of adding more cases to the already complex vma-based
scheme, we can have a clearly seperate and simple pte-based scheme (and get
slightly better code generation in the process):
vm_normal_page()
{
#ifdef s390
if (!mixedmap_refcount_pte(pte))
return NULL;
return pte_page(pte);
#else
...
#endif
}
And finally, we may rather make this concept usable by any architecture rather
than making it s390 only, so implement a new type of pte state for this.
Unfortunately the old vma based code must stay, because some architectures may
not be able to spare pte bits. This makes vm_normal_page a little bit more
ugly than we would like, but the 2 cases are clearly seperate.
So introduce a pte_special pte state, and use it in mm/memory.c. It is
currently a noop for all architectures, so this doesn't actually result in any
compiled code changes to mm/memory.o.
BTW:
I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
The reason is that, regardless of where vm_normal_page is actually
implemented, the *abstraction* is still exactly the same. Also, while it
depends on whether the architecture has pte_special or not, that is the
only two possible cases, and it really isn't an arch specific function --
the role of the arch code should be to provide primitive functions and
accessors with which to build the core code; pte_special does that. We do
not want architectures to know or care about vm_normal_page itself, and
we definitely don't want them being able to invent something new there
out of sight of mm/ code. If we made vm_normal_page an arch function, then
we have to make vm_insert_mixed (next patch) an arch function too. So I
don't think moving it to arch code fundamentally improves any abstractions,
while it does practically make the code more difficult to follow, for both
mm and arch developers, and easier to misuse.
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 18:13:00 +09:00
|
|
|
|
2018-05-29 21:14:07 +09:00
|
|
|
void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
unsigned long size);
|
2012-03-06 03:38:09 +09:00
|
|
|
void zap_page_range(struct vm_area_struct *vma, unsigned long address,
|
2018-05-29 21:14:07 +09:00
|
|
|
unsigned long size);
|
2012-05-07 05:54:06 +09:00
|
|
|
void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
|
|
|
|
unsigned long start, unsigned long end);
|
2008-02-05 15:29:01 +09:00
|
|
|
|
2018-12-28 17:38:09 +09:00
|
|
|
struct mmu_notifier_range;
|
|
|
|
|
2008-07-24 13:27:10 +09:00
|
|
|
void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
|
2005-04-20 05:29:16 +09:00
|
|
|
unsigned long end, unsigned long floor, unsigned long ceiling);
|
2005-04-17 07:20:36 +09:00
|
|
|
int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
|
|
|
|
struct vm_area_struct *vma);
|
2021-02-05 19:07:11 +09:00
|
|
|
int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
|
|
|
|
struct mmu_notifier_range *range, pte_t **ptepp,
|
|
|
|
pmd_t **pmdpp, spinlock_t **ptlp);
|
2020-12-16 13:47:23 +09:00
|
|
|
int follow_pte(struct mm_struct *mm, unsigned long address,
|
2021-02-05 19:07:11 +09:00
|
|
|
pte_t **ptepp, spinlock_t **ptlp);
|
2009-06-17 07:32:35 +09:00
|
|
|
int follow_pfn(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
unsigned long *pfn);
|
2008-12-20 06:47:27 +09:00
|
|
|
int follow_phys(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
unsigned int flags, unsigned long *prot, resource_size_t *phys);
|
2008-07-24 13:27:05 +09:00
|
|
|
int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
|
|
|
|
void *buf, int len, int write);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2018-04-17 23:33:14 +09:00
|
|
|
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
|
|
|
|
static inline void vm_write_begin(struct vm_area_struct *vma)
|
|
|
|
{
|
2022-11-16 03:40:41 +09:00
|
|
|
/*
|
|
|
|
* Isolated vma might be freed without exclusive mmap_lock but
|
|
|
|
* speculative page fault handler still needs to know it was changed.
|
|
|
|
*/
|
|
|
|
if (!RB_EMPTY_NODE(&vma->vm_rb))
|
|
|
|
WARN_ON_ONCE(!rwsem_is_locked(&(vma->vm_mm)->mmap_sem));
|
2021-01-15 23:22:40 +09:00
|
|
|
/*
|
|
|
|
* The reads never spins and preemption
|
|
|
|
* disablement is not required.
|
|
|
|
*/
|
2018-04-17 23:33:14 +09:00
|
|
|
raw_write_seqcount_begin(&vma->vm_sequence);
|
|
|
|
}
|
2021-01-15 23:22:40 +09:00
|
|
|
static inline void vm_write_end(struct vm_area_struct *vma)
|
2018-04-17 23:33:14 +09:00
|
|
|
{
|
|
|
|
raw_write_seqcount_end(&vma->vm_sequence);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline void vm_write_begin(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void vm_write_end(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
|
|
|
|
|
2013-09-13 07:13:56 +09:00
|
|
|
extern void truncate_pagecache(struct inode *inode, loff_t new);
|
2010-06-04 18:30:04 +09:00
|
|
|
extern void truncate_setsize(struct inode *inode, loff_t newsize);
|
vfs: fix data corruption when blocksize < pagesize for mmaped data
->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.
However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
ftruncate(fd, 0);
pwrite(fd, buf, 1024, 0);
map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
map[0] = 'a'; ----> page_mkwrite() for index 0 is called
ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
mremap(map, 1024, 10000, 0);
map[4095] = 'a'; ----> no page_mkwrite() called
At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.
This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-02 10:49:18 +09:00
|
|
|
void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
|
2012-03-29 06:42:40 +09:00
|
|
|
void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
|
2009-09-16 18:50:12 +09:00
|
|
|
int truncate_inode_page(struct address_space *mapping, struct page *page);
|
2009-09-16 18:50:13 +09:00
|
|
|
int generic_error_remove_page(struct address_space *mapping, struct page *page);
|
2009-09-16 18:50:13 +09:00
|
|
|
int invalidate_inode_page(struct page *page);
|
|
|
|
|
2006-01-06 17:11:44 +09:00
|
|
|
#ifdef CONFIG_MMU
|
2018-08-24 09:01:36 +09:00
|
|
|
extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
|
|
|
|
unsigned long address, unsigned int flags);
|
2018-04-17 23:33:24 +09:00
|
|
|
|
|
|
|
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
|
|
|
|
extern int __handle_speculative_fault(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
2018-04-17 23:33:28 +09:00
|
|
|
unsigned int flags,
|
|
|
|
struct vm_area_struct **vma);
|
2018-04-17 23:33:24 +09:00
|
|
|
static inline int handle_speculative_fault(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
2018-04-17 23:33:28 +09:00
|
|
|
unsigned int flags,
|
|
|
|
struct vm_area_struct **vma)
|
2018-04-17 23:33:24 +09:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Try speculative page fault for multithreaded user space task only.
|
|
|
|
*/
|
2018-04-17 23:33:28 +09:00
|
|
|
if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1) {
|
|
|
|
*vma = NULL;
|
2018-04-17 23:33:24 +09:00
|
|
|
return VM_FAULT_RETRY;
|
2018-04-17 23:33:28 +09:00
|
|
|
}
|
|
|
|
return __handle_speculative_fault(mm, address, flags, vma);
|
2018-04-17 23:33:24 +09:00
|
|
|
}
|
2018-04-17 23:33:28 +09:00
|
|
|
extern bool can_reuse_spf_vma(struct vm_area_struct *vma,
|
|
|
|
unsigned long address);
|
2018-04-17 23:33:24 +09:00
|
|
|
#else
|
|
|
|
static inline int handle_speculative_fault(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
2018-04-17 23:33:28 +09:00
|
|
|
unsigned int flags,
|
|
|
|
struct vm_area_struct **vma)
|
2018-04-17 23:33:24 +09:00
|
|
|
{
|
|
|
|
return VM_FAULT_RETRY;
|
|
|
|
}
|
2018-04-17 23:33:28 +09:00
|
|
|
static inline bool can_reuse_spf_vma(struct vm_area_struct *vma,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
2018-04-17 23:33:24 +09:00
|
|
|
#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
|
|
|
|
|
2011-07-27 19:17:11 +09:00
|
|
|
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
|
2016-01-16 09:57:04 +09:00
|
|
|
unsigned long address, unsigned int fault_flags,
|
|
|
|
bool *unlocked);
|
mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
[ Upstream commit 22061a1ffabdb9c3385de159c5db7aac3a4df1cc ]
There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
it, but before its page_remove_rmap() gets to decrement
compound_mapcount: generating false "BUG: Bad page cache" reports that
the page is still mapped when deleted. This commit fixes that, but not
in the way I hoped.
The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has
often been an annoyance that we usually call unmap_mapping_range() with
no pages locked, but there apply it to a single locked page.
try_to_unmap() looks more suitable for a single locked page.
However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap. Set that aside.
The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end. Nice. But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage. It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too). Set that aside.
Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.
This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details. Then zap_pmd_range() watches for this
case, and does spin_unlock(pmd_lock) if so - just like
page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but
safe.
Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not. Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.
Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Note on stable backport: fixed up call to truncate_cleanup_page()
in truncate_inode_pages_range(). Use hpage_nr_pages() in
unmap_mapping_page().
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-16 10:24:03 +09:00
|
|
|
void unmap_mapping_page(struct page *page);
|
2018-02-01 09:17:36 +09:00
|
|
|
void unmap_mapping_pages(struct address_space *mapping,
|
|
|
|
pgoff_t start, pgoff_t nr, bool even_cows);
|
|
|
|
void unmap_mapping_range(struct address_space *mapping,
|
|
|
|
loff_t const holebegin, loff_t const holelen, int even_cows);
|
2006-01-06 17:11:44 +09:00
|
|
|
#else
|
2018-08-24 09:01:36 +09:00
|
|
|
static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
|
2016-07-27 07:25:18 +09:00
|
|
|
unsigned long address, unsigned int flags)
|
2006-01-06 17:11:44 +09:00
|
|
|
{
|
|
|
|
/* should never happen if there's no MMU */
|
|
|
|
BUG();
|
|
|
|
return VM_FAULT_SIGBUS;
|
|
|
|
}
|
2011-07-27 19:17:11 +09:00
|
|
|
static inline int fixup_user_fault(struct task_struct *tsk,
|
|
|
|
struct mm_struct *mm, unsigned long address,
|
2016-01-16 09:57:04 +09:00
|
|
|
unsigned int fault_flags, bool *unlocked)
|
2011-07-27 19:17:11 +09:00
|
|
|
{
|
|
|
|
/* should never happen if there's no MMU */
|
|
|
|
BUG();
|
|
|
|
return -EFAULT;
|
|
|
|
}
|
mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
[ Upstream commit 22061a1ffabdb9c3385de159c5db7aac3a4df1cc ]
There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
it, but before its page_remove_rmap() gets to decrement
compound_mapcount: generating false "BUG: Bad page cache" reports that
the page is still mapped when deleted. This commit fixes that, but not
in the way I hoped.
The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has
often been an annoyance that we usually call unmap_mapping_range() with
no pages locked, but there apply it to a single locked page.
try_to_unmap() looks more suitable for a single locked page.
However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap. Set that aside.
The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end. Nice. But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage. It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too). Set that aside.
Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.
This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details. Then zap_pmd_range() watches for this
case, and does spin_unlock(pmd_lock) if so - just like
page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but
safe.
Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not. Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.
Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Note on stable backport: fixed up call to truncate_cleanup_page()
in truncate_inode_pages_range(). Use hpage_nr_pages() in
unmap_mapping_page().
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-16 10:24:03 +09:00
|
|
|
static inline void unmap_mapping_page(struct page *page) { }
|
2018-02-01 09:17:36 +09:00
|
|
|
static inline void unmap_mapping_pages(struct address_space *mapping,
|
|
|
|
pgoff_t start, pgoff_t nr, bool even_cows) { }
|
|
|
|
static inline void unmap_mapping_range(struct address_space *mapping,
|
|
|
|
loff_t const holebegin, loff_t const holelen, int even_cows) { }
|
2006-01-06 17:11:44 +09:00
|
|
|
#endif
|
[PATCH] fix get_user_pages bug
Checking pte_dirty instead of pte_write in __follow_page is problematic
for s390, and for copy_one_pte which leaves dirty when clearing write.
So revert __follow_page to check pte_write as before, and make
do_wp_page pass back a special extra VM_FAULT_WRITE bit to say it has
done its full job: once get_user_pages receives this value, it no longer
requires pte_write in __follow_page.
But most callers of handle_mm_fault, in the various architectures, have
switch statements which do not expect this new case. To avoid changing
them all in a hurry, make an inline wrapper function (using the old
name) that masks off the new bit, and use the extended interface with
double underscores.
Yes, we do have a call to do_wp_page from do_swap_page, but no need to
change that: in rare case it's needed, another do_wp_page will follow.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
[ Cleanups by Nick Piggin ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-03 19:24:01 +09:00
|
|
|
|
2018-02-01 09:17:36 +09:00
|
|
|
static inline void unmap_shared_mapping_range(struct address_space *mapping,
|
|
|
|
loff_t const holebegin, loff_t const holelen)
|
|
|
|
{
|
|
|
|
unmap_mapping_range(mapping, holebegin, holelen, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
|
|
|
|
void *buf, int len, unsigned int gup_flags);
|
2011-03-14 04:49:20 +09:00
|
|
|
extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
|
2016-10-13 09:20:19 +09:00
|
|
|
void *buf, int len, unsigned int gup_flags);
|
2016-11-23 03:06:50 +09:00
|
|
|
extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
|
|
|
|
unsigned long addr, void *buf, int len, unsigned int gup_flags);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2016-02-13 06:01:54 +09:00
|
|
|
long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
|
|
|
|
unsigned long start, unsigned long nr_pages,
|
2016-10-13 09:20:17 +09:00
|
|
|
unsigned int gup_flags, struct page **pages,
|
2016-12-15 08:06:52 +09:00
|
|
|
struct vm_area_struct **vmas, int *locked);
|
mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs
The pkeys changes brought about a truly hideous set of macros in:
cde70140fed8 ("mm/gup: Overload get_user_pages() functions")
... which macros are (ab-)using the fact that __VA_ARGS__ can be used
to shift parameter positions in macro arguments without breaking the
build and so can be used to call separate C functions depending on
the number of arguments of the macro.
This allowed easy migration of these 3 GUP APIs, as both these variants
worked at the C level:
old:
ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);
new:
ret = get_user_pages(address, 1, 1, 0, &page, NULL);
... while we also generated a (functionally harmless but noticeable) build
time warning if the old API was used. As there are over 300 uses of these
APIs, this trick eased the migration of the API and avoided excessive
migration pain in linux-next.
Now, with its work done, get rid of all of that complication and ugliness:
3 files changed, 16 insertions(+), 140 deletions(-)
... where the linecount of the migration hack was further inflated by the
fact that there are NOMMU variants of these GUP APIs as well.
Much of the conversion was done in linux-next over the past couple of months,
and Linus recently removed all remaining old API uses from the upstream tree
in the following upstrea commit:
cb107161df3c ("Convert straggling drivers to new six-argument get_user_pages()")
There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
code path that ARM, ARM64 and PowerPC uses.
After this commit any old API usage will break the build.
[ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-04 17:24:58 +09:00
|
|
|
long get_user_pages(unsigned long start, unsigned long nr_pages,
|
2016-10-13 09:20:16 +09:00
|
|
|
unsigned int gup_flags, struct page **pages,
|
2016-02-13 06:01:55 +09:00
|
|
|
struct vm_area_struct **vmas);
|
mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs
The pkeys changes brought about a truly hideous set of macros in:
cde70140fed8 ("mm/gup: Overload get_user_pages() functions")
... which macros are (ab-)using the fact that __VA_ARGS__ can be used
to shift parameter positions in macro arguments without breaking the
build and so can be used to call separate C functions depending on
the number of arguments of the macro.
This allowed easy migration of these 3 GUP APIs, as both these variants
worked at the C level:
old:
ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);
new:
ret = get_user_pages(address, 1, 1, 0, &page, NULL);
... while we also generated a (functionally harmless but noticeable) build
time warning if the old API was used. As there are over 300 uses of these
APIs, this trick eased the migration of the API and avoided excessive
migration pain in linux-next.
Now, with its work done, get rid of all of that complication and ugliness:
3 files changed, 16 insertions(+), 140 deletions(-)
... where the linecount of the migration hack was further inflated by the
fact that there are NOMMU variants of these GUP APIs as well.
Much of the conversion was done in linux-next over the past couple of months,
and Linus recently removed all remaining old API uses from the upstream tree
in the following upstrea commit:
cb107161df3c ("Convert straggling drivers to new six-argument get_user_pages()")
There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
code path that ARM, ARM64 and PowerPC uses.
After this commit any old API usage will break the build.
[ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-04 17:24:58 +09:00
|
|
|
long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
|
2016-10-13 09:20:14 +09:00
|
|
|
unsigned int gup_flags, struct page **pages, int *locked);
|
mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs
The pkeys changes brought about a truly hideous set of macros in:
cde70140fed8 ("mm/gup: Overload get_user_pages() functions")
... which macros are (ab-)using the fact that __VA_ARGS__ can be used
to shift parameter positions in macro arguments without breaking the
build and so can be used to call separate C functions depending on
the number of arguments of the macro.
This allowed easy migration of these 3 GUP APIs, as both these variants
worked at the C level:
old:
ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);
new:
ret = get_user_pages(address, 1, 1, 0, &page, NULL);
... while we also generated a (functionally harmless but noticeable) build
time warning if the old API was used. As there are over 300 uses of these
APIs, this trick eased the migration of the API and avoided excessive
migration pain in linux-next.
Now, with its work done, get rid of all of that complication and ugliness:
3 files changed, 16 insertions(+), 140 deletions(-)
... where the linecount of the migration hack was further inflated by the
fact that there are NOMMU variants of these GUP APIs as well.
Much of the conversion was done in linux-next over the past couple of months,
and Linus recently removed all remaining old API uses from the upstream tree
in the following upstrea commit:
cb107161df3c ("Convert straggling drivers to new six-argument get_user_pages()")
There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
code path that ARM, ARM64 and PowerPC uses.
After this commit any old API usage will break the build.
[ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-04 17:24:58 +09:00
|
|
|
long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
|
2016-10-13 09:20:13 +09:00
|
|
|
struct page **pages, unsigned int gup_flags);
|
2019-03-06 08:47:44 +09:00
|
|
|
|
2019-05-14 09:17:11 +09:00
|
|
|
int get_user_pages_fast(unsigned long start, int nr_pages,
|
|
|
|
unsigned int gup_flags, struct page **pages);
|
2015-07-13 23:55:44 +09:00
|
|
|
|
2019-07-17 08:30:54 +09:00
|
|
|
int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
|
|
|
|
int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
|
|
|
|
struct task_struct *task, bool bypass_rlim);
|
|
|
|
|
2015-07-13 23:55:44 +09:00
|
|
|
/* Container for pinned pfns / pages */
|
|
|
|
struct frame_vector {
|
|
|
|
unsigned int nr_allocated; /* Number of frames we have space for */
|
|
|
|
unsigned int nr_frames; /* Number of frames stored in ptrs array */
|
|
|
|
bool got_ref; /* Did we pin pages by getting page ref? */
|
|
|
|
bool is_pfns; /* Does array contain pages or pfns? */
|
|
|
|
void *ptrs[0]; /* Array of pinned pfns / pages. Use
|
|
|
|
* pfns_vector_pages() or pfns_vector_pfns()
|
|
|
|
* for access */
|
|
|
|
};
|
|
|
|
|
|
|
|
struct frame_vector *frame_vector_create(unsigned int nr_frames);
|
|
|
|
void frame_vector_destroy(struct frame_vector *vec);
|
|
|
|
int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
|
2016-10-13 09:20:15 +09:00
|
|
|
unsigned int gup_flags, struct frame_vector *vec);
|
2015-07-13 23:55:44 +09:00
|
|
|
void put_vaddr_frames(struct frame_vector *vec);
|
|
|
|
int frame_vector_to_pages(struct frame_vector *vec);
|
|
|
|
void frame_vector_to_pfns(struct frame_vector *vec);
|
|
|
|
|
|
|
|
static inline unsigned int frame_vector_count(struct frame_vector *vec)
|
|
|
|
{
|
|
|
|
return vec->nr_frames;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page **frame_vector_pages(struct frame_vector *vec)
|
|
|
|
{
|
|
|
|
if (vec->is_pfns) {
|
|
|
|
int err = frame_vector_to_pages(vec);
|
|
|
|
|
|
|
|
if (err)
|
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
return (struct page **)(vec->ptrs);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long *frame_vector_pfns(struct frame_vector *vec)
|
|
|
|
{
|
|
|
|
if (!vec->is_pfns)
|
|
|
|
frame_vector_to_pfns(vec);
|
|
|
|
return (unsigned long *)(vec->ptrs);
|
|
|
|
}
|
|
|
|
|
2012-08-01 08:44:51 +09:00
|
|
|
struct kvec;
|
|
|
|
int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
|
|
|
|
struct page **pages);
|
|
|
|
int get_kernel_page(unsigned long start, int write, struct page **pages);
|
2009-09-22 09:03:25 +09:00
|
|
|
struct page *get_dump_page(unsigned long addr);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2006-08-30 03:05:54 +09:00
|
|
|
extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
|
2013-05-22 12:17:23 +09:00
|
|
|
extern void do_invalidatepage(struct page *page, unsigned int offset,
|
|
|
|
unsigned int length);
|
2006-08-30 03:05:54 +09:00
|
|
|
|
2018-04-11 08:36:44 +09:00
|
|
|
void __set_page_dirty(struct page *, struct address_space *, int warn);
|
2005-04-17 07:20:36 +09:00
|
|
|
int __set_page_dirty_nobuffers(struct page *page);
|
2007-02-10 18:43:15 +09:00
|
|
|
int __set_page_dirty_no_writeback(struct page *page);
|
2005-04-17 07:20:36 +09:00
|
|
|
int redirty_page_for_writepage(struct writeback_control *wbc,
|
|
|
|
struct page *page);
|
2016-03-16 06:57:22 +09:00
|
|
|
void account_page_dirtied(struct page *page, struct address_space *mapping);
|
memcg: add per cgroup dirty page accounting
When modifying PG_Dirty on cached file pages, update the new
MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
per memcg memory.stat cgroupfs file. The most recent past attempt at
this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
The new accounting supports future efforts to add per cgroup dirty
page throttling and writeback. It also helps an administrator break
down a container's memory usage and provides evidence to understand
memcg oom kills (the new dirty count is included in memcg oom kill
messages).
The ability to move page accounting between memcg
(memory.move_charge_at_immigrate) makes this accounting more
complicated than the global counter. The existing
mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
accounting with stat updates.
Typical update operation:
memcg = mem_cgroup_begin_page_stat(page)
if (TestSetPageDirty()) {
[...]
mem_cgroup_update_page_stat(memcg)
}
mem_cgroup_end_page_stat(memcg)
Summary of mem_cgroup_end_page_stat() overhead:
- Without CONFIG_MEMCG it's a no-op
- With CONFIG_MEMCG and no inter memcg task movement, it's just
rcu_read_lock()
- With CONFIG_MEMCG and inter memcg task movement, it's
rcu_read_lock() + spin_lock_irqsave()
A memcg parameter is added to several routines because their callers
now grab mem_cgroup_begin_page_stat() which returns the memcg later
needed by for mem_cgroup_update_page_stat().
Because mem_cgroup_begin_page_stat() may disable interrupts, some
adjustments are needed:
- move __mark_inode_dirty() from __set_page_dirty() to its caller.
__mark_inode_dirty() locking does not want interrupts disabled.
- use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
__delete_from_page_cache(), replace_page_cache_page(),
invalidate_complete_page2(), and __remove_mapping().
text data bss dec hex filename
8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
+192 text bytes
8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
+773 text bytes
Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
all metrics, they're all wall clock or cycle counts. The read and write
fault benchmarks just measure fault time, they do not include I/O time.
* CONFIG_MEMCG not set:
baseline patched
kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)
* CONFIG_MEMCG=y root_memcg:
baseline patched
kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)
* CONFIG_MEMCG=y non-root_memcg:
baseline patched
kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)
As expected anon page faults are not affected by this patch.
tj: Updated to apply on top of the recent cancel_dirty_page() changes.
Signed-off-by: Sha Zhengju <handai.szj@gmail.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-23 06:13:16 +09:00
|
|
|
void account_page_cleaned(struct page *page, struct address_space *mapping,
|
2016-03-16 06:57:22 +09:00
|
|
|
struct bdi_writeback *wb);
|
2008-02-14 08:03:15 +09:00
|
|
|
int set_page_dirty(struct page *page);
|
2005-04-17 07:20:36 +09:00
|
|
|
int set_page_dirty_lock(struct page *page);
|
2017-11-16 10:37:11 +09:00
|
|
|
void __cancel_dirty_page(struct page *page);
|
|
|
|
static inline void cancel_dirty_page(struct page *page)
|
|
|
|
{
|
|
|
|
/* Avoid atomic ops, locking, etc. when not actually needed. */
|
|
|
|
if (PageDirty(page))
|
|
|
|
__cancel_dirty_page(page);
|
|
|
|
}
|
2005-04-17 07:20:36 +09:00
|
|
|
int clear_page_dirty_for_io(struct page *page);
|
2015-04-15 07:45:27 +09:00
|
|
|
|
2014-02-12 03:11:59 +09:00
|
|
|
int get_cmdline(struct task_struct *task, char *buffer, int buflen);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2007-07-19 17:48:16 +09:00
|
|
|
extern unsigned long move_page_tables(struct vm_area_struct *vma,
|
|
|
|
unsigned long old_addr, struct vm_area_struct *new_vma,
|
2012-10-09 08:31:50 +09:00
|
|
|
unsigned long new_addr, unsigned long len,
|
|
|
|
bool need_rmap_locks);
|
2012-11-19 11:14:23 +09:00
|
|
|
extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
|
|
|
|
unsigned long end, pgprot_t newprot,
|
2012-10-25 21:16:32 +09:00
|
|
|
int dirty_accountable, int prot_numa);
|
2007-07-19 17:48:16 +09:00
|
|
|
extern int mprotect_fixup(struct vm_area_struct *vma,
|
|
|
|
struct vm_area_struct **pprev, unsigned long start,
|
|
|
|
unsigned long end, unsigned long newflags);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2009-06-15 19:31:37 +09:00
|
|
|
/*
|
|
|
|
* doesn't attempt to fault and will return short.
|
|
|
|
*/
|
|
|
|
int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
|
|
|
|
struct page **pages);
|
2010-03-06 06:41:39 +09:00
|
|
|
/*
|
|
|
|
* per-process(per-mm_struct) statistics.
|
|
|
|
*/
|
|
|
|
static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
|
|
|
|
{
|
2012-03-22 08:33:49 +09:00
|
|
|
long val = atomic_long_read(&mm->rss_stat.count[member]);
|
|
|
|
|
|
|
|
#ifdef SPLIT_RSS_COUNTING
|
|
|
|
/*
|
|
|
|
* counter is updated in asynchronous manner and may go to minus.
|
|
|
|
* But it's never be expected number for users.
|
|
|
|
*/
|
|
|
|
if (val < 0)
|
|
|
|
val = 0;
|
2011-05-25 09:12:36 +09:00
|
|
|
#endif
|
2012-03-22 08:33:49 +09:00
|
|
|
return (unsigned long)val;
|
|
|
|
}
|
2010-03-06 06:41:39 +09:00
|
|
|
|
2019-12-11 00:45:34 +09:00
|
|
|
void mm_trace_rss_stat(struct mm_struct *mm, int member, long count,
|
|
|
|
long value);
|
UPSTREAM: mm: emit tracepoint when RSS changes
Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs. Several Android teams have been using this patch in various
kernel trees for half a year now. Many reported to me it is really
useful so I'm posting it upstream.
Initial patch developed by Tim Murray. Changes I made from original patch:
o Prevent any additional space consumed by mm_struct.
Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already. That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING. However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding. In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.
Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:
CC kernel/trace/power-traces.o
In file included from ./include/trace/define_trace.h:102,
from ./include/trace/events/kmem.h:342,
from ./include/linux/mm.h:31,
from ./include/linux/ring_buffer.h:5,
from ./include/linux/trace_events.h:6,
from ./include/trace/events/power.h:12,
from kernel/trace/power-traces.c:15:
./include/trace/trace_events.h:113:22: error: field ‘ent’ has incomplete type
struct trace_entry ent; \
Change-Id: I26e72db29ec89a305c29a50159279339067f4983
Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Acked-by: Michal Hocko <mhocko@suse.com>
Co-developed-by: Tim Murray <timmurray@google.com>
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
2019-10-02 02:28:17 +09:00
|
|
|
|
2010-03-06 06:41:39 +09:00
|
|
|
static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
|
|
|
|
{
|
UPSTREAM: mm: emit tracepoint when RSS changes
Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs. Several Android teams have been using this patch in various
kernel trees for half a year now. Many reported to me it is really
useful so I'm posting it upstream.
Initial patch developed by Tim Murray. Changes I made from original patch:
o Prevent any additional space consumed by mm_struct.
Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already. That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING. However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding. In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.
Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:
CC kernel/trace/power-traces.o
In file included from ./include/trace/define_trace.h:102,
from ./include/trace/events/kmem.h:342,
from ./include/linux/mm.h:31,
from ./include/linux/ring_buffer.h:5,
from ./include/linux/trace_events.h:6,
from ./include/trace/events/power.h:12,
from kernel/trace/power-traces.c:15:
./include/trace/trace_events.h:113:22: error: field ‘ent’ has incomplete type
struct trace_entry ent; \
Change-Id: I26e72db29ec89a305c29a50159279339067f4983
Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Acked-by: Michal Hocko <mhocko@suse.com>
Co-developed-by: Tim Murray <timmurray@google.com>
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
2019-10-02 02:28:17 +09:00
|
|
|
long count = atomic_long_add_return(value, &mm->rss_stat.count[member]);
|
|
|
|
|
2019-12-11 00:45:34 +09:00
|
|
|
mm_trace_rss_stat(mm, member, count, value);
|
2010-03-06 06:41:39 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void inc_mm_counter(struct mm_struct *mm, int member)
|
|
|
|
{
|
UPSTREAM: mm: emit tracepoint when RSS changes
Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs. Several Android teams have been using this patch in various
kernel trees for half a year now. Many reported to me it is really
useful so I'm posting it upstream.
Initial patch developed by Tim Murray. Changes I made from original patch:
o Prevent any additional space consumed by mm_struct.
Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already. That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING. However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding. In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.
Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:
CC kernel/trace/power-traces.o
In file included from ./include/trace/define_trace.h:102,
from ./include/trace/events/kmem.h:342,
from ./include/linux/mm.h:31,
from ./include/linux/ring_buffer.h:5,
from ./include/linux/trace_events.h:6,
from ./include/trace/events/power.h:12,
from kernel/trace/power-traces.c:15:
./include/trace/trace_events.h:113:22: error: field ‘ent’ has incomplete type
struct trace_entry ent; \
Change-Id: I26e72db29ec89a305c29a50159279339067f4983
Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Acked-by: Michal Hocko <mhocko@suse.com>
Co-developed-by: Tim Murray <timmurray@google.com>
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
2019-10-02 02:28:17 +09:00
|
|
|
long count = atomic_long_inc_return(&mm->rss_stat.count[member]);
|
|
|
|
|
2019-12-11 00:45:34 +09:00
|
|
|
mm_trace_rss_stat(mm, member, count, 1);
|
2010-03-06 06:41:39 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void dec_mm_counter(struct mm_struct *mm, int member)
|
|
|
|
{
|
UPSTREAM: mm: emit tracepoint when RSS changes
Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs. Several Android teams have been using this patch in various
kernel trees for half a year now. Many reported to me it is really
useful so I'm posting it upstream.
Initial patch developed by Tim Murray. Changes I made from original patch:
o Prevent any additional space consumed by mm_struct.
Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already. That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING. However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding. In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.
Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:
CC kernel/trace/power-traces.o
In file included from ./include/trace/define_trace.h:102,
from ./include/trace/events/kmem.h:342,
from ./include/linux/mm.h:31,
from ./include/linux/ring_buffer.h:5,
from ./include/linux/trace_events.h:6,
from ./include/trace/events/power.h:12,
from kernel/trace/power-traces.c:15:
./include/trace/trace_events.h:113:22: error: field ‘ent’ has incomplete type
struct trace_entry ent; \
Change-Id: I26e72db29ec89a305c29a50159279339067f4983
Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Acked-by: Michal Hocko <mhocko@suse.com>
Co-developed-by: Tim Murray <timmurray@google.com>
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Joel Fernandes <joelaf@google.com>
2019-10-02 02:28:17 +09:00
|
|
|
long count = atomic_long_dec_return(&mm->rss_stat.count[member]);
|
|
|
|
|
2019-12-11 00:45:34 +09:00
|
|
|
mm_trace_rss_stat(mm, member, count, -1);
|
2010-03-06 06:41:39 +09:00
|
|
|
}
|
|
|
|
|
2016-01-15 08:19:26 +09:00
|
|
|
/* Optimized variant when page is already known not to be PageAnon */
|
|
|
|
static inline int mm_counter_file(struct page *page)
|
|
|
|
{
|
|
|
|
if (PageSwapBacked(page))
|
|
|
|
return MM_SHMEMPAGES;
|
|
|
|
return MM_FILEPAGES;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int mm_counter(struct page *page)
|
|
|
|
{
|
|
|
|
if (PageAnon(page))
|
|
|
|
return MM_ANONPAGES;
|
|
|
|
return mm_counter_file(page);
|
|
|
|
}
|
|
|
|
|
2010-03-06 06:41:39 +09:00
|
|
|
static inline unsigned long get_mm_rss(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
return get_mm_counter(mm, MM_FILEPAGES) +
|
2016-01-15 08:19:26 +09:00
|
|
|
get_mm_counter(mm, MM_ANONPAGES) +
|
|
|
|
get_mm_counter(mm, MM_SHMEMPAGES);
|
2010-03-06 06:41:39 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
return max(mm->hiwater_rss, get_mm_rss(mm));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
return max(mm->hiwater_vm, mm->total_vm);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void update_hiwater_rss(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
unsigned long _rss = get_mm_rss(mm);
|
|
|
|
|
|
|
|
if ((mm)->hiwater_rss < _rss)
|
|
|
|
(mm)->hiwater_rss = _rss;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void update_hiwater_vm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
if (mm->hiwater_vm < mm->total_vm)
|
|
|
|
mm->hiwater_vm = mm->total_vm;
|
|
|
|
}
|
|
|
|
|
2015-02-13 08:01:00 +09:00
|
|
|
static inline void reset_mm_hiwater_rss(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
mm->hiwater_rss = get_mm_rss(mm);
|
|
|
|
}
|
|
|
|
|
2010-03-06 06:41:39 +09:00
|
|
|
static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
|
|
|
|
struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
|
|
|
|
|
|
|
|
if (*maxrss < hiwater_rss)
|
|
|
|
*maxrss = hiwater_rss;
|
|
|
|
}
|
|
|
|
|
2010-03-11 08:20:38 +09:00
|
|
|
#if defined(SPLIT_RSS_COUNTING)
|
2012-03-22 08:34:13 +09:00
|
|
|
void sync_mm_rss(struct mm_struct *mm);
|
2010-03-11 08:20:38 +09:00
|
|
|
#else
|
2012-03-22 08:34:13 +09:00
|
|
|
static inline void sync_mm_rss(struct mm_struct *mm)
|
2010-03-11 08:20:38 +09:00
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
2009-06-15 19:31:37 +09:00
|
|
|
|
2019-07-17 08:30:47 +09:00
|
|
|
#ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
|
2016-01-16 09:56:55 +09:00
|
|
|
static inline int pte_devmap(pte_t pte)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2016-10-08 09:01:22 +09:00
|
|
|
int vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
|
2006-09-26 15:30:57 +09:00
|
|
|
|
2010-10-27 06:21:59 +09:00
|
|
|
extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
|
|
|
|
spinlock_t **ptl);
|
|
|
|
static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
|
|
|
|
spinlock_t **ptl)
|
|
|
|
{
|
|
|
|
pte_t *ptep;
|
|
|
|
__cond_lock(*ptl, ptep = __get_locked_pte(mm, addr, ptl));
|
|
|
|
return ptep;
|
|
|
|
}
|
2005-11-30 07:03:14 +09:00
|
|
|
|
2017-03-09 23:24:07 +09:00
|
|
|
#ifdef __PAGETABLE_P4D_FOLDED
|
|
|
|
static inline int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
|
|
|
|
#endif
|
|
|
|
|
2017-11-16 10:35:33 +09:00
|
|
|
#if defined(__PAGETABLE_PUD_FOLDED) || !defined(CONFIG_MMU)
|
2017-03-09 23:24:07 +09:00
|
|
|
static inline int __pud_alloc(struct mm_struct *mm, p4d_t *p4d,
|
2007-05-07 06:49:02 +09:00
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2017-11-16 10:35:33 +09:00
|
|
|
static inline void mm_inc_nr_puds(struct mm_struct *mm) {}
|
|
|
|
static inline void mm_dec_nr_puds(struct mm_struct *mm) {}
|
|
|
|
|
2007-05-07 06:49:02 +09:00
|
|
|
#else
|
2017-03-09 23:24:07 +09:00
|
|
|
int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address);
|
2017-11-16 10:35:33 +09:00
|
|
|
|
|
|
|
static inline void mm_inc_nr_puds(struct mm_struct *mm)
|
|
|
|
{
|
2018-10-15 17:30:23 +09:00
|
|
|
if (mm_pud_folded(mm))
|
|
|
|
return;
|
2017-11-16 10:35:40 +09:00
|
|
|
atomic_long_add(PTRS_PER_PUD * sizeof(pud_t), &mm->pgtables_bytes);
|
2017-11-16 10:35:33 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mm_dec_nr_puds(struct mm_struct *mm)
|
|
|
|
{
|
2018-10-15 17:30:23 +09:00
|
|
|
if (mm_pud_folded(mm))
|
|
|
|
return;
|
2017-11-16 10:35:40 +09:00
|
|
|
atomic_long_sub(PTRS_PER_PUD * sizeof(pud_t), &mm->pgtables_bytes);
|
2017-11-16 10:35:33 +09:00
|
|
|
}
|
2007-05-07 06:49:02 +09:00
|
|
|
#endif
|
|
|
|
|
2015-02-13 07:59:59 +09:00
|
|
|
#if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU)
|
2007-05-07 06:49:02 +09:00
|
|
|
static inline int __pmd_alloc(struct mm_struct *mm, pud_t *pud,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void)
{
char *addr = NULL;
unsigned long i;
prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}
The patch addresses the issue by account PMD tables to the process the
same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).
Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 08:26:50 +09:00
|
|
|
|
|
|
|
static inline void mm_inc_nr_pmds(struct mm_struct *mm) {}
|
|
|
|
static inline void mm_dec_nr_pmds(struct mm_struct *mm) {}
|
|
|
|
|
2007-05-07 06:49:02 +09:00
|
|
|
#else
|
2005-10-30 10:16:22 +09:00
|
|
|
int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
|
mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void)
{
char *addr = NULL;
unsigned long i;
prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}
The patch addresses the issue by account PMD tables to the process the
same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).
Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 08:26:50 +09:00
|
|
|
|
|
|
|
static inline void mm_inc_nr_pmds(struct mm_struct *mm)
|
|
|
|
{
|
2018-10-15 17:30:23 +09:00
|
|
|
if (mm_pmd_folded(mm))
|
|
|
|
return;
|
2017-11-16 10:35:40 +09:00
|
|
|
atomic_long_add(PTRS_PER_PMD * sizeof(pmd_t), &mm->pgtables_bytes);
|
mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void)
{
char *addr = NULL;
unsigned long i;
prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}
The patch addresses the issue by account PMD tables to the process the
same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).
Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 08:26:50 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mm_dec_nr_pmds(struct mm_struct *mm)
|
|
|
|
{
|
2018-10-15 17:30:23 +09:00
|
|
|
if (mm_pmd_folded(mm))
|
|
|
|
return;
|
2017-11-16 10:35:40 +09:00
|
|
|
atomic_long_sub(PTRS_PER_PMD * sizeof(pmd_t), &mm->pgtables_bytes);
|
mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/prctl.h>
#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void)
{
char *addr = NULL;
unsigned long i;
prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}
The patch addresses the issue by account PMD tables to the process the
same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).
Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: David Rientjes <rientjes@google.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 08:26:50 +09:00
|
|
|
}
|
2007-05-07 06:49:02 +09:00
|
|
|
#endif
|
|
|
|
|
2017-11-16 10:35:37 +09:00
|
|
|
#ifdef CONFIG_MMU
|
2017-11-16 10:35:40 +09:00
|
|
|
static inline void mm_pgtables_bytes_init(struct mm_struct *mm)
|
2017-11-16 10:35:37 +09:00
|
|
|
{
|
2017-11-16 10:35:40 +09:00
|
|
|
atomic_long_set(&mm->pgtables_bytes, 0);
|
2017-11-16 10:35:37 +09:00
|
|
|
}
|
|
|
|
|
2017-11-16 10:35:40 +09:00
|
|
|
static inline unsigned long mm_pgtables_bytes(const struct mm_struct *mm)
|
2017-11-16 10:35:37 +09:00
|
|
|
{
|
2017-11-16 10:35:40 +09:00
|
|
|
return atomic_long_read(&mm->pgtables_bytes);
|
2017-11-16 10:35:37 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mm_inc_nr_ptes(struct mm_struct *mm)
|
|
|
|
{
|
2017-11-16 10:35:40 +09:00
|
|
|
atomic_long_add(PTRS_PER_PTE * sizeof(pte_t), &mm->pgtables_bytes);
|
2017-11-16 10:35:37 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mm_dec_nr_ptes(struct mm_struct *mm)
|
|
|
|
{
|
2017-11-16 10:35:40 +09:00
|
|
|
atomic_long_sub(PTRS_PER_PTE * sizeof(pte_t), &mm->pgtables_bytes);
|
2017-11-16 10:35:37 +09:00
|
|
|
}
|
|
|
|
#else
|
|
|
|
|
2017-11-16 10:35:40 +09:00
|
|
|
static inline void mm_pgtables_bytes_init(struct mm_struct *mm) {}
|
|
|
|
static inline unsigned long mm_pgtables_bytes(const struct mm_struct *mm)
|
2017-11-16 10:35:37 +09:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mm_inc_nr_ptes(struct mm_struct *mm) {}
|
|
|
|
static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
|
|
|
|
#endif
|
|
|
|
|
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".
This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
fn(...
-, E2
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 08:28:34 +09:00
|
|
|
int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
|
|
|
|
int __pte_alloc_kernel(pmd_t *pmd);
|
2005-10-30 10:16:22 +09:00
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
/*
|
|
|
|
* The following ifdef needed to get the 4level-fixup.h header to work.
|
|
|
|
* Remove it when 4level-fixup.h has been removed.
|
|
|
|
*/
|
2005-10-30 10:16:22 +09:00
|
|
|
#if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)
|
2017-03-09 23:24:03 +09:00
|
|
|
|
|
|
|
#ifndef __ARCH_HAS_5LEVEL_HACK
|
2017-03-09 23:24:07 +09:00
|
|
|
static inline p4d_t *p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
return (unlikely(pgd_none(*pgd)) && __p4d_alloc(mm, pgd, address)) ?
|
|
|
|
NULL : p4d_offset(pgd, address);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
|
|
|
|
unsigned long address)
|
2005-04-17 07:20:36 +09:00
|
|
|
{
|
2017-03-09 23:24:07 +09:00
|
|
|
return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
|
|
|
|
NULL : pud_offset(p4d, address);
|
2005-04-17 07:20:36 +09:00
|
|
|
}
|
2017-03-09 23:24:03 +09:00
|
|
|
#endif /* !__ARCH_HAS_5LEVEL_HACK */
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
|
|
|
|
{
|
2005-10-30 10:16:22 +09:00
|
|
|
return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?
|
|
|
|
NULL: pmd_offset(pud, address);
|
2005-04-17 07:20:36 +09:00
|
|
|
}
|
2005-10-30 10:16:22 +09:00
|
|
|
#endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
|
|
|
|
|
2013-11-15 07:30:45 +09:00
|
|
|
#if USE_SPLIT_PTE_PTLOCKS
|
2013-12-20 20:35:58 +09:00
|
|
|
#if ALLOC_SPLIT_PTLOCKS
|
2014-01-22 08:49:07 +09:00
|
|
|
void __init ptlock_cache_init(void);
|
2013-11-15 07:31:52 +09:00
|
|
|
extern bool ptlock_alloc(struct page *page);
|
|
|
|
extern void ptlock_free(struct page *page);
|
|
|
|
|
|
|
|
static inline spinlock_t *ptlock_ptr(struct page *page)
|
|
|
|
{
|
|
|
|
return page->ptl;
|
|
|
|
}
|
2013-12-20 20:35:58 +09:00
|
|
|
#else /* ALLOC_SPLIT_PTLOCKS */
|
2014-01-22 08:49:07 +09:00
|
|
|
static inline void ptlock_cache_init(void)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
2013-11-15 07:31:51 +09:00
|
|
|
static inline bool ptlock_alloc(struct page *page)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2013-11-15 07:31:52 +09:00
|
|
|
|
2013-11-15 07:31:51 +09:00
|
|
|
static inline void ptlock_free(struct page *page)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline spinlock_t *ptlock_ptr(struct page *page)
|
|
|
|
{
|
2013-11-15 07:31:52 +09:00
|
|
|
return &page->ptl;
|
2013-11-15 07:31:51 +09:00
|
|
|
}
|
2013-12-20 20:35:58 +09:00
|
|
|
#endif /* ALLOC_SPLIT_PTLOCKS */
|
2013-11-15 07:31:51 +09:00
|
|
|
|
|
|
|
static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
|
|
|
|
{
|
|
|
|
return ptlock_ptr(pmd_page(*pmd));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool ptlock_init(struct page *page)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* prep_new_page() initialize page->private (and therefore page->ptl)
|
|
|
|
* with 0. Make sure nobody took it in use in between.
|
|
|
|
*
|
|
|
|
* It can happen if arch try to use slab for page table allocation:
|
2015-11-07 09:29:54 +09:00
|
|
|
* slab code uses page->slab_cache, which share storage with page->ptl.
|
2013-11-15 07:31:51 +09:00
|
|
|
*/
|
2014-01-24 08:52:54 +09:00
|
|
|
VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
|
2013-11-15 07:31:51 +09:00
|
|
|
if (!ptlock_alloc(page))
|
|
|
|
return false;
|
|
|
|
spin_lock_init(ptlock_ptr(page));
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2013-11-15 07:30:45 +09:00
|
|
|
#else /* !USE_SPLIT_PTE_PTLOCKS */
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 10:16:40 +09:00
|
|
|
/*
|
|
|
|
* We use mm->page_table_lock to guard all pagetable pages of the mm.
|
|
|
|
*/
|
2013-11-15 07:31:51 +09:00
|
|
|
static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
|
|
|
|
{
|
|
|
|
return &mm->page_table_lock;
|
|
|
|
}
|
2014-01-22 08:49:07 +09:00
|
|
|
static inline void ptlock_cache_init(void) {}
|
2013-11-15 07:31:51 +09:00
|
|
|
static inline bool ptlock_init(struct page *page) { return true; }
|
2018-12-28 17:36:58 +09:00
|
|
|
static inline void ptlock_free(struct page *page) {}
|
2013-11-15 07:30:45 +09:00
|
|
|
#endif /* USE_SPLIT_PTE_PTLOCKS */
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 10:16:40 +09:00
|
|
|
|
2014-01-22 08:49:07 +09:00
|
|
|
static inline void pgtable_init(void)
|
|
|
|
{
|
|
|
|
ptlock_cache_init();
|
|
|
|
pgtable_cache_init();
|
|
|
|
}
|
|
|
|
|
2019-09-26 08:49:46 +09:00
|
|
|
static inline bool pgtable_pte_page_ctor(struct page *page)
|
2008-02-08 21:22:04 +09:00
|
|
|
{
|
2015-11-06 11:49:27 +09:00
|
|
|
if (!ptlock_init(page))
|
|
|
|
return false;
|
2018-06-08 09:08:23 +09:00
|
|
|
__SetPageTable(page);
|
2008-02-08 21:22:04 +09:00
|
|
|
inc_zone_page_state(page, NR_PAGETABLE);
|
2015-11-06 11:49:27 +09:00
|
|
|
return true;
|
2008-02-08 21:22:04 +09:00
|
|
|
}
|
|
|
|
|
2019-09-26 08:49:46 +09:00
|
|
|
static inline void pgtable_pte_page_dtor(struct page *page)
|
2008-02-08 21:22:04 +09:00
|
|
|
{
|
2018-12-28 17:36:58 +09:00
|
|
|
ptlock_free(page);
|
2018-06-08 09:08:23 +09:00
|
|
|
__ClearPageTable(page);
|
2008-02-08 21:22:04 +09:00
|
|
|
dec_zone_page_state(page, NR_PAGETABLE);
|
|
|
|
}
|
|
|
|
|
2005-10-30 10:16:23 +09:00
|
|
|
#define pte_offset_map_lock(mm, pmd, address, ptlp) \
|
|
|
|
({ \
|
[PATCH] mm: split page table lock
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
a many-threaded application which concurrently initializes different parts of
a large anonymous area.
This patch corrects that, by using a separate spinlock per page table page, to
guard the page table entries in that page, instead of using the mm's single
page_table_lock. (But even then, page_table_lock is still used to guard page
table allocation, and anon_vma allocation.)
In this implementation, the spinlock is tucked inside the struct page of the
page table page: with a BUILD_BUG_ON in case it overflows - which it would in
the case of 32-bit PA-RISC with spinlock debugging enabled.
Splitting the lock is not quite for free: another cacheline access. Ideally,
I suppose we would use split ptlock only for multi-threaded processes on
multi-cpu machines; but deciding that dynamically would have its own costs.
So for now enable it by config, at some number of cpus - since the Kconfig
language doesn't support inequalities, let preprocessor compare that with
NR_CPUS. But I don't think it's worth being user-configurable: for good
testing of both split and unsplit configs, split now at 4 cpus, and perhaps
change that to 8 later.
There is a benefit even for singly threaded processes: kswapd can be attacking
one part of the mm while another part is busy faulting.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-30 10:16:40 +09:00
|
|
|
spinlock_t *__ptl = pte_lockptr(mm, pmd); \
|
2005-10-30 10:16:23 +09:00
|
|
|
pte_t *__pte = pte_offset_map(pmd, address); \
|
|
|
|
*(ptlp) = __ptl; \
|
|
|
|
spin_lock(__ptl); \
|
|
|
|
__pte; \
|
|
|
|
})
|
|
|
|
|
|
|
|
#define pte_unmap_unlock(pte, ptl) do { \
|
|
|
|
spin_unlock(ptl); \
|
|
|
|
pte_unmap(pte); \
|
|
|
|
} while (0)
|
|
|
|
|
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".
This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
fn(...
-, E2
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 08:28:34 +09:00
|
|
|
#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
|
2016-03-18 06:19:11 +09:00
|
|
|
|
|
|
|
#define pte_alloc_map(mm, pmd, address) \
|
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".
This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
fn(...
-, E2
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 08:28:34 +09:00
|
|
|
(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
|
2005-10-30 10:16:22 +09:00
|
|
|
|
2005-10-30 10:16:23 +09:00
|
|
|
#define pte_alloc_map_lock(mm, pmd, address, ptlp) \
|
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".
This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
fn(...
-, E2
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 08:28:34 +09:00
|
|
|
(pte_alloc(mm, pmd) ? \
|
2016-03-18 06:19:11 +09:00
|
|
|
NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
|
2005-10-30 10:16:23 +09:00
|
|
|
|
2005-10-30 10:16:22 +09:00
|
|
|
#define pte_alloc_kernel(pmd, address) \
|
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".
This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
fn(...
-, E2
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 08:28:34 +09:00
|
|
|
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
|
2005-10-30 10:16:22 +09:00
|
|
|
NULL: pte_offset_kernel(pmd, address))
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2013-11-15 07:31:07 +09:00
|
|
|
#if USE_SPLIT_PMD_PTLOCKS
|
|
|
|
|
2014-02-13 21:53:33 +09:00
|
|
|
static struct page *pmd_to_page(pmd_t *pmd)
|
|
|
|
{
|
|
|
|
unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
|
|
|
|
return virt_to_page((void *)((unsigned long) pmd & mask));
|
|
|
|
}
|
|
|
|
|
2013-11-15 07:31:07 +09:00
|
|
|
static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
|
|
|
|
{
|
2014-02-13 21:53:33 +09:00
|
|
|
return ptlock_ptr(pmd_to_page(pmd));
|
2013-11-15 07:31:07 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool pgtable_pmd_page_ctor(struct page *page)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
page->pmd_huge_pte = NULL;
|
|
|
|
#endif
|
2013-11-15 07:31:51 +09:00
|
|
|
return ptlock_init(page);
|
2013-11-15 07:31:07 +09:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void pgtable_pmd_page_dtor(struct page *page)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
2014-01-24 08:52:54 +09:00
|
|
|
VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
|
2013-11-15 07:31:07 +09:00
|
|
|
#endif
|
2013-11-15 07:31:51 +09:00
|
|
|
ptlock_free(page);
|
2013-11-15 07:31:07 +09:00
|
|
|
}
|
|
|
|
|
2014-02-13 21:53:33 +09:00
|
|
|
#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
|
2013-11-15 07:31:07 +09:00
|
|
|
|
|
|
|
#else
|
|
|
|
|
2013-11-15 07:30:51 +09:00
|
|
|
static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
|
|
|
|
{
|
|
|
|
return &mm->page_table_lock;
|
|
|
|
}
|
|
|
|
|
2013-11-15 07:31:07 +09:00
|
|
|
static inline bool pgtable_pmd_page_ctor(struct page *page) { return true; }
|
|
|
|
static inline void pgtable_pmd_page_dtor(struct page *page) {}
|
|
|
|
|
2013-11-15 07:30:59 +09:00
|
|
|
#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
|
2013-11-15 07:30:51 +09:00
|
|
|
|
2013-11-15 07:31:07 +09:00
|
|
|
#endif
|
|
|
|
|
2013-11-15 07:30:51 +09:00
|
|
|
static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
|
|
|
|
{
|
|
|
|
spinlock_t *ptl = pmd_lockptr(mm, pmd);
|
|
|
|
spin_lock(ptl);
|
|
|
|
return ptl;
|
|
|
|
}
|
|
|
|
|
2017-02-25 07:57:02 +09:00
|
|
|
/*
|
|
|
|
* No scalability reason to split PUD locks yet, but follow the same pattern
|
|
|
|
* as the PMD locks to make it easier if we decide to. The VM should not be
|
|
|
|
* considered ready to switch to split PUD locks yet; there may be places
|
|
|
|
* which need to be converted from page_table_lock.
|
|
|
|
*/
|
|
|
|
static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
|
|
|
|
{
|
|
|
|
return &mm->page_table_lock;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
|
|
|
|
{
|
|
|
|
spinlock_t *ptl = pud_lockptr(mm, pud);
|
|
|
|
|
|
|
|
spin_lock(ptl);
|
|
|
|
return ptl;
|
|
|
|
}
|
2016-12-25 12:00:30 +09:00
|
|
|
|
2017-02-25 07:57:02 +09:00
|
|
|
extern void __init pagecache_init(void);
|
2005-04-17 07:20:36 +09:00
|
|
|
extern void free_area_init(unsigned long * zones_size);
|
mm/page_alloc: Introduce free_area_init_core_hotplug
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core(). But there is
some code that we do not really need to run when we are coming from such
path.
free_area_init_core() performs the following actions:
1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone
When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.
Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages. For the same reason, action #3 results
always in manages_pages being 0.
Action #5 and #6 are performed later on when onlining the pages:
online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
online_pages()->move_pfn_range_to_zone()->memmap_init_zone()
This patch does two things:
First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:
1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data
These two functions are: pgdat_init_internals() and zone_init_internals().
The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():
Currently, we call free_area_init_node() from the memhotplug path. In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.
Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.
Actually, we re-set these values to 0 later on with the calls to:
reset_node_managed_pages()
reset_node_present_pages()
The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()
Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.
[osalvador@techadventures.net: fix section usage]
Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 13:53:43 +09:00
|
|
|
extern void __init free_area_init_node(int nid, unsigned long * zones_size,
|
2008-07-24 13:27:20 +09:00
|
|
|
unsigned long zone_start_pfn, unsigned long *zholes_size);
|
2012-03-29 02:30:03 +09:00
|
|
|
extern void free_initmem(void);
|
|
|
|
|
2013-04-30 07:06:21 +09:00
|
|
|
/*
|
|
|
|
* Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
|
|
|
|
* into the buddy system. The freed pages will be poisoned with pattern
|
2013-07-04 07:02:51 +09:00
|
|
|
* "poison" if it's within range [0, UCHAR_MAX].
|
2013-04-30 07:06:21 +09:00
|
|
|
* Return pages freed into the buddy system.
|
|
|
|
*/
|
mm: change signature of free_reserved_area() to fix building warnings
Change signature of free_reserved_area() according to Russell King's
suggestion to fix following build warnings:
arch/arm/mm/init.c: In function 'mem_init':
arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
^
In file included from include/linux/mman.h:4:0,
from arch/arm/mm/init.c:15:
include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
extern unsigned long free_reserved_area(unsigned long start, unsigned long end,
mm/page_alloc.c: In function 'free_reserved_area':
>> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
In file included from arch/mips/include/asm/page.h:49:0,
from include/linux/mmzone.h:20,
from include/linux/gfp.h:4,
from include/linux/mm.h:8,
from mm/page_alloc.c:18:
arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
mm/page_alloc.c: In function 'free_area_init_nodes':
mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]
Also address some minor code review comments.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-04 07:02:48 +09:00
|
|
|
extern unsigned long free_reserved_area(void *start, void *end,
|
2018-12-28 17:36:03 +09:00
|
|
|
int poison, const char *s);
|
2013-07-04 07:03:14 +09:00
|
|
|
|
2013-04-30 07:07:00 +09:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
|
|
/*
|
|
|
|
* Free a highmem page into the buddy system, adjusting totalhigh_pages
|
|
|
|
* and totalram_pages.
|
|
|
|
*/
|
|
|
|
extern void free_highmem_page(struct page *page);
|
|
|
|
#endif
|
2013-04-30 07:06:21 +09:00
|
|
|
|
2013-07-04 07:03:14 +09:00
|
|
|
extern void adjust_managed_page_count(struct page *page, long count);
|
2013-07-04 07:03:41 +09:00
|
|
|
extern void mem_init_print_info(const char *str);
|
2013-04-30 07:06:21 +09:00
|
|
|
|
2016-05-21 08:58:38 +09:00
|
|
|
extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
|
2015-07-01 06:56:48 +09:00
|
|
|
|
2013-04-30 07:06:21 +09:00
|
|
|
/* Free the reserved page into the buddy system, so it gets managed. */
|
|
|
|
static inline void __free_reserved_page(struct page *page)
|
|
|
|
{
|
|
|
|
ClearPageReserved(page);
|
|
|
|
init_page_count(page);
|
|
|
|
__free_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void free_reserved_page(struct page *page)
|
|
|
|
{
|
|
|
|
__free_reserved_page(page);
|
|
|
|
adjust_managed_page_count(page, 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void mark_page_reserved(struct page *page)
|
|
|
|
{
|
|
|
|
SetPageReserved(page);
|
|
|
|
adjust_managed_page_count(page, -1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Default method to free all the __init memory into the buddy system.
|
2013-07-04 07:02:51 +09:00
|
|
|
* The freed pages will be poisoned with pattern "poison" if it's within
|
|
|
|
* range [0, UCHAR_MAX].
|
|
|
|
* Return pages freed into the buddy system.
|
2013-04-30 07:06:21 +09:00
|
|
|
*/
|
|
|
|
static inline unsigned long free_initmem_default(int poison)
|
|
|
|
{
|
|
|
|
extern char __init_begin[], __init_end[];
|
|
|
|
|
mm: change signature of free_reserved_area() to fix building warnings
Change signature of free_reserved_area() according to Russell King's
suggestion to fix following build warnings:
arch/arm/mm/init.c: In function 'mem_init':
arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
^
In file included from include/linux/mman.h:4:0,
from arch/arm/mm/init.c:15:
include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
extern unsigned long free_reserved_area(unsigned long start, unsigned long end,
mm/page_alloc.c: In function 'free_reserved_area':
>> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
In file included from arch/mips/include/asm/page.h:49:0,
from include/linux/mmzone.h:20,
from include/linux/gfp.h:4,
from include/linux/mm.h:8,
from mm/page_alloc.c:18:
arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
mm/page_alloc.c: In function 'free_area_init_nodes':
mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]
Also address some minor code review comments.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-04 07:02:48 +09:00
|
|
|
return free_reserved_area(&__init_begin, &__init_end,
|
2013-04-30 07:06:21 +09:00
|
|
|
poison, "unused kernel");
|
|
|
|
}
|
|
|
|
|
2013-07-04 07:03:41 +09:00
|
|
|
static inline unsigned long get_num_physpages(void)
|
|
|
|
{
|
|
|
|
int nid;
|
|
|
|
unsigned long phys_pages = 0;
|
|
|
|
|
|
|
|
for_each_online_node(nid)
|
|
|
|
phys_pages += node_present_pages(nid);
|
|
|
|
|
|
|
|
return phys_pages;
|
|
|
|
}
|
|
|
|
|
2011-12-09 03:22:09 +09:00
|
|
|
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
|
[PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
This patch:
Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 17:49:43 +09:00
|
|
|
/*
|
2011-12-09 03:22:09 +09:00
|
|
|
* With CONFIG_HAVE_MEMBLOCK_NODE_MAP set, an architecture may initialise its
|
[PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
This patch:
Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 17:49:43 +09:00
|
|
|
* zones, allocate the backing mem_map and account for memory holes in a more
|
|
|
|
* architecture independent manner. This is a substitute for creating the
|
|
|
|
* zone_sizes[] and zholes_size[] arrays and passing them to
|
|
|
|
* free_area_init_node()
|
|
|
|
*
|
|
|
|
* An architecture is expected to register range of page frames backed by
|
2011-12-09 03:22:09 +09:00
|
|
|
* physical memory with memblock_add[_node]() before calling
|
[PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
This patch:
Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 17:49:43 +09:00
|
|
|
* free_area_init_nodes() passing in the PFN each zone ends at. At a basic
|
|
|
|
* usage, an architecture is expected to do something like
|
|
|
|
*
|
|
|
|
* unsigned long max_zone_pfns[MAX_NR_ZONES] = {max_dma, max_normal_pfn,
|
|
|
|
* max_highmem_pfn};
|
|
|
|
* for_each_valid_physical_page_range()
|
2011-12-09 03:22:09 +09:00
|
|
|
* memblock_add_node(base, size, nid)
|
[PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
This patch:
Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 17:49:43 +09:00
|
|
|
* free_area_init_nodes(max_zone_pfns);
|
|
|
|
*
|
2011-12-09 03:22:09 +09:00
|
|
|
* free_bootmem_with_active_regions() calls free_bootmem_node() for each
|
|
|
|
* registered physical page range. Similarly
|
|
|
|
* sparse_memory_present_with_active_regions() calls memory_present() for
|
|
|
|
* each range when SPARSEMEM is enabled.
|
[PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
This patch:
Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 17:49:43 +09:00
|
|
|
*
|
|
|
|
* See mm/page_alloc.c for more information on each function exposed by
|
2011-12-09 03:22:09 +09:00
|
|
|
* CONFIG_HAVE_MEMBLOCK_NODE_MAP.
|
[PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
This patch:
Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 17:49:43 +09:00
|
|
|
*/
|
|
|
|
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
|
x86, numa: Implement pfn -> nid mapping granularity check
SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
sections array to map pfn to nid which is limited in granularity. If
NUMA nodes are laid out such that the mapping cannot be accurate, boot
will fail triggering BUG_ON() in mminit_verify_page_links().
On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been
granular enough until commit 2706a0bf7b (x86, NUMA: Enable
CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which
aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This
led to the following BUG_ON().
On node 0 totalpages: 2096615
DMA zone: 32 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 3927 pages, LIFO batch:0
Normal zone: 1740 pages used for memmap
Normal zone: 220978 pages, LIFO batch:31
HighMem zone: 16405 pages used for memmap
HighMem zone: 1853533 pages, LIFO batch:31
BUG: Int 6: CR2 (null)
EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc
EBX f2400000 EDX 00000006 ECX (null) EAX 00000001
err (null) EIP c16209aa CS 00000060 flg 00010002
Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
(null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null)
f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null)
Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17
Call Trace:
[<c136b1e5>] ? early_fault+0x2e/0x2e
[<c16209aa>] ? mminit_verify_page_links+0x12/0x42
[<c1620613>] ? memmap_init_zone+0xaf/0x10c
[<c1620929>] ? free_area_init_node+0x2b9/0x2e3
[<c1607e99>] ? free_area_init_nodes+0x3f2/0x451
[<c1601d80>] ? paging_init+0x112/0x118
[<c15f578d>] ? setup_arch+0x791/0x82f
[<c15f43d9>] ? start_kernel+0x6a/0x257
This patch implements node_map_pfn_alignment() which determines
maximum internode alignment and update numa_register_memblks() to
reject NUMA configuration if alignment exceeds the pfn -> nid mapping
granularity of the memory model as determined by PAGES_PER_SECTION.
This makes the problematic machine boot w/ flatmem by rejecting the
NUMA config and provides protection against crazy NUMA configurations.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com>
Reported-and-Tested-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Conny Seidel <conny.seidel@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2011-07-12 16:45:34 +09:00
|
|
|
unsigned long node_map_pfn_alignment(void);
|
x86: Fix checking of SRAT when node 0 ram is not from 0
Found one system that boot from socket1 instead of socket0, SRAT get rejected...
[ 0.000000] SRAT: Node 1 PXM 0 0-a0000
[ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
[ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
[ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
[ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
[ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
[ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
[ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
[ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
[ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
...
[ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
[ 0.000000] NUMA: Using 20 for the hash shift.
[ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
[ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
[ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
[ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
[ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
[ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
[ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
[ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
[ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
[ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
[ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
[ 0.000000] SRAT: SRAT not used.
the early_node_map is not sorted because node0 with non zero start come first.
so try to sort it right away after all regions are registered.
also fixs refression by 8716273c (x86: Export srat physical topology)
-v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
-v3: update comments.
Reported-and-tested-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4B2579D2.3010201@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-12-16 10:59:02 +09:00
|
|
|
unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn);
|
[PATCH] Introduce mechanism for registering active regions of memory
At a basic level, architectures define structures to record where active
ranges of page frames are located. Once located, the code to calculate zone
sizes and holes in each architecture is very similar. Some of this zone and
hole sizing code is difficult to read for no good reason. This set of patches
eliminates the similar-looking architecture-specific code.
The patches introduce a mechanism where architectures register where the
active ranges of page frames are with add_active_range(). When all areas have
been discovered, free_area_init_nodes() is called to initialise the pgdat and
zones. The zone sizes and holes are then calculated in an architecture
independent manner.
Patch 1 introduces the mechanism for registering and initialising PFN ranges
Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
It adjusts the watermarks slightly
Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
machine. These were on patches against 2.6.17-rc1 and release 3 of these
patches but there have been no ia64-changes since release 3.
There are differences in the zone sizes for x86_64 as the arch-specific code
for x86_64 accounts the kernel image and the starting mem_maps as memory holes
but the architecture-independent code accounts the memory as present.
The big benefit of this set of patches is a sizable reduction of
architecture-specific code, some of which is very hairy. There should be a
greater reduction when other architectures use the same mechanisms for zone
and hole sizing but I lack the hardware to test on.
Additional credit;
Dave Hansen for the initial suggestion and comments on early patches
Andy Whitcroft for reviewing early versions and catching numerous
errors
Tony Luck for testing and debugging on IA64
Bob Picco for fixing bugs related to pfn registration, reviewing a
number of patch revisions, providing a number of suggestions
on future direction and testing heavily
Jack Steiner and Robin Holt for testing on IA64 and clarifying
issues related to memory holes
Yasunori for testing on IA64
Andi Kleen for reviewing and feeding back about x86_64
Christian Kujau for providing valuable information related to ACPI
problems on x86_64 and testing potential fixes
This patch:
Define the structure to represent an active range of page frames within a node
in an architecture independent manner. Architectures are expected to register
active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
free_area_init_nodes() passing the PFNs of the end of each zone.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 17:49:43 +09:00
|
|
|
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn);
|
|
|
|
extern void get_pfn_range_for_nid(unsigned int nid,
|
|
|
|
unsigned long *start_pfn, unsigned long *end_pfn);
|
|
|
|
extern unsigned long find_min_pfn_with_active_regions(void);
|
|
|
|
extern void free_bootmem_with_active_regions(int nid,
|
|
|
|
unsigned long max_low_pfn);
|
|
|
|
extern void sparse_memory_present_with_active_regions(int nid);
|
mm: clean up for early_pfn_to_nid()
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:
BUG_ON(page_zone(start_page) != page_zone(end_page));
Once I knew this is what was happening, I added some annotations:
if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...
And here's what I got:
move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]
My memory layout on this box is:
[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d
So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.
This patch:
Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.
This patch moves all declaration to include/linux/mm.h
After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-19 07:48:32 +09:00
|
|
|
|
2011-12-09 03:22:09 +09:00
|
|
|
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
|
mm: clean up for early_pfn_to_nid()
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:
BUG_ON(page_zone(start_page) != page_zone(end_page));
Once I knew this is what was happening, I added some annotations:
if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...
And here's what I got:
move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]
My memory layout on this box is:
[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d
So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.
This patch:
Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.
This patch moves all declaration to include/linux/mm.h
After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-19 07:48:32 +09:00
|
|
|
|
2011-12-09 03:22:09 +09:00
|
|
|
#if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
|
mm: clean up for early_pfn_to_nid()
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:
BUG_ON(page_zone(start_page) != page_zone(end_page));
Once I knew this is what was happening, I added some annotations:
if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...
And here's what I got:
move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]
My memory layout on this box is:
[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d
So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.
This patch:
Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.
This patch moves all declaration to include/linux/mm.h
After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-19 07:48:32 +09:00
|
|
|
!defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID)
|
2015-07-01 06:56:55 +09:00
|
|
|
static inline int __early_pfn_to_nid(unsigned long pfn,
|
|
|
|
struct mminit_pfnnid_cache *state)
|
mm: clean up for early_pfn_to_nid()
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:
BUG_ON(page_zone(start_page) != page_zone(end_page));
Once I knew this is what was happening, I added some annotations:
if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...
And here's what I got:
move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]
My memory layout on this box is:
[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d
So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.
This patch:
Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.
This patch moves all declaration to include/linux/mm.h
After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-19 07:48:32 +09:00
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
/* please see mm/page_alloc.c */
|
|
|
|
extern int __meminit early_pfn_to_nid(unsigned long pfn);
|
|
|
|
/* there is a per-arch backend function. */
|
2015-07-01 06:56:55 +09:00
|
|
|
extern int __meminit __early_pfn_to_nid(unsigned long pfn,
|
|
|
|
struct mminit_pfnnid_cache *state);
|
mm: clean up for early_pfn_to_nid()
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:
BUG_ON(page_zone(start_page) != page_zone(end_page));
Once I knew this is what was happening, I added some annotations:
if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...
And here's what I got:
move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]
My memory layout on this box is:
[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d
So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.
This patch:
Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.
This patch moves all declaration to include/linux/mm.h
After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-19 07:48:32 +09:00
|
|
|
#endif
|
|
|
|
|
2018-10-31 07:07:44 +09:00
|
|
|
#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
|
mm: zero reserved and unavailable struct pages
Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by
going through __init_single_page().
In some cases these struct pages are accessed even if they do not
contain any data. One example is page_to_pfn() might access page->flags
if this is where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).
One example of such memory: trim_low_memory_range() unconditionally
reserves from pfn 0, but e820__memblock_setup() might provide the
exiting memory from pfn 1 (i.e. KVM).
Since struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.
The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)
Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().
===
Here is more detailed example of problem that this patch is addressing:
Run tested on qemu with the following arguments:
-enable-kvm -cpu kvm64 -m 512 -smp 2
This patch reports that there are 98 unavailable pages.
They are: pfn 0 and pfns in range [159, 255].
Note, trim_low_memory_range() reserves only pfns in range [0, 15], it does
not reserve [159, 255] ones.
e820__memblock_setup() reports linux that the following physical ranges are
available:
[1 , 158]
[256, 130783]
Notice, that exactly unavailable pfns are missing!
Now, lets check what we have in zone 0: [1, 131039]
pfn 0, is not part of the zone, but pfns [1, 158], are.
However, the bigger problem we have if we do not initialize these struct
pages is with memory hotplug. Because, that path operates at 2M
boundaries (section_nr). And checks if 2M range of pages is hot
removable. It starts with first pfn from zone, rounds it down to 2M
boundary (sturct pages are allocated at 2M boundaries when vmemmap is
created), and checks if that section is hot removable. In this case
start with pfn 1 and convert it down to pfn 0. Later pfn is converted
to struct page, and some fields are checked. Now, if we do not zero
struct pages, we get unpredictable results.
In fact when CONFIG_VM_DEBUG is enabled, and we explicitly set all
vmemmap memory to ones, the following panic is observed with kernel test
without this patch applied:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: is_pageblock_removable_nolock+0x35/0x90
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT
...
task: ffff88001f4e2900 task.stack: ffffc90000314000
RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
Call Trace:
? is_mem_section_removable+0x5a/0xd0
show_mem_removable+0x6b/0xa0
dev_attr_show+0x1b/0x50
sysfs_kf_seq_show+0xa1/0x100
kernfs_seq_show+0x22/0x30
seq_read+0x1ac/0x3a0
kernfs_fop_read+0x36/0x190
? security_file_permission+0x90/0xb0
__vfs_read+0x16/0x30
vfs_read+0x81/0x130
SyS_read+0x44/0xa0
entry_SYSCALL_64_fastpath+0x1f/0xbd
Link: http://lkml.kernel.org/r/20171013173214.27300-7-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Tested-by: Bob Picco <bob.picco@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-16 10:36:31 +09:00
|
|
|
void zero_resv_unavail(void);
|
|
|
|
#else
|
|
|
|
static inline void zero_resv_unavail(void) {}
|
|
|
|
#endif
|
|
|
|
|
2006-09-27 17:49:56 +09:00
|
|
|
extern void set_dma_reserve(unsigned long new_dma_reserve);
|
2017-12-29 16:53:57 +09:00
|
|
|
extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
|
2020-09-26 13:19:28 +09:00
|
|
|
enum meminit_context, struct vmem_altmap *);
|
2009-06-17 07:32:48 +09:00
|
|
|
extern void setup_per_zone_wmarks(void);
|
vmscan: Support multiple kswapd threads per node
Page replacement is handled in the Linux Kernel in one of two ways:
1) Asynchronously via kswapd
2) Synchronously, via direct reclaim
At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.
Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.
When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.
The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.
When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.
The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.
Test Details
NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details
The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.
Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
eleven 250GB zero-filled files on each drive so that I could test with
parallel reads.
The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.
During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:
CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
doesn't tend to fluctuate much so I just grab the highest value.
Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
there is a lot of variation.
Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total
This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.
The dd command for this test looks like this:
Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M
Test #1: Direct IO
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40
Throughput was close to peak with only 22 dd tasks. Very little system CPU
was consumed as expected as the drives DMA directly into the user address
space when using direct IO.
In this next test, the iflag=direct option is removed and we only run the
test until the pgscan_kswapd from /proc/vmstat starts to increment. At
that point metrics are parsed and reported and the pagecache contents are
dropped prior to the next test. Lather, rinse, repeat.
Test #2: standard file system IO, no page replacement
dd sy dd_cpu throughput
6 2 28.78 5134316.40
10 3 31.40 8051218.40
16 5 34.73 11438106.80
22 7 33.65 14140596.40
28 8 31.24 16393455.20
34 10 29.88 18219463.60
40 11 28.33 19644159.60
46 11 25.05 20802497.60
52 13 26.92 22092370.00
58 13 23.29 22884881.20
64 14 23.12 23452248.80
70 15 22.40 23916468.00
76 16 22.06 24328737.20
82 17 20.97 24718693.20
88 16 18.57 25149404.40
90 16 18.31 25245565.60
Each read has to pause after the buffer in kernel space is populated while
those pages are added to the pagecache and copied into the user address
space. For this reason, more parallel streams are required to achieve peak
throughput. The copy operation consumes substantially more CPU than direct
IO as expected.
The next test measures throughput after kswapd starts running. This is the
same test only we wait for kswapd to wake up before we start collecting
metrics. The script actually keeps track of a few things that were not
mentioned earlier. It tracks direct reclaims and page scans by watching
the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the
same way it is tracked for dd.
Since the test is 100% reads, you can assume that the page steal rate for
kswapd and direct reclaims is almost identical to the scan rate.
Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901
In the previous test where kswapd was not involved, the system-wide kernel
mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption
with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA
node), kswapd can only be responsible for a little over 4% of the increase.
The rest is likely caused by 51,618 direct reclaims that scanned 1.2
billion pages over the five minute time period of the test.
Same test, more kswapd tasks:
Test #4: 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615
By increasing the number of kswapd threads, throughput increased by ~50%
while kernel mode CPU utilization decreased or stayed the same, likely due
to a decrease in the number of parallel tasks at any given time doing page
replacement.
Change-Id: I966d4a9c33bad188b3409f7ceea1df205a63c3bd
Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Patch-mainline: linux-mm @ Mon, 2 Apr 2018 09:24:22
Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com
[charante@codeaurora.org]: Changes done to ensure QGKI compliance.
Signed-off-by: Charan Teja Kalla <charante@codeaurora.org>
2018-04-02 18:24:22 +09:00
|
|
|
extern void update_kswapd_threads(void);
|
2011-05-25 09:11:32 +09:00
|
|
|
extern int __meminit init_per_zone_wmark_min(void);
|
2005-04-17 07:20:36 +09:00
|
|
|
extern void mem_init(void);
|
2009-01-08 21:04:47 +09:00
|
|
|
extern void __init mmap_init(void);
|
2017-02-23 08:46:16 +09:00
|
|
|
extern void show_mem(unsigned int flags, nodemask_t *nodemask);
|
2016-03-18 06:19:05 +09:00
|
|
|
extern long si_mem_available(void);
|
2005-04-17 07:20:36 +09:00
|
|
|
extern void si_meminfo(struct sysinfo * val);
|
|
|
|
extern void si_meminfo_node(struct sysinfo *val, int nid);
|
2016-10-08 08:59:15 +09:00
|
|
|
#ifdef __HAVE_ARCH_RESERVED_KERNEL_PAGES
|
|
|
|
extern unsigned long arch_reserved_kernel_pages(void);
|
|
|
|
#endif
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2017-02-23 08:46:10 +09:00
|
|
|
extern __printf(3, 4)
|
|
|
|
void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...);
|
2011-05-25 09:12:16 +09:00
|
|
|
|
2005-06-22 09:14:47 +09:00
|
|
|
extern void setup_per_cpu_pageset(void);
|
|
|
|
|
2009-09-22 09:01:16 +09:00
|
|
|
extern void zone_pcp_update(struct zone *zone);
|
2012-08-01 08:43:32 +09:00
|
|
|
extern void zone_pcp_reset(struct zone *zone);
|
2009-09-22 09:01:16 +09:00
|
|
|
|
2013-02-23 09:34:42 +09:00
|
|
|
/* page_alloc.c */
|
vmscan: Support multiple kswapd threads per node
Page replacement is handled in the Linux Kernel in one of two ways:
1) Asynchronously via kswapd
2) Synchronously, via direct reclaim
At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.
Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.
When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.
The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.
When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.
The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.
Test Details
NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details
The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.
Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
eleven 250GB zero-filled files on each drive so that I could test with
parallel reads.
The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.
During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:
CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
doesn't tend to fluctuate much so I just grab the highest value.
Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
there is a lot of variation.
Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total
This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.
The dd command for this test looks like this:
Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M
Test #1: Direct IO
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40
Throughput was close to peak with only 22 dd tasks. Very little system CPU
was consumed as expected as the drives DMA directly into the user address
space when using direct IO.
In this next test, the iflag=direct option is removed and we only run the
test until the pgscan_kswapd from /proc/vmstat starts to increment. At
that point metrics are parsed and reported and the pagecache contents are
dropped prior to the next test. Lather, rinse, repeat.
Test #2: standard file system IO, no page replacement
dd sy dd_cpu throughput
6 2 28.78 5134316.40
10 3 31.40 8051218.40
16 5 34.73 11438106.80
22 7 33.65 14140596.40
28 8 31.24 16393455.20
34 10 29.88 18219463.60
40 11 28.33 19644159.60
46 11 25.05 20802497.60
52 13 26.92 22092370.00
58 13 23.29 22884881.20
64 14 23.12 23452248.80
70 15 22.40 23916468.00
76 16 22.06 24328737.20
82 17 20.97 24718693.20
88 16 18.57 25149404.40
90 16 18.31 25245565.60
Each read has to pause after the buffer in kernel space is populated while
those pages are added to the pagecache and copied into the user address
space. For this reason, more parallel streams are required to achieve peak
throughput. The copy operation consumes substantially more CPU than direct
IO as expected.
The next test measures throughput after kswapd starts running. This is the
same test only we wait for kswapd to wake up before we start collecting
metrics. The script actually keeps track of a few things that were not
mentioned earlier. It tracks direct reclaims and page scans by watching
the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the
same way it is tracked for dd.
Since the test is 100% reads, you can assume that the page steal rate for
kswapd and direct reclaims is almost identical to the scan rate.
Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901
In the previous test where kswapd was not involved, the system-wide kernel
mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption
with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA
node), kswapd can only be responsible for a little over 4% of the increase.
The rest is likely caused by 51,618 direct reclaims that scanned 1.2
billion pages over the five minute time period of the test.
Same test, more kswapd tasks:
Test #4: 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615
By increasing the number of kswapd threads, throughput increased by ~50%
while kernel mode CPU utilization decreased or stayed the same, likely due
to a decrease in the number of parallel tasks at any given time doing page
replacement.
Change-Id: I966d4a9c33bad188b3409f7ceea1df205a63c3bd
Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Patch-mainline: linux-mm @ Mon, 2 Apr 2018 09:24:22
Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com
[charante@codeaurora.org]: Changes done to ensure QGKI compliance.
Signed-off-by: Charan Teja Kalla <charante@codeaurora.org>
2018-04-02 18:24:22 +09:00
|
|
|
extern int kswapd_threads;
|
2013-02-23 09:34:42 +09:00
|
|
|
extern int min_free_kbytes;
|
mm: reclaim small amounts of memory when an external fragmentation event occurs
An external fragmentation event was previously described as
When the page allocator fragments memory, it records the event using
the mm_page_alloc_extfrag event. If the fallback_order is smaller
than a pageblock order (order-9 on 64-bit x86) then it's considered
an event that will cause external fragmentation issues in the future.
The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.
This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs. The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance. Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.
This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".
1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------
4.20-rc3 extfrag events < order 9: 804694
4.20-rc3+patch: 408912 (49% reduction)
4.20-rc3+patch1-4: 18421 (98% reduction)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)
Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.
1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 291392
4.20-rc3+patch: 191187 (34% reduction)
4.20-rc3+patch1-4: 13464 (95% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)
As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.
2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 215698
4.20-rc3+patch: 200210 (7% reduction)
4.20-rc3+patch1-4: 14263 (93% reduction)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)
There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.
2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------
4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch: 147463 (11% reduction)
4.20-rc3+patch1-4: 11095 (93% reduction)
thpfioscale Fault Latencies
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)
4.20.0-rc3 4.20.0-rc3
lowzone-v5r8 boost-v5r8
Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)
There is a large reduction in fragmentation events with some jitter around
the latencies and success rates. As before, the high THP allocation
success rate does mean the system is under a lot of pressure. However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.
Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 17:35:52 +09:00
|
|
|
extern int watermark_boost_factor;
|
2016-03-18 06:19:14 +09:00
|
|
|
extern int watermark_scale_factor;
|
2013-02-23 09:34:42 +09:00
|
|
|
|
2009-01-08 21:04:47 +09:00
|
|
|
/* nommu.c */
|
2009-04-03 08:56:32 +09:00
|
|
|
extern atomic_long_t mmap_pages_allocated;
|
nommu: fix shared mmap after truncate shrinkage problems
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen
over the end of a truncation. The problem is that
ramfs_nommu_check_mappings() checks that the reduced file size against the
VMA tree, but not the vm_region tree.
The following sequence of events can cause the problem:
fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600);
ftruncate(fd, 32 * 1024);
a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
munmap(a, 32 * 1024);
ftruncate(fd, 16 * 1024);
c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b'
sees that the vm_region from 'a' is covering the region it wants and so
shares it, pinning it in memory.
Mapping 'a' then goes away and the file is truncated to the end of VMA
'b'. However, the region allocated by 'a' is still in effect, and has
_not_ been reduced.
Mapping 'c' is then created, and because there's a vm_region covering the
desired region, get_unmapped_area() is _not_ called to repeat the check,
and the mapping is granted, even though the pages from the latter half of
the mapping have been discarded.
However:
d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Mapping 'd' should work, and should end up sharing the region allocated by
'a'.
To deal with this, we shrink the vm_region struct during the truncation,
lest do_mmap_pgoff() take it as licence to share the full region
automatically without calling the get_unmapped_area() file op again.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16 10:01:39 +09:00
|
|
|
extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
|
2009-01-08 21:04:47 +09:00
|
|
|
|
2012-10-09 08:31:25 +09:00
|
|
|
/* interval_tree.c */
|
|
|
|
void vma_interval_tree_insert(struct vm_area_struct *node,
|
2017-09-09 08:15:08 +09:00
|
|
|
struct rb_root_cached *root);
|
2012-10-09 08:31:35 +09:00
|
|
|
void vma_interval_tree_insert_after(struct vm_area_struct *node,
|
|
|
|
struct vm_area_struct *prev,
|
2017-09-09 08:15:08 +09:00
|
|
|
struct rb_root_cached *root);
|
2012-10-09 08:31:25 +09:00
|
|
|
void vma_interval_tree_remove(struct vm_area_struct *node,
|
2017-09-09 08:15:08 +09:00
|
|
|
struct rb_root_cached *root);
|
|
|
|
struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
|
2012-10-09 08:31:25 +09:00
|
|
|
unsigned long start, unsigned long last);
|
|
|
|
struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
|
|
|
|
unsigned long start, unsigned long last);
|
|
|
|
|
|
|
|
#define vma_interval_tree_foreach(vma, root, start, last) \
|
|
|
|
for (vma = vma_interval_tree_iter_first(root, start, last); \
|
|
|
|
vma; vma = vma_interval_tree_iter_next(vma, start, last))
|
2005-04-17 07:20:36 +09:00
|
|
|
|
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 08:31:39 +09:00
|
|
|
void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
|
2017-09-09 08:15:08 +09:00
|
|
|
struct rb_root_cached *root);
|
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 08:31:39 +09:00
|
|
|
void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
|
2017-09-09 08:15:08 +09:00
|
|
|
struct rb_root_cached *root);
|
|
|
|
struct anon_vma_chain *
|
|
|
|
anon_vma_interval_tree_iter_first(struct rb_root_cached *root,
|
|
|
|
unsigned long start, unsigned long last);
|
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 08:31:39 +09:00
|
|
|
struct anon_vma_chain *anon_vma_interval_tree_iter_next(
|
|
|
|
struct anon_vma_chain *node, unsigned long start, unsigned long last);
|
2012-10-09 08:31:45 +09:00
|
|
|
#ifdef CONFIG_DEBUG_VM_RB
|
|
|
|
void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
|
|
|
|
#endif
|
mm anon rmap: replace same_anon_vma linked list with an interval tree.
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 08:31:39 +09:00
|
|
|
|
|
|
|
#define anon_vma_interval_tree_foreach(avc, root, start, last) \
|
|
|
|
for (avc = anon_vma_interval_tree_iter_first(root, start, last); \
|
|
|
|
avc; avc = anon_vma_interval_tree_iter_next(avc, start, last))
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
/* mmap.c */
|
2007-08-23 06:01:28 +09:00
|
|
|
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
|
mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.
The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.
The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.
migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated
This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.
Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.
This fix removes the oddness factor from case 8 and it converts it from:
AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN
to:
AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX
XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).
Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aditya Mandaleeka <adityam@microsoft.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 09:01:28 +09:00
|
|
|
extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
|
|
|
|
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
|
2018-04-17 23:33:16 +09:00
|
|
|
struct vm_area_struct *expand, bool keep_locked);
|
mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.
The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.
The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.
migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated
This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.
Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.
This fix removes the oddness factor from case 8 and it converts it from:
AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN
to:
AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX
XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).
Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aditya Mandaleeka <adityam@microsoft.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 09:01:28 +09:00
|
|
|
static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
|
|
|
|
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
|
|
|
|
{
|
2018-04-17 23:33:16 +09:00
|
|
|
return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
|
mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.
The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.
The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.
migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated
This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.
Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.
This fix removes the oddness factor from case 8 and it converts it from:
AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN
to:
AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX
XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).
Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aditya Mandaleeka <adityam@microsoft.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-08 09:01:28 +09:00
|
|
|
}
|
2018-04-17 23:33:16 +09:00
|
|
|
|
|
|
|
extern struct vm_area_struct *__vma_merge(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
|
|
|
|
unsigned long vm_flags, struct anon_vma *anon, struct file *file,
|
|
|
|
pgoff_t pgoff, struct mempolicy *mpol, struct vm_userfaultfd_ctx uff,
|
|
|
|
const char __user *user, bool keep_locked);
|
|
|
|
|
|
|
|
static inline struct vm_area_struct *vma_merge(struct mm_struct *mm,
|
2005-04-17 07:20:36 +09:00
|
|
|
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
|
2018-04-17 23:33:16 +09:00
|
|
|
unsigned long vm_flags, struct anon_vma *anon, struct file *file,
|
|
|
|
pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff,
|
|
|
|
const char __user *user)
|
|
|
|
{
|
|
|
|
return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off,
|
|
|
|
pol, uff, user, false);
|
|
|
|
}
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
|
2017-02-25 07:58:47 +09:00
|
|
|
extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
|
|
|
|
unsigned long addr, int new_below);
|
|
|
|
extern int split_vma(struct mm_struct *, struct vm_area_struct *,
|
|
|
|
unsigned long addr, int new_below);
|
2005-04-17 07:20:36 +09:00
|
|
|
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
|
|
|
|
extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
|
|
|
|
struct rb_node **, struct rb_node *);
|
2005-10-30 10:15:57 +09:00
|
|
|
extern void unlink_file_vma(struct vm_area_struct *);
|
2005-04-17 07:20:36 +09:00
|
|
|
extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
|
2012-10-09 08:31:50 +09:00
|
|
|
unsigned long addr, unsigned long len, pgoff_t pgoff,
|
|
|
|
bool *need_rmap_locks);
|
2005-04-17 07:20:36 +09:00
|
|
|
extern void exit_mmap(struct mm_struct *);
|
2008-04-29 17:01:36 +09:00
|
|
|
|
2014-10-10 07:27:29 +09:00
|
|
|
static inline int check_data_rlimit(unsigned long rlim,
|
|
|
|
unsigned long new,
|
|
|
|
unsigned long start,
|
|
|
|
unsigned long end_data,
|
|
|
|
unsigned long start_data)
|
|
|
|
{
|
|
|
|
if (rlim < RLIM_INFINITY) {
|
|
|
|
if (((new - start) + (end_data - start_data)) > rlim)
|
|
|
|
return -ENOSPC;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-07-29 07:46:26 +09:00
|
|
|
extern int mm_take_all_locks(struct mm_struct *mm);
|
|
|
|
extern void mm_drop_all_locks(struct mm_struct *mm);
|
|
|
|
|
2011-05-27 08:25:46 +09:00
|
|
|
extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
|
|
|
|
extern struct file *get_mm_exe_file(struct mm_struct *mm);
|
2016-08-23 23:20:38 +09:00
|
|
|
extern struct file *get_task_exe_file(struct task_struct *task);
|
2008-04-29 17:01:36 +09:00
|
|
|
|
2016-01-15 08:22:07 +09:00
|
|
|
extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long npages);
|
|
|
|
extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages);
|
|
|
|
|
2016-09-05 22:33:05 +09:00
|
|
|
extern bool vma_is_special_mapping(const struct vm_area_struct *vma,
|
|
|
|
const struct vm_special_mapping *sm);
|
2014-03-18 07:22:02 +09:00
|
|
|
extern struct vm_area_struct *_install_special_mapping(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long len,
|
2014-05-20 07:58:33 +09:00
|
|
|
unsigned long flags,
|
|
|
|
const struct vm_special_mapping *spec);
|
|
|
|
/* This is an obsolete alternative to _install_special_mapping. */
|
2007-02-09 07:20:41 +09:00
|
|
|
extern int install_special_mapping(struct mm_struct *mm,
|
|
|
|
unsigned long addr, unsigned long len,
|
|
|
|
unsigned long flags, struct page **pages);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2019-09-24 07:38:37 +09:00
|
|
|
unsigned long randomize_stack_top(unsigned long stack_top);
|
2022-05-14 20:59:30 +09:00
|
|
|
unsigned long randomize_page(unsigned long start, unsigned long range);
|
2019-09-24 07:38:37 +09:00
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
|
|
|
|
|
2007-07-16 15:38:26 +09:00
|
|
|
extern unsigned long mmap_region(struct file *file, unsigned long addr,
|
2017-02-25 07:58:22 +09:00
|
|
|
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
|
|
|
|
struct list_head *uf);
|
2015-09-10 07:39:29 +09:00
|
|
|
extern unsigned long do_mmap(struct file *file, unsigned long addr,
|
2013-02-23 09:32:37 +09:00
|
|
|
unsigned long len, unsigned long prot, unsigned long flags,
|
2017-02-25 07:58:22 +09:00
|
|
|
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
|
|
|
|
struct list_head *uf);
|
2018-10-27 07:08:50 +09:00
|
|
|
extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
|
|
|
|
struct list_head *uf, bool downgrade);
|
2017-02-25 07:58:22 +09:00
|
|
|
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
|
|
|
|
struct list_head *uf);
|
mm/madvise: pass task and mm to do_madvise
Patch series "introduce memory hinting API for external process", v7.
Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
that, application could give hints to kernel what memory range are
preferred to be reclaimed. However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.
To solve the concern, this patch introduces new syscall -
process_madvise(2). Bascially, it's same with madvise(2) syscall but it
has some differences.
1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
moment. Other hints in madvise will be opened when there are explicit
requests from community to prevent unexpected bugs we couldn't support.
3. Only privileged processes can do something for other process's
address space.
For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.
This patch (of 7):
In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.
Furthermore, we must not access mm_struct via task->mm, but obtain it
via access_mm() once (in the following patch) and only use that pointer
[1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers
are safe, so we can use them further down the call stack.
And let's pass *current* and current->mm as arguments of do_madvise so
it shouldn't change existing behavior but prepare next patch to make
review easy.
Note: io_madvise passes NULL as target_task argument of do_madvise because
it couldn't know who is target.
[1] http://lore.kernel.org/r/CAG48ez27=pwm5m_N_988xT1huO7g7h6arTQL44zev6TD-h-7Tg@mail.gmail.com
[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-man@vger.kernel.org>
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.
The information required to make the reclaim decision is not known to
the app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to
initiate reclaim on its own without any app involvement.
To solve the issue, this patch introduces a new syscall
process_madvise(2). It uses pidfd of an external process to give the
hint.
int process_madvise(int pidfd, void *addr, size_t length, int advice,
unsigned long flags);
Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the
API.
I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints
make sense from external process and implementation for the hint may
rely on the caller being in the current context so it could be
error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this
patch.
If someone want to add other hints, we could hear hear the usecase and
review it for each hint. It's safer for maintenance rather than
introducing a buggy syscall but hard to fix it later.
Q.1 - Why does any external entity have better knowledge?
Quote from Sandeep
"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.
After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.
In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.
So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.
Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.
So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.
- ssp
Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?
process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.
The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.
To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.
Q.3 - Why doesn't ptrace work?
Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.
[1] https://developer.android.com/topic/performance/memory"
[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: <linux-man@vger.kernel.org>
From: Minchan Kim <minchan@kernel.org>
Subject: fix process_madvise build break for arm64
0-day reported build break from process_madvise on ARM64.
aarch64-linux-ld: arch/arm64/kernel/head.o: relocation R_AARCH64_ABS32 against `_kernel_offset_le_lo32' can not be used when making a shared object
aarch64-linux-ld: arch/arm64/kernel/efi-entry.stub.o: relocation R_AARCH64_ABS32 against `__efistub_stext_offset' can not be used when making a shared object
arch/arm64/kernel/head.o: In function `kimage_vaddr':
(.idmap.text+0x0): dangerous relocation: unsupported relocation
arch/arm64/kernel/head.o: In function `__primary_switch':
(.idmap.text+0x378): dangerous relocation: unsupported relocation
(.idmap.text+0x380): dangerous relocation: unsupported relocation
>> arch/arm64/kernel/sys32.o:(.rodata+0xdb8): undefined reference to `__arm64_process_madvise'
This patch should fix it.
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
From: Minchan Kim <minchan@kernel.org>
Subject: mm: fix build error for mips of process_madvise
kbuild test rebot reported build break of process_madvise for mips[1].
This patch should fix it.
[1] https://lore.kernel.org/linux-mm/202005080716.cUcbCQ3i%25lkp@intel.com/
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-introduce-external-memory-hinting-api-fix-2-fix
the compat bit comes later
Cc: Minchan Kim <minchan@kernel.org>
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: check fatal signal pending of target process
Bail out to prevent unnecessary CPU overhead if target process has pending
fatal signal during (MADV_COLD|MADV_PAGEOUT) operation.
Link: http://lkml.kernel.org/r/20200302193630.68771-4-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: <linux-man@vger.kernel.org>
From: Minchan Kim <minchan@kernel.org>
Subject: pid: move pidfd_get_pid() to pid.c
process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.
Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jann Horn <jannh@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: <linux-man@vger.kernel.org>
From: Minchan Kim <minchan@kernel.org>
Subject: mm/madvise: support both pid and pidfd for process_madvise
There is a demand[1] to support pid as well pidfd for process_madvise
to reduce unnecessary syscall to get pidfd if the user has control of
the target process (ie, they could guarantee the process is not gone or
pid is not reused).
This patch aims for supporting both options like waitid(2). So, the
syscall is currently,
int process_madvise(idtype_t idtype, id_t id, void *addr,
size_t length, int advice, unsigned long flags);
@which is actually idtype_t for userspace library and currently, it
supports P_PID and P_PIDFD.
[1] https://lore.kernel.org/linux-mm/9d849087-3359-c4ab-fbec-859e8186c509@virtuozzo.com/
Link: http://lkml.kernel.org/r/20200302193630.68771-6-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <christian@brauner.io>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: <linux-man@vger.kernel.org>
From: Oleksandr Natalenko <oleksandr@redhat.com>
Subject: mm/madvise: allow KSM hints for remote API
It all began with the fact that KSM works only on memory that is marked by
madvise(). And the only way to get around that is to either:
* use LD_PRELOAD; or
* patch the kernel with something like UKSM or PKSM.
(i skip ptrace can of worms here intentionally)
To overcome this restriction, lets employ a new remote madvise API. This
can be used by some small userspace helper daemon that will do auto-KSM
job for us.
I think of two major consumers of remote KSM hints:
* hosts, that run containers, especially similar ones and especially in
a trusted environment, sharing the same runtime like Node.js;
* heavy applications, that can be run in multiple instances, not
limited to opensource ones like Firefox, but also those that cannot be
modified since they are binary-only and, maybe, statically linked.
Speaking of statistics, more numbers can be found in the very first
submission, that is related to this one [1]. For my current setup with
two Firefox instances I get 100 to 200 MiB saved for the second instance
depending on the amount of tabs.
1 FF instance with 15 tabs:
$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
410
2 FF instances, second one has 12 tabs (all the tabs are different):
$ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
592
At the very moment I do not have specific numbers for containerised
workload, but those should be comparable in case the containers share
similar/same runtime.
[1] https://lore.kernel.org/patchwork/patch/1012142/
Link: http://lkml.kernel.org/r/20200302193630.68771-8-minchan@kernel.org
Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian@brauner.io>
Cc: Daniel Colascione <dancol@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <linux-man@vger.kernel.org>
From: Minchan Kim <minchan@kernel.org>
Subject: mm: support vector address ranges for process_madvise
This patch extends a) process_madvise(2) support vector address ranges in
a system call and then b) support the vector address ranges to local
process as well as external process.
Android app has thousands of vmas due to zygote so it's totally waste of
CPU and power if we should call the syscall one by one for each vma.
(With testing 2000-vma syscall vs 1-vector syscall, it showed 15%
performance improvement. I think it would be bigger in real practice
because the testing ran very cache friendly environment).
Another potential use case for the vector range is to amortize the cost of
TLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations. In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.
So finally, the API is as follows,
ssize_t process_madvise(idtype_t idtype, id_t id,
const struct iovec *iovec, unsigned long vlen,
int advice, unsigned long flags);
DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve system
or application performance.
The idtype and id arguments select the target process to be advised as
follows:
idtype == P_PID
select the process whose process ID matches id
idtype == P_PIDFD
select the process referred to by the PID file descriptor
specified in id. (See pidofd_open(2) for further information)
The pointer iovec points to an array of iovec structures, defined in
<sys/uio.h> as:
struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};
The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).
The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by idtype and
id is external.
MADV_COLD
MADV_PAGEOUT
MADV_MERGEABLE
MADV_UNMERGEABLE
Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.
RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.
Link: http://lkml.kernel.org/r/20200423145215.72666-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: John Dias <joaodias@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
From: Minchan Kim <minchan@kernel.org>
Subject: mm: support compat_sys_process_madvise
This patch supports compat syscall for process_madvise
Link: http://lkml.kernel.org/r/20200423195835.GA46847@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Dias <joaodias@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: SeongJae Park <sj38.park@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
From: Randy Dunlap <rdunlap@infradead.org>
Subject: mm-support-vector-address-ranges-for-process_madvise-fix-fix
fix process_madvise prototype
Cc: Minchan Kim <minchan@kernel.org>
From: Zheng Bin <zhengbin13@huawei.com>
Subject: mm/madvise: make function 'do_process_madvise' static
Fix sparse warnings:
mm/madvise.c:1233:9: warning: symbol 'do_process_madvise' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20200429014030.41147-1-zhengbin13@huawei.com
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
From: Minchan Kim <minchan@kernel.org>
Subject: mm: fix s390 compat build error
Nathan reported build error with sys_compat_process_madvise.
This patch should fix it.
Link: http://lkml.kernel.org/r/20200429012421.GA132200@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com> [build]
From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-support-vector-address-ranges-for-process_madvise-fix-fix-fix-fix-fix
add compat_sys_process_madvise to mips syscall table
Conflicts:
fs/io_uring.c
mm/madvise.c
Change-Id: I89b92904043c6d7fbf9747746d20b823dbc20410
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Git-commit: 82f576dd0298d675df9b19ac0638d79b5ca79e59
Git-Repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[charante@codeaurora.org: Fixed merge conflicts]
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
2020-05-22 13:28:19 +09:00
|
|
|
extern int do_madvise(struct task_struct *target_task, struct mm_struct *mm,
|
|
|
|
unsigned long start, size_t len_in, int behavior);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2015-09-10 07:39:29 +09:00
|
|
|
static inline unsigned long
|
|
|
|
do_mmap_pgoff(struct file *file, unsigned long addr,
|
|
|
|
unsigned long len, unsigned long prot, unsigned long flags,
|
2017-02-25 07:58:22 +09:00
|
|
|
unsigned long pgoff, unsigned long *populate,
|
|
|
|
struct list_head *uf)
|
2015-09-10 07:39:29 +09:00
|
|
|
{
|
2017-02-25 07:58:22 +09:00
|
|
|
return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate, uf);
|
2015-09-10 07:39:29 +09:00
|
|
|
}
|
|
|
|
|
2013-02-23 09:32:37 +09:00
|
|
|
#ifdef CONFIG_MMU
|
|
|
|
extern int __mm_populate(unsigned long addr, unsigned long len,
|
|
|
|
int ignore_errors);
|
|
|
|
static inline void mm_populate(unsigned long addr, unsigned long len)
|
|
|
|
{
|
|
|
|
/* Ignore errors */
|
|
|
|
(void) __mm_populate(addr, len, 1);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline void mm_populate(unsigned long addr, unsigned long len) {}
|
|
|
|
#endif
|
|
|
|
|
2012-04-21 07:35:40 +09:00
|
|
|
/* These take the mm semaphore themselves */
|
2016-05-28 07:57:31 +09:00
|
|
|
extern int __must_check vm_brk(unsigned long, unsigned long);
|
powerpc: do not make the entire heap executable
On 32-bit powerpc the ELF PLT sections of binaries (built with
--bss-plt, or with a toolchain which defaults to it) look like this:
[17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4
[18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4
[19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4
Which results in an ELF load header:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000
This is all correct, the load region containing the PLT is marked as
executable. Note that the PLT starts at 0002b00c but the file mapping
ends at 0002aff8, so the PLT falls in the 0 fill section described by
the load header, and after a page boundary.
Unfortunately the generic ELF loader ignores the X bit in the load
headers when it creates the 0 filled non-file backed mappings. It
assumes all of these mappings are RW BSS sections, which is not the case
for PPC.
gcc/ld has an option (--secure-plt) to not do this, this is said to
incur a small performance penalty.
Currently, to support 32-bit binaries with PLT in BSS kernel maps
*entire brk area* with executable rights for all binaries, even
--secure-plt ones.
Stop doing that.
Teach the ELF loader to check the X bit in the relevant load header and
create 0 filled anonymous mappings that are executable if the load
header requests that.
Test program showing the difference in /proc/$PID/maps:
int main() {
char buf[16*1024];
char *p = malloc(123); /* make "[heap]" mapping appear */
int fd = open("/proc/self/maps", O_RDONLY);
int len = read(fd, buf, sizeof(buf));
write(1, buf, len);
printf("%p\n", p);
return 0;
}
Compiled using: gcc -mbss-plt -m32 -Os test.c -otest
Unpatched ppc64 kernel:
00100000-00120000 r-xp 00000000 00:00 0 [vdso]
0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
10690000-106c0000 rwxp 00000000 00:00 0 [heap]
f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack]
0x10690008
Patched ppc64 kernel:
00100000-00120000 r-xp 00000000 00:00 0 [vdso]
0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
10180000-101b0000 rw-p 00000000 00:00 0 [heap]
^^^^ this has changed
f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
ff860000-ff890000 rw-p 00000000 00:00 0 [stack]
0x10180008
The patch was originally posted in 2012 by Jason Gunthorpe
and apparently ignored:
https://lkml.org/lkml/2012/9/30/138
Lightly run-tested.
Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.com
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-23 08:45:16 +09:00
|
|
|
extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned long);
|
2012-04-21 10:57:04 +09:00
|
|
|
extern int vm_munmap(unsigned long, size_t);
|
2016-05-24 08:25:30 +09:00
|
|
|
extern unsigned long __must_check vm_mmap(struct file *, unsigned long,
|
2012-04-21 09:13:58 +09:00
|
|
|
unsigned long, unsigned long,
|
|
|
|
unsigned long, unsigned long);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2012-12-12 09:01:49 +09:00
|
|
|
struct vm_unmapped_area_info {
|
|
|
|
#define VM_UNMAPPED_AREA_TOPDOWN 1
|
|
|
|
unsigned long flags;
|
|
|
|
unsigned long length;
|
|
|
|
unsigned long low_limit;
|
|
|
|
unsigned long high_limit;
|
|
|
|
unsigned long align_mask;
|
|
|
|
unsigned long align_offset;
|
|
|
|
};
|
|
|
|
|
|
|
|
extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
|
|
|
|
extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Search for an unmapped address range.
|
|
|
|
*
|
|
|
|
* We are looking for a range that:
|
|
|
|
* - does not intersect with any VMA;
|
|
|
|
* - is contained within the [low_limit, high_limit) interval;
|
|
|
|
* - is at least the desired size.
|
|
|
|
* - satisfies (begin_addr & align_mask) == (align_offset & align_mask)
|
|
|
|
*/
|
|
|
|
static inline unsigned long
|
|
|
|
vm_unmapped_area(struct vm_unmapped_area_info *info)
|
|
|
|
{
|
2015-04-16 08:14:47 +09:00
|
|
|
if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
|
2012-12-12 09:01:49 +09:00
|
|
|
return unmapped_area_topdown(info);
|
2015-04-16 08:14:47 +09:00
|
|
|
else
|
|
|
|
return unmapped_area(info);
|
2012-12-12 09:01:49 +09:00
|
|
|
}
|
|
|
|
|
2011-07-26 09:12:23 +09:00
|
|
|
/* truncate.c */
|
2005-04-17 07:20:36 +09:00
|
|
|
extern void truncate_inode_pages(struct address_space *, loff_t);
|
2006-01-06 17:10:36 +09:00
|
|
|
extern void truncate_inode_pages_range(struct address_space *,
|
|
|
|
loff_t lstart, loff_t lend);
|
2014-04-04 06:47:49 +09:00
|
|
|
extern void truncate_inode_pages_final(struct address_space *);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
/* generic vm_area_ops exported for stackable file systems */
|
2018-06-08 09:08:00 +09:00
|
|
|
extern vm_fault_t filemap_fault(struct vm_fault *vmf);
|
2016-12-15 08:06:58 +09:00
|
|
|
extern void filemap_map_pages(struct vm_fault *vmf,
|
2016-07-27 07:25:20 +09:00
|
|
|
pgoff_t start_pgoff, pgoff_t end_pgoff);
|
2018-06-08 09:08:00 +09:00
|
|
|
extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
/* mm/page-writeback.c */
|
2017-07-06 04:26:48 +09:00
|
|
|
int __must_check write_one_page(struct page *page);
|
2009-02-19 07:48:18 +09:00
|
|
|
void task_dirty_inc(struct task_struct *tsk);
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
/* readahead.c */
|
2013-10-08 15:47:59 +09:00
|
|
|
#define VM_READAHEAD_PAGES (SZ_512K / PAGE_SIZE)
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
|
2005-11-07 17:59:28 +09:00
|
|
|
pgoff_t offset, unsigned long nr_to_read);
|
2007-07-19 17:48:08 +09:00
|
|
|
|
|
|
|
void page_cache_sync_readahead(struct address_space *mapping,
|
|
|
|
struct file_ra_state *ra,
|
|
|
|
struct file *filp,
|
|
|
|
pgoff_t offset,
|
|
|
|
unsigned long size);
|
|
|
|
|
|
|
|
void page_cache_async_readahead(struct address_space *mapping,
|
|
|
|
struct file_ra_state *ra,
|
|
|
|
struct file *filp,
|
|
|
|
struct page *pg,
|
|
|
|
pgoff_t offset,
|
|
|
|
unsigned long size);
|
|
|
|
|
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 20:03:24 +09:00
|
|
|
extern unsigned long stack_guard_gap;
|
2011-05-25 09:11:44 +09:00
|
|
|
/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
|
2005-10-30 10:16:20 +09:00
|
|
|
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
|
2011-05-25 09:11:44 +09:00
|
|
|
|
|
|
|
/* CONFIG_STACK_GROWSUP still needs to to grow downwards at some places */
|
|
|
|
extern int expand_downwards(struct vm_area_struct *vma,
|
|
|
|
unsigned long address);
|
2010-08-25 03:44:18 +09:00
|
|
|
#if VM_GROWSUP
|
2005-10-30 10:16:20 +09:00
|
|
|
extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
|
2010-08-25 03:44:18 +09:00
|
|
|
#else
|
2015-01-07 06:00:05 +09:00
|
|
|
#define expand_upwards(vma, address) (0)
|
2005-11-19 06:16:42 +09:00
|
|
|
#endif
|
2005-04-17 07:20:36 +09:00
|
|
|
|
|
|
|
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
|
|
|
|
extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
|
|
|
|
extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
|
|
|
|
struct vm_area_struct **pprev);
|
|
|
|
|
|
|
|
/* Look up the first VMA which intersects the interval start_addr..end_addr-1,
|
|
|
|
NULL if none. Assume start_addr < end_addr. */
|
|
|
|
static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr)
|
|
|
|
{
|
|
|
|
struct vm_area_struct * vma = find_vma(mm,start_addr);
|
|
|
|
|
|
|
|
if (vma && end_addr <= vma->vm_start)
|
|
|
|
vma = NULL;
|
|
|
|
return vma;
|
|
|
|
}
|
|
|
|
|
mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 20:03:24 +09:00
|
|
|
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
unsigned long vm_start = vma->vm_start;
|
|
|
|
|
|
|
|
if (vma->vm_flags & VM_GROWSDOWN) {
|
|
|
|
vm_start -= stack_guard_gap;
|
|
|
|
if (vm_start > vma->vm_start)
|
|
|
|
vm_start = 0;
|
|
|
|
}
|
|
|
|
return vm_start;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
unsigned long vm_end = vma->vm_end;
|
|
|
|
|
|
|
|
if (vma->vm_flags & VM_GROWSUP) {
|
|
|
|
vm_end += stack_guard_gap;
|
|
|
|
if (vm_end < vma->vm_end)
|
|
|
|
vm_end = -PAGE_SIZE;
|
|
|
|
}
|
|
|
|
return vm_end;
|
|
|
|
}
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
static inline unsigned long vma_pages(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
|
|
|
|
}
|
|
|
|
|
2012-01-11 08:11:23 +09:00
|
|
|
/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
|
|
|
|
static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
|
|
|
|
unsigned long vm_start, unsigned long vm_end)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma = find_vma(mm, vm_start);
|
|
|
|
|
|
|
|
if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
|
|
|
|
vma = NULL;
|
|
|
|
|
|
|
|
return vma;
|
|
|
|
}
|
|
|
|
|
2018-10-06 07:51:29 +09:00
|
|
|
static inline bool range_in_vma(struct vm_area_struct *vma,
|
|
|
|
unsigned long start, unsigned long end)
|
|
|
|
{
|
|
|
|
return (vma && vma->vm_start <= start && end <= vma->vm_end);
|
|
|
|
}
|
|
|
|
|
2010-08-27 00:00:34 +09:00
|
|
|
#ifdef CONFIG_MMU
|
2006-07-27 05:39:49 +09:00
|
|
|
pgprot_t vm_get_page_prot(unsigned long vm_flags);
|
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 07:55:46 +09:00
|
|
|
void vma_set_page_prot(struct vm_area_struct *vma);
|
2010-08-27 00:00:34 +09:00
|
|
|
#else
|
|
|
|
static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
|
|
|
|
{
|
|
|
|
return __pgprot(0);
|
|
|
|
}
|
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 07:55:46 +09:00
|
|
|
static inline void vma_set_page_prot(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
|
|
|
|
}
|
2010-08-27 00:00:34 +09:00
|
|
|
#endif
|
|
|
|
|
2013-12-06 03:38:22 +09:00
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
2012-10-25 21:16:32 +09:00
|
|
|
unsigned long change_prot_numa(struct vm_area_struct *vma,
|
2012-10-25 21:16:32 +09:00
|
|
|
unsigned long start, unsigned long end);
|
|
|
|
#endif
|
|
|
|
|
2005-10-30 10:16:33 +09:00
|
|
|
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
|
|
|
|
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
|
|
|
|
unsigned long pfn, unsigned long size, pgprot_t);
|
2005-12-01 02:35:19 +09:00
|
|
|
int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
|
2019-05-14 09:21:56 +09:00
|
|
|
int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
|
|
|
|
unsigned long num);
|
|
|
|
int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
|
|
|
|
unsigned long num);
|
2018-10-27 07:04:29 +09:00
|
|
|
vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
|
2007-02-12 17:51:36 +09:00
|
|
|
unsigned long pfn);
|
2018-10-27 07:04:13 +09:00
|
|
|
vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
|
2015-12-30 13:12:20 +09:00
|
|
|
unsigned long pfn, pgprot_t pgprot);
|
2018-10-27 07:04:10 +09:00
|
|
|
vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
|
2016-01-16 09:56:40 +09:00
|
|
|
pfn_t pfn);
|
2018-06-08 09:04:29 +09:00
|
|
|
vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, pfn_t pfn);
|
2013-04-17 05:45:37 +09:00
|
|
|
int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
|
|
|
|
|
2018-04-06 08:25:23 +09:00
|
|
|
static inline vm_fault_t vmf_insert_page(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, struct page *page)
|
|
|
|
{
|
|
|
|
int err = vm_insert_page(vma, addr, page);
|
|
|
|
|
|
|
|
if (err == -ENOMEM)
|
|
|
|
return VM_FAULT_OOM;
|
|
|
|
if (err < 0 && err != -EBUSY)
|
|
|
|
return VM_FAULT_SIGBUS;
|
|
|
|
|
|
|
|
return VM_FAULT_NOPAGE;
|
|
|
|
}
|
|
|
|
|
2020-11-02 10:08:00 +09:00
|
|
|
#ifndef io_remap_pfn_range
|
|
|
|
static inline int io_remap_pfn_range(struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, unsigned long pfn,
|
|
|
|
unsigned long size, pgprot_t prot)
|
|
|
|
{
|
|
|
|
return remap_pfn_range(vma, addr, pfn, size, pgprot_decrypted(prot));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-05-19 08:08:47 +09:00
|
|
|
static inline vm_fault_t vmf_error(int err)
|
|
|
|
{
|
|
|
|
if (err == -ENOMEM)
|
|
|
|
return VM_FAULT_OOM;
|
|
|
|
return VM_FAULT_SIGBUS;
|
|
|
|
}
|
|
|
|
|
2018-10-27 07:10:28 +09:00
|
|
|
struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
|
|
|
|
unsigned int foll_flags);
|
2013-02-23 09:35:56 +09:00
|
|
|
|
2005-10-30 10:16:33 +09:00
|
|
|
#define FOLL_WRITE 0x01 /* check pte is writable */
|
|
|
|
#define FOLL_TOUCH 0x02 /* mark page accessed */
|
|
|
|
#define FOLL_GET 0x04 /* do get_page on page */
|
2009-09-22 09:03:26 +09:00
|
|
|
#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
|
2009-09-22 09:03:31 +09:00
|
|
|
#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
|
2011-03-23 08:30:51 +09:00
|
|
|
#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
|
|
|
|
* and return without waiting upon it */
|
2015-04-15 07:44:37 +09:00
|
|
|
#define FOLL_POPULATE 0x40 /* fault in page */
|
2011-01-14 08:46:55 +09:00
|
|
|
#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */
|
2011-01-30 12:15:48 +09:00
|
|
|
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
|
2012-10-06 04:36:27 +09:00
|
|
|
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
|
mm,ksm: FOLL_MIGRATION do migration_entry_wait
In "ksm: remove old stable nodes more thoroughly" I said that I'd never
seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing,
but it soon appeared once I tried fuller tests on the whole series.
It turned out to be due to the KSM page migration itself: unmerge_and_
remove_all_rmap_items() failed to locate and replace all the KSM pages,
because of that hiatus in page migration when old pte has been replaced
by migration entry, but not yet by new pte. follow_page() finds no page
at that instant, but a KSM page reappears shortly after, without a
fault.
Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
for KSM's break_cow(). I'd have preferred to avoid another flag, and do
it every time, in case someone else makes the same easy mistake; but did
not find another transgressor (the common get_user_pages() is of course
safe), and cannot be sure that every follow_page() caller is prepared to
sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
already sleep there, since anon_vma locking was changed to mutex, but
maybe that's somehow excluded.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 09:36:07 +09:00
|
|
|
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
|
2014-09-18 02:51:48 +09:00
|
|
|
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
|
2015-11-06 11:51:36 +09:00
|
|
|
#define FOLL_MLOCK 0x1000 /* lock present pages */
|
2016-02-13 06:01:54 +09:00
|
|
|
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
|
2016-10-14 05:07:36 +09:00
|
|
|
#define FOLL_COW 0x4000 /* internal GUP flag */
|
2018-05-11 15:11:44 +09:00
|
|
|
#define FOLL_ANON 0x8000 /* don't do file mappings */
|
mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it".
HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages. These pages can be held for a significant time. But
get_user_pages_fast() does not protect against mapping FS DAX pages.
Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks. XDP has also
shown interest in using this functionality.[1]
In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.
[1] https://lkml.org/lkml/2019/3/19/939
"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move. I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem. Then I think we can change the flag to a
better name.
Secondly, it depends on how often you are registering memory. I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance. I don't have the numbers as the
tests for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
This patch (of 7):
This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast(). Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.
Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.
This patch does not change any functionality. In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked. However, callers of get_user_pages_fast()
were not "protected".
FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.
NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages. This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.
As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.
In review[1] it was asked:
<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?
A couple of points.
First "longterm" is a relative thing and at this point is probably a
misnomer. This is really flagging a pin which is going to be given to
hardware and can't move. I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem. Then I think we can
change the flag to a better name.
Second, It depends on how often you are registering memory. I have spoken
with some RDMA users who consider MR in the performance path... For the
overall application performance. I don't have the numbers as the tests
for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
</quote>
[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:17:03 +09:00
|
|
|
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
|
2019-09-24 07:38:25 +09:00
|
|
|
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
|
mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM
Pach series "Add FOLL_LONGTERM to GUP fast and use it".
HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages. These pages can be held for a significant time. But
get_user_pages_fast() does not protect against mapping FS DAX pages.
Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks. XDP has also
shown interest in using this functionality.[1]
In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.
[1] https://lkml.org/lkml/2019/3/19/939
"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move. I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem. Then I think we can change the flag to a
better name.
Secondly, it depends on how often you are registering memory. I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance. I don't have the numbers as the
tests for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
This patch (of 7):
This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast(). Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.
Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.
This patch does not change any functionality. In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked. However, callers of get_user_pages_fast()
were not "protected".
FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.
NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages. This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.
As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.
In review[1] it was asked:
<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?
A couple of points.
First "longterm" is a relative thing and at this point is probably a
misnomer. This is really flagging a pin which is going to be given to
hardware and can't move. I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem. Then I think we can
change the flag to a better name.
Second, It depends on how often you are registering memory. I have spoken
with some RDMA users who consider MR in the performance path... For the
overall application performance. I don't have the numbers as the tests
for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.
Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.
As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.
</quote>
[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:17:03 +09:00
|
|
|
|
|
|
|
/*
|
|
|
|
* NOTE on FOLL_LONGTERM:
|
|
|
|
*
|
|
|
|
* FOLL_LONGTERM indicates that the page will be held for an indefinite time
|
|
|
|
* period _often_ under userspace control. This is contrasted with
|
|
|
|
* iov_iter_get_pages() where usages which are transient.
|
|
|
|
*
|
|
|
|
* FIXME: For pages which are part of a filesystem, mappings are subject to the
|
|
|
|
* lifetime enforced by the filesystem and we need guarantees that longterm
|
|
|
|
* users like RDMA and V4L2 only establish mappings which coordinate usage with
|
|
|
|
* the filesystem. Ideas for this coordination include revoking the longterm
|
|
|
|
* pin, delaying writeback, bounce buffer page writeback, etc. As FS DAX was
|
|
|
|
* added after the problem with filesystems was found FS DAX VMAs are
|
|
|
|
* specifically failed. Filesystem pages are still subject to bugs and use of
|
|
|
|
* FOLL_LONGTERM should be avoided on those pages.
|
|
|
|
*
|
|
|
|
* FIXME: Also NOTE that FOLL_LONGTERM is not supported in every GUP call.
|
|
|
|
* Currently only get_user_pages() and get_user_pages_fast() support this flag
|
|
|
|
* and calls to get_user_pages_[un]locked are specifically not allowed. This
|
|
|
|
* is due to an incompatibility with the FS DAX check and
|
|
|
|
* FAULT_FLAG_ALLOW_RETRY
|
|
|
|
*
|
|
|
|
* In the CMA case: longterm pins in a CMA region would unnecessarily fragment
|
|
|
|
* that region. And so CMA attempts to migrate the page before pinning when
|
|
|
|
* FOLL_LONGTERM is specified.
|
|
|
|
*/
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2018-08-24 09:01:36 +09:00
|
|
|
static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
|
2017-06-03 06:46:46 +09:00
|
|
|
{
|
|
|
|
if (vm_fault & VM_FAULT_OOM)
|
|
|
|
return -ENOMEM;
|
|
|
|
if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
|
|
|
|
return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
|
|
|
|
if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
|
|
|
|
return -EFAULT;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-07-12 12:58:43 +09:00
|
|
|
typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
|
2007-05-07 06:48:54 +09:00
|
|
|
extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
|
|
|
|
unsigned long size, pte_fn_t fn, void *data);
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
|
2016-03-16 06:56:27 +09:00
|
|
|
#ifdef CONFIG_PAGE_POISONING
|
|
|
|
extern bool page_poisoning_enabled(void);
|
|
|
|
extern void kernel_poison_pages(struct page *page, int numpages, int enable);
|
|
|
|
#else
|
|
|
|
static inline bool page_poisoning_enabled(void) { return false; }
|
|
|
|
static inline void kernel_poison_pages(struct page *page, int numpages,
|
|
|
|
int enable) { }
|
|
|
|
#endif
|
|
|
|
|
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
Patch series "add init_on_alloc/init_on_free boot options", v10.
Provide init_on_alloc and init_on_free boot options.
These are aimed at preventing possible information leaks and making the
control-flow bugs that depend on uninitialized values more deterministic.
Enabling either of the options guarantees that the memory returned by the
page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
isn't supported at the moment, as its emulation of kmem caches complicates
handling of SLAB_TYPESAFE_BY_RCU caches correctly.
Enabling init_on_free also guarantees that pages and heap objects are
initialized right after they're freed, so it won't be possible to access
stale data by using a dangling pointer.
As suggested by Michal Hocko, right now we don't let the heap users to
disable initialization for certain allocations. There's not enough
evidence that doing so can speed up real-life cases, and introducing ways
to opt-out may result in things going out of control.
This patch (of 2):
The new options are needed to prevent possible information leaks and make
control-flow bugs that depend on uninitialized values more deterministic.
This is expected to be on-by-default on Android and Chrome OS. And it
gives the opportunity for anyone else to use it under distros too via the
boot args. (The init_on_free feature is regularly requested by folks
where memory forensics is included in their threat models.)
init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
objects with zeroes. Initialization is done at allocation time at the
places where checks for __GFP_ZERO are performed.
init_on_free=1 makes the kernel initialize freed pages and heap objects
with zeroes upon their deletion. This helps to ensure sensitive data
doesn't leak via use-after-free accesses.
Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
returns zeroed memory. The two exceptions are slab caches with
constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
zero-initialized to preserve their semantics.
Both init_on_alloc and init_on_free default to zero, but those defaults
can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
CONFIG_INIT_ON_FREE_DEFAULT_ON.
If either SLUB poisoning or page poisoning is enabled, those options take
precedence over init_on_alloc and init_on_free: initialization is only
applied to unpoisoned allocations.
Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
is within the standard error.
The new features are also going to pave the way for hardware memory
tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
hooks to set the tags for heap objects. With MTE, tagging will have the
same cost as memory initialization.
Although init_on_free is rather costly, there are paranoid use-cases where
in-memory data lifetime is desired to be minimized. There are various
arguments for/against the realism of the associated threat models, but
given that we'll need the infrastructure for MTE anyway, and there are
people who want wipe-on-free behavior no matter what the performance cost,
it seems reasonable to include it in this series.
[glider@google.com: v8]
Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
[glider@google.com: v9]
Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
[glider@google.com: v10]
Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts
Acked-by: James Morris <jamorris@linux.microsoft.com>]
Cc: Christoph Lameter <cl@linux.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Sandeep Patil <sspatil@android.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Jann Horn <jannh@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 12:59:19 +09:00
|
|
|
#ifdef CONFIG_INIT_ON_ALLOC_DEFAULT_ON
|
|
|
|
DECLARE_STATIC_KEY_TRUE(init_on_alloc);
|
|
|
|
#else
|
|
|
|
DECLARE_STATIC_KEY_FALSE(init_on_alloc);
|
|
|
|
#endif
|
|
|
|
static inline bool want_init_on_alloc(gfp_t flags)
|
|
|
|
{
|
|
|
|
if (static_branch_unlikely(&init_on_alloc) &&
|
|
|
|
!page_poisoning_enabled())
|
|
|
|
return true;
|
|
|
|
return flags & __GFP_ZERO;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_INIT_ON_FREE_DEFAULT_ON
|
|
|
|
DECLARE_STATIC_KEY_TRUE(init_on_free);
|
|
|
|
#else
|
|
|
|
DECLARE_STATIC_KEY_FALSE(init_on_free);
|
|
|
|
#endif
|
|
|
|
static inline bool want_init_on_free(void)
|
|
|
|
{
|
|
|
|
return static_branch_unlikely(&init_on_free) &&
|
|
|
|
!page_poisoning_enabled();
|
|
|
|
}
|
|
|
|
|
mm, debug_pagealloc: don't rely on static keys too early
commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.
Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.
[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-14 09:29:20 +09:00
|
|
|
#ifdef CONFIG_DEBUG_PAGEALLOC
|
|
|
|
extern void init_debug_pagealloc(void);
|
2019-07-12 12:55:06 +09:00
|
|
|
#else
|
mm, debug_pagealloc: don't rely on static keys too early
commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.
Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.
[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-14 09:29:20 +09:00
|
|
|
static inline void init_debug_pagealloc(void) {}
|
2019-07-12 12:55:06 +09:00
|
|
|
#endif
|
mm, debug_pagealloc: don't rely on static keys too early
commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.
Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.
[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-14 09:29:20 +09:00
|
|
|
extern bool _debug_pagealloc_enabled_early;
|
|
|
|
DECLARE_STATIC_KEY_FALSE(_debug_pagealloc_enabled);
|
2014-12-13 09:55:52 +09:00
|
|
|
|
|
|
|
static inline bool debug_pagealloc_enabled(void)
|
mm, debug_pagealloc: don't rely on static keys too early
commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.
Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
debugging") has introduced a static key to reduce overhead when
debug_pagealloc is compiled in but not enabled. It relied on the
assumption that jump_label_init() is called before parse_early_param()
as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
it is safe to enable the static key.
However, it turns out multiple architectures call parse_early_param()
earlier from their setup_arch(). x86 also calls jump_label_init() even
earlier, so no issue was found while testing the commit, but same is not
true for e.g. ppc64 and s390 where the kernel would not boot with
debug_pagealloc=on as found by our QA.
To fix this without tricky changes to init code of multiple
architectures, this patch partially reverts the static key conversion
from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
code) of debug_pagealloc_enabled() will again test a simple bool
variable. Fastpath mm code is converted to a new
debug_pagealloc_enabled_static() variant that relies on the static key,
which is enabled in a well-defined point in mm_init() where it's
guaranteed that jump_label_init() has been called, regardless of
architecture.
[sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-14 09:29:20 +09:00
|
|
|
{
|
|
|
|
return IS_ENABLED(CONFIG_DEBUG_PAGEALLOC) &&
|
|
|
|
_debug_pagealloc_enabled_early;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* For use in fast paths after init_debug_pagealloc() has run, or when a
|
|
|
|
* false negative result is not harmful when called too early.
|
|
|
|
*/
|
|
|
|
static inline bool debug_pagealloc_enabled_static(void)
|
2014-12-13 09:55:52 +09:00
|
|
|
{
|
2019-07-12 12:55:06 +09:00
|
|
|
if (!IS_ENABLED(CONFIG_DEBUG_PAGEALLOC))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return static_branch_unlikely(&_debug_pagealloc_enabled);
|
2014-12-13 09:55:52 +09:00
|
|
|
}
|
|
|
|
|
2019-04-26 09:11:35 +09:00
|
|
|
#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
|
|
|
|
extern void __kernel_map_pages(struct page *page, int numpages, int enable);
|
|
|
|
|
mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled
commit c87cbc1f007c4b46165f05ceca04e1973cda0b9c upstream.
Commit cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
fixed memory hotplug with debug_pagealloc enabled, where onlining a page
goes through page freeing, which removes the direct mapping. Some arches
don't like when the page is not mapped in the first place, so
generic_online_page() maps it first. This is somewhat wasteful, but
better than special casing page freeing fast paths.
The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
it's actually enabled. One has to test debug_pagealloc_enabled() since
031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime
configurable"), or alternatively debug_pagealloc_enabled_static() since
8e57f8acbbd1 ("mm, debug_pagealloc: don't rely on static keys too early"),
but this is not done.
As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
will crash:
Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 0000000000000000 TEID: 0000000000000483
Fault in home space mode while using kernel ASCE.
AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
Oops: 0004 ilc:2 [#1] SMP
CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
0000000000000001 0000000000000000 0000000000000002 0000000000000100
0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
>0000001ecd281b9e: 94fb5006 ni 6(%r5),251
0000001ecd281ba2: 41505008 la %r5,8(%r5)
0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
0000001ecd281bac: 1a07 ar %r0,%r7
0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
Call Trace:
[<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188
[<0000001ecd4d9516>] online_pages_range+0xf6/0x128
[<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8
[<0000001ecda28aae>] online_pages+0x2fe/0x3f0
[<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0
[<0000001ecd7add42>] device_online+0x5a/0xc8
[<0000001ecd7d0430>] state_store+0x88/0x118
[<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200
[<0000001ecd5064b6>] vfs_write+0x176/0x1e0
[<0000001ecd50676a>] ksys_write+0xa2/0x100
[<0000001ecda315d4>] system_call+0xd8/0x2c8
Fix this by checking debug_pagealloc_enabled_static() before calling
kernel_map_pages(). Backports for kernel before 5.5 should use
debug_pagealloc_enabled() instead. Also add comments.
Fixes: cd02cf1aceea ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-06 15:28:42 +09:00
|
|
|
/*
|
|
|
|
* When called in DEBUG_PAGEALLOC context, the call should most likely be
|
|
|
|
* guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
|
|
|
|
*/
|
2014-12-13 09:55:52 +09:00
|
|
|
static inline void
|
|
|
|
kernel_map_pages(struct page *page, int numpages, int enable)
|
|
|
|
{
|
|
|
|
__kernel_map_pages(page, numpages, enable);
|
|
|
|
}
|
2008-02-20 09:47:44 +09:00
|
|
|
#ifdef CONFIG_HIBERNATION
|
|
|
|
extern bool kernel_page_present(struct page *page);
|
2016-03-16 06:54:21 +09:00
|
|
|
#endif /* CONFIG_HIBERNATION */
|
2019-04-26 09:11:35 +09:00
|
|
|
#else /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
|
2005-04-17 07:20:36 +09:00
|
|
|
static inline void
|
2006-10-11 17:21:30 +09:00
|
|
|
kernel_map_pages(struct page *page, int numpages, int enable) {}
|
2008-02-20 09:47:44 +09:00
|
|
|
#ifdef CONFIG_HIBERNATION
|
|
|
|
static inline bool kernel_page_present(struct page *page) { return true; }
|
2016-03-16 06:54:21 +09:00
|
|
|
#endif /* CONFIG_HIBERNATION */
|
2019-04-26 09:11:35 +09:00
|
|
|
#endif /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
|
2005-04-17 07:20:36 +09:00
|
|
|
|
arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area
The core mm code will provide a default gate area based on
FIXADDR_USER_START and FIXADDR_USER_END if
!defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
and x86_64 have gate areas, but they have their own implementations.
This gets rid of the default and moves the code into ia64.
This should save some code on architectures without a gate area: it's now
possible to inline the gate_area functions in the default case.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Nathan Lynch <nathan_lynch@mentor.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
Acked-by: Richard Weinberger <richard@nod.at> [for um]
Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 06:23:40 +09:00
|
|
|
#ifdef __HAVE_ARCH_GATE_AREA
|
2011-03-14 04:49:15 +09:00
|
|
|
extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
|
arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area
The core mm code will provide a default gate area based on
FIXADDR_USER_START and FIXADDR_USER_END if
!defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
and x86_64 have gate areas, but they have their own implementations.
This gets rid of the default and moves the code into ia64.
This should save some code on architectures without a gate area: it's now
possible to inline the gate_area functions in the default case.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Nathan Lynch <nathan_lynch@mentor.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
Acked-by: Richard Weinberger <richard@nod.at> [for um]
Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 06:23:40 +09:00
|
|
|
extern int in_gate_area_no_mm(unsigned long addr);
|
|
|
|
extern int in_gate_area(struct mm_struct *mm, unsigned long addr);
|
2005-04-17 07:20:36 +09:00
|
|
|
#else
|
arm64,ia64,ppc,s390,sh,tile,um,x86,mm: remove default gate area
The core mm code will provide a default gate area based on
FIXADDR_USER_START and FIXADDR_USER_END if
!defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR).
This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit
UML, and x86_32 have their own code just to disable it. arm, 32-bit UML,
and x86_64 have gate areas, but they have their own implementations.
This gets rid of the default and moves the code into ia64.
This should save some code on architectures without a gate area: it's now
possible to inline the gate_area functions in the default case.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Nathan Lynch <nathan_lynch@mentor.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle]
Acked-by: Richard Weinberger <richard@nod.at> [for um]
Acked-by: Will Deacon <will.deacon@arm.com> [for arm64]
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Nathan Lynch <Nathan_Lynch@mentor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-09 06:23:40 +09:00
|
|
|
static inline struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
static inline int in_gate_area_no_mm(unsigned long addr) { return 0; }
|
|
|
|
static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-17 07:20:36 +09:00
|
|
|
#endif /* __HAVE_ARCH_GATE_AREA */
|
|
|
|
|
2016-07-29 07:44:43 +09:00
|
|
|
extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
|
|
|
|
|
2013-04-30 07:07:22 +09:00
|
|
|
#ifdef CONFIG_SYSCTL
|
|
|
|
extern int sysctl_drop_caches;
|
2009-09-24 07:57:19 +09:00
|
|
|
int drop_caches_sysctl_handler(struct ctl_table *, int,
|
2006-01-08 18:00:39 +09:00
|
|
|
void __user *, size_t *, loff_t *);
|
2013-04-30 07:07:22 +09:00
|
|
|
#endif
|
|
|
|
|
2015-02-13 07:58:54 +09:00
|
|
|
void drop_slab(void);
|
|
|
|
void drop_slab_node(int nid);
|
2006-01-08 18:00:39 +09:00
|
|
|
|
2006-02-21 11:28:07 +09:00
|
|
|
#ifndef CONFIG_MMU
|
|
|
|
#define randomize_va_space 0
|
|
|
|
#else
|
2006-02-17 07:41:58 +09:00
|
|
|
extern int randomize_va_space;
|
2006-02-21 11:28:07 +09:00
|
|
|
#endif
|
2006-02-17 07:41:58 +09:00
|
|
|
|
2007-07-27 02:41:13 +09:00
|
|
|
const char * arch_vma_name(struct vm_area_struct *vma);
|
2019-07-17 08:26:30 +09:00
|
|
|
#ifdef CONFIG_MMU
|
2008-01-30 21:33:18 +09:00
|
|
|
void print_vma_addr(char *prefix, unsigned long rip);
|
2019-07-17 08:26:30 +09:00
|
|
|
#else
|
|
|
|
static inline void print_vma_addr(char *prefix, unsigned long rip)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.
Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.
It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).
There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO. Newer
distributions (using glibc 2.3.3 or later) can turn this option off. Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.
There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.
(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)
This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.
[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 18:53:50 +09:00
|
|
|
|
2018-08-18 07:49:21 +09:00
|
|
|
void *sparse_buffer_alloc(unsigned long size);
|
2019-07-19 07:58:11 +09:00
|
|
|
struct page * __populate_section_memmap(unsigned long pfn,
|
|
|
|
unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
|
2007-10-16 17:24:14 +09:00
|
|
|
pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
|
2017-03-09 23:24:07 +09:00
|
|
|
p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
|
|
|
|
pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
|
2007-10-16 17:24:14 +09:00
|
|
|
pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
|
|
|
|
pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
|
2007-10-16 17:24:13 +09:00
|
|
|
void *vmemmap_alloc_block(unsigned long size, int node);
|
2016-01-16 09:56:22 +09:00
|
|
|
struct vmem_altmap;
|
2017-12-29 16:53:58 +09:00
|
|
|
void *vmemmap_alloc_block_buf(unsigned long size, int node);
|
|
|
|
void *altmap_alloc_block_buf(unsigned long size, struct vmem_altmap *altmap);
|
2007-10-16 17:24:13 +09:00
|
|
|
void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
|
2013-04-30 07:07:50 +09:00
|
|
|
int vmemmap_populate_basepages(unsigned long start, unsigned long end,
|
|
|
|
int node);
|
2017-12-29 16:53:54 +09:00
|
|
|
int vmemmap_populate(unsigned long start, unsigned long end, int node,
|
|
|
|
struct vmem_altmap *altmap);
|
2008-04-12 17:19:24 +09:00
|
|
|
void vmemmap_populate_print_last(void);
|
2013-02-23 09:33:08 +09:00
|
|
|
#ifdef CONFIG_MEMORY_HOTPLUG
|
2017-12-29 16:53:56 +09:00
|
|
|
void vmemmap_free(unsigned long start, unsigned long end,
|
|
|
|
struct vmem_altmap *altmap);
|
2013-02-23 09:33:08 +09:00
|
|
|
#endif
|
2013-02-23 09:33:00 +09:00
|
|
|
void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
|
2017-10-28 10:30:38 +09:00
|
|
|
unsigned long nr_pages);
|
2009-09-16 18:50:15 +09:00
|
|
|
|
2009-12-16 20:19:57 +09:00
|
|
|
enum mf_flags {
|
|
|
|
MF_COUNT_INCREASED = 1 << 0,
|
2011-12-14 02:27:58 +09:00
|
|
|
MF_ACTION_REQUIRED = 1 << 1,
|
2012-07-12 02:20:47 +09:00
|
|
|
MF_MUST_KILL = 1 << 2,
|
2013-07-10 18:27:01 +09:00
|
|
|
MF_SOFT_OFFLINE = 1 << 3,
|
2009-12-16 20:19:57 +09:00
|
|
|
};
|
2017-07-10 08:14:01 +09:00
|
|
|
extern int memory_failure(unsigned long pfn, int flags);
|
|
|
|
extern void memory_failure_queue(unsigned long pfn, int flags);
|
2009-12-16 20:19:58 +09:00
|
|
|
extern int unpoison_memory(unsigned long pfn);
|
2015-06-25 08:56:48 +09:00
|
|
|
extern int get_hwpoison_page(struct page *page);
|
2016-01-16 09:54:07 +09:00
|
|
|
#define put_hwpoison_page(page) put_page(page)
|
2009-09-16 18:50:15 +09:00
|
|
|
extern int sysctl_memory_failure_early_kill;
|
|
|
|
extern int sysctl_memory_failure_recovery;
|
2009-12-16 20:20:00 +09:00
|
|
|
extern void shake_page(struct page *p, int access);
|
2018-04-06 08:24:32 +09:00
|
|
|
extern atomic_long_t num_poisoned_pages __read_mostly;
|
2009-12-16 20:20:00 +09:00
|
|
|
extern int soft_offline_page(struct page *page, int flags);
|
2009-09-16 18:50:15 +09:00
|
|
|
|
2015-06-25 08:57:30 +09:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Error handlers for various types of pages.
|
|
|
|
*/
|
2015-06-25 08:57:33 +09:00
|
|
|
enum mf_result {
|
2015-06-25 08:57:30 +09:00
|
|
|
MF_IGNORED, /* Error: cannot be handled */
|
|
|
|
MF_FAILED, /* Error: handling failed */
|
|
|
|
MF_DELAYED, /* Will be handled later */
|
|
|
|
MF_RECOVERED, /* Successfully recovered */
|
|
|
|
};
|
|
|
|
|
|
|
|
enum mf_action_page_type {
|
|
|
|
MF_MSG_KERNEL,
|
|
|
|
MF_MSG_KERNEL_HIGH_ORDER,
|
|
|
|
MF_MSG_SLAB,
|
|
|
|
MF_MSG_DIFFERENT_COMPOUND,
|
|
|
|
MF_MSG_POISONED_HUGE,
|
|
|
|
MF_MSG_HUGE,
|
|
|
|
MF_MSG_FREE_HUGE,
|
2018-04-06 08:23:05 +09:00
|
|
|
MF_MSG_NON_PMD_HUGE,
|
2015-06-25 08:57:30 +09:00
|
|
|
MF_MSG_UNMAP_FAILED,
|
|
|
|
MF_MSG_DIRTY_SWAPCACHE,
|
|
|
|
MF_MSG_CLEAN_SWAPCACHE,
|
|
|
|
MF_MSG_DIRTY_MLOCKED_LRU,
|
|
|
|
MF_MSG_CLEAN_MLOCKED_LRU,
|
|
|
|
MF_MSG_DIRTY_UNEVICTABLE_LRU,
|
|
|
|
MF_MSG_CLEAN_UNEVICTABLE_LRU,
|
|
|
|
MF_MSG_DIRTY_LRU,
|
|
|
|
MF_MSG_CLEAN_LRU,
|
|
|
|
MF_MSG_TRUNCATED_LRU,
|
|
|
|
MF_MSG_BUDDY,
|
|
|
|
MF_MSG_BUDDY_2ND,
|
2018-07-14 13:50:21 +09:00
|
|
|
MF_MSG_DAX,
|
2015-06-25 08:57:30 +09:00
|
|
|
MF_MSG_UNKNOWN,
|
|
|
|
};
|
|
|
|
|
2011-01-14 08:46:47 +09:00
|
|
|
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
|
|
|
|
extern void clear_huge_page(struct page *page,
|
mm: hugetlb: clear target sub-page last when clearing huge page
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
clearing huge page on x86_64 platform, the cache footprint is 2M. But
on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
LLC (last level cache). That is, in average, there are 2.5M LLC for
each core and 1.25M LLC for each thread.
If the cache pressure is heavy when clearing the huge page, and we clear
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing clearing the
end of the huge page. And it is possible for the application to access
the begin of the huge page after clearing the huge page.
To help the above situation, in this patch, when we clear a huge page,
the order to clear sub-pages is changed. In quite some situation, we
can get the address that the application will access after we clear the
huge page, for example, in a page fault handler. Instead of clearing
the huge page from begin to end, we will clear the sub-pages farthest
from the the sub-page to access firstly, and clear the sub-page to
access last. This will make the sub-page to access most cache-hot and
sub-pages around it more cache-hot too. If we cannot know the address
the application will access, the begin of the huge page is assumed to be
the the address the application will access.
With this patch, the throughput increases ~28.3% in vm-scalability
anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case creates 72 processes, each
process mmap a big anonymous memory area and writes to it from the begin
to the end. For each process, other processes could be seen as other
workload which generates heavy cache pressure. At the same time, the
cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
cycle) increased from 0.56 to 0.74, and the time spent in user space is
reduced ~7.9%
Christopher Lameter suggests to clear bytes inside a sub-page from end
to begin too. But tests show no visible performance difference in the
tests. May because the size of page is small compared with the cache
size.
Thanks Andi Kleen to propose to use address to access to determine the
order of sub-pages to clear.
The hugetlbfs access address could be improved, will do that in another
patch.
[ying.huang@intel.com: improve readability of clear_huge_page()]
Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com
Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-07 08:25:04 +09:00
|
|
|
unsigned long addr_hint,
|
2011-01-14 08:46:47 +09:00
|
|
|
unsigned int pages_per_huge_page);
|
|
|
|
extern void copy_user_huge_page(struct page *dst, struct page *src,
|
mm, huge page: copy target sub-page last when copy huge page
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
copying huge page on x86_64 platform, the cache footprint is 4M. But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache). That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.
If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page. And it is possible for the application to access
the begin of the huge page after copying the huge page.
In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last. The similar order
changing helps huge page copying too. That is implemented in this
patch. Because we have put the order algorithm into a separate
function, the implementation is quite simple.
The patch is a generic optimization which should benefit quite some
workloads, not for a specific use case. To demonstrate the performance
benefit of the patch, we tested it with vm-scalability run on
transparent huge page.
With this patch, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write. For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.
Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18 07:45:49 +09:00
|
|
|
unsigned long addr_hint,
|
|
|
|
struct vm_area_struct *vma,
|
2011-01-14 08:46:47 +09:00
|
|
|
unsigned int pages_per_huge_page);
|
2017-02-23 08:42:49 +09:00
|
|
|
extern long copy_huge_page_from_user(struct page *dst_page,
|
|
|
|
const void __user *usr_src,
|
2017-02-23 08:42:58 +09:00
|
|
|
unsigned int pages_per_huge_page,
|
|
|
|
bool allow_pagefault);
|
2011-01-14 08:46:47 +09:00
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
|
|
|
|
|
2012-01-11 08:07:28 +09:00
|
|
|
#ifdef CONFIG_DEBUG_PAGEALLOC
|
|
|
|
extern unsigned int _debug_guardpage_minorder;
|
2019-07-12 12:55:06 +09:00
|
|
|
DECLARE_STATIC_KEY_FALSE(_debug_guardpage_enabled);
|
2012-01-11 08:07:28 +09:00
|
|
|
|
|
|
|
static inline unsigned int debug_guardpage_minorder(void)
|
|
|
|
{
|
|
|
|
return _debug_guardpage_minorder;
|
|
|
|
}
|
|
|
|
|
mm/debug-pagealloc: prepare boottime configurable on/off
Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page. So, we
can't use this good feature in many cases.
Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.
This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext. Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext. This patch implements this.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 09:55:49 +09:00
|
|
|
static inline bool debug_guardpage_enabled(void)
|
|
|
|
{
|
2019-07-12 12:55:06 +09:00
|
|
|
return static_branch_unlikely(&_debug_guardpage_enabled);
|
mm/debug-pagealloc: prepare boottime configurable on/off
Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page. So, we
can't use this good feature in many cases.
Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.
This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext. Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext. This patch implements this.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 09:55:49 +09:00
|
|
|
}
|
|
|
|
|
2012-01-11 08:07:28 +09:00
|
|
|
static inline bool page_is_guard(struct page *page)
|
|
|
|
{
|
mm/debug-pagealloc: prepare boottime configurable on/off
Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page. So, we
can't use this good feature in many cases.
Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.
This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext. Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext. This patch implements this.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 09:55:49 +09:00
|
|
|
if (!debug_guardpage_enabled())
|
|
|
|
return false;
|
|
|
|
|
2019-07-12 12:55:13 +09:00
|
|
|
return PageGuard(page);
|
2012-01-11 08:07:28 +09:00
|
|
|
}
|
|
|
|
#else
|
|
|
|
static inline unsigned int debug_guardpage_minorder(void) { return 0; }
|
mm/debug-pagealloc: prepare boottime configurable on/off
Until now, debug-pagealloc needs extra flags in struct page, so we need to
recompile whole source code when we decide to use it. This is really
painful, because it takes some time to recompile and sometimes rebuild is
not possible due to third party module depending on struct page. So, we
can't use this good feature in many cases.
Now, we have the page extension feature that allows us to insert extra
flags to outside of struct page. This gets rid of third party module
issue mentioned above. And, this allows us to determine if we need extra
memory for this page extension in boottime. With these property, we can
avoid using debug-pagealloc in boottime with low computational overhead in
the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
development process greatly.
This patch is the preparation step to achive above goal. debug-pagealloc
originally uses extra field of struct page, but, after this patch, it will
use field of struct page_ext. Because memory for page_ext is allocated
later than initialization of page allocator in CONFIG_SPARSEMEM, we should
disable debug-pagealloc feature temporarily until initialization of
page_ext. This patch implements this.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 09:55:49 +09:00
|
|
|
static inline bool debug_guardpage_enabled(void) { return false; }
|
2012-01-11 08:07:28 +09:00
|
|
|
static inline bool page_is_guard(struct page *page) { return false; }
|
|
|
|
#endif /* CONFIG_DEBUG_PAGEALLOC */
|
|
|
|
|
2013-04-30 07:08:01 +09:00
|
|
|
#if MAX_NUMNODES > 1
|
|
|
|
void __init setup_nr_node_ids(void);
|
|
|
|
#else
|
|
|
|
static inline void setup_nr_node_ids(void) {}
|
|
|
|
#endif
|
|
|
|
|
2019-09-24 07:38:19 +09:00
|
|
|
extern int memcmp_pages(struct page *page1, struct page *page2);
|
|
|
|
|
|
|
|
static inline int pages_identical(struct page *page1, struct page *page2)
|
|
|
|
{
|
|
|
|
return !memcmp_pages(page1, page2);
|
|
|
|
}
|
|
|
|
|
mm: make faultaround produce old ptes
Based on Kirill's patch [1].
Currently, faultaround code produces young pte. This can screw up
vmscan behaviour[2], as it makes vmscan think that these pages are hot
and not push them out on first round.
During sparse file access faultaround gets more pages mapped and all of
them are young. Under memory pressure, this makes vmscan swap out anon
pages instead, or to drop other page cache pages which otherwise stay
resident.
Modify faultaround to produce old ptes if sysctl 'want_old_faultaround_pte'
is set, so they can easily be reclaimed under memory pressure.
This can to some extend defeat the purpose of faultaround on machines
without hardware accessed bit as it will not help us with reducing the
number of minor page faults.
Making the faultaround ptes old results in a unixbench regression for some
architectures [3][4]. But on some architectures like arm64 it is not found
to cause any regression.
unixbench shell8 scores on arm64 v8.2 hardware with CONFIG_ARM64_HW_AFDBM
enabled (5 runs min, max, avg):
Base: (741,748,744)
With this patch: (739,748,743)
So by default produce young ptes and provide a sysctl option to make the
ptes old.
[1] https://marc.info/?l=linux-mm&m=146348837703148
[2] https://lkml.org/lkml/2016/4/18/612
[3] https://marc.info/?l=linux-kernel&m=146582237922378&w=2
[4] https://marc.info/?l=linux-mm&m=146589376909424&w=2
Change-Id: I193185cc953bc33a44fc24963a9df9e555906d95
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Patch-mainline: linux-mm @ Fri, 19 Jan 2018 17:24:54
[vinmenon@codeaurora.org: enable by default since arm works well
with old fault_around ptes + edit the links in commit message to
fix checkpatch issues]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
[swatsrid@codeaurora.org: Fix merge conflicts]
Signed-off-by: Swathi Sridhar <swatsrid@codeaurora.org>
Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
2016-05-21 08:58:41 +09:00
|
|
|
extern int want_old_faultaround_pte;
|
|
|
|
|
vmscan: Support multiple kswapd threads per node
Page replacement is handled in the Linux Kernel in one of two ways:
1) Asynchronously via kswapd
2) Synchronously, via direct reclaim
At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.
Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.
When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.
The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.
When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.
The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.
Test Details
NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details
The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.
Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
eleven 250GB zero-filled files on each drive so that I could test with
parallel reads.
The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.
During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:
CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
doesn't tend to fluctuate much so I just grab the highest value.
Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
there is a lot of variation.
Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total
This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.
The dd command for this test looks like this:
Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M
Test #1: Direct IO
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40
Throughput was close to peak with only 22 dd tasks. Very little system CPU
was consumed as expected as the drives DMA directly into the user address
space when using direct IO.
In this next test, the iflag=direct option is removed and we only run the
test until the pgscan_kswapd from /proc/vmstat starts to increment. At
that point metrics are parsed and reported and the pagecache contents are
dropped prior to the next test. Lather, rinse, repeat.
Test #2: standard file system IO, no page replacement
dd sy dd_cpu throughput
6 2 28.78 5134316.40
10 3 31.40 8051218.40
16 5 34.73 11438106.80
22 7 33.65 14140596.40
28 8 31.24 16393455.20
34 10 29.88 18219463.60
40 11 28.33 19644159.60
46 11 25.05 20802497.60
52 13 26.92 22092370.00
58 13 23.29 22884881.20
64 14 23.12 23452248.80
70 15 22.40 23916468.00
76 16 22.06 24328737.20
82 17 20.97 24718693.20
88 16 18.57 25149404.40
90 16 18.31 25245565.60
Each read has to pause after the buffer in kernel space is populated while
those pages are added to the pagecache and copied into the user address
space. For this reason, more parallel streams are required to achieve peak
throughput. The copy operation consumes substantially more CPU than direct
IO as expected.
The next test measures throughput after kswapd starts running. This is the
same test only we wait for kswapd to wake up before we start collecting
metrics. The script actually keeps track of a few things that were not
mentioned earlier. It tracks direct reclaims and page scans by watching
the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the
same way it is tracked for dd.
Since the test is 100% reads, you can assume that the page steal rate for
kswapd and direct reclaims is almost identical to the scan rate.
Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901
In the previous test where kswapd was not involved, the system-wide kernel
mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption
with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA
node), kswapd can only be responsible for a little over 4% of the increase.
The rest is likely caused by 51,618 direct reclaims that scanned 1.2
billion pages over the five minute time period of the test.
Same test, more kswapd tasks:
Test #4: 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615
By increasing the number of kswapd threads, throughput increased by ~50%
while kernel mode CPU utilization decreased or stayed the same, likely due
to a decrease in the number of parallel tasks at any given time doing page
replacement.
Change-Id: I966d4a9c33bad188b3409f7ceea1df205a63c3bd
Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Patch-mainline: linux-mm @ Mon, 2 Apr 2018 09:24:22
Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com
[charante@codeaurora.org]: Changes done to ensure QGKI compliance.
Signed-off-by: Charan Teja Kalla <charante@codeaurora.org>
2018-04-02 18:24:22 +09:00
|
|
|
#ifndef CONFIG_MULTIPLE_KSWAPD
|
|
|
|
static inline void update_kswapd_threads_node(int nid) {}
|
|
|
|
static inline int multi_kswapd_run(int nid) { return 0; }
|
|
|
|
static inline void multi_kswapd_stop(int nid) {}
|
|
|
|
static inline void multi_kswapd_cpu_online(pg_data_t *pgdat,
|
|
|
|
const struct cpumask *mask) {}
|
|
|
|
#endif /* CONFIG_MULTIPLE_KSWAPD */
|
Merge android11-5.4.134+ (dca02b1) into msm-5.4
* refs/heads/tmp-dca02b1:
ANDROID: GKI: Update abi_gki_aarch64_cuttlefish
ANDROID: GKI: Update abi_gki_aarch64_exynos
ANDROID: GKI: Update android/abi_gki_aarch64_sonywalkman
BACKPORT: blk-mq: fix is_flush_rq
BACKPORT: blk-mq: clearing flush request reference in tags->rqs[]
BACKPORT: blk-mq: clear stale request in tags->rq[] before freeing one request pool
BACKPORT: blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter
ANDROID: gki_defconfig: set DEFAULT_MMAP_MIN_ADDR=32768
ANDROID: GKI: upate .xml file for new symbol addtions
ANDROID: xt_quota2: set usersize in xt_match registration object
ANDROID: xt_quota2: clear quota2_log message before sending
ANDROID: xt_quota2: remove trailing junk which might have a digit in it
UPSTREAM: io_uring: Fix current->fs handling in io_sq_wq_submit_work()
ANDROID: ABI: Update allowed list for QCOM
UPSTREAM: arm64: vdso: Avoid ISB after reading from cntvct_el0
ANDROID: GKI: Disable X86_MCE drivers
ANDROID: GKI: Add FCNT KMI symbol list
ANDROID: fuse: Allocate zeroed memory for canonical path
ANDROID: ABI: Update allowed list for Microsoft
ANDROID: GKI: add padding to struct hid_device
ANDROID: Update android/abi_gki_aarch64.xml
ANDROID: Update android/abi_gki_aarch64_goldfish
ANDROID: generate_initcall_order.pl: Use two dash long options for llvm-nm
Linux 5.4.134
seq_file: disallow extremely large seq buffer allocations
misc: alcor_pci: fix inverted branch condition
scsi: scsi_dh_alua: Fix signedness bug in alua_rtpg()
MIPS: vdso: Invalid GIC access through VDSO
mips: disable branch profiling in boot/decompress.o
mips: always link byteswap helpers into decompressor
scsi: be2iscsi: Fix an error handling path in beiscsi_dev_probe()
firmware: turris-mox-rwtm: fail probing when firmware does not support hwrng
firmware: turris-mox-rwtm: report failures better
firmware: turris-mox-rwtm: fix reply status decoding function
thermal/drivers/rcar_gen3_thermal: Fix coefficient calculations
ARM: dts: imx6q-dhcom: Add gpios pinctrl for i2c bus recovery
ARM: dts: imx6q-dhcom: Fix ethernet plugin detection problems
ARM: dts: imx6q-dhcom: Fix ethernet reset time properties
ARM: dts: am437x: align ti,pindir-d0-out-d1-in property with dt-shema
ARM: dts: am335x: align ti,pindir-d0-out-d1-in property with dt-shema
memory: fsl_ifc: fix leak of private memory on probe failure
memory: fsl_ifc: fix leak of IO mapping on probe failure
reset: bail if try_module_get() fails
ARM: dts: BCM5301X: Fixup SPI binding
firmware: arm_scmi: Reset Rx buffer to max size during async commands
firmware: tegra: Fix error return code in tegra210_bpmp_init()
ARM: dts: r8a7779, marzen: Fix DU clock names
arm64: dts: renesas: v3msk: Fix memory size
rtc: fix snprintf() checking in is_rtc_hctosys()
memory: pl353: Fix error return code in pl353_smc_probe()
reset: brcmstb: Add missing MODULE_DEVICE_TABLE
memory: atmel-ebi: add missing of_node_put for loop iteration
ARM: dts: exynos: fix PWM LED max brightness on Odroid XU4
ARM: dts: exynos: fix PWM LED max brightness on Odroid HC1
ARM: dts: exynos: fix PWM LED max brightness on Odroid XU/XU3
ARM: exynos: add missing of_node_put for loop iteration
reset: a10sr: add missing of_match_table reference
ARM: dts: gemini-rut1xx: remove duplicate ethernet node
hexagon: use common DISCARDS macro
NFSv4/pNFS: Don't call _nfs4_pnfs_v3_ds_connect multiple times
ALSA: isa: Fix error return code in snd_cmi8330_probe()
nvme-tcp: can't set sk_user_data without write_lock
virtio_net: move tx vq operation under tx queue lock
pwm: imx1: Don't disable clocks at device remove time
x86/fpu: Limit xstate copy size in xstateregs_set()
PCI: iproc: Support multi-MSI only on uniprocessor kernel
PCI: iproc: Fix multi-MSI base vector number allocation
ubifs: Set/Clear I_LINKABLE under i_lock for whiteout inode
nfs: fix acl memory leak of posix_acl_create()
watchdog: aspeed: fix hardware timeout calculation
um: fix error return code in winch_tramp()
um: fix error return code in slip_open()
NFSv4: Initialise connection to the server in nfs4_alloc_client()
power: supply: rt5033_battery: Fix device tree enumeration
PCI/sysfs: Fix dsm_label_utf16s_to_utf8s() buffer overrun
f2fs: add MODULE_SOFTDEP to ensure crc32 is included in the initramfs
x86/signal: Detect and prevent an alternate signal stack overflow
virtio_console: Assure used length from device is limited
virtio_net: Fix error handling in virtnet_restore()
virtio-blk: Fix memory leak among suspend/resume procedure
ACPI: video: Add quirk for the Dell Vostro 3350
ACPI: AMBA: Fix resource name in /proc/iomem
pwm: tegra: Don't modify HW state in .remove callback
pwm: img: Fix PM reference leak in img_pwm_enable()
power: supply: ab8500: add missing MODULE_DEVICE_TABLE
power: supply: charger-manager: add missing MODULE_DEVICE_TABLE
NFS: nfs_find_open_context() may only select open files
ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirty
orangefs: fix orangefs df output.
PCI: tegra: Add missing MODULE_DEVICE_TABLE
x86/fpu: Return proper error codes from user access functions
watchdog: iTCO_wdt: Account for rebooting on second timeout
watchdog: imx_sc_wdt: fix pretimeout
watchdog: Fix possible use-after-free by calling del_timer_sync()
watchdog: sc520_wdt: Fix possible use-after-free in wdt_turnoff()
watchdog: Fix possible use-after-free in wdt_startup()
PCI/P2PDMA: Avoid pci_get_slot(), which may sleep
ARM: 9087/1: kprobes: test-thumb: fix for LLVM_IAS=1
power: reset: gpio-poweroff: add missing MODULE_DEVICE_TABLE
power: supply: max17042: Do not enforce (incorrect) interrupt trigger type
power: supply: ab8500: Avoid NULL pointers
pwm: spear: Don't modify HW state in .remove callback
power: supply: sc2731_charger: Add missing MODULE_DEVICE_TABLE
power: supply: sc27xx: Add missing MODULE_DEVICE_TABLE
lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
i2c: core: Disable client irq on reboot/shutdown
intel_th: Wait until port is in reset before programming it
staging: rtl8723bs: fix macro value for 2.4Ghz only device
ALSA: usb-audio: scarlett2: Fix 6i6 Gen 2 line out descriptions
ALSA: hda: Add IRQ check for platform_get_irq()
backlight: lm3630a: Fix return code of .update_status() callback
ASoC: Intel: kbl_da7219_max98357a: shrink platform_id below 20 characters
powerpc/boot: Fixup device-tree on little endian
usb: gadget: hid: fix error return code in hid_bind()
usb: gadget: f_hid: fix endianness issue with descriptors
ALSA: usb-audio: scarlett2: Fix scarlett2_*_ctl_put() return values
ALSA: usb-audio: scarlett2: Fix data_mutex lock
ALSA: usb-audio: scarlett2: Fix 18i8 Gen 2 PCM Input count
ALSA: bebob: add support for ToneWeal FW66
Input: hideep - fix the uninitialized use in hideep_nvm_unlock()
s390/mem_detect: fix tprot() program check new psw handling
s390/mem_detect: fix diag260() program check new psw handling
s390/ipl_parm: fix program check new psw handling
s390/processor: always inline stap() and __load_psw_mask()
ASoC: soc-core: Fix the error return code in snd_soc_of_parse_audio_routing()
gpio: pca953x: Add support for the On Semi pca9655
selftests/powerpc: Fix "no_handler" EBB selftest
ALSA: ppc: fix error return code in snd_pmac_probe()
gpio: zynq: Check return value of pm_runtime_get_sync
iommu/arm-smmu: Fix arm_smmu_device refcount leak in address translation
iommu/arm-smmu: Fix arm_smmu_device refcount leak when arm_smmu_rpm_get fails
powerpc/ps3: Add dma_mask to ps3_dma_region
ALSA: sb: Fix potential double-free of CSP mixer elements
selftests: timers: rtcpie: skip test if default RTC device does not exist
s390/sclp_vt220: fix console name to match device
serial: tty: uartlite: fix console setup
ASoC: img: Fix PM reference leak in img_i2s_in_probe()
mfd: cpcap: Fix cpcap dmamask not set warnings
mfd: da9052/stmpe: Add and modify MODULE_DEVICE_TABLE
scsi: qedi: Fix null ref during abort handling
scsi: iscsi: Fix shost->max_id use
scsi: iscsi: Fix conn use after free during resets
scsi: iscsi: Add iscsi_cls_conn refcount helpers
scsi: megaraid_sas: Handle missing interrupts while re-enabling IRQs
scsi: megaraid_sas: Early detection of VD deletion through RaidMap update
scsi: megaraid_sas: Fix resource leak in case of probe failure
fs/jfs: Fix missing error code in lmLogInit()
scsi: scsi_dh_alua: Check for negative result value
tty: serial: 8250: serial_cs: Fix a memory leak in error handling path
ALSA: ac97: fix PM reference leak in ac97_bus_remove()
scsi: core: Cap scsi_host cmd_per_lun at can_queue
scsi: lpfc: Fix crash when lpfc_sli4_hba_setup() fails to initialize the SGLs
scsi: lpfc: Fix "Unexpected timeout" error in direct attach topology
scsi: hisi_sas: Propagate errors in interrupt_init_v1_hw()
w1: ds2438: fixing bug that would always get page0
Revert "ALSA: bebob/oxfw: fix Kconfig entry for Mackie d.2 Pro"
ALSA: usx2y: Don't call free_pages_exact() with NULL address
iio: magn: bmc150: Balance runtime pm + use pm_runtime_resume_and_get()
iio: gyro: fxa21002c: Balance runtime pm + use pm_runtime_resume_and_get().
misc: alcor_pci: fix null-ptr-deref when there is no PCI bridge
misc/libmasm/module: Fix two use after free in ibmasm_init_one
tty: serial: fsl_lpuart: fix the potential risk of division or modulo by zero
srcu: Fix broken node geometry after early ssp init
dmaengine: fsl-qdma: check dma_set_mask return value
net: moxa: Use devm_platform_get_and_ioremap_resource()
fbmem: Do not delete the mode that is still in use
cgroup: verify that source is a string
tracing: Do not reference char * as a string in histograms
scsi: core: Fix bad pointer dereference when ehandler kthread is invalid
KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is enabled
KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
Revert "media: subdev: disallow ioctl for saa6588/davinci"
Linux 5.4.133
smackfs: restrict bytes count in smk_set_cipso()
jfs: fix GPF in diFree
pinctrl: mcp23s08: Fix missing unlock on error in mcp23s08_irq()
media: uvcvideo: Fix pixel format change for Elgato Cam Link 4K
media: gspca/sunplus: fix zero-length control requests
media: gspca/sq905: fix control-request direction
media: zr364xx: fix memory leak in zr364xx_start_readpipe
media: dtv5100: fix control-request directions
media: subdev: disallow ioctl for saa6588/davinci
PCI: aardvark: Implement workaround for the readback value of VEND_ID
PCI: aardvark: Fix checking for PIO Non-posted Request
PCI: Leave Apple Thunderbolt controllers on for s2idle or standby
dm btree remove: assign new_root only when removal succeeds
coresight: tmc-etf: Fix global-out-of-bounds in tmc_update_etf_buffer()
ipack/carriers/tpci200: Fix a double free in tpci200_pci_probe
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
tracing: Simplify & fix saved_tgids logic
rq-qos: fix missed wake-ups in rq_qos_throttle try two
seq_buf: Fix overflow in seq_buf_putmem_hex()
extcon: intel-mrfld: Sync hardware and software state on init
nvmem: core: add a missing of_node_put
power: supply: ab8500: Fix an old bug
ubifs: Fix races between xattr_{set|get} and listxattr operations
thermal/drivers/int340x/processor_thermal: Fix tcc setting
ipmi/watchdog: Stop watchdog timer when the current action is 'none'
qemu_fw_cfg: Make fw_cfg_rev_attr a proper kobj_attribute
ASoC: tegra: Set driver_name=tegra for all machine drivers
MIPS: fix "mipsel-linux-ld: decompress.c:undefined reference to `memmove'"
fpga: stratix10-soc: Add missing fpga_mgr_free() call
clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
cpu/hotplug: Cure the cpusets trainwreck
ata: ahci_sunxi: Disable DIPM
mmc: core: Allow UHS-I voltage switch for SDSC cards if supported
mmc: core: clear flags before allowing to retune
mmc: sdhci: Fix warning message when accessing RPMB in HS400 mode
drm/arm/malidp: Always list modifiers
drm/msm/mdp4: Fix modifier support enabling
drm/tegra: Don't set allow_fb_modifiers explicitly
drm/amd/display: Reject non-zero src_y and src_x for video planes
pinctrl/amd: Add device HID for new AMD GPIO controller
drm/amd/display: fix incorrrect valid irq check
drm/rockchip: dsi: remove extra component_del() call
drm/radeon: Add the missed drm_gem_object_put() in radeon_user_framebuffer_create()
drm/amdgpu: Update NV SIMD-per-CU to 2
powerpc/barrier: Avoid collision with clang's __lwsync macro
powerpc/mm: Fix lockup on kernel exec fault
perf bench: Fix 2 memory sanitizer warnings
crypto: ccp - Annotate SEV Firmware file names
fscrypt: don't ignore minor_hash when hash is 0
MIPS: set mips32r5 for virt extensions
MIPS: loongsoon64: Reserve memory below starting pfn to prevent Oops
sctp: add size validation when walking chunks
sctp: validate from_addr_param return
Bluetooth: btusb: fix bt fiwmare downloading failure issue for qca btsoc.
Bluetooth: Shutdown controller after workqueues are flushed or cancelled
Bluetooth: Fix the HCI to MGMT status conversion table
Bluetooth: btusb: Fixed too many in-token issue for Mediatek Chip.
RDMA/cma: Fix rdma_resolve_route() memory leak
net: ip: avoid OOM kills with large UDP sends over loopback
media, bpf: Do not copy more entries than user space requested
wireless: wext-spy: Fix out-of-bounds warning
sfc: error code if SRIOV cannot be disabled
sfc: avoid double pci_remove of VFs
iwlwifi: pcie: fix context info freeing
iwlwifi: pcie: free IML DMA memory allocation
iwlwifi: mvm: don't change band on bound PHY contexts
RDMA/rxe: Don't overwrite errno from ib_umem_get()
vsock: notify server to shutdown when client has pending signal
atm: nicstar: register the interrupt handler in the right place
atm: nicstar: use 'dma_free_coherent' instead of 'kfree'
MIPS: add PMD table accounting into MIPS'pmd_alloc_one
rtl8xxxu: Fix device info for RTL8192EU devices
drm/amdkfd: Walk through list with dqm lock hold
net: sched: fix error return code in tcf_del_walker()
net: fix mistake path for netdev_features_strings
mt76: mt7615: fix fixed-rate tx status reporting
bpf: Fix up register-based shifts in interpreter to silence KUBSAN
cw1200: add missing MODULE_DEVICE_TABLE
wl1251: Fix possible buffer overflow in wl1251_cmd_scan
wlcore/wl12xx: Fix wl12xx get_mac error if device is in ELP
xfrm: Fix error reporting in xfrm_state_construct.
drm/amd/display: Verify Gamma & Degamma LUT sizes in amdgpu_dm_atomic_check
r8169: avoid link-up interrupt issue on RTL8106e if user enables ASPM
selinux: use __GFP_NOWARN with GFP_NOWAIT in the AVC
fjes: check return value after calling platform_get_resource()
drm/amdkfd: use allowed domain for vmbo validation
drm/amd/display: Set DISPCLK_MAX_ERRDET_CYCLES to 7
drm/amd/display: Release MST resources on switch from MST to SST
drm/amd/display: Update scaling settings on modeset
net: micrel: check return value after calling platform_get_resource()
net: mvpp2: check return value after calling platform_get_resource()
net: bcmgenet: check return value after calling platform_get_resource()
virtio_net: Remove BUG() to avoid machine dead
ice: set the value of global config lock timeout longer
pinctrl: mcp23s08: fix race condition in irq handler
dm space maps: don't reset space map allocation cursor when committing
RDMA/cxgb4: Fix missing error code in create_qp()
ipv6: use prandom_u32() for ID generation
clk: tegra: Ensure that PLLU configuration is applied properly
clk: renesas: r8a77995: Add ZA2 clock
drm/bridge: cdns: Fix PM reference leak in cdns_dsi_transfer()
igb: handle vlan types with checker enabled
e100: handle eeprom as little endian
udf: Fix NULL pointer dereference in udf_symlink function
drm/sched: Avoid data corruptions
drm/virtio: Fix double free on probe failure
reiserfs: add check for invalid 1st journal block
drm/mediatek: Fix PM reference leak in mtk_crtc_ddp_hw_init()
net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RT
atm: nicstar: Fix possible use-after-free in nicstar_cleanup()
mISDN: fix possible use-after-free in HFC_cleanup()
atm: iphase: fix possible use-after-free in ia_module_exit()
hugetlb: clear huge pte during flush function on mips platform
drm/amd/display: fix use_max_lb flag for 420 pixel formats
net: pch_gbe: Use proper accessors to BE data in pch_ptp_match()
drm/vc4: fix argument ordering in vc4_crtc_get_margins()
drm/amd/amdgpu/sriov disable all ip hw status by default
drm/zte: Don't select DRM_KMS_FB_HELPER
drm/mxsfb: Don't select DRM_KMS_FB_HELPER
ANDROID: GKI: fix up crc change in ip.h
Linux 5.4.132
iommu/dma: Fix compile warning in 32-bit builds
scsi: core: Retry I/O for Notify (Enable Spinup) Required error
mmc: vub3000: fix control-request direction
mmc: block: Disable CMDQ on the ioctl path
block: return the correct bvec when checking for gaps
scsi: target: cxgbit: Unmap DMA buffer before calling target_execute_cmd()
perf llvm: Return -ENOMEM when asprintf() fails
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
mm/z3fold: fix potential memory leak in z3fold_destroy_pool()
mm/huge_memory.c: don't discard hugepage if other processes are mapping it
vfio/pci: Handle concurrent vma faults
arm64: dts: marvell: armada-37xx: Fix reg for standard variant of UART
serial: mvebu-uart: correctly calculate minimal possible baudrate
serial: mvebu-uart: do not allow changing baudrate when uartclk is not available
powerpc: Offline CPU in stop_this_cpu()
leds: ktd2692: Fix an error handling path
leds: as3645a: Fix error return code in as3645a_parse_node()
configfs: fix memleak in configfs_release_bin_file
ASoC: atmel-i2s: Fix usage of capture and playback at the same time
extcon: max8997: Add missing modalias string
extcon: sm5502: Drop invalid register write in sm5502_reg_data
phy: ti: dm816x: Fix the error handling path in 'dm816x_usb_phy_probe()
phy: uniphier-pcie: Fix updating phy parameters
soundwire: stream: Fix test for DP prepare complete
scsi: mpt3sas: Fix error return value in _scsih_expander_add()
mtd: rawnand: marvell: add missing clk_disable_unprepare() on error in marvell_nfc_resume()
of: Fix truncation of memory sizes on 32-bit platforms
ASoC: cs42l42: Correct definition of CS42L42_ADC_PDN_MASK
iio: prox: isl29501: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: light: vcnl4035: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
serial: 8250: Actually allow UPF_MAGIC_MULTIPLIER baud rates
staging: mt7621-dts: fix pci address for PCI memory range
staging: rtl8712: fix memory leak in rtl871x_load_fw_cb
staging: rtl8712: remove redundant check in r871xu_drv_init
staging: gdm724x: check for overflow in gdm_lte_netif_rx()
staging: gdm724x: check for buffer overflow in gdm_lte_multi_sdu_pkt()
iio: magn: rm3100: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
iio: adc: ti-ads8688: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
iio: adc: mxs-lradc: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: adc: hx711: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: adc: at91-sama5d2: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: at91-sama5d2_adc: remove usage of iio_priv_to_dev() helper
eeprom: idt_89hpesx: Restore printing the unsupported fwnode name
eeprom: idt_89hpesx: Put fwnode in matching case during ->probe()
usb: dwc2: Don't reset the core after setting turnaround time
usb: gadget: f_fs: Fix setting of device and driver data cross-references
ASoC: mediatek: mtk-btcvsd: Fix an error handling path in 'mtk_btcvsd_snd_probe()'
iommu/dma: Fix IOVA reserve dma ranges
s390: appldata depends on PROC_SYSCTL
visorbus: fix error return code in visorchipset_init()
fsi/sbefifo: Fix reset timeout
fsi/sbefifo: Clean up correct FIFO when receiving reset request from SBE
fsi: occ: Don't accept response from un-initialized OCC
fsi: scom: Reset the FSI2PIB engine for any error
fsi: core: Fix return of error values on failures
scsi: FlashPoint: Rename si_flags field
leds: lm3692x: Put fwnode in any case during ->probe()
leds: lm36274: cosmetic: rename lm36274_data to chip
leds: lm3532: select regmap I2C API
tty: nozomi: Fix the error handling path of 'nozomi_card_init()'
firmware: stratix10-svc: Fix a resource leak in an error handling path
char: pcmcia: error out if 'num_bytes_read' is greater than 4 in set_protocol()
mtd: partitions: redboot: seek fis-index-block in the right node
Input: hil_kbd - fix error return code in hil_dev_connect()
ASoC: rsnd: tidyup loop on rsnd_adg_clk_query()
backlight: lm3630a_bl: Put fwnode in error case during ->probe()
ASoC: hisilicon: fix missing clk_disable_unprepare() on error in hi6210_i2s_startup()
ASoC: rk3328: fix missing clk_disable_unprepare() on error in rk3328_platform_probe()
iio: potentiostat: lmp91000: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
iio: cros_ec_sensors: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
iio: light: tcs3472: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: light: tcs3414: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: light: isl29125: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: magn: bmc150: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: magn: hmc5843: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: prox: as3935: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: prox: pulsed-light: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: prox: srf08: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: humidity: am2315: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: gyro: bmg160: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: adc: vf610: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: adc: ti-ads1015: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: accel: stk8ba50: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: accel: stk8312: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: accel: mxc4005: Fix overread of data and alignment issue.
iio:accel:mxc4005: Drop unnecessary explicit casts in regmap_bulk_read calls
iio: accel: kxcjk-1013: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: accel: hid: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: accel: bma220: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: accel: bma180: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
iio: adis16400: do not return ints in irq handlers
iio: adis_buffer: do not return ints in irq handlers
mwifiex: re-fix for unaligned accesses
tty: nozomi: Fix a resource leak in an error handling function
rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
staging: fbtft: Rectify GPIO handling
MIPS: Fix PKMAP with 32-bit MIPS huge page support
RDMA/mlx5: Don't access NULL-cleared mpi pointer
net: sched: fix warning in tcindex_alloc_perfect_hash
net: lwtunnel: handle MTU calculation in forwading
writeback: fix obtain a reference to a freeing memcg css
clk: si5341: Update initialization magic
clk: si5341: Avoid divide errors due to bogus register contents
clk: actions: Fix bisp_factor_table based clocks on Owl S500 SoC
clk: actions: Fix SD clocks factor table on Owl S500 SoC
clk: actions: Fix UART clock dividers on Owl S500 SoC
Bluetooth: Fix handling of HCI_LE_Advertising_Set_Terminated event
Bluetooth: mgmt: Fix slab-out-of-bounds in tlv_data_is_valid
Revert "be2net: disable bh with spin_lock in be_process_mcc"
gve: Fix swapped vars when fetching max queues
bpfilter: Specify the log level for the kmsg message
e1000e: Check the PCIm state
ipv6: fix out-of-bound access in ip6_parse_tlv()
ibmvnic: free tx_pool if tso_pool alloc fails
Revert "ibmvnic: remove duplicate napi_schedule call in open function"
i40e: Fix autoneg disabling for non-10GBaseT links
i40e: Fix error handling in i40e_vsi_open
bpf: Do not change gso_size during bpf_skb_change_proto()
ipv6: exthdrs: do not blindly use init_net
net: bcmgenet: Fix attaching to PYH failed on RPi 4B
mac80211: remove iwlwifi specific workaround NDPs of null_response
ieee802154: hwsim: avoid possible crash in hwsim_del_edge_nl()
ieee802154: hwsim: Fix memory leak in hwsim_add_one
tc-testing: fix list handling
net/ipv4: swap flow ports when validating source
vxlan: add missing rcu_read_lock() in neigh_reduce()
pkt_sched: sch_qfq: fix qfq_change_class() error path
tls: prevent oversized sendfile() hangs by ignoring MSG_MORE
net: sched: add barrier to ensure correct ordering for lockless qdisc
vrf: do not push non-ND strict packets with a source LLA through packet taps again
net: ethernet: ezchip: fix error handling
net: ethernet: ezchip: fix UAF in nps_enet_remove
net: ethernet: aeroflex: fix UAF in greth_of_remove
samples/bpf: Fix the error return code of xdp_redirect's main()
RDMA/rxe: Fix qp reference counting for atomic ops
netfilter: nft_tproxy: restrict support to TCP and UDP transport protocols
netfilter: nft_osf: check for TCP packet before further processing
netfilter: nft_exthdr: check for IPv6 packet before further processing
RDMA/mlx5: Don't add slave port to unaffiliated list
netlabel: Fix memory leak in netlbl_mgmt_add_common
ath10k: Fix an error code in ath10k_add_interface()
brcmsmac: mac80211_if: Fix a resource leak in an error handling path
brcmfmac: correctly report average RSSI in station info
brcmfmac: fix setting of station info chains bitmask
ssb: Fix error return code in ssb_bus_scan()
wcn36xx: Move hal_buf allocation to devm_kmalloc in probe
ieee802154: hwsim: Fix possible memory leak in hwsim_subscribe_all_others
wireless: carl9170: fix LEDS build errors & warnings
ath10k: add missing error return code in ath10k_pci_probe()
ath10k: go to path err_unsupported when chip id is not supported
tools/bpftool: Fix error return code in do_batch()
drm: qxl: ensure surf.data is ininitialized
RDMA/rxe: Fix failure during driver load
RDMA/core: Sanitize WQ state received from the userspace
net/sched: act_vlan: Fix modify to allow 0
ehea: fix error return code in ehea_restart_qps()
drm/rockchip: dsi: move all lane config except LCDC mux to bind()
drm/rockchip: cdn-dp-core: add missing clk_disable_unprepare() on error in cdn_dp_grf_write()
net: ftgmac100: add missing error return code in ftgmac100_probe()
clk: meson: g12a: fix gp0 and hifi ranges
pinctrl: renesas: r8a77990: JTAG pins do not have pull-down capabilities
pinctrl: renesas: r8a7796: Add missing bias for PRESET# pin
net: pch_gbe: Propagate error from devm_gpio_request_one()
net: mvpp2: Put fwnode in error case during ->probe()
video: fbdev: imxfb: Fix an error message
xfrm: xfrm_state_mtu should return at least 1280 for ipv6
dax: fix ENOMEM handling in grab_mapping_entry()
ocfs2: fix snprintf() checking
cpufreq: Make cpufreq_online() call driver->offline() on errors
ACPI: bgrt: Fix CFI violation
ACPI: Use DEVICE_ATTR_<RW|RO|WO> macros
blk-wbt: make sure throttle is enabled properly
blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
extcon: extcon-max8997: Fix IRQ freeing at error path
ACPI: sysfs: Fix a buffer overrun problem with description_show()
crypto: nx - Fix RCU warning in nx842_OF_upd_status
spi: spi-sun6i: Fix chipselect/clock bug
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
btrfs: clear log tree recovering status if starting transaction fails
regulator: hi655x: Fix pass wrong pointer to config.driver_data
KVM: nVMX: Ensure 64-bit shift when checking VMFUNC bitmap
hwmon: (max31790) Fix fan speed reporting for fan7..12
hwmon: (max31722) Remove non-standard ACPI device IDs
media: s5p-g2d: Fix a memory leak on ctx->fh.m2m_ctx
arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan
arm64: consistently use reserved_pg_dir
mmc: usdhi6rol0: fix error return code in usdhi6_probe()
crypto: omap-sham - Fix PM reference leak in omap sham ops
crypto: nitrox - fix unchecked variable in nitrox_register_interrupts
media: siano: Fix out-of-bounds warnings in smscore_load_firmware_family2()
m68k: atari: Fix ATARI_KBD_CORE kconfig unmet dependency warning
media: gspca/gl860: fix zero-length control requests
media: tc358743: Fix error return code in tc358743_probe_of()
media: au0828: fix a NULL vs IS_ERR() check
media: exynos4-is: Fix a use after free in isp_video_release
pata_ep93xx: fix deferred probing
media: rc: i2c: Fix an error message
crypto: ccp - Fix a resource leak in an error handling path
evm: fix writing <securityfs>/evm overflow
pata_octeon_cf: avoid WARN_ON() in ata_host_activate()
kbuild: Fix objtool dependency for 'OBJECT_FILES_NON_STANDARD_<obj> := n'
kbuild: run the checker after the compiler
sched/uclamp: Fix locking around cpu_util_update_eff()
sched/uclamp: Fix wrong implementation of cpu.uclamp.min
media: I2C: change 'RST' to "RSET" to fix multiple build errors
pata_rb532_cf: fix deferred probing
sata_highbank: fix deferred probing
crypto: ux500 - Fix error return code in hash_hw_final()
crypto: ixp4xx - dma_unmap the correct address
media: s5p_cec: decrement usage count if disabled
writeback, cgroup: increment isw_nr_in_flight before grabbing an inode
ia64: mca_drv: fix incorrect array size calculation
kthread_worker: fix return value when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
block: fix discard request merge
cifs: fix missing spinlock around update to ses->status
HID: wacom: Correct base usage for capacitive ExpressKey status bits
ACPI: tables: Add custom DSDT file as makefile prerequisite
clocksource: Retry clock read if long delays detected
PCI: hv: Add check for hyperv_initialized in init_hv_pci_drv()
EDAC/Intel: Do not load EDAC driver when running as a guest
nvmet-fc: do not check for invalid target port in nvmet_fc_handle_fcp_rqst()
platform/x86: toshiba_acpi: Fix missing error code in toshiba_acpi_setup_keyboard()
block: fix race between adding/removing rq qos and normal IO
ACPI: resources: Add checks for ACPI IRQ override
ACPI: bus: Call kobject_put() in acpi_init() error path
ACPICA: Fix memory leak caused by _CID repair function
fs: dlm: fix memory leak when fenced
random32: Fix implicit truncation warning in prandom_seed_state()
fs: dlm: cancel work sync othercon
block_dump: remove block_dump feature in mark_inode_dirty()
ACPI: EC: Make more Asus laptops use ECDT _GPE
lib: vsprintf: Fix handling of number field widths in vsscanf
hv_utils: Fix passing zero to 'PTR_ERR' warning
ACPI: processor idle: Fix up C-state latency if not ordered
EDAC/ti: Add missing MODULE_DEVICE_TABLE
HID: do not use down_interruptible() when unbinding devices
media: Fix Media Controller API config checks
regulator: da9052: Ensure enough delay time for .set_voltage_time_sel
regulator: mt6358: Fix vdram2 .vsel_mask
KVM: s390: get rid of register asm usage
lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()
locking/lockdep: Fix the dep path printing for backwards BFS
btrfs: disable build on platforms having page size 256K
btrfs: abort transaction if we fail to update the delayed inode
btrfs: fix error handling in __btrfs_update_delayed_inode
KVM: PPC: Book3S HV: Fix TLB management on SMT8 POWER9 and POWER10 processors
drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()
hwmon: (max31790) Fix pwmX_enable attributes
hwmon: (max31790) Report correct current pwm duty cycles
media: imx-csi: Skip first few frames from a BT.656 source
media: siano: fix device register error path
media: dvb_net: avoid speculation from net slot
crypto: shash - avoid comparing pointers to exported functions under CFI
mmc: via-sdmmc: add a check against NULL pointer dereference
mmc: sdhci-sprd: use sdhci_sprd_writew
memstick: rtsx_usb_ms: fix UAF
media: dvd_usb: memory leak in cinergyt2_fe_attach
Makefile: fix GDB warning with CONFIG_RELR
media: st-hva: Fix potential NULL pointer dereferences
media: bt8xx: Fix a missing check bug in bt878_probe
media: v4l2-core: Avoid the dangling pointer in v4l2_fh_release
media: em28xx: Fix possible memory leak of em28xx struct
sched/fair: Fix ascii art by relpacing tabs
crypto: qat - remove unused macro in FW loader
crypto: qat - check return code of qat_hal_rd_rel_reg()
media: imx: imx7_mipi_csis: Fix logging of only error event counters
media: pvrusb2: fix warning in pvr2_i2c_core_done
media: cobalt: fix race condition in setting HPD
media: cpia2: fix memory leak in cpia2_usb_probe
media: sti: fix obj-$(config) targets
crypto: nx - add missing MODULE_DEVICE_TABLE
hwrng: exynos - Fix runtime PM imbalance on error
regulator: uniphier: Add missing MODULE_DEVICE_TABLE
spi: omap-100k: Fix the length judgment problem
spi: spi-topcliff-pch: Fix potential double free in pch_spi_process_messages()
spi: spi-loopback-test: Fix 'tx_buf' might be 'rx_buf'
media: exynos-gsc: fix pm_runtime_get_sync() usage count
media: sti/bdisp: fix pm_runtime_get_sync() usage count
media: s5p-jpeg: fix pm_runtime_get_sync() usage count
media: mtk-vcodec: fix PM runtime get logic
media: sh_vou: fix pm_runtime_get_sync() usage count
media: s5p: fix pm_runtime_get_sync() usage count
media: mdk-mdp: fix pm_runtime_get_sync() usage count
spi: Make of_register_spi_device also set the fwnode
fuse: reject internal errno
fuse: check connected before queueing on fpq->io
fuse: ignore PG_workingset after stealing
evm: Refuse EVM_ALLOW_METADATA_WRITES only if an HMAC key is loaded
evm: Execute evm_inode_init_security() only when an HMAC key is loaded
powerpc/stacktrace: Fix spurious "stale" traces in raise_backtrace_ipi()
seq_buf: Make trace_seq_putmem_hex() support data longer than 8
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
tracing/histograms: Fix parsing of "sym-offset" modifier
rsi: fix AP mode with WPA failure due to encrypted EAPOL
rsi: Assign beacon rate settings to the correct rate_info descriptor field
ssb: sdio: Don't overwrite const buffer if block_write fails
ath9k: Fix kernel NULL pointer dereference during ath_reset_internal()
serial_cs: remove wrong GLOBETROTTER.cis entry
serial_cs: Add Option International GSM-Ready 56K/ISDN modem
serial: sh-sci: Stop dmaengine transfer in sci_stop_tx()
serial: mvebu-uart: fix calculation of clock divisor
iio: ltr501: ltr501_read_ps(): add missing endianness conversion
iio: ltr501: ltr559: fix initialization of LTR501_ALS_CONTR
iio: ltr501: mark register holding upper 8 bits of ALS_DATA{0,1} and PS_DATA as volatile, too
iio: light: tcs3472: do not free unallocated IRQ
rtc: stm32: Fix unbalanced clk_disable_unprepare() on probe error path
s390/cio: dont call css_wait_for_slow_path() inside a lock
KVM: PPC: Book3S HV: Workaround high stack usage with clang
perf/smmuv3: Don't trample existing events with global filter
SUNRPC: Should wake up the privileged task firstly.
SUNRPC: Fix the batch tasks count wraparound.
mac80211: remove iwlwifi specific workaround that broke sta NDP tx
can: peak_pciefd: pucan_handle_status(): fix a potential starvation issue in TX path
can: j1939: j1939_sk_init(): set SOCK_RCU_FREE to call sk_destruct() after RCU is done
can: gw: synchronize rcu operations before removing gw job entry
can: bcm: delay release of struct bcm_op after synchronize_rcu()
ext4: use ext4_grp_locked_error in mb_find_extent
ext4: fix avefreec in find_group_orlov
ext4: remove check for zero nr_to_scan in ext4_es_scan()
ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit
ext4: return error code when ext4_fill_flex_info() fails
ext4: fix kernel infoleak via ext4_extent_header
ext4: cleanup in-core orphan list if ext4_truncate() failed to get a transaction handle
btrfs: clear defrag status of a root if starting transaction fails
btrfs: send: fix invalid path for unlink operations after parent orphanization
ARM: dts: at91: sama5d4: fix pinctrl muxing
arm_pmu: Fix write counter incorrect in ARMv7 big-endian mode
Input: joydev - prevent use of not validated data in JSIOCSBTNMAP ioctl
iov_iter_fault_in_readable() should do nothing in xarray case
copy_page_to_iter(): fix ITER_DISCARD case
ntfs: fix validity check for file name attribute
xhci: solve a double free problem while doing s4
usb: typec: Add the missed altmode_id_remove() in typec_register_altmode()
usb: dwc3: Fix debugfs creation flow
USB: cdc-acm: blacklist Heimann USB Appset device
usb: gadget: eem: fix echo command packet response issue
net: can: ems_usb: fix use-after-free in ems_usb_disconnect()
Input: usbtouchscreen - fix control-request directions
media: dvb-usb: fix wrong definition
ALSA: hda/realtek: Apply LED fixup for HP Dragonfly G1, too
ALSA: hda/realtek: Fix bass speaker DAC mapping for Asus UM431D
ALSA: hda/realtek: Improve fixup for HP Spectre x360 15-df0xxx
ALSA: hda/realtek: Add another ALC236 variant support
ALSA: intel8x0: Fix breakage at ac97 clock measurement
ALSA: usb-audio: scarlett2: Fix wrong resume call
ALSA: usb-audio: Fix OOB access at proc output
ALSA: usb-audio: fix rate on Ozone Z90 USB headset
Linux 5.4.131
xen/events: reset active flag for lateeoi events later
KVM: SVM: Call SEV Guest Decommission if ASID binding fails
s390/stack: fix possible register corruption with stack switch helper
KVM: SVM: Periodically schedule when unregistering regions on destroy
Linux 5.4.130
RDMA/mlx5: Block FDB rules when not in switchdev mode
gpio: AMD8111 and TQMX86 require HAS_IOPORT_MAP
drm/nouveau: fix dma_address check for CPU/GPU sync
scsi: sr: Return appropriate error code when disk is ejected
x86/efi: remove unused variables
Linux 5.4.129
certs: Move load_system_certificate_list to a common function
certs: Add EFI_CERT_X509_GUID support for dbx entries
x86/efi: move common keyring handler functions to new file
certs: Add wrapper function to check blacklisted binary hash
mm, futex: fix shared futex pgoff on shmem huge page
mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
mm: page_vma_mapped_walk(): get vma_address_end() earlier
mm: page_vma_mapped_walk(): use goto instead of while (1)
mm: page_vma_mapped_walk(): add a level of indentation
mm: page_vma_mapped_walk(): crossing page table boundary
mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
mm: page_vma_mapped_walk(): settle PageHuge on entry
mm: page_vma_mapped_walk(): use page for pvmw->page
mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
mm/thp: fix page_address_in_vma() on file THP tails
mm/thp: fix vma_address() if virtual address below file offset
mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
mm/thp: make is_huge_zero_pmd() safe and quicker
mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
mm, thp: use head page in __migration_entry_wait()
mm/rmap: use page_not_mapped in try_to_unmap()
mm/rmap: remove unneeded semicolon in page_not_mapped()
mm: add VM_WARN_ON_ONCE_PAGE() macro
kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
kthread_worker: split code for canceling the delayed work timer
i2c: robotfuzz-osif: fix control-request directions
KVM: do not allow mapping valid but non-reference-counted pages
nilfs2: fix memory leak in nilfs_sysfs_delete_device_group
pinctrl: stm32: fix the reported number of GPIO lines per bank
net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
net: ll_temac: Add memory-barriers for TX BD access
PCI: Add AMD RS690 quirk to enable 64-bit DMA
recordmcount: Correct st_shndx handling
net: qed: Fix memcpy() overflow of qed_dcbx_params()
KVM: selftests: Fix kvm_check_cap() assertion
r8169: Avoid memcpy() over-reading of ETH_SS_STATS
sh_eth: Avoid memcpy() over-reading of ETH_SS_STATS
r8152: Avoid memcpy() over-reading of ETH_SS_STATS
net/packet: annotate accesses to po->ifindex
net/packet: annotate accesses to po->bind
net: caif: fix memory leak in ldisc_open
net: phy: dp83867: perform soft reset and retain established link
inet: annotate date races around sk->sk_txhash
ping: Check return value of function 'ping_queue_rcv_skb'
net: ethtool: clear heap allocations for ethtool function
mac80211: drop multicast fragments
net: ipv4: Remove unneed BUG() function
dmaengine: mediatek: use GFP_NOWAIT instead of GFP_ATOMIC in prep_dma
dmaengine: mediatek: do not issue a new desc if one is still current
dmaengine: mediatek: free the proper desc in desc_free handler
dmaengine: rcar-dmac: Fix PM reference leak in rcar_dmac_probe()
cfg80211: call cfg80211_leave_ocb when switching away from OCB
mac80211_hwsim: drop pending frames on stop
mac80211: remove warning in ieee80211_get_sband()
dmaengine: zynqmp_dma: Fix PM reference leak in zynqmp_dma_alloc_chan_resourc()
Revert "PCI: PM: Do not read power state in pci_enable_device_flags()"
spi: spi-nxp-fspi: move the register operation after the clock enable
MIPS: generic: Update node names to avoid unit addresses
arm64: link with -z norelro for LLD or aarch64-elf
kbuild: add CONFIG_LD_IS_LLD
mmc: meson-gx: use memcpy_to/fromio for dram-access-quirk
ARM: 9081/1: fix gcc-10 thumb2-kernel regression
drm/radeon: wait for moving fence after pinning
drm/nouveau: wait for moving fence after pinning v2
Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell."
Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue."
module: limit enabling module.sig_enforce
Revert "clocksource/drivers/timer-ti-dm: Handle dra7 timer wrap errata i940"
Linux 5.4.128
usb: dwc3: core: fix kernel panic when do reboot
usb: dwc3: debugfs: Add and remove endpoint dirs dynamically
clocksource/drivers/timer-ti-dm: Handle dra7 timer wrap errata i940
clocksource/drivers/timer-ti-dm: Prepare to handle dra7 timer wrap issue
clocksource/drivers/timer-ti-dm: Add clockevent and clocksource support
ARM: OMAP: replace setup_irq() by request_irq()
KVM: arm/arm64: Fix KVM_VGIC_V3_ADDR_TYPE_REDIST read
tools headers UAPI: Sync linux/in.h copy with the kernel sources
net: fec_ptp: add clock rate zero check
net: stmmac: disable clocks in stmmac_remove_config_dt()
mm/slub.c: include swab.h
mm/slub: fix redzoning for small allocations
mm/slub: clarify verification reporting
net: bridge: fix vlan tunnel dst refcnt when egressing
net: bridge: fix vlan tunnel dst null pointer dereference
net: ll_temac: Fix TX BD buffer overwrite
net: ll_temac: Make sure to free skb when it is completely used
drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue.
drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell.
cfg80211: avoid double free of PMSR request
cfg80211: make certificate generation more robust
dmaengine: pl330: fix wrong usage of spinlock flags in dma_cyclc
x86/fpu: Reset state for all signal restore failures
x86/pkru: Write hardware init value to PKRU when xstate is init
x86/process: Check PF_KTHREAD and not current->mm for kernel threads
ARCv2: save ABI registers across signal handling
KVM: x86: Immediately reset the MMU context when the SMM flag is cleared
PCI: Work around Huawei Intelligent NIC VF FLR erratum
PCI: Add ACS quirk for Broadcom BCM57414 NIC
PCI: aardvark: Fix kernel panic during PIO transfer
PCI: aardvark: Don't rely on jiffies while holding spinlock
PCI: Mark some NVIDIA GPUs to avoid bus reset
PCI: Mark TI C667X to avoid bus reset
tracing: Do no increment trace_clock_global() by one
tracing: Do not stop recording comms if the trace file is being read
tracing: Do not stop recording cmdlines when tracing is off
usb: core: hub: Disable autosuspend for Cypress CY7C65632
can: mcba_usb: fix memory leak in mcba_usb
can: j1939: fix Use-after-Free, hold skb ref while in use
can: bcm/raw/isotp: use per module netdevice notifier
can: bcm: fix infoleak in struct bcm_msg_head
hwmon: (scpi-hwmon) shows the negative temperature properly
radeon: use memcpy_to/fromio for UVD fw upload
pinctrl: ralink: rt2880: avoid to error in calls is pin is already enabled
spi: stm32-qspi: Always wait BUSY bit to be cleared in stm32_qspi_wait_cmd()
ASoC: rt5659: Fix the lost powers for the HDA header
regulator: bd70528: Fix off-by-one for buck123 .n_voltages setting
net: ethernet: fix potential use-after-free in ec_bhf_remove
icmp: don't send out ICMP messages with a source address of 0.0.0.0
bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path
bnxt_en: Rediscover PHY capabilities after firmware reset
cxgb4: fix wrong shift.
net: cdc_eem: fix tx fixup skb leak
net: hamradio: fix memory leak in mkiss_close
be2net: Fix an error handling path in 'be_probe()'
net/af_unix: fix a data-race in unix_dgram_sendmsg / unix_release_sock
net: ipv4: fix memory leak in ip_mc_add1_src
net: fec_ptp: fix issue caused by refactor the fec_devtype
net: usb: fix possible use-after-free in smsc75xx_bind
lantiq: net: fix duplicated skb in rx descriptor ring
net: cdc_ncm: switch to eth%d interface naming
ptp: improve max_adj check against unreasonable values
net: qrtr: fix OOB Read in qrtr_endpoint_post
netxen_nic: Fix an error handling path in 'netxen_nic_probe()'
qlcnic: Fix an error handling path in 'qlcnic_probe()'
net: make get_net_ns return error if NET_NS is disabled
net: stmmac: dwmac1000: Fix extended MAC address registers definition
alx: Fix an error handling path in 'alx_probe()'
sch_cake: Fix out of bounds when parsing TCP options and header
netfilter: synproxy: Fix out of bounds when parsing TCP options
net/mlx5e: Block offload of outer header csum for UDP tunnels
net/mlx5e: allow TSO on VXLAN over VLAN topologies
net/mlx5: Consider RoCE cap before init RDMA resources
net/mlx5e: Fix page reclaim for dead peer hairpin
net/mlx5e: Remove dependency in IPsec initialization flows
net/sched: act_ct: handle DNAT tuple collision
rtnetlink: Fix regression in bridge VLAN configuration
udp: fix race between close() and udp_abort()
net: lantiq: disable interrupt before sheduling NAPI
net: rds: fix memory leak in rds_recvmsg
vrf: fix maximum MTU
net: ipv4: fix memory leak in netlbl_cipsov4_add_std
batman-adv: Avoid WARN_ON timing related checks
kvm: LAPIC: Restore guard to prevent illegal APIC register access
mm/memory-failure: make sure wait for page writeback in memory_failure
afs: Fix an IS_ERR() vs NULL check
dmaengine: stedma40: add missing iounmap() on error in d40_probe()
dmaengine: QCOM_HIDMA_MGMT depends on HAS_IOMEM
dmaengine: ALTERA_MSGDMA depends on HAS_IOMEM
Linux 5.4.127
fib: Return the correct errno code
net: Return the correct errno code
net/x25: Return the correct errno code
rtnetlink: Fix missing error code in rtnl_bridge_notify()
drm/amd/display: Allow bandwidth validation for 0 streams.
net: ipconfig: Don't override command-line hostnames or domains
nvme-loop: check for NVME_LOOP_Q_LIVE in nvme_loop_destroy_admin_queue()
nvme-loop: clear NVME_LOOP_Q_LIVE when nvme_loop_configure_admin_queue() fails
nvme-loop: reset queue count to 1 in nvme_loop_destroy_io_queues()
scsi: scsi_devinfo: Add blacklist entry for HPE OPEN-V
scsi: qedf: Do not put host in qedf_vport_create() unconditionally
ethernet: myri10ge: Fix missing error code in myri10ge_probe()
scsi: target: core: Fix warning on realtime kernels
gfs2: Fix use-after-free in gfs2_glock_shrink_scan
riscv: Use -mno-relax when using lld linker
HID: gt683r: add missing MODULE_DEVICE_TABLE
gfs2: Prevent direct-I/O write fallback errors from getting lost
ARM: OMAP2+: Fix build warning when mmc_omap is not built
drm/tegra: sor: Do not leak runtime PM reference
HID: usbhid: fix info leak in hid_submit_ctrl
HID: Add BUS_VIRTUAL to hid_connect logging
HID: multitouch: set Stylus suffix for Stylus-application devices, too
HID: quirks: Add quirk for Lenovo optical mouse
HID: hid-sensor-hub: Return error for hid_set_field() failure
HID: hid-input: add mapping for emoji picker key
HID: quirks: Set INCREMENT_USAGE_ON_DUPLICATE for Saitek X65
net: ieee802154: fix null deref in parse dev addr
Revert "RDMA/ipoib: Fix warning caused by destroying non-initial netns"
Linux 5.4.126
proc: only require mm_struct for writing
tracing: Correct the length check which causes memory corruption
ftrace: Do not blindly read the ip address in ftrace_bug()
scsi: core: Only put parent device if host state differs from SHOST_CREATED
scsi: core: Put .shost_dev in failure path if host state changes to RUNNING
scsi: core: Fix failure handling of scsi_add_host_with_dma()
scsi: core: Fix error handling of scsi_host_alloc()
NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
NFSv4: Fix second deadlock in nfs4_evict_inode()
NFS: Fix use-after-free in nfs4_init_client()
kvm: fix previous commit for 32-bit builds
perf session: Correct buffer copying when peeking events
NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()
NFS: Fix a potential NULL dereference in nfs_get_client()
IB/mlx5: Fix initializing CQ fragments buffer
KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
sched/fair: Make sure to update tg contrib for blocked load
perf: Fix data race between pin_count increment/decrement
vmlinux.lds.h: Avoid orphan section with !SMP
RDMA/mlx4: Do not map the core_clock page to user space unless enabled
RDMA/ipoib: Fix warning caused by destroying non-initial netns
usb: typec: mux: Fix copy-paste mistake in typec_mux_match
regulator: max77620: Use device_set_of_node_from_dev()
regulator: core: resolve supply for boot-on/always-on regulators
usb: fix various gadget panics on 10gbps cabling
usb: fix various gadgets null ptr deref on 10gbps cabling.
usb: gadget: eem: fix wrong eem header operation
USB: serial: cp210x: fix alternate function for CP2102N QFN20
USB: serial: quatech2: fix control-request directions
USB: serial: omninet: add device id for Zyxel Omni 56K Plus
USB: serial: ftdi_sio: add NovaTech OrionMX product ID
usb: gadget: f_fs: Ensure io_completion_wq is idle during unbind
usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path
usb: typec: wcove: Use LE to CPU conversion when accessing msg->header
usb: musb: fix MUSB_QUIRK_B_DISCONNECT_99 handling
usb: dwc3: ep0: fix NULL pointer exception
usb: pd: Set PD_T_SINK_WAIT_CAP to 310ms
usb: f_ncm: only first packet of aggregate needs to start timer
USB: f_ncm: ncm_bitrate (speed) is unsigned
cgroup1: don't allow '\n' in renaming
btrfs: promote debugging asserts to full-fledged checks in validate_super
btrfs: return value from btrfs_mark_extent_written() in case of error
staging: rtl8723bs: Fix uninitialized variables
kvm: avoid speculation-based attacks from out-of-range memslot accesses
drm: Lock pointer access in drm_master_release()
drm: Fix use-after-free read in drm_getunique()
spi: bcm2835: Fix out-of-bounds access with more than 4 slaves
x86/boot: Add .text.* to setup.ld
i2c: mpc: implement erratum A-004447 workaround
i2c: mpc: Make use of i2c_recover_bus()
spi: Cleanup on failure of initial setup
spi: Don't have controller clean up spi device before driver unbind
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P1010 i2c controllers
powerpc/fsl: set fsl,i2c-erratum-a004447 flag for P2041 i2c controllers
nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME
bnx2x: Fix missing error code in bnx2x_iov_init_one()
dm verity: fix require_signatures module_param permissions
MIPS: Fix kernel hang under FUNCTION_GRAPH_TRACER and PREEMPT_TRACER
nvme-fabrics: decode host pathing error for connect
net: dsa: microchip: enable phy errata workaround on 9567
net: appletalk: cops: Fix data race in cops_probe1
net: macb: ensure the device is available before accessing GEMGXL control registers
scsi: target: qla2xxx: Wait for stop_phase1 at WWN removal
scsi: hisi_sas: Drop free_irq() of devm_request_irq() allocated irq
scsi: vmw_pvscsi: Set correct residual data length
scsi: bnx2fc: Return failure if io_req is already in ABTS processing
RDS tcp loopback connection can hang
net/qla3xxx: fix schedule while atomic in ql_sem_spinlock
wq: handle VM suspension in stall detection
cgroup: disable controllers at parse time
net: mdiobus: get rid of a BUG_ON()
netlink: disable IRQs for netlink_lock_table()
bonding: init notify_work earlier to avoid uninitialized use
isdn: mISDN: netjet: Fix crash in nj_probe:
spi: sprd: Add missing MODULE_DEVICE_TABLE
ASoC: sti-sas: add missing MODULE_DEVICE_TABLE
vfio-ccw: Serialize FSM IDLE state with I/O completion
ASoC: Intel: bytcr_rt5640: Add quirk for the Lenovo Miix 3-830 tablet
ASoC: Intel: bytcr_rt5640: Add quirk for the Glavey TM800A550L tablet
usb: cdns3: Fix runtime PM imbalance on error
net/nfc/rawsock.c: fix a permission check bug
spi: Fix spi device unregister flow
ASoC: max98088: fix ni clock divider calculation
proc: Track /proc/$pid/attr/ opener mm_struct
ANDROID: GKI: update .xml file
ANDROID: restore abi breakage in usbnet.h
Linux 5.4.125
neighbour: allow NUD_NOARP entries to be forced GCed
i2c: qcom-geni: Suspend and resume the bus during SYSTEM_SLEEP_PM ops
xen-pciback: redo VF placement in the virtual topology
lib/lz4: explicitly support in-place decompression
x86/kvm: Disable all PV features on crash
x86/kvm: Disable kvmclock on all CPUs on shutdown
x86/kvm: Teardown PV features on boot CPU as well
KVM: arm64: Fix debug register indexing
KVM: SVM: Truncate GPR value for DR and CR accesses in !64-bit mode
btrfs: fix unmountable seed device after fstrim
mm/filemap: fix storing to a THP shadow entry
XArray: add xas_split
XArray: add xa_get_order
mm: add thp_order
bnxt_en: Remove the setting of dev_port.
mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY
btrfs: fixup error handling in fixup_inode_link_counts
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
btrfs: fix error handling in btrfs_del_csums
btrfs: mark ordered extent and inode with error if we fail to finish
x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
drm/amdgpu: make sure we unpin the UVD BO
drm/amdgpu: Don't query CE and UE errors
nfc: fix NULL ptr dereference in llcp_sock_getname() after failed connect
ocfs2: fix data corruption by fallocate
pid: take a reference when initializing `cad_pid`
usb: dwc2: Fix build in periphal-only mode
ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed
ARM: dts: imx6q-dhcom: Add PU,VDD1P1,VDD2P5 regulators
ARM: dts: imx6dl-yapp4: Fix RGMII connection to QCA8334 switch
ALSA: hda: Fix for mute key LED for HP Pavilion 15-CK0xx
ALSA: timer: Fix master timer notification
HID: multitouch: require Finger field to mark Win8 reports as MT
HID: magicmouse: fix NULL-deref on disconnect
HID: i2c-hid: Skip ELAN power-on command after reset
net: caif: fix memory leak in cfusbl_device_notify
net: caif: fix memory leak in caif_device_notify
net: caif: add proper error handling
net: caif: added cfserl_release function
Bluetooth: use correct lock to prevent UAF of hdev object
Bluetooth: fix the erroneous flush_work() order
tipc: fix unique bearer names sanity check
tipc: add extack messages for bearer/media failure
bus: ti-sysc: Fix flakey idling of uarts and stop using swsup_sidle_act
ARM: dts: imx: emcon-avari: Fix nxp,pca8574 #gpio-cells
ARM: dts: imx7d-pico: Fix the 'tuning-step' property
ARM: dts: imx7d-meerkat96: Fix the 'tuning-step' property
arm64: dts: zii-ultra: fix 12V_MAIN voltage
arm64: dts: ls1028a: fix memory node
i40e: add correct exception tracing for XDP
i40e: optimize for XDP_REDIRECT in xsk path
i2c: qcom-geni: Add shutdown callback for i2c
ice: Allow all LLDP packets from PF to Tx
ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared
ice: write register with correct offset
ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions
ixgbevf: add correct exception tracing for XDP
ieee802154: fix error return code in ieee802154_llsec_getparams()
ieee802154: fix error return code in ieee802154_add_iface()
netfilter: nfnetlink_cthelper: hit EBUSY on updates if size mismatches
netfilter: nft_ct: skip expectations for confirmed conntrack
ACPICA: Clean up context mutex during object deletion
net/sched: act_ct: Fix ct template allocation for zone 0
HID: i2c-hid: fix format string mismatch
HID: pidff: fix error return code in hid_pidff_init()
ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service
vfio/platform: fix module_put call in error flow
samples: vfio-mdev: fix error handing in mdpy_fb_probe()
vfio/pci: zap_vma_ptes() needs MMU
vfio/pci: Fix error return code in vfio_ecap_init()
efi: cper: fix snprintf() use in cper_dimm_err_location()
efi: Allow EFI_MEMORY_XP and EFI_MEMORY_RO both to be cleared
netfilter: conntrack: unregister ipv4 sockopts on error unwind
hwmon: (dell-smm-hwmon) Fix index values
nl80211: validate key indexes for cfg80211_registered_device
ALSA: usb: update old-style static const declaration
net: usb: cdc_ncm: don't spew notifications
btrfs: tree-checker: do not error out if extent ref hash doesn't match
ANDROID: GKI: Preserve abi change in ieee80211_data_to_8023_exthdr()
Linux 5.4.124
usb: core: reduce power-on-good delay time of root hub
neighbour: Prevent Race condition in neighbour subsytem
net: hso: bail out on interrupt URB allocation failure
Revert "Revert "ALSA: usx2y: Fix potential NULL pointer dereference""
net: hns3: check the return of skb_checksum_help()
drivers/net/ethernet: clean up unused assignments
i915: fix build warning in intel_dp_get_link_status()
drm/i915/display: fix compiler warning about array overrun
MIPS: ralink: export rt_sysc_membase for rt2880_wdt.c
MIPS: alchemy: xxs1500: add gpio-au1000.h header file
sch_dsmark: fix a NULL deref in qdisc_reset()
net: ethernet: mtk_eth_soc: Fix packet statistics support for MT7628/88
ALSA: usb-audio: scarlett2: snd_scarlett_gen2_controls_create() can be static
ipv6: record frag_max_size in atomic fragments in input path
net: lantiq: fix memory corruption in RX ring
scsi: libsas: Use _safe() loop in sas_resume_port()
ixgbe: fix large MTU request from VF
bpf: Set mac_len in bpf_skb_change_head
ASoC: cs35l33: fix an error code in probe()
staging: emxx_udc: fix loop in _nbu2ss_nuke()
cxgb4: avoid accessing registers when clearing filters
gve: Correct SKB queue index validation.
gve: Upgrade memory barrier in poll routine
gve: Add NULL pointer checks when freeing irqs.
gve: Update mgmt_msix_idx if num_ntfy changes
gve: Check TX QPL was actually assigned
mld: fix panic in mld_newpack()
bnxt_en: Include new P5 HV definition in VF check.
net: bnx2: Fix error return code in bnx2_init_board()
net: hso: check for allocation failure in hso_create_bulk_serial_device()
net: sched: fix tx action reschedule issue with stopped queue
net: sched: fix tx action rescheduling issue during deactivation
net: sched: fix packet stuck problem for lockless qdisc
tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT
openvswitch: meter: fix race when getting now_ms.
net: mdio: octeon: Fix some double free issues
net: mdio: thunder: Fix a double free issue in the .remove function
net: fec: fix the potential memory leak in fec_enet_init()
net: really orphan skbs tied to closing sk
vfio-ccw: Check initialized flag in cp_init()
ASoC: cs42l42: Regmap must use_single_read/write
net: dsa: fix error code getting shifted with 4 in dsa_slave_get_sset_count
net: netcp: Fix an error message
drm/amd/amdgpu: fix a potential deadlock in gpu reset
drm/amdgpu: Fix a use-after-free
drm/amd/amdgpu: fix refcount leak
drm/amd/display: Disconnect non-DP with no EDID
SMB3: incorrect file id in requests compounded with open
platform/x86: touchscreen_dmi: Add info for the Mediacom Winpad 7.0 W700 tablet
platform/x86: intel_punit_ipc: Append MODULE_DEVICE_TABLE for ACPI
platform/x86: hp-wireless: add AMD's hardware id to the supported list
btrfs: do not BUG_ON in link_to_fixup_dir
openrisc: Define memory barrier mb
scsi: BusLogic: Fix 64-bit system enumeration error for Buslogic
btrfs: return whole extents in fiemap
brcmfmac: properly check for bus register errors
Revert "brcmfmac: add a check for the status of usb_register"
net: liquidio: Add missing null pointer checks
Revert "net: liquidio: fix a NULL pointer dereference"
media: gspca: properly check for errors in po1030_probe()
Revert "media: gspca: Check the return value of write_bridge for timeout"
media: gspca: mt9m111: Check write_bridge for timeout
Revert "media: gspca: mt9m111: Check write_bridge for timeout"
media: dvb: Add check on sp8870_readreg return
Revert "media: dvb: Add check on sp8870_readreg"
ASoC: cs43130: handle errors in cs43130_probe() properly
Revert "ASoC: cs43130: fix a NULL pointer dereference"
libertas: register sysfs groups properly
Revert "libertas: add checks for the return value of sysfs_create_group"
dmaengine: qcom_hidma: comment platform_driver_register call
Revert "dmaengine: qcom_hidma: Check for driver register failure"
isdn: mISDN: correctly handle ph_info allocation failure in hfcsusb_ph_info
Revert "isdn: mISDN: Fix potential NULL pointer dereference of kzalloc"
ath6kl: return error code in ath6kl_wmi_set_roam_lrssi_cmd()
Revert "ath6kl: return error code in ath6kl_wmi_set_roam_lrssi_cmd()"
isdn: mISDNinfineon: check/cleanup ioremap failure correctly in setup_io
Revert "isdn: mISDNinfineon: fix potential NULL pointer dereference"
Revert "ALSA: usx2y: Fix potential NULL pointer dereference"
Revert "ALSA: gus: add a check of the status of snd_ctl_add"
char: hpet: add checks after calling ioremap
Revert "char: hpet: fix a missing check of ioremap"
net: caif: remove BUG_ON(dev == NULL) in caif_xmit
Revert "net/smc: fix a NULL pointer dereference"
net: fujitsu: fix potential null-ptr-deref
Revert "net: fujitsu: fix a potential NULL pointer dereference"
serial: max310x: unregister uart driver in case of failure and abort
Revert "serial: max310x: pass return value of spi_register_driver"
Revert "ALSA: sb: fix a missing check of snd_ctl_add"
Revert "media: usb: gspca: add a missed check for goto_low_power"
gpio: cadence: Add missing MODULE_DEVICE_TABLE
platform/x86: hp_accel: Avoid invoking _INI to speed up resume
perf jevents: Fix getting maximum number of fds
i2c: sh_mobile: Use new clock calculation formulas for RZ/G2E
i2c: i801: Don't generate an interrupt on bus reset
i2c: s3c2410: fix possible NULL pointer deref on read message after write
net: dsa: sja1105: error out on unsupported PHY mode
net: dsa: fix a crash if ->get_sset_count() fails
net: dsa: mt7530: fix VLAN traffic leaks
spi: spi-fsl-dspi: Fix a resource leak in an error handling path
tipc: skb_linearize the head skb when reassembling msgs
tipc: wait and exit until all work queues are done
Revert "net:tipc: Fix a double free in tipc_sk_mcast_rcv"
net/mlx4: Fix EEPROM dump support
net/mlx5e: Fix nullptr in add_vlan_push_action()
net/mlx5e: Fix multipath lag activation
drm/meson: fix shutdown crash when component not probed
NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config
NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()
NFS: Fix an Oopsable condition in __nfs_pageio_add_request()
NFS: fix an incorrect limit in filelayout_decode_layout()
fs/nfs: Use fatal_signal_pending instead of signal_pending
Bluetooth: cmtp: fix file refcount when cmtp_attach_device fails
spi: spi-geni-qcom: Fix use-after-free on unbind
net: usb: fix memory leak in smsc75xx_bind
usb: gadget: udc: renesas_usb3: Fix a race in usb3_start_pipen()
usb: dwc3: gadget: Properly track pending and queued SG
thermal/drivers/intel: Initialize RW trip to THERMAL_TEMP_INVALID
USB: serial: pl2303: add device id for ADLINK ND-6530 GC
USB: serial: ftdi_sio: add IDs for IDS GmbH Products
USB: serial: option: add Telit LE910-S1 compositions 0x7010, 0x7011
USB: serial: ti_usb_3410_5052: add startech.com device id
serial: rp2: use 'request_firmware' instead of 'request_firmware_nowait'
serial: sh-sci: Fix off-by-one error in FIFO threshold register setting
serial: tegra: Fix a mask operation that is always true
USB: usbfs: Don't WARN about excessively large memory allocations
USB: trancevibrator: fix control-request direction
serial: 8250_pci: handle FL_NOIRQ board flag
serial: 8250_pci: Add support for new HPE serial device
iio: adc: ad7793: Add missing error code in ad7793_setup()
iio: adc: ad7124: Fix potential overflow due to non sequential channel numbers
iio: adc: ad7124: Fix missbalanced regulator enable / disable on error.
iio: adc: ad7768-1: Fix too small buffer passed to iio_push_to_buffers_with_timestamp()
iio: gyro: fxas21002c: balance runtime power in error path
staging: iio: cdc: ad7746: avoid overwrite of num_channels
mei: request autosuspend after sending rx flow control
thunderbolt: dma_port: Fix NVM read buffer bounds and offset issue
misc/uss720: fix memory leak in uss720_probe
serial: core: fix suspicious security_locked_down() call
Documentation: seccomp: Fix user notification documentation
kgdb: fix gcc-11 warnings harder
selftests/gpio: Fix build when source tree is read only
selftests/gpio: Move include of lib.mk up
selftests/gpio: Use TEST_GEN_PROGS_EXTENDED
drm/amdgpu/vcn2.5: add cancel_delayed_work_sync before power gate
drm/amdgpu/vcn2.0: add cancel_delayed_work_sync before power gate
drm/amdgpu/vcn1: add cancel_delayed_work_sync before power gate
dm snapshot: properly fix a crash when an origin has no snapshots
ath10k: Validate first subframe of A-MSDU before processing the list
ath10k: Fix TKIP Michael MIC verification for PCIe
ath10k: drop MPDU which has discard flag set by firmware for SDIO
ath10k: drop fragments with multicast DA for SDIO
ath10k: drop fragments with multicast DA for PCIe
ath10k: add CCMP PN replay protection for fragmented frames for PCIe
mac80211: extend protection against mixed key and fragment cache attacks
mac80211: do not accept/forward invalid EAPOL frames
mac80211: prevent attacks on TKIP/WEP as well
mac80211: check defrag PN against current frame
mac80211: add fragment cache to sta_info
mac80211: drop A-MSDUs on old ciphers
cfg80211: mitigate A-MSDU aggregation attacks
mac80211: properly handle A-MSDUs that start with an RFC 1042 header
mac80211: prevent mixed key and fragment cache attacks
mac80211: assure all fragments are encrypted
net: hso: fix control-request directions
proc: Check /proc/$pid/attr/ writes against file opener
perf scripts python: exported-sql-viewer.py: Fix warning display
perf scripts python: exported-sql-viewer.py: Fix Array TypeError
perf scripts python: exported-sql-viewer.py: Fix copy to clipboard from Top Calls by elapsed Time report
perf intel-pt: Fix transaction abort handling
perf intel-pt: Fix sample instruction bytes
iommu/vt-d: Fix sysfs leak in alloc_iommu()
NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return()
cifs: set server->cipher_type to AES-128-CCM for SMB3.0
ALSA: usb-audio: scarlett2: Improve driver startup messages
ALSA: usb-audio: scarlett2: Fix device hang with ehci-pci
ALSA: hda/realtek: Headphone volume is controlled by Front mixer
ANDROID: GKI: update .xml file due to merge with `android11-5.4`
Linux 5.4.123
NFC: nci: fix memory leak in nci_allocate_device
perf unwind: Set userdata for all __report_module() paths
perf unwind: Fix separate debug info files when using elfutils' libdw's unwinder
usb: dwc3: gadget: Enable suspend events
bpf: No need to simulate speculative domain for immediates
bpf: Fix mask direction swap upon off reg sign change
bpf: Wrap aux data inside bpf_sanitize_info container
ANDROID: GKI: add thermal_zone_get_slope() to the .xml file
Linux 5.4.122
Bluetooth: SMP: Fail if remote and local public keys are identical
video: hgafb: correctly handle card detect failure during probe
nvmet: use new ana_log_size instead the old one
Bluetooth: L2CAP: Fix handling LE modes by L2CAP_OPTIONS
ext4: fix error handling in ext4_end_enable_verity()
nvme-multipath: fix double initialization of ANA state
tty: vt: always invoke vc->vc_sw->con_resize callback
vt: Fix character height handling with VT_RESIZEX
vgacon: Record video mode changes with VT_RESIZEX
video: hgafb: fix potential NULL pointer dereference
qlcnic: Add null check after calling netdev_alloc_skb
leds: lp5523: check return value of lp5xx_read and jump to cleanup code
ics932s401: fix broken handling of errors when word reading fails
net: rtlwifi: properly check for alloc_workqueue() failure
scsi: ufs: handle cleanup correctly on devm_reset_control_get error
net: stmicro: handle clk_prepare() failure during init
ethernet: sun: niu: fix missing checks of niu_pci_eeprom_read()
Revert "niu: fix missing checks of niu_pci_eeprom_read"
Revert "qlcnic: Avoid potential NULL pointer dereference"
Revert "rtlwifi: fix a potential NULL pointer dereference"
Revert "media: rcar_drif: fix a memory disclosure"
cdrom: gdrom: initialize global variable at init time
cdrom: gdrom: deallocate struct gdrom_unit fields in remove_gdrom
Revert "gdrom: fix a memory leak bug"
Revert "scsi: ufs: fix a missing check of devm_reset_control_get"
Revert "ecryptfs: replace BUG_ON with error handling code"
Revert "video: imsttfb: fix potential NULL pointer dereferences"
Revert "hwmon: (lm80) fix a missing check of bus read in lm80 probe"
Revert "leds: lp5523: fix a missing check of return value of lp55xx_read"
Revert "net: stmicro: fix a missing check of clk_prepare"
Revert "video: hgafb: fix potential NULL pointer dereference"
dm snapshot: fix crash with transient storage and zero chunk size
xen-pciback: reconfigure also from backend watch handler
mmc: sdhci-pci-gli: increase 1.8V regulator wait
drm/amdgpu: update sdma golden setting for Navi12
drm/amdgpu: update gc golden setting for Navi12
drm/amdgpu: disable 3DCGCG on picasso/raven1 to avoid compute hang
Revert "serial: mvebu-uart: Fix to avoid a potential NULL pointer dereference"
rapidio: handle create_workqueue() failure
Revert "rapidio: fix a NULL pointer dereference when create_workqueue() fails"
uio_hv_generic: Fix a memory leak in error handling paths
ALSA: hda/realtek: Add fixup for HP Spectre x360 15-df0xxx
ALSA: hda/realtek: Add fixup for HP OMEN laptop
ALSA: hda/realtek: Fix silent headphone output on ASUS UX430UA
ALSA: hda/realtek: Add some CLOVE SSIDs of ALC293
ALSA: hda/realtek: reset eapd coeff to default value for alc287
ALSA: firewire-lib: fix check for the size of isochronous packet payload
Revert "ALSA: sb8: add a check for request_region"
ALSA: hda: fixup headset for ASUS GU502 laptop
ALSA: bebob/oxfw: fix Kconfig entry for Mackie d.2 Pro
ALSA: usb-audio: Validate MS endpoint descriptors
ALSA: firewire-lib: fix calculation for size of IR context payload
ALSA: dice: fix stream format at middle sampling rate for Alesis iO 26
ALSA: line6: Fix racy initialization of LINE6 MIDI
ALSA: intel8x0: Don't update period unless prepared
ALSA: dice: fix stream format for TC Electronic Konnekt Live at high sampling transfer frequency
cifs: fix memory leak in smb2_copychunk_range
btrfs: avoid RCU stalls while running delayed iputs
locking/mutex: clear MUTEX_FLAGS if wait_list is empty due to signal
nvmet: seset ns->file when open fails
ptrace: make ptrace() fail if the tracee changed its pid unexpectedly
RDMA/uverbs: Fix a NULL vs IS_ERR() bug
platform/x86: dell-smbios-wmi: Fix oops on rmmod dell_smbios
platform/mellanox: mlxbf-tmfifo: Fix a memory barrier issue
RDMA/core: Don't access cm_id after its destruction
RDMA/mlx5: Recover from fatal event in dual port mode
scsi: qla2xxx: Fix error return code in qla82xx_write_flash_dword()
scsi: ufs: core: Increase the usable queue depth
RDMA/rxe: Clear all QP fields if creation failed
RDMA/siw: Release xarray entry
RDMA/siw: Properly check send and receive CQ pointers
openrisc: Fix a memory leak
firmware: arm_scpi: Prevent the ternary sign expansion bug
Linux 5.4.121
scripts: switch explicitly to Python 3
tweewide: Fix most Shebang lines
KVM: arm64: Initialize VCPU mdcr_el2 before loading it
ipv6: remove extra dev_hold() for fallback tunnels
ip6_tunnel: sit: proper dev_{hold|put} in ndo_[un]init methods
sit: proper dev_{hold|put} in ndo_[un]init methods
ip6_gre: proper dev_{hold|put} in ndo_[un]init methods
net: stmmac: Do not enable RX FIFO overflow interrupts
lib: stackdepot: turn depot_lock spinlock to raw_spinlock
block: reexpand iov_iter after read/write
ALSA: hda: generic: change the DAC ctl name for LO+SPK or LO+HP
gpiolib: acpi: Add quirk to ignore EC wakeups on Dell Venue 10 Pro 5055
drm/amd/display: Fix two cursor duplication when using overlay
bridge: Fix possible races between assigning rx_handler_data and setting IFF_BRIDGE_PORT bit
scsi: target: tcmu: Return from tcmu_handle_completions() if cmd_id not found
ceph: fix fscache invalidation
scsi: lpfc: Fix illegal memory access on Abort IOCBs
riscv: Workaround mcount name prior to clang-13
scripts/recordmcount.pl: Fix RISC-V regex for clang
ARM: 9075/1: kernel: Fix interrupted SMC calls
um: Disable CONFIG_GCOV with MODULES
um: Mark all kernel symbols as local
Input: silead - add workaround for x86 BIOS-es which bring the chip up in a stuck state
Input: elants_i2c - do not bind to i2c-hid compatible ACPI instantiated devices
ACPI / hotplug / PCI: Fix reference count leak in enable_slot()
ARM: 9066/1: ftrace: pause/unpause function graph tracer in cpu_suspend()
dmaengine: dw-edma: Fix crash on loading/unloading driver
PCI: thunder: Fix compile testing
virtio_net: Do not pull payload in skb->head
xsk: Simplify detection of empty and full rings
pinctrl: ingenic: Improve unreachable code generation
isdn: capi: fix mismatched prototypes
cxgb4: Fix the -Wmisleading-indentation warning
usb: sl811-hcd: improve misleading indentation
kgdb: fix gcc-11 warning on indentation
x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
ANDROID: GKI: genksyms fixup for efed9a3337e3 ("kyber: fix out of bounds access when * preempted")
Revert "PM: runtime: Fix unpaired parent child_count for force_resume"
Revert "mm: fix struct page layout on 32-bit systems"
Linux 5.4.120
ASoC: rsnd: check all BUSIF status when error
nvme: do not try to reconfigure APST when the controller is not live
clk: exynos7: Mark aclk_fsys1_200 as critical
netfilter: conntrack: Make global sysctls readonly in non-init netns
kobject_uevent: remove warning in init_uevent_argv()
usb: typec: tcpm: Fix error while calculating PPS out values
ARM: 9027/1: head.S: explicitly map DT even if it lives in the first physical section
ARM: 9020/1: mm: use correct section size macro to describe the FDT virtual address
ARM: 9012/1: move device tree mapping out of linear region
ARM: 9011/1: centralize phys-to-virt conversion of DT/ATAGS address
f2fs: fix error handling in f2fs_end_enable_verity()
thermal/core/fair share: Lock the thermal zone while looping over instances
MIPS: Avoid handcoded DIVU in `__div64_32' altogether
MIPS: Avoid DIVU in `__div64_32' is result would be zero
MIPS: Reinstate platform `__div64_32' handler
FDDI: defxx: Make MMIO the configuration default except for EISA
mm: fix struct page layout on 32-bit systems
KVM: x86: Cancel pvclock_gtod_work on module removal
cdc-wdm: untangle a circular dependency between callback and softint
iio: tsl2583: Fix division by a zero lux_val
iio: gyro: mpu3050: Fix reported temperature value
xhci: Add reset resume quirk for AMD xhci controller.
xhci: Do not use GFP_KERNEL in (potentially) atomic context
usb: dwc3: gadget: Return success always for kick transfer in ep queue
usb: core: hub: fix race condition about TRSMRCY of resume
usb: dwc2: Fix gadget DMA unmap direction
usb: xhci: Increase timeout for HC halt
usb: dwc3: pci: Enable usb2-gadget-lpm-disable for Intel Merrifield
usb: dwc3: omap: improve extcon initialization
iomap: fix sub-page uptodate handling
blk-mq: Swap two calls in blk_mq_exit_queue()
nbd: Fix NULL pointer in flush_workqueue
kyber: fix out of bounds access when preempted
ACPI: scan: Fix a memory leak in an error handling path
hwmon: (occ) Fix poll rate limiting
usb: fotg210-hcd: Fix an error message
iio: proximity: pulsedlight: Fix rumtime PM imbalance on error
drm/i915: Avoid div-by-zero on gen2
drm/radeon/dpm: Disable sclk switching on Oland when two 4K 60Hz monitors are connected
mm/hugetlb: fix F_SEAL_FUTURE_WRITE
userfaultfd: release page in error path to avoid BUG_ON
squashfs: fix divide error in calculate_skip()
hfsplus: prevent corruption in shrinking truncate
powerpc/64s: Fix crashes when toggling entry flush barrier
powerpc/64s: Fix crashes when toggling stf barrier
ARC: mm: PAE: use 40-bit physical page mask
ARC: entry: fix off-by-one error in syscall number validation
i40e: Fix PHY type identifiers for 2.5G and 5G adapters
i40e: fix the restart auto-negotiation after FEC modified
i40e: Fix use-after-free in i40e_client_subtask()
netfilter: nftables: avoid overflows in nft_hash_buckets()
kernel: kexec_file: fix error return code of kexec_calculate_store_digests()
sched/fair: Fix unfairness caused by missing load decay
sched: Fix out-of-bound access in uclamp
can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check
smc: disallow TCP_ULP in smc_setsockopt()
net: fix nla_strcmp to handle more then one trailing null character
ksm: fix potential missing rmap_item for stable_node
mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()
mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
khugepaged: fix wrong result value for trace_mm_collapse_huge_page_isolate()
drm/radeon: Avoid power table parsing memory leaks
drm/radeon: Fix off-by-one power_state index heap overwrite
netfilter: xt_SECMARK: add new revision to fix structure layout
sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b
ethernet:enic: Fix a use after free bug in enic_hard_start_xmit
sunrpc: Fix misplaced barrier in call_decode
RISC-V: Fix error code returned by riscv_hartid_to_cpuid()
sctp: do asoc update earlier in sctp_sf_do_dupcook_a
net: hns3: disable phy loopback setting in hclge_mac_start_phy
net: hns3: use netif_tx_disable to stop the transmit queue
net: hns3: fix for vxlan gpe tx checksum bug
net: hns3: add check for HNS3_NIC_STATE_INITED in hns3_reset_notify_up_enet()
net: hns3: initialize the message content in hclge_get_link_mode()
net: hns3: fix incorrect configuration for igu_egu_hw_err
rtc: ds1307: Fix wday settings for rx8130
ceph: fix inode leak on getattr error in __fh_to_dentry
rtc: fsl-ftm-alarm: add MODULE_TABLE()
NFSv4.2 fix handling of sr_eof in SEEK's reply
pNFS/flexfiles: fix incorrect size check in decode_nfs_fh()
PCI: endpoint: Fix missing destroy_workqueue()
NFS: Deal correctly with attribute generation counter overflow
NFSv4.2: Always flush out writes in nfs42_proc_fallocate()
rpmsg: qcom_glink_native: fix error return code of qcom_glink_rx_data()
ARM: 9064/1: hw_breakpoint: Do not directly check the event's overflow_handler hook
PCI: Release OF node in pci_scan_device()'s error path
PCI: iproc: Fix return value of iproc_msi_irq_domain_alloc()
f2fs: fix a redundant call to f2fs_balance_fs if an error occurs
thermal: thermal_of: Fix error return code of thermal_of_populate_bind_params()
ASoC: rt286: Make RT286_SET_GPIO_* readable and writable
ia64: module: fix symbolizer crash on fdescr
bnxt_en: Add PCI IDs for Hyper-V VF devices.
net: ethernet: mtk_eth_soc: fix RX VLAN offload
iavf: remove duplicate free resources calls
powerpc/iommu: Annotate nested lock for lockdep
qtnfmac: Fix possible buffer overflow in qtnf_event_handle_external_auth
wl3501_cs: Fix out-of-bounds warnings in wl3501_mgmt_join
wl3501_cs: Fix out-of-bounds warnings in wl3501_send_pkt
drm/amd/display: fixed divide by zero kernel crash during dsc enablement
powerpc/pseries: Stop calling printk in rtas_stop_self()
samples/bpf: Fix broken tracex1 due to kprobe argument change
net: sched: tapr: prevent cycle_time == 0 in parse_taprio_schedule
ethtool: ioctl: Fix out-of-bounds warning in store_link_ksettings_for_user()
ASoC: rt286: Generalize support for ALC3263 codec
powerpc/smp: Set numa node before updating mask
flow_dissector: Fix out-of-bounds warning in __skb_flow_bpf_to_target()
sctp: Fix out-of-bounds warning in sctp_process_asconf_param()
ALSA: hda/hdmi: fix race in handling acomp ELD notification at resume
kconfig: nconf: stop endless search loops
selftests: Set CC to clang in lib.mk if LLVM is set
drm/amd/display: Force vsync flip when reconfiguring MPCC
iommu/amd: Remove performance counter pre-initialization test
Revert "iommu/amd: Fix performance counter initialization"
ASoC: rsnd: call rsnd_ssi_master_clk_start() from rsnd_ssi_init()
cuse: prevent clone
mt76: mt76x0: disable GTK offloading
pinctrl: samsung: use 'int' for register masks in Exynos
mac80211: clear the beacon's CRC after channel switch
i2c: Add I2C_AQ_NO_REP_START adapter quirk
ASoC: Intel: bytcr_rt5640: Add quirk for the Chuwi Hi8 tablet
ip6_vti: proper dev_{hold|put} in ndo_[un]init methods
Bluetooth: check for zapped sk before connecting
net: bridge: when suppression is enabled exclude RARP packets
Bluetooth: initialize skb_queue_head at l2cap_chan_create()
Bluetooth: Set CONF_NOT_COMPLETE as l2cap_chan default
ALSA: bebob: enable to deliver MIDI messages for multiple ports
ALSA: rme9652: don't disable if not enabled
ALSA: hdspm: don't disable if not enabled
ALSA: hdsp: don't disable if not enabled
i2c: bail out early when RDWR parameters are wrong
ASoC: rsnd: core: Check convert rate in rsnd_hw_params
net: stmmac: Set FIFO sizes for ipq806x
ASoC: Intel: bytcr_rt5640: Enable jack-detect support on Asus T100TAF
tipc: convert dest node's address to network order
fs: dlm: fix debugfs dump
PM: runtime: Fix unpaired parent child_count for force_resume
KVM: x86/mmu: Remove the defunct update_pte() paging hook
tpm, tpm_tis: Reserve locality in tpm_tis_resume()
tpm, tpm_tis: Extend locality handling to TPM2 in tpm_tis_gen_interrupt()
tpm: fix error return code in tpm2_get_cc_attrs_tbl()
Revert "smp: Fix smp_call_function_single_async prototype"
Revert "usb: typec: tcpm: Address incorrect values of tcpm psy for fixed supply"
Revert "usb: typec: tcpm: Address incorrect values of tcpm psy for pps supply"
Revert "usb: typec: tcpm: update power supply once partner accepts"
Revert "spi: Fix use-after-free with devm_spi_alloc_*"
Linux 5.4.119
Revert "fdt: Properly handle "no-map" field in the memory region"
Revert "of/fdt: Make sure no-map does not remove already reserved regions"
sctp: delay auto_asconf init until binding the first addr
Revert "net/sctp: fix race condition in sctp_destroy_sock"
smp: Fix smp_call_function_single_async prototype
net: Only allow init netns to set default tcp cong to a restricted algo
mm/memory-failure: unnecessary amount of unmapping
mm/sparse: add the missing sparse_buffer_fini() in error branch
kfifo: fix ternary sign extension bugs
net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
net: bridge: mcast: fix broken length + header check for MRDv6 Adv.
RDMA/bnxt_re: Fix a double free in bnxt_qplib_alloc_res
RDMA/siw: Fix a use after free in siw_alloc_mr
net:emac/emac-mac: Fix a use after free in emac_mac_tx_buf_send
bnxt_en: Fix RX consumer index logic in the error path.
selftests: net: mirror_gre_vlan_bridge_1q: Make an FDB entry static
net: geneve: modify IP header check in geneve6_xmit_skb and geneve_xmit_skb
arm64: dts: uniphier: Change phy-mode to RGMII-ID to enable delay pins for RTL8211E
ARM: dts: uniphier: Change phy-mode to RGMII-ID to enable delay pins for RTL8211E
bnxt_en: fix ternary sign extension bug in bnxt_show_temp()
powerpc/52xx: Fix an invalid ASM expression ('addi' used instead of 'add')
ath10k: Fix ath10k_wmi_tlv_op_pull_peer_stats_info() unlock without lock
ath9k: Fix error check in ath9k_hw_read_revisions() for PCI devices
net: phy: intel-xway: enable integrated led functions
net: renesas: ravb: Fix a stuck issue when a lot of frames are received
net: davinci_emac: Fix incorrect masking of tx and rx error channel
ALSA: usb: midi: don't return -ENOMEM when usb_urb_ep_type_check fails
RDMA/i40iw: Fix error unwinding when i40iw_hmc_sd_one fails
RDMA/cxgb4: add missing qpid increment
gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check
vsock/vmci: log once the failed queue pair allocation
mwl8k: Fix a double Free in mwl8k_probe_hw
i2c: sh7760: fix IRQ error path
rtlwifi: 8821ae: upgrade PHY and RF parameters
powerpc/pseries: extract host bridge from pci_bus prior to bus removal
MIPS: pci-legacy: stop using of_pci_range_to_resource
perf beauty: Fix fsconfig generator
drm/i915/gvt: Fix error code in intel_gvt_init_device()
ASoC: ak5558: correct reset polarity
powerpc/xive: Fix xmon command "dxi"
i2c: sh7760: add IRQ check
i2c: jz4780: add IRQ check
i2c: emev2: add IRQ check
i2c: cadence: add IRQ check
i2c: sprd: fix reference leak when pm_runtime_get_sync fails
i2c: omap: fix reference leak when pm_runtime_get_sync fails
i2c: imx-lpi2c: fix reference leak when pm_runtime_get_sync fails
i2c: img-scb: fix reference leak when pm_runtime_get_sync fails
RDMA/srpt: Fix error return code in srpt_cm_req_recv()
net: thunderx: Fix unintentional sign extension issue
cxgb4: Fix unintentional sign extension issues
IB/hfi1: Fix error return code in parse_platform_config()
RDMA/qedr: Fix error return code in qedr_iw_connect()
KVM: PPC: Book3S HV P9: Restore host CTRL SPR after guest exit
mt7601u: fix always true expression
mac80211: bail out if cipher schemes are invalid
powerpc: iommu: fix build when neither PCI or IBMVIO is set
powerpc/perf: Fix PMU constraint check for EBB events
powerpc/64s: Fix pte update for kernel memory on radix
liquidio: Fix unintented sign extension of a left shift of a u16
ASoC: simple-card: fix possible uninitialized single_cpu local variable
ALSA: usb-audio: Add error checks for usb_driver_claim_interface() calls
mips: bmips: fix syscon-reboot nodes
net: hns3: Limiting the scope of vector_ring_chain variable
nfc: pn533: prevent potential memory corruption
bug: Remove redundant condition check in report_bug
ALSA: core: remove redundant spin_lock pair in snd_card_disconnect
powerpc: Fix HAVE_HARDLOCKUP_DETECTOR_ARCH build configuration
inet: use bigger hash table for IP ID generation
powerpc/prom: Mark identical_pvr_fixup as __init
powerpc/fadump: Mark fadump_calculate_reserve_size as __init
net: lapbether: Prevent racing when checking whether the netif is running
perf symbols: Fix dso__fprintf_symbols_by_name() to return the number of printed chars
HID: plantronics: Workaround for double volume key presses
drivers/block/null_blk/main: Fix a double free in null_init.
sched/debug: Fix cgroup_path[] serialization
x86/events/amd/iommu: Fix sysfs type mismatch
HSI: core: fix resource leaks in hsi_add_client_from_dt()
nvme-pci: don't simple map sgl when sgls are disabled
mfd: stm32-timers: Avoid clearing auto reload register
scsi: ibmvfc: Fix invalid state machine BUG_ON()
scsi: sni_53c710: Add IRQ check
scsi: sun3x_esp: Add IRQ check
scsi: jazz_esp: Add IRQ check
scsi: hisi_sas: Fix IRQ checks
clk: uniphier: Fix potential infinite loop
clk: qcom: a53-pll: Add missing MODULE_DEVICE_TABLE
clk: zynqmp: move zynqmp_pll_set_mode out of round_rate callback
vfio/mdev: Do not allow a mdev_type to have a NULL parent pointer
media: v4l2-ctrls.c: fix race condition in hdl->requests list
nvme: retrigger ANA log update if group descriptor isn't found
nvmet-tcp: fix incorrect locking in state_change sk callback
nvme-tcp: block BH in sk state_change sk callback
ata: libahci_platform: fix IRQ check
sata_mv: add IRQ checks
pata_ipx4xx_cf: fix IRQ check
pata_arasan_cf: fix IRQ check
x86/kprobes: Fix to check non boostable prefixes correctly
drm/amdkfd: fix build error with AMD_IOMMU_V2=m
media: m88rs6000t: avoid potential out-of-bounds reads on arrays
media: platform: sunxi: sun6i-csi: fix error return code of sun6i_video_start_streaming()
media: aspeed: fix clock handling logic
media: omap4iss: return error code when omap4iss_get() failed
media: vivid: fix assignment of dev->fbuf_out_flags
soc: aspeed: fix a ternary sign expansion bug
xen-blkback: fix compatibility bug with single page rings
ttyprintk: Add TTY hangup callback.
usb: dwc2: Fix hibernation between host and device modes.
usb: dwc2: Fix host mode hibernation exit with remote wakeup flow.
Drivers: hv: vmbus: Increase wait time for VMbus unload
x86/platform/uv: Fix !KEXEC build failure
platform/x86: pmc_atom: Match all Beckhoff Automation baytrail boards with critclk_systems DMI table
usbip: vudc: fix missing unlock on error in usbip_sockfd_store()
node: fix device cleanups in error handling code
firmware: qcom-scm: Fix QCOM_SCM configuration
serial: core: return early on unsupported ioctls
tty: fix return value for unsupported ioctls
tty: actually undefine superseded ASYNC flags
USB: cdc-acm: fix TIOCGSERIAL implementation
USB: cdc-acm: fix unprivileged TIOCCSERIAL
usb: gadget: r8a66597: Add missing null check on return from platform_get_resource
spi: fsl-lpspi: Fix PM reference leak in lpspi_prepare_xfer_hardware()
cpufreq: armada-37xx: Fix determining base CPU frequency
cpufreq: armada-37xx: Fix driver cleanup when registration failed
clk: mvebu: armada-37xx-periph: Fix workaround for switching from L1 to L0
clk: mvebu: armada-37xx-periph: Fix switching CPU freq from 250 Mhz to 1 GHz
cpufreq: armada-37xx: Fix the AVS value for load L1
clk: mvebu: armada-37xx-periph: remove .set_parent method for CPU PM clock
cpufreq: armada-37xx: Fix setting TBG parent for load levels
crypto: qat - Fix a double free in adf_create_ring
ACPI: CPPC: Replace cppc_attr with kobj_attribute
soc: qcom: mdt_loader: Detect truncated read of segments
soc: qcom: mdt_loader: Validate that p_filesz < p_memsz
spi: Fix use-after-free with devm_spi_alloc_*
PM / devfreq: Use more accurate returned new_freq as resume_freq
staging: greybus: uart: fix unprivileged TIOCCSERIAL
staging: rtl8192u: Fix potential infinite loop
irqchip/gic-v3: Fix OF_BAD_ADDR error handling
mtd: rawnand: gpmi: Fix a double free in gpmi_nand_init
m68k: mvme147,mvme16x: Don't wipe PCC timer config bits
soundwire: stream: fix memory leak in stream config error path
memory: pl353: fix mask of ECC page_size config register
USB: gadget: udc: fix wrong pointer passed to IS_ERR() and PTR_ERR()
usb: gadget: aspeed: fix dma map failure
crypto: qat - fix error path in adf_isr_resource_alloc()
phy: marvell: ARMADA375_USBCLUSTER_PHY should not default to y, unconditionally
soundwire: bus: Fix device found flag correctly
bus: qcom: Put child node before return
mtd: require write permissions for locking and badblock ioctls
fotg210-udc: Complete OUT requests on short packets
fotg210-udc: Don't DMA more than the buffer can take
fotg210-udc: Mask GRP2 interrupts we don't handle
fotg210-udc: Remove a dubious condition leading to fotg210_done
fotg210-udc: Fix EP0 IN requests bigger than two packets
fotg210-udc: Fix DMA on EP0 for length > max packet size
crypto: qat - ADF_STATUS_PF_RUNNING should be set after adf_dev_init
crypto: qat - don't release uninitialized resources
usb: gadget: pch_udc: Check for DMA mapping error
usb: gadget: pch_udc: Check if driver is present before calling ->setup()
usb: gadget: pch_udc: Replace cpu_to_le32() by lower_32_bits()
x86/microcode: Check for offline CPUs before requesting new microcode
arm64: dts: renesas: r8a77980: Fix vin4-7 endpoint binding
spi: stm32: drop devres version of spi_register_master
arm64: dts: qcom: sm8150: fix number of pins in 'gpio-ranges'
mtd: rawnand: qcom: Return actual error code instead of -ENODEV
mtd: Handle possible -EPROBE_DEFER from parse_mtd_partitions()
mtd: rawnand: brcmnand: fix OOB R/W with Hamming ECC
mtd: rawnand: fsmc: Fix error code in fsmc_nand_probe()
regmap: set debugfs_name to NULL after it is freed
usb: typec: tcpci: Check ROLE_CONTROL while interpreting CC_STATUS
serial: stm32: fix tx_empty condition
serial: stm32: fix incorrect characters on console
ARM: dts: exynos: correct PMIC interrupt trigger level on Snow
ARM: dts: exynos: correct PMIC interrupt trigger level on SMDK5250
ARM: dts: exynos: correct PMIC interrupt trigger level on Odroid X/U3 family
ARM: dts: exynos: correct PMIC interrupt trigger level on Midas family
ARM: dts: exynos: correct MUIC interrupt trigger level on Midas family
ARM: dts: exynos: correct fuel gauge interrupt trigger level on Midas family
memory: gpmc: fix out of bounds read and dereference on gpmc_cs[]
usb: gadget: pch_udc: Revert d3cb25a12138 completely
ovl: fix missing revert_creds() on error path
Revert "i3c master: fix missing destroy_workqueue() on error in i3c_master_register"
KVM: Stop looking for coalesced MMIO zones if the bus is destroyed
KVM: nVMX: Truncate bits 63:32 of VMCS field on nested check in !64-bit
KVM: s390: split kvm_s390_real_to_abs
s390: fix detection of vector enhancements facility 1 vs. vector packed decimal facility
KVM: s390: fix guarded storage control register handling
KVM: s390: split kvm_s390_logical_to_effective
ALSA: hda/realtek: ALC285 Thinkpad jack pin quirk is unreachable
ALSA: hda/realtek: Remove redundant entry for ALC861 Haier/Uniwill devices
ALSA: hda/realtek: Re-order ALC662 quirk table entries
ALSA: hda/realtek: Re-order remaining ALC269 quirk table entries
ALSA: hda/realtek: Re-order ALC269 Lenovo quirk table entries
ALSA: hda/realtek: Re-order ALC269 Sony quirk table entries
ALSA: hda/realtek: Re-order ALC269 ASUS quirk table entries
ALSA: hda/realtek: Re-order ALC269 Dell quirk table entries
ALSA: hda/realtek: Re-order ALC269 Acer quirk table entries
ALSA: hda/realtek: Re-order ALC269 HP quirk table entries
ALSA: hda/realtek: Re-order ALC882 Clevo quirk table entries
ALSA: hda/realtek: Re-order ALC882 Sony quirk table entries
ALSA: hda/realtek: Re-order ALC882 Acer quirk table entries
drm/amd/display: Reject non-zero src_y and src_x for video planes
drm/radeon: fix copy of uninitialized variable back to userspace
drm/panfrost: Don't try to map pages that are already mapped
drm/panfrost: Clear MMU irqs before handling the fault
rtw88: Fix array overrun in rtw_get_tx_power_params()
cfg80211: scan: drop entry from hidden_list on overflow
ipw2x00: potential buffer overflow in libipw_wx_set_encodeext()
md: Fix missing unused status line of /proc/mdstat
md: md_open returns -EBUSY when entering racing area
md: factor out a mddev_find_locked helper from mddev_find
md: split mddev_find
md-cluster: fix use-after-free issue when removing rdev
md/bitmap: wait for external bitmap writes to complete during tear down
misc: vmw_vmci: explicitly initialize vmci_datagram payload
misc: vmw_vmci: explicitly initialize vmci_notify_bm_set_msg struct
misc: lis3lv02d: Fix false-positive WARN on various HP models
iio:accel:adis16201: Fix wrong axis assignment that prevents loading
PCI: Allow VPD access for QLogic ISP2722
FDDI: defxx: Bail out gracefully with unassigned PCI resource for CSR
MIPS: pci-rt2880: fix slot 0 configuration
MIPS: pci-mt7620: fix PLL lock check
ASoC: Intel: kbl_da7219_max98927: Fix kabylake_ssp_fixup function
ASoC: samsung: tm2_wm5110: check of of_parse return value
usb: xhci-mtk: improve bandwidth scheduling with TT
usb: xhci-mtk: remove or operator for setting schedule parameters
usb: typec: tcpm: update power supply once partner accepts
usb: typec: tcpm: Address incorrect values of tcpm psy for pps supply
usb: typec: tcpm: Address incorrect values of tcpm psy for fixed supply
staging: fwserial: fix TIOCSSERIAL permission check
tty: moxa: fix TIOCSSERIAL permission check
staging: fwserial: fix TIOCSSERIAL jiffies conversions
USB: serial: ti_usb_3410_5052: fix TIOCSSERIAL permission check
staging: greybus: uart: fix TIOCSSERIAL jiffies conversions
USB: serial: usb_wwan: fix TIOCSSERIAL jiffies conversions
tty: amiserial: fix TIOCSSERIAL permission check
tty: moxa: fix TIOCSSERIAL jiffies conversions
Revert "USB: cdc-acm: fix rounding error in TIOCSSERIAL"
net/nfc: fix use-after-free llcp_sock_bind/connect
bluetooth: eliminate the potential race condition when removing the HCI controller
hsr: use netdev_err() instead of WARN_ONCE()
Bluetooth: verify AMP hci_chan before amp_destroy
ANDROID: GKI: restore a part of "struct mmc_host"
Revert "mmc: block: Issue a cache flush only when it's enabled"
Linux 5.4.118
dm rq: fix double free of blk_mq_tag_set in dev remove after table load fails
dm integrity: fix missing goto in bitmap_flush_interval error handling
dm space map common: fix division bug in sm_ll_find_free_block()
dm persistent data: packed struct should have an aligned() attribute too
tracing: Restructure trace_clock_global() to never block
tracing: Map all PIDs to command lines
rsi: Use resume_noirq for SDIO
tty: fix memory leak in vc_deallocate
usb: dwc2: Fix session request interrupt handler
usb: dwc3: gadget: Fix START_TRANSFER link state check
usb: gadget/function/f_fs string table fix for multiple languages
usb: gadget: Fix double free of device descriptor pointers
usb: gadget: dummy_hcd: fix gpf in gadget_setup
media: staging/intel-ipu3: Fix race condition during set_fmt
media: staging/intel-ipu3: Fix set_fmt error handling
media: staging/intel-ipu3: Fix memory leak in imu_fmt
media: dvb-usb: Fix memory leak at error in dvb_usb_device_init()
media: dvb-usb: Fix use-after-free access
media: dvbdev: Fix memory leak in dvb_media_device_free()
ext4: fix error code in ext4_commit_super
ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()
ext4: fix check to prevent false positive report of incorrect used inodes
kbuild: update config_data.gz only when the content of .config is changed
x86/cpu: Initialize MSR_TSC_AUX if RDTSCP *or* RDPID is supported
Revert 337f13046ff0 ("futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op")
jffs2: check the validity of dstlen in jffs2_zlib_compress()
Fix misc new gcc warnings
security: commoncap: fix -Wstringop-overread warning
fuse: fix write deadlock
dm raid: fix inconclusive reshape layout on fast raid4/5/6 table reload sequences
md/raid1: properly indicate failure when ending a failed write request
crypto: rng - fix crypto_rng_reset() refcounting when !CRYPTO_STATS
tpm: vtpm_proxy: Avoid reading host log when using a virtual device
tpm: efi: Use local variable for calculating final log size
intel_th: pci: Add Alder Lake-M support
powerpc: fix EDEADLOCK redefinition error in uapi/asm/errno.h
powerpc/eeh: Fix EEH handling for hugepages in ioremap space.
jffs2: Fix kasan slab-out-of-bounds problem
Input: ili210x - add missing negation for touch indication on ili210x
NFSv4: Don't discard segments marked for return in _pnfs_return_layout()
NFS: Don't discard pNFS layout segments that are marked for return
ACPI: GTDT: Don't corrupt interrupt mappings on watchdow probe failure
openvswitch: fix stack OOB read while fragmenting IPv4 packets
mlxsw: spectrum_mr: Update egress RIF list before route's action
f2fs: fix to avoid out-of-bounds memory access
ubifs: Only check replay with inode type to judge if inode linked
virtiofs: fix memory leak in virtio_fs_probe()
Makefile: Move -Wno-unused-but-set-variable out of GCC only block
arm64/vdso: Discard .note.gnu.property sections in vDSO
btrfs: fix race when picking most recent mod log operation for an old root
ALSA: hda/realtek: Add quirk for Intel Clevo PCx0Dx
ALSA: hda/realtek: fix static noise on ALC285 Lenovo laptops
ALSA: hda/realtek: fix mic boost on Intel NUC 8
ALSA: hda/realtek: GA503 use same quirks as GA401
ALSA: usb-audio: Add dB range mapping for Sennheiser Communications Headset PC 8
ALSA: usb-audio: More constifications
ALSA: usb-audio: Explicitly set up the clock selector
ALSA: sb: Fix two use after free in snd_sb_qsound_build
ALSA: hda/conexant: Re-order CX5066 quirk table entries
ALSA: emu8000: Fix a use after free in snd_emu8000_create_mixer
s390/archrandom: add parameter check for s390_arch_random_generate
scsi: libfc: Fix a format specifier
mfd: arizona: Fix rumtime PM imbalance on error
scsi: lpfc: Remove unsupported mbox PORT_CAPABILITIES logic
scsi: lpfc: Fix error handling for mailboxes completed in MBX_POLL mode
scsi: lpfc: Fix crash when a REG_RPI mailbox fails triggering a LOGO response
drm/amdgpu: fix NULL pointer dereference
amdgpu: avoid incorrect %hu format string
drm/amdkfd: Fix cat debugfs hang_hws file causes system crash bug
drm/msm/mdp5: Do not multiply vclk line count by 100
drm/msm/mdp5: Configure PP_SYNC_HEIGHT to double the vtotal
sched/fair: Ignore percpu threads for imbalance pulls
media: gscpa/stv06xx: fix memory leak
media: dvb-usb: fix memory leak in dvb_usb_adapter_init
media: platform: sti: Fix runtime PM imbalance in regs_show
media: i2c: adv7842: fix possible use-after-free in adv7842_remove()
media: i2c: tda1997: Fix possible use-after-free in tda1997x_remove()
media: i2c: adv7511-v4l2: fix possible use-after-free in adv7511_remove()
media: adv7604: fix possible use-after-free in adv76xx_remove()
media: tc358743: fix possible use-after-free in tc358743_remove()
power: supply: s3c_adc_battery: fix possible use-after-free in s3c_adc_bat_remove()
power: supply: generic-adc-battery: fix possible use-after-free in gab_remove()
clk: socfpga: arria10: Fix memory leak of socfpga_clk on error return
media: vivid: update EDID
media: em28xx: fix memory leak
scsi: scsi_dh_alua: Remove check for ASC 24h in alua_rtpg()
scsi: smartpqi: Add new PCI IDs
scsi: smartpqi: Correct request leakage during reset operations
ata: ahci: Disable SXS for Hisilicon Kunpeng920
mmc: sdhci-pci: Add PCI IDs for Intel LKF
scsi: qla2xxx: Fix use after free in bsg
drm/vkms: fix misuse of WARN_ON
scsi: qla2xxx: Always check the return value of qla24xx_get_isp_stats()
drm/amd/display: fix dml prefetch validation
drm/amd/display: Fix UBSAN warning for not a valid value for type '_Bool'
drm/amdgpu : Fix asic reset regression issue introduce by 8f211fe8ac7c4f
drm/amdkfd: Fix UBSAN shift-out-of-bounds warning
drm/amdgpu: mask the xgmi number of hops reported from psp to kfd
power: supply: Use IRQF_ONESHOT
media: gspca/sq905.c: fix uninitialized variable
media: media/saa7164: fix saa7164_encoder_register() memory leak bugs
extcon: arizona: Fix various races on driver unbind
extcon: arizona: Fix some issues when HPDET IRQ fires after the jack has been unplugged
power: supply: bq27xxx: fix power_avg for newer ICs
media: imx: capture: Return -EPIPE from __capture_legacy_try_fmt()
media: drivers: media: pci: sta2x11: fix Kconfig dependency on GPIOLIB
media: ite-cir: check for receive overflow
scsi: target: pscsi: Fix warning in pscsi_complete_cmd()
scsi: lpfc: Fix pt2pt connection does not recover after LOGO
scsi: lpfc: Fix incorrect dbde assignment when building target abts wqe
drm/amd/display: Don't optimize bandwidth before disabling planes
drm/amd/display: Check for DSC support instead of ASIC revision
drm/qxl: release shadow on shutdown
drm: Added orientation quirk for OneGX1 Pro
btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s
platform/x86: intel_pmc_core: Don't use global pmcdev in quirks
crypto: omap-aes - Fix PM reference leak on omap-aes.c
crypto: stm32/cryp - Fix PM reference leak on stm32-cryp.c
crypto: stm32/hash - Fix PM reference leak on stm32-hash.c
phy: phy-twl4030-usb: Fix possible use-after-free in twl4030_usb_remove()
intel_th: Consistency and off-by-one fix
tty: n_gsm: check error while registering tty devices
usb: core: hub: Fix PM reference leak in usb_port_resume()
usb: musb: fix PM reference leak in musb_irq_work()
spi: qup: fix PM reference leak in spi_qup_remove()
spi: omap-100k: Fix reference leak to master
spi: dln2: Fix reference leak to master
xhci: fix potential array out of bounds with several interrupters
xhci: check control context is valid before dereferencing it.
usb: xhci-mtk: support quirk to disable usb2 lpm
perf/arm_pmu_platform: Fix error handling
tee: optee: do not check memref size on return from Secure World
x86/build: Propagate $(CLANG_FLAGS) to $(REALMODE_FLAGS)
PCI: PM: Do not read power state in pci_enable_device_flags()
usb: xhci: Fix port minor revision
usb: dwc3: gadget: Ignore EP queue requests during bus reset
usb: gadget: f_uac1: validate input parameters
usb: gadget: f_uac2: validate input parameters
genirq/matrix: Prevent allocation counter corruption
usb: webcam: Invalid size of Processing Unit Descriptor
usb: gadget: uvc: add bInterval checking for HS mode
crypto: qat - fix unmap invalid dma address
crypto: api - check for ERR pointers in crypto_destroy_tfm()
spi: ath79: remove spi-master setup and cleanup assignment
spi: ath79: always call chipselect function
staging: wimax/i2400m: fix byte-order issue
bus: ti-sysc: Probe for l4_wkup and l4_cfg interconnect devices first
fbdev: zero-fill colormap in fbcmap.c
posix-timers: Preserve return value in clock_adjtime32()
intel_th: pci: Add Rocket Lake CPU support
btrfs: fix metadata extent leak after failure to create subvolume
cifs: Return correct error code from smb2_get_enc_key
irqchip/gic-v3: Do not enable irqs when handling spurious interrups
modules: inherit TAINT_PROPRIETARY_MODULE
modules: return licensing information from find_symbol
modules: rename the licence field in struct symsearch to license
modules: unexport __module_address
modules: unexport __module_text_address
modules: mark each_symbol_section static
modules: mark find_symbol static
modules: mark ref_module static
mmc: core: Fix hanging on I/O during system suspend for removable cards
mmc: core: Set read only for SD cards with permanent write protect bit
mmc: core: Do a power cycle when the CMD11 fails
mmc: block: Issue a cache flush only when it's enabled
mmc: block: Update ext_csd.cache_ctrl if it was written
mmc: sdhci-pci: Fix initialization of some SD cards for Intel BYT-based controllers
mmc: sdhci: Check for reset prior to DMA address unmap
mmc: uniphier-sd: Fix a resource leak in the remove function
mmc: uniphier-sd: Fix an error handling path in uniphier_sd_probe()
scsi: mpt3sas: Block PCI config access from userspace during reset
scsi: qla2xxx: Fix crash in qla2xxx_mqueuecommand()
spi: spi-ti-qspi: Free DMA resources
erofs: add unsupported inode i_format check
mtd: rawnand: atmel: Update ecc_stats.corrected counter
mtd: spinand: core: add missing MODULE_DEVICE_TABLE()
ecryptfs: fix kernel panic with null dev_name
arm64: dts: mt8173: fix property typo of 'phys' in dsi node
arm64: dts: marvell: armada-37xx: add syscon compatible to NB clk node
ARM: 9056/1: decompressor: fix BSS size calculation for LLVM ld.lld
ftrace: Handle commands when closing set_ftrace_filter file
ACPI: custom_method: fix a possible memory leak
ACPI: custom_method: fix potential use-after-free issue
s390/disassembler: increase ebpf disasm buffer size
ANDROID: GKI: Update the .xml file after android11-5.4 merge
Linux 5.4.117
vfio: Depend on MMU
perf/core: Fix unconditional security_locked_down() call
ovl: allow upperdir inside lowerdir
scsi: ufs: Unlock on a couple error paths
platform/x86: thinkpad_acpi: Correct thermal sensor allocation
USB: Add reset-resume quirk for WD19's Realtek Hub
USB: Add LPM quirk for Lenovo ThinkPad USB-C Dock Gen2 Ethernet
ALSA: usb-audio: Add MIDI quirk for Vox ToneLab EX
perf ftrace: Fix access to pid in array when setting a pid filter
perf data: Fix error return code in perf_data__create_dir()
iwlwifi: Fix softirq/hardirq disabling in iwl_pcie_gen2_enqueue_hcmd()
avoid __memcat_p link failure
bpf: Fix leakage of uninitialized bpf stack under speculation
bpf: Fix masking negation logic upon negative dst register
iwlwifi: Fix softirq/hardirq disabling in iwl_pcie_enqueue_hcmd()
igb: Enable RSS for Intel I211 Ethernet Controller
net: usb: ax88179_178a: initialize local variables before use
ACPI: x86: Call acpi_boot_table_init() after acpi_table_upgrade()
ACPI: tables: x86: Reserve memory occupied by ACPI tables
mips: Do not include hi and lo in clobber list for R6
Linux 5.4.116
bpf: Update selftests to reflect new error states
bpf: Tighten speculative pointer arithmetic mask
bpf: Move sanitize_val_alu out of op switch
bpf: Refactor and streamline bounds check into helper
bpf: Improve verifier error messages for users
bpf: Rework ptr_limit into alu_limit and add common error path
bpf: Ensure off_reg has no mixed signed bounds for all types
bpf: Move off_reg into sanitize_ptr_alu
Linux 5.4.115
USB: CDC-ACM: fix poison/unpoison imbalance
net: hso: fix NULL-deref on disconnect regression
x86/crash: Fix crash_setup_memmap_entries() out-of-bounds access
ia64: tools: remove duplicate definition of ia64_mf() on ia64
ia64: fix discontig.c section mismatches
csky: change a Kconfig symbol name to fix e1000 build error
cavium/liquidio: Fix duplicate argument
xen-netback: Check for hotplug-status existence before watching
s390/entry: save the caller of psw_idle
net: geneve: check skb is large enough for IPv4/IPv6 header
ARM: dts: Fix swapped mmc order for omap3
HID: wacom: Assign boolean values to a bool variable
HID: alps: fix error return code in alps_input_configured()
HID: google: add don USB id
perf auxtrace: Fix potential NULL pointer dereference
perf/x86/kvm: Fix Broadwell Xeon stepping in isolation_ucodes[]
perf/x86/intel/uncore: Remove uncore extra PCI dev HSWEP_PCI_PCU_3
locking/qrwlock: Fix ordering in queued_write_lock_slowpath()
arm64: dts: allwinner: Revert SD card CD GPIO for Pine64-LTS
pinctrl: lewisburg: Update number of pins in community
gpio: omap: Save and restore sysconfig
s390/ptrace: return -ENOSYS when invalid syscall is supplied
ANDROID: clang: update to 12.0.5
Linux 5.4.114
net: phy: marvell: fix detection of PHY on Topaz switches
ARM: 9071/1: uprobes: Don't hook on thumb instructions
r8169: don't advertise pause in jumbo mode
r8169: tweak max read request size for newer chips also in jumbo mtu mode
r8169: improve rtl_jumbo_config
r8169: fix performance regression related to PCIe max read request size
r8169: simplify setting PCI_EXP_DEVCTL_NOSNOOP_EN
r8169: remove fiddling with the PCIe max read request size
arm64: dts: allwinner: Fix SD card CD GPIO for SOPine systems
ARM: footbridge: fix PCI interrupt mapping
gro: ensure frag0 meets IP header alignment
ibmvnic: remove duplicate napi_schedule call in open function
ibmvnic: remove duplicate napi_schedule call in do_reset function
ibmvnic: avoid calling napi_disable() twice
i40e: fix the panic when running bpf in xdpdrv mode
net: ip6_tunnel: Unregister catch-all devices
net: sit: Unregister catch-all devices
net: davicom: Fix regulator not turned off on failed probe
netfilter: nft_limit: avoid possible divide error in nft_limit_init
net: macb: fix the restore of cmp registers
netfilter: arp_tables: add pre_exit hook for table unregister
netfilter: bridge: add pre_exit hooks for ebtable unregistration
libnvdimm/region: Fix nvdimm_has_flush() to handle ND_REGION_ASYNC
netfilter: conntrack: do not print icmpv6 as unknown via /proc
scsi: libsas: Reset num_scatter if libata marks qc as NODATA
riscv: Fix spelling mistake "SPARSEMEM" to "SPARSMEM"
vfio/pci: Add missing range check in vfio_pci_mmap
arm64: alternatives: Move length validation in alternative_{insn, endif}
arm64: fix inline asm in load_unaligned_zeropad()
readdir: make sure to verify directory entry for legacy interfaces too
dm verity fec: fix misaligned RS roots IO
HID: wacom: set EV_KEY and EV_ABS only for non-HID_GENERIC type of devices
Input: i8042 - fix Pegatron C15B ID entry
Input: s6sy761 - fix coordinate read bit shift
virt_wifi: Return micros for BSS TSF values
mac80211: clear sta->fast_rx when STA removed from 4-addr VLAN
pcnet32: Use pci_resource_len to validate PCI resource
net: ieee802154: forbid monitor for add llsec seclevel
net: ieee802154: stop dump llsec seclevels for monitors
net: ieee802154: forbid monitor for del llsec devkey
net: ieee802154: forbid monitor for add llsec devkey
net: ieee802154: stop dump llsec devkeys for monitors
net: ieee802154: forbid monitor for del llsec dev
net: ieee802154: forbid monitor for add llsec dev
net: ieee802154: stop dump llsec devs for monitors
net: ieee802154: forbid monitor for del llsec key
net: ieee802154: forbid monitor for add llsec key
net: ieee802154: stop dump llsec keys for monitors
scsi: scsi_transport_srp: Don't block target in SRP_PORT_LOST state
ASoC: fsl_esai: Fix TDM slot setup for I2S mode
drm/msm: Fix a5xx/a6xx timestamps
ARM: omap1: fix building with clang IAS
ARM: keystone: fix integer overflow warning
neighbour: Disregard DEAD dst in neigh_update
ASoC: max98373: Added 30ms turn on/off time delay
arc: kernel: Return -EFAULT if copy_to_user() fails
lockdep: Add a missing initialization hint to the "INFO: Trying to register non-static key" message
ARM: dts: Fix moving mmc devices with aliases for omap4 & 5
ARM: dts: Drop duplicate sha2md5_fck to fix clk_disable race
dmaengine: dw: Make it dependent to HAS_IOMEM
gpio: sysfs: Obey valid_mask
Input: nspire-keypad - enable interrupts only when opened
net/sctp: fix race condition in sctp_destroy_sock
scsi: qla2xxx: Fix fabric scan hang
scsi: qla2xxx: Fix stuck login session using prli_pend_timer
scsi: qla2xxx: Add a shadow variable to hold disc_state history of fcport
scsi: qla2xxx: Retry PLOGI on FC-NVMe PRLI failure
scsi: qla2xxx: Fix device connect issues in P2P configuration
scsi: qla2xxx: Dual FCP-NVMe target port support
Revert "scsi: qla2xxx: Fix stuck login session using prli_pend_timer"
Revert "scsi: qla2xxx: Retry PLOGI on FC-NVMe PRLI failure"
Linux 5.4.113
xen/events: fix setting irq affinity
perf map: Tighten snprintf() string precision to pass gcc check on some 32-bit arches
perf tools: Use %zd for size_t printf formats on 32-bit
perf tools: Use %define api.pure full instead of %pure-parser
driver core: Fix locking bug in deferred_probe_timeout_work_func()
netfilter: x_tables: fix compat match/target pad out-of-bound write
block: don't ignore REQ_NOWAIT for direct IO
riscv,entry: fix misaligned base for excp_vect_table
idr test suite: Create anchor before launching throbber
idr test suite: Take RCU read lock in idr_find_test_1
radix tree test suite: Register the main thread with the RCU library
block: only update parent bi_status when bio fail
drm/tegra: dc: Don't set PLL clock to 0Hz
gfs2: report "already frozen/thawed" errors
drm/imx: imx-ldb: fix out of bounds array access warning
KVM: arm64: Disable guest access to trace filter controls
KVM: arm64: Hide system instruction access to Trace registers
interconnect: core: fix error return code of icc_link_destroy()
Revert "UPSTREAM: scsi: ufs: Avoid busy-waiting by eliminating tag conflicts"
Revert "UPSTREAM: scsi: ufs: Use blk_{get,put}_request() to allocate and free TMFs"
Revert "UPSTREAM: scsi: ufs: core: Fix task management request completion timeout"
Revert "UPSTREAM: scsi: ufs: core: Fix wrong Task Tag used in task management request UPIUs"
Revert "net: xfrm: Localize sequence counter per network namespace"
UPSTREAM: scsi: ufs: core: Fix wrong Task Tag used in task management request UPIUs
UPSTREAM: scsi: ufs: core: Fix task management request completion timeout
UPSTREAM: scsi: ufs: Use blk_{get,put}_request() to allocate and free TMFs
UPSTREAM: scsi: ufs: Avoid busy-waiting by eliminating tag conflicts
Linux 5.4.112
Revert "cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath."
net: ieee802154: stop dump llsec params for monitors
net: ieee802154: forbid monitor for del llsec seclevel
net: ieee802154: forbid monitor for set llsec params
net: ieee802154: fix nl802154 del llsec devkey
net: ieee802154: fix nl802154 add llsec key
net: ieee802154: fix nl802154 del llsec dev
net: ieee802154: fix nl802154 del llsec key
net: ieee802154: nl-mac: fix check on panid
net: mac802154: Fix general protection fault
drivers: net: fix memory leak in peak_usb_create_dev
drivers: net: fix memory leak in atusb_probe
net: tun: set tun->dev->addr_len during TUNSETLINK processing
cfg80211: remove WARN_ON() in cfg80211_sme_connect
net: sched: bump refcount for new action in ACT replace mode
dt-bindings: net: ethernet-controller: fix typo in NVMEM
clk: socfpga: fix iomem pointer cast on 64-bit
RAS/CEC: Correct ce_add_elem()'s returned values
RDMA/addr: Be strict with gid size
RDMA/cxgb4: check for ipv6 address properly while destroying listener
net/mlx5: Fix PBMC register mapping
net/mlx5: Fix placement of log_max_flow_counter
net: hns3: clear VF down state bit before request link status
openvswitch: fix send of uninitialized stack memory in ct limit reply
net: openvswitch: conntrack: simplify the return expression of ovs_ct_limit_get_default_limit()
perf inject: Fix repipe usage
s390/cpcmd: fix inline assembly register clobbering
workqueue: Move the position of debug_work_activate() in __queue_work()
clk: fix invalid usage of list cursor in unregister
clk: fix invalid usage of list cursor in register
net: macb: restore cmp registers on resume path
scsi: ufs: core: Fix wrong Task Tag used in task management request UPIUs
scsi: ufs: core: Fix task management request completion timeout
scsi: ufs: Use blk_{get,put}_request() to allocate and free TMFs
scsi: ufs: Avoid busy-waiting by eliminating tag conflicts
scsi: ufs: Fix irq return code
net: udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...);
drm/msm: Set drvdata to NULL when msm_drm_init() fails
i40e: Fix display statistics for veb_tc
soc/fsl: qbman: fix conflicting alignment attributes
net/rds: Fix a use after free in rds_message_map_pages
net/mlx5: Don't request more than supported EQs
net/mlx5e: Fix ethtool indication of connector type
ASoC: sunxi: sun4i-codec: fill ASoC card owner
net: phy: broadcom: Only advertise EEE for supported modes
nfp: flower: ignore duplicate merge hints from FW
net/ncsi: Avoid channel_monitor hrtimer deadlock
ARM: dts: imx6: pbab01: Set vmmc supply for both SD interfaces
net:tipc: Fix a double free in tipc_sk_mcast_rcv
cxgb4: avoid collecting SGE_QBASE regs during traffic
gianfar: Handle error code at MAC address change
can: bcm/raw: fix msg_namelen values depending on CAN_REQUIRED_SIZE
arm64: dts: imx8mm/q: Fix pad control of SD1_DATA0
sch_red: fix off-by-one checks in red_check_params()
amd-xgbe: Update DMA coherency values
hostfs: fix memory handling in follow_link()
hostfs: Use kasprintf() instead of fixed buffer formatting
i40e: Fix kernel oops when i40e driver removes VF's
i40e: Added Asym_Pause to supported link modes
xfrm: Fix NULL pointer dereference on policy lookup
ASoC: wm8960: Fix wrong bclk and lrclk with pll enabled for some chips
ASoC: SOF: Intel: HDA: fix core status verification
ASoC: SOF: Intel: hda: remove unnecessary parentheses
esp: delete NETIF_F_SCTP_CRC bit from features for esp offload
net: xfrm: Localize sequence counter per network namespace
regulator: bd9571mwv: Fix AVS and DVFS voltage range
xfrm: interface: fix ipv4 pmtu check to honor ip header df
net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits
net: dsa: lantiq_gswip: Don't use PHY auto polling
virtio_net: Add XDP meta data support
i2c: turn recovery error on init to debug
usbip: synchronize event handler with sysfs code paths
usbip: vudc synchronize sysfs code paths
usbip: stub-dev synchronize sysfs code paths
usbip: add sysfs_lock to synchronize sysfs code paths
net: let skb_orphan_partial wake-up waiters.
net-ipv6: bugfix - raw & sctp - switch to ipv6_can_nonlocal_bind()
net: hsr: Reset MAC header for Tx path
mac80211: fix TXQ AC confusion
net: sched: sch_teql: fix null-pointer dereference
i40e: Fix sparse error: 'vsi->netdev' could be null
i40e: Fix sparse warning: missing error code 'err'
net: ensure mac header is set in virtio_net_hdr_to_skb()
bpf, sockmap: Fix sk->prot unhash op reset
ethernet/netronome/nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx
net: hso: fix null-ptr-deref during tty device unregistration
ice: Cleanup fltr list in case of allocation issues
ice: Fix for dereference of NULL pointer
ice: Increase control queue timeout
batman-adv: initialize "struct batadv_tvlv_tt_vlan_data"->reserved field
ARM: dts: turris-omnia: configure LED[2]/INTn pin as interrupt pin
parisc: avoid a warning on u8 cast for cmpxchg on u8 pointers
parisc: parisc-agp requires SBA IOMMU driver
fs: direct-io: fix missing sdio->boundary
ocfs2: fix deadlock between setattr and dio_end_io_write
nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff
ia64: fix user_stack_pointer() for ptrace()
gcov: re-fix clang-11+ support
drm/i915: Fix invalid access to ACPI _DSM objects
net: dsa: lantiq_gswip: Let GSWIP automatically set the xMII clock
net: ipv6: check for validity before dereferencing cfg->fc_nlinfo.nlh
xen/evtchn: Change irq_info lock to raw_spinlock_t
nfc: Avoid endless loops caused by repeated llcp_sock_connect()
nfc: fix memory leak in llcp_sock_connect()
nfc: fix refcount leak in llcp_sock_connect()
nfc: fix refcount leak in llcp_sock_bind()
ASoC: intel: atom: Stop advertising non working S24LE support
ALSA: hda/realtek: Fix speaker amp setup on Acer Aspire E1
ALSA: aloop: Fix initialization of controls
counter: stm32-timer-cnt: fix ceiling miss-alignment with reload register
Linux 5.4.111
init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM
init/Kconfig: make COMPILE_TEST depend on !S390
nvme-mpath: replace direct_make_request with generic_make_request
bpf, x86: Validate computation of branch displacements for x86-32
bpf, x86: Validate computation of branch displacements for x86-64
cifs: Silently ignore unknown oplock break handle
cifs: revalidate mapping when we open files for SMB1 POSIX
ia64: fix format strings for err_inject
ia64: mca: allocate early mca with GFP_ATOMIC
scsi: target: pscsi: Clean up after failure in pscsi_map_sg()
x86/build: Turn off -fcf-protection for realmode targets
platform/x86: thinkpad_acpi: Allow the FnLock LED to change state
netfilter: conntrack: Fix gre tunneling over ipv6
drm/msm: Ratelimit invalid-fence message
drm/msm/adreno: a5xx_power: Don't apply A540 lm_setup to other GPUs
mac80211: choose first enabled channel for monitor
mISDN: fix crash in fritzpci
net: pxa168_eth: Fix a potential data race in pxa168_eth_remove
net/mlx5e: Enforce minimum value check for ICOSQ size
bpf, x86: Use kvmalloc_array instead kmalloc_array in bpf_jit_comp
platform/x86: intel-hid: Support Lenovo ThinkPad X1 Tablet Gen 2
bus: ti-sysc: Fix warning on unbind if reset is not deasserted
ARM: dts: am33xx: add aliases for mmc interfaces
ANDROID: GKI: update .xml file
Revert "net: introduce CAN specific pointer in the struct net_device"
Linux 5.4.110
drivers: video: fbcon: fix NULL dereference in fbcon_cursor()
staging: rtl8192e: Change state information from u16 to u8
staging: rtl8192e: Fix incorrect source in memcpy()
usb: dwc2: Prevent core suspend when port connection flag is 0
usb: dwc2: Fix HPRT0.PrtSusp bit setting for HiKey 960 board.
usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference
USB: cdc-acm: fix use-after-free after probe failure
USB: cdc-acm: fix double free on probe failure
USB: cdc-acm: downgrade message to debug
USB: cdc-acm: untangle a circular dependency between callback and softint
cdc-acm: fix BREAK rx code path adding necessary calls
usb: xhci-mtk: fix broken streams issue on 0.96 xHCI
usb: musb: Fix suspend with devices connected for a64
USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem
usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control()
firewire: nosy: Fix a use-after-free bug in nosy_ioctl()
extcon: Fix error handling in extcon_dev_register
extcon: Add stubs for extcon_register_notifier_all() functions
pinctrl: rockchip: fix restore error in resume
vfio/nvlink: Add missing SPAPR_TCE_IOMMU depends
reiserfs: update reiserfs_xattrs_initialized() condition
drm/amdgpu: check alignment on CPU page for bo map
drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()
mm: fix race by making init_zero_pfn() early_initcall
tracing: Fix stack trace event size
PM: runtime: Fix ordering in pm_runtime_get_suppliers()
PM: runtime: Fix race getting/putting suppliers at probe
xtensa: move coprocessor_flush to the .text section
ALSA: hda/realtek: call alc_update_headset_mode() in hp_automute_hook
ALSA: hda/realtek: fix a determine_headset_type issue for a Dell AIO
ALSA: hda: Add missing sanity checks in PM prepare/complete callbacks
ALSA: hda: Re-add dropped snd_poewr_change_state() calls
ALSA: usb-audio: Apply sample rate quirk to Logitech Connect
bpf: Remove MTU check in __bpf_skb_max_len
net: wan/lmc: unregister device when no matching device is found
appletalk: Fix skb allocation size in loopback case
net: ethernet: aquantia: Handle error cleanup of start on open
ath10k: hold RCU lock when calling ieee80211_find_sta_by_ifaddr()
brcmfmac: clear EAP/association status bits on linkdown events
can: tcan4x5x: fix max register value
net: introduce CAN specific pointer in the struct net_device
can: dev: move driver related infrastructure into separate subdir
flow_dissector: fix TTL and TOS dissection on IPv4 fragments
net: mvpp2: fix interrupt mask/unmask skip condition
ext4: do not iput inode under running transaction in ext4_rename()
locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handling
thermal/core: Add NULL pointer check before using cooling device stats
ASoC: rt5659: Update MCLK rate in set_sysclk()
staging: comedi: cb_pcidas64: fix request_irq() warn
staging: comedi: cb_pcidas: fix request_irq() warn
scsi: qla2xxx: Fix broken #endif placement
scsi: st: Fix a use after free in st_open()
vhost: Fix vhost_vq_reset()
powerpc: Force inlining of cpu_has_feature() to avoid build failure
NFSD: fix error handling in NFSv4.0 callbacks
ASoC: cs42l42: Always wait at least 3ms after reset
ASoC: cs42l42: Fix mixer volume control
ASoC: cs42l42: Fix channel width support
ASoC: cs42l42: Fix Bitclock polarity inversion
ASoC: es8316: Simplify adc_pga_gain_tlv table
ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe
ASoC: rt5651: Fix dac- and adc- vol-tlv values being off by a factor of 10
ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10
iomap: Fix negative assignment to unsigned sis->pages in iomap_swapfile_activate
rpc: fix NULL dereference on kmalloc failure
fs: nfsd: fix kconfig dependency warning for NFSD_V4
ext4: fix bh ref count on error paths
ext4: shrink race window in ext4_should_retry_alloc()
module: harden ELF info handling
module: avoid *goto*s in module_sig_check()
module: merge repetitive strings in module_sig_check()
modsign: print module name along with error message
ipv6: weaken the v4mapped source check
selinux: vsock: Set SID for socket returned by accept()
Revert "can: dev: Move device back to init netns on owning netns delete"
Linux 5.4.109
xen-blkback: don't leak persistent grants from xen_blkbk_map()
can: peak_usb: Revert "can: peak_usb: add forgotten supported devices"
ext4: add reclaim checks to xattr code
mac80211: fix double free in ibss_leave
net: qrtr: fix a kernel-infoleak in qrtr_recvmsg()
net: dsa: b53: VLAN filtering is global to all users
can: dev: Move device back to init netns on owning netns delete
x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()
locking/mutex: Fix non debug version of mutex_lock_io_nested()
scsi: mpt3sas: Fix error return code of mpt3sas_base_attach()
scsi: qedi: Fix error return code of qedi_alloc_global_queues()
scsi: Revert "qla2xxx: Make sure that aborted commands are freed"
block: recalculate segment count for multi-segment discards correctly
perf auxtrace: Fix auxtrace queue conflict
ACPI: scan: Use unique number for instance_no
ACPI: scan: Rearrange memory allocation in acpi_device_add()
Revert "netfilter: x_tables: Update remaining dereference to RCU"
netfilter: x_tables: Use correct memory barriers.
Revert "netfilter: x_tables: Switch synchronization to RCU"
bpf: Don't do bpf_cgroup_storage_set() for kuprobe/tp programs
RDMA/cxgb4: Fix adapter LE hash errors while destroying ipv6 listening server
PM: EM: postpone creating the debugfs dir till fs_initcall
net/mlx5e: Fix error path for ethtool set-priv-flag
PM: runtime: Defer suspending suppliers
arm64: kdump: update ppos when reading elfcorehdr
drm/msm: fix shutdown hook in case GPU components failed to bind
libbpf: Fix BTF dump of pointer-to-array-of-struct
selftests: forwarding: vxlan_bridge_1d: Fix vxlan ecn decapsulate value
net: stmmac: dwmac-sun8i: Provide TX and RX fifo sizes
r8152: limit the RX buffer size of RTL8153A for USB 2.0
net: cdc-phonet: fix data-interface release on probe failure
octeontx2-af: fix infinite loop in unmapping NPC counter
octeontx2-af: Fix irq free in rvu teardown
libbpf: Use SOCK_CLOEXEC when opening the netlink socket
nfp: flower: fix pre_tun mask id allocation
mac80211: fix rate mask reset
can: m_can: m_can_rx_peripheral(): fix RX being blocked by errors
can: m_can: m_can_do_rx_poll(): fix extraneous msg loss warning
can: c_can: move runtime PM enable/disable to c_can_platform
can: c_can_pci: c_can_pci_remove(): fix use-after-free
can: kvaser_pciefd: Always disable bus load reporting
can: flexcan: flexcan_chip_freeze(): fix chip freeze for missing bitrate
can: peak_usb: add forgotten supported devices
tcp: relookup sock for RST+ACK packets handled by obsolete req sock
netfilter: ctnetlink: fix dump of the expect mask attribute
selftests/bpf: Set gopt opt_class to 0 if get tunnel opt failed
ftgmac100: Restart MAC HW once
net/qlcnic: Fix a use after free in qlcnic_83xx_get_minidump_template
e1000e: Fix error handling in e1000_set_d0_lplu_state_82571
e1000e: add rtnl_lock() to e1000_reset_task
igc: Fix Supported Pause Frame Link Setting
igc: Fix Pause Frame Advertising
net: dsa: bcm_sf2: Qualify phydev->dev_flags based on port
net: sched: validate stab values
macvlan: macvlan_count_rx() needs to be aware of preemption
ipv6: fix suspecious RCU usage warning
net/mlx5e: Don't match on Geneve options in case option masks are all zero
libbpf: Fix INSTALL flag order
veth: Store queue_mapping independently of XDP prog presence
bus: omap_l3_noc: mark l3 irqs as IRQF_NO_THREAD
dm ioctl: fix out of bounds array access when no devices
dm verity: fix DM_VERITY_OPTS_MAX value
integrity: double check iint_cache was initialized
ARM: dts: at91-sama5d27_som1: fix phy address to 7
arm64: dts: ls1043a: mark crypto engine dma coherent
arm64: dts: ls1012a: mark crypto engine dma coherent
arm64: dts: ls1046a: mark crypto engine dma coherent
ACPI: video: Add missing callback back for Sony VPCEH3U1E
gcov: fix clang-11+ support
kasan: fix per-page tags for non-page_alloc pages
squashfs: fix xattr id and id lookup sanity checks
squashfs: fix inode lookup sanity checks
platform/x86: intel-vbtn: Stop reporting SW_DOCK events
netsec: restore phy power state after controller reset
ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
block: Suppress uevent for hidden device when removed
nfs: we don't support removing system.nfs4_acl
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a
nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted
nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request()
drm/radeon: fix AGP dependency
drm/amdgpu: fb BO should be ttm_bo_type_device
drm/amd/display: Revert dram_clock_change_latency for DCN2.1
regulator: qcom-rpmh: Correct the pmic5_hfsmps515 buck
u64_stats,lockdep: Fix u64_stats_init() vs lockdep
habanalabs: Call put_pid() when releasing control device
sparc64: Fix opcode filtering in handling of no fault loads
irqchip/ingenic: Add support for the JZ4760
cifs: change noisy error message to FYI
atm: idt77252: fix null-ptr-dereference
atm: uPD98402: fix incorrect allocation
net: davicom: Use platform_get_irq_optional()
net: wan: fix error return code of uhdlc_init()
net: hisilicon: hns: fix error return code of hns_nic_clear_all_rx_fetch()
NFS: Correct size calculation for create reply length
nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig default
gpiolib: acpi: Add missing IRQF_ONESHOT
cpufreq: blacklist Arm Vexpress platforms in cpufreq-dt-platdev
cifs: ask for more credit on async read/write code paths
gianfar: fix jumbo packets+napi+rx overrun crash
sun/niu: fix wrong RXMAC_BC_FRM_CNT_COUNT count
net: intel: iavf: fix error return code of iavf_init_get_resources()
net: tehuti: fix error return code in bdx_probe()
ixgbe: Fix memleak in ixgbe_configure_clsu32
ALSA: hda: ignore invalid NHLT table
Revert "r8152: adjust the settings about MAC clock speed down for RTL8153"
atm: lanai: dont run lanai_dev_close if not open
atm: eni: dont release is never initialized
powerpc/4xx: Fix build errors from mfdcr()
net: fec: ptp: avoid register access when ipg clock is disabled
hugetlbfs: hugetlb_fault_mutex_hash() cleanup
ANDROID: refresh ABI XML to new version
ANDROID: refresh ABI XML
ANDROID: fix up ext4 build from 5.4.108
Linux 5.4.108
cifs: Fix preauth hash corruption
x86/apic/of: Fix CPU devicetree-node lookups
genirq: Disable interrupts for force threaded handlers
firmware/efi: Fix a use after bug in efi_mem_reserve_persistent
efi: use 32-bit alignment for efi_guid_t literals
ext4: fix potential error in ext4_do_update_inode
ext4: do not try to set xattr into ea_inode if value is empty
ext4: find old entry again if failed to rename whiteout
x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
x86: Move TS_COMPAT back to asm/thread_info.h
kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
x86/ioapic: Ignore IRQ2 again
perf/x86/intel: Fix a crash caused by zero PEBS status
PCI: rpadlpar: Fix potential drc_name corruption in store functions
counter: stm32-timer-cnt: fix ceiling write max value
iio: hid-sensor-temperature: Fix issues of timestamp channel
iio: hid-sensor-prox: Fix scale not correct issue
iio: hid-sensor-humidity: Fix alignment issue of timestamp channel
iio: adc: ad7949: fix wrong ADC result due to incorrect bit mask
iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handler
iio: adis16400: Fix an error code in adis16400_initial_setup()
iio:adc:qcom-spmi-vadc: add default scale to LR_MUX2_BAT_ID channel
iio:adc:stm32-adc: Add HAS_IOMEM dependency
usb: typec: tcpm: Invoke power_supply_changed for tcpm-source-psy-
usb: gadget: configfs: Fix KASAN use-after-free
USB: replace hardcode maximum usb string length by definition
usbip: Fix incorrect double assignment to udc->ud.tcp_rx
usb-storage: Add quirk to defeat Kindle's automatic unload
nvme-rdma: fix possible hang when failing to set io queues
counter: stm32-timer-cnt: Report count function when SLAVE_MODE_DISABLED
scsi: myrs: Fix a double free in myrs_cleanup()
scsi: lpfc: Fix some error codes in debugfs
riscv: Correct SPARSEMEM configuration
kbuild: Fix <linux/version.h> for empty SUBLEVEL or PATCHLEVEL again
net/qrtr: fix __netdev_alloc_skb call
sunrpc: fix refcount leak for rpc auth modules
vfio: IOMMU_API should be selected
svcrdma: disable timeouts on rdma backchannel
NFSD: Repair misuse of sv_lock in 5.10.16-rt30.
nfsd: Don't keep looking up unhashed files in the nfsd file cache
nvmet: don't check iosqes,iocqes for discovery controllers
nvme-tcp: fix a NULL deref when receiving a 0-length r2t PDU
nvme-tcp: fix possible hang when failing to set io queues
nvme: fix Write Zeroes limitations
afs: Stop listxattr() from listing "afs.*" attributes
ASoC: simple-card-utils: Do not handle device clock
ASoC: SOF: intel: fix wrong poll bits in dsp power down
ASoC: SOF: Intel: unregister DMIC device on probe error
ASoC: fsl_ssi: Fix TDM slot setup for I2S mode
btrfs: fix slab cache flags for free space tree bitmap
btrfs: fix race when cloning extent buffer during rewind of an old root
ARM: 9044/1: vfp: use undef hook for VFP support detection
ARM: 9030/1: entry: omit FP emulation for UND exceptions taken in kernel mode
s390/vtime: fix increased steal time accounting
Revert "PM: runtime: Update device status before letting suppliers suspend"
ALSA: hda/realtek: Apply headset-mic quirks for Xiaomi Redmibook Air
ALSA: hda: generic: Fix the micmute led init state
ALSA: hda/realtek: apply pin quirk for XiaomiNotebook Pro
ALSA: dice: fix null pointer dereference when node is disconnected
ASoC: ak5558: Add MODULE_DEVICE_TABLE
ASoC: ak4458: Add MODULE_DEVICE_TABLE
Linux 5.4.107
net: dsa: b53: Support setting learning on port
net: dsa: tag_mtk: fix 802.1ad VLAN egress
crypto: x86/aes-ni-xts - use direct calls to and 4-way stride
crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: x86 - Regularize glue function prototypes
fuse: fix live lock in fuse_iget()
drm/i915/gvt: Fix vfio_edid issue for BXT/APL
drm/i915/gvt: Fix port number for BDW on EDID region setup
drm/i915/gvt: Fix virtual display setup for BXT/APL
drm/i915/gvt: Fix mmio handler break on BXT/APL.
drm/i915/gvt: Set SNOOP for PAT3 on BXT/APL to workaround GPU BB hang
btrfs: scrub: Don't check free space before marking a block group RO
bpf, selftests: Fix up some test_verifier cases for unprivileged
bpf: Add sanity check for upper ptr_limit
bpf: Simplify alu_limit masking for pointer arithmetic
bpf: Fix off-by-one for area size in creating mask to left
bpf: Prohibit alu ops for pointer types not defining ptr_limit
KVM: arm64: nvhe: Save the SPE context early
Linux 5.4.106
xen/events: avoid handling the same event on two cpus at the same time
xen/events: don't unmask an event channel when an eoi is pending
xen/events: reset affinity of 2-level event when tearing it down
KVM: arm64: Reject VM creation when the default IPA size is unsupported
KVM: arm64: Ensure I-cache isolation between vcpus of a same VM
nvme: release namespace head reference on error
nvme: unlink head after removing last namespace
KVM: arm64: Fix exclusive limit for IPA size
x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2
binfmt_misc: fix possible deadlock in bm_register_write
powerpc/64s: Fix instruction encoding for lis in ppc_function_entry()
sched/membarrier: fix missing local execution of ipi_sync_rq_state()
zram: fix return value on writeback_store
include/linux/sched/mm.h: use rcu_dereference in in_vfork()
stop_machine: mark helpers __always_inline
hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
arm64: mm: use a 48-bit ID map when possible on 52-bit VA builds
configfs: fix a use-after-free in __configfs_open_file
block: rsxx: fix error return code of rsxx_pci_probe()
NFSv4.2: fix return value of _nfs4_get_security_label()
NFS: Don't gratuitously clear the inode cache when lookup failed
NFS: Don't revalidate the directory permissions on a lookup failure
SUNRPC: Set memalloc_nofs_save() for sync tasks
arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory
sh_eth: fix TRSCER mask for R7S72100
staging: comedi: pcl818: Fix endian problem for AI command data
staging: comedi: pcl711: Fix endian problem for AI command data
staging: comedi: me4000: Fix endian problem for AI command data
staging: comedi: dmm32at: Fix endian problem for AI command data
staging: comedi: das800: Fix endian problem for AI command data
staging: comedi: das6402: Fix endian problem for AI command data
staging: comedi: adv_pci1710: Fix endian problem for AI command data
staging: comedi: addi_apci_1500: Fix endian problem for command sample
staging: comedi: addi_apci_1032: Fix endian problem for COS sample
staging: rtl8192e: Fix possible buffer overflow in _rtl92e_wx_set_scan
staging: rtl8712: Fix possible buffer overflow in r8712_sitesurvey_cmd
staging: ks7010: prevent buffer overflow in ks_wlan_set_scan()
staging: rtl8188eu: fix potential memory corruption in rtw_check_beacon_data()
staging: rtl8712: unterminated string leads to read overflow
staging: rtl8188eu: prevent ->ssid overflow in rtw_wx_set_scan()
staging: rtl8192u: fix ->ssid overflow in r8192_wx_set_scan()
misc: fastrpc: restrict user apps from sending kernel RPC messages
misc/pvpanic: Export module FDT device table
usbip: fix vudc usbip_sockfd_store races leading to gpf
usbip: fix vhci_hcd attach_store() races leading to gpf
usbip: fix stub_dev usbip_sockfd_store() races leading to gpf
usbip: fix vudc to check for stream socket
usbip: fix vhci_hcd to check for stream socket
usbip: fix stub_dev to check for stream socket
USB: serial: cp210x: add some more GE USB IDs
USB: serial: cp210x: add ID for Acuity Brands nLight Air Adapter
USB: serial: ch341: add new Product ID
USB: serial: io_edgeport: fix memory leak in edge_startup
xhci: Fix repeated xhci wake after suspend due to uncleared internal wake state
usb: xhci: Fix ASMedia ASM1042A and ASM3242 DMA addressing
xhci: Improve detection of device initiated wake signal.
usb: xhci: do not perform Soft Retry for some xHCI hosts
usb: renesas_usbhs: Clear PIPECFG for re-enabling pipe with other EPNUM
USB: usblp: fix a hang in poll() if disconnected
usb: dwc3: qcom: Honor wakeup enabled/disabled state
usb: dwc3: qcom: Add missing DWC3 OF node refcount decrement
usb: gadget: f_uac1: stop playback on function disable
usb: gadget: f_uac2: always increase endpoint max_packet_size by one audio slot
USB: gadget: u_ether: Fix a configfs return code
Goodix Fingerprint device is not a modem
mmc: cqhci: Fix random crash when remove mmc module/card
mmc: core: Fix partition switch time for eMMC
software node: Fix node registration
s390/dasd: fix hanging IO request during DASD driver unbind
s390/dasd: fix hanging DASD driver unbind
arm64: kasan: fix page_alloc tagging with DEBUG_VIRTUAL
Revert 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file capabilities")
ALSA: usb-audio: Apply the control quirk to Plantronics headsets
ALSA: usb-audio: Fix "cannot get freq eq" errors on Dell AE515 sound bar
ALSA: hda: Avoid spurious unsol event handling during S3/S4
ALSA: hda: Flush pending unsolicited events before suspend
ALSA: hda: Drop the BATCH workaround for AMD controllers
ALSA: hda/ca0132: Add Sound BlasterX AE-5 Plus support
ALSA: hda/hdmi: Cancel pending works before suspend
ALSA: usb: Add Plantronics C320-M USB ctrl msg delay quirk
scsi: target: core: Prevent underflow for service actions
scsi: target: core: Add cmd length set before cmd complete
scsi: libiscsi: Fix iscsi_prep_scsi_cmd_pdu() error handling
sysctl.c: fix underflow value setting risk in vm_table
s390/smp: __smp_rescan_cpus() - move cpumask away from stack
i40e: Fix memory leak in i40e_probe
PCI: Fix pci_register_io_range() memory leak
kbuild: clamp SUBLEVEL to 255
PCI: mediatek: Add missing of_node_put() to fix reference leak
PCI: xgene-msi: Fix race in installing chained irq handler
Input: applespi - don't wait for responses to commands indefinitely.
sparc64: Use arch_validate_flags() to validate ADI flag
sparc32: Limit memblock allocation to low memory
iommu/amd: Fix performance counter initialization
powerpc/64: Fix stack trace not displaying final frame
HID: logitech-dj: add support for the new lightspeed connection iteration
powerpc/perf: Record counter overflow always if SAMPLE_IP is unset
powerpc: improve handling of unrecoverable system reset
spi: stm32: make spurious and overrun interrupts visible
powerpc/pci: Add ppc_md.discover_phbs()
Platform: OLPC: Fix probe error handling
mmc: mediatek: fix race condition between msdc_request_timeout and irq
mmc: mxs-mmc: Fix a resource leak in an error handling path in 'mxs_mmc_probe()'
udf: fix silent AED tagLocation corruption
i2c: rcar: optimize cacheline to minimize HW race condition
i2c: rcar: faster irq code to minimize HW race condition
net: phy: fix save wrong speed and duplex problem if autoneg is on
net: enetc: initialize RFS/RSS memories for unused ports too
net: hns3: fix error mask definition of flow director
media: rc: compile rc-cec.c into rc-core
media: v4l: vsp1: Fix bru null pointer access
media: v4l: vsp1: Fix uif null pointer access
media: usbtv: Fix deadlock on suspend
sh_eth: fix TRSCER mask for R7S9210
qxl: Fix uninitialised struct field head.surface_id
s390/crypto: return -EFAULT if copy_to_user() fails
s390/cio: return -EFAULT if copy_to_user() fails
drm: meson_drv add shutdown function
drm/shmem-helper: Don't remove the offset in vm_area_struct pgoff
drm/shmem-helper: Check for purged buffers in fault handler
drm/compat: Clear bounce structures
bnxt_en: reliably allocate IRQ table on reset to avoid crash
s390/cio: return -EFAULT if copy_to_user() fails again
net: hns3: fix bug when calculating the TCAM table info
net: hns3: fix query vlan mask value error for flow director
perf traceevent: Ensure read cmdlines are null terminated.
selftests: forwarding: Fix race condition in mirror installation
net: stmmac: fix watchdog timeout during suspend/resume stress test
net: stmmac: stop each tx channel independently
ixgbe: fail to create xfrm offload of IPsec tunnel mode SA
net: qrtr: fix error return code of qrtr_sendmsg()
net: davicom: Fix regulator not turned off on driver removal
net: davicom: Fix regulator not turned off on failed probe
net: lapbether: Remove netif_start_queue / netif_stop_queue
cipso,calipso: resolve a number of problems with the DOI refcounts
netdevsim: init u64 stats for 32bit hardware
net: usb: qmi_wwan: allow qmimux add/del with master up
net: sched: avoid duplicates in classes dump
nexthop: Do not flush blackhole nexthops when loopback goes down
net: stmmac: fix incorrect DMA channel intr enable setting of EQoS v4.10
net/mlx4_en: update moderation when config reset
net: enetc: don't overwrite the RSS indirection table when initializing
Revert "mm, slub: consider rest of partial list if acquire_slab() fails"
cifs: return proper error code in statfs(2)
mount: fix mounting of detached mounts onto targets that reside on shared mounts
powerpc/603: Fix protection of user pages mapped with PROT_NONE
mt76: dma: do not report truncated frames to mac80211
ibmvnic: always store valid MAC address
samples, bpf: Add missing munmap in xdpsock
selftests/bpf: Mask bpf_csum_diff() return value to 16 bits in test_verifier
selftests/bpf: No need to drop the packet when there is no geneve opt
netfilter: x_tables: gpf inside xt_find_revision()
netfilter: nf_nat: undo erroneous tcp edemux lookup
tcp: add sanity tests to TCP_QUEUE_SEQ
can: tcan4x5x: tcan4x5x_init(): fix initialization - clear MRAM before entering Normal Mode
can: flexcan: invoke flexcan_chip_freeze() to enter freeze mode
can: flexcan: enable RX FIFO after FRZ/HALT valid
can: flexcan: assert FRZ bit in flexcan_chip_freeze()
can: skb: can_skb_set_owner(): fix ref counting if socket was closed before setting skb ownership
sh_eth: fix TRSCER mask for SH771x
net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
net: check if protocol extracted by virtio_net_hdr_set_proto is correct
net: Fix gro aggregation for udp encaps with zero csum
ath9k: fix transmitting to stations in dynamic SMPS mode
ethernet: alx: fix order of calls on resume
powerpc/pseries: Don't enforce MSI affinity with kdump
uapi: nfnetlink_cthelper.h: fix userspace compilation error
Linux 5.4.105
nvme-pci: add quirks for Lexar 256GB SSD
nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST.
HID: i2c-hid: Add I2C_HID_QUIRK_NO_IRQ_AFTER_RESET for ITE8568 EC on Voyo Winpad A15
mmc: sdhci-of-dwcmshc: set SDHCI_QUIRK2_PRESET_VALUE_BROKEN
drm/msm/a5xx: Remove overwriting A5XX_PC_DBG_ECO_CNTL register
misc: eeprom_93xx46: Add quirk to support Microchip 93LC46B eeprom
PCI: Add function 1 DMA alias quirk for Marvell 9215 SATA controller
ASoC: Intel: bytcr_rt5640: Add quirk for ARCHOS Cesium 140
ACPI: video: Add DMI quirk for GIGABYTE GB-BXBT-2807
media: cx23885: add more quirks for reset DMA on some AMD IOMMU
HID: mf: add support for 0079:1846 Mayflash/Dragonrise USB Gamecube Adapter
platform/x86: acer-wmi: Add ACER_CAP_KBD_DOCK quirk for the Aspire Switch 10E SW3-016
platform/x86: acer-wmi: Add support for SW_TABLET_MODE on Switch devices
platform/x86: acer-wmi: Add ACER_CAP_SET_FUNCTION_MODE capability flag
platform/x86: acer-wmi: Add new force_caps module parameter
platform/x86: acer-wmi: Cleanup accelerometer device handling
platform/x86: acer-wmi: Cleanup ACER_CAP_FOO defines
mwifiex: pcie: skip cancel_work_sync() on reset failure path
iommu/amd: Fix sleeping in atomic in increase_address_space()
ACPICA: Fix race in generic_serial_bus (I2C) and GPIO op_region parameter handling
dm table: fix zoned iterate_devices based device capability checks
dm table: fix DAX iterate_devices based device capability checks
dm table: fix iterate_devices based device capability checks
net: dsa: add GRO support via gro_cells
ANDROID: GKI: update .xml file due to new symbol additions.
Revert "crypto - shash: reduce minimum alignment of shash_desc structure"
Linux 5.4.104
r8169: fix resuming from suspend on RTL8105e if machine runs on battery
rsxx: Return -EFAULT if copy_to_user() fails
ftrace: Have recordmcount use w8 to read relp->r_info in arm64_is_fake_mcount
ALSA: hda: intel-nhlt: verify config type
IB/mlx5: Add missing error code
RDMA/rxe: Fix missing kconfig dependency on CRYPTO
ALSA: ctxfi: cthw20k2: fix mask on conf to allow 4 bits
usbip: tools: fix build error for multiple definition
crypto - shash: reduce minimum alignment of shash_desc structure
arm64: ptrace: Fix seccomp of traced syscall -1 (NO_SYSCALL)
drm/amdgpu: fix parameter error of RREG32_PCIE() in amdgpu_regs_pcie
dm verity: fix FEC for RS roots unaligned to block size
dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size
PM: runtime: Update device status before letting suppliers suspend
btrfs: fix warning when creating a directory with smack enabled
btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
btrfs: fix raid6 qstripe kmap
btrfs: raid56: simplify tracking of Q stripe presence
tpm, tpm_tis: Decorate tpm_get_timeouts() with request_locality()
tpm, tpm_tis: Decorate tpm_tis_gen_interrupt() with request_locality()
ANDROID: GKI: hack up fs/sysfs/file.c to prevent GENKSYMS change
Revert "sched/features: Fix hrtick reprogramming"
Linux 5.4.103
ALSA: hda/realtek: Apply dual codec quirks for MSI Godlike X570 board
ALSA: hda/realtek: Add quirk for Intel NUC 10
ALSA: hda/realtek: Add quirk for Clevo NH55RZQ
media: v4l: ioctl: Fix memory leak in video_usercopy
swap: fix swapfile read/write offset
zsmalloc: account the number of compacted pages correctly
xen-netback: respect gnttab_map_refs()'s return value
Xen/gnttab: handle p2m update errors on a per-slot basis
scsi: iscsi: Verify lengths on passthrough PDUs
scsi: iscsi: Ensure sysfs attributes are limited to PAGE_SIZE
sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
scsi: iscsi: Restrict sessions and handles to admin capabilities
ASoC: Intel: bytcr_rt5640: Add quirk for the Acer One S1002 tablet
ASoC: Intel: bytcr_rt5651: Add quirk for the Jumper EZpad 7 tablet
ASoC: Intel: bytcr_rt5640: Add quirk for the Voyo Winpad A15 tablet
ASoC: Intel: bytcr_rt5640: Add quirk for the Estar Beauty HD MID 7316R tablet
sched/features: Fix hrtick reprogramming
parisc: Bump 64-bit IRQ stack size to 64 KB
perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]
btrfs: fix error handling in commit_fs_roots
ASoC: Intel: Add DMI quirk table to soc_intel_is_byt_cr()
nvme-tcp: add clean action for failed reconnection
nvme-rdma: add clean action for failed reconnection
nvme-core: add cancel tagset helpers
f2fs: fix to set/clear I_LINKABLE under i_lock
f2fs: handle unallocated section and zone on pinned/atgc
media: uvcvideo: Allow entities with no pads
drm/amd/display: Guard against NULL pointer deref when get_i2c_info fails
PCI: Add a REBAR size quirk for Sapphire RX 5600 XT Pulse
drm/amdgpu: Add check to prevent IH overflow
crypto: tcrypt - avoid signed overflow in byte count
drm/hisilicon: Fix use-after-free
brcmfmac: Add DMI nvram filename quirk for Voyo winpad A15 tablet
brcmfmac: Add DMI nvram filename quirk for Predia Basic tablet
staging: bcm2835-audio: Replace unsafe strcpy() with strscpy()
staging: most: sound: add sanity check for function argument
Bluetooth: Fix null pointer dereference in amp_read_loc_assoc_final_data
x86/build: Treat R_386_PLT32 relocation as R_386_PC32
ath10k: fix wmi mgmt tx queue full due to race condition
pktgen: fix misuse of BUG_ON() in pktgen_thread_worker()
Bluetooth: hci_h5: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for btrtl
wlcore: Fix command execute failure 19 for wl12xx
vt/consolemap: do font sum unsigned
x86/reboot: Add Zotac ZBOX CI327 nano PCI reboot quirk
staging: fwserial: Fix error handling in fwserial_create
rsi: Move card interrupt handling to RX thread
rsi: Fix TX EAPOL packet handling against iwlwifi AP
drm/virtio: use kvmalloc for large allocations
MIPS: Drop 32-bit asm string functions
dt-bindings: net: btusb: DT fix s/interrupt-name/interrupt-names/
dt-bindings: ethernet-controller: fix fixed-link specification
net: fix dev_ifsioc_locked() race condition
net: ag71xx: remove unnecessary MTU reservation
net: bridge: use switchdev for port flags set through sysfs too
mm/hugetlb.c: fix unnecessary address expansion of pmd sharing
nbd: handle device refs for DESTROY_ON_DISCONNECT properly
net: fix up truesize of cloned skb in skb_prepare_for_shift()
smackfs: restrict bytes count in smackfs write functions
net/af_iucv: remove WARN_ONCE on malformed RX packets
xfs: Fix assert failure in xfs_setattr_size()
media: v4l2-ctrls.c: fix shift-out-of-bounds in std_validate
erofs: fix shift-out-of-bounds of blkszbits
media: mceusb: sanity check for prescaler value
udlfb: Fix memory leak in dlfb_usb_probe
JFS: more checks for invalid superblock
MIPS: VDSO: Use CLANG_FLAGS instead of filtering out '--target='
arm64 module: set plt* section addresses to 0x0
nvme-pci: fix error unwind in nvme_map_data
nvme-pci: refactor nvme_unmap_data
Input: elantech - fix protocol errors for some trackpoints in SMBus mode
net: usb: qmi_wwan: support ZTE P685M modem
ANDROID: GKI: update .xml file due to new symbol additions.
ANDROID: Adding kprobes build configs for Cuttlefish
ANDROID: GKI: bring back icmpv6_send
Linux 5.4.102
ARM: dts: aspeed: Add LCLK to lpc-snoop
net: qrtr: Fix memory leak in qrtr_tun_open
dm era: Update in-core bitset after committing the metadata
net: sched: fix police ext initialization
net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending
ipv6: silence compilation warning for non-IPV6 builds
ipv6: icmp6: avoid indirect call for icmpv6_send()
xfrm: interface: use icmp_ndo_send helper
sunvnet: use icmp_ndo_send helper
gtp: use icmp_ndo_send helper
icmp: allow icmpv6_ndo_send to work with CONFIG_IPV6=n
icmp: introduce helper for nat'd source address in network device context
drm/i915: Reject 446-480MHz HDMI clock on GLK
dm era: only resize metadata in preresume
dm era: Reinitialize bitset cache before digesting a new writeset
dm era: Use correct value size in equality function of writeset tree
dm era: Fix bitset memory leaks
dm era: Verify the data block size hasn't changed
dm era: Recover committed writeset after crash
dm writecache: fix writing beyond end of underlying device when shrinking
dm: fix deadlock when swapping to encrypted device
gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end
gfs2: Don't skip dlm unlock if glock has an lvb
spi: spi-synquacer: fix set_cs handling
sparc32: fix a user-triggerable oops in clear_user()
f2fs: fix out-of-repair __setattr_copy()
um: mm: check more comprehensively for stub changes
virtio/s390: implement virtio-ccw revision 2 correctly
s390/vtime: fix inline assembly clobber list
cpufreq: intel_pstate: Get per-CPU max freq via MSR_HWP_CAPABILITIES if available
printk: fix deadlock when kernel panic
gpio: pcf857x: Fix missing first interrupt
spmi: spmi-pmic-arb: Fix hw_irq overflow
powerpc/32s: Add missing call to kuep_lock on syscall entry
mmc: sdhci-esdhc-imx: fix kernel panic when remove module
module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbols
media: smipcie: fix interrupt handling and IR timeout
arm64: Extend workaround for erratum 1024718 to all versions of Cortex-A55
hugetlb: fix copy_huge_page_from_user contig page struct assumption
hugetlb: fix update_and_free_page contig page struct assumption
x86: fix seq_file iteration for pat/memtype.c
seq_file: document how per-entry resources are managed.
fs/affs: release old buffer head on error path
mtd: spi-nor: hisi-sfc: Put child node np on error path
mtd: spi-nor: core: Add erase size check for erase command initialization
mtd: spi-nor: core: Fix erase type discovery for overlaid region
mtd: spi-nor: sfdp: Fix wrong erase type bitmask for overlaid region
mtd: spi-nor: sfdp: Fix last erase region marking
watchdog: mei_wdt: request stop on unregister
watchdog: qcom: Remove incorrect usage of QCOM_WDT_ENABLE_IRQ
arm64: uprobe: Return EOPNOTSUPP for AARCH32 instruction probing
arm64: kexec_file: fix memory leakage in create_dtb() when fdt_open_into() fails
floppy: reintroduce O_NDELAY fix
rcu/nocb: Perform deferred wake up before last idle's need_resched() check
rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
powerpc/prom: Fix "ibm,arch-vec-5-platform-support" scan
x86/reboot: Force all cpus to exit VMX root if VMX is supported
x86/virt: Eat faults on VMXOFF in reboot flows
media: ipu3-cio2: Fix mbus_code processing in cio2_subdev_set_fmt()
staging: rtl8188eu: Add Edimax EW-7811UN V2 to device table
staging: gdm724x: Fix DMA from stack
staging/mt7621-dma: mtk-hsdma.c->hsdma-mt7621.c
dts64: mt7622: fix slow sd card access
pstore: Fix typo in compression option name
drivers/misc/vmw_vmci: restrict too big queue size in qp_host_alloc_queue
misc: rtsx: init of rts522a add OCP power off when no card is present
seccomp: Add missing return in non-void function
crypto: sun4i-ss - initialize need_fallback
crypto: sun4i-ss - handle BigEndian for cipher
crypto: sun4i-ss - checking sg length is not sufficient
crypto: aesni - prevent misaligned buffers on the stack
crypto: arm64/sha - add missing module aliases
btrfs: fix extent buffer leak on failure to copy root
btrfs: splice remaining dirty_bg's onto the transaction dirty bg list
btrfs: fix reloc root leak with 0 ref reloc roots on recovery
btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root
KEYS: trusted: Fix migratable=1 failing
tpm_tis: Clean up locality release
tpm_tis: Fix check_locality for correct locality acquisition
erofs: initialized fields can only be observed after bit is set
drm/sched: Cancel and flush all outstanding jobs before finish.
drm/nouveau/kms: handle mDP connectors
drm/amdgpu: Set reference clock to 100Mhz on Renoir (v2)
drm/amd/display: Add vupdate_no_lock interrupts for DCN2.1
bcache: Move journal work to new flush wq
bcache: Give btree_io_wq correct semantics again
Revert "bcache: Kill btree_io_wq"
ALSA: hda/realtek: modify EAPD in the ALC886
ALSA: hda: Add another CometLake-H PCI ID
USB: serial: mos7720: fix error code in mos7720_write()
USB: serial: mos7840: fix error code in mos7840_write()
USB: serial: ftdi_sio: fix FTX sub-integer prescaler
usb: dwc3: gadget: Fix dep->interval for fullspeed interrupt
usb: dwc3: gadget: Fix setting of DEPCFG.bInterval_m1
usb: musb: Fix runtime PM race in musb_queue_resume_work
USB: serial: option: update interface mapping for ZTE P685M
media: mceusb: Fix potential out-of-bounds shift
Input: i8042 - add ASUS Zenbook Flip to noselftest list
Input: joydev - prevent potential read overflow in ioctl
Input: xpad - add support for PowerA Enhanced Wired Controller for Xbox Series X|S
Input: raydium_ts_i2c - do not send zero length
HID: wacom: Ignore attempts to overwrite the touch_max value from HID
HID: logitech-dj: add support for keyboard events in eQUAD step 4 Gaming
ACPI: configfs: add missing check after configfs_register_default_group()
ACPI: property: Fix fwnode string properties matching
blk-settings: align max_sectors on "logical_block_size" boundary
scsi: bnx2fc: Fix Kconfig warning & CNIC build errors
mm/rmap: fix potential pte_unmap on an not mapped pte
i2c: brcmstb: Fix brcmstd_send_i2c_cmd condition
arm64: Add missing ISB after invalidating TLB in __primary_switch
r8169: fix jumbo packet handling on RTL8168e
mm/compaction: fix misbehaviors of fast_find_migrateblock()
mm/hugetlb: fix potential double free in hugetlb_register_node() error path
mm/memory.c: fix potential pte_unmap_unlock pte error
ocfs2: fix a use after free on error
vxlan: move debug check after netdev unregister
net/mlx4_core: Add missed mlx4_free_cmd_mailbox()
vfio/type1: Use follow_pte()
i40e: Fix add TC filter for IPv6
i40e: Fix VFs not created
i40e: Fix addition of RX filters after enabling FW LLDP agent
i40e: Fix overwriting flow control settings during driver loading
i40e: Add zero-initialization of AQ command structures
i40e: Fix flow for IPv6 next header (extension header)
regmap: sdw: use _no_pm functions in regmap_read/write
nvmem: core: skip child nodes not matching binding
nvmem: core: Fix a resource leak on error in nvmem_add_cells_from_of()
ext4: fix potential htree index checksum corruption
vfio/iommu_type1: Fix some sanity checks in detach group
drm/msm/mdp5: Fix wait-for-commit for cmd panels
drm/msm/dsi: Correct io_start for MSM8994 (20nm PHY)
mei: hbm: call mei_set_devstate() on hbm stop response
PCI: Align checking of syscall user config accessors
VMCI: Use set_page_dirty_lock() when unregistering guest memory
pwm: rockchip: rockchip_pwm_probe(): Remove superfluous clk_unprepare()
soundwire: cadence: fix ACK/NAK handling
misc: eeprom_93xx46: Add module alias to avoid breaking support for non device tree users
phy: rockchip-emmc: emmc_phy_init() always return 0
misc: eeprom_93xx46: Fix module alias to enable module autoprobe
sparc64: only select COMPAT_BINFMT_ELF if BINFMT_ELF is set
Input: elo - fix an error code in elo_connect()
perf test: Fix unaligned access in sample parsing test
perf intel-pt: Fix premature IPC
perf intel-pt: Fix missing CYC processing in PSB
Input: sur40 - fix an error code in sur40_probe()
RDMA/hns: Fixes missing error code of CMDQ
nfsd: register pernet ops last, unregister first
clk: aspeed: Fix APLL calculate formula from ast2600-A2
regulator: qcom-rpmh: fix pm8009 ldo7
spi: pxa2xx: Fix the controller numbering for Wildcat Point
RDMA/hns: Fix type of sq_signal_bits
RDMA/siw: Fix calculation of tx_valid_cpus size
RDMA/hns: Fixed wrong judgments in the goto branch
clk: qcom: gcc-msm8998: Fix Alpha PLL type for all GPLLs
powerpc/8xx: Fix software emulation interrupt
powerpc/pseries/dlpar: handle ibm, configure-connector delay status
mfd: wm831x-auxadc: Prevent use after free in wm831x_auxadc_read_irq()
spi: stm32: properly handle 0 byte transfer
RDMA/rxe: Correct skb on loopback path
RDMA/rxe: Fix coding error in rxe_rcv_mcast_pkt
RDMA/rxe: Fix coding error in rxe_recv.c
perf vendor events arm64: Fix Ampere eMag event typo
perf tools: Fix DSO filtering when not finding a map for a sampled address
tracepoint: Do not fail unregistering a probe due to memory failure
IB/cm: Avoid a loop when device has 255 ports
IB/mlx5: Return appropriate error code instead of ENOMEM
amba: Fix resource leak for drivers without .remove
i2c: qcom-geni: Store DMA mapping data in geni_i2c_dev struct
ARM: 9046/1: decompressor: Do not clear SCTLR.nTLSMD for ARMv7+ cores
mmc: renesas_sdhi_internal_dmac: Fix DMA buffer alignment from 8 to 128-bytes
mmc: usdhi6rol0: Fix a resource leak in the error handling path of the probe
mmc: sdhci-sprd: Fix some resource leaks in the remove function
powerpc/47x: Disable 256k page size
KVM: PPC: Make the VMX instruction emulation routines static
IB/umad: Return EPOLLERR in case of when device disassociated
IB/umad: Return EIO in case of when device disassociated
objtool: Fix ".cold" section suffix check for newer versions of GCC
objtool: Fix error handling for STD/CLD warnings
auxdisplay: ht16k33: Fix refresh rate handling
isofs: release buffer head before return
regulator: core: Avoid debugfs: Directory ... already present! error
regulator: s5m8767: Drop regulators OF node reference
spi: atmel: Put allocated master before return
regulator: s5m8767: Fix reference count leak
certs: Fix blacklist flag type confusion
regulator: axp20x: Fix reference cout leak
clk: sunxi-ng: h6: Fix clock divider range on some clocks
RDMA/mlx5: Use the correct obj_id upon DEVX TIR creation
clocksource/drivers/mxs_timer: Add missing semicolon when DEBUG is defined
clocksource/drivers/ixp4xx: Select TIMER_OF when needed
rtc: s5m: select REGMAP_I2C
power: reset: at91-sama5d2_shdwc: fix wkupdbc mask
of/fdt: Make sure no-map does not remove already reserved regions
fdt: Properly handle "no-map" field in the memory region
mfd: bd9571mwv: Use devm_mfd_add_devices()
dmaengine: hsu: disable spurious interrupt
dmaengine: owl-dma: Fix a resource leak in the remove function
dmaengine: fsldma: Fix a resource leak in an error handling path of the probe function
dmaengine: fsldma: Fix a resource leak in the remove function
RDMA/siw: Fix handling of zero-sized Read and Receive Queues.
HID: core: detect and skip invalid inputs to snto32()
clk: sunxi-ng: h6: Fix CEC clock
spi: cadence-quadspi: Abort read if dummy cycles required are too many
i2c: iproc: handle master read request
i2c: iproc: update slave isr mask (ISR_MASK_SLAVE)
i2c: iproc: handle only slave interrupts which are enabled
quota: Fix memory leak when handling corrupted quota file
selftests/powerpc: Make the test check in eeh-basic.sh posix compliant
clk: meson: clk-pll: propagate the error from meson_clk_pll_set_rate()
clk: meson: clk-pll: make "ret" a signed integer
clk: meson: clk-pll: fix initializing the old rate (fallback) for a PLL
HSI: Fix PM usage counter unbalance in ssi_hw_init
capabilities: Don't allow writing ambiguous v3 file capabilities
ubifs: Fix error return code in alloc_wbufs()
ubifs: Fix memleak in ubifs_init_authentication
jffs2: fix use after free in jffs2_sum_write_data()
fs/jfs: fix potential integer overflow on shift of a int
ASoC: simple-card-utils: Fix device module clock
ima: Free IMA measurement buffer after kexec syscall
ima: Free IMA measurement buffer on error
crypto: ecdh_helper - Ensure 'len >= secret.len' in decode_key()
hwrng: timeriomem - Fix cooldown period calculation
btrfs: clarify error returns values in __load_free_space_cache
ASoC: SOF: debug: Fix a potential issue on string buffer termination
Drivers: hv: vmbus: Avoid use-after-free in vmbus_onoffer_rescind()
f2fs: fix a wrong condition in __submit_bio
drm/amdgpu: Prevent shift wrapping in amdgpu_read_mask()
f2fs: fix to avoid inconsistent quota data
mtd: parsers: afs: Fix freeing the part name memory in failure
ASoC: cpcap: fix microphone timeslot mask
ata: ahci_brcm: Add back regulators management
drm/nouveau: bail out of nouveau_channel_new if channel init fails
crypto: talitos - Work around SEC6 ERRATA (AES-CTR mode data size error)
mtd: parser: imagetag: fix error codes in bcm963xx_parse_imagetag_partitions()
sched/eas: Don't update misfit status if the task is pinned
media: uvcvideo: Accept invalid bFormatIndex and bFrameIndex values
media: pxa_camera: declare variable when DEBUG is defined
media: cx25821: Fix a bug when reallocating some dma memory
media: qm1d1c0042: fix error return code in qm1d1c0042_init()
media: lmedm04: Fix misuse of comma
media: software_node: Fix refcounts in software_node_get_next_child()
drm/amd/display: Fix HDMI deep color output for DCE 6-11.
drm/amd/display: Fix 10/12 bpc setup in DCE output bit depth reduction.
bsg: free the request before return error code
MIPS: properly stop .eh_frame generation
drm/sun4i: tcon: fix inverted DCLK polarity
crypto: bcm - Rename struct device_private to bcm_device_private
evm: Fix memleak in init_desc
ASoC: cs42l56: fix up error handling in probe
media: aspeed: fix error return code in aspeed_video_setup_video()
media: tm6000: Fix memleak in tm6000_start_stream
media: media/pci: Fix memleak in empress_init
media: em28xx: Fix use-after-free in em28xx_alloc_urbs
media: vsp1: Fix an error handling path in the probe function
media: camss: missing error code in msm_video_register()
media: imx: Fix csc/scaler unregister
media: imx: Unregister csc/scaler only if registered
media: i2c: ov5670: Fix PIXEL_RATE minimum value
MIPS: lantiq: Explicitly compare LTQ_EBU_PCC_ISTAT against 0
MIPS: c-r4k: Fix section mismatch for loongson2_sc_init
drm/amdgpu: Fix macro name _AMDGPU_TRACE_H_ in preprocessor if condition
crypto: arm64/aes-ce - really hide slower algos when faster ones are enabled
crypto: sun4i-ss - fix kmap usage
crypto: sun4i-ss - linearize buffers content must be kept
drm/fb-helper: Add missed unlocks in setcmap_legacy()
gma500: clean up error handling in init
drm/gma500: Fix error return code in psb_driver_load()
fbdev: aty: SPARC64 requires FB_ATY_CT
net: mvneta: Remove per-cpu queue mapping for Armada 3700
net: amd-xgbe: Fix network fluctuations when using 1G BELFUSE SFP
net: amd-xgbe: Reset link when the link never comes back
net: amd-xgbe: Fix NETDEV WATCHDOG transmit queue timeout warning
net: amd-xgbe: Reset the PHY rx data path when mailbox command timeout
ibmvnic: skip send_request_unmap for timeout reset
ibmvnic: add memory barrier to protect long term buffer
b43: N-PHY: Fix the update of coef for the PHY revision >= 3case
cxgb4/chtls/cxgbit: Keeping the max ofld immediate data size same in cxgb4 and ulds
net: axienet: Handle deferred probe on clock properly
tcp: fix SO_RCVLOWAT related hangs under mem pressure
bpf: Fix bpf_fib_lookup helper MTU check for SKB ctx
mac80211: fix potential overflow when multiplying to u32 integers
xen/netback: fix spurious event detection for common event case
bnxt_en: reverse order of TX disable and carrier off
ibmvnic: Set to CLOSED state even on error
ath9k: fix data bus crash when setting nf_override via debugfs
bpf_lru_list: Read double-checked variable once without lock
soc: aspeed: snoop: Add clock control logic
ARM: s3c: fix fiq for clang IAS
arm64: dts: msm8916: Fix reserved and rfsa nodes unit address
Bluetooth: btusb: Fix memory leak in btusb_mtk_wmt_recv
arm64: dts: armada-3720-turris-mox: rename u-boot mtd partition to a53-firmware
ARM: dts: armada388-helios4: assign pinctrl to each fan
ARM: dts: armada388-helios4: assign pinctrl to LEDs
staging: rtl8723bs: wifi_regd.c: Fix incorrect number of regulatory rules
usb: dwc2: Make "trimming xfer length" a debug message
usb: dwc2: Abort transaction after errors with unknown reason
usb: dwc2: Do not update data length if it is 0 on inbound transfers
ARM: dts: Configure missing thermal interrupt for 4430
memory: ti-aemif: Drop child node when jumping out loop
Bluetooth: Put HCI device if inquiry procedure interrupts
Bluetooth: drop HCI device reference before return
usb: gadget: u_audio: Free requests only after callback
ACPICA: Fix exception code class checks
cpufreq: brcmstb-avs-cpufreq: Fix resource leaks in ->remove()
cpufreq: brcmstb-avs-cpufreq: Free resources in error path
arm64: dts: allwinner: A64: Limit MMC2 bus frequency to 150 MHz
arm64: dts: allwinner: H6: Allow up to 150 MHz MMC bus frequency
arm64: dts: allwinner: Drop non-removable from SoPine/LTS SD card
arm64: dts: allwinner: H6: properly connect USB PHY to port 0
arm64: dts: allwinner: A64: properly connect USB PHY to port 0
bpf: Avoid warning when re-casting __bpf_call_base into __bpf_call_base_args
bpf: Add bpf_patch_call_args prototype to include/linux/bpf.h
memory: mtk-smi: Fix PM usage counter unbalance in mtk_smi ops
arm64: dts: exynos: correct PMIC interrupt trigger level on Espresso
arm64: dts: exynos: correct PMIC interrupt trigger level on TM2
ARM: dts: exynos: correct PMIC interrupt trigger level on Odroid XU3 family
ARM: dts: exynos: correct PMIC interrupt trigger level on Arndale Octa
ARM: dts: exynos: correct PMIC interrupt trigger level on Spring
ARM: dts: exynos: correct PMIC interrupt trigger level on Rinato
ARM: dts: exynos: correct PMIC interrupt trigger level on Monk
ARM: dts: exynos: correct PMIC interrupt trigger level on Artik 5
Bluetooth: Fix initializing response id after clearing struct
Bluetooth: hci_uart: Fix a race for write_work scheduling
Bluetooth: btqcomsmd: Fix a resource leak in error handling paths in the probe function
ath10k: Fix error handling in case of CE pipe init failure
random: fix the RNDRESEEDCRNG ioctl
MIPS: vmlinux.lds.S: add missing PAGE_ALIGNED_DATA() section
ALSA: usb-audio: Fix PCM buffer allocation in non-vmalloc mode
bfq: Avoid false bfq queue merging
virt: vbox: Do not use wait_event_interruptible when called from kernel context
PCI: Decline to resize resources if boot config must be preserved
PCI: qcom: Use PHY_REFCLK_USE_PAD only for ipq8064
kdb: Make memory allocations more robust
debugfs: do not attempt to create a new file before the filesystem is initalized
debugfs: be more robust at handling improper input in debugfs_lookup()
kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the host
vmlinux.lds.h: add DWARF v5 sections
Linux 5.4.101
scripts/recordmcount.pl: support big endian for ARCH sh
cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath.
cxgb4: Add new T6 PCI device id 0x6092
NET: usb: qmi_wwan: Adding support for Cinterion MV31
KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped()
mm: provide a saner PTE walking API for modules
KVM: do not assume PTE is writable after follow_pfn
mm: simplify follow_pte{,pmd}
mm: unexport follow_pte_pmd
scripts: set proper OpenSSL include dir also for sign-file
scripts: use pkg-config to locate libcrypto
arm64: tegra: Add power-domain for Tegra210 HDA
ntfs: check for valid standard information attribute
usb: quirks: add quirk to start video capture on ELMO L-12F document camera reliable
USB: quirks: sort quirk entries
HID: make arrays usage and value to be the same
bpf: Fix truncation handling for mod32 dst reg wrt zero
Linux 5.4.100
btrfs: fix backport of 2175bf57dc952 in 5.4.95
media: pwc: Use correct device for DMA
xen-blkback: fix error handling in xen_blkbk_map()
xen-scsiback: don't "handle" error by BUG()
xen-netback: don't "handle" error by BUG()
xen-blkback: don't "handle" error by BUG()
xen/arm: don't ignore return errors from set_phys_to_machine
Xen/gntdev: correct error checking in gntdev_map_grant_pages()
Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()
Xen/x86: also check kernel mapping in set_foreign_p2m_mapping()
Xen/x86: don't bail early from clear_foreign_p2m_mapping()
net: bridge: Fix a warning when del bridge sysfs
net: qrtr: Fix port ID for control messages
KVM: SEV: fix double locking due to incorrect backport
ANDROID: GKI: Fix up .xml file after merge with android11-5.4
ANDROID: GKI: Fix up .xml file after merge with android11-5.4
ANDROID: GKI: fix up .xml file after big merge with android11-5.4
Linux 5.4.99
ovl: expand warning in ovl_d_real()
net/qrtr: restrict user-controlled length in qrtr_tun_write_iter()
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
vsock: fix locking in vsock_shutdown()
vsock/virtio: update credit only if socket is not closed
net: watchdog: hold device global xmit lock during tx disable
net/vmw_vsock: improve locking in vsock_connect_timeout()
net: fix iteration for sctp transport seq_files
net: gro: do not keep too many GRO packets in napi->rx_list
net: dsa: call teardown method on probe failure
udp: fix skb_copy_and_csum_datagram with odd segment sizes
rxrpc: Fix clearance of Tx/Rx ring when releasing a call
usb: dwc3: ulpi: Replace CPU-based busyloop with Protocol-based one
usb: dwc3: ulpi: fix checkpatch warning
h8300: fix PREEMPTION build, TI_PRE_COUNT undefined
i2c: stm32f7: fix configuration of the digital filter
clk: sunxi-ng: mp: fix parent rate change flag check
drm/sun4i: dw-hdmi: Fix max. frequency for H6
drm/sun4i: Fix H6 HDMI PHY configuration
drm/sun4i: tcon: set sync polarity for tcon1 channel
firmware_loader: align .builtin_fw to 8
net: hns3: add a check for queue_id in hclge_reset_vf_queue()
x86/build: Disable CET instrumentation in the kernel for 32-bit too
netfilter: conntrack: skip identical origin tuple in same zone only
ibmvnic: Clear failover_pending if unable to schedule
net: stmmac: set TxQ mode back to DCB after disabling CBS
selftests: txtimestamp: fix compilation issue
net: enetc: initialize the RFS and RSS memories
xen/netback: avoid race in xenvif_rx_ring_slots_available()
netfilter: flowtable: fix tcp and udp header checksum update
netfilter: nftables: fix possible UAF over chains from packet path in netns
netfilter: xt_recent: Fix attempt to update deleted entry
bpf: Check for integer overflow when using roundup_pow_of_two()
drm/vc4: hvs: Fix buffer overflow with the dlist handling
mt76: dma: fix a possible memory leak in mt76_add_fragment()
lkdtm: don't move ctors to .rodata
vmlinux.lds.h: Create section for protection against instrumentation
ARM: kexec: fix oops after TLB are invalidated
ARM: ensure the signal page contains defined contents
ARM: dts: lpc32xx: Revert set default clock rate of HCLK PLL
bfq-iosched: Revert "bfq: Fix computation of shallow depth"
riscv: virt_addr_valid must check the address belongs to linear mapping
drm/amd/display: Decrement refcount of dc_sink before reassignment
drm/amd/display: Free atomic state after drm_atomic_commit
drm/amd/display: Fix dc_sink kref count in emulated_link_detect
drm/amd/display: Add more Clock Sources to DCN2.1
nvme-pci: ignore the subsysem NQN on Phison E16
ovl: skip getxattr of security labels
cap: fix conversions on getxattr
ovl: perform vfs_getxattr() with mounter creds
platform/x86: hp-wmi: Disable tablet-mode reporting by default
ARM: OMAP2+: Fix suspcious RCU usage splats for omap_enter_idle_coupled
arm64: dts: qcom: sdm845: Reserve LPASS clocks in gcc
arm64: dts: rockchip: Fix PCIe DT properties on rk3399
cgroup: fix psi monitor for root cgroup
arm/xen: Don't probe xenbus as part of an early initcall
tracing: Check length before giving out the filter buffer
tracing: Do not count ftrace events in top level enable output
gpio: ep93xx: Fix single irqchip with multi gpiochips
gpio: ep93xx: fix BUG_ON port F usage
Linux 5.4.98
squashfs: add more sanity checks in xattr id lookup
squashfs: add more sanity checks in inode lookup
squashfs: add more sanity checks in id lookup
Fix unsynchronized access to sev members through svm_register_enc_region
bpf: Fix 32 bit src register truncation on div/mod
regulator: Fix lockdep warning resolving supplies
blk-cgroup: Use cond_resched() when destroy blkgs
i2c: mediatek: Move suspend and resume handling to NOIRQ phase
SUNRPC: Handle 0 length opaque XDR object data properly
SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
iwlwifi: mvm: guard against device removal in reprobe
iwlwifi: mvm: invalidate IDs of internal stations at mvm start
iwlwifi: pcie: fix context info memory leak
iwlwifi: pcie: add a NULL check in iwl_pcie_txq_unmap
iwlwifi: mvm: take mutex for calling iwl_mvm_get_sync_time()
iwlwifi: mvm: skip power command when unbinding vif during CSA
ASoC: ak4458: correct reset polarity
pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
chtls: Fix potential resource leak
ASoC: Intel: Skylake: Zero snd_ctl_elem_value
mac80211: 160MHz with extended NSS BW in CSA
regulator: core: avoid regulator_resolve_supply() race condition
af_key: relax availability checks for skb size calculation
tracing/kprobe: Fix to support kretprobe events on unloaded modules
UPSTREAM: usb: xhci-mtk: break loop when find the endpoint to drop
UPSTREAM: usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints
Linux 5.4.97
usb: host: xhci: mvebu: make USB 3.0 PHY optional for Armada 3720
net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
net: dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
net: ip_tunnel: fix mtu calculation
neighbour: Prevent a dead entry from updating gc_list
igc: Report speed and duplex as unknown when device is runtime suspended
md: Set prev_flush_start and flush_bio in an atomic way
iommu/vt-d: Do not use flush-queue when caching-mode is on
Input: xpad - sync supported devices with fork on GitHub
iwlwifi: mvm: don't send RFH_QUEUE_CONFIG_CMD with no queues
x86/apic: Add extra serialization for non-serializing MSRs
x86/build: Disable CET instrumentation in the kernel
mm: thp: fix MADV_REMOVE deadlock on shmem THP
mm, compaction: move high_pfn to the for loop scope
mm: hugetlb: remove VM_BUG_ON_PAGE from page_huge_active
mm: hugetlb: fix a race between isolating and freeing page
mm: hugetlb: fix a race between freeing and dissolving the page
mm: hugetlbfs: fix cannot migrate the fallocated HugeTLB page
ARM: footbridge: fix dc21285 PCI configuration accessors
KVM: x86: Update emulator context mode if SYSENTER xfers to 64-bit mode
KVM: SVM: Treat SVM as unsupported when running as an SEV guest
nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs
drm/amd/display: Revert "Fix EDID parsing after resume from suspend"
mmc: core: Limit retries when analyse of SDIO tuples fails
smb3: fix crediting for compounding when only one request in flight
smb3: Fix out-of-bounds bug in SMB2_negotiate()
cifs: report error instead of invalid when revalidating a dentry fails
xhci: fix bounce buffer usage for non-sg list case
genirq/msi: Activate Multi-MSI early when MSI_FLAG_ACTIVATE_EARLY is set
libnvdimm/dimm: Avoid race between probe and available_slots_show()
kretprobe: Avoid re-registration of the same kretprobe earlier
fgraph: Initialize tracing_graph_pause at task creation
mac80211: fix station rate table updates on assoc
ovl: fix dentry leak in ovl_get_redirect
usb: host: xhci-plat: add priv quirk for skip PHY initialization
usb: xhci-mtk: break loop when find the endpoint to drop
usb: xhci-mtk: skip dropping bandwidth of unchecked endpoints
usb: xhci-mtk: fix unreleased bandwidth data
usb: dwc3: fix clock issue during resume in OTG mode
usb: dwc2: Fix endpoint direction check in ep_from_windex
usb: renesas_usbhs: Clear pipe running flag in usbhs_pkt_pop()
USB: usblp: don't call usb_set_interface if there's a single alt
USB: gadget: legacy: fix an error code in eth_bind()
memblock: do not start bottom-up allocations with kernel_end
nvmet-tcp: fix out-of-bounds access when receiving multiple h2cdata PDUs
ARM: dts: sun7i: a20: bananapro: Fix ethernet phy-mode
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
net: mvpp2: TCAM entry enable should be written after SRAM data
net: lapb: Copy the skb before sending a packet
net/mlx5: Fix leak upon failure of rule creation
i40e: Revert "i40e: don't report link up for a VF who hasn't enabled queues"
igc: check return value of ret_val in igc_config_fc_after_link_up
igc: set the default return value to -IGC_ERR_NVM in igc_write_nvm_srwr
arm64: dts: ls1046a: fix dcfg address range
rxrpc: Fix deadlock around release of dst cached on udp tunnel
um: virtio: free vu_dev only with the contained struct device
bpf, cgroup: Fix problematic bounds check
bpf, cgroup: Fix optlen WARN_ON_ONCE toctou
arm64: dts: rockchip: fix vopl iommu irq on px30
arm64: dts: amlogic: meson-g12: Set FL-adj property value
Input: i8042 - unbreak Pegatron C15B
arm64: dts: qcom: c630: keep both touchpad devices enabled
USB: serial: option: Adding support for Cinterion MV31
USB: serial: cp210x: add new VID/PID for supporting Teraoka AD2000
USB: serial: cp210x: add pid/vid for WSDA-200-USB
Linux 5.4.96
workqueue: Restrict affinity change to rescuer
kthread: Extract KTHREAD_IS_PER_CPU
objtool: Don't fail on missing symbol table
drm/amd/display: Change function decide_dp_link_settings to avoid infinite looping
drm/amd/display: Update dram_clock_change_latency for DCN2.1
selftests/powerpc: Only test lwm/stmw on big endian
nvme: check the PRINFO bit before deciding the host buffer length
udf: fix the problem that the disc content is not displayed
ALSA: hda: Add Cometlake-R PCI ID
scsi: ibmvfc: Set default timeout to avoid crash during migration
mac80211: fix fast-rx encryption check
ASoC: SOF: Intel: hda: Resume codec to do jack detection
scsi: fnic: Fix memleak in vnic_dev_init_devcmd2
scsi: libfc: Avoid invoking response handler twice if ep is already completed
scsi: scsi_transport_srp: Don't block target in failfast state
x86: __always_inline __{rd,wr}msr()
platform/x86: intel-vbtn: Support for tablet mode on Dell Inspiron 7352
platform/x86: touchscreen_dmi: Add swap-x-y quirk for Goodix touchscreen on Estar Beauty HD tablet
phy: cpcap-usb: Fix warning for missing regulator_disable
net_sched: gen_estimator: support large ewma log
btrfs: backref, use correct count to resolve normal data refs
btrfs: backref, only search backref entries from leaves of the same root
btrfs: backref, don't add refs from shared block when resolving normal backref
btrfs: backref, only collect file extent items matching backref offset
tcp: make TCP_USER_TIMEOUT accurate for zero window probes
arm64: Do not pass tagged addresses to __is_lm_address()
arm64: Fix kernel address detection of __is_lm_address()
ACPI: thermal: Do not call acpi_thermal_check() directly
Revert "Revert "block: end bio with BLK_STS_AGAIN in case of non-mq devs and REQ_NOWAIT""
ibmvnic: Ensure that CRQ entry read are correctly ordered
net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP
net: dsa: bcm_sf2: put device node before return
Linux 5.4.95
tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
team: protect features update by RCU to avoid deadlock
ASoC: topology: Fix memory corruption in soc_tplg_denum_create_values()
NFC: fix possible resource leak
NFC: fix resource leak when target index is invalid
rxrpc: Fix memory leak in rxrpc_lookup_local
iommu/vt-d: Don't dereference iommu_device if IOMMU_API is not built
iommu/vt-d: Gracefully handle DMAR units with no supported address widths
selftests: forwarding: Specify interface when invoking mausezahn
nvme-multipath: Early exit if no path is available
can: dev: prevent potential information leak in can_fill_info()
net/mlx5e: Reduce tc unsupported key print level
net/mlx5e: E-switch, Fix rate calculation for overflow
net/mlx5: Fix memory leak on flow table creation error flow
igc: fix link speed advertising
i40e: acquire VSI pointer only after VF is initialized
mac80211: pause TX while changing interface type
iwlwifi: pcie: reschedule in long-running memory reads
iwlwifi: pcie: use jiffies for memory read spin time limit
pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
ASoC: Intel: Skylake: skl-topology: Fix OOPs ib skl_tplg_complete
RDMA/cxgb4: Fix the reported max_recv_sge value
firmware: imx: select SOC_BUS to fix firmware build
ARM: dts: imx6qdl-kontron-samx6i: fix i2c_lcd/cam default status
arm64: dts: ls1028a: fix the offset of the reset register
xfrm: Fix wraparound in xfrm_policy_addr_delta()
selftests: xfrm: fix test return value override issue in xfrm_policy.sh
xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
xfrm: Fix oops in xfrm_replay_advance_bmp
netfilter: nft_dynset: add timeout extension to template
ARM: imx: build suspend-imx6.S with arm instruction set
xen-blkfront: allow discard-* nodes to be optional
tee: optee: replace might_sleep with cond_resched
drm/i915: Check for all subplatform bits
drm/nouveau/svm: fail NOUVEAU_SVM_INIT ioctl on unsupported devices
mt7601u: fix rx buffer refcounting
mt7601u: fix kernel crash unplugging the device
arm64: dts: broadcom: Fix USB DMA address translation for Stingray
leds: trigger: fix potential deadlock with libata
xen: Fix XenStore initialisation for XS_LOCAL
KVM: Forbid the use of tagged userspace addresses for memslots
KVM: x86: get smi pending status correctly
KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]
btrfs: fix possible free space tree corruption with online conversion
drivers: soc: atmel: add null entry at the end of at91_soc_allowed_list[]
drivers: soc: atmel: Avoid calling at91_soc_init on non AT91 SoCs
PM: hibernate: flush swap writer after marking
s390/vfio-ap: No need to disable IRQ after queue reset
net: usb: qmi_wwan: added support for Thales Cinterion PLSx3 modem family
wext: fix NULL-ptr-dereference with cfg80211's lack of commit()
ARM: dts: imx6qdl-gw52xx: fix duplicate regulator naming
media: rc: ensure that uevent can be read directly after rc device register
ALSA: hda/via: Apply the workaround generically for Clevo machines
ALSA: hda/realtek: Enable headset of ASUS B1400CEPE with ALC256
kernel: kexec: remove the lock operation of system_transition_mutex
ACPI: sysfs: Prefer "compatible" modalias
nbd: freeze the queue while we're adding connections
IPv6: reply ICMP error if the first fragment don't include all headers
ICMPv6: Add ICMPv6 Parameter Problem, code 3 definition
ANDROID: arm64: mm: ensure that memstart_addr and physvirt_offset remain in sync
Revert "arm64: mm: use single quantity to represent the PA to VA translation"
Linux 5.4.94
fs: fix lazytime expiration handling in __writeback_single_inode()
writeback: Drop I_DIRTY_TIME_EXPIRE
dm integrity: conditionally disable "recalculate" feature
tools: Factor HOSTCC, HOSTLD, HOSTAR definitions
SMB3.1.1: do not log warning message if server doesn't populate salt
arm64: mm: use single quantity to represent the PA to VA translation
tracing: Fix race in trace_open and buffer resize call
io_uring: Fix current->fs handling in io_sq_wq_submit_work()
HID: wacom: Correct NULL dereference on AES pen proximity
futex: Handle faults correctly for PI futexes
futex: Simplify fixup_pi_state_owner()
futex: Use pi_state_update_owner() in put_pi_state()
rtmutex: Remove unused argument from rt_mutex_proxy_unlock()
futex: Provide and use pi_state_update_owner()
futex: Replace pointless printk in fixup_owner()
futex: Ensure the correct return value from futex_lock_pi()
Revert "mm/slub: fix a memory leak in sysfs_slab_add()"
gpio: mvebu: fix pwm .get_state period calculation
ANDROID: GKI: api preservation of struct inet_connection_sock
ANDROID: GKI: update .xml file in android11-5.4-lts
Linux 5.4.93
tcp: fix TCP_USER_TIMEOUT with zero window
tcp: do not mess with cloned skbs in tcp_add_backlog()
net: dsa: b53: fix an off by one in checking "vlan->vid"
net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
net: mscc: ocelot: allow offloading of bridge on top of LAG
ipv6: set multicast flag on the multicast route
net_sched: reject silly cell_log in qdisc_get_rtab()
net_sched: avoid shift-out-of-bounds in tcindex_set_parms()
ipv6: create multicast route with RTPROT_KERNEL
udp: mask TOS bits in udp_v4_early_demux()
kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too
lightnvm: fix memory leak when submit fails
sh_eth: Fix power down vs. is_opened flag ordering
net: dsa: mv88e6xxx: also read STU state in mv88e6250_g1_vtu_getnext
sh: dma: fix kconfig dependency for G2_DMA
netfilter: rpfilter: mask ecn bits before fib lookup
x86/cpu/amd: Set __max_die_per_package on AMD
pinctrl: ingenic: Fix JZ4760 support
driver core: Extend device_is_dependent()
xhci: tegra: Delay for disabling LFPS detector
xhci: make sure TRB is fully written before giving it to the controller
usb: bdc: Make bdc pci driver depend on BROKEN
usb: udc: core: Use lock when write to soft_connect
usb: gadget: aspeed: fix stop dma register setting.
USB: ehci: fix an interrupt calltrace error
ehci: fix EHCI host controller initialization sequence
serial: mvebu-uart: fix tx lost characters at power off
stm class: Fix module init return on allocation failure
intel_th: pci: Add Alder Lake-P support
x86/mmx: Use KFPU_387 for MMX string operations
x86/topology: Make __max_die_per_package available unconditionally
x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
irqchip/mips-cpu: Set IPI domain parent chip
cifs: do not fail __smb_send_rqst if non-fatal signals are pending
iio: ad5504: Fix setting power-down state
can: peak_usb: fix use after free bugs
can: vxcan: vxcan_xmit: fix use after free bug
can: dev: can_restart: fix use after free bug
selftests: net: fib_tests: remove duplicate log test
platform/x86: intel-vbtn: Drop HP Stream x360 Convertible PC 11 from allow-list
i2c: octeon: check correct size of maximum RECV_LEN packet
powerpc: Fix alignment bug within the init sections
scsi: megaraid_sas: Fix MEGASAS_IOC_FIRMWARE regression
pinctrl: aspeed: g6: Fix PWMG0 pinctrl setting
powerpc: Use the common INIT_DATA_SECTION macro in vmlinux.lds.S
drm/nouveau/kms/nv50-: fix case where notifier buffer is at offset 0
drm/nouveau/mmu: fix vram heap sizing
drm/nouveau/i2c/gm200: increase width of aux semaphore owner fields
drm/nouveau/privring: ack interrupts the same way as RM
drm/nouveau/bios: fix issue shadowing expansion ROMs
drm/amd/display: Fix to be able to stop crc calculation
drm/amdgpu/psp: fix psp gfx ctrl cmds
riscv: defconfig: enable gpio support for HiFive Unleashed
dts: phy: fix missing mdio device and probe failure of vsc8541-01 device
x86/xen: Add xen_no_vector_callback option to test PCI INTX delivery
xen: Fix event channel callback via INTX/GSI
arm64: make atomic helpers __always_inline
clk: tegra30: Add hda clock default rates to clock driver
HID: Ignore battery for Elan touchscreen on ASUS UX550
HID: logitech-dj: add the G602 receiver
riscv: Fix sifive serial driver
riscv: Fix kernel time_init()
scsi: sd: Suppress spurious errors when WRITE SAME is being disabled
scsi: qedi: Correct max length of CHAP secret
scsi: ufs: Correct the LUN used in eh_device_reset_handler() callback
dm integrity: select CRYPTO_SKCIPHER
HID: multitouch: Enable multi-input for Synaptics pointstick/touchpad device
ASoC: Intel: haswell: Add missing pm_ops
drm/i915/gt: Prevent use of engine->wa_ctx after error
drm/syncobj: Fix use-after-free
drm/atomic: put state on error path
dm integrity: fix a crash if "recalculate" used without "internal_hash"
dm: avoid filesystem lookup in dm_get_dev_t()
mmc: sdhci-xenon: fix 1.8v regulator stabilization
mmc: core: don't initialize block size from ext_csd if not present
btrfs: send: fix invalid clone operations when cloning from the same file and root
btrfs: don't clear ret in btrfs_start_dirty_block_groups
btrfs: fix lockdep splat in btrfs_recover_relocation
btrfs: don't get an EINTR during drop_snapshot for reloc
ACPI: scan: Make acpi_bus_get_device() clear return pointer on error
ALSA: hda/via: Add minimum mute flag
ALSA: seq: oss: Fix missing error check in snd_seq_oss_synth_make_info()
platform/x86: ideapad-laptop: Disable touchpad_switch for ELAN0634
platform/x86: i2c-multi-instantiate: Don't create platform device for INT3515 ACPI nodes
i2c: bpmp-tegra: Ignore unknown I2C_M flags
Linux 5.4.92
spi: cadence: cache reference clock rate during probe
mac80211: check if atf has been disabled in __ieee80211_schedule_txq
mac80211: do not drop tx nulldata packets on encrypted links
tipc: fix NULL deref in tipc_link_xmit()
net, sctp, filter: remap copy_from_user failure error
rxrpc: Fix handling of an unsupported token type in rxrpc_read()
net: avoid 32 x truesize under-estimation for tiny skbs
net: sit: unregister_netdevice on newlink's error path
net: stmmac: Fixed mtu channged by cache aligned
rxrpc: Call state should be read with READ_ONCE() under some circumstances
net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands
net: dcb: Validate netlink message in DCB handler
esp: avoid unneeded kmap_atomic call
rndis_host: set proper input size for OID_GEN_PHYSICAL_MEDIUM request
net: mvpp2: Remove Pause and Asym_Pause support
mlxsw: core: Increase critical threshold for ASIC thermal zone
mlxsw: core: Add validation of transceiver temperature thresholds
net: ipv6: Validate GSO SKB before finish IPv6 processing
net: skbuff: disambiguate argument and member for skb_list_walk_safe helper
net: introduce skb_list_walk_safe for skb segment walking
netxen_nic: fix MSI/MSI-x interrupts
udp: Prevent reuseport_select_sock from reading uninitialized socks
bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
bpf: Don't leak memory in bpf getsockopt when optlen == 0
nfsd4: readdirplus shouldn't return parent of export
spi: npcm-fiu: Disable clock in probe error path
spi: npcm-fiu: simplify the return expression of npcm_fiu_probe()
scsi: lpfc: Make lpfc_defer_acc_rsp static
scsi: lpfc: Make function lpfc_defer_pt2pt_acc static
elfcore: fix building with clang
xen/privcmd: allow fetching resource sizes
compiler.h: Raise minimum version of GCC to 5.1 for arm64
usb: ohci: Make distrust_firmware param default to false
Linux 5.4.91
netfilter: nft_compat: remove flush counter optimization
netfilter: nf_nat: Fix memleak in nf_nat_init
netfilter: conntrack: fix reading nf_conntrack_buckets
ALSA: firewire-tascam: Fix integer overflow in midi_port_work()
ALSA: fireface: Fix integer overflow in transmit_midi_msg()
dm: eliminate potential source of excessive kernel log noise
net: sunrpc: interpret the return value of kstrtou32 correctly
iommu/vt-d: Fix unaligned addresses for intel_flush_svm_range_dev()
mm, slub: consider rest of partial list if acquire_slab() fails
drm/i915/dsi: Use unconditional msleep for the panel_on_delay when there is no reset-deassert MIPI-sequence
IB/mlx5: Fix error unwinding when set_has_smi_cap fails
RDMA/mlx5: Fix wrong free of blue flame register on error
bnxt_en: Improve stats context resource accounting with RDMA driver loaded.
RDMA/usnic: Fix memleak in find_free_vf_and_create_qp_grp
RDMA/restrack: Don't treat as an error allocation ID wrapping
ext4: fix superblock checksum failure when setting password salt
NFS: nfs_igrab_and_active must first reference the superblock
NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
pNFS: Stricter ordering of layoutget and layoutreturn
pNFS: Mark layout for return if return-on-close was not sent
pNFS: We want return-on-close to complete when evicting the inode
NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
nvme-tcp: fix possible data corruption with bio merges
ASoC: Intel: fix error code cnl_set_dsp_D0()
ASoC: meson: axg-tdmin: fix axg skew offset
ASoC: meson: axg-tdm-interface: fix loopback
dump_common_audit_data(): fix racy accesses to ->d_name
perf intel-pt: Fix 'CPU too large' error
ARM: picoxcell: fix missing interrupt-parent properties
drm/msm: Call msm_init_vram before binding the gpu
ACPI: scan: add stub acpi_create_platform_device() for !CONFIG_ACPI
usb: typec: Fix copy paste error for NVIDIA alt-mode description
drm/amdgpu: fix a GPU hang issue when remove device
nvmet-rdma: Fix list_del corruption on queue establishment failure
nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
selftests: fix the return value for UDP GRO test
net: ethernet: fs_enet: Add missing MODULE_LICENSE
misdn: dsp: select CONFIG_BITREVERSE
arch/arc: add copy_user_page() to <asm/page.h> to fix build error on ARC
bfq: Fix computation of shallow depth
lib/raid6: Let $(UNROLL) rules work with macOS userland
hwmon: (pwm-fan) Ensure that calculation doesn't discard big period values
habanalabs: Fix memleak in hl_device_reset
habanalabs: register to pci shutdown callback
ethernet: ucc_geth: fix definition and size of ucc_geth_tx_global_pram
regulator: bd718x7: Add enable times
btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
netfilter: ipset: fixes possible oops in mtype_resize
ARC: build: move symlink creation to arch/arc/Makefile to avoid race
ARC: build: add boot_targets to PHONY
ARC: build: add uImage.lzma to the top-level target
ARC: build: remove non-existing bootpImage from KBUILD_IMAGE
dm integrity: fix flush with external metadata device
cifs: fix interrupted close commands
smb3: remove unused flag passed into close functions
ext4: don't leak old mountpoint samples
ext4: fix bug for rename with RENAME_WHITEOUT
drm/i915/backlight: fix CPU mode backlight takeover on LPT
btrfs: tree-checker: check if chunk item end overflows
r8152: Add Lenovo Powered USB-C Travel Hub
dm integrity: fix the maximum number of arguments
dm snapshot: flush merged data before committing metadata
dm raid: fix discard limits for raid1
mm/hugetlb: fix potential missing huge page size info
ACPI: scan: Harden acpi_device_add() against device ID overflows
RDMA/ocrdma: Fix use after free in ocrdma_dealloc_ucontext_pd()
MIPS: relocatable: fix possible boot hangup with KASLR enabled
MIPS: boot: Fix unaligned access with CONFIG_MIPS_RAW_APPENDED_DTB
mips: lib: uncached: fix non-standard usage of variable 'sp'
mips: fix Section mismatch in reference
tracing/kprobes: Do the notrace functions check without kprobes on ftrace
x86/hyperv: check cpu mask after interrupt has been disabled
ASoC: dapm: remove widget from dirty list on free
btrfs: prevent NULL pointer dereference in extent_io_tree_panic
kbuild: enforce -Werror=return-type
Linux 5.4.90
regmap: debugfs: Fix a reversed if statement in regmap_debugfs_init()
net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
block: fix use-after-free in disk_part_iter_next
KVM: arm64: Don't access PMCR_EL0 when no PMU is available
net: mvpp2: disable force link UP during port init procedure
regulator: qcom-rpmh-regulator: correct hfsmps515 definition
wan: ds26522: select CONFIG_BITREVERSE
regmap: debugfs: Fix a memory leak when calling regmap_attach_dev
net/mlx5e: Fix two double free cases
net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups
bpftool: Fix compilation failure for net.o with older glibc
iommu/intel: Fix memleak in intel_irq_remapping_alloc
lightnvm: select CONFIG_CRC32
block: rsxx: select CONFIG_CRC32
wil6210: select CONFIG_CRC32
qed: select CONFIG_CRC32
dmaengine: xilinx_dma: fix mixed_enum_type coverity warning
dmaengine: xilinx_dma: fix incompatible param warning in _child_probe()
dmaengine: xilinx_dma: check dma_async_device_register return value
dmaengine: mediatek: mtk-hsdma: Fix a resource leak in the error handling path of the probe function
i2c: i801: Fix the i2c-mux gpiod_lookup_table not being properly terminated
spi: stm32: FIFO threshold level - fix align packet size
cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get()
can: kvaser_pciefd: select CONFIG_CRC32
can: m_can: m_can_class_unregister(): remove erroneous m_can_clk_stop()
can: tcan4x5x: fix bittiming const, use common bittiming from m_can driver
dmaengine: dw-edma: Fix use after free in dw_edma_alloc_chunk()
i2c: sprd: use a specific timeout to avoid system hang up issue
ARM: OMAP2+: omap_device: fix idling of devices during probe
HID: wacom: Fix memory leakage caused by kfifo_alloc
iio: imu: st_lsm6dsx: fix edge-trigger interrupts
vmlinux.lds.h: Add PGO and AutoFDO input sections
exfat: Month timestamp metadata accidentally incremented
x86/resctrl: Don't move a task to the same resource group
x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR
chtls: Fix chtls resources release sequence
chtls: Added a check to avoid NULL pointer dereference
chtls: Replace skb_dequeue with skb_peek
chtls: Fix panic when route to peer not configured
chtls: Remove invalid set_tcb call
chtls: Fix hardware tid leak
net/mlx5e: ethtool, Fix restriction of autoneg with 56G
net/mlx5: Use port_num 1 instead of 0 when delete a RoCE address
net: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE
s390/qeth: fix L2 header access in qeth_l3_osa_features_check()
nexthop: Unlink nexthop group entry in error path
nexthop: Fix off-by-one error in error path
octeontx2-af: fix memory leak of lmac and lmac->name
net: ip: always refragment ip defragmented packets
net: fix pmtu check in nopmtudisc mode
tools: selftests: add test for changing routes with PTMU exceptions
net: ipv6: fib: flush exceptions when purging route
net/sonic: Fix some resource leaks in error handling paths
net: vlan: avoid leaks on register_vlan_dev() failures
net: stmmac: dwmac-sun8i: Balance internal PHY power
net: stmmac: dwmac-sun8i: Balance internal PHY resource references
net: hns3: fix a phy loopback fail issue
net: hns3: fix the number of queues actually used by ARQ
net: cdc_ncm: correct overhead in delayed_ndp_size
vfio iommu: Add dma available capability
x86/asm/32: Add ENDs to some functions and relabel with SYM_CODE_*
Linux 5.4.89
scsi: target: Fix XCOPY NAA identifier lookup
KVM: x86: fix shift out of bounds reported by UBSAN
x86/mtrr: Correct the range check before performing MTRR type lookups
netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
netfilter: xt_RATEEST: reject non-null terminated string from userspace
netfilter: ipset: fix shift-out-of-bounds in htable_bits()
netfilter: x_tables: Update remaining dereference to RCU
drm/i915: clear the gpu reloc batch
dmabuf: fix use-after-free of dmabuf's file->f_inode
Revert "device property: Keep secondary firmware node secondary by type"
btrfs: send: fix wrong file path when there is an inode with a pending rmdir
ALSA: hda/realtek: Add two "Intel Reference board" SSID in the ALC256.
ALSA: hda/realtek: Enable mute and micmute LED on HP EliteBook 850 G7
ALSA: hda/realtek - Fix speaker volume control on Lenovo C940
ALSA: hda/conexant: add a new hda codec CX11970
ALSA: hda/via: Fix runtime PM for Clevo W35xSS
kvm: check tlbs_dirty directly
x86/mm: Fix leak of pmd ptlock
USB: serial: keyspan_pda: remove unused variable
usb: gadget: configfs: Fix use-after-free issue with udc_name
usb: gadget: configfs: Preserve function ordering after bind failure
usb: gadget: Fix spinlock lockup on usb_function_deactivate
USB: gadget: legacy: fix return error code in acm_ms_bind()
usb: gadget: u_ether: Fix MTU size mismatch with RX packet size
usb: gadget: function: printer: Fix a memory leak for interface descriptor
usb: gadget: f_uac2: reset wMaxPacketSize
usb: gadget: select CONFIG_CRC32
ALSA: usb-audio: Fix UBSAN warnings for MIDI jacks
USB: usblp: fix DMA to stack
USB: yurex: fix control-URB timeout handling
USB: serial: option: add Quectel EM160R-GL
USB: serial: option: add LongSung M5710 module support
USB: serial: iuu_phoenix: fix DMA from stack
usb: uas: Add PNY USB Portable SSD to unusual_uas
usb: usbip: vhci_hcd: protect shift size
USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set
usb: chipidea: ci_hdrc_imx: add missing put_device() call in usbmisc_get_init_data()
usb: dwc3: ulpi: Use VStsDone to detect PHY regs access completion
USB: cdc-wdm: Fix use after free in service_outstanding_interrupt().
USB: cdc-acm: blacklist another IR Droid device
usb: gadget: enable super speed plus
staging: mt7621-dma: Fix a resource leak in an error handling path
powerpc: Handle .text.{hot,unlikely}.* in linker script
crypto: asym_tpm: correct zero out potential secrets
crypto: ecdh - avoid buffer overflow in ecdh_set_secret()
video: hyperv_fb: Fix the mmap() regression for v5.4.y and older
Bluetooth: revert: hci_h5: close serdev device and free hu in h5_close
kbuild: don't hardcode depmod path
net/sched: sch_taprio: ensure to reset/destroy all child qdiscs
ionic: account for vlan tag len in rx buffer len
vhost_net: fix ubuf refcount incorrectly when sendmsg fails
net: usb: qmi_wwan: add Quectel EM160R-GL
CDC-NCM: remove "connected" log message
net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
r8169: work around power-saving bug on some chip versions
net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
erspan: fix version 1 check in gre_parse_header()
net: hns: fix return value check in __lb_other_process()
net: sched: prevent invalid Scell_log shift count
ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
net: mvpp2: fix pkt coalescing int-threshold configuration
tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
net: ethernet: ti: cpts: fix ethtool output when no ptp_clock registered
net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc
net-sysfs: take the rtnl lock when storing xps_rxqs
net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
net-sysfs: take the rtnl lock when storing xps_cpus
net: ethernet: Fix memleak in ethoc_probe
net/ncsi: Use real net-device for response handler
virtio_net: Fix recursive call to cpus_read_lock()
qede: fix offload for IPIP tunnel packets
net: ethernet: mvneta: Fix error handling in mvneta_probe
ibmvnic: continue fatal error reset after passive init
net: mvpp2: Fix GoP port 3 Networking Complex Control configurations
atm: idt77252: call pci_disable_device() on error path
ethernet: ucc_geth: set dev->max_mtu to 1518
ethernet: ucc_geth: fix use-after-free in ucc_geth_remove()
net: systemport: set dev->max_mtu to UMAC_MAX_MTU_SIZE
net: mvpp2: prs: fix PPPoE with ipv6 packet parse
net: mvpp2: Add TCAM entry to drop flow control pause frames
iavf: fix double-release of rtnl_lock
i40e: Fix Error I40E_AQ_RC_EINVAL when removing VFs
proc: fix lookup in /proc/net subdirectories after setns(2)
proc: change ->nlink under proc_subdir_lock
depmod: handle the case of /sbin/depmod without /sbin in PATH
lib/genalloc: fix the overflow when size is too big
scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
scsi: ufs: Fix wrong print message in dev_err()
workqueue: Kick a worker based on the actual activation of delayed works
Revert "rwsem: Implement down_read_killable_nested"
Revert "rwsem: Implement down_read_interruptible"
Revert "perf: Use new infrastructure to fix deadlocks in execve"
Revert "perf: Break deadlock involving exec_update_mutex"
Revert "exec: Add exec_update_mutex to replace cred_guard_mutex"
Revert "kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve"
Revert "proc: Use new infrastructure to fix deadlocks in execve"
Revert "proc: io_accounting: Use new infrastructure to fix deadlocks in execve"
Revert "exec: Fix a deadlock in strace"
Revert "exec: Transform exec_update_mutex into a rw_semaphore"
Revert "Revert "exec: Fix a deadlock in strace""
Revert "Revert "perf: Use new infrastructure to fix deadlocks in execve""
Revert "Revert "proc: io_accounting: Use new infrastructure to fix deadlocks in execve""
Revert "Revert "proc: Use new infrastructure to fix deadlocks in execve""
Revert "Revert "kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve""
Revert "Revert "exec: Add exec_update_mutex to replace cred_guard_mutex""
Linux 5.4.88
mwifiex: Fix possible buffer overflows in mwifiex_cmd_802_11_ad_hoc_start
exec: Transform exec_update_mutex into a rw_semaphore
rwsem: Implement down_read_interruptible
rwsem: Implement down_read_killable_nested
perf: Break deadlock involving exec_update_mutex
fuse: fix bad inode
iio:imu:bmi160: Fix alignment and data leak issues
kdev_t: always inline major/minor helper functions
dmaengine: at_hdmac: add missing kfree() call in at_dma_xlate()
dmaengine: at_hdmac: add missing put_device() call in at_dma_xlate()
dmaengine: at_hdmac: Substitute kzalloc with kmalloc
Revert "mtd: spinand: Fix OOB read"
Revert "drm/amd/display: Fix memory leaks in S3 resume"
Linux 5.4.87
dm verity: skip verity work if I/O error when system is shutting down
ALSA: pcm: Clear the full allocated memory at hw_params
tick/sched: Remove bogus boot "safety" check
um: ubd: Submit all data segments atomically
fs/namespace.c: WARN if mnt_count has become negative
module: delay kobject uevent until after module init call
f2fs: avoid race condition for shrinker count
NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode
i3c master: fix missing destroy_workqueue() on error in i3c_master_register
powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
rtc: pl031: fix resource leak in pl031_probe
quota: Don't overflow quota file offsets
module: set MODULE_STATE_GOING state when a module fails to load
rtc: sun6i: Fix memleak in sun6i_rtc_clk_init
fcntl: Fix potential deadlock in send_sig{io, urg}()
bfs: don't use WARNING: string when it's just info.
ALSA: rawmidi: Access runtime->avail always in spinlock
ALSA: seq: Use bool for snd_seq_queue internal flags
f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
media: gp8psk: initialize stats at power control logic
misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
reiserfs: add check for an invalid ih_entry_count
Bluetooth: hci_h5: close serdev device and free hu in h5_close
scsi: cxgb4i: Fix TLS dependency
cgroup: Fix memory leak when parsing multiple source parameters
of: fix linker-section match-table corruption
null_blk: Fix zone size initialization
tools headers UAPI: Sync linux/const.h with the kernel headers
uapi: move constants from <linux/kernel.h> to <linux/const.h>
scsi: block: Fix a race in the runtime power management code
jffs2: Fix NULL pointer dereference in rp_size fs option parsing
jffs2: Allow setting rp_size to zero during remounting
powerpc/bitops: Fix possible undefined behaviour with fls() and fls64()
KVM: x86: reinstate vendor-agnostic check on SPEC_CTRL cpuid bits
KVM: SVM: relax conditions for allowing MSR_IA32_SPEC_CTRL accesses
KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL
ext4: don't remount read-only with errors=continue on reboot
btrfs: fix race when defragmenting leads to unnecessary IO
vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
fscrypt: remove kernel-internal constants from UAPI header
fscrypt: add fscrypt_is_nokey_name()
f2fs: prevent creating duplicate encrypted filenames
ubifs: prevent creating duplicate encrypted filenames
ext4: prevent creating duplicate encrypted filenames
thermal/drivers/cpufreq_cooling: Update cpufreq_state only if state has changed
md/raid10: initialize r10_bio->read_slot before use.
net/sched: sch_taprio: reset child qdiscs before freeing them
Conflicts:
Documentation/devicetree/bindings
Documentation/devicetree/bindings/net/btusb.txt
Documentation/devicetree/bindings/net/ethernet-controller.yaml
arch/arm/kernel/smccc-call.S
arch/arm64/kernel/cpufeature.c
block/blk-pm.c
drivers/dma-buf/dma-buf.c
drivers/iommu/arm-smmu.c
drivers/md/dm-verity-target.c
drivers/spmi/spmi-pmic-arb.c
drivers/staging/exfat/exfat_super.c
drivers/usb/core/hub.c
drivers/usb/dwc3/core.c
drivers/usb/dwc3/debugfs.c
drivers/usb/dwc3/gadget.c
drivers/usb/gadget/function/f_fs.c
drivers/usb/gadget/function/f_uac2.c
drivers/usb/gadget/function/u_audio.c
fs/fuse/fuse_i.h
include/linux/mm.h
kernel/cpu.c
kernel/sched/fair.c
kernel/workqueue.c
mm/huge_memory.c
net/ipv4/tcp_timer.c
net/qrtr/qrtr.c
net/qrtr/tun.c
security/selinux/avc.c
fixed build errors.
Change-Id: I8c05a8523ac57cedf52589a41ec4c582fd512a26
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
2021-09-15 18:43:35 +09:00
|
|
|
|
2021-05-15 09:27:04 +09:00
|
|
|
/**
|
|
|
|
* seal_check_future_write - Check for F_SEAL_FUTURE_WRITE flag and handle it
|
|
|
|
* @seals: the seals to check
|
|
|
|
* @vma: the vma to operate on
|
|
|
|
*
|
|
|
|
* Check whether F_SEAL_FUTURE_WRITE is set; if so, do proper check/handling on
|
|
|
|
* the vma flags. Return 0 if check pass, or <0 for errors.
|
|
|
|
*/
|
|
|
|
static inline int seal_check_future_write(int seals, struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
if (seals & F_SEAL_FUTURE_WRITE) {
|
|
|
|
/*
|
|
|
|
* New PROT_WRITE and MAP_SHARED mmaps are not allowed when
|
|
|
|
* "future write" seal active.
|
|
|
|
*/
|
|
|
|
if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since an F_SEAL_FUTURE_WRITE sealed memfd can be mapped as
|
|
|
|
* MAP_SHARED and read-only, take care to not allow mprotect to
|
|
|
|
* revert protections on such mappings. Do this only for shared
|
|
|
|
* mappings. For private mappings, don't need to mask
|
|
|
|
* VM_MAYWRITE as we still want them to be COW-writable.
|
|
|
|
*/
|
|
|
|
if (vma->vm_flags & VM_SHARED)
|
|
|
|
vma->vm_flags &= ~(VM_MAYWRITE);
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2005-04-17 07:20:36 +09:00
|
|
|
#endif /* __KERNEL__ */
|
|
|
|
#endif /* _LINUX_MM_H */
|