13740 Commits

Author SHA1 Message Date
0154ec71d5 Merge 5.2-rc4 into char-misc-next
We want the char/misc driver fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-09 09:11:21 +02:00
cb1aaebea8 docs: fix broken documentation links
Mostly due to x86 and acpi conversion, several documentation
links are still pointing to the old file. Fix them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Reviewed-by: Wolfram Sang <wsa@the-dreams.de>
Reviewed-by: Sven Van Asbroeck <TheSven73@gmail.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-08 13:42:13 -06:00
6d7c3cde93 mm/hmm: fix use after free with struct hmm in the mmu notifiers
mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
system will continue to reference hmm->mn until the srcu grace period
expires.

Resulting in use after free races like this:

         CPU0                                     CPU1
                                               __mmu_notifier_invalidate_range_start()
                                                 srcu_read_lock
                                                 hlist_for_each ()
                                                   // mn == hmm->mn
hmm_mirror_unregister()
  hmm_put()
    hmm_free()
      mmu_notifier_unregister_no_release()
         hlist_del_init_rcu(hmm-mn->list)
			                           mn->ops->invalidate_range_start(mn, range);
					             mm_get_hmm()
      mm->hmm = NULL;
      kfree(hmm)
                                                     mutex_lock(&hmm->lock);

Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
existing. Get the now-safe hmm struct through container_of and directly
check kref_get_unless_zero to lock it against free.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Philip Yang <Philip.Yang@amd.com>
2019-06-07 10:47:24 -03:00
9b1ae605c8 mm/hmm: Only set FAULT_FLAG_ALLOW_RETRY for non-blocking
Don't set this flag by default in hmm_vma_do_fault. It is set
conditionally just a few lines below. Setting it unconditionally can lead
to handle_mm_fault doing a non-blocking fault, returning -EBUSY and
unlocking mmap_sem unexpectedly.

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-06 16:31:41 -03:00
789c2af88f mm/hmm: support automatic NUMA balancing
While the page is migrating by NUMA balancing, HMM failed to detect this
condition and still return the old page. Application will use the new page
migrated, but driver pass the old page physical address to GPU, this crash
the application later.

Use pte_protnone(pte) to return this condition and then hmm_vma_do_fault
will allocate new page.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-06 16:31:41 -03:00
085ea25064 mm/hmm: clean up some coding style and comments
There are no functional changes, just some coding style clean ups and
minor comment changes.

Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-06 16:31:40 -03:00
1c2308f0f0 mm/hmm.c: suppress compilation warnings when CONFIG_HUGETLB_PAGE is not set
gcc reports that several variables are defined but not used.

For the first hunk CONFIG_HUGETLB_PAGE the entire if block is already
protected by pud_huge() which is forced to 0.  None of the stuff under the
ifdef causes compilation problems as it is already stubbed out in the
header files.

For the second hunk the dummy huge_page_shift macro doesn't touch the
argument, so just inline the argument.

Link: http://lkml.kernel.org/r/20190522195151.GA23955@ziepe.ca
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2019-06-06 16:29:59 -03:00
b886d83c5b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 315 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:17 +02:00
55716d2643 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428
Based on 1 normalized pattern(s):

  this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:16 +02:00
b2139ce04f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 403
Based on 1 normalized pattern(s):

  this software may be redistributed and or modified under the terms
  of the gnu general public license gpl version 2 as published by the
  free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190112.039124428@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:13 +02:00
4505153954 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 59 temple place suite 330 boston ma 02111
  1307 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 136 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.384967451@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:06 +02:00
1439f94c54 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 263
Based on 1 normalized pattern(s):

  this software may be redistributed and or modified under the terms
  of the gnu general public license gpl version 2 only as published by
  the free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141333.676969322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:30:28 +02:00
64ae0e71c6 mm/zsmalloc.c: remove unused variable
The variable 'entry' is no longer used and the compiler rightly complains
that it should be removed.

../mm/zsmalloc.c: In function `zs_pool_stat_create':
../mm/zsmalloc.c:648:17: warning: unused variable `entry' [-Wunused-variable]
  struct dentry *entry;
                 ^~~~~

Rework to remove the unused variable.

Link: http://lkml.kernel.org/r/20190604065826.26064-1-anders.roxell@linaro.org
Fixes: 4268509a36a7 ("zsmalloc: no need to check return value of debugfs_create functions")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 16:36:49 +02:00
2d146b924e backing-dev: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

And as the return value does not matter at all, no need to save the
dentry in struct backing_dev_info, so delete it.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:49:07 +02:00
c4e41349a8 mm: cleancache: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: linux-mm@kvack.org
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:49:07 +02:00
2fcc6e202a hwpoison-inject: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-mm@kvack.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:39:40 +02:00
282401df90 mm: kmemleak: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-mm@kvack.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:39:39 +02:00
4268509a36 zsmalloc: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: linux-mm@kvack.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:39:39 +02:00
de2fadf566 zswap: ignore debugfs_create_dir() return value
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Seth Jennings <sjenning@redhat.com>
Cc: linux-mm@kvack.org
Acked-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-03 15:39:39 +02:00
31e67340cc mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
In a rare case, flush_tlb_kernel_range() could be called with a start
higher than the end.

In vm_remove_mappings(), in case page_address() returns 0 for all pages
(for example they were all in highmem), _vm_unmap_aliases() will be
called with start = ULONG_MAX, end = 0 and flush = 1.

If at the same time, the vmalloc purge operation is triggered by something
else while the current operation is between remove_vm_area() and
_vm_unmap_aliases(), then the vm mapping just removed will be already
purged. In this case the call of vm_unmap_aliases() may not find any other
mappings to flush and so end up flushing start = ULONG_MAX, end = 0. So
only set flush = true if we find something in the direct mapping that we
need to flush, and this way this can't happen.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
Link: https://lkml.kernel.org/r/20190527211058.2729-3-rick.p.edgecombe@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:47:25 +02:00
8e41f8726d mm/vmalloc: Fix calculation of direct map addr range
The calculation of the direct map address range to flush was wrong.
This could cause the RO direct map alias to not get flushed. Today
this shouldn't be a problem because this flush is only needed on x86
right now and the spurious fault handler will fix cached RO->RW
translations. In the future though, it could cause the permissions
to remain RO in the TLB for the direct map alias, and then the page
would return from the page allocator to some other component as RO
and cause a crash.

So fix fix the address range calculation so the flush will include the
direct map range.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Nadav Amit <namit@vmware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
Link: https://lkml.kernel.org/r/20190527211058.2729-2-rick.p.edgecombe@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:47:25 +02:00
e577c8b64d mm, compaction: make sure we isolate a valid PFN
When we have holes in a normal memory zone, we could endup having
cached_migrate_pfns which may not necessarily be valid, under heavy memory
pressure with swapping enabled ( via __reset_isolation_suitable(),
triggered by kswapd).

Later if we fail to find a page via fast_isolate_freepages(), we may end
up using the migrate_pfn we started the search with, as valid page.  This
could lead to accessing NULL pointer derefernces like below, due to an
invalid mem_section pointer.

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
 [0000000000000008] pgd=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 ...
 CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
 Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
 pstate: 60000005 (nZCv daif -PAN -UAO)
 pc : set_pfnblock_flags_mask+0x58/0xe8
 lr : compaction_alloc+0x300/0x950
 [...]
 Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
 Call trace:
  set_pfnblock_flags_mask+0x58/0xe8
  compaction_alloc+0x300/0x950
  migrate_pages+0x1a4/0xbb0
  compact_zone+0x750/0xde8
  compact_zone_order+0xd8/0x118
  try_to_compact_pages+0xb4/0x290
  __alloc_pages_direct_compact+0x84/0x1e0
  __alloc_pages_nodemask+0x5e0/0xe18
  alloc_pages_vma+0x1cc/0x210
  do_huge_pmd_anonymous_page+0x108/0x7c8
  __handle_mm_fault+0xdd4/0x1190
  handle_mm_fault+0x114/0x1c0
  __get_user_pages+0x198/0x3c0
  get_user_pages_unlocked+0xb4/0x1d8
  __gfn_to_pfn_memslot+0x12c/0x3b8
  gfn_to_pfn_prot+0x4c/0x60
  kvm_handle_guest_abort+0x4b0/0xcd8
  handle_exit+0x140/0x1b8
  kvm_arch_vcpu_ioctl_run+0x260/0x768
  kvm_vcpu_ioctl+0x490/0x898
  do_vfs_ioctl+0xc4/0x898
  ksys_ioctl+0x8c/0xa0
  __arm64_sys_ioctl+0x28/0x38
  el0_svc_common+0x74/0x118
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc
 Code: f8607840 f100001f 8b011401 9a801020 (f9400400)
 ---[ end trace af6a35219325a9b6 ]---

The issue was reported on an arm64 server with 128GB with holes in the
zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
running 100 KVM guest instances.

This patch fixes the issue by ensuring that the page belongs to a valid
PFN when we fallback to using the lower limit of the scan range upon
failure in fast_isolate_freepages().

Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com
Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:32 -07:00
0600597c85 kasan: initialize tag to 0xff in __kasan_kmalloc
When building with -Wuninitialized and CONFIG_KASAN_SW_TAGS unset, Clang
warns:

mm/kasan/common.c:484:40: warning: variable 'tag' is uninitialized when
used here [-Wuninitialized]
        kasan_unpoison_shadow(set_tag(object, tag), size);
                                              ^~~

set_tag ignores tag in this configuration but clang doesn't realize it at
this point in its pipeline, as it points to arch_kasan_set_tag as being
the point where it is used, which will later be expanded to (void
*)(object) without a use of tag.  Initialize tag to 0xff, as it removes
this warning and doesn't change the meaning of the code.

Link: https://github.com/ClangBuiltLinux/linux/issues/465
Link: http://lkml.kernel.org/r/20190502163057.6603-1-natechancellor@gmail.com
Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode")
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
bb9f6f63f3 z3fold: fix sheduling while atomic
kmem_cache_alloc() may be called from z3fold_alloc() in atomic context, so
we need to pass correct gfp flags to avoid "scheduling while atomic" bug.

Link: http://lkml.kernel.org/r/20190523153245.119dfeed55927e8755250ddd@gmail.com
Fixes: 7c2b8baa61fe5 ("mm/z3fold.c: add structure for buddy handles")
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
df17277b2a mm/gup: continue VM_FAULT_RETRY processing even for pre-faults
When get_user_pages*() is called with pages = NULL, the processing of
VM_FAULT_RETRY terminates early without actually retrying to fault-in all
the pages.

If the pages in the requested range belong to a VMA that has userfaultfd
registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
has populated the page, but for the gup pre-fault case there's no actual
retry and the caller will get no pages although they are present.

This issue was uncovered when running post-copy memory restore in CRIU
after d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
copy_fpstate_to_sigframe() fails").

After this change, the copying of FPU state to the sigframe switched from
copy_to_user() variants which caused a real page fault to get_user_pages()
with pages parameter set to NULL.

In post-copy mode of CRIU, the destination memory is managed with
userfaultfd and lack of the retry for pre-fault case in get_user_pages()
causes a crash of the restored process.

Making the pre-fault behavior of get_user_pages() the same as the "normal"
one fixes the issue.

Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com
Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Andrei Vagin <avagin@gmail.com> [https://travis-ci.org/avagin/linux/builds/533184940]
Tested-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
3e85899637 memcg: make it work on sparse non-0-node systems
We have a single node system with node 0 disabled:
  Scanning NUMA topology in Northbridge 24
  Number of physical nodes 2
  Skipping disabled node 0
  Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
  NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]

This causes crashes in memcg when system boots:
  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  #PF error: [normal kernel read fault]
...
  RIP: 0010:list_lru_add+0x94/0x170
...
  Call Trace:
   d_lru_add+0x44/0x50
   dput.part.34+0xfc/0x110
   __fput+0x108/0x230
   task_work_run+0x9f/0xc0
   exit_to_usermode_loop+0xf5/0x100

It is reproducible as far as 4.12.  I did not try older kernels.  You have
to have a new enough systemd, e.g.  241 (the reason is unknown -- was not
investigated).  Cannot be reproduced with systemd 234.

The system crashes because the size of lru array is never updated in
memcg_update_all_list_lrus and the reads are past the zero-sized array,
causing dereferences of random memory.

The root cause are list_lru_memcg_aware checks in the list_lru code.  The
test in list_lru_memcg_aware is broken: it assumes node 0 is always
present, but it is not true on some systems as can be seen above.

So fix this by avoiding checks on node 0.  Remember the memcg-awareness by
a bool flag in struct list_lru.

Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
bc81426f5b prctl_set_mm: downgrade mmap_sem to read lock
The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
semaphore taken.") added synchronization of reading argument/environment
boundaries under mmap_sem.  Later commit 88aa7cc688d4 ("mm: introduce
arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
the coarse use of mmap_sem in similar situations.  But there still
remained two places that (mis)use mmap_sem.

get_cmdline should also use arg_lock instead of mmap_sem when it reads the
boundaries.

The second place that should use arg_lock is in prctl_set_mm.  By
protecting the boundaries fields with the arg_lock, we can downgrade
mmap_sem to reader lock (analogous to what we already do in
prctl_set_mm_map).

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Co-developed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
3806b04144 mm/vmalloc.c: fix typo in comment
Reported-by: Nicholas Joll <najoll@posteo.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
46aeb7e6c1 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 225
Based on 1 normalized pattern(s):

  subject to the gnu public license version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528171440.319650492@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:56 -07:00
c942fddf87 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157
Based on 3 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version [author] [kishon] [vijay] [abraham]
  [i] [kishon]@[ti] [com] this program is distributed in the hope that
  it will be useful but without any warranty without even the implied
  warranty of merchantability or fitness for a particular purpose see
  the gnu general public license for more details

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version [author] [graeme] [gregory]
  [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
  [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
  [hk] [hemahk]@[ti] [com] this program is distributed in the hope
  that it will be useful but without any warranty without even the
  implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1105 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:37 -07:00
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
f8eac9011b signal: Remove task parameter from force_sig_mceerr
All of the callers pass current into force_sig_mceer so remove the
task parameter to make this obvious.

This also makes it clear that force_sig_mceerr passes current
into force_sig_info.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2019-05-27 09:36:28 -05:00
3d7081822f uaccess: Add non-pagefault user-space read functions
Add probe_user_read(), strncpy_from_unsafe_user() and
strnlen_unsafe_user() which allows caller to access user-space
in IRQ context.

Current probe_kernel_read() and strncpy_from_unsafe() are
not available for user-space memory, because it sets
KERNEL_DS while accessing data. On some arch, user address
space and kernel address space can be co-exist, but others
can not. In that case, setting KERNEL_DS means given
address is treated as a kernel address space.
Also strnlen_user() is only available from user context since
it can sleep if pagefault is enabled.

To access user-space memory without pagefault, we need
these new functions which sets USER_DS while accessing
the data.

Link: http://lkml.kernel.org/r/155789869802.26965.4940338412595759063.stgit@devnote2

Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2019-05-25 23:04:42 -04:00
ea8157ab2a zsfold: Convert zsfold to use the new mount API
Convert the zsfold filesystem to the new internal mount API as the old one
will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-25 18:06:01 -04:00
8e9231f819 vfs: Convert zsmalloc to use the new mount API
Convert the zsmalloc filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Minchan Kim <minchan@kernel.org>
cc: Nitin Gupta <ngupta@vflare.org>
cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
cc: linux-mm@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 18:00:07 -04:00
1f58bb18f6 mount_pseudo(): drop 'name' argument, switch to d_make_root()
Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work.  These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root.  However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it.  Including those that
didn't *have* any non-root dentries to start with...

All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25 17:59:24 -04:00
418a3ab1e7 mm/balloon_compaction: List interfaces
Introduce interfaces for ballooning enqueueing and dequeueing of a list
of pages. These interfaces reduce the overhead of storing and restoring
IRQs by batching the operations. In addition they do not panic if the
list of pages is empty.

Cc: Jason Wang <jasowang@redhat.com>
Cc: linux-mm@kvack.org
Cc: virtualization@lists.linux-foundation.org
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 20:19:17 +02:00
8607a96520 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your optional any later version of the license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075212.713472955@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:37:54 +02:00
ec8f24b7fa treewide: Add SPDX license identifier - Makefile/Kconfig
Add SPDX license identifiers to all Make/Kconfig files which:

 - Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:46 +02:00
09c434b8a0 treewide: Add SPDX license identifier for more missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have MODULE_LICENCE("GPL*") inside which was used in the initial
   scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
4de1e3a8ec z3fold: don't bother with dentry_operations
Don't bother with dentry_operations as no dentry is ever allocated.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-21 08:22:17 +01:00
48b48750c3 zsmalloc: don't bother with dentry_operations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-20 14:09:46 +01:00
cb6f8739fb Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
 "A few final bits:

   - large changes to vmalloc, yielding large performance benefits

   - tweak the console-flush-on-panic code

   - a few fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  panic: add an option to replay all the printk message in buffer
  initramfs: don't free a non-existent initrd
  fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
  mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
  mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
  mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
  mm/vmalloc.c: keep track of free blocks for vmap allocation
2019-05-19 12:15:32 -07:00
1335d9a1fb Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Ingo Molnar:
 "This fixes a particularly thorny munmap() bug with MPX, plus fixes a
  host build environment assumption in objtool"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Allow AR to be overridden with HOSTAR
  x86/mpx, mm/core: Fix recursive munmap() corruption
2019-05-19 10:23:24 -07:00
60fce36afa mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
syzbot reported the following error from a tree with a head commit of
baf76f0c58ae ("slip: make slhc_free() silently accept an error pointer")

  BUG: unable to handle kernel paging request at ffffea0003348000
  #PF error: [normal kernel read fault]
  PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
  RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
  RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
  Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
  ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49
  c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
  RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
  RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
  RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
  RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
  R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
  FS:  00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
  Call Trace:
   fast_isolate_around mm/compaction.c:1243 [inline]
   fast_isolate_freepages mm/compaction.c:1418 [inline]
   isolate_freepages mm/compaction.c:1438 [inline]
   compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550

There is no reproducer and it is difficult to hit -- 1 crash every few
days.  The issue is very similar to the fix in commit 6b0868c820ff
("mm/compaction.c: correct zone boundary handling when resetting pageblock
skip hints").  When isolating free pages around a target pageblock, the
boundary handling is off by one and can stray into the next pageblock.
Triggering the syzbot error requires that the end of pageblock is section
or zone aligned, and that the next section is unpopulated.

A more subtle consequence of the bug is that pageblocks were being
improperly used as migration targets which potentially hurts fragmentation
avoidance in the long-term one page at a time.

A debugging patch revealed that it's definitely possible to stray outside
of a pageblock which is not intended.  While syzbot cannot be used to
verify this patch, it was confirmed that the debugging warning no longer
triggers with this patch applied.  It has also been confirmed that the THP
allocation stress tests are not degraded by this patch.

Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org> # v5.1+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-18 15:52:26 -07:00
a6cf4e0fe3 mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
This macro adds some debug code to check that vmap allocations are
happened in ascending order.

By default this option is set to 0 and not active.  It requires
recompilation of the kernel to activate it.  Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-18 15:52:26 -07:00
bb850f4dae mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
This macro adds some debug code to check that the augment tree is
maintained correctly, meaning that every node contains valid
subtree_max_size value.

By default this option is set to 0 and not active.  It requires
recompilation of the kernel to activate it.  Set to 1, compile the
kernel.

[urezki@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-18 15:52:26 -07:00
68ad4a3304 mm/vmalloc.c: keep track of free blocks for vmap allocation
Patch series "improve vmap allocation", v3.

Objective
---------

Please have a look for the description at:

  https://lkml.org/lkml/2018/10/19/786

but let me also summarize it a bit here as well.

The current implementation has O(N) complexity. Requests with different
permissive parameters can lead to long allocation time. When i say
"long" i mean milliseconds.

Description
-----------

This approach organizes the KVA memory layout into free areas of the
1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
instead of finding a hole between two busy blocks.  It allows to have
lower number of objects which represent the free space, therefore to have
less fragmented memory allocator.  Because free blocks are always as large
as possible.

It uses the augment tree where all free areas are sorted in ascending
order of va->va_start address in pair with linked list that provides
O(1) access to prev/next elements.

Since the tree is augment, we also maintain the "subtree_max_size" of VA
that reflects a maximum available free block in its left or right
sub-tree.  Knowing that, we can easily traversal toward the lowest (left
most path) free area.

Allocation: ~O(log(N)) complexity.  It is sequential allocation method
therefore tends to maximize locality.  The search is done until a first
suitable block is large enough to encompass the requested parameters.
Bigger areas are split.

I copy paste here the description of how the area is split, since i
described it in https://lkml.org/lkml/2018/10/19/786

<snip>

A free block can be split by three different ways.  Their names are
FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
correspond to how requested size and alignment fit to a free block.

FL_FIT_TYPE - in this case a free block is just removed from the free
list/tree because it fully fits.  Comparing with current design there is
an extra work with rb-tree updating.

LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
is just cutting a free block.  It is as fast as a current design.  Most of
the vmalloc allocations just end up with this case, because the edge is
always aligned to 1.

NE_FIT_TYPE - Is much less common case.  Basically it happens when
requested size and alignment does not fit left nor right edges, i.e.  it
is between them.  In this case during splitting we have to build a
remaining left free area and place it back to the free list/tree.

Comparing with current design there are two extra steps.  First one is we
have to allocate a new vmap_area structure.  Second one we have to insert
that remaining free block to the address sorted list/tree.

In order to optimize a first case there is a cache with free_vmap objects.
Instead of allocating from slab we just take an object from the cache and
reuse it.

Second one is pretty optimized.  Since we know a start point in the tree
we do not do a search from the top.  Instead a traversal begins from a
rb-tree node we split.
<snip>

De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
away to the tree/list, instead we identify the spot first, checking if it
can be merged around neighbors.  The list provides O(1) access to
prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
large coalesced areas are created, if not the area is just linked making
more fragments.

There is one more thing that i should mention here.  After modification of
VA node, its subtree_max_size is updated if it was/is the biggest area in
its left or right sub-tree.  Apart of that it can also be populated back
to upper levels to fix the tree.  For more details please have a look at
the __augment_tree_propagate_from() function and the description.

Tests and stressing
-------------------

I use the "test_vmalloc.sh" test driver available under
"tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
./test_vmalloc.sh" to find out how to deal with it.

Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
Regarding last one, i do not have any physical access to NUMA system,
therefore i emulated it.  The time of stressing is days.

If you run the test driver in "stress mode", you also need the patch that
is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

After massive testing, i have not identified any problems like memory
leaks, crashes or kernel panics.  I find it stable, but more testing would
be good.

Performance analysis
--------------------

I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
as well, but the results have not been ready by time i an writing this.

Currently it consist of 8 tests.  There are three of them which correspond
to different types of splitting(to compare with default).  We have 3
ones(see above).  Another 5 do allocations in different conditions.

a) sudo ./test_vmalloc.sh performance

When the test driver is run in "performance" mode, it runs all available
tests pinned to first online CPU with sequential execution test order.  We
do it in order to get stable and repeatable results.  Take a look at time
difference in "long_busy_list_alloc_test".  It is not surprising because
the worst case is O(N).

# i5-3320M
How many cycles all tests took:
CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

# Hikey960 8x CPUs
How many cycles all tests took:
CPU0=3478683207 cycles vs CPU0=463767978 cycles

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

b) time sudo ./test_vmalloc.sh test_repeat_count=1

With this configuration, all tests are run on all available online CPUs.
Before running each CPU shuffles its tests execution order.  It gives
random allocation behaviour.  So it is rough comparison, but it puts in
the picture for sure.

# i5-3320M
<default>            vs            <patched>
real    101m22.813s                real    0m56.805s
user    0m0.011s                   user    0m0.015s
sys     0m5.076s                   sys     0m0.023s

# See detailed table with results here:
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

# Hikey960 8x CPUs
<default>            vs            <patched>
real    unknown                    real    4m25.214s
user    unknown                    user    0m0.011s
sys     unknown                    sys     0m0.670s

I did not manage to complete this test on "default Hikey960" kernel
version.  After 24 hours it was still running, therefore i had to cancel
it.  That is why real/user/sys are "unknown".

This patch (of 3):

Currently an allocation of the new vmap area is done over busy list
iteration(complexity O(n)) until a suitable hole is found between two busy
areas.  Therefore each new allocation causes the list being grown.  Due to
over fragmented list and different permissive parameters an allocation can
take a long time.  For example on embedded devices it is milliseconds.

This patch organizes the KVA memory layout into free areas of the
1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
sorted by their offsets in pair with linked list keeping the free space in
order of increasing addresses.

Nodes are augmented with the size of the maximum available free block in
its left or right sub-tree.  Thus, that allows to take a decision and
traversal toward the block that will fit and will have the lowest start
address, i.e.  it is sequential allocation.

Allocation: to allocate a new block a search is done over the tree until a
suitable lowest(left most) block is large enough to encompass: the
requested size, alignment and vstart point.  If the block is bigger than
requested size - it is split.

De-allocation: when a busy vmap area is freed it can either be merged or
inserted to the tree.  Red-black tree allows efficiently find a spot
whereas a linked list provides a constant-time access to previous and next
blocks to check if merging can be done.  In case of merging of
de-allocated memory chunk a large coalesced area is created.

Complexity: ~O(log(N))

[urezki@gmail.com: v3]
  Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
[urezki@gmail.com: v4]
  Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-18 15:52:26 -07:00
7878c231da slab: remove /proc/slab_allocators
It turned out that DEBUG_SLAB_LEAK is still broken even after recent
recue efforts that when there is a large number of objects like
kmemleak_object which is normal on a debug kernel,

  # grep kmemleak /proc/slabinfo
  kmemleak_object   2243606 3436210 ...

reading /proc/slab_allocators could easily loop forever while processing
the kmemleak_object cache and any additional freeing or allocating
objects will trigger a reprocessing. To make a situation worse,
soft-lockups could easily happen in this sitatuion which will call
printk() to allocate more kmemleak objects to guarantee an infinite
loop.

Also, since it seems no one had noticed when it was totally broken
more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash
by reading /proc/slab_allocators"), probably nobody cares about it
anymore due to the decline of the SLAB. Just remove it entirely.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-16 15:51:55 -07:00