ed8a555323
275 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Gustavo A. R. Silva
|
dbac61a3f2 |
mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference
The NULL check at line 1226: if (!pgdat), implies that pointer pgdat might be NULL. rollback_node_hotadd() dereferences this pointer. Add NULL check to avoid a potential NULL pointer dereference. Addresses-Coverity-ID: 1369133 Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgus Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
4932381ee2 |
mm, memory_hotplug: move movable_node to the hotplug proper
movable_node_is_enabled is defined in memblock proper while it is initialized from the memory hotplug proper. This is quite messy and it makes a dependency between the two so move movable_node along with the helper functions to memory_hotplug. To make it more entertaining the kernel parameter is ignored unless CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node information for each memblock otherwise. So let's warn when the option is disabled. Link: http://lkml.kernel.org/r/20170529114141.536-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Kani Toshimitsu <toshi.kani@hpe.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
f70029bbaa |
mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
Commit
|
||
Michal Hocko
|
57c0a17238 |
mm, memory_hotplug: drop artificial restriction on online/offline
Patch series "remove CONFIG_MOVABLE_NODE".
I am continuing to clean up the memory hotplug code and
CONFIG_MOVABLE_NODE seems dubious at best. The following two patches
simply removes the flag and make it de-facto always enabled.
The current semantic of the config option is twofold 1) it automatically
binds hotplugable nodes to have memory in zone_movable by default when
movable_node is enabled 2) forbids memory hotplug to online all the
memory as movable when !CONFIG_MOVABLE_NODE.
The later restriction is quite dubious because there is no clear cut of
how much normal memory do we need for a reasonable system operation. A
single memory block which is sufficient to allow further movable onlines
is far from sufficient (e.g a node with >2GB and memblocks 128MB will
fill up this zone with struct pages leaving nothing for other
allocations). Removing the config option will not only reduce the
configuration space it also removes quite some code.
The semantic of the movable_node command line parameter is preserved.
The first patch removes the restriction mentioned above and the second
one simply removes all the CONFIG_MOVABLE_NODE related stuff. The last
patch moves movable_node flag handling to memory_hotplug proper where it
belongs.
[1] http://lkml.kernel.org/r/20170524122411.25212-1-mhocko@kernel.org
This patch (of 3):
Commit
|
||
Vlastimil Babka
|
04ec6264f2 |
mm, page_alloc: pass preferred nid instead of zonelist to allocator
The main allocator function __alloc_pages_nodemask() takes a zonelist pointer as one of its parameters. All of its callers directly or indirectly obtain the zonelist via node_zonelist() using a preferred node id and gfp_mask. We can make the code a bit simpler by doing the zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node id instead (gfp_mask is already another parameter). There are some code size benefits thanks to removal of inlined node_zonelist(): bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952) This will also make things simpler if we proceed with converting cpusets to zonelists. Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
559bfc7d1b |
mm, memory_hotplug: remove unused cruft after memory hotplug rework
zone_for_memory doesn't have any user anymore as well as the whole zone shifting infrastructure so drop them all. This shouldn't introduce any functional changes. Link: http://lkml.kernel.org/r/20170515085827.16474-15-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
cdf72f2504 |
mm, memory_hotplug: fix the section mismatch warning
Tobias has reported following section mismatches introduced by "mm, memory_hotplug: do not associate hotadded memory to zones until online". WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone() The function move_pfn_range_to_zone() references the function __meminit memmap_init_zone(). This is often because move_pfn_range_to_zone lacks a __meminit annotation or the annotation of memmap_init_zone is wrong. WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone() The function move_pfn_range_to_zone() references the function __meminit init_currently_empty_zone(). This is often because move_pfn_range_to_zone lacks a __meminit annotation or the annotation of init_currently_empty_zone is wrong. WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone() The function move_pfn_range_to_zone() references the function __meminit memmap_init_zone(). This is often because move_pfn_range_to_zone lacks a __meminit annotation or the annotation of memmap_init_zone is wrong. WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone() The function move_pfn_range_to_zone() references the function __meminit init_currently_empty_zone(). This is often because move_pfn_range_to_zone lacks a __meminit annotation or the annotation of init_currently_empty_zone is wrong. Both memmap_init_zone and init_currently_empty_zone are marked __meminit but move_pfn_range_to_zone is used outside of __meminit sections (e.g. devm_memremap_pages) so we have to hide it from the checker by __ref annotation. Link: http://lkml.kernel.org/r/20170515085827.16474-14-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tobias Regnery <tobias.regnery@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
3d79a728f9 |
mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory
arch_add_memory gets for_device argument which then controls whether we want to create memblocks for created memory sections. Simplify the logic by telling whether we want memblocks directly rather than going through pointless negation. This also makes the api easier to understand because it is clear what we want rather than nothing telling for_device which can mean anything. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
c246a213f5 |
mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zone
Heiko Carstens has noticed that he can generate overlapping zones for ZONE_DMA and ZONE_NORMAL: DMA [mem 0x0000000000000000-0x000000007fffffff] Normal [mem 0x0000000080000000-0x000000017fffffff] $ cat /sys/devices/system/memory/block_size_bytes 10000000 $ cat /sys/devices/system/memory/memory5/valid_zones DMA $ echo 0 > /sys/devices/system/memory/memory5/online $ cat /sys/devices/system/memory/memory5/valid_zones Normal $ echo 1 > /sys/devices/system/memory/memory5/online Normal $ cat /proc/zoneinfo Node 0, zone DMA spanned 524288 <----- present 458752 managed 455078 start_pfn: 0 <----- Node 0, zone Normal spanned 720896 present 589824 managed 571648 start_pfn: 327680 <----- The reason is that we assume that the default zone for kernel onlining is ZONE_NORMAL. This was a simplification introduced by the memory hotplug rework and it is easily fixable by checking the range overlap in the zone order and considering the first matching zone as the default one. If there is no such zone then assume ZONE_NORMAL as we have been doing so far. Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online" Link: http://lkml.kernel.org/r/20170601083746.4924-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
a69578a154 |
mm, memory_hotplug: fix MMOP_ONLINE_KEEP behavior
Heiko Carstens has noticed that the MMOP_ONLINE_KEEP is broken currently $ grep . memory3?/valid_zones memory34/valid_zones:Normal Movable memory35/valid_zones:Normal Movable memory36/valid_zones:Normal Movable memory37/valid_zones:Normal Movable $ echo online_movable > memory34/state $ grep . memory3?/valid_zones memory34/valid_zones:Movable memory35/valid_zones:Movable memory36/valid_zones:Movable memory37/valid_zones:Movable $ echo online > memory36/state $ grep . memory3?/valid_zones memory34/valid_zones:Movable memory36/valid_zones:Normal memory37/valid_zones:Movable so we have effectively punched a hole into the movable zone. The problem is that move_pfn_range() check for MMOP_ONLINE_KEEP is wrong. It only checks whether the given range is already part of the movable zone which is not the case here as only memory34 is in the zone. Fix this by using allow_online_pfn_range(..., MMOP_ONLINE_KERNEL) if that is false then we can be sure that movable onlining is the right thing to do. Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online" Link: http://lkml.kernel.org/r/20170601083746.4924-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
f1dd2cd13c |
mm, memory_hotplug: do not associate hotadded memory to zones until online
The current memory hotplug implementation relies on having all the struct pages associate with a zone/node during the physical hotplug phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the vast majority of cases this means that they are added to ZONE_NORMAL. This has been so since |
||
Michal Hocko
|
2d070eab2e |
mm: consider zone which is not fully populated to have holes
__pageblock_pfn_to_page has two users currently, set_zone_contiguous which checks whether the given zone contains holes and pageblock_pfn_to_page which then carefully returns a first valid page from the given pfn range for the given zone. This doesn't handle zones which are not fully populated though. Memory pageblocks can be offlined or might not have been onlined yet. In such a case the zone should be considered to have holes otherwise pfn walkers can touch and play with offline pages. Current callers of pageblock_pfn_to_page in compaction seem to work properly right now because they only isolate PageBuddy (isolate_freepages_block) or PageLRU resp. __PageMovable (isolate_migratepages_block) which will be always false for these pages. It would be safer to skip these pages altogether, though. In order to do this patch adds a new memory section state (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or in online_pages_range during the memory hotplug. Similarly offline_mem_sections clears the bit and it is called when the memory range is offlined. pfn_to_online_page helper is then added which check the mem section and only returns a page if it is onlined already. Use the new helper in __pageblock_pfn_to_page and skip the whole page block in such a case. [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil), mark sections online after all struct pages are initialized in online_pages_range (Vlastimil)] Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
9037a99343 |
mm, memory_hotplug: split up register_one_node()
Memory hotplug (add_memory_resource) has to reinitialize node infrastructure if the node is offline (one which went through the complete add_memory(); remove_memory() cycle). That involves node registration to the kobj infrastructure (register_node), the proper association with cpus (register_cpu_under_node) and finally creation of node<->memblock symlinks (link_mem_sections). The last part requires to know node_start_pfn and node_spanned_pages which we currently have but a leter patch will postpone this initialization to the onlining phase which happens later. In fact we do not need to rely on the early pgdat initialization even now because the currently hot added pfn range is currently known. Split register_one_node into core which does all the common work for the boot time NUMA initialization and the hotplug (__register_one_node). register_one_node keeps the full initialization while hotplug calls __register_one_node and manually calls link_mem_sections for the proper range. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
1b862aecfb |
mm, memory_hotplug: get rid of is_zone_device_section
Device memory hotplug hooks into regular memory hotplug only half way. It needs memory sections to track struct pages but there is no need/desire to associate those sections with memory blocks and export them to the userspace via sysfs because they cannot be onlined anyway. This is currently expressed by for_device argument to arch_add_memory which then makes sure to associate the given memory range with ZONE_DEVICE. register_new_memory then relies on is_zone_device_section to distinguish special memory hotplug from the regular one. While this works now, later patches in this series want to move __add_zone outside of arch_add_memory path so we have to come up with something else. Add want_memblock down the __add_pages path and use it to control whether the section->memblock association should be done. arch_add_memory then just trivially want memblock for everything but for_device hotplug. remove_memory_section doesn't need is_zone_device_section either. We can simply skip all the memblock specific cleanup if there is no memblock for the given section. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
c8f9565716 |
mm, memory_hotplug: use node instead of zone in can_online_high_movable
The primary purpose of this helper is to query the node state so use the node id directly. This is a preparatory patch for later changes. This shouldn't introduce any functional change Link: http://lkml.kernel.org/r/20170515085827.16474-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Michal Hocko
|
dc0bbf3b7f |
mm: remove return value from init_currently_empty_zone
Patch series "mm: make movable onlining suck less", v4. Movable onlining is a real hack with many downsides - mainly reintroduction of lowmem/highmem issues we used to have on 32b systems - but it is the only way to make the memory hotremove more reliable which is something that people are asking for. The current semantic of memory movable onlinening is really cumbersome, however. The main reason for this is that the udev driven approach is basically unusable because udev races with the memory probing while only the last memory block or the one adjacent to the existing zone_movable are allowed to be onlined movable. In short the criterion for the successful online_movable changes under udev's feet. A reliable udev approach would require a 2 phase approach where the first successful movable online would have to check all the previous blocks and online them in descending order. This is hard to be considered sane. This patchset aims at making the onlining semantic more usable. First of all it allows to online memory movable as long as it doesn't clash with the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE cannot overlap. Currently I preserve the original ordering semantic so the zone always precedes the movable zone but I have plans to remove this restriction in future because it is not really necessary. First 3 patches are cleanups which should be ready to be merged right away (unless I have missed something subtle of course). Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path. Patch 5 deals with implicit assumptions of register_one_node on pgdat initialization. Patches 6-10 deal with offline holes in the zone for pfn walkers. I hope I got all of them right but people familiar with compaction should double check this. Patch 11 is the core of the change. In order to make it easier to review I have tried it to be as minimalistic as possible and the large code removal is moved to patch 14. Patch 12 is a trivial follow up cleanup. Patch 13 fixes sparse warnings and finally patch 14 removes the unused code. I have tested the patches in kvm: # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ... and then probed the additional memory by (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 Then I have used this simple script to probe the memory block by hand # cat probe_memblock.sh #!/bin/sh BLOCK_NR=$1 # echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe # for i in $(seq 10); do sh probe_memblock.sh $i; done # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Normal Movable /sys/devices/system/memory/memory35/valid_zones:Normal Movable /sys/devices/system/memory/memory36/valid_zones:Normal Movable /sys/devices/system/memory/memory37/valid_zones:Normal Movable /sys/devices/system/memory/memory38/valid_zones:Normal Movable /sys/devices/system/memory/memory39/valid_zones:Normal Movable The main difference to the original implementation is that all new memblocks can be both online_kernel and online_movable initially because there is no clash obviously. For the comparison the original implementation would have /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal /sys/devices/system/memory/memory35/valid_zones:Normal /sys/devices/system/memory/memory36/valid_zones:Normal /sys/devices/system/memory/memory37/valid_zones:Normal /sys/devices/system/memory/memory38/valid_zones:Normal /sys/devices/system/memory/memory39/valid_zones:Normal Movable Now # echo online_movable > /sys/devices/system/memory/memory34/state # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable /sys/devices/system/memory/memory35/valid_zones:Movable /sys/devices/system/memory/memory36/valid_zones:Movable /sys/devices/system/memory/memory37/valid_zones:Movable /sys/devices/system/memory/memory38/valid_zones:Movable /sys/devices/system/memory/memory39/valid_zones:Movable Block 33 can still be online both kernel and movable while all the remaining can be only movable. /proc/zonelist says Node 0, zone Normal pages free 0 min 0 low 0 high 0 spanned 0 present 0 -- Node 0, zone Movable pages free 32753 min 85 low 117 high 149 spanned 32768 present 32768 A new memblock at a lower address will result in a new memblock (32) which will still allow both Normal and Movable. # sh probe_memblock.sh 0 # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable /sys/devices/system/memory/memory35/valid_zones:Movable and online_kernel will convert it to the zone normal properly while 33 can be still onlined both ways. # echo online_kernel > /sys/devices/system/memory/memory32/state # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable /sys/devices/system/memory/memory35/valid_zones:Movable /proc/zoneinfo will now tell Node 0, zone Normal pages free 65441 min 165 low 230 high 295 spanned 65536 present 65536 -- Node 0, zone Movable pages free 32740 min 82 low 114 high 146 spanned 32768 present 32768 so both zones have one memblock spanned and present. Onlining 39 should associate this block to the movable zone # echo online > /sys/devices/system/memory/memory39/state /proc/zoneinfo will now tell Node 0, zone Normal pages free 32765 min 80 low 112 high 144 spanned 32768 present 32768 -- Node 0, zone Movable pages free 65501 min 160 low 225 high 290 spanned 196608 present 65536 so we will have a movable zone which spans 6 memblocks, 2 present and 4 representing a hole. Offlining both movable blocks will lead to the zone with no present pages which is the expected behavior I believe. # echo offline > /sys/devices/system/memory/memory39/state # echo offline > /sys/devices/system/memory/memory34/state # grep -A6 "Movable\|Normal" /proc/zoneinfo Node 0, zone Normal pages free 32735 min 90 low 122 high 154 spanned 32768 present 32768 -- Node 0, zone Movable pages free 0 min 0 low 0 high 0 spanned 196608 present 0 As a bonus we will get a nice cleanup in the memory hotplug codebase. This patch (of 16): init_currently_empty_zone doesn't have any error to return yet it is still an int and callers try to be defensive and try to handle potential error. Remove this nonsense and simplify all callers. This patch shouldn't have any visible effect Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mel Gorman
|
e716f2eb24 |
mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken
for in pgdat->kswapd_classzone_idx. If it has not been woken recently,
this zone will be 0.
However, the decision on whether to sleep is made on
kswapd_classzone_idx which is 0 without a recent wakeup request and that
classzone does not account for lowmem reserves. This allows kswapd to
sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
request even if a stream of allocations cannot use that zone. While
kswapd may be woken again shortly in the near future there are two
consequences -- the pgdat bits that control congestion are cleared
prematurely and direct reclaim is more likely as kswapd slept
prematurely.
This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
invalid index) when there has been no recent wakeups. If there are no
wakeups, it'll decide whether to sleep based on the highest possible
zone available (MAX_NR_ZONES - 1). It then becomes critical that the
"pgdat balanced" decisions during reclaim and when deciding to sleep are
the same. If there is a mismatch, kswapd can stay awake continually
trying to balance tiny zones.
simoop was used to evaluate it again. Two of the preparation patches
regressed the workload so they are included as the second set of
results. Otherwise this patch looks artifically excellent
4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
vanilla clear-v2 keepawake-v2
Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%)
Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%)
Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%)
Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%)
Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%)
Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%)
Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%)
Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%)
Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%)
With this patch on top, write and allocation latencies are massively
improved. The read latencies are slightly impaired but it's worth
noting that this is mostly due to the IO scheduler and not directly
related to reclaim. The vmstats are a bit of a mix but the relevant
ones are as follows;
4.10.0-rc7 4.10.0-rc7 4.10.0-rc7
mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins 0 0 0
Swap Outs 0 608 0
Direct pages scanned 6910672
|
||
Heiko Carstens
|
55adc1d05d |
mm: add private lock to serialize memory hotplug operations
Commit |
||
Ingo Molnar
|
174cd4b1e5 |
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Nathan Fontenot
|
dc18d706a4 |
memory-hotplug: use dev_online for memhp_auto_online
Commit
|
||
zhong jiang
|
d6d8c8a482 |
mm/memory_hotplug.c: fix overflow in test_pages_in_a_zone()
When mainline introduced commit |
||
Yisheng Xie
|
0efadf48bc |
mm/hotplug: enable memory hotplug for non-lru movable pages
We had considered all of the non-lru pages as unmovable before commit
|
||
Andrew Morton
|
997126bbc5 |
mm/memory_hotplug.c: unexport __remove_pages()
It has no modular callers. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Dan Williams
|
3fc2192410 |
mm: validate device_hotplug is held for memory hotplug
mem_hotplug_begin() assumes that it can set mem_hotplug.active_writer and run the hotplug process without racing another thread. Validate this assumption with a lockdep assertion. Link: http://lkml.kernel.org/r/148693886229.16345.1770484669403334689.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Yasuaki Ishimatsu
|
ddffe98d16 |
mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next
To identify that pages of page table are allocated from bootmem
allocator, magic number sets to page->lru.next.
But page->lru list is initialized in reserve_bootmem_region(). So when
calling free_pagetable(), the function cannot find the magic number of
pages. And free_pagetable() frees the pages by free_reserved_page() not
put_page_bootmem().
But if the pages are allocated from bootmem allocator and used as page
table, the pages have private flag. So before freeing the pages, we
should clear the private flag by put_page_bootmem().
Before applying the commit
|
||
Toshi Kani
|
a96dfddbcc |
base/memory, hotplug: fix a kernel oops in show_valid_zones()
Reading a sysfs "memoryN/valid_zones" file leads to the following oops
when the first page of a range is not backed by struct page.
show_valid_zones() assumes that 'start_pfn' is always valid for
page_zone().
BUG: unable to handle kernel paging request at ffffea017a000000
IP: show_valid_zones+0x6f/0x160
This issue may happen on x86-64 systems with 64GiB or more memory since
their memory block size is bumped up to 2GiB. [1] An example of such
systems is desribed below. 0x3240000000 is only aligned by 1GiB and
this memory block starts from 0x3200000000, which is not backed by
struct page.
BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable
Since test_pages_in_a_zone() already checks holes, fix this issue by
extending this function to return 'valid_start' and 'valid_end' for a
given range. show_valid_zones() then proceeds with the valid range.
[1] 'Commit
|
||
Toshi Kani
|
deb88a2a19 |
mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
Patch series "fix a kernel oops when reading sysfs valid_zones", v2. A sysfs memory file is created for each 2GiB memory block on x86-64 when the system has 64GiB or more memory. [1] When the start address of a memory block is not backed by struct page, i.e. a memory range is not aligned by 2GiB, reading its 'valid_zones' attribute file leads to a kernel oops. This issue was observed on multiple x86-64 systems with more than 64GiB of memory. This patch-set fixes this issue. Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not test the start section. Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone() to return valid [start, end). Note for stable kernels: The memory block size change was made by commit |
||
Yasuaki Ishimatsu
|
8a1f780e7f |
memory_hotplug: make zone_can_shift() return a boolean value
online_{kernel|movable} is used to change the memory zone to
ZONE_{NORMAL|MOVABLE} and online the memory.
To check that memory zone can be changed, zone_can_shift() is used.
Currently the function returns minus integer value, plus integer
value and 0. When the function returns minus or plus integer value,
it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.
But when the function returns 0, there are two meanings.
One of the meanings is that the memory zone does not need to be changed.
For example, when memory is in ZONE_NORMAL and onlined by online_kernel
the memory zone does not need to be changed.
Another meaning is that the memory zone cannot be changed. When memory
is in ZONE_NORMAL and onlined by online_movable, the memory zone may
not be changed to ZONE_MOVALBE due to memory online limitation(see
Documentation/memory-hotplug.txt). In this case, memory must not be
onlined.
The patch changes the return type of zone_can_shift() so that memory
online operation fails when memory zone cannot be changed as follows:
Before applying patch:
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal
<snip>
node_scanned 0
spanned 8388608
present 7864320
managed 7864320
# echo online_movable > memory4097/state
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal
<snip>
node_scanned 0
spanned 8388608
present 8388608
managed 8388608
online_movable operation succeeded. But memory is onlined as
ZONE_NORMAL, not ZONE_MOVABLE.
After applying patch:
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal
<snip>
node_scanned 0
spanned 8388608
present 7864320
managed 7864320
# echo online_movable > memory4097/state
bash: echo: write error: Invalid argument
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal
<snip>
node_scanned 0
spanned 8388608
present 7864320
managed 7864320
online_movable operation failed because of failure of changing
the memory zone from ZONE_NORMAL to ZONE_MOVABLE
Fixes:
|
||
Reza Arbab
|
39fa104d5b |
mm: remove x86-only restriction of movable_node
In commit
|
||
Linus Torvalds
|
9db4f36e82 |
mm: remove unused variable in memory hotplug
When I removed the per-zone bitlock hashed waitqueues in commit
|
||
Linus Torvalds
|
9dcb8b685f |
mm: remove per-zone hashtable of bitlock waitqueues
The per-zone waitqueues exist because of a scalability issue with the page waitqueues on some NUMA machines, but it turns out that they hurt normal loads, and now with the vmalloced stacks they also end up breaking gfs2 that uses a bit_wait on a stack object: wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE) where 'gh' can be a reference to the local variable 'mount_gh' on the stack of fill_super(). The reason the per-zone hash table breaks for this case is that there is no "zone" for virtual allocations, and trying to look up the physical page to get at it will fail (with a BUG_ON()). It turns out that I actually complained to the mm people about the per-zone hash table for another reason just a month ago: the zone lookup also hurts the regular use of "unlock_page()" a lot, because the zone lookup ends up forcing several unnecessary cache misses and generates horrible code. As part of that earlier discussion, we had a much better solution for the NUMA scalability issue - by just making the page lock have a separate contention bit, the waitqueue doesn't even have to be looked at for the normal case. Peter Zijlstra already has a patch for that, but let's see if anybody even notices. In the meantime, let's fix the actual gfs2 breakage by simplifying the bitlock waitqueues and removing the per-zone issue. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Tested-by: Bob Peterson <rpeterso@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Gerald Schaefer
|
082d5b6b60 |
mm/hugetlb: check for reserved hugepages during memory offline
In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.
Fix this by adding a return value to dissolve_free_huge_pages() and
checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.
Fixes:
|
||
Li Zhong
|
231e97e2b8 |
mem-hotplug: use nodes that contain memory as mask in new_node_page()
|
||
Li Zhong
|
9bb627be47 |
mem-hotplug: don't clear the only node in new_node_page()
Commit |
||
Reza Arbab
|
5830169f47 |
mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats
The following oops occurs after a pgdat is hotadded: Unable to handle kernel paging request for data at address 0x00c30001 Faulting instruction address: 0xc00000000022f8f4 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110 task: c000000000ef3080 task.stack: c000000000f6c000 NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000 REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device) MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 84002028 XER: 20000000 CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0 NIP refresh_cpu_vm_stats+0x1a4/0x2f0 LR refresh_cpu_vm_stats+0x1f8/0x2f0 Call Trace: refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable) Add per_cpu_nodestats initialization to the hotplug codepath. Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Xishi Qiu
|
394e31d2ce |
mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
If we offline a node, alloc the new page from a nearest neighbor node instead of the current node or other remote nodes, because re-migrate is a waste of time and the distance of the remote nodes is often very large. Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable zone or highmem zone. Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mel Gorman
|
38087d9b03 |
mm, vmscan: simplify the logic deciding whether kswapd sleeps
kswapd goes through some complex steps trying to figure out if it should stay awake based on the classzone_idx and the requested order. It is unnecessarily complex and passes in an invalid classzone_idx to balance_pgdat(). What matters most of all is whether a larger order has been requsted and whether kswapd successfully reclaimed at the previous order. This patch irons out the logic to check just that and the end result is less headache inducing. Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mel Gorman
|
599d0c954f |
mm, vmscan: move LRU lists to node
This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Reza Arbab
|
df429ac039 |
memory-hotplug: more general validation of zone during online
When memory is onlined, we are only able to rezone from ZONE_MOVABLE to ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE. To be more flexible, use the following criteria instead; to online memory from zone X into zone Y, * Any zones between X and Y must be unused. * If X is lower than Y, the onlined memory must lie at the end of X. * If X is higher than Y, the onlined memory must lie at the start of X. Add zone_can_shift() to make this determination. Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.com Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Reviewd-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Reza Arbab
|
e51e6c8f80 |
memory-hotplug: add move_pfn_range()
Add move_pfn_range(), a wrapper to call move_pfn_range_left() or move_pfn_range_right(). No functional change. This will be utilized by a later patch. Link: http://lkml.kernel.org/r/1462816419-4479-2-git-send-email-arbab@linux.vnet.ibm.com Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Banman <abanman@sgi.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Zhang Zhen <zhenzhang.zhang@huawei.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Linus Torvalds
|
7ded384a12 |
mm: fix section mismatch warning
The register_page_bootmem_info_node() function needs to be marked __init
in order to avoid a new warning introduced by commit
|
||
Yang Shi
|
f65e91df25 |
mm: use early_pfn_to_nid in register_page_bootmem_info_node
register_page_bootmem_info_node() is invoked in mem_init(), so it will be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is enabled. But, pfn_to_nid() depends on memmap which won't be fully setup until page_alloc_init_late() is done, so replace pfn_to_nid() by early_pfn_to_nid(). Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Vitaly Kuznetsov
|
86dd995d63 |
memory_hotplug: introduce memhp_default_state= command line parameter
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the memory hotplug onlining policy. Add a command line parameter to make it possible to override the default. It may come handy for debug and testing purposes. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: David Rientjes <rientjes@google.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Lennart Poettering <lennart@poettering.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Vitaly Kuznetsov
|
8604d9e534 |
memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
This patchset continues the work I started with commit
|
||
Yaowei Bai
|
c98940f6fa |
mm/memory_hotplug: is_mem_section_removable() can return bool
Make is_mem_section_removable() return bool to improve readability due to this particular function only using either one or zero as its return value. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Joe Perches
|
756a025f00 |
mm: coalesce split strings
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Chen Yucong
|
e33e33b4d1 |
mm, memory hotplug: print debug message in the proper way for online_pages
online_pages() simply returns an error value if memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we want for successfully onlining target pages. This patch arms to print more failure information like offline_pages() in online_pages. This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves __offline_pages() to not print failure information with KERN_INFO according to David Rientjes's suggestion[1]. [1] https://lkml.org/lkml/2016/2/24/1094 Signed-off-by: Chen Yucong <slaoub@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Joonsoo Kim
|
fe896d1878 |
mm: introduce page reference manipulation functions
The success of CMA allocation largely depends on the success of migration and key factor of it is page reference count. Until now, page reference is manipulated by direct calling atomic functions so we cannot follow up who and where manipulate it. Then, it is hard to find actual reason of CMA allocation failure. CMA allocation should be guaranteed to succeed so finding offending place is really important. In this patch, call sites where page reference is manipulated are converted to introduced wrapper function. This is preparation step to add tracepoint to each page reference manipulation function. With this facility, we can easily find reason of CMA allocation failure. There is no functional change in this patch. In addition, this patch also converts reference read sites. It will help a second step that renames page._count to something else and prevents later attempt to direct access to it (Suggested by Andrew). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Vlastimil Babka
|
e888ca3545 |
mm, memory hotplug: small cleanup in online_pages()
We can reuse the nid we've determined instead of repeated pfn_to_nid() usages. Also zone_to_nid() should be a bit cheaper in general than pfn_to_nid(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Vlastimil Babka
|
698b1b3064 |
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts: - kswapd balancing a zone after a high-order allocation failure - direct compaction to satisfy a high-order allocation, including THP page fault attemps - khugepaged trying to collapse a hugepage - manually from /proc The purpose of compaction is two-fold. The obvious purpose is to satisfy a (pending or future) high-order allocation, and is easy to evaluate. The other purpose is to keep overal memory fragmentation low and help the anti-fragmentation mechanism. The success wrt the latter purpose is more The current situation wrt the purposes has a few drawbacks: - compaction is invoked only when a high-order page or hugepage is not available (or manually). This might be too late for the purposes of keeping memory fragmentation low. - direct compaction increases latency of allocations. Again, it would be better if compaction was performed asynchronously to keep fragmentation low, before the allocation itself comes. - (a special case of the previous) the cost of compaction during THP page faults can easily offset the benefits of THP. - kswapd compaction appears to be complex, fragile and not working in some scenarios. It could also end up compacting for a high-order allocation request when it should be reclaiming memory for a later order-0 request. To improve the situation, we should be able to benefit from an equivalent of kswapd, but for compaction - i.e. a background thread which responds to fragmentation and the need for high-order allocations (including hugepages) somewhat proactively. One possibility is to extend the responsibilities of kswapd, which could however complicate its design too much. It should be better to let kswapd handle reclaim, as order-0 allocations are often more critical than high-order ones. Another possibility is to extend khugepaged, but this kthread is a single instance and tied to THP configs. This patch goes with the option of a new set of per-node kthreads called kcompactd, and lays the foundations, without introducing any new tunables. The lifecycle mimics kswapd kthreads, including the memory hotplug hooks. For compaction, kcompactd uses the standard compaction_suitable() and ompact_finished() criteria and the deferred compaction functionality. Unlike direct compaction, it uses only sync compaction, as there's no allocation latency to minimize. This patch doesn't yet add a call to wakeup_kcompactd. The kswapd compact/reclaim loop for high-order pages will be replaced by waking up kcompactd in the next patch with the description of what's wrong with the old approach. Waking up of the kcompactd threads is also tied to kswapd activity and follows these rules: - we don't want to affect any fastpaths, so wake up kcompactd only from the slowpath, as it's done for kswapd - if kswapd is doing reclaim, it's more important than compaction, so don't invoke kcompactd until kswapd goes to sleep - the target order used for kswapd is passed to kcompactd Future possible future uses for kcompactd include the ability to wake up kcompactd on demand in special situations, such as when hugepages are not available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform periodic compaction with kcompactd. [arnd@arndb.de: fix build errors with kcompactd] [paul.gortmaker@windriver.com: don't use modular references for non modular code] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |