-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmVmG20ACgkQONu9yGCS
aT6dzg/7BnCP2SpVmgEaD7FdPvGO/A6O5VrC9zu3sQE6g2gAwirZhdgE8NRn+ggm
WSQ1kIA+HEcY23FKpq46pBED4P1irudiW7DkLw8nyOGp+XLb4wGkF5lBBP5z+B2P
ga2RgwqKvYWeDaUW4n1Uy7m2Cz+wqCg/EvnITo40glSWPh20gM532/CSnA5akoje
9mjZYZ0rKHKTZGu65aNScNR7XnXHIivJU6C1jF6L9N1+Xn679nUHKQP4KM/RcjpX
g1WQMWFC3mGIn5IX28W1wvKS320D5HLmTLnLqJvFpJN9+13DUnUoXcX469zvQoxJ
GL3S94goWN/0BPOgr5KcKvTj00b4O+EWhQuQt+x8NLdydzRQuyFu2UpLNhIKKSou
sT+BcxzeuqJhEh1tZItcZkZBptpLEkb0ezT11u5McnU5FjPzzzP8CtEetKKmEaBU
AUoEP/lQQlVyk1I6xAeuzu53smncNQt6CqnXJxYXOBGgJ2txAM5kroMKXPin5C8k
BCpUIqghhKmBd1hwuKyaOBKF99eLKKZsuvXppoPD0Yz7/Nq5TgdBw0qbNt2iLr05
XSM7WIIeCBROaV+ZiVxgtcXDR51FpMr7CLTbkBQ6IgLwircHeHSK7rQn7kFO3fCg
OezhWAuh72qDZ2PCJ84fj21IhZ49a5oCLbUdBew+KzZervVpSo0=
=eW67
-----END PGP SIGNATURE-----
Merge 5.10.202 into android12-5.10-lts
Changes in 5.10.202
locking/ww_mutex/test: Fix potential workqueue corruption
perf/core: Bail out early if the request AUX area is out of bound
clocksource/drivers/timer-imx-gpt: Fix potential memory leak
clocksource/drivers/timer-atmel-tcb: Fix initialization on SAM9 hardware
x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size
wifi: mac80211_hwsim: fix clang-specific fortify warning
wifi: mac80211: don't return unset power in ieee80211_get_tx_power()
bpf: Detect IP == ksym.end as part of BPF program
wifi: ath9k: fix clang-specific fortify warnings
wifi: ath10k: fix clang-specific fortify warning
net: annotate data-races around sk->sk_tx_queue_mapping
net: annotate data-races around sk->sk_dst_pending_confirm
wifi: ath10k: Don't touch the CE interrupt registers after power up
Bluetooth: btusb: Add date->evt_skb is NULL check
Bluetooth: Fix double free in hci_conn_cleanup
platform/x86: thinkpad_acpi: Add battery quirk for Thinkpad X120e
drm/komeda: drop all currently held locks if deadlock happens
drm/msm/dp: skip validity check for DP CTS EDID checksum
drm/amd: Fix UBSAN array-index-out-of-bounds for SMU7
drm/amd: Fix UBSAN array-index-out-of-bounds for Polaris and Tonga
drm/amdgpu: Fix potential null pointer derefernce
drm/panel: fix a possible null pointer dereference
drm/panel/panel-tpo-tpg110: fix a possible null pointer dereference
drm/panel: st7703: Pick different reset sequence
drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULL
selftests/efivarfs: create-read: fix a resource leak
ASoC: soc-card: Add storage for PCI SSID
crypto: pcrypt - Fix hungtask for PADATA_RESET
RDMA/hfi1: Use FIELD_GET() to extract Link Width
fs/jfs: Add check for negative db_l2nbperpage
fs/jfs: Add validity check for db_maxag and db_agpref
jfs: fix array-index-out-of-bounds in dbFindLeaf
jfs: fix array-index-out-of-bounds in diAlloc
HID: lenovo: Detect quirk-free fw on cptkbd and stop applying workaround
ARM: 9320/1: fix stack depot IRQ stack filter
ALSA: hda: Fix possible null-ptr-deref when assigning a stream
PCI: tegra194: Use FIELD_GET()/FIELD_PREP() with Link Width fields
atm: iphase: Do PCI error checks on own line
scsi: libfc: Fix potential NULL pointer dereference in fc_lport_ptp_setup()
misc: pci_endpoint_test: Add Device ID for R-Car S4-8 PCIe controller
HID: Add quirk for Dell Pro Wireless Keyboard and Mouse KM5221W
exfat: support handle zero-size directory
tty: vcc: Add check for kstrdup() in vcc_probe()
usb: gadget: f_ncm: Always set current gadget in ncm_bind()
9p/trans_fd: Annotate data-racy writes to file::f_flags
i2c: sun6i-p2wi: Prevent potential division by zero
media: gspca: cpia1: shift-out-of-bounds in set_flicker
media: vivid: avoid integer overflow
gfs2: ignore negated quota changes
gfs2: fix an oops in gfs2_permission
media: cobalt: Use FIELD_GET() to extract Link Width
media: imon: fix access to invalid resource for the second interface
drm/amd/display: Avoid NULL dereference of timing generator
kgdb: Flush console before entering kgdb on panic
ASoC: ti: omap-mcbsp: Fix runtime PM underflow warnings
drm/amdgpu: fix software pci_unplug on some chips
pwm: Fix double shift bug
wifi: iwlwifi: Use FW rate for non-data frames
xhci: turn cancelled td cleanup to its own function
SUNRPC: ECONNRESET might require a rebind
SUNRPC: Add an IS_ERR() check back to where it was
NFSv4.1: fix SP4_MACH_CRED protection for pnfs IO
SUNRPC: Fix RPC client cleaned up the freed pipefs dentries
gfs2: Silence "suspicious RCU usage in gfs2_permission" warning
ipvlan: add ipvlan_route_v6_outbound() helper
tty: Fix uninit-value access in ppp_sync_receive()
net: hns3: fix variable may not initialized problem in hns3_init_mac_addr()
net: hns3: fix VF reset fail issue
tipc: Fix kernel-infoleak due to uninitialized TLV value
ppp: limit MRU to 64K
xen/events: fix delayed eoi list handling
ptp: annotate data-race around q->head and q->tail
bonding: stop the device in bond_setup_by_slave()
net: ethernet: cortina: Fix max RX frame define
net: ethernet: cortina: Handle large frames
net: ethernet: cortina: Fix MTU max setting
netfilter: nf_conntrack_bridge: initialize err to 0
net: stmmac: fix rx budget limit check
net/mlx5e: fix double free of encap_header
net/mlx5_core: Clean driver version and name
net/mlx5e: Check return value of snprintf writing to fw_version buffer for representors
macvlan: Don't propagate promisc change to lower dev in passthru
tools/power/turbostat: Fix a knl bug
cifs: spnego: add ';' in HOST_KEY_LEN
cifs: fix check of rc in function generate_smb3signingkey
media: venus: hfi: add checks to perform sanity on queue pointers
powerpc/perf: Fix disabling BHRB and instruction sampling
randstruct: Fix gcc-plugin performance mode to stay in group
bpf: Fix check_stack_write_fixed_off() to correctly spill imm
bpf: Fix precision tracking for BPF_ALU | BPF_TO_BE | BPF_END
scsi: mpt3sas: Fix loop logic
scsi: megaraid_sas: Increase register read retry rount from 3 to 30 for selected registers
x86/cpu/hygon: Fix the CPU topology evaluation for real
KVM: x86: hyper-v: Don't auto-enable stimer on write from user-space
KVM: x86: Ignore MSR_AMD64_TW_CFG access
audit: don't take task_lock() in audit_exe_compare() code path
audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()
tty/sysrq: replace smp_processor_id() with get_cpu()
hvc/xen: fix console unplug
hvc/xen: fix error path in xen_hvc_init() to always register frontend driver
PCI/sysfs: Protect driver's D3cold preference from user space
watchdog: move softlockup_panic back to early_param
ACPI: resource: Do IRQ override on TongFang GMxXGxx
arm64: Restrict CPU_BIG_ENDIAN to GNU as or LLVM IAS 15.x or newer
parisc/pdc: Add width field to struct pdc_model
clk: qcom: ipq8074: drop the CLK_SET_RATE_PARENT flag from PLL clocks
clk: qcom: ipq6018: drop the CLK_SET_RATE_PARENT flag from PLL clocks
mmc: vub300: fix an error code
mmc: sdhci_am654: fix start loop index for TAP value parsing
PCI/ASPM: Fix L1 substate handling in aspm_attr_store_common()
arm64: dts: qcom: ipq6018: Fix hwlock index for SMEM
PM: hibernate: Use __get_safe_page() rather than touching the list
PM: hibernate: Clean up sync_read handling in snapshot_write_next()
rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects
btrfs: don't arbitrarily slow down delalloc if we're committing
firmware: qcom_scm: use 64-bit calling convention only when client is 64-bit
ima: detect changes to the backing overlay file
wifi: ath11k: fix temperature event locking
wifi: ath11k: fix dfs radar event locking
wifi: ath11k: fix htt pktlog locking
mmc: meson-gx: Remove setting of CMD_CFG_ERROR
genirq/generic_chip: Make irq_remove_generic_chip() irqdomain aware
PCI: keystone: Don't discard .remove() callback
PCI: keystone: Don't discard .probe() callback
jbd2: fix potential data lost in recovering journal raced with synchronizing fs bdev
quota: explicitly forbid quota files from being encrypted
kernel/reboot: emergency_restart: Set correct system_state
i2c: core: Run atomic i2c xfer when !preemptible
mcb: fix error handling for different scenarios when parsing
dmaengine: stm32-mdma: correct desc prep when channel running
mm/cma: use nth_page() in place of direct struct page manipulation
mm/memory_hotplug: use pfn math in place of direct struct page manipulation
mtd: cfi_cmdset_0001: Byte swap OTP info
i3c: master: cdns: Fix reading status register
parisc: Prevent booting 64-bit kernels on PA1.x machines
parisc/pgtable: Do not drop upper 5 address bits of physical address
xhci: Enable RPM on controllers that support low-power states
ALSA: info: Fix potential deadlock at disconnection
ALSA: hda/realtek - Add Dell ALC295 to pin fall back table
ALSA: hda/realtek - Enable internal speaker of ASUS K6500ZC
serial: meson: remove redundant initialization of variable id
tty: serial: meson: retrieve port FIFO size from DT
serial: meson: Use platform_get_irq() to get the interrupt
tty: serial: meson: fix hard LOCKUP on crtscts mode
cpufreq: stats: Fix buffer overflow detection in trans_stats()
Bluetooth: btusb: Add Realtek RTL8852BE support ID 0x0cb8:0xc559
bluetooth: Add device 0bda:887b to device tables
bluetooth: Add device 13d3:3571 to device tables
Bluetooth: btusb: Add RTW8852BE device 13d3:3570 to device tables
Bluetooth: btusb: Add 0bda:b85b for Fn-Link RTL8852BE
PCI: exynos: Don't discard .remove() callback
arm64: dts: qcom: ipq6018: switch TCSR mutex to MMIO
arm64: dts: qcom: ipq6018: Fix tcsr_mutex register size
Revert ncsi: Propagate carrier gain/loss events to the NCSI controller
lsm: fix default return value for vm_enough_memory
lsm: fix default return value for inode_getsecctx
i2c: designware: Disable TX_EMPTY irq while waiting for block length byte
net: dsa: lan9303: consequently nested-lock physical MDIO
net: phylink: initialize carrier state at creation
i2c: i801: fix potential race in i801_block_transaction_byte_by_byte
f2fs: avoid format-overflow warning
media: lirc: drop trailing space from scancode transmit
media: sharp: fix sharp encoding
media: venus: hfi_parser: Add check to keep the number of codecs within range
media: venus: hfi: fix the check to handle session buffer requirement
media: venus: hfi: add checks to handle capabilities from firmware
nfsd: fix file memleak on client_opens_release
mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors
media: qcom: camss: Fix vfe_get() error jump
Revert "net: r8169: Disable multicast filter for RTL8168H and RTL8107E"
ext4: apply umask if ACL support is disabled
ext4: correct offset of gdb backup in non meta_bg group to update_backups
ext4: correct return value of ext4_convert_meta_bg
ext4: correct the start block of counting reserved clusters
ext4: remove gdb backup copy for meta bg in setup_new_flex_group_blocks
drm/amd/pm: Handle non-terminated overdrive commands.
drm/amdgpu: fix error handling in amdgpu_bo_list_get()
drm/amd/display: Change the DMCUB mailbox memory location from FB to inbox
io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
tracing: Have trace_event_file have ref counters
netfilter: nftables: update table flags from the commit phase
netfilter: nf_tables: fix table flag updates
netfilter: nf_tables: disable toggling dormant table state more than once
interconnect: qcom: Add support for mask-based BCMs
Linux 5.10.202
Change-Id: I762bcd4848d9b87cbb4efe4104fe1685999dc0f7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 1640a0ef80f6d572725f5b0330038c18e98ea168 upstream.
When dealing with hugetlb pages, manipulating struct page pointers
directly can get to wrong struct page, since struct page is not guaranteed
to be contiguous on SPARSEMEM without VMEMMAP. Use pfn calculation to
handle it properly.
Without the fix, a wrong number of page might be skipped. Since skip cannot be
negative, scan_movable_page() will end early and might miss a movable page with
-ENOENT. This might fail offline_pages(). No bug is reported. The fix comes
from code inspection.
Link: https://lkml.kernel.org/r/20230913201248.452081-4-zi.yan@sent.com
Fixes: eeb0efd071 ("mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 786dee864804f8e851cf0f258df2ccbb4ee03d80 upstream.
When offlining memory the system can attempt to migrate a lot of pages, if
there are problems with migration this can flood the logs. Printing all
the data hogs the CPU and cause some RT threads to run for a long time,
which may have some bad consequences.
Rate limit the page migration warnings in order to avoid this.
Link: https://lkml.kernel.org/r/20210505140542.24935-1-georgi.djakov@linaro.org
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <pjy@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.185
lib: cleanup kstrto*() usage
kernel.h: split out kstrtox() and simple_strtox() to a separate header
test_firmware: Use kstrtobool() instead of strtobool()
test_firmware: prevent race conditions by a correct implementation of locking
test_firmware: fix a memory leak with reqs buffer
power: supply: ab8500: Fix external_power_changed race
power: supply: sc27xx: Fix external_power_changed race
power: supply: bq27xxx: Use mod_delayed_work() instead of cancel() + schedule()
ARM: dts: vexpress: add missing cache properties
tools: gpio: fix debounce_period_us output of lsgpio
power: supply: Ratelimit no data debug output
platform/x86: asus-wmi: Ignore WMI events with codes 0x7B, 0xC0
regulator: Fix error checking for debugfs_create_dir
irqchip/gic-v3: Disable pseudo NMIs on Mediatek devices w/ firmware issues
power: supply: Fix logic checking if system is running from battery
btrfs: scrub: try harder to mark RAID56 block groups read-only
btrfs: handle memory allocation failure in btrfs_csum_one_bio
ASoC: soc-pcm: test if a BE can be prepared
parisc: Improve cache flushing for PCXL in arch_sync_dma_for_cpu()
parisc: Flush gatt writes and adjust gatt mask in parisc_agp_mask_memory()
MIPS: Alchemy: fix dbdma2
mips: Move initrd_start check after initrd address sanitisation.
ASoC: dwc: move DMA init to snd_soc_dai_driver probe()
xen/blkfront: Only check REQ_FUA for writes
drm:amd:amdgpu: Fix missing buffer object unlock in failure path
irqchip/gic: Correctly validate OF quirk descriptors
io_uring: hold uring mutex around poll removal
epoll: ep_autoremove_wake_function should use list_del_init_careful
ocfs2: fix use-after-free when unmounting read-only filesystem
ocfs2: check new file size on fallocate call
nios2: dts: Fix tse_mac "max-frame-size" property
nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
kexec: support purgatories with .text.hot sections
x86/purgatory: remove PGO flags
powerpc/purgatory: remove PGO flags
nouveau: fix client work fence deletion race
RDMA/uverbs: Restrict usage of privileged QKEYs
net: usb: qmi_wwan: add support for Compal RXM-G1
ALSA: hda/realtek: Add a quirk for Compaq N14JP6
Remove DECnet support from kernel
USB: serial: option: add Quectel EM061KGL series
serial: lantiq: add missing interrupt ack
usb: dwc3: gadget: Reset num TRBs before giving back the request
RDMA/rtrs: Fix the last iu->buf leak in err path
spi: fsl-dspi: avoid SCK glitches with continuous transfers
netfilter: nfnetlink: skip error delivery on batch in case of ENOMEM
net: enetc: correct the indexes of highest and 2nd highest TCs
ping6: Fix send to link-local addresses with VRF.
net/sched: cls_u32: Fix reference counter leak leading to overflow
RDMA/rxe: Remove the unused variable obj
RDMA/rxe: Removed unused name from rxe_task struct
RDMA/rxe: Fix the use-before-initialization error of resp_pkts
iavf: remove mask from iavf_irq_enable_queues()
octeontx2-af: fixed resource availability check
RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functions
RDMA/cma: Always set static rate to 0 for RoCE
IB/uverbs: Fix to consider event queue closing also upon non-blocking mode
IB/isert: Fix dead lock in ib_isert
IB/isert: Fix possible list corruption in CMA handler
IB/isert: Fix incorrect release of isert connection
ipvlan: fix bound dev checking for IPv6 l3s mode
sctp: fix an error code in sctp_sf_eat_auth()
igb: fix nvm.ops.read() error handling
drm/nouveau: don't detect DSM for non-NVIDIA device
drm/nouveau/dp: check for NULL nv_connector->native_mode
drm/nouveau: add nv_encoder pointer check for NULL
ext4: drop the call to ext4_error() from ext4_get_group_info()
net/sched: cls_api: Fix lockup on flushing explicitly created chain
net: lapbether: only support ethernet devices
net: tipc: resize nlattr array to correct size
selftests/ptp: Fix timestamp printf format for PTP_SYS_OFFSET
afs: Fix vlserver probe RTT handling
cgroup: always put cset in cgroup_css_set_put_fork
rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
neighbour: Remove unused inline function neigh_key_eq16()
net: Remove unused inline function dst_hold_and_use()
net: Remove DECnet leftovers from flow.h.
neighbour: delete neigh_lookup_nodev as not used
batman-adv: Switch to kstrtox.h for kstrtou64
mmc: block: ensure error propagation for non-blk
mm/memory_hotplug: extend offline_and_remove_memory() to handle more than one memory block
nilfs2: reject devices with insufficient block count
media: dvbdev: Fix memleak in dvb_register_device
media: dvbdev: fix error logic at dvb_register_device()
media: dvb-core: Fix use-after-free due to race at dvb_register_device()
drm/i915/dg1: Wait for pcode/uncore handshake at startup
drm/i915/gen11+: Only load DRAM information from pcode
um: Fix build w/o CONFIG_PM_SLEEP
Linux 5.10.185
Change-Id: I05ba9c2e38c013c553c9f89e2a6b71ec9bdb0bd3
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
commit 8dc4bb58a146655eb057247d7c9d19e73928715b upstream.
virtio-mem soon wants to use offline_and_remove_memory() memory that
exceeds a single Linux memory block (memory_block_size_bytes()). Let's
remove that restriction.
Let's remember the old state and try to restore that if anything goes
wrong. While re-onlining can, in general, fail, it's highly unlikely to
happen (usually only when a notifier fails to allocate memory, and these
are rather rare).
This will be used by virtio-mem to offline+remove memory ranges that are
bigger than a single memory block - for example, with a device block
size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory
block size of 128MB.
While we could compress the state into 2 bit, using 8 bit is much
easier.
This handling is similar, but different to acpi_scan_try_to_offline():
a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG
optimization is still relevant - it should only apply to ZONE_NORMAL
(where we have no guarantees). If relevant, we can always add it.
b) acpi_scan_try_to_offline() simply onlines all memory in case
something goes wrong. It doesn't restore previous online type. Let's do
that, so we won't overwrite what e.g., user space configured.
Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20201112133815.13332-28-david@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.168
firewire: fix memory leak for payload of request subaction to IEC 61883-1 FCP region
bus: sunxi-rsb: Fix error handling in sunxi_rsb_init()
bpf: Fix incorrect state pruning for <8B spill/fill
powerpc/imc-pmu: Revert nest_init_lock to being a mutex
bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpers
ALSA: hda/via: Avoid potential array out-of-bound in add_secret_dac_path()
bpf: Support <8-byte scalar spill and refill
bpf: Fix to preserve reg parent/live fields when copying range info
bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener
arm64: dts: imx8mm: Fix pad control for UART1_DTE_RX
drm/vc4: hdmi: make CEC adapter name unique
scsi: Revert "scsi: core: map PQ=1, PDT=other values to SCSI_SCAN_TARGET_PRESENT"
vhost/net: Clear the pending messages when the backend is removed
WRITE is "data source", not destination...
READ is "data destination", not source...
fix iov_iter_bvec() "direction" argument
fix "direction" argument of iov_iter_kvec()
virtio-net: execute xdp_do_flush() before napi_complete_done()
sfc: correctly advertise tunneled IPv6 segmentation
net: phy: dp83822: Fix null pointer access on DP83825/DP83826 devices
netrom: Fix use-after-free caused by accept on already connected socket
netfilter: br_netfilter: disable sabotage_in hook after first suppression
squashfs: harden sanity check in squashfs_read_xattr_id_table
net: phy: meson-gxl: Add generic dummy stubs for MMD register access
igc: return an error if the mac type is unknown in igc_ptp_systim_to_hwtstamp()
can: j1939: fix errant WARN_ON_ONCE in j1939_session_deactivate
ata: libata: Fix sata_down_spd_limit() when no link speed is reported
selftests: net: udpgso_bench_rx: Fix 'used uninitialized' compiler warning
selftests: net: udpgso_bench_rx/tx: Stop when wrong CLI args are provided
selftests: net: udpgso_bench: Fix racing bug between the rx/tx programs
selftests: net: udpgso_bench_tx: Cater for pending datagrams zerocopy benchmarking
virtio-net: Keep stop() to follow mirror sequence of open()
net: openvswitch: fix flow memory leak in ovs_flow_cmd_new
efi: fix potential NULL deref in efi_mem_reserve_persistent
qede: add netpoll support for qede driver
qede: execute xdp_do_flush() before napi_complete_done()
i2c: mxs: suppress probe-deferral error message
scsi: target: core: Fix warning on RT kernels
scsi: iscsi_tcp: Fix UAF during login when accessing the shost ipaddress
i2c: rk3x: fix a bunch of kernel-doc warnings
platform/x86: dell-wmi: Add a keymap for KEY_MUTE in type 0x0010 table
net/x25: Fix to not accept on connected socket
iio: adc: stm32-dfsdm: fill module aliases
usb: dwc3: dwc3-qcom: Fix typo in the dwc3 vbus override API
usb: dwc3: qcom: enable vbus override when in OTG dr-mode
usb: gadget: f_fs: Fix unbalanced spinlock in __ffs_ep0_queue_wait
vc_screen: move load of struct vc_data pointer in vcs_read() to avoid UAF
Input: i8042 - move __initconst to fix code styling warning
Input: i8042 - merge quirk tables
Input: i8042 - add TUXEDO devices to i8042 quirk tables
Input: i8042 - add Clevo PCX0DX to i8042 quirk table
fbcon: Check font dimension limits
net: qrtr: free memory on error path in radix_tree_insert()
watchdog: diag288_wdt: do not use stack buffers for hardware data
watchdog: diag288_wdt: fix __diag288() inline assembly
ALSA: hda/realtek: Add Acer Predator PH315-54
efi: Accept version 2 of memory attributes table
iio: hid: fix the retval in accel_3d_capture_sample
iio: adc: berlin2-adc: Add missing of_node_put() in error path
iio:adc:twl6030: Enable measurements of VUSB, VBAT and others
iio: imu: fxos8700: fix ACCEL measurement range selection
iio: imu: fxos8700: fix incomplete ACCEL and MAGN channels readback
iio: imu: fxos8700: fix IMU data bits returned to user space
iio: imu: fxos8700: fix map label of channel type to MAGN sensor
iio: imu: fxos8700: fix swapped ACCEL and MAGN channels readback
iio: imu: fxos8700: fix incorrect ODR mode readback
iio: imu: fxos8700: fix failed initialization ODR mode assignment
iio: imu: fxos8700: remove definition FXOS8700_CTRL_ODR_MIN
iio: imu: fxos8700: fix MAGN sensor scale and unit
nvmem: qcom-spmi-sdam: fix module autoloading
parisc: Fix return code of pdc_iodc_print()
parisc: Wire up PTRACE_GETREGS/PTRACE_SETREGS for compat case
riscv: disable generation of unwind tables
mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses
fpga: stratix10-soc: Fix return value check in s10_ops_write_init()
mm/swapfile: add cond_resched() in get_swap_pages()
Squashfs: fix handling and sanity checking of xattr_ids count
drm/i915: Fix potential bit_17 double-free
nvmem: core: initialise nvmem->id early
nvmem: core: fix cell removal on error
serial: 8250_dma: Fix DMA Rx completion race
serial: 8250_dma: Fix DMA Rx rearm race
fbdev: smscufx: fix error handling code in ufx_usb_probe
f2fs: fix to do sanity check on i_extra_isize in is_alive()
wifi: brcmfmac: Check the count value of channel spec to prevent out-of-bounds reads
nvmem: core: Fix a conflict between MTD and NVMEM on wp-gpios property
bpf: Do not reject when the stack read size is different from the tracked scalar size
iio:adc:twl6030: Enable measurement of VAC
mm/migration: return errno when isolate_huge_page failed
migrate: hugetlb: check for hugetlb shared PMD in node migration
btrfs: limit device extents to the device size
btrfs: zlib: zero-initialize zlib workspace
ALSA: hda/realtek: Add Positivo N14KP6-TG
ALSA: emux: Avoid potential array out-of-bound in snd_emux_xg_control()
ALSA: hda/realtek: Fix the speaker output on Samsung Galaxy Book2 Pro 360
tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw
of/address: Return an error when no valid dma-ranges are found
can: j1939: do not wait 250 ms if the same addr was already claimed
xfrm: compat: change expression for switch in xfrm_xlate64
IB/hfi1: Restore allocated resources on failed copyout
xfrm/compat: prevent potential spectre v1 gadget in xfrm_xlate32_attr()
IB/IPoIB: Fix legacy IPoIB due to wrong number of queues
RDMA/usnic: use iommu_map_atomic() under spin_lock()
xfrm: fix bug with DSCP copy to v6 from v4 tunnel
bonding: fix error checking in bond_debug_reregister()
net: phy: meson-gxl: use MMD access dummy stubs for GXL, internal PHY
ionic: clean interrupt before enabling queue to avoid credit race
uapi: add missing ip/ipv6 header dependencies for linux/stddef.h
ice: Do not use WQ_MEM_RECLAIM flag for workqueue
net: mscc: ocelot: fix VCAP filters not matching on MAC with "protocol 802.1Q"
net/mlx5e: IPoIB, Show unknown speed instead of error
net/mlx5: fw_tracer, Clear load bit when freeing string DBs buffers
net/mlx5: fw_tracer, Zero consumer index when reloading the tracer
rds: rds_rm_zerocopy_callback() use list_first_entry()
selftests: forwarding: lib: quote the sysctl values
ALSA: pci: lx6464es: fix a debug loop
pinctrl: aspeed: Fix confusing types in return value
pinctrl: single: fix potential NULL dereference
spi: dw: Fix wrong FIFO level setting for long xfers
pinctrl: intel: Restore the pins that used to be in Direct IRQ mode
cifs: Fix use-after-free in rdata->read_into_pages()
net: USB: Fix wrong-direction WARNING in plusb.c
btrfs: free device in btrfs_close_devices for a single device filesystem
usb: core: add quirk for Alcor Link AK9563 smartcard reader
usb: typec: altmodes/displayport: Fix probe pin assign check
ceph: flush cap releases when the session is flushed
riscv: Fixup race condition on PG_dcache_clean in flush_icache_pte
arm64: dts: meson-gx: Make mmc host controller interrupts level-sensitive
arm64: dts: meson-g12-common: Make mmc host controller interrupts level-sensitive
arm64: dts: meson-axg: Make mmc host controller interrupts level-sensitive
Fix page corruption caused by racy check in __free_pages
Linux 5.10.168
Change-Id: I98d1e73edfaab3ce45c15283ae0964527d5e547e
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
[ Upstream commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8 ]
We might fail to isolate huge page due to e.g. the page is under
migration which cleared HPageMigratable. We should return errno in this
case rather than always return 1 which could confuse the user, i.e. the
caller might think all of the memory is migrated while the hugetlb page is
left behind. We make the prototype of isolate_huge_page consistent with
isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
to isolate_hugetlb as suggested by Muchun to improve the readability.
Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
Fixes: e8db67eb0d ("mm: migrate: move_pages() supports thp migration")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Suggested-by: Huang Ying <ying.huang@intel.com>
Reported-by: kernel test robot <lkp@intel.com> (build error)
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: 73bdf65ea748 ("migrate: hugetlb: check for hugetlb shared PMD in node migration")
Signed-off-by: Sasha Levin <sashal@kernel.org>
If offline_pages failed after lru_cache_disable(), it forgot to do
lru_cache_enable() in error path. So we would have lru cache disabled
permanently in this case.
Link: https://lkml.kernel.org/r/20210821094246.10149-3-linmiaohe@huawei.com
Fixes: d479960e44f2 ("mm: disable LRU pagevec during the migration temporarily")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 946746d1ad921e5f493b536533dda02ea22ca609)
[connor: move after appropriate label for 5.10]
Bug: 187129171
Signed-off-by: Connor O'Brien <connoro@google.com>
Change-Id: I45e16bfe4812dab86e2e678138599fcaf2b0cd8d
If add_memory_subsection() is called with a size of
memory_block_size_bytes, it calls into add_memory(), which declares
the region as system ram, and adds it to the buddy allocator. This
is inconsistent with the behavior of add_memory_subsection() for
other sizes, for which it does not add the memory to buddy and
instead reserves it for the caller's private use.
Bug: 210008865
Fixes: 417ac617ea ("ANDROID: mm/memory_hotplug: implement {add/remove}_memory_subsection")
Change-Id: Iefb69b0b4e96af670d0e65c325a9538d14b460e3
Signed-off-by: Patrick Daly <quic_pdaly@quicinc.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmFLB58ACgkQONu9yGCS
aT7uAhAAraX1qVdfkq3g4w9jaURkiR/Z1LbPqjMswIojApmcXV3e0mUtEWxBBEJT
o/uId9KUr/OrfAN++DO+9iLmPIjZHW+49I+CeHcDS95PdeWSKxZ3HBPUqK8uX8tU
QdPjh2PVL7Kkzbgi65RWeTOERHLlEj6qo21xu4W9QuwmZZojEB8xVP9BB/U6p84Q
KYPX+zyGUo9NgsaVTwOXxZzyT8JgcfEUKg0F4nHeNJxEh106dN2XgZpq+GvB7Hq7
koDy/dg2I4hS++Ds/Fjz9wQrgcvw3WSo3pUZzyTS2zfrcefLjqDVWzSY/1Ttd4b9
B7Lw7WiEgbX75EFXX8RgCrmNSsNW8pnFyR2URoOfFD6ckJNj/XCPVV+tfiSfAnH5
vlOQOicjtr/yFeOfhre8U4pTBWXk9BYscJyzNp/wScaExHXXkI+HYi92cbbTWKCU
/ig1RmIqTATdFAXjukHUqt6QzI1iqPtTQCGd99AhaBGq0Hb8OK2HponzBOpQvAHb
xaEMSL9YsJhoAux+n+R95FQKCk2KrjgX8Bczyuj2OAL5jeST10fWrYe6DflSta5K
9fNWmyjegpQEcmtDidQ7HH81Fy793S/34R8FQ4y1zPEi1A0yH//FO2lA8dS4Rdvo
ho7l7W+Hd/Ut67P0b7OFz2znw0T4OqMF6Il30q88pOfcis2TfNs=
=2XgB
-----END PGP SIGNATURE-----
Merge 5.10.68 into android12-5.10-lts
Changes in 5.10.68
drm/bridge: lt9611: Fix handling of 4k panels
btrfs: fix upper limit for max_inline for page size 64K
io_uring: ensure symmetry in handling iter types in loop_rw_iter()
xen: reset legacy rtc flag for PV domU
bnx2x: Fix enabling network interfaces without VFs
arm64/sve: Use correct size when reinitialising SVE state
PM: base: power: don't try to use non-existing RTC for storing data
PCI: Add AMD GPU multi-function power dependencies
drm/amd/amdgpu: Increase HWIP_MAX_INSTANCE to 10
drm/etnaviv: return context from etnaviv_iommu_context_get
drm/etnaviv: put submit prev MMU context when it exists
drm/etnaviv: stop abusing mmu_context as FE running marker
drm/etnaviv: keep MMU context across runtime suspend/resume
drm/etnaviv: exec and MMU state is lost when resetting the GPU
drm/etnaviv: fix MMU context leak on GPU reset
drm/etnaviv: reference MMU context when setting up hardware state
drm/etnaviv: add missing MMU context put when reaping MMU mapping
s390/sclp: fix Secure-IPL facility detection
x86/pat: Pass valid address to sanitize_phys()
x86/mm: Fix kern_addr_valid() to cope with existing but not present entries
tipc: fix an use-after-free issue in tipc_recvmsg
ethtool: Fix rxnfc copy to user buffer overflow
net/{mlx5|nfp|bnxt}: Remove unnecessary RTNL lock assert
net-caif: avoid user-triggerable WARN_ON(1)
ptp: dp83640: don't define PAGE0
dccp: don't duplicate ccid when cloning dccp sock
net/l2tp: Fix reference count leak in l2tp_udp_recv_core
r6040: Restore MDIO clock frequency after MAC reset
tipc: increase timeout in tipc_sk_enqueue()
drm/rockchip: cdn-dp-core: Make cdn_dp_core_resume __maybe_unused
perf machine: Initialize srcline string member in add_location struct
net/mlx5: FWTrace, cancel work on alloc pd error flow
net/mlx5: Fix potential sleeping in atomic context
nvme-tcp: fix io_work priority inversion
events: Reuse value read using READ_ONCE instead of re-reading it
net: ipa: initialize all filter table slots
gen_compile_commands: fix missing 'sys' package
vhost_net: fix OoB on sendmsg() failure.
net/af_unix: fix a data-race in unix_dgram_poll
net: dsa: destroy the phylink instance on any error in dsa_slave_phy_setup
x86/uaccess: Fix 32-bit __get_user_asm_u64() when CC_HAS_ASM_GOTO_OUTPUT=y
tcp: fix tp->undo_retrans accounting in tcp_sacktag_one()
selftest: net: fix typo in altname test
qed: Handle management FW error
udp_tunnel: Fix udp_tunnel_nic work-queue type
dt-bindings: arm: Fix Toradex compatible typo
ibmvnic: check failover_pending in login response
KVM: PPC: Book3S HV: Tolerate treclaim. in fake-suspend mode changing registers
bnxt_en: make bnxt_free_skbs() safe to call after bnxt_free_mem()
net: hns3: pad the short tunnel frame before sending to hardware
net: hns3: change affinity_mask to numa node range
net: hns3: disable mac in flr process
net: hns3: fix the timing issue of VF clearing interrupt sources
mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()
dt-bindings: mtd: gpmc: Fix the ECC bytes vs. OOB bytes equation
mfd: db8500-prcmu: Adjust map to reality
PCI: Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms
fuse: fix use after free in fuse_read_interrupt()
PCI: tegra194: Fix handling BME_CHGED event
PCI: tegra194: Fix MSI-X programming
PCI: tegra: Fix OF node reference leak
mfd: Don't use irq_create_mapping() to resolve a mapping
PCI: rcar: Fix runtime PM imbalance in rcar_pcie_ep_probe()
tracing/probes: Reject events which have the same name of existing one
PCI: cadence: Use bitfield for *quirk_retrain_flag* instead of bool
PCI: cadence: Add quirk flag to set minimum delay in LTSSM Detect.Quiet state
PCI: j721e: Add PCIe support for J7200
PCI: j721e: Add PCIe support for AM64
PCI: Add ACS quirks for Cavium multi-function devices
watchdog: Start watchdog in watchdog_set_last_hw_keepalive only if appropriate
octeontx2-af: Add additional register check to rvu_poll_reg()
Set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
net: usb: cdc_mbim: avoid altsetting toggling for Telit LN920
block, bfq: honor already-setup queue merges
PCI: ibmphp: Fix double unmap of io_mem
ethtool: Fix an error code in cxgb2.c
NTB: Fix an error code in ntb_msit_probe()
NTB: perf: Fix an error code in perf_setup_inbuf()
s390/bpf: Fix optimizing out zero-extensions
s390/bpf: Fix 64-bit subtraction of the -0x80000000 constant
s390/bpf: Fix branch shortening during codegen pass
mfd: axp20x: Update AXP288 volatile ranges
backlight: ktd253: Stabilize backlight
PCI: of: Don't fail devm_pci_alloc_host_bridge() on missing 'ranges'
PCI: iproc: Fix BCMA probe resource handling
netfilter: Fix fall-through warnings for Clang
netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
KVM: arm64: Restrict IPA size to maximum 48 bits on 4K and 16K page size
PCI: Fix pci_dev_str_match_path() alloc while atomic bug
mfd: tqmx86: Clear GPIO IRQ resource when no IRQ is set
tracing/boot: Fix a hist trigger dependency for boot time tracing
mtd: mtdconcat: Judge callback existence based on the master
mtd: mtdconcat: Check _read, _write callbacks existence before assignment
KVM: arm64: Fix read-side race on updates to vcpu reset state
KVM: arm64: Handle PSCI resets before userspace touches vCPU state
PCI: Sync __pci_register_driver() stub for CONFIG_PCI=n
mtd: rawnand: cafe: Fix a resource leak in the error handling path of 'cafe_nand_probe()'
ARC: export clear_user_page() for modules
perf unwind: Do not overwrite FEATURE_CHECK_LDFLAGS-libunwind-{x86,aarch64}
perf bench inject-buildid: Handle writen() errors
gpio: mpc8xxx: Fix a resources leak in the error handling path of 'mpc8xxx_probe()'
gpio: mpc8xxx: Use 'devm_gpiochip_add_data()' to simplify the code and avoid a leak
net: dsa: tag_rtl4_a: Fix egress tags
selftests: mptcp: clean tmp files in simult_flows
net: hso: add failure handler for add_net_device
net: dsa: b53: Fix calculating number of switch ports
net: dsa: b53: Set correct number of ports in the DSA struct
netfilter: socket: icmp6: fix use-after-scope
fq_codel: reject silly quantum parameters
qlcnic: Remove redundant unlock in qlcnic_pinit_from_rom
ip_gre: validate csum_start only on pull
net: dsa: b53: Fix IMP port setup on BCM5301x
bnxt_en: fix stored FW_PSID version masks
bnxt_en: Fix asic.rev in devlink dev info command
bnxt_en: log firmware debug notifications
bnxt_en: Consolidate firmware reset event logging.
bnxt_en: Convert to use netif_level() helpers.
bnxt_en: Improve logging of error recovery settings information.
bnxt_en: Fix possible unintended driver initiated error recovery
mfd: lpc_sch: Partially revert "Add support for Intel Quark X1000"
mfd: lpc_sch: Rename GPIOBASE to prevent build error
net: renesas: sh_eth: Fix freeing wrong tx descriptor
x86/mce: Avoid infinite loop for copy from user recovery
bnxt_en: Fix error recovery regression
net: dsa: bcm_sf2: Fix array overrun in bcm_sf2_num_active_ports()
Linux 5.10.68
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I542f48f8de516dcabce91d3d399583483aba0da7
commit 7cf209ba8a86410939a24cb1aeb279479a7e0ca6 upstream.
Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"
These are all cleanups and one fix previously sent as part of [1]:
[PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
groups.
These patches make sense even without the other series, therefore I pulled
them out to make the other series easier to digest.
[1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com
This patch (of 4):
Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.
As we will search for a fitting zone using the wrong pfn, we might end
up onlining memory to one of the special kernel zones, such as ZONE_DMA,
which can end badly as the onlined memory does not satisfy properties of
these zones.
Use "unsigned long" instead, just as we do in other places when handling
PFNs. This can bite us once we have physical addresses in the range of
multiple TB.
Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
Fixes: e5e6893026 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
alloc_contig_range is supposed to work on max(MAX_ORDER_NR_PAGES,
or pageblock_nr_pages) granularity aligned range. If it fails
at a page and return error to user, user doesn't know what page
makes the allocation failure and keep retrying another allocation
with new range including the failed page and encountered error
again and again until it could escape the out of the granularity
block. Instead, let's make CMA aware of what pfn was troubled in
previous trial and then continue to work new pageblock out of the
failed page so it doesn't see the repeated error repeatedly.
Currently, this option works for only __GFP_NORETRY case for
safe for existing CMA users.
Bug: 192475091
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I0959c9df3d4b36408a68920abbb4d52d31026079
Currently, pcplists are drained during set_migratetype_isolate() which
means once per pageblock processed start_isolate_page_range(). This is
somewhat wasteful. Moreover, the callers might need different guarantees,
and the draining is currently prone to races and does not guarantee that
no page from isolated pageblock will end up on the pcplist after the
drain.
Better guarantees are added by later patches and require explicit actions
by page isolation users that need them. Thus it makes sense to move the
current imperfect draining to the callers also as a preparation step.
Link: https://lkml.kernel.org/r/20201111092812.11329-7-vbabka@suse.cz
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7612921f2376d51d020ae2f06ffb7da40422b75b)
Change-Id: I10fc574024606c499ddda325d188d181aff7ceec
A memory section may contain a mix of memory present at bootup, and
memory added dynamically via add_memory_subsection(). Fix
remove_memory_subsection to not return an error for this situation.
Bug: 190151165i
Fixes: 417ac617ea ("ANDROID: mm/memory_hotplug: implement {add/remove}_memory_subsection")
Change-Id: I20314fe136d6e5b56a9275be7e2d130d18bd79a5
Signed-off-by: Patrick Daly <pdaly@codeaurora.org>
When offlining memory the system can attempt to migrate a lot of pages, if
there are problems with migration this can flood the logs. Printing all
the data hogs the CPU and cause some RT threads to run for a long time,
which may have some bad consequences.
Rate limit the page migration warnings in order to avoid this.
Link: https://lkml.kernel.org/r/20210505140542.24935-1-georgi.djakov@linaro.org
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bug: 187917022
(cherry picked from commit 6be9af0e0f77 https: //github.com/hnaz/linux-mm.git v5.13-rc1-mmots-2021-05-10-22-15)
Change-Id: I6754d380f8c31cf9b699c242edacf593807ef2b1
Signed-off-by: Chris Goldsworthy <quic_cgoldswo@quicinc.com>
Use the correct printk length specifier [%llx] for u64 variables.
This fixes several warnings of the following type:
mm/memory_hotplug.c: In function ‘add_memory_subsection’:
./include/linux/kern_levels.h:5:18: warning: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘u64’ {aka ‘long long unsigned int’} [-Wformat=]
mm/memory_hotplug.c:1144:25: note: format string is defined here
pr_err("%s: start 0x%lx size 0x%lx not aligned to subsection size\n",
~~^
%llx
Bug: 183339614
Fixes: 417ac617ea (ANDROID: mm/memory_hotplug: implement {add/remove}_memory_subsection)
Reported-by: kernelci.org bot <bot@kernelci.org>
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Change-Id: Iee89be07cb40513661e336dd0671f65b1161e830
Patch series "arch, mm: improve robustness of direct map manipulation", v7.
During recent discussion about KVM protected memory, David raised a
concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC
scope [1].
Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
possible that __kernel_map_pages() would fail, but since this function is
void, the failure will go unnoticed.
Moreover, there's lack of consistency of __kernel_map_pages() semantics
across architectures as some guard this function with #ifdef
DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation
debugging is disabled at run time and some allow modifying the direct map
regardless of DEBUG_PAGEALLOC settings.
This set straightens this out by restoring dependency of
__kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
accordingly.
Since currently the only user of __kernel_map_pages() outside
DEBUG_PAGEALLOC is hibernation, it is updated to make direct map accesses
there more explicit.
[1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com
This patch (of 4):
When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
direct mapping after free_pages(). The pages than need to be mapped back
before they could be used. Theese mapping operations use
__kernel_map_pages() guarded with with debug_pagealloc_enabled().
The only place that calls __kernel_map_pages() without checking whether
DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
availability of this function when ARCH_HAS_SET_DIRECT_MAP is set. Still,
on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is not
enabled but set_direct_map_invalid_noflush() may render some pages not
present in the direct map and hibernation code won't be able to save such
pages.
To make page allocation debugging and hibernation interaction more robust,
the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be
made more explicit.
Start with combining the guard condition and the call to
__kernel_map_pages() into debug_pagealloc_map_pages() and
debug_pagealloc_unmap_pages() functions to emphasize that
__kernel_map_pages() should not be called without DEBUG_PAGEALLOC and use
these new functions to map/unmap pages when page allocation debugging is
enabled.
Link: https://lkml.kernel.org/r/20201109192128.960-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20201109192128.960-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 77bc7fd607dee2ffb28daff6d0dd8ae42af61ea8
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git akpm)
Bug: 182930667
Signed-off-by: Alexander Potapenko <glider@google.com>
Change-Id: I9f0dac574bc3a7ea7d88bff051b77eca19610ce9
LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec)
during ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a
sync migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation
in background every five seconds. Total cma allocation count was
about 500 during the testing. With this patch, the dump_page count
was reduced from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Bug: 180018981
Link: https://lore.kernel.org/linux-mm/20210319175127.886124-1-minchan@kernel.org/
Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I838c63d11ca49a8734d8b37a7d5272ab6b802f9f
commit d15dfd31384ba3cb93150e5f87661a76fa419f74 upstream.
In a system supporting MTE, the linear map must allow reading/writing
allocation tags by setting the memory type as Normal Tagged. Currently,
this is only handled for memory present at boot. Hotplugged memory uses
Normal non-Tagged memory.
Introduce pgprot_mhp() for hotplugged memory and use it in
add_memory_resource(). The arm64 code maps pgprot_mhp() to
pgprot_tagged().
Note that ZONE_DEVICE memory should not be mapped as Tagged and
therefore setting the memory type in arch_add_memory() is not feasible.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Fixes: 0178dc7613 ("arm64: mte: Use Normal Tagged attributes for the linear map")
Reported-by: Patrick Daly <pdaly@codeaurora.org>
Tested-by: Patrick Daly <pdaly@codeaurora.org>
Link: https://lore.kernel.org/r/1614745263-27827-1-git-send-email-pdaly@codeaurora.org
Cc: <stable@vger.kernel.org> # 5.10.x
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20210309122601.5543-1-catalin.marinas@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.
However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.
To close the race, this patch disables lru caches(i.e, pagevec)
during ongoing migration until migrate is done.
Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a
sync migration) with below debug code.
int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..
if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc);
dump_page(page, "fail to migrate");
}
The test was repeating android apps launching with cma allocation
in background every five seconds. Total cma allocation count was
about 500 during the testing. With this patch, the dump_page count
was reduced from 400 to 30.
The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.
Bug: 180018981
Link: https://lore.kernel.org/linux-mm/20210310161429.399432-2-minchan@kernel.org/
[minchan: Resolved conflict in mm/memory_hotplug.c, mm/swap.c]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: Ie1e09cf26e3105b674a9aed4ac65070efee608af
In a system supporting MTE, the linear map must allow reading/writing
allocation tags by setting the memory type as Normal Tagged. Currently,
this is only handled for memory present at boot. Hotplugged memory uses
Normal non-Tagged memory.
Introduce pgprot_mhp() for hotplugged memory and use it in
add_memory_resource(). The arm64 code maps pgprot_mhp() to
pgprot_tagged().
Note that ZONE_DEVICE memory should not be mapped as Tagged and
therefore setting the memory type in arch_add_memory() is not feasible.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Fixes: 0178dc7613 ("arm64: mte: Use Normal Tagged attributes for the linear map")
Reported-by: Patrick Daly <pdaly@codeaurora.org>
Tested-by: Patrick Daly <pdaly@codeaurora.org>
Link: https://lore.kernel.org/r/1614745263-27827-1-git-send-email-pdaly@codeaurora.org
Cc: <stable@vger.kernel.org> # 5.10.x
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20210309122601.5543-1-catalin.marinas@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit d15dfd31384ba3cb93150e5f87661a76fa419f74
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/fixes)
Signed-off-by: Will Deacon <willdeacon@google.com>
Bug: 181697456
Change-Id: Id1077c49637d4791fee781dd4c013b112e77ae0d
When removing memory subsections, we need to ensure that subsections are
not in use or online before we try to remove them. Previously, we used
test_pages_isolated() which works only if pages were onlined and added to
a zone before. However, we should allow subsection addition and removal
without them being added to buddy i.e. onlined. test_pages_isolated() would
not help in such cases. Instead, check for valid offlined sections before
the subsections within can be allowed to be removed.
Bug: 170460867
Fixes: 417ac617ea ("ANDROID: mm/memory_hotplug: implement {add/remove}_memory_subsection")
Change-Id: I09b46582cd21e9e5370ebd5434209530edb89e58
Signed-off-by: Sudarshan Rajagopalan <sudaraja@codeaurora.org>
Memory hotplugging is allowed only for multiples of section sizes.
Section size could be huge (ex. 1GB for arm64 targets) thus limiting
to add/remove lower chunks of memory. This patch allows drivers to add
memory of subsection sizes which are then added to memblock. This does
not create a separate memblock device nodes for newly added subsections
until the whole of the memblock section is added.
Bug: 170460867
Change-Id: I15749b5320340cba4d526e7ddb26a9cd6029c690
Signed-off-by: Sudarshan Rajagopalan <sudaraja@codeaurora.org>
commit dc2da7b45ffe954a0090f5d0310ed7b0b37d2bd2 upstream.
VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.
Before the commit:
[0.033176] Normal zone: 1445888 pages used for memmap
[0.033176] Normal zone: 89391104 pages, LIFO batch:63
[0.035851] ACPI: PM-Timer IO Port: 0x448
With commit
[0.026874] Normal zone: 1445888 pages used for memmap
[0.026875] Normal zone: 89391104 pages, LIFO batch:63
[2.028450] ACPI: PM-Timer IO Port: 0x448
The root cause is the current memmap defer init doesn't work as expected.
Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node. However, since commit 73a6e474cb,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.
E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2. The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
only one memory section's memmap initialized. That's why more time is
costed at the time.
[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]
Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.
Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
Fixes: commit 73a6e474cb ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reported-by: Rahul Gopakumar <gopakumarr@vmware.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]
Since commit 369ea8242c ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check. More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.
However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page. So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.
There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim. From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.
The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.
Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y
...and:
CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y
...it failed to export the symbol in the case of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n
Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.
Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.
The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.
The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.
[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: a035b6bf86 ("mm/memory_hotplug: introduce default phys_to_target_node() implementation")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To calculate the correct node to migrate the page for hotplug, we need to
check node id of the page. Wrapper for alloc_migration_target() exists
for this purpose.
However, Vlastimil informs that all migration source pages come from a
single node. In this case, we don't need to check the node id for each
page and we don't need to re-set the target nodemask for each page by
using the wrapper. Set up the migration_target_control once and use it
for all pages.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As we no longer shuffle via generic_online_page() and when undoing
isolation, we can simplify the comment.
We now effectively shuffle only once (properly) when onlining new memory.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: https://lkml.kernel.org/r/20201005121534.15649-6-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At boot time, or when doing memory hot-add operations, if the links in
sysfs can't be created, the system is still able to run, so just report
the error in the kernel log rather than BUG_ON and potentially make system
unusable because the callpath can be called with locks held.
Since the number of memory blocks managed could be high, the messages are
rate limited.
As a consequence, link_mem_sections() has no status to report anymore.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"mem" in the name already indicates the root, similar to
release_mem_region() and devm_request_mem_region(). Make it implicit.
The only single caller always passes iomem_resource, other parents are not
applicable.
Suggested-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.
Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
either created within add_memory*() or passed via add_memory_resource()
shall be marked mergeable and merged with applicable siblings.
To implement that, we need a kernel/resource interface to mark selected
System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
merging.
Note: We really want to merge after the whole operation succeeded, not
directly when adding a resource to the resource tree (it would break
add_memory_resource() and require splitting resources again when the
operation failed - e.g., due to -ENOMEM).
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Julien Grall <julien@xen.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Leonardo Bras <leobras.c@gmail.com>
Cc: Libor Pechacek <lpechacek@suse.cz>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable. Prepare for that.
This patch is based on a similar patch by Oscar Salvador:
https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.de
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Juergen Gross <jgross@suse.com> # Xen related part
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Libor Pechacek <lpechacek@suse.cz>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Leonardo Bras <leobras.c@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Julien Grall <julien@xen.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
always set to 0 by hardware. This is far from beautiful (and confusing),
and the bit only applies to SYSRAM. So let's move it out of the
bus-specific (PnP) defined bits.
We'll add another SYSRAM specific bit soon. If we ever need more bits for
other purposes, we can steal some from "desc", or reshuffle/regroup what
we have.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Leonardo Bras <leobras.c@gmail.com>
Cc: Libor Pechacek <lpechacek@suse.cz>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: https://lkml.kernel.org/r/20200911103459.10306-3-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "selective merging of system ram resources", v4.
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.
Resources are effectively stored in a list-based tree. Having a lot of
resources not only wastes memory, it also makes traversing that tree more
expensive, and makes /proc/iomem explode in size (e.g., requiring
kexec-tools to manually merge resources when creating a kdump header. The
current kexec-tools resource count limit does not allow for more than
~100GB of memory with a memory block size of 128MB on x86-64).
Let's allow to selectively merge system ram resources by specifying a new
flag for add_memory*(). Patch #5 contains a /proc/iomem example. Only
tested with virtio-mem.
This patch (of 8):
Let's make sure splitting a resource on memory hotunplug will never fail.
This will become more relevant once we merge selected System RAM resources
- then, we'll trigger that case more often on memory hotunplug.
In general, this function is already unlikely to fail. When we remove
memory, we free up quite a lot of metadata (memmap, page tables, memory
block device, etc.). The only reason it could really fail would be when
injecting allocation errors.
All other error cases inside release_mem_region_adjustable() seem to be
sanity checks if the function would be abused in different context - let's
add WARN_ON_ONCE() in these cases so we can catch them.
[natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable]
Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com
Link: https://github.com/ClangBuiltLinux/linux/issues/1159
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Wei Yang <richardw.yang@linux.intel.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Grall <julien@xen.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Leonardo Bras <leobras.c@gmail.com>
Cc: Libor Pechacek <lpechacek@suse.cz>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Roger Pau Monn <roger.pau@citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, it can happen that pages are allocated (and freed) via the
buddy before we finished basic memory onlining.
For example, pages are exposed to the buddy and can be allocated before we
actually mark the sections online. Allocated pages could suddenly fail
pfn_to_online_page() checks. We had similar issues with pcp handling,
when pages are allocated+freed before we reach zone_pcp_update() in
online_pages() [1].
Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
impossible. Once done with the heavy lifting, use
undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
freelist, marking them ready for allocation. Similar to offline_pages(),
we have to manually adjust zone->nr_isolate_pageblock.
[1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
un-isolate the pages after memory onlining is complete. Let's allow
passing in the migratetype.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't allow to offline memory with holes, all boot memory is online,
and all hotplugged memory cannot have holes.
We can now simplify onlining of pages. As we only allow to online/offline
full sections and sections always span full MAX_ORDER_NR_PAGES, we can
just process MAX_ORDER - 1 pages without further special handling.
The number of onlined pages simply corresponds to the number of pages we
were requested to online.
While at it, refine the comment regarding the callback not exposing all
pages to the buddy.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-8-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We make sure that we cannot have any memory holes right at the beginning
of offline_pages() and we only support to online/offline full sections.
Both, sections and pageblocks are a power of two in size, and sections
always span full pageblocks.
We can directly calculate the number of isolated pageblocks from nr_pages.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-6-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We make sure that we cannot have any memory holes right at the beginning
of offline_pages(). We no longer need walk_system_ram_range() and can
call test_pages_isolated() and __offline_isolated_pages() directly.
offlined_pages always corresponds to nr_pages, so we can simplify that.
[akpm@linux-foundation.org: patch conflict resolution]
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-4-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Already two people (including me) tried to offline subsections, because
the function looks like it can deal with it. But we really can only
online/offline full sections that are properly aligned (e.g., we can only
mark full sections online/offline via SECTION_IS_ONLINE).
Add a simple safety net to document the restriction now. Current users
(core and powernv/memtrace) respect these restrictions.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200819175957.28465-3-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/memory_hotplug: online_pages()/offline_pages() cleanups", v2.
These are a bunch of cleanups for online_pages()/offline_pages() and
related code, mostly getting rid of memory hole handling that is no longer
necessary. There is only a single walk_system_ram_range() call left in
offline_pages(), to make sure we don't have any memory holes. I had some
of these patches lying around for a longer time but didn't have time to
polish them.
In addition, the last patch marks all pageblocks of memory to get onlined
MIGRATE_ISOLATE, so pages that have just been exposed to the buddy cannot
get allocated before onlining is complete. Once heavy lifting is done,
the pageblocks are set to MIGRATE_MOVABLE, such that allocations are
possible.
I played with DIMMs and virtio-mem on x86-64 and didn't spot any
surprises. I verified that the numer of isolated pageblocks is correctly
handled when onlining/offlining.
This patch (of 10):
There is only a single user, offline_pages(). Let's inline, to make
it look more similar to online_pages().
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Charan Teja Reddy <charante@codeaurora.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200819175957.28465-1-david@redhat.com
Link: https://lkml.kernel.org/r/20200819175957.28465-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation to set a fallback value for dev_dax->target_node, introduce
generic fallback helpers for phys_to_target_node()
A generic implementation based on node-data or memblock was proposed, but
as noted by Mike:
"Here again, I would prefer to add a weak default for
phys_to_target_node() because the "generic" implementation is not really
generic.
The fallback to reserved ranges is x86 specfic because on x86 most of
the reserved areas is not in memblock.memory. AFAIK, no other
architecture does this."
The info message in the generic memory_add_physaddr_to_nid()
implementation is fixed up to properly reflect that
memory_add_physaddr_to_nid() communicates "online" node info and
phys_to_target_node() indicates "target / to-be-onlined" node info.
[akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jia He <justin.he@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In register_mem_sect_under_node() the system_state's value is checked to
detect whether the call is made during boot time or during an hot-plug
operation. Unfortunately, that check against SYSTEM_BOOTING is wrong
because regular memory is registered at SYSTEM_SCHEDULING state. In
addition, memory hot-plug operation can be triggered at this system
state by the ACPI [1]. So checking against the system state is not
enough.
The consequence is that on system with interleaved node's ranges like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]
This can be seen on PowerPC LPAR after multiple memory hot-plug and
hot-unplug operations are done. At the next reboot the node's memory
ranges can be interleaved and since the call to link_mem_sections() is
made in topology_init() while the system is in the SYSTEM_SCHEDULING
state, the node's id is not checked, and the sections registered to
multiple nodes:
$ ls -l /sys/devices/system/memory/memory21/node*
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
In that case, the system is able to boot but if later one of theses
memory blocks is hot-unplugged and then hot-plugged, the sysfs
inconsistency is detected and this is triggering a BUG_ON():
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c
This patch addresses the root cause by not relying on the system_state
value to detect whether the call is due to a hot-plug operation. An
extra parameter is added to link_mem_sections() detailing whether the
operation is due to a hot-plug operation.
[1] According to Oscar Salvador, using this qemu command line, ACPI
memory hotplug operations are raised at SYSTEM_SCHEDULING state:
$QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \
-m size=$MEM,slots=255,maxmem=4294967296k \
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \
-object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \
-object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \
-object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \
-object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \
-object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \
-object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \
-object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \
Fixes: 4fbce63391 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: fix memory to node bad links in sysfs", v3.
Sometimes, firmware may expose interleaved memory layout like this:
Early memory node ranges
node 1: [mem 0x0000000000000000-0x000000011fffffff]
node 2: [mem 0x0000000120000000-0x000000014fffffff]
node 1: [mem 0x0000000150000000-0x00000001ffffffff]
node 0: [mem 0x0000000200000000-0x000000048fffffff]
node 2: [mem 0x0000000490000000-0x00000007ffffffff]
In that case, we can see memory blocks assigned to multiple nodes in
sysfs:
$ ls -l /sys/devices/system/memory/memory21
total 0
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
-rw-r--r-- 1 root root 65536 Aug 24 05:27 online
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
-r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
drwxr-xr-x 2 root root 0 Aug 24 05:27 power
-r--r--r-- 1 root root 65536 Aug 24 05:27 removable
-rw-r--r-- 1 root root 65536 Aug 24 05:27 state
lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
-rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
-r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
The same applies in the node's directory with a memory21 link in both
the node1 and node2's directory.
This is wrong but doesn't prevent the system to run. However when
later, one of these memory blocks is hot-unplugged and then hot-plugged,
the system is detecting an inconsistency in the sysfs layout and a
BUG_ON() is raised:
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
Call Trace:
add_memory_resource+0x23c/0x340 (unreliable)
__add_memory+0x5c/0xf0
dlpar_add_lmb+0x1b4/0x500
dlpar_memory+0x1f8/0xb80
handle_dlpar_errorlog+0xc0/0x190
dlpar_store+0x198/0x4a0
kobj_attr_store+0x30/0x50
sysfs_kf_write+0x64/0x90
kernfs_fop_write+0x1b0/0x290
vfs_write+0xe8/0x290
ksys_write+0xdc/0x130
system_call_exception+0x160/0x270
system_call_common+0xf0/0x27c
This has been seen on PowerPC LPAR.
The root cause of this issue is that when node's memory is registered,
the range used can overlap another node's range, thus the memory block
is registered to multiple nodes in sysfs.
There are two issues here:
(a) The sysfs memory and node's layouts are broken due to these
multiple links
(b) The link errors in link_mem_sections() should not lead to a system
panic.
To address (a) register_mem_sect_under_node should not rely on the
system state to detect whether the link operation is triggered by a hot
plug operation or not. This is addressed by the patches 1 and 2 of this
series.
Issue (b) will be addressed separately.
This patch (of 2):
The memmap_context enum is used to detect whether a memory operation is
due to a hot-add operation or happening at boot time.
Make it general to the hotplug operation and rename it as
meminit_context.
There is no functional change introduced by this patch
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race during page offline that can lead to infinite loop:
a page never ends up on a buddy list and __offline_pages() keeps
retrying infinitely or until a termination signal is received.
Thread#1 - a new process:
load_elf_binary
begin_new_exec
exec_mmap
mmput
exit_mmap
tlb_finish_mmu
tlb_flush_mmu
release_pages
free_unref_page_list
free_unref_page_prepare
set_pcppage_migratetype(page, migratetype);
// Set page->index migration type below MIGRATE_PCPTYPES
Thread#2 - hot-removes memory
__offline_pages
start_isolate_page_range
set_migratetype_isolate
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
Set migration type to MIGRATE_ISOLATE-> set
drain_all_pages(zone);
// drain per-cpu page lists to buddy allocator.
Thread#1 - continue
free_unref_page_commit
migratetype = get_pcppage_migratetype(page);
// get old migration type
list_add(&page->lru, &pcp->lists[migratetype]);
// add new page to already drained pcp list
Thread#2
Never drains pcp again, and therefore gets stuck in the loop.
The fix is to try to drain per-cpu lists again after
check_pages_isolated_cb() fails.
Fixes: c52e75935f ("mm: remove extra drain pages on pcp list")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200903140032.380431-1-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20200904151448.100489-2-pasha.tatashin@soleen.com
Link: http://lkml.kernel.org/r/20200904070235.GA15277@dhcp22.suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.
[akpm@linux-foundation.org: fix mm/migrate.c]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>