lineage-22.0
431 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Zichun Zheng
|
98a66e87c1 |
ANDROID: Export symbols to do reverse mapping within memcg in kernel modules.
Export the symbols below to do reverse mapping within memcg: root_mem_cgroup page_referenced Bug: 296526618 Change-Id: Ia9c5876bd97d3f13c92b28af2ca5e74b3f91bd5a Signed-off-by: Zichun Zheng <zhengzichun@oppo.com> |
||
Lee Jones
|
e36eef3783 |
Revert "Revert "mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse""
This reverts commit 4f35cec76058557d9eaec0d501d03c7657eb56b4 and does so in an abi-safe way. This is done by adding the new fields only to the end of the structure and this structure is only passed around to other functions as a pointer, the internal structure layout is only touched by the core kernel, so adding it to the end is safe. Update ABI using The Button: Leaf changes summary: 1 artifact changed Changed leaf types summary: 1 leaf type changed Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable 'struct anon_vma at rmap.h:33:1' changed: type size changed from 832 to 960 (in bits) 2 data member insertions: 'unsigned long int num_children', at offset 832 (in bits) at rmap.h:74:1 'unsigned long int num_active_vmas', at offset 896 (in bits) at rmap.h:76:1 5406 impacted interfaces Bug: 260678056 Bug: 253167854 Change-Id: Ib1d45625cbc2e0b21330ca3dc2aa7aff34666d31 Signed-off-by: Lee Jones <joneslee@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> |
||
Minchan Kim
|
c35cda5280 |
BACKPORT: mm: don't be stuck to rmap lock on reclaim path
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended under memory pressure if processes keep working on their vmas(e.g., fork, mmap, munmap). It makes reclaim path stuck. In our real workload traces, we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it makes other processes entering direct reclaim, which were also stuck on the lock. This patch makes lru aging path try_lock mode like shink_page_list so the reclaim context will keep working with next lru pages without being stuck. if it found the rmap lock contended, it rotates the page back to head of lru in both active/inactive lrus to make them consistent behavior, which is basic starting point rather than adding more heristic. Since this patch introduces a new "contended" field as out-param along with try_lock in-param in rmap_walk_control, it's not immutable any longer if the try_lock is set so remove const keywords on rmap related functions. Since rmap walking is already expensive operation, I doubt the const would help sizable benefit( And we didn't have it until 5.17). In a heavy app workload in Android, trace shows following statistics. It almost removes rmap lock contention from reclaim path. Martin Liu reported: Before: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function 1632 0 1631 151.542173 31672 209 page_lock_anon_vma_read 601 0 601 145.544681 28817 198 rmap_walk_file After: max_dur(ms) min_dur(ms) max-min(dur)ms avg_dur(ms) sum_dur(ms) count blocked_function NaN NaN NaN NaN NaN 0.0 NaN 0 0 0 0.127645 1 12 rmap_walk_file [minchan@kernel.org: add comment, per Matthew] Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: John Dias <joaodias@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: folio->page (cherry picked from commit 6d4675e601357834dadd2ba1d803f6484596015c) Bug: 239681156 Bug: 252333201 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I0c63e0291120c8a1b5f2d83b8a7b210cb56c27a2 Signed-off-by: chenxin <chenxinxin@xiaomi.corp-partner.google.com> |
||
Peifeng Li
|
f50f24e781 |
ANDROID: vendor_hooks: Add hooks for lookaround
Add hooks for support lookaround in memory reclamation. - android_vh_test_clear_look_around_ref - android_vh_check_page_look_around_ref - android_vh_look_around_migrate_page - android_vh_look_around Bug: 241079328 Signed-off-by: Peifeng Li <lipeifeng@oppo.com> Change-Id: I9a606ae71d2f1303df3b02403b30bc8fdc9d06dd |
||
Peifeng Li
|
1f8f6d59a2 |
ANDROID: vendor_hook: Add hook to not be stuck ro rmap lock in kswapd or direct_reclaim
Add hooks to support trylock in rmaplock when reclaiming in kswapd or direct_reclaim, in order to avoid wait lock for a long time. - android_vh_handle_failed_page_trylock - android_vh_page_trylock_set - android_vh_page_trylock_clear - android_vh_page_trylock_get_result - android_vh_do_page_trylock Bug: 240003372 Signed-off-by: Peifeng Li <lipeifeng@oppo.com> Change-Id: I0f605b35ae41f15b3ca7bc72cd5f003175c318a5 |
||
Peifeng Li
|
3f775b9367 |
ANDROID: vendor_hooks: account page-mapcount
Support five hooks as follows to account the amount of multi-mapped pages in kernel: - android_vh_show_mapcount_pages - android_vh_do_traversal_lruvec - android_vh_update_page_mapcount - android_vh_add_page_to_lrulist - android_vh_del_page_from_lrulist Bug: 236578020 Signed-off-by: Peifeng Li <lipeifeng@oppo.com> Change-Id: Ia2c7015aab442be7dbb496b8b630b9dff59ab935 |
||
Greg Kroah-Hartman
|
9c2a5eef8f |
Merge tag 'android12-5.10.117_r00' into 'android12-5.10'
This is the merge of the upstream LTS release of 5.10.117 into the android12-5.10 branch. It contains the following commits: |
||
Bing Han
|
1aa26f0017 |
ANDROID: vendor_hook: Add hook in page_referenced_one()
Add android_vh_page_referenced_one_end at the end of function page_referenced_one to update the status that whether the page need to be reclaimed to a specified swap location. Bug: 234214858 Signed-off-by: Bing Han <bing.han@transsion.com> Change-Id: Ia06a229956328ef776da5d163708dcb011a327fb |
||
Greg Kroah-Hartman
|
5dadf6321c |
This is the 5.10.111 stable release
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmJXHgUACgkQONu9yGCS aT7BohAAx7alIKg1d4gbIHhO6eimWLWLj95ncyeq6xtNT+qKqdYgp+w8xKAJ8QLG sG9sbGcoWYkgOLcSy4rztgh9HBQuGvY6vLFqygRw5HXN1iAirYlr7DJCCKRc1pPZ E5ASOzbkfmBw9HI/w41up5vosSkNAf1qqbL9lJxfx11ms5t7s/11gYg+xSH61NUI gBE4GyJSq91p161F4ql+dJqYrU+gAIY9zKVSAqB97z9D3d01tZkr4LGNjbqtu7Kb 3d+vjiKfMda09X16US3nx9PaxikfQn5IB8JA9mpWgI+Q7H6R9Ri+rQxnv/ghpEPc U9BvK9p7+zYu6dyNUZYbGCsHAQ3WFoatJPO+JTxXllJ99ORrN85WfvFMWZq49f3k XxYMbECcLJfsYUJycKcPJJfGFLfxw2cDfJmzNJEvzX9KK6ObZxSeYeVHjrdC8XwA WZlt1zNObE2IyH3pkqSyxKubnpu4Z0UdxDdIeowsdI9iD7oGlhtfzhTVa+JbxnuY HHtHIKweeyYeUTRIKe1w/ZE24LQjF0fpy9M2ZGxzy6YQTqFsjGNzsfcPBWWFOXqp XGTgaKoIqA+1ov2nVGk8j1BOTHKqYx4gwhb5Y58kX8hvHQGr6d3eoyMKui7wPT5f 9RjU/9+eZ1DT8LLnUZFNJHIrhwIUCR/wEY1Y9gPi58RE8Wj9TlY= =3r9Y -----END PGP SIGNATURE----- Merge 5.10.111 into android12-5.10-lts Changes in 5.10.111 ubifs: Rectify space amount budget for mkdir/tmpfile operations gfs2: Check for active reservation in gfs2_release gfs2: Fix gfs2_release for non-writers regression gfs2: gfs2_setattr_size error path fix rtc: wm8350: Handle error for wm8350_register_irq KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs KVM: x86/emulator: Emulate RDPID only if it is enabled in guest drm: Add orientation quirk for GPD Win Max ath5k: fix OOB in ath5k_eeprom_read_pcal_info_5111 drm/amd/display: Add signal type check when verify stream backends same drm/amd/amdgpu/amdgpu_cs: fix refcount leak of a dma_fence obj usb: gadget: tegra-xudc: Do not program SPARAM usb: gadget: tegra-xudc: Fix control endpoint's definitions ptp: replace snprintf with sysfs_emit powerpc: dts: t104xrdb: fix phy type for FMAN 4/5 ath11k: fix kernel panic during unload/load ath11k modules ath11k: mhi: use mhi_sync_power_up() bpf: Make dst_port field in struct bpf_sock 16-bit wide scsi: mvsas: Replace snprintf() with sysfs_emit() scsi: bfa: Replace snprintf() with sysfs_emit() power: supply: axp20x_battery: properly report current when discharging mt76: dma: initialize skip_unmap in mt76_dma_rx_fill cfg80211: don't add non transmitted BSS to 6GHz scanned channels libbpf: Fix build issue with llvm-readelf ipv6: make mc_forwarding atomic powerpc: Set crashkernel offset to mid of RMA region drm/amdgpu: Fix recursive locking warning PCI: aardvark: Fix support for MSI interrupts iommu/arm-smmu-v3: fix event handling soft lockup usb: ehci: add pci device support for Aspeed platforms PCI: endpoint: Fix alignment fault error in copy tests tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH. PCI: pciehp: Add Qualcomm quirk for Command Completed erratum power: supply: axp288-charger: Set Vhold to 4.4V iwlwifi: mvm: Correctly set fragmented EBS ipv4: Invalidate neighbour for broadcast address upon address addition dm ioctl: prevent potential spectre v1 gadget dm: requeue IO if mapping table not yet available drm/amdkfd: make CRAT table missing message informational only scsi: pm8001: Fix pm80xx_pci_mem_copy() interface scsi: pm8001: Fix pm8001_mpi_task_abort_resp() scsi: pm8001: Fix task leak in pm8001_send_abort_all() scsi: pm8001: Fix tag leaks on error scsi: pm8001: Fix memory leak in pm8001_chip_fw_flash_update_req() mt76: mt7615: Fix assigning negative values to unsigned variable scsi: aha152x: Fix aha152x_setup() __setup handler return value scsi: hisi_sas: Free irq vectors in order for v3 HW net/smc: correct settings of RMB window update limit mips: ralink: fix a refcount leak in ill_acc_of_setup() macvtap: advertise link netns via netlink tuntap: add sanity checks about msg_controllen in sendmsg Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg} Bluetooth: use memset avoid memory leaks bnxt_en: Eliminate unintended link toggle during FW reset PCI: endpoint: Fix misused goto label MIPS: fix fortify panic when copying asm exception handlers powerpc/secvar: fix refcount leak in format_show() scsi: libfc: Fix use after free in fc_exch_abts_resp() can: isotp: set default value for N_As to 50 micro seconds net: account alternate interface name memory net: limit altnames to 64k total net: sfp: add 2500base-X quirk for Lantech SFP module usb: dwc3: omap: fix "unbalanced disables for smps10_out1" on omap5evm xtensa: fix DTC warning unit_address_format MIPS: ingenic: correct unit node address Bluetooth: Fix use after free in hci_send_acl netlabel: fix out-of-bounds memory accesses ceph: fix memory leak in ceph_readdir when note_last_dentry returns error init/main.c: return 1 from handled __setup() functions minix: fix bug when opening a file with O_DIRECT clk: si5341: fix reported clk_rate when output divider is 2 staging: vchiq_core: handle NULL result of find_service_by_handle phy: amlogic: meson8b-usb2: Use dev_err_probe() staging: wfx: fix an error handling in wfx_init_common() w1: w1_therm: fixes w1_seq for ds28ea00 sensors NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify() NFSv4: Protect the state recovery thread against direct reclaim xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 clk: ti: Preserve node in ti_dt_clocks_register() clk: Enforce that disjoints limits are invalid SUNRPC/call_alloc: async tasks mustn't block waiting for memory SUNRPC/xprt: async tasks mustn't block waiting for memory SUNRPC: remove scheduling boost for "SWAPPER" tasks. NFS: swap IO handling is slightly different for O_DIRECT IO NFS: swap-out must always use STABLE writes. x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy serial: samsung_tty: do not unlock port->lock for uart_write_wakeup() virtio_console: eliminate anonymous module_init & module_exit jfs: prevent NULL deref in diFree SUNRPC: Fix socket waits for write buffer space NFS: nfsiod should not block forever in mempool_alloc() NFS: Avoid writeback threads getting stuck in mempool_alloc() parisc: Fix CPU affinity for Lasi, WAX and Dino chips parisc: Fix patch code locking and flushing mm: fix race between MADV_FREE reclaim and blkdev direct IO read Revert "hv: utils: add PTP_1588_CLOCK to Kconfig to fix build" drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire() Drivers: hv: vmbus: Fix potential crash on module unload Revert "NFSv4: Handle the special Linux file open access mode" NFSv4: fix open failure with O_ACCMODE flag scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one() net/tls: fix slab-out-of-bounds bug in decrypt_internal ice: Clear default forwarding VSI during VSI release net: ipv4: fix route with nexthop object delete warning net: stmmac: Fix unset max_speed difference between DT and non-DT platforms drm/imx: imx-ldb: Check for null pointer after calling kmemdup drm/imx: Fix memory leak in imx_pd_connector_get_modes bnxt_en: reserve space inside receive page for skb_shared_info sfc: Do not free an empty page_ring RDMA/mlx5: Don't remove cache MRs when a delay is needed IB/rdmavt: add lock to call to rvt_error_qp to prevent a race condition dpaa2-ptp: Fix refcount leak in dpaa2_ptp_probe ice: Set txq_teid to ICE_INVAL_TEID on ring creation ice: Do not skip not enabled queues in ice_vc_dis_qs_msg ipv6: Fix stats accounting in ip6_pkt_drop ice: synchronize_rcu() when terminating rings net: openvswitch: don't send internal clone attribute to the userspace. net: openvswitch: fix leak of nested actions rxrpc: fix a race in rxrpc_exit_net() net: phy: mscc-miim: reject clause 45 register accesses qede: confirm skb is allocated before using spi: bcm-qspi: fix MSPI only access with bcm_qspi_exec_mem_op() bpf: Support dual-stack sockets in bpf_tcp_check_syncookie drbd: Fix five use after free bugs in get_initial_state io_uring: don't touch scm_fp_list after queueing skb SUNRPC: Handle ENOMEM in call_transmit_status() SUNRPC: Handle low memory situations in call_status() SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec() iommu/omap: Fix regression in probe for NULL pointer dereference perf: arm-spe: Fix perf report --mem-mode perf tools: Fix perf's libperf_print callback perf session: Remap buf if there is no space for event arm64: Add part number for Arm Cortex-A78AE Revert "mmc: sdhci-xenon: fix annoying 1.8V regulator warning" mmc: mmci: stm32: correctly check all elements of sg list mmc: renesas_sdhi: don't overwrite TAP settings when HS400 tuning is complete lz4: fix LZ4_decompress_safe_partial read out of bound mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0) mm/mempolicy: fix mpol_new leak in shared_policy_replace io_uring: fix race between timeout flush and removal x86/pm: Save the MSR validity status at context setup x86/speculation: Restore speculation related MSRs during S3 resume btrfs: fix qgroup reserve overflow the qgroup limit btrfs: prevent subvol with swapfile from being deleted arm64: patch_text: Fixup last cpu should be master RDMA/hfi1: Fix use-after-free bug for mm struct gpio: Restrict usage of GPIO chip irq members before initialization ata: sata_dwc_460ex: Fix crash due to OOB write perf: qcom_l2_pmu: fix an incorrect NULL check on list iterator irqchip/gic-v3: Fix GICR_CTLR.RWP polling drm/amdgpu/smu10: fix SoC/fclk units in auto mode drm/nouveau/pmu: Add missing callbacks for Tegra devices drm/amdkfd: Create file descriptor after client is added to smi_clients list perf build: Don't use -ffat-lto-objects in the python feature test when building with clang-13 perf python: Fix probing for some clang command line options tools build: Filter out options and warnings not supported by clang tools build: Use $(shell ) instead of `` to get embedded libperl's ccopts dmaengine: Revert "dmaengine: shdma: Fix runtime PM imbalance on error" ubsan: remove CONFIG_UBSAN_OBJECT_SIZE mm: don't skip swap entry even if zap_details specified cgroup: Use open-time credentials for process migraton perm checks selftests/cgroup: Fix build on older distros selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644 selftests: cgroup: Test open-time credential usage for migration checks selftests: cgroup: Test open-time cgroup namespace usage for migration checks arm64: module: remove (NOLOAD) from linker script Drivers: hv: vmbus: Replace smp_store_mb() with virt_store_mb() irqchip/gic, gic-v3: Prevent GSI to SGI translations mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit Linux 5.10.111 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I9b4c1d30ae226b865494df03d871db2a2b9281c7 |
||
Mauricio Faria de Oliveira
|
3c3fbfa6dd |
mm: fix race between MADV_FREE reclaim and blkdev direct IO read
commit 6c8e2a256915a223f6289f651d6b926cd7135c9e upstream. Problem: ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: |
||
Greg Kroah-Hartman
|
639159686b |
Merge branch 'android12-5.10' into android12-5.10-lts
Sync up with android12-5.10 for the following commits: |
||
Jiewen Wang
|
955f917251 |
ANDROID: vendor_hooks: Add hook in try_to_unmap_one()
Add hook in try_to_unmap_one() to trace this function for debug memory swap bugs. Bug: 198385827 Change-Id: I1fdbe60e09bb491b949e06a07133710453ecca03 Signed-off-by: Jiewen Wang <jiewen.wang@vivo.com> |
||
Greg Kroah-Hartman
|
194be71cc6 |
Linux 5.10.47
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmDcbDgACgkQ3qZv95d3 LNwUuQ//VDlmBPk/3w1FYvg9N9q/t1GHkVJXmD8TY/ClLdJtgxPYeoRu1VNLR/xf Y2kwZEF07yMA88RME56Zwt3p+LBbacrp5MoNdzEA48kb7auGBPk1HIscBg2PXC+C AnlC/O4/NAW6Okb+lLFL7XFM4xrlDBkNr5yTz2HmQSQC3JFfov0FcrON3KKTL5Bi aeyWjhn1NnhkKCDaKUl7kKlCQ2x7buu/YmvJK2OdGmuLZVywzto76RS+Xx3X0CnK pPkmciZSS7Gxi6UJel/zza0UwlKg5+IhFzfYVt0nsTFMjk/4QStAoevu7TFnbVD3 yM7nzJpAQmVLs/X9sC0rgg9rCBUyp9d4ddba8bCUqaxpPQfMObWI8S5F8tfzBnK/ h8P5xfs8u4O1AzpRr+YSN2Hbvak47e/4c5UOvvYj6z8NaIEb7DGbaUv/JE5YRVZ0 ZEVZ1auEpHcSVAGz6DUwuwzc8Rk0NdskES5DD53QxNLXDoF1CVdsD44xUuJH1/HA //3S1SWxvwF9UQ+w+sk/Z6pzUj+CdFigou3QzwB+vAZ04n0JawRSuqyijkCqFOP5 88iCMgxZ5qAYJ1TzQ6gV7cA0tSbteBF/HERNmdadyGvse9KtxkaBAfFmvIpd9D8J fTepYVuneP9cGqaGDUtsqHzM5YLrIkCSxABjgIvpNTt0eJL5c2k= =TKhf -----END PGP SIGNATURE----- Merge 5.10.47 into android12-5.10-lts Changes in 5.10.47 module: limit enabling module.sig_enforce Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue." Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell." drm: add a locked version of drm_is_current_master drm/nouveau: wait for moving fence after pinning v2 drm/radeon: wait for moving fence after pinning drm/amdgpu: wait for moving fence after pinning ARM: 9081/1: fix gcc-10 thumb2-kernel regression mmc: meson-gx: use memcpy_to/fromio for dram-access-quirk MIPS: generic: Update node names to avoid unit addresses arm64: Ignore any DMA offsets in the max_zone_phys() calculation arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required spi: spi-nxp-fspi: move the register operation after the clock enable Revert "PCI: PM: Do not read power state in pci_enable_device_flags()" drm/vc4: hdmi: Move the HSM clock enable to runtime_pm drm/vc4: hdmi: Make sure the controller is powered in detect x86/entry: Fix noinstr fail in __do_fast_syscall_32() x86/xen: Fix noinstr fail in exc_xen_unknown_trap() locking/lockdep: Improve noinstr vs errors perf/x86/lbr: Remove cpuc->lbr_xsave allocation from atomic context perf/x86/intel/lbr: Zero the xstate buffer on allocation dmaengine: zynqmp_dma: Fix PM reference leak in zynqmp_dma_alloc_chan_resourc() dmaengine: stm32-mdma: fix PM reference leak in stm32_mdma_alloc_chan_resourc() dmaengine: xilinx: dpdma: Add missing dependencies to Kconfig dmaengine: xilinx: dpdma: Limit descriptor IDs to 16 bits mac80211: remove warning in ieee80211_get_sband() mac80211_hwsim: drop pending frames on stop cfg80211: call cfg80211_leave_ocb when switching away from OCB dmaengine: rcar-dmac: Fix PM reference leak in rcar_dmac_probe() dmaengine: mediatek: free the proper desc in desc_free handler dmaengine: mediatek: do not issue a new desc if one is still current dmaengine: mediatek: use GFP_NOWAIT instead of GFP_ATOMIC in prep_dma net: ipv4: Remove unneed BUG() function mac80211: drop multicast fragments net: ethtool: clear heap allocations for ethtool function inet: annotate data race in inet_send_prepare() and inet_dgram_connect() ping: Check return value of function 'ping_queue_rcv_skb' net: annotate data race in sock_error() inet: annotate date races around sk->sk_txhash net/packet: annotate data race in packet_sendmsg() net: phy: dp83867: perform soft reset and retain established link riscv32: Use medany C model for modules net: caif: fix memory leak in ldisc_open net/packet: annotate accesses to po->bind net/packet: annotate accesses to po->ifindex r8152: Avoid memcpy() over-reading of ETH_SS_STATS sh_eth: Avoid memcpy() over-reading of ETH_SS_STATS r8169: Avoid memcpy() over-reading of ETH_SS_STATS KVM: selftests: Fix kvm_check_cap() assertion net: qed: Fix memcpy() overflow of qed_dcbx_params() mac80211: reset profile_periodicity/ema_ap mac80211: handle various extensible elements correctly recordmcount: Correct st_shndx handling PCI: Add AMD RS690 quirk to enable 64-bit DMA net: ll_temac: Add memory-barriers for TX BD access net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY perf/x86: Track pmu in per-CPU cpu_hw_events pinctrl: stm32: fix the reported number of GPIO lines per bank i2c: i801: Ensure that SMBHSTSTS_INUSE_STS is cleared when leaving i801_access gpiolib: cdev: zero padding during conversion to gpioline_info_changed scsi: sd: Call sd_revalidate_disk() for ioctl(BLKRRPART) nilfs2: fix memory leak in nilfs_sysfs_delete_device_group s390/stack: fix possible register corruption with stack switch helper KVM: do not allow mapping valid but non-reference-counted pages i2c: robotfuzz-osif: fix control-request directions ceph: must hold snap_rwsem when filling inode for async create kthread_worker: split code for canceling the delayed work timer kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate() x86/fpu: Make init_fpstate correct with optimized XSAVE mm: add VM_WARN_ON_ONCE_PAGE() macro mm/rmap: remove unneeded semicolon in page_not_mapped() mm/rmap: use page_not_mapped in try_to_unmap() mm, thp: use head page in __migration_entry_wait() mm/thp: fix __split_huge_pmd_locked() on shmem migration entry mm/thp: make is_huge_zero_pmd() safe and quicker mm/thp: try_to_unmap() use TTU_SYNC for safe splitting mm/thp: fix vma_address() if virtual address below file offset mm/thp: fix page_address_in_vma() on file THP tails mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split mm: page_vma_mapped_walk(): use page for pvmw->page mm: page_vma_mapped_walk(): settle PageHuge on entry mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block mm: page_vma_mapped_walk(): crossing page table boundary mm: page_vma_mapped_walk(): add a level of indentation mm: page_vma_mapped_walk(): use goto instead of while (1) mm: page_vma_mapped_walk(): get vma_address_end() earlier mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() mm, futex: fix shared futex pgoff on shmem huge page KVM: SVM: Call SEV Guest Decommission if ASID binding fails swiotlb: manipulate orig_addr when tlb_addr has offset netfs: fix test for whether we can skip read when writing beyond EOF Revert "drm: add a locked version of drm_is_current_master" certs: Add EFI_CERT_X509_GUID support for dbx entries certs: Move load_system_certificate_list to a common function certs: Add ability to preload revocation certs integrity: Load mokx variables into the blacklist keyring Linux 5.10.47 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com> Change-Id: I68f731ad78a5db003c41093e4faf59f6f9f2e446 |
||
Jue Wang
|
38cda6b5ab |
mm/thp: fix page_address_in_vma() on file THP tails
commit 31657170deaf1d8d2f6a1955fbc6fa9d228be036 upstream.
Anon THP tails were already supported, but memory-failure may need to
use page_address_in_vma() on file THP tails, which its page->mapping
check did not permit: fix it.
hughd adds: no current usage is known to hit the issue, but this does
fix a subtle trap in a general helper: best fixed in stable sooner than
later.
Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
Fixes:
|
||
Hugh Dickins
|
37ffe9f4d7 |
mm/thp: fix vma_address() if virtual address below file offset
commit 494334e43c16d63b878536a26505397fce6ff3a2 upstream. Running certain tests with a DEBUG_VM kernel would crash within hours, on the total_mapcount BUG() in split_huge_page_to_list(), while trying to free up some memory by punching a hole in a shmem huge page: split's try_to_unmap() was unable to find all the mappings of the page (which, on a !DEBUG_VM kernel, would then keep the huge page pinned in memory). When that BUG() was changed to a WARN(), it would later crash on the VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in mm/internal.h:vma_address(), used by rmap_walk_file() for try_to_unmap(). vma_address() is usually correct, but there's a wraparound case when the vm_start address is unusually low, but vm_pgoff not so low: vma_address() chooses max(start, vma->vm_start), but that decides on the wrong address, because start has become almost ULONG_MAX. Rewrite vma_address() to be more careful about vm_pgoff; move the VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can be safely used from page_mapped_in_vma() and page_address_in_vma() too. Add vma_address_end() to apply similar care to end address calculation, in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one(); though it raises a question of whether callers would do better to supply pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch. An irritation is that their apparent generality breaks down on KSM pages, which cannot be located by the page->index that page_to_pgoff() uses: as commit |
||
Hugh Dickins
|
66be14a926 |
mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
commit 732ed55823fc3ad998d43b86bf771887bcc5ec67 upstream.
Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e. mapcount 0.
And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and
silently.
I believe the problem is that once a racing unmap has cleared pte or
pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
from try_to_unmap() before the racing task has reached decrementing
mapcount.
Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
TTU_SYNC to the options, and passing that from unmap_page().
When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem. Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race
tolerated.
Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes:
|
||
Miaohe Lin
|
bfd90b56d7 |
mm/rmap: use page_not_mapped in try_to_unmap()
[ Upstream commit b7e188ec98b1644ff70a6d3624ea16aadc39f5e0 ] page_mapcount_is_zero() calculates accurately how many mappings a hugepage has in order to check against 0 only. This is a waste of cpu time. We can do this via page_not_mapped() to save some possible atomic_read cycles. Remove the function page_mapcount_is_zero() as it's not used anymore and move page_not_mapped() above try_to_unmap() to avoid identifier undeclared compilation error. Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Miaohe Lin
|
ff81af8259 |
mm/rmap: remove unneeded semicolon in page_not_mapped()
[ Upstream commit e0af87ff7afcde2660be44302836d2d5618185af ] Remove extra semicolon without any functional change intended. Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Laurent Dufour
|
a1dbf20e8e |
FROMLIST: mm: introduce __page_add_new_anon_rmap()
When dealing with speculative page fault handler, we may race with VMA being split or merged. In this case the vma->vm_start and vm->vm_end fields may not match the address the page fault is occurring. This can only happens when the VMA is split but in that case, the anon_vma pointer of the new VMA will be the same as the original one, because in __split_vma the new->anon_vma is set to src->anon_vma when *new = *vma. So even if the VMA boundaries are not correct, the anon_vma pointer is still valid. If the VMA has been merged, then the VMA in which it has been merged must have the same anon_vma pointer otherwise the merge can't be done. So in all the case we know that the anon_vma is valid, since we have checked before starting the speculative page fault that the anon_vma pointer is valid for this VMA and since there is an anon_vma this means that at one time a page has been backed and that before the VMA is cleaned, the page table lock would have to be grab to clean the PTE, and the anon_vma field is checked once the PTE is locked. This patch introduce a new __page_add_new_anon_rmap() service which doesn't check for the VMA boundaries, and create a new inline one which do the check. When called from a page fault handler, if this is not a speculative one, there is a guarantee that vm_start and vm_end match the faulting address, so this check is useless. In the context of the speculative page fault handler, this check may be wrong but anon_vma is still valid as explained above. Change-Id: I72c47830181579f8c9618df879077d321653b5f1 Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Link: https://lore.kernel.org/lkml/1523975611-15978-17-git-send-email-ldufour@linux.vnet.ibm.com/ Bug: 161210518 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> |
||
Shakeel Butt
|
dd156e3fca |
mm/rmap: always do TTU_IGNORE_ACCESS
[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ] Since commit |
||
Mike Kravetz
|
336bf30eb7 |
hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]
LTP: starting move_pages12
BUG: unable to handle page fault for address: ffffffffffffffe0
...
RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
Call Trace:
rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
migrate_pages+0x1005/0x1fb0
move_pages_and_store_status.isra.47+0xd7/0x1a0
__x64_sys_move_pages+0xa5c/0x1100
do_syscall_64+0x5f/0x310
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
anon pages as the anon_vma_lock should be held in this case. Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.
Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
keep mapping valid while dropping page lock.
This patch does the following:
- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
calling try_to_unmap.
- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
will now simply do a 'trylock' while still holding the page lock. If
the trylock fails, it will return NULL. This could impact the
callers:
- migration calling code will receive -EAGAIN and retry up to the
hard coded limit (10).
- memory error code will treat the page as BUSY. This will force
killing (SIGKILL) instead of SIGBUS any mapping tasks.
Do note that this change in behavior only happens when there is a
race. None of the standard kernel testing suites actually hit this
race, but it is possible.
[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
Fixes:
|
||
Matthew Wilcox (Oracle)
|
5eaf35ab12 |
mm/rmap: fix assumptions of THP size
Ask the page what size it is instead of assuming it's PMD size. Do this for anon pages as well as file pages for when someone decides to support that. Leave the assumption alone for pages which are PMD mapped; we don't currently grow THPs beyond PMD size, so we don't need to change this code yet. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sjpark@amazon.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-9-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Alistair Popple
|
ad7df764b7 |
mm/rmap: fixup copying of soft dirty and uffd ptes
During memory migration a pte is temporarily replaced with a migration swap pte. Some pte bits from the existing mapping such as the soft-dirty and uffd write-protect bits are preserved by copying these to the temporary migration swap pte. However these bits are not stored at the same location for swap and non-swap ptes. Therefore testing these bits requires using the appropriate helper function for the given pte type. Unfortunately several code locations were found where the wrong helper function is being used to test soft_dirty and uffd_wp bits which leads to them getting incorrectly set or cleared during page-migration. Fix these by using the correct tests based on pte type. Fixes: |
||
Qian Cai
|
9c1177b62a |
mm/rmap: annotate a data race at tlb_flush_batched
mm->tlb_flush_batched could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6: try_to_unmap_one+0x59a/0x1ab0 set_tlb_ubc_flush_pending at mm/rmap.c:635 (inlined by) try_to_unmap_one at mm/rmap.c:1538 rmap_walk_anon+0x296/0x650 rmap_walk+0xdf/0x100 try_to_unmap+0x18a/0x2f0 shrink_page_list+0xef6/0x2870 shrink_inactive_list+0x316/0x880 shrink_lruvec+0x8dc/0x1380 shrink_node+0x317/0xd80 balance_pgdat+0x652/0xd90 kswapd+0x396/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4: flush_tlb_batched_pending+0x29/0x90 flush_tlb_batched_pending at mm/rmap.c:682 change_p4d_range+0x5dd/0x1030 change_pte_range at mm/mprotect.c:44 (inlined by) change_pmd_range at mm/mprotect.c:212 (inlined by) change_pud_range at mm/mprotect.c:240 (inlined by) change_p4d_range at mm/mprotect.c:260 change_protection+0x222/0x310 change_prot_numa+0x3e/0x60 task_numa_work+0x219/0x350 task_work_run+0xed/0x140 prepare_exit_to_usermode+0x2cc/0x2e0 ret_from_intr+0x32/0x42 Reported by Kernel Concurrency Sanitizer on: CPU: 4 PID: 6364 Comm: mtest01 Tainted: G W L 5.5.0-next-20200210+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 flush_tlb_batched_pending() is under PTL but the write is not, but mm->tlb_flush_batched is only a bool type, so the value is unlikely to be shattered. Thus, mark it as an intentional data race by using the data race macro. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
6c357848b4 |
mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be consistent between the various functions. [akpm@linux-foundation.org: fix mm/migrate.c] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Mike Kravetz
|
34ae204f18 |
hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem
Commit |
||
Michel Lespinasse
|
c1e8d7c6a7 |
mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Johannes Weiner
|
468c398233 |
mm: memcontrol: switch to native NR_ANON_THPS counter
With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the native NR_ANON_THPS accounting sites. [hannes@cmpxchg.org: fixes] Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> [build-tested] Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Johannes Weiner
|
be5d0a74c6 |
mm: memcontrol: switch to native NR_ANON_MAPPED counter
Memcg maintains a private MEMCG_RSS counter. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the same way we do for NR_FILE_MAPPED. With the previous patch removing MEMCG_CACHE and the private NR_SHMEM counter, this patch finally eliminates the need to have page->mapping set up at charge time. However, we need to have page->mem_cgroup set up by the time rmap runs and does the accounting, so switch the commit and the rmap callbacks around. v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Palmer Dabbelt
|
4708f31885 |
mm: prevent a warning when casting void* -> enum
I recently build the RISC-V port with LLVM trunk, which has introduced a new warning when casting from a pointer to an enum of a smaller size. This patch simply casts to a long in the middle to stop the warning. I'd be surprised this is the only one in the kernel, but it's the only one I saw. Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Peter Xu
|
f45ec5ff16 |
userfaultfd: wp: support swap and page migration
For either swap and page migration, we all use the bit 2 of the entry to identify whether this entry is uffd write-protected. It plays a similar role as the existing soft dirty bit in swap entries but only for keeping the uffd-wp tracking for a specific PTE/PMD. Something special here is that when we want to recover the uffd-wp bit from a swap/migration entry to the PTE bit we'll also need to take care of the _PAGE_RW bit and make sure it's cleared, otherwise even with the _PAGE_UFFD_WP bit we can't trap it at all. In change_pte_range() we do nothing for uffd if the PTE is a swap entry. That can lead to data mismatch if the page that we are going to write protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch also applies/removes the uffd-wp bit even for the swap entries. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
396bcc5299 |
mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE
Commit |
||
Li Xinhai
|
23ab76bf90 |
Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork"
This reverts commit |
||
Mike Kravetz
|
c0d0381ade |
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2. While discussing the issue with huge_pte_offset [1], I remembered that there were more outstanding hugetlb races. These issues are: 1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become invalid via a call to huge_pmd_unshare by another thread. 2) hugetlbfs page faults can race with truncation causing invalid global reserve counts and state. A previous attempt was made to use i_mmap_rwsem in this manner as described at [2]. However, those patches were reverted starting with [3] due to locking issues. To effectively use i_mmap_rwsem to address the above issues it needs to be held (in read mode) during page fault processing. However, during fault processing we need to lock the page we will be adding. Lock ordering requires we take page lock before i_mmap_rwsem. Waiting until after taking the page lock is too late in the fault process for the synchronization we want to do. To address this lock ordering issue, the following patches change the lock ordering for hugetlb pages. This is not too invasive as hugetlbfs processing is done separate from core mm in many places. However, I don't really like this idea. Much ugliness is contained in the new routine hugetlb_page_mapping_lock_write() of patch 1. The only other way I can think of to address these issues is by catching all the races. After catching a race, cleanup, backout, retry ... etc, as needed. This can get really ugly, especially for huge page reservations. At one time, I started writing some of the reservation backout code for page faults and it got so ugly and complicated I went down the path of adding synchronization to avoid the races. Any other suggestions would be welcome. [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/ [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/ [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/ [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/ This patch (of 2): While looking at BUGs associated with invalid huge page map counts, it was discovered and observed that a huge pte pointer could become 'invalid' and point to another task's page table. Consider the following: A task takes a page fault on a shared hugetlbfs file and calls huge_pte_alloc to get a ptep. Suppose the returned ptep points to a shared pmd. Now, another task truncates the hugetlbfs file. As part of truncation, it unmaps everyone who has the file mapped. If the range being truncated is covered by a shared pmd, huge_pmd_unshare will be called. For all but the last user of the shared pmd, huge_pmd_unshare will clear the pud pointing to the pmd. If the task in the middle of the page fault is not the last user, the ptep returned by huge_pte_alloc now points to another task's page table or worse. This leads to bad things such as incorrect page map/reference counts or invalid memory references. To fix, expand the use of i_mmap_rwsem as follows: - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called. huge_pmd_share is only called via huge_pte_alloc, so callers of huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers of huge_pte_alloc continue to hold the semaphore until finished with the ptep. - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called. One problem with this scheme is that it requires taking i_mmap_rwsem before taking the page lock during page faults. This is not the order specified in the rest of mm code. Handling of hugetlbfs pages is mostly isolated today. Therefore, we use this alternative locking order for PageHuge() pages. mapping->i_mmap_rwsem hugetlb_fault_mutex (hugetlbfs specific page fault mutex) page->flags PG_locked (lock_page) To help with lock ordering issues, hugetlb_page_mapping_lock_write() is introduced to write lock the i_mmap_rwsem associated with a page. In most cases it is easy to get address_space via vma->vm_file->f_mapping. However, in the case of migration or memory errors for anon pages we do not have an associated vma. A new routine _get_hugetlb_page_mapping() will use anon_vma to get address_space in these cases. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Anshuman Khandual
|
222100eed2 |
mm/vma: make is_vma_temporary_stack() available for general use
Currently the declaration and definition for is_vma_temporary_stack() are scattered. Lets make is_vma_temporary_stack() helper available for general use and also drop the declaration from (include/linux/huge_mm.h) which is no longer required. While at this, rename this as vma_is_temporary_stack() in line with existing helpers. This should not cause any functional change. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
John Hubbard
|
47e29d32af |
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Kirill A. Shutemov
|
f1fe80d4ae |
mm, thp: do not queue fully unmapped pages for deferred split
Adding fully unmapped pages into deferred split queue is not productive: these pages are about to be freed or they are pinned and cannot be split anyway. Link: http://lkml.kernel.org/r/20190913091849.11151-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Yang Shi
|
30c4638285 |
mm/rmap.c: use VM_BUG_ON_PAGE() in __page_check_anon_rmap()
The __page_check_anon_rmap() just calls two BUG_ON()s protected by CONFIG_DEBUG_VM, the #ifdef could be eliminated by using VM_BUG_ON_PAGE(). Link: http://lkml.kernel.org/r/1573157346-111316-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Miles Chen
|
091e429954 |
mm/rmap.c: fix outdated comment in page_get_anon_vma()
Replace DESTROY_BY_RCU with SLAB_TYPESAFE_BY_RCU because
SLAB_DESTROY_BY_RCU has been renamed to SLAB_TYPESAFE_BY_RCU by commit
|
||
Wei Yang
|
4e4a9eb921 |
mm/rmap.c: reuse mergeable anon_vma as parent when fork
In __anon_vma_prepare(), we will try to find anon_vma if it is possible to reuse it. While on fork, the logic is different. Since commit |
||
Wei Yang
|
47b390d23b |
mm/rmap.c: don't reuse anon_vma if we just want a copy
Before commit |
||
Ben Dooks
|
444f84fd2a |
mm: include <linux/huge_mm.h> for is_vma_temporary_stack
Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack to fix the following sparse warning: mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Reviewed-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Song Liu
|
99cb0dbd47 |
mm,thp: add read-only THP support for (non-shmem) FS
This patch is (hopefully) the first step to enable THP for non-shmem filesystems. This patch enables an application to put part of its text sections to THP via madvise, for example: madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE); We tried to reuse the logic for THP on tmpfs. Currently, write is not supported for non-shmem THP. khugepaged will only process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is execve(). This requirement limits non-shmem THP to text sections. The next patch will handle writes, which would only happen when the all the vmas with VM_DENYWRITE are unmapped. An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this feature. [songliubraving@fb.com: fix build without CONFIG_SHMEM] Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com [songliubraving@fb.com: fix double unlock in collapse_file()] Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
d8c6546b1a |
mm: introduce compound_nr()
Replace 1 << compound_order(page) with compound_nr(page). Minor improvements in readability. Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
a50b854e07 |
mm: introduce page_size()
Patch series "Make working with compound pages easier", v2. These three patches add three helpers and convert the appropriate places to use them. This patch (of 3): It's unnecessarily hard to find out the size of a potentially huge page. Replace 'PAGE_SIZE << compound_order(page)' with page_size(page). Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
YueHaibing
|
1f18b29669 |
mm/rmap.c: remove set but not used variable 'cstart'
Fixes gcc '-Wunused-but-set-variable' warning:
mm/rmap.c: In function page_mkclean_one:
mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable]
It is not used any more since
commit
|
||
Ralph Campbell
|
1de13ee592 |
mm/hmm: fix bad subpage pointer in try_to_unmap_one
When migrating an anonymous private page to a ZONE_DEVICE private page,
the source page->mapping and page->index fields are copied to the
destination ZONE_DEVICE struct page and the page_mapcount() is
increased. This is so rmap_walk() can be used to unmap and migrate the
page back to system memory.
However, try_to_unmap_one() computes the subpage pointer from a swap pte
which computes an invalid page pointer and a kernel panic results such
as:
BUG: unable to handle page fault for address: ffffea1fffffffc8
Currently, only single pages can be migrated to device private memory so
no subpage computation is needed and it can be set to "page".
[rcampbell@nvidia.com: add comment]
Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
Fixes:
|
||
Huang Shijie
|
059d8442ea |
mm/rmap.c: use the pra.mapcount to do the check
We have the pra.mapcount already, and there is no need to call the page_mapped() which may do some complicated computing for compound page. Link: http://lkml.kernel.org/r/20190404054828.2731-1-sjhuang@iluvatar.ai Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Jérôme Glisse
|
7269f99993 |
mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier event that represent what is happening to the CPU page table. See the patch which introduced the events to see the rational behind this. Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
Jérôme Glisse
|
6f4f13e8d9 |
mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |