Commit Graph

431 Commits

Author SHA1 Message Date
Zichun Zheng
98a66e87c1 ANDROID: Export symbols to do reverse mapping within memcg in kernel modules.
Export the symbols below to do reverse mapping within memcg:
  root_mem_cgroup
  page_referenced

Bug: 296526618
Change-Id: Ia9c5876bd97d3f13c92b28af2ca5e74b3f91bd5a
Signed-off-by: Zichun Zheng <zhengzichun@oppo.com>
2023-08-23 12:33:26 +00:00
Lee Jones
e36eef3783 Revert "Revert "mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse""
This reverts commit 4f35cec76058557d9eaec0d501d03c7657eb56b4 and does so
in an abi-safe way.

This is done by adding the new fields only to the end of the structure
and this structure is only passed around to other functions as a
pointer, the internal structure layout is only touched by the core
kernel, so adding it to the end is safe.

Update ABI using The Button:

Leaf changes summary: 1 artifact changed
Changed leaf types summary: 1 leaf type changed
Removed/Changed/Added functions summary: 0 Removed, 0 Changed, 0 Added function
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable

'struct anon_vma at rmap.h:33:1' changed:
  type size changed from 832 to 960 (in bits)
  2 data member insertions:
    'unsigned long int num_children', at offset 832 (in bits) at rmap.h:74:1
    'unsigned long int num_active_vmas', at offset 896 (in bits) at rmap.h:76:1
  5406 impacted interfaces

Bug: 260678056
Bug: 253167854
Change-Id: Ib1d45625cbc2e0b21330ca3dc2aa7aff34666d31
Signed-off-by: Lee Jones <joneslee@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-05-11 12:39:32 +00:00
Minchan Kim
c35cda5280 BACKPORT: mm: don't be stuck to rmap lock on reclaim path
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
under memory pressure if processes keep working on their vmas(e.g., fork,
mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
makes other processes entering direct reclaim, which were also stuck on
the lock.

This patch makes lru aging path try_lock mode like shink_page_list so the
reclaim context will keep working with next lru pages without being stuck.
if it found the rmap lock contended, it rotates the page back to head of
lru in both active/inactive lrus to make them consistent behavior, which
is basic starting point rather than adding more heristic.

Since this patch introduces a new "contended" field as out-param along
with try_lock in-param in rmap_walk_control, it's not immutable any longer
if the try_lock is set so remove const keywords on rmap related functions.
Since rmap walking is already expensive operation, I doubt the const
would help sizable benefit( And we didn't have it until 5.17).

In a heavy app workload in Android, trace shows following statistics.  It
almost removes rmap lock contention from reclaim path.

Martin Liu reported:

Before:

   max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
         1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
          601            0             601   145.544681        28817    198  rmap_walk_file

After:

   max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
          NaN          NaN              NaN          NaN          NaN    0.0             NaN
            0            0                0     0.127645            1     12  rmap_walk_file

[minchan@kernel.org: add comment, per Matthew]
  Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Conflicts:
	folio->page

(cherry picked from commit 6d4675e601357834dadd2ba1d803f6484596015c)
Bug: 239681156
Bug: 252333201
Signed-off-by: Minchan Kim <minchan@google.com>
Change-Id: I0c63e0291120c8a1b5f2d83b8a7b210cb56c27a2
Signed-off-by: chenxin <chenxinxin@xiaomi.corp-partner.google.com>
2022-10-11 16:33:36 +00:00
Peifeng Li
f50f24e781 ANDROID: vendor_hooks: Add hooks for lookaround
Add hooks for support lookaround in memory reclamation.

- android_vh_test_clear_look_around_ref
- android_vh_check_page_look_around_ref
- android_vh_look_around_migrate_page
- android_vh_look_around
Bug: 241079328

Signed-off-by: Peifeng Li <lipeifeng@oppo.com>
Change-Id: I9a606ae71d2f1303df3b02403b30bc8fdc9d06dd
2022-09-13 19:07:41 +08:00
Peifeng Li
1f8f6d59a2 ANDROID: vendor_hook: Add hook to not be stuck ro rmap lock in kswapd or direct_reclaim
Add hooks to support trylock in rmaplock when reclaiming in kswapd or
direct_reclaim, in order to avoid wait lock for a long time.

- android_vh_handle_failed_page_trylock
- android_vh_page_trylock_set
- android_vh_page_trylock_clear
- android_vh_page_trylock_get_result
- android_vh_do_page_trylock

Bug: 240003372

Signed-off-by: Peifeng Li <lipeifeng@oppo.com>
Change-Id: I0f605b35ae41f15b3ca7bc72cd5f003175c318a5
2022-08-14 23:08:07 +08:00
Peifeng Li
3f775b9367 ANDROID: vendor_hooks: account page-mapcount
Support five hooks as follows to account
the amount of multi-mapped pages in kernel:

- android_vh_show_mapcount_pages
- android_vh_do_traversal_lruvec
- android_vh_update_page_mapcount
- android_vh_add_page_to_lrulist
- android_vh_del_page_from_lrulist

Bug: 236578020
Signed-off-by: Peifeng Li <lipeifeng@oppo.com>
Change-Id: Ia2c7015aab442be7dbb496b8b630b9dff59ab935
2022-08-03 20:10:45 +00:00
Greg Kroah-Hartman
9c2a5eef8f Merge tag 'android12-5.10.117_r00' into 'android12-5.10'
This is the merge of the upstream LTS release of 5.10.117 into the
android12-5.10 branch.

It contains the following commits:

fdd06dc6b0 ANDROID: GKI: db845c: Update symbols list and ABI
0974b8411a Merge 5.10.117 into android12-5.10-lts
7686a5c2a8 Linux 5.10.117
937c6b0e3e SUNRPC: Fix fall-through warnings for Clang
29f077d070 io_uring: always use original task when preparing req identity
1444e0568b usb: gadget: uvc: allow for application to cleanly shutdown
42505e3622 usb: gadget: uvc: rename function to be more consistent
002e7223dc ping: fix address binding wrt vrf
d9a1e82bf6 arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map
49750c5e9a net: phy: Fix race condition on link status change
e68b60ae29 SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
dbe6974a39 SUNRPC: Don't call connect() more than once on a TCP socket
47541ed4d4 SUNRPC: Prevent immediate close+reconnect
2ab569edd8 SUNRPC: Clean up scheduling of autoclose
85844ea29f drm/vmwgfx: Initialize drm_mode_fb_cmd2
7e849dbe60 cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
6aa239d82e net: atlantic: always deep reset on pm op, fixing up my null deref regression
6158df4fa5 i40e: i40e_main: fix a missing check on list iterator
819796024c drm/nouveau/tegra: Stop using iommu_present()
e06605af8b ceph: fix setting of xattrs on async created inodes
86db01f373 serial: 8250_mtk: Fix register address for XON/XOFF character
84ad84e495 serial: 8250_mtk: Fix UART_EFR register address
f8d8440f13 slimbus: qcom: Fix IRQ check in qcom_slim_probe
d7b7c5532a USB: serial: option: add Fibocom MA510 modem
2ba0034e36 USB: serial: option: add Fibocom L610 modem
319b312edb USB: serial: qcserial: add support for Sierra Wireless EM7590
994395f356 USB: serial: pl2303: add device id for HP LM930 Display
8276a3dbe2 usb: typec: tcpci_mt6360: Update for BMC PHY setting
54979aa49e usb: typec: tcpci: Don't skip cleanup in .remove() on error
7335a6b11d usb: cdc-wdm: fix reading stuck on device close
6d47eceaf3 tty: n_gsm: fix mux activation issues in gsm_config()
69139a45b8 tty/serial: digicolor: fix possible null-ptr-deref in digicolor_uart_probe()
5a73581116 firmware_loader: use kernel credentials when reading firmware
d254309aab tcp: resalt the secret every 10 seconds
3abbfac1ab net: sfp: Add tx-fault workaround for Huawei MA5671A SFP ONT
48f1dd67a8 net: emaclite: Don't advertise 1000BASE-T and do auto negotiation
5c09dbdfd4 s390: disable -Warray-bounds
03ebc6fd5c ASoC: ops: Validate input values in snd_soc_put_volsw_range()
31606a73ba ASoC: max98090: Generate notifications on changes for custom control
ce154bd3bc ASoC: max98090: Reject invalid values in custom control put()
5ecaaaeb2c hwmon: (f71882fg) Fix negative temperature
88091c0275 gfs2: Fix filesystem block deallocation for short writes
fccf4bf3f2 tls: Fix context leak on tls_device_down
161c4edeca net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
d5e1b41bf7 net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
e417a8fcea net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
9012209f43 net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
abe35bf3be net/sched: act_pedit: really ensure the skb is writable
b816ed53f3 s390/lcs: fix variable dereferenced before check
4d3c6d7418 s390/ctcm: fix potential memory leak
5497f87edc s390/ctcm: fix variable dereferenced before check
cc71c9f17c selftests: vm: Makefile: rename TARGETS to VMTARGETS
ce12e5ff8d hwmon: (ltq-cputemp) restrict it to SOC_XWAY
ceb3db723f dim: initialize all struct fields
8b1b8fc819 ionic: fix missing pci_release_regions() on error in ionic_probe()
2cb8689f45 nfs: fix broken handling of the softreval mount option
49c10784b9 mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
79432d2237 net: sfc: fix memory leak due to ptp channel
bdb8d4aed1 sfc: Use swap() instead of open coding it
33c93f6e55 netlink: do not reset transport header in netlink_recvmsg()
9e40f2c513 drm/nouveau: Fix a potential theorical leak in nouveau_get_backlight_name()
54f26fc07e ipv4: drop dst in multicast routing path
c07a84492f net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters
abb237c544 net: mscc: ocelot: restrict tc-trap actions to VCAP IS2 lookup 0
f9674c52a1 net: mscc: ocelot: fix VCAP IS2 filters matching on both lookups
c1184d2888 net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted
e2cdde89d2 net: Fix features skip in for_each_netdev_feature()
c420d66047 mac80211: Reset MBSSID parameters upon connection
9cbf2a7d5d hwmon: (tmp401) Add OF device ID table
85eba08be2 iwlwifi: iwl-dbg: Use del_timer_sync() before freeing
a6a73781b4 batman-adv: Don't skb_split skbuffs with frag_list
0577ff1c69 Merge 5.10.116 into android12-5.10-lts
3f70116e5f Merge 5.10.115 into android12-5.10-lts
07a4d3649a Linux 5.10.116
d1ac096f88 mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()
c6cbf5431a mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()
308ff6a6e7 mm: fix missing cache flush for all tail pages of compound page
185fa5984d Bluetooth: Fix the creation of hdev->name
9ff4a6b806 arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
dfb55dcf9d nfp: bpf: silence bitwise vs. logical OR warning
f89f76f4b0 drm/amd/display/dc/gpio/gpio_service: Pass around correct dce_{version, environment} types
efd1429fa9 block: drbd: drbd_nl: Make conversion to 'enum drbd_ret_code' explicit
a71658c7db regulator: consumer: Add missing stubs to regulator/consumer.h
7648f42d1a MIPS: Use address-of operator on section symbols
2ed28105c6 ANDROID: GKI: update the abi .xml file due to hex_to_bin() changes
ee8877df71 Revert "tcp: ensure to use the most recently sent skb when filling the rate sample"
6273d79c86 Merge 5.10.114 into android12-5.10-lts
e61686bb77 Linux 5.10.115
8528806abe mmc: rtsx: add 74 Clocks in power on flow
e1ab92302b PCI: aardvark: Fix reading MSI interrupt number
49143c9ed2 PCI: aardvark: Clear all MSIs at setup
7676a5b99f dm: interlock pending dm_io and dm_wait_for_bios_completion
a439819f47 block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern
a22d66eb51 rcu: Apply callbacks processing time limit only on softirq
40fb3812d9 rcu: Fix callbacks processing time limit retaining cond_resched()
43dbc3edad KVM: LAPIC: Enable timer posted-interrupt only when mwait/hlt is advertised
9c8474fa34 KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugs
a474ee5ece KVM: x86: Do not change ICR on write to APIC_SELF_IPI
64e3e16dbc x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
5f884e0c2e net/mlx5: Fix slab-out-of-bounds while reading resource dump menu
599fc32e74 kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU
0a960a3672 net: igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
4fd45ef704 btrfs: always log symlinks in full mode
687167eef9 smsc911x: allow using IRQ0
b280877eab selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
a9fd5d6cd5 bnxt_en: Fix unnecessary dropping of RX packets
72e4fc1a4e bnxt_en: Fix possible bnxt_open() failure caused by wrong RFS flag
9ac9f07f0f selftests: mirror_gre_bridge_1q: Avoid changing PVID while interface is operational
475237e807 hinic: fix bug of wq out of bound access
1b9f1f455d net: emaclite: Add error handling for of_address_to_resource()
8459485db7 net: cpsw: add missing of_node_put() in cpsw_probe_dt()
4eee980950 net: stmmac: dwmac-sun8i: add missing of_node_put() in sun8i_dwmac_register_mdio_mux()
2347e9c922 net: dsa: mt7530: add missing of_node_put() in mt7530_setup()
1092656cc4 net: ethernet: mediatek: add missing of_node_put() in mtk_sgmii_init()
408fb2680e NFSv4: Don't invalidate inode attributes on delegation return
c1b480e6be RDMA/siw: Fix a condition race issue in MPA request processing
5bf2a45e33 selftests/seccomp: Don't call read() on TTY from background pgrp
3ea0b44c01 net/mlx5: Avoid double clear or set of sync reset requested
2455331591 net/mlx5e: Fix the calling of update_buffer_lossy() API
e07c13fbdd net/mlx5e: CT: Fix queued up restore put() executing after relevant ft release
d8338a7a09 net/mlx5e: Don't match double-vlan packets if cvlan is not set
c7f87ad115 net/mlx5e: Fix trust state reset in reload
87f0d9a518 ASoC: dmaengine: Restore NULL prepare_slave_config() callback
ad87f8498e hwmon: (adt7470) Fix warning on module removal
997b8605e8 gpio: pca953x: fix irq_stat not updated when irq is disabled (irq_mask not set)
879b075a9a NFC: netlink: fix sleep in atomic bug when firmware download timeout
1961c5a688 nfc: nfcmrvl: main: reorder destructive operations in nfcmrvl_nci_unregister_dev to avoid bugs
8a9e7c64f4 nfc: replace improper check device_is_registered() in netlink related functions
11adc9ab3e can: grcan: only use the NAPI poll budget for RX
4df5e498e0 can: grcan: grcan_probe(): fix broken system id check for errata workaround needs
dd973c0185 can: grcan: use ofdev->dev when allocating DMA memory
45bdcb5ca4 can: isotp: remove re-binding of bound socket
13959b9117 can: grcan: grcan_close(): fix deadlock
6c7c0e131e s390/dasd: Fix read inconsistency for ESE DASD devices
6e02c0413a s390/dasd: Fix read for ESE with blksize < 4k
ecc8396827 s390/dasd: prevent double format of tracks for ESE devices
30e008ab3f s390/dasd: fix data corruption for ESE devices
d53d47fadd ASoC: meson: Fix event generation for AUI CODEC mux
93a1f0755e ASoC: meson: Fix event generation for G12A tohdmi mux
e8b08e2f17 ASoC: meson: Fix event generation for AUI ACODEC mux
954d55170f ASoC: wm8958: Fix change notifications for DSP controls
f45359824a ASoC: da7219: Fix change notifications for tone generator frequency
e6e61aab49 genirq: Synchronize interrupt thread startup
dcf1150f2e net: stmmac: disable Split Header (SPH) for Intel platforms
68f35987d4 firewire: core: extend card->lock in fw_core_handle_bus_reset
629b4003a7 firewire: remove check of list iterator against head past the loop body
e757ff4bbc firewire: fix potential uaf in outbound_phy_packet_callback()
70d25d4fba Revert "SUNRPC: attempt AF_LOCAL connect on setup"
466721d767 drm/amd/display: Avoid reading audio pattern past AUDIO_CHANNELS_COUNT
2e6f3d665a iommu/vt-d: Calculate mask for non-aligned flushes
fbb7c61e76 KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id
b085afe226 gpiolib: of: fix bounds check for 'gpio-reserved-ranges'
2b7cb072d0 mmc: core: Set HS clock speed before sending HS CMD13
66651d7199 mmc: sdhci-msm: Reset GCC_SDCC_BCR register for SDHC
2906c73632 ALSA: fireworks: fix wrong return count shorter than expected by 4 bytes
03ab174805 ALSA: hda/realtek: Add quirk for Yoga Duet 7 13ITL6 speakers
a196f277c5 parisc: Merge model and model name into one line in /proc/cpuinfo
326f02f172 MIPS: Fix CP0 counter erratum detection for R4k CPUs
681997eca1 Revert "ipv6: make ip6_rt_gc_expire an atomic_t"
141fbd343b Revert "oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup"
ca9b002a16 Merge 5.10.113 into android12-5.10-lts
f64cd19a00 Merge branch 'android12-5.10' into `android12-5.10-lts`
f40e35e79c Linux 5.10.114
2d74f61787 perf symbol: Remove arch__symbols__fixup_end()
bf98302e68 tty: n_gsm: fix software flow control handling
95b267271a tty: n_gsm: fix incorrect UA handling
70b045d9ae tty: n_gsm: fix reset fifo race condition
320a24c4ef tty: n_gsm: fix wrong command frame length field encoding
935f314b6f tty: n_gsm: fix wrong command retry handling
17b86db43c tty: n_gsm: fix missing explicit ldisc flush
a2baa907c2 tty: n_gsm: fix wrong DLCI release order
705925e693 tty: n_gsm: fix insufficient txframe size
842a9bbbef netfilter: nft_socket: only do sk lookups when indev is available
7346e54dbf tty: n_gsm: fix malformed counter for out of frame data
d19613895e tty: n_gsm: fix wrong signal octet encoding in convergence layer type 2
26f127f6d9 tty: n_gsm: fix mux cleanup after unregister tty device
f26c271492 tty: n_gsm: fix decoupled mux resource
47132f9f7f tty: n_gsm: fix restart handling via CLD command
b3c88d46db perf symbol: Update symbols__fixup_end()
3d0a3168a3 perf symbol: Pass is_kallsyms to symbols__fixup_end()
2ab14625b8 x86/cpu: Load microcode during restore_processor_state()
795afbe8b4 thermal: int340x: Fix attr.show callback prototype
11d16498d7 net: ethernet: stmmac: fix write to sgmii_adapter_base
236dd62230 drm/i915: Fix SEL_FETCH_PLANE_*(PIPE_B+) register addresses
78d4dccf16 kasan: prevent cpu_quarantine corruption when CPU offline and cache shrink occur at same time
5fef6df273 zonefs: Clear inode information flags on inode creation
92ed64a920 zonefs: Fix management of open zones
42e8ec3b4b powerpc/perf: Fix 32bit compile
ac3d077043 drivers: net: hippi: Fix deadlock in rr_close()
5399e7b80c cifs: destage any unwritten data to the server before calling copychunk_write
80fc45377f x86: __memcpy_flushcache: fix wrong alignment if size > 2^32
585ef03c9e ext4: fix bug_on in start_this_handle during umount filesystem
07da0be588 ASoC: wm8731: Disable the regulator when probing fails
1b1747ad7e ASoC: Intel: soc-acpi: correct device endpoints for max98373
aa138efd2b tcp: fix F-RTO may not work correctly when receiving DSACK
9d56e369bd Revert "ibmvnic: Add ethtool private flag for driver-defined queue limits"
96904c8289 ibmvnic: fix miscellaneous checks
17f71272ef ixgbe: ensure IPsec VF<->PF compatibility
c33d717e06 net: fec: add missing of_node_put() in fec_enet_init_stop_mode()
9591967ac4 bnx2x: fix napi API usage sequence
1781beb879 tls: Skip tls_append_frag on zero copy size
77b922683e drm/amd/display: Fix memory leak in dcn21_clock_source_create
18068e0527 drm/amdkfd: Fix GWS queue count
c0396f5e5b net: dsa: lantiq_gswip: Don't set GSWIP_MII_CFG_RMII_CLK
1204386e26 net: phy: marvell10g: fix return value on error
e974c730f0 net: bcmgenet: hide status block before TX timestamping
ee71b47da5 clk: sunxi: sun9i-mmc: check return value after calling platform_get_resource()
8dacbef4fe bus: sunxi-rsb: Fix the return value of sunxi_rsb_device_create()
9f29f6f8da tcp: make sure treq->af_specific is initialized
8a9d6ca360 tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
720b6ced85 ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode
41661b4c1a ip6_gre: Make o_seqno start from 0 in native mode
7b187fbd7e ip_gre: Make o_seqno start from 0 in native mode
83d128daff net/smc: sync err code when tcp connection was refused
9eb25e00f5 net: hns3: add return value for mailbox handling in PF
929c30c02d net: hns3: add validity check for message data length
e3ec78d82d net: hns3: modify the return code of hclge_get_ring_chain_from_mbx
06a40e7105 cpufreq: fix memory leak in sun50i_cpufreq_nvmem_probe
fb172e93f8 pinctrl: pistachio: fix use of irq_of_parse_and_map()
8f042884af arm64: dts: imx8mn-ddr4-evk: Describe the 32.768 kHz PMIC clock
73c35379db ARM: dts: imx6ull-colibri: fix vqmmc regulator
61a89d0a5b sctp: check asoc strreset_chunk in sctp_generate_reconf_event
41d6ac687d wireguard: device: check for metadata_dst with skb_valid_dst()
3c464db03c tcp: ensure to use the most recently sent skb when filling the rate sample
ce4c3f7087 pinctrl: stm32: Keep pinctrl block clock enabled when LEVEL IRQ requested
0c60271df0 tcp: md5: incorrect tcp_header_len for incoming connections
f4dad5a48d pinctrl: rockchip: fix RK3308 pinmux bits
9ef33d23f8 bpf, lwt: Fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook
6ac03e6ddd netfilter: nft_set_rbtree: overlap detection with element re-addition after deletion
72ae15d5ce net: dsa: Add missing of_node_put() in dsa_port_link_register_of
14cc2044c1 memory: renesas-rpc-if: Fix HF/OSPI data transfer in Manual Mode
690c1bc4bf pinctrl: stm32: Do not call stm32_gpio_get() for edge triggered IRQs in EOI
6f2bf9c5dd mtd: fix 'part' field data corruption in mtd_info
4da421035b mtd: rawnand: Fix return value check of wait_for_completion_timeout
94ca69b702 pinctrl: mediatek: moore: Fix build error
123b7e0388 ipvs: correctly print the memory size of ip_vs_conn_tab
f4446f2136 ARM: dts: logicpd-som-lv: Fix wrong pinmuxing on OMAP35
4a526cc29c ARM: dts: am3517-evm: Fix misc pinmuxing
b622bca852 ARM: dts: Fix mmc order for omap3-gta04
9419d27fe1 phy: ti: Add missing pm_runtime_disable() in serdes_am654_probe
9e00a6e1fd phy: mapphone-mdm6600: Fix PM error handling in phy_mdm6600_probe
eb659608e6 ARM: dts: at91: sama5d4_xplained: fix pinctrl phandle name
bb524f5a95 ARM: dts: at91: Map MCLK for wm8731 on at91sam9g20ek
4691ce8f28 phy: ti: omap-usb2: Fix error handling in omap_usb2_enable_clocks
76d1591a38 bus: ti-sysc: Make omap3 gpt12 quirk handling SoC specific
1b9855bf31 ARM: OMAP2+: Fix refcount leak in omap_gic_of_init
93cc8f184e phy: samsung: exynos5250-sata: fix missing device put in probe error paths
3ca7491570 phy: samsung: Fix missing of_node_put() in exynos_sata_phy_probe
8f7644ac24 ARM: dts: imx6qdl-apalis: Fix sgtl5000 detection issue
23b0711fcd USB: Fix xhci event ring dequeue pointer ERDP update issue
712302aed1 mtd: rawnand: fix ecc parameters for mt7622
207c7af341 iio:imu:bmi160: disable regulator in error path
70d2df257e arm64: dts: meson: remove CPU opps below 1GHz for SM1 boards
2d320609be arm64: dts: meson: remove CPU opps below 1GHz for G12B boards
c4fb41bdf4 video: fbdev: udlfb: properly check endpoint type
0967830e72 iocost: don't reset the inuse weight of under-weighted debtors
ad604cbd1d x86/pci/xen: Disable PCI/MSI[-X] masking for XEN_HVM guests
8fcce58c59 riscv: patch_text: Fixup last cpu should be master
51477d3b38 hex2bin: fix access beyond string end
616d354fb9 hex2bin: make the function hex_to_bin constant-time
1633cb2d4a pinctrl: samsung: fix missing GPIOLIB on ARM64 Exynos config
bdc3ad9251 arch_topology: Do not set llc_sibling if llc_id is invalid
aaee3f6617 serial: 8250: Correct the clock for EndRun PTP/1588 PCIe device
662f945a20 serial: 8250: Also set sticky MCR bits in console restoration
8be962c89d serial: imx: fix overrun interrupts in DMA mode
d22d92230f usb: phy: generic: Get the vbus supply
b820764c64 usb: cdns3: Fix issue for clear halt endpoint
bd7f84708e usb: dwc3: gadget: Return proper request status
a633b8c341 usb: dwc3: core: Only handle soft-reset in DCTL
5fa59bb867 usb: dwc3: core: Fix tx/rx threshold settings
140801d3fb usb: dwc3: Try usb-role-switch first in dwc3_drd_init
4dd5feb279 usb: gadget: configfs: clear deactivation flag in configfs_composite_unbind()
6c3da0e19c usb: gadget: uvc: Fix crash when encoding data for usb request
fb1fe1a455 usb: typec: ucsi: Fix role swapping
06826eb063 usb: typec: ucsi: Fix reuse of completion structure
7b510d4bb4 usb: misc: fix improper handling of refcount in uss720_probe()
bb8ecca2dd iio: imu: inv_icm42600: Fix I2C init possible nack
ca2b54b6ad iio: magnetometer: ak8975: Fix the error handling in ak8975_power_on()
1060604fc7 iio: dac: ad5446: Fix read_raw not returning set value
6ff33c01be iio: dac: ad5592r: Fix the missing return value.
06ada9487f xhci: increase usb U3 -> U0 link resume timeout from 100ms to 500ms
e1be000166 xhci: stop polling roothubs after shutdown
2eb6c86891 xhci: Enable runtime PM on second Alderlake controller
63eda431b2 USB: serial: option: add Telit 0x1057, 0x1058, 0x1075 compositions
e9971dac69 USB: serial: option: add support for Cinterion MV32-WA/MV32-WB
34ff5455ee USB: serial: cp210x: add PIDs for Kamstrup USB Meter Reader
729a81ae10 USB: serial: whiteheat: fix heap overflow in WHITEHEAT_GET_DTR_RTS
008ba29f33 USB: quirks: add STRING quirk for VCOM device
ac6ad0ef83 USB: quirks: add a Realtek card reader
8ba02cebb7 usb: mtu3: fix USB 3.0 dual-role-switch from device to host
549209caab lightnvm: disable the subsystem
54c028cfc4 floppy: disable FDRAWCMD by default
de64d941a7 Merge 5.10.112 into android12-5.10-lts
54af9dd2b9 Linux 5.10.113
7992fdb045 Revert "net: micrel: fix KS8851_MLL Kconfig"
8bedbc8f7f block/compat_ioctl: fix range check in BLKGETSIZE
fea24b07ed staging: ion: Prevent incorrect reference counting behavour
dccee748af spi: atmel-quadspi: Fix the buswidth adjustment between spi-mem and controller
572761645b jbd2: fix a potential race while discarding reserved buffers after an abort
50aac44273 can: isotp: stop timeout monitoring when no first frame was sent
e1e96e3727 ext4: force overhead calculation if the s_overhead_cluster makes no sense
4789149b9e ext4: fix overhead calculation to account for the reserved gdt blocks
0c54b09376 ext4, doc: fix incorrect h_reserved size
22c450d39f ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
75ac724684 ext4: fix use-after-free in ext4_search_dir
a46b3d8498 ext4: fix symlink file size not match to file content
f6038d43b2 ext4: fix fallocate to use file_modified to update permissions consistently
19590bbc69 perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event
e012f9d1af powerpc/perf: Fix power9 event alternatives
0a2cef65b3 drm/vc4: Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage
f8f8b3124b KVM: PPC: Fix TCE handling for VFIO
405d984274 drm/panel/raspberrypi-touchscreen: Initialise the bridge in prepare
231381f521 drm/panel/raspberrypi-touchscreen: Avoid NULL deref if not initialised
51d9cbbb0f perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled
88fcfd6ee6 sched/pelt: Fix attach_entity_load_avg() corner case
c55327bc37 arm_pmu: Validate single/group leader events
5580b974a8 ARC: entry: fix syscall_trace_exit argument
7082650eb8 e1000e: Fix possible overflow in LTR decoding
43a2a3734a ASoC: soc-dapm: fix two incorrect uses of list iterator
54e6180c8c gpio: Request interrupts after IRQ is initialized
0837ff17d0 openvswitch: fix OOB access in reserve_sfa_size()
19f6dcb1f0 xtensa: fix a7 clobbering in coprocessor context load/store
f399ab11dd xtensa: patch_text: Fixup last cpu should be master
ba2716da23 net: atlantic: invert deep par in pm functions, preventing null derefs
358a3846f6 dma: at_xdmac: fix a missing check on list iterator
cf23a960c5 ata: pata_marvell: Check the 'bmdma_addr' beforing reading
9ca66d7914 mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
ed5d4efb4d oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
6b932920b9 mm, hugetlb: allow for "high" userspace addresses
50cbc583fa EDAC/synopsys: Read the error count from the correct register
7ec6e06ee4 nvme-pci: disable namespace identifiers for Qemu controllers
316bd86c22 nvme: add a quirk to disable namespace identifiers
76101c8e0c stat: fix inconsistency between struct stat and struct compat_stat
bf28bba304 scsi: qedi: Fix failed disconnect handling
a284cca3d8 net: macb: Restart tx only if queue pointer is lagging
9581e07b54 drm/msm/mdp5: check the return of kzalloc()
8d71edabb0 dpaa_eth: Fix missing of_node_put in dpaa_get_ts_info()
b3afe5a7fd brcmfmac: sdio: Fix undefined behavior due to shift overflowing the constant
202748f441 mt76: Fix undefined behavior due to shift overflowing the constant
0de9c104d0 net: atlantic: Avoid out-of-bounds indexing
5bef9fc38f cifs: Check the IOCB_DIRECT flag, not O_DIRECT
e129c55153 vxlan: fix error return code in vxlan_fdb_append
8e7ea11364 arm64: dts: imx: Fix imx8*-var-som touchscreen property sizes
cd227ac03f ALSA: usb-audio: Fix undefined behavior due to shift overflowing the constant
490815f0b5 platform/x86: samsung-laptop: Fix an unsigned comparison which can never be negative
cb17b56a9b reset: tegra-bpmp: Restore Handle errors in BPMP response
d513ea9b7e ARM: vexpress/spc: Avoid negative array index when !SMP
052e4a661f arm64: mm: fix p?d_leaf()
18ff7a2efa arm64/mm: Remove [PUD|PMD]_TABLE_BIT from [pud|pmd]_bad()
3bf8ca3501 selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
520aab8b72 dmaengine: idxd: add RO check for wq max_transfer_size write
9a3c026dc3 dmaengine: idxd: add RO check for wq max_batch_size write
f593f49fcd net: stmmac: Use readl_poll_timeout_atomic() in atomic state
3d55b19574 netlink: reset network and mac headers in netlink_dump()
49516e6ed9 ipv6: make ip6_rt_gc_expire an atomic_t
078d839f11 l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu
0ac8f83d8f net/sched: cls_u32: fix possible leak in u32_init_knode()
93366275be ip6_gre: Fix skb_under_panic in __gre6_xmit()
200f96ebb3 ip6_gre: Avoid updating tunnel->tun_hlen in __gre6_xmit()
8fb76adb89 net/packet: fix packet_sock xmit return value checking
a499cb5f3e net/smc: Fix sock leak when release after smc_shutdown()
60592f16a4 rxrpc: Restore removed timer deletion
fc7116a79a igc: Fix BUG: scheduling while atomic
46b0e4f998 igc: Fix infinite loop in release_swfw_sync
c075c3ea03 esp: limit skb_page_frag_refill use to a single page
3f7914dbea spi: spi-mtk-nor: initialize spi controller after resume
f714abf28f dmaengine: mediatek:Fix PM usage reference leak of mtk_uart_apdma_alloc_chan_resources
9bc949a181 dmaengine: imx-sdma: Fix error checking in sdma_event_remap
12aa8021c7 ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use
b6f474cd30 ASoC: msm8916-wcd-digital: Check failure for devm_snd_soc_register_component
608fc58858 ASoC: atmel: Remove system clock tree configuration for at91sam9g20ek
d29c78d3f9 dm: fix mempool NULL pointer race when completing IO
cf9b195464 ALSA: hda/realtek: Add quirk for Clevo NP70PNP
8ce3820fc9 ALSA: usb-audio: Clear MIDI port active flag after draining
43ce33a68e net/sched: cls_u32: fix netns refcount changes in u32_change()
04dd45d977 gfs2: assign rgrp glock before compute_bitstructs
378061c9b8 perf tools: Fix segfault accessing sample_id xyarray
5e8446e382 tracing: Dump stacktrace trigger to the corresponding instance
69848f9488 mm: page_alloc: fix building error on -Werror=array-compare
08ad7a770e etherdevice: Adjust ether_addr* prototypes to silence -Wstringop-overead
904c5c08bb ANDROID: fix up gpio change in 5.10.111
5dadf6321c Merge 5.10.111 into android12-5.10-lts
1052f9bce6 Linux 5.10.112
5c62d3bf14 ax25: Fix UAF bugs in ax25 timers
f934fa478d ax25: Fix NULL pointer dereferences in ax25 timers
145ea8d213 ax25: fix NPD bug in ax25_disconnect
a4942c6fea ax25: fix UAF bug in ax25_send_control()
b20a5ab0f5 ax25: Fix refcount leaks caused by ax25_cb_del()
57cc15f5fd ax25: fix UAF bugs of net_device caused by rebinding operation
5ddae8d064 ax25: fix reference count leaks of ax25_dev
5ea00fc606 ax25: add refcount in ax25_dev to avoid UAF bugs
361288633b scsi: iscsi: Fix unbound endpoint error handling
129db30599 scsi: iscsi: Fix endpoint reuse regression
26f827e095 dma-direct: avoid redundant memory sync for swiotlb
9a5a4d23e2 timers: Fix warning condition in __run_timers()
84837f43e5 i2c: pasemi: Wait for write xfers to finish
89496d80bf smp: Fix offline cpu check in flush_smp_call_function_queue()
cd02b2687d dm integrity: fix memory corruption when tag_size is less than digest size
0a312ec66a ARM: davinci: da850-evm: Avoid NULL pointer dereference
0806f19305 tick/nohz: Use WARN_ON_ONCE() to prevent console saturation
0275c75955 genirq/affinity: Consider that CPUs on nodes can be unbalanced
1fcfe37d17 drm/amdgpu: Enable gfxoff quirk on MacBook Pro
68ae52efa1 drm/amd/display: don't ignore alpha property on pre-multiplied mode
a263712ba8 ipv6: fix panic when forwarding a pkt with no in6 dev
659214603b nl80211: correctly check NL80211_ATTR_REG_ALPHA2 size
912797e54c ALSA: pcm: Test for "silence" field in struct "pcm_format_data"
48d070ca5e ALSA: hda/realtek: add quirk for Lenovo Thinkpad X12 speakers
163e162471 ALSA: hda/realtek: Add quirk for Clevo PD50PNT
5e4dd17998 btrfs: mark resumed async balance as writing
1d2eda18f6 btrfs: fix root ref counts in error handling in btrfs_get_root_ref
9b7ec35253 ath9k: Fix usage of driver-private space in tx_info
0f65cedae5 ath9k: Properly clear TX status area before reporting to mac80211
cc21ae9326 gcc-plugins: latent_entropy: use /dev/urandom
c089ffc846 memory: renesas-rpc-if: fix platform-device leak in error path
342454231e KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded
06c348fde5 mm: kmemleak: take a full lowmem check in kmemleak_*_phys()
20ed94f818 mm: fix unexpected zeroed page mapping with zram swap
192e507ef8 mm, page_alloc: fix build_zonerefs_node()
000b3921b4 perf/imx_ddr: Fix undefined behavior due to shift overflowing the constant
ca24c5e8f0 drivers: net: slip: fix NPD bug in sl_tx_timeout()
e8cf1e4d95 scsi: megaraid_sas: Target with invalid LUN ID is deleted during scan
5b7ce74b6b scsi: mvsas: Add PCI ID of RocketRaid 2640
4b44cd5840 drm/amd/display: Fix allocate_mst_payload assert on resume
34ea097fb6 drm/amd/display: Revert FEC check in validation
fa5ee7c423 myri10ge: fix an incorrect free for skb in myri10ge_sw_tso
d90df6da50 net: usb: aqc111: Fix out-of-bounds accesses in RX fixup
9c12fcf1d8 net: axienet: setup mdio unconditionally
b643807a73 tlb: hugetlb: Add more sizes to tlb_remove_huge_tlb_entry
98973d2bdd arm64: alternatives: mark patch_alternative() as `noinstr`
2462faffbf regulator: wm8994: Add an off-on delay for WM8994 variant
aa8cdedaf7 gpu: ipu-v3: Fix dev_dbg frequency output
150fe861c5 ata: libata-core: Disable READ LOG DMA EXT for Samsung 840 EVOs
1ff5359afa net: micrel: fix KS8851_MLL Kconfig
d3478709ed scsi: ibmvscsis: Increase INITIAL_SRP_LIMIT to 1024
b9a110fa75 scsi: lpfc: Fix queue failures when recovering from PCI parity error
aec36b98a1 scsi: target: tcmu: Fix possible page UAF
4366679805 Drivers: hv: vmbus: Prevent load re-ordering when reading ring buffer
1d7a5aae88 drm/amdkfd: Check for potential null return of kmalloc_array()
e5afacc826 drm/amdgpu/vcn: improve vcn dpg stop procedure
d2e0931e6d drm/amdkfd: Fix Incorrect VMIDs passed to HWS
7fc0610ad8 drm/amd/display: Update VTEM Infopacket definition
6906e05cf3 drm/amd/display: FEC check in timing validation
756c61c168 drm/amd/display: fix audio format not updated after edid updated
76e086ce7b btrfs: do not warn for free space inode in cow_file_range
217190dc66 btrfs: fix fallocate to use file_modified to update permissions consistently
9b5d1b3413 drm/amd: Add USBC connector ID
6f9c06501d net: bcmgenet: Revert "Use stronger register read/writes to assure ordering"
504c15f07f dm mpath: only use ktime_get_ns() in historical selector
4e166a4118 cifs: potential buffer overflow in handling symlinks
67677050ce nfc: nci: add flush_workqueue to prevent uaf
bfba9722cf perf tools: Fix misleading add event PMU debug message
280f721edc testing/selftests/mqueue: Fix mq_perf_tests to free the allocated cpu set
eb8873b324 sctp: Initialize daddr on peeled off socket
45226fac4d scsi: iscsi: Fix conn cleanup and stop race during iscsid restart
73805795c9 scsi: iscsi: Fix offload conn cleanup when iscsid restarts
699bd835c3 scsi: iscsi: Move iscsi_ep_disconnect()
46f37a34a5 scsi: iscsi: Fix in-kernel conn failure handling
8125738967 scsi: iscsi: Rel ref after iscsi_lookup_endpoint()
22608545b8 scsi: iscsi: Use system_unbound_wq for destroy_work
4029a1e992 scsi: iscsi: Force immediate failure during shutdown
17d14456f6 scsi: iscsi: Stop queueing during ep_disconnect
da9cf24aa7 scsi: pm80xx: Enable upper inbound, outbound queues
e08d269712 scsi: pm80xx: Mask and unmask upper interrupt vectors 32-63
35b91e49bc net/smc: Fix NULL pointer dereference in smc_pnet_find_ib()
98a7f6c4ad drm/msm/dsi: Use connector directly in msm_dsi_manager_connector_init()
5f78ad9383 drm/msm: Fix range size vs end confusion
5513f9a0b0 cfg80211: hold bss_lock while updating nontrans_list
a44938950e net/sched: taprio: Check if socket flags are valid
08d5e3e954 net: ethernet: stmmac: fix altr_tse_pcs function when using a fixed-link
2ad9d890d8 net: dsa: felix: suppress -EPROBE_DEFER errors
f2cc341fcc net/sched: fix initialization order when updating chain 0 head
7a7cf84148 mlxsw: i2c: Fix initialization error flow
43e58e119a net: mdio: Alphabetically sort header inclusion
9709c8b5cd gpiolib: acpi: use correct format characters
d67c900f19 veth: Ensure eth header is in skb's linear part
845f44ce3d net/sched: flower: fix parsing of ethertype following VLAN header
85ee17ca21 SUNRPC: Fix the svc_deferred_event trace class
af12dd7123 media: rockchip/rga: do proper error checking in probe
5637129712 firmware: arm_scmi: Fix sorting of retrieved clock rates
16c628b0c6 memory: atmel-ebi: Fix missing of_node_put in atmel_ebi_probe
cb66641f81 drm/msm: Add missing put_task_struct() in debugfs path
921fdc45a0 btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()
5d131318bb ACPI: processor idle: Check for architectural support for LPI
503934df31 cpuidle: PSCI: Move the `has_lpi` check to the beginning of the function
cfa98ffc42 hamradio: remove needs_free_netdev to avoid UAF
80a4df1464 hamradio: defer 6pack kfree after unregister_netdev
f0c31f192f drm/amdkfd: Use drm_priv to pass VM from KFD to amdgpu
6c8e5cb264 Linux 5.10.111
d36febbcd5 powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit
5c672073bc mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning
5973f7507a irqchip/gic, gic-v3: Prevent GSI to SGI translations
000e09462f Drivers: hv: vmbus: Replace smp_store_mb() with virt_store_mb()
e1f540b752 arm64: module: remove (NOLOAD) from linker script
919823bd67 selftests: cgroup: Test open-time cgroup namespace usage for migration checks
637eca44b8 selftests: cgroup: Test open-time credential usage for migration checks
9dd39d2c65 selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644
e74da71e66 selftests/cgroup: Fix build on older distros
4665722d36 cgroup: Use open-time credentials for process migraton perm checks
f089471d1b mm: don't skip swap entry even if zap_details specified
58823a9b09 ubsan: remove CONFIG_UBSAN_OBJECT_SIZE
03b39bbbec dmaengine: Revert "dmaengine: shdma: Fix runtime PM imbalance on error"
40e00885a6 tools build: Use $(shell ) instead of `` to get embedded libperl's ccopts
75c8558d41 tools build: Filter out options and warnings not supported by clang
6374faf49e perf python: Fix probing for some clang command line options
79abc219ba perf build: Don't use -ffat-lto-objects in the python feature test when building with clang-13
82e4395014 drm/amdkfd: Create file descriptor after client is added to smi_clients list
326b408e7e drm/nouveau/pmu: Add missing callbacks for Tegra devices
786ae8de3a drm/amdgpu/smu10: fix SoC/fclk units in auto mode
ff24114bb0 irqchip/gic-v3: Fix GICR_CTLR.RWP polling
451214b266 perf: qcom_l2_pmu: fix an incorrect NULL check on list iterator
fc629224aa ata: sata_dwc_460ex: Fix crash due to OOB write
7e88a50704 gpio: Restrict usage of GPIO chip irq members before initialization
5f54364ff6 RDMA/hfi1: Fix use-after-free bug for mm struct
8bb4168291 arm64: patch_text: Fixup last cpu should be master
a044bca8ef btrfs: prevent subvol with swapfile from being deleted
82ae73ac96 btrfs: fix qgroup reserve overflow the qgroup limit
fc4bdaed4d x86/speculation: Restore speculation related MSRs during S3 resume
8c9e26c890 x86/pm: Save the MSR validity status at context setup
2827328e64 io_uring: fix race between timeout flush and removal
f7e183b0a7 mm/mempolicy: fix mpol_new leak in shared_policy_replace
7d659cb176 mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0)
6adc01a7aa lz4: fix LZ4_decompress_safe_partial read out of bound
8b6f04b4c9 mmc: renesas_sdhi: don't overwrite TAP settings when HS400 tuning is complete
029b417073 mmc: mmci: stm32: correctly check all elements of sg list
41a519c05b Revert "mmc: sdhci-xenon: fix annoying 1.8V regulator warning"
9de98470db arm64: Add part number for Arm Cortex-A78AE
4604b5738d perf session: Remap buf if there is no space for event
362ced3769 perf tools: Fix perf's libperf_print callback
65210fac63 perf: arm-spe: Fix perf report --mem-mode
bd905fed87 iommu/omap: Fix regression in probe for NULL pointer dereference
b3c00be2ff SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()
9a45e08636 SUNRPC: Handle low memory situations in call_status()
132cbe2f18 SUNRPC: Handle ENOMEM in call_transmit_status()
aed30a2054 io_uring: don't touch scm_fp_list after queueing skb
594205b493 drbd: Fix five use after free bugs in get_initial_state
970a6bb729 bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
6c17f4ef3c spi: bcm-qspi: fix MSPI only access with bcm_qspi_exec_mem_op()
8928239e5e qede: confirm skb is allocated before using
b7893388bb net: phy: mscc-miim: reject clause 45 register accesses
08ff0e74fa rxrpc: fix a race in rxrpc_exit_net()
5ae05b5eb5 net: openvswitch: fix leak of nested actions
42ab401d22 net: openvswitch: don't send internal clone attribute to the userspace.
e54ea8fc51 ice: synchronize_rcu() when terminating rings
e3dd1202ab ipv6: Fix stats accounting in ip6_pkt_drop
ffce126c95 ice: Do not skip not enabled queues in ice_vc_dis_qs_msg
b003fc4913 ice: Set txq_teid to ICE_INVAL_TEID on ring creation
ebd1e3458d dpaa2-ptp: Fix refcount leak in dpaa2_ptp_probe
43c2d7890e IB/rdmavt: add lock to call to rvt_error_qp to prevent a race condition
3a57babfb6 RDMA/mlx5: Don't remove cache MRs when a delay is needed
d8992b393f sfc: Do not free an empty page_ring
0ac74169eb bnxt_en: reserve space inside receive page for skb_shared_info
f8b0ef0a58 drm/imx: Fix memory leak in imx_pd_connector_get_modes
25bc9fd4c8 drm/imx: imx-ldb: Check for null pointer after calling kmemdup
02ab4abe5b net: stmmac: Fix unset max_speed difference between DT and non-DT platforms
63ea57478a net: ipv4: fix route with nexthop object delete warning
4be6ed0310 ice: Clear default forwarding VSI during VSI release
589154d0f1 net/tls: fix slab-out-of-bounds bug in decrypt_internal
c5f77b5953 scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one()
45b9932b4d NFSv4: fix open failure with O_ACCMODE flag
c688705a39 Revert "NFSv4: Handle the special Linux file open access mode"
cf580d2e38 Drivers: hv: vmbus: Fix potential crash on module unload
0c122eb3a1 drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire()
84e5dfc05f Revert "hv: utils: add PTP_1588_CLOCK to Kconfig to fix build"
3c3fbfa6dd mm: fix race between MADV_FREE reclaim and blkdev direct IO read
1753a49e26 parisc: Fix patch code locking and flushing
f7c3522030 parisc: Fix CPU affinity for Lasi, WAX and Dino chips
c74e2f6ecc NFS: Avoid writeback threads getting stuck in mempool_alloc()
34681aeddc NFS: nfsiod should not block forever in mempool_alloc()
7a506fabcf SUNRPC: Fix socket waits for write buffer space
b9c5ac0a15 jfs: prevent NULL deref in diFree
c69b442125 virtio_console: eliminate anonymous module_init & module_exit
3309b32217 serial: samsung_tty: do not unlock port->lock for uart_write_wakeup()
9cb90f9ad5 x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
b3882e78aa NFS: swap-out must always use STABLE writes.
d4170a2821 NFS: swap IO handling is slightly different for O_DIRECT IO
4b6f122bdf SUNRPC: remove scheduling boost for "SWAPPER" tasks.
f4fc47e71e SUNRPC/xprt: async tasks mustn't block waiting for memory
f9244d31e0 SUNRPC/call_alloc: async tasks mustn't block waiting for memory
e2b2542f74 clk: Enforce that disjoints limits are invalid
1e9b5538cf clk: ti: Preserve node in ti_dt_clocks_register()
a2a0e04f64 xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32
4a2544ce24 NFSv4: Protect the state recovery thread against direct reclaim
9b9feec97c NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify()
2e16895d06 w1: w1_therm: fixes w1_seq for ds28ea00 sensors
93498c6e77 staging: wfx: fix an error handling in wfx_init_common()
8f1d24f85f phy: amlogic: meson8b-usb2: Use dev_err_probe()
aa0b729678 staging: vchiq_core: handle NULL result of find_service_by_handle
be4ecca958 clk: si5341: fix reported clk_rate when output divider is 2
c9cf6baabf minix: fix bug when opening a file with O_DIRECT
8d9efd4434 init/main.c: return 1 from handled __setup() functions
f442978612 ceph: fix memory leak in ceph_readdir when note_last_dentry returns error
d745512d54 netlabel: fix out-of-bounds memory accesses
2cc803804e Bluetooth: Fix use after free in hci_send_acl
789621df19 MIPS: ingenic: correct unit node address
61e25021e6 xtensa: fix DTC warning unit_address_format
f6b9550f53 usb: dwc3: omap: fix "unbalanced disables for smps10_out1" on omap5evm
a4dd3e9e5a net: sfp: add 2500base-X quirk for Lantech SFP module
278b652f0a net: limit altnames to 64k total
423e7107f6 net: account alternate interface name memory
74c4d50255 can: isotp: set default value for N_As to 50 micro seconds
1d7effe5ff scsi: libfc: Fix use after free in fc_exch_abts_resp()
02222bf4f0 powerpc/secvar: fix refcount leak in format_show()
fd416c3f5a MIPS: fix fortify panic when copying asm exception handlers
7c657c0694 PCI: endpoint: Fix misused goto label
79cfc0052f bnxt_en: Eliminate unintended link toggle during FW reset
9567d54e70 Bluetooth: use memset avoid memory leaks
f9b183f133 Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg}
647b35aaf4 tuntap: add sanity checks about msg_controllen in sendmsg
797b4ea951 macvtap: advertise link netns via netlink
142ae7d4f2 mips: ralink: fix a refcount leak in ill_acc_of_setup()
f2565cb40e net/smc: correct settings of RMB window update limit
224903cc60 scsi: hisi_sas: Free irq vectors in order for v3 HW
f49ffaa85d scsi: aha152x: Fix aha152x_setup() __setup handler return value
91ee8a14ef mt76: mt7615: Fix assigning negative values to unsigned variable
d83574666b scsi: pm8001: Fix memory leak in pm8001_chip_fw_flash_update_req()
a0bb65eadb scsi: pm8001: Fix tag leaks on error
2051044d79 scsi: pm8001: Fix task leak in pm8001_send_abort_all()
3bd9a28798 scsi: pm8001: Fix pm8001_mpi_task_abort_resp()
ef969095c4 scsi: pm8001: Fix pm80xx_pci_mem_copy() interface
fe4b6d5a0d drm/amdkfd: make CRAT table missing message informational only
2f2f017ea8 dm: requeue IO if mapping table not yet available
71c8df33fd dm ioctl: prevent potential spectre v1 gadget
f655b724b4 ipv4: Invalidate neighbour for broadcast address upon address addition
bae03957e8 iwlwifi: mvm: Correctly set fragmented EBS
9538563d31 power: supply: axp288-charger: Set Vhold to 4.4V
c66cc04043 PCI: pciehp: Add Qualcomm quirk for Command Completed erratum
b1b27b0e8d tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.
b02a1a6502 PCI: endpoint: Fix alignment fault error in copy tests
4820847e8b usb: ehci: add pci device support for Aspeed platforms
0b9cf0b599 iommu/arm-smmu-v3: fix event handling soft lockup
e07e420a00 PCI: aardvark: Fix support for MSI interrupts
6694b8643b drm/amdgpu: Fix recursive locking warning
ea21eaea7f powerpc: Set crashkernel offset to mid of RMA region
fb5ac62fbe ipv6: make mc_forwarding atomic
5baf92a2c4 libbpf: Fix build issue with llvm-readelf
26a1e4739e cfg80211: don't add non transmitted BSS to 6GHz scanned channels
9a56e2b271 mt76: dma: initialize skip_unmap in mt76_dma_rx_fill
b42b6d0ec3 power: supply: axp20x_battery: properly report current when discharging
de9505936c scsi: bfa: Replace snprintf() with sysfs_emit()
ed7db95920 scsi: mvsas: Replace snprintf() with sysfs_emit()
995f517888 bpf: Make dst_port field in struct bpf_sock 16-bit wide
339bd0b55e ath11k: mhi: use mhi_sync_power_up()
c6a815f5ab ath11k: fix kernel panic during unload/load ath11k modules
e4d2d72013 powerpc: dts: t104xrdb: fix phy type for FMAN 4/5
02e2ee8619 ptp: replace snprintf with sysfs_emit
9ea17b9f1d usb: gadget: tegra-xudc: Fix control endpoint's definitions
07971b818e usb: gadget: tegra-xudc: Do not program SPARAM
927beb05aa drm/amd/amdgpu/amdgpu_cs: fix refcount leak of a dma_fence obj
85313d9bc7 drm/amd/display: Add signal type check when verify stream backends same
9d7d83d039 ath5k: fix OOB in ath5k_eeprom_read_pcal_info_5111
850c4351e8 drm: Add orientation quirk for GPD Win Max
a24479c5e9 KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
66b0fa6b22 KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
2e52a29470 rtc: wm8350: Handle error for wm8350_register_irq
0777fe98a4 gfs2: gfs2_setattr_size error path fix
f349d7f9ee gfs2: Fix gfs2_release for non-writers regression
3f53715fd5 gfs2: Check for active reservation in gfs2_release
2dc49f58a2 ubifs: Rectify space amount budget for mkdir/tmpfile operations

Update the .xml file with the following needed changes that came in from
the -lts branch to handle ABI issues with LTS security fixes:

Leaf changes summary: 3 artifacts changed
Changed leaf types summary: 2 leaf types changed
Removed/Changed/Added functions summary: 0 Removed, 1 Changed, 0 Added function
Removed/Changed/Added variables summary: 0 Removed, 0 Changed, 0 Added variable

1 function with some sub-type change:

  [C] 'function int hex_to_bin(char)' at hexdump.c:53:1 has some sub-type changes:
    parameter 1 of type 'char' changed:
      type name changed from 'char' to 'unsigned char'
      type size hasn't changed

'struct gpio_chip at driver.h:362:1' changed (indirectly):
  type size hasn't changed
  there are data member changes:
    type 'struct gpio_irq_chip' of 'gpio_chip::irq' changed:
      type size hasn't changed
      there are data member changes:
        data member u64 android_kabi_reserved1 at offset 2304 (in bits) became anonymous data member 'union {bool initialized; struct {u64 android_kabi_reserved1;}; union {};}'
      1265 impacted interfaces
  1265 impacted interfaces

'struct gpio_irq_chip at driver.h:32:1' changed:
  details were reported earlier

Change-Id: Iface7385c5d82fbcdaeb92fda79ac3cd1835d323
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2022-07-27 11:21:05 +02:00
Bing Han
1aa26f0017 ANDROID: vendor_hook: Add hook in page_referenced_one()
Add android_vh_page_referenced_one_end at the end of function
page_referenced_one to update the status that whether the page
need to be reclaimed to a specified swap location.

Bug: 234214858
Signed-off-by: Bing Han <bing.han@transsion.com>
Change-Id: Ia06a229956328ef776da5d163708dcb011a327fb
2022-06-30 03:00:23 +00:00
Greg Kroah-Hartman
5dadf6321c This is the 5.10.111 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmJXHgUACgkQONu9yGCS
 aT7BohAAx7alIKg1d4gbIHhO6eimWLWLj95ncyeq6xtNT+qKqdYgp+w8xKAJ8QLG
 sG9sbGcoWYkgOLcSy4rztgh9HBQuGvY6vLFqygRw5HXN1iAirYlr7DJCCKRc1pPZ
 E5ASOzbkfmBw9HI/w41up5vosSkNAf1qqbL9lJxfx11ms5t7s/11gYg+xSH61NUI
 gBE4GyJSq91p161F4ql+dJqYrU+gAIY9zKVSAqB97z9D3d01tZkr4LGNjbqtu7Kb
 3d+vjiKfMda09X16US3nx9PaxikfQn5IB8JA9mpWgI+Q7H6R9Ri+rQxnv/ghpEPc
 U9BvK9p7+zYu6dyNUZYbGCsHAQ3WFoatJPO+JTxXllJ99ORrN85WfvFMWZq49f3k
 XxYMbECcLJfsYUJycKcPJJfGFLfxw2cDfJmzNJEvzX9KK6ObZxSeYeVHjrdC8XwA
 WZlt1zNObE2IyH3pkqSyxKubnpu4Z0UdxDdIeowsdI9iD7oGlhtfzhTVa+JbxnuY
 HHtHIKweeyYeUTRIKe1w/ZE24LQjF0fpy9M2ZGxzy6YQTqFsjGNzsfcPBWWFOXqp
 XGTgaKoIqA+1ov2nVGk8j1BOTHKqYx4gwhb5Y58kX8hvHQGr6d3eoyMKui7wPT5f
 9RjU/9+eZ1DT8LLnUZFNJHIrhwIUCR/wEY1Y9gPi58RE8Wj9TlY=
 =3r9Y
 -----END PGP SIGNATURE-----

Merge 5.10.111 into android12-5.10-lts

Changes in 5.10.111
	ubifs: Rectify space amount budget for mkdir/tmpfile operations
	gfs2: Check for active reservation in gfs2_release
	gfs2: Fix gfs2_release for non-writers regression
	gfs2: gfs2_setattr_size error path fix
	rtc: wm8350: Handle error for wm8350_register_irq
	KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs
	KVM: x86/emulator: Emulate RDPID only if it is enabled in guest
	drm: Add orientation quirk for GPD Win Max
	ath5k: fix OOB in ath5k_eeprom_read_pcal_info_5111
	drm/amd/display: Add signal type check when verify stream backends same
	drm/amd/amdgpu/amdgpu_cs: fix refcount leak of a dma_fence obj
	usb: gadget: tegra-xudc: Do not program SPARAM
	usb: gadget: tegra-xudc: Fix control endpoint's definitions
	ptp: replace snprintf with sysfs_emit
	powerpc: dts: t104xrdb: fix phy type for FMAN 4/5
	ath11k: fix kernel panic during unload/load ath11k modules
	ath11k: mhi: use mhi_sync_power_up()
	bpf: Make dst_port field in struct bpf_sock 16-bit wide
	scsi: mvsas: Replace snprintf() with sysfs_emit()
	scsi: bfa: Replace snprintf() with sysfs_emit()
	power: supply: axp20x_battery: properly report current when discharging
	mt76: dma: initialize skip_unmap in mt76_dma_rx_fill
	cfg80211: don't add non transmitted BSS to 6GHz scanned channels
	libbpf: Fix build issue with llvm-readelf
	ipv6: make mc_forwarding atomic
	powerpc: Set crashkernel offset to mid of RMA region
	drm/amdgpu: Fix recursive locking warning
	PCI: aardvark: Fix support for MSI interrupts
	iommu/arm-smmu-v3: fix event handling soft lockup
	usb: ehci: add pci device support for Aspeed platforms
	PCI: endpoint: Fix alignment fault error in copy tests
	tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH.
	PCI: pciehp: Add Qualcomm quirk for Command Completed erratum
	power: supply: axp288-charger: Set Vhold to 4.4V
	iwlwifi: mvm: Correctly set fragmented EBS
	ipv4: Invalidate neighbour for broadcast address upon address addition
	dm ioctl: prevent potential spectre v1 gadget
	dm: requeue IO if mapping table not yet available
	drm/amdkfd: make CRAT table missing message informational only
	scsi: pm8001: Fix pm80xx_pci_mem_copy() interface
	scsi: pm8001: Fix pm8001_mpi_task_abort_resp()
	scsi: pm8001: Fix task leak in pm8001_send_abort_all()
	scsi: pm8001: Fix tag leaks on error
	scsi: pm8001: Fix memory leak in pm8001_chip_fw_flash_update_req()
	mt76: mt7615: Fix assigning negative values to unsigned variable
	scsi: aha152x: Fix aha152x_setup() __setup handler return value
	scsi: hisi_sas: Free irq vectors in order for v3 HW
	net/smc: correct settings of RMB window update limit
	mips: ralink: fix a refcount leak in ill_acc_of_setup()
	macvtap: advertise link netns via netlink
	tuntap: add sanity checks about msg_controllen in sendmsg
	Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg}
	Bluetooth: use memset avoid memory leaks
	bnxt_en: Eliminate unintended link toggle during FW reset
	PCI: endpoint: Fix misused goto label
	MIPS: fix fortify panic when copying asm exception handlers
	powerpc/secvar: fix refcount leak in format_show()
	scsi: libfc: Fix use after free in fc_exch_abts_resp()
	can: isotp: set default value for N_As to 50 micro seconds
	net: account alternate interface name memory
	net: limit altnames to 64k total
	net: sfp: add 2500base-X quirk for Lantech SFP module
	usb: dwc3: omap: fix "unbalanced disables for smps10_out1" on omap5evm
	xtensa: fix DTC warning unit_address_format
	MIPS: ingenic: correct unit node address
	Bluetooth: Fix use after free in hci_send_acl
	netlabel: fix out-of-bounds memory accesses
	ceph: fix memory leak in ceph_readdir when note_last_dentry returns error
	init/main.c: return 1 from handled __setup() functions
	minix: fix bug when opening a file with O_DIRECT
	clk: si5341: fix reported clk_rate when output divider is 2
	staging: vchiq_core: handle NULL result of find_service_by_handle
	phy: amlogic: meson8b-usb2: Use dev_err_probe()
	staging: wfx: fix an error handling in wfx_init_common()
	w1: w1_therm: fixes w1_seq for ds28ea00 sensors
	NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify()
	NFSv4: Protect the state recovery thread against direct reclaim
	xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32
	clk: ti: Preserve node in ti_dt_clocks_register()
	clk: Enforce that disjoints limits are invalid
	SUNRPC/call_alloc: async tasks mustn't block waiting for memory
	SUNRPC/xprt: async tasks mustn't block waiting for memory
	SUNRPC: remove scheduling boost for "SWAPPER" tasks.
	NFS: swap IO handling is slightly different for O_DIRECT IO
	NFS: swap-out must always use STABLE writes.
	x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy
	serial: samsung_tty: do not unlock port->lock for uart_write_wakeup()
	virtio_console: eliminate anonymous module_init & module_exit
	jfs: prevent NULL deref in diFree
	SUNRPC: Fix socket waits for write buffer space
	NFS: nfsiod should not block forever in mempool_alloc()
	NFS: Avoid writeback threads getting stuck in mempool_alloc()
	parisc: Fix CPU affinity for Lasi, WAX and Dino chips
	parisc: Fix patch code locking and flushing
	mm: fix race between MADV_FREE reclaim and blkdev direct IO read
	Revert "hv: utils: add PTP_1588_CLOCK to Kconfig to fix build"
	drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire()
	Drivers: hv: vmbus: Fix potential crash on module unload
	Revert "NFSv4: Handle the special Linux file open access mode"
	NFSv4: fix open failure with O_ACCMODE flag
	scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one()
	net/tls: fix slab-out-of-bounds bug in decrypt_internal
	ice: Clear default forwarding VSI during VSI release
	net: ipv4: fix route with nexthop object delete warning
	net: stmmac: Fix unset max_speed difference between DT and non-DT platforms
	drm/imx: imx-ldb: Check for null pointer after calling kmemdup
	drm/imx: Fix memory leak in imx_pd_connector_get_modes
	bnxt_en: reserve space inside receive page for skb_shared_info
	sfc: Do not free an empty page_ring
	RDMA/mlx5: Don't remove cache MRs when a delay is needed
	IB/rdmavt: add lock to call to rvt_error_qp to prevent a race condition
	dpaa2-ptp: Fix refcount leak in dpaa2_ptp_probe
	ice: Set txq_teid to ICE_INVAL_TEID on ring creation
	ice: Do not skip not enabled queues in ice_vc_dis_qs_msg
	ipv6: Fix stats accounting in ip6_pkt_drop
	ice: synchronize_rcu() when terminating rings
	net: openvswitch: don't send internal clone attribute to the userspace.
	net: openvswitch: fix leak of nested actions
	rxrpc: fix a race in rxrpc_exit_net()
	net: phy: mscc-miim: reject clause 45 register accesses
	qede: confirm skb is allocated before using
	spi: bcm-qspi: fix MSPI only access with bcm_qspi_exec_mem_op()
	bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
	drbd: Fix five use after free bugs in get_initial_state
	io_uring: don't touch scm_fp_list after queueing skb
	SUNRPC: Handle ENOMEM in call_transmit_status()
	SUNRPC: Handle low memory situations in call_status()
	SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()
	iommu/omap: Fix regression in probe for NULL pointer dereference
	perf: arm-spe: Fix perf report --mem-mode
	perf tools: Fix perf's libperf_print callback
	perf session: Remap buf if there is no space for event
	arm64: Add part number for Arm Cortex-A78AE
	Revert "mmc: sdhci-xenon: fix annoying 1.8V regulator warning"
	mmc: mmci: stm32: correctly check all elements of sg list
	mmc: renesas_sdhi: don't overwrite TAP settings when HS400 tuning is complete
	lz4: fix LZ4_decompress_safe_partial read out of bound
	mmmremap.c: avoid pointless invalidate_range_start/end on mremap(old_size=0)
	mm/mempolicy: fix mpol_new leak in shared_policy_replace
	io_uring: fix race between timeout flush and removal
	x86/pm: Save the MSR validity status at context setup
	x86/speculation: Restore speculation related MSRs during S3 resume
	btrfs: fix qgroup reserve overflow the qgroup limit
	btrfs: prevent subvol with swapfile from being deleted
	arm64: patch_text: Fixup last cpu should be master
	RDMA/hfi1: Fix use-after-free bug for mm struct
	gpio: Restrict usage of GPIO chip irq members before initialization
	ata: sata_dwc_460ex: Fix crash due to OOB write
	perf: qcom_l2_pmu: fix an incorrect NULL check on list iterator
	irqchip/gic-v3: Fix GICR_CTLR.RWP polling
	drm/amdgpu/smu10: fix SoC/fclk units in auto mode
	drm/nouveau/pmu: Add missing callbacks for Tegra devices
	drm/amdkfd: Create file descriptor after client is added to smi_clients list
	perf build: Don't use -ffat-lto-objects in the python feature test when building with clang-13
	perf python: Fix probing for some clang command line options
	tools build: Filter out options and warnings not supported by clang
	tools build: Use $(shell ) instead of `` to get embedded libperl's ccopts
	dmaengine: Revert "dmaengine: shdma: Fix runtime PM imbalance on error"
	ubsan: remove CONFIG_UBSAN_OBJECT_SIZE
	mm: don't skip swap entry even if zap_details specified
	cgroup: Use open-time credentials for process migraton perm checks
	selftests/cgroup: Fix build on older distros
	selftests: cgroup: Make cg_create() use 0755 for permission instead of 0644
	selftests: cgroup: Test open-time credential usage for migration checks
	selftests: cgroup: Test open-time cgroup namespace usage for migration checks
	arm64: module: remove (NOLOAD) from linker script
	Drivers: hv: vmbus: Replace smp_store_mb() with virt_store_mb()
	irqchip/gic, gic-v3: Prevent GSI to SGI translations
	mm/sparsemem: fix 'mem_section' will never be NULL gcc 12 warning
	powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit
	Linux 5.10.111

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I9b4c1d30ae226b865494df03d871db2a2b9281c7
2022-04-21 14:27:41 +02:00
Mauricio Faria de Oliveira
3c3fbfa6dd mm: fix race between MADV_FREE reclaim and blkdev direct IO read
commit 6c8e2a256915a223f6289f651d6b926cd7135c9e upstream.

Problem:
=======

Userspace might read the zero-page instead of actual data from a direct IO
read on a block device if the buffers have been called madvise(MADV_FREE)
on earlier (this is discussed below) due to a race between page reclaim on
MADV_FREE and blkdev direct IO read.

- Race condition:
  ==============

During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
if the page is dirty).

However, after try_to_unmap_one() returns to shrink_page_list(), it might
keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
_one_ page reference, from the isolation for page reclaim).

Well, blkdev_direct_IO() gets references for all pages, and on READ
operations it only sets them dirty _later_.

So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
IO read from block devices, and page reclaim happens during
__blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
returns, but BEFORE the pages are set dirty, the situation happens.

The direct IO read eventually completes.  Now, when userspace reads the
buffers, the PTE is no longer there and the page fault handler
do_anonymous_page() services that with the zero-page, NOT the data!

A synthetic reproducer is provided.

- Page faults:
  ===========

If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
happen, because that faults-in all pages as writeable, so
do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
direct IO.  The userspace reads don't fault as the PTE is there (thus
zero-page is not used/setup).

But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
is no longer there; the subsequent page faults can't help:

The data-read from the block device probably won't generate faults due to
DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
different virtual addresses (not user-mapped addresses) because `struct
bio_vec` stores `struct page` to figure addresses out (which are different
from user-mapped addresses) for the read.

Thus userspace reads (to user-mapped addresses) still fault, then
do_anonymous_page() gets another `struct page` that would address/ map to
other memory than the `struct page` used by `struct bio_vec` for the read.
(The original `struct page` is not available, since it wasn't freed, as
page_ref_freeze() failed due to more page refs.  And even if it were
available, its data cannot be trusted anymore.)

Solution:
========

One solution is to check for the expected page reference count in
try_to_unmap_one().

There should be one reference from the isolation (that is also checked in
shrink_page_list() with page_ref_freeze()) plus one or more references
from page mapping(s) (put in discard: label).  Further references mean
that rmap/PTE cannot be unmapped/nuked.

(Note: there might be more than one reference from mapping due to
fork()/clone() without CLONE_VM, which use the same `struct page` for
references, until the copy-on-write page gets copied.)

So, additional page references (e.g., from direct IO read) now prevent the
rmap/PTE from being unmapped/dropped; similarly to the page is not freed
per shrink_page_list()/page_ref_freeze()).

- Races and Barriers:
  ==================

The new check in try_to_unmap_one() should be safe in races with
bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
done under the PTE lock.

The fast path doesn't take the lock, but it checks if the PTE has changed
and if so, it drops the reference and leaves the page for the slow path
(which does take that lock).

The fast path requires synchronization w/ full memory barrier: it writes
the page reference count first then it reads the PTE later, while
try_to_unmap() writes PTE first then it reads page refcount.

And a second barrier is needed, as the page dirty flag should not be read
before the page reference count (as in __remove_mapping()).  (This can be
a load memory barrier only; no writes are involved.)

Call stack/comments:

- try_to_unmap_one()
  - page_vma_mapped_walk()
    - map_pte()			# see pte_offset_map_lock():
        pte_offset_map()
        spin_lock()

  - ptep_get_and_clear()	# write PTE
  - smp_mb()			# (new barrier) GUP fast path
  - page_ref_count()		# (new check) read refcount

  - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
      pte_unmap()
      spin_unlock()

- bio_iov_iter_get_pages()
  - __bio_iov_iter_get_pages()
    - iov_iter_get_pages()
      - get_user_pages_fast()
        - internal_get_user_pages_fast()

          # fast path
          - lockless_pages_from_mm()
            - gup_{pgd,p4d,pud,pmd,pte}_range()
                ptep = pte_offset_map()		# not _lock()
                pte = ptep_get_lockless(ptep)

                page = pte_page(pte)
                try_grab_compound_head(page)	# inc refcount
                                            	# (RMW/barrier
                                             	#  on success)

                if (pte_val(pte) != pte_val(*ptep)) # read PTE
                        put_compound_head(page) # dec refcount
                        			# go slow path

          # slow path
          - __gup_longterm_unlocked()
            - get_user_pages_unlocked()
              - __get_user_pages_locked()
                - __get_user_pages()
                  - follow_{page,p4d,pud,pmd}_mask()
                    - follow_page_pte()
                        ptep = pte_offset_map_lock()
                        pte = *ptep
                        page = vm_normal_page(pte)
                        try_grab_page(page)	# inc refcount
                        pte_unmap_unlock()

- Huge Pages:
  ==========

Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
(aka lazyfree) pages are PageAnon() && !PageSwapBacked()
(madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
thus should reach shrink_page_list() -> split_huge_page_to_list() before
try_to_unmap[_one](), so it deals with normal pages only.

(And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
which should not or be rare, the page refcount should be greater than
mapcount: the head page is referenced by tail pages.  That also prevents
checking the head `page` then incorrectly call page_remove_rmap(subpage)
for a tail page, that isn't even in the shrink_page_list()'s page_list (an
effect of split huge pmd/pmvw), as it might happen today in this unlikely
scenario.)

MADV_FREE'd buffers:
===================

So, back to the "if MADV_FREE pages are used as buffers" note.  The case
is arguable, and subject to multiple interpretations.

The madvise(2) manual page on the MADV_FREE advice value says:

1) 'After a successful MADV_FREE ... data will be lost when
   the kernel frees the pages.'
2) 'the free operation will be canceled if the caller writes
   into the page' / 'subsequent writes ... will succeed and
   then [the] kernel cannot free those dirtied pages'
3) 'If there is no subsequent write, the kernel can free the
   pages at any time.'

Thoughts, questions, considerations... respectively:

1) Since the kernel didn't actually free the page (page_ref_freeze()
   failed), should the data not have been lost? (on userspace read.)
2) Should writes performed by the direct IO read be able to cancel
   the free operation?
   - Should the direct IO read be considered as 'the caller' too,
     as it's been requested by 'the caller'?
   - Should the bio technique to dirty pages on return to userspace
     (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
     be considered in another/special way here?
3) Should an upcoming write from a previously requested direct IO
   read be considered as a subsequent write, so the kernel should
   not free the pages? (as it's known at the time of page reclaim.)

And lastly:

Technically, the last point would seem a reasonable consideration and
balance, as the madvise(2) manual page apparently (and fairly) seem to
assume that 'writes' are memory access from the userspace process (not
explicitly considering writes from the kernel or its corner cases; again,
fairly)..  plus the kernel fix implementation for the corner case of the
largely 'non-atomic write' encompassed by a direct IO read operation, is
relatively simple; and it helps.

Reproducer:
==========

@ test.c (simplified, but works)

	#define _GNU_SOURCE
	#include <fcntl.h>
	#include <stdio.h>
	#include <unistd.h>
	#include <sys/mman.h>

	int main() {
		int fd, i;
		char *buf;

		fd = open(DEV, O_RDONLY | O_DIRECT);

		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
			buf[i] = 1; // init to non-zero

		madvise(buf, BUF_SIZE, MADV_FREE);

		read(fd, buf, BUF_SIZE);

		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
			printf("%p: 0x%x\n", &buf[i], buf[i]);

		return 0;
	}

@ block/fops.c (formerly fs/block_dev.c)

	+#include <linux/swap.h>
	...
	... __blkdev_direct_IO[_simple](...)
	{
	...
	+	if (!strcmp(current->comm, "good"))
	+		shrink_all_memory(ULONG_MAX);
	+
         	ret = bio_iov_iter_get_pages(...);
	+
	+	if (!strcmp(current->comm, "bad"))
	+		shrink_all_memory(ULONG_MAX);
	...
	}

@ shell

        # NUM_PAGES=4
        # PAGE_SIZE=$(getconf PAGE_SIZE)

        # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
        # DEV=$(losetup -f --show test.img)

        # gcc -DDEV=\"$DEV\" \
              -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
              -DPAGE_SIZE=${PAGE_SIZE} \
               test.c -o test

        # od -tx1 $DEV
        0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
        *
        0040000

        # mv test good
        # ./good
        0x7f7c10418000: 0x79
        0x7f7c10419000: 0x79
        0x7f7c1041a000: 0x79
        0x7f7c1041b000: 0x79

        # mv good bad
        # ./bad
        0x7fa1b8050000: 0x0
        0x7fa1b8051000: 0x0
        0x7fa1b8052000: 0x0
        0x7fa1b8053000: 0x0

Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].

- v5.17-rc3:

        # for i in {1..1000}; do ./good; done \
            | cut -d: -f2 | sort | uniq -c
           4000  0x79

        # mv good bad
        # for i in {1..1000}; do ./bad; done \
            | cut -d: -f2 | sort | uniq -c
           4000  0x0

        # free | grep Swap
        Swap:             0           0           0

- v4.5:

        # for i in {1..1000}; do ./good; done \
            | cut -d: -f2 | sort | uniq -c
           4000  0x79

        # mv good bad
        # for i in {1..1000}; do ./bad; done \
            | cut -d: -f2 | sort | uniq -c
           2702  0x0
           1298  0x79

        # swapoff -av
        swapoff /swap

        # for i in {1..1000}; do ./bad; done \
            | cut -d: -f2 | sort | uniq -c
           4000  0x79

Ceph/TCMalloc:
=============

For documentation purposes, the use case driving the analysis/fix is Ceph
on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
release unused memory to the system from the mmap'ed page heap (might be
committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
-> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
TCMalloc_SystemCommit() -> do nothing.

Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
just 'disappeared' on Ceph on later Ubuntu releases but is still present
in the kernel, and can be hit by other use cases.

The observed issue seems to be the old Ceph bug #22464 [1], where checksum
mismatches are observed (and instrumentation with buffer dumps shows
zero-pages read from mmap'ed/MADV_FREE'd page ranges).

The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
mostly worked around with a retry mechanism, but other parts of Ceph could
still hit that (rocksdb).  Anyway, it's less likely to be hit again as
TCMalloc switched out of MADV_FREE by default.

(Some kernel versions/reports from the Ceph bug, and relation with
the MADV_FREE introduction/changes; TCMalloc versions not checked.)
- 4.4 good
- 4.5 (madv_free: introduction)
- 4.9 bad
- 4.10 good? maybe a swapless system
- 4.12 (madv_free: no longer free instantly on swapless systems)
- 4.13 bad

[1] https://tracker.ceph.com/issues/22464

Thanks:
======

Several people contributed to analysis/discussions/tests/reproducers in
the first stages when drilling down on ceph/tcmalloc/linux kernel:

- Dan Hill
- Dan Streetman
- Dongdong Tao
- Gavin Guo
- Gerald Yang
- Heitor Alves de Siqueira
- Ioanna Alifieraki
- Jay Vosburgh
- Matthew Ruffell
- Ponnuvel Palaniyappan

Reviews, suggestions, corrections, comments:

- Minchan Kim
- Yu Zhao
- Huang, Ying
- John Hubbard
- Christoph Hellwig

[mfo@canonical.com: v4]
  Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com

Fixes: 802a3a92ad ("mm: reclaim MADV_FREE pages")
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Dan Hill <daniel.hill@canonical.com>
Cc: Dan Streetman <dan.streetman@canonical.com>
Cc: Dongdong Tao <dongdong.tao@canonical.com>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Gerald Yang <gerald.yang@canonical.com>
Cc: Heitor Alves de Siqueira <halves@canonical.com>
Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
Cc: <stable@vger.kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[mfo: backport: replace folio/test_flag with page/flag equivalents;
 real Fixes: 854e9ed09d ("mm: support madvise(MADV_FREE)") in v4.]
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-04-13 21:01:03 +02:00
Greg Kroah-Hartman
639159686b Merge branch 'android12-5.10' into android12-5.10-lts
Sync up with android12-5.10 for the following commits:

6c3417436a ANDROID: kernel: fix module info for debug_kinfo
774f1bd29c ANDROID: Disable CFI on restricted vendor hooks in TRACE_HEADER_MULTI_READ
90c60a51f5 UPSTREAM: f2fs: guarantee to write dirty data when enabling checkpoint back
8cf5bb6946 UPSTREAM: mm: memblock: fix section mismatch warning again
f00a543047 FROMLIST: usb: gadget: u_audio: EP-OUT bInterval in fback frequency
3f26745cae FROMLIST: usb: gadget: f_uac2: fixed EP-IN wMaxPacketSize
ab9ceb4334 FROMLIST: usb: gadget: f_uac2: Add missing companion descriptor for feedback EP
35afadf0da FROMLIST: thermal: Fix a NULL pointer dereference
3d371f087c UPSTREAM: usb: gadget: f_uac2: fixup feedback endpoint stop
406a51b486 UPSTREAM: usb: gadget: u_audio: add real feedback implementation
d33287acf3 UPSTREAM: usb: gadget: f_uac2: add adaptive sync support for capture
c71892dd9e UPSTREAM: usb: gadget: f_uac2/u_audio: add feedback endpoint support
a844dfbbcb UPSTREAM: usb: gadget: u_audio: convert to strscpy
955f917251 ANDROID: vendor_hooks: Add hook in try_to_unmap_one()
878e0caa77 ANDROID: vendor_hooks: Add hook in mmap_region()
b0778aaff4 ANDROID: vendor_hooks: export android_vh_kmalloc_slab
94fbab9d6c ANDROID: vendor_hooks: Add hook in kmalloc_slab()
73839b71c8 FROMGIT: usb: dwc3: Decouple USB 2.0 L1 & L2 events
2c68b9071d ANDROID: GKI: update xiaomi symbol list
8da32d526d ANDROID: cfi: explicitly clear diag in __cfi_slowpath

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I71c4d03d373d9cef0c467ceea52b62ada755e701
2021-09-13 18:12:07 +02:00
Jiewen Wang
955f917251 ANDROID: vendor_hooks: Add hook in try_to_unmap_one()
Add hook in try_to_unmap_one() to trace this function for debug memory
swap bugs.

Bug: 198385827
Change-Id: I1fdbe60e09bb491b949e06a07133710453ecca03
Signed-off-by: Jiewen Wang <jiewen.wang@vivo.com>
2021-09-06 17:00:04 +08:00
Greg Kroah-Hartman
194be71cc6 Linux 5.10.47
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE4n5dijQDou9mhzu83qZv95d3LNwFAmDcbDgACgkQ3qZv95d3
 LNwUuQ//VDlmBPk/3w1FYvg9N9q/t1GHkVJXmD8TY/ClLdJtgxPYeoRu1VNLR/xf
 Y2kwZEF07yMA88RME56Zwt3p+LBbacrp5MoNdzEA48kb7auGBPk1HIscBg2PXC+C
 AnlC/O4/NAW6Okb+lLFL7XFM4xrlDBkNr5yTz2HmQSQC3JFfov0FcrON3KKTL5Bi
 aeyWjhn1NnhkKCDaKUl7kKlCQ2x7buu/YmvJK2OdGmuLZVywzto76RS+Xx3X0CnK
 pPkmciZSS7Gxi6UJel/zza0UwlKg5+IhFzfYVt0nsTFMjk/4QStAoevu7TFnbVD3
 yM7nzJpAQmVLs/X9sC0rgg9rCBUyp9d4ddba8bCUqaxpPQfMObWI8S5F8tfzBnK/
 h8P5xfs8u4O1AzpRr+YSN2Hbvak47e/4c5UOvvYj6z8NaIEb7DGbaUv/JE5YRVZ0
 ZEVZ1auEpHcSVAGz6DUwuwzc8Rk0NdskES5DD53QxNLXDoF1CVdsD44xUuJH1/HA
 //3S1SWxvwF9UQ+w+sk/Z6pzUj+CdFigou3QzwB+vAZ04n0JawRSuqyijkCqFOP5
 88iCMgxZ5qAYJ1TzQ6gV7cA0tSbteBF/HERNmdadyGvse9KtxkaBAfFmvIpd9D8J
 fTepYVuneP9cGqaGDUtsqHzM5YLrIkCSxABjgIvpNTt0eJL5c2k=
 =TKhf
 -----END PGP SIGNATURE-----

Merge 5.10.47 into android12-5.10-lts

Changes in 5.10.47
	module: limit enabling module.sig_enforce
	Revert "drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue."
	Revert "drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell."
	drm: add a locked version of drm_is_current_master
	drm/nouveau: wait for moving fence after pinning v2
	drm/radeon: wait for moving fence after pinning
	drm/amdgpu: wait for moving fence after pinning
	ARM: 9081/1: fix gcc-10 thumb2-kernel regression
	mmc: meson-gx: use memcpy_to/fromio for dram-access-quirk
	MIPS: generic: Update node names to avoid unit addresses
	arm64: Ignore any DMA offsets in the max_zone_phys() calculation
	arm64: Force NO_BLOCK_MAPPINGS if crashkernel reservation is required
	spi: spi-nxp-fspi: move the register operation after the clock enable
	Revert "PCI: PM: Do not read power state in pci_enable_device_flags()"
	drm/vc4: hdmi: Move the HSM clock enable to runtime_pm
	drm/vc4: hdmi: Make sure the controller is powered in detect
	x86/entry: Fix noinstr fail in __do_fast_syscall_32()
	x86/xen: Fix noinstr fail in exc_xen_unknown_trap()
	locking/lockdep: Improve noinstr vs errors
	perf/x86/lbr: Remove cpuc->lbr_xsave allocation from atomic context
	perf/x86/intel/lbr: Zero the xstate buffer on allocation
	dmaengine: zynqmp_dma: Fix PM reference leak in zynqmp_dma_alloc_chan_resourc()
	dmaengine: stm32-mdma: fix PM reference leak in stm32_mdma_alloc_chan_resourc()
	dmaengine: xilinx: dpdma: Add missing dependencies to Kconfig
	dmaengine: xilinx: dpdma: Limit descriptor IDs to 16 bits
	mac80211: remove warning in ieee80211_get_sband()
	mac80211_hwsim: drop pending frames on stop
	cfg80211: call cfg80211_leave_ocb when switching away from OCB
	dmaengine: rcar-dmac: Fix PM reference leak in rcar_dmac_probe()
	dmaengine: mediatek: free the proper desc in desc_free handler
	dmaengine: mediatek: do not issue a new desc if one is still current
	dmaengine: mediatek: use GFP_NOWAIT instead of GFP_ATOMIC in prep_dma
	net: ipv4: Remove unneed BUG() function
	mac80211: drop multicast fragments
	net: ethtool: clear heap allocations for ethtool function
	inet: annotate data race in inet_send_prepare() and inet_dgram_connect()
	ping: Check return value of function 'ping_queue_rcv_skb'
	net: annotate data race in sock_error()
	inet: annotate date races around sk->sk_txhash
	net/packet: annotate data race in packet_sendmsg()
	net: phy: dp83867: perform soft reset and retain established link
	riscv32: Use medany C model for modules
	net: caif: fix memory leak in ldisc_open
	net/packet: annotate accesses to po->bind
	net/packet: annotate accesses to po->ifindex
	r8152: Avoid memcpy() over-reading of ETH_SS_STATS
	sh_eth: Avoid memcpy() over-reading of ETH_SS_STATS
	r8169: Avoid memcpy() over-reading of ETH_SS_STATS
	KVM: selftests: Fix kvm_check_cap() assertion
	net: qed: Fix memcpy() overflow of qed_dcbx_params()
	mac80211: reset profile_periodicity/ema_ap
	mac80211: handle various extensible elements correctly
	recordmcount: Correct st_shndx handling
	PCI: Add AMD RS690 quirk to enable 64-bit DMA
	net: ll_temac: Add memory-barriers for TX BD access
	net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
	perf/x86: Track pmu in per-CPU cpu_hw_events
	pinctrl: stm32: fix the reported number of GPIO lines per bank
	i2c: i801: Ensure that SMBHSTSTS_INUSE_STS is cleared when leaving i801_access
	gpiolib: cdev: zero padding during conversion to gpioline_info_changed
	scsi: sd: Call sd_revalidate_disk() for ioctl(BLKRRPART)
	nilfs2: fix memory leak in nilfs_sysfs_delete_device_group
	s390/stack: fix possible register corruption with stack switch helper
	KVM: do not allow mapping valid but non-reference-counted pages
	i2c: robotfuzz-osif: fix control-request directions
	ceph: must hold snap_rwsem when filling inode for async create
	kthread_worker: split code for canceling the delayed work timer
	kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
	x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate()
	x86/fpu: Make init_fpstate correct with optimized XSAVE
	mm: add VM_WARN_ON_ONCE_PAGE() macro
	mm/rmap: remove unneeded semicolon in page_not_mapped()
	mm/rmap: use page_not_mapped in try_to_unmap()
	mm, thp: use head page in __migration_entry_wait()
	mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
	mm/thp: make is_huge_zero_pmd() safe and quicker
	mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
	mm/thp: fix vma_address() if virtual address below file offset
	mm/thp: fix page_address_in_vma() on file THP tails
	mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
	mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
	mm: page_vma_mapped_walk(): use page for pvmw->page
	mm: page_vma_mapped_walk(): settle PageHuge on entry
	mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
	mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
	mm: page_vma_mapped_walk(): crossing page table boundary
	mm: page_vma_mapped_walk(): add a level of indentation
	mm: page_vma_mapped_walk(): use goto instead of while (1)
	mm: page_vma_mapped_walk(): get vma_address_end() earlier
	mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
	mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
	mm, futex: fix shared futex pgoff on shmem huge page
	KVM: SVM: Call SEV Guest Decommission if ASID binding fails
	swiotlb: manipulate orig_addr when tlb_addr has offset
	netfs: fix test for whether we can skip read when writing beyond EOF
	Revert "drm: add a locked version of drm_is_current_master"
	certs: Add EFI_CERT_X509_GUID support for dbx entries
	certs: Move load_system_certificate_list to a common function
	certs: Add ability to preload revocation certs
	integrity: Load mokx variables into the blacklist keyring
	Linux 5.10.47

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I68f731ad78a5db003c41093e4faf59f6f9f2e446
2021-06-30 19:38:46 +02:00
Jue Wang
38cda6b5ab mm/thp: fix page_address_in_vma() on file THP tails
commit 31657170deaf1d8d2f6a1955fbc6fa9d228be036 upstream.

Anon THP tails were already supported, but memory-failure may need to
use page_address_in_vma() on file THP tails, which its page->mapping
check did not permit: fix it.

hughd adds: no current usage is known to hit the issue, but this does
fix a subtle trap in a general helper: best fixed in stable sooner than
later.

Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Jue Wang <juew@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:27 -04:00
Hugh Dickins
37ffe9f4d7 mm/thp: fix vma_address() if virtual address below file offset
commit 494334e43c16d63b878536a26505397fce6ff3a2 upstream.

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for
try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low:
vma_address() chooses max(start, vma->vm_start), but that decides on the
wrong address, because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM
pages, which cannot be located by the page->index that page_to_pgoff()
uses: as commit 4b0ece6fa0 ("mm: migrate: fix remove_migration_pte()
for ksm pages") once discovered.  I dithered over the best thing to do
about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
vma_address() and vma_address_end(); though the only place in danger of
using it on them was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never
mattered.

Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
Fixes: a8fa41ad2f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:27 -04:00
Hugh Dickins
66be14a926 mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
commit 732ed55823fc3ad998d43b86bf771887bcc5ec67 upstream.

Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e.  mapcount 0.

And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and
silently.

I believe the problem is that once a racing unmap has cleared pte or
pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
from try_to_unmap() before the racing task has reached decrementing
mapcount.

Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
TTU_SYNC to the options, and passing that from unmap_page().

When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem.  Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race
tolerated.

Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes: fec89c109f ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:27 -04:00
Miaohe Lin
bfd90b56d7 mm/rmap: use page_not_mapped in try_to_unmap()
[ Upstream commit b7e188ec98b1644ff70a6d3624ea16aadc39f5e0 ]

page_mapcount_is_zero() calculates accurately how many mappings a hugepage
has in order to check against 0 only.  This is a waste of cpu time.  We
can do this via page_not_mapped() to save some possible atomic_read
cycles.  Remove the function page_mapcount_is_zero() as it's not used
anymore and move page_not_mapped() above try_to_unmap() to avoid
identifier undeclared compilation error.

Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:26 -04:00
Miaohe Lin
ff81af8259 mm/rmap: remove unneeded semicolon in page_not_mapped()
[ Upstream commit e0af87ff7afcde2660be44302836d2d5618185af ]

Remove extra semicolon without any functional change intended.

Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-30 08:47:26 -04:00
Laurent Dufour
a1dbf20e8e FROMLIST: mm: introduce __page_add_new_anon_rmap()
When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.

When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.

Change-Id: I72c47830181579f8c9618df879077d321653b5f1
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Link: https://lore.kernel.org/lkml/1523975611-15978-17-git-send-email-ldufour@linux.vnet.ibm.com/
Bug: 161210518
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2021-01-22 18:00:48 +00:00
Shakeel Butt
dd156e3fca mm/rmap: always do TTU_IGNORE_ACCESS
[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

Since commit 369ea8242c ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check.  More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page.  So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim.  From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:55 +01:00
Mike Kravetz
336bf30eb7 hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]

  LTP: starting move_pages12
  BUG: unable to handle page fault for address: ffffffffffffffe0
  ...
  RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
  Call Trace:
    rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
    try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
    migrate_pages+0x1005/0x1fb0
    move_pages_and_store_status.isra.47+0xd7/0x1a0
    __x64_sys_move_pages+0xa5c/0x1100
    do_syscall_64+0x5f/0x310
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
anon pages as the anon_vma_lock should be held in this case.  Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.

Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
keep mapping valid while dropping page lock.

This patch does the following:

 - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
   calling try_to_unmap.

 - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
   will now simply do a 'trylock' while still holding the page lock. If
   the trylock fails, it will return NULL. This could impact the
   callers:

    - migration calling code will receive -EAGAIN and retry up to the
      hard coded limit (10).

    - memory error code will treat the page as BUSY. This will force
      killing (SIGKILL) instead of SIGBUS any mapping tasks.

   Do note that this change in behavior only happens when there is a
   race. None of the standard kernel testing suites actually hit this
   race, but it is possible.

[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

Fixes: c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Reported-by: Qian Cai <cai@lca.pw>
Suggested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-14 11:26:04 -08:00
Matthew Wilcox (Oracle)
5eaf35ab12 mm/rmap: fix assumptions of THP size
Ask the page what size it is instead of assuming it's PMD size.  Do this
for anon pages as well as file pages for when someone decides to support
that.  Leave the assumption alone for pages which are PMD mapped; we don't
currently grow THPs beyond PMD size, so we don't need to change this code
yet.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Link: https://lkml.kernel.org/r/20200908195539.25896-9-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:15 -07:00
Alistair Popple
ad7df764b7 mm/rmap: fixup copying of soft dirty and uffd ptes
During memory migration a pte is temporarily replaced with a migration
swap pte.  Some pte bits from the existing mapping such as the soft-dirty
and uffd write-protect bits are preserved by copying these to the
temporary migration swap pte.

However these bits are not stored at the same location for swap and
non-swap ptes.  Therefore testing these bits requires using the
appropriate helper function for the given pte type.

Unfortunately several code locations were found where the wrong helper
function is being used to test soft_dirty and uffd_wp bits which leads to
them getting incorrectly set or cleared during page-migration.

Fix these by using the correct tests based on pte type.

Fixes: a5430dda8a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Fixes: 8c3328f1f3 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
Fixes: f45ec5ff16 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-05 12:14:30 -07:00
Qian Cai
9c1177b62a mm/rmap: annotate a data race at tlb_flush_batched
mm->tlb_flush_batched could be accessed concurrently as noticed by
KCSAN,

 BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one

 write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
  try_to_unmap_one+0x59a/0x1ab0
  set_tlb_ubc_flush_pending at mm/rmap.c:635
  (inlined by) try_to_unmap_one at mm/rmap.c:1538
  rmap_walk_anon+0x296/0x650
  rmap_walk+0xdf/0x100
  try_to_unmap+0x18a/0x2f0
  shrink_page_list+0xef6/0x2870
  shrink_inactive_list+0x316/0x880
  shrink_lruvec+0x8dc/0x1380
  shrink_node+0x317/0xd80
  balance_pgdat+0x652/0xd90
  kswapd+0x396/0x8d0
  kthread+0x1e0/0x200
  ret_from_fork+0x27/0x50

 read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
  flush_tlb_batched_pending+0x29/0x90
  flush_tlb_batched_pending at mm/rmap.c:682
  change_p4d_range+0x5dd/0x1030
  change_pte_range at mm/mprotect.c:44
  (inlined by) change_pmd_range at mm/mprotect.c:212
  (inlined by) change_pud_range at mm/mprotect.c:240
  (inlined by) change_p4d_range at mm/mprotect.c:260
  change_protection+0x222/0x310
  change_prot_numa+0x3e/0x60
  task_numa_work+0x219/0x350
  task_work_run+0xed/0x140
  prepare_exit_to_usermode+0x2cc/0x2e0
  ret_from_intr+0x32/0x42

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 4 PID: 6364 Comm: mtest01 Tainted: G        W    L 5.5.0-next-20200210+ #5
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

flush_tlb_batched_pending() is under PTL but the write is not, but
mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
shattered.  Thus, mark it as an intentional data race by using the data
race macro.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14 19:56:57 -07:00
Matthew Wilcox (Oracle)
6c357848b4 mm: replace hpage_nr_pages with thp_nr_pages
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.

[akpm@linux-foundation.org: fix mm/migrate.c]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14 19:56:56 -07:00
Mike Kravetz
34ae204f18 hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem
Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode.  This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed.  When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.

Unfortunately, that else clause is exercised when there is no page table
entry.  This will likely lead to a call to huge_pmd_share.  If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem.  If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.

Simply remove the else clause.  It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate
locking.

To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing
routines.

Fixes: c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
Michel Lespinasse
c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Johannes Weiner
468c398233 mm: memcontrol: switch to native NR_ANON_THPS counter
With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a
small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the
native NR_ANON_THPS accounting sites.

[hannes@cmpxchg.org: fixes]
  Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>	[build-tested]
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Johannes Weiner
be5d0a74c6 mm: memcontrol: switch to native NR_ANON_MAPPED counter
Memcg maintains a private MEMCG_RSS counter.  This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of charging,
so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED.  We use
lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping set
up at charge time.  However, we need to have page->mem_cgroup set up by
the time rmap runs and does the accounting, so switch the commit and the
rmap callbacks around.

v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Palmer Dabbelt
4708f31885 mm: prevent a warning when casting void* -> enum
I recently build the RISC-V port with LLVM trunk, which has introduced a
new warning when casting from a pointer to an enum of a smaller size.
This patch simply casts to a long in the middle to stop the warning.  I'd
be surprised this is the only one in the kernel, but it's the only one I
saw.

Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:41 -07:00
Peter Xu
f45ec5ff16 userfaultfd: wp: support swap and page migration
For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected.  It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care of
the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
That can lead to data mismatch if the page that we are going to write
protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
also applies/removes the uffd-wp bit even for the swap entries.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:39 -07:00
Matthew Wilcox (Oracle)
396bcc5299 mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE
Commit e496cf3d78 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
notes that it should be reverted when the PowerPC problem was fixed.  The
commit fixing the PowerPC problem (953c66c2b2) did not revert the
commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
CONFIG_TRANSPARENT_HUGEPAGE.  Checking with Kirill and Aneesh, this was an
oversight, so remove the Kconfig symbol and undo the work of commit
e496cf3d78.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:38 -07:00
Li Xinhai
23ab76bf90 Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork"
This reverts commit 4e4a9eb921 ("mm/rmap.c: reuse mergeable
anon_vma as parent when fork").

In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and
parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
->vm_prev as its parent vma.  That causes the anon_vma used by parent been
mistakenly shared by child (In anon_vma_clone(), the code added by that
commit will do this reuse work).

Besides this issue, the design of reusing anon_vma from vma which has gone
through fork should be avoided ([1]).  So, this patch reverts that commit
and maintains the consistent logic of reusing anon_vma for
fork/split/merge vma.

Reusing anon_vma within the process is fine.  But if a vma has gone
through fork(), then that vma's anon_vma should not be shared with its
neighbor vma.  As explained in [1], when vma gone through fork(), the
check for list_is_singular(vma->anon_vma_chain) will be false, and
don't share anon_vma.

With current issue, one example can clarify more.  Parent process do
below two steps:

1. p_vma_1 is created and p_anon_vma_1 is prepared;

2. p_vma_2 is created and share p_anon_vma_1; (this is allowed,
   becaues p_vma_1 didn't gothrough fork()); parent process do fork():

3. c_vma_1 is dup from p_vma_1, and has its own c_anon_vma_1
   prepared; at this point, c_vma_1->anon_vma_chain has two items, one
   for p_anon_vma_1 and one for c_anon_vma_1;

4. c_vma_2 is dup from p_vma_2, it is not allowed to share
   c_anon_vma_1, because

c_vma_1->anon_vma_chain has two items.
[1] commit d0e9fe1758 ("Simplify and comment on anon_vma re-use for
    anon_vma_prepare()") explains the test of "list_is_singular()".

Fixes: 4e4a9eb921 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork")
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1581150928-3214-3-git-send-email-lixinhai.lxh@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:37 -07:00
Mike Kravetz
c0d0381ade hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

While discussing the issue with huge_pte_offset [1], I remembered that
there were more outstanding hugetlb races.  These issues are:

1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
   invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid global
   reserve counts and state.

A previous attempt was made to use i_mmap_rwsem in this manner as
described at [2].  However, those patches were reverted starting with [3]
due to locking issues.

To effectively use i_mmap_rwsem to address the above issues it needs to be
held (in read mode) during page fault processing.  However, during fault
processing we need to lock the page we will be adding.  Lock ordering
requires we take page lock before i_mmap_rwsem.  Waiting until after
taking the page lock is too late in the fault process for the
synchronization we want to do.

To address this lock ordering issue, the following patches change the lock
ordering for hugetlb pages.  This is not too invasive as hugetlbfs
processing is done separate from core mm in many places.  However, I don't
really like this idea.  Much ugliness is contained in the new routine
hugetlb_page_mapping_lock_write() of patch 1.

The only other way I can think of to address these issues is by catching
all the races.  After catching a race, cleanup, backout, retry ...  etc,
as needed.  This can get really ugly, especially for huge page
reservations.  At one time, I started writing some of the reservation
backout code for page faults and it got so ugly and complicated I went
down the path of adding synchronization to avoid the races.  Any other
suggestions would be welcome.

[1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
[2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
[3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
[4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
[5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

This patch (of 2):

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table.  Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation, it
unmaps everyone who has the file mapped.  If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd.  If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse.  This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with
  the ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

One problem with this scheme is that it requires taking i_mmap_rwsem
before taking the page lock during page faults.  This is not the order
specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
isolated today.  Therefore, we use this alternative locking order for
PageHuge() pages.

         mapping->i_mmap_rwsem
           hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
             page->flags PG_locked (lock_page)

To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
introduced to write lock the i_mmap_rwsem associated with a page.

In most cases it is easy to get address_space via vma->vm_file->f_mapping.
However, in the case of migration or memory errors for anon pages we do
not have an associated vma.  A new routine _get_hugetlb_page_mapping()
will use anon_vma to get address_space in these cases.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:32 -07:00
Anshuman Khandual
222100eed2 mm/vma: make is_vma_temporary_stack() available for general use
Currently the declaration and definition for is_vma_temporary_stack() are
scattered.  Lets make is_vma_temporary_stack() helper available for
general use and also drop the declaration from (include/linux/huge_mm.h)
which is no longer required.  While at this, rename this as
vma_is_temporary_stack() in line with existing helpers.  This should not
cause any functional change.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:29 -07:00
John Hubbard
47e29d32af mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
of huge pages that can be pinned.

This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1.  The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.

A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).

This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block.  That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.

Documentation/core-api/pin_user_pages.rst is updated accordingly.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:27 -07:00
Kirill A. Shutemov
f1fe80d4ae mm, thp: do not queue fully unmapped pages for deferred split
Adding fully unmapped pages into deferred split queue is not productive:
these pages are about to be freed or they are pinned and cannot be split
anyway.

Link: http://lkml.kernel.org/r/20190913091849.11151-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:09 -08:00
Yang Shi
30c4638285 mm/rmap.c: use VM_BUG_ON_PAGE() in __page_check_anon_rmap()
The __page_check_anon_rmap() just calls two BUG_ON()s protected by
CONFIG_DEBUG_VM, the #ifdef could be eliminated by using VM_BUG_ON_PAGE().

Link: http://lkml.kernel.org/r/1573157346-111316-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:19 -08:00
Miles Chen
091e429954 mm/rmap.c: fix outdated comment in page_get_anon_vma()
Replace DESTROY_BY_RCU with SLAB_TYPESAFE_BY_RCU because
SLAB_DESTROY_BY_RCU has been renamed to SLAB_TYPESAFE_BY_RCU by commit
5f0d5a3ae7 ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU")

Link: http://lkml.kernel.org/r/20191017093554.22562-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:19 -08:00
Wei Yang
4e4a9eb921 mm/rmap.c: reuse mergeable anon_vma as parent when fork
In __anon_vma_prepare(), we will try to find anon_vma if it is possible to
reuse it.  While on fork, the logic is different.

Since commit 5beb493052 ("mm: change anon_vma linking to fix
multi-process server scalability issue"), function anon_vma_clone() tries
to allocate new anon_vma for child process.  But the logic here will
allocate a new anon_vma for each vma, even in parent this vma is mergeable
and share the same anon_vma with its sibling.  This may do better for
scalability issue, while it is not necessary to do so especially after
interval tree is used.

Commit 7a3ef208e6 ("mm: prevent endless growth of anon_vma hierarchy")
tries to reuse some anon_vma by counting child anon_vma and attached vmas.
While for those mergeable anon_vmas, we can just reuse it and not
necessary to go through the logic.

After this change, kernel build test reduces 20% anon_vma allocation.

Do the same kernel build test, it shows run time in sys reduced 11.6%.

Origin:

real    2m50.467s
user    17m52.002s
sys     1m51.953s

real    2m48.662s
user    17m55.464s
sys     1m50.553s

real    2m51.143s
user    17m59.687s
sys     1m53.600s

Patched:

real	2m39.933s
user	17m1.835s
sys	1m38.802s

real	2m39.321s
user	17m1.634s
sys	1m39.206s

real	2m39.575s
user	17m1.420s
sys	1m38.845s

Link: http://lkml.kernel.org/r/20191011072256.16275-2-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:19 -08:00
Wei Yang
47b390d23b mm/rmap.c: don't reuse anon_vma if we just want a copy
Before commit 7a3ef208e6 ("mm: prevent endless growth of anon_vma
hierarchy"), anon_vma_clone() doesn't change dst->anon_vma.  While after
this commit, anon_vma_clone() will try to reuse an exist one on forking.

But this commit go a little bit further for the case not forking.
anon_vma_clone() is called from __vma_split(), __split_vma(), copy_vma()
and anon_vma_fork().  For the first three places, the purpose here is
get a copy of src and we don't expect to touch dst->anon_vma even it is
NULL.

While after that commit, it is possible to reuse an anon_vma when
dst->anon_vma is NULL.  This is not we intend to have.

This patch stops reuse of anon_vma for non-fork cases.

Link: http://lkml.kernel.org/r/20191011072256.16275-1-richardw.yang@linux.intel.com
Fixes: 7a3ef208e6 ("mm: prevent endless growth of anon_vma hierarchy")
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:19 -08:00
Ben Dooks
444f84fd2a mm: include <linux/huge_mm.h> for is_vma_temporary_stack
Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack
to fix the following sparse warning:

  mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Reviewed-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19 06:32:32 -04:00
Song Liu
99cb0dbd47 mm,thp: add read-only THP support for (non-shmem) FS
This patch is (hopefully) the first step to enable THP for non-shmem
filesystems.

This patch enables an application to put part of its text sections to THP
via madvise, for example:

    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

We tried to reuse the logic for THP on tmpfs.

Currently, write is not supported for non-shmem THP.  khugepaged will only
process vma with VM_DENYWRITE.  sys_mmap() ignores VM_DENYWRITE requests
(see ksys_mmap_pgoff).  The only way to create vma with VM_DENYWRITE is
execve().  This requirement limits non-shmem THP to text sections.

The next patch will handle writes, which would only happen when the all
the vmas with VM_DENYWRITE are unmapped.

An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
feature.

[songliubraving@fb.com: fix build without CONFIG_SHMEM]
  Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
[songliubraving@fb.com: fix double unlock in collapse_file()]
  Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Matthew Wilcox (Oracle)
d8c6546b1a mm: introduce compound_nr()
Replace 1 << compound_order(page) with compound_nr(page).  Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Matthew Wilcox (Oracle)
a50b854e07 mm: introduce page_size()
Patch series "Make working with compound pages easier", v2.

These three patches add three helpers and convert the appropriate
places to use them.

This patch (of 3):

It's unnecessarily hard to find out the size of a potentially huge page.
Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
YueHaibing
1f18b29669 mm/rmap.c: remove set but not used variable 'cstart'
Fixes gcc '-Wunused-but-set-variable' warning:

mm/rmap.c: In function page_mkclean_one:
mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable]

It is not used any more since
commit cdb07bdea2 ("mm/rmap.c: remove redundant variable cend")

Link: http://lkml.kernel.org/r/20190724141453.38536-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Ralph Campbell
1de13ee592 mm/hmm: fix bad subpage pointer in try_to_unmap_one
When migrating an anonymous private page to a ZONE_DEVICE private page,
the source page->mapping and page->index fields are copied to the
destination ZONE_DEVICE struct page and the page_mapcount() is
increased.  This is so rmap_walk() can be used to unmap and migrate the
page back to system memory.

However, try_to_unmap_one() computes the subpage pointer from a swap pte
which computes an invalid page pointer and a kernel panic results such
as:

  BUG: unable to handle page fault for address: ffffea1fffffffc8

Currently, only single pages can be migrated to device private memory so
no subpage computation is needed and it can be set to "page".

[rcampbell@nvidia.com: add comment]
  Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
Fixes: a5430dda8a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Huang Shijie
059d8442ea mm/rmap.c: use the pra.mapcount to do the check
We have the pra.mapcount already, and there is no need to call the
page_mapped() which may do some complicated computing for compound page.

Link: http://lkml.kernel.org/r/20190404054828.2731-1-sjhuang@iluvatar.ai
Signed-off-by: Huang Shijie <sjhuang@iluvatar.ai>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Jérôme Glisse
7269f99993 mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table.  See the
patch which introduced the events to see the rational behind this.

Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Jérôme Glisse
6f4f13e8d9 mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them.  While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma).  Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens.  A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }

@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)

@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00