Commit Graph

410 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
875057880e This is the 5.10.222 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmaY9zYACgkQONu9yGCS
 aT6v5g//WMifSZz85CUFaqgs65rwVfhTMpYtUeL5LiDuy+SMou6ViV3A93FpTkmj
 FJBvrr2y0bn8Y5Dp/fwYj10XUz+THZte/yEVnPh/NkV107FZD3fKa6GTnJY7H/XY
 4SoOGfPB4yfx+MpN6ZpLsu4cAt6FW8P+QfKOxBEboGkJSGpjEbGYFMtyZAMjknia
 QE8cKQ3LnMrQzHIizil5dZVlYaiMgJtlKTtUeVI1ixmaGDb3rCsnCVvMRvZnW95V
 aSgyJNrNix7a5tRgYwZHZp4t3p9iT2lyIFM3/y7TKcglVCMPw4nbsDdLNNq11qrk
 RdTdScR+9eKyJsEGVYOhXZFUFzOgHW22xyx0CCZmDMeu08WPNl4vhGewnndQy3yd
 6jdTRYDrU6SQNQ0AjRZXcdmfopIQxetHE7ZEKvbgBW6+u9oySYU8phPCNkma2JWr
 O2eY5AOF8zgPAdAzvF9Bt/qTlwLNjP0zczoIRX7HSvV03Nh9cQvgzKdSCfuPDU4a
 FX7mlokgweYa7WoWGPkzOlgMaJZksqstDnhbuwONoMPrNFTUjgm429K87iPdwzqC
 Yv4uDrpFXgkhfD4Aoks4wDpE2LgBKWz5Wnpo+WW4fjcrXtcIV2tTD9FkMjBv3ECv
 A8TTWsXxQtm3V54R4h7fAXg9KnZBuIYYDnB2u1317ZdaDkZRuPQ=
 =X2/A
 -----END PGP SIGNATURE-----

Merge 5.10.222 into android12-5.10-lts

Changes in 5.10.222
	Compiler Attributes: Add __uninitialized macro
	drm/lima: fix shared irq handling on driver remove
	media: dvb: as102-fe: Fix as10x_register_addr packing
	media: dvb-usb: dib0700_devices: Add missing release_firmware()
	IB/core: Implement a limit on UMAD receive List
	scsi: qedf: Make qedf_execute_tmf() non-preemptible
	crypto: aead,cipher - zeroize key buffer after use
	drm/amdgpu: Initialize timestamp for some legacy SOCs
	drm/amd/display: Check index msg_id before read or write
	drm/amd/display: Check pipe offset before setting vblank
	drm/amd/display: Skip finding free audio for unknown engine_id
	media: dw2102: Don't translate i2c read into write
	sctp: prefer struct_size over open coded arithmetic
	firmware: dmi: Stop decoding on broken entry
	Input: ff-core - prefer struct_size over open coded arithmetic
	net: dsa: mv88e6xxx: Correct check for empty list
	media: dvb-frontends: tda18271c2dd: Remove casting during div
	media: s2255: Use refcount_t instead of atomic_t for num_channels
	media: dvb-frontends: tda10048: Fix integer overflow
	i2c: i801: Annotate apanel_addr as __ro_after_init
	powerpc/64: Set _IO_BASE to POISON_POINTER_DELTA not 0 for CONFIG_PCI=n
	orangefs: fix out-of-bounds fsid access
	kunit: Fix timeout message
	powerpc/xmon: Check cpu id in commands "c#", "dp#" and "dx#"
	bpf: Avoid uninitialized value in BPF_CORE_READ_BITFIELD
	jffs2: Fix potential illegal address access in jffs2_free_inode
	s390/pkey: Wipe sensitive data on failure
	UPSTREAM: tcp: fix DSACK undo in fast recovery to call tcp_try_to_open()
	tcp_metrics: validate source addr length
	wifi: wilc1000: fix ies_len type in connect path
	bonding: Fix out-of-bounds read in bond_option_arp_ip_targets_set()
	selftests: fix OOM in msg_zerocopy selftest
	selftests: make order checking verbose in msg_zerocopy selftest
	inet_diag: Initialize pad field in struct inet_diag_req_v2
	nilfs2: fix inode number range checks
	nilfs2: add missing check for inode numbers on directory entries
	mm: optimize the redundant loop of mm_update_owner_next()
	mm: avoid overflows in dirty throttling logic
	Bluetooth: qca: Fix BT enable failure again for QCA6390 after warm reboot
	can: kvaser_usb: Explicitly initialize family in leafimx driver_info struct
	fsnotify: Do not generate events for O_PATH file descriptors
	Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"
	drm/nouveau: fix null pointer dereference in nouveau_connector_get_modes
	drm/amdgpu/atomfirmware: silence UBSAN warning
	mtd: rawnand: Bypass a couple of sanity checks during NAND identification
	bnx2x: Fix multiple UBSAN array-index-out-of-bounds
	bpf, sockmap: Fix sk->sk_forward_alloc warn_on in sk_stream_kill_queues
	ima: Avoid blocking in RCU read-side critical section
	media: dw2102: fix a potential buffer overflow
	i2c: pnx: Fix potential deadlock warning from del_timer_sync() call in isr
	ALSA: hda/realtek: Enable headset mic of JP-IK LEAP W502 with ALC897
	nvme-multipath: find NUMA path only for online numa-node
	nvme: adjust multiples of NVME_CTRL_PAGE_SIZE in offset
	platform/x86: touchscreen_dmi: Add info for GlobalSpace SolT IVW 11.6" tablet
	platform/x86: touchscreen_dmi: Add info for the EZpad 6s Pro
	nvmet: fix a possible leak when destroy a ctrl during qp establishment
	kbuild: fix short log for AS in link-vmlinux.sh
	nilfs2: fix incorrect inode allocation from reserved inodes
	mm: prevent derefencing NULL ptr in pfn_section_valid()
	filelock: fix potential use-after-free in posix_lock_inode
	fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
	vfs: don't mod negative dentry count when on shrinker list
	tcp: fix incorrect undo caused by DSACK of TLP retransmit
	octeontx2-af: Fix incorrect value output on error path in rvu_check_rsrc_availability()
	net: lantiq_etop: add blank line after declaration
	net: ethernet: lantiq_etop: fix double free in detach
	ppp: reject claimed-as-LCP but actually malformed packets
	ethtool: netlink: do not return SQI value if link is down
	udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().
	net/sched: Fix UAF when resolving a clash
	s390: Mark psw in __load_psw_mask() as __unitialized
	ARM: davinci: Convert comma to semicolon
	octeontx2-af: fix detection of IP layer
	tcp: use signed arithmetic in tcp_rtx_probe0_timed_out()
	tcp: avoid too many retransmit packets
	net: ks8851: Fix potential TX stall after interface reopen
	USB: serial: option: add Telit generic core-dump composition
	USB: serial: option: add Telit FN912 rmnet compositions
	USB: serial: option: add Fibocom FM350-GL
	USB: serial: option: add support for Foxconn T99W651
	USB: serial: option: add Netprisma LCUK54 series modules
	USB: serial: option: add Rolling RW350-GL variants
	USB: serial: mos7840: fix crash on resume
	USB: Add USB_QUIRK_NO_SET_INTF quirk for START BP-850k
	usb: gadget: configfs: Prevent OOB read/write in usb_string_copy()
	USB: core: Fix duplicate endpoint bug by clearing reserved bits in the descriptor
	hpet: Support 32-bit userspace
	nvmem: meson-efuse: Fix return value of nvmem callbacks
	ALSA: hda/realtek: Enable Mute LED on HP 250 G7
	ALSA: hda/realtek: Limit mic boost on VAIO PRO PX
	libceph: fix race between delayed_work() and ceph_monc_stop()
	wireguard: allowedips: avoid unaligned 64-bit memory accesses
	wireguard: queueing: annotate intentional data race in cpu round robin
	wireguard: send: annotate intentional data race in checking empty queue
	x86/retpoline: Move a NOENDBR annotation to the SRSO dummy return thunk
	efi: ia64: move IA64-only declarations to new asm/efi.h header
	ipv6: annotate data-races around cnf.disable_ipv6
	ipv6: prevent NULL dereference in ip6_output()
	bpf: Allow reads from uninit stack
	nilfs2: fix kernel bug on rename operation of broken directory
	i2c: rcar: bring hardware to known state when probing
	i2c: mark HostNotify target address as used
	i2c: rcar: Add R-Car Gen4 support
	i2c: rcar: reset controller is mandatory for Gen3+
	i2c: rcar: introduce Gen4 devices
	i2c: rcar: ensure Gen3+ reset does not disturb local targets
	i2c: rcar: clear NO_RXDMA flag after resetting
	i2c: rcar: fix error code in probe()
	Linux 5.10.222

Change-Id: I39dedaef039a49c1b8b53dd83b83d481593ffb95
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-07-20 13:33:30 +00:00
Jan Kara
145faa3d03 Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"
commit 30139c702048f1097342a31302cbd3d478f50c63 upstream.

Patch series "mm: Avoid possible overflows in dirty throttling".

Dirty throttling logic assumes dirty limits in page units fit into
32-bits.  This patch series makes sure this is true (see patch 2/2 for
more details).


This patch (of 2):

This reverts commit 9319b647902cbd5cc884ac08a8a6d54ce111fc78.

The commit is broken in several ways.  Firstly, the removed (u64) cast
from the multiplication will introduce a multiplication overflow on 32-bit
archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the
default settings with 4GB of RAM will trigger this).  Secondly, the
div64_u64() is unnecessarily expensive on 32-bit archs.  We have
div64_ul() in case we want to be safe & cheap.  Thirdly, if dirty
thresholds are larger than 1<<32 pages, then dirty balancing is going to
blow up in many other spectacular ways anyway so trying to fix one
possible overflow is just moot.

Link: https://lkml.kernel.org/r/20240621144017.30993-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20240621144246.11148-1-jack@suse.cz
Fixes: 9319b647902c ("mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-By: Zach O'Keefe <zokeefe@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18 13:05:43 +02:00
Jan Kara
7a49389771 mm: avoid overflows in dirty throttling logic
commit 385d838df280eba6c8680f9777bfa0d0bfe7e8b2 upstream.

The dirty throttling logic is interspersed with assumptions that dirty
limits in PAGE_SIZE units fit into 32-bit (so that various multiplications
fit into 64-bits).  If limits end up being larger, we will hit overflows,
possible divisions by 0 etc.  Fix these problems by never allowing so
large dirty limits as they have dubious practical value anyway.  For
dirty_bytes / dirty_background_bytes interfaces we can just refuse to set
so large limits.  For dirty_ratio / dirty_background_ratio it isn't so
simple as the dirty limit is computed from the amount of available memory
which can change due to memory hotplug etc.  So when converting dirty
limits from ratios to numbers of pages, we just don't allow the result to
exceed UINT_MAX.

This is root-only triggerable problem which occurs when the operator
sets dirty limits to >16 TB.

Link: https://lkml.kernel.org/r/20240621144246.11148-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-By: Zach O'Keefe <zokeefe@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-07-18 13:05:42 +02:00
Greg Kroah-Hartman
66e91da883 This is the 5.10.210 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmXYTLkACgkQONu9yGCS
 aT4+fhAAqqR/Cvx53ZKMQ8GZTCudAZnr/Dz6kWYwxhhhIbQjDpCaf9mgsrEDaQS2
 ancSZjzYaOUIXq/IsthXxQIUhiZbuM3iuSEi7+odWgSYdkFyzuUt8MWLBGSaB5Er
 ojn+APtq7vPXTSnp7uMwqMC3/BHCKkeYIjRVevhhHBKG5d3lzkV1xU8NcvMkLaly
 CIRxpWXD3w2b7K0GEbb/zN1GQEHDCQcxjuaJoe/5FKGJkqd3T31eyiJTRumCCMcz
 j8vkGkYmcMJpWf04iLgVA1p13I5/HGrXdEBI/GutN8IABIC3Cp42jW8phHYKW5ZM
 a4R25LZG5buND1Ubpq+EDrYn3EaPek5XRki0w8ZAXfNa3rYc+N6mQjkzNSOzhJ/5
 VNsn3EAE1Dwtar5Z3ASe9ugDbh+0bgx85PbfaADK88V+qWb3DVr1TBWmDNu2vfVP
 rv4I0EKu9r3vOE8aNMEBuhAVkIK3mEQUxwab6RKNrMby/5Uwa+ugrrUtQd8V+T1S
 j6r6v7u7aZ8mhYO7d6WSvAKL85lCWGbs3WRIKCJZmDRyqWrWW9tVWRN9wrZ2QnRr
 iaCQKk8P474P7/j1zwnmih8l4wS1oszveNziWwd0fi1Nn/WQYM+JKYQvpuQijmQ+
 J9jLyWo7a59zffIE6mzJdNwFy9hlw9X+VnJmExk/Q88Z7Bt5wPQ=
 =laYd
 -----END PGP SIGNATURE-----

Merge 5.10.210 into android12-5.10-lts

Changes in 5.10.210
	usb: cdns3: Fixes for sparse warnings
	usb: cdns3: fix uvc failure work since sg support enabled
	usb: cdns3: fix incorrect calculation of ep_buf_size when more than one config
	usb: cdns3: fix iso transfer error when mult is not zero
	usb: cdns3: Fix uvc fail when DMA cross 4k boundery since sg enabled
	PCI: mediatek: Clear interrupt status before dispatching handler
	units: change from 'L' to 'UL'
	units: add the HZ macros
	serial: sc16is7xx: set safe default SPI clock frequency
	spi: introduce SPI_MODE_X_MASK macro
	serial: sc16is7xx: add check for unsupported SPI modes during probe
	iio: adc: ad7091r: Set alert bit in config register
	iio: adc: ad7091r: Allow users to configure device events
	iio: adc: ad7091r: Enable internal vref if external vref is not supplied
	dmaengine: fix NULL pointer in channel unregistration function
	iio:adc:ad7091r: Move exports into IIO_AD7091R namespace.
	ext4: allow for the last group to be marked as trimmed
	crypto: api - Disallow identical driver names
	PM: hibernate: Enforce ordering during image compression/decompression
	hwrng: core - Fix page fault dead lock on mmap-ed hwrng
	crypto: s390/aes - Fix buffer overread in CTR mode
	rpmsg: virtio: Free driver_override when rpmsg_remove()
	bus: mhi: host: Drop chan lock before queuing buffers
	parisc/firmware: Fix F-extend for PDC addresses
	async: Split async_schedule_node_domain()
	async: Introduce async_schedule_dev_nocall()
	arm64: dts: qcom: sdm845: fix USB wakeup interrupt types
	arm64: dts: qcom: sdm845: fix USB DP/DM HS PHY interrupts
	lsm: new security_file_ioctl_compat() hook
	scripts/get_abi: fix source path leak
	mmc: core: Use mrq.sbc in close-ended ffu
	mmc: mmc_spi: remove custom DMA mapped buffers
	rtc: Adjust failure return code for cmos_set_alarm()
	nouveau/vmm: don't set addr on the fail path to avoid warning
	ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path
	rename(): fix the locking of subdirectories
	block: Remove special-casing of compound pages
	stddef: Introduce DECLARE_FLEX_ARRAY() helper
	smb3: Replace smb2pdu 1-element arrays with flex-arrays
	mm: vmalloc: introduce array allocation functions
	KVM: use __vcalloc for very large allocations
	net/smc: fix illegal rmb_desc access in SMC-D connection dump
	tcp: make sure init the accept_queue's spinlocks once
	bnxt_en: Wait for FLR to complete during probe
	vlan: skip nested type that is not IFLA_VLAN_QOS_MAPPING
	llc: make llc_ui_sendmsg() more robust against bonding changes
	llc: Drop support for ETH_P_TR_802_2.
	net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recv
	tracing: Ensure visibility when inserting an element into tracing_map
	afs: Hide silly-rename files from userspace
	tcp: Add memory barrier to tcp_push()
	netlink: fix potential sleeping issue in mqueue_flush_file
	ipv6: init the accept_queue's spinlocks in inet6_create
	net/mlx5: DR, Use the right GVMI number for drop action
	net/mlx5e: fix a double-free in arfs_create_groups
	netfilter: nf_tables: restrict anonymous set and map names to 16 bytes
	netfilter: nf_tables: validate NFPROTO_* family
	net: mvpp2: clear BM pool before initialization
	selftests: netdevsim: fix the udp_tunnel_nic test
	fjes: fix memleaks in fjes_hw_setup
	net: fec: fix the unhandled context fault from smmu
	btrfs: ref-verify: free ref cache before clearing mount opt
	btrfs: tree-checker: fix inline ref size in error messages
	btrfs: don't warn if discard range is not aligned to sector
	btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
	btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
	rbd: don't move requests to the running list on errors
	exec: Fix error handling in begin_new_exec()
	wifi: iwlwifi: fix a memory corruption
	netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain
	netfilter: nf_tables: reject QUEUE/DROP verdict parameters
	gpiolib: acpi: Ignore touchpad wakeup on GPD G1619-04
	drm: Don't unref the same fb many times by mistake due to deadlock handling
	drm/bridge: nxp-ptn3460: fix i2c_master_send() error checking
	drm/tidss: Fix atomic_flush check
	drm/bridge: nxp-ptn3460: simplify some error checking
	PM: sleep: Use dev_printk() when possible
	PM: sleep: Avoid calling put_device() under dpm_list_mtx
	PM: core: Remove unnecessary (void *) conversions
	PM: sleep: Fix possible deadlocks in core system-wide PM code
	fs/pipe: move check to pipe_has_watch_queue()
	pipe: wakeup wr_wait after setting max_usage
	ARM: dts: samsung: exynos4210-i9100: Unconditionally enable LDO12
	arm64: dts: qcom: sc7180: Use pdc interrupts for USB instead of GIC interrupts
	arm64: dts: qcom: sc7180: fix USB wakeup interrupt types
	media: mtk-jpeg: Fix use after free bug due to error path handling in mtk_jpeg_dec_device_run
	mm: use __pfn_to_section() instead of open coding it
	mm/sparsemem: fix race in accessing memory_section->usage
	btrfs: remove err variable from btrfs_delete_subvolume
	btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
	drm: panel-simple: add missing bus flags for Tianma tm070jvhg[30/33]
	drm/exynos: fix accidental on-stack copy of exynos_drm_plane
	drm/exynos: gsc: minor fix for loop iteration in gsc_runtime_resume
	gpio: eic-sprd: Clear interrupt after set the interrupt type
	spi: bcm-qspi: fix SFDP BFPT read by usig mspi read
	mips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nan
	tick/sched: Preserve number of idle sleeps across CPU hotplug events
	x86/entry/ia32: Ensure s32 is sign extended to s64
	powerpc/mm: Fix null-pointer dereference in pgtable_cache_add
	drivers/perf: pmuv3: don't expose SW_INCR event in sysfs
	powerpc: Fix build error due to is_valid_bugaddr()
	powerpc/mm: Fix build failures due to arch_reserved_kernel_pages()
	x86/boot: Ignore NMIs during very early boot
	powerpc: pmd_move_must_withdraw() is only needed for CONFIG_TRANSPARENT_HUGEPAGE
	powerpc/lib: Validate size for vector operations
	x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel
	perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file
	debugobjects: Stop accessing objects after releasing hash bucket lock
	regulator: core: Only increment use_count when enable_count changes
	audit: Send netlink ACK before setting connection in auditd_set
	ACPI: video: Add quirk for the Colorful X15 AT 23 Laptop
	PNP: ACPI: fix fortify warning
	ACPI: extlog: fix NULL pointer dereference check
	PM / devfreq: Synchronize devfreq_monitor_[start/stop]
	ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events
	FS:JFS:UBSAN:array-index-out-of-bounds in dbAdjTree
	UBSAN: array-index-out-of-bounds in dtSplitRoot
	jfs: fix slab-out-of-bounds Read in dtSearch
	jfs: fix array-index-out-of-bounds in dbAdjTree
	jfs: fix uaf in jfs_evict_inode
	pstore/ram: Fix crash when setting number of cpus to an odd number
	crypto: stm32/crc32 - fix parsing list of devices
	afs: fix the usage of read_seqbegin_or_lock() in afs_lookup_volume_rcu()
	afs: fix the usage of read_seqbegin_or_lock() in afs_find_server*()
	rxrpc_find_service_conn_rcu: fix the usage of read_seqbegin_or_lock()
	jfs: fix array-index-out-of-bounds in diNewExt
	s390/ptrace: handle setting of fpc register correctly
	KVM: s390: fix setting of fpc register
	SUNRPC: Fix a suspicious RCU usage warning
	ecryptfs: Reject casefold directory inodes
	ext4: fix inconsistent between segment fstrim and full fstrim
	ext4: unify the type of flexbg_size to unsigned int
	ext4: remove unnecessary check from alloc_flex_gd()
	ext4: avoid online resizing failures due to oversized flex bg
	wifi: rt2x00: restart beacon queue when hardware reset
	selftests/bpf: satisfy compiler by having explicit return in btf test
	selftests/bpf: Fix pyperf180 compilation failure with clang18
	scsi: lpfc: Fix possible file string name overflow when updating firmware
	PCI: Add no PM reset quirk for NVIDIA Spectrum devices
	bonding: return -ENOMEM instead of BUG in alb_upper_dev_walk
	scsi: arcmsr: Support new PCI device IDs 1883 and 1886
	ARM: dts: imx7d: Fix coresight funnel ports
	ARM: dts: imx7s: Fix lcdif compatible
	ARM: dts: imx7s: Fix nand-controller #size-cells
	wifi: ath9k: Fix potential array-index-out-of-bounds read in ath9k_htc_txstatus()
	bpf: Add map and need_defer parameters to .map_fd_put_ptr()
	scsi: libfc: Don't schedule abort twice
	scsi: libfc: Fix up timeout error in fc_fcp_rec_error()
	bpf: Set uattr->batch.count as zero before batched update or deletion
	ARM: dts: rockchip: fix rk3036 hdmi ports node
	ARM: dts: imx25/27-eukrea: Fix RTC node name
	ARM: dts: imx: Use flash@0,0 pattern
	ARM: dts: imx27: Fix sram node
	ARM: dts: imx1: Fix sram node
	ionic: pass opcode to devcmd_wait
	block/rnbd-srv: Check for unlikely string overflow
	ARM: dts: imx25: Fix the iim compatible string
	ARM: dts: imx25/27: Pass timing0
	ARM: dts: imx27-apf27dev: Fix LED name
	ARM: dts: imx23-sansa: Use preferred i2c-gpios properties
	ARM: dts: imx23/28: Fix the DMA controller node name
	net: dsa: mv88e6xxx: Fix mv88e6352_serdes_get_stats error path
	block: prevent an integer overflow in bvec_try_merge_hw_page
	md: Whenassemble the array, consult the superblock of the freshest device
	arm64: dts: qcom: msm8996: Fix 'in-ports' is a required property
	arm64: dts: qcom: msm8998: Fix 'out-ports' is a required property
	wifi: rtl8xxxu: Add additional USB IDs for RTL8192EU devices
	wifi: rtlwifi: rtl8723{be,ae}: using calculate_bit_shift()
	wifi: cfg80211: free beacon_ies when overridden from hidden BSS
	Bluetooth: qca: Set both WIDEBAND_SPEECH and LE_STATES quirks for QCA2066
	Bluetooth: L2CAP: Fix possible multiple reject send
	i40e: Fix VF disable behavior to block all traffic
	f2fs: fix to check return value of f2fs_reserve_new_block()
	ALSA: hda: Refer to correct stream index at loops
	ASoC: doc: Fix undefined SND_SOC_DAPM_NOPM argument
	fast_dput(): handle underflows gracefully
	RDMA/IPoIB: Fix error code return in ipoib_mcast_join
	drm/amd/display: Fix tiled display misalignment
	f2fs: fix write pointers on zoned device after roll forward
	drm/drm_file: fix use of uninitialized variable
	drm/framebuffer: Fix use of uninitialized variable
	drm/mipi-dsi: Fix detach call without attach
	media: stk1160: Fixed high volume of stk1160_dbg messages
	media: rockchip: rga: fix swizzling for RGB formats
	PCI: add INTEL_HDA_ARL to pci_ids.h
	ALSA: hda: Intel: add HDA_ARL PCI ID support
	ALSA: hda: intel-dspcfg: add filters for ARL-S and ARL
	drm/exynos: Call drm_atomic_helper_shutdown() at shutdown/unbind time
	IB/ipoib: Fix mcast list locking
	media: ddbridge: fix an error code problem in ddb_probe
	drm/msm/dpu: Ratelimit framedone timeout msgs
	clk: hi3620: Fix memory leak in hi3620_mmc_clk_init()
	clk: mmp: pxa168: Fix memory leak in pxa168_clk_init()
	watchdog: it87_wdt: Keep WDTCTRL bit 3 unmodified for IT8784/IT8786
	drm/amdgpu: Let KFD sync with VM fences
	drm/amdgpu: Drop 'fence' check in 'to_amdgpu_amdkfd_fence()'
	leds: trigger: panic: Don't register panic notifier if creating the trigger failed
	um: Fix naming clash between UML and scheduler
	um: Don't use vfprintf() for os_info()
	um: net: Fix return type of uml_net_start_xmit()
	i3c: master: cdns: Update maximum prescaler value for i2c clock
	xen/gntdev: Fix the abuse of underlying struct page in DMA-buf import
	mfd: ti_am335x_tscadc: Fix TI SoC dependencies
	PCI: Only override AMD USB controller if required
	PCI: switchtec: Fix stdev_release() crash after surprise hot remove
	usb: hub: Replace hardcoded quirk value with BIT() macro
	tty: allow TIOCSLCKTRMIOS with CAP_CHECKPOINT_RESTORE
	fs/kernfs/dir: obey S_ISGID
	PCI/AER: Decode Requester ID when no error info found
	libsubcmd: Fix memory leak in uniq()
	virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings
	blk-mq: fix IO hang from sbitmap wakeup race
	ceph: fix deadlock or deadcode of misusing dget()
	drm/amd/powerplay: Fix kzalloc parameter 'ATOM_Tonga_PPM_Table' in 'get_platform_power_management_table()'
	drm/amdgpu: Release 'adev->pm.fw' before return in 'amdgpu_device_need_post()'
	perf: Fix the nr_addr_filters fix
	wifi: cfg80211: fix RCU dereference in __cfg80211_bss_update
	drm: using mul_u32_u32() requires linux/math64.h
	scsi: isci: Fix an error code problem in isci_io_request_build()
	scsi: core: Introduce enum scsi_disposition
	scsi: core: Move scsi_host_busy() out of host lock for waking up EH handler
	ip6_tunnel: use dev_sw_netstats_rx_add()
	ip6_tunnel: make sure to pull inner header in __ip6_tnl_rcv()
	net-zerocopy: Refactor frag-is-remappable test.
	tcp: add sanity checks to rx zerocopy
	ixgbe: Remove non-inclusive language
	ixgbe: Refactor returning internal error codes
	ixgbe: Refactor overtemp event handling
	ixgbe: Fix an error handling path in ixgbe_read_iosf_sb_reg_x550()
	ipv6: Ensure natural alignment of const ipv6 loopback and router addresses
	llc: call sock_orphan() at release time
	netfilter: nf_log: replace BUG_ON by WARN_ON_ONCE when putting logger
	netfilter: nft_ct: sanitize layer 3 and 4 protocol number in custom expectations
	net: ipv4: fix a memleak in ip_setup_cork
	af_unix: fix lockdep positive in sk_diag_dump_icons()
	net: sysfs: Fix /sys/class/net/<iface> path
	HID: apple: Add support for the 2021 Magic Keyboard
	HID: apple: Add 2021 magic keyboard FN key mapping
	bonding: remove print in bond_verify_device_path
	uapi: stddef.h: Fix __DECLARE_FLEX_ARRAY for C++
	PM: sleep: Fix error handling in dpm_prepare()
	dmaengine: fsl-dpaa2-qdma: Fix the size of dma pools
	dmaengine: ti: k3-udma: Report short packet errors
	dmaengine: fsl-qdma: Fix a memory leak related to the status queue DMA
	dmaengine: fsl-qdma: Fix a memory leak related to the queue command DMA
	phy: renesas: rcar-gen3-usb2: Fix returning wrong error code
	dmaengine: fix is_slave_direction() return false when DMA_DEV_TO_DEV
	phy: ti: phy-omap-usb2: Fix NULL pointer dereference for SRP
	drm/msm/dp: return correct Colorimetry for DP_TEST_DYNAMIC_RANGE_CEA case
	net: stmmac: xgmac: fix handling of DPP safety error for DMA channels
	selftests: net: avoid just another constant wait
	tunnels: fix out of bounds access when building IPv6 PMTU error
	atm: idt77252: fix a memleak in open_card_ubr0
	hwmon: (aspeed-pwm-tacho) mutex for tach reading
	hwmon: (coretemp) Fix out-of-bounds memory access
	hwmon: (coretemp) Fix bogus core_id to attr name mapping
	inet: read sk->sk_family once in inet_recv_error()
	rxrpc: Fix response to PING RESPONSE ACKs to a dead call
	tipc: Check the bearer type before calling tipc_udp_nl_bearer_add()
	ppp_async: limit MRU to 64K
	netfilter: nft_compat: reject unused compat flag
	netfilter: nft_compat: restrict match/target protocol to u16
	netfilter: nft_ct: reject direction for ct id
	netfilter: nft_set_pipapo: store index in scratch maps
	netfilter: nft_set_pipapo: add helper to release pcpu scratch area
	netfilter: nft_set_pipapo: remove scratch_aligned pointer
	scsi: core: Move scsi_host_busy() out of host lock if it is for per-command
	blk-iocost: Fix an UBSAN shift-out-of-bounds warning
	net/af_iucv: clean up a try_then_request_module()
	USB: serial: qcserial: add new usb-id for Dell Wireless DW5826e
	USB: serial: option: add Fibocom FM101-GL variant
	USB: serial: cp210x: add ID for IMST iM871A-USB
	usb: host: xhci-plat: Add support for XHCI_SG_TRB_CACHE_SIZE_QUIRK
	hrtimer: Report offline hrtimer enqueue
	Input: i8042 - fix strange behavior of touchpad on Clevo NS70PU
	Input: atkbd - skip ATKBD_CMD_SETLEDS when skipping ATKBD_CMD_GETID
	vhost: use kzalloc() instead of kmalloc() followed by memset()
	clocksource: Skip watchdog check for large watchdog intervals
	net: stmmac: xgmac: use #define for string constants
	net: stmmac: xgmac: fix a typo of register name in DPP safety handling
	netfilter: nft_set_rbtree: skip end interval element from gc
	btrfs: forbid creating subvol qgroups
	btrfs: do not ASSERT() if the newly created subvolume already got read
	btrfs: forbid deleting live subvol qgroup
	btrfs: send: return EOPNOTSUPP on unknown flags
	of: unittest: Fix compile in the non-dynamic case
	net: openvswitch: limit the number of recursions from action sets
	spi: ppc4xx: Drop write-only variable
	ASoC: rt5645: Fix deadlock in rt5645_jack_detect_work()
	net: sysfs: Fix /sys/class/net/<iface> path for statistics
	MIPS: Add 'memory' clobber to csum_ipv6_magic() inline assembler
	i40e: Fix waiting for queues of all VSIs to be disabled
	tracing/trigger: Fix to return error if failed to alloc snapshot
	mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again
	ALSA: hda/realtek: Fix the external mic not being recognised for Acer Swift 1 SF114-32
	ALSA: hda/realtek: Enable Mute LED on HP Laptop 14-fq0xxx
	HID: wacom: generic: Avoid reporting a serial of '0' to userspace
	HID: wacom: Do not register input devices until after hid_hw_start
	usb: ucsi_acpi: Fix command completion handling
	USB: hub: check for alternate port before enabling A_ALT_HNP_SUPPORT
	usb: f_mass_storage: forbid async queue when shutdown happen
	media: ir_toy: fix a memleak in irtoy_tx
	powerpc/kasan: Fix addr error caused by page alignment
	i2c: i801: Remove i801_set_block_buffer_mode
	i2c: i801: Fix block process call transactions
	modpost: trim leading spaces when processing source files list
	scsi: Revert "scsi: fcoe: Fix potential deadlock on &fip->ctlr_lock"
	lsm: fix the logic in security_inode_getsecctx()
	firewire: core: correct documentation of fw_csr_string() kernel API
	kbuild: Fix changing ELF file type for output of gen_btf for big endian
	nfc: nci: free rx_data_reassembly skb on NCI device cleanup
	net: hsr: remove WARN_ONCE() in send_hsr_supervision_frame()
	xen-netback: properly sync TX responses
	ALSA: hda/realtek: Enable headset mic on Vaio VJFE-ADL
	binder: signal epoll threads of self-work
	misc: fastrpc: Mark all sessions as invalid in cb_remove
	ext4: fix double-free of blocks due to wrong extents moved_len
	tracing: Fix wasted memory in saved_cmdlines logic
	staging: iio: ad5933: fix type mismatch regression
	iio: magnetometer: rm3100: add boundary check for the value read from RM3100_REG_TMRC
	iio: accel: bma400: Fix a compilation problem
	media: rc: bpf attach/detach requires write permission
	hv_netvsc: Fix race condition between netvsc_probe and netvsc_remove
	ring-buffer: Clean ring_buffer_poll_wait() error return
	serial: max310x: set default value when reading clock ready bit
	serial: max310x: improve crystal stable clock detection
	x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6
	x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
	mmc: slot-gpio: Allow non-sleeping GPIO ro
	ALSA: hda/conexant: Add quirk for SWS JS201D
	nilfs2: fix data corruption in dsync block recovery for small block sizes
	nilfs2: fix hang in nilfs_lookup_dirty_data_buffers()
	crypto: ccp - Fix null pointer dereference in __sev_platform_shutdown_locked
	nfp: use correct macro for LengthSelect in BAR config
	nfp: flower: prevent re-adding mac index for bonded port
	wifi: mac80211: reload info pointer in ieee80211_tx_dequeue()
	irqchip/irq-brcmstb-l2: Add write memory barrier before exit
	irqchip/gic-v3-its: Fix GICv4.1 VPE affinity update
	s390/qeth: Fix potential loss of L3-IP@ in case of network issues
	ceph: prevent use-after-free in encode_cap_msg()
	of: property: fix typo in io-channels
	can: j1939: Fix UAF in j1939_sk_match_filter during setsockopt(SO_J1939_FILTER)
	pmdomain: core: Move the unused cleanup to a _sync initcall
	tracing: Inform kmemleak of saved_cmdlines allocation
	Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"
	bus: moxtet: Add spi device table
	PCI: dwc: endpoint: Fix dw_pcie_ep_raise_msix_irq() alignment support
	mips: Fix max_mapnr being uninitialized on early stages
	crypto: lib/mpi - Fix unexpected pointer access in mpi_ec_init
	serial: Add rs485_supported to uart_port
	serial: 8250_exar: Fill in rs485_supported
	serial: 8250_exar: Set missing rs485_supported flag
	scripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm
	scripts/decode_stacktrace.sh: support old bash version
	scripts: decode_stacktrace: demangle Rust symbols
	scripts/decode_stacktrace.sh: optionally use LLVM utilities
	netfilter: ipset: fix performance regression in swap operation
	netfilter: ipset: Missing gc cancellations fixed
	hrtimer: Ignore slack time for RT tasks in schedule_hrtimeout_range()
	Revert "arm64: Stash shadow stack pointer in the task struct on interrupt"
	net: prevent mss overflow in skb_segment()
	sched/membarrier: reduce the ability to hammer on sys_membarrier
	nilfs2: fix potential bug in end_buffer_async_write
	nilfs2: replace WARN_ONs for invalid DAT metadata block requests
	dm: limit the number of targets and parameter size area
	PM: runtime: add devm_pm_runtime_enable helper
	PM: runtime: Have devm_pm_runtime_enable() handle pm_runtime_dont_use_autosuspend()
	drm/msm/dsi: Enable runtime PM
	netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval()
	net: bcmgenet: Fix EEE implementation
	PCI: dwc: Fix a 64bit bug in dw_pcie_ep_raise_msix_irq()
	Linux 5.10.210

Change-Id: I5e7327f58dd6abd26ac2b1e328a81c1010d1147c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-04-10 07:10:03 +00:00
Zach O'Keefe
81e7d2530d mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again
commit 9319b647902cbd5cc884ac08a8a6d54ce111fc78 upstream.

(struct dirty_throttle_control *)->thresh is an unsigned long, but is
passed as the u32 divisor argument to div_u64().  On architectures where
unsigned long is 64 bytes, the argument will be implicitly truncated.

Use div64_u64() instead of div_u64() so that the value used in the "is
this a safe division" check is the same as the divisor.

Also, remove redundant cast of the numerator to u64, as that should happen
implicitly.

This would be difficult to exploit in memcg domain, given the ratio-based
arithmetic domain_drity_limits() uses, but is much easier in global
writeback domain with a BDI_CAP_STRICTLIMIT-backing device, using e.g.
vm.dirty_bytes=(1<<32)*PAGE_SIZE so that dtc->thresh == (1<<32)

Link: https://lkml.kernel.org/r/20240118181954.1415197-1-zokeefe@google.com
Fixes: f6789593d5 ("mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()")
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Cc: Maxim Patlasov <MPatlasov@parallels.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23 08:42:24 +01:00
Yang Yang
1b6f2f6e29 ANDROID: vendor_hooks: add hook to balance_dirty_pages()
Add vendor hook in order to track which process cause dirty page write
back pressure.

Bug: 188096764
Change-Id: I890299c97d7a8cf791f20d16d8d53b4615679b9e
Signed-off-by: Yang Yang <yang.yang@vivo.com>
2021-05-20 19:38:42 +00:00
Linus Torvalds
876195e1c8 mm: make wait_on_page_writeback() wait for multiple pending writebacks
commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream.

Ever since commit 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common()
logic") we've had some very occasional reports of BUG_ON(PageWriteback)
in write_cache_pages(), which we thought we already fixed in commit
073861ed77 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").

But syzbot just reported another one, even with that commit in place.

And it turns out that there's a simpler way to trigger the BUG_ON() than
the one Hugh found with page re-use.  It all boils down to the fact that
the page writeback is ostensibly serialized by the page lock, but that
isn't actually really true.

Yes, the people _setting_ writeback all do so under the page lock, but
the actual clearing of the bit - and waking up any waiters - happens
without any page lock.

This gives us this fairly simple race condition:

  CPU1 = end previous writeback
  CPU2 = start new writeback under page lock
  CPU3 = write_cache_pages()

  CPU1          CPU2            CPU3
  ----          ----            ----

  end_page_writeback()
    test_clear_page_writeback(page)
    ... delayed...

                lock_page();
                set_page_writeback()
                unlock_page()

                                lock_page()
                                wait_on_page_writeback();

    wake_up_page(page, PG_writeback);
    .. wakes up CPU3 ..

                                BUG_ON(PageWriteback(page));

where the BUG_ON() happens because we woke up the PG_writeback bit
becasue of the _previous_ writeback, but a new one had already been
started because the clearing of the bit wasn't actually atomic wrt the
actual wakeup or serialized by the page lock.

The reason this didn't use to happen was that the old logic in waiting
on a page bit would just loop if it ever saw the bit set again.

The nice proper fix would probably be to get rid of the whole "wait for
writeback to clear, and then set it" logic in the writeback path, and
replace it with an atomic "wait-to-set" (ie the same as we have for page
locking: we set the page lock bit with a single "lock_page()", not with
"wait for lock bit to clear and then set it").

However, out current model for writeback is that the waiting for the
writeback bit is done by the generic VFS code (ie write_cache_pages()),
but the actual setting of the writeback bit is done much later by the
filesystem ".writepages()" function.

IOW, to make the writeback bit have that same kind of "wait-to-set"
behavior as we have for page locking, we'd have to change our roughly
~50 different writeback functions.  Painful.

Instead, just make "wait_on_page_writeback()" loop on the very unlikely
situation that the PG_writeback bit is still set, basically re-instating
the old behavior.  This is very non-optimal in case of contention, but
since we only ever set the bit under the page lock, that situation is
controlled.

Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
Fixes: 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic")
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:22 +01:00
Hugh Dickins
073861ed77 mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach.  If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()?  And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon?  What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

Matthew Wilcox explaining this to himself:
 "page is allocated, added to page cache, dirtied, writeback starts,

  --- thread A ---
  filesystem calls end_page_writeback()
        test_clear_page_writeback()
  --- context switch to thread B ---
  truncate_inode_pages_range() finds the page, it doesn't have writeback set,
  we delete it from the page cache.  Page gets reallocated, dirtied, writeback
  starts again.  Then we call write_cache_pages(), see
  PageWriteback() set, call wait_on_page_writeback()
  --- context switch back to thread A ---
  wake_up_page(page, PG_writeback);
  ... thread B is woken, but because the wakeup was for the old use of
  the page, PageWriteback is still set.

  Devious"

And prior to 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared?  I think not,
because each waiter does hold a reference on the page.  This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it).  The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai <cai@lca.pw>
Fixes: 2a9127fcf2 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-11-24 15:23:19 -08:00
Matthew Wilcox (Oracle)
8854a6a724 mm/page-writeback: support tail pages in wait_for_stable_page
page->mapping is undefined for tail pages, so operate exclusively on the
head page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Link: https://lkml.kernel.org/r/20200908195539.25896-11-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:15 -07:00
Christoph Hellwig
f56753ac2a bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities.  Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
823423ef55 bdi: invert BDI_CAP_NO_ACCT_WB
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious.  Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
1cb039f3dc bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
David Hildenbrand
0a18e60788 mm: remove vm_total_pages
The global variable "vm_total_pages" is a relic from older days.  There is
only a single user that reads the variable - build_all_zonelists() - and
the first thing it does is update it.

Use a local variable in build_all_zonelists() instead and remove the
global variable.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:28 -07:00
Ethon Paul
e0857cf5ac mm/page-writeback: fix a typo in comment "effictive"->"effective"
There is a typo in comment, fix it.

Signed-off-by: Ethon Paul <ethp@qq.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200411003513.14613-1-ethp@qq.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:24 -07:00
Linus Torvalds
cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
NeilBrown
8d92890bd6 mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead
After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds.  If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count.  This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g.  releasepage() used to
send a COMMIT).  However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a8
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput.  With this
patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero.  static trace points and warning printks which mentioned this
counter no longer report it.

[akpm@linux-foundation.org: re-layout comment]
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Acked-by: Michal Hocko <mhocko@suse.com>	[mm]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
NeilBrown
a37b0715dd mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.  The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi.  This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
   global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks.  I don't know what is wanted for realtime.

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Chao Yu
28659cc8cc mm/page-writeback.c: remove unused variable
Commit 64081362e8 ("mm/page-writeback.c: fix range_cyclic writeback
vs writepages deadlock") left unused variable, remove it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200528033740.17269-1-yuchao0@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Christoph Hellwig
32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Claudio Imbrenda
f28d43636d mm/gup/writeback: add callbacks for inaccessible pages
With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages.  These pages need to be made accessible
before the host can access them.

While cpu accesses will trigger a fault that can be resolved, I/O accesses
will just fail.  We need to add a callback into architecture code for
places that will do I/O, namely when writeback is started or when a page
reference is taken.

This is not only to enable paging, file backing etc, it is also necessary
to protect the host against a malicious user space.  For example a bad
QEMU could simply start direct I/O on such protected memory.  We do not
want userspace to be able to trigger I/O errors and thus the logic is
"whenever somebody accesses that page (gup) or does I/O, make sure that
this page can be accessed".  When the guest tries to access that page we
will wait in the page fault handler for writeback to have finished and for
the page_ref to be the expected value.

On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure.  If we ever need some more finegrained handling we can
tackle this when we know the details.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200306132537.783769-3-imbrenda@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:27 -07:00
Matthew Wilcox (Oracle)
184b4fef58 mm/page-writeback.c: use VM_BUG_ON_PAGE in clear_page_dirty_for_io
Dumping the page information in this circumstance helps for debugging.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-7-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:26 -07:00
Mauricio Faria de Oliveira
cc7b8f6245 mm/page-writeback.c: write_cache_pages(): deduplicate identical checks
There used to be a 'retry' label in between the two (identical) checks
when first introduced in commit f446daaea9 ("mm: implement writeback
livelock avoidance using page tagging"), and later modified/updated in
commit 6e6938b6d3 ("writeback: introduce .tagged_writepages for the
WB_SYNC_NONE sync stage").

The label has been removed in commit 64081362e8 ("mm/page-writeback.c:
fix range_cyclic writeback vs writepages deadlock"), and the (identical)
checks are now present / performed immediately one after another.

So, remove/deduplicate the latter check, moving tag_pages_for_writeback()
into the former check before the 'tag' variable assignment, so it's clear
that it's not used in this (similarly-named) function call but only later
in pagevec_lookup_range_tag().

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Link: http://lkml.kernel.org/r/20200218221716.1648-1-mfo@canonical.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:26 -07:00
Wen Yang
0a5d1a7f64 mm/page-writeback.c: improve arithmetic divisions
Use div64_ul() instead of do_div() if the divisor is unsigned long, to
avoid truncation to 32-bit on 64-bit platforms.

Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-13 18:19:02 -08:00
Wen Yang
d3ac946ec9 mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide
The two variables 'numerator' and 'denominator', though they are
declared as long, they should actually be unsigned long (according to
the implementation of the fprop_fraction_percpu() function)

And do_div() does a 64-by-32 division, while the divisor 'denominator'
is unsigned long, thus 64-bit on 64-bit platforms.  Hence the proper
function to call is div64_ul().

Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-13 18:19:02 -08:00
Wen Yang
6d9e8c651d mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio()
Patch series "use div64_ul() instead of div_u64() if the divisor is
unsigned long".

We were first inspired by commit b0ab99e773 ("sched: Fix possible divide
by zero in avg_atom () calculation"), then refer to the recently analyzed
mm code, we found this suspicious place.

 201                 if (min) {
 202                         min *= this_bw;
 203                         do_div(min, tot_bw);
 204                 }

And we also disassembled and confirmed it:

  /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
  0xffffffff811c37da <__wb_calc_thresh+234>:      xor    %r10d,%r10d
  0xffffffff811c37dd <__wb_calc_thresh+237>:      test   %rax,%rax
  0xffffffff811c37e0 <__wb_calc_thresh+240>:      je 0xffffffff811c3800 <__wb_calc_thresh+272>
  /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
  0xffffffff811c37e2 <__wb_calc_thresh+242>:      imul   %r8,%rax
  /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
  0xffffffff811c37e6 <__wb_calc_thresh+246>:      mov    %r9d,%r10d    ---> truncates it to 32 bits here
  0xffffffff811c37e9 <__wb_calc_thresh+249>:      xor    %edx,%edx
  0xffffffff811c37eb <__wb_calc_thresh+251>:      div    %r10
  0xffffffff811c37ee <__wb_calc_thresh+254>:      imul   %rbx,%rax
  0xffffffff811c37f2 <__wb_calc_thresh+258>:      shr    $0x2,%rax
  0xffffffff811c37f6 <__wb_calc_thresh+262>:      mul    %rcx
  0xffffffff811c37f9 <__wb_calc_thresh+265>:      shr    $0x2,%rdx
  0xffffffff811c37fd <__wb_calc_thresh+269>:      mov    %rdx,%r10

This series uses div64_ul() instead of div_u64() if the divisor is
unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

This patch (of 3):

The variables 'min' and 'max' are unsigned long and do_div truncates
them to 32 bits, which means it can test non-zero and be truncated to
zero for division.  Fix this issue by using div64_ul() instead.

Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
Fixes: 693108a8a6 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-13 18:19:02 -08:00
Tejun Heo
97b27821b4 writeback, memcg: Implement foreign dirty flushing
There's an inherent mismatch between memcg and writeback.  The former
trackes ownership per-page while the latter per-inode.  This was a
deliberate design decision because honoring per-page ownership in the
writeback path is complicated, may lead to higher CPU and IO overheads
and deemed unnecessary given that write-sharing an inode across
different cgroups isn't a common use-case.

Combined with inode majority-writer ownership switching, this works
well enough in most cases but there are some pathological cases.  For
example, let's say there are two cgroups A and B which keep writing to
different but confined parts of the same inode.  B owns the inode and
A's memory is limited far below B's.  A's dirty ratio can rise enough
to trigger balance_dirty_pages() sleeps but B's can be low enough to
avoid triggering background writeback.  A will be slowed down without
a way to make writeback of the dirty pages happen.

This patch implements foreign dirty recording and foreign mechanism so
that when a memcg encounters a condition as above it can trigger
flushes on bdi_writebacks which can clean its pages.  Please see the
comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
details.

A reproducer follows.

write-range.c::

  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/types.h>

  static const char *usage = "write-range FILE START SIZE\n";

  int main(int argc, char **argv)
  {
	  int fd;
	  unsigned long start, size, end, pos;
	  char *endp;
	  char buf[4096];

	  if (argc < 4) {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  fd = open(argv[1], O_WRONLY);
	  if (fd < 0) {
		  perror("open");
		  return 1;
	  }

	  start = strtoul(argv[2], &endp, 0);
	  if (*endp != '\0') {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  size = strtoul(argv[3], &endp, 0);
	  if (*endp != '\0') {
		  fprintf(stderr, usage);
		  return 1;
	  }

	  end = start + size;

	  while (1) {
		  for (pos = start; pos < end; ) {
			  long bread, bwritten = 0;

			  if (lseek(fd, pos, SEEK_SET) < 0) {
				  perror("lseek");
				  return 1;
			  }

			  bread = read(0, buf, sizeof(buf) < end - pos ?
					       sizeof(buf) : end - pos);
			  if (bread < 0) {
				  perror("read");
				  return 1;
			  }
			  if (bread == 0)
				  return 0;

			  while (bwritten < bread) {
				  long this;

				  this = write(fd, buf + bwritten,
					       bread - bwritten);
				  if (this < 0) {
					  perror("write");
					  return 1;
				  }

				  bwritten += this;
				  pos += bwritten;
			  }
		  }
	  }
  }

repro.sh::

  #!/bin/bash

  set -e
  set -x

  sysctl -w vm.dirty_expire_centisecs=300000
  sysctl -w vm.dirty_writeback_centisecs=300000
  sysctl -w vm.dirtytime_expire_seconds=300000
  echo 3 > /proc/sys/vm/drop_caches

  TEST=/sys/fs/cgroup/test
  A=$TEST/A
  B=$TEST/B

  mkdir -p $A $B
  echo "+memory +io" > $TEST/cgroup.subtree_control
  echo $((1<<30)) > $A/memory.high
  echo $((32<<30)) > $B/memory.high

  rm -f testfile
  touch testfile
  fallocate -l 4G testfile

  echo "Starting B"

  (echo $BASHPID > $B/cgroup.procs
   pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &

  echo "Waiting 10s to ensure B claims the testfile inode"
  sleep 5
  sync
  sleep 5
  sync
  echo "Starting A"

  (echo $BASHPID > $A/cgroup.procs
   pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))

v2: Added comments explaining why the specific intervals are being used.

v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
    flushing while avoding possible livelocks.

v4: Use get_jiffies_64() and time_before/after64() instead of raw
    jiffies_64 and arthimetic comparisons as suggested by Jan.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27 09:22:38 -06:00
Christoph Hellwig
ac1c3e49a9 mm: remove the account_page_dirtied export
account_page_dirtied() is only used by our set_page_dirty() helpers and
should not be used anywhere else.

Link: http://lkml.kernel.org/r/20190605183702.30572-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:42 -07:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Yafang Shao
19343b5bdd mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
Recently there have been some hung tasks on our server due to
wait_on_page_writeback(), and we want to know the details of this
PG_writeback, i.e.  this page is writing back to which device.  But it is
not so convenient to get the details.

I think it would be better to introduce a tracepoint for diagnosing the
writeback details.

Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:51 -07:00
Mike Rapoport
a862f68a8b docs/core-api/mm: fix return value descriptions in mm/
Many kernel-doc comments in mm/ have the return value descriptions
either misformatted or omitted at all which makes kernel-doc script
unhappy:

$ make V=1 htmldocs
...
./mm/util.c:36: info: Scanning doc for kstrdup
./mm/util.c:41: warning: No description found for return value of 'kstrdup'
./mm/util.c:57: info: Scanning doc for kstrdup_const
./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
./mm/util.c:75: info: Scanning doc for kstrndup
./mm/util.c:83: warning: No description found for return value of 'kstrndup'
...

Fixing the formatting and adding the missing return value descriptions
eliminates ~100 such warnings.

Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Brian Foster
3fa750dcf2 mm/page-writeback.c: don't break integrity writeback on ->writepage() error
write_cache_pages() is used in both background and integrity writeback
scenarios by various filesystems.  Background writeback is mostly
concerned with cleaning a certain number of dirty pages based on various
mm heuristics.  It may not write the full set of dirty pages or wait for
I/O to complete.  Integrity writeback is responsible for persisting a set
of dirty pages before the writeback job completes.  For example, an
fsync() call must perform integrity writeback to ensure data is on disk
before the call returns.

write_cache_pages() unconditionally breaks out of its processing loop in
the event of a ->writepage() error.  This is fine for background
writeback, which had no strict requirements and will eventually come
around again.  This can cause problems for integrity writeback on
filesystems that might need to clean up state associated with failed page
writeouts.  For example, XFS performs internal delayed allocation
accounting before returning a ->writepage() error, where applicable.  If
the current writeback happens to be associated with an unmount and
write_cache_pages() completes the writeback prematurely due to error, the
filesystem is unmounted in an inconsistent state if dirty+delalloc pages
still exist.

To handle this problem, update write_cache_pages() to always process the
full set of pages for integrity writeback regardless of ->writepage()
errors.  Save the first encountered error and return it to the caller once
complete.  This facilitates XFS (or any other fs that expects integrity
writeback to process the entire set of dirty pages) to clean up its
internal state completely in the event of persistent mapping errors.
Background writeback continues to exit on the first error encountered.

[akpm@linux-foundation.org: fix typo in comment]
Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:49 -08:00
Linus Torvalds
dad4f140ed Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax
Pull XArray conversion from Matthew Wilcox:
 "The XArray provides an improved interface to the radix tree data
  structure, providing locking as part of the API, specifying GFP flags
  at allocation time, eliminating preloading, less re-walking the tree,
  more efficient iterations and not exposing RCU-protected pointers to
  its users.

  This patch set

   1. Introduces the XArray implementation

   2. Converts the pagecache to use it

   3. Converts memremap to use it

  The page cache is the most complex and important user of the radix
  tree, so converting it was most important. Converting the memremap
  code removes the only other user of the multiorder code, which allows
  us to remove the radix tree code that supported it.

  I have 40+ followup patches to convert many other users of the radix
  tree over to the XArray, but I'd like to get this part in first. The
  other conversions haven't been in linux-next and aren't suitable for
  applying yet, but you can see them in the xarray-conv branch if you're
  interested"

* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
  radix tree: Remove multiorder support
  radix tree test: Convert multiorder tests to XArray
  radix tree tests: Convert item_delete_rcu to XArray
  radix tree tests: Convert item_kill_tree to XArray
  radix tree tests: Move item_insert_order
  radix tree test suite: Remove multiorder benchmarking
  radix tree test suite: Remove __item_insert
  memremap: Convert to XArray
  xarray: Add range store functionality
  xarray: Move multiorder_check to in-kernel tests
  xarray: Move multiorder_shrink to kernel tests
  xarray: Move multiorder account test in-kernel
  radix tree test suite: Convert iteration test to XArray
  radix tree test suite: Convert tag_tagged_items to XArray
  radix tree: Remove radix_tree_clear_tags
  radix tree: Remove radix_tree_maybe_preload_order
  radix tree: Remove split/join code
  radix tree: Remove radix_tree_update_node_t
  page cache: Finish XArray conversion
  dax: Convert page fault handlers to XArray
  ...
2018-10-28 11:35:40 -07:00
Dave Chinner
64081362e8 mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process application
doing concurrent writes and fsyncs to a small range of a file.

range_cyclic
writeback		Process 1		Process 2

xfs_vm_writepages
  write_cache_pages
    writeback_index = 2
    cycled = 0
    ....
    find page 2 dirty
    lock Page 2
    ->writepage
      page 2 writeback
      page 2 clean
      page 2 added to bio
    no more pages
			write()
			locks page 1
			dirties page 1
			locks page 2
			dirties page 1
			fsync()
			....
			xfs_vm_writepages
			write_cache_pages
			  start index 0
			  find page 1 towrite
			  lock Page 1
			  ->writepage
			    page 1 writeback
			    page 1 clean
			    page 1 added to bio
			  find page 2 towrite
			  lock Page 2
			  page 2 is writeback
			  <blocks>
						write()
						locks page 1
						dirties page 1
						fsync()
						....
						xfs_vm_writepages
						write_cache_pages
						  start index 0

    !done && !cycled
      sets index to 0, restarts lookup
    find page 1 dirty
						  find page 1 towrite
						  lock Page 1
						  page 1 is writeback
						  <blocks>

    lock Page 1
    <blocks>

DEADLOCK because:

	- process 1 needs page 2 writeback to complete to make
	  enough progress to issue IO pending for page 1
	- writeback needs page 1 writeback to complete so process 2
	  can progress and unlock the page it is blocked on, then it
	  can issue the IO pending for page 2
	- process 2 can't make progress until process 1 issues IO
	  for page 1

The underlying cause of the problem here is that range_cyclic writeback is
processing pages in descending index order as we hold higher index pages
in a structure controlled from above write_cache_pages().  The
write_cache_pages() caller needs to be able to submit these pages for IO
before write_cache_pages restarts writeback at mapping index 0 to avoid
wcp inverting the page lock/writeback wait order.

generic_writepages() is not susceptible to this bug as it has no private
context held across write_cache_pages() - filesystems using this
infrastructure always submit pages in ->writepage immediately and so there
is no problem with range_cyclic going back to mapping index 0.

However:
	mpage_writepages() has a private bio context,
	exofs_writepages() has page_collect
	fuse_writepages() has fuse_fill_wb_data
	nfs_writepages() has nfs_pageio_descriptor
	xfs_vm_writepages() has xfs_writepage_ctx

All of these ->writepages implementations can hold pages under writeback
in their private structures until write_cache_pages() returns, and hence
they are all susceptible to this deadlock.

Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
at the code long enough to understand that it has a similar retry loop for
range_cyclic writeback reaching the end of the file and then promptly ran
away before my eyes bled too much.  I'll leave it for the ext4 developers
to determine if their code is actually has this deadlock and how to fix it
if it has.

There's a few ways I can see avoid this deadlock.  There's probably more,
but these are the first I've though of:

1. get rid of range_cyclic altogether

2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()

2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation

3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer

3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.

I don't think #1 is an option - range_cyclic prevents frequently
dirtied lower file offset from starving background writeback of
rarely touched higher file offsets.

#2 is simple, and I don't think it will have any impact on
performance as going back to the start of the file implies an
immediate seek. We'll have exactly the same number of seeks if we
switch writeback to another inode, and then come back to this one
later and restart from index 0.

#2a is pretty much "status quo without the deadlock". Moving the
retry loop up into the wcp caller means we can issue IO on the
pending pages before calling wcp again, and so avoid locking or
waiting on pages in the wrong order. I'm not convinced we need to do
this given that we get the same thing from #2 on the next writeback
call from the writeback infrastructure.

#3 is really just a band-aid - it doesn't fix the access/wait
inversion problem, just prevents it from becoming a deadlock
situation. I'd prefer we fix the inversion, not sweep it under the
carpet like this.

#3a is really an optimisation that just so happens to include the
band-aid fix of #3.

So it seems that the simplest way to fix this issue is to implement
solution #2

Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:38:14 -07:00
Matthew Wilcox
ff9c745b81 mm: Convert page-writeback to XArray
Includes moving mapping_tagged() to fs.h as a static inline, and
changing it to return bool.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-10-21 10:46:36 -04:00
Mukesh Ojha
13ba17bee1 notifier: Remove notifier header file wherever not used
The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
2018-08-30 12:56:40 +02:00
Greg Thelen
dcfe4df3d5 mm/page-writeback.c: update stale account_page_redirty() comment
Commit 93f78d8828 ("writeback: move backing_dev_info->bdi_stat[] into
bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
account_page_redirty().  Update comment to track that change.

  BDI_DIRTIED => WB_DIRTIED
  BDI_WRITTEN => WB_WRITTEN

Link: http://lkml.kernel.org/r/20180625171526.173483-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Greg Thelen
2e898e4c0a writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>	[v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
Matthew Wilcox
b93b016313 page cache: use xa_lock
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
Michal Hocko
90daf3062f Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical"
This reverts commit 0f6d24f878 ("mm/page-writeback.c: print a warning
if the vm dirtiness settings are illogical") because it causes false
positive warnings during OOM situations as noticed by Tetsuo Handa:

  Node 0 active_anon:3525940kB inactive_anon:8372kB active_file:216kB inactive_file:1872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2504kB dirty:52kB writeback:0kB shmem:8660kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 636928kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
  Node 0 DMA free:14848kB min:284kB low:352kB high:420kB active_anon:992kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 2687 3645 3645
  Node 0 DMA32 free:53004kB min:49608kB low:62008kB high:74408kB active_anon:2712648kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2773132kB mlocked:0kB kernel_stack:96kB pagetables:5096kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 0 958 958
  Node 0 Normal free:17140kB min:17684kB low:22104kB high:26524kB active_anon:812300kB inactive_anon:8372kB active_file:1228kB inactive_file:1868kB unevictable:0kB writepending:52kB present:1048576kB managed:981224kB mlocked:0kB kernel_stack:3520kB pagetables:8552kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  [...]
  Out of memory: Kill process 8459 (a.out) score 999 or sacrifice child
  Killed process 8459 (a.out) total-vm:4180kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 8459 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  vm direct limit must be set greater than background limit.

The problem is that both thresh and bg_thresh will be 0 if
available_memory is less than 4 pages when evaluating
global_dirtyable_memory.

While this might be worked around the whole point of the warning is
dubious at best.  We do rely on admins to do sensible things when
changing tunable knobs.  Dirty memory writeback knobs are not any
special in that regards so revert the warning rather than adding more
hacks to work this around.

Debugged by Yafang Shao.

Link: http://lkml.kernel.org/r/20171127091939.tahb77nznytcxw55@dhcp22.suse.cz
Fixes: 0f6d24f878 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:43 -08:00
Kees Cook
bca237a52c block/laptop_mode: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: linux-block@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:46:44 -08:00
Wang Long
2bce774e82 writeback: remove unused function parameter
The parameter `struct bdi_writeback *wb` is not been used in the
function body.  Remove it.

Link: http://lkml.kernel.org/r/1509685485-15278-1-git-send-email-wanglong19@meituan.com
Signed-off-by: Wang Long <wanglong19@meituan.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:07 -08:00
Mel Gorman
8667982014 mm, pagevec: remove cold parameter for pagevecs
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Jan Kara
736304f324 mm: speed up cancel_dirty_page() for clean pages
Patch series "Speed up page cache truncation", v1.

When rebasing our enterprise distro to a newer kernel (from 4.4 to 4.12)
we have noticed a regression in bonnie++ benchmark when deleting files.
Eventually we have tracked this down to a fact that page cache
truncation got slower by about 10%.  There were both gains and losses in
the above interval of kernels but we have been able to identify that
commit 83929372f6 ("filemap: prepare find and delete operations for
huge pages") caused about 10% regression on its own.

After some investigation it didn't seem easily possible to fix the
regression while maintaining the THP in page cache functionality so
we've decided to optimize the page cache truncation path instead to make
up for the change.  This series is a result of that effort.

Patch 1 is an easy speedup of cancel_dirty_page().  Patches 2-6 refactor
page cache truncation code so that it is easier to batch radix tree
operations.  Patch 7 implements batching of deletes from the radix tree
which more than makes up for the original regression.

This patch (of 7):

cancel_dirty_page() does quite some work even for clean pages (fetching
of mapping, locking of memcg, atomic bit op on page flags) so it
accounts for ~2.5% of cost of truncation of a clean page.  That is not
much but still dumb for something we don't need at all.  Check whether a
page is actually dirty and avoid any work if not.

Link: http://lkml.kernel.org/r/20171010151937.26984-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Kees Cook
9823e51bfd mm/page-writeback.c: convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer
to all timer callbacks, switch to using the new timer_setup() and
from_timer() to pass the timer pointer explicitly.

Link: http://lkml.kernel.org/r/20171016225913.GA99214@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:05 -08:00
Jan Kara
67fd707f46 mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.  Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara
2b9775ae42 mm: use pagevec_lookup_range_tag() in write_cache_pages()
Use pagevec_lookup_range_tag() in write_cache_pages() as it is
interested only in pages from given range.  Remove unnecessary code
resulting from this.

Link: http://lkml.kernel.org/r/20171009151359.31984-12-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Yafang Shao
0f6d24f878 mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical
The vm direct limit setting must be set greater than vm background limit
setting.  Otherwise print a warning to help the operator to figure out
that the vm dirtiness settings is in illogical state.

Link: http://lkml.kernel.org/r/1506592464-30962-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Tahsin Erdogan
4c578dce58 mm/page-writeback.c: remove unused parameter from balance_dirty_pages()
"mapping" parameter to balance_dirty_pages() is not used anymore.

Fixes: dfb8ae5678 ("writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback")
Link: http://lkml.kernel.org/r/20170927221311.23263-1-tahsin@google.com
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:02 -08:00
Yafang Shao
515c24c13c mm/page-writeback.c: make changes of dirty_writeback_centisecs take effect immediately
This patch is the followup of the prvious patch:
[writeback: schedule periodic writeback with sysctl].

There's another issue to fix.
For example,
- When the tunable was set to one hour and is reset to one second, the
  new setting will not take effect for up to one hour.

Kicking the flusher threads immediately fixes it.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-14 09:14:35 -06:00
Yafang Shao
94af584692 writeback: schedule periodic writeback with sysctl
After disable periodic writeback by writing 0 to
dirty_writeback_centisecs, the handler wb_workfn() will not be
entered again until the dirty background limit reaches or
sync syscall is executed or no enough free memory available or
vmscan is triggered.

So the periodic writeback can't be enabled by writing a non-zero
value to dirty_writeback_centisecs.
As it can be disabled by sysctl, it should be able to enable by
sysctl as well.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-09 08:27:40 -06:00