Commit Graph

68099 Commits

Author SHA1 Message Date
Alessio Balsini
8a0e4c2b94 FROMLIST: fuse: Fix crediantials leak in passthrough read_iter
If the system doesn't have enough memory when fuse_passthrough_read_iter
is requested in asynchronous IO, an error is directly returned without
restoring the caller's credentials.
Fix by always ensuring credentials are restored.

Fixes: aa29f32988 ("FROMLIST: fuse: Use daemon creds in passthrough mode")
Link: https://lore.kernel.org/lkml/YB0qPHVORq7bJy6G@google.com/
Reported-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Alessio Balsini <balsini@google.com>
Change-Id: I4aff43f5dd8ddab2cc8871cd9f81438963ead5b6
2021-02-05 14:15:09 +00:00
Lokesh Gidra
6a6bc06393 UPSTREAM: userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knob
With this change, when the knob is set to 0, it allows unprivileged users
to call userfaultfd, like when it is set to 1, but with the restriction
that page faults from only user-mode can be handled.  In this mode, an
unprivileged user (without SYS_CAP_PTRACE capability) must pass
UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM.

This enables administrators to reduce the likelihood that an attacker with
access to userfaultfd can delay faulting kernel code to widen timing
windows for other exploits.

The default value of this knob is changed to 0.  This is required for
correct functioning of pipe mutex.  However, this will fail postcopy live
migration, which will be unnoticeable to the VM guests.  To avoid this,
set 'vm.userfault = 1' in /sys/sysctl.conf.

The main reason this change is desirable as in the short term is that the
Android userland will behave as with the sysctl set to zero.  So without
this commit, any Linux binary using userfaultfd to manage its memory would
behave differently if run within the Android userland.  For more details,
refer to Andrea's reply [1].

[1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/

Link: https://lkml.kernel.org/r/20201120030411.2690816-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: <calin@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d0d4730ac2e404a5b0da9a87ef38c73e51cb1664)

Bug: 160737021
Bug: 169683130
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Change-Id: I08b7080b49ca626c5ab41bb2621fa21fa9a928a2
2021-02-05 13:14:08 +00:00
Lokesh Gidra
b8af1f96cc UPSTREAM: userfaultfd: add UFFD_USER_MODE_ONLY
Patch series "Control over userfaultfd kernel-fault handling", v6.

This patch series is split from [1].  The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.

It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel.  For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3].  Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.

This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object.  It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.

The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808

This patch (of 2):

userfaultfd handles page faults from both user and kernel code.  Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.

A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.

Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <calin@google.com>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 37cd0575b8510159992d279c530c05f872990b02)

Bug: 160737021
Bug: 169683130
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Change-Id: I19ff309b616c7a4a247e8c8427a87caffb1b2df9
2021-02-05 13:14:00 +00:00
Daniel Colascione
dbc935c62b UPSTREAM: userfaultfd: use secure anon inodes for userfaultfd
This change gives userfaultfd file descriptors a real security
context, allowing policy to act on them.

Signed-off-by: Daniel Colascione <dancol@google.com>
[LG: Remove owner inode from userfaultfd_ctx]
[LG: Use anon_inode_getfd_secure() in userfaultfd syscall]
[LG: Use inode of file in userfaultfd_read() in resolve_userfault_fork()]
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit b537900f1598b67bcb8acac20da73c6e26ebbf99)
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: Ib2973ca3650a8defe15eded13294a3fb25356b9d
2021-02-05 11:04:21 +00:00
Daniel Colascione
d7848abe40 UPSTREAM: fs: add LSM-supporting anon-inode interface
This change adds a new function, anon_inode_getfd_secure, that creates
anonymous-node file with individual non-S_PRIVATE inode to which security
modules can apply policy. Existing callers continue using the original
singleton-inode kind of anonymous-inode file. We can transition anonymous
inode users to the new kind of anonymous inode in individual patches for
the sake of bisection and review.

The new function accepts an optional context_inode parameter that callers
can use to provide additional contextual information to security modules.
For example, in case of userfaultfd, the created inode is a 'logical child'
of the context_inode (userfaultfd inode of the parent process) in the sense
that it provides the security context required during creation of the child
process' userfaultfd inode.

Signed-off-by: Daniel Colascione <dancol@google.com>
[LG: Delete obsolete comments to alloc_anon_inode()]
[LG: Add context_inode description in comments to anon_inode_getfd_secure()]
[LG: Remove definition of anon_inode_getfile_secure() as there are no callers]
[LG: Make __anon_inode_getfile() static]
[LG: Use correct error cast in __anon_inode_getfile()]
[LG: Fix error handling in __anon_inode_getfile()]
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit e7e832ce6fa769f800cd7eaebdb0459ad31e0416)
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: I3061c599f2951368914a2ca9f56ea60387d42a1d
2021-02-05 11:03:56 +00:00
Greg Kroah-Hartman
cf5b2483a5 Merge 5.10.13 into android12-5.10
Changes in 5.10.13
	iwlwifi: provide gso_type to GSO packets
	nbd: freeze the queue while we're adding connections
	tty: avoid using vfs_iocb_iter_write() for redirected console writes
	ACPI: sysfs: Prefer "compatible" modalias
	ACPI: thermal: Do not call acpi_thermal_check() directly
	kernel: kexec: remove the lock operation of system_transition_mutex
	ALSA: hda/realtek: Enable headset of ASUS B1400CEPE with ALC256
	ALSA: hda/via: Apply the workaround generically for Clevo machines
	parisc: Enable -mlong-calls gcc option by default when !CONFIG_MODULES
	media: cec: add stm32 driver
	media: cedrus: Fix H264 decoding
	media: hantro: Fix reset_raw_fmt initialization
	media: rc: fix timeout handling after switch to microsecond durations
	media: rc: ite-cir: fix min_timeout calculation
	media: rc: ensure that uevent can be read directly after rc device register
	ARM: dts: tbs2910: rename MMC node aliases
	ARM: dts: ux500: Reserve memory carveouts
	ARM: dts: imx6qdl-gw52xx: fix duplicate regulator naming
	wext: fix NULL-ptr-dereference with cfg80211's lack of commit()
	x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled
	ASoC: AMD Renoir - refine DMI entries for some Lenovo products
	Revert "drm/amdgpu/swsmu: drop set_fan_speed_percent (v2)"
	drm/nouveau/kms/gk104-gp1xx: Fix > 64x64 cursors
	drm/i915: Always flush the active worker before returning from the wait
	drm/i915/gt: Always try to reserve GGTT address 0x0
	drivers/nouveau/kms/nv50-: Reject format modifiers for cursor planes
	bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
	net: usb: qmi_wwan: added support for Thales Cinterion PLSx3 modem family
	s390: uv: Fix sysfs max number of VCPUs reporting
	s390/vfio-ap: No need to disable IRQ after queue reset
	PM: hibernate: flush swap writer after marking
	x86/entry: Emit a symbol for register restoring thunk
	efi/apple-properties: Reinstate support for boolean properties
	crypto: marvel/cesa - Fix tdma descriptor on 64-bit
	drivers: soc: atmel: Avoid calling at91_soc_init on non AT91 SoCs
	drivers: soc: atmel: add null entry at the end of at91_soc_allowed_list[]
	btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
	btrfs: fix possible free space tree corruption with online conversion
	KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]
	KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
	KVM: arm64: Filter out v8.1+ events on v8.0 HW
	KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
	KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
	KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
	KVM: x86: get smi pending status correctly
	KVM: Forbid the use of tagged userspace addresses for memslots
	io_uring: fix wqe->lock/completion_lock deadlock
	xen: Fix XenStore initialisation for XS_LOCAL
	leds: trigger: fix potential deadlock with libata
	arm64: dts: broadcom: Fix USB DMA address translation for Stingray
	mt7601u: fix kernel crash unplugging the device
	mt76: mt7663s: fix rx buffer refcounting
	mt7601u: fix rx buffer refcounting
	iwlwifi: Fix IWL_SUBDEVICE_NO_160 macro to use the correct bit.
	drm/i915/gt: Clear CACHE_MODE prior to clearing residuals
	drm/i915/pmu: Don't grab wakeref when enabling events
	net/mlx5e: Fix IPSEC stats
	ARM: dts: imx6qdl-kontron-samx6i: fix pwms for lcd-backlight
	drm/nouveau/svm: fail NOUVEAU_SVM_INIT ioctl on unsupported devices
	drm/vc4: Correct lbm size and calculation
	drm/vc4: Correct POS1_SCL for hvs5
	drm/nouveau/dispnv50: Restore pushing of all data.
	drm/i915: Check for all subplatform bits
	drm/i915/selftest: Fix potential memory leak
	uapi: fix big endian definition of ipv6_rpl_sr_hdr
	KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM
	tee: optee: replace might_sleep with cond_resched
	xen-blkfront: allow discard-* nodes to be optional
	blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
	clk: imx: fix Kconfig warning for i.MX SCU clk
	clk: mmp2: fix build without CONFIG_PM
	clk: qcom: gcc-sm250: Use floor ops for sdcc clks
	ARM: imx: build suspend-imx6.S with arm instruction set
	ARM: zImage: atags_to_fdt: Fix node names on added root nodes
	netfilter: nft_dynset: add timeout extension to template
	Revert "RDMA/mlx5: Fix devlink deadlock on net namespace deletion"
	Revert "block: simplify set_init_blocksize" to regain lost performance
	xfrm: Fix oops in xfrm_replay_advance_bmp
	xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
	selftests: xfrm: fix test return value override issue in xfrm_policy.sh
	xfrm: Fix wraparound in xfrm_policy_addr_delta()
	arm64: dts: ls1028a: fix the offset of the reset register
	ARM: imx: fix imx8m dependencies
	ARM: dts: imx6qdl-kontron-samx6i: fix i2c_lcd/cam default status
	ARM: dts: imx6qdl-sr-som: fix some cubox-i platforms
	arm64: dts: imx8mp: Correct the gpio ranges of gpio3
	firmware: imx: select SOC_BUS to fix firmware build
	RDMA/cxgb4: Fix the reported max_recv_sge value
	ASoC: dt-bindings: lpass: Fix and common up lpass dai ids
	ASoC: qcom: Fix incorrect volatile registers
	ASoC: qcom: Fix broken support to MI2S TERTIARY and QUATERNARY
	ASoC: qcom: lpass-ipq806x: fix bitwidth regmap field
	spi: altera: Fix memory leak on error path
	ASoC: Intel: Skylake: skl-topology: Fix OOPs ib skl_tplg_complete
	powerpc/64s: prevent recursive replay_soft_interrupts causing superfluous interrupt
	pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
	pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
	ASoC: SOF: Intel: soundwire: fix select/depend unmet dependencies
	ASoC: qcom: lpass: Fix out-of-bounds DAI ID lookup
	iwlwifi: pcie: avoid potential PNVM leaks
	iwlwifi: pnvm: don't skip everything when not reloading
	iwlwifi: pnvm: don't try to load after failures
	iwlwifi: pcie: set LTR on more devices
	iwlwifi: pcie: use jiffies for memory read spin time limit
	iwlwifi: pcie: reschedule in long-running memory reads
	mac80211: pause TX while changing interface type
	ice: fix FDir IPv6 flexbyte
	ice: Implement flow for IPv6 next header (extension header)
	ice: update dev_addr in ice_set_mac_address even if HW filter exists
	ice: Don't allow more channels than LAN MSI-X available
	ice: Fix MSI-X vector fallback logic
	i40e: acquire VSI pointer only after VF is initialized
	igc: fix link speed advertising
	net/mlx5: Fix memory leak on flow table creation error flow
	net/mlx5e: E-switch, Fix rate calculation for overflow
	net/mlx5e: free page before return
	net/mlx5e: Reduce tc unsupported key print level
	net/mlx5: Maintain separate page trees for ECPF and PF functions
	net/mlx5e: Disable hw-tc-offload when MLX5_CLS_ACT config is disabled
	net/mlx5e: Fix CT rule + encap slow path offload and deletion
	net/mlx5e: Correctly handle changing the number of queues when the interface is down
	net/mlx5e: Revert parameters on errors when changing trust state without reset
	net/mlx5e: Revert parameters on errors when changing MTU and LRO state without reset
	net/mlx5: CT: Fix incorrect removal of tuple_nat_node from nat rhashtable
	can: dev: prevent potential information leak in can_fill_info()
	ACPI/IORT: Do not blindly trust DMA masks from firmware
	of/device: Update dma_range_map only when dev has valid dma-ranges
	iommu/amd: Use IVHD EFR for early initialization of IOMMU features
	iommu/vt-d: Correctly check addr alignment in qi_flush_dev_iotlb_pasid()
	nvme-multipath: Early exit if no path is available
	selftests: forwarding: Specify interface when invoking mausezahn
	rxrpc: Fix memory leak in rxrpc_lookup_local
	NFC: fix resource leak when target index is invalid
	NFC: fix possible resource leak
	ASoC: mediatek: mt8183-da7219: ignore TDM DAI link by default
	ASoC: mediatek: mt8183-mt6358: ignore TDM DAI link by default
	ASoC: topology: Properly unregister DAI on removal
	ASoC: topology: Fix memory corruption in soc_tplg_denum_create_values()
	scsi: qla2xxx: Fix description for parameter ql2xenforce_iocb_limit
	team: protect features update by RCU to avoid deadlock
	tcp: make TCP_USER_TIMEOUT accurate for zero window probes
	tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
	vsock: fix the race conditions in multi-transport support
	Linux 5.10.13

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I75f419b25f24da559e446d62f75ce6bb9b0a5396
2021-02-05 10:38:34 +01:00
Trond Myklebust
d46c0d64db pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
[ Upstream commit 1bcf34fdac5f8c2fcd16796495db75744612ca27 ]

When we're scheduling a layoutreturn, we need to ignore any further
incoming layouts with sequence ids that are going to be affected by the
layout return.

Fixes: 44ea8dfce0 ("NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-03 23:28:47 +01:00
Trond Myklebust
dba0d4b150 pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
[ Upstream commit 814b84971388cd5fb182f2e914265b3827758455 ]

If the server returns a new stateid that does not match the one in our
cache, then pnfs_layout_process() will leak the layout segments returned
by pnfs_mark_layout_stateid_invalid().

Fixes: 9888d837f3 ("pNFS: Force a retry of LAYOUTGET if the stateid doesn't match our cache")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-03 23:28:47 +01:00
Maxim Mikityanskiy
f39005edf5 Revert "block: simplify set_init_blocksize" to regain lost performance
commit 8dc932d3e8afb65e12eba7495f046c83884c49bf upstream.

The cited commit introduced a serious regression with SATA write speed,
as found by bisecting. This patch reverts this commit, which restores
write speed back to the values observed before this commit.

The performance tests were done on a Helios4 NAS (2nd batch) with 4 HDDs
(WD8003FFBX) using dd (bs=1M count=2000). "Direct" is a test with a
single HDD, the rest are different RAID levels built over the first
partitions of 4 HDDs. Test results are in MB/s, R is read, W is write.

                | Direct | RAID0 | RAID10 f2 | RAID10 n2 | RAID6
----------------+--------+-------+-----------+-----------+--------
9011495c94    | R:256  | R:313 | R:276     | R:313     | R:323
(before faulty) | W:254  | W:253 | W:195     | W:204     | W:117
----------------+--------+-------+-----------+-----------+--------
5ff9f19231    | R:257  | R:398 | R:312     | R:344     | R:391
(faulty commit) | W:154  | W:122 | W:67.7    | W:66.6    | W:67.2
----------------+--------+-------+-----------+-----------+--------
5.10.10         | R:256  | R:401 | R:312     | R:356     | R:375
unpatched       | W:149  | W:123 | W:64      | W:64.1    | W:61.5
----------------+--------+-------+-----------+-----------+--------
5.10.10         | R:255  | R:396 | R:312     | R:340     | R:393
patched         | W:247  | W:274 | W:220     | W:225     | W:121

Applying this patch doesn't hurt read performance, while improves the
write speed by 1.5x - 3.5x (more impact on RAID tests). The write speed
is restored back to the state before the faulty commit, and even a bit
higher in RAID tests (which aren't HDD-bound on this device) - that is
likely related to other optimizations done between the faulty commit and
5.10.10 which also improved the read speed.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Fixes: 5ff9f19231 ("block: simplify set_init_blocksize")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03 23:28:45 +01:00
Pavel Begunkov
bc79ff0b1a io_uring: fix wqe->lock/completion_lock deadlock
commit 907d1df30a51cc1a1d25414a00cde0494b83df7b upstream.

Joseph reports following deadlock:

CPU0:
...
io_kill_linked_timeout  // &ctx->completion_lock
io_commit_cqring
__io_queue_deferred
__io_queue_async_work
io_wq_enqueue
io_wqe_enqueue  // &wqe->lock

CPU1:
...
__io_uring_files_cancel
io_wq_cancel_cb
io_wqe_cancel_pending_work  // &wqe->lock
io_cancel_task_cb  // &ctx->completion_lock

Only __io_queue_deferred() calls queue_async_work() while holding
ctx->completion_lock, enqueue drained requests via io_req_task_queue()
instead.

Cc: stable@vger.kernel.org # 5.9+
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03 23:28:41 +01:00
Josef Bacik
2175bf57dc btrfs: fix possible free space tree corruption with online conversion
commit 2f96e40212d435b328459ba6b3956395eed8fa9f upstream.

While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.

This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same.  During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups.  While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.

Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method.  We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.

CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03 23:28:40 +01:00
Su Yue
f343bf1aaf btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
commit c41ec4529d3448df8998950d7bada757a1b321cf upstream.

This effectively reverts commit d5c8238849 ("btrfs: convert
data_seqcount to seqcount_mutex_t").

While running fstests on 32 bits test box, many tests failed because of
warnings in dmesg. One of those warnings (btrfs/003):

  [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
  [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G           O      5.11.0-rc4-custom+ #5
  [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
  [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
  [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
  [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
  [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
  [66.441485] Call Trace:
  [66.441510]  btrfs_relocate_chunk+0xb1/0x100 [btrfs]
  [66.441529]  ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
  [66.441562]  btrfs_balance+0x8ed/0x13b0 [btrfs]
  [66.441586]  ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
  [66.441619]  ? __this_cpu_preempt_check+0xf/0x11
  [66.441643]  btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
  [66.441664]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441683]  btrfs_ioctl+0x414/0x2ae0 [btrfs]
  [66.441700]  ? __lock_acquire+0x35f/0x2650
  [66.441717]  ? lockdep_hardirqs_on+0x87/0x120
  [66.441720]  ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
  [66.441724]  ? call_rcu+0x2d3/0x530
  [66.441731]  ? __might_fault+0x41/0x90
  [66.441736]  ? kvm_sched_clock_read+0x15/0x50
  [66.441740]  ? sched_clock+0x8/0x10
  [66.441745]  ? sched_clock_cpu+0x13/0x180
  [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441768]  __ia32_sys_ioctl+0x165/0x8a0
  [66.441773]  ? __this_cpu_preempt_check+0xf/0x11
  [66.441785]  ? __might_fault+0x89/0x90
  [66.441791]  __do_fast_syscall_32+0x54/0x80
  [66.441796]  do_fast_syscall_32+0x32/0x70
  [66.441801]  do_SYSENTER_32+0x15/0x20
  [66.441805]  entry_SYSENTER_32+0x9f/0xf2
  [66.441808] EIP: 0xab7b5549
  [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
  [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
  [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
  [66.441833] irq event stamp: 42579
  [66.441835] hardirqs last  enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
  [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
  [66.441840] softirqs last  enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
  [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60

  ========================================================================
  btrfs_remove_chunk+0x58b/0x7b0:
  __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
  (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
  (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
  ========================================================================

The warning is produced by lockdep_assert_held() in
__seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
fs_info->chunk_mutex held already.

After adding some debug prints, the cause was found that many
__alloc_device() are called with NULL @fs_info (during scanning ioctl).
Inside the function, btrfs_device_data_ordered_init() is expanded to
seqcount_mutex_init().  In this scenario, its second
parameter info->chunk_mutex  is &NULL->chunk_mutex which equals
to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
seqcount_mutex_init() is called in wrong way. And later
btrfs_device_get/set helpers trigger lockdep warnings.

The device and filesystem object lifetimes are different and we'd have
to synchronize initialization of the btrfs_device::data_seqcount with
the fs_info, possibly using some additional synchronization. It would
still not prevent concurrent access to the seqcount lock when it's used
for read and initialization.

Commit d5c8238849 ("btrfs: convert data_seqcount to seqcount_mutex_t")
does not mention a particular problem being fixed so revert should not
cause any harm and we'll get the lockdep warning fixed.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139
Reported-by: Erhard F <erhard_f@mailbox.org>
Fixes: d5c8238849 ("btrfs: convert data_seqcount to seqcount_mutex_t")
CC: stable@vger.kernel.org # 5.10
CC: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03 23:28:40 +01:00
Alistair Delva
cf8f7947f2 ANDROID: Add filp_open_block() for zram
We currently plan to disallow use of filp_open() from drivers in GKI,
however the ZRAM driver still needs it. Add a new GKI-only variant of
filp_open() which only permits a block device to be opened, which can
be exported instead. This keeps ZRAM working but cuts down on drivers
that attempt to open and write files in kernel mode.

Bug: 179220339
Change-Id: Id696b4aaf204b0499ce0a1b6416648670236e570
Signed-off-by: Alistair Delva <adelva@google.com>
2021-02-03 18:27:38 +00:00
Greg Kroah-Hartman
39564d70ad This is the 5.10.12 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmAVV8MACgkQONu9yGCS
 aT5EyxAAkIKzfWDIgtxBBui9zNb4af0nik/4Fv+0ynvvMFIJ+9OEh8vrrzASze3E
 6w8E5c1TxP2iXiW0/NQqU2UWmdVzO85zAeGMjZGSgzn4AtbZrBd8FIk3g5aNzGEJ
 xuqlVm+VOmdQ30Lr+yIOE/xwGDhGy+4cCMBQqGdMWk3Bnsk2QHBzSzyLZOJiK8M1
 9qTyMvtUdIVDFw5rqWQgtfNkcCfk7dMfjmD1bFVSFiJCnJbHE2Yr8y2MscSeLZ1V
 csBmg6K/JgEZFJFVamFKfGkAKQp2nI6YIUm3K0oJhp9BYYECJaH0irnkrT5F8rU8
 RBvxW+9E+SOmrHoEo9RTfGDnvU0hOrZolmPmj71puT6vHzw/S2npoAanWX+nWD6j
 dVTT77TKaSovmqp7+Lt9djsb3E9WzKHlIBJIcgcy/uyMpsllmHt6GROYBIa5gFJk
 LZY6zFrG9l04RYICBuuD6XNcqP56H/WnhBB8us3X5ui5x/3fI+RFBhf/UOXzxUnB
 KcBzRLCUFugvPdKeXGmjn0FCrj1vpj1/cbqLbDvETq9nF8qp/sXjPHbDpvNHyBOR
 MpzFgWnNrg2pYlJHidxpj2gog8jvEEdtOHeVW16HpVsvwMClJVcgaBF3US5mT8Zy
 nNohKtYPx6XjdddDb41NZsWxPHizN7FGnFeJOTZpH0YjNpTNS6c=
 =etoA
 -----END PGP SIGNATURE-----

Merge 5.10.12 into android12-5.10

Changes in 5.10.12
	gpio: mvebu: fix pwm .get_state period calculation
	Revert "mm/slub: fix a memory leak in sysfs_slab_add()"
	futex: Ensure the correct return value from futex_lock_pi()
	futex: Replace pointless printk in fixup_owner()
	futex: Provide and use pi_state_update_owner()
	rtmutex: Remove unused argument from rt_mutex_proxy_unlock()
	futex: Use pi_state_update_owner() in put_pi_state()
	futex: Simplify fixup_pi_state_owner()
	futex: Handle faults correctly for PI futexes
	HID: wacom: Correct NULL dereference on AES pen proximity
	HID: multitouch: Apply MT_QUIRK_CONFIDENCE quirk for multi-input devices
	media: Revert "media: videobuf2: Fix length check for single plane dmabuf queueing"
	media: v4l2-subdev.h: BIT() is not available in userspace
	RDMA/vmw_pvrdma: Fix network_hdr_type reported in WC
	iwlwifi: dbg: Don't touch the tlv data
	kernel/io_uring: cancel io_uring before task works
	io_uring: inline io_uring_attempt_task_drop()
	io_uring: add warn_once for io_uring_flush()
	io_uring: stop SQPOLL submit on creator's death
	io_uring: fix null-deref in io_disable_sqo_submit
	io_uring: do sqo disable on install_fd error
	io_uring: fix false positive sqo warning on flush
	io_uring: fix uring_flush in exit_files() warning
	io_uring: fix skipping disabling sqo on exec
	io_uring: dont kill fasync under completion_lock
	io_uring: fix sleeping under spin in __io_clean_op
	objtool: Don't fail on missing symbol table
	mm/page_alloc: add a missing mm_page_alloc_zone_locked() tracepoint
	mm: fix a race on nr_swap_pages
	tools: Factor HOSTCC, HOSTLD, HOSTAR definitions
	printk: fix buffer overflow potential for print_text()
	printk: fix string termination for record_print_text()
	Linux 5.10.12

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6d96ec78494ebbc0daf4fdecfc13e522c6bd6b42
2021-01-30 14:29:02 +01:00
Pavel Begunkov
d92d00861e io_uring: fix sleeping under spin in __io_clean_op
[ Upstream commit 9d5c8190683a462dbc787658467a0da17011ea5f ]

[   27.629441] BUG: sleeping function called from invalid context
	at fs/file.c:402
[   27.631317] in_atomic(): 1, irqs_disabled(): 1, non_block: 0,
	pid: 1012, name: io_wqe_worker-0
[   27.633220] 1 lock held by io_wqe_worker-0/1012:
[   27.634286]  #0: ffff888105e26c98 (&ctx->completion_lock)
	{....}-{2:2}, at: __io_req_complete.part.102+0x30/0x70
[   27.649249] Call Trace:
[   27.649874]  dump_stack+0xac/0xe3
[   27.650666]  ___might_sleep+0x284/0x2c0
[   27.651566]  put_files_struct+0xb8/0x120
[   27.652481]  __io_clean_op+0x10c/0x2a0
[   27.653362]  __io_cqring_fill_event+0x2c1/0x350
[   27.654399]  __io_req_complete.part.102+0x41/0x70
[   27.655464]  io_openat2+0x151/0x300
[   27.656297]  io_issue_sqe+0x6c/0x14e0
[   27.660991]  io_wq_submit_work+0x7f/0x240
[   27.662890]  io_worker_handle_work+0x501/0x8a0
[   27.664836]  io_wqe_worker+0x158/0x520
[   27.667726]  kthread+0x134/0x180
[   27.669641]  ret_from_fork+0x1f/0x30

Instead of cleaning files on overflow, return back overflow cancellation
into io_uring_cancel_files(). Previously it was racy to clean
REQ_F_OVERFLOW flag, but we got rid of it, and can do it through
repetitive attempts targeting all matching requests.

Cc: stable@vger.kernel.org # 5.9+
Reported-by: Abaci <abaci@linux.alibaba.com>
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:19 +01:00
Pavel Begunkov
7bccd1c191 io_uring: dont kill fasync under completion_lock
[ Upstream commit 4aa84f2ffa81f71e15e5cffc2cc6090dbee78f8e ]

      CPU0                    CPU1
       ----                    ----
  lock(&new->fa_lock);
                               local_irq_disable();
                               lock(&ctx->completion_lock);
                               lock(&new->fa_lock);
  <Interrupt>
    lock(&ctx->completion_lock);

 *** DEADLOCK ***

Move kill_fasync() out of io_commit_cqring() to io_cqring_ev_posted(),
so it doesn't hold completion_lock while doing it. That saves from the
reported deadlock, and it's just nice to shorten the locking time and
untangle nested locks (compl_lock -> wq_head::lock).

Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+91ca3f25bd7f795f019c@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:19 +01:00
Pavel Begunkov
186725a80c io_uring: fix skipping disabling sqo on exec
[ Upstream commit 0b5cd6c32b14413bf87e10ee62be3162588dcbe6 ]

If there are no requests at the time __io_uring_task_cancel() is called,
tctx_inflight() returns zero and and it terminates not getting a chance
to go through __io_uring_files_cancel() and do
io_disable_sqo_submit(). And we absolutely want them disabled by the
time cancellation ends.

Cc: stable@vger.kernel.org # 5.5+
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:19 +01:00
Pavel Begunkov
54b4c4f9ab io_uring: fix uring_flush in exit_files() warning
[ Upstream commit 4325cb498cb743dacaa3edbec398c5255f476ef6 ]

WARNING: CPU: 1 PID: 11100 at fs/io_uring.c:9096
	io_uring_flush+0x326/0x3a0 fs/io_uring.c:9096
RIP: 0010:io_uring_flush+0x326/0x3a0 fs/io_uring.c:9096
Call Trace:
 filp_close+0xb4/0x170 fs/open.c:1280
 close_files fs/file.c:401 [inline]
 put_files_struct fs/file.c:416 [inline]
 put_files_struct+0x1cc/0x350 fs/file.c:413
 exit_files+0x7e/0xa0 fs/file.c:433
 do_exit+0xc22/0x2ae0 kernel/exit.c:820
 do_group_exit+0x125/0x310 kernel/exit.c:922
 get_signal+0x3e9/0x20a0 kernel/signal.c:2770
 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
 handle_signal_work kernel/entry/common.c:147 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

An SQPOLL ring creator task may have gotten rid of its file note during
exit and called io_disable_sqo_submit(), but the io_uring is still left
referenced through fdtable, which will be put during close_files() and
cause a false positive warning.

First split the warning into two for more clarity when is hit, and the
add sqo_dead check to handle the described case.

Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+a32b546d58dde07875a1@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:19 +01:00
Pavel Begunkov
0682759126 io_uring: fix false positive sqo warning on flush
[ Upstream commit 6b393a1ff1746a1c91bd95cbb2d79b104d8f15ac ]

WARNING: CPU: 1 PID: 9094 at fs/io_uring.c:8884
	io_disable_sqo_submit+0x106/0x130 fs/io_uring.c:8884
Call Trace:
 io_uring_flush+0x28b/0x3a0 fs/io_uring.c:9099
 filp_close+0xb4/0x170 fs/open.c:1280
 close_fd+0x5c/0x80 fs/file.c:626
 __do_sys_close fs/open.c:1299 [inline]
 __se_sys_close fs/open.c:1297 [inline]
 __x64_sys_close+0x2f/0xa0 fs/open.c:1297
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

io_uring's final close() may be triggered by any task not only the
creator. It's well handled by io_uring_flush() including SQPOLL case,
though a warning in io_disable_sqo_submit() will fallaciously fire by
moving this warning out to the only call site that matters.

Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+2f5d1785dc624932da78@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:19 +01:00
Pavel Begunkov
8cb6f4da83 io_uring: do sqo disable on install_fd error
[ Upstream commit 06585c497b55045ec21aa8128e340f6a6587351c ]

WARNING: CPU: 0 PID: 8494 at fs/io_uring.c:8717
	io_ring_ctx_wait_and_kill+0x4f2/0x600 fs/io_uring.c:8717
Call Trace:
 io_uring_release+0x3e/0x50 fs/io_uring.c:8759
 __fput+0x283/0x920 fs/file_table.c:280
 task_work_run+0xdd/0x190 kernel/task_work.c:140
 tracehook_notify_resume include/linux/tracehook.h:189 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
 exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

failed io_uring_install_fd() is a special case, we don't do
io_ring_ctx_wait_and_kill() directly but defer it to fput, though still
need to io_disable_sqo_submit() before.

note: it doesn't fix any real problem, just a warning. That's because
sqring won't be available to the userspace in this case and so SQPOLL
won't submit anything.

Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+9c9c35374c0ecac06516@syzkaller.appspotmail.com
Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:18 +01:00
Pavel Begunkov
0e3562e3b2 io_uring: fix null-deref in io_disable_sqo_submit
[ Upstream commit b4411616c26f26c4017b8fa4d3538b1a02028733 ]

general protection fault, probably for non-canonical address
	0xdffffc0000000022: 0000 [#1] KASAN: null-ptr-deref
	in range [0x0000000000000110-0x0000000000000117]
RIP: 0010:io_ring_set_wakeup_flag fs/io_uring.c:6929 [inline]
RIP: 0010:io_disable_sqo_submit+0xdb/0x130 fs/io_uring.c:8891
Call Trace:
 io_uring_create fs/io_uring.c:9711 [inline]
 io_uring_setup+0x12b1/0x38e0 fs/io_uring.c:9739
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

io_disable_sqo_submit() might be called before user rings were
allocated, don't do io_ring_set_wakeup_flag() in those cases.

Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+ab412638aeb652ded540@syzkaller.appspotmail.com
Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:18 +01:00
Pavel Begunkov
a63d915757 io_uring: stop SQPOLL submit on creator's death
[ Upstream commit d9d05217cb6990b9a56e13b56e7a1b71e2551f6c ]

When the creator of SQPOLL io_uring dies (i.e. sqo_task), we don't want
its internals like ->files and ->mm to be poked by the SQPOLL task, it
have never been nice and recently got racy. That can happen when the
owner undergoes destruction and SQPOLL tasks tries to submit new
requests in parallel, and so calls io_sq_thread_acquire*().

That patch halts SQPOLL submissions when sqo_task dies by introducing
sqo_dead flag. Once set, the SQPOLL task must not do any submission,
which is synchronised by uring_lock as well as the new flag.

The tricky part is to make sure that disabling always happens, that
means either the ring is discovered by creator's do_exit() -> cancel,
or if the final close() happens before it's done by the creator. The
last is guaranteed by the fact that for SQPOLL the creator task and only
it holds exactly one file note, so either it pins up to do_exit() or
removed by the creator on the final put in flush. (see comments in
uring_flush() around file->f_count == 2).

One more place that can trigger io_sq_thread_acquire_*() is
__io_req_task_submit(). Shoot off requests on sqo_dead there, even
though actually we don't need to. That's because cancellation of
sqo_task should wait for the request before going any further.

note 1: io_disable_sqo_submit() does io_ring_set_wakeup_flag() so the
caller would enter the ring to get an error, but it still doesn't
guarantee that the flag won't be cleared.

note 2: if final __userspace__ close happens not from the creator
task, the file note will pin the ring until the task dies.

Cc: stable@vger.kernel.org # 5.5+
Fixed: b1b6b5a30dce8 ("kernel/io_uring: cancel io_uring before task works")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:18 +01:00
Pavel Begunkov
da67631a33 io_uring: add warn_once for io_uring_flush()
[ Upstream commit 6b5733eb638b7068ab7cb34e663b55a1d1892d85]

files_cancel() should cancel all relevant requests and drop file notes,
so we should never have file notes after that, including on-exit fput
and flush. Add a WARN_ONCE to be sure.

Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:18 +01:00
Pavel Begunkov
18f31594ee io_uring: inline io_uring_attempt_task_drop()
[ Upstream commit 4f793dc40bc605b97624fd36baf085b3c35e8bfd ]

A simple preparation change inlining io_uring_attempt_task_drop() into
io_uring_flush().

Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:18 +01:00
Pavel Begunkov
7bf3fb6243 kernel/io_uring: cancel io_uring before task works
[ Upstream commit b1b6b5a30dce872f500dc43f067cba8e7f86fc7d ]

For cancelling io_uring requests it needs either to be able to run
currently enqueued task_works or having it shut down by that moment.
Otherwise io_uring_cancel_files() may be waiting for requests that won't
ever complete.

Go with the first way and do cancellations before setting PF_EXITING and
so before putting the task_work infrastructure into a transition state
where task_work_run() would better not be called.

Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-30 13:55:18 +01:00
Jaegeuk Kim
9c694f73e7 FROMGIT: f2fs: flush data when enabling checkpoint back
During checkpoint=disable period, f2fs bypasses all the synchronous IOs such as
sync and fsync. So, when enabling it back, we must flush all of them in order
to keep the data persistent. Otherwise, suddern power-cut right after enabling
checkpoint will cause data loss.

Bug: 171063590
Fixes: 4354994f09 ("f2fs: checkpoint disabling")
Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(cherry picked from commit 8d52dbb373579b48f5758dd0cdd2ac0fb4e5be7f git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev)
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
Change-Id: Iaca2d6fc1841fffa8677d5d592732c94241fb3fb
2021-01-28 18:04:28 +00:00
Laura Abbott
7d212a5102 FROMLIST: fs/buffer.c: Revoke LRU when trying to drop buffers
When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these types of failures make CMA significantly
less reliable, especially under high filesystem usage.

Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.

Note: a bug reported by "kernel test robot" lead to a switch from
using xas_for_each() to xa_for_each().

Bug: 174118021
Link: https://lore.kernel.org/linux-fsdevel/cover.1611642038.git.cgoldswo@codeaurora.org/
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: Matthew Wilcox <willy@infradead.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Change-Id: I561fa3ac7e8874e27d4ad8e1d62ab62e18dd419c
2021-01-28 09:49:23 +00:00
Greg Kroah-Hartman
ba152773be This is the 5.10.11 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmARRrcACgkQONu9yGCS
 aT7n8xAA0lDCM7V1v/EAYw+7ZN1HNXCZhkOZzwrbFz2SSRnu0UGRbA6B9tcdHWA4
 FX3ZgNcEaut3HgT4TYa3xZfhq6tB4w1vfHXyvcuDlgVv+0+LGbPaZiH/pq6wIE00
 s3buqbtasRIQw4UEokos6ZJCiXHr9EmbCauQYViSf8YWR5tDjbIaqUB2TI/Cw4Io
 xYYqpnHJ51KItNvOPaFoUWYA6zrD7R963aAf+vmvBfDrgcs0UgH5OTO0XY6FCYKR
 ZgVMX/1pioWTAF8TMq2pzU55FP7SlohlcISC8NQrygq8JrKHVmd9c5U/MTd4ytaz
 nBaEizEA8Ewz3eTSSED173jJ1OMA06prfx1NQ4fO90U11eTj+IG9qkLePsB7UBe8
 7FEJ8lTKgWDDwC7UzctHaI7mIyzAX5EiItRjjlln++bnQZNjjd/TBVGqXu4FAGBP
 0yBamdBrNVLVmjJQ7Wudt2iZU/gMkt9R+frq24IZLjW+5XB3L3G3S3vAm+8IYthV
 OVh23SfD/w2kPJcvw4z8ZYeOigyDd7CgoLeVwrdH6jM9N1gEEElkt6gsJugPhnxH
 odRRfUaRp+3j6M6ZCf8ai2QzsE0l3Uu+mJHdRaWiXOMmbsIuDFrYW/iNlLyHy8i8
 EZ01/chEyjs7ek8TNdkbhpHow3wI3Uvpczt/BG7OxH/F+DpTmkE=
 =tvv2
 -----END PGP SIGNATURE-----

Merge 5.10.11 into android12-5.10

Changes in 5.10.11
	scsi: target: tcmu: Fix use-after-free of se_cmd->priv
	mtd: rawnand: gpmi: fix dst bit offset when extracting raw payload
	mtd: rawnand: nandsim: Fix the logic when selecting Hamming soft ECC engine
	i2c: tegra: Wait for config load atomically while in ISR
	i2c: bpmp-tegra: Ignore unknown I2C_M flags
	platform/x86: i2c-multi-instantiate: Don't create platform device for INT3515 ACPI nodes
	platform/x86: ideapad-laptop: Disable touchpad_switch for ELAN0634
	ALSA: seq: oss: Fix missing error check in snd_seq_oss_synth_make_info()
	ALSA: hda/realtek - Limit int mic boost on Acer Aspire E5-575T
	ALSA: hda/via: Add minimum mute flag
	crypto: xor - Fix divide error in do_xor_speed()
	dm crypt: fix copy and paste bug in crypt_alloc_req_aead
	ACPI: scan: Make acpi_bus_get_device() clear return pointer on error
	btrfs: don't get an EINTR during drop_snapshot for reloc
	btrfs: do not double free backref nodes on error
	btrfs: fix lockdep splat in btrfs_recover_relocation
	btrfs: don't clear ret in btrfs_start_dirty_block_groups
	btrfs: send: fix invalid clone operations when cloning from the same file and root
	fs: fix lazytime expiration handling in __writeback_single_inode()
	pinctrl: ingenic: Fix JZ4760 support
	mmc: core: don't initialize block size from ext_csd if not present
	mmc: sdhci-of-dwcmshc: fix rpmb access
	mmc: sdhci-xenon: fix 1.8v regulator stabilization
	mmc: sdhci-brcmstb: Fix mmc timeout errors on S5 suspend
	dm: avoid filesystem lookup in dm_get_dev_t()
	dm integrity: fix a crash if "recalculate" used without "internal_hash"
	dm integrity: conditionally disable "recalculate" feature
	drm/atomic: put state on error path
	drm/syncobj: Fix use-after-free
	drm/amdgpu: remove gpu info firmware of green sardine
	drm/amd/display: DCN2X Find Secondary Pipe properly in MPO + ODM Case
	drm/i915/gt: Prevent use of engine->wa_ctx after error
	drm/i915: Check for rq->hwsp validity after acquiring RCU lock
	ASoC: Intel: haswell: Add missing pm_ops
	ASoC: rt711: mutex between calibration and power state changes
	SUNRPC: Handle TCP socket sends with kernel_sendpage() again
	HID: multitouch: Enable multi-input for Synaptics pointstick/touchpad device
	HID: sony: select CONFIG_CRC32
	dm integrity: select CRYPTO_SKCIPHER
	x86/hyperv: Fix kexec panic/hang issues
	scsi: ufs: Relax the condition of UFSHCI_QUIRK_SKIP_MANUAL_WB_FLUSH_CTRL
	scsi: ufs: Correct the LUN used in eh_device_reset_handler() callback
	scsi: qedi: Correct max length of CHAP secret
	scsi: scsi_debug: Fix memleak in scsi_debug_init()
	scsi: sd: Suppress spurious errors when WRITE SAME is being disabled
	riscv: Fix kernel time_init()
	riscv: Fix sifive serial driver
	riscv: Enable interrupts during syscalls with M-Mode
	HID: logitech-dj: add the G602 receiver
	HID: Ignore battery for Elan touchscreen on ASUS UX550
	clk: tegra30: Add hda clock default rates to clock driver
	ALSA: hda/tegra: fix tegra-hda on tegra30 soc
	riscv: cacheinfo: Fix using smp_processor_id() in preemptible
	arm64: make atomic helpers __always_inline
	xen: Fix event channel callback via INTX/GSI
	x86/xen: Add xen_no_vector_callback option to test PCI INTX delivery
	x86/xen: Fix xen_hvm_smp_init() when vector callback not available
	dts: phy: fix missing mdio device and probe failure of vsc8541-01 device
	dts: phy: add GPIO number and active state used for phy reset
	riscv: defconfig: enable gpio support for HiFive Unleashed
	drm/amdgpu/psp: fix psp gfx ctrl cmds
	drm/amd/display: disable dcn10 pipe split by default
	HID: logitech-hidpp: Add product ID for MX Ergo in Bluetooth mode
	drm/amd/display: Fix to be able to stop crc calculation
	drm/nouveau/bios: fix issue shadowing expansion ROMs
	drm/nouveau/privring: ack interrupts the same way as RM
	drm/nouveau/i2c/gm200: increase width of aux semaphore owner fields
	drm/nouveau/mmu: fix vram heap sizing
	drm/nouveau/kms/nv50-: fix case where notifier buffer is at offset 0
	io_uring: flush timeouts that should already have expired
	libperf tests: If a test fails return non-zero
	libperf tests: Fail when failing to get a tracepoint id
	RISC-V: Set current memblock limit
	RISC-V: Fix maximum allowed phsyical memory for RV32
	x86/xen: fix 'nopvspin' build error
	nfsd: Fixes for nfsd4_encode_read_plus_data()
	nfsd: Don't set eof on a truncated READ_PLUS
	gpiolib: cdev: fix frame size warning in gpio_ioctl()
	pinctrl: aspeed: g6: Fix PWMG0 pinctrl setting
	pinctrl: mediatek: Fix fallback call path
	RDMA/ucma: Do not miss ctx destruction steps in some cases
	btrfs: print the actual offset in btrfs_root_name
	scsi: megaraid_sas: Fix MEGASAS_IOC_FIRMWARE regression
	scsi: ufs: ufshcd-pltfrm depends on HAS_IOMEM
	scsi: ufs: Fix tm request when non-fatal error happens
	crypto: omap-sham - Fix link error without crypto-engine
	bpf: Prevent double bpf_prog_put call from bpf_tracing_prog_attach
	powerpc: Use the common INIT_DATA_SECTION macro in vmlinux.lds.S
	powerpc: Fix alignment bug within the init sections
	arm64: entry: remove redundant IRQ flag tracing
	bpf: Reject too big ctx_size_in for raw_tp test run
	drm/amdkfd: Fix out-of-bounds read in kdf_create_vcrat_image_cpu()
	RDMA/umem: Avoid undefined behavior of rounddown_pow_of_two()
	RDMA/cma: Fix error flow in default_roce_mode_store
	printk: ringbuffer: fix line counting
	printk: fix kmsg_dump_get_buffer length calulations
	iov_iter: fix the uaccess area in copy_compat_iovec_from_user
	i2c: octeon: check correct size of maximum RECV_LEN packet
	drm/vc4: Unify PCM card's driver_name
	platform/x86: intel-vbtn: Drop HP Stream x360 Convertible PC 11 from allow-list
	platform/x86: hp-wmi: Don't log a warning on HPWMI_RET_UNKNOWN_COMMAND errors
	gpio: sifive: select IRQ_DOMAIN_HIERARCHY rather than depend on it
	ALSA: hda: Balance runtime/system PM if direct-complete is disabled
	xsk: Clear pool even for inactive queues
	selftests: net: fib_tests: remove duplicate log test
	can: dev: can_restart: fix use after free bug
	can: vxcan: vxcan_xmit: fix use after free bug
	can: peak_usb: fix use after free bugs
	perf evlist: Fix id index for heterogeneous systems
	i2c: sprd: depend on COMMON_CLK to fix compile tests
	iio: common: st_sensors: fix possible infinite loop in st_sensors_irq_thread
	iio: ad5504: Fix setting power-down state
	drivers: iio: temperature: Add delay after the addressed reset command in mlx90632.c
	iio: adc: ti_am335x_adc: remove omitted iio_kfifo_free()
	counter:ti-eqep: remove floor
	powerpc/64s: fix scv entry fallback flush vs interrupt
	cifs: do not fail __smb_send_rqst if non-fatal signals are pending
	irqchip/mips-cpu: Set IPI domain parent chip
	x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
	x86/topology: Make __max_die_per_package available unconditionally
	x86/mmx: Use KFPU_387 for MMX string operations
	x86/setup: don't remove E820_TYPE_RAM for pfn 0
	proc_sysctl: fix oops caused by incorrect command parameters
	mm: memcg/slab: optimize objcg stock draining
	mm: memcg: fix memcg file_dirty numa stat
	mm: fix numa stats for thp migration
	io_uring: iopoll requests should also wake task ->in_idle state
	io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
	io_uring: fix short read retries for non-reg files
	intel_th: pci: Add Alder Lake-P support
	stm class: Fix module init return on allocation failure
	serial: mvebu-uart: fix tx lost characters at power off
	ehci: fix EHCI host controller initialization sequence
	USB: ehci: fix an interrupt calltrace error
	usb: gadget: aspeed: fix stop dma register setting.
	USB: gadget: dummy-hcd: Fix errors in port-reset handling
	usb: udc: core: Use lock when write to soft_connect
	usb: bdc: Make bdc pci driver depend on BROKEN
	usb: cdns3: imx: fix writing read-only memory issue
	usb: cdns3: imx: fix can't create core device the second time issue
	xhci: make sure TRB is fully written before giving it to the controller
	xhci: tegra: Delay for disabling LFPS detector
	drivers core: Free dma_range_map when driver probe failed
	driver core: Fix device link device name collision
	driver core: Extend device_is_dependent()
	drm/i915: s/intel_dp_sink_dpms/intel_dp_set_power/
	drm/i915: Only enable DFP 4:4:4->4:2:0 conversion when outputting YCbCr 4:4:4
	x86/entry: Fix noinstr fail
	x86/cpu/amd: Set __max_die_per_package on AMD
	cls_flower: call nla_ok() before nla_next()
	netfilter: rpfilter: mask ecn bits before fib lookup
	tools: gpio: fix %llu warning in gpio-event-mon.c
	tools: gpio: fix %llu warning in gpio-watch.c
	drm/i915/hdcp: Update CP property in update_pipe
	sh: dma: fix kconfig dependency for G2_DMA
	sh: Remove unused HAVE_COPY_THREAD_TLS macro
	locking/lockdep: Cure noinstr fail
	ASoC: SOF: Intel: fix page fault at probe if i915 init fails
	octeontx2-af: Fix missing check bugs in rvu_cgx.c
	net: dsa: mv88e6xxx: also read STU state in mv88e6250_g1_vtu_getnext
	selftests/powerpc: Fix exit status of pkey tests
	sh_eth: Fix power down vs. is_opened flag ordering
	nvme-pci: refactor nvme_unmap_data
	nvme-pci: fix error unwind in nvme_map_data
	cachefiles: Drop superfluous readpages aops NULL check
	lightnvm: fix memory leak when submit fails
	skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too
	kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
	kasan: fix incorrect arguments passing in kasan_add_zero_shadow
	tcp: fix TCP socket rehash stats mis-accounting
	net_sched: gen_estimator: support large ewma log
	udp: mask TOS bits in udp_v4_early_demux()
	ipv6: create multicast route with RTPROT_KERNEL
	net_sched: avoid shift-out-of-bounds in tcindex_set_parms()
	net_sched: reject silly cell_log in qdisc_get_rtab()
	ipv6: set multicast flag on the multicast route
	net: mscc: ocelot: allow offloading of bridge on top of LAG
	net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
	net: dsa: b53: fix an off by one in checking "vlan->vid"
	tcp: do not mess with cloned skbs in tcp_add_backlog()
	tcp: fix TCP_USER_TIMEOUT with zero window
	net: mscc: ocelot: Fix multicast to the CPU port
	net: core: devlink: use right genl user_ptr when handling port param get/set
	pinctrl: qcom: Allow SoCs to specify a GPIO function that's not 0
	pinctrl: qcom: No need to read-modify-write the interrupt status
	pinctrl: qcom: Properly clear "intr_ack_high" interrupts when unmasking
	pinctrl: qcom: Don't clear pending interrupts when enabling
	x86/sev: Fix nonistr violation
	tty: implement write_iter
	tty: fix up hung_up_tty_write() conversion
	net: systemport: free dev before on error path
	x86/sev-es: Handle string port IO to kernel memory properly
	tcp: Fix potential use-after-free due to double kfree()
	ASoC: SOF: Intel: hda: Avoid checking jack on system suspend
	drm/i915/hdcp: Get conn while content_type changed
	bpf: Local storage helpers should check nullness of owner ptr passed
	kernfs: implement ->read_iter
	kernfs: implement ->write_iter
	kernfs: wire up ->splice_read and ->splice_write
	interconnect: imx8mq: Use icc_sync_state
	fs/pipe: allow sendfile() to pipe again
	Commit 9bb48c82aced ("tty: implement write_iter") converted the tty layer to use write_iter. Fix the redirected_tty_write declaration also in n_tty and change the comparisons to use write_iter instead of write. also in n_tty and change the comparisons to use write_iter instead of write.
	mm: fix initialization of struct page for holes in memory layout
	Revert "mm: fix initialization of struct page for holes in memory layout"
	Linux 5.10.11

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I502239d06f9bcd68b59149376f3c796c64de5942
2021-01-27 12:12:33 +01:00
Johannes Berg
e857271389 fs/pipe: allow sendfile() to pipe again
commit f8ad8187c3b536ee2b10502a8340c014204a1af0 upstream.

After commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops") sendfile() could no longer send data
from a real file to a pipe, breaking for example certain cgit
setups (e.g. when running behind fcgiwrap), because in this
case cgit will try to do exactly this: sendfile() to a pipe.

Fix this by using iter_file_splice_write for the splice_write
method of pipes, as suggested by Christoph.

Cc: stable@vger.kernel.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:29 +01:00
Christoph Hellwig
0b6672fd77 kernfs: wire up ->splice_read and ->splice_write
commit f2d6c2708bd84ca953fa6b6ca5717e79eb0140c7 upstream.

Wire up the splice_read and splice_write methods to the default
helpers using ->read_iter and ->write_iter now that those are
implemented for kernfs.  This restores support to use splice and
sendfile on kernfs files.

Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Reported-by: Siddharth Gupta <sidgup@codeaurora.org>
Tested-by: Siddharth Gupta <sidgup@codeaurora.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-4-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:29 +01:00
Christoph Hellwig
11167454e9 kernfs: implement ->write_iter
commit cc099e0b399889c6485c88368b19824b087c9f8c upstream.

Switch kernfs to implement the write_iter method instead of plain old
write to prepare to supporting splice and sendfile again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-3-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:29 +01:00
Christoph Hellwig
6ce10b6481 kernfs: implement ->read_iter
commit 4eaad21a6ac9865df7f31983232ed5928450458d upstream.

Switch kernfs to implement the read_iter method instead of plain old
read to prepare to supporting splice and sendfile again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-2-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:29 +01:00
Takashi Iwai
76e2b0b65d cachefiles: Drop superfluous readpages aops NULL check
commit db58465f1121086b524be80be39d1fedbe5387f3 upstream.

After the recent actions to convert readpages aops to readahead, the
NULL checks of readpages aops in cachefiles_read_or_alloc_page() may
hit falsely.  More badly, it's an ASSERT() call, and this panics.

Drop the superfluous NULL checks for fixing this regression.

[DH: Note that cachefiles never actually used readpages, so this check was
 never actually necessary]

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208883
BugLink: https://bugzilla.opensuse.org/show_bug.cgi?id=1175245
Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:22 +01:00
Pavel Begunkov
2df15ef2a9 io_uring: fix short read retries for non-reg files
commit 9a173346bd9e16ab19c7addb8862d95a5cea9feb upstream.

Sockets and other non-regular files may actually expect short reads to
happen, don't retry reads for them. Because non-reg files don't set
FMODE_BUF_RASYNC and so it won't do second/retry do_read, we can filter
out those cases after first do_read() attempt with ret>0.

Cc: stable@vger.kernel.org # 5.9+
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:15 +01:00
Jens Axboe
f3ac7a5996 io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
commit 607ec89ed18f49ca59689572659b9c0076f1991f upstream.

IORING_OP_CLOSE is special in terms of cancelation, since it has an
intermediate state where we've removed the file descriptor but hasn't
closed the file yet. For that reason, it's currently marked with
IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op
is always run even if canceled, to prevent leaving us with a live file
but an fd that is gone. However, with SQPOLL, since a cancel request
doesn't carry any resources on behalf of the request being canceled, if
we cancel before any of the close op has been run, we can end up with
io-wq not having the ->files assigned. This can result in the following
oops reported by Joseph:

BUG: kernel NULL pointer dereference, address: 00000000000000d8
PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:__lock_acquire+0x19d/0x18c0
Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe
RSP: 0018:ffffc90001933828 EFLAGS: 00010002
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8
RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8
FS:  0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 lock_acquire+0x31a/0x440
 ? close_fd_get_file+0x39/0x160
 ? __lock_acquire+0x647/0x18c0
 _raw_spin_lock+0x2c/0x40
 ? close_fd_get_file+0x39/0x160
 close_fd_get_file+0x39/0x160
 io_issue_sqe+0x1334/0x14e0
 ? lock_acquire+0x31a/0x440
 ? __io_free_req+0xcf/0x2e0
 ? __io_free_req+0x175/0x2e0
 ? find_held_lock+0x28/0xb0
 ? io_wq_submit_work+0x7f/0x240
 io_wq_submit_work+0x7f/0x240
 io_wq_cancel_cb+0x161/0x580
 ? io_wqe_wake_worker+0x114/0x360
 ? io_uring_get_socket+0x40/0x40
 io_async_find_and_cancel+0x3b/0x140
 io_issue_sqe+0xbe1/0x14e0
 ? __lock_acquire+0x647/0x18c0
 ? __io_queue_sqe+0x10b/0x5f0
 __io_queue_sqe+0x10b/0x5f0
 ? io_req_prep+0xdb/0x1150
 ? mark_held_locks+0x6d/0xb0
 ? mark_held_locks+0x6d/0xb0
 ? io_queue_sqe+0x235/0x4b0
 io_queue_sqe+0x235/0x4b0
 io_submit_sqes+0xd7e/0x12a0
 ? _raw_spin_unlock_irq+0x24/0x30
 ? io_sq_thread+0x3ae/0x940
 io_sq_thread+0x207/0x940
 ? do_wait_intr_irq+0xc0/0xc0
 ? __ia32_sys_io_uring_enter+0x650/0x650
 kthread+0x134/0x180
 ? kthread_create_worker_on_cpu+0x90/0x90
 ret_from_fork+0x1f/0x30

Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified
the fdtable. Canceling before this point is totally fine, and running
it in the io-wq context _after_ that point is also fine.

For 5.12, we'll handle this internally and get rid of the no-cancel
flag, as IORING_OP_CLOSE is the only user of it.

Cc: stable@vger.kernel.org
Fixes: b5dba59e0c ("io_uring: add support for IORING_OP_CLOSE")
Reported-by: "Abaci <abaci@linux.alibaba.com>"
Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:15 +01:00
Jens Axboe
ca75872dd9 io_uring: iopoll requests should also wake task ->in_idle state
commit c93cc9e16d88e0f5ea95d2d65d58a8a4dab258bc upstream.

If we're freeing/finishing iopoll requests, ensure we check if the task
is in idling in terms of cancelation. Otherwise we could end up waiting
forever in __io_uring_task_cancel() if the task has active iopoll
requests that need cancelation.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:15 +01:00
Xiaoming Ni
cb5fe25c82 proc_sysctl: fix oops caused by incorrect command parameters
commit 697edcb0e4eadc41645fe88c991fe6a206b1a08d upstream.

The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val).  If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.

For example:
  "hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
  triggered. The call stack is as follows:
    Kernel command line: .... hung_task_panic
    ......
    Call trace:
    __pi_strlen+0x10/0x98
    parse_args+0x278/0x344
    do_sysctl_args+0x8c/0xfc
    kernel_init+0x5c/0xf4
    ret_from_fork+0x10/0x30

To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().

Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com
Fixes: 3db978d480 ("kernel/sysctl: support setting sysctl parameters from kernel command line")
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org>	[5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:14 +01:00
Ronnie Sahlberg
2edf2c9f3e cifs: do not fail __smb_send_rqst if non-fatal signals are pending
commit 214a5ea081e77346e4963dd6d20c5539ff8b6ae6 upstream.

RHBZ 1848178

The original intent of returning an error in this function
in the patch:
  "CIFS: Mask off signals when sending SMB packets"
was to avoid interrupting packet send in the middle of
sending the data (and thus breaking an SMB connection),
but we also don't want to fail the request for non-fatal
signals even before we have had a chance to try to
send it (the reported problem could be reproduced e.g.
by exiting a child process when the parent process was in
the midst of calling futimens to update a file's timestamps).

In addition, since the signal may remain pending when we enter the
sending loop, we may end up not sending the whole packet before
TCP buffers become full. In this case the code returns -EINTR
but what we need here is to return -ERESTARTSYS instead to
allow system calls to be restarted.

Fixes: b30c74c73c ("CIFS: Mask off signals when sending SMB packets")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:13 +01:00
Josef Bacik
dbba7a38b0 btrfs: print the actual offset in btrfs_root_name
[ Upstream commit 71008734d27f2276fcef23a5e546d358430f2d52 ]

We're supposed to print the root_key.offset in btrfs_root_name in the
case of a reloc root, not the objectid.  Fix this helper to take the key
so we have access to the offset when we need it.

Fixes: 457f1864b5 ("btrfs: pretty print leaked root name")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27 11:55:06 +01:00
Trond Myklebust
6533681890 nfsd: Don't set eof on a truncated READ_PLUS
[ Upstream commit b68f0cbd3f95f2df81e525c310a41fc73c2ed0d3 ]

If the READ_PLUS operation was truncated due to an error, then ensure we
clear the 'eof' flag.

Fixes: 9f0b5792f0 ("NFSD: Encode a full READ_PLUS reply")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27 11:55:05 +01:00
Trond Myklebust
de82ec8e5e nfsd: Fixes for nfsd4_encode_read_plus_data()
[ Upstream commit 72d78717c6d06adf65d2e3dccc96d9e9dc978593 ]

Ensure that we encode the data payload + padding, and that we truncate
the preallocated buffer to the actual read size.

Fixes: 528b84934e ("NFSD: Add READ_PLUS data support")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27 11:55:04 +01:00
Marcelo Diop-Gonzalez
2ca824c793 io_uring: flush timeouts that should already have expired
[ Upstream commit f010505b78a4fa8d5b6480752566e7313fb5ca6e ]

Right now io_flush_timeouts() checks if the current number of events
is equal to ->timeout.target_seq, but this will miss some timeouts if
there have been more than 1 event added since the last time they were
flushed (possible in io_submit_flush_completions(), for example). Fix
it by recording the last sequence at which timeouts were flushed so
that the number of events seen can be compared to the number of events
needed without overflow.

Signed-off-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27 11:55:03 +01:00
Eric Biggers
13ef6bccab fs: fix lazytime expiration handling in __writeback_single_inode()
commit 1e249cb5b7fc09ff216aa5a12f6c302e434e88f9 upstream.

When lazytime is enabled and an inode is being written due to its
in-memory updated timestamps having expired, either due to a sync() or
syncfs() system call or due to dirtytime_expire_interval having elapsed,
the VFS needs to inform the filesystem so that the filesystem can copy
the inode's timestamps out to the on-disk data structures.

This is done by __writeback_single_inode() calling
mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).

However, this occurs after __writeback_single_inode() has already
cleared the dirty flags from ->i_state.  This causes two bugs:

- mark_inode_dirty_sync() redirties the inode, causing it to remain
  dirty.  This wastefully causes the inode to be written twice.  But
  more importantly, it breaks cases where sync_filesystem() is expected
  to clean dirty inodes.  This includes the FS_IOC_REMOVE_ENCRYPTION_KEY
  ioctl (as reported at
  https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well
  as possibly filesystem freezing (freeze_super()).

- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is
  called from __writeback_single_inode() for lazytime expiration,
  xfs_fs_dirty_inode() ignores the notification.  (XFS only cares about
  lazytime expirations, and it assumes that i_state will contain
  I_DIRTY_TIME during those.)  Therefore, lazy timestamps aren't
  persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.

Fix this by moving the call to mark_inode_dirty_sync() to earlier in
__writeback_single_inode(), before the dirty flags are cleared from
i_state.  This makes filesystems be properly notified of the timestamp
expiration, and it avoids incorrectly redirtying the inode.

This fixes xfstest generic/580 (which tests
FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime
enabled.  It also fixes the new lazytime xfstest I've proposed, which
reproduces the above-mentioned XFS bug
(https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).

Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly.  But
due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the
right thing to do because mark_inode_dirty_sync() now knows not to move
the inode to a writeback list if it is currently queued for sync.

Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Depends-on: 5afced3bf2 ("writeback: Avoid skipping inode writeback")
Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:54:53 +01:00
Filipe Manana
adc11110d1 btrfs: send: fix invalid clone operations when cloning from the same file and root
commit 518837e65068c385dddc0a87b3e577c8be7c13b1 upstream.

When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).

However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with EINVAL, when attempting
to execute the clone operation.

Example reproducer:

  $ cat test-send-clone.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  # Create our test file with a single and large extent (1M) and with
  # different content for different file ranges that will be reflinked
  # later.
  xfs_io -f \
         -c "pwrite -S 0xab 0 128K" \
         -c "pwrite -S 0xcd 128K 128K" \
         -c "pwrite -S 0xef 256K 256K" \
         -c "pwrite -S 0x1a 512K 512K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap1
  btrfs send -f /tmp/snap1.send $MNT/snap1

  # Now do a series of changes to our file such that we end up with
  # different parts of the extent reflinked into different file offsets
  # and we overwrite a large part of the extent too, so no file extent
  # items refer to that part that was overwritten. This used to confuse
  # the algorithm used by the kernel to figure out which file ranges to
  # clone, making it attempt to clone from a source range starting at
  # the current eof of the file, resulting in the receiver to fail since
  # it is an invalid clone operation.
  #
  xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
         -c "reflink $MNT/foobar 0K 512K 256K" \
         -c "reflink $MNT/foobar 512K 128K 256K" \
         -c "pwrite -S 0x73 384K 640K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap2
  btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

  echo -e "\nFile digest in the original filesystem:"
  md5sum $MNT/snap2/foobar

  # Now unmount the filesystem, create a new one, mount it and try to
  # apply both send streams to recreate both snapshots.
  umount $DEV

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  btrfs receive -f /tmp/snap1.send $MNT
  btrfs receive -f /tmp/snap2.send $MNT

  # Must match what we got in the original filesystem of course.
  echo -e "\nFile digest in the new filesystem:"
  md5sum $MNT/snap2/foobar

  umount $MNT

When running the reproducer, the incremental send operation fails due to
an invalid clone operation:

  $ ./test-send-clone.sh
  wrote 131072/131072 bytes at offset 0
  128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
  wrote 131072/131072 bytes at offset 131072
  128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
  wrote 262144/262144 bytes at offset 262144
  256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
  wrote 524288/524288 bytes at offset 524288
  512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  linked 983040/983040 bytes at offset 1048576
  960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
  linked 262144/262144 bytes at offset 524288
  256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
  linked 262144/262144 bytes at offset 131072
  256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
  wrote 655360/655360 bytes at offset 393216
  640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2

  File digest in the original filesystem:
  9c13c61cb0b9f5abf45344375cb04dfa  /mnt/sdi/snap2/foobar
  At subvol snap1
  At snapshot snap2
  ERROR: failed to clone extents to foobar: Invalid argument

  File digest in the new filesystem:
  132f0396da8f48d2e667196bff882cfc  /mnt/sdi/snap2/foobar

The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an EINVAL error from the clone operation when attempting it.

For the example above, what happens is the following:

1) When processing the extent at file offset 1M, the algorithm checks that
   the extent is shared and can be (fully or partially) found at file
   offset 0.

   At this point the file has a size (and eof) of 1M at the receiver;

2) It finds that our extent item at file offset 1M has a data offset of
   64K and, since the file extent item at file offset 0 has a data offset
   of 0, it issues a clone operation, from the same file and root, that
   has a source range offset of 64K, destination offset of 1M and a length
   of 64K, since the extent item at file offset 0 refers only to the first
   128K of the shared extent.

   After this clone operation, the file size (and eof) at the receiver is
   increased from 1M to 1088K (1M + 64K);

3) Now there's still 896K (960K - 64K) of data left to clone or write, so
   it checks for the next file extent item, which starts at file offset
   128K. This file extent item has a data offset of 0 and a length of
   256K, so a clone operation with a source range offset of 256K, a
   destination offset of 1088K (1M + 64K) and length of 128K is issued.

   After this operation the file size (and eof) at the receiver increases
   from 1088K to 1216K (1088K + 128K);

4) Now there's still 768K (896K - 128K) of data left to clone or write, so
   it checks for the next file extent item, located at file offset 384K.
   This file extent item points to a different extent, not the one we want
   to clone, with a length of 640K. So we issue a write operation into the
   file range 1216K (1088K + 128K, end of the last clone operation), with
   a length of 640K and with a data matching the one we can find for that
   range in send root.

   After this operation, the file size (and eof) at the receiver increases
   from 1216K to 1856K (1216K + 640K);

5) Now there's still 128K (768K - 640K) of data left to clone or write, so
   we look into the file extent item, which is for file offset 1M and it
   points to the extent we want to clone, with a data offset of 64K and a
   length of 960K.

   However this matches the file offset we started with, the start of the
   range to clone into. So we can't for sure find any file extent item
   from here onwards with the rest of the data we want to clone, yet we
   proceed and since the file extent item points to the shared extent,
   with a data offset of 64K, we issue a clone operation with a source
   range starting at file offset 1856K, which matches the file extent
   item's offset, 1M, plus the amount of data cloned and written so far,
   which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
   operation is invalid since the source range offset matches the current
   eof of the file in the receiver. We should have stopped looking for
   extents to clone at this point and instead fallback to write, which
   would simply the contain the data in the file range from 1856K to
   1856K + 128K.

So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.

A test case for fstests will follow soon.

Reported-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
Fixes: 11f2069c11 ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:54:53 +01:00
Josef Bacik
018abb5089 btrfs: don't clear ret in btrfs_start_dirty_block_groups
commit 34d1eb0e599875064955a74712f08ff14c8e3d5f upstream.

If we fail to update a block group item in the loop we'll break, however
we'll do btrfs_run_delayed_refs and lose our error value in ret, and
thus not clean up properly.  Fix this by only running the delayed refs
if there was no failure.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:54:53 +01:00
Josef Bacik
14e17e90bf btrfs: fix lockdep splat in btrfs_recover_relocation
commit fb286100974e7239af243bc2255a52f29442f9c8 upstream.

While testing the error paths of relocation I hit the following lockdep
splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.10.0-rc6+ #217 Not tainted
  ------------------------------------------------------
  mount/779 is trying to acquire lock:
  ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340

  but task is already holding lock:
  ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-root-00){++++}-{3:3}:
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x27/0x100
	 btrfs_read_lock_root_node+0x31/0x40
	 btrfs_search_slot+0x462/0x8f0
	 btrfs_update_root+0x55/0x2b0
	 btrfs_drop_snapshot+0x398/0x750
	 clean_dirty_subvols+0xdf/0x120
	 btrfs_recover_relocation+0x534/0x5a0
	 btrfs_start_pre_rw_mount+0xcb/0x170
	 open_ctree+0x151f/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (sb_internal#2){.+.+}-{0:0}:
	 start_transaction+0x444/0x700
	 insert_balance_item.isra.0+0x37/0x320
	 btrfs_balance+0x354/0xf40
	 btrfs_ioctl_balance+0x2cf/0x380
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1120/0x1e10
	 lock_acquire+0x116/0x370
	 __mutex_lock+0x7e/0x7b0
	 btrfs_recover_balance+0x2f0/0x340
	 open_ctree+0x1095/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-root-00);
				 lock(sb_internal#2);
				 lock(btrfs-root-00);
    lock(&fs_info->balance_mutex);

   *** DEADLOCK ***

  2 locks held by mount/779:
   #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
   #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  stack backtrace:
  CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   ? trace_call_bpf+0x139/0x260
   __lock_acquire+0x1120/0x1e10
   lock_acquire+0x116/0x370
   ? btrfs_recover_balance+0x2f0/0x340
   __mutex_lock+0x7e/0x7b0
   ? btrfs_recover_balance+0x2f0/0x340
   ? btrfs_recover_balance+0x2f0/0x340
   ? rcu_read_lock_sched_held+0x3f/0x80
   ? kmem_cache_alloc_trace+0x2c4/0x2f0
   ? btrfs_get_64+0x5e/0x100
   btrfs_recover_balance+0x2f0/0x340
   open_ctree+0x1095/0x1726
   btrfs_mount_root.cold+0x12/0xea
   ? rcu_read_lock_sched_held+0x3f/0x80
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   vfs_kern_mount.part.0+0x71/0xb0
   btrfs_mount+0x10d/0x380
   ? __kmalloc_track_caller+0x2f2/0x320
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   ? capable+0x3a/0x60
   path_mount+0x433/0xc10
   __x64_sys_mount+0xe3/0x120
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is straightforward to fix, simply release the path before we setup
the balance_ctl.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:54:53 +01:00
Josef Bacik
5169a289fc btrfs: do not double free backref nodes on error
commit 49ecc679ab48b40ca799bf94b327d5284eac9e46 upstream.

Zygo reported the following KASAN splat:

  BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
  Read of size 8 at addr ffff888112402950 by task btrfs/28836

  CPU: 0 PID: 28836 Comm: btrfs Tainted: G        W         5.10.0-e35f27394290-for-next+ #23
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  Call Trace:
   dump_stack+0xbc/0xf9
   ? btrfs_backref_cleanup_node+0x18a/0x420
   print_address_description.constprop.8+0x21/0x210
   ? record_print_text.cold.34+0x11/0x11
   ? btrfs_backref_cleanup_node+0x18a/0x420
   ? btrfs_backref_cleanup_node+0x18a/0x420
   kasan_report.cold.10+0x20/0x37
   ? btrfs_backref_cleanup_node+0x18a/0x420
   __asan_load8+0x69/0x90
   btrfs_backref_cleanup_node+0x18a/0x420
   btrfs_backref_release_cache+0x83/0x1b0
   relocate_block_group+0x394/0x780
   ? merge_reloc_roots+0x4a0/0x4a0
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   ? check_flags.part.50+0x6c/0x1e0
   ? btrfs_relocate_chunk+0x120/0x120
   ? kmem_cache_alloc_trace+0xa06/0xcb0
   ? _copy_from_user+0x83/0xc0
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   ? __kasan_check_read+0x11/0x20
   ? check_chain_key+0x1f4/0x2f0
   ? __asan_loadN+0xf/0x20
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? kvm_sched_clock_read+0x18/0x30
   ? check_chain_key+0x1f4/0x2f0
   ? lock_downgrade+0x3f0/0x3f0
   ? handle_mm_fault+0xad6/0x2150
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags+0x26/0x30
   ? lock_is_held_type+0xc3/0xf0
   ? syscall_enter_from_user_mode+0x1b/0x60
   ? do_syscall_64+0x13/0x80
   ? rcu_read_lock_sched_held+0xa1/0xd0
   ? __kasan_check_read+0x11/0x20
   ? __fget_light+0xae/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f4c4bdfe427

  Allocated by task 28836:
   kasan_save_stack+0x21/0x50
   __kasan_kmalloc.constprop.18+0xbe/0xd0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x410/0xcb0
   btrfs_backref_alloc_node+0x46/0xf0
   btrfs_backref_add_tree_node+0x60d/0x11d0
   build_backref_tree+0xc5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 28836:
   kasan_save_stack+0x21/0x50
   kasan_set_track+0x20/0x30
   kasan_set_free_info+0x1f/0x30
   __kasan_slab_free+0xf3/0x140
   kasan_slab_free+0xe/0x10
   kfree+0xde/0x200
   btrfs_backref_error_cleanup+0x452/0x530
   build_backref_tree+0x1a5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This occurred because we freed our backref node in
btrfs_backref_error_cleanup(), but then tried to free it again in
btrfs_backref_release_cache().  This is because
btrfs_backref_release_cache() will cycle through all of the
cache->leaves nodes and free them up.  However
btrfs_backref_error_cleanup() freed the backref node with
btrfs_backref_free_node(), which simply kfree()d the backref node
without unlinking it from the cache.  Change this to a
btrfs_backref_drop_node(), which does the appropriate cleanup and
removes the node from the cache->leaves list, so when we go to free the
remaining cache we don't trip over items we've already dropped.

Fixes: 75bfb9aff4 ("Btrfs: cleanup error handling in build_backref_tree")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:54:52 +01:00
Josef Bacik
9e2fc8f10c btrfs: don't get an EINTR during drop_snapshot for reloc
commit 18d3bff411c8d46d40537483bdc0b61b33ce0371 upstream.

This was partially fixed by f3e3d9cc35 ("btrfs: avoid possible signal
interruption of btrfs_drop_snapshot() on relocation tree"), however it
missed a spot when we restart a trans handle because we need to end the
transaction.  The fix is the same, simply use btrfs_join_transaction()
instead of btrfs_start_transaction() when deleting reloc roots.

Fixes: f3e3d9cc35 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:54:52 +01:00
Alessio Balsini
32c13ae5ff FROMLIST: fuse: Introduce passthrough for mmap
Enabling FUSE passthrough for mmap-ed operations not only affects
performance, but has also been shown as mandatory for the correct
functioning of FUSE passthrough.
yanwu noticed [1] that a FUSE file with passthrough enabled may suffer
data inconsistencies if the same file is also accessed with mmap. What
happens is that read/write operations are directly applied to the lower
file system (and its cache), while mmap-ed operations are affecting the
FUSE cache.

Extend the FUSE passthrough implementation to also handle memory-mapped
FUSE file, to both fix the cache inconsistencies and extend the
passthrough performance benefits to mmap-ed operations.

[1] https://lore.kernel.org/lkml/20210119110654.11817-1-wu-yan@tcl.com/

Bug: 168023149
Link: https://lore.kernel.org/lkml/20210125153057.3623715-9-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: Ifad4698b0380f6e004c487940ac6907b9a9f2964
Signed-off-by: Alessio Balsini <balsini@google.com>
2021-01-26 19:07:13 +00:00
Alessio Balsini
aa29f32988 FROMLIST: fuse: Use daemon creds in passthrough mode
When using FUSE passthrough, read/write operations are directly
forwarded to the lower file system file through VFS, but there is no
guarantee that the process that is triggering the request has the right
permissions to access the lower file system. This would cause the
read/write access to fail.

In passthrough file systems, where the FUSE daemon is responsible for
the enforcement of the lower file system access policies, often happens
that the process dealing with the FUSE file system doesn't have access
to the lower file system.
Being the FUSE daemon in charge of implementing the FUSE file
operations, that in the case of read/write operations usually simply
results in the copy of memory buffers from/to the lower file system
respectively, these operations are executed with the FUSE daemon
privileges.

This patch adds a reference to the FUSE daemon credentials, referenced
at FUSE_DEV_IOC_PASSTHROUGH_OPEN ioctl() time so that they can be used
to temporarily raise the user credentials when accessing lower file
system files in passthrough.
The process accessing the FUSE file with passthrough enabled temporarily
receives the privileges of the FUSE daemon while performing read/write
operations. Similar behavior is implemented in overlayfs.
These privileges will be reverted as soon as the IO operation completes.
This feature does not provide any higher security privileges to those
processes accessing the FUSE file system with passthrough enabled. This
is because it is still the FUSE daemon responsible for enabling or not
the passthrough feature at file open time, and should enable the feature
only after appropriate access policy checks.

Bug: 168023149
Link: https://lore.kernel.org/lkml/20210125153057.3623715-8-balsini@android.com/
Signed-off-by: Alessio Balsini <balsini@android.com>
Change-Id: Idb4f03a2ce7c536691e5eaf8fadadfcf002e1677
Signed-off-by: Alessio Balsini <balsini@google.com>
2021-01-26 17:20:10 +00:00