Commit Graph

356 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
df23049a96 This is the 5.10.176 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmQa9NgACgkQONu9yGCS
 aT4Iew//X/3+Bpiu+FyaYe0NZ4I95rQvNh4fG6wXCFd/PVbCRpxVOAKQ91GnkU+D
 iMeuGBPqkpPhHvesRybsq0u8GmJ+fJj58+fgy1ABI7UzkWihzNDu1n2RntYmuRvl
 TEEsAIS+6/lhVKosDhyYcXAL5eT8F06zFOI9HspWRe+lYoRBIQyykcLgZQwt5mBX
 qyKAFkvhH0Z77ATiID5alRkVArgi/t3qBUANTrJ7LqOlhtY42EOS0Sp7wpZWskqI
 7Mpb6pfODsOq5d+6zNvZzdrtMaKRBal0Inxj2+zLEYdSv+xbTqp4Cb6UI18gJTA7
 zsvItAzTRxp+7KiZVS2HP3uMRRV4lQ5HxgMJhSsONHSSRh7ndhkW7NQq/o/dRFm2
 IgVf1beHk2pE+LN0Plf2oQCOMV8h/vQRZLCejoQEbFy6oNQ6bA4btJaXZnfluqDb
 KXONyDqXZ3uX3DSrKO4pCNCTsm5JhinkFHhO125kjSkPp/k2YWXdnBftQT1mWPYf
 dbWu1z/E+3qvObedwNn+icuu/MUznZMTYwDOD31tJp+1iEBgeBQWI+IRaIaWbDyD
 dxSoV8cScNZz+X4M70EFlwJMYL/VcIzDljeH2EA3CImDycDH0tspo6z8Z+xFhsrg
 D1wshmaT9XkSEJ92xDMw82B/1noOati75HpkUW1W/PKTqvjH/uU=
 =/t/A
 -----END PGP SIGNATURE-----

Merge 5.10.176 into android12-5.10-lts

Changes in 5.10.176
	xfrm: Allow transport-mode states with AF_UNSPEC selector
	drm/panfrost: Don't sync rpm suspension after mmu flushing
	cifs: Move the in_send statistic to __smb_send_rqst()
	drm/meson: fix 1px pink line on GXM when scaling video overlay
	clk: HI655X: select REGMAP instead of depending on it
	docs: Correct missing "d_" prefix for dentry_operations member d_weak_revalidate
	scsi: mpt3sas: Fix NULL pointer access in mpt3sas_transport_port_add()
	ALSA: hda: Match only Intel devices with CONTROLLER_IN_GPU()
	netfilter: nft_nat: correct length for loading protocol registers
	netfilter: nft_masq: correct length for loading protocol registers
	netfilter: nft_redir: correct length for loading protocol registers
	netfilter: nft_redir: correct value of inet type `.maxattrs`
	scsi: core: Fix a comment in function scsi_host_dev_release()
	scsi: core: Fix a procfs host directory removal regression
	tcp: tcp_make_synack() can be called from process context
	nfc: pn533: initialize struct pn533_out_arg properly
	ipvlan: Make skb->skb_iif track skb->dev for l3s mode
	i40e: Fix kernel crash during reboot when adapter is in recovery mode
	net/smc: fix NULL sndbuf_desc in smc_cdc_tx_handler()
	qed/qed_dev: guard against a possible division by zero
	net: tunnels: annotate lockless accesses to dev->needed_headroom
	net: phy: smsc: bail out in lan87xx_read_status if genphy_read_status fails
	nfc: st-nci: Fix use after free bug in ndlc_remove due to race condition
	net/smc: fix deadlock triggered by cancel_delayed_work_syn()
	net: usb: smsc75xx: Limit packet length to skb->len
	drm/bridge: Fix returned array size name for atomic_get_input_bus_fmts kdoc
	null_blk: Move driver into its own directory
	block: null_blk: Fix handling of fake timeout request
	nvme: fix handling single range discard request
	nvmet: avoid potential UAF in nvmet_req_complete()
	block: sunvdc: add check for mdesc_grab() returning NULL
	ice: xsk: disable txq irq before flushing hw
	net: dsa: mv88e6xxx: fix max_mtu of 1492 on 6165, 6191, 6220, 6250, 6290
	ipv4: Fix incorrect table ID in IOCTL path
	net: usb: smsc75xx: Move packet length check to prevent kernel panic in skb_pull
	net/iucv: Fix size of interrupt data
	selftests: net: devlink_port_split.py: skip test if no suitable device available
	qed/qed_mng_tlv: correctly zero out ->min instead of ->hour
	ethernet: sun: add check for the mdesc_grab()
	hwmon: (adt7475) Display smoothing attributes in correct order
	hwmon: (adt7475) Fix masking of hysteresis registers
	hwmon: (xgene) Fix use after free bug in xgene_hwmon_remove due to race condition
	hwmon: (ina3221) return prober error code
	hwmon: (ucd90320) Add minimum delay between bus accesses
	hwmon: tmp512: drop of_match_ptr for ID table
	hwmon: (adm1266) Set `can_sleep` flag for GPIO chip
	media: m5mols: fix off-by-one loop termination error
	mmc: atmel-mci: fix race between stop command and start of next command
	jffs2: correct logic when creating a hole in jffs2_write_begin
	ext4: fail ext4_iget if special inode unallocated
	ext4: fix task hung in ext4_xattr_delete_inode
	drm/amdkfd: Fix an illegal memory access
	sh: intc: Avoid spurious sizeof-pointer-div warning
	drm/amd/display: fix shift-out-of-bounds in CalculateVMAndRowBytes
	ext4: fix possible double unlock when moving a directory
	tty: serial: fsl_lpuart: skip waiting for transmission complete when UARTCTRL_SBK is asserted
	serial: 8250_em: Fix UART port type
	firmware: xilinx: don't make a sleepable memory allocation from an atomic context
	interconnect: fix mem leak when freeing nodes
	tracing: Make splice_read available again
	tracing: Check field value in hist_field_name()
	tracing: Make tracepoint lockdep check actually test something
	cifs: Fix smb2_set_path_size()
	KVM: nVMX: add missing consistency checks for CR0 and CR4
	ALSA: hda: intel-dsp-config: add MTL PCI id
	ALSA: hda/realtek: Fix the speaker output on Samsung Galaxy Book2 Pro
	drm/shmem-helper: Remove another errant put in error path
	mptcp: avoid setting TCP_CLOSE state twice
	ftrace: Fix invalid address access in lookup_rec() when index is 0
	mm/userfaultfd: propagate uffd-wp bit when PTE-mapping the huge zeropage
	mmc: sdhci_am654: lower power-on failed message severity
	fbdev: stifb: Provide valid pixelclock and add fb_check_var() checks
	cpuidle: psci: Iterate backwards over list in psci_pd_remove()
	x86/mce: Make sure logged MCEs are processed after sysfs update
	x86/mm: Fix use of uninitialized buffer in sme_enable()
	drm/i915: Don't use stolen memory for ring buffers with LLC
	drm/i915/active: Fix misuse of non-idle barriers as fence trackers
	io_uring: avoid null-ptr-deref in io_arm_poll_handler
	s390/ipl: add missing intersection check to ipl_report handling
	PCI: Unify delay handling for reset and resume
	PCI/DPC: Await readiness of secondary bus after reset
	xfs: don't assert fail on perag references on teardown
	xfs: purge dquots after inode walk fails during quotacheck
	xfs: don't leak btree cursor when insrec fails after a split
	xfs: remove XFS_PREALLOC_SYNC
	xfs: fallocate() should call file_modified()
	xfs: set prealloc flag in xfs_alloc_file_space()
	xfs: use setattr_copy to set vfs inode attributes
	fs: add mode_strip_sgid() helper
	fs: move S_ISGID stripping into the vfs_*() helpers
	attr: add in_group_or_capable()
	fs: move should_remove_suid()
	attr: add setattr_should_drop_sgid()
	attr: use consistent sgid stripping checks
	fs: use consistent setgid checks in is_sxid()
	xfs: remove xfs_setattr_time() declaration
	HID: core: Provide new max_buffer_size attribute to over-ride the default
	HID: uhid: Over-ride the default maximum data buffer value with our own
	Linux 5.10.176

Change-Id: Icd45189f4182c749d1758c13e18705abb4ea9c5a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-03-24 16:03:04 +00:00
Amir Goldstein
0e9dbde96c attr: use consistent sgid stripping checks
commit ed5a7047d2011cb6b2bf84ceb6680124cc6a7d95 upstream.

[backported to 5.10.y, prior to idmapped mounts]

Currently setgid stripping in file_remove_privs()'s should_remove_suid()
helper is inconsistent with other parts of the vfs. Specifically, it only
raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
inode isn't in the caller's groups and the caller isn't privileged over the
inode although we require this already in setattr_prepare() and
setattr_copy() and so all filesystem implement this requirement implicitly
because they have to use setattr_{prepare,copy}() anyway.

But the inconsistency shows up in setgid stripping bugs for overlayfs in
xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping works
correctly when performing various write-like operations as an unprivileged
user (fallocate, reflink, write, etc.):

echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k

The test basically creates a file with 6666 permissions. While the file has
the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
regular filesystem like xfs what will happen is:

sys_fallocate()
-> vfs_fallocate()
   -> xfs_file_fallocate()
      -> file_modified()
         -> __file_remove_privs()
            -> dentry_needs_remove_privs()
               -> should_remove_suid()
            -> __remove_privs()
               newattrs.ia_valid = ATTR_FORCE | kill;
               -> notify_change()
                  -> setattr_copy()

In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.

But we also see that ATTR_KILL_SGID won't be set because while the file
is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.

So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
ATTR_KILL_SUID is set and does:

ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);

which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

Now we call into the filesystem's ->setattr() inode operation which will
end up calling setattr_copy(). Since ATTR_MODE is set we will hit:

if (ia_valid & ATTR_MODE) {
        umode_t mode = attr->ia_mode;
        vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
        if (!vfsgid_in_group_p(vfsgid) &&
            !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                mode &= ~S_ISGID;
        inode->i_mode = mode;
}

and since the caller in the test is neither capable nor in the group of the
inode the S_ISGID bit is stripped.

But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
has the consequence that neither the setgid nor the suid bits are stripped
even though it should be stripped because the inode isn't in the caller's
groups and the caller isn't privileged over the inode.

If overlayfs is in the mix things become a bit more complicated and the bug
shows up more clearly. When e.g., ovl_setattr() is hit from
ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
ATTR_KILL_SGID might be raised but because the check in notify_change() is
questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
stripped the S_ISGID bit isn't removed even though it should be stripped:

sys_fallocate()
-> vfs_fallocate()
   -> ovl_fallocate()
      -> file_remove_privs()
         -> dentry_needs_remove_privs()
            -> should_remove_suid()
         -> __remove_privs()
            newattrs.ia_valid = ATTR_FORCE | kill;
            -> notify_change()
               -> ovl_setattr()
                  // TAKE ON MOUNTER'S CREDS
                  -> ovl_do_notify_change()
                     -> notify_change()
                  // GIVE UP MOUNTER'S CREDS
     // TAKE ON MOUNTER'S CREDS
     -> vfs_fallocate()
        -> xfs_file_fallocate()
           -> file_modified()
              -> __file_remove_privs()
                 -> dentry_needs_remove_privs()
                    -> should_remove_suid()
                 -> __remove_privs()
                    newattrs.ia_valid = attr_force | kill;
                    -> notify_change()

The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper to perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have notify_change()
not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
first place because the caller must calculate the flags via
should_remove_suid() anyway which would raise ATTR_KILL_SGID.

While we're at it we move should_remove_suid() from inode.c to attr.c
where it belongs with the rest of the iattr helpers. Especially since it
returns ATTR_KILL_S{G,U}ID flags. We also rename it to
setattr_should_drop_suidgid() to better reflect that it indicates both
setuid and setgid bit removal and also that it returns attr flags.

Running xfstests with this doesn't report any regressions. We should really
try and use consistent checks.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22 13:30:08 +01:00
Greg Kroah-Hartman
8596b99884 Merge 5.10.162 into android12-5.10-lts
Changes in 5.10.162
	kernel: provide create_io_thread() helper
	iov_iter: add helper to save iov_iter state
	saner calling conventions for unlazy_child()
	fs: add support for LOOKUP_CACHED
	fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy*
	Make sure nd->path.mnt and nd->path.dentry are always valid pointers
	fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
	tools headers UAPI: Sync openat2.h with the kernel sources
	net: provide __sys_shutdown_sock() that takes a socket
	net: add accept helper not installing fd
	signal: Add task_sigpending() helper
	fs: make do_renameat2() take struct filename
	file: Rename __close_fd_get_file close_fd_get_file
	fs: provide locked helper variant of close_fd_get_file()
	entry: Add support for TIF_NOTIFY_SIGNAL
	task_work: Use TIF_NOTIFY_SIGNAL if available
	x86: Wire up TIF_NOTIFY_SIGNAL
	arc: add support for TIF_NOTIFY_SIGNAL
	arm64: add support for TIF_NOTIFY_SIGNAL
	m68k: add support for TIF_NOTIFY_SIGNAL
	nios32: add support for TIF_NOTIFY_SIGNAL
	parisc: add support for TIF_NOTIFY_SIGNAL
	powerpc: add support for TIF_NOTIFY_SIGNAL
	mips: add support for TIF_NOTIFY_SIGNAL
	s390: add support for TIF_NOTIFY_SIGNAL
	um: add support for TIF_NOTIFY_SIGNAL
	sh: add support for TIF_NOTIFY_SIGNAL
	openrisc: add support for TIF_NOTIFY_SIGNAL
	csky: add support for TIF_NOTIFY_SIGNAL
	hexagon: add support for TIF_NOTIFY_SIGNAL
	microblaze: add support for TIF_NOTIFY_SIGNAL
	arm: add support for TIF_NOTIFY_SIGNAL
	xtensa: add support for TIF_NOTIFY_SIGNAL
	alpha: add support for TIF_NOTIFY_SIGNAL
	c6x: add support for TIF_NOTIFY_SIGNAL
	h8300: add support for TIF_NOTIFY_SIGNAL
	ia64: add support for TIF_NOTIFY_SIGNAL
	nds32: add support for TIF_NOTIFY_SIGNAL
	riscv: add support for TIF_NOTIFY_SIGNAL
	sparc: add support for TIF_NOTIFY_SIGNAL
	ia64: don't call handle_signal() unless there's actually a signal queued
	ARC: unbork 5.11 bootup: fix snafu in _TIF_NOTIFY_SIGNAL handling
	alpha: fix TIF_NOTIFY_SIGNAL handling
	task_work: remove legacy TWA_SIGNAL path
	kernel: remove checking for TIF_NOTIFY_SIGNAL
	coredump: Limit what can interrupt coredumps
	kernel: allow fork with TIF_NOTIFY_SIGNAL pending
	entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
	arch: setup PF_IO_WORKER threads like PF_KTHREAD
	arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread()
	x86/process: setup io_threads more like normal user space threads
	kernel: stop masking signals in create_io_thread()
	kernel: don't call do_exit() for PF_IO_WORKER threads
	task_work: add helper for more targeted task_work canceling
	io_uring: import 5.15-stable io_uring
	signal: kill JOBCTL_TASK_WORK
	task_work: unconditionally run task_work from get_signal()
	net: remove cmsg restriction from io_uring based send/recvmsg calls
	Revert "proc: don't allow async path resolution of /proc/thread-self components"
	Revert "proc: don't allow async path resolution of /proc/self components"
	eventpoll: add EPOLL_URING_WAKE poll wakeup flag
	eventfd: provide a eventfd_signal_mask() helper
	io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups
	Linux 5.10.162

Change-Id: I50a7b8bc8d38fac612113281b218cf5323b0af5e
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2023-02-01 16:13:18 +00:00
Jens Axboe
5683caa735 fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
[ Upstream commit 99668f618062816ca7ba639b007eb145b9d3d41e ]

Now that we support non-blocking path resolution internally, expose it
via openat2() in the struct open_how ->resolve flags. This allows
applications using openat2() to limit path resolution to the extent that
it is already cached.

If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
will return -1/-EAGAIN.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04 11:39:17 +01:00
Greg Kroah-Hartman
d483eed85f ANDROID: GKI: set vfs-only exports into their own namespace
We have namespaces, so use them for all vfs-exported namespaces so that
filesystems can use them, but not anything else.

Some in-kernel drivers that do direct filesystem accesses (because they
serve up files) are also allowed access to these symbols to keep 'make
allmodconfig' builds working properly, but it is not needed for Android
kernel images.

Bug: 157965270
Bug: 210074446
Cc: Matthias Maennich <maennich@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iaf6140baf3a18a516ab2d5c3966235c42f3f70de
2022-01-11 09:30:47 +01:00
Greg Kroah-Hartman
2df0fb4a4b This is the 5.10.50 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmDu+1UACgkQONu9yGCS
 aT7jQRAAuLDi7ejk3JUameYFMzVXGAUE6yPs392/lWJzey7IBf+2uLqz4FzqqUHp
 U1GkEKJVaCacEfi0+rpi7BxNFljUdZdg/F/P68ARtAWPvwqAeJ4QIh5u3A682UUO
 1M5h6e5/oY9F4kQIb5Kot04avqOeR6lTqrkA8jeP5h43ngyLWuS2d+5oOGmbCukS
 UgEaCC6CiKjcN51UUTj/fXMQ0X4IDHP5pD8rWwH0IvK0i7gduvk744un8LVB6aW1
 rNV88C3BEFFtkPQh2XySnXM5Ok8kYlhFoTDsqlpeAX7pA8hiUPYBoRzTg0MJtPZn
 N1L/Yqhvxmn5xs9HAw7mDOo8E8NWXzsT5FvZVaBeiCgtdKmcPszylXqmSt1oiOb0
 /EmkCWmlbG/3qWql24+LU4XP36iVPx32HQxAgg2XbnlNU5o0E1y2F98p6p/3JSWX
 NAjHtmg/MxueFQ+w8bDzhO8YzYn1dIU3V3qaXRvtpODrmaSYW+bwCyPtSjXe3/vL
 604zb3dOg9+tD/gKqfRb/UPMu24nNll8M/gnSRci05/thmIxwtYudPwoLNSejDqr
 e+a8vejISfIyp41XrpYQbUeKs1WOA+A7vgx6CZrT791afiT+6UgC/ecQfg1NFxhs
 8ayWpocaIszxyXxVGro1rfwZeQmTlbTCZ5wVdpn9sDPZfI7epts=
 =FCrA
 -----END PGP SIGNATURE-----

Merge 5.10.50 into android12-5.10-lts

Changes in 5.10.50
	Bluetooth: hci_qca: fix potential GPF
	Bluetooth: btqca: Don't modify firmware contents in-place
	Bluetooth: Remove spurious error message
	ALSA: usb-audio: fix rate on Ozone Z90 USB headset
	ALSA: usb-audio: Fix OOB access at proc output
	ALSA: firewire-motu: fix stream format for MOTU 8pre FireWire
	ALSA: usb-audio: scarlett2: Fix wrong resume call
	ALSA: intel8x0: Fix breakage at ac97 clock measurement
	ALSA: hda/realtek: fix mute/micmute LEDs for HP ProBook 450 G8
	ALSA: hda/realtek: fix mute/micmute LEDs for HP ProBook 445 G8
	ALSA: hda/realtek: fix mute/micmute LEDs for HP ProBook 630 G8
	ALSA: hda/realtek: Add another ALC236 variant support
	ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook x360 830 G8
	ALSA: hda/realtek: Improve fixup for HP Spectre x360 15-df0xxx
	ALSA: hda/realtek: Fix bass speaker DAC mapping for Asus UM431D
	ALSA: hda/realtek: Apply LED fixup for HP Dragonfly G1, too
	ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook 830 G8 Notebook PC
	media: dvb-usb: fix wrong definition
	Input: usbtouchscreen - fix control-request directions
	net: can: ems_usb: fix use-after-free in ems_usb_disconnect()
	usb: gadget: eem: fix echo command packet response issue
	usb: renesas-xhci: Fix handling of unknown ROM state
	USB: cdc-acm: blacklist Heimann USB Appset device
	usb: dwc3: Fix debugfs creation flow
	usb: typec: Add the missed altmode_id_remove() in typec_register_altmode()
	xhci: solve a double free problem while doing s4
	gfs2: Fix underflow in gfs2_page_mkwrite
	gfs2: Fix error handling in init_statfs
	ntfs: fix validity check for file name attribute
	selftests/lkdtm: Avoid needing explicit sub-shell
	copy_page_to_iter(): fix ITER_DISCARD case
	iov_iter_fault_in_readable() should do nothing in xarray case
	Input: joydev - prevent use of not validated data in JSIOCSBTNMAP ioctl
	crypto: nx - Fix memcpy() over-reading in nonce
	crypto: ccp - Annotate SEV Firmware file names
	arm_pmu: Fix write counter incorrect in ARMv7 big-endian mode
	ARM: dts: ux500: Fix LED probing
	ARM: dts: at91: sama5d4: fix pinctrl muxing
	btrfs: send: fix invalid path for unlink operations after parent orphanization
	btrfs: compression: don't try to compress if we don't have enough pages
	btrfs: clear defrag status of a root if starting transaction fails
	ext4: cleanup in-core orphan list if ext4_truncate() failed to get a transaction handle
	ext4: fix kernel infoleak via ext4_extent_header
	ext4: fix overflow in ext4_iomap_alloc()
	ext4: return error code when ext4_fill_flex_info() fails
	ext4: correct the cache_nr in tracepoint ext4_es_shrink_exit
	ext4: remove check for zero nr_to_scan in ext4_es_scan()
	ext4: fix avefreec in find_group_orlov
	ext4: use ext4_grp_locked_error in mb_find_extent
	can: bcm: delay release of struct bcm_op after synchronize_rcu()
	can: gw: synchronize rcu operations before removing gw job entry
	can: isotp: isotp_release(): omit unintended hrtimer restart on socket release
	can: j1939: j1939_sk_init(): set SOCK_RCU_FREE to call sk_destruct() after RCU is done
	can: peak_pciefd: pucan_handle_status(): fix a potential starvation issue in TX path
	mac80211: remove iwlwifi specific workaround that broke sta NDP tx
	SUNRPC: Fix the batch tasks count wraparound.
	SUNRPC: Should wake up the privileged task firstly.
	bus: mhi: Wait for M2 state during system resume
	mm/gup: fix try_grab_compound_head() race with split_huge_page()
	perf/smmuv3: Don't trample existing events with global filter
	KVM: nVMX: Handle split-lock #AC exceptions that happen in L2
	KVM: PPC: Book3S HV: Workaround high stack usage with clang
	KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUs
	KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT walk
	s390/cio: dont call css_wait_for_slow_path() inside a lock
	s390: mm: Fix secure storage access exception handling
	f2fs: Prevent swap file in LFS mode
	clk: agilex/stratix10/n5x: fix how the bypass_reg is handled
	clk: agilex/stratix10: remove noc_clk
	clk: agilex/stratix10: fix bypass representation
	rtc: stm32: Fix unbalanced clk_disable_unprepare() on probe error path
	iio: frequency: adf4350: disable reg and clk on error in adf4350_probe()
	iio: light: tcs3472: do not free unallocated IRQ
	iio: ltr501: mark register holding upper 8 bits of ALS_DATA{0,1} and PS_DATA as volatile, too
	iio: ltr501: ltr559: fix initialization of LTR501_ALS_CONTR
	iio: ltr501: ltr501_read_ps(): add missing endianness conversion
	iio: accel: bma180: Fix BMA25x bandwidth register values
	serial: mvebu-uart: fix calculation of clock divisor
	serial: sh-sci: Stop dmaengine transfer in sci_stop_tx()
	serial_cs: Add Option International GSM-Ready 56K/ISDN modem
	serial_cs: remove wrong GLOBETROTTER.cis entry
	ath9k: Fix kernel NULL pointer dereference during ath_reset_internal()
	ssb: sdio: Don't overwrite const buffer if block_write fails
	rsi: Assign beacon rate settings to the correct rate_info descriptor field
	rsi: fix AP mode with WPA failure due to encrypted EAPOL
	tracing/histograms: Fix parsing of "sym-offset" modifier
	tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
	seq_buf: Make trace_seq_putmem_hex() support data longer than 8
	powerpc/stacktrace: Fix spurious "stale" traces in raise_backtrace_ipi()
	loop: Fix missing discard support when using LOOP_CONFIGURE
	evm: Execute evm_inode_init_security() only when an HMAC key is loaded
	evm: Refuse EVM_ALLOW_METADATA_WRITES only if an HMAC key is loaded
	fuse: Fix crash in fuse_dentry_automount() error path
	fuse: Fix crash if superblock of submount gets killed early
	fuse: Fix infinite loop in sget_fc()
	fuse: ignore PG_workingset after stealing
	fuse: check connected before queueing on fpq->io
	fuse: reject internal errno
	thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
	spi: Make of_register_spi_device also set the fwnode
	Add a reference to ucounts for each cred
	staging: media: rkvdec: fix pm_runtime_get_sync() usage count
	media: marvel-ccic: fix some issues when getting pm_runtime
	media: mdk-mdp: fix pm_runtime_get_sync() usage count
	media: s5p: fix pm_runtime_get_sync() usage count
	media: am437x: fix pm_runtime_get_sync() usage count
	media: sh_vou: fix pm_runtime_get_sync() usage count
	media: mtk-vcodec: fix PM runtime get logic
	media: s5p-jpeg: fix pm_runtime_get_sync() usage count
	media: sunxi: fix pm_runtime_get_sync() usage count
	media: sti/bdisp: fix pm_runtime_get_sync() usage count
	media: exynos4-is: fix pm_runtime_get_sync() usage count
	media: exynos-gsc: fix pm_runtime_get_sync() usage count
	spi: spi-loopback-test: Fix 'tx_buf' might be 'rx_buf'
	spi: spi-topcliff-pch: Fix potential double free in pch_spi_process_messages()
	spi: omap-100k: Fix the length judgment problem
	regulator: uniphier: Add missing MODULE_DEVICE_TABLE
	sched/core: Initialize the idle task with preemption disabled
	hwrng: exynos - Fix runtime PM imbalance on error
	crypto: nx - add missing MODULE_DEVICE_TABLE
	media: sti: fix obj-$(config) targets
	media: cpia2: fix memory leak in cpia2_usb_probe
	media: cobalt: fix race condition in setting HPD
	media: hevc: Fix dependent slice segment flags
	media: pvrusb2: fix warning in pvr2_i2c_core_done
	media: imx: imx7_mipi_csis: Fix logging of only error event counters
	crypto: qat - check return code of qat_hal_rd_rel_reg()
	crypto: qat - remove unused macro in FW loader
	crypto: qce: skcipher: Fix incorrect sg count for dma transfers
	arm64: perf: Convert snprintf to sysfs_emit
	sched/fair: Fix ascii art by relpacing tabs
	media: i2c: ov2659: Use clk_{prepare_enable,disable_unprepare}() to set xvclk on/off
	media: bt878: do not schedule tasklet when it is not setup
	media: em28xx: Fix possible memory leak of em28xx struct
	media: hantro: Fix .buf_prepare
	media: cedrus: Fix .buf_prepare
	media: v4l2-core: Avoid the dangling pointer in v4l2_fh_release
	media: bt8xx: Fix a missing check bug in bt878_probe
	media: st-hva: Fix potential NULL pointer dereferences
	crypto: hisilicon/sec - fixup 3des minimum key size declaration
	Makefile: fix GDB warning with CONFIG_RELR
	media: dvd_usb: memory leak in cinergyt2_fe_attach
	memstick: rtsx_usb_ms: fix UAF
	mmc: sdhci-sprd: use sdhci_sprd_writew
	mmc: via-sdmmc: add a check against NULL pointer dereference
	spi: meson-spicc: fix a wrong goto jump for avoiding memory leak.
	spi: meson-spicc: fix memory leak in meson_spicc_probe
	crypto: shash - avoid comparing pointers to exported functions under CFI
	media: dvb_net: avoid speculation from net slot
	media: siano: fix device register error path
	media: imx-csi: Skip first few frames from a BT.656 source
	hwmon: (max31790) Report correct current pwm duty cycles
	hwmon: (max31790) Fix pwmX_enable attributes
	drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()
	KVM: PPC: Book3S HV: Fix TLB management on SMT8 POWER9 and POWER10 processors
	btrfs: fix error handling in __btrfs_update_delayed_inode
	btrfs: abort transaction if we fail to update the delayed inode
	btrfs: sysfs: fix format string for some discard stats
	btrfs: don't clear page extent mapped if we're not invalidating the full page
	btrfs: disable build on platforms having page size 256K
	locking/lockdep: Fix the dep path printing for backwards BFS
	lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()
	KVM: s390: get rid of register asm usage
	regulator: mt6358: Fix vdram2 .vsel_mask
	regulator: da9052: Ensure enough delay time for .set_voltage_time_sel
	media: Fix Media Controller API config checks
	ACPI: video: use native backlight for GA401/GA502/GA503
	HID: do not use down_interruptible() when unbinding devices
	EDAC/ti: Add missing MODULE_DEVICE_TABLE
	ACPI: processor idle: Fix up C-state latency if not ordered
	hv_utils: Fix passing zero to 'PTR_ERR' warning
	lib: vsprintf: Fix handling of number field widths in vsscanf
	Input: goodix - platform/x86: touchscreen_dmi - Move upside down quirks to touchscreen_dmi.c
	platform/x86: touchscreen_dmi: Add an extra entry for the upside down Goodix touchscreen on Teclast X89 tablets
	platform/x86: touchscreen_dmi: Add info for the Goodix GT912 panel of TM800A550L tablets
	ACPI: EC: Make more Asus laptops use ECDT _GPE
	block_dump: remove block_dump feature in mark_inode_dirty()
	blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter
	blk-mq: clear stale request in tags->rq[] before freeing one request pool
	fs: dlm: cancel work sync othercon
	random32: Fix implicit truncation warning in prandom_seed_state()
	open: don't silently ignore unknown O-flags in openat2()
	drivers: hv: Fix missing error code in vmbus_connect()
	fs: dlm: fix memory leak when fenced
	ACPICA: Fix memory leak caused by _CID repair function
	ACPI: bus: Call kobject_put() in acpi_init() error path
	ACPI: resources: Add checks for ACPI IRQ override
	block: fix race between adding/removing rq qos and normal IO
	platform/x86: asus-nb-wmi: Revert "Drop duplicate DMI quirk structures"
	platform/x86: asus-nb-wmi: Revert "add support for ASUS ROG Zephyrus G14 and G15"
	platform/x86: toshiba_acpi: Fix missing error code in toshiba_acpi_setup_keyboard()
	nvme-pci: fix var. type for increasing cq_head
	nvmet-fc: do not check for invalid target port in nvmet_fc_handle_fcp_rqst()
	EDAC/Intel: Do not load EDAC driver when running as a guest
	PCI: hv: Add check for hyperv_initialized in init_hv_pci_drv()
	cifs: improve fallocate emulation
	ACPI: EC: trust DSDT GPE for certain HP laptop
	clocksource: Retry clock read if long delays detected
	clocksource: Check per-CPU clock synchronization when marked unstable
	tpm_tis_spi: add missing SPI device ID entries
	ACPI: tables: Add custom DSDT file as makefile prerequisite
	HID: wacom: Correct base usage for capacitive ExpressKey status bits
	cifs: fix missing spinlock around update to ses->status
	mailbox: qcom: Use PLATFORM_DEVID_AUTO to register platform device
	block: fix discard request merge
	kthread_worker: fix return value when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
	ia64: mca_drv: fix incorrect array size calculation
	writeback, cgroup: increment isw_nr_in_flight before grabbing an inode
	spi: Allow to have all native CSs in use along with GPIOs
	spi: Avoid undefined behaviour when counting unused native CSs
	media: venus: Rework error fail recover logic
	media: s5p_cec: decrement usage count if disabled
	media: hantro: do a PM resume earlier
	crypto: ixp4xx - dma_unmap the correct address
	crypto: ixp4xx - update IV after requests
	crypto: ux500 - Fix error return code in hash_hw_final()
	sata_highbank: fix deferred probing
	pata_rb532_cf: fix deferred probing
	media: I2C: change 'RST' to "RSET" to fix multiple build errors
	sched/uclamp: Fix wrong implementation of cpu.uclamp.min
	sched/uclamp: Fix locking around cpu_util_update_eff()
	kbuild: Fix objtool dependency for 'OBJECT_FILES_NON_STANDARD_<obj> := n'
	pata_octeon_cf: avoid WARN_ON() in ata_host_activate()
	evm: fix writing <securityfs>/evm overflow
	x86/elf: Use _BITUL() macro in UAPI headers
	crypto: sa2ul - Fix leaks on failure paths with sa_dma_init()
	crypto: sa2ul - Fix pm_runtime enable in sa_ul_probe()
	crypto: ccp - Fix a resource leak in an error handling path
	media: rc: i2c: Fix an error message
	pata_ep93xx: fix deferred probing
	locking/lockdep: Reduce LOCKDEP dependency list
	media: rkvdec: Fix .buf_prepare
	media: exynos4-is: Fix a use after free in isp_video_release
	media: au0828: fix a NULL vs IS_ERR() check
	media: tc358743: Fix error return code in tc358743_probe_of()
	media: gspca/gl860: fix zero-length control requests
	m68k: atari: Fix ATARI_KBD_CORE kconfig unmet dependency warning
	media: siano: Fix out-of-bounds warnings in smscore_load_firmware_family2()
	regulator: fan53880: Fix vsel_mask setting for FAN53880_BUCK
	crypto: nitrox - fix unchecked variable in nitrox_register_interrupts
	crypto: omap-sham - Fix PM reference leak in omap sham ops
	crypto: x86/curve25519 - fix cpu feature checking logic in mod_exit
	crypto: sm2 - remove unnecessary reset operations
	crypto: sm2 - fix a memory leak in sm2
	mmc: usdhi6rol0: fix error return code in usdhi6_probe()
	arm64: consistently use reserved_pg_dir
	arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan
	media: subdev: remove VIDIOC_DQEVENT_TIME32 handling
	media: s5p-g2d: Fix a memory leak on ctx->fh.m2m_ctx
	hwmon: (lm70) Use device_get_match_data()
	hwmon: (lm70) Revert "hwmon: (lm70) Add support for ACPI"
	hwmon: (max31722) Remove non-standard ACPI device IDs
	hwmon: (max31790) Fix fan speed reporting for fan7..12
	KVM: nVMX: Sync all PGDs on nested transition with shadow paging
	KVM: nVMX: Ensure 64-bit shift when checking VMFUNC bitmap
	KVM: nVMX: Don't clobber nested MMU's A/D status on EPTP switch
	KVM: x86/mmu: Fix return value in tdp_mmu_map_handle_target_level()
	perf/arm-cmn: Fix invalid pointer when access dtc object sharing the same IRQ number
	KVM: arm64: Don't zero the cycle count register when PMCR_EL0.P is set
	regulator: hi655x: Fix pass wrong pointer to config.driver_data
	btrfs: clear log tree recovering status if starting transaction fails
	x86/sev: Make sure IRQs are disabled while GHCB is active
	x86/sev: Split up runtime #VC handler for correct state tracking
	sched/rt: Fix RT utilization tracking during policy change
	sched/rt: Fix Deadline utilization tracking during policy change
	sched/uclamp: Fix uclamp_tg_restrict()
	lockdep: Fix wait-type for empty stack
	lockdep/selftests: Fix selftests vs PROVE_RAW_LOCK_NESTING
	spi: spi-sun6i: Fix chipselect/clock bug
	crypto: nx - Fix RCU warning in nx842_OF_upd_status
	psi: Fix race between psi_trigger_create/destroy
	media: v4l2-async: Clean v4l2_async_notifier_add_fwnode_remote_subdev
	media: video-mux: Skip dangling endpoints
	PM / devfreq: Add missing error code in devfreq_add_device()
	ACPI: PM / fan: Put fan device IDs into separate header file
	block: avoid double io accounting for flush request
	nvme-pci: look for StorageD3Enable on companion ACPI device instead
	ACPI: sysfs: Fix a buffer overrun problem with description_show()
	mark pstore-blk as broken
	clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG
	extcon: extcon-max8997: Fix IRQ freeing at error path
	ACPI: APEI: fix synchronous external aborts in user-mode
	blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
	blk-wbt: make sure throttle is enabled properly
	ACPI: Use DEVICE_ATTR_<RW|RO|WO> macros
	ACPI: bgrt: Fix CFI violation
	cpufreq: Make cpufreq_online() call driver->offline() on errors
	blk-mq: update hctx->dispatch_busy in case of real scheduler
	ocfs2: fix snprintf() checking
	dax: fix ENOMEM handling in grab_mapping_entry()
	mm/debug_vm_pgtable/basic: add validation for dirtiness after write protect
	mm/debug_vm_pgtable/basic: iterate over entire protection_map[]
	mm/debug_vm_pgtable: ensure THP availability via has_transparent_hugepage()
	swap: fix do_swap_page() race with swapoff
	mm/shmem: fix shmem_swapin() race with swapoff
	mm: memcg/slab: properly set up gfp flags for objcg pointer array
	mm: page_alloc: refactor setup_per_zone_lowmem_reserve()
	mm/page_alloc: fix counting of managed_pages
	xfrm: xfrm_state_mtu should return at least 1280 for ipv6
	drm/bridge/sii8620: fix dependency on extcon
	drm/bridge: Fix the stop condition of drm_bridge_chain_pre_enable()
	drm/amd/dc: Fix a missing check bug in dm_dp_mst_detect()
	drm/ast: Fix missing conversions to managed API
	video: fbdev: imxfb: Fix an error message
	net: mvpp2: Put fwnode in error case during ->probe()
	net: pch_gbe: Propagate error from devm_gpio_request_one()
	pinctrl: renesas: r8a7796: Add missing bias for PRESET# pin
	pinctrl: renesas: r8a77990: JTAG pins do not have pull-down capabilities
	drm/vmwgfx: Mark a surface gpu-dirty after the SVGA3dCmdDXGenMips command
	drm/vmwgfx: Fix cpu updates of coherent multisample surfaces
	net: qrtr: ns: Fix error return code in qrtr_ns_init()
	clk: meson: g12a: fix gp0 and hifi ranges
	net: ftgmac100: add missing error return code in ftgmac100_probe()
	drm: rockchip: set alpha_en to 0 if it is not used
	drm/rockchip: cdn-dp-core: add missing clk_disable_unprepare() on error in cdn_dp_grf_write()
	drm/rockchip: dsi: move all lane config except LCDC mux to bind()
	drm/rockchip: lvds: Fix an error handling path
	drm/rockchip: cdn-dp: fix sign extension on an int multiply for a u64 result
	mptcp: fix pr_debug in mptcp_token_new_connect
	mptcp: generate subflow hmac after mptcp_finish_join()
	RDMA/srp: Fix a recently introduced memory leak
	RDMA/rtrs-clt: Check state of the rtrs_clt_sess before reading its stats
	RDMA/rtrs: Do not reset hb_missed_max after re-connection
	RDMA/rtrs-srv: Fix memory leak of unfreed rtrs_srv_stats object
	RDMA/rtrs-srv: Fix memory leak when having multiple sessions
	RDMA/rtrs-clt: Check if the queue_depth has changed during a reconnection
	RDMA/rtrs-clt: Fix memory leak of not-freed sess->stats and stats->pcpu_stats
	ehea: fix error return code in ehea_restart_qps()
	clk: tegra30: Use 300MHz for video decoder by default
	xfrm: remove the fragment check for ipv6 beet mode
	net/sched: act_vlan: Fix modify to allow 0
	RDMA/core: Sanitize WQ state received from the userspace
	drm/pl111: depend on CONFIG_VEXPRESS_CONFIG
	RDMA/rxe: Fix failure during driver load
	drm/pl111: Actually fix CONFIG_VEXPRESS_CONFIG depends
	drm/vc4: hdmi: Fix error path of hpd-gpios
	clk: vc5: fix output disabling when enabling a FOD
	drm: qxl: ensure surf.data is ininitialized
	tools/bpftool: Fix error return code in do_batch()
	ath10k: go to path err_unsupported when chip id is not supported
	ath10k: add missing error return code in ath10k_pci_probe()
	wireless: carl9170: fix LEDS build errors & warnings
	ieee802154: hwsim: Fix possible memory leak in hwsim_subscribe_all_others
	clk: imx8mq: remove SYS PLL 1/2 clock gates
	wcn36xx: Move hal_buf allocation to devm_kmalloc in probe
	ssb: Fix error return code in ssb_bus_scan()
	brcmfmac: fix setting of station info chains bitmask
	brcmfmac: correctly report average RSSI in station info
	brcmfmac: Fix a double-free in brcmf_sdio_bus_reset
	brcmsmac: mac80211_if: Fix a resource leak in an error handling path
	cw1200: Revert unnecessary patches that fix unreal use-after-free bugs
	ath11k: Fix an error handling path in ath11k_core_fetch_board_data_api_n()
	ath10k: Fix an error code in ath10k_add_interface()
	ath11k: send beacon template after vdev_start/restart during csa
	netlabel: Fix memory leak in netlbl_mgmt_add_common
	RDMA/mlx5: Don't add slave port to unaffiliated list
	netfilter: nft_exthdr: check for IPv6 packet before further processing
	netfilter: nft_osf: check for TCP packet before further processing
	netfilter: nft_tproxy: restrict support to TCP and UDP transport protocols
	RDMA/rxe: Fix qp reference counting for atomic ops
	selftests/bpf: Whitelist test_progs.h from .gitignore
	xsk: Fix missing validation for skb and unaligned mode
	xsk: Fix broken Tx ring validation
	bpf: Fix libelf endian handling in resolv_btfids
	RDMA/rtrs-srv: Set minimal max_send_wr and max_recv_wr
	samples/bpf: Fix Segmentation fault for xdp_redirect command
	samples/bpf: Fix the error return code of xdp_redirect's main()
	mt76: fix possible NULL pointer dereference in mt76_tx
	mt76: mt7615: fix NULL pointer dereference in tx_prepare_skb()
	net: ethernet: aeroflex: fix UAF in greth_of_remove
	net: ethernet: ezchip: fix UAF in nps_enet_remove
	net: ethernet: ezchip: fix error handling
	vrf: do not push non-ND strict packets with a source LLA through packet taps again
	net: sched: add barrier to ensure correct ordering for lockless qdisc
	tls: prevent oversized sendfile() hangs by ignoring MSG_MORE
	netfilter: nf_tables_offload: check FLOW_DISSECTOR_KEY_BASIC in VLAN transfer logic
	pkt_sched: sch_qfq: fix qfq_change_class() error path
	xfrm: Fix xfrm offload fallback fail case
	iwlwifi: increase PNVM load timeout
	rtw88: 8822c: fix lc calibration timing
	vxlan: add missing rcu_read_lock() in neigh_reduce()
	ip6_tunnel: fix GRE6 segmentation
	net/ipv4: swap flow ports when validating source
	net: ti: am65-cpsw-nuss: Fix crash when changing number of TX queues
	tc-testing: fix list handling
	ieee802154: hwsim: Fix memory leak in hwsim_add_one
	ieee802154: hwsim: avoid possible crash in hwsim_del_edge_nl()
	bpf: Fix null ptr deref with mixed tail calls and subprogs
	drm/msm: Fix error return code in msm_drm_init()
	drm/msm/dpu: Fix error return code in dpu_mdss_init()
	mac80211: remove iwlwifi specific workaround NDPs of null_response
	net: bcmgenet: Fix attaching to PYH failed on RPi 4B
	ipv6: exthdrs: do not blindly use init_net
	can: j1939: j1939_sk_setsockopt(): prevent allocation of j1939 filter for optlen == 0
	bpf: Do not change gso_size during bpf_skb_change_proto()
	i40e: Fix error handling in i40e_vsi_open
	i40e: Fix autoneg disabling for non-10GBaseT links
	i40e: Fix missing rtnl locking when setting up pf switch
	Revert "ibmvnic: remove duplicate napi_schedule call in open function"
	ibmvnic: set ltb->buff to NULL after freeing
	ibmvnic: free tx_pool if tso_pool alloc fails
	RDMA/cma: Protect RMW with qp_mutex
	net: macsec: fix the length used to copy the key for offloading
	net: phy: mscc: fix macsec key length
	net: atlantic: fix the macsec key length
	ipv6: fix out-of-bound access in ip6_parse_tlv()
	e1000e: Check the PCIm state
	net: dsa: sja1105: fix NULL pointer dereference in sja1105_reload_cbs()
	bpfilter: Specify the log level for the kmsg message
	RDMA/cma: Fix incorrect Packet Lifetime calculation
	gve: Fix swapped vars when fetching max queues
	Revert "be2net: disable bh with spin_lock in be_process_mcc"
	Bluetooth: mgmt: Fix slab-out-of-bounds in tlv_data_is_valid
	Bluetooth: Fix not sending Set Extended Scan Response
	Bluetooth: Fix Set Extended (Scan Response) Data
	Bluetooth: Fix handling of HCI_LE_Advertising_Set_Terminated event
	clk: actions: Fix UART clock dividers on Owl S500 SoC
	clk: actions: Fix SD clocks factor table on Owl S500 SoC
	clk: actions: Fix bisp_factor_table based clocks on Owl S500 SoC
	clk: actions: Fix AHPPREDIV-H-AHB clock chain on Owl S500 SoC
	clk: qcom: clk-alpha-pll: fix CAL_L write in alpha_pll_fabia_prepare
	clk: si5341: Wait for DEVICE_READY on startup
	clk: si5341: Avoid divide errors due to bogus register contents
	clk: si5341: Check for input clock presence and PLL lock on startup
	clk: si5341: Update initialization magic
	writeback: fix obtain a reference to a freeing memcg css
	net: lwtunnel: handle MTU calculation in forwading
	net: sched: fix warning in tcindex_alloc_perfect_hash
	net: tipc: fix FB_MTU eat two pages
	RDMA/mlx5: Don't access NULL-cleared mpi pointer
	RDMA/core: Always release restrack object
	MIPS: Fix PKMAP with 32-bit MIPS huge page support
	staging: fbtft: Rectify GPIO handling
	staging: fbtft: Don't spam logs when probe is deferred
	ASoC: rt5682: Disable irq on shutdown
	rcu: Invoke rcu_spawn_core_kthreads() from rcu_spawn_gp_kthread()
	serial: fsl_lpuart: don't modify arbitrary data on lpuart32
	serial: fsl_lpuart: remove RTSCTS handling from get_mctrl()
	serial: 8250_omap: fix a timeout loop condition
	tty: nozomi: Fix a resource leak in an error handling function
	mwifiex: re-fix for unaligned accesses
	iio: adis_buffer: do not return ints in irq handlers
	iio: adis16400: do not return ints in irq handlers
	iio: adis16475: do not return ints in irq handlers
	iio: accel: bma180: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: accel: bma220: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: accel: hid: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: accel: kxcjk-1013: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: accel: mxc4005: Fix overread of data and alignment issue.
	iio: accel: stk8312: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: accel: stk8ba50: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: adc: ti-ads1015: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: adc: vf610: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: gyro: bmg160: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: humidity: am2315: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: prox: srf08: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: prox: pulsed-light: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: prox: as3935: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: magn: hmc5843: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: magn: bmc150: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: light: isl29125: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: light: tcs3414: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: light: tcs3472: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: chemical: atlas: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: cros_ec_sensors: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
	iio: potentiostat: lmp91000: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
	ASoC: rk3328: fix missing clk_disable_unprepare() on error in rk3328_platform_probe()
	ASoC: hisilicon: fix missing clk_disable_unprepare() on error in hi6210_i2s_startup()
	backlight: lm3630a_bl: Put fwnode in error case during ->probe()
	ASoC: rsnd: tidyup loop on rsnd_adg_clk_query()
	Input: hil_kbd - fix error return code in hil_dev_connect()
	perf scripting python: Fix tuple_set_u64()
	mtd: partitions: redboot: seek fis-index-block in the right node
	mtd: rawnand: arasan: Ensure proper configuration for the asserted target
	staging: mmal-vchiq: Fix incorrect static vchiq_instance.
	char: pcmcia: error out if 'num_bytes_read' is greater than 4 in set_protocol()
	firmware: stratix10-svc: Fix a resource leak in an error handling path
	tty: nozomi: Fix the error handling path of 'nozomi_card_init()'
	leds: class: The -ENOTSUPP should never be seen by user space
	leds: lm3532: select regmap I2C API
	leds: lm36274: Put fwnode in error case during ->probe()
	leds: lm3692x: Put fwnode in any case during ->probe()
	leds: lm3697: Don't spam logs when probe is deferred
	leds: lp50xx: Put fwnode in error case during ->probe()
	scsi: FlashPoint: Rename si_flags field
	scsi: iscsi: Flush block work before unblock
	mfd: mp2629: Select MFD_CORE to fix build error
	mfd: rn5t618: Fix IRQ trigger by changing it to level mode
	fsi: core: Fix return of error values on failures
	fsi: scom: Reset the FSI2PIB engine for any error
	fsi: occ: Don't accept response from un-initialized OCC
	fsi/sbefifo: Clean up correct FIFO when receiving reset request from SBE
	fsi/sbefifo: Fix reset timeout
	visorbus: fix error return code in visorchipset_init()
	iommu/amd: Fix extended features logging
	s390/irq: select HAVE_IRQ_EXIT_ON_IRQ_STACK
	s390: enable HAVE_IOREMAP_PROT
	s390: appldata depends on PROC_SYSCTL
	selftests: splice: Adjust for handler fallback removal
	iommu/dma: Fix IOVA reserve dma ranges
	ASoC: max98373-sdw: use first_hw_init flag on resume
	ASoC: rt1308-sdw: use first_hw_init flag on resume
	ASoC: rt5682-sdw: use first_hw_init flag on resume
	ASoC: rt700-sdw: use first_hw_init flag on resume
	ASoC: rt711-sdw: use first_hw_init flag on resume
	ASoC: rt715-sdw: use first_hw_init flag on resume
	ASoC: rt5682: fix getting the wrong device id when the suspend_stress_test
	ASoC: rt5682-sdw: set regcache_cache_only false before reading RT5682_DEVICE_ID
	ASoC: mediatek: mtk-btcvsd: Fix an error handling path in 'mtk_btcvsd_snd_probe()'
	usb: gadget: f_fs: Fix setting of device and driver data cross-references
	usb: dwc2: Don't reset the core after setting turnaround time
	eeprom: idt_89hpesx: Put fwnode in matching case during ->probe()
	eeprom: idt_89hpesx: Restore printing the unsupported fwnode name
	thunderbolt: Bond lanes only when dual_link_port != NULL in alloc_dev_default()
	iio: adc: at91-sama5d2: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: adc: hx711: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: adc: mxs-lradc: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: adc: ti-ads8688: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
	iio: magn: rm3100: Fix alignment of buffer in iio_push_to_buffers_with_timestamp()
	iio: light: vcnl4000: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	ASoC: fsl_spdif: Fix error handler with pm_runtime_enable
	staging: gdm724x: check for buffer overflow in gdm_lte_multi_sdu_pkt()
	staging: gdm724x: check for overflow in gdm_lte_netif_rx()
	staging: rtl8712: fix error handling in r871xu_drv_init
	staging: rtl8712: fix memory leak in rtl871x_load_fw_cb
	coresight: core: Fix use of uninitialized pointer
	staging: mt7621-dts: fix pci address for PCI memory range
	serial: 8250: Actually allow UPF_MAGIC_MULTIPLIER baud rates
	iio: light: vcnl4035: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	iio: prox: isl29501: Fix buffer alignment in iio_push_to_buffers_with_timestamp()
	ASoC: cs42l42: Correct definition of CS42L42_ADC_PDN_MASK
	of: Fix truncation of memory sizes on 32-bit platforms
	mtd: rawnand: marvell: add missing clk_disable_unprepare() on error in marvell_nfc_resume()
	habanalabs: Fix an error handling path in 'hl_pci_probe()'
	scsi: mpt3sas: Fix error return value in _scsih_expander_add()
	soundwire: stream: Fix test for DP prepare complete
	phy: uniphier-pcie: Fix updating phy parameters
	phy: ti: dm816x: Fix the error handling path in 'dm816x_usb_phy_probe()
	extcon: sm5502: Drop invalid register write in sm5502_reg_data
	extcon: max8997: Add missing modalias string
	powerpc/powernv: Fix machine check reporting of async store errors
	ASoC: atmel-i2s: Fix usage of capture and playback at the same time
	configfs: fix memleak in configfs_release_bin_file
	ASoC: Intel: sof_sdw: add SOF_RT715_DAI_ID_FIX for AlderLake
	ASoC: fsl_spdif: Fix unexpected interrupt after suspend
	leds: as3645a: Fix error return code in as3645a_parse_node()
	leds: ktd2692: Fix an error handling path
	selftests/ftrace: fix event-no-pid on 1-core machine
	serial: 8250: 8250_omap: Disable RX interrupt after DMA enable
	serial: 8250: 8250_omap: Fix possible interrupt storm on K3 SoCs
	powerpc: Offline CPU in stop_this_cpu()
	powerpc/papr_scm: Properly handle UUID types and API
	powerpc/64s: Fix copy-paste data exposure into newly created tasks
	powerpc/papr_scm: Make 'perf_stats' invisible if perf-stats unavailable
	ALSA: firewire-lib: Fix 'amdtp_domain_start()' when no AMDTP_OUT_STREAM stream is found
	serial: mvebu-uart: do not allow changing baudrate when uartclk is not available
	serial: mvebu-uart: correctly calculate minimal possible baudrate
	arm64: dts: marvell: armada-37xx: Fix reg for standard variant of UART
	vfio/pci: Handle concurrent vma faults
	mm/pmem: avoid inserting hugepage PTE entry with fsdax if hugepage support is disabled
	mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK
	mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()
	mm/huge_memory.c: don't discard hugepage if other processes are mapping it
	mm/hugetlb: use helper huge_page_order and pages_per_huge_page
	mm/hugetlb: remove redundant check in preparing and destroying gigantic page
	hugetlb: remove prep_compound_huge_page cleanup
	include/linux/huge_mm.h: remove extern keyword
	mm/z3fold: fix potential memory leak in z3fold_destroy_pool()
	mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page
	lib/math/rational.c: fix divide by zero
	selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
	selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
	selftests/vm/pkeys: refill shadow register after implicit kernel write
	perf llvm: Return -ENOMEM when asprintf() fails
	csky: fix syscache.c fallthrough warning
	csky: syscache: Fixup duplicate cache flush
	exfat: handle wrong stream entry size in exfat_readdir()
	scsi: fc: Correct RHBA attributes length
	scsi: target: cxgbit: Unmap DMA buffer before calling target_execute_cmd()
	mailbox: qcom-ipcc: Fix IPCC mbox channel exhaustion
	fscrypt: don't ignore minor_hash when hash is 0
	fscrypt: fix derivation of SipHash keys on big endian CPUs
	tpm: Replace WARN_ONCE() with dev_err_once() in tpm_tis_status()
	erofs: fix error return code in erofs_read_superblock()
	block: return the correct bvec when checking for gaps
	io_uring: fix blocking inline submission
	mmc: block: Disable CMDQ on the ioctl path
	mmc: vub3000: fix control-request direction
	media: exynos4-is: remove a now unused integer
	scsi: core: Retry I/O for Notify (Enable Spinup) Required error
	crypto: qce - fix error return code in qce_skcipher_async_req_handle()
	s390: preempt: Fix preempt_count initialization
	cred: add missing return error code when set_cred_ucounts() failed
	iommu/dma: Fix compile warning in 32-bit builds
	powerpc/preempt: Don't touch the idle task's preempt_count during hotplug
	Linux 5.10.50

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iec4eab24ea8eb5a6d79739a1aec8432d93a8f82c
2021-07-14 17:35:23 +02:00
Christian Brauner
019d04f914 open: don't silently ignore unknown O-flags in openat2()
[ Upstream commit cfe80306a0dd6d363934913e47c3f30d71b721e5 ]

The new openat2() syscall verifies that no unknown O-flag values are
set and returns an error to userspace if they are while the older open
syscalls like open() and openat() simply ignore unknown flag values:

  #define O_FLAG_CURRENTLY_INVALID (1 << 31)
  struct open_how how = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID,
          .resolve = 0,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how, sizeof(how));

  /* succeeds */
  fd = openat(-EBADF, "/dev/null", O_RDONLY | O_FLAG_CURRENTLY_INVALID);

However, openat2() silently truncates the upper 32 bits meaning:

  #define O_FLAG_CURRENTLY_INVALID_LOWER32 (1 << 31)
  #define O_FLAG_CURRENTLY_INVALID_UPPER32 (1 << 40)

  struct open_how how_lowe32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_LOWER32,
  };

  struct open_how how_upper32 = {
          .flags = O_RDONLY | O_FLAG_CURRENTLY_INVALID_UPPER32,
  };

  /* fails */
  fd = openat2(-EBADF, "/dev/null", &how_lower32, sizeof(how_lower32));

  /* succeeds */
  fd = openat2(-EBADF, "/dev/null", &how_upper32, sizeof(how_upper32));

Fix this by preventing the immediate truncation in build_open_flags().

There's a snafu here though stripping FMODE_* directly from flags would
cause the upper 32 bits to be truncated as well due to integer promotion
rules since FMODE_* is unsigned int, O_* are signed ints (yuck).

In addition, struct open_flags currently defines flags to be 32 bit
which is reasonable. If we simply were to bump it to 64 bit we would
need to change a lot of code preemptively which doesn't seem worth it.
So simply add a compile-time check verifying that all currently known
O_* flags are within the 32 bit range and fail to build if they aren't
anymore.

This change shouldn't regress old open syscalls since they silently
truncate any unknown values anyway. It is a tiny semantic change for
openat2() but it is very unlikely people pass ing > 32 bit unknown flags
and the syscall is relatively new too.

Link: https://lore.kernel.org/r/20210528092417.3942079-3-brauner@kernel.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reported-by: Richard Guy Briggs <rgb@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14 16:55:59 +02:00
Kuan-Ying Lee
a7a3b31d58 ANDROID: syscall_check: add vendor hook for open syscall
Through this vendor hook, we can get the timing to check
current running task for the validation of its credential
and open operation.

Bug: 191291287

Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Change-Id: Ia644ceb02dbc230ee1d25cad3630c2c3f908e41a
2021-07-09 13:48:44 +00:00
Collin Fijalkovich
28b4b1588e FROMLIST: mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs
Transparent huge pages are supported for read-only non-shmem files,
but are only used for vmas with VM_DENYWRITE. This condition ensures that
file THPs are protected from writes while an application is running
(ETXTBSY).  Any existing file THPs are then dropped from the page cache
when a file is opened for write in do_dentry_open(). Since sys_mmap
ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
produced by execve().

Systems that make heavy use of shared libraries (e.g. Android) are unable
to apply VM_DENYWRITE through the dynamic linker, preventing them from
benefiting from the resultant reduced contention on the TLB.

This patch reduces the constraint on file THPs allowing use with any
executable mapping from a file not opened for write (see
inode_is_open_for_write()). It also introduces additional conditions to
ensure that files opened for write will never be backed by file THPs.

Restricting the use of THPs to executable mappings eliminates the risk that
a read-only file later opened for write would encounter significant
latencies due to page cache truncation.

The ld linker flag '-z max-page-size=(hugepage size)' can be used to
produce executables with the necessary layout. The dynamic linker must
map these file's segments at a hugepage size aligned vma for the mapping to
be backed with THPs.

Comparison of the performance characteristics of 4KB and 2MB-backed
libraries follows; the Android dex2oat tool was used to AOT compile an
example application on a single ARM core.

4KB Pages:
==========

count              event_name            # count / runtime
598,995,035,942    cpu-cycles            # 1.800861 GHz
 81,195,620,851    raw-stall-frontend    # 244.112 M/sec
347,754,466,597    iTLB-loads            # 1.046 G/sec
  2,970,248,900    iTLB-load-misses      # 0.854122% miss rate

Total test time: 332.854998 seconds.

2MB Pages:
==========

count              event_name            # count / runtime
592,872,663,047    cpu-cycles            # 1.800358 GHz
 76,485,624,143    raw-stall-frontend    # 232.261 M/sec
350,478,413,710    iTLB-loads            # 1.064 G/sec
    803,233,322    iTLB-load-misses      # 0.229182% miss rate

Total test time: 329.826087 seconds

A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:

/apex/com.android.art/lib64/libart.so
FilePmdMapped:      4096 kB

/apex/com.android.art/lib64/libart-compiler.so
FilePmdMapped:      2048 kB

Bug: 158135888
Link: https://lore.kernel.org/patchwork/patch/1408266/

Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Collin Fijalkovich <cfijalkovich@google.com>
Change-Id: I75c693a4b4e7526d374ef2c010bde3094233eef2
2021-04-28 18:41:35 +00:00
Alistair Delva
cf8f7947f2 ANDROID: Add filp_open_block() for zram
We currently plan to disallow use of filp_open() from drivers in GKI,
however the ZRAM driver still needs it. Add a new GKI-only variant of
filp_open() which only permits a block device to be opened, which can
be exported instead. This keeps ZRAM working but cuts down on drivers
that attempt to open and write files in kernel mode.

Bug: 179220339
Change-Id: Id696b4aaf204b0499ce0a1b6416648670236e570
Signed-off-by: Alistair Delva <adelva@google.com>
2021-02-03 18:27:38 +00:00
Aleksa Sarai
aa606ebab1 openat2: reject RESOLVE_BENEATH|RESOLVE_IN_ROOT
commit 398840f8bb935d33c64df4ec4fed77a7d24c267d upstream.

This was an oversight in the original implementation, as it makes no
sense to specify both scoping flags to the same openat2(2) invocation
(before this patch, the result of such an invocation was equivalent to
RESOLVE_IN_ROOT being ignored).

This is a userspace-visible ABI change, but the only user of openat2(2)
at the moment is LXC which doesn't specify both flags and so no
userspace programs will break as a result.

Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <stable@vger.kernel.org> # v5.6+
Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:54:24 +01:00
Kees Cook
633fb6ac39 exec: move S_ISREG() check earlier
The execve(2)/uselib(2) syscalls have always rejected non-regular files.
Recently, it was noticed that a deadlock was introduced when trying to
execute pipes, as the S_ISREG() test was happening too late.  This was
fixed in commit 73601ea5b7 ("fs/open.c: allow opening only regular files
during execve()"), but it was added after inode_permission() had already
run, which meant LSMs could see bogus attempts to execute non-regular
files.

Move the test into the other inode type checks (which already look for
other pathological conditions[1]).  Since there is no need to use
FMODE_EXEC while we still have access to "acc_mode", also switch the test
to MAY_EXEC.

Also include a comment with the redundant S_ISREG() checks at the end of
execve(2)/uselib(2) to note that they are present to avoid any mistakes.

My notes on the call path, and related arguments, checks, etc:

do_open_execat()
    struct open_flags open_exec_flags = {
        .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
        .acc_mode = MAY_EXEC,
        ...
    do_filp_open(dfd, filename, open_flags)
        path_openat(nameidata, open_flags, flags)
            file = alloc_empty_file(open_flags, current_cred());
            do_open(nameidata, file, open_flags)
                may_open(path, acc_mode, open_flag)
		    /* new location of MAY_EXEC vs S_ISREG() test */
                    inode_permission(inode, MAY_OPEN | acc_mode)
                        security_inode_permission(inode, acc_mode)
                vfs_open(path, file)
                    do_dentry_open(file, path->dentry->d_inode, open)
                        /* old location of FMODE_EXEC vs S_ISREG() test */
                        security_file_open(f)
                        open()

[1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:01 -07:00
Linus Torvalds
e1ec517e18 Merge branch 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull init and set_fs() cleanups from Al Viro:
 "Christoph's 'getting rid of ksys_...() uses under KERNEL_DS' series"

* 'hch.init_path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (50 commits)
  init: add an init_dup helper
  init: add an init_utimes helper
  init: add an init_stat helper
  init: add an init_mknod helper
  init: add an init_mkdir helper
  init: add an init_symlink helper
  init: add an init_link helper
  init: add an init_eaccess helper
  init: add an init_chmod helper
  init: add an init_chown helper
  init: add an init_chroot helper
  init: add an init_chdir helper
  init: add an init_rmdir helper
  init: add an init_unlink helper
  init: add an init_umount helper
  init: add an init_mount helper
  init: mark create_dev as __init
  init: mark console_on_rootfs as __init
  init: initialize ramdisk_execute_command at compile time
  devtmpfs: refactor devtmpfsd()
  ...
2020-08-07 09:40:34 -07:00
Christoph Hellwig
eb9d7d390e init: add an init_eaccess helper
Add a simple helper to check if a file exists based on kernel space file
name and switch the early init code over to it.  Note that this
theoretically changes behavior as it always is based on the effective
permissions.  But during early init that doesn't make a difference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
1097742efc init: add an init_chmod helper
Add a simple helper to chmod with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:53 +02:00
Christoph Hellwig
b873498f99 init: add an init_chown helper
Add a simple helper to chown with a kernel space file name and switch
the early init code over to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
4b7ca5014c init: add an init_chroot helper
Add a simple helper to chroot with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_chroot.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
db63f1e315 init: add an init_chdir helper
Add a simple helper to chdir with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_chdir.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-31 08:17:52 +02:00
Christoph Hellwig
b25ba7c3c9 fs: remove ksys_fchmod
Fold it into the only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:01 +02:00
Christoph Hellwig
166e07c37c fs: remove ksys_open
Just open code it in the two callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-31 08:16:00 +02:00
Christoph Hellwig
9e96c8c0e9 fs: add a vfs_fchmod helper
Add a helper for struct file based chmode operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:04 +02:00
Christoph Hellwig
c04011fe8c fs: add a vfs_fchown helper
Add a helper for struct file based chown operations.  To be used by
the initramfs code soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16 15:33:00 +02:00
Christian Brauner
60997c3d45
close_range: add CLOSE_RANGE_UNSHARE
One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:

unshare(CLONE_FILES);
close_range(3, ~0U);

as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.

The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-17 00:07:38 +02:00
Christian Brauner
278a5fbaed
open: add close_range()
This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

I was contacted by FreeBSD as they wanted to have the same close_range()
syscall as we proposed here. We've coordinated this and in the meantime, Kyle
was fast enough to merge close_range() into FreeBSD already in April:
https://reviews.freebsd.org/D21627
https://svnweb.freebsd.org/base?view=revision&revision=359836
and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2])
once its merged in Linux too. Python is in the process of switching to
close_range() on FreeBSD and they are waiting on us to merge this to switch on
Linux as well: https://bugs.python.org/issue38061

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
        int dir_fd;
        DIR *dir;
        struct dirent *direntp;

        dir = opendir("/proc/self/fd");
        if (!dir)
                return -1;
        dir_fd = dirfd(dir);
        while ((direntp = readdir(dir))) {
                int fd;
                if (strcmp(direntp->d_name, ".") == 0)
                        continue;
                if (strcmp(direntp->d_name, "..") == 0)
                        continue;
                fd = atoi(direntp->d_name);
                if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                        continue;
                close(fd);
        }
        closedir(dir);
        return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():    ~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to the fdtable.
  If my reasoning is correct then there's really no point in fancyfying
  __close_range(): We just need to rcu-dereference the fdtable of the
  calling task once to cap the max_fd value correctly and then go on
  calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: 9e4f2f3a6b/Modules/_posixsubprocess.c (L220)
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: 5238e95759/src/basic/fd-util.c (L217)
[5]: ddf4b77e11/src/lxc/start.c (L236)
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
     Note that this is an internal implementation that is not exported.
     Currently, libc seems to not provide an exported version of this
     because of missing kernel support to do this.
     Note, in a recent patch series Florian made grantpt() a nop thereby
     removing the code referenced here.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: 5f47c0613e/src/libstd/sys/unix/process2.rs (L303-L308)
     Rust's solution is slightly different but is equally unperformant.
     Rust calls getdtablesize() which is a glibc library function that
     simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
     goes on to call close() on each fd. That's obviously overkill for most
     tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
     OPEN_MAX.
     Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
     to 1024. Even in this case, there's a very high chance that in the
     common case Rust is calling the close() syscall 1021 times pointlessly
     if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kyle Evans <self@kyle-evans.net>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
2020-06-17 00:05:19 +02:00
Linus Torvalds
94709049fb Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
 "A few little subsystems and a start of a lot of MM patches.

  Subsystems affected by this patch series: squashfs, ocfs2, parisc,
  vfs. With mm subsystems: slab-generic, slub, debug, pagecache, gup,
  swap, memcg, pagemap, memory-failure, vmalloc, kasan"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits)
  kasan: move kasan_report() into report.c
  mm/mm_init.c: report kasan-tag information stored in page->flags
  ubsan: entirely disable alignment checks under UBSAN_TRAP
  kasan: fix clang compilation warning due to stack protector
  x86/mm: remove vmalloc faulting
  mm: remove vmalloc_sync_(un)mappings()
  x86/mm/32: implement arch_sync_kernel_mappings()
  x86/mm/64: implement arch_sync_kernel_mappings()
  mm/ioremap: track which page-table levels were modified
  mm/vmalloc: track which page-table levels were modified
  mm: add functions to track page directory modifications
  s390: use __vmalloc_node in stack_alloc
  powerpc: use __vmalloc_node in alloc_vm_stack
  arm64: use __vmalloc_node in arch_alloc_vmap_stack
  mm: remove vmalloc_user_node_flags
  mm: switch the test_vmalloc module to use __vmalloc_node
  mm: remove __vmalloc_node_flags_caller
  mm: remove both instances of __vmalloc_node_flags
  mm: remove the prot argument to __vmalloc_node
  mm: remove the pgprot argument to __vmalloc
  ...
2020-06-02 12:21:36 -07:00
Jeff Layton
735e4ae5ba vfs: track per-sb writeback errors and report them to syncfs
Patch series "vfs: have syncfs() return error when there are writeback
errors", v6.

Currently, syncfs does not return errors when one of the inodes fails to
be written back.  It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea in this set is to track writeback errors at the
superblock level, so that we can quickly and easily check whether
something bad happened without having to fsync each file individually.
syncfs is then changed to reliably report writeback errors after they
occur, much in the same fashion as fsync does now.

This patch (of 2):

Usually we suggest that applications call fsync when they want to ensure
that all data written to the file has made it to the backing store, but
that can be inefficient when there are a lot of open files.

Calling syncfs on the filesystem can be more efficient in some
situations, but the error reporting doesn't currently work the way most
people expect.  If a single inode on a filesystem reports a writeback
error, syncfs won't necessarily return an error.  syncfs only returns an
error if __sync_blockdev fails, and on some filesystems that's a no-op.

It would be better if syncfs reported an error if there were any
writeback failures.  Then applications could call syncfs to see if there
are any errors on any open files, and could then call fsync on all of
the other descriptors to figure out which one failed.

This patch adds a new errseq_t to struct super_block, and has
mapping_set_error also record writeback errors there.

To report those errors, we also need to keep an errseq_t in struct file
to act as a cursor.  This patch adds a dedicated field for that purpose,
which slots nicely into 4 bytes of padding at the end of struct file on
x86_64.

An earlier version of this patch used an O_PATH file descriptor to cue
the kernel that the open file should track the superblock error and not
the inode's writeback error.

I think that API is just too weird though.  This is simpler and should
make syncfs error reporting "just work" even if someone is multiplexing
fsync and syncfs on the same fds.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andres Freund <andres@anarazel.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20200428135155.19223-1-jlayton@kernel.org
Link: http://lkml.kernel.org/r/20200428135155.19223-2-jlayton@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:05 -07:00
Miklos Szeredi
c8ffd8bcdd vfs: add faccessat2 syscall
POSIX defines faccessat() as having a fourth "flags" argument, while the
linux syscall doesn't have it.  Glibc tries to emulate AT_EACCESS and
AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.

Add a new faccessat(2) syscall with the added flags argument and implement
both flags.

The value of AT_EACCESS is defined in glibc headers to be the same as
AT_REMOVEDIR.  Use this value for the kernel interface as well, together
with the explanatory comment.

Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
be useful and is trivial to implement.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:25 +02:00
Miklos Szeredi
9470451505 vfs: split out access_override_creds()
Split out a helper that overrides the credentials in preparation for
actually doing the access check.

This prepares for the next patch that optionally disables the creds
override.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-05-14 16:44:24 +02:00
Linus Torvalds
9c577491b9 Merge branch 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pathwalk sanitizing from Al Viro:
 "Massive pathwalk rewrite and cleanups.

  Several iterations have been posted; hopefully this thing is getting
  readable and understandable now. Pretty much all parts of pathname
  resolutions are affected...

  The branch is identical to what has sat in -next, except for commit
  message in "lift all calls of step_into() out of follow_dotdot/
  follow_dotdot_rcu", crediting Qian Cai for reporting the bug; only
  commit message changed there."

* 'work.dotdot1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (69 commits)
  lookup_open(): don't bother with fallbacks to lookup+create
  atomic_open(): no need to pass struct open_flags anymore
  open_last_lookups(): move complete_walk() into do_open()
  open_last_lookups(): lift O_EXCL|O_CREAT handling into do_open()
  open_last_lookups(): don't abuse complete_walk() when all we want is unlazy
  open_last_lookups(): consolidate fsnotify_create() calls
  take post-lookup part of do_last() out of loop
  link_path_walk(): sample parent's i_uid and i_mode for the last component
  __nd_alloc_stack(): make it return bool
  reserve_stack(): switch to __nd_alloc_stack()
  pick_link(): take reserving space on stack into a new helper
  pick_link(): more straightforward handling of allocation failures
  fold path_to_nameidata() into its only remaining caller
  pick_link(): pass it struct path already with normal refcounting rules
  fs/namei.c: kill follow_mount()
  non-RCU analogue of the previous commit
  helper for mount rootwards traversal
  follow_dotdot(): be lazy about changing nd->path
  follow_dotdot_rcu(): be lazy about changing nd->path
  follow_dotdot{,_rcu}(): massage loops
  ...
2020-04-02 12:30:08 -07:00
Al Viro
d9a9f4849f cifs_atomic_open(): fix double-put on late allocation failure
several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open().  Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified.  Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling.  Trivially
fixed...

Fixes: fe9ec8291f "do_last(): take fput() on error after opening to out:"
Cc: stable@kernel.org # v4.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-12 18:25:20 -04:00
Al Viro
31d1726d72 make build_open_flags() treat O_CREAT | O_EXCL as implying O_NOFOLLOW
O_CREAT | O_EXCL means "-EEXIST if we run into a trailing symlink".
As it is, we might or might not have LOOKUP_FOLLOW in op->intent
in that case - that depends upon having O_NOFOLLOW in open flags.
It doesn't matter, since we won't be checking it in that case -
do_last() bails out earlier.

However, making sure it's not set (i.e. acting as if we had an explicit
O_NOFOLLOW) makes the behaviour more explicit and allows to reorder the
check for O_CREAT | O_EXCL in do_last() with the call of step_into()
immediately following it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-27 14:43:56 -05:00
Jens Axboe
35cb6d54c1 fs: make build_open_flags() available internally
This is a prep patch for supporting non-blocking open from io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-20 17:01:53 -07:00
Aleksa Sarai
fddb5d430a open: introduce openat2(2) syscall
/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
  /*
   * open_how is an extensible structure (similar in interface to
   * clone3(2) or sched_setattr(2)). The size parameter must be set to
   * sizeof(struct open_how), to allow for future extensions. All future
   * extensions will be appended to open_how, with their zero value
   * acting as a no-op default.
   */
  struct open_how { /* ... */ };

  int openat2(int dfd, const char *pathname,
              struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

  flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

  mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

  resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH       => LOOKUP_BENEATH
    RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.

Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).

[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014bb8 ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVs

Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-18 09:19:18 -05:00
Linus Torvalds
2be7d348fe Revert "vfs: properly and reliably lock f_pos in fdget_pos()"
This reverts commit 0be0ee7181.

I was hoping it would be benign to switch over entirely to FMODE_STREAM,
and we'd have just a couple of small fixups we'd need, but it looks like
we're not quite there yet.

While it worked fine on both my desktop and laptop, they are fairly
similar in other respects, and run mostly the same loads.  Kenneth
Crudup reports that it seems to break both his vmware installation and
the KDE upower service.  In both cases apparently leading to timeouts
due to waitinmg for the f_pos lock.

There are a number of character devices in particular that definitely
want stream-like behavior, but that currently don't get marked as
streams, and as a result get the exclusion between concurrent
read()/write() on the same file descriptor.  Which doesn't work well for
them.

The most obvious example if this is /dev/console and /dev/tty, which use
console_fops and tty_fops respectively (and ptmx_fops for the pty master
side).  It may be that it's just this that causes problems, but we
clearly weren't ready yet.

Because there's a number of other likely common cases that don't have
llseek implementations and would seem to act as stream devices:

  /dev/fuse		(fuse_dev_operations)
  /dev/mcelog		(mce_chrdev_ops)
  /dev/mei0		(mei_fops)
  /dev/net/tun		(tun_fops)
  /dev/nvme0		(nvme_dev_fops)
  /dev/tpm0		(tpm_fops)
  /proc/self/ns/mnt	(ns_file_operations)
  /dev/snd/pcm*		(snd_pcm_f_ops[])

and while some of these could be trivially automatically detected by the
vfs layer when the character device is opened by just noticing that they
have no read or write operations either, it often isn't that obvious.

Some character devices most definitely do use the file position, even if
they don't allow seeking: the firmware update code, for example, uses
simple_read_from_buffer() that does use f_pos, but doesn't allow seeking
back and forth.

We'll revisit this when there's a better way to detect the problem and
fix it (possibly with a coccinelle script to do more of the FMODE_STREAM
annotations).

Reported-by: Kenneth R. Crudup <kenny@panix.com>
Cc: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-26 11:34:06 -08:00
Linus Torvalds
0be0ee7181 vfs: properly and reliably lock f_pos in fdget_pos()
fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).

However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.

The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file.  Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.

The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it.  With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.

Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()".  We've done that for all the obvious and common
situations, we may need a few more.  Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):

 "And I appreciate if people could help at least somehow with "getting
  rid of mixed case entirely" (i.e. always lock f_pos_lock on
  !FMODE_STREAM), because this transition starts to diverge from my
  particular use-case too far. To me it makes sense to do that
  transition as follows:

   - convert nonseekable_open -> stream_open via stream_open.cocci;
   - audit other nonseekable_open calls and convert left users that
     truly don't depend on position to stream_open;
   - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
     will cover pipes and sockets), or maybe convert pipes and sockets
     to FMODE_STREAM manually;
   - extend stream_open.cocci to analyze file_operations that use
     no_llseek or noop_llseek, but do not use nonseekable_open or
     alloc_file_pseudo. This might find files that have stream semantic
     but are opened differently;
   - extend stream_open.cocci to analyze file_operations whose
     .read/.write do not use ppos at all (independently of how file was
     opened);
   - ...
   - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
     !FMODE_STREAM;
   - gather bug reports for deadlocked read/write and convert missed
     cases to FMODE_STREAM, probably extending stream_open.cocci along
     the road to catch similar cases

  i.e. always take f_pos_lock unless a file is explicitly marked as
  being stream, and try to find and cover all files that are streams"

We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.

The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".

The second test was correct, although a bit subtle and worth just
re-iterating here.  There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).

The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.

But in both cases the file count would be greater than one, so the
"file_count(file) > 1" test catches both situations.  Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.

Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-25 10:01:53 -08:00
Denis Efremov
7159d54418 fs: remove unlikely() from WARN_ON() condition
"unlikely(WARN_ON(x))" is excessive. WARN_ON() already uses unlikely()
internally.

Link: http://lkml.kernel.org/r/20190829165025.15750-5-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 10:10:30 -07:00
Song Liu
09d91cda0e mm,thp: avoid writes to file with THP in pagecache
In previous patch, an application could put part of its text section in
THP via madvise().  These THPs will be protected from writes when the
application is still running (TXTBSY).  However, after the application
exits, the file is available for writes.

This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write.  A new counter nr_thps is added to struct
address_space.  In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.

Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Linus Torvalds
d7852fbd0f access: avoid the RCU grace period for the temporary subjective credentials
It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
work because it installs a temporary credential that gets allocated and
freed for each system call.

The allocation and freeing overhead is mostly benign, but because
credentials can be accessed under the RCU read lock, the freeing
involves a RCU grace period.

Which is not a huge deal normally, but if you have a lot of access()
calls, this causes a fair amount of seconday damage: instead of having a
nice alloc/free patterns that hits in hot per-CPU slab caches, you have
all those delayed free's, and on big machines with hundreds of cores,
the RCU overhead can end up being enormous.

But it turns out that all of this is entirely unnecessary.  Exactly
because access() only installs the credential as the thread-local
subjective credential, the temporary cred pointer doesn't actually need
to be RCU free'd at all.  Once we're done using it, we can just free it
synchronously and avoid all the RCU overhead.

So add a 'non_rcu' flag to 'struct cred', which can be set by users that
know they only use it in non-RCU context (there are other potential
users for this).  We can make it a union with the rcu freeing list head
that we need for the RCU case, so this doesn't need any extra storage.

Note that this also makes 'get_current_cred()' clear the new non_rcu
flag, in case we have filesystems that take a long-term reference to the
cred and then expect the RCU delayed freeing afterwards.  It's not
entirely clear that this is required, but it makes for clear semantics:
the subjective cred remains non-RCU as long as you only access it
synchronously using the thread-local accessors, but you _can_ use it as
a generic cred if you want to.

It is possible that we should just remove the whole RCU markings for
->cred entirely.  Only ->real_cred is really supposed to be accessed
through RCU, and the long-term cred copies that nfs uses might want to
explicitly re-enable RCU freeing if required, rather than have
get_current_cred() do it implicitly.

But this is a "minimal semantic changes" change for the immediate
problem.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Glauber <jglauber@marvell.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-24 10:12:09 -07:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Kirill Smelkov
438ab720c6 vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files
This amends commit 10dce8af34 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:

Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().

Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .

Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af34) ignores files
whose file_operations has *_iter methods.

Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
2019-05-06 17:46:52 +03:00
Kirill Smelkov
10dce8af34 fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
Commit 9c225f2655 ("vfs: atomic f_pos accesses as per POSIX") added
locking for file.f_pos access and in particular made concurrent read and
write not possible - now both those functions take f_pos lock for the
whole run, and so if e.g. a read is blocked waiting for data, write will
deadlock waiting for that read to complete.

This caused regression for stream-like files where previously read and
write could run simultaneously, but after that patch could not do so
anymore. See e.g. commit 581d21a2d0 ("xenbus: fix deadlock on writes
to /proc/xen/xenbus") which fixes such regression for particular case of
/proc/xen/xenbus.

The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
safety for read/write/lseek and added the locking to file descriptors of
all regular files. In 2014 that thread-safety problem was not new as it
was already discussed earlier in 2006.

However even though 2006'th version of Linus's patch was adding f_pos
locking "only for files that are marked seekable with FMODE_LSEEK (thus
avoiding the stream-like objects like pipes and sockets)", the 2014
version - the one that actually made it into the tree as 9c225f2655 -
is doing so irregardless of whether a file is seekable or not.

See

    https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
    https://lwn.net/Articles/180387
    https://lwn.net/Articles/180396

for historic context.

The reason that it did so is, probably, that there are many files that
are marked non-seekable, but e.g. their read implementation actually
depends on knowing current position to correctly handle the read. Some
examples:

	kernel/power/user.c		snapshot_read
	fs/debugfs/file.c		u32_array_read
	fs/fuse/control.c		fuse_conn_waiting_read + ...
	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
	arch/s390/hypfs/inode.c		hypfs_read_iter
	...

Despite that, many nonseekable_open users implement read and write with
pure stream semantics - they don't depend on passed ppos at all. And for
those cases where read could wait for something inside, it creates a
situation similar to xenbus - the write could be never made to go until
read is done, and read is waiting for some, potentially external, event,
for potentially unbounded time -> deadlock.

Besides xenbus, there are 14 such places in the kernel that I've found
with semantic patch (see below):

	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()

In addition to the cases above another regression caused by f_pos
locking is that now FUSE filesystems that implement open with
FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
stream-like files - for the same reason as above e.g. read can deadlock
write locking on file.f_pos in the kernel.

FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990f7 ("fuse:
implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
write routines not depending on current position at all, and with both
read and write being potentially blocking operations:

See

    https://github.com/libfuse/osspd
    https://lwn.net/Articles/308445

    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
    https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510

Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
"somewhat pipe-like files ..." with read handler not using offset.
However that test implements only read without write and cannot exercise
the deadlock scenario:

    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
    https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216

I've actually hit the read vs write deadlock for real while implementing
my FUSE filesystem where there is /head/watch file, for which open
creates separate bidirectional socket-like stream in between filesystem
and its user with both read and write being later performed
simultaneously. And there it is semantically not easy to split the
stream into two separate read-only and write-only channels:

    https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169

Let's fix this regression. The plan is:

1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
   doing so would break many in-kernel nonseekable_open users which
   actually use ppos in read/write handlers.

2. Add stream_open() to kernel to open stream-like non-seekable file
   descriptors. Read and write on such file descriptors would never use
   nor change ppos. And with that property on stream-like files read and
   write will be running without taking f_pos lock - i.e. read and write
   could be running simultaneously.

3. With semantic patch search and convert to stream_open all in-kernel
   nonseekable_open users for which read and write actually do not
   depend on ppos and where there is no other methods in file_operations
   which assume @offset access.

4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
   steam_open if that bit is present in filesystem open reply.

   It was tempting to change fs/fuse/ open handler to use stream_open
   instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
   grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
   and in particular GVFS which actually uses offset in its read and
   write handlers

	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481

   so if we would do such a change it will break a real user.

5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
   from v3.14+ (the kernel where 9c225f2655 first appeared).

   This will allow to patch OSSPD and other FUSE filesystems that
   provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
   in their open handler and this way avoid the deadlock on all kernel
   versions. This should work because fs/fuse/ ignores unknown open
   flags returned from a filesystem and so passing FOPEN_STREAM to a
   kernel that is not aware of this flag cannot hurt. In turn the kernel
   that is not aware of FOPEN_STREAM will be < v3.14 where just
   FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
   write deadlock.

This patch adds stream_open, converts /proc/xen/xenbus to it and adds
semantic patch to automatically locate in-kernel places that are either
required to be converted due to read vs write deadlock, or that are just
safe to be converted because read and write do not use ppos and there
are no other funky methods in file_operations.

Regarding semantic patch I've verified each generated change manually -
that it is correct to convert - and each other nonseekable_open instance
left - that it is either not correct to convert there, or that it is not
converted due to current stream_open.cocci limitations.

The script also does not convert files that should be valid to convert,
but that currently have .llseek = noop_llseek or generic_file_llseek for
unknown reason despite file being opened with nonseekable_open (e.g.
drivers/input/mousedev.c)

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Yongzhi Pan <panyongzhi@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Nikolaus Rath <Nikolaus@rath.org>
Cc: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-06 07:01:55 -10:00
Tetsuo Handa
73601ea5b7 fs/open.c: allow opening only regular files during execve()
syzbot is hitting lockdep warning [1] due to trying to open a fifo
during an execve() operation.  But we don't need to open non regular
files during an execve() operation, for all files which we will need are
the executable file itself and the interpreter programs like /bin/sh and
ld-linux.so.2 .

Since the manpage for execve(2) says that execve() returns EACCES when
the file or a script interpreter is not a regular file, and the manpage
for uselib(2) says that uselib() can return EACCES, and we use
FMODE_EXEC when opening for execve()/uselib(), we can bail out if a non
regular file is requested with FMODE_EXEC set.

Since this deadlock followed by khungtaskd warnings is trivially
reproducible by a local unprivileged user, and syzbot's frequent crash
due to this deadlock defers finding other bugs, let's workaround this
deadlock until we get a chance to find a better solution.

[1] https://syzkaller.appspot.com/bug?id=b5095bfec44ec84213bac54742a82483aad578ce

Link: http://lkml.kernel.org/r/1552044017-7890-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Reported-by: syzbot <syzbot+e93a80c1bb7c5c56e522461c149f8bf55eab1b2b@syzkaller.appspotmail.com>
Fixes: 8924feff66 ("splice: lift pipe_lock out of splice_to_pipe()")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>	[4.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Linus Torvalds
d9a185f8b4 overlayfs update for 4.19
This contains two new features:
 
  1) Stack file operations: this allows removal of several hacks from the
     VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.
 
  2) Metadata only copy-up: when file is on lower layer and only metadata is
     modified (except size) then only copy up the metadata and continue to
     use the data from the lower file.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
 PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
 fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
 =QDXH
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:
 "This contains two new features:

   - Stack file operations: this allows removal of several hacks from
     the VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.

   - Metadata only copy-up: when file is on lower layer and only
     metadata is modified (except size) then only copy up the metadata
     and continue to use the data from the lower file"

* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
  ovl: Enable metadata only feature
  ovl: Do not do metacopy only for ioctl modifying file attr
  ovl: Do not do metadata only copy-up for truncate operation
  ovl: add helper to force data copy-up
  ovl: Check redirect on index as well
  ovl: Set redirect on upper inode when it is linked
  ovl: Set redirect on metacopy files upon rename
  ovl: Do not set dentry type ORIGIN for broken hardlinks
  ovl: Add an inode flag OVL_CONST_INO
  ovl: Treat metacopy dentries as type OVL_PATH_MERGE
  ovl: Check redirects for metacopy files
  ovl: Move some dir related ovl_lookup_single() code in else block
  ovl: Do not expose metacopy only dentry from d_real()
  ovl: Open file with data except for the case of fsync
  ovl: Add helper ovl_inode_realdata()
  ovl: Store lower data inode in ovl_inode
  ovl: Fix ovl_getattr() to get number of blocks from lower
  ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
  ovl: Copy up meta inode data from lowest data inode
  ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
  ...
2018-08-21 18:19:09 -07:00
Miklos Szeredi
8cf9ee5061 Revert "vfs: do get_write_access() on upper layer of overlayfs"
This reverts commit 4d0c5ba2ff.

We now get write access on both overlay and underlying layers so this patch
is no longer needed for correct operation.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
4ab30319fd Revert "vfs: add flags to d_real()"
This reverts commit 495e642939.

No user of "flags" argument of d_real() remain.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
6742cee043 Revert "ovl: don't allow writing ioctl on lower layer"
This reverts commit 7c6893e3c9.

Overlayfs no longer relies on the vfs for checking writability of files.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
a6518f73e6 vfs: don't open real
Let overlayfs do its thing when opening a file.

This enables stacking and fixes the corner case when a file is opened for
read, modified through a writable open, and data is read from the read-only
file.  After this patch the read-only open will not return stale data even
in this case.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
d3b1084dfd vfs: make open_with_fake_path() not contribute to nr_files
Stacking file operations in overlay will store an extra open file for each
overlay file opened.

The overhead is just that of "struct file" which is about 256bytes, because
overlay already pins an extra dentry and inode when the file is open, which
add up to a much larger overhead.

For fear of breaking working setups, don't start accounting the extra file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:40 +02:00
Al Viro
2abc77af89 new helper: open_with_fake_path()
open a file by given inode, faking ->f_path.  Use with shitloads
of caution - at the very least you'd damn better make sure that
some dentry alias of that inode is pinned down by the path in
question.  Again, this is no general-purpose interface and I hope
it will eventually go away.  Right now overlayfs wants something
like that, but nothing else should.

Any out-of-tree code with bright idea of using this one *will*
eventually get hurt, with zero notice and great delight on my part.
I refuse to use EXPORT_SYMBOL_GPL(), especially in situations when
it's really EXPORT_SYMBOL_DONT_USE_IT(), but don't take that export
as "you are welcome to use it".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 11:18:42 -04:00
Al Viro
64e1ac4d46 ->atomic_open(): return 0 in all success cases
FMODE_OPENED can be used to distingusish "successful open" from the
"called finish_no_open(), do it yourself" cases.  Since finish_no_open()
has been adjusted, no changes in the instances were actually needed.
The caller has been adjusted.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:21 -04:00