Commit Graph

5601 Commits

Author SHA1 Message Date
Tejun Heo
70748bba55 blk-iocost: fix weight updates of inner active iocgs
commit e9f4eee9a0023ba22db9560d4cc6ee63f933dae8 upstream.

When the weight of an active iocg is updated, weight_updated() is called
which in turn calls __propagate_weights() to update the active and inuse
weights so that the effective hierarchical weights are update accordingly.

The current implementation is incorrect for inner active nodes. For an
active leaf iocg, inuse can be any value between 1 and active and the
difference represents how much the iocg is donating. When weight is updated,
as long as inuse is clamped between 1 and the new weight, we're alright and
this is what __propagate_weights() currently implements.

However, that's not how an active inner node's inuse is set. An inner node's
inuse is solely determined by the ratio between the sums of inuse's and
active's of its children - ie. they're results of propagating the leaves'
active and inuse weights upwards. __propagate_weights() incorrectly applies
the same clamping as for a leaf when an active inner node's weight is
updated. Consider a hierarchy which looks like the following with saturating
workloads in AA and BB.

     R
   /   \
  A     B
  |     |
 AA     BB

1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.

2. echo 200 > A/io.weight

3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
   it's already between 1 and the new active, making A:active=200,
   A:inuse=100. As R's active_sum is updated along with A's active,
   A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
   hwi's remain unchanged at 0.5.

4. The weight of A is now twice that of B but AA and BB still have the same
   hwi of 0.5 and thus are doing the same amount of IOs.

Fix it by making __propgate_weights() always calculate the inuse of an
active inner iocg based on the ratio of child_inuse_sum to child_active_sum.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Schatzberg <dschatzberg@fb.com>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-19 10:13:11 +02:00
Ming Lei
d8ef677e32 FROMGIT: block: avoid double io accounting for flush request
For flush request, rq->end_io() may be called two times, one is from
timeout handling(blk_mq_check_expired()), another is from normal
completion(__blk_mq_end_request()).

Move blk_account_io_flush() after flush_rq->ref drops to zero, so
io accounting can be done just once for flush request.

Fixes: b686631865 ("block: add iostat counters for flush requests")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 188199752
(cherry picked from commit 773cd5fb22e7c61c65c7528a3e1cb5bcbc1408ea git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I0cbcb35fce831daab99145d2401a6fdeb5d575b4
2021-05-14 13:50:36 -07:00
Bart Van Assche
e4d47d9a03 FROMLIST: blk-mq: Swap two calls in blk_mq_exit_queue()
If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Bug: 181696921
Link: https://lore.kernel.org/linux-block/d1b1f123-9b47-8805-0b86-2fa9f6b19bb5@kernel.dk/T/#ma8a01d264c59074d3f9b69f833ffe408592cf837
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I3f9ca00fbefec97d1e3c9df85365be0265b8cec6
2021-05-14 10:17:49 -07:00
Bart Van Assche
a1fbf0ead8 Revert "BACKPORT: bio: limit bio max size"
Revert most of commit 25a6f60b717d ("BACKPORT: bio: limit bio max size")
because it has been reported to cause data corruption and because it has
been reverted upstream. See also
https://lore.kernel.org/linux-block/1620571445.2k94orj8ee.none@localhost/T/#t

Cc: Changheun Lee <nanich.lee@samsung.com>
Cc: Jaegeuk Kim <jaegeuk@google.com>
Bug: 182716953
Change-Id: I79bf39d1d1a0c13cb30ff4b0f1da2b43e1c817f0
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2021-05-11 09:34:37 -07:00
Jeffle Xu
be6f5cf52c UPSTREAM: block: fix inflight statistics of part0
The inflight of partition 0 doesn't include inflight IOs to all
sub-partitions, since currently mq calculates inflight of specific
partition by simply camparing the value of the partition pointer.

Thus the following case is possible:

$ cat /sys/block/vda/inflight
       0        0
$ cat /sys/block/vda/vda1/inflight
       0      128

While single queue device (on a previous version, e.g. v3.10) has no
this issue:

$cat /sys/block/sda/sda3/inflight
       0       33
$cat /sys/block/sda/inflight
       0       33

Partition 0 should be specially handled since it represents the whole
disk. This issue is introduced since commit bf0ddaba65 ("blk-mq: fix
sysfs inflight counter").

Besides, this patch can also fix the inflight statistics of part 0 in
/proc/diskstats. Before this patch, the inflight statistics of part 0
doesn't include that of sub partitions. (I have marked the 'inflight'
field with asterisk.)

$cat /proc/diskstats
 259       0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
 259       2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0

This is introduced since commit f299b7c7a9 ("blk-mq: provide internal
in-flight variant").

Fixes: bf0ddaba65 ("blk-mq: fix sysfs inflight counter")
Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: adapt for 5.11 partition change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 187355247
Change-Id: I378b2cb7312a5e47d5e2ec7301dc392e6e7336d0
(cherry picked from commit b0d97557ebfc9d5ba5f2939339a9fdd267abafeb)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2021-05-07 10:42:41 -07:00
Changheun Lee
9458fa0dda BACKPORT: bio: limit bio max size
bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 32MB chunk read in user space -
all pages for 32MB would be merged to a bio structure if the pages
physical addresses are contiguous. it makes some delay to submit
until merge complete. bio max size should be limited to a proper size.

When 32MB chunk read with direct I/O option is coming from userspace,
kernel behavior is below now in do_direct_IO() loop. it's timeline.

 | bio merge for 32MB. total 8,192 pages are merged.
 | total elapsed time is over 2ms.
 |------------------ ... ----------------------->|
                                                 | 8,192 pages merged a bio.
                                                 | at this time, first bio submit is done.
                                                 | 1 bio is split to 32 read request and issue.
                                                 |--------------->
                                                  |--------------->
                                                   |--------------->
                                                              ......
                                                                   |--------------->
                                                                    |--------------->|
                          total 19ms elapsed to complete 32MB read done from device. |

If bio max size is limited with 1MB, behavior is changed below.

 | bio merge for 1MB. 256 pages are merged for each bio.
 | total 32 bio will be made.
 | total elapsed time is over 2ms. it's same.
 | but, first bio submit timing is fast. about 100us.
 |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
      | 256 pages merged a bio.
      | at this time, first bio submit is done.
      | and 1 read request is issued for 1 bio.
      |--------------->
           |--------------->
                |--------------->
                                      ......
                                                 |--------------->
                                                  |--------------->|
        total 17ms elapsed to complete 32MB read done from device. |

As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.

Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 182716953
(cherry picked from commit cd2c7545ae1beac3b6aae033c7f31193b3255946)
Change-Id: Ie3876daa495535dc7f856ed9a281e65d72a437c1
Signed-off-by: Bart Van Assche <bvanassche@google.com>
2021-05-07 07:14:07 -07:00
Greg Kroah-Hartman
0907114be2 Merge 5.10.33 into android12-5.10
Changes in 5.10.33
	vhost-vdpa: protect concurrent access to vhost device iotlb
	gpio: omap: Save and restore sysconfig
	KEYS: trusted: Fix TPM reservation for seal/unseal
	vdpa/mlx5: Set err = -ENOMEM in case dma_map_sg_attrs fails
	pinctrl: lewisburg: Update number of pins in community
	block: return -EBUSY when there are open partitions in blkdev_reread_part
	pinctrl: core: Show pin numbers for the controllers with base = 0
	arm64: dts: allwinner: Revert SD card CD GPIO for Pine64-LTS
	bpf: Permits pointers on stack for helper calls
	bpf: Allow variable-offset stack access
	bpf: Refactor and streamline bounds check into helper
	bpf: Tighten speculative pointer arithmetic mask
	locking/qrwlock: Fix ordering in queued_write_lock_slowpath()
	perf/x86/intel/uncore: Remove uncore extra PCI dev HSWEP_PCI_PCU_3
	perf/x86/kvm: Fix Broadwell Xeon stepping in isolation_ucodes[]
	perf auxtrace: Fix potential NULL pointer dereference
	perf map: Fix error return code in maps__clone()
	HID: google: add don USB id
	HID: alps: fix error return code in alps_input_configured()
	HID cp2112: fix support for multiple gpiochips
	HID: wacom: Assign boolean values to a bool variable
	soc: qcom: geni: shield geni_icc_get() for ACPI boot
	dmaengine: xilinx: dpdma: Fix descriptor issuing on video group
	dmaengine: xilinx: dpdma: Fix race condition in done IRQ
	ARM: dts: Fix swapped mmc order for omap3
	net: geneve: check skb is large enough for IPv4/IPv6 header
	dmaengine: tegra20: Fix runtime PM imbalance on error
	s390/entry: save the caller of psw_idle
	arm64: kprobes: Restore local irqflag if kprobes is cancelled
	xen-netback: Check for hotplug-status existence before watching
	cavium/liquidio: Fix duplicate argument
	kasan: fix hwasan build for gcc
	csky: change a Kconfig symbol name to fix e1000 build error
	ia64: fix discontig.c section mismatches
	ia64: tools: remove duplicate definition of ia64_mf() on ia64
	x86/crash: Fix crash_setup_memmap_entries() out-of-bounds access
	net: hso: fix NULL-deref on disconnect regression
	USB: CDC-ACM: fix poison/unpoison imbalance
	Linux 5.10.33

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I638db3c919ad938eaaaac3d687175252edcd7990
2021-04-29 13:57:47 +02:00
Christoph Hellwig
fc2454cc0c block: return -EBUSY when there are open partitions in blkdev_reread_part
[ Upstream commit 68e6582e8f2dc32fd2458b9926564faa1fb4560e ]

The switch to go through blkdev_get_by_dev means we now ignore the
return value from bdev_disk_changed in __blkdev_get.  Add a manual
check to restore the old semantics.

Fixes: 4601b4b130de ("block: reopen the device in blkdev_reread_part")
Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210421160502.447418-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-28 13:39:59 +02:00
Greg Kroah-Hartman
ab8b108b0a Merge 5.10.31 into android12-5.10
Changes in 5.10.31
	interconnect: core: fix error return code of icc_link_destroy()
	gfs2: Flag a withdraw if init_threads() fails
	KVM: arm64: Hide system instruction access to Trace registers
	KVM: arm64: Disable guest access to trace filter controls
	drm/imx: imx-ldb: fix out of bounds array access warning
	gfs2: report "already frozen/thawed" errors
	ftrace: Check if pages were allocated before calling free_pages()
	tools/kvm_stat: Add restart delay
	drm/tegra: dc: Don't set PLL clock to 0Hz
	gpu: host1x: Use different lock classes for each client
	XArray: Fix splitting to non-zero orders
	block: only update parent bi_status when bio fail
	radix tree test suite: Register the main thread with the RCU library
	idr test suite: Take RCU read lock in idr_find_test_1
	idr test suite: Create anchor before launching throbber
	null_blk: fix command timeout completion handling
	io_uring: don't mark S_ISBLK async work as unbounded
	riscv,entry: fix misaligned base for excp_vect_table
	block: don't ignore REQ_NOWAIT for direct IO
	netfilter: x_tables: fix compat match/target pad out-of-bound write
	perf map: Tighten snprintf() string precision to pass gcc check on some 32-bit arches
	net: sfp: relax bitrate-derived mode check
	net: sfp: cope with SFPs that set both LOS normal and LOS inverted
	xen/events: fix setting irq affinity
	Linux 5.10.31

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I19a7cfbdaab23e578dd82c552aea86d367c2f40f
2021-04-16 16:01:44 +02:00
Yufen Yu
1d2310d95f block: only update parent bi_status when bio fail
[ Upstream commit 3edf5346e4f2ce2fa0c94651a90a8dda169565ee ]

For multiple split bios, if one of the bio is fail, the whole
should return error to application. But we found there is a race
between bio_integrity_verify_fn and bio complete, which return
io success to application after one of the bio fail. The race as
following:

split bio(READ)          kworker

nvme_complete_rq
blk_update_request //split error=0
  bio_endio
    bio_integrity_endio
      queue_work(kintegrityd_wq, &bip->bip_work);

                         bio_integrity_verify_fn
                         bio_endio //split bio
                          __bio_chain_endio
                             if (!parent->bi_status)

                               <interrupt entry>
                               nvme_irq
                                 blk_update_request //parent error=7
                                 req_bio_endio
                                    bio->bi_status = 7 //parent bio
                               <interrupt exit>

                               parent->bi_status = 0
                        parent->bi_end_io() // return bi_status=0

The bio has been split as two: split and parent. When split
bio completed, it depends on kworker to do endio, while
bio_integrity_verify_fn have been interrupted by parent bio
complete irq handler. Then, parent bio->bi_status which have
been set in irq handler will overwrite by kworker.

In fact, even without the above race, we also need to conside
the concurrency beteen mulitple split bio complete and update
the same parent bi_status. Normally, multiple split bios will
be issued to the same hctx and complete from the same irq
vector. But if we have updated queue map between multiple split
bios, these bios may complete on different hw queue and different
irq vector. Then the concurrency update parent bi_status may
cause the final status error.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-16 11:43:21 +02:00
Matthias Maennich
2e4b322b06 ANDROID: Add OWNERS files referring to the respective android-mainline OWNERS
This was generated with
  $ build/synchronize_owners common-mainline/ android-mainline common12-5.10/

Bug: 184248201
Signed-off-by: Matthias Maennich <maennich@google.com>
Change-Id: I5e56eb34fcbb5a2a013dd03bc9dcc4f159fb90de
2021-04-03 14:11:30 +00:00
Greg Kroah-Hartman
b9a61f9a56 Merge 5.10.27 into android12-5.10
Changes in 5.10.27
	mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument
	mm/memcg: set memcg when splitting page
	mt76: fix tx skb error handling in mt76_dma_tx_queue_skb
	net: stmmac: fix dma physical address of descriptor when display ring
	net: fec: ptp: avoid register access when ipg clock is disabled
	powerpc/4xx: Fix build errors from mfdcr()
	atm: eni: dont release is never initialized
	atm: lanai: dont run lanai_dev_close if not open
	Revert "r8152: adjust the settings about MAC clock speed down for RTL8153"
	ALSA: hda: ignore invalid NHLT table
	ixgbe: Fix memleak in ixgbe_configure_clsu32
	scsi: ufs: ufs-qcom: Disable interrupt in reset path
	blk-cgroup: Fix the recursive blkg rwstat
	net: tehuti: fix error return code in bdx_probe()
	net: intel: iavf: fix error return code of iavf_init_get_resources()
	sun/niu: fix wrong RXMAC_BC_FRM_CNT_COUNT count
	gianfar: fix jumbo packets+napi+rx overrun crash
	cifs: ask for more credit on async read/write code paths
	gfs2: fix use-after-free in trans_drain
	cpufreq: blacklist Arm Vexpress platforms in cpufreq-dt-platdev
	gpiolib: acpi: Add missing IRQF_ONESHOT
	nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig default
	NFS: Correct size calculation for create reply length
	net: hisilicon: hns: fix error return code of hns_nic_clear_all_rx_fetch()
	net: wan: fix error return code of uhdlc_init()
	net: davicom: Use platform_get_irq_optional()
	net: enetc: set MAC RX FIFO to recommended value
	atm: uPD98402: fix incorrect allocation
	atm: idt77252: fix null-ptr-dereference
	cifs: change noisy error message to FYI
	irqchip/ingenic: Add support for the JZ4760
	kbuild: add image_name to no-sync-config-targets
	kbuild: dummy-tools: fix inverted tests for gcc
	umem: fix error return code in mm_pci_probe()
	sparc64: Fix opcode filtering in handling of no fault loads
	habanalabs: Call put_pid() when releasing control device
	staging: rtl8192e: fix kconfig dependency on CRYPTO
	u64_stats,lockdep: Fix u64_stats_init() vs lockdep
	kselftest: arm64: Fix exit code of sve-ptrace
	regulator: qcom-rpmh: Correct the pmic5_hfsmps515 buck
	block: Fix REQ_OP_ZONE_RESET_ALL handling
	drm/amd/display: Revert dram_clock_change_latency for DCN2.1
	drm/amdgpu: fb BO should be ttm_bo_type_device
	drm/radeon: fix AGP dependency
	nvme: simplify error logic in nvme_validate_ns()
	nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request()
	nvme-fc: set NVME_REQ_CANCELLED in nvme_fc_terminate_exchange()
	nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted
	nvme-core: check ctrl css before setting up zns
	nvme-rdma: Fix a use after free in nvmet_rdma_write_data_done
	nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a
	nfs: we don't support removing system.nfs4_acl
	block: Suppress uevent for hidden device when removed
	mm/fork: clear PASID for new mm
	ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
	ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
	static_call: Pull some static_call declarations to the type headers
	static_call: Allow module use without exposing static_call_key
	static_call: Fix the module key fixup
	static_call: Fix static_call_set_init()
	KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish
	btrfs: fix sleep while in non-sleep context during qgroup removal
	selinux: don't log MAC_POLICY_LOAD record on failed policy load
	selinux: fix variable scope issue in live sidtab conversion
	netsec: restore phy power state after controller reset
	platform/x86: intel-vbtn: Stop reporting SW_DOCK events
	psample: Fix user API breakage
	z3fold: prevent reclaim/free race for headless pages
	squashfs: fix inode lookup sanity checks
	squashfs: fix xattr id and id lookup sanity checks
	hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
	kasan: fix per-page tags for non-page_alloc pages
	gcov: fix clang-11+ support
	ACPI: video: Add missing callback back for Sony VPCEH3U1E
	ACPICA: Always create namespace nodes using acpi_ns_create_node()
	arm64: stacktrace: don't trace arch_stack_walk()
	arm64: dts: ls1046a: mark crypto engine dma coherent
	arm64: dts: ls1012a: mark crypto engine dma coherent
	arm64: dts: ls1043a: mark crypto engine dma coherent
	ARM: dts: at91: sam9x60: fix mux-mask for PA7 so it can be set to A, B and C
	ARM: dts: at91: sam9x60: fix mux-mask to match product's datasheet
	ARM: dts: at91-sama5d27_som1: fix phy address to 7
	integrity: double check iint_cache was initialized
	drm/etnaviv: Use FOLL_FORCE for userptr
	drm/amd/pm: workaround for audio noise issue
	drm/amdgpu/display: restore AUX_DPHY_TX_CONTROL for DCN2.x
	drm/amdgpu: Add additional Sienna Cichlid PCI ID
	drm/i915: Fix the GT fence revocation runtime PM logic
	dm verity: fix DM_VERITY_OPTS_MAX value
	dm ioctl: fix out of bounds array access when no devices
	bus: omap_l3_noc: mark l3 irqs as IRQF_NO_THREAD
	ARM: OMAP2+: Fix smartreflex init regression after dropping legacy data
	soc: ti: omap-prm: Fix occasional abort on reset deassert for dra7 iva
	veth: Store queue_mapping independently of XDP prog presence
	bpf: Change inode_storage's lookup_elem return value from NULL to -EBADF
	libbpf: Fix INSTALL flag order
	net/mlx5e: RX, Mind the MPWQE gaps when calculating offsets
	net/mlx5e: When changing XDP program without reset, take refs for XSK RQs
	net/mlx5e: Don't match on Geneve options in case option masks are all zero
	ipv6: fix suspecious RCU usage warning
	drop_monitor: Perform cleanup upon probe registration failure
	macvlan: macvlan_count_rx() needs to be aware of preemption
	net: sched: validate stab values
	net: dsa: bcm_sf2: Qualify phydev->dev_flags based on port
	igc: reinit_locked() should be called with rtnl_lock
	igc: Fix Pause Frame Advertising
	igc: Fix Supported Pause Frame Link Setting
	igc: Fix igc_ptp_rx_pktstamp()
	e1000e: add rtnl_lock() to e1000_reset_task
	e1000e: Fix error handling in e1000_set_d0_lplu_state_82571
	net/qlcnic: Fix a use after free in qlcnic_83xx_get_minidump_template
	net: phy: broadcom: Add power down exit reset state delay
	ftgmac100: Restart MAC HW once
	clk: qcom: gcc-sc7180: Use floor ops for the correct sdcc1 clk
	net: ipa: terminate message handler arrays
	net: qrtr: fix a kernel-infoleak in qrtr_recvmsg()
	flow_dissector: fix byteorder of dissected ICMP ID
	selftests/bpf: Set gopt opt_class to 0 if get tunnel opt failed
	netfilter: ctnetlink: fix dump of the expect mask attribute
	net: hdlc_x25: Prevent racing between "x25_close" and "x25_xmit"/"x25_rx"
	net: phylink: Fix phylink_err() function name error in phylink_major_config
	tipc: better validate user input in tipc_nl_retrieve_key()
	tcp: relookup sock for RST+ACK packets handled by obsolete req sock
	can: isotp: isotp_setsockopt(): only allow to set low level TX flags for CAN-FD
	can: isotp: TX-path: ensure that CAN frame flags are initialized
	can: peak_usb: add forgotten supported devices
	can: flexcan: flexcan_chip_freeze(): fix chip freeze for missing bitrate
	can: kvaser_pciefd: Always disable bus load reporting
	can: c_can_pci: c_can_pci_remove(): fix use-after-free
	can: c_can: move runtime PM enable/disable to c_can_platform
	can: m_can: m_can_do_rx_poll(): fix extraneous msg loss warning
	can: m_can: m_can_rx_peripheral(): fix RX being blocked by errors
	mac80211: fix rate mask reset
	mac80211: Allow HE operation to be longer than expected.
	selftests/net: fix warnings on reuseaddr_ports_exhausted
	nfp: flower: fix unsupported pre_tunnel flows
	nfp: flower: add ipv6 bit to pre_tunnel control message
	nfp: flower: fix pre_tun mask id allocation
	ftrace: Fix modify_ftrace_direct.
	drm/msm/dsi: fix check-before-set in the 7nm dsi_pll code
	ionic: linearize tso skb with too many frags
	net/sched: cls_flower: fix only mask bit check in the validate_ct_state
	netfilter: nftables: report EOPNOTSUPP on unsupported flowtable flags
	netfilter: nftables: allow to update flowtable flags
	netfilter: flowtable: Make sure GC works periodically in idle system
	libbpf: Fix error path in bpf_object__elf_init()
	libbpf: Use SOCK_CLOEXEC when opening the netlink socket
	ARM: dts: imx6ull: fix ubi filesystem mount failed
	ipv6: weaken the v4mapped source check
	octeontx2-af: Formatting debugfs entry rsrc_alloc.
	octeontx2-af: Modify default KEX profile to extract TX packet fields
	octeontx2-af: Remove TOS field from MKEX TX
	octeontx2-af: Fix irq free in rvu teardown
	octeontx2-pf: Clear RSS enable flag on interace down
	octeontx2-af: fix infinite loop in unmapping NPC counter
	net: check all name nodes in __dev_alloc_name
	net: cdc-phonet: fix data-interface release on probe failure
	igb: check timestamp validity
	r8152: limit the RX buffer size of RTL8153A for USB 2.0
	net: stmmac: dwmac-sun8i: Provide TX and RX fifo sizes
	selinux: vsock: Set SID for socket returned by accept()
	selftests: forwarding: vxlan_bridge_1d: Fix vxlan ecn decapsulate value
	libbpf: Fix BTF dump of pointer-to-array-of-struct
	bpf: Fix umd memory leak in copy_process()
	can: isotp: tx-path: zero initialize outgoing CAN frames
	drm/msm: fix shutdown hook in case GPU components failed to bind
	drm/msm: Fix suspend/resume on i.MX5
	arm64: kdump: update ppos when reading elfcorehdr
	PM: runtime: Defer suspending suppliers
	net/mlx5: Add back multicast stats for uplink representor
	net/mlx5e: Allow to match on MPLS parameters only for MPLS over UDP
	net/mlx5e: Offload tuple rewrite for non-CT flows
	net/mlx5e: Fix error path for ethtool set-priv-flag
	PM: EM: postpone creating the debugfs dir till fs_initcall
	net: bridge: don't notify switchdev for local FDB addresses
	octeontx2-af: Fix memory leak of object buf
	xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG
	RDMA/cxgb4: Fix adapter LE hash errors while destroying ipv6 listening server
	bpf: Don't do bpf_cgroup_storage_set() for kuprobe/tp programs
	net: Consolidate common blackhole dst ops
	net, bpf: Fix ip6ip6 crash with collect_md populated skbs
	igb: avoid premature Rx buffer reuse
	net: axienet: Properly handle PCS/PMA PHY for 1000BaseX mode
	net: axienet: Fix probe error cleanup
	net: phy: introduce phydev->port
	net: phy: broadcom: Avoid forward for bcm54xx_config_clock_delay()
	net: phy: broadcom: Set proper 1000BaseX/SGMII interface mode for BCM54616S
	net: phy: broadcom: Fix RGMII delays for BCM50160 and BCM50610M
	Revert "netfilter: x_tables: Switch synchronization to RCU"
	netfilter: x_tables: Use correct memory barriers.
	dm table: Fix zoned model check and zone sectors check
	mm/mmu_notifiers: ensure range_end() is paired with range_start()
	Revert "netfilter: x_tables: Update remaining dereference to RCU"
	ACPI: scan: Rearrange memory allocation in acpi_device_add()
	ACPI: scan: Use unique number for instance_no
	perf auxtrace: Fix auxtrace queue conflict
	perf synthetic events: Avoid write of uninitialized memory when generating PERF_RECORD_MMAP* records
	io_uring: fix provide_buffers sign extension
	block: recalculate segment count for multi-segment discards correctly
	scsi: Revert "qla2xxx: Make sure that aborted commands are freed"
	scsi: qedi: Fix error return code of qedi_alloc_global_queues()
	scsi: mpt3sas: Fix error return code of mpt3sas_base_attach()
	smb3: fix cached file size problems in duplicate extents (reflink)
	cifs: Adjust key sizes and key generation routines for AES256 encryption
	locking/mutex: Fix non debug version of mutex_lock_io_nested()
	x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()
	mm/memcg: fix 5.10 backport of splitting page memcg
	fs/cachefiles: Remove wait_bit_key layout dependency
	ch_ktls: fix enum-conversion warning
	can: dev: Move device back to init netns on owning netns delete
	r8169: fix DMA being used after buffer free if WoL is enabled
	net: dsa: b53: VLAN filtering is global to all users
	mac80211: fix double free in ibss_leave
	ext4: add reclaim checks to xattr code
	fs/ext4: fix integer overflow in s_log_groups_per_flex
	Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
	Revert "net: bonding: fix error return code of bond_neigh_init()"
	nvme: fix the nsid value to print in nvme_validate_or_alloc_ns
	can: peak_usb: Revert "can: peak_usb: add forgotten supported devices"
	xen-blkback: don't leak persistent grants from xen_blkbk_map()
	Linux 5.10.27

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7eafe976fd6bf33db6db4adb8ebf2ff087294a23
2021-04-02 15:25:50 +02:00
David Jeffery
fc062d21c0 block: recalculate segment count for multi-segment discards correctly
[ Upstream commit a958937ff166fc60d1c3a721036f6ff41bfa2821 ]

When a stacked block device inserts a request into another block device
using blk_insert_cloned_request, the request's nr_phys_segments field gets
recalculated by a call to blk_recalc_rq_segments in
blk_cloned_rq_check_limits. But blk_recalc_rq_segments does not know how to
handle multi-segment discards. For disk types which can handle
multi-segment discards like nvme, this results in discard requests which
claim a single segment when it should report several, triggering a warning
in nvme and causing nvme to fail the discard from the invalid state.

 WARNING: CPU: 5 PID: 191 at drivers/nvme/host/core.c:700 nvme_setup_discard+0x170/0x1e0 [nvme_core]
 ...
 nvme_setup_cmd+0x217/0x270 [nvme_core]
 nvme_loop_queue_rq+0x51/0x1b0 [nvme_loop]
 __blk_mq_try_issue_directly+0xe7/0x1b0
 blk_mq_request_issue_directly+0x41/0x70
 ? blk_account_io_start+0x40/0x50
 dm_mq_queue_rq+0x200/0x3e0
 blk_mq_dispatch_rq_list+0x10a/0x7d0
 ? __sbitmap_queue_get+0x25/0x90
 ? elv_rb_del+0x1f/0x30
 ? deadline_remove_request+0x55/0xb0
 ? dd_dispatch_request+0x181/0x210
 __blk_mq_do_dispatch_sched+0x144/0x290
 ? bio_attempt_discard_merge+0x134/0x1f0
 __blk_mq_sched_dispatch_requests+0x129/0x180
 blk_mq_sched_dispatch_requests+0x30/0x60
 __blk_mq_run_hw_queue+0x47/0xe0
 __blk_mq_delay_run_hw_queue+0x15b/0x170
 blk_mq_sched_insert_requests+0x68/0xe0
 blk_mq_flush_plug_list+0xf0/0x170
 blk_finish_plug+0x36/0x50
 xlog_cil_committed+0x19f/0x290 [xfs]
 xlog_cil_process_committed+0x57/0x80 [xfs]
 xlog_state_do_callback+0x1e0/0x2a0 [xfs]
 xlog_ioend_work+0x2f/0x80 [xfs]
 process_one_work+0x1b6/0x350
 worker_thread+0x53/0x3e0
 ? process_one_work+0x350/0x350
 kthread+0x11b/0x140
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30

This patch fixes blk_recalc_rq_segments to be aware of devices which can
have multi-segment discards. It calculates the correct discard segment
count by counting the number of bio as each discard bio is considered its
own segment.

Fixes: 1e739730c5 ("block: optionally merge discontiguous discard bios into a single request")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Link: https://lore.kernel.org/r/20210211143807.GA115624@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30 14:32:06 +02:00
Daniel Wagner
07feac84ef block: Suppress uevent for hidden device when removed
[ Upstream commit 9ec491447b90ad6a4056a9656b13f0b3a1e83043 ]

register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN
but enables uevents at the end again in order to announce disk after
possible partitions are created.

When the device is removed the uevents are still on and user land sees
'remove' messages for devices which were never 'add'ed to the system.

  KERNEL[95481.571887] remove   /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block)

Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the
uevents at all.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30 14:31:52 +02:00
Damien Le Moal
d27b0964ad block: Fix REQ_OP_ZONE_RESET_ALL handling
[ Upstream commit faa44c69daf9ccbd5b8a1aee13e0e0d037c0be17 ]

Similarly to a single zone reset operation (REQ_OP_ZONE_RESET), execute
REQ_OP_ZONE_RESET_ALL operations with REQ_SYNC set.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30 14:31:51 +02:00
Xunlei Pang
71b996c9b8 blk-cgroup: Fix the recursive blkg rwstat
[ Upstream commit 4f44657d74873735e93a50eb25014721a66aac19 ]

The current blkio.throttle.io_service_bytes_recursive doesn't
work correctly.

As an example, for the following blkcg hierarchy:
 (Made 1GB READ in test1, 512MB READ in test2)
     test
    /    \
 test1   test2

$ head -n 1 test/test1/blkio.throttle.io_service_bytes_recursive
8:0 Read 1073684480
$ head -n 1 test/test2/blkio.throttle.io_service_bytes_recursive
8:0 Read 537448448
$ head -n 1 test/blkio.throttle.io_service_bytes_recursive
8:0 Read 537448448

Clearly, above data of "test" reflects "test2" not "test1"+"test2".

Do the correct summary in blkg_rwstat_recursive_sum().

Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30 14:31:48 +02:00
Greg Kroah-Hartman
3ccfc59f82 This is the 5.10.24 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmBSKS0ACgkQONu9yGCS
 aT7Ngg//c4C1WnWC0sNWzP3xT2paCkLnUUyjQTmrkbPvLtr2DvehW5Bvp/32pGiS
 8mDMoTLq3QxNrfrU6SY3KavZRC9Pa+migAsVmuujygQwNphqv95/XxnFemFEAYTl
 b8b5OJPyomzMMEwHzx1Tr+7/d58czrqXo97QI0lmaDlHl+9JKTg2SMX9AkHkU8pK
 zYjbtzdhd9UZCTdVYY1ZFkQ1ik1iAWo3Xv0G2aMeQQpuGcZIh/Y66xBuyH+8g+Yz
 3mInhPQvhkb+c+m4ZJ9NhOUVEW4Fl0fq0mVrrYkfHqXe0D36Vj/yYvO/yTSBqb4+
 XQ5PLXX3KFVDFl1id94unXGgP3c0zBe30JZPqKdpSET+PzOlGiZTxMCfjPeTgu/Z
 7xc2qSX1zn273HMTRrT1daO4/NXQ85kE04mZMzq7cqDpum7ltfKrEMum/Gma+dJz
 Knn47oZHbSW4Er/WcAwHSeZpxvD7AahG/GlsQRy+IVPu/jMXJHmo2/Nv1fLJWp+G
 7VVWRXug69hywGr7hFiT3USG2C5g5cV3/dEO8NFFjGKRa5CbLbQD6B3+Dz3dXyBH
 jE3MGIoqoNk+SvJOAf2ogu7SS6wLynZWOchmAVvIQ4QEzcP2jroeFHKD49MYxDUE
 dKcq0dtfMc4nUaUZ/XRfWtS9fSm+T4XonmvEY4yXnAyfZ0aeEM8=
 =FdFm
 -----END PGP SIGNATURE-----

Merge 5.10.24 into android12-5.10-lts

Changes in 5.10.24
	uapi: nfnetlink_cthelper.h: fix userspace compilation error
	powerpc/perf: Fix handling of privilege level checks in perf interrupt context
	powerpc/pseries: Don't enforce MSI affinity with kdump
	ethernet: alx: fix order of calls on resume
	crypto: mips/poly1305 - enable for all MIPS processors
	ath9k: fix transmitting to stations in dynamic SMPS mode
	net: Fix gro aggregation for udp encaps with zero csum
	net: check if protocol extracted by virtio_net_hdr_set_proto is correct
	net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
	net: l2tp: reduce log level of messages in receive path, add counter instead
	can: skb: can_skb_set_owner(): fix ref counting if socket was closed before setting skb ownership
	can: flexcan: assert FRZ bit in flexcan_chip_freeze()
	can: flexcan: enable RX FIFO after FRZ/HALT valid
	can: flexcan: invoke flexcan_chip_freeze() to enter freeze mode
	can: tcan4x5x: tcan4x5x_init(): fix initialization - clear MRAM before entering Normal Mode
	tcp: Fix sign comparison bug in getsockopt(TCP_ZEROCOPY_RECEIVE)
	tcp: add sanity tests to TCP_QUEUE_SEQ
	netfilter: nf_nat: undo erroneous tcp edemux lookup
	netfilter: x_tables: gpf inside xt_find_revision()
	net: always use icmp{,v6}_ndo_send from ndo_start_xmit
	net: phy: fix save wrong speed and duplex problem if autoneg is on
	selftests/bpf: Use the last page in test_snprintf_btf on s390
	selftests/bpf: No need to drop the packet when there is no geneve opt
	selftests/bpf: Mask bpf_csum_diff() return value to 16 bits in test_verifier
	samples, bpf: Add missing munmap in xdpsock
	libbpf: Clear map_info before each bpf_obj_get_info_by_fd
	ibmvnic: Fix possibly uninitialized old_num_tx_queues variable warning.
	ibmvnic: always store valid MAC address
	mt76: dma: do not report truncated frames to mac80211
	powerpc/603: Fix protection of user pages mapped with PROT_NONE
	mount: fix mounting of detached mounts onto targets that reside on shared mounts
	cifs: return proper error code in statfs(2)
	Revert "mm, slub: consider rest of partial list if acquire_slab() fails"
	docs: networking: drop special stable handling
	net: dsa: tag_rtl4_a: fix egress tags
	sh_eth: fix TRSCER mask for SH771x
	net: enetc: don't overwrite the RSS indirection table when initializing
	net: enetc: take the MDIO lock only once per NAPI poll cycle
	net: enetc: fix incorrect TPID when receiving 802.1ad tagged packets
	net: enetc: don't disable VLAN filtering in IFF_PROMISC mode
	net: enetc: force the RGMII speed and duplex instead of operating in inband mode
	net: enetc: remove bogus write to SIRXIDR from enetc_setup_rxbdr
	net: enetc: keep RX ring consumer index in sync with hardware
	net: ethernet: mtk-star-emac: fix wrong unmap in RX handling
	net/mlx4_en: update moderation when config reset
	net: stmmac: fix incorrect DMA channel intr enable setting of EQoS v4.10
	nexthop: Do not flush blackhole nexthops when loopback goes down
	net: sched: avoid duplicates in classes dump
	net: mscc: ocelot: properly reject destination IP keys in VCAP IS1
	net: dsa: sja1105: fix SGMII PCS being forced to SPEED_UNKNOWN instead of SPEED_10
	net: usb: qmi_wwan: allow qmimux add/del with master up
	netdevsim: init u64 stats for 32bit hardware
	cipso,calipso: resolve a number of problems with the DOI refcounts
	net: stmmac: Fix VLAN filter delete timeout issue in Intel mGBE SGMII
	stmmac: intel: Fixes clock registration error seen for multiple interfaces
	net: lapbether: Remove netif_start_queue / netif_stop_queue
	net: davicom: Fix regulator not turned off on failed probe
	net: davicom: Fix regulator not turned off on driver removal
	net: enetc: allow hardware timestamping on TX queues with tc-etf enabled
	net: qrtr: fix error return code of qrtr_sendmsg()
	s390/qeth: fix memory leak after failed TX Buffer allocation
	r8169: fix r8168fp_adjust_ocp_cmd function
	ixgbe: fail to create xfrm offload of IPsec tunnel mode SA
	tools/resolve_btfids: Fix build error with older host toolchains
	perf build: Fix ccache usage in $(CC) when generating arch errno table
	net: stmmac: stop each tx channel independently
	net: stmmac: fix watchdog timeout during suspend/resume stress test
	net: stmmac: fix wrongly set buffer2 valid when sph unsupport
	ethtool: fix the check logic of at least one channel for RX/TX
	net: phy: make mdio_bus_phy_suspend/resume as __maybe_unused
	selftests: forwarding: Fix race condition in mirror installation
	mlxsw: spectrum_ethtool: Add an external speed to PTYS register
	perf traceevent: Ensure read cmdlines are null terminated.
	perf report: Fix -F for branch & mem modes
	net: hns3: fix query vlan mask value error for flow director
	net: hns3: fix bug when calculating the TCAM table info
	s390/cio: return -EFAULT if copy_to_user() fails again
	bnxt_en: reliably allocate IRQ table on reset to avoid crash
	gpiolib: acpi: Add ACPI_GPIO_QUIRK_ABSOLUTE_NUMBER quirk
	gpiolib: acpi: Allow to find GpioInt() resource by name and index
	gpio: pca953x: Set IRQ type when handle Intel Galileo Gen 2
	gpio: fix gpio-device list corruption
	drm/compat: Clear bounce structures
	drm/amd/display: Add a backlight module option
	drm/amdgpu/display: use GFP_ATOMIC in dcn21_validate_bandwidth_fp()
	drm/amd/display: Fix nested FPU context in dcn21_validate_bandwidth()
	drm/amd/pm: bug fix for pcie dpm
	drm/amdgpu/display: simplify backlight setting
	drm/amdgpu/display: don't assert in set backlight function
	drm/amdgpu/display: handle aux backlight in backlight_get_brightness
	drm/shmem-helper: Check for purged buffers in fault handler
	drm/shmem-helper: Don't remove the offset in vm_area_struct pgoff
	drm: Use USB controller's DMA mask when importing dmabufs
	drm: meson_drv add shutdown function
	drm/shmem-helpers: vunmap: Don't put pages for dma-buf
	drm/i915: Wedge the GPU if command parser setup fails
	s390/cio: return -EFAULT if copy_to_user() fails
	s390/crypto: return -EFAULT if copy_to_user() fails
	qxl: Fix uninitialised struct field head.surface_id
	sh_eth: fix TRSCER mask for R7S9210
	media: usbtv: Fix deadlock on suspend
	media: rkisp1: params: fix wrong bits settings
	media: v4l: vsp1: Fix uif null pointer access
	media: v4l: vsp1: Fix bru null pointer access
	media: rc: compile rc-cec.c into rc-core
	cifs: fix credit accounting for extra channel
	net: hns3: fix error mask definition of flow director
	s390/qeth: don't replace a fully completed async TX buffer
	s390/qeth: remove QETH_QDIO_BUF_HANDLED_DELAYED state
	s390/qeth: improve completion of pending TX buffers
	s390/qeth: fix notification for pending buffers during teardown
	net: dsa: implement a central TX reallocation procedure
	net: dsa: tag_ksz: don't allocate additional memory for padding/tagging
	net: dsa: trailer: don't allocate additional memory for padding/tagging
	net: dsa: tag_qca: let DSA core deal with TX reallocation
	net: dsa: tag_ocelot: let DSA core deal with TX reallocation
	net: dsa: tag_mtk: let DSA core deal with TX reallocation
	net: dsa: tag_lan9303: let DSA core deal with TX reallocation
	net: dsa: tag_edsa: let DSA core deal with TX reallocation
	net: dsa: tag_brcm: let DSA core deal with TX reallocation
	net: dsa: tag_dsa: let DSA core deal with TX reallocation
	net: dsa: tag_gswip: let DSA core deal with TX reallocation
	net: dsa: tag_ar9331: let DSA core deal with TX reallocation
	net: dsa: tag_mtk: fix 802.1ad VLAN egress
	enetc: Fix unused var build warning for CONFIG_OF
	net: enetc: initialize RFS/RSS memories for unused ports too
	ath11k: peer delete synchronization with firmware
	ath11k: start vdev if a bss peer is already created
	ath11k: fix AP mode for QCA6390
	i2c: rcar: faster irq code to minimize HW race condition
	i2c: rcar: optimize cacheline to minimize HW race condition
	scsi: ufs: WB is only available on LUN #0 to #7
	udf: fix silent AED tagLocation corruption
	iommu/vt-d: Clear PRQ overflow only when PRQ is empty
	mmc: mxs-mmc: Fix a resource leak in an error handling path in 'mxs_mmc_probe()'
	mmc: mediatek: fix race condition between msdc_request_timeout and irq
	mmc: sdhci-iproc: Add ACPI bindings for the RPi
	Platform: OLPC: Fix probe error handling
	powerpc/pci: Add ppc_md.discover_phbs()
	spi: stm32: make spurious and overrun interrupts visible
	powerpc: improve handling of unrecoverable system reset
	powerpc/perf: Record counter overflow always if SAMPLE_IP is unset
	HID: logitech-dj: add support for the new lightspeed connection iteration
	powerpc/64: Fix stack trace not displaying final frame
	iommu/amd: Fix performance counter initialization
	clk: qcom: gdsc: Implement NO_RET_PERIPH flag
	sparc32: Limit memblock allocation to low memory
	sparc64: Use arch_validate_flags() to validate ADI flag
	Input: applespi - don't wait for responses to commands indefinitely.
	PCI: xgene-msi: Fix race in installing chained irq handler
	PCI: mediatek: Add missing of_node_put() to fix reference leak
	drivers/base: build kunit tests without structleak plugin
	PCI/LINK: Remove bandwidth notification
	ext4: don't try to processed freed blocks until mballoc is initialized
	kbuild: clamp SUBLEVEL to 255
	PCI: Fix pci_register_io_range() memory leak
	i40e: Fix memory leak in i40e_probe
	kasan: fix memory corruption in kasan_bitops_tags test
	s390/smp: __smp_rescan_cpus() - move cpumask away from stack
	drivers/base/memory: don't store phys_device in memory blocks
	sysctl.c: fix underflow value setting risk in vm_table
	scsi: libiscsi: Fix iscsi_prep_scsi_cmd_pdu() error handling
	scsi: target: core: Add cmd length set before cmd complete
	scsi: target: core: Prevent underflow for service actions
	clk: qcom: gpucc-msm8998: Add resets, cxc, fix flags on gpu_gx_gdsc
	mmc: sdhci: Update firmware interface API
	ARM: 9029/1: Make iwmmxt.S support Clang's integrated assembler
	ARM: assembler: introduce adr_l, ldr_l and str_l macros
	ARM: efistub: replace adrl pseudo-op with adr_l macro invocation
	ALSA: usb: Add Plantronics C320-M USB ctrl msg delay quirk
	ALSA: hda/hdmi: Cancel pending works before suspend
	ALSA: hda/conexant: Add quirk for mute LED control on HP ZBook G5
	ALSA: hda/ca0132: Add Sound BlasterX AE-5 Plus support
	ALSA: hda: Drop the BATCH workaround for AMD controllers
	ALSA: hda: Flush pending unsolicited events before suspend
	ALSA: hda: Avoid spurious unsol event handling during S3/S4
	ALSA: usb-audio: Fix "cannot get freq eq" errors on Dell AE515 sound bar
	ALSA: usb-audio: Apply the control quirk to Plantronics headsets
	ALSA: usb-audio: Disable USB autosuspend properly in setup_disable_autosuspend()
	ALSA: usb-audio: fix NULL ptr dereference in usb_audio_probe
	ALSA: usb-audio: fix use after free in usb_audio_disconnect
	Revert 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file capabilities")
	block: Discard page cache of zone reset target range
	block: Try to handle busy underlying device on discard
	arm64: kasan: fix page_alloc tagging with DEBUG_VIRTUAL
	arm64: mte: Map hotplugged memory as Normal Tagged
	arm64: perf: Fix 64-bit event counter read truncation
	s390/dasd: fix hanging DASD driver unbind
	s390/dasd: fix hanging IO request during DASD driver unbind
	software node: Fix node registration
	xen/events: reset affinity of 2-level event when tearing it down
	mmc: mmci: Add MMC_CAP_NEED_RSP_BUSY for the stm32 variants
	mmc: core: Fix partition switch time for eMMC
	mmc: cqhci: Fix random crash when remove mmc module/card
	cifs: do not send close in compound create+close requests
	Goodix Fingerprint device is not a modem
	USB: gadget: udc: s3c2410_udc: fix return value check in s3c2410_udc_probe()
	USB: gadget: u_ether: Fix a configfs return code
	usb: gadget: f_uac2: always increase endpoint max_packet_size by one audio slot
	usb: gadget: f_uac1: stop playback on function disable
	usb: dwc3: qcom: Add missing DWC3 OF node refcount decrement
	usb: dwc3: qcom: add URS Host support for sdm845 ACPI boot
	usb: dwc3: qcom: add ACPI device id for sc8180x
	usb: dwc3: qcom: Honor wakeup enabled/disabled state
	USB: usblp: fix a hang in poll() if disconnected
	usb: renesas_usbhs: Clear PIPECFG for re-enabling pipe with other EPNUM
	usb: xhci: do not perform Soft Retry for some xHCI hosts
	xhci: Improve detection of device initiated wake signal.
	usb: xhci: Fix ASMedia ASM1042A and ASM3242 DMA addressing
	xhci: Fix repeated xhci wake after suspend due to uncleared internal wake state
	USB: serial: io_edgeport: fix memory leak in edge_startup
	USB: serial: ch341: add new Product ID
	USB: serial: cp210x: add ID for Acuity Brands nLight Air Adapter
	USB: serial: cp210x: add some more GE USB IDs
	usbip: fix stub_dev to check for stream socket
	usbip: fix vhci_hcd to check for stream socket
	usbip: fix vudc to check for stream socket
	usbip: fix stub_dev usbip_sockfd_store() races leading to gpf
	usbip: fix vhci_hcd attach_store() races leading to gpf
	usbip: fix vudc usbip_sockfd_store races leading to gpf
	Revert "serial: max310x: rework RX interrupt handling"
	misc/pvpanic: Export module FDT device table
	misc: fastrpc: restrict user apps from sending kernel RPC messages
	staging: rtl8192u: fix ->ssid overflow in r8192_wx_set_scan()
	staging: rtl8188eu: prevent ->ssid overflow in rtw_wx_set_scan()
	staging: rtl8712: unterminated string leads to read overflow
	staging: rtl8188eu: fix potential memory corruption in rtw_check_beacon_data()
	staging: ks7010: prevent buffer overflow in ks_wlan_set_scan()
	staging: rtl8712: Fix possible buffer overflow in r8712_sitesurvey_cmd
	staging: rtl8192e: Fix possible buffer overflow in _rtl92e_wx_set_scan
	staging: comedi: addi_apci_1032: Fix endian problem for COS sample
	staging: comedi: addi_apci_1500: Fix endian problem for command sample
	staging: comedi: adv_pci1710: Fix endian problem for AI command data
	staging: comedi: das6402: Fix endian problem for AI command data
	staging: comedi: das800: Fix endian problem for AI command data
	staging: comedi: dmm32at: Fix endian problem for AI command data
	staging: comedi: me4000: Fix endian problem for AI command data
	staging: comedi: pcl711: Fix endian problem for AI command data
	staging: comedi: pcl818: Fix endian problem for AI command data
	sh_eth: fix TRSCER mask for R7S72100
	cpufreq: qcom-hw: fix dereferencing freed memory 'data'
	cpufreq: qcom-hw: Fix return value check in qcom_cpufreq_hw_cpu_init()
	arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory
	SUNRPC: Set memalloc_nofs_save() for sync tasks
	NFS: Don't revalidate the directory permissions on a lookup failure
	NFS: Don't gratuitously clear the inode cache when lookup failed
	NFSv4.2: fix return value of _nfs4_get_security_label()
	block: rsxx: fix error return code of rsxx_pci_probe()
	nvme-fc: fix racing controller reset and create association
	configfs: fix a use-after-free in __configfs_open_file
	arm64: mm: use a 48-bit ID map when possible on 52-bit VA builds
	perf/core: Flush PMU internal buffers for per-CPU events
	perf/x86/intel: Set PERF_ATTACH_SCHED_CB for large PEBS and LBR
	hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
	powerpc/64s/exception: Clean up a missed SRR specifier
	seqlock,lockdep: Fix seqcount_latch_init()
	stop_machine: mark helpers __always_inline
	include/linux/sched/mm.h: use rcu_dereference in in_vfork()
	zram: fix return value on writeback_store
	linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
	sched/membarrier: fix missing local execution of ipi_sync_rq_state()
	efi: stub: omit SetVirtualAddressMap() if marked unsupported in RT_PROP table
	powerpc/64s: Fix instruction encoding for lis in ppc_function_entry()
	powerpc: Fix inverted SET_FULL_REGS bitop
	powerpc: Fix missing declaration of [en/dis]able_kernel_vsx()
	binfmt_misc: fix possible deadlock in bm_register_write
	x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2
	x86/sev-es: Introduce ip_within_syscall_gap() helper
	x86/sev-es: Check regs->sp is trusted before adjusting #VC IST stack
	x86/entry: Move nmi entry/exit into common code
	x86/sev-es: Correctly track IRQ states in runtime #VC handler
	x86/sev-es: Use __copy_from_user_inatomic()
	x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
	KVM: x86: Ensure deadline timer has truly expired before posting its IRQ
	KVM: kvmclock: Fix vCPUs > 64 can't be online/hotpluged
	KVM: arm64: Fix range alignment when walking page tables
	KVM: arm64: Avoid corrupting vCPU context register in guest exit
	KVM: arm64: nvhe: Save the SPE context early
	KVM: arm64: Reject VM creation when the default IPA size is unsupported
	KVM: arm64: Fix exclusive limit for IPA size
	mm/userfaultfd: fix memory corruption due to writeprotect
	mm/madvise: replace ptrace attach requirement for process_madvise
	KVM: arm64: Ensure I-cache isolation between vcpus of a same VM
	mm/page_alloc.c: refactor initialization of struct page for holes in memory layout
	xen/events: don't unmask an event channel when an eoi is pending
	xen/events: avoid handling the same event on two cpus at the same time
	KVM: arm64: Fix nVHE hyp panic host context restore
	RDMA/umem: Use ib_dma_max_seg_size instead of dma_get_max_seg_size
	Linux 5.10.24

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie53a3c1963066a18d41357b6be41cff00690bd40
2021-03-19 09:42:56 +01:00
Shin'ichiro Kawasaki
a534778492 block: Discard page cache of zone reset target range
commit e5113505904ea1c1c0e1f92c1cfa91fbf4da1694 upstream.

When zone reset ioctl and data read race for a same zone on zoned block
devices, the data read leaves stale page cache even though the zone
reset ioctl zero clears all the zone data on the device. To avoid
non-zero data read from the stale page cache after zone reset, discard
page cache of reset target zones in blkdev_zone_mgmt_ioctl(). Introduce
the helper function blkdev_truncate_zone_range() to discard the page
cache. Ensure the page cache discarded by calling the helper function
before and after zone reset in same manner as fallocate does.

This patch can be applied back to the stable kernel version v5.10.y.
Rework is needed for older stable kernels.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Cc: <stable@vger.kernel.org> # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210311072546.678999-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-17 17:06:27 +01:00
Greg Kroah-Hartman
d8c7f0a3cd Merge 5.10.20 into android12-5.10
Changes in 5.10.20
	vmlinux.lds.h: add DWARF v5 sections
	vdpa/mlx5: fix param validation in mlx5_vdpa_get_config()
	debugfs: be more robust at handling improper input in debugfs_lookup()
	debugfs: do not attempt to create a new file before the filesystem is initalized
	scsi: libsas: docs: Remove notify_ha_event()
	scsi: qla2xxx: Fix mailbox Ch erroneous error
	kdb: Make memory allocations more robust
	w1: w1_therm: Fix conversion result for negative temperatures
	PCI: qcom: Use PHY_REFCLK_USE_PAD only for ipq8064
	PCI: Decline to resize resources if boot config must be preserved
	virt: vbox: Do not use wait_event_interruptible when called from kernel context
	bfq: Avoid false bfq queue merging
	ALSA: usb-audio: Fix PCM buffer allocation in non-vmalloc mode
	MIPS: vmlinux.lds.S: add missing PAGE_ALIGNED_DATA() section
	vmlinux.lds.h: Define SANTIZER_DISCARDS with CONFIG_GCOV_KERNEL=y
	random: fix the RNDRESEEDCRNG ioctl
	ALSA: pcm: Call sync_stop at disconnection
	ALSA: pcm: Assure sync with the pending stop operation at suspend
	ALSA: pcm: Don't call sync_stop if it hasn't been stopped
	drm/i915/gt: One more flush for Baytrail clear residuals
	ath10k: Fix error handling in case of CE pipe init failure
	Bluetooth: btqcomsmd: Fix a resource leak in error handling paths in the probe function
	Bluetooth: hci_uart: Fix a race for write_work scheduling
	Bluetooth: Fix initializing response id after clearing struct
	arm64: dts: renesas: beacon kit: Fix choppy Bluetooth Audio
	arm64: dts: renesas: beacon: Fix audio-1.8V pin enable
	ARM: dts: exynos: correct PMIC interrupt trigger level on Artik 5
	ARM: dts: exynos: correct PMIC interrupt trigger level on Monk
	ARM: dts: exynos: correct PMIC interrupt trigger level on Rinato
	ARM: dts: exynos: correct PMIC interrupt trigger level on Spring
	ARM: dts: exynos: correct PMIC interrupt trigger level on Arndale Octa
	ARM: dts: exynos: correct PMIC interrupt trigger level on Odroid XU3 family
	arm64: dts: exynos: correct PMIC interrupt trigger level on TM2
	arm64: dts: exynos: correct PMIC interrupt trigger level on Espresso
	memory: mtk-smi: Fix PM usage counter unbalance in mtk_smi ops
	Bluetooth: hci_qca: Fix memleak in qca_controller_memdump
	staging: vchiq: Fix bulk userdata handling
	staging: vchiq: Fix bulk transfers on 64-bit builds
	arm64: dts: qcom: msm8916-samsung-a5u: Fix iris compatible
	net: stmmac: dwmac-meson8b: fix enabling the timing-adjustment clock
	bpf: Add bpf_patch_call_args prototype to include/linux/bpf.h
	bpf: Avoid warning when re-casting __bpf_call_base into __bpf_call_base_args
	firmware: arm_scmi: Fix call site of scmi_notification_exit
	arm64: dts: allwinner: A64: properly connect USB PHY to port 0
	arm64: dts: allwinner: H6: properly connect USB PHY to port 0
	arm64: dts: allwinner: Drop non-removable from SoPine/LTS SD card
	arm64: dts: allwinner: H6: Allow up to 150 MHz MMC bus frequency
	arm64: dts: allwinner: A64: Limit MMC2 bus frequency to 150 MHz
	arm64: dts: qcom: msm8916-samsung-a2015: Fix sensors
	cpufreq: brcmstb-avs-cpufreq: Free resources in error path
	cpufreq: brcmstb-avs-cpufreq: Fix resource leaks in ->remove()
	arm64: dts: rockchip: rk3328: Add clock_in_out property to gmac2phy node
	ACPICA: Fix exception code class checks
	usb: gadget: u_audio: Free requests only after callback
	arm64: dts: qcom: sdm845-db845c: Fix reset-pin of ov8856 node
	soc: qcom: socinfo: Fix an off by one in qcom_show_pmic_model()
	soc: ti: pm33xx: Fix some resource leak in the error handling paths of the probe function
	staging: media: atomisp: Fix size_t format specifier in hmm_alloc() debug statemenet
	Bluetooth: drop HCI device reference before return
	Bluetooth: Put HCI device if inquiry procedure interrupts
	memory: ti-aemif: Drop child node when jumping out loop
	ARM: dts: Configure missing thermal interrupt for 4430
	usb: dwc2: Do not update data length if it is 0 on inbound transfers
	usb: dwc2: Abort transaction after errors with unknown reason
	usb: dwc2: Make "trimming xfer length" a debug message
	staging: rtl8723bs: wifi_regd.c: Fix incorrect number of regulatory rules
	x86/MSR: Filter MSR writes through X86_IOC_WRMSR_REGS ioctl too
	arm64: dts: renesas: beacon: Fix EEPROM compatible value
	can: mcp251xfd: mcp251xfd_probe(): fix errata reference
	ARM: dts: armada388-helios4: assign pinctrl to LEDs
	ARM: dts: armada388-helios4: assign pinctrl to each fan
	arm64: dts: armada-3720-turris-mox: rename u-boot mtd partition to a53-firmware
	opp: Correct debug message in _opp_add_static_v2()
	Bluetooth: btusb: Fix memory leak in btusb_mtk_wmt_recv
	soc: qcom: ocmem: don't return NULL in of_get_ocmem
	arm64: dts: msm8916: Fix reserved and rfsa nodes unit address
	arm64: dts: meson: fix broken wifi node for Khadas VIM3L
	iwlwifi: mvm: set enabled in the PPAG command properly
	ARM: s3c: fix fiq for clang IAS
	optee: simplify i2c access
	staging: wfx: fix possible panic with re-queued frames
	ARM: at91: use proper asm syntax in pm_suspend
	ath10k: Fix suspicious RCU usage warning in ath10k_wmi_tlv_parse_peer_stats_info()
	ath10k: Fix lockdep assertion warning in ath10k_sta_statistics
	ath11k: fix a locking bug in ath11k_mac_op_start()
	soc: aspeed: snoop: Add clock control logic
	iwlwifi: mvm: fix the type we use in the PPAG table validity checks
	iwlwifi: mvm: store PPAG enabled/disabled flag properly
	iwlwifi: mvm: send stored PPAG command instead of local
	iwlwifi: mvm: assign SAR table revision to the command later
	iwlwifi: mvm: don't check if CSA event is running before removing
	bpf_lru_list: Read double-checked variable once without lock
	iwlwifi: pnvm: set the PNVM again if it was already loaded
	iwlwifi: pnvm: increment the pointer before checking the TLV
	ath9k: fix data bus crash when setting nf_override via debugfs
	selftests/bpf: Convert test_xdp_redirect.sh to bash
	ibmvnic: Set to CLOSED state even on error
	bnxt_en: reverse order of TX disable and carrier off
	bnxt_en: Fix devlink info's stored fw.psid version format.
	xen/netback: fix spurious event detection for common event case
	dpaa2-eth: fix memory leak in XDP_REDIRECT
	net: phy: consider that suspend2ram may cut off PHY power
	net/mlx5e: Don't change interrupt moderation params when DIM is enabled
	net/mlx5e: Change interrupt moderation channel params also when channels are closed
	net/mlx5: Fix health error state handling
	net/mlx5e: Replace synchronize_rcu with synchronize_net
	net/mlx5e: kTLS, Use refcounts to free kTLS RX priv context
	net/mlx5: Disable devlink reload for multi port slave device
	net/mlx5: Disallow RoCE on multi port slave device
	net/mlx5: Disallow RoCE on lag device
	net/mlx5: Disable devlink reload for lag devices
	net/mlx5e: CT: manage the lifetime of the ct entry object
	net/mlx5e: Check tunnel offload is required before setting SWP
	mac80211: fix potential overflow when multiplying to u32 integers
	libbpf: Ignore non function pointer member in struct_ops
	bpf: Fix an unitialized value in bpf_iter
	bpf, devmap: Use GFP_KERNEL for xdp bulk queue allocation
	bpf: Fix bpf_fib_lookup helper MTU check for SKB ctx
	selftests: mptcp: fix ACKRX debug message
	tcp: fix SO_RCVLOWAT related hangs under mem pressure
	net: axienet: Handle deferred probe on clock properly
	cxgb4/chtls/cxgbit: Keeping the max ofld immediate data size same in cxgb4 and ulds
	b43: N-PHY: Fix the update of coef for the PHY revision >= 3case
	bpf: Clear subreg_def for global function return values
	ibmvnic: add memory barrier to protect long term buffer
	ibmvnic: skip send_request_unmap for timeout reset
	net: dsa: felix: perform teardown in reverse order of setup
	net: dsa: felix: don't deinitialize unused ports
	net: phy: mscc: adding LCPLL reset to VSC8514
	net: amd-xgbe: Reset the PHY rx data path when mailbox command timeout
	net: amd-xgbe: Fix NETDEV WATCHDOG transmit queue timeout warning
	net: amd-xgbe: Reset link when the link never comes back
	net: amd-xgbe: Fix network fluctuations when using 1G BELFUSE SFP
	net: mvneta: Remove per-cpu queue mapping for Armada 3700
	net: enetc: fix destroyed phylink dereference during unbind
	tty: convert tty_ldisc_ops 'read()' function to take a kernel pointer
	tty: implement read_iter
	fbdev: aty: SPARC64 requires FB_ATY_CT
	drm/gma500: Fix error return code in psb_driver_load()
	gma500: clean up error handling in init
	drm/fb-helper: Add missed unlocks in setcmap_legacy()
	drm/panel: mantix: Tweak init sequence
	drm/vc4: hdmi: Take into account the clock doubling flag in atomic_check
	crypto: sun4i-ss - linearize buffers content must be kept
	crypto: sun4i-ss - fix kmap usage
	crypto: arm64/aes-ce - really hide slower algos when faster ones are enabled
	hwrng: ingenic - Fix a resource leak in an error handling path
	media: allegro: Fix use after free on error
	kcsan: Rewrite kcsan_prandom_u32_max() without prandom_u32_state()
	drm: rcar-du: Fix PM reference leak in rcar_cmm_enable()
	drm: rcar-du: Fix crash when using LVDS1 clock for CRTC
	drm: rcar-du: Fix the return check of of_parse_phandle and of_find_device_by_node
	drm/amdgpu: Fix macro name _AMDGPU_TRACE_H_ in preprocessor if condition
	MIPS: c-r4k: Fix section mismatch for loongson2_sc_init
	MIPS: lantiq: Explicitly compare LTQ_EBU_PCC_ISTAT against 0
	drm/virtio: make sure context is created in gem open
	drm/fourcc: fix Amlogic format modifier masks
	media: ipu3-cio2: Build only for x86
	media: i2c: ov5670: Fix PIXEL_RATE minimum value
	media: imx: Unregister csc/scaler only if registered
	media: imx: Fix csc/scaler unregister
	media: mtk-vcodec: fix error return code in vdec_vp9_decode()
	media: camss: missing error code in msm_video_register()
	media: vsp1: Fix an error handling path in the probe function
	media: em28xx: Fix use-after-free in em28xx_alloc_urbs
	media: media/pci: Fix memleak in empress_init
	media: tm6000: Fix memleak in tm6000_start_stream
	media: aspeed: fix error return code in aspeed_video_setup_video()
	ASoC: cs42l56: fix up error handling in probe
	ASoC: qcom: qdsp6: Move frontend AIFs to q6asm-dai
	evm: Fix memleak in init_desc
	crypto: bcm - Rename struct device_private to bcm_device_private
	sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue
	drm/sun4i: tcon: fix inverted DCLK polarity
	media: imx7: csi: Fix regression for parallel cameras on i.MX6UL
	media: imx7: csi: Fix pad link validation
	media: ti-vpe: cal: fix write to unallocated memory
	MIPS: properly stop .eh_frame generation
	MIPS: Compare __SYNC_loongson3_war against 0
	drm/tegra: Fix reference leak when pm_runtime_get_sync() fails
	drm/amdgpu: toggle on DF Cstate after finishing xgmi injection
	bsg: free the request before return error code
	macintosh/adb-iop: Use big-endian autopoll mask
	drm/amd/display: Fix 10/12 bpc setup in DCE output bit depth reduction.
	drm/amd/display: Fix HDMI deep color output for DCE 6-11.
	media: software_node: Fix refcounts in software_node_get_next_child()
	media: lmedm04: Fix misuse of comma
	media: vidtv: psi: fix missing crc for PMT
	media: atomisp: Fix a buffer overflow in debug code
	media: qm1d1c0042: fix error return code in qm1d1c0042_init()
	media: cx25821: Fix a bug when reallocating some dma memory
	media: mtk-vcodec: fix argument used when DEBUG is defined
	media: pxa_camera: declare variable when DEBUG is defined
	media: uvcvideo: Accept invalid bFormatIndex and bFrameIndex values
	sched/eas: Don't update misfit status if the task is pinned
	f2fs: compress: fix potential deadlock
	ASoC: qcom: lpass-cpu: Remove bit clock state check
	ASoC: SOF: Intel: hda: cancel D0i3 work during runtime suspend
	perf/arm-cmn: Fix PMU instance naming
	perf/arm-cmn: Move IRQs when migrating context
	mtd: parser: imagetag: fix error codes in bcm963xx_parse_imagetag_partitions()
	crypto: talitos - Work around SEC6 ERRATA (AES-CTR mode data size error)
	crypto: talitos - Fix ctr(aes) on SEC1
	drm/nouveau: bail out of nouveau_channel_new if channel init fails
	mm: proc: Invalidate TLB after clearing soft-dirty page state
	ata: ahci_brcm: Add back regulators management
	ASoC: cpcap: fix microphone timeslot mask
	ASoC: codecs: add missing max_register in regmap config
	mtd: parsers: afs: Fix freeing the part name memory in failure
	f2fs: fix to avoid inconsistent quota data
	drm/amdgpu: Prevent shift wrapping in amdgpu_read_mask()
	f2fs: fix a wrong condition in __submit_bio
	ASoC: qcom: Fix typo error in HDMI regmap config callbacks
	KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRs
	drm/mediatek: Check if fb is null
	Drivers: hv: vmbus: Avoid use-after-free in vmbus_onoffer_rescind()
	ASoC: Intel: sof_sdw: add missing TGL_HDMI quirk for Dell SKU 0A5E
	ASoC: Intel: sof_sdw: add missing TGL_HDMI quirk for Dell SKU 0A3E
	locking/lockdep: Avoid unmatched unlock
	ASoC: qcom: lpass: Fix i2s ctl register bit map
	ASoC: rt5682: Fix panic in rt5682_jack_detect_handler happening during system shutdown
	ASoC: SOF: debug: Fix a potential issue on string buffer termination
	btrfs: clarify error returns values in __load_free_space_cache
	btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge
	KVM: x86: Restore all 64 bits of DR6 and DR7 during RSM on x86-64
	s390/zcrypt: return EIO when msg retry limit reached
	drm/vc4: hdmi: Move hdmi reset to bind
	drm/vc4: hdmi: Fix register offset with longer CEC messages
	drm/vc4: hdmi: Fix up CEC registers
	drm/vc4: hdmi: Restore cec physical address on reconnect
	drm/vc4: hdmi: Compute the CEC clock divider from the clock rate
	drm/vc4: hdmi: Update the CEC clock divider on HSM rate change
	drm/lima: fix reference leak in lima_pm_busy
	drm/dp_mst: Don't cache EDIDs for physical ports
	hwrng: timeriomem - Fix cooldown period calculation
	crypto: ecdh_helper - Ensure 'len >= secret.len' in decode_key()
	io_uring: fix possible deadlock in io_uring_poll
	nvmet-tcp: fix receive data digest calculation for multiple h2cdata PDUs
	nvmet-tcp: fix potential race of tcp socket closing accept_work
	nvme-multipath: set nr_zones for zoned namespaces
	nvmet: remove extra variable in identify ns
	nvmet: set status to 0 in case for invalid nsid
	ASoC: SOF: sof-pci-dev: add missing Up-Extreme quirk
	ima: Free IMA measurement buffer on error
	ima: Free IMA measurement buffer after kexec syscall
	ASoC: simple-card-utils: Fix device module clock
	fs/jfs: fix potential integer overflow on shift of a int
	jffs2: fix use after free in jffs2_sum_write_data()
	ubifs: Fix memleak in ubifs_init_authentication
	ubifs: replay: Fix high stack usage, again
	ubifs: Fix error return code in alloc_wbufs()
	irqchip/imx: IMX_INTMUX should not default to y, unconditionally
	smp: Process pending softirqs in flush_smp_call_function_from_idle()
	drm/amdgpu/display: remove hdcp_srm sysfs on device removal
	capabilities: Don't allow writing ambiguous v3 file capabilities
	HSI: Fix PM usage counter unbalance in ssi_hw_init
	power: supply: cpcap: Add missing IRQF_ONESHOT to fix regression
	clk: meson: clk-pll: fix initializing the old rate (fallback) for a PLL
	clk: meson: clk-pll: make "ret" a signed integer
	clk: meson: clk-pll: propagate the error from meson_clk_pll_set_rate()
	selftests/powerpc: Make the test check in eeh-basic.sh posix compliant
	regulator: qcom-rpmh-regulator: add pm8009-1 chip revision
	arm64: dts: qcom: qrb5165-rb5: fix pm8009 regulators
	quota: Fix memory leak when handling corrupted quota file
	i2c: iproc: handle only slave interrupts which are enabled
	i2c: iproc: update slave isr mask (ISR_MASK_SLAVE)
	i2c: iproc: handle master read request
	spi: cadence-quadspi: Abort read if dummy cycles required are too many
	clk: sunxi-ng: h6: Fix CEC clock
	clk: renesas: r8a779a0: Remove non-existent S2 clock
	clk: renesas: r8a779a0: Fix parent of CBFUSA clock
	HID: core: detect and skip invalid inputs to snto32()
	RDMA/siw: Fix handling of zero-sized Read and Receive Queues.
	dmaengine: fsldma: Fix a resource leak in the remove function
	dmaengine: fsldma: Fix a resource leak in an error handling path of the probe function
	dmaengine: owl-dma: Fix a resource leak in the remove function
	dmaengine: hsu: disable spurious interrupt
	mfd: bd9571mwv: Use devm_mfd_add_devices()
	power: supply: cpcap-charger: Fix missing power_supply_put()
	power: supply: cpcap-battery: Fix missing power_supply_put()
	power: supply: cpcap-charger: Fix power_supply_put on null battery pointer
	fdt: Properly handle "no-map" field in the memory region
	of/fdt: Make sure no-map does not remove already reserved regions
	RDMA/rtrs: Extend ibtrs_cq_qp_create
	RDMA/rtrs-srv: Release lock before call into close_sess
	RDMA/rtrs-srv: Use sysfs_remove_file_self for disconnect
	RDMA/rtrs-clt: Set mininum limit when create QP
	RDMA/rtrs: Call kobject_put in the failure path
	RDMA/rtrs-srv: Fix missing wr_cqe
	RDMA/rtrs-clt: Refactor the failure cases in alloc_clt
	RDMA/rtrs-srv: Init wr_cnt as 1
	power: reset: at91-sama5d2_shdwc: fix wkupdbc mask
	rtc: s5m: select REGMAP_I2C
	dmaengine: idxd: set DMA channel to be private
	power: supply: fix sbs-charger build, needs REGMAP_I2C
	clocksource/drivers/ixp4xx: Select TIMER_OF when needed
	clocksource/drivers/mxs_timer: Add missing semicolon when DEBUG is defined
	spi: imx: Don't print error on -EPROBEDEFER
	RDMA/mlx5: Use the correct obj_id upon DEVX TIR creation
	IB/mlx5: Add mutex destroy call to cap_mask_mutex mutex
	clk: sunxi-ng: h6: Fix clock divider range on some clocks
	platform/chrome: cros_ec_proto: Use EC_HOST_EVENT_MASK not BIT
	platform/chrome: cros_ec_proto: Add LID and BATTERY to default mask
	regulator: axp20x: Fix reference cout leak
	watch_queue: Drop references to /dev/watch_queue
	certs: Fix blacklist flag type confusion
	regulator: s5m8767: Fix reference count leak
	spi: atmel: Put allocated master before return
	regulator: s5m8767: Drop regulators OF node reference
	power: supply: axp20x_usb_power: Init work before enabling IRQs
	power: supply: smb347-charger: Fix interrupt usage if interrupt is unavailable
	regulator: core: Avoid debugfs: Directory ... already present! error
	isofs: release buffer head before return
	watchdog: intel-mid_wdt: Postpone IRQ handler registration till SCU is ready
	auxdisplay: ht16k33: Fix refresh rate handling
	objtool: Fix error handling for STD/CLD warnings
	objtool: Fix retpoline detection in asm code
	objtool: Fix ".cold" section suffix check for newer versions of GCC
	scsi: lpfc: Fix ancient double free
	iommu: Switch gather->end to the inclusive end
	IB/umad: Return EIO in case of when device disassociated
	IB/umad: Return EPOLLERR in case of when device disassociated
	KVM: PPC: Make the VMX instruction emulation routines static
	powerpc/47x: Disable 256k page size
	powerpc/time: Enable sched clock for irqtime
	mmc: owl-mmc: Fix a resource leak in an error handling path and in the remove function
	mmc: sdhci-sprd: Fix some resource leaks in the remove function
	mmc: usdhi6rol0: Fix a resource leak in the error handling path of the probe
	mmc: renesas_sdhi_internal_dmac: Fix DMA buffer alignment from 8 to 128-bytes
	ARM: 9046/1: decompressor: Do not clear SCTLR.nTLSMD for ARMv7+ cores
	i2c: qcom-geni: Store DMA mapping data in geni_i2c_dev struct
	amba: Fix resource leak for drivers without .remove
	iommu: Move iotlb_sync_map out from __iommu_map
	iommu: Properly pass gfp_t in _iommu_map() to avoid atomic sleeping
	IB/mlx5: Return appropriate error code instead of ENOMEM
	IB/cm: Avoid a loop when device has 255 ports
	tracepoint: Do not fail unregistering a probe due to memory failure
	rtc: zynqmp: depend on HAS_IOMEM
	perf tools: Fix DSO filtering when not finding a map for a sampled address
	perf vendor events arm64: Fix Ampere eMag event typo
	RDMA/rxe: Fix coding error in rxe_recv.c
	RDMA/rxe: Fix coding error in rxe_rcv_mcast_pkt
	RDMA/rxe: Correct skb on loopback path
	spi: stm32: properly handle 0 byte transfer
	mfd: altera-sysmgr: Fix physical address storing more
	mfd: wm831x-auxadc: Prevent use after free in wm831x_auxadc_read_irq()
	powerpc/pseries/dlpar: handle ibm, configure-connector delay status
	powerpc/8xx: Fix software emulation interrupt
	clk: qcom: gcc-msm8998: Fix Alpha PLL type for all GPLLs
	kunit: tool: fix unit test cleanup handling
	kselftests: dmabuf-heaps: Fix Makefile's inclusion of the kernel's usr/include dir
	RDMA/hns: Fixed wrong judgments in the goto branch
	RDMA/siw: Fix calculation of tx_valid_cpus size
	RDMA/hns: Fix type of sq_signal_bits
	RDMA/hns: Disable RQ inline by default
	clk: divider: fix initialization with parent_hw
	spi: pxa2xx: Fix the controller numbering for Wildcat Point
	powerpc/uaccess: Avoid might_fault() when user access is enabled
	powerpc/kuap: Restore AMR after replaying soft interrupts
	regulator: qcom-rpmh: fix pm8009 ldo7
	clk: aspeed: Fix APLL calculate formula from ast2600-A2
	selftests/ftrace: Update synthetic event syntax errors
	perf symbols: Use (long) for iterator for bfd symbols
	regulator: bd718x7, bd71828, Fix dvs voltage levels
	spi: dw: Avoid stack content exposure
	spi: Skip zero-length transfers in spi_transfer_one_message()
	printk: avoid prb_first_valid_seq() where possible
	perf symbols: Fix return value when loading PE DSO
	nfsd: register pernet ops last, unregister first
	svcrdma: Hold private mutex while invoking rdma_accept()
	ceph: fix flush_snap logic after putting caps
	RDMA/hns: Fixes missing error code of CMDQ
	RDMA/ucma: Fix use-after-free bug in ucma_create_uevent
	RDMA/rtrs-srv: Fix stack-out-of-bounds
	RDMA/rtrs: Only allow addition of path to an already established session
	RDMA/rtrs-srv: fix memory leak by missing kobject free
	RDMA/rtrs-srv-sysfs: fix missing put_device
	RDMA/rtrs-srv: Do not pass a valid pointer to PTR_ERR()
	Input: sur40 - fix an error code in sur40_probe()
	perf record: Fix continue profiling after draining the buffer
	perf intel-pt: Fix missing CYC processing in PSB
	perf intel-pt: Fix premature IPC
	perf intel-pt: Fix IPC with CYC threshold
	perf test: Fix unaligned access in sample parsing test
	Input: elo - fix an error code in elo_connect()
	sparc64: only select COMPAT_BINFMT_ELF if BINFMT_ELF is set
	sparc: fix led.c driver when PROC_FS is not enabled
	Input: zinitix - fix return type of zinitix_init_touch()
	ARM: 9065/1: OABI compat: fix build when EPOLL is not enabled
	misc: eeprom_93xx46: Fix module alias to enable module autoprobe
	phy: rockchip-emmc: emmc_phy_init() always return 0
	phy: cadence-torrent: Fix error code in cdns_torrent_phy_probe()
	misc: eeprom_93xx46: Add module alias to avoid breaking support for non device tree users
	PCI: rcar: Always allocate MSI addresses in 32bit space
	soundwire: cadence: fix ACK/NAK handling
	pwm: rockchip: Enable APB clock during register access while probing
	pwm: rockchip: rockchip_pwm_probe(): Remove superfluous clk_unprepare()
	pwm: rockchip: Eliminate potential race condition when probing
	PCI: xilinx-cpm: Fix reference count leak on error path
	VMCI: Use set_page_dirty_lock() when unregistering guest memory
	PCI: Align checking of syscall user config accessors
	mei: hbm: call mei_set_devstate() on hbm stop response
	drm/msm: Fix MSM_INFO_GET_IOVA with carveout
	drm/msm/dsi: Correct io_start for MSM8994 (20nm PHY)
	drm/msm/mdp5: Fix wait-for-commit for cmd panels
	drm/msm: Fix race of GPU init vs timestamp power management.
	drm/msm: Fix races managing the OOB state for timestamp vs timestamps.
	drm/msm/dp: trigger unplug event in msm_dp_display_disable
	vfio/iommu_type1: Populate full dirty when detach non-pinned group
	vfio/iommu_type1: Fix some sanity checks in detach group
	vfio-pci/zdev: fix possible segmentation fault issue
	ext4: fix potential htree index checksum corruption
	phy: USB_LGM_PHY should depend on X86
	coresight: etm4x: Skip accessing TRCPDCR in save/restore
	nvmem: core: Fix a resource leak on error in nvmem_add_cells_from_of()
	nvmem: core: skip child nodes not matching binding
	soundwire: bus: use sdw_update_no_pm when initializing a device
	soundwire: bus: use sdw_write_no_pm when setting the bus scale registers
	soundwire: export sdw_write/read_no_pm functions
	soundwire: bus: fix confusion on device used by pm_runtime
	misc: fastrpc: fix incorrect usage of dma_map_sgtable
	remoteproc/mediatek: acknowledge watchdog IRQ after handled
	regmap: sdw: use _no_pm functions in regmap_read/write
	ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
	mailbox: sprd: correct definition of SPRD_OUTBOX_FIFO_FULL
	device-dax: Fix default return code of range_parse()
	PCI: pci-bridge-emul: Fix array overruns, improve safety
	PCI: cadence: Fix DMA range mapping early return error
	i40e: Fix flow for IPv6 next header (extension header)
	i40e: Add zero-initialization of AQ command structures
	i40e: Fix overwriting flow control settings during driver loading
	i40e: Fix addition of RX filters after enabling FW LLDP agent
	i40e: Fix VFs not created
	Take mmap lock in cacheflush syscall
	nios2: fixed broken sys_clone syscall
	i40e: Fix add TC filter for IPv6
	octeontx2-af: Fix an off by one in rvu_dbg_qsize_write()
	pwm: iqs620a: Fix overflow and optimize calculations
	vfio/type1: Use follow_pte()
	ice: report correct max number of TCs
	ice: Account for port VLAN in VF max packet size calculation
	ice: Fix state bits on LLDP mode switch
	ice: update the number of available RSS queues
	net: stmmac: fix CBS idleslope and sendslope calculation
	net/mlx4_core: Add missed mlx4_free_cmd_mailbox()
	PCI: rockchip: Make 'ep-gpios' DT property optional
	vxlan: move debug check after netdev unregister
	wireguard: device: do not generate ICMP for non-IP packets
	wireguard: kconfig: use arm chacha even with no neon
	ocfs2: fix a use after free on error
	mm: memcontrol: fix NR_ANON_THPS accounting in charge moving
	mm: memcontrol: fix slub memory accounting
	mm/memory.c: fix potential pte_unmap_unlock pte error
	mm/hugetlb: fix potential double free in hugetlb_register_node() error path
	mm/hugetlb: suppress wrong warning info when alloc gigantic page
	mm/compaction: fix misbehaviors of fast_find_migrateblock()
	r8169: fix jumbo packet handling on RTL8168e
	NFSv4: Fixes for nfs4_bitmask_adjust()
	KVM: SVM: Intercept INVPCID when it's disabled to inject #UD
	KVM: x86/mmu: Expand collapsible SPTE zap for TDP MMU to ZONE_DEVICE and HugeTLB pages
	arm64: Add missing ISB after invalidating TLB in __primary_switch
	i2c: brcmstb: Fix brcmstd_send_i2c_cmd condition
	i2c: exynos5: Preserve high speed master code
	mm,thp,shmem: make khugepaged obey tmpfs mount flags
	mm: fix memory_failure() handling of dax-namespace metadata
	mm/rmap: fix potential pte_unmap on an not mapped pte
	proc: use kvzalloc for our kernel buffer
	csky: Fix a size determination in gpr_get()
	scsi: bnx2fc: Fix Kconfig warning & CNIC build errors
	scsi: sd: sd_zbc: Don't pass GFP_NOIO to kvcalloc
	block: reopen the device in blkdev_reread_part
	ide/falconide: Fix module unload
	scsi: sd: Fix Opal support
	blk-settings: align max_sectors on "logical_block_size" boundary
	soundwire: intel: fix possible crash when no device is detected
	ACPI: property: Fix fwnode string properties matching
	ACPI: configfs: add missing check after configfs_register_default_group()
	cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
	HID: logitech-dj: add support for keyboard events in eQUAD step 4 Gaming
	HID: wacom: Ignore attempts to overwrite the touch_max value from HID
	Input: raydium_ts_i2c - do not send zero length
	Input: xpad - add support for PowerA Enhanced Wired Controller for Xbox Series X|S
	Input: joydev - prevent potential read overflow in ioctl
	Input: i8042 - add ASUS Zenbook Flip to noselftest list
	media: mceusb: Fix potential out-of-bounds shift
	USB: serial: option: update interface mapping for ZTE P685M
	usb: musb: Fix runtime PM race in musb_queue_resume_work
	usb: dwc3: gadget: Fix setting of DEPCFG.bInterval_m1
	usb: dwc3: gadget: Fix dep->interval for fullspeed interrupt
	USB: serial: ftdi_sio: fix FTX sub-integer prescaler
	USB: serial: pl2303: fix line-speed handling on newer chips
	USB: serial: mos7840: fix error code in mos7840_write()
	USB: serial: mos7720: fix error code in mos7720_write()
	phy: lantiq: rcu-usb2: wait after clock enable
	ALSA: fireface: fix to parse sync status register of latter protocol
	ALSA: hda: Add another CometLake-H PCI ID
	ALSA: hda/hdmi: Drop bogus check at closing a stream
	ALSA: hda/realtek: modify EAPD in the ALC886
	ALSA: hda/realtek: Quirk for HP Spectre x360 14 amp setup
	MIPS: Ingenic: Disable HPTLB for D0 XBurst CPUs too
	MIPS: Support binutils configured with --enable-mips-fix-loongson3-llsc=yes
	MIPS: VDSO: Use CLANG_FLAGS instead of filtering out '--target='
	Revert "MIPS: Octeon: Remove special handling of CONFIG_MIPS_ELF_APPENDED_DTB=y"
	Revert "bcache: Kill btree_io_wq"
	bcache: Give btree_io_wq correct semantics again
	bcache: Move journal work to new flush wq
	Revert "drm/amd/display: Update NV1x SR latency values"
	drm/amd/display: Add FPU wrappers to dcn21_validate_bandwidth()
	drm/amd/display: Remove Assert from dcn10_get_dig_frontend
	drm/amd/display: Add vupdate_no_lock interrupts for DCN2.1
	drm/amdkfd: Fix recursive lock warnings
	drm/amdgpu: Set reference clock to 100Mhz on Renoir (v2)
	drm/nouveau/kms: handle mDP connectors
	drm/modes: Switch to 64bit maths to avoid integer overflow
	drm/sched: Cancel and flush all outstanding jobs before finish.
	drm/panel: kd35t133: allow using non-continuous dsi clock
	drm/rockchip: Require the YTR modifier for AFBC
	ASoC: siu: Fix build error by a wrong const prefix
	selinux: fix inconsistency between inode_getxattr and inode_listsecurity
	erofs: initialized fields can only be observed after bit is set
	tpm_tis: Fix check_locality for correct locality acquisition
	tpm_tis: Clean up locality release
	KEYS: trusted: Fix incorrect handling of tpm_get_random()
	KEYS: trusted: Fix migratable=1 failing
	KEYS: trusted: Reserve TPM for seal and unseal operations
	btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node
	btrfs: do not warn if we can't find the reloc root when looking up backref
	btrfs: add asserts for deleting backref cache nodes
	btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root
	btrfs: fix reloc root leak with 0 ref reloc roots on recovery
	btrfs: splice remaining dirty_bg's onto the transaction dirty bg list
	btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself
	btrfs: account for new extents being deleted in total_bytes_pinned
	btrfs: fix extent buffer leak on failure to copy root
	drm/i915/gt: Flush before changing register state
	drm/i915/gt: Correct surface base address for renderclear
	crypto: arm64/sha - add missing module aliases
	crypto: aesni - prevent misaligned buffers on the stack
	crypto: michael_mic - fix broken misalignment handling
	crypto: sun4i-ss - checking sg length is not sufficient
	crypto: sun4i-ss - IV register does not work on A10 and A13
	crypto: sun4i-ss - handle BigEndian for cipher
	crypto: sun4i-ss - initialize need_fallback
	soc: samsung: exynos-asv: don't defer early on not-supported SoCs
	soc: samsung: exynos-asv: handle reading revision register error
	seccomp: Add missing return in non-void function
	arm64: ptrace: Fix seccomp of traced syscall -1 (NO_SYSCALL)
	misc: rtsx: init of rts522a add OCP power off when no card is present
	drivers/misc/vmw_vmci: restrict too big queue size in qp_host_alloc_queue
	pstore: Fix typo in compression option name
	dts64: mt7622: fix slow sd card access
	arm64: dts: agilex: fix phy interface bit shift for gmac1 and gmac2
	staging/mt7621-dma: mtk-hsdma.c->hsdma-mt7621.c
	staging: gdm724x: Fix DMA from stack
	staging: rtl8188eu: Add Edimax EW-7811UN V2 to device table
	floppy: reintroduce O_NDELAY fix
	media: i2c: max9286: fix access to unallocated memory
	media: ir_toy: add another IR Droid device
	media: ipu3-cio2: Fix mbus_code processing in cio2_subdev_set_fmt()
	media: marvell-ccic: power up the device on mclk enable
	media: smipcie: fix interrupt handling and IR timeout
	x86/virt: Eat faults on VMXOFF in reboot flows
	x86/reboot: Force all cpus to exit VMX root if VMX is supported
	x86/fault: Fix AMD erratum #91 errata fixup for user code
	x86/entry: Fix instrumentation annotation
	powerpc/prom: Fix "ibm,arch-vec-5-platform-support" scan
	rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
	rcu/nocb: Perform deferred wake up before last idle's need_resched() check
	kprobes: Fix to delay the kprobes jump optimization
	arm64: Extend workaround for erratum 1024718 to all versions of Cortex-A55
	iommu/arm-smmu-qcom: Fix mask extraction for bootloader programmed SMRs
	arm64: kexec_file: fix memory leakage in create_dtb() when fdt_open_into() fails
	arm64: uprobe: Return EOPNOTSUPP for AARCH32 instruction probing
	arm64 module: set plt* section addresses to 0x0
	arm64: spectre: Prevent lockdep splat on v4 mitigation enable path
	riscv: Disable KSAN_SANITIZE for vDSO
	watchdog: qcom: Remove incorrect usage of QCOM_WDT_ENABLE_IRQ
	watchdog: mei_wdt: request stop on unregister
	coresight: etm4x: Handle accesses to TRCSTALLCTLR
	mtd: spi-nor: sfdp: Fix last erase region marking
	mtd: spi-nor: sfdp: Fix wrong erase type bitmask for overlaid region
	mtd: spi-nor: core: Fix erase type discovery for overlaid region
	mtd: spi-nor: core: Add erase size check for erase command initialization
	mtd: spi-nor: hisi-sfc: Put child node np on error path
	fs/affs: release old buffer head on error path
	seq_file: document how per-entry resources are managed.
	x86: fix seq_file iteration for pat/memtype.c
	mm: memcontrol: fix swap undercounting in cgroup2
	mm: memcontrol: fix get_active_memcg return value
	hugetlb: fix update_and_free_page contig page struct assumption
	hugetlb: fix copy_huge_page_from_user contig page struct assumption
	mm/vmscan: restore zone_reclaim_mode ABI
	mm, compaction: make fast_isolate_freepages() stay within zone
	KVM: nSVM: fix running nested guests when npt=0
	nvmem: qcom-spmi-sdam: Fix uninitialized pdev pointer
	module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbols
	mmc: sdhci-esdhc-imx: fix kernel panic when remove module
	mmc: sdhci-pci-o2micro: Bug fix for SDR104 HW tuning failure
	powerpc/32: Preserve cr1 in exception prolog stack check to fix build error
	powerpc/kexec_file: fix FDT size estimation for kdump kernel
	powerpc/32s: Add missing call to kuep_lock on syscall entry
	spmi: spmi-pmic-arb: Fix hw_irq overflow
	mei: fix transfer over dma with extended header
	mei: me: emmitsburg workstation DID
	mei: me: add adler lake point S DID
	mei: me: add adler lake point LP DID
	gpio: pcf857x: Fix missing first interrupt
	mfd: gateworks-gsc: Fix interrupt type
	printk: fix deadlock when kernel panic
	exfat: fix shift-out-of-bounds in exfat_fill_super()
	zonefs: Fix file size of zones in full condition
	kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE
	thermal: cpufreq_cooling: freq_qos_update_request() returns < 0 on error
	cpufreq: qcom-hw: drop devm_xxx() calls from init/exit hooks
	cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument
	cpufreq: intel_pstate: Get per-CPU max freq via MSR_HWP_CAPABILITIES if available
	proc: don't allow async path resolution of /proc/thread-self components
	s390/vtime: fix inline assembly clobber list
	virtio/s390: implement virtio-ccw revision 2 correctly
	um: mm: check more comprehensively for stub changes
	um: defer killing userspace on page table update failures
	irqchip/loongson-pch-msi: Use bitmap_zalloc() to allocate bitmap
	f2fs: fix out-of-repair __setattr_copy()
	f2fs: enforce the immutable flag on open files
	f2fs: flush data when enabling checkpoint back
	sparc32: fix a user-triggerable oops in clear_user()
	spi: fsl: invert spisel_boot signal on MPC8309
	spi: spi-synquacer: fix set_cs handling
	gfs2: fix glock confusion in function signal_our_withdraw
	gfs2: Don't skip dlm unlock if glock has an lvb
	gfs2: Lock imbalance on error path in gfs2_recover_one
	gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end
	dm: fix deadlock when swapping to encrypted device
	dm table: fix iterate_devices based device capability checks
	dm table: fix DAX iterate_devices based device capability checks
	dm table: fix zoned iterate_devices based device capability checks
	dm writecache: fix performance degradation in ssd mode
	dm writecache: return the exact table values that were set
	dm writecache: fix writing beyond end of underlying device when shrinking
	dm era: Recover committed writeset after crash
	dm era: Update in-core bitset after committing the metadata
	dm era: Verify the data block size hasn't changed
	dm era: Fix bitset memory leaks
	dm era: Use correct value size in equality function of writeset tree
	dm era: Reinitialize bitset cache before digesting a new writeset
	dm era: only resize metadata in preresume
	drm/i915: Reject 446-480MHz HDMI clock on GLK
	kgdb: fix to kill breakpoints on initmem after boot
	ipv6: silence compilation warning for non-IPV6 builds
	net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending
	wireguard: selftests: test multiple parallel streams
	wireguard: queueing: get rid of per-peer ring buffers
	net: sched: fix police ext initialization
	net: qrtr: Fix memory leak in qrtr_tun_open
	net_sched: fix RTNL deadlock again caused by request_module()
	ARM: dts: aspeed: Add LCLK to lpc-snoop
	Linux 5.10.20

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3fbcecd9413ce212dac68d5cc800c9457feba56a
2021-03-07 12:33:33 +01:00
Mikulas Patocka
556c513e6b blk-settings: align max_sectors on "logical_block_size" boundary
commit 97f433c3601a24d3513d06f575a389a2ca4e11e4 upstream.

We get I/O errors when we run md-raid1 on the top of dm-integrity on the
top of ramdisk.
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb

The ramdisk device has logical_block_size 512 and max_sectors 255. The
dm-integrity device uses logical_block_size 4096 and it doesn't affect the
"max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have
a device with max_sectors not aligned on logical_block_size.

The md-raid device sees that the underlying leg has max_sectors 255 and it
will split the bios on 255-sector boundary, making the bios unaligned on
logical_block_size.

In order to fix the bug, we round down max_sectors to logical_block_size.

Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:38:22 +01:00
Christoph Hellwig
cc88a819a1 block: reopen the device in blkdev_reread_part
[ Upstream commit 4601b4b130de2329fe06df80ed5d77265f2058e5 ]

Historically the BLKRRPART ioctls called into the now defunct ->revalidate
method, which caused the sd driver to check if any media is present.
When the ->revalidate method was removed this revalidation was lost,
leading to lots of I/O errors when using the eject command.  Fix this by
reopening the device to rescan the partitions, and thus calling the
revalidation logic in the sd driver.

Fixes: 471bd0af54 ("sd: use bdev_check_media_change")
Reported--by: Tom Seewald <tseewald@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Tom Seewald <tseewald@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:38:21 +01:00
Pan Bian
1c7b7d476e bsg: free the request before return error code
[ Upstream commit 0f7b4bc6bb1e57c48ef14f1818df947c1612b206 ]

Free the request rq before returning error code.

Fixes: 972248e911 ("scsi: bsg-lib: handle bidi requests without block layer help")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04 11:37:42 +01:00
Jan Kara
89e3d1a85d bfq: Avoid false bfq queue merging
commit 41e76c85660c022c6bf5713bfb6c21e64a487cec upstream.

bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:

[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read

[reader]
numjobs=4
directory=/mnt

where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).

Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.

Fixes: 058fdecc6d ("block, bfq: fix in-service-queue check for queue merging")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04 11:37:18 +01:00
Eric Biggers
25b3bc709c UPSTREAM: block/keyslot-manager: introduce devm_blk_ksm_init()
Add a resource-managed variant of blk_ksm_init() so that drivers don't
have to worry about calling blk_ksm_destroy().

Note that the implementation uses a custom devres action to call
blk_ksm_destroy() rather than switching the two allocations to be
directly devres-managed, e.g. with devm_kmalloc().  This is because we
need to keep zeroing the memory containing the keyslots when it is
freed, and also because we want to continue using kvmalloc() (and there
is no devm_kvmalloc()).

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20210121082155.111333-2-ebiggers@kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>

(cherry picked from commit 5851d3b042b694839d2241fbb3200ce958135cdf)
Bug: 161256007
Change-Id: I3cda7fc5f7509f8d967882a88ca30ddc6f6a0426
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-02-23 08:10:56 +01:00
Eric Biggers
537d3bb974 ANDROID: dm: sync inline crypto support with patches going upstream
Replace the following patches with upstream versions
(well, almost upstream; as of 2021-02-12 they are queued for 5.12 at
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next):

	ANDROID-dm-add-support-for-passing-through-inline-crypto-support.patch
	ANDROID-dm-enable-may_passthrough_inline_crypto-on-some-targets.patch
	ANDROID-block-Introduce-passthrough-keyslot-manager.patch

Also, resolve conflicts with the following non-upstream patches for
hardware-wrapped key support.  Notably, we need to handle the field
blk_keyslot_manager::features in a few places:

	ANDROID-block-add-hardware-wrapped-key-support.patch
	ANDROID-dm-add-support-for-passing-through-derive_raw_secret.patch

Finally, update non-upstream device-mapper targets (dm-bow and
dm-default-key) to use the new way of specifying inline crypto
passthrough support (DM_TARGET_PASSES_CRYPTO) rather than the old way
(may_passthrough_inline_crypto).  These changes should be folded into:

	ANDROID-dm-bow-Add-dm-bow-feature.patch
	ANDROID-dm-add-dm-default-key-target-for-metadata-encryption.patch

Test: tested on db845c; verified that inline crypto support gets passed
      through over dm-linear.
Bug: 162257830
Change-Id: I5e3dea1aa09fc1215c90857b5b51d9e3720ef7db
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-02-19 10:48:51 +00:00
Greg Kroah-Hartman
b129c98dc6 Merge 5.10.17 into android12-5.10
Changes in 5.10.17
	objtool: Fix seg fault with Clang non-section symbols
	Revert "dts: phy: add GPIO number and active state used for phy reset"
	gpio: mxs: GPIO_MXS should not default to y unconditionally
	gpio: ep93xx: fix BUG_ON port F usage
	gpio: ep93xx: Fix single irqchip with multi gpiochips
	tracing: Do not count ftrace events in top level enable output
	tracing: Check length before giving out the filter buffer
	drm/i915: Fix overlay frontbuffer tracking
	arm/xen: Don't probe xenbus as part of an early initcall
	cgroup: fix psi monitor for root cgroup
	Revert "drm/amd/display: Update NV1x SR latency values"
	drm/i915/tgl+: Make sure TypeC FIA is powered up when initializing it
	drm/dp_mst: Don't report ports connected if nothing is attached to them
	dmaengine: move channel device_node deletion to driver
	tmpfs: disallow CONFIG_TMPFS_INODE64 on s390
	tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha
	soc: ti: omap-prm: Fix boot time errors for rst_map_012 bits 0 and 1
	arm64: dts: rockchip: Fix PCIe DT properties on rk3399
	arm64: dts: qcom: sdm845: Reserve LPASS clocks in gcc
	ARM: OMAP2+: Fix suspcious RCU usage splats for omap_enter_idle_coupled
	arm64: dts: rockchip: remove interrupt-names property from rk3399 vdec node
	platform/x86: hp-wmi: Disable tablet-mode reporting by default
	arm64: dts: rockchip: Disable display for NanoPi R2S
	ovl: perform vfs_getxattr() with mounter creds
	cap: fix conversions on getxattr
	ovl: skip getxattr of security labels
	scsi: lpfc: Fix EEH encountering oops with NVMe traffic
	x86/split_lock: Enable the split lock feature on another Alder Lake CPU
	nvme-pci: ignore the subsysem NQN on Phison E16
	drm/amd/display: Fix DPCD translation for LTTPR AUX_RD_INTERVAL
	drm/amd/display: Add more Clock Sources to DCN2.1
	drm/amd/display: Release DSC before acquiring
	drm/amd/display: Fix dc_sink kref count in emulated_link_detect
	drm/amd/display: Free atomic state after drm_atomic_commit
	drm/amd/display: Decrement refcount of dc_sink before reassignment
	riscv: virt_addr_valid must check the address belongs to linear mapping
	bfq-iosched: Revert "bfq: Fix computation of shallow depth"
	ARM: dts: lpc32xx: Revert set default clock rate of HCLK PLL
	kallsyms: fix nonconverging kallsyms table with lld
	ARM: ensure the signal page contains defined contents
	ARM: kexec: fix oops after TLB are invalidated
	ubsan: implement __ubsan_handle_alignment_assumption
	Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs"
	x86/efi: Remove EFI PGD build time checks
	lkdtm: don't move ctors to .rodata
	KVM: x86: cleanup CR3 reserved bits checks
	cgroup-v1: add disabled controller check in cgroup1_parse_param()
	dmaengine: idxd: fix misc interrupt completion
	ath9k: fix build error with LEDS_CLASS=m
	mt76: dma: fix a possible memory leak in mt76_add_fragment()
	drm/vc4: hvs: Fix buffer overflow with the dlist handling
	dmaengine: idxd: check device state before issue command
	bpf: Unbreak BPF_PROG_TYPE_KPROBE when kprobe is called via do_int3
	bpf: Check for integer overflow when using roundup_pow_of_two()
	netfilter: xt_recent: Fix attempt to update deleted entry
	selftests: netfilter: fix current year
	netfilter: nftables: fix possible UAF over chains from packet path in netns
	netfilter: flowtable: fix tcp and udp header checksum update
	xen/netback: avoid race in xenvif_rx_ring_slots_available()
	net: hdlc_x25: Return meaningful error code in x25_open
	net: ipa: set error code in gsi_channel_setup()
	hv_netvsc: Reset the RSC count if NVSP_STAT_FAIL in netvsc_receive()
	net: enetc: initialize the RFS and RSS memories
	selftests: txtimestamp: fix compilation issue
	net: stmmac: set TxQ mode back to DCB after disabling CBS
	ibmvnic: Clear failover_pending if unable to schedule
	netfilter: conntrack: skip identical origin tuple in same zone only
	scsi: scsi_debug: Fix a memory leak
	x86/build: Disable CET instrumentation in the kernel for 32-bit too
	net: dsa: felix: implement port flushing on .phylink_mac_link_down
	net: hns3: add a check for queue_id in hclge_reset_vf_queue()
	net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()
	net: hns3: add a check for index in hclge_get_rss_key()
	firmware_loader: align .builtin_fw to 8
	drm/sun4i: tcon: set sync polarity for tcon1 channel
	drm/sun4i: dw-hdmi: always set clock rate
	drm/sun4i: Fix H6 HDMI PHY configuration
	drm/sun4i: dw-hdmi: Fix max. frequency for H6
	clk: sunxi-ng: mp: fix parent rate change flag check
	i2c: stm32f7: fix configuration of the digital filter
	h8300: fix PREEMPTION build, TI_PRE_COUNT undefined
	scripts: set proper OpenSSL include dir also for sign-file
	x86/pci: Create PCI/MSI irqdomain after x86_init.pci.arch_init()
	arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page
	rxrpc: Fix clearance of Tx/Rx ring when releasing a call
	udp: fix skb_copy_and_csum_datagram with odd segment sizes
	net: dsa: call teardown method on probe failure
	cpufreq: ACPI: Extend frequency tables to cover boost frequencies
	cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
	net: gro: do not keep too many GRO packets in napi->rx_list
	net: fix iteration for sctp transport seq_files
	net/vmw_vsock: fix NULL pointer dereference
	net/vmw_vsock: improve locking in vsock_connect_timeout()
	net: watchdog: hold device global xmit lock during tx disable
	bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state
	switchdev: mrp: Remove SWITCHDEV_ATTR_ID_MRP_PORT_STAT
	vsock/virtio: update credit only if socket is not closed
	vsock: fix locking in vsock_shutdown()
	net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
	net/qrtr: restrict user-controlled length in qrtr_tun_write_iter()
	ovl: expand warning in ovl_d_real()
	kcov, usb: only collect coverage from __usb_hcd_giveback_urb in softirq
	Linux 5.10.17

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id0300681f52b51d3f466f1e66ec3a6c25f65f4d3
2021-02-18 11:21:01 +01:00
Lin Feng
d93178df8f bfq-iosched: Revert "bfq: Fix computation of shallow depth"
[ Upstream commit 388c705b95f23f317fa43e6abf9ff07b583b721a ]

This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.

bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
sbitmap_get_shallow, which uses just the number to limit the scan depth of
each bitmap word, formula:
scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%

That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
But after commit patch 'bfq: Fix computation of shallow depth', we use
sbitmap.depth instead, as a example in following case:

sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
nothing.

Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-17 11:02:24 +01:00
Greg Kroah-Hartman
a6310f1034 This is the 5.10.16 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmAnzLcACgkQONu9yGCS
 aT5FUQ/+LUBYHpWjyV1wrnjwf3AAtcUnZGtPUOEsv9d9lAcituistag2zHXive9g
 K7HGria7BVcARnAtdcOLWB7ur9Vj+Ch1XVOVhSdI8EgGPslxoWKxmM03FQtSjQak
 OYZAHc/A/mrTtG+rYROx4gp+jxaiaUx8e/zleFgNeN1GU9/owR2H8+d/a2L3bnzN
 mgaYG4/0GTy1JfDXwsmiNa376dIViPAkukjS8AV+dPZKFag+TmcE0d/qTtDlmiQO
 gSboV/8FzwKgUIxjOt6Rw6AniCfGTew/Dy/NkRiGB4ge5+aMZe78+IZ6xzRlbVix
 d1/+7Iviy40pTWOZdRxwefAj0/MS9zZeVrDSA/Ips24EfD/0qxq9QEa3cEXvQkZF
 ih5AX9obPBxHRsFwn7x9siP3ZW1W2jaEYzrXxIWBJxFDVRRh3/DMo5rljSkUWxzS
 8dBpxfNiRMggsbgKPBtuV5+4Dzdbx5Dn1sbaMgT9pU1f+U0LH0KjIU1evuCFqUo6
 C/Y61pDjc8GotBFuKjcbCYBMWpAJ/UwqRn4HrMBRMN+ZOpBQr/2RLaM8ROla8H3W
 GrhADQlDuHForKHRuiuBpaUxZGLeZw2dpZClrV0WwzHLLV0KsQC0+xE9ge0/GPtQ
 rnJPxYiKg2WJctVBlH2i5uLw6s25+dq4ufSZBmr2AOg8u0YccU4=
 =BFeH
 -----END PGP SIGNATURE-----

Merge 5.10.16 into android12-5.10

Changes in 5.10.16
	io_uring: simplify io_task_match()
	io_uring: add a {task,files} pair matching helper
	io_uring: don't iterate io_uring_cancel_files()
	io_uring: pass files into kill timeouts/poll
	io_uring: always batch cancel in *cancel_files()
	io_uring: fix files cancellation
	io_uring: account io_uring internal files as REQ_F_INFLIGHT
	io_uring: if we see flush on exit, cancel related tasks
	io_uring: fix __io_uring_files_cancel() with TASK_UNINTERRUPTIBLE
	io_uring: replace inflight_wait with tctx->wait
	io_uring: fix cancellation taking mutex while TASK_UNINTERRUPTIBLE
	io_uring: fix flush cqring overflow list while TASK_INTERRUPTIBLE
	io_uring: fix list corruption for splice file_get
	io_uring: fix sqo ownership false positive warning
	io_uring: reinforce cancel on flush during exit
	io_uring: drop mm/files between task_work_submit
	gpiolib: cdev: clear debounce period if line set to output
	powerpc/64/signal: Fix regression in __kernel_sigtramp_rt64() semantics
	af_key: relax availability checks for skb size calculation
	regulator: core: avoid regulator_resolve_supply() race condition
	ASoC: wm_adsp: Fix control name parsing for multi-fw
	drm/nouveau/nvif: fix method count when pushing an array
	mac80211: 160MHz with extended NSS BW in CSA
	ASoC: Intel: Skylake: Zero snd_ctl_elem_value
	chtls: Fix potential resource leak
	pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
	pNFS/NFSv4: Improve rejection of out-of-order layouts
	ALSA: hda: intel-dsp-config: add PCI id for TGL-H
	ASoC: ak4458: correct reset polarity
	ASoC: Intel: sof_sdw: set proper flags for Dell TGL-H SKU 0A5E
	iwlwifi: mvm: skip power command when unbinding vif during CSA
	iwlwifi: mvm: take mutex for calling iwl_mvm_get_sync_time()
	iwlwifi: pcie: add a NULL check in iwl_pcie_txq_unmap
	iwlwifi: pcie: fix context info memory leak
	iwlwifi: mvm: invalidate IDs of internal stations at mvm start
	iwlwifi: pcie: add rules to match Qu with Hr2
	iwlwifi: mvm: guard against device removal in reprobe
	iwlwifi: queue: bail out on invalid freeing
	SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
	SUNRPC: Handle 0 length opaque XDR object data properly
	i2c: mediatek: Move suspend and resume handling to NOIRQ phase
	blk-cgroup: Use cond_resched() when destroy blkgs
	regulator: Fix lockdep warning resolving supplies
	bpf: Fix verifier jmp32 pruning decision logic
	bpf: Fix 32 bit src register truncation on div/mod
	bpf: Fix verifier jsgt branch analysis on max bound
	drm/i915: Fix ICL MG PHY vswing handling
	drm/i915: Skip vswing programming for TBT
	nilfs2: make splice write available again
	Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
	squashfs: avoid out of bounds writes in decompressors
	squashfs: add more sanity checks in id lookup
	squashfs: add more sanity checks in inode lookup
	squashfs: add more sanity checks in xattr id lookup
	Linux 5.10.16

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3d667eb0c90288b118c756a33c70c8ceb097405
2021-02-13 14:19:38 +01:00
Baolin Wang
fb8f9b2f7d blk-cgroup: Use cond_resched() when destroy blkgs
[ Upstream commit 6c635caef410aa757befbd8857c1eadde5cc22ed ]

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-02-13 13:55:13 +01:00
Greg Kroah-Hartman
cf5b2483a5 Merge 5.10.13 into android12-5.10
Changes in 5.10.13
	iwlwifi: provide gso_type to GSO packets
	nbd: freeze the queue while we're adding connections
	tty: avoid using vfs_iocb_iter_write() for redirected console writes
	ACPI: sysfs: Prefer "compatible" modalias
	ACPI: thermal: Do not call acpi_thermal_check() directly
	kernel: kexec: remove the lock operation of system_transition_mutex
	ALSA: hda/realtek: Enable headset of ASUS B1400CEPE with ALC256
	ALSA: hda/via: Apply the workaround generically for Clevo machines
	parisc: Enable -mlong-calls gcc option by default when !CONFIG_MODULES
	media: cec: add stm32 driver
	media: cedrus: Fix H264 decoding
	media: hantro: Fix reset_raw_fmt initialization
	media: rc: fix timeout handling after switch to microsecond durations
	media: rc: ite-cir: fix min_timeout calculation
	media: rc: ensure that uevent can be read directly after rc device register
	ARM: dts: tbs2910: rename MMC node aliases
	ARM: dts: ux500: Reserve memory carveouts
	ARM: dts: imx6qdl-gw52xx: fix duplicate regulator naming
	wext: fix NULL-ptr-dereference with cfg80211's lack of commit()
	x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled
	ASoC: AMD Renoir - refine DMI entries for some Lenovo products
	Revert "drm/amdgpu/swsmu: drop set_fan_speed_percent (v2)"
	drm/nouveau/kms/gk104-gp1xx: Fix > 64x64 cursors
	drm/i915: Always flush the active worker before returning from the wait
	drm/i915/gt: Always try to reserve GGTT address 0x0
	drivers/nouveau/kms/nv50-: Reject format modifiers for cursor planes
	bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
	net: usb: qmi_wwan: added support for Thales Cinterion PLSx3 modem family
	s390: uv: Fix sysfs max number of VCPUs reporting
	s390/vfio-ap: No need to disable IRQ after queue reset
	PM: hibernate: flush swap writer after marking
	x86/entry: Emit a symbol for register restoring thunk
	efi/apple-properties: Reinstate support for boolean properties
	crypto: marvel/cesa - Fix tdma descriptor on 64-bit
	drivers: soc: atmel: Avoid calling at91_soc_init on non AT91 SoCs
	drivers: soc: atmel: add null entry at the end of at91_soc_allowed_list[]
	btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
	btrfs: fix possible free space tree corruption with online conversion
	KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]
	KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
	KVM: arm64: Filter out v8.1+ events on v8.0 HW
	KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
	KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
	KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
	KVM: x86: get smi pending status correctly
	KVM: Forbid the use of tagged userspace addresses for memslots
	io_uring: fix wqe->lock/completion_lock deadlock
	xen: Fix XenStore initialisation for XS_LOCAL
	leds: trigger: fix potential deadlock with libata
	arm64: dts: broadcom: Fix USB DMA address translation for Stingray
	mt7601u: fix kernel crash unplugging the device
	mt76: mt7663s: fix rx buffer refcounting
	mt7601u: fix rx buffer refcounting
	iwlwifi: Fix IWL_SUBDEVICE_NO_160 macro to use the correct bit.
	drm/i915/gt: Clear CACHE_MODE prior to clearing residuals
	drm/i915/pmu: Don't grab wakeref when enabling events
	net/mlx5e: Fix IPSEC stats
	ARM: dts: imx6qdl-kontron-samx6i: fix pwms for lcd-backlight
	drm/nouveau/svm: fail NOUVEAU_SVM_INIT ioctl on unsupported devices
	drm/vc4: Correct lbm size and calculation
	drm/vc4: Correct POS1_SCL for hvs5
	drm/nouveau/dispnv50: Restore pushing of all data.
	drm/i915: Check for all subplatform bits
	drm/i915/selftest: Fix potential memory leak
	uapi: fix big endian definition of ipv6_rpl_sr_hdr
	KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM
	tee: optee: replace might_sleep with cond_resched
	xen-blkfront: allow discard-* nodes to be optional
	blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
	clk: imx: fix Kconfig warning for i.MX SCU clk
	clk: mmp2: fix build without CONFIG_PM
	clk: qcom: gcc-sm250: Use floor ops for sdcc clks
	ARM: imx: build suspend-imx6.S with arm instruction set
	ARM: zImage: atags_to_fdt: Fix node names on added root nodes
	netfilter: nft_dynset: add timeout extension to template
	Revert "RDMA/mlx5: Fix devlink deadlock on net namespace deletion"
	Revert "block: simplify set_init_blocksize" to regain lost performance
	xfrm: Fix oops in xfrm_replay_advance_bmp
	xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
	selftests: xfrm: fix test return value override issue in xfrm_policy.sh
	xfrm: Fix wraparound in xfrm_policy_addr_delta()
	arm64: dts: ls1028a: fix the offset of the reset register
	ARM: imx: fix imx8m dependencies
	ARM: dts: imx6qdl-kontron-samx6i: fix i2c_lcd/cam default status
	ARM: dts: imx6qdl-sr-som: fix some cubox-i platforms
	arm64: dts: imx8mp: Correct the gpio ranges of gpio3
	firmware: imx: select SOC_BUS to fix firmware build
	RDMA/cxgb4: Fix the reported max_recv_sge value
	ASoC: dt-bindings: lpass: Fix and common up lpass dai ids
	ASoC: qcom: Fix incorrect volatile registers
	ASoC: qcom: Fix broken support to MI2S TERTIARY and QUATERNARY
	ASoC: qcom: lpass-ipq806x: fix bitwidth regmap field
	spi: altera: Fix memory leak on error path
	ASoC: Intel: Skylake: skl-topology: Fix OOPs ib skl_tplg_complete
	powerpc/64s: prevent recursive replay_soft_interrupts causing superfluous interrupt
	pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
	pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
	ASoC: SOF: Intel: soundwire: fix select/depend unmet dependencies
	ASoC: qcom: lpass: Fix out-of-bounds DAI ID lookup
	iwlwifi: pcie: avoid potential PNVM leaks
	iwlwifi: pnvm: don't skip everything when not reloading
	iwlwifi: pnvm: don't try to load after failures
	iwlwifi: pcie: set LTR on more devices
	iwlwifi: pcie: use jiffies for memory read spin time limit
	iwlwifi: pcie: reschedule in long-running memory reads
	mac80211: pause TX while changing interface type
	ice: fix FDir IPv6 flexbyte
	ice: Implement flow for IPv6 next header (extension header)
	ice: update dev_addr in ice_set_mac_address even if HW filter exists
	ice: Don't allow more channels than LAN MSI-X available
	ice: Fix MSI-X vector fallback logic
	i40e: acquire VSI pointer only after VF is initialized
	igc: fix link speed advertising
	net/mlx5: Fix memory leak on flow table creation error flow
	net/mlx5e: E-switch, Fix rate calculation for overflow
	net/mlx5e: free page before return
	net/mlx5e: Reduce tc unsupported key print level
	net/mlx5: Maintain separate page trees for ECPF and PF functions
	net/mlx5e: Disable hw-tc-offload when MLX5_CLS_ACT config is disabled
	net/mlx5e: Fix CT rule + encap slow path offload and deletion
	net/mlx5e: Correctly handle changing the number of queues when the interface is down
	net/mlx5e: Revert parameters on errors when changing trust state without reset
	net/mlx5e: Revert parameters on errors when changing MTU and LRO state without reset
	net/mlx5: CT: Fix incorrect removal of tuple_nat_node from nat rhashtable
	can: dev: prevent potential information leak in can_fill_info()
	ACPI/IORT: Do not blindly trust DMA masks from firmware
	of/device: Update dma_range_map only when dev has valid dma-ranges
	iommu/amd: Use IVHD EFR for early initialization of IOMMU features
	iommu/vt-d: Correctly check addr alignment in qi_flush_dev_iotlb_pasid()
	nvme-multipath: Early exit if no path is available
	selftests: forwarding: Specify interface when invoking mausezahn
	rxrpc: Fix memory leak in rxrpc_lookup_local
	NFC: fix resource leak when target index is invalid
	NFC: fix possible resource leak
	ASoC: mediatek: mt8183-da7219: ignore TDM DAI link by default
	ASoC: mediatek: mt8183-mt6358: ignore TDM DAI link by default
	ASoC: topology: Properly unregister DAI on removal
	ASoC: topology: Fix memory corruption in soc_tplg_denum_create_values()
	scsi: qla2xxx: Fix description for parameter ql2xenforce_iocb_limit
	team: protect features update by RCU to avoid deadlock
	tcp: make TCP_USER_TIMEOUT accurate for zero window probes
	tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
	vsock: fix the race conditions in multi-transport support
	Linux 5.10.13

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I75f419b25f24da559e446d62f75ce6bb9b0a5396
2021-02-05 10:38:34 +01:00
Ming Lei
20786fdd2f blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
commit 2569063c7140c65a0d0ad075e95ddfbcda9ba3c0 upstream.

In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against
q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE.

So fix it.

Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: f1b49fdc1c ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-03 23:28:44 +01:00
Asutosh Das
e48a262853 FROMLIST: block: bsg: resume platform device before accessing
It may happen that the underlying device's runtime-pm is
not controlled by block-pm. So it's possible that when
commands are sent to the device, it's suspended and may not
be resumed by blk-pm. Hence explicitly resume the parent
which is the platform device.

Bug: 178653131
Link: https://lore.kernel.org/linux-arm-msm/b1db5394aa3f6cf44cd9adb9c8d569caa0c9e4f5.1611803264.git.asutoshd@codeaurora.org/T/#u
Change-Id: Iead4102ea609d339549694164771e47debb58a00
Signed-off-by: Asutosh Das <asutoshd@codeaurora.org>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bao D. Nguyen <nguyenb@codeaurora.org>
2021-01-28 20:23:35 +00:00
Greg Kroah-Hartman
88e2d5fd10 This is the 5.10.9 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmAHFpcACgkQONu9yGCS
 aT4Vhw/+JLscHnfK//hbS6Nx95MY95VMzy+p2ccADXRy3O/5nr0HwGKnXTKB4Bg+
 05S3Hv9ZU/XSszLWvgFQ0Z0peU241ASPz1uLTgtpziBT5plXa5eJULBZ+WknWMef
 dNKpvKPphpEbQ0yz6o/4sbNAdiI9BzyGCOicQ2dl9nY7R/JA9YHquUD7iHMnvbs+
 yxwwawNHVwszUT/fJT3iFzOAehHGAttHdf3z/bGPS1ogy2S7J5IluJgTAibd3P7G
 5o7OUUA5ujEtjBLIkA61fqeL2Qaci83Ff/8KEPEfF1JeLBbMHYcLHnz3RAwBaLZh
 nlM4smyTeekcnHIzyRGw16OmpoYwY3MQAt+UFLCzKhlnscB0UqCNkA9zQA9k/taw
 cy7/fe5hWFU9DRv4uTUT2H1tkP+pNQ5eIaejPHMtld5JlYXoDN4RyQq7sAyMQgBj
 CXADStYSR/f5sWWgRbRs1F7E0lrePsVpjOcqHXxbsS+52yN2CZSKazlOIJ9xArfM
 cTzzLUuYbMZoHjIDdMMkjA41VMmyJ+BKrqEgzu3LsJQs57o/ckjnQx4VV5YiHhci
 v35OL8oa9IZi8WQikB9bx2WZRWUChOGKwMNeeUwEFD4Zmye1OtyyHuzYQf9QSjRv
 zbf1owwsg3xnfkvLcfru8mNMgJkgG8RpuNNVPO8boWZ4pgPu2tk=
 =5K55
 -----END PGP SIGNATURE-----

Merge 5.10.9 into android12-5.10

Changes in 5.10.9
	btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
	btrfs: prevent NULL pointer dereference in extent_io_tree_panic
	ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machines
	ALSA: doc: Fix reference to mixart.rst
	ASoC: AMD Renoir - add DMI entry for Lenovo ThinkPad X395
	ASoC: dapm: remove widget from dirty list on free
	x86/hyperv: check cpu mask after interrupt has been disabled
	drm/amdgpu: add green_sardine device id (v2)
	drm/amdgpu: fix DRM_INFO flood if display core is not supported (bug 210921)
	Revert "drm/amd/display: Fixed Intermittent blue screen on OLED panel"
	drm/amdgpu: add new device id for Renior
	drm/i915: Allow the sysadmin to override security mitigations
	drm/i915/gt: Limit VFE threads based on GT
	drm/i915/backlight: fix CPU mode backlight takeover on LPT
	drm/bridge: sii902x: Refactor init code into separate function
	dt-bindings: display: sii902x: Add supply bindings
	drm/bridge: sii902x: Enable I/O and core VCC supplies if present
	tracing/kprobes: Do the notrace functions check without kprobes on ftrace
	tools/bootconfig: Add tracing_on support to helper scripts
	ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
	ext4: fix wrong list_splice in ext4_fc_cleanup
	ext4: fix bug for rename with RENAME_WHITEOUT
	cifs: check pointer before freeing
	cifs: fix interrupted close commands
	riscv: Drop a duplicated PAGE_KERNEL_EXEC
	riscv: return -ENOSYS for syscall -1
	riscv: Fixup CONFIG_GENERIC_TIME_VSYSCALL
	riscv: Fix KASAN memory mapping.
	mips: fix Section mismatch in reference
	mips: lib: uncached: fix non-standard usage of variable 'sp'
	MIPS: boot: Fix unaligned access with CONFIG_MIPS_RAW_APPENDED_DTB
	MIPS: Fix malformed NT_FILE and NT_SIGINFO in 32bit coredumps
	MIPS: relocatable: fix possible boot hangup with KASLR enabled
	RDMA/ocrdma: Fix use after free in ocrdma_dealloc_ucontext_pd()
	ACPI: scan: Harden acpi_device_add() against device ID overflows
	xen/privcmd: allow fetching resource sizes
	compiler.h: Raise minimum version of GCC to 5.1 for arm64
	mm/vmalloc.c: fix potential memory leak
	mm/hugetlb: fix potential missing huge page size info
	mm/process_vm_access.c: include compat.h
	dm raid: fix discard limits for raid1
	dm snapshot: flush merged data before committing metadata
	dm integrity: fix flush with external metadata device
	dm integrity: fix the maximum number of arguments
	dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
	dm crypt: do not wait for backlogged crypto request completion in softirq
	dm crypt: do not call bio_endio() from the dm-crypt tasklet
	dm crypt: defer decryption to a tasklet if interrupts disabled
	stmmac: intel: change all EHL/TGL to auto detect phy addr
	r8152: Add Lenovo Powered USB-C Travel Hub
	btrfs: tree-checker: check if chunk item end overflows
	ext4: don't leak old mountpoint samples
	io_uring: don't take files/mm for a dead task
	io_uring: drop mm and files after task_work_run
	ARC: build: remove non-existing bootpImage from KBUILD_IMAGE
	ARC: build: add uImage.lzma to the top-level target
	ARC: build: add boot_targets to PHONY
	ARC: build: move symlink creation to arch/arc/Makefile to avoid race
	ARM: omap2: pmic-cpcap: fix maximum voltage to be consistent with defaults on xt875
	ath11k: fix crash caused by NULL rx_channel
	netfilter: ipset: fixes possible oops in mtype_resize
	ath11k: qmi: try to allocate a big block of DMA memory first
	btrfs: fix async discard stall
	btrfs: merge critical sections of discard lock in workfn
	btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
	regulator: bd718x7: Add enable times
	ethernet: ucc_geth: fix definition and size of ucc_geth_tx_global_pram
	ARM: dts: ux500/golden: Set display max brightness
	habanalabs: adjust pci controller init to new firmware
	habanalabs/gaudi: retry loading TPC f/w on -EINTR
	habanalabs: register to pci shutdown callback
	staging: spmi: hisi-spmi-controller: Fix some error handling paths
	spi: altera: fix return value for altera_spi_txrx()
	habanalabs: Fix memleak in hl_device_reset
	hwmon: (pwm-fan) Ensure that calculation doesn't discard big period values
	lib/raid6: Let $(UNROLL) rules work with macOS userland
	kconfig: remove 'kvmconfig' and 'xenconfig' shorthands
	spi: fix the divide by 0 error when calculating xfer waiting time
	io_uring: drop file refs after task cancel
	bfq: Fix computation of shallow depth
	arch/arc: add copy_user_page() to <asm/page.h> to fix build error on ARC
	misdn: dsp: select CONFIG_BITREVERSE
	net: ethernet: fs_enet: Add missing MODULE_LICENSE
	selftests: fix the return value for UDP GRO test
	nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
	nvme: avoid possible double fetch in handling CQE
	nvmet-rdma: Fix list_del corruption on queue establishment failure
	drm/amd/display: fix sysfs amdgpu_current_backlight_pwm NULL pointer issue
	drm/amdgpu: fix a GPU hang issue when remove device
	drm/amd/pm: fix the failure when change power profile for renoir
	drm/amdgpu: fix potential memory leak during navi12 deinitialization
	usb: typec: Fix copy paste error for NVIDIA alt-mode description
	iommu/vt-d: Fix lockdep splat in sva bind()/unbind()
	ACPI: scan: add stub acpi_create_platform_device() for !CONFIG_ACPI
	drm/msm: Call msm_init_vram before binding the gpu
	ARM: picoxcell: fix missing interrupt-parent properties
	poll: fix performance regression due to out-of-line __put_user()
	rcu-tasks: Move RCU-tasks initialization to before early_initcall()
	bpf: Simplify task_file_seq_get_next()
	bpf: Save correct stopping point in file seq iteration
	x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handling
	cfg80211: select CONFIG_CRC32
	nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context
	iommu/vt-d: Update domain geometry in iommu_ops.at(de)tach_dev
	net/mlx5e: CT: Use per flow counter when CT flow accounting is enabled
	net/mlx5: Fix passing zero to 'PTR_ERR'
	net/mlx5: E-Switch, fix changing vf VLANID
	blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
	mm: fix clear_refs_write locking
	mm: don't play games with pinned pages in clear_page_refs
	mm: don't put pinned pages into the swap cache
	perf intel-pt: Fix 'CPU too large' error
	dump_common_audit_data(): fix racy accesses to ->d_name
	ASoC: meson: axg-tdm-interface: fix loopback
	ASoC: meson: axg-tdmin: fix axg skew offset
	ASoC: Intel: fix error code cnl_set_dsp_D0()
	nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY
	nvme: don't intialize hwmon for discovery controllers
	nvme-tcp: fix possible data corruption with bio merges
	nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT
	NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
	pNFS: We want return-on-close to complete when evicting the inode
	pNFS: Mark layout for return if return-on-close was not sent
	pNFS: Stricter ordering of layoutget and layoutreturn
	NFS: Adjust fs_context error logging
	NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
	NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
	NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
	NFS: nfs_delegation_find_inode_server must first reference the superblock
	NFS: nfs_igrab_and_active must first reference the superblock
	scsi: ufs: Fix possible power drain during system suspend
	ext4: fix superblock checksum failure when setting password salt
	RDMA/restrack: Don't treat as an error allocation ID wrapping
	RDMA/usnic: Fix memleak in find_free_vf_and_create_qp_grp
	bnxt_en: Improve stats context resource accounting with RDMA driver loaded.
	RDMA/mlx5: Fix wrong free of blue flame register on error
	IB/mlx5: Fix error unwinding when set_has_smi_cap fails
	umount(2): move the flag validity checks first
	dm zoned: select CONFIG_CRC32
	drm/i915/dsi: Use unconditional msleep for the panel_on_delay when there is no reset-deassert MIPI-sequence
	drm/i915/icl: Fix initing the DSI DSC power refcount during HW readout
	drm/i915/gt: Restore clear-residual mitigations for Ivybridge, Baytrail
	mm, slub: consider rest of partial list if acquire_slab() fails
	riscv: Trace irq on only interrupt is enabled
	iommu/vt-d: Fix unaligned addresses for intel_flush_svm_range_dev()
	net: sunrpc: interpret the return value of kstrtou32 correctly
	selftests: netfilter: Pass family parameter "-f" to conntrack tool
	dm: eliminate potential source of excessive kernel log noise
	ALSA: fireface: Fix integer overflow in transmit_midi_msg()
	ALSA: firewire-tascam: Fix integer overflow in midi_port_work()
	netfilter: conntrack: fix reading nf_conntrack_buckets
	netfilter: nf_nat: Fix memleak in nf_nat_init
	Linux 5.10.9

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I609e501511889081e03d2d18ee7e1be95406f396
2021-01-19 18:49:54 +01:00
John Garry
847c76518c blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
[ Upstream commit 02f938e9fed1681791605ca8b96c2d9da9355f6a ]

Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
something like:

root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3

Add the decoding for that flag.

Fixes: 32bc15afed ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19 18:27:29 +01:00
Jan Kara
7fdaca86fc bfq: Fix computation of shallow depth
[ Upstream commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a ]

BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19 18:27:26 +01:00
Greg Kroah-Hartman
f11e1751ad Merge 5.10.8 into android12-5.10
Changes in 5.10.8
	powerpc/32s: Fix RTAS machine check with VMAP stack
	io_uring: synchronise IOPOLL on task_submit fail
	io_uring: limit {io|sq}poll submit locking scope
	io_uring: patch up IOPOLL overflow_flush sync
	RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
	iommu/arm-smmu-qcom: Initialize SCTLR of the bypass context
	drm/panfrost: Don't corrupt the queue mutex on open/close
	io_uring: Fix return value from alloc_fixed_file_ref_node
	scsi: ufs: Fix -Wsometimes-uninitialized warning
	btrfs: skip unnecessary searches for xattrs when logging an inode
	btrfs: fix deadlock when cloning inline extent and low on free metadata space
	btrfs: shrink delalloc pages instead of full inodes
	net: cdc_ncm: correct overhead in delayed_ndp_size
	net: hns3: fix incorrect handling of sctp6 rss tuple
	net: hns3: fix the number of queues actually used by ARQ
	net: hns3: fix a phy loopback fail issue
	net: stmmac: dwmac-sun8i: Fix probe error handling
	net: stmmac: dwmac-sun8i: Balance internal PHY resource references
	net: stmmac: dwmac-sun8i: Balance internal PHY power
	net: stmmac: dwmac-sun8i: Balance syscon (de)initialization
	net: vlan: avoid leaks on register_vlan_dev() failures
	net/sonic: Fix some resource leaks in error handling paths
	net: bareudp: add missing error handling for bareudp_link_config()
	ptp: ptp_ines: prevent build when HAS_IOMEM is not set
	net: ipv6: fib: flush exceptions when purging route
	tools: selftests: add test for changing routes with PTMU exceptions
	net: fix pmtu check in nopmtudisc mode
	net: ip: always refragment ip defragmented packets
	chtls: Fix hardware tid leak
	chtls: Remove invalid set_tcb call
	chtls: Fix panic when route to peer not configured
	chtls: Avoid unnecessary freeing of oreq pointer
	chtls: Replace skb_dequeue with skb_peek
	chtls: Added a check to avoid NULL pointer dereference
	chtls: Fix chtls resources release sequence
	octeontx2-af: fix memory leak of lmac and lmac->name
	nexthop: Fix off-by-one error in error path
	nexthop: Unlink nexthop group entry in error path
	nexthop: Bounce NHA_GATEWAY in FDB nexthop groups
	s390/qeth: fix deadlock during recovery
	s390/qeth: fix locking for discipline setup / removal
	s390/qeth: fix L2 header access in qeth_l3_osa_features_check()
	net: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE
	net/mlx5: Use port_num 1 instead of 0 when delete a RoCE address
	net/mlx5e: ethtool, Fix restriction of autoneg with 56G
	net/mlx5e: In skb build skip setting mark in switchdev mode
	net/mlx5: Check if lag is supported before creating one
	scsi: lpfc: Fix variable 'vport' set but not used in lpfc_sli4_abts_err_handler()
	ionic: start queues before announcing link up
	HID: wacom: Fix memory leakage caused by kfifo_alloc
	fanotify: Fix sys_fanotify_mark() on native x86-32
	ARM: OMAP2+: omap_device: fix idling of devices during probe
	i2c: sprd: use a specific timeout to avoid system hang up issue
	dmaengine: dw-edma: Fix use after free in dw_edma_alloc_chunk()
	selftests/bpf: Clarify build error if no vmlinux
	can: tcan4x5x: fix bittiming const, use common bittiming from m_can driver
	can: m_can: m_can_class_unregister(): remove erroneous m_can_clk_stop()
	can: kvaser_pciefd: select CONFIG_CRC32
	spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
	cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get()
	spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
	spi: stm32: FIFO threshold level - fix align packet size
	i2c: i801: Fix the i2c-mux gpiod_lookup_table not being properly terminated
	i2c: mediatek: Fix apdma and i2c hand-shake timeout
	bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET
	interconnect: imx: Add a missing of_node_put after of_device_is_available
	interconnect: qcom: fix rpmh link failures
	dmaengine: mediatek: mtk-hsdma: Fix a resource leak in the error handling path of the probe function
	dmaengine: milbeaut-xdmac: Fix a resource leak in the error handling path of the probe function
	dmaengine: xilinx_dma: check dma_async_device_register return value
	dmaengine: xilinx_dma: fix incompatible param warning in _child_probe()
	dmaengine: xilinx_dma: fix mixed_enum_type coverity warning
	arm64: mm: Fix ARCH_LOW_ADDRESS_LIMIT when !CONFIG_ZONE_DMA
	qed: select CONFIG_CRC32
	phy: dp83640: select CONFIG_CRC32
	wil6210: select CONFIG_CRC32
	block: rsxx: select CONFIG_CRC32
	lightnvm: select CONFIG_CRC32
	zonefs: select CONFIG_CRC32
	iommu/vt-d: Fix misuse of ALIGN in qi_flush_piotlb()
	iommu/intel: Fix memleak in intel_irq_remapping_alloc
	bpftool: Fix compilation failure for net.o with older glibc
	nvme-tcp: Fix possible race of io_work and direct send
	net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups
	net/mlx5e: Fix two double free cases
	regmap: debugfs: Fix a memory leak when calling regmap_attach_dev
	wan: ds26522: select CONFIG_BITREVERSE
	arm64: cpufeature: remove non-exist CONFIG_KVM_ARM_HOST
	regulator: qcom-rpmh-regulator: correct hfsmps515 definition
	net: mvpp2: disable force link UP during port init procedure
	drm/i915/dp: Track pm_qos per connector
	net: mvneta: fix error message when MTU too large for XDP
	selftests: fib_nexthops: Fix wrong mausezahn invocation
	KVM: arm64: Don't access PMCR_EL0 when no PMU is available
	xsk: Fix race in SKB mode transmit with shared cq
	xsk: Rollback reservation at NETDEV_TX_BUSY
	block/rnbd-clt: avoid module unload race with close confirmation
	can: isotp: isotp_getname(): fix kernel information leak
	block: fix use-after-free in disk_part_iter_next
	net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
	regmap: debugfs: Fix a reversed if statement in regmap_debugfs_init()
	drm/panfrost: Remove unused variables in panfrost_job_close()
	tools headers UAPI: Sync linux/fscrypt.h with the kernel sources
	Linux 5.10.8

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib8272ec9f47a3c3813509bcacece3b16137332e1
2021-01-19 09:33:21 +01:00
Ming Lei
481097d661 block: fix use-after-free in disk_part_iter_next
commit aebf5db917055b38f4945ed6d621d9f07a44ff30 upstream.

Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:17:05 +01:00
Greg Kroah-Hartman
7eadb0006a Merge 5.10.7 into android12-5.10
Changes in 5.10.7
	i40e: Fix Error I40E_AQ_RC_EINVAL when removing VFs
	iavf: fix double-release of rtnl_lock
	net/sched: sch_taprio: ensure to reset/destroy all child qdiscs
	net: mvpp2: Add TCAM entry to drop flow control pause frames
	net: mvpp2: prs: fix PPPoE with ipv6 packet parse
	net: systemport: set dev->max_mtu to UMAC_MAX_MTU_SIZE
	ethernet: ucc_geth: fix use-after-free in ucc_geth_remove()
	ethernet: ucc_geth: set dev->max_mtu to 1518
	ionic: account for vlan tag len in rx buffer len
	atm: idt77252: call pci_disable_device() on error path
	net: mvpp2: Fix GoP port 3 Networking Complex Control configurations
	net: stmmac: dwmac-meson8b: ignore the second clock input
	ibmvnic: fix login buffer memory leak
	ibmvnic: continue fatal error reset after passive init
	net: ethernet: mvneta: Fix error handling in mvneta_probe
	qede: fix offload for IPIP tunnel packets
	virtio_net: Fix recursive call to cpus_read_lock()
	net/ncsi: Use real net-device for response handler
	net: ethernet: Fix memleak in ethoc_probe
	net-sysfs: take the rtnl lock when storing xps_cpus
	net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
	net-sysfs: take the rtnl lock when storing xps_rxqs
	net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc
	net: ethernet: ti: cpts: fix ethtool output when no ptp_clock registered
	tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
	e1000e: Only run S0ix flows if shutdown succeeded
	e1000e: bump up timeout to wait when ME un-configures ULP mode
	Revert "e1000e: disable s0ix entry and exit flows for ME systems"
	e1000e: Export S0ix flags to ethtool
	bnxt_en: Check TQM rings for maximum supported value.
	net: mvpp2: fix pkt coalescing int-threshold configuration
	bnxt_en: Fix AER recovery.
	ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
	net: sched: prevent invalid Scell_log shift count
	net: hns: fix return value check in __lb_other_process()
	erspan: fix version 1 check in gre_parse_header()
	net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
	bareudp: set NETIF_F_LLTX flag
	bareudp: Fix use of incorrect min_headroom size
	vhost_net: fix ubuf refcount incorrectly when sendmsg fails
	r8169: work around power-saving bug on some chip versions
	net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
	net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
	CDC-NCM: remove "connected" log message
	ibmvnic: fix: NULL pointer dereference.
	net: usb: qmi_wwan: add Quectel EM160R-GL
	selftests: mlxsw: Set headroom size of correct port
	stmmac: intel: Add PCI IDs for TGL-H platform
	selftests/vm: fix building protection keys test
	block: add debugfs stanza for QUEUE_FLAG_NOWAIT
	workqueue: Kick a worker based on the actual activation of delayed works
	scsi: ufs: Fix wrong print message in dev_err()
	scsi: ufs-pci: Fix restore from S4 for Intel controllers
	scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
	scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
	scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
	scsi: block: Introduce BLK_MQ_REQ_PM
	scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
	scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
	scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
	scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
	local64.h: make <asm/local64.h> mandatory
	lib/genalloc: fix the overflow when size is too big
	depmod: handle the case of /sbin/depmod without /sbin in PATH
	scsi: ufs: Clear UAC for FFU and RPMB LUNs
	kbuild: don't hardcode depmod path
	Bluetooth: revert: hci_h5: close serdev device and free hu in h5_close
	scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
	scsi: block: Do not accept any requests while suspended
	crypto: ecdh - avoid buffer overflow in ecdh_set_secret()
	crypto: asym_tpm: correct zero out potential secrets
	powerpc: Handle .text.{hot,unlikely}.* in linker script
	Staging: comedi: Return -EFAULT if copy_to_user() fails
	staging: mt7621-dma: Fix a resource leak in an error handling path
	usb: gadget: enable super speed plus
	USB: cdc-acm: blacklist another IR Droid device
	USB: cdc-wdm: Fix use after free in service_outstanding_interrupt().
	usb: typec: intel_pmc_mux: Configure HPD first for HPD+IRQ request
	usb: dwc3: meson-g12a: disable clk on error handling path in probe
	usb: dwc3: gadget: Restart DWC3 gadget when enabling pullup
	usb: dwc3: gadget: Clear wait flag on dequeue
	usb: dwc3: ulpi: Use VStsDone to detect PHY regs access completion
	usb: dwc3: ulpi: Replace CPU-based busyloop with Protocol-based one
	usb: dwc3: ulpi: Fix USB2.0 HS/FS/LS PHY suspend regression
	usb: chipidea: ci_hdrc_imx: add missing put_device() call in usbmisc_get_init_data()
	USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set
	usb: usbip: vhci_hcd: protect shift size
	usb: uas: Add PNY USB Portable SSD to unusual_uas
	USB: serial: iuu_phoenix: fix DMA from stack
	USB: serial: option: add LongSung M5710 module support
	USB: serial: option: add Quectel EM160R-GL
	USB: yurex: fix control-URB timeout handling
	USB: usblp: fix DMA to stack
	ALSA: usb-audio: Fix UBSAN warnings for MIDI jacks
	usb: gadget: select CONFIG_CRC32
	USB: Gadget: dummy-hcd: Fix shift-out-of-bounds bug
	usb: gadget: f_uac2: reset wMaxPacketSize
	usb: gadget: function: printer: Fix a memory leak for interface descriptor
	usb: gadget: u_ether: Fix MTU size mismatch with RX packet size
	USB: gadget: legacy: fix return error code in acm_ms_bind()
	usb: gadget: Fix spinlock lockup on usb_function_deactivate
	usb: gadget: configfs: Preserve function ordering after bind failure
	usb: gadget: configfs: Fix use-after-free issue with udc_name
	USB: serial: keyspan_pda: remove unused variable
	hwmon: (amd_energy) fix allocation of hwmon_channel_info config
	mm: make wait_on_page_writeback() wait for multiple pending writebacks
	x86/mm: Fix leak of pmd ptlock
	KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()
	KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE
	kvm: check tlbs_dirty directly
	KVM: x86/mmu: Ensure TDP MMU roots are freed after yield
	x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR
	x86/resctrl: Don't move a task to the same resource group
	blk-iocost: fix NULL iocg deref from racing against initialization
	ALSA: hda/via: Fix runtime PM for Clevo W35xSS
	ALSA: hda/conexant: add a new hda codec CX11970
	ALSA: hda/realtek - Fix speaker volume control on Lenovo C940
	ALSA: hda/realtek: Add mute LED quirk for more HP laptops
	ALSA: hda/realtek: Enable mute and micmute LED on HP EliteBook 850 G7
	ALSA: hda/realtek: Add two "Intel Reference board" SSID in the ALC256.
	iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev
	btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
	btrfs: send: fix wrong file path when there is an inode with a pending rmdir
	Revert "device property: Keep secondary firmware node secondary by type"
	dmabuf: fix use-after-free of dmabuf's file->f_inode
	arm64: link with -z norelro for LLD or aarch64-elf
	drm/i915: clear the shadow batch
	drm/i915: clear the gpu reloc batch
	bcache: fix typo from SUUP to SUPP in features.h
	bcache: check unsupported feature sets for bcache register
	bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket
	net/mlx5e: Fix SWP offsets when vlan inserted by driver
	ARM: dts: OMAP3: disable AES on N950/N9
	netfilter: x_tables: Update remaining dereference to RCU
	netfilter: ipset: fix shift-out-of-bounds in htable_bits()
	netfilter: xt_RATEEST: reject non-null terminated string from userspace
	netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
	dmaengine: idxd: off by one in cleanup code
	x86/mtrr: Correct the range check before performing MTRR type lookups
	KVM: x86: fix shift out of bounds reported by UBSAN
	xsk: Fix memory leak for failed bind
	rtlwifi: rise completion at the last step of firmware callback
	scsi: target: Fix XCOPY NAA identifier lookup
	Linux 5.10.7

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I1a7c195af35831fe362b027fe013c0c7e4dc20ea
2021-01-13 10:29:42 +01:00
Tejun Heo
cafc6e70a6 blk-iocost: fix NULL iocg deref from racing against initialization
commit d16baa3f1453c14d680c5fee01cd122a22d0e0ce upstream.

When initializing iocost for a queue, its rqos should be registered before
the blkcg policy is activated to allow policy data initiailization to lookup
the associated ioc. This unfortunately means that the rqos methods can be
called on bios before iocgs are attached to all existing blkgs.

While the race is theoretically possible on ioc_rqos_throttle(), it mostly
happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
The former determines it from the passed in @rqos and then bails before
dereferencing iocg if the looked up ioc is disabled, which most likely is
the case if initialization is still in progress. The latter looked up ioc by
dereferencing the possibly NULL iocg making it a lot more prone to actually
triggering the bug.

* Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
  up ioc for consistency.

* Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
  dereferencing it.

* Explain the danger of NULL iocgs in blk_iocost_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jonathan Lemon <bsd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:23 +01:00
Alan Stern
d55d15a332 scsi: block: Do not accept any requests while suspended
[ Upstream commit 52abca64fd9410ea6c9a3a74eab25663b403d7da ]

blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
power management state. Now that SCSI domain validation no longer depends
on this behavior, modify the behavior of blk_queue_enter() as follows:

   - Do not accept any requests while suspended.

   - Only process power management requests while suspending or resuming.

Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
causes runtime-suspended devices not to resume as they should. The request
which should cause a runtime resume instead gets issued directly, without
resuming the device first. Of course the device can't handle it properly,
the I/O fails, and the device remains suspended.

The problem is fixed by checking that the queue's runtime-PM status isn't
RPM_SUSPENDED before allowing a request to be issued, and queuing a
runtime-resume request if it is.  In particular, the inline
blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
code is unified by merging the surrounding checks into the routine.  If the
queue isn't set up for runtime PM, or there currently is no restriction on
allowed requests, the request is allowed.  Likewise if the BLK_MQ_REQ_PM
flag is set and the status isn't RPM_SUSPENDED.  Otherwise a runtime resume
is queued and the request is blocked until conditions are more suitable.

[ bvanassche: modified commit message and removed Cc: stable because
  without the previous patches from this series this patch would break
  parallel SCSI domain validation + introduced queue_rpm_status() ]

Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-and-tested-by: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-12 20:18:17 +01:00
Bart Van Assche
782c9ef2ac scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
[ Upstream commit a4d34da715e3cb7e0741fe603dcd511bed067e00 ]

Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
used by any kernel code.

Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-12 20:18:17 +01:00
Bart Van Assche
8ed46b329d scsi: block: Introduce BLK_MQ_REQ_PM
[ Upstream commit 0854bcdcdec26aecdc92c303816f349ee1fba2bc ]

Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
functions set RQF_PM. This is the first step towards removing
BLK_MQ_REQ_PREEMPT.

Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Can Guo <cang@codeaurora.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-12 20:18:15 +01:00
Andres Freund
bfb39e6d67 block: add debugfs stanza for QUEUE_FLAG_NOWAIT
[ Upstream commit dc30432605bbbd486dfede3852ea4d42c40a84b4 ]

This was missed in 021a24460d. Leads to the numeric value of
QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
/sys/kernel/debug/block/*/state.

Fixes: 021a24460d
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-12 20:18:14 +01:00
Greg Kroah-Hartman
9cf2ceaffd Merge 5.10.5 into android12-5.10
Changes in 5.10.5
	net/sched: sch_taprio: reset child qdiscs before freeing them
	mptcp: fix security context on server socket
	ethtool: fix error paths in ethnl_set_channels()
	ethtool: fix string set id check
	md/raid10: initialize r10_bio->read_slot before use.
	drm/amd/display: Add get_dig_frontend implementation for DCEx
	io_uring: close a small race gap for files cancel
	jffs2: Allow setting rp_size to zero during remounting
	jffs2: Fix NULL pointer dereference in rp_size fs option parsing
	spi: dw-bt1: Fix undefined devm_mux_control_get symbol
	opp: fix memory leak in _allocate_opp_table
	opp: Call the missing clk_put() on error
	scsi: block: Fix a race in the runtime power management code
	mm/hugetlb: fix deadlock in hugetlb_cow error path
	mm: memmap defer init doesn't work as expected
	lib/zlib: fix inflating zlib streams on s390
	io_uring: don't assume mm is constant across submits
	io_uring: use bottom half safe lock for fixed file data
	io_uring: add a helper for setting a ref node
	io_uring: fix io_sqe_files_unregister() hangs
	uapi: move constants from <linux/kernel.h> to <linux/const.h>
	tools headers UAPI: Sync linux/const.h with the kernel headers
	cgroup: Fix memory leak when parsing multiple source parameters
	zlib: move EXPORT_SYMBOL() and MODULE_LICENSE() out of dfltcc_syms.c
	scsi: cxgb4i: Fix TLS dependency
	Bluetooth: hci_h5: close serdev device and free hu in h5_close
	fbcon: Disable accelerated scrolling
	reiserfs: add check for an invalid ih_entry_count
	misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
	media: gp8psk: initialize stats at power control logic
	f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
	ALSA: seq: Use bool for snd_seq_queue internal flags
	ALSA: rawmidi: Access runtime->avail always in spinlock
	bfs: don't use WARNING: string when it's just info.
	ext4: check for invalid block size early when mounting a file system
	fcntl: Fix potential deadlock in send_sig{io, urg}()
	io_uring: check kthread stopped flag when sq thread is unparked
	rtc: sun6i: Fix memleak in sun6i_rtc_clk_init
	module: set MODULE_STATE_GOING state when a module fails to load
	quota: Don't overflow quota file offsets
	rtc: pl031: fix resource leak in pl031_probe
	powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
	i3c master: fix missing destroy_workqueue() on error in i3c_master_register
	NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode
	f2fs: avoid race condition for shrinker count
	f2fs: fix race of pending_pages in decompression
	module: delay kobject uevent until after module init call
	powerpc/64: irq replay remove decrementer overflow check
	fs/namespace.c: WARN if mnt_count has become negative
	watchdog: rti-wdt: fix reference leak in rti_wdt_probe
	um: random: Register random as hwrng-core device
	um: ubd: Submit all data segments atomically
	NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
	ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails
	drm/amd/display: updated wm table for Renoir
	tick/sched: Remove bogus boot "safety" check
	s390: always clear kernel stack backchain before calling functions
	io_uring: remove racy overflow list fast checks
	ALSA: pcm: Clear the full allocated memory at hw_params
	dm verity: skip verity work if I/O error when system is shutting down
	ext4: avoid s_mb_prefetch to be zero in individual scenarios
	device-dax: Fix range release
	Linux 5.10.5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2b481bfac06bafdef2cf3cc1ac2c2a4ddf9913dc
2021-01-10 12:19:03 +01:00
Bart Van Assche
092898b070 scsi: block: Fix a race in the runtime power management code
commit fa4d0f1992a96f6d7c988ef423e3127e613f6ac9 upstream.

With the current implementation the following race can happen:

 * blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
   blk_mq_unfreeze_queue().

 * blk_queue_enter() calls blk_queue_pm_only() and that function returns
   true.

 * blk_queue_enter() calls blk_pm_request_resume() and that function does
   not call pm_request_resume() because the queue runtime status is
   RPM_ACTIVE.

 * blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.

Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.

Link: https://lore.kernel.org/r/20201209052951.16136-2-bvanassche@acm.org
Fixes: 986d413b7c ("blk-mq: Enable support for runtime power management")
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Stanley Chu <stanley.chu@mediatek.com>
Co-developed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06 14:56:50 +01:00
Greg Kroah-Hartman
ca0c76873c Merge 5.10-rc7 into android-mainline
Linux 5.10-rc7

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie61b3510311a825ee57bee12610e25bc1500b350
2020-12-09 08:09:26 +01:00
Linus Torvalds
be1515bad7 block-5.10-2020-12-05
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/L/n0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptxfD/9Dvasr1LIF9jTL5nU3iwwCAKVneI0VXsvn
 x9xLv+3AlvyhKJSpDxjyYrrsNwo2r2Hay/3Yix079wMyDNcjuUeaDVscQ2Ed6WXI
 slOhefyd4sgZf0kpB7TkwSrU1idoSEhOgcbZSiPLnlb071rYbJm5M6W15j9knUW0
 HfomlryBgTX6b0et+YHvboTQ+31ZWTpZdMk9XEvQhCVynPOwU858rRKnFZUQSJ04
 LGKXcA5xchxD8HVImwGv6QFTtFSzoyT3g68fEXEM9rPoEM8T0utVFPfF7o8L4Oui
 nLxpBThwN18xwcc2iei8Fko2H6RgAmaD4cR4rgZaesa4QavPtT8N+JebMX47Tvhy
 BBUbD/gRMPt1nO9ufPuLDyYBda2Ne5Z1DkBSIuYgZrreiOBdYGaJDbGIJ9uGwi+j
 u4QvooSMoXb/7XGojxaJCjPDlNyhOb+fbZaOJDU49xQR3k01B6vteDnxGBlvbYHn
 xuC82LTpTkWQT65ciaFAsW0L2lE6zlgkWWP3ahzH/qAg0HvODT9LDPNxvUpgZqUT
 zoUHua3jAY2JBudUMLlacD+/ZS/T7krMl2XLSvyKaOIAyAlaluxK/uykQ+Z3sWgF
 XGYQ/yv/4mQ1rTozqRFKndRPN++FCRzIwcHk8iBQCRlh8yLB/gVBr/Q6SgJrdnEj
 WWwc1tRmwA==
 =o4GN
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Single fix for an issue with chunk_sectors and stacked devices"

* tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block:
  block: use gcd() to fix chunk_sectors limit stacking
2020-12-05 14:45:30 -08:00
Linus Torvalds
b3298500b2 - Fix DM's bio splitting changes that were made during v5.9.
Restores splitting in terms of varied per-target ti->max_io_len
   rather than use block core's single stacked 'chunk_sectors' limit.
 
 - Like DM crypt, update DM integrity to not use crypto drivers that
   have CRYPTO_ALG_ALLOCATES_MEMORY set.
 
 - Fix DM writecache target's argument parsing and status display.
 
 - Remove needless BUG() from dm writecache's persistent_memory_claim()
 
 - Remove old gcc workaround in DM cache target's block_div() for ARM
   link errors now that gcc >= 4.9 is required.
 
 - Fix RCU locking in dm_blk_report_zones and dm_dax_zero_page_range.
 
 - Remove old, and now frowned upon, BUG_ON(in_interrupt()) in
   dm_table_event().
 
 - Remove invalid sparse annotations from dm_prepare_ioctl() and
   dm_unprepare_ioctl().
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl/KonYTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWgSJCACNYEndubrROJZL+FOUQixzZQiOfphw
 9Brb/XbdXWXIv7F+JV85E6olOqz7JjTGrO91uD5kwHEtVhDx5zT/GCm+5FoBrLa/
 FuTphPRWNimZSU1umJe2AG9hOiDpPJJUe/wwj3QkBH2TeEHwBHblB8BkRFzxnP+p
 0dGybrQBMtrH3GO65YG7qaASeBPl1+G3mVHfzViyhk1uoZL1y9pKbzPK60TkHcsa
 VCGTPke5Ri3hvd85hmpDcXmyxjxZfCA8Jc/DrQ+DDEwakHoJFwlSzP7fqwHnpKHT
 RDL4iOID54SViSGqzNcxlGtr/EHyN9Mom2d4Nnb0cgsRG4woCeJWJZMM
 =+o1m
 -----END PGP SIGNATURE-----

Merge tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM's bio splitting changes that were made during v5.9. This
   restores splitting in terms of varied per-target ti->max_io_len
   rather than use block core's single stacked 'chunk_sectors' limit.

 - Like DM crypt, update DM integrity to not use crypto drivers that
   have CRYPTO_ALG_ALLOCATES_MEMORY set.

 - Fix DM writecache target's argument parsing and status display.

 - Remove needless BUG() from dm writecache's persistent_memory_claim()

 - Remove old gcc workaround in DM cache target's block_div() for ARM
   link errors now that gcc >= 4.9 is required.

 - Fix RCU locking in dm_blk_report_zones and dm_dax_zero_page_range.

 - Remove old, and now frowned upon, BUG_ON(in_interrupt()) in
   dm_table_event().

 - Remove invalid sparse annotations from dm_prepare_ioctl() and
   dm_unprepare_ioctl().

* tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: remove invalid sparse __acquires and __releases annotations
  dm: fix double RCU unlock in dm_dax_zero_page_range() error path
  dm: fix IO splitting
  dm writecache: remove BUG() and fail gracefully instead
  dm table: Remove BUG_ON(in_interrupt())
  dm: fix bug with RCU locking in dm_blk_report_zones
  Revert "dm cache: fix arm link errors with inline"
  dm writecache: fix the maximum number of arguments
  dm writecache: advance the number of arguments when reporting max_age
  dm integrity: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
2020-12-04 13:28:39 -08:00
Mike Snitzer
3ee16db390 dm: fix IO splitting
Commit 882ec4e609 ("dm table: stack 'chunk_sectors' limit to account
for target-specific splitting") caused a couple regressions:
1) Using lcm_not_zero() when stacking chunk_sectors was a bug because
   chunk_sectors must reflect the most limited of all devices in the
   IO stack.
2) DM targets that set max_io_len but that do _not_ provide an
   .iterate_devices method no longer had there IO split properly.

And commit 5091cdec56 ("dm: change max_io_len() to use
blk_max_size_offset()") also caused a regression where DM no longer
supported varied (per target) IO splitting. The implication being the
potential for severely reduced performance for IO stacks that use a DM
target like dm-cache to hide performance limitations of a slower
device (e.g. one that requires 4K IO splitting).

Coming full circle: Fix all these issues by discontinuing stacking
chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(),
add optional chunk_sectors override argument to blk_max_size_offset()
and update DM's max_io_len() to pass ti->max_io_len to its
blk_max_size_offset() call.

Passing in an optional chunk_sectors override to blk_max_size_offset()
allows for code reuse of block's centralized calculation for max IO
size based on provided offset and split boundary.

Fixes: 882ec4e609 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting")
Fixes: 5091cdec56 ("dm: change max_io_len() to use blk_max_size_offset()")
Cc: stable@vger.kernel.org
Reported-by: John Dorminy <jdorminy@redhat.com>
Reported-by: Bruce Johnston <bjohnsto@redhat.com>
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
2020-12-04 14:53:15 -05:00
Mike Snitzer
7e7986f9d3 block: use gcd() to fix chunk_sectors limit stacking
commit 22ada802ed ("block: use lcm_not_zero() when stacking
chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must
reflect the most limited of all devices in the IO stack.

Otherwise malformed IO may result. E.g.: prior to this fix,
->chunk_sectors = lcm_not_zero(8, 128) would result in
blk_max_size_offset() splitting IO at 128 sectors rather than the
required more restrictive 8 sectors.

And since commit 07d098e6bb ("block: allow 'chunk_sectors' to be
non-power-of-2") care must be taken to properly stack chunk_sectors to
be compatible with the possibility that a non-power-of-2 chunk_sectors
may be stacked. This is why gcd() is used instead of reverting back
to using min_not_zero().

Fixes: 22ada802ed ("block: use lcm_not_zero() when stacking chunk_sectors")
Fixes: 07d098e6bb ("block: allow 'chunk_sectors' to be non-power-of-2")
Reported-by: John Dorminy <jdorminy@redhat.com>
Reported-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: John Dorminy <jdorminy@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 11:02:55 -07:00
Greg Kroah-Hartman
b19651bfcc Merge c84e1efae0 ("Merge tag 'asm-generic-fixes-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic") into android-mainline
Steps on the way to 5.10-rc5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I644783003a83186a34cdbb753aa492f4350f49ee
2020-11-28 08:23:09 +01:00
Greg Kroah-Hartman
26d2906f96 Merge 27bba9c532 ("Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi") into android-mainline
Steps on the way to 5.10-rc5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: If85acec5178f1d8317f5ee781ae91ac27df55464
2020-11-22 10:00:49 +01:00
Eric Biggers
47a846536e block/keyslot-manager: prevent crash when num_slots=1
If there is only one keyslot, then blk_ksm_init() computes
slot_hashtable_size=1 and log_slot_ht_size=0.  This causes
blk_ksm_find_keyslot() to crash later because it uses
hash_ptr(key, log_slot_ht_size) to find the hash bucket containing the
key, and hash_ptr() doesn't support the bits == 0 case.

Fix this by making the hash table always have at least 2 buckets.

Tested by running:

    kvm-xfstests -c ext4 -g encrypt -m inlinecrypt \
                 -o blk-crypto-fallback.num_keyslots=1

Fixes: 1b26283970 ("block: Keyslot Manager for Inline Encryption")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-20 11:52:52 -07:00
Christoph Hellwig
b7131ee0ba blk-cgroup: fix a hd_struct leak in blkcg_fill_root_iostats
disk_get_part needs to be paired with a disk_put_part.

Cc: stable@vger.kernel.org
Fixes: ef45fe470e ("blk-cgroup: show global disk stats in root cgroup io.stat")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-14 11:17:34 -07:00
Greg Kroah-Hartman
73936cf2dd Merge f01c30de86 ("Merge tag 'vfs-5.10-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux") into android-mainline
Steps on the way to 5.10-rc4

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iba36d2244c4229d44e3d391ed23dc25c6022f917
2020-11-14 10:29:08 +01:00
Ming Lei
9f16a66733 block: mark flush request as IDLE when it is really finished
For avoiding use-after-free on flush request, we call its .end_io() from
both timeout code path and __blk_mq_end_request().

When flush request's ref doesn't drop to zero, it is still used, we
can't mark it as IDLE, so fix it by marking IDLE when its refcount drops
to zero really.

Fixes: 65ff5cd045 ("blk-mq: mark flush request as IDLE in flush_end_io()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-13 14:24:16 -07:00
Christoph Hellwig
7e890c37c2 block: add a return value to set_capacity_revalidate_and_notify
Return if the function ended up sending an uevent or not.

Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-12 13:59:04 -07:00
Greg Kroah-Hartman
b0748cf3e1 Merge c2dc4c073f ("Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost") into android-mainline
Steps on the way to 5.10-rc2

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I686e55205b113f69b9ea8a22c56d86a572e8c603
2020-11-02 11:59:56 +01:00
Greg Kroah-Hartman
b28ad9eb46 Revert "Revert "iov_iter: transparently handle compat iovecs in import_iovec""
This reverts commit 836219141f.

Bug: 171539436
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia401cbfe62bf5d9e03afab9f9fc6086bd5b34bbf
2020-11-02 09:27:36 +01:00
Ming Lei
65ff5cd045 blk-mq: mark flush request as IDLE in flush_end_io()
Mark flush request as IDLE in its .end_io(), aligning it with how normal
requests behave. The flush request stays in in-flight tags if we're not
using an IO scheduler, so we need to change its state into IDLE.
Otherwise, we will hang in blk_mq_tagset_wait_completed_request() during
error recovery because flush the request state is kept as COMPLETED.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Chao Leng <lengchao@huawei.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-30 08:33:49 -06:00
Greg Kroah-Hartman
338f8e80b4 Merge 0746c4a9f3 ("Merge branch 'i2c/for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux") into android-mainline
Steps on the way to 5.10-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iec426c6de4a59a517e5fa575a9424b883d958f08
2020-10-28 21:20:34 +01:00
Naohiro Aota
4977d121bc block: advance iov_iter on bio_add_hw_page failure
When the bio's size reaches max_append_sectors, bio_add_hw_page returns
0 then __bio_iov_append_get_pages returns -EINVAL. This is an expected
result of building a small enough bio not to be split in the IO path.
However, iov_iter is not advanced in this case, causing the same pages
are filled for the bio again and again.

Fix the case by properly advancing the iov_iter for already processed
pages.

Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org # 5.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-28 07:51:02 -06:00
Gabriel Krisman Bertazi
f255c19b3a blk-cgroup: Pre-allocate tree node on blkg_conf_prep
Similarly to commit 457e490f2b ("blkcg: allocate struct blkcg_gq
outside request queue spinlock"), blkg_create can also trigger
occasional -ENOMEM failures at the radix insertion because any
allocation inside blkg_create has to be non-blocking, making it more
likely to fail.  This causes trouble for userspace tools trying to
configure io weights who need to deal with this condition.

This patch reduces the occurrence of -ENOMEMs on this path by preloading
the radix tree element on a GFP_KERNEL context, such that we guarantee
the later non-blocking insertion won't fail.

A similar solution exists in blkcg_init_queue for the same situation.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-26 07:57:47 -06:00
Gabriel Krisman Bertazi
52abfcbd57 blk-cgroup: Fix memleak on error path
If new_blkg allocation raced with blk_policy change and
blkg_lookup_check fails, new_blkg is leaked.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-26 07:57:46 -06:00
Greg Kroah-Hartman
05d2a661fd Merge 54a4c789ca ("Merge tag 'docs/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media") into android-mainline
Steps on the way to 5.10-rc1

Resolves conflicts in:
	fs/userfaultfd.c

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3fe3c818f1f6565cfd4fa551de72d2b72ef60af
2020-10-26 09:23:33 +01:00
Greg Kroah-Hartman
cf785fec62 Merge 4815519ed0 ("Merge tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm") into android-mainline
Steps on the way to 5.10-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Icc39c31375d6446d72331f8ace728a44199bbc5e
2020-10-25 16:27:51 +01:00
Greg Kroah-Hartman
aedaedc749 Merge 7cd4ecd917 ("Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block") into android-mainline
Steps on the way to 5.10-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie4e2b65874248082be3c0e05b1bb591f9f46128c
2020-10-25 11:56:29 +01:00
Linus Torvalds
d769139081 block-5.10-2020-10-24
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+UQjkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvN9D/9iHU7Vgi8J3SiLrHYUiDtMSI5VnEmSBo6K
 Ej/wbbrk4tm2UYi550krOk0dMaHxWD3XWSTlpnrE0sUfjs69G676yzxrlnPt50f5
 XbcMc0YOHZfffeu9xXykUO8Q2918PTPC08eLaTK1I8lhKAuuTFCT/syGYu+prfd7
 AogyuczaDok8nqJEK9QNr0iaEUbe17GQwmvpWyjHl/qfKhWvV2r6jCZZf6pzQj2c
 zv3kbiT3u6xw9OEuhY0sgpTEfhAHEXbNIln6Ob4qVgxmOjwgiZdU/QXyw1i2s6pc
 ks7e28P43r3VfNYGBfr/hQCeAJT9gOeUG5yBiQr7ooX6uNPL6GOCG7DO/g5y2thQ
 NkV4hub/FjYWbSmRzDlJGj1fWn4L+3r/O8g5nMr+F1L3JYeaW0hOyStqBQ4O74Cj
 04tvWQ8ndXdPQrm/iDhM6KxfCvR5TC6k4fy9XPpRW8JOxauhIwTZQJyEQUnXTH3v
 pwv3IxRmuWGa3mrJZ5kGhsNAEGHdZCL5soLI+BXAD2MUW2IB5v2HpD/z1bvWL/51
 uYiVIt/2LxgLkF7BXP40PnY0qqTsOwGxdd6wQhi5Jn9Et+JkmAAR6cVwXx4AhuQg
 FT5mq7ZTQBZrErQu4Mr1k3UyqBFm4MB+mbJhWrVWnUnnyA6pcr1NUsUTz5JcyrWz
 jWI7T1Si7w==
 =dFJi
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph
     - rdma error handling fixes (Chao Leng)
     - fc error handling and reconnect fixes (James Smart)
     - fix the qid displace when tracing ioctl command (Keith Busch)
     - don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
     - fix MTDT for passthru (Logan Gunthorpe)
     - blacklist Write Same on more devices (Kai-Heng Feng)
     - fix an uninitialized work struct (zhenwei pi)"

 - lightnvm out-of-bounds fix (Colin)

 - SG allocation leak fix (Doug)

 - rnbd fixes (Gioh, Guoqing, Jack)

 - zone error translation fixes (Keith)

 - kerneldoc markup fix (Mauro)

 - zram lockdep fix (Peter)

 - Kill unused io_context members (Yufen)

 - NUMA memory allocation cleanup (Xianting)

 - NBD config wakeup fix (Xiubo)

* tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
  block: blk-mq: fix a kernel-doc markup
  nvme-fc: shorten reconnect delay if possible for FC
  nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
  nvme-fc: fix error loop in create_hw_io_queues
  nvme-fc: fix io timeout to abort I/O
  null_blk: use zone status for max active/open
  nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
  nvmet: cleanup nvmet_passthru_map_sg()
  nvmet: limit passthru MTDS by BIO_MAX_PAGES
  nvmet: fix uninitialized work for zero kato
  nvme-pci: disable Write Zeroes on Sandisk Skyhawk
  nvme: use queuedata for nvme_req_qid
  nvme-rdma: fix crash due to incorrect cqe
  nvme-rdma: fix crash when connect rejected
  block: remove unused members for io_context
  blk-mq: remove the calling of local_memory_node()
  zram: Fix __zram_bvec_{read,write}() locking order
  skd_main: remove unused including <linux/version.h>
  sgl_alloc_order: fix memory leak
  lightnvm: fix out-of-bounds write to array devices->info[]
  ...
2020-10-24 12:46:42 -07:00
Greg Kroah-Hartman
35cda5af62 Revert "block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE"
This reverts commit 9d3a39a5f1 as that
commit causes a bunch of SELinux errors.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I73dc15c4ecddcf1950ca7fca2cc107be27da8f3f
2020-10-24 17:30:14 +02:00
Greg Kroah-Hartman
c62fe038fe Merge 3ad11d7ac8 ("Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block") into android-mainline
Steps on the way to 5.10-rc1

Resolves conflicts in:
	include/linux/blk-crypto.h

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4012850c2e4b804d9e87e90b8e03a3b9ce21b5e7
2020-10-24 17:29:43 +02:00
Greg Kroah-Hartman
072f3a5b28 Merge a7863b3423 ("blk-iocost: update iocost_monitor.py") into android-mainline
Bisection on the way to 5.10-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib96c933a2903e2f30edc0ca13929a318537f77d8
2020-10-24 14:40:14 +02:00
Greg Kroah-Hartman
6c45f08d88 Merge 0460375517 ("blk-iocost: restore inuse update tracepoints") into android-mailine
Bisection on the way to 5.10-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3f67860c970bea3019c374ba34e051ca78282a92
2020-10-24 13:14:50 +02:00
Greg Kroah-Hartman
b5a717365a Merge c421a3eb2e ("blk-iocost: revamp debt handling") into android-mainline
Bisection on the way to 5.10-rc1

Change-Id: I601c029848bafea95f6a40682264ca5ce76c8b8d
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2020-10-24 12:09:26 +02:00
Greg Kroah-Hartman
dcc7dd95ad Merge 97eb19751f ("blk-iocost: add absolute usage stat") into android-mainline
Bisection on the way to 5.10-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic074e5010e8b4480d50bb84b56c3cf64c3afaaeb
2020-10-24 12:07:50 +02:00
Greg Kroah-Hartman
e92900328d Merge 1f06959bd2 ("block: remove the unused q argument to part_in_flight and part_in_flight_rw") into android-mailine
Bisection on the way to 5.10-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2cd6d1d4deabeb7f9a17cf558aebd589449a37da
2020-10-24 10:17:17 +02:00
Mauro Carvalho Chehab
24f7bb8863 block: blk-mq: fix a kernel-doc markup
Fix a typo:
	blk_mq_run_hw_queue -> blk_mq_run_hw_queues

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-23 12:20:17 -06:00
Greg Kroah-Hartman
836219141f Revert "iov_iter: transparently handle compat iovecs in import_iovec"
This reverts commit 89cd35c58b.

Bug: 171539436
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I44ac1575af118ddbc3ee22a70f81cba282f5372a
2020-10-23 10:21:28 +02:00
Greg Kroah-Hartman
87b0d10a70 Merge 85ed13e78d ("Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs") into android-common
Steps in the way to 5.10-rc1

Change-Id: Ida014e494f8c8b8cc44209df17d8b07b3bd94f1a
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2020-10-23 10:13:30 +02:00
Xianting Tian
576e85c5e9 blk-mq: remove the calling of local_memory_node()
We don't need to check whether the node is memoryless numa node before
calling allocator interface. SLUB(and SLAB,SLOB) relies on the page
allocator to pick a node. Page allocator should deal with memoryless
nodes just fine. It has zonelists constructed for each possible nodes.
And it will automatically fall back into a node which is closest to the
requested node. As long as __GFP_THISNODE is not enforced of course.

The code comments of kmem_cache_alloc_node() of SLAB also showed this:
 * Fallback to other node is possible if __GFP_THISNODE is not set.

blk-mq code doesn't set __GFP_THISNODE, so we can remove the calling
of local_memory_node().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-20 07:08:17 -06:00
Mauro Carvalho Chehab
5cd3ddc186 docs: bio: fix a kerneldoc markup
Fix this warning:

	./block/bio.c:1098: WARNING: Inline emphasis start-string without end-string.

The thing is that *iter is not a valid markup.

That seems to be a typo:
	*iter -> @iter

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2020-10-15 07:49:48 +02:00
Mauro Carvalho Chehab
5b874af627 block: bio: fix a warning at the kernel-doc markups
Using "@bio's parent" causes the following waring:
	./block/bio.c:10: WARNING: Inline emphasis start-string without end-string.

The main problem here is that this would be converted into:

	**bio**'s parent

By kernel-doc, which is not a valid notation. It would be
possible to use, instead, this kernel-doc markup:

	``bio's`` parent

Yet, here, is probably simpler to just use an altenative language:

	the parent of @bio

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2020-10-15 07:49:47 +02:00
Linus Torvalds
4815519ed0 - Improve DM core's bio splitting to use blk_max_size_offset(). Also
fix bio splitting for bios that were deferred to the worker thread
   due to a DM device being suspended.
 
 - Remove DM core's special handling of NVMe devices now that block
   core has internalized efficiencies drivers previously needed to
   be concerned about (via now removed direct_make_request).
 
 - Fix request-based DM to not bounce through indirect dm_submit_bio;
   instead have block core make direct call to blk_mq_submit_bio().
 
 - Various DM core cleanups to simplify and improve code.
 
 - Update DM cryot to not use drivers that set
   CRYPTO_ALG_ALLOCATES_MEMORY.
 
 - Fix DM raid's raid1 and raid10 discard limits for the purposes of
   linux-stable. But then remove DM raid's discard limits settings now
   that MD raid can efficiently handle large discards.
 
 - A couple small cleanups across various targets.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl+Fx1gTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWk5iB/9pONYmtfQ5oBx4jg/PU8cVYYIfOtwS
 ZtItFbw7T9bkHVZ8d4hDr5LTq898cADuRD5edlR82gDOcXkiJlb5PqU39RoOTVvF
 Xz87sWzHdGAK7rdnCMAc2hiX3oQOje9o7NxGeGQ/uPaNU+U/vJS0AZtEAwltocBd
 j9MGESddBC636Gzbg5C0c0frikXd0am6qp6SCYJNpP5I0G2beHk2YX5Jqt9c7zMk
 8kyQend5b5RvkPNWTAjkVfWUsIjwYHh6MF48ZoGvD0X3lWjIBiwyxC0UX5hSXq63
 kB+nqxbXcvQLEBtJuDZ2bjyvrwzCVLpmfgLgzxOOU8fI5Q2U0zpsPaa0
 =6YDu
 -----END PGP SIGNATURE-----

Merge tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Improve DM core's bio splitting to use blk_max_size_offset(). Also
   fix bio splitting for bios that were deferred to the worker thread
   due to a DM device being suspended.

 - Remove DM core's special handling of NVMe devices now that block core
   has internalized efficiencies drivers previously needed to be
   concerned about (via now removed direct_make_request).

 - Fix request-based DM to not bounce through indirect dm_submit_bio;
   instead have block core make direct call to blk_mq_submit_bio().

 - Various DM core cleanups to simplify and improve code.

 - Update DM cryot to not use drivers that set
   CRYPTO_ALG_ALLOCATES_MEMORY.

 - Fix DM raid's raid1 and raid10 discard limits for the purposes of
   linux-stable. But then remove DM raid's discard limits settings now
   that MD raid can efficiently handle large discards.

 - A couple small cleanups across various targets.

* tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix request-based DM to not bounce through indirect dm_submit_bio
  dm: remove special-casing of bio-based immutable singleton target on NVMe
  dm: export dm_copy_name_and_uuid
  dm: fix comment in __dm_suspend()
  dm: fold dm_process_bio() into dm_submit_bio()
  dm: fix missing imposition of queue_limits from dm_wq_work() thread
  dm snap persistent: simplify area_io()
  dm thin metadata: Remove unused local variable when create thin and snap
  dm raid: remove unnecessary discard limits for raid10
  dm raid: fix discard limits for raid1 and raid10
  dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
  dm: use dm_table_get_device_name() where appropriate in targets
  dm table: make 'struct dm_table' definition accessible to all of DM core
  dm: eliminate need for start_io_acct() forward declaration
  dm: simplify __process_abnormal_io()
  dm: push use of on-stack flush_bio down to __send_empty_flush()
  dm: optimize max_io_len() by inlining max_io_len_target_boundary()
  dm: push md->immutable_target optimization down to __process_bio()
  dm: change max_io_len() to use blk_max_size_offset()
  dm table: stack 'chunk_sectors' limit to account for target-specific splitting
2020-10-14 15:05:38 -07:00
Keith Busch
3b481d9135 block: add zone specific block statuses
A zoned device with limited resources to open or activate zones may
return an error when the host exceeds those limits. The same command may
be successful if retried later, but the host needs to wait for specific
zone states before it should expect a retry to succeed. Have the block
layer provide an appropriate status for these conditions so applications
can distinuguish this error for special handling.

Cc: linux-api@vger.kernel.org
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-13 15:05:05 -06:00
Linus Torvalds
7cd4ecd917 drivers-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EYWYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsCgD/9Izy/mbiQMmcBPBuQFds2b2SwPAoB4RVcU
 NU7pcI3EbAlcj7xDF08Z74Sr6MKyg+JhGid15iw47o+qFq6cxDKiESYLIrFmb70R
 lUDkPr9J4OLNDSZ6hpM4sE6Qg9bzDPhRbAceDQRtVlqjuQdaOS2qZAjNG4qjO8by
 3PDO7XHCW+X4HhXiu2PDCKuwyDlHxggYzhBIFZNf58US2BU8+tLn2gvTSvmTb27F
 w0s5WU1Q5Q0W9RLrp4YTQi4SIIOq03BTSqpRjqhomIzhSQMieH95XNKGRitLjdap
 2mFNJ+5I+DTB/TW2BDBrBRXnoV/QNBJsR0DDFnUZsHEejjXKEVt5BRCpSQC9A0WW
 XUyVE1K+3GwgIxSI8tjPtyPEGzzhnqJjzHPq4LJLGlQje95v9JZ6bpODB7HHtZQt
 rbNp8IoVQ0n01nIvkkt/vnzCE9VFbWFFQiiu5/+x26iKZXW0pAF9Dnw46nFHoYZi
 llYvbKDcAUhSdZI8JuqnSnKhi7sLRNPnApBxs52mSX8qaE91sM2iRFDewYXzaaZG
 NjijYCcUtopUvojwxYZaLnIpnKWG4OZqGTNw1IdgzUtfdxoazpg6+4wAF9vo7FEP
 AePAUTKrfkGBm95uAP4bRvXBzS9UhXJvBrFW3grzRZybMj617F01yAR4N0xlMXeN
 jMLrGe7sWA==
 =xE9E
 -----END PGP SIGNATURE-----

Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Here are the driver updates for 5.10.

  A few SCSI updates in here too, in coordination with Martin as they
  depend on core block changes for the shared tag bitmap.

  This contains:

   - NVMe pull requests via Christoph:
      - fix keep alive timer modification (Amit Engel)
      - order the PCI ID list more sensibly (Andy Shevchenko)
      - cleanup the open by controller helper (Chaitanya Kulkarni)
      - use an xarray for the CSE log lookup (Chaitanya Kulkarni)
      - support ZNS in nvmet passthrough mode (Chaitanya Kulkarni)
      - fix nvme_ns_report_zones (Christoph Hellwig)
      - add a sanity check to nvmet-fc (James Smart)
      - fix interrupt allocation when too many polled queues are
        specified (Jeffle Xu)
      - small nvmet-tcp optimization (Mark Wunderlich)
      - fix a controller refcount leak on init failure (Chaitanya
        Kulkarni)
      - misc cleanups (Chaitanya Kulkarni)
      - major refactoring of the scanning code (Christoph Hellwig)

   - MD updates via Song:
      - Bug fixes in bitmap code, from Zhao Heming
      - Fix a work queue check, from Guoqing Jiang
      - Fix raid5 oops with reshape, from Song Liu
      - Clean up unused code, from Jason Yan
      - Discard improvements, from Xiao Ni
      - raid5/6 page offset support, from Yufen Yu

   - Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap,
     Hannes)

   - null_blk open/active zone limit support (Niklas)

   - Set of bcache updates (Coly, Dongsheng, Qinglang)"

* tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits)
  md/raid5: fix oops during stripe resizing
  md/bitmap: fix memory leak of temporary bitmap
  md: fix the checking of wrong work queue
  md/bitmap: md_bitmap_get_counter returns wrong blocks
  md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks
  md/raid0: remove unused function is_io_in_chunk_boundary()
  nvme-core: remove extra condition for vwc
  nvme-core: remove extra variable
  nvme: remove nvme_identify_ns_list
  nvme: refactor nvme_validate_ns
  nvme: move nvme_validate_ns
  nvme: query namespace identifiers before adding the namespace
  nvme: revalidate zone bitmaps in nvme_update_ns_info
  nvme: remove nvme_update_formats
  nvme: update the known admin effects
  nvme: set the queue limits in nvme_update_ns_info
  nvme: remove the 0 lba_shift check in nvme_update_ns_info
  nvme: clean up the check for too large logic block sizes
  nvme: freeze the queue over ->lba_shift updates
  nvme: factor out a nvme_configure_metadata helper
  ...
2020-10-13 13:04:41 -07:00
Linus Torvalds
3ad11d7ac8 block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
 u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
 fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
 e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
 WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
 c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
 FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
 YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
 ZwfpDh2+Tg==
 =LzyE
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Linus Torvalds
85ed13e78d Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat iovec cleanups from Al Viro:
 "Christoph's series around import_iovec() and compat variant thereof"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  security/keys: remove compat_keyctl_instantiate_key_iov
  mm: remove compat_process_vm_{readv,writev}
  fs: remove compat_sys_vmsplice
  fs: remove the compat readv/writev syscalls
  fs: remove various compat readv/writev helpers
  iov_iter: transparently handle compat iovecs in import_iovec
  iov_iter: refactor rw_copy_check_uvector and import_iovec
  iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
  compat.h: fix a spelling error in <linux/compat.h>
2020-10-12 16:35:51 -07:00
Greg Kroah-Hartman
8de87f4d1c Linux 5.9
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl+DdgYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGB1sH/0B4REolPPxiQsPL
 P45+a2POUU0SQK8NjkcYskQmJ6DnPKkJSXeUU8rzuI2rty5bWEzEBxdUSh064ZaM
 DGxkWrdhYNaLJuceNLlanBk3vrM3KwGkaJFSKLgsEXkmH86gl33ptA8nlQVbfc+4
 FK4hRZs2J6Y5YRsuzPO8nzzHWs1CThZEcwNPOeIZrAnv+/13zyP/piZ+R/2gKYwy
 RVrVrzSTkTGJXotn6J17Sa02+CGv5UqdxOZFv7jimYJOmFT7KnGngNrJbh8aYWk7
 vPTRUvGozh6NyYxtRI9LKGZy1yQ05Cl7N927CnFYzD235/eT7zaRDZe1e1JaIE8Z
 r2yxGCM=
 =ZIYP
 -----END PGP SIGNATURE-----

Merge 5.9 into android-mainline

Linux 5.9

Change-Id: Ic4308a3e2a4015058efdac52bd51794b604c8435
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2020-10-12 09:12:41 +02:00
Yang Yang
47ce030b7a blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk_exit_queue will free elevator_data, while blk_mq_run_work_fn
will access it. Move cancel of hctx->run_work to the front of
blk_exit_queue to avoid use-after-free.

Fixes: 1b97871b50 ("blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release")
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 12:46:28 -06:00
Yufen Yu
c728152413 blk-mq: get rid of the dead flush handle code path
After commit 923218f616 ("blk-mq: don't allocate driver tag upfront
for flush rq"), blk_mq_submit_bio() will call blk_insert_flush()
directly to handle flush request rather than blk_mq_sched_insert_request()
in the case of elevator.

Then, all flush request either have set RQF_FLUSH_SEQ flag when call
blk_mq_sched_insert_request(), or have inserted into hctx->dispatch.
So, remove the dead code path.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 12:35:39 -06:00
Yufen Yu
0546858c59 block: get rid of unnecessary local variable
Since whole elevator register is protectd by sysfs_lock, we
don't need extras 'has_elevator'. Just use q->elevator directly.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 12:34:06 -06:00
Yufen Yu
f0c6ae09db block: fix comment and add lockdep assert
After commit b89f625e28 ("block: don't release queue's sysfs
lock during switching elevator"), whole elevator register and
unregister function are covered by sysfs_lock. So, remove wrong
comment and add lockdep assert.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 12:34:06 -06:00
Yufen Yu
0841031ab9 blk-mq: use helper function to test hw stopped
We have introduced helper function blk_mq_hctx_stopped() to test
BLK_MQ_S_STOPPED.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 12:34:06 -06:00
Yufen Yu
75e6c00fc7 block: use helper function to test queue register
We have defined common interface blk_queue_registered() to
test QUEUE_FLAG_REGISTERED. Just use it.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 12:34:06 -06:00
Yufen Yu
6251b754f5 block: remove redundant mq check
elv_support_iosched() will check queue_is_mq() for us. So, remove
the redundant check to clean code.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 12:34:06 -06:00
Yufen Yu
dd1c372d65 block: invoke blk_mq_exit_sched no matter whether have .exit_sched
We will register debugfs for scheduler no matter whether it have
defined callback funciton .exit_sched. So, blk_mq_exit_sched()
is always needed to unregister debugfs. Also, q->elevator should
be set as NULL after exiting scheduler.

For now, since all register scheduler have defined .exit_sched,
it will not cause any actual problem. But It will be more reasonable
to do this change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 12:34:06 -06:00
Linus Torvalds
583090b1b8 block5.9-2020-10-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9/uU0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnQvD/wNEBP6d4ISx2/I6sDon9SKJgiY3CLF7x3f
 F//GHMYP9+ZzoLdQRlebGiP6c5PVRL6ExJUVNT+Wc4h5jOuThuxy63j/zvv/RSFw
 WH9lFiTG44zjbWjp3sCDOuIlHnCTsqA4zYb6os62q3v4SzenW/TA65C+yLn823AF
 1VKeVvcoHDu3bvLwtLmAyqZAm2iJH02yKdclKgyaLSKdaGGPX2MJ4tW3GxqzA71i
 7R/qer8KqYXSdJdghGI5eFycLnv/TE/bky02TlE+qUhIFwIhDNyo69IQzlMSQXmw
 ECaAxMJYvzh6ruztkdJP0wOjYEryLY1oCusQEseB9M//qMlue/4Mi2D3bX5Ni1g4
 blQQbIi1gu1J/fZrFtW7G/qHxDvT8oA5cFSv5e/72QRIghvavV6cvEP3s9Uu9v9l
 3pA2LcErEgVellzvAe9q192mPpAUgR42VlUyYi7P74By+m7pWob2jWR0WsSbXqNk
 pVhhW3s02hIf9HUAwJkqH46Y3FZmbpTBQvYByFnQh1VSRzmx69zZxs4SrKJTJq9L
 Id83gBW+r1cuJ8QuZUX4D3ttIGuaZ7J8IdSY4JUBJPMOavbykb6YiWtZ4W5IW5R/
 VYcuVTmJr37hcSBHJLw3FmlEN4IH/2QX+mrtJvCEWgeJACo3TVpv0QGw+gD1V5iS
 EQzTCgctTg==
 =THH6
 -----END PGP SIGNATURE-----

Merge tag 'block5.9-2020-10-08' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes that should go into this release:

   - NVMe controller error path reference fix (Chaitanya)

   - Fix regression with IBM partitions on non-dasd devices (Christoph)

   - Fix a missing clear in the compat CDROM packet structure (Peilin)"

* tag 'block5.9-2020-10-08' of git://git.kernel.dk/linux-block:
  partitions/ibm: fix non-DASD devices
  nvme-core: put ctrl ref when module ref get fail
  block/scsi-ioctl: Fix kernel-infoleak in scsi_put_cdrom_generic_arg()
2020-10-08 18:48:34 -07:00
Tetsuo Handa
f4ac712e4f block: ratelimit handle_bad_sector() message
syzbot is reporting unkillable task [1], for the caller is failing to
handle a corrupted filesystem image which attempts to access beyond
the end of the device. While we need to fix the caller, flooding the
console with handle_bad_sector() message is unlikely useful.

[1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 10:16:59 -06:00
Baolin Wang
1da30f952a blk-throttle: Re-use the throtl_set_slice_end()
Re-use throtl_set_slice_end() to remove duplicate code.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 08:01:38 -06:00
Baolin Wang
29379674bd blk-throttle: Open code __throtl_de/enqueue_tg()
The __throtl_de/enqueue_tg() functions are only be called by
throtl_de/enqueue_tg(), thus we can just open code them to
make code more readable.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 08:01:38 -06:00
Baolin Wang
2397611ac8 blk-throttle: Move service tree validation out of the throtl_rb_first()
The throtl_schedule_next_dispatch() will validate if the service queue
is empty before calling update_min_dispatch_time(), and the
update_min_dispatch_time() will call throtl_rb_first(), which will
validate service queue again.

Thus we can move the service queue validation out of the
throtl_rb_first() to remove the redundant validation in the fast path.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 08:01:38 -06:00
Baolin Wang
b7b609de5a blk-throttle: Move the list operation after list validation
We should move the list operation after validation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 08:01:38 -06:00
Baolin Wang
5b7048b897 blk-throttle: Fix IO hang for a corner case
It can not scale up in throtl_adjusted_limit() if we set bps or iops is
1, which will cause IO hang when enable low limit. Thus we should treat
1 as a illegal value to avoid this issue.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 08:01:38 -06:00
Baolin Wang
b185efa78b blk-throttle: Avoid tracking latency if low limit is invalid
The IO latency tracking is only for LOW limit, so we should add a
validation to avoid redundant latency tracking if the LOW limit
is not valid.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 08:01:37 -06:00
Baolin Wang
7901601aef blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
We only update the tg->last_finish_time when the low limitaion is
enabled, so we can move the tg->last_finish_time validation a little
forward to avoid getting the unnecessary current time stamp if the
the low limitation is not enabled.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 08:01:37 -06:00
Baolin Wang
4247d9c8ba blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
The throtl_downgrade_state() is always used to change to LIMIT_LOW
limitation, thus remove the latter meaningless parameter which
indicates the limitation index.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 08:01:37 -06:00
Baolin Wang
fa1c3eaf4d block: Remove redundant 'return' statement
Remove redundant 'return' statement for 'void' functions.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 07:59:48 -06:00
Mike Snitzer
681cc5e866 dm: fix request-based DM to not bounce through indirect dm_submit_bio
It is unnecessary to force request-based DM to call into bio-based
dm_submit_bio (via indirect disk->fops->submit_bio) only to have it then
call blk_mq_submit_bio().

Fix this by establishing a request-based DM block_device_operations
(dm_rq_blk_dops, which doesn't have .submit_bio) and update
dm_setup_md_queue() to set md->disk->fops to it for
DM_TYPE_REQUEST_BASED.

Remove DM_TYPE_REQUEST_BASED conditional in dm_submit_bio and unexport
blk_mq_submit_bio.

Fixes: c62b37d96b ("block: move ->make_request_fn to struct block_device_operations")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-10-07 18:08:51 -04:00
Christoph Hellwig
7370997d48 partitions/ibm: fix non-DASD devices
Don't error out if the dasd_biodasdinfo symbol is not available.

Cc: stable@vger.kernel.org
Fixes: 26d7e28e38 ("s390/dasd: remove ioctl_by_bdev calls")
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-07 07:55:35 -06:00
Gabriel Krisman Bertazi
a926c7afff block: Consider only dispatched requests for inflight statistic
According to Documentation/block/stat.rst, inflight should not include
I/O requests that are in the queue but not yet dispatched to the device,
but blk-mq identifies as inflight any request that has a tag allocated,
which, for queues without elevator, happens at request allocation time
and before it is queued in the ctx (default case in blk_mq_submit_bio).

In addition, current behavior is different for queues with elevator from
queues without it, since for the former the driver tag is allocated at
dispatch time.  A more precise approach would be to only consider
requests with state MQ_RQ_IN_FLIGHT.

This effectively reverts commit 6131837b1d ("blk-mq: count allocated
but not started requests in iostats inflight") to consolidate blk-mq
behavior with itself (elevator case) and with original documentation,
but it differs from the behavior used by the legacy path.

This version differs from v1 by using blk_mq_rq_state to access the
state attribute.  Avoid using blk_mq_request_started, which was
suggested, since we don't want to include MQ_RQ_COMPLETE.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-06 14:36:35 -06:00
Christoph Hellwig
eda5cc997a block: move blk_mq_sched_try_merge to blk-merge.c
Move blk_mq_sched_try_merge to blk-merge.c, which allows to mark
a lot of the merge infrastructure static there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-06 07:29:53 -06:00
Christoph Hellwig
d59da41998 block: remove the unused blk_integrity_merge_bio export
Also move the definition from the public blkdev.h to the private
block/blk.h header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-06 07:29:53 -06:00
Christoph Hellwig
92cf2fd156 block: remove the unused blk_integrity_merge_rq export
Also move the definition from the public blkdev.h to the private
block/blk.h header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-06 07:29:53 -06:00
Greg Kroah-Hartman
114c58d041 Linux 5.9-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl96VQIeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGDS0H/0cyFcn/c8PYitdj
 C5p8/1KUmXOJsHeRjZ6rZPExVL4bp88/xxMkzLspD6INJsoqfyrj1wbJfHIrYSYt
 6Ps5NoWl9woJlyD4H7cyoXmGfb0DiVRjCWU7VK7fwxQy4W97A/eLgNEj1WMZ5kzC
 FWT4bI7Kwo7tdcBeU52RkzZfeW59HU8667CO3FXr8yPySlUSZMoWeypQfCbHnqVG
 WOEAJbbxklPzUbABcTF23lUMl1saOZ30T+hlXddolHq4clvhKf2b3lLbTCUcM1Yj
 P+I7/K3bDdbpmaYNqNs8LCEiVhZBGK7CJViJvWAus/U7ccMNlJO4xJQV5C7oq85K
 DSXVIYU=
 =RqRn
 -----END PGP SIGNATURE-----

Merge 5.9-rc8 into android-mainline

Linux 5.9-rc8

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4a03857522443f5d032a0cc1ecdced3675255d5e
2020-10-06 09:55:29 +02:00
Eric Biggers
cf785af193 block: warn if !__GFP_DIRECT_RECLAIM in bio_crypt_set_ctx()
bio_crypt_set_ctx() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

For now this assumption is still fine, since no callers violate it.
Making bio_crypt_set_ctx() able to fail would add unneeded complexity.

However, if a caller didn't use __GFP_DIRECT_RECLAIM, it would be very
hard to notice the bug.  Make it easier by adding a WARN_ON_ONCE().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05 10:47:43 -06:00
Eric Biggers
93f221ae08 block: make blk_crypto_rq_bio_prep() able to fail
blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c.

This case isn't currently reachable with a bio that actually has an
encryption context.  However, it's fragile to rely on this.  Just make
blk_crypto_rq_bio_prep() able to fail.

Suggested-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05 10:47:43 -06:00
Eric Biggers
07560151db block: make bio_crypt_clone() able to fail
bio_crypt_clone() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, bio_crypt_clone() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via
kcryptd_io_read() in drivers/md/dm-crypt.c.

Neither case is currently reachable with a bio that actually has an
encryption context.  However, it's fragile to rely on this.  Just make
bio_crypt_clone() able to fail, analogous to bio_integrity_clone().

Reported-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05 10:47:43 -06:00
Christoph Hellwig
10ed16662d block: add a bdget_part helper
All remaining callers of bdget() outside of fs/block_dev.c want to get a
reference to the struct block_device for a given struct hd_struct.  Add
a helper just for that and then mark bdget static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05 10:38:33 -06:00
Christoph Hellwig
89cd35c58b iov_iter: transparently handle compat iovecs in import_iovec
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.

This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03 00:02:13 -04:00
Gustavo A. R. Silva
f5ace5ef37 block: scsi_ioctl: Avoid the use of one-element arrays
One-element arrays are being deprecated[1]. Replace the one-element array
with a simple object of type compat_caddr_t: 'compat_caddr_t unused'[2],
once it seems this field is actually never used.

Also, update struct cdrom_generic_command in UAPI by adding an
anonimous union to avoid using the one-element array _reserved_.

[1] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays
[2] https://github.com/KSPP/linux/issues/86

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/lkml/5f76f5d0.qJ4t%2FHWuRzSW7bTa%25lkp@intel.com/
Build-tested-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 17:58:52 -06:00
Linus Torvalds
f016a54052 block-5.9-2020-10-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl93Z28QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpucFEACjn38JQGjFxcT9034e4rTys3kPFcvC6yik
 8BZI33rYeuX3GAkuOAUeAoK5k8EfZBhjgHKX0DTaW4RZbggZC4fT9vVEKsRz1Ee2
 E0xLc1jUoUqQ397H+AhOHnVHylQJqUzy6dywyz7QHTH/fWmemKqvZLZrA/ujDkhS
 AxiKI+/E6DxYByi9mgOfSCCQSZVEUTS0Z9S9+fcKAJ9VSiJNu3d3UWFkcrCECmb8
 ChBgNuf/qpAT0lW6/L3eGv+qzDCgYw7VTEtGEONEJKLm84wYdcGWEFr3pNHTkxl6
 ZXHyfVno1DctGpiDEE84FYBvBW7lKogwJVJkh8niEOm9vkXUJYrSAJvuTyw9KRHJ
 wEse1Y3+uMhPLFmIkFMMayn/ErzddD64WGN7CJLMsiXs3z08cFNmLLU57nvrC3um
 AC0rJ10eYMxEQkJuTAoMOWzz3zjhwDxNZL1v/aUr73Tag5uFSoj3esJMKKAdjH82
 OYl6SB6rTcvnTcnaja0AzWCy5dSV1sbGWxc2PuEcobNkmrht24KsQk8Enw1YsnRa
 aLmrh8a6Ya8rbv3L9A1Uz51QXMAwtZJ/43l6nWwppuxntR1/ufZo8e4qt0XNqp/s
 4NJPoHHE4iqpw2+BnZjlzuomUQAStMew4h91J5d2QJZe+sl5+KMDvquW4uIUU4vr
 FBvHbrn1fA==
 =p7wt
 -----END PGP SIGNATURE-----

Merge tag 'block-5.9-2020-10-02' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Single fix for a ->commit_rqs failure case"

* tag 'block-5.9-2020-10-02' of git://git.kernel.dk/linux-block:
  blk-mq: call commit_rqs while list empty but error happen
2020-10-02 14:34:52 -07:00
Peilin Ye
6d53a9fe5a block/scsi-ioctl: Fix kernel-infoleak in scsi_put_cdrom_generic_arg()
scsi_put_cdrom_generic_arg() is copying uninitialized stack memory to
userspace, since the compiler may leave a 3-byte hole in the middle of
`cgc32`. Fix it by adding a padding field to `struct
compat_cdrom_generic_command`.

Cc: stable@vger.kernel.org
Fixes: f3ee6e63a9 ("compat_ioctl: move CDROM_SEND_PACKET handling into scsi")
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: syzbot+85433a479a646a064ab3@syzkaller.appspotmail.com
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 12:01:47 -06:00
Mike Snitzer
1471308fb5 Merge remote-tracking branch 'jens/for-5.10/block' into dm-5.10
DM depends on these block 5.10 commits:

22ada802ed block: use lcm_not_zero() when stacking chunk_sectors
07d098e6bb block: allow 'chunk_sectors' to be non-power-of-2
021a24460d block: add QUEUE_FLAG_NOWAIT
6abc49468e dm: add support for REQ_NOWAIT and enable it for linear target

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-09-29 16:31:35 -04:00
yangerkun
76cffccd60 block-mq: fix comments in blk_mq_queue_tag_busy_iter
'f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with
blk_mq_queue_tag_busy_iter")' introduce a bug what we may sleep between
rcu lock. Then '530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter
callbacks")' fix it by get request_queue's ref. And 'a9a808084d6a ("block:
Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()")'
remove the synchronize_rcu in __blk_mq_update_nr_hw_queues. We need
update the confused comments in blk_mq_queue_tag_busy_iter.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-29 08:11:00 -06:00
yangerkun
632bfb6323 blk-mq: call commit_rqs while list empty but error happen
Blk-mq should call commit_rqs once 'bd.last != true' and no more
request will come(so virtscsi can kick the virtqueue, e.g.). We already
do that in 'blk_mq_dispatch_rq_list/blk_mq_try_issue_list_directly' while
list not empty and 'queued > 0'. However, we can seen the same scene
once the last request in list call queue_rq and return error like
BLK_STS_IOERR which will not requeue the request, and lead that list
empty but need call commit_rqs too(Or the request for virtscsi will stay
timeout until other request kick virtqueue).

We found this problem by do fsstress test with offline/online virtscsi
device repeat quickly.

Fixes: d666ba98f8 ("blk-mq: add mq_ops->commit_rqs()")
Reported-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-29 08:10:17 -06:00
Xianting Tian
8229cca8c3 blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()
We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.

Debug code,
1, blk_mq_alloc_rq_maps():
        u64 start, end;
        depth = set->queue_depth;
        start = ktime_get_ns();
        pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
                        current->pid, current->comm, current->nvcsw, current->nivcsw,
                        set->queue_depth, set->nr_hw_queues);
        do {
                err = __blk_mq_alloc_rq_maps(set);
                if (!err)
                        break;

                set->queue_depth >>= 1;
                if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
                        err = -ENOMEM;
                        break;
                }
        } while (set->queue_depth);
        end = ktime_get_ns();
        pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
                        current->pid, current->comm,
                        current->nvcsw, current->nivcsw, end - start);

2, __blk_mq_alloc_rq_maps():
        u64 start, end;
        for (i = 0; i < set->nr_hw_queues; i++) {
                start = ktime_get_ns();
                if (!__blk_mq_alloc_rq_map(set, i))
                        goto out_unwind;
                end = ktime_get_ns();
                pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
        }

Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
	[  428.428771] nvme nvme0: pci function 10000:01:00.0
	[  428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
	[  428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
	[  428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
	[  432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
	[  432.593404] hw queue 0 init cost time 22883 ns
	[  432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
	[  432.595953] nvme nvme0: 16/0/0 default/read/poll queues
	[  432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
	[  432.596203] hw queue 0 init cost time 242630 ns
	[  432.596441] hw queue 1 init cost time 235913 ns
	[  432.596659] hw queue 2 init cost time 216461 ns
	[  432.596877] hw queue 3 init cost time 215851 ns
	[  432.597107] hw queue 4 init cost time 228406 ns
	[  432.597336] hw queue 5 init cost time 227298 ns
	[  432.597564] hw queue 6 init cost time 224633 ns
	[  432.597785] hw queue 7 init cost time 219954 ns
	[  432.597937] hw queue 8 init cost time 150930 ns
	[  432.598082] hw queue 9 init cost time 143496 ns
	[  432.598231] hw queue 10 init cost time 147261 ns
	[  432.598397] hw queue 11 init cost time 164522 ns
	[  432.598542] hw queue 12 init cost time 143401 ns
	[  432.598692] hw queue 13 init cost time 148934 ns
	[  432.598841] hw queue 14 init cost time 147194 ns
	[  432.598991] hw queue 15 init cost time 148942 ns
	[  432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
	[  432.602611]  nvme0n1: p1

So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-28 09:01:51 -06:00
Greg Kroah-Hartman
fb3b36d52f Merge a1bffa4874 ("Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi") into 'android-mainline'
Fixes up a merge issue in:
	net/ipv6/route.c
on the way to a 5.9-rc7 release.

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4eb508eb3761b95ad8f39dd79f03b3352481ceaf
2020-09-27 13:56:43 +02:00
Linus Torvalds
a1bffa4874 SCSI fixes on 20200926
Three fixes: one in drivers (lpfc) and two for zoned block devices.
 The latter also impinges on the block layer but only to introduce a
 new block API for setting the zone model rather than fiddling with the
 queue directly in the zoned block driver.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCX29mRyYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishabnAP48vMYD
 /cjyGAJfq/0k/U/t6pRPc5tUm89LOWcOJz0SjwD/YXcQNz7mx8MxnypAV1jbWXR7
 iyWkPMYVc4EJh7oTARE=
 =SQhI
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Three fixes: one in drivers (lpfc) and two for zoned block devices.

  The latter also impinges on the block layer but only to introduce a
  new block API for setting the zone model rather than fiddling with the
  queue directly in the zoned block driver"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sd: sd_zbc: Fix ZBC disk initialization
  scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks
  scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported
2020-09-26 11:18:37 -07:00
Tejun Heo
bec02dbbaf iocost: consider iocgs with active delays for debt forgiveness
An iocg may have 0 debt but non-zero delay. The current debt forgiveness
logic doesn't act on such iocgs. This can lead to unexpected behaviors - an
iocg with a little bit of debt will have its delay canceled through debt
forgiveness but one w/o any debt but active delay will have to wait out
until its delay decays out.

This patch updates the debt handling logic so that it treats delays the same
as debts. If either debt or delay is active, debt forgiveness logic kicks in
and acts on both the same way.

Also, avoid turning the debt and delay directly to zero as that can confuse
state transitions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:35:02 -06:00
Tejun Heo
c5a6561b8d iocost: add iocg_forgive_debt tracepoint
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:35:02 -06:00
Tejun Heo
c7af2a003a iocost: reimplement debt forgiveness using average usage
Debt forgiveness logic was counting the number of consecutive !busy periods
as the trigger condition. While this usually works, it can easily be thrown
off by temporary fluctuations especially on configurations w/ short periods.

This patch reimplements debt forgiveness so that:

* Use the average usage over the forgiveness period instead of counting
  consecutive periods.

* Debt is reduced at around the target rate (1/2 every 100ms) regardless of
  ioc period duration.

* Usage threshold is raised to 50%. Combined with the preceding changes and
  the switch to average usage, this makes debt forgivness a lot more
  effective at reducing the amount of unnecessary idleness.

* Constants are renamed with DFGV_ prefix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:35:02 -06:00
Tejun Heo
d95178410b iocost: recalculate delay after debt reduction
Debt sets the initial delay duration which is decayed over time. The current
debt reduction halved the debt but didn't change the delay. It prevented
future debts from increasing delay but didn't do anything to lower the
existing delay, limiting the mechanism's ability to reduce unnecessary
idling.

Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq()
recalculates new delay value based on the reduced debt amount.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:35:02 -06:00
Tejun Heo
33a1fe6d82 iocost: replace nr_shortages cond in ioc_forgive_debts() with busy_level one
Debt reduction was blocked if any iocg was short on budget in the past
period to avoid reducing debts while some iocgs are saturated. However, this
ends up unnecessarily blocking debt reduction due to temporary local
imbalances when the device is generally being underutilized, while also
failing to block when the underlying device is overwhelmed and the usage
becomes low from high latency.

Given that debt accumulation mostly happens with swapout bursts which can
significantly deteriorate the underlying device's latency response, the
current logic is not great.

Let's replace it with ioc->busy_level based condition so that we block debt
reduction when the underlying device is being saturated. ioc_forgive_debts()
call is moved after busy_level determination.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:35:02 -06:00
Tejun Heo
ab8df828b5 iocost: factor out ioc_forgive_debts()
Debt reduction logic is going to be improved and expanded. Factor it out
into ioc_forgive_debts() and generalize the comment a bit. No functional
change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:35:02 -06:00
Mike Snitzer
021a24460d block: add QUEUE_FLAG_NOWAIT
Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for
REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where
applicable.

Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT.  Also
update submit_bio_checks() to verify it is set for REQ_NOWAIT bios.

Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:20:03 -06:00
Christoph Hellwig
8a63a86e1f block: use bd_partno in bdevname
No need to go through the hd_struct to find the partition number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:58 -06:00
Christoph Hellwig
fa01b1e973 block: add a bdev_is_partition helper
Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:57 -06:00
Jens Axboe
ac8f7a0264 Merge branch 'for-5.10/block' into for-5.10/drivers
* for-5.10/block: (140 commits)
  bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
  bdi: invert BDI_CAP_NO_ACCT_WB
  bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
  mm: use SWP_SYNCHRONOUS_IO more intelligently
  bdi: remove BDI_CAP_SYNCHRONOUS_IO
  bdi: remove BDI_CAP_CGROUP_WRITEBACK
  block: lift setting the readahead size into the block layer
  md: update the optimal I/O size on reshape
  bdi: initialize ->ra_pages and ->io_pages in bdi_init
  aoe: set an optimal I/O size
  bcache: inherit the optimal I/O size
  drbd: remove dead code in device_to_statistics
  fs: remove the unused SB_I_MULTIROOT flag
  block: mark blkdev_get static
  PM: mm: cleanup swsusp_swap_check
  mm: split swap_type_of
  PM: rewrite is_hibernate_resume_dev to not require an inode
  mm: cleanup claim_swapfile
  ocfs2: cleanup o2hb_region_dev_store
  dasd: cleanup dasd_scan_partitions
  ...
2020-09-24 13:44:39 -06:00
Christoph Hellwig
1cb039f3dc bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
ed7b6b4f6e bdi: remove BDI_CAP_CGROUP_WRITEBACK
Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
c2e4cd57cf block: lift setting the readahead size into the block layer
Drivers shouldn't really mess with the readahead size, as that is a VM
concept.  Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk.  Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors.  To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
55b2598e84 bdi: initialize ->ra_pages and ->io_pages in bdi_init
Set up a readahead size by default, as very few users have a good
reason to change it.  This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
478162821d block: cleanup blkdev_bszset
Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:43:19 -06:00
Christoph Hellwig
9301fe7343 block: cleanup partition scanning in register_disk
Use blkdev_get_by_dev instead of open coding it using bdget_disk +
blkdev_get, and split the code to read the partition table into a
separate helper to make it a little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:43:19 -06:00
Christoph Hellwig
38430f0876 block: move the NEED_PART_SCAN flag to struct gendisk
We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:43:18 -06:00
Mike Snitzer
07d098e6bb block: allow 'chunk_sectors' to be non-power-of-2
It is possible, albeit more unlikely, for a block device to have a non
power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
which results in a full-stripe size of 1280K. This causes the RAID6's
io_opt to be advertised as 1280K, and a stacked device _could_ then be
made to use a blocksize, aka chunk_sectors, that matches non power-of-2
io_opt of underlying RAID6 -- resulting in stacked device's
chunk_sectors being a non power-of-2).

Update blk_queue_chunk_sectors() and blk_max_size_offset() to
accommodate drivers that need a non power-of-2 chunk_sectors.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:38:14 -06:00
Mike Snitzer
22ada802ed block: use lcm_not_zero() when stacking chunk_sectors
Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
lcm_not_zero() rather than min_not_zero() -- otherwise the final
'chunk_sectors' could result in sub-optimal alignment of IO to
component devices in the IO stack.

Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
then it is a bug in the driver and the device should be flagged as
'misaligned'.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 10:38:12 -06:00
Christoph Hellwig
0385971754 block: fix bmd->is_null_mapped initialization
bmd is allocated using kmalloc in bio_alloc_map_data, so make sure
is_null_mapped is properly initialized to false for the !null_mapped
case.

Fixes: f3256075ba ("block: remove the BIO_NULL_MAPPED flag")
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 09:18:39 -06:00
Julia Lawall
f952eefe74 block: drop double zeroing
sg_init_table zeroes its first argument, so the allocation of that argument
doesn't have to.

the semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x;
@@

x =
- kzalloc
+ kmalloc
 (...)
...
sg_init_table(x,...)
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-23 09:18:13 -06:00
Damien Le Moal
27ba3e8ff3 scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks
When CONFIG_BLK_DEV_ZONED is disabled, allow using host-aware ZBC disks as
regular disks. In this case, ensure that command completion is correctly
executed by changing sd_zbc_complete() to return good_bytes instead of 0
and causing a hang during device probe (endless retries).

When CONFIG_BLK_DEV_ZONED is enabled and a host-aware disk is detected to
have partitions, it will be used as a regular disk. In this case, make sure
to not do anything in sd_zbc_revalidate_zones() as that triggers warnings.

Since all these different cases result in subtle settings of the disk queue
zoned model, introduce the block layer helper function
blk_queue_set_zoned() to generically implement setting up the effective
zoned model according to the disk type, the presence of partitions on the
disk and CONFIG_BLK_DEV_ZONED configuration.

Link: https://lore.kernel.org/r/20200915073347.832424-2-damien.lemoal@wdc.com
Fixes: b72053072c ("block: allow partitions on host aware zone devices")
Cc: <stable@vger.kernel.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-09-15 20:08:14 -04:00
Baolin Wang
87fbeb8813 blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimited
Do not need check the bps or iops limitation if bps or iops is unlimited.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14 19:36:54 -06:00
Baolin Wang
4599ea49d4 blk-throttle: Avoid calculating bps/iops limitation repeatedly
The tg_may_dispatch() will call tg_with_in_bps_limit() and
tg_with_in_iops_limit() to check if we can dispatch a bio or
not, which will calculate bps/iops limitation multiple times.
But tg_may_dispatch() is always called under queue lock, which
means the bps/iops limitation will not change in tg_may_dispatch().

So we can calculate the bps/iops limitation only once, and pass
them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to
avoid calculating bps/iops limitation repeatedly.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14 19:36:54 -06:00
Baolin Wang
e675df2adc blk-throttle: Define readable macros instead of static variables
The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only
variables, thus better to use readable macros instead of static
variables, which can also save some spaces for .bss area.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14 19:36:54 -06:00
Baolin Wang
ff8b22c0f2 blk-throttle: Use readable READ/WRITE macros
Use readable READ/WRITE macros instead of magic numbers.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14 19:36:54 -06:00
Baolin Wang
b53b072c4b blk-throttle: Fix some comments' typos
Fix some comments' typos.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14 19:36:54 -06:00
Tejun Heo
aa67db24b6 iocost: fix infinite loop bug in adjust_inuse_and_calc_cost()
adjust_inuse_and_calc_cost() is responsible for reducing the amount of
donated weights dynamically in period as the budget runs low. Because we
don't want to do full donation calculation in period, we keep latching up
inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the
resulting hweight_inuse is satisfactory.

Unfortunately, the adj_step calculation was reading the active weight before
acquiring ioc->lock. Because the current thread could have lost race to
activate the iocg to another thread before entering this function, it may
read the active weight as zero before acquiring ioc->lock. When this
happens, the adj_step is calculated as zero and the incremental adjustment
loop becomes an infinite one.

Fix it by fetching the active weight after acquiring ioc->lock.

Fixes: b0853ab4a2 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-14 17:25:39 -06:00
Greg Kroah-Hartman
9c16520a33 Linux 5.9-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl9epdgeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG9IMH/jHCRSbcsIXHuQHn
 xcRLlhrDHfXoBza7auHfPWx2+9DZsmaSJs/SEiTGNag0Bi7jBcWcwBpsep7iVG/+
 WiftD5uOMhZigyuvfMFrt0mjr2Kr3wg5p58lwMBeBdm8iL5uKV8ehKsh05/Fral2
 6hu3jP8L0PCZMpF+sZ7s2jlhfVUMmjA8VzXZCvgQtmhoraHiF3mzfkcSMxnHwBPO
 HLo+TDDm49u+LbVsJT7+cSTiWxuUJCbix9Q4PCTx/BGg4ezYsjc6v0BnYRaYtrrA
 1uYiT6PVBEUkYYBHKQlD3N2KnUmbKx7dGUF4t+peTg5/JiocAJMNi1N9Qzvv7N6Q
 CqTiuio=
 =q+kJ
 -----END PGP SIGNATURE-----

Merge 5.9-rc5 into android-mainline

Linux 5.9-rc5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia4878d9973d7f9adf94f38e9c5703112c605ca56
2020-09-14 09:49:34 +02:00
Tejun Heo
769b628de0 blk-iocost: fix divide-by-zero in transfer_surpluses()
Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but
all hweight calculations round up and thus it may end up >= WEIGHT_ONE
triggering divide-by-zero and other issues. Bound the value to avoid
surprises.

Fixes: e08d02aa5f ("blk-iocost: implement Andy's method for donation weight updates")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11 16:41:47 -06:00
Song Liu
7b26410b05 block: introduce part_[begin|end]_io_acct
These functions can be used to enable iostat for partitions on devices
like md, bcache.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11 16:41:30 -06:00
Linus Torvalds
7b8731d958 - Fix a regression in bdev partition locking (Christoph)
- NVMe pull request from Christoph:
 	- cancel async events before freeing them (David Milburn)
 	- revert a broken race fix (James Smart)
 	- fix command processing during resets (Sagi Grimberg)
 
 - Fix a kyber crash with requeued flushes (Omar)
 
 - Fix __bio_try_merge_page() same_page error for no merging (Ritesh)
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9boNoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm++D/9oEC1RazLFXwZD7rtXUMQ0bWRmbyM77Qtq
 P7wn0poSSvHT6fNyd9ytf9STlTXeJz81Gk4jTRiau1HKAhc9GudYEzYFw0baNN82
 AX5dO1Gt2vww+k4XAHCM0l0k2/IOgQg8d2hDJBt68bnDIW/T1T3GORqS5Ki0dw9R
 EYVFbBePZTyUIAxDWnSKtNRR3TpMrfZfi9AAUpwGkKVcCZkHD4SlrNPGKd0ckD5Z
 GnHdJtWjb5mIgVHMbHgWjcIjKhC7BTrL+sCqdBJ55NvfWXZ20QoKKDSx5BWl6rMI
 g/eMAJjoYJ6Ih13sjIbrC7fHZBXzPRTRfqKBq8fM6oytD0cO9ZcUfpBeqiCWOyrT
 SU3C1MkkqeskDGNXhjOq8lFWeyQlUgBg0rXIDDeFNusUB3QOZa3T7oirqZlfZsOi
 G7WVd4/aftr+qB8GVl1HmLCg7U3rO2q6EuJ+aJDGh07TuiFi5qaPwRzmRcykKs62
 UJ15W9JaNEHdGQs5rim7evz9qLCTyQqrwF7nDFBpM8hsraPPCNbwGoUbXLACtXGR
 htjr5nxEoOEJs9SKZCWl9jXzvyoMkqLp4j6soVS7cZKUJU1qxMhf68FGylbHitEq
 Pe1z7dG/3Pq/zV77aGTt1J40tB43tHr3gOSQ2swwjxqvYIjlvbP4xnl6SIHvLlof
 blntc17XWQ==
 =J16G
 -----END PGP SIGNATURE-----

Merge tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Fix a regression in bdev partition locking (Christoph)

 - NVMe pull request from Christoph:
      - cancel async events before freeing them (David Milburn)
      - revert a broken race fix (James Smart)
      - fix command processing during resets (Sagi Grimberg)

 - Fix a kyber crash with requeued flushes (Omar)

 - Fix __bio_try_merge_page() same_page error for no merging (Ritesh)

* tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block:
  block: Set same_page to false in __bio_try_merge_page if ret is false
  nvme-fabrics: allow to queue requests for live queues
  block: only call sched requeue_request() for scheduled requests
  nvme-tcp: cancel async events before freeing event struct
  nvme-rdma: cancel async events before freeing event struct
  nvme-fc: cancel async events before freeing event struct
  nvme: Revert: Fix controller creation races with teardown flow
  block: restore a specific error code in bdev_del_partition
2020-09-11 11:55:28 -07:00
Ming Lei
285008501c blk-mq: always allow reserved allocation in hctx_may_queue
NVMe shares tagset between fabric queue and admin queue or between
connect_q and NS queue, so hctx_may_queue() can be called to allocate
request for these queues.

Tags can be reserved in these tagset. Before error recovery, there is
often lots of in-flight requests which can't be completed, and new
reserved request may be needed in error recovery path. However,
hctx_may_queue() can always return false because there is too many
in-flight requests which can't be completed during error handling.
Finally, nothing can proceed.

Fix this issue by always allowing reserved tag allocation in
hctx_may_queue(). This is reasonable because reserved tags are supposed
to always be available.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: David Milburn <dmilburn@redhat.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11 05:26:19 -06:00
Tian Tao
84ed2573c5 block: remove duplicate include statement in scsi_ioctl.c
scsi/sg.h is included more than once, Remove the one that isn't
necessary.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-11 05:23:37 -06:00
Xianting Tian
192f1c6bc2 blkcg: add plugging support for punt bio
The test and the explaination of the patch as bellow.

Before test we added more debug code in blkg_async_bio_workfn():
	int count = 0
	if (bios.head && bios.head->bi_next) {
		need_plug = true;
		blk_start_plug(&plug);
	}
	while ((bio = bio_list_pop(&bios))) {
		/*io_punt is a sysctl user interface to control the print*/
		if(io_punt) {
			printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n",
				current->comm, current->pid, bio->bi_iter.bi_sector,
				(bio->bi_iter.bi_size)>>9, count++, need_plug);
		}
		submit_bio(bio);
	}
	if (need_plug)
		blk_finish_plug(&plug);

Steps that need to be set to trigger *PUNT* io before testing:
	mount -t btrfs -o compress=lzo /dev/sda6 /btrfs
	mount -t cgroup2 nodev /cgroup2
	mkdir /cgroup2/cg3
	echo "+io" > /cgroup2/cgroup.subtree_control
	echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s
	echo $$ > /cgroup2/cg3/cgroup.procs

Then use dd command to test btrfs PUNT io in current shell:
	dd if=/dev/zero of=/btrfs/file bs=64K count=100000

Test hardware environment as below:
	[root@localhost btrfs]# lscpu
	Architecture:          x86_64
	CPU op-mode(s):        32-bit, 64-bit
	Byte Order:            Little Endian
	CPU(s):                32
	On-line CPU(s) list:   0-31
	Thread(s) per core:    2
	Core(s) per socket:    8
	Socket(s):             2
	NUMA node(s):          2
	Vendor ID:             GenuineIntel

With above debug code, test command and test environment, I did the
tests under 3 different system loads, which are triggered by stress:
1, Run 64 threads by command "stress -c 64 &"
	[53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1
	[53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1
	[53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1
	[53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1
	[53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1
	[53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1
	... ...
	[53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1
	[53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1
	[53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1
	[53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1
	[53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1
	[53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1
	[53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1

2, Run 32 threads by command "stress -c 32 &"
	[50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1
	[50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1
	[50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1
	[50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1
	[50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1
	[50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1
	... ...
	[50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1
	[50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1
	[50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1
	[50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1
	[50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1
	[50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1
	[50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1
	[50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1

3, Don't run thread by stress
	[50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0
	[50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0
	[50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0
	[50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0
	[50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0
	[50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0
	[50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0
	[50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0
	[50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0
	[50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0
	[50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0
	[50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0
	[50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1
	[50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1
	[50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0
	[50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0
	[50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0
	[50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0
	[50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0
	[50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0
	[50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0

Analysis of above 3 test results with different system load:
>From above test, we can see more and more continuous bios can be plugged
with system load increasing. When run "stress -c 64 &", 310 continuous
bios are plugged; When run "stress -c 32 &", 260 continuous bios are
plugged; When don't run stress, at most only 2 continuous bios are
plugged, in most cases, bio_list only contains one single bio.

How to explain above phenomenon:
We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will
queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is
scheduled, it depends on the system load.  When system load is low, the
workqueue will be quickly scheduled, and the bio in bio_list will be
quickly processed in blkg_async_bio_workfn(), so there is less chance
that the same io submit thread can add multiple continuous bios to
bio_list before workqueue is scheduled to run. The analysis aligned with
above test "3".
When system load is high, there is some delay before the workqueue can
be scheduled to run, the higher the system load the greater the delay.
So there is more chance that the same io submit thread can add multiple
continuous bios to bio_list. Then when the workqueue is scheduled to run,
there are more continuous bios in bio_list, which will be processed in
blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2".

According to test, we can get io performance improved with the patch,
especially when system load is higher. Another optimazition is to use
the plug only when bio_list contains at least 2 bios.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10 09:56:34 -06:00
Christoph Hellwig
95f6f3a46f block: add a bdev_check_media_change helper
Like check_disk_changed, except that it does not call ->revalidate_disk
but leaves that to the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-10 09:32:30 -06:00
Ritesh Harjani
2cd896a5e8 block: Set same_page to false in __bio_try_merge_page if ret is false
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.

Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.

WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
 CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G        W         5.8.0-rc3 #6
 Call Trace:
  __remove_mapping+0x154/0x320 (unreliable)
  iomap_releasepage+0x80/0x180
  try_to_release_page+0x94/0xe0
  invalidate_inode_page+0xc8/0x110
  invalidate_mapping_pages+0x1dc/0x540
  generic_fadvise+0x3c8/0x450
  xfs_file_fadvise+0x2c/0xe0 [xfs]
  vfs_fadvise+0x3c/0x60
  ksys_fadvise64_64+0x68/0xe0
  sys_fadvise64+0x28/0x40
  system_call_exception+0xf8/0x1c0
  system_call_common+0xf0/0x278

Fixes: cc90bc6842 ("block: fix "check bi_size overflow before merge"")
Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-09 08:18:45 -06:00
Omar Sandoval
e8a8a18505 block: only call sched requeue_request() for scheduled requests
Yang Yang reported the following crash caused by requeueing a flush
request in Kyber:

  [    2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
  ...
  [    2.517468] pc : clear_bit+0x18/0x2c
  [    2.517502] lr : sbitmap_queue_clear+0x40/0x228
  [    2.517503] sp : ffffff800832bc60 pstate : 00c00145
  ...
  [    2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
  [    2.517602] Call trace:
  [    2.517606]  clear_bit+0x18/0x2c
  [    2.517619]  kyber_finish_request+0x74/0x80
  [    2.517627]  blk_mq_requeue_request+0x3c/0xc0
  [    2.517637]  __scsi_queue_insert+0x11c/0x148
  [    2.517640]  scsi_softirq_done+0x114/0x130
  [    2.517643]  blk_done_softirq+0x7c/0xb0
  [    2.517651]  __do_softirq+0x208/0x3bc
  [    2.517657]  run_ksoftirqd+0x34/0x60
  [    2.517663]  smpboot_thread_fn+0x1c4/0x2c0
  [    2.517667]  kthread+0x110/0x120
  [    2.517669]  ret_from_fork+0x10/0x18

This happens because Kyber doesn't track flush requests, so
kyber_finish_request() reads a garbage domain token. Only call the
scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
the finish_request() hook in blk_mq_free_request()). Now that we're
handling it in blk-mq, also remove the check from BFQ.

Reported-by: Yang Yang <yang.yang@vivo.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08 17:40:46 -06:00
Christoph Hellwig
fc93fe1453 block: make QUEUE_SYSFS_BIT_FNS more useful
Switch to the naming used by the other entries so that we can use the
QUEUE_RW_ENTRY helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08 09:01:10 -06:00
Christoph Hellwig
3562614705 block: add helper macros for queue sysfs entries
Add two helpers macros to avoid boilerplate code for the queue sysfs
entries.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08 09:01:10 -06:00
Christoph Hellwig
88ce2a530c block: restore a specific error code in bdev_del_partition
mdadm relies on the fact that deleting an invalid partition returns
-ENXIO or -ENOTTY to detect if a block device is a partition or a
whole device.

Fixes: 08fc1ab6d7 ("block: fix locking in bdev_del_partition")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-08 08:18:24 -06:00
Baolin Wang
ddfb8b0bed block: Remove unused blk_mq_sched_free_hctx_data()
Now we usually free the hctx->sched_data by e->type->ops.exit_hctx(),
and no users will use blk_mq_sched_free_hctx_data() function.
Remove it.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-07 20:11:15 -06:00
Jan Kara
384d87ef2c block: Do not discard buffers under a mounted filesystem
Discarding blocks and buffers under a mounted filesystem is hardly
anything admin wants to do. Usually it will confuse the filesystem and
sometimes the loss of buffer_head state (including b_private field) can
even cause crashes like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
Call Trace:
 __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
 jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
 kjournald2+0xbd/0x270 [jbd2]

So if we don't have block device open with O_EXCL already, claim the
block device while we truncate buffer cache. This makes sure any
exclusive block device user (such as filesystem) cannot operate on the
device while we are discarding buffer cache.

Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-07 20:10:55 -06:00
Greg Kroah-Hartman
3d3ef2a059 Linux 5.9-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl9VerweHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhc4H/iHD6qLdB36gZB6K
 oc2nJyrqyWitv4ti2Mnt5PA7o4wX4l6nnr1QvoaJ4BRs5Ja1czRvb2XDmdzqAoIA
 xITGoafqaAeDfxQ91bWrJsVN0pCRKiGwddXlU7TWmqw/riAkfOqi6GYKvav4biJH
 +n1mUPQb1M2IbRFsqkAS+ebKHq3CWaRvzKOEneS88nGlL5u31S9NAru8Ru/fkxRn
 6CwGcs1XRaBPYaZAhdfIb0NuatUlpkhPC9yhNS9up6SqrWmK3m65vmFVng6H0eCF
 fwn1jVztboY/XcNAi5sM9ExpQCql6WLQEEktVikqRDojC8fVtSx6W55tPt7qeaoO
 Z6t4/DA=
 =bcA4
 -----END PGP SIGNATURE-----

Merge 5.9-rc4 into android-mainline

Linux 5.9-rc4

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3d041935cae5e8f3421edcdee4892f17e2c776ad
2020-09-07 09:24:58 +02:00
Linus Torvalds
8075fc3b11 block-5.9-2020-09-04
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9SWMMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphIcD/488Q7rXb2eABp1fGs4gu+VFOCLogeHL8xh
 5xHNiOPnZG2SGr8DQJY/7EX2kE65rbZi8/g+2N6anovI2nduRu0tzSra7fRgzbys
 ZQC1CUel0MbCd7e8OaEfg108PSHNxBf1PqDcE7zCeyZ0DIs3s4vK/bQtmzzxZHgU
 wNw4OIP9gOdqgjowb6GGHo9SLN4GT8rZ0jZVPLa7GwFsvxCTwv/7lHO8rqeSeuCu
 5H6i3M/rSbtTXPLHf4Fy97x9WmBmdgu4epTXiwbOxaagpx3lm/7n1P3CpavR+Gcq
 O5VGIIzazxPwnZl9y/6rZFLGYqcj38RxUvC8KtK6tDXxEu/BDJa1d6hXI03SyXAO
 ZAiEpQTKOkJE3R8ewUDrXLvl3p6FvwZVZ5SIFwUb+0JFrVQYwrgfoRJtzb5SIUan
 T9/bSYge7lFRI92FZRIqhvk8rsEBRdu7N/rQCyGf6GuZ0vRXWRAqN7T02iDn3czX
 pdGAepU5ymw8CwyUiNNnkY0DUaQLBIO9tCA9epxLwdroQ95vJtMPRBX1STQ65GVk
 XvMFAJqDAehQ/nP5xO60cWGZHyL7L/ccpofZlA/ytgAIZRa85GvhrdVy7yc6DKto
 wu6h2tkX9+ldoUjVbn/60T+Ft3QUTlfAuDfherkNoFNB/G5i1pzOHbwvL7B3czr3
 ZMjoNiOIqA==
 =8fvz
 -----END PGP SIGNATURE-----

Merge tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A bit larger than usual this week, mostly due to the NVMe fixes
  arriving late for -rc3 and hence didn't make last weeks pull request.

   - NVMe:
        - instance leak and io boundary fixes from Keith
        - fc locking fix from Christophe
        - various tcp/rdma reset during traffic fixes from Sagi
        - pci use-after-free fix from Tong
        - tcp target null deref fix from Ziye

   - Locking fix for partition removal (Christoph)

   - Ensure bdi->io_pages is always set (me)

   - Fixup for hd struct reference (Ming)

   - Fix for zero length bvecs (Ming)

   - Two small blk-iocost fixes (Tejun)"

* tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block:
  block: allow for_each_bvec to support zero len bvec
  blk-stat: make q->stats->lock irqsafe
  blk-iocost: ioc_pd_free() shouldn't assume irq disabled
  block: fix locking in bdev_del_partition
  block: release disk reference in hd_struct_free_work
  block: ensure bdi->io_pages is always initialized
  nvme-pci: cancel nvme device request before disabling
  nvme: only use power of two io boundaries
  nvme: fix controller instance leak
  nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
  nvme: Fix NULL dereference for pci nvme controllers
  nvme-rdma: fix reset hang if controller died in the middle of a reset
  nvme-rdma: fix timeout handler
  nvme-rdma: serialize controller teardown sequences
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-tcp: fix timeout handler
  nvme-tcp: serialize controller teardown sequences
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
2020-09-04 13:04:51 -07:00
Kashyap Desai
b445547ec1 blk-mq, elevator: Count requests per hctx to improve performance
High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible for mq-deadline and bfq IO schedulers
when nr_hw_queues is more than one.

It is because kblockd work queue can submit IO from all online CPUs
(through blk_mq_run_hw_queues()) even though only one hctx has pending
commands.

The elevator callback .has_work for mq-deadline and bfq scheduler considers
pending work if there are any IOs on request queue but it does not account
hctx context.

Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.

[jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:47 -06:00
John Garry
f1b49fdc1c blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap
For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:47 -06:00
John Garry
bccf5e26d9 blk-mq: Record nr_active_requests per queue for when using shared sbitmap
The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.

For this case, record the nr_active_requests per request_queue, and make
the judgement based on that value.

Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:47 -06:00
John Garry
a0235d230f blk-mq: Relocate hctx_may_queue()
blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal.

Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code.

In this way, we can drop the blk-mq-tag.h include of blk-mq.h

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:47 -06:00
John Garry
32bc15afed blk-mq: Facilitate a shared sbitmap per tagset
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec060
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092 ("scsi:
 core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:47 -06:00
John Garry
222a5ae03c blk-mq: Use pointers for blk_mq_tags bitmap tags
Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:47 -06:00
John Garry
1c0706a70a blk-mq: Pass flags for tag init/free
Pass hctx/tagset flags argument down to blk_mq_init_tags() and
blk_mq_free_tags() for selective init/free.

For now, make it include the alloc policy flag, which can be evaluated
when needed (in blk_mq_init_tags()).

Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:46 -06:00
Hannes Reinecke
4d063237b9 blk-mq: Free tags in blk_mq_init_tags() upon error
Since the tags are allocated in blk_mq_init_tags(), it's better practice
to free in that same function upon error, rather than a callee which is to
init the bitmap tags (blk_mq_init_tags()).

[jpg: Split from an earlier patch with a new commit message]

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:46 -06:00
Hannes Reinecke
655ac30094 blk-mq: Rename blk_mq_update_tag_set_depth()
The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.

So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.

[jpg: take out some unrelated changes in blk_mq_init_bitmap_tags()]

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:46 -06:00
Ming Lei
51db1c37ee blk-mq: Rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

[jpg: rebase a few times, add rnbd-clt.c change]

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03 15:20:46 -06:00
Christoph Hellwig
b8086d3f5a block: use revalidate_disk_size in set_capacity_revalidate_and_notify
Only virtio_blk and xen-blkfront set the revalidate argument to true,
and both do not implement the ->revalidate_disk method.  So switch
to the helper that just updates the size instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 08:00:07 -06:00
Christoph Hellwig
f4ad06f2bb block: rename bd_invalidated
Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
variable to better describe the condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-02 08:00:02 -06:00
Baolin Wang
265600b7b6 block: Remove a duplicative condition
Remove a duplicative condition to remove below cppcheck warnings:

"warning: Redundant condition: sched_allow_merge. '!A || (A && B)' is
equivalent to '!A || B' [redundantCondition]"

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:48:06 -06:00
Tejun Heo
0460375517 blk-iocost: restore inuse update tracepoints
Update and restore the inuse update tracepoints.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:33 -06:00
Ritika Srivastava
8327cce5ff block: better deal with the delayed not supported case in blk_cloned_rq_check_limits
If WRITE_ZERO/WRITE_SAME operation is not supported by the storage,
blk_cloned_rq_check_limits() will return IO error which will cause
device-mapper to fail the paths.

Instead, if the queue limit is set to 0, return BLK_STS_NOTSUPP.
BLK_STS_NOTSUPP will be ignored by device-mapper and will not fail the
paths.

Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:33 -06:00
Tejun Heo
f0bf84a5df blk-iocost: add three debug stat - cost.wait, indebt and indelay
These are really cheap to collect and can be useful in debugging iocost
behavior. Add them as debug stats for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:33 -06:00
Tejun Heo
ac33e91e2d blk-iocost: implement vtime loss compensation
When an iocg accumulates too much vtime or gets deactivated, we throw away
some vtime, which lowers the overall device utilization. As the exact amount
which is being thrown away is known, we can compensate by accelerating the
vrate accordingly so that the extra vtime generated in the current period
matches what got lost.

This significantly improves work conservation when involving high weight
cgroups with intermittent and bursty IO patterns.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:33 -06:00
Ritika Srivastava
143d2600fa block: Return blk_status_t instead of errno codes
Replace returning legacy errno codes with blk_status_t in
blk_cloned_rq_check_limits().

Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:33 -06:00
Tejun Heo
dda1315f18 blk-iocost: halve debts if device stays idle
A low weight iocg can amass a large amount of debt, for example, when
anonymous memory gets reclaimed aggressively. If the system has a lot of
memory paired with a slow IO device, the debt can span multiple seconds or
more. If there are no other subsequent IO issuers, the in-debt iocg may end
up blocked paying its debt while the IO device is idle.

This patch implements a mechanism to protect against such pathological
cases. If the device has been sufficiently idle for a substantial amount of
time, the debts are halved. The criteria are on the conservative side as we
want to resolve the rare extreme cases without impacting regular operation
by forgiving debts too readily.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:33 -06:00
Khazhismel Kumykov
9d3a39a5f1 block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE
CAP_SYS_ADMIN is too broad, and ionice fits into CAP_SYS_NICE's grouping.

Retain CAP_SYS_ADMIN permission for backwards compatibility.

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:33 -06:00
Tejun Heo
c421a3eb2e blk-iocost: revamp debt handling
Debt handling had several issues.

* How much inuse a debtor carries wasn't clearly defined. inuse would be
  driven down over time from not issuing IOs but it'd be better to clamp it
  to minimum immediately once in debt.

* How much can be paid off was determined by hweight_inuse. As inuse was
  driven down, the payment amount would fall together regardless of the
  debtor's active weight. This means that the debtors were punished harshly.

* ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after
  iocg_kick_delay().

This patch revamps debt handling so that

* Debt handling owns inuse for iocgs in debt and keeps them at zero.

* Payment amount is determined by hweight_active. This is more deterministic
  and safer than hweight_inuse but still far from ideal in that it doesn't
  factor in possible donations from other iocgs for debt payments. This
  likely needs further improvements in the future.

* iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
97eb19751f blk-iocost: add absolute usage stat
Currently, iocost doesn't collect or expose any statistics punting off all
monitoring duties to drgn based iocost_monitor.py. While it works for some
scenarios, there are some usability and data availability challenges. For
example, accurate per-cgroup usage information can't be tracked by vtime
progression at all and the number available in iocg->usages[] are really
short-term snapshots used for control heuristics with possibly significant
errors.

This patch implements per-cgroup absolute usage stat counter and exposes it
through io.stat along with the current vrate. Usage stat collection and
flushing employ the same method as cgroup rstat on the active iocg's and the
only hot path overhead is preemption toggling and adding to a percpu
counter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
5160a5a53c blk-iocost: implement delay adjustment hysteresis
Curently, iocost syncs the delay duration to the outstanding debt amount,
which seemed enough to protect the system from anon memory hogs. However,
that was mostly because the delay calcuation was using hweight_inuse which
quickly converges towards zero under debt for delay duration calculation,
often pusnishing debtors overly harshly for longer than deserved.

The previous patch fixed the delay calcuation and now the protection against
anonymous memory hogs isn't enough because the effect of delay is indirect
and non-linear and a huge amount of future debt can accumulate abruptly
while unthrottled.

This patch implements delay hysteresis so that delay is decayed
exponentially over time instead of getting cleared immediately as debt is
paid off. While the overall behavior is similar to the blk-cgroup
implementation used by blk-iolatency, a lot of the details are different and
due to the empirical nature of the mechanism, it's challenging to adapt the
mechanism for one controller without negatively impacting the other.

As the delay is gradually decayed now, there's no point in running it from
its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
the dedicated hrtimer is removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
b0853ab4a2 blk-iocost: revamp in-period donation snapbacks
When the margin drops below the minimum on a donating iocg, donation is
immediately canceled in full. There are a couple shortcomings with the
current behavior.

* It's abrupt. A small temporary budget deficit can lead to a wide swing in
  weight allocation and a large surplus.

* It's open coded in the issue path but not implemented for the merge path.
  A series of merges at a low inuse can make the iocg incur debts and stall
  incorrectly.

This patch reimplements in-period donation snapbacks so that

* inuse adjustment and cost calculations are factored into
  adjust_inuse_and_calc_cost() which is called from both the issue and merge
  paths.

* Snapbacks are more gradual. It occurs in quarter steps.

* A snapback triggers if the margin goes below the low threshold and is
  lower than the budget at the time of the last adjustment.

* For the above, __propagate_weights() stores the margin in
  iocg->saved_margin. Move iocg->last_inuse storing together into
  __propagate_weights() for consistency.

* Full snapback is guaranteed when there are waiters.

* With precise donation and gradual snapbacks, inuse adjustments are now a
  lot more effective and the value of scaling inuse on weight changes isn't
  clear. Removed inuse scaling from weight_update().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
da437b95db blk-iocost: grab ioc->lock for debt handling
Currently, debt handling requires only iocg->waitq.lock. In the future, we
want to adjust and propagate inuse changes depending on debt status. Let's
grab ioc->lock in debt handling paths in preparation.

* Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
  ioc->lock needs to be made before entering the critical sections.

* Add and use iocg_[un]lock() which handles the conditional double locking.

* Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
  the caller grabbed both locks.

This patch is prepatory and the comments contain references to future
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
f1de2439ec blk-iocost: revamp donation amount determination
iocost has various safety nets to combat inuse adjustment calculation
inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse
adjustment calculations are now accurate and we can make donation amount
determinations accurate too.

* Stop keeping track of past usage history and using the maximum. Act on the
  immediate usage information.

* Remove donation constraints defined by SURPLUS_* constants. Donate
  whatever isn't used.

* Determine the donation amount so that the iocg will end up with
  MARGIN_TARGET_PCT budget at the end of the coming period assuming the same
  usage as the previous period. TARGET is set at 50% of period, which is the
  previous maximum. This provides smooth convergence for most repetitive IO
  patterns.

* Apply donation logic early at 20% budget. There's no risk in doing so as
  the calculation is based on the delta between the current budget and the
  target budget at the end of the coming period.

* Remove preemptive iocg activation for zero cost IOs. As donation can reach
  near zero now, the mere activation doesn't provide any protection anymore.
  In the unlikely case that this becomes a problem, the right solution is
  assigning appropriate costs for such IOs.

This significantly improves the donation determination logic while also
simplifying it. Now all donations are immediate, exact and smooth.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
7ca5b2e60b blk-iocost: streamline vtime margin and timer slack handling
The margin handling was pretty inconsistent.

* ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin
  thresholds. However, the two are in different units with the former
  requiring conversion to vtime on use.

* iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of
  period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a
  quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use
  different values for the two.

This patch cleans up margin and timer slack handling:

* vtime margins are now recorded in ioc->margins.{min, max} on period
  duration changes and used consistently.

* Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and
  used consistently for iocg_kick_waitq() and iocg_kick_delay().

The only functional change is shortening of timer slack. No meaningful
visible change is expected.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
e08d02aa5f blk-iocost: implement Andy's method for donation weight updates
iocost implements work conservation by reducing iocg->inuse and propagating
the adjustment upwards proportionally. However, while I knew the target
absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure
out how to determine the iocg->inuse adjustment to achieve that and
approximated the adjustment by scaling iocg->inuse using the proportion of
the needed hweight_inuse changes.

When nested, these scalings aren't accurate even when adjusting a single
node as the donating node also receives the benefit of the donated portion.
When multiple nodes are donating as they often do, they can be wildly wrong.

iocost employed various safety nets to combat the inaccuracies. There are
ample buffers in determining how much to donate, the adjustments are
conservative and gradual. While it can achieve a reasonable level of work
conservation in simple scenarios, the inaccuracies can easily add up leading
to significant loss of total work. This in turn makes it difficult to
closely cap vrate as vrate adjustment is needed to compensate for the loss
of work. The combination of inaccurate donation calculations and vrate
adjustments can lead to wide fluctuations and clunky overall behaviors.

Andy Newell devised a method to calculate the needed ->inuse updates to
achieve the target hweight_inuse's. The method is compatible with the
proportional inuse adjustment propagation which allows all hot path
operations to be local to each iocg.

To roughly summarize, Andy's method divides the tree into donating and
non-donating parts, calculates global donation rate which is used to
determine the target hweight_inuse for each node, and then derives per-level
proportions. There's non-trivial amount of math involved. Please refer to
the following pdfs for detailed descriptions.

  https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
  https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
  https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN

This patch implements Andy's method in transfer_surpluses(). This makes the
donation calculations accurate per cycle and enables further improvements in
other parts of the donation logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
ce95570acf blk-iocost: make ioc_now->now and ioc->period_at 64bit
They are in microseconds and wrap in around 1.2 hours with u32. While
unlikely, confusions from wraparounds are still possible. We aren't saving
anything meaningful by keeping these u32. Let's make them u64.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
93f7d2db80 blk-iocost: restructure surplus donation logic
The way the surplus donation logic is structured isn't great. There are two
separate paths for starting/increasing donations and decreasing them making
the logic harder to follow and is prone to unnecessary behavior differences.

In preparation for improved donation handling, this patch restructures the
code so that

* All donors - new, increasing and decreasing - are funneled through the
  same code path.

* The target donation calculation is factored into hweight_after_donation()
  which is called once from the same spot for all possible donors.

* Actual inuse adjustment is factored into trasnfer_surpluses().

This change introduces a few behavior differences - e.g. donation amount
reduction now uses the max usage of the recent three periods just like new
and increasing donations, and inuse now gets adjusted upwards the same way
it gets downwards. These differences are unlikely to have severely negative
implications and the whole logic will be revamped soon.

This patch also removes two tracepoints. The existing TPs don't quite fit
the new implementation. A later patch will update and reinstate them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
bd0adb91a6 blk-iocost: use WEIGHT_ONE based fixed point number for weights
To improve weight donations, we want to able to scale inuse with a greater
accuracy and down below 1. Let's make non-hierarchical weights to use
WEIGHT_ONE based fixed point numbers too like hierarchical ones.

This doesn't cause any behavior changes yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
065655c862 blk-iocost: decouple vrate adjustment from surplus transfers
Budget donations are inaccurate and could take multiple periods to converge.
To prevent triggering vrate adjustments while surplus transfers were
catching up, vrate adjustment was suppressed if donations were increasing,
which was indicated by non-zero nr_surpluses.

This entangling won't be necessary with the scheduled rewrite of donation
mechanism which will make it precise and immediate. Let's decouple the two
in preparation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
fe20cdb516 blk-iocost: s/HWEIGHT_WHOLE/WEIGHT_ONE/g
We're gonna use HWEIGHT_WHOLE for regular weights too. Let's rename it to
WEIGHT_ONE.

Pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
8692d2db8e blk-iocost: replace iocg->has_surplus with ->surplus_list
Instead of marking iocgs with surplus with a flag and filtering for them
while walking all active iocgs, build a surpluses list. This doesn't make
much difference now but will help implementing improved donation logic which
will iterate iocgs with surplus multiple times.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Tejun Heo
1aa50d020c blk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_us
Currently, iocg->usages[] which are used to guide inuse adjustments are
calculated from vtime deltas. This, however, assumes that the hierarchical
inuse weight at the time of calculation held for the entire period, which
often isn't true and can lead to significant errors.

Now that we have absolute usage information collected, we can derive
iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
decisions are made based on actual absolute usage. The calculated usage is
clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
saturation regardless of the current hierarchical inuse weight.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:32 -06:00
Christoph Hellwig
1f06959bd2 block: remove the unused q argument to part_in_flight and part_in_flight_rw
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:31 -06:00
Tejun Heo
7b84b49e38 blk-iocost: make iocg_kick_waitq() call iocg_kick_delay() after paying debt
iocg_kick_waitq() is the function which pays debt and iocg_kick_delay()
updates the actual delay status accordingly. If iocg_kick_delay() is not
called after iocg_kick_delay() updated debt, unnecessarily large delays can
be applied temporarily.

Let's make sure such conditions don't occur by making iocg_kick_waitq()
always call iocg_kick_delay() after paying debt.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:31 -06:00
Tejun Heo
6ef20f787b blk-iocost: move iocg_kick_delay() above iocg_kick_waitq()
We'll make iocg_kick_waitq() call iocg_kick_delay(). Reorder them in
preparation. This is pure code reorganization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:31 -06:00
Tejun Heo
db84a72af6 blk-iocost: clamp inuse and skip noops in __propagate_weights()
__propagate_weights() currently expects the callers to clamp inuse within
[1, active], which is needlessly fragile. The inuse adjustment logic is
going to be revamped, in preparation, let's make __propagate_weights() clamp
inuse on entry.

Also, make it avoid weight updates altogether if neither active or inuse is
changed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:31 -06:00
Tejun Heo
00410f1b09 blk-iocost: rename propagate_active_weights() to propagate_weights()
It already propagates two weights - active and inuse - and there will be
another soon. Let's drop the confusing misnomers. Rename
[__]propagate_active_weights() to [__]propagate_weights() and
commit_active_weights() to commit_weights().

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:31 -06:00
Tejun Heo
5e124f7432 blk-iocost: use local[64]_t for percpu stat
blk-iocost has been reading percpu stat counters from remote cpus which on
some archs can lead to torn reads in really rare occassions. Use local[64]_t
for those counters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:31 -06:00
Christoph Hellwig
8328eb2836 block: remove the disk argument to delete_partition
We can trivially derive the gendisk from the hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 19:38:25 -06:00
Christoph Hellwig
f93af2a494 block: cleanup __alloc_disk_node
Use early returns and goto-based unwinding to simplify the flow a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Christoph Hellwig
7cf34d97ab block: remove the discard_alignment field from struct hd_struct
The alignment offset is only used in slow path callers, so just calculate
it on the fly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Christoph Hellwig
7b8917f5e2 block: remove the alignment_offset field from struct hd_struct
The alignment offset is only used in slow path callers, so just calculate
it on the fly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Xianting Tian
e44a6a2359 blk-mq: use BLK_MQ_NO_TAG for no tag
Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Baolin Wang
cdfcef9ee8 block: Remove blk_mq_attempt_merge() function
The small blk_mq_attempt_merge() function is only called by
__blk_mq_sched_bio_merge(), just open code it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Baolin Wang
7d7ca7c526 block: Add a new helper to attempt to merge a bio
There are lots of duplicated code when trying to merge a bio from
plug list and sw queue, we can introduce a new helper to attempt
to merge a bio, which can simplify the blk_bio_list_merge()
and blk_attempt_plug_merge().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Baolin Wang
bdc6a287bc block: Move blk_mq_bio_list_merge() into blk-merge.c
Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Baolin Wang
8e756373d7 block: Move bio merge related functions into blk-merge.c
It's better to move bio merge related functions into blk-merge.c,
which contains all merge related functions.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Danny Lin
339b5a25c2 blk-wbt: Remove obsolete multiqueue I/O scheduling comment
This comment was added before the multiqueue I/O scheduler framework
was introduced; multiqueue has support for I/O scheduling now, so this
obsolete comment can be removed.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Christoph Hellwig
3310eebafe block: remove the BIO_USER_MAPPED flag
Just check if there is private data, in which case the bio must have
originated from bio_copy_user_iov.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:26 -06:00
Christoph Hellwig
7589ad6729 block: remove __blk_rq_map_user_iov
Just duplicate a small amount of code in the low-level map into the bio
and copy to the bio routines, leading to much easier to follow and
maintain code, and better shared error handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:25 -06:00
Christoph Hellwig
7b63c052a5 block: remove __blk_rq_unmap_user
Open code __blk_rq_unmap_user in the two callers.  Both never pass a NULL
bio, and one of them can use an existing local variable instead of the bio
flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:25 -06:00
Christoph Hellwig
f3256075ba block: remove the BIO_NULL_MAPPED flag
We can simply use a boolean flag in the bio_map_data data structure
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:25 -06:00
Christoph Hellwig
c2b4bb8cb3 block: fix locking for struct block_device size updates
Two different callers use two different mutexes for updating the
block device size, which obviously doesn't help to actually protect
against concurrent updates from the different callers.  In addition
one of the locks, bd_mutex is rather prone to deadlocks with other
parts of the block stack that use it for high level synchronization.

Switch to using a new spinlock protecting just the size updates, as
that is all we need, and make sure everyone does the update through
the proper helper.

This fixes a bug reported with the nvme revalidating disks during a
hot removal operation, which can currently deadlock on bd_mutex.

Reported-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:25 -06:00
Jens Axboe
a98278ecfb Merge branch 'block-5.9' into for-5.10/block
* block-5.9:
  blk-stat: make q->stats->lock irqsafe
  blk-iocost: ioc_pd_free() shouldn't assume irq disabled
  block: fix locking in bdev_del_partition
  block: release disk reference in hd_struct_free_work
  block: ensure bdi->io_pages is always initialized
  nvme-pci: cancel nvme device request before disabling
  nvme: only use power of two io boundaries
  nvme: fix controller instance leak
  nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
  nvme: Fix NULL dereference for pci nvme controllers
  nvme-rdma: fix reset hang if controller died in the middle of a reset
  nvme-rdma: fix timeout handler
  nvme-rdma: serialize controller teardown sequences
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-tcp: fix timeout handler
  nvme-tcp: serialize controller teardown sequences
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
2020-09-01 16:49:20 -06:00
Tejun Heo
e11d80a849 blk-stat: make q->stats->lock irqsafe
blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock
which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's
make it irqsafe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: cd006509b0 ("blk-iocost: account for IO size when testing latencies")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:48:46 -06:00
Tejun Heo
5aeac7c4b1 blk-iocost: ioc_pd_free() shouldn't assume irq disabled
ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:48:44 -06:00
Christoph Hellwig
08fc1ab6d7 block: fix locking in bdev_del_partition
We need to hold the whole device bd_mutex to protect against
other thread concurrently deleting out partition before we get
to it, and thus causing a use after free.

Fixes: cddae808ae ("block: pass a hd_struct to delete_partition")
Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 08:35:35 -06:00
Ming Lei
cafe01ef8f block: release disk reference in hd_struct_free_work
Commit e8c7d14ac6 ("block: revert back to synchronous request_queue removal")
stops to release request queue from wq context because that commit
supposed all blk_put_queue() is called in context which is allowed
to sleep. However, this assumption isn't true because we release disk's
reference in partition's percpu_ref's ->release() which doesn't allow
to sleep, because the ->release() is run via call_rcu().

Fixes this issue by moving put disk reference into hd_struct_free_work()

Fixes: e8c7d14ac6 ("block: revert back to synchronous request_queue removal")
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 08:34:08 -06:00
Jens Axboe
de1b0ee490 block: ensure bdi->io_pages is always initialized
If a driver leaves the limit settings as the defaults, then we don't
initialize bdi->io_pages. This means that file systems may need to
work around bdi->io_pages == 0, which is somewhat messy.

Initialize the default value just like we do for ->ra_pages.

Cc: stable@vger.kernel.org
Fixes: 9491ae4aad ("mm: don't cap request size based on read-ahead setting")
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 08:00:14 -06:00
Greg Kroah-Hartman
f022c0602c Merge 5.9-rc3 into android-mainline
Linux 5.9-rc3

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic7758bc57a7d91861657388ddd015db5c5db5480
2020-08-31 19:51:25 +02:00
Linus Torvalds
c41c3ec4a2 io_uring-5.9-2020-08-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9CwtMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsehEAC4ReB53LLbZxqcmoA2RNs9yz1I4DM2PU6z
 C+NSGGEnAFHQAhLbfCAzxbtQa6x/m64zoLd+8zHZNAeanJXarszcgSuqhXQFlEfX
 7Jz/vdXGdu7Q4zgkLuO3FxleDoPoUC5qOSFHWYtMu6KvHLOkmc9DvdSUsFMDSThX
 6RsoaQY2gDOD/pwtm8Cqmy89nLZdFoyxadXyk/lzxLodjeRZOwoVc+YM8YWmrXZ0
 mKEEuO4uBWxUUmoyAwUABNqWWAkwTDEhrYCiiG81DkAa1Cu0mRXodN0xycr72cLZ
 Ik2OlnTLCE6B0UXsBu2c0+qXGArWsvDyhEEkwF+O+Ump4IBIr72EmgZb+o2nnkXo
 Uu4X/r0qeQ6XD+vBTHcE6oPUjJhV6uEXXon5aesE+vh277ILmHgMyjJKaSiJcY/E
 efM5SuPRq2kuROKWLKiLJnpuJ/9ZTU/4nk4k1pOlWWOVGLHien0sWBBzQ+iWr6mm
 eRl5EkI3JoahqIrNFz0+qF3DwKPVfu+B02/EzA8OXoYHIRV9KMS5eWX5hK12aZ3i
 4AT3xuAanfcNs4qBAScOfHQxQu9U5Z7Mu4JQJ58xdsJd+UWBnbznUmSLob9KKk+c
 X8AvAcYhb684F87VCmaCzDlIPMb46OYxLBgI6sz7L0xdc7i8TCeeEDbQCN1HixZ3
 SNtKzalNXA==
 =fAwK
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Sagi:
       - nvme completion rework from Christoph and Chao that mostly came
         from a bit of divergence of how we classify errors related to
         pathing/retry etc.
       - nvmet passthru fixes from Chaitanya
       - minor nvmet fixes from Amit and I
       - mpath round-robin path selection fix from Martin
       - ignore noiob for zoned devices from Keith
       - minor nvme-fc fix from Tianjia"

 - BFQ cgroup leak fix (Dmitry)

 - block layer MAINTAINERS addition (Geert)

 - fix null_blk FUA checking (Hou)

 - get_max_io_size() size fix (Keith)

 - fix block page_is_mergeable() for compound pages (Matthew)

 - discard granularity fixes (Ming)

 - IO scheduler ordering fix (Ming)

 - misc fixes

* tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits)
  null_blk: fix passing of REQ_FUA flag in null_handle_rq
  nvmet: Disable keep-alive timer when kato is cleared to 0h
  nvme: redirect commands on dying queue
  nvme: just check the status code type in nvme_is_path_error
  nvme: refactor command completion
  nvme: rename and document nvme_end_request
  nvme: skip noiob for zoned devices
  nvme-pci: fix PRP pool size
  nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
  nvme: Use spin_lock_irq() when taking the ctrl->lock
  nvmet: call blk_mq_free_request() directly
  nvmet: fix oops in pt cmd execution
  nvmet: add ns tear down label for pt-cmd handling
  nvme: multipath: round-robin: eliminate "fallback" variable
  nvme: multipath: round-robin: fix single non-optimized path case
  nvme-fc: Fix wrong return value in __nvme_fc_init_request()
  nvmet-passthru: Reject commands with non-sgl flags set
  nvmet: fix a memory leak
  blkcg: fix memleak for iolatency
  MAINTAINERS: Add missing header files to BLOCK LAYER section
  ...
2020-08-24 11:53:15 -07:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Yufen Yu
27029b4b18 blkcg: fix memleak for iolatency
Normally, blkcg_iolatency_exit() will free related memory in iolatency
when cleanup queue. But if blk_throtl_init() return error and queue init
fail, blkcg_iolatency_exit() will not do that for us. Then it cause
memory leak.

Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21 17:14:27 -06:00
Keith Busch
e4b469c66f block: fix get_max_io_size()
A previous commit aligning splits to physical block sizes inadvertently
modified one return case such that that it now returns 0 length splits
when the number of sectors doesn't exceed the physical offset. This
later hits a BUG in bio_split(). Restore the previous working behavior.

Fixes: 9cc5169cd4 ("block: Improve physical block alignment of split bios")
Reported-by: Eric Deal <eric.deal@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21 17:09:22 -06:00
Ming Lei
db03f88fae blk-mq: insert request not through ->queue_rq into sw/scheduler queue
c616cbee97 ("blk-mq: punt failed direct issue to dispatch list") supposed
to add request which has been through ->queue_rq() to the hw queue dispatch
list, however it adds request running out of budget or driver tag to hw queue
too. This way basically bypasses request merge, and causes too many request
dispatched to LLD, and system% is unnecessary increased.

Fixes this issue by adding request not through ->queue_rq into sw/scheduler
queue, and this way is safe because no ->queue_rq is called on this request
yet.

High %system can be observed on Azure storvsc device, and even soft lock
is observed. This patch reduces %system during heavy sequential IO,
meantime decreases soft lockup risk.

Fixes: c616cbee97 ("blk-mq: punt failed direct issue to dispatch list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21 17:09:22 -06:00
Dmitry Monakhov
2de791ab49 bfq: fix blkio cgroup leakage v4
Changes from v1:
    - update commit description with proper ref-accounting justification

commit db37a34c56 ("block, bfq: get a ref to a group when adding it to a service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance.
In fact whole idea of original commit is wrong because bfq_group entity
can not dissapear under us because it is referenced by child bfq_queue's
entities from here:
 -> bfq_init_entity()
    ->bfqg_and_blkg_get(bfqg);
    ->entity->parent = bfqg->my_entity

 -> bfq_put_queue(bfqq)
    FINAL_PUT
    ->bfqg_and_blkg_put(bfqq_group(bfqq))
    ->kmem_cache_free(bfq_pool, bfqq);

So parent entity can not disappear while child entity is in tree,
and child entities already has proper protection.
This patch revert commit db37a34c56 ("block, bfq: get a ref to a group when adding it to a service tree")

bfq_group leak trace caused by bad commit:
-> blkg_alloc
   -> bfq_pq_alloc
     -> bfqg_get (+1)
->bfq_activate_bfqq
  ->bfq_activate_requeue_entity
    -> __bfq_activate_entity
       ->bfq_get_entity
         ->bfqg_and_blkg_get (+1)  <==== : Note1
->bfq_del_bfqq_busy
  ->bfq_deactivate_entity+0x53/0xc0 [bfq]
    ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
      -> bfq_forget_entity(is_in_service = true)
	 entity->on_st_or_in_serv = false   <=== :Note2
	 if (is_in_service)
	     return;  ==> do not touch reference
-> blkcg_css_offline
 -> blkcg_destroy_blkgs
  -> blkg_destroy
   -> bfq_pd_offline
    -> __bfq_deactivate_entity
         if (!entity->on_st_or_in_serv) /* true, because (Note2)
		return false;
 -> bfq_pd_free
    -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq  will leak forever, see test-case below.

##TESTCASE_BEGIN:
#!/bin/bash

max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler

grep blkio /proc/cgroups
for ((i=0;i<max_iters;i++))
do
    mkdir -p /sys/fs/cgroup/blkio/a
    echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
    dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
    echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
    grep blkio /proc/cgroups
done
##TESTCASE_END:

Fixes: db37a34c56 ("block, bfq: get a ref to a group when adding it to a service tree")
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-18 07:48:08 -07:00
Matthew Wilcox (Oracle)
d81665198b block: Fix page_is_mergeable() for compound pages
If we pass in an offset which is larger than PAGE_SIZE, then
page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
leading to a large number of bio_vecs being used.  Use a slightly more
obvious test that the two pages are compatible with each other.

Fixes: 52d52d1c98 ("block: only allow contiguous page structs in a bio_vec")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17 19:35:53 -07:00
Ming Lei
943b40c832 block: respect queue limit of max discard segment
When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
return false for discard request, then normal request merge is applied.
However, only queue_max_segments() is checked, so max discard segment
limit isn't respected.

Check max discard segment limit in the request merge code for fixing
the issue.

Discard request failure of virtio_blk is fixed.

Fixes: 6984046608 ("block: fix the DISCARD request merge")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17 06:59:41 -07:00
Ming Lei
d7d8535f37 blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART
SCHED_RESTART code path is relied to re-run queue for dispatch requests
in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
requests to hctx->dispatch.

memory barriers have to be used for ordering the following two pair of OPs:

1) adding requests to hctx->dispatch and checking SCHED_RESTART in
blk_mq_dispatch_rq_list()

2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
in blk_mq_sched_restart().

Without the added memory barrier, either:

1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
dispatch side

or

2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
in dispatch side, meantime checking if there is request in
hctx->dispatch from blk_mq_sched_restart() is missed.

IO hang in ltp/fs_fill test is reported by kernel test robot:

	https://lkml.org/lkml/2020/7/26/77

Turns out it is caused by the above out-of-order OPs. And the IO hang
can't be observed any more after applying this patch.

Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17 06:57:49 -07:00
Greg Kroah-Hartman
df7e491a37 Merge 4b6c093e21 ("Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block") into android-mainline
Steps on the way to 5.9-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I904678b5c31139b25fb49ee67cbacb4721a3c7bc
2020-08-17 09:19:35 +02:00
Xu Wang
03ef5941a0 bsg-lib: convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-16 20:07:12 -07:00
Randy Dunlap
26bfeb2662 block: blk-mq.c: fix @at_head kernel-doc warning
Fix a kernel-doc warning in block/blk-mq.c:

../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert'

Fixes: 01e99aeca3 ("blk-mq: insert passthrough request into hctx->dispatch directly")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: André Almeida <andrealmeid@collabora.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-16 16:50:28 -07:00
Linus Torvalds
4b6c093e21 block-5.9-2020-08-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl83DGUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplTYEACs5kPPpSVdzVsOeHj0LkfQXiSlli8dbPBo
 NB2+TIyr9NxJgxn8B4x+5/4DZgJaCoHOeyyzOocXQmvWGwWOTkrfX/OSyQOlRB5z
 dpzqF0Huhw31MSQEiwA/8lo3omBmat9cMzMa5PJYPghMGfqyQDzVJk1lIX51a1th
 oE01eBpNNsDK0OTwKrl6Rx2/OuFZnA0P3lQwgPZSLnDM6Hq+xeHTdx2LNSyE2QFv
 GzYl4dFoXg3NReLv9D57b7hE6Dc95NcCDDeU7Y3cE7XPksKMA/TkVYOD20ysJ31l
 9uzscvvcm2UugN2r0d/B35lf6NWmOG24SmkLMKTtExPGHOCQIbDAlSP/QQ4zz9pQ
 2yA+eImpQnRsCzPbGcnBzwEF3yX5+lQYmFWac+0AHDiWEWkb8e3MzNSWPZrsN+cD
 7U7c5Zw6zDEtl/naJccuZZPgQGbZgFJ/P6Wo6l5ywIPtE7wzv4MUe4eUxZhitL9M
 0ZP6WIQd8oNQdNoCYVQDwPdYJYMq7uUQFUo40vaSfntZxVKZQao7cvUHwmzVzNlZ
 v5UazETAx+4Eg6MNwfjKp+kt3rr6Xul7K9Nzn6R/cVacIU349FovUshm7WieoAUu
 niZ40gXltxj7NDwHj3p/dqesW5Nhv/qk6hlVWoi9vdmh8vAVBy/fedQfocvKrFJy
 prCI1h1UOQ==
 =10Pr
 -----END PGP SIGNATURE-----

Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes on the block side of things:

   - Discard granularity fix (Coly)

   - rnbd cleanups (Guoqing)

   - md error handling fix (Dan)

   - md sysfs fix (Junxiao)

   - Fix flush request accounting, which caused an IO slowdown for some
     configurations (Ming)

   - Properly propagate loop flag for partition scanning (Lennart)"

* tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
  block: fix double account of flush request's driver tag
  loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
  rnbd: no need to set bi_end_io in rnbd_bio_map_kern
  rnbd: remove rnbd_dev_submit_io
  md-cluster: Fix potential error pointer dereference in resize_bitmaps()
  block: check queue's limits.discard_granularity in __blkdev_issue_discard()
  md: get sysfs entry after redundancy attr group create
2020-08-15 20:36:42 -07:00
Ming Lei
c1e2b8422b block: fix double account of flush request's driver tag
In case of none scheduler, we share data request's driver tag for
flush request, so have to mark the flush request as INFLIGHT for
avoiding double account of this driver tag.

Fixes: 568f270065 ("blk-mq: centralise related handling into blk_mq_get_driver_tag")
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-11 13:53:32 -06:00
Eric Biggers
6b78d00f74 ANDROID: block/keyslot-manager: fix checkpatch warning
Fix a checkpatch warning that I fixed when applying
ANDROID-dm-add-support-for-passing-through-inline-crypto-support.patch
to android12-5.4 (http://aosp/1394279).  This gets android-mainline in
sync.

This should be folded into
ANDROID-dm-add-support-for-passing-through-inline-crypto-support.patch.

Bug: 162257830
Change-Id: I083209d174630220e3b6d148d7317600ce040a84
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-08-11 09:44:23 -07:00
Greg Kroah-Hartman
d7b0856eac Merge 00e4db5125 ("Merge tag 'perf-tools-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux") into android-mainline
Tiny steps on the way to 5.9-rc1.

Fixes conflicts in:
	fs/f2fs/inline.c

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I16d863ae44a51156499458e8c3486587cbe2babe
2020-08-11 10:57:46 +02:00
Linus Torvalds
97d052ea3f A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in various
     situations caused by the lockdep additions to seqcount to validate that
     the write side critical sections are non-preemptible.
 
   - The seqcount associated lock debug addons which were blocked by the
     above fallout.
 
     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict per
     CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
     validate that the lock is held.
 
     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored and
     write_seqcount_begin() has a lockdep assertion to validate that the
     lock is held.
 
     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API is
     unchanged and determines the type at compile time with the help of
     _Generic which is possible now that the minimal GCC version has been
     moved up.
 
     Adding this lockdep coverage unearthed a handful of seqcount bugs which
     have been addressed already independent of this.
 
     While generaly useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if the
     writers are serialized by an associated lock, which leads to the well
     known reader preempts writer livelock. RT prevents this by storing the
     associated lock pointer independent of lockdep in the seqcount and
     changing the reader side to block on the lock when a reader detects
     that a writer is in the write side critical section.
 
  - Conversion of seqcount usage sites to associated types and initializers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
 QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
 KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
 eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
 shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
 n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
 F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
 DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
 VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
 AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
 f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
 81ppn2KlvzUK8g==
 =7Gj+
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Thomas Gleixner:
 "A set of locking fixes and updates:

   - Untangle the header spaghetti which causes build failures in
     various situations caused by the lockdep additions to seqcount to
     validate that the write side critical sections are non-preemptible.

   - The seqcount associated lock debug addons which were blocked by the
     above fallout.

     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict
     per CPU seqcounts. As the lock is not part of the seqcount, lockdep
     cannot validate that the lock is held.

     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored
     and write_seqcount_begin() has a lockdep assertion to validate that
     the lock is held.

     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API
     is unchanged and determines the type at compile time with the help
     of _Generic which is possible now that the minimal GCC version has
     been moved up.

     Adding this lockdep coverage unearthed a handful of seqcount bugs
     which have been addressed already independent of this.

     While generally useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if
     the writers are serialized by an associated lock, which leads to
     the well known reader preempts writer livelock. RT prevents this by
     storing the associated lock pointer independent of lockdep in the
     seqcount and changing the reader side to block on the lock when a
     reader detects that a writer is in the write side critical section.

   - Conversion of seqcount usage sites to associated types and
     initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  locking/seqlock, headers: Untangle the spaghetti monster
  locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
  x86/headers: Remove APIC headers from <asm/smp.h>
  seqcount: More consistent seqprop names
  seqcount: Compress SEQCNT_LOCKNAME_ZERO()
  seqlock: Fold seqcount_LOCKNAME_init() definition
  seqlock: Fold seqcount_LOCKNAME_t definition
  seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
  hrtimer: Use sequence counter with associated raw spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  vfs: Use sequence counter with associated spinlock
  timekeeping: Use sequence counter with associated raw spinlock
  xfrm: policy: Use sequence counters with associated lock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  sched: tasks: Use sequence counter with associated spinlock
  ...
2020-08-10 19:07:44 -07:00
Greg Kroah-Hartman
2c136de302 Merge 86cfccb669 ("Merge tag 'dlm-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm") into android-mainline
Steps along the way to 5.9-rc1

Fixed conflicts in:
	drivers/scsi/ufs/Kconfig
	drivers/scsi/ufs/ufshcd-crypto.c
	drivers/scsi/ufs/ufshcd.h
	drivers/staging/android/ion/ion.c
	drivers/staging/android/ion/ion_heap.c
	include/linux/ion.h

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia2602190d5960b7ad1beaf49a00489d49f144a4e
2020-08-07 16:19:28 +02:00
Greg Kroah-Hartman
f04a11ac99 Merge 7b4ea9456d ("Revert "x86/mm/64: Do not sync vmalloc/ioremap mappings"") into android-mainline
Steps on the way to 5.9-rc1

Resolves conflicts in:
	drivers/irqchip/qcom-pdc.c
	include/linux/device.h
	net/xfrm/xfrm_state.c
	security/lsm_audit.c

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4aeb3d04f4717714a421721eb3ce690c099bb30a
2020-08-07 16:01:35 +02:00
Linus Torvalds
dfdf16ecfd SCSI misc on 20200806
This series consists of the usual driver updates (ufs, qla2xxx, tcmu,
 lpfc, hpsa, zfcp, scsi_debug) and minor bug fixes.  We also have a
 huge docbook fix update like most other subsystems and no major update
 to the core (the few non trivial updates are either minor fixes or
 removing an unused feature [scsi_sdb_cache]).
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXyxq1yYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishSoAAQChZ4i8
 ZqYW3pL33JO3fA8vdjvLuyC489Hj4wzIsl3/bQEAxYyM6BSLvMoLWR2Plq/JmTLm
 4W/LDptarpTiDI3NuDc=
 =4b0W
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This consists of the usual driver updates (ufs, qla2xxx, tcmu, lpfc,
  hpsa, zfcp, scsi_debug) and minor bug fixes.

  We also have a huge docbook fix update like most other subsystems and
  no major update to the core (the few non trivial updates are either
  minor fixes or removing an unused feature [scsi_sdb_cache])"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (307 commits)
  scsi: scsi_transport_srp: Sanitize scsi_target_block/unblock sequences
  scsi: ufs-mediatek: Apply DELAY_AFTER_LPM quirk to Micron devices
  scsi: ufs: Introduce device quirk "DELAY_AFTER_LPM"
  scsi: virtio-scsi: Correctly handle the case where all LUNs are unplugged
  scsi: scsi_debug: Implement tur_ms_to_ready parameter
  scsi: scsi_debug: Fix request sense
  scsi: lpfc: Fix typo in comment for ULP
  scsi: ufs-mediatek: Prevent LPM operation on undeclared VCC
  scsi: iscsi: Do not put host in iscsi_set_flashnode_param()
  scsi: hpsa: Correct ctrl queue depth
  scsi: target: tcmu: Make TMR notification optional
  scsi: target: tcmu: Implement tmr_notify callback
  scsi: target: tcmu: Fix and simplify timeout handling
  scsi: target: tcmu: Factor out new helper ring_insert_padding
  scsi: target: tcmu: Do not queue aborted commands
  scsi: target: tcmu: Use priv pointer in se_cmd
  scsi: target: Add tmr_notify backend function
  scsi: target: Modify core_tmr_abort_task()
  scsi: target: iscsi: Fix inconsistent debug message
  scsi: target: iscsi: Fix login error when receiving
  ...
2020-08-06 16:50:07 -07:00
Greg Kroah-Hartman
d332fba061 Merge b34133fec8 ("Merge tag 'perf-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip") into android-mainline
Steps on the way to 5.9-rc1

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib901b7670a0313d91a57bce86c9587a25020cc3d
2020-08-06 20:25:54 +02:00
Eric Biggers
98777a7eb7 Merge commit 382625d0d4 ("Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block") into android-mainline
Conflicts:
	drivers/md/dm-bow.c
	drivers/md/dm-default-key.c
	drivers/md/dm.c
	fs/crypto/inline_crypt.c

Replace bdev->bd_queue with bdev_get_queue(bdev).

Bug: 129280212
Bug: 160883801
Bug: 160885805
Bug: 162257830
Change-Id: I9b0b295472080dfc0990dcb769205e68d706ce0e
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-08-06 10:07:17 -07:00
Coly Li
b35fd7422c block: check queue's limits.discard_granularity in __blkdev_issue_discard()
If create a loop device with a backing NVMe SSD, current loop device
driver doesn't correctly set its  queue's limits.discard_granularity and
leaves it as 0. If a discard request at LBA 0 on this loop device, in
__blkdev_issue_discard() the calculated req_sects will be 0, and a zero
length discard request will trigger a BUG() panic in generic block layer
code at block/blk-mq.c:563.

[  955.565006][   C39] ------------[ cut here ]------------
[  955.559660][   C39] invalid opcode: 0000 [#1] SMP NOPTI
[  955.622171][   C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G            E     5.8.0-default+ #40
[  955.622171][   C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020
[  955.622175][   C39] RIP: 0010:blk_mq_end_request+0x107/0x110
[  955.622177][   C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54
[  955.622179][   C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202
[  955.749277][   C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003
[  955.749278][   C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[  955.749279][   C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  955.749279][   C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160
[  955.749280][   C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e
[  955.749281][   C39] FS:  0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000
[  955.749282][   C39] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  955.749282][   C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0
[  955.749283][   C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  955.749284][   C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  955.749284][   C39] PKRU: 55555554
[  955.749285][   C39] Call Trace:
[  955.749290][   C39]  blk_done_softirq+0x99/0xc0
[  957.550669][   C39]  __do_softirq+0xd3/0x45f
[  957.550677][   C39]  ? smpboot_thread_fn+0x2f/0x1e0
[  957.550679][   C39]  ? smpboot_thread_fn+0x74/0x1e0
[  957.550680][   C39]  ? smpboot_thread_fn+0x14e/0x1e0
[  957.550684][   C39]  run_ksoftirqd+0x30/0x60
[  957.550687][   C39]  smpboot_thread_fn+0x149/0x1e0
[  957.886225][   C39]  ? sort_range+0x20/0x20
[  957.886226][   C39]  kthread+0x137/0x160
[  957.886228][   C39]  ? kthread_park+0x90/0x90
[  957.886231][   C39]  ret_from_fork+0x22/0x30
[  959.117120][   C39] ---[ end trace 3dacdac97e2ed164 ]---

This is the procedure to reproduce the panic,
  # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1
  # losetup -f /dev/nvme0n1 --direct-io=on
  # blkdiscard /dev/loop0 -o 0 -l 0x200

This patch fixes the issue by checking q->limits.discard_granularity in
__blkdev_issue_discard() before composing the discard bio. If the value
is 0, then prints a warning oops information and returns -EOPNOTSUPP to
the caller to indicate that this buggy device driver doesn't support
discard request.

Fixes: 9b15d109a6 ("block: improve discard bio alignment in __blkdev_issue_discard()")
Fixes: c52abf5630 ("loop: Better discard support for block devices")
Reported-and-suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Evan Green <evgreen@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Xiao Ni <xni@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-05 17:15:47 -06:00
Linus Torvalds
060a72a268 for-5.9/block-merge-20200804
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8pjAEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphuHEAC5hNAi99HLktAQ7qZy4cBqnGNKCrguFszq
 Kxiecp3Nrb9EnAPNWYG+QMO0kD9z8quML85beBaJNxN1PlOk9pawqFAd4ziFncFI
 ruZwIMP+oH/0OmPUA2a4ymrqu+rpyFvfsDL2RKJ9dirAt9fuv9W0RZM5g7Oz83Xi
 cNdPRn0tOhK0DTPxL4M1/NR2OutSgvKDfA5Et3IrDFl7+bJAEFqmSO8wOSdZtvFp
 KcR4O/DXnr5Wl6cPvzlvooQze8vGGJkXAyIKaC9cuBm/nlzMCBGG8kE0v3kRJ8Sc
 uSSFkC+P+OlktY4JwXN+mCacDUdVBiiL/uUs1zel6HmociBgh67mgyJ6AfQtGZry
 yVl9mj44qWZjAzCODv5KnuxlH+gBacdmjcQqwxsZ2P477gfNkxmBXgHeWdfzO9A/
 zTUXaBDXg3VdYxQfD8zTWPkCwXYp+YG3SRb9pfrIWIiYuz2UECZTvl/8Upnacz2B
 POTf+6vcNDlILCtboVE0mKEYR0ckxqrbs0NQloQdmVOfXNyhLml9OrXmwJIffVtE
 pZ9g428c5bm44lIOiB2eW+QPsXo0s8GxqIrMtxzKsJ3WgFefwLiVDLJBqEt78jRJ
 RvpGUxrMLgWFubowH8yDmWV+Fp0NpqcqF+GU45z8nGC3OTS+i0ZvUFYgLM6a2uOf
 sv4bzDPDBg==
 =uMth
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block

Pull block stacking updates from Jens Axboe:
 "The stacking related fixes depended on both the core block and drivers
  branches, so here's a topic branch with that change.

  Outside of that, a late fix from Johannes for zone revalidation"

* tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block:
  block: don't do revalidate zones on invalid devices
  block: remove blk_queue_stack_limits
  block: remove bdev_stack_limits
  block: inherit the zoned characteristics in blk_stack_limits
2020-08-05 11:12:34 -07:00
Linus Torvalds
e0fc99e21e for-5.9/drivers-20200803
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8od3oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppkpD/9D+XqD9qYcYTj+ShVCc5+3RtMG5ZiAAX0y
 l4QXomentn/1Y0UYXFGJH7JLZWrKYT0QiktLtfpe5pmTqRUkckTIyJQlsHb+K6Dz
 lFjtywRK9pcFYgiWIUg80wlJKrTa8QdnrlS/Esn4YITKGRbgMIdFvq2jymXC+1ho
 RgodlgzcBUREgHSLo0H3cqEKA53fQiJhKC6CbFrFdrkpf2yUpcTfEDtpSwuIuPj3
 2AUed1qXUtNjdHciCn3N37OuHqXKAA9noXAWfg9Gx/5zfGUNX9QJvlsny1AopgS0
 jJvPSDVAhu/qRLHW6q/ZOT0JAlHegguuTAOtgMh2cMpAS5sumCAtltxVcI7Qnx41
 HalMpTefXsVoBo0gfjqldnIPt34ZNj5aH5GYaH/wPpSg6VkTVBJK8GuQDBvg27qT
 w+U/T6EzuqniWXh/P3COhfrMCR9ueUOY1qWCRwzomlpeIfBhCzidt2wUqIxX1TOA
 Q0Ltf0eERDevsZbE+tIm+VAAg98kHehcS2t8lfFYFO6/PKu2iJpJt/HtJbZNBE+W
 rm96E4qXRiy1UuL7D9vBkaWsbnosuNHgGQXx57GlokQU+2IGBmOxV52XHiSxxpXd
 AS1ZTd56ItmID8VaU09Pbf7ZFbiCgdEAxIbUFzaCuvo+lxryHFphIUARNi/zPnNT
 UC2OzunCqA==
 =oADH
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - NVMe:
      - ZNS support (Aravind, Keith, Matias, Niklas)
      - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
        Dongli, Max, Sagi)

 - null_blk zone capacity support (Aravind)

 - MD:
      - raid5/6 fixes (ChangSyun)
      - Warning fixes (Damien)
      - raid5 stripe fixes (Guoqing, Song, Yufen)
      - sysfs deadlock fix (Junxiao)
      - raid10 deadlock fix (Vitaly)

 - struct_size conversions (Gustavo)

 - Set of bcache updates/fixes (Coly)

* tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
  md/raid5: Allow degraded raid6 to do rmw
  md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
  raid5: don't duplicate code for different paths in handle_stripe
  raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
  md: print errno in super_written
  md/raid5: remove the redundant setting of STRIPE_HANDLE
  md: register new md sysfs file 'uuid' read-only
  md: fix max sectors calculation for super 1.0
  nvme-loop: remove extra variable in create ctrl
  nvme-loop: set ctrl state connecting after init
  nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
  nvme-multipath: fix logic for non-optimized paths
  nvme-rdma: fix controller reset hang during traffic
  nvme-tcp: fix controller reset hang during traffic
  nvmet: introduce the passthru Kconfig option
  nvmet: introduce the passthru configfs interface
  nvmet: Add passthru enable/disable helpers
  nvmet: add passthru code to process commands
  nvme: export nvme_find_get_ns() and nvme_put_ns()
  nvme: introduce nvme_ctrl_get_by_path()
  ...
2020-08-05 10:51:40 -07:00
Linus Torvalds
99ea1521a0 Remove uninitialized_var() macro for v5.9-rc1
- Clean up non-trivial uses of uninitialized_var()
 - Update documentation and checkpatch for uninitialized_var() removal
 - Treewide removal of uninitialized_var()
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq
 il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o
 XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp
 UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2
 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8
 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI
 h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z
 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I
 AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG
 k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa
 dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+
 drhmmImUr1YylrtVOw==
 =xHmk
 -----END PGP SIGNATURE-----

Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull uninitialized_var() macro removal from Kees Cook:
 "This is long overdue, and has hidden too many bugs over the years. The
  series has several "by hand" fixes, and then a trivial treewide
  replacement.

   - Clean up non-trivial uses of uninitialized_var()

   - Update documentation and checkpatch for uninitialized_var() removal

   - Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  compiler: Remove uninitialized_var() macro
  treewide: Remove uninitialized_var() usage
  checkpatch: Remove awareness of uninitialized_var() macro
  mm/debug_vm_pgtable: Remove uninitialized_var() usage
  f2fs: Eliminate usage of uninitialized_var() macro
  media: sur40: Remove uninitialized_var() usage
  KVM: PPC: Book3S PR: Remove uninitialized_var() usage
  clk: spear: Remove uninitialized_var() usage
  clk: st: Remove uninitialized_var() usage
  spi: davinci: Remove uninitialized_var() usage
  ide: Remove uninitialized_var() usage
  rtlwifi: rtl8192cu: Remove uninitialized_var() usage
  b43: Remove uninitialized_var() usage
  drbd: Remove uninitialized_var() usage
  x86/mm/numa: Remove uninitialized_var() usage
  docs: deprecated.rst: Add uninitialized_var()
2020-08-04 13:49:43 -07:00
Linus Torvalds
cdc8fcb499 for-5.9/io_uring-20200802
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7asQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplrCD/0S17kio+k4cOJDGwl88WoJw+QiYmM5019k
 decZ1JymQvV1HXRmlcZiEAu0hHDD0FoovSRrw7II3gw3GouETmYQM62f6ZTpDeMD
 CED/fidnfULAkPaI6h+bj3jyI0cEuujG/R47rGSQEkIIr3RttqKZUzVkB9KN+KMw
 +OBuXZtMIoFFEVJ91qwC2dm2qHLqOn1/5MlT59knso/xbPOYOXsFQpGiACJqF97x
 6qSSI8uGE+HZqvL2OLWPDBbLEJhrq+dzCgxln5VlvLele4UcRhOdonUb7nUwEKCe
 zwvtXzz16u1D1b8bJL4Kg5bGqyUAQUCSShsfBJJxh6vTTULiHyCX5sQaai1OEB16
 4dpBL9E+nOUUix4wo9XBY0/KIYaPWg5L1CoEwkAXqkXPhFvNUucsC0u6KvmzZR3V
 1OogVTjl6GhS8uEVQjTKNshkTIC9QHEMXDUOHtINDCb/sLU+ANXU5UpvsuzZ9+kt
 KGc4mdyCwaKBq4YW9sVwhhq/RHLD4AUtWZiUVfOE+0cltCLJUNMbQsJ+XrcYaQnm
 W4zz22Rep+SJuQNVcCW/w7N2zN3yB6gC1qeroSLvzw4b5el2TdFp+BcgVlLHK+uh
 xjsGNCq++fyzNk7vvMZ5hVq4JGXYjza7AiP5HlQ8nqdiPUKUPatWCBqUm9i9Cz/B
 n+0dlYbRwQ==
 =2vmy
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Lots of cleanups in here, hardening the code and/or making it easier
  to read and fixing bugs, but a core feature/change too adding support
  for real async buffered reads. With the latter in place, we just need
  buffered write async support and we're done relying on kthreads for
  the fast path. In detail:

   - Cleanup how memory accounting is done on ring setup/free (Bijan)

   - sq array offset calculation fixup (Dmitry)

   - Consistently handle blocking off O_DIRECT submission path (me)

   - Support proper async buffered reads, instead of relying on kthread
     offload for that. This uses the page waitqueue to drive retries
     from task_work, like we handle poll based retry. (me)

   - IO completion optimizations (me)

   - Fix race with accounting and ring fd install (me)

   - Support EPOLLEXCLUSIVE (Jiufei)

   - Get rid of the io_kiocb unionizing, made possible by shrinking
     other bits (Pavel)

   - Completion side cleanups (Pavel)

   - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

   - Request environment grabbing cleanups (Pavel)

   - File and socket read/write cleanups (Pavel)

   - Improve kiocb_set_rw_flags() (Pavel)

   - Tons of fixes and cleanups (Pavel)

   - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
  io_uring: flip if handling after io_setup_async_rw
  fs: optimise kiocb_set_rw_flags()
  io_uring: don't touch 'ctx' after installing file descriptor
  io_uring: get rid of atomic FAA for cq_timeouts
  io_uring: consolidate *_check_overflow accounting
  io_uring: fix stalled deferred requests
  io_uring: fix racy overflow count reporting
  io_uring: deduplicate __io_complete_rw()
  io_uring: de-unionise io_kiocb
  io-wq: update hash bits
  io_uring: fix missing io_queue_linked_timeout()
  io_uring: mark ->work uninitialised after cleanup
  io_uring: deduplicate io_grab_files() calls
  io_uring: don't do opcode prep twice
  io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
  io_uring: batch put_task_struct()
  tasks: add put_task_struct_many()
  io_uring: return locked and pinned page accounting
  io_uring: don't miscount pinned memory
  io_uring: don't open-code recv kbuf managment
  ...
2020-08-03 13:01:22 -07:00
Linus Torvalds
382625d0d4 for-5.9/block-20200802
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
 yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
 FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
 x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
 b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
 8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
 SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
 yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
 GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
 ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
 +BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
 RzAUdZFbWw==
 =abJG
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Good amount of cleanups and tech debt removals in here, and as a
  result, the diffstat shows a nice net reduction in code.

   - Softirq completion cleanups (Christoph)

   - Stop using ->queuedata (Christoph)

   - Cleanup bd claiming (Christoph)

   - Use check_events, moving away from the legacy media change
     (Christoph)

   - Use inode i_blkbits consistently (Christoph)

   - Remove old unused writeback congestion bits (Christoph)

   - Cleanup/unify submission path (Christoph)

   - Use bio_uninit consistently, instead of bio_disassociate_blkg
     (Christoph)

   - sbitmap cleared bits handling (John)

   - Request merging blktrace event addition (Jan)

   - sysfs add/remove race fixes (Luis)

   - blk-mq tag fixes/optimizations (Ming)

   - Duplicate words in comments (Randy)

   - Flush deferral cleanup (Yufen)

   - IO context locking/retry fixes (John)

   - struct_size() usage (Gustavo)

   - blk-iocost fixes (Chengming)

   - blk-cgroup IO stats fixes (Boris)

   - Various little fixes"

* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
  block: blk-timeout: delete duplicated word
  block: blk-mq-sched: delete duplicated word
  block: blk-mq: delete duplicated word
  block: genhd: delete duplicated words
  block: elevator: delete duplicated word and fix typos
  block: bio: delete duplicated words
  block: bfq-iosched: fix duplicated word
  iocost_monitor: start from the oldest usage index
  iocost: Fix check condition of iocg abs_vdebt
  block: Remove callback typedefs for blk_mq_ops
  block: Use non _rcu version of list functions for tag_set_list
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  ...
2020-08-03 11:57:03 -07:00
Johannes Thumshirn
1a1206dc4c block: don't do revalidate zones on invalid devices
When we loose a device for whatever reason while (re)scanning zones, we
trip over a NULL pointer in blk_revalidate_zone_cb, like in the following
log:

sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB)
sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed
sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08
sd 0:0:0:0: [sda] Sense Key : 0xb [current]
sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6
sda: failed to revalidate zones
sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B)
sda: detected capacity change from 14000519643136 to 0
==================================================================
BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550
Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58

CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound async_run_entry_fn
Call Trace:
 dump_stack+0x7d/0xb0
 ? blk_revalidate_zone_cb+0x1b7/0x550
 kasan_report.cold+0x5/0x37
 ? blk_revalidate_zone_cb+0x1b7/0x550
 check_memory_region+0x145/0x1a0
 blk_revalidate_zone_cb+0x1b7/0x550
 sd_zbc_parse_report+0x1f1/0x370
 ? blk_req_zone_write_trylock+0x200/0x200
 ? sectors_to_logical+0x60/0x60
 ? blk_req_zone_write_trylock+0x200/0x200
 ? blk_req_zone_write_trylock+0x200/0x200
 sd_zbc_report_zones+0x3c4/0x5e0
 ? sd_dif_config_host+0x500/0x500
 blk_revalidate_disk_zones+0x231/0x44d
 ? _raw_write_lock_irqsave+0xb0/0xb0
 ? blk_queue_free_zone_bitmaps+0xd0/0xd0
 sd_zbc_read_zones+0x8cf/0x11a0
 sd_revalidate_disk+0x305c/0x64e0
 ? __device_add_disk+0x776/0xf20
 ? read_capacity_16.part.0+0x1080/0x1080
 ? blk_alloc_devt+0x250/0x250
 ? create_object.isra.0+0x595/0xa20
 ? kasan_unpoison_shadow+0x33/0x40
 sd_probe+0x8dc/0xcd2
 really_probe+0x20e/0xaf0
 __driver_attach_async_helper+0x249/0x2d0
 async_run_entry_fn+0xbe/0x560
 process_one_work+0x764/0x1290
 ? _raw_read_unlock_irqrestore+0x30/0x30
 worker_thread+0x598/0x12f0
 ? __kthread_parkme+0xc6/0x1b0
 ? schedule+0xed/0x2c0
 ? process_one_work+0x1290/0x1290
 kthread+0x36b/0x440
 ? kthread_create_worker_on_cpu+0xa0/0xa0
 ret_from_fork+0x22/0x30
==================================================================

When the device is already gone we end up with the following scenario:
The device's capacity is 0 and thus the number of zones will be 0 as well. When
allocating the bitmap for the conventional zones, we then trip over a NULL
pointer.

So if we encounter a zoned block device with a 0 capacity, don't dare to
revalidate the zones sizes.

Fixes: 6c6b354914 ("block: set the zone size in blk_revalidate_disk_zones atomically")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-03 09:24:04 -06:00
Randy Dunlap
d958e343bd block: blk-timeout: delete duplicated word
Drop the repeated word "request".
Change to the correct kernel-doc notation for function name separtor.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Randy Dunlap
c4aecaa256 block: blk-mq-sched: delete duplicated word
Drop the repeated word "to".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Randy Dunlap
70f15a4fd9 block: blk-mq: delete duplicated word
Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Randy Dunlap
0d20dcc277 block: genhd: delete duplicated words
Drop the repeated word "to" in multiple places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Randy Dunlap
5b8f65e1f9 block: elevator: delete duplicated word and fix typos
Drop the repeated word "the".
Fix typos of "features" and "specified".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Randy Dunlap
3cf1488917 block: bio: delete duplicated words
Drop the repeated words "a" and "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Randy Dunlap
f06678af91 block: bfq-iosched: fix duplicated word
Change "at at" to "at a".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-31 16:29:47 -06:00
Chengming Zhou
d9012a59db iocost: Fix check condition of iocg abs_vdebt
We shouldn't skip iocg when its abs_vdebt is not zero.

Fixes: 0b80f9866e ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-30 11:45:12 -06:00
Ahmed S. Darwish
67b7b641ca iocost: Use sequence counter with associated spinlock
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20200720155530.1173732-21-a.darwish@linutronix.de
2020-07-29 16:14:28 +02:00
Daniel Wagner
08c875cbf4 block: Use non _rcu version of list functions for tag_set_list
tag_set_list is only accessed under the tag_set_lock lock. There is
no need for using the _rcu list functions.

The _rcu list function were introduced to allow read access to the
tag_set_list protected under RCU, see 705cda97ee ("blk-mq: Make it
safe to use RCU to iterate over blk_mq_tag_set.tag_list") and
05b7941394 ("Revert "blk-mq: don't handle TAG_SHARED in restart"").
Those changes got reverted later but the cleanup commit missed a
couple of places to undo the changes.

Fixes: 97889f9ac2 ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-28 09:37:09 -06:00
Alan Stern
8f38f8e0a3 scsi: block: pm: Simplify resume handling
Commit 05d18ae1cc ("scsi: pm: Balance pm_only counter of request queue
during system resume") fixed a problem in the block layer's runtime-PM
code: blk_set_runtime_active() failed to call blk_clear_pm_only().
However, the commit's implementation was awkward; it forced the SCSI
system-resume handler to choose whether to call blk_post_runtime_resume()
or blk_set_runtime_active(), depending on whether or not the SCSI device
had previously been runtime suspended.

This patch simplifies the situation considerably by adding the missing
function call directly into blk_set_runtime_active() (under the condition
that the queue is not already in the RPM_ACTIVE state).  This allows the
SCSI routine to revert back to its original form.  Furthermore, making this
change reveals that blk_post_runtime_resume() (in its success pathway) does
exactly the same thing as blk_set_runtime_active().  The duplicate code is
easily removed by making one routine call the other.

No functional changes are intended.

Link: https://lore.kernel.org/r/20200706151436.GA702867@rowland.harvard.edu
CC: Can Guo <cang@codeaurora.org>
CC: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-07-24 22:09:55 -04:00
Christoph Hellwig
b9b1a5d715 block: remove blk_queue_stack_limits
This function is just a tiny wrapper around blk_stack_limits.  Open code
it int the two callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-20 15:38:52 -06:00
Christoph Hellwig
9efa82ef2b block: remove bdev_stack_limits
This function is just a tiny wrapper around blk_stack_limit and has
two callers.  Simplify the stack a bit by open coding it in the two
callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-20 15:38:52 -06:00
Christoph Hellwig
3093a47972 block: inherit the zoned characteristics in blk_stack_limits
Lift the code from device mapper into blk_stack_limits to inherity
the stacking limitations.  This ensures we do the right thing for
all stacked zoned block devices.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-20 15:38:52 -06:00
Jens Axboe
4f43d64807 Merge branch 'for-5.9/drivers' into for-5.9/block-merge
* for-5.9/drivers: (38 commits)
  block: add max_active_zones to blk-sysfs
  block: add max_open_zones to blk-sysfs
  s390/dasd: Use struct_size() helper
  s390/dasd: fix inability to use DASD with DIAG driver
  md-cluster: fix wild pointer of unlock_all_bitmaps()
  md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes
  md: fix deadlock causing by sysfs_notify
  md: improve io stats accounting
  md: raid0/linear: fix dereference before null check on pointer mddev
  rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()'
  nvme: remove ns->disk checks
  nvme-pci: use standard block status symbolic names
  nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size()
  nvme-pci: add a blank line after declarations
  nvme-pci: fix some comments issues
  nvme-pci: remove redundant segment validation
  nvme: document quirked Intel models
  nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs
  nvme: support for zoned namespaces
  nvme: support for multiple Command Sets Supported and Effects log pages
  ...
2020-07-20 15:38:27 -06:00
Jens Axboe
9caaa66c91 Merge branch 'for-5.9/block' into for-5.9/block-merge
* for-5.9/block: (124 commits)
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  block: use bd_prepare_to_claim directly in the loop driver
  block: refactor bd_start_claiming
  block: simplify the restart case in __blkdev_get
  Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."
  block: always remove partitions from blk_drop_partitions()
  block: relax jiffies rounding for timeouts
  blk-mq: remove redundant validation in __blk_mq_end_request()
  blk-mq: Remove unnecessary local variable
  writeback: remove bdi->congested_fn
  writeback: remove struct bdi_writeback_congested
  writeback: remove {set,clear}_wb_congested
  ...
2020-07-20 15:38:23 -06:00
Boris Burkov
ef45fe470e blk-cgroup: show global disk stats in root cgroup io.stat
In order to improve consistency and usability in cgroup stat accounting,
we would like to support the root cgroup's io.stat.

Since the root cgroup has processes doing io even if the system has no
explicitly created cgroups, we need to be careful to avoid overhead in
that case.  For that reason, the rstat algorithms don't handle the root
cgroup, so just turning the file on wouldn't give correct statistics.

To get around this, we simulate flushing the iostat struct by filling it
out directly from global disk stats. The result is a root cgroup io.stat
file consistent with both /proc/diskstats and io.stat.

Note that in order to collect the disk stats, we needed to iterate over
devices. To facilitate that, we had to change the linkage of a disk_type
to external so that it can be used from blk-cgroup.c to iterate over
disks.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Boris Burkov <boris@bur.io>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 20:18:00 -06:00
Boris Burkov
cd1fc4b98f blk-cgroup: make iostat functions visible to stat printing
Previously, the code which printed io.stat only needed access to the
generic rstat flushing code, but since we plan to write some more
specific code for preparing root cgroup stats, we need to manipulate
iostat structs directly. Since declaring static functions ahead does not
seem like common practice in this file, simply move the iostat functions
up. We only plan to use blkg_iostat_set, but it seems better to keep them
all together.

Signed-off-by: Boris Burkov <boris@bur.io>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 20:17:59 -06:00
Coly Li
9b15d109a6 block: improve discard bio alignment in __blkdev_issue_discard()
This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.

Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.

Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
  with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
  /dev/sdb, fill data into the first 50GB by,
        # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
        # blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
        # sg_get_lba_status /dev/sdb -m 1048 --lba=0
  descriptor LBA: 0x0000000000000000  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000000000800  blocks: 16773120  deallocated
  descriptor LBA: 0x0000000000fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000017ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000001800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000001fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000027ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000002800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000002fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000037ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000003800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000003fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000047ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000004800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000004fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005000000  blocks: 8386560  deallocated
  descriptor LBA: 0x00000000057ff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000005800000  blocks: 8386560  deallocated
  descriptor LBA: 0x0000000005fff800  blocks: 2048  mapped (or unknown)
  descriptor LBA: 0x0000000006000000  blocks: 6291456  deallocated
  descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.

The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.

 25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 26                 sector_t nr_sects, gfp_t gfp_mask, int flags,
 27                 struct bio **biop)
 28 {
 29         struct request_queue *q = bdev_get_queue(bdev);
   [snipped]
 56
 57         while (nr_sects) {
 58                 sector_t req_sects = min_t(sector_t, nr_sects,
 59                                 bio_allowed_max_sectors(q));
 60
 61                 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
 62
 63                 bio = blk_next_bio(bio, 0, gfp_mask);
 64                 bio->bi_iter.bi_sector = sector;
 65                 bio_set_dev(bio, bdev);
 66                 bio_set_op_attrs(bio, op, 0);
 67
 68                 bio->bi_iter.bi_size = req_sects << 9;
 69                 sector += req_sects;
 70                 nr_sects -= req_sects;
   [snipped]
 79         }
 80
 81         *biop = bio;
 82         return 0;
 83 }
 84 EXPORT_SYMBOL(__blkdev_issue_discard);

At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.

This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.

But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.

The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000  blocks: 106954752  deallocated
descriptor LBA: 0x0000000006600000  blocks: 0  deallocated

We an see there is no 2048 sectors segment anymore, everything is clean.

Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 07:15:10 -06:00
Yufen Yu
b5718d6c00 block: defer flush request no matter whether we have elevator
Commit 7520872c0c ("block: don't defer flushes on blk-mq + scheduling")
tried to fix deadlock for cycled wait between flush requests and data
request into flush_data_in_flight. The former holded all driver tags
and wait for data request completion, but the latter can not complete
for waiting free driver tags.

After commit 923218f616 ("blk-mq: don't allocate driver tag upfront
for flush rq"), flush requests will not get driver tag before queuing
into flush queue.

* With elevator, flush request just get sched_tags before inserting
  flush queue. It will not get driver tag until issue them to driver.
  data request on list fq->flush_data_in_flight will complete in
  the end.

* Without elevator, each flush request will get a driver tag when
  allocate request. Then data request on fq->flush_data_in_flight
  don't worry about lacking driver tag.

In both of these cases, cycled wait cannot be true. So we may allow
to defer flush request.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 07:14:28 -06:00
Wei Yongjun
943c4d9074 block: make blk_timeout_init() static
The sparse tool complains as follows:

block/blk-timeout.c:93:12: warning:
 symbol 'blk_timeout_init' was not declared. Should it be static?

Function blk_timeout_init() is not used outside of blk-timeout.c, so
mark it static.

Fixes: 9054650fac ("block: relax jiffies rounding for timeouts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-17 07:13:42 -06:00
Kees Cook
3f649ab728 treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:35:15 -07:00
John Ogness
ab96bbab46 block: remove retry loop in ioc_release_fn()
The reverse-order double lock dance in ioc_release_fn() is using a
retry loop. This is a problem on PREEMPT_RT because it could preempt
the task that would release q->queue_lock and thus live lock in the
retry loop.

RCU is already managing the freeing of the request queue and icq. If
the trylock fails, use RCU to guarantee that the request queue and
icq are not freed and re-acquire the locks in the correct order,
allowing forward progress.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-16 10:22:15 -06:00
John Ogness
a43f085f87 block: remove unnecessary ioc nested locking
The legacy CFQ IO scheduler could call put_io_context() in its exit_icq()
elevator callback. This led to a lockdep warning, which was fixed in
commit d8c66c5d59 ("block: fix lockdep warning on io_context release
put_io_context()") by using a nested subclass for the ioc spinlock.
However, with commit f382fb0bce ("block: remove legacy IO schedulers")
the CFQ IO scheduler no longer exists.

The BFQ IO scheduler also implements the exit_icq() elevator callback but
does not call put_io_context().

The nested subclass for the ioc spinlock is no longer needed. Since it
existed as an exception and no longer applies, remove the nested subclass
usage.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-16 10:22:15 -06:00
Niklas Cassel
659bf827ba block: add max_active_zones to blk-sysfs
Add a new max_active zones definition in the sysfs documentation.
This definition will be common for all devices utilizing the zoned block
device support in the kernel.

Export max_active_zones according to this new definition for NVMe Zoned
Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
the kernel), and ZBC SCSI devices.

Add the new max_active_zones member to struct request_queue, rather
than as a queue limit, since this property cannot be split across stacking
drivers.

For SCSI devices, even though max active zones is not part of the ZBC/ZAC
spec, export max_active_zones as 0, signifying "no limit".

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-15 14:26:11 -06:00
Niklas Cassel
e15864f8ea block: add max_open_zones to blk-sysfs
Add a new max_open_zones definition in the sysfs documentation.
This definition will be common for all devices utilizing the zoned block
device support in the kernel.

Export max open zones according to this new definition for NVMe Zoned
Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
the kernel), and ZBC SCSI devices.

Add the new max_open_zones member to struct request_queue, rather
than as a queue limit, since this property cannot be split across stacking
drivers.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-15 14:26:11 -06:00
Jens Axboe
e791ee6885 Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."
This reverts commit 826f2f48da.

Qian Cai reports that this commit causes stalls with swap. Revert until
the reason can be figured out.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-15 09:33:37 -06:00
Ming Lei
d0f0f1b4c5 block: always remove partitions from blk_drop_partitions()
In theory, when GENHD_FL_NO_PART_SCAN is set, no partitions can be created
on one disk. However, ioctl(BLKPG, BLKPG_ADD_PARTITION) doesn't check
GENHD_FL_NO_PART_SCAN, so partitions still can be added even though
GENHD_FL_NO_PART_SCAN is set.

So far blk_drop_partitions() only removes partitions when disk_part_scan_enabled()
return true. This way can make ghost partition on loop device after changing/clearing
FD in case that PARTSCAN is disabled, such as partitions can be added
via 'parted' on loop disk even though GENHD_FL_NO_PART_SCAN is set.

Fix this issue by always removing partitions in blk_drop_partitions(), and
this way is correct because the current code supposes that no partitions
can be added in case of GENHD_FL_NO_PART_SCAN.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-15 09:23:42 -06:00
Jens Axboe
9054650fac block: relax jiffies rounding for timeouts
In doing high IOPS testing, blk-mq is generally pretty well optimized.
There are a few things that stuck out as using more CPU than what is
really warranted, and one thing is the round_jiffies_up() that we do
twice for each request. That accounts for about 0.8% of the CPU in
my testing.

We can make this cheaper by avoiding an integer division, by just adding
a rough HZ mask that we can AND with instead. The timeouts are only on a
second granularity already, we don't have to be that accurate here and
this patch barely changes that. All we care about is nice grouping.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-15 09:23:35 -06:00
Greg Kroah-Hartman
6c201957b6 Linux 5.8-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8LnhoeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGIOAH/jkWTYOGhrHEYy8A
 lOPz5JszK6XbG4+KPYsNZpZmB9pHO3mlnGVFzL1+nKG4RwheYW+sfjICVRMWiJeY
 3N1ZvxisgK8E3F/VGOhIOLLQkddUns9VD1F6okYTfnIMmMwynI9Qd2Rk4MSusNv6
 wxeAHS1+mgZo2F29N1zkssSUClwBvEe6q797Oeym0n1QSF5o7H9Ufj+pucaiZhLy
 1VHIO7Sdo0oK4Bu2mOe5ucsz6KaXPaav8fRVc9ZOKwm/oA3ffbrZH3HO8z7O8Y56
 ilZciWpFyaFmMVIB+6RaRv5FdJDG2lZtXr3S9Orqr/R77Y9/maYgB2P6c1toDbMC
 dEtOvh0=
 =VlHj
 -----END PGP SIGNATURE-----

Merge 5.8-rc5 into android-mainline

Linux 5.8-rc5

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I0604003a73d4691fe7e3f78cb51d6605f5ac8f4a
2020-07-13 14:55:20 +02:00
Linus Torvalds
d33db70274 block-5.8-2020-07-10
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8Ij/kQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprHQD/4rS7rwnzbE0HAYqnxG40Pkzcedsj8DMBSY
 whlPe016D4tqGuqbudm2crVO8TKPRVhluw+VL85xgDh5Hg09uljjpjCh9q4K8GxF
 moTNlhrHkg7SC8P6wEUBR4oPA0g8NQ0CLMk9CB+2YmahcyBV1neZKH3oItZniSw5
 8GwHDRI0o+FfhpgwuCMGvThQ2q4OLnwVa6rxVFaVgniCB/vjcuRQGLlGTTLIon21
 7jRWr950XcbBu0g4fp5jP7bXXEwr/fQDYRk3LNGS3Ku+v0sCzWscpkOrQZxH/xLd
 l41QQK0dP8dG7GphYGupnlGenqHhLpW+9hrG3nJTt+Y2V16/wdJoFfKgi5wHj1lP
 ltSnBo+0/V2M3IfzNnLu0khzRl3//65dffZQIJznMqMy7L+ggZfpDm3MKUdxRrmc
 +yyL5NJmvg1p8CQdky8A22yrIzK6NAyS/rxg1rzdy5PuF+y9z91vcARxorrB0t31
 bM89PYLDnUkwT0kGErdU1TtqvW8OEE2kzjJf/sfoRdY9w+ZlD/vzlC8axso1FrjX
 ep8BBHH7oHPy0q8gYYVUUWdsJSi0DZEHwML7lb+CDIlfE0UVBtOuvLYDfZUrckOg
 Ahs3Mc7odJ2Am9qESUZQnG5AhkYmVLYhtB5NGeaOMuzDQnfZPrYYrUdvZ6kN7pww
 c4h7ZSPVUQ==
 =BZ6W
 -----END PGP SIGNATURE-----

Merge tag 'block-5.8-2020-07-10' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Fix for inflight accounting, which affects only dm (Ming)

 - Fix documentation error for bfq (Yufen)

 - Fix memory leak for nbd (Zheng)

* tag 'block-5.8-2020-07-10' of git://git.kernel.dk/linux-block:
  nbd: Fix memory leak in nbd_add_socket
  blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight()
  docs: block: update and fix tiny error for bfq
2020-07-10 09:55:46 -07:00
Baolin Wang
87890092ee blk-mq: remove redundant validation in __blk_mq_end_request()
We've already validated the 'q->elevator' before calling
->ops.completed_request() in blk_mq_sched_completed_request(), thus no
need to validate rq->internal_tag again. Rmove it.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-10 07:58:33 -06:00
Baolin Wang
106e71c512 blk-mq: Remove unnecessary local variable
Remove unnecessary local variable 'ret' in blk_mq_dispatch_hctx_list().

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-10 07:58:09 -06:00
Christoph Hellwig
8c911f3d4c writeback: remove struct bdi_writeback_congested
We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 17:05:53 -06:00
Christoph Hellwig
a564e23f0f md: switch to ->check_events for media change notifications
md is the last driver using the legacy media_changed method.  Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 16:19:47 -06:00
Ming Lei
568f270065 blk-mq: centralise related handling into blk_mq_get_driver_tag
Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.

Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-08 16:06:42 -06:00