commit e9f4eee9a0023ba22db9560d4cc6ee63f933dae8 upstream.
When the weight of an active iocg is updated, weight_updated() is called
which in turn calls __propagate_weights() to update the active and inuse
weights so that the effective hierarchical weights are update accordingly.
The current implementation is incorrect for inner active nodes. For an
active leaf iocg, inuse can be any value between 1 and active and the
difference represents how much the iocg is donating. When weight is updated,
as long as inuse is clamped between 1 and the new weight, we're alright and
this is what __propagate_weights() currently implements.
However, that's not how an active inner node's inuse is set. An inner node's
inuse is solely determined by the ratio between the sums of inuse's and
active's of its children - ie. they're results of propagating the leaves'
active and inuse weights upwards. __propagate_weights() incorrectly applies
the same clamping as for a leaf when an active inner node's weight is
updated. Consider a hierarchy which looks like the following with saturating
workloads in AA and BB.
R
/ \
A B
| |
AA BB
1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.
2. echo 200 > A/io.weight
3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
it's already between 1 and the new active, making A:active=200,
A:inuse=100. As R's active_sum is updated along with A's active,
A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
hwi's remain unchanged at 0.5.
4. The weight of A is now twice that of B but AA and BB still have the same
hwi of 0.5 and thus are doing the same amount of IOs.
Fix it by making __propgate_weights() always calculate the inuse of an
active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Schatzberg <dschatzberg@fb.com>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For flush request, rq->end_io() may be called two times, one is from
timeout handling(blk_mq_check_expired()), another is from normal
completion(__blk_mq_end_request()).
Move blk_account_io_flush() after flush_rq->ref drops to zero, so
io accounting can be done just once for flush request.
Fixes: b686631865 ("block: add iostat counters for flush requests")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 188199752
(cherry picked from commit 773cd5fb22e7c61c65c7528a3e1cb5bcbc1408ea git://git.kernel.dk/linux-block/ for-5.14/block)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I0cbcb35fce831daab99145d2401a6fdeb5d575b4
If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Bug: 181696921
Link: https://lore.kernel.org/linux-block/d1b1f123-9b47-8805-0b86-2fa9f6b19bb5@kernel.dk/T/#ma8a01d264c59074d3f9b69f833ffe408592cf837
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Change-Id: I3f9ca00fbefec97d1e3c9df85365be0265b8cec6
Revert most of commit 25a6f60b717d ("BACKPORT: bio: limit bio max size")
because it has been reported to cause data corruption and because it has
been reverted upstream. See also
https://lore.kernel.org/linux-block/1620571445.2k94orj8ee.none@localhost/T/#t
Cc: Changheun Lee <nanich.lee@samsung.com>
Cc: Jaegeuk Kim <jaegeuk@google.com>
Bug: 182716953
Change-Id: I79bf39d1d1a0c13cb30ff4b0f1da2b43e1c817f0
Signed-off-by: Bart Van Assche <bvanassche@google.com>
The inflight of partition 0 doesn't include inflight IOs to all
sub-partitions, since currently mq calculates inflight of specific
partition by simply camparing the value of the partition pointer.
Thus the following case is possible:
$ cat /sys/block/vda/inflight
0 0
$ cat /sys/block/vda/vda1/inflight
0 128
While single queue device (on a previous version, e.g. v3.10) has no
this issue:
$cat /sys/block/sda/sda3/inflight
0 33
$cat /sys/block/sda/inflight
0 33
Partition 0 should be specially handled since it represents the whole
disk. This issue is introduced since commit bf0ddaba65 ("blk-mq: fix
sysfs inflight counter").
Besides, this patch can also fix the inflight statistics of part 0 in
/proc/diskstats. Before this patch, the inflight statistics of part 0
doesn't include that of sub partitions. (I have marked the 'inflight'
field with asterisk.)
$cat /proc/diskstats
259 0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
259 2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0
This is introduced since commit f299b7c7a9 ("blk-mq: provide internal
in-flight variant").
Fixes: bf0ddaba65 ("blk-mq: fix sysfs inflight counter")
Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: adapt for 5.11 partition change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 187355247
Change-Id: I378b2cb7312a5e47d5e2ec7301dc392e6e7336d0
(cherry picked from commit b0d97557ebfc9d5ba5f2939339a9fdd267abafeb)
Signed-off-by: Bart Van Assche <bvanassche@google.com>
bio size can grow up to 4GB when muli-page bvec is enabled.
but sometimes it would lead to inefficient behaviors.
in case of large chunk direct I/O, - 32MB chunk read in user space -
all pages for 32MB would be merged to a bio structure if the pages
physical addresses are contiguous. it makes some delay to submit
until merge complete. bio max size should be limited to a proper size.
When 32MB chunk read with direct I/O option is coming from userspace,
kernel behavior is below now in do_direct_IO() loop. it's timeline.
| bio merge for 32MB. total 8,192 pages are merged.
| total elapsed time is over 2ms.
|------------------ ... ----------------------->|
| 8,192 pages merged a bio.
| at this time, first bio submit is done.
| 1 bio is split to 32 read request and issue.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 19ms elapsed to complete 32MB read done from device. |
If bio max size is limited with 1MB, behavior is changed below.
| bio merge for 1MB. 256 pages are merged for each bio.
| total 32 bio will be made.
| total elapsed time is over 2ms. it's same.
| but, first bio submit timing is fast. about 100us.
|--->|--->|--->|---> ... -->|--->|--->|--->|--->|
| 256 pages merged a bio.
| at this time, first bio submit is done.
| and 1 read request is issued for 1 bio.
|--------------->
|--------------->
|--------------->
......
|--------------->
|--------------->|
total 17ms elapsed to complete 32MB read done from device. |
As a result, read request issue timing is faster if bio max size is limited.
Current kernel behavior with multipage bvec, super large bio can be created.
And it lead to delay first I/O request issue.
Signed-off-by: Changheun Lee <nanich.lee@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bug: 182716953
(cherry picked from commit cd2c7545ae1beac3b6aae033c7f31193b3255946)
Change-Id: Ie3876daa495535dc7f856ed9a281e65d72a437c1
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Changes in 5.10.33
vhost-vdpa: protect concurrent access to vhost device iotlb
gpio: omap: Save and restore sysconfig
KEYS: trusted: Fix TPM reservation for seal/unseal
vdpa/mlx5: Set err = -ENOMEM in case dma_map_sg_attrs fails
pinctrl: lewisburg: Update number of pins in community
block: return -EBUSY when there are open partitions in blkdev_reread_part
pinctrl: core: Show pin numbers for the controllers with base = 0
arm64: dts: allwinner: Revert SD card CD GPIO for Pine64-LTS
bpf: Permits pointers on stack for helper calls
bpf: Allow variable-offset stack access
bpf: Refactor and streamline bounds check into helper
bpf: Tighten speculative pointer arithmetic mask
locking/qrwlock: Fix ordering in queued_write_lock_slowpath()
perf/x86/intel/uncore: Remove uncore extra PCI dev HSWEP_PCI_PCU_3
perf/x86/kvm: Fix Broadwell Xeon stepping in isolation_ucodes[]
perf auxtrace: Fix potential NULL pointer dereference
perf map: Fix error return code in maps__clone()
HID: google: add don USB id
HID: alps: fix error return code in alps_input_configured()
HID cp2112: fix support for multiple gpiochips
HID: wacom: Assign boolean values to a bool variable
soc: qcom: geni: shield geni_icc_get() for ACPI boot
dmaengine: xilinx: dpdma: Fix descriptor issuing on video group
dmaengine: xilinx: dpdma: Fix race condition in done IRQ
ARM: dts: Fix swapped mmc order for omap3
net: geneve: check skb is large enough for IPv4/IPv6 header
dmaengine: tegra20: Fix runtime PM imbalance on error
s390/entry: save the caller of psw_idle
arm64: kprobes: Restore local irqflag if kprobes is cancelled
xen-netback: Check for hotplug-status existence before watching
cavium/liquidio: Fix duplicate argument
kasan: fix hwasan build for gcc
csky: change a Kconfig symbol name to fix e1000 build error
ia64: fix discontig.c section mismatches
ia64: tools: remove duplicate definition of ia64_mf() on ia64
x86/crash: Fix crash_setup_memmap_entries() out-of-bounds access
net: hso: fix NULL-deref on disconnect regression
USB: CDC-ACM: fix poison/unpoison imbalance
Linux 5.10.33
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I638db3c919ad938eaaaac3d687175252edcd7990
[ Upstream commit 68e6582e8f2dc32fd2458b9926564faa1fb4560e ]
The switch to go through blkdev_get_by_dev means we now ignore the
return value from bdev_disk_changed in __blkdev_get. Add a manual
check to restore the old semantics.
Fixes: 4601b4b130de ("block: reopen the device in blkdev_reread_part")
Reported-by: Karel Zak <kzak@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210421160502.447418-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.10.31
interconnect: core: fix error return code of icc_link_destroy()
gfs2: Flag a withdraw if init_threads() fails
KVM: arm64: Hide system instruction access to Trace registers
KVM: arm64: Disable guest access to trace filter controls
drm/imx: imx-ldb: fix out of bounds array access warning
gfs2: report "already frozen/thawed" errors
ftrace: Check if pages were allocated before calling free_pages()
tools/kvm_stat: Add restart delay
drm/tegra: dc: Don't set PLL clock to 0Hz
gpu: host1x: Use different lock classes for each client
XArray: Fix splitting to non-zero orders
block: only update parent bi_status when bio fail
radix tree test suite: Register the main thread with the RCU library
idr test suite: Take RCU read lock in idr_find_test_1
idr test suite: Create anchor before launching throbber
null_blk: fix command timeout completion handling
io_uring: don't mark S_ISBLK async work as unbounded
riscv,entry: fix misaligned base for excp_vect_table
block: don't ignore REQ_NOWAIT for direct IO
netfilter: x_tables: fix compat match/target pad out-of-bound write
perf map: Tighten snprintf() string precision to pass gcc check on some 32-bit arches
net: sfp: relax bitrate-derived mode check
net: sfp: cope with SFPs that set both LOS normal and LOS inverted
xen/events: fix setting irq affinity
Linux 5.10.31
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I19a7cfbdaab23e578dd82c552aea86d367c2f40f
[ Upstream commit 3edf5346e4f2ce2fa0c94651a90a8dda169565ee ]
For multiple split bios, if one of the bio is fail, the whole
should return error to application. But we found there is a race
between bio_integrity_verify_fn and bio complete, which return
io success to application after one of the bio fail. The race as
following:
split bio(READ) kworker
nvme_complete_rq
blk_update_request //split error=0
bio_endio
bio_integrity_endio
queue_work(kintegrityd_wq, &bip->bip_work);
bio_integrity_verify_fn
bio_endio //split bio
__bio_chain_endio
if (!parent->bi_status)
<interrupt entry>
nvme_irq
blk_update_request //parent error=7
req_bio_endio
bio->bi_status = 7 //parent bio
<interrupt exit>
parent->bi_status = 0
parent->bi_end_io() // return bi_status=0
The bio has been split as two: split and parent. When split
bio completed, it depends on kworker to do endio, while
bio_integrity_verify_fn have been interrupted by parent bio
complete irq handler. Then, parent bio->bi_status which have
been set in irq handler will overwrite by kworker.
In fact, even without the above race, we also need to conside
the concurrency beteen mulitple split bio complete and update
the same parent bi_status. Normally, multiple split bios will
be issued to the same hctx and complete from the same irq
vector. But if we have updated queue map between multiple split
bios, these bios may complete on different hw queue and different
irq vector. Then the concurrency update parent bi_status may
cause the final status error.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210331115359.1125679-1-yuyufen@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.10.27
mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg and add nr_pages argument
mm/memcg: set memcg when splitting page
mt76: fix tx skb error handling in mt76_dma_tx_queue_skb
net: stmmac: fix dma physical address of descriptor when display ring
net: fec: ptp: avoid register access when ipg clock is disabled
powerpc/4xx: Fix build errors from mfdcr()
atm: eni: dont release is never initialized
atm: lanai: dont run lanai_dev_close if not open
Revert "r8152: adjust the settings about MAC clock speed down for RTL8153"
ALSA: hda: ignore invalid NHLT table
ixgbe: Fix memleak in ixgbe_configure_clsu32
scsi: ufs: ufs-qcom: Disable interrupt in reset path
blk-cgroup: Fix the recursive blkg rwstat
net: tehuti: fix error return code in bdx_probe()
net: intel: iavf: fix error return code of iavf_init_get_resources()
sun/niu: fix wrong RXMAC_BC_FRM_CNT_COUNT count
gianfar: fix jumbo packets+napi+rx overrun crash
cifs: ask for more credit on async read/write code paths
gfs2: fix use-after-free in trans_drain
cpufreq: blacklist Arm Vexpress platforms in cpufreq-dt-platdev
gpiolib: acpi: Add missing IRQF_ONESHOT
nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig default
NFS: Correct size calculation for create reply length
net: hisilicon: hns: fix error return code of hns_nic_clear_all_rx_fetch()
net: wan: fix error return code of uhdlc_init()
net: davicom: Use platform_get_irq_optional()
net: enetc: set MAC RX FIFO to recommended value
atm: uPD98402: fix incorrect allocation
atm: idt77252: fix null-ptr-dereference
cifs: change noisy error message to FYI
irqchip/ingenic: Add support for the JZ4760
kbuild: add image_name to no-sync-config-targets
kbuild: dummy-tools: fix inverted tests for gcc
umem: fix error return code in mm_pci_probe()
sparc64: Fix opcode filtering in handling of no fault loads
habanalabs: Call put_pid() when releasing control device
staging: rtl8192e: fix kconfig dependency on CRYPTO
u64_stats,lockdep: Fix u64_stats_init() vs lockdep
kselftest: arm64: Fix exit code of sve-ptrace
regulator: qcom-rpmh: Correct the pmic5_hfsmps515 buck
block: Fix REQ_OP_ZONE_RESET_ALL handling
drm/amd/display: Revert dram_clock_change_latency for DCN2.1
drm/amdgpu: fb BO should be ttm_bo_type_device
drm/radeon: fix AGP dependency
nvme: simplify error logic in nvme_validate_ns()
nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request()
nvme-fc: set NVME_REQ_CANCELLED in nvme_fc_terminate_exchange()
nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted
nvme-core: check ctrl css before setting up zns
nvme-rdma: Fix a use after free in nvmet_rdma_write_data_done
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a
nfs: we don't support removing system.nfs4_acl
block: Suppress uevent for hidden device when removed
mm/fork: clear PASID for new mm
ia64: fix ia64_syscall_get_set_arguments() for break-based syscalls
ia64: fix ptrace(PTRACE_SYSCALL_INFO_EXIT) sign
static_call: Pull some static_call declarations to the type headers
static_call: Allow module use without exposing static_call_key
static_call: Fix the module key fixup
static_call: Fix static_call_set_init()
KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish
btrfs: fix sleep while in non-sleep context during qgroup removal
selinux: don't log MAC_POLICY_LOAD record on failed policy load
selinux: fix variable scope issue in live sidtab conversion
netsec: restore phy power state after controller reset
platform/x86: intel-vbtn: Stop reporting SW_DOCK events
psample: Fix user API breakage
z3fold: prevent reclaim/free race for headless pages
squashfs: fix inode lookup sanity checks
squashfs: fix xattr id and id lookup sanity checks
hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
kasan: fix per-page tags for non-page_alloc pages
gcov: fix clang-11+ support
ACPI: video: Add missing callback back for Sony VPCEH3U1E
ACPICA: Always create namespace nodes using acpi_ns_create_node()
arm64: stacktrace: don't trace arch_stack_walk()
arm64: dts: ls1046a: mark crypto engine dma coherent
arm64: dts: ls1012a: mark crypto engine dma coherent
arm64: dts: ls1043a: mark crypto engine dma coherent
ARM: dts: at91: sam9x60: fix mux-mask for PA7 so it can be set to A, B and C
ARM: dts: at91: sam9x60: fix mux-mask to match product's datasheet
ARM: dts: at91-sama5d27_som1: fix phy address to 7
integrity: double check iint_cache was initialized
drm/etnaviv: Use FOLL_FORCE for userptr
drm/amd/pm: workaround for audio noise issue
drm/amdgpu/display: restore AUX_DPHY_TX_CONTROL for DCN2.x
drm/amdgpu: Add additional Sienna Cichlid PCI ID
drm/i915: Fix the GT fence revocation runtime PM logic
dm verity: fix DM_VERITY_OPTS_MAX value
dm ioctl: fix out of bounds array access when no devices
bus: omap_l3_noc: mark l3 irqs as IRQF_NO_THREAD
ARM: OMAP2+: Fix smartreflex init regression after dropping legacy data
soc: ti: omap-prm: Fix occasional abort on reset deassert for dra7 iva
veth: Store queue_mapping independently of XDP prog presence
bpf: Change inode_storage's lookup_elem return value from NULL to -EBADF
libbpf: Fix INSTALL flag order
net/mlx5e: RX, Mind the MPWQE gaps when calculating offsets
net/mlx5e: When changing XDP program without reset, take refs for XSK RQs
net/mlx5e: Don't match on Geneve options in case option masks are all zero
ipv6: fix suspecious RCU usage warning
drop_monitor: Perform cleanup upon probe registration failure
macvlan: macvlan_count_rx() needs to be aware of preemption
net: sched: validate stab values
net: dsa: bcm_sf2: Qualify phydev->dev_flags based on port
igc: reinit_locked() should be called with rtnl_lock
igc: Fix Pause Frame Advertising
igc: Fix Supported Pause Frame Link Setting
igc: Fix igc_ptp_rx_pktstamp()
e1000e: add rtnl_lock() to e1000_reset_task
e1000e: Fix error handling in e1000_set_d0_lplu_state_82571
net/qlcnic: Fix a use after free in qlcnic_83xx_get_minidump_template
net: phy: broadcom: Add power down exit reset state delay
ftgmac100: Restart MAC HW once
clk: qcom: gcc-sc7180: Use floor ops for the correct sdcc1 clk
net: ipa: terminate message handler arrays
net: qrtr: fix a kernel-infoleak in qrtr_recvmsg()
flow_dissector: fix byteorder of dissected ICMP ID
selftests/bpf: Set gopt opt_class to 0 if get tunnel opt failed
netfilter: ctnetlink: fix dump of the expect mask attribute
net: hdlc_x25: Prevent racing between "x25_close" and "x25_xmit"/"x25_rx"
net: phylink: Fix phylink_err() function name error in phylink_major_config
tipc: better validate user input in tipc_nl_retrieve_key()
tcp: relookup sock for RST+ACK packets handled by obsolete req sock
can: isotp: isotp_setsockopt(): only allow to set low level TX flags for CAN-FD
can: isotp: TX-path: ensure that CAN frame flags are initialized
can: peak_usb: add forgotten supported devices
can: flexcan: flexcan_chip_freeze(): fix chip freeze for missing bitrate
can: kvaser_pciefd: Always disable bus load reporting
can: c_can_pci: c_can_pci_remove(): fix use-after-free
can: c_can: move runtime PM enable/disable to c_can_platform
can: m_can: m_can_do_rx_poll(): fix extraneous msg loss warning
can: m_can: m_can_rx_peripheral(): fix RX being blocked by errors
mac80211: fix rate mask reset
mac80211: Allow HE operation to be longer than expected.
selftests/net: fix warnings on reuseaddr_ports_exhausted
nfp: flower: fix unsupported pre_tunnel flows
nfp: flower: add ipv6 bit to pre_tunnel control message
nfp: flower: fix pre_tun mask id allocation
ftrace: Fix modify_ftrace_direct.
drm/msm/dsi: fix check-before-set in the 7nm dsi_pll code
ionic: linearize tso skb with too many frags
net/sched: cls_flower: fix only mask bit check in the validate_ct_state
netfilter: nftables: report EOPNOTSUPP on unsupported flowtable flags
netfilter: nftables: allow to update flowtable flags
netfilter: flowtable: Make sure GC works periodically in idle system
libbpf: Fix error path in bpf_object__elf_init()
libbpf: Use SOCK_CLOEXEC when opening the netlink socket
ARM: dts: imx6ull: fix ubi filesystem mount failed
ipv6: weaken the v4mapped source check
octeontx2-af: Formatting debugfs entry rsrc_alloc.
octeontx2-af: Modify default KEX profile to extract TX packet fields
octeontx2-af: Remove TOS field from MKEX TX
octeontx2-af: Fix irq free in rvu teardown
octeontx2-pf: Clear RSS enable flag on interace down
octeontx2-af: fix infinite loop in unmapping NPC counter
net: check all name nodes in __dev_alloc_name
net: cdc-phonet: fix data-interface release on probe failure
igb: check timestamp validity
r8152: limit the RX buffer size of RTL8153A for USB 2.0
net: stmmac: dwmac-sun8i: Provide TX and RX fifo sizes
selinux: vsock: Set SID for socket returned by accept()
selftests: forwarding: vxlan_bridge_1d: Fix vxlan ecn decapsulate value
libbpf: Fix BTF dump of pointer-to-array-of-struct
bpf: Fix umd memory leak in copy_process()
can: isotp: tx-path: zero initialize outgoing CAN frames
drm/msm: fix shutdown hook in case GPU components failed to bind
drm/msm: Fix suspend/resume on i.MX5
arm64: kdump: update ppos when reading elfcorehdr
PM: runtime: Defer suspending suppliers
net/mlx5: Add back multicast stats for uplink representor
net/mlx5e: Allow to match on MPLS parameters only for MPLS over UDP
net/mlx5e: Offload tuple rewrite for non-CT flows
net/mlx5e: Fix error path for ethtool set-priv-flag
PM: EM: postpone creating the debugfs dir till fs_initcall
net: bridge: don't notify switchdev for local FDB addresses
octeontx2-af: Fix memory leak of object buf
xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG
RDMA/cxgb4: Fix adapter LE hash errors while destroying ipv6 listening server
bpf: Don't do bpf_cgroup_storage_set() for kuprobe/tp programs
net: Consolidate common blackhole dst ops
net, bpf: Fix ip6ip6 crash with collect_md populated skbs
igb: avoid premature Rx buffer reuse
net: axienet: Properly handle PCS/PMA PHY for 1000BaseX mode
net: axienet: Fix probe error cleanup
net: phy: introduce phydev->port
net: phy: broadcom: Avoid forward for bcm54xx_config_clock_delay()
net: phy: broadcom: Set proper 1000BaseX/SGMII interface mode for BCM54616S
net: phy: broadcom: Fix RGMII delays for BCM50160 and BCM50610M
Revert "netfilter: x_tables: Switch synchronization to RCU"
netfilter: x_tables: Use correct memory barriers.
dm table: Fix zoned model check and zone sectors check
mm/mmu_notifiers: ensure range_end() is paired with range_start()
Revert "netfilter: x_tables: Update remaining dereference to RCU"
ACPI: scan: Rearrange memory allocation in acpi_device_add()
ACPI: scan: Use unique number for instance_no
perf auxtrace: Fix auxtrace queue conflict
perf synthetic events: Avoid write of uninitialized memory when generating PERF_RECORD_MMAP* records
io_uring: fix provide_buffers sign extension
block: recalculate segment count for multi-segment discards correctly
scsi: Revert "qla2xxx: Make sure that aborted commands are freed"
scsi: qedi: Fix error return code of qedi_alloc_global_queues()
scsi: mpt3sas: Fix error return code of mpt3sas_base_attach()
smb3: fix cached file size problems in duplicate extents (reflink)
cifs: Adjust key sizes and key generation routines for AES256 encryption
locking/mutex: Fix non debug version of mutex_lock_io_nested()
x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()
mm/memcg: fix 5.10 backport of splitting page memcg
fs/cachefiles: Remove wait_bit_key layout dependency
ch_ktls: fix enum-conversion warning
can: dev: Move device back to init netns on owning netns delete
r8169: fix DMA being used after buffer free if WoL is enabled
net: dsa: b53: VLAN filtering is global to all users
mac80211: fix double free in ibss_leave
ext4: add reclaim checks to xattr code
fs/ext4: fix integer overflow in s_log_groups_per_flex
Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
Revert "net: bonding: fix error return code of bond_neigh_init()"
nvme: fix the nsid value to print in nvme_validate_or_alloc_ns
can: peak_usb: Revert "can: peak_usb: add forgotten supported devices"
xen-blkback: don't leak persistent grants from xen_blkbk_map()
Linux 5.10.27
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I7eafe976fd6bf33db6db4adb8ebf2ff087294a23
[ Upstream commit a958937ff166fc60d1c3a721036f6ff41bfa2821 ]
When a stacked block device inserts a request into another block device
using blk_insert_cloned_request, the request's nr_phys_segments field gets
recalculated by a call to blk_recalc_rq_segments in
blk_cloned_rq_check_limits. But blk_recalc_rq_segments does not know how to
handle multi-segment discards. For disk types which can handle
multi-segment discards like nvme, this results in discard requests which
claim a single segment when it should report several, triggering a warning
in nvme and causing nvme to fail the discard from the invalid state.
WARNING: CPU: 5 PID: 191 at drivers/nvme/host/core.c:700 nvme_setup_discard+0x170/0x1e0 [nvme_core]
...
nvme_setup_cmd+0x217/0x270 [nvme_core]
nvme_loop_queue_rq+0x51/0x1b0 [nvme_loop]
__blk_mq_try_issue_directly+0xe7/0x1b0
blk_mq_request_issue_directly+0x41/0x70
? blk_account_io_start+0x40/0x50
dm_mq_queue_rq+0x200/0x3e0
blk_mq_dispatch_rq_list+0x10a/0x7d0
? __sbitmap_queue_get+0x25/0x90
? elv_rb_del+0x1f/0x30
? deadline_remove_request+0x55/0xb0
? dd_dispatch_request+0x181/0x210
__blk_mq_do_dispatch_sched+0x144/0x290
? bio_attempt_discard_merge+0x134/0x1f0
__blk_mq_sched_dispatch_requests+0x129/0x180
blk_mq_sched_dispatch_requests+0x30/0x60
__blk_mq_run_hw_queue+0x47/0xe0
__blk_mq_delay_run_hw_queue+0x15b/0x170
blk_mq_sched_insert_requests+0x68/0xe0
blk_mq_flush_plug_list+0xf0/0x170
blk_finish_plug+0x36/0x50
xlog_cil_committed+0x19f/0x290 [xfs]
xlog_cil_process_committed+0x57/0x80 [xfs]
xlog_state_do_callback+0x1e0/0x2a0 [xfs]
xlog_ioend_work+0x2f/0x80 [xfs]
process_one_work+0x1b6/0x350
worker_thread+0x53/0x3e0
? process_one_work+0x350/0x350
kthread+0x11b/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30
This patch fixes blk_recalc_rq_segments to be aware of devices which can
have multi-segment discards. It calculates the correct discard segment
count by counting the number of bio as each discard bio is considered its
own segment.
Fixes: 1e739730c5 ("block: optionally merge discontiguous discard bios into a single request")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Link: https://lore.kernel.org/r/20210211143807.GA115624@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9ec491447b90ad6a4056a9656b13f0b3a1e83043 ]
register_disk() suppress uevents for devices with the GENHD_FL_HIDDEN
but enables uevents at the end again in order to announce disk after
possible partitions are created.
When the device is removed the uevents are still on and user land sees
'remove' messages for devices which were never 'add'ed to the system.
KERNEL[95481.571887] remove /devices/virtual/nvme-fabrics/ctl/nvme5/nvme0c5n1 (block)
Let's suppress the uevents for GENHD_FL_HIDDEN by not enabling the
uevents at all.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Link: https://lore.kernel.org/r/20210311151917.136091-1-dwagner@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit faa44c69daf9ccbd5b8a1aee13e0e0d037c0be17 ]
Similarly to a single zone reset operation (REQ_OP_ZONE_RESET), execute
REQ_OP_ZONE_RESET_ALL operations with REQ_SYNC set.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4f44657d74873735e93a50eb25014721a66aac19 ]
The current blkio.throttle.io_service_bytes_recursive doesn't
work correctly.
As an example, for the following blkcg hierarchy:
(Made 1GB READ in test1, 512MB READ in test2)
test
/ \
test1 test2
$ head -n 1 test/test1/blkio.throttle.io_service_bytes_recursive
8:0 Read 1073684480
$ head -n 1 test/test2/blkio.throttle.io_service_bytes_recursive
8:0 Read 537448448
$ head -n 1 test/blkio.throttle.io_service_bytes_recursive
8:0 Read 537448448
Clearly, above data of "test" reflects "test2" not "test1"+"test2".
Do the correct summary in blkg_rwstat_recursive_sum().
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmBSKS0ACgkQONu9yGCS
aT7Ngg//c4C1WnWC0sNWzP3xT2paCkLnUUyjQTmrkbPvLtr2DvehW5Bvp/32pGiS
8mDMoTLq3QxNrfrU6SY3KavZRC9Pa+migAsVmuujygQwNphqv95/XxnFemFEAYTl
b8b5OJPyomzMMEwHzx1Tr+7/d58czrqXo97QI0lmaDlHl+9JKTg2SMX9AkHkU8pK
zYjbtzdhd9UZCTdVYY1ZFkQ1ik1iAWo3Xv0G2aMeQQpuGcZIh/Y66xBuyH+8g+Yz
3mInhPQvhkb+c+m4ZJ9NhOUVEW4Fl0fq0mVrrYkfHqXe0D36Vj/yYvO/yTSBqb4+
XQ5PLXX3KFVDFl1id94unXGgP3c0zBe30JZPqKdpSET+PzOlGiZTxMCfjPeTgu/Z
7xc2qSX1zn273HMTRrT1daO4/NXQ85kE04mZMzq7cqDpum7ltfKrEMum/Gma+dJz
Knn47oZHbSW4Er/WcAwHSeZpxvD7AahG/GlsQRy+IVPu/jMXJHmo2/Nv1fLJWp+G
7VVWRXug69hywGr7hFiT3USG2C5g5cV3/dEO8NFFjGKRa5CbLbQD6B3+Dz3dXyBH
jE3MGIoqoNk+SvJOAf2ogu7SS6wLynZWOchmAVvIQ4QEzcP2jroeFHKD49MYxDUE
dKcq0dtfMc4nUaUZ/XRfWtS9fSm+T4XonmvEY4yXnAyfZ0aeEM8=
=FdFm
-----END PGP SIGNATURE-----
Merge 5.10.24 into android12-5.10-lts
Changes in 5.10.24
uapi: nfnetlink_cthelper.h: fix userspace compilation error
powerpc/perf: Fix handling of privilege level checks in perf interrupt context
powerpc/pseries: Don't enforce MSI affinity with kdump
ethernet: alx: fix order of calls on resume
crypto: mips/poly1305 - enable for all MIPS processors
ath9k: fix transmitting to stations in dynamic SMPS mode
net: Fix gro aggregation for udp encaps with zero csum
net: check if protocol extracted by virtio_net_hdr_set_proto is correct
net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
net: l2tp: reduce log level of messages in receive path, add counter instead
can: skb: can_skb_set_owner(): fix ref counting if socket was closed before setting skb ownership
can: flexcan: assert FRZ bit in flexcan_chip_freeze()
can: flexcan: enable RX FIFO after FRZ/HALT valid
can: flexcan: invoke flexcan_chip_freeze() to enter freeze mode
can: tcan4x5x: tcan4x5x_init(): fix initialization - clear MRAM before entering Normal Mode
tcp: Fix sign comparison bug in getsockopt(TCP_ZEROCOPY_RECEIVE)
tcp: add sanity tests to TCP_QUEUE_SEQ
netfilter: nf_nat: undo erroneous tcp edemux lookup
netfilter: x_tables: gpf inside xt_find_revision()
net: always use icmp{,v6}_ndo_send from ndo_start_xmit
net: phy: fix save wrong speed and duplex problem if autoneg is on
selftests/bpf: Use the last page in test_snprintf_btf on s390
selftests/bpf: No need to drop the packet when there is no geneve opt
selftests/bpf: Mask bpf_csum_diff() return value to 16 bits in test_verifier
samples, bpf: Add missing munmap in xdpsock
libbpf: Clear map_info before each bpf_obj_get_info_by_fd
ibmvnic: Fix possibly uninitialized old_num_tx_queues variable warning.
ibmvnic: always store valid MAC address
mt76: dma: do not report truncated frames to mac80211
powerpc/603: Fix protection of user pages mapped with PROT_NONE
mount: fix mounting of detached mounts onto targets that reside on shared mounts
cifs: return proper error code in statfs(2)
Revert "mm, slub: consider rest of partial list if acquire_slab() fails"
docs: networking: drop special stable handling
net: dsa: tag_rtl4_a: fix egress tags
sh_eth: fix TRSCER mask for SH771x
net: enetc: don't overwrite the RSS indirection table when initializing
net: enetc: take the MDIO lock only once per NAPI poll cycle
net: enetc: fix incorrect TPID when receiving 802.1ad tagged packets
net: enetc: don't disable VLAN filtering in IFF_PROMISC mode
net: enetc: force the RGMII speed and duplex instead of operating in inband mode
net: enetc: remove bogus write to SIRXIDR from enetc_setup_rxbdr
net: enetc: keep RX ring consumer index in sync with hardware
net: ethernet: mtk-star-emac: fix wrong unmap in RX handling
net/mlx4_en: update moderation when config reset
net: stmmac: fix incorrect DMA channel intr enable setting of EQoS v4.10
nexthop: Do not flush blackhole nexthops when loopback goes down
net: sched: avoid duplicates in classes dump
net: mscc: ocelot: properly reject destination IP keys in VCAP IS1
net: dsa: sja1105: fix SGMII PCS being forced to SPEED_UNKNOWN instead of SPEED_10
net: usb: qmi_wwan: allow qmimux add/del with master up
netdevsim: init u64 stats for 32bit hardware
cipso,calipso: resolve a number of problems with the DOI refcounts
net: stmmac: Fix VLAN filter delete timeout issue in Intel mGBE SGMII
stmmac: intel: Fixes clock registration error seen for multiple interfaces
net: lapbether: Remove netif_start_queue / netif_stop_queue
net: davicom: Fix regulator not turned off on failed probe
net: davicom: Fix regulator not turned off on driver removal
net: enetc: allow hardware timestamping on TX queues with tc-etf enabled
net: qrtr: fix error return code of qrtr_sendmsg()
s390/qeth: fix memory leak after failed TX Buffer allocation
r8169: fix r8168fp_adjust_ocp_cmd function
ixgbe: fail to create xfrm offload of IPsec tunnel mode SA
tools/resolve_btfids: Fix build error with older host toolchains
perf build: Fix ccache usage in $(CC) when generating arch errno table
net: stmmac: stop each tx channel independently
net: stmmac: fix watchdog timeout during suspend/resume stress test
net: stmmac: fix wrongly set buffer2 valid when sph unsupport
ethtool: fix the check logic of at least one channel for RX/TX
net: phy: make mdio_bus_phy_suspend/resume as __maybe_unused
selftests: forwarding: Fix race condition in mirror installation
mlxsw: spectrum_ethtool: Add an external speed to PTYS register
perf traceevent: Ensure read cmdlines are null terminated.
perf report: Fix -F for branch & mem modes
net: hns3: fix query vlan mask value error for flow director
net: hns3: fix bug when calculating the TCAM table info
s390/cio: return -EFAULT if copy_to_user() fails again
bnxt_en: reliably allocate IRQ table on reset to avoid crash
gpiolib: acpi: Add ACPI_GPIO_QUIRK_ABSOLUTE_NUMBER quirk
gpiolib: acpi: Allow to find GpioInt() resource by name and index
gpio: pca953x: Set IRQ type when handle Intel Galileo Gen 2
gpio: fix gpio-device list corruption
drm/compat: Clear bounce structures
drm/amd/display: Add a backlight module option
drm/amdgpu/display: use GFP_ATOMIC in dcn21_validate_bandwidth_fp()
drm/amd/display: Fix nested FPU context in dcn21_validate_bandwidth()
drm/amd/pm: bug fix for pcie dpm
drm/amdgpu/display: simplify backlight setting
drm/amdgpu/display: don't assert in set backlight function
drm/amdgpu/display: handle aux backlight in backlight_get_brightness
drm/shmem-helper: Check for purged buffers in fault handler
drm/shmem-helper: Don't remove the offset in vm_area_struct pgoff
drm: Use USB controller's DMA mask when importing dmabufs
drm: meson_drv add shutdown function
drm/shmem-helpers: vunmap: Don't put pages for dma-buf
drm/i915: Wedge the GPU if command parser setup fails
s390/cio: return -EFAULT if copy_to_user() fails
s390/crypto: return -EFAULT if copy_to_user() fails
qxl: Fix uninitialised struct field head.surface_id
sh_eth: fix TRSCER mask for R7S9210
media: usbtv: Fix deadlock on suspend
media: rkisp1: params: fix wrong bits settings
media: v4l: vsp1: Fix uif null pointer access
media: v4l: vsp1: Fix bru null pointer access
media: rc: compile rc-cec.c into rc-core
cifs: fix credit accounting for extra channel
net: hns3: fix error mask definition of flow director
s390/qeth: don't replace a fully completed async TX buffer
s390/qeth: remove QETH_QDIO_BUF_HANDLED_DELAYED state
s390/qeth: improve completion of pending TX buffers
s390/qeth: fix notification for pending buffers during teardown
net: dsa: implement a central TX reallocation procedure
net: dsa: tag_ksz: don't allocate additional memory for padding/tagging
net: dsa: trailer: don't allocate additional memory for padding/tagging
net: dsa: tag_qca: let DSA core deal with TX reallocation
net: dsa: tag_ocelot: let DSA core deal with TX reallocation
net: dsa: tag_mtk: let DSA core deal with TX reallocation
net: dsa: tag_lan9303: let DSA core deal with TX reallocation
net: dsa: tag_edsa: let DSA core deal with TX reallocation
net: dsa: tag_brcm: let DSA core deal with TX reallocation
net: dsa: tag_dsa: let DSA core deal with TX reallocation
net: dsa: tag_gswip: let DSA core deal with TX reallocation
net: dsa: tag_ar9331: let DSA core deal with TX reallocation
net: dsa: tag_mtk: fix 802.1ad VLAN egress
enetc: Fix unused var build warning for CONFIG_OF
net: enetc: initialize RFS/RSS memories for unused ports too
ath11k: peer delete synchronization with firmware
ath11k: start vdev if a bss peer is already created
ath11k: fix AP mode for QCA6390
i2c: rcar: faster irq code to minimize HW race condition
i2c: rcar: optimize cacheline to minimize HW race condition
scsi: ufs: WB is only available on LUN #0 to #7
udf: fix silent AED tagLocation corruption
iommu/vt-d: Clear PRQ overflow only when PRQ is empty
mmc: mxs-mmc: Fix a resource leak in an error handling path in 'mxs_mmc_probe()'
mmc: mediatek: fix race condition between msdc_request_timeout and irq
mmc: sdhci-iproc: Add ACPI bindings for the RPi
Platform: OLPC: Fix probe error handling
powerpc/pci: Add ppc_md.discover_phbs()
spi: stm32: make spurious and overrun interrupts visible
powerpc: improve handling of unrecoverable system reset
powerpc/perf: Record counter overflow always if SAMPLE_IP is unset
HID: logitech-dj: add support for the new lightspeed connection iteration
powerpc/64: Fix stack trace not displaying final frame
iommu/amd: Fix performance counter initialization
clk: qcom: gdsc: Implement NO_RET_PERIPH flag
sparc32: Limit memblock allocation to low memory
sparc64: Use arch_validate_flags() to validate ADI flag
Input: applespi - don't wait for responses to commands indefinitely.
PCI: xgene-msi: Fix race in installing chained irq handler
PCI: mediatek: Add missing of_node_put() to fix reference leak
drivers/base: build kunit tests without structleak plugin
PCI/LINK: Remove bandwidth notification
ext4: don't try to processed freed blocks until mballoc is initialized
kbuild: clamp SUBLEVEL to 255
PCI: Fix pci_register_io_range() memory leak
i40e: Fix memory leak in i40e_probe
kasan: fix memory corruption in kasan_bitops_tags test
s390/smp: __smp_rescan_cpus() - move cpumask away from stack
drivers/base/memory: don't store phys_device in memory blocks
sysctl.c: fix underflow value setting risk in vm_table
scsi: libiscsi: Fix iscsi_prep_scsi_cmd_pdu() error handling
scsi: target: core: Add cmd length set before cmd complete
scsi: target: core: Prevent underflow for service actions
clk: qcom: gpucc-msm8998: Add resets, cxc, fix flags on gpu_gx_gdsc
mmc: sdhci: Update firmware interface API
ARM: 9029/1: Make iwmmxt.S support Clang's integrated assembler
ARM: assembler: introduce adr_l, ldr_l and str_l macros
ARM: efistub: replace adrl pseudo-op with adr_l macro invocation
ALSA: usb: Add Plantronics C320-M USB ctrl msg delay quirk
ALSA: hda/hdmi: Cancel pending works before suspend
ALSA: hda/conexant: Add quirk for mute LED control on HP ZBook G5
ALSA: hda/ca0132: Add Sound BlasterX AE-5 Plus support
ALSA: hda: Drop the BATCH workaround for AMD controllers
ALSA: hda: Flush pending unsolicited events before suspend
ALSA: hda: Avoid spurious unsol event handling during S3/S4
ALSA: usb-audio: Fix "cannot get freq eq" errors on Dell AE515 sound bar
ALSA: usb-audio: Apply the control quirk to Plantronics headsets
ALSA: usb-audio: Disable USB autosuspend properly in setup_disable_autosuspend()
ALSA: usb-audio: fix NULL ptr dereference in usb_audio_probe
ALSA: usb-audio: fix use after free in usb_audio_disconnect
Revert 95ebabde382c ("capabilities: Don't allow writing ambiguous v3 file capabilities")
block: Discard page cache of zone reset target range
block: Try to handle busy underlying device on discard
arm64: kasan: fix page_alloc tagging with DEBUG_VIRTUAL
arm64: mte: Map hotplugged memory as Normal Tagged
arm64: perf: Fix 64-bit event counter read truncation
s390/dasd: fix hanging DASD driver unbind
s390/dasd: fix hanging IO request during DASD driver unbind
software node: Fix node registration
xen/events: reset affinity of 2-level event when tearing it down
mmc: mmci: Add MMC_CAP_NEED_RSP_BUSY for the stm32 variants
mmc: core: Fix partition switch time for eMMC
mmc: cqhci: Fix random crash when remove mmc module/card
cifs: do not send close in compound create+close requests
Goodix Fingerprint device is not a modem
USB: gadget: udc: s3c2410_udc: fix return value check in s3c2410_udc_probe()
USB: gadget: u_ether: Fix a configfs return code
usb: gadget: f_uac2: always increase endpoint max_packet_size by one audio slot
usb: gadget: f_uac1: stop playback on function disable
usb: dwc3: qcom: Add missing DWC3 OF node refcount decrement
usb: dwc3: qcom: add URS Host support for sdm845 ACPI boot
usb: dwc3: qcom: add ACPI device id for sc8180x
usb: dwc3: qcom: Honor wakeup enabled/disabled state
USB: usblp: fix a hang in poll() if disconnected
usb: renesas_usbhs: Clear PIPECFG for re-enabling pipe with other EPNUM
usb: xhci: do not perform Soft Retry for some xHCI hosts
xhci: Improve detection of device initiated wake signal.
usb: xhci: Fix ASMedia ASM1042A and ASM3242 DMA addressing
xhci: Fix repeated xhci wake after suspend due to uncleared internal wake state
USB: serial: io_edgeport: fix memory leak in edge_startup
USB: serial: ch341: add new Product ID
USB: serial: cp210x: add ID for Acuity Brands nLight Air Adapter
USB: serial: cp210x: add some more GE USB IDs
usbip: fix stub_dev to check for stream socket
usbip: fix vhci_hcd to check for stream socket
usbip: fix vudc to check for stream socket
usbip: fix stub_dev usbip_sockfd_store() races leading to gpf
usbip: fix vhci_hcd attach_store() races leading to gpf
usbip: fix vudc usbip_sockfd_store races leading to gpf
Revert "serial: max310x: rework RX interrupt handling"
misc/pvpanic: Export module FDT device table
misc: fastrpc: restrict user apps from sending kernel RPC messages
staging: rtl8192u: fix ->ssid overflow in r8192_wx_set_scan()
staging: rtl8188eu: prevent ->ssid overflow in rtw_wx_set_scan()
staging: rtl8712: unterminated string leads to read overflow
staging: rtl8188eu: fix potential memory corruption in rtw_check_beacon_data()
staging: ks7010: prevent buffer overflow in ks_wlan_set_scan()
staging: rtl8712: Fix possible buffer overflow in r8712_sitesurvey_cmd
staging: rtl8192e: Fix possible buffer overflow in _rtl92e_wx_set_scan
staging: comedi: addi_apci_1032: Fix endian problem for COS sample
staging: comedi: addi_apci_1500: Fix endian problem for command sample
staging: comedi: adv_pci1710: Fix endian problem for AI command data
staging: comedi: das6402: Fix endian problem for AI command data
staging: comedi: das800: Fix endian problem for AI command data
staging: comedi: dmm32at: Fix endian problem for AI command data
staging: comedi: me4000: Fix endian problem for AI command data
staging: comedi: pcl711: Fix endian problem for AI command data
staging: comedi: pcl818: Fix endian problem for AI command data
sh_eth: fix TRSCER mask for R7S72100
cpufreq: qcom-hw: fix dereferencing freed memory 'data'
cpufreq: qcom-hw: Fix return value check in qcom_cpufreq_hw_cpu_init()
arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory
SUNRPC: Set memalloc_nofs_save() for sync tasks
NFS: Don't revalidate the directory permissions on a lookup failure
NFS: Don't gratuitously clear the inode cache when lookup failed
NFSv4.2: fix return value of _nfs4_get_security_label()
block: rsxx: fix error return code of rsxx_pci_probe()
nvme-fc: fix racing controller reset and create association
configfs: fix a use-after-free in __configfs_open_file
arm64: mm: use a 48-bit ID map when possible on 52-bit VA builds
perf/core: Flush PMU internal buffers for per-CPU events
perf/x86/intel: Set PERF_ATTACH_SCHED_CB for large PEBS and LBR
hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
powerpc/64s/exception: Clean up a missed SRR specifier
seqlock,lockdep: Fix seqcount_latch_init()
stop_machine: mark helpers __always_inline
include/linux/sched/mm.h: use rcu_dereference in in_vfork()
zram: fix return value on writeback_store
linux/compiler-clang.h: define HAVE_BUILTIN_BSWAP*
sched/membarrier: fix missing local execution of ipi_sync_rq_state()
efi: stub: omit SetVirtualAddressMap() if marked unsupported in RT_PROP table
powerpc/64s: Fix instruction encoding for lis in ppc_function_entry()
powerpc: Fix inverted SET_FULL_REGS bitop
powerpc: Fix missing declaration of [en/dis]able_kernel_vsx()
binfmt_misc: fix possible deadlock in bm_register_write
x86/unwind/orc: Disable KASAN checking in the ORC unwinder, part 2
x86/sev-es: Introduce ip_within_syscall_gap() helper
x86/sev-es: Check regs->sp is trusted before adjusting #VC IST stack
x86/entry: Move nmi entry/exit into common code
x86/sev-es: Correctly track IRQ states in runtime #VC handler
x86/sev-es: Use __copy_from_user_inatomic()
x86/entry: Fix entry/exit mismatch on failed fast 32-bit syscalls
KVM: x86: Ensure deadline timer has truly expired before posting its IRQ
KVM: kvmclock: Fix vCPUs > 64 can't be online/hotpluged
KVM: arm64: Fix range alignment when walking page tables
KVM: arm64: Avoid corrupting vCPU context register in guest exit
KVM: arm64: nvhe: Save the SPE context early
KVM: arm64: Reject VM creation when the default IPA size is unsupported
KVM: arm64: Fix exclusive limit for IPA size
mm/userfaultfd: fix memory corruption due to writeprotect
mm/madvise: replace ptrace attach requirement for process_madvise
KVM: arm64: Ensure I-cache isolation between vcpus of a same VM
mm/page_alloc.c: refactor initialization of struct page for holes in memory layout
xen/events: don't unmask an event channel when an eoi is pending
xen/events: avoid handling the same event on two cpus at the same time
KVM: arm64: Fix nVHE hyp panic host context restore
RDMA/umem: Use ib_dma_max_seg_size instead of dma_get_max_seg_size
Linux 5.10.24
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie53a3c1963066a18d41357b6be41cff00690bd40
commit e5113505904ea1c1c0e1f92c1cfa91fbf4da1694 upstream.
When zone reset ioctl and data read race for a same zone on zoned block
devices, the data read leaves stale page cache even though the zone
reset ioctl zero clears all the zone data on the device. To avoid
non-zero data read from the stale page cache after zone reset, discard
page cache of reset target zones in blkdev_zone_mgmt_ioctl(). Introduce
the helper function blkdev_truncate_zone_range() to discard the page
cache. Ensure the page cache discarded by calling the helper function
before and after zone reset in same manner as fallocate does.
This patch can be applied back to the stable kernel version v5.10.y.
Rework is needed for older stable kernels.
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Cc: <stable@vger.kernel.org> # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210311072546.678999-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.20
vmlinux.lds.h: add DWARF v5 sections
vdpa/mlx5: fix param validation in mlx5_vdpa_get_config()
debugfs: be more robust at handling improper input in debugfs_lookup()
debugfs: do not attempt to create a new file before the filesystem is initalized
scsi: libsas: docs: Remove notify_ha_event()
scsi: qla2xxx: Fix mailbox Ch erroneous error
kdb: Make memory allocations more robust
w1: w1_therm: Fix conversion result for negative temperatures
PCI: qcom: Use PHY_REFCLK_USE_PAD only for ipq8064
PCI: Decline to resize resources if boot config must be preserved
virt: vbox: Do not use wait_event_interruptible when called from kernel context
bfq: Avoid false bfq queue merging
ALSA: usb-audio: Fix PCM buffer allocation in non-vmalloc mode
MIPS: vmlinux.lds.S: add missing PAGE_ALIGNED_DATA() section
vmlinux.lds.h: Define SANTIZER_DISCARDS with CONFIG_GCOV_KERNEL=y
random: fix the RNDRESEEDCRNG ioctl
ALSA: pcm: Call sync_stop at disconnection
ALSA: pcm: Assure sync with the pending stop operation at suspend
ALSA: pcm: Don't call sync_stop if it hasn't been stopped
drm/i915/gt: One more flush for Baytrail clear residuals
ath10k: Fix error handling in case of CE pipe init failure
Bluetooth: btqcomsmd: Fix a resource leak in error handling paths in the probe function
Bluetooth: hci_uart: Fix a race for write_work scheduling
Bluetooth: Fix initializing response id after clearing struct
arm64: dts: renesas: beacon kit: Fix choppy Bluetooth Audio
arm64: dts: renesas: beacon: Fix audio-1.8V pin enable
ARM: dts: exynos: correct PMIC interrupt trigger level on Artik 5
ARM: dts: exynos: correct PMIC interrupt trigger level on Monk
ARM: dts: exynos: correct PMIC interrupt trigger level on Rinato
ARM: dts: exynos: correct PMIC interrupt trigger level on Spring
ARM: dts: exynos: correct PMIC interrupt trigger level on Arndale Octa
ARM: dts: exynos: correct PMIC interrupt trigger level on Odroid XU3 family
arm64: dts: exynos: correct PMIC interrupt trigger level on TM2
arm64: dts: exynos: correct PMIC interrupt trigger level on Espresso
memory: mtk-smi: Fix PM usage counter unbalance in mtk_smi ops
Bluetooth: hci_qca: Fix memleak in qca_controller_memdump
staging: vchiq: Fix bulk userdata handling
staging: vchiq: Fix bulk transfers on 64-bit builds
arm64: dts: qcom: msm8916-samsung-a5u: Fix iris compatible
net: stmmac: dwmac-meson8b: fix enabling the timing-adjustment clock
bpf: Add bpf_patch_call_args prototype to include/linux/bpf.h
bpf: Avoid warning when re-casting __bpf_call_base into __bpf_call_base_args
firmware: arm_scmi: Fix call site of scmi_notification_exit
arm64: dts: allwinner: A64: properly connect USB PHY to port 0
arm64: dts: allwinner: H6: properly connect USB PHY to port 0
arm64: dts: allwinner: Drop non-removable from SoPine/LTS SD card
arm64: dts: allwinner: H6: Allow up to 150 MHz MMC bus frequency
arm64: dts: allwinner: A64: Limit MMC2 bus frequency to 150 MHz
arm64: dts: qcom: msm8916-samsung-a2015: Fix sensors
cpufreq: brcmstb-avs-cpufreq: Free resources in error path
cpufreq: brcmstb-avs-cpufreq: Fix resource leaks in ->remove()
arm64: dts: rockchip: rk3328: Add clock_in_out property to gmac2phy node
ACPICA: Fix exception code class checks
usb: gadget: u_audio: Free requests only after callback
arm64: dts: qcom: sdm845-db845c: Fix reset-pin of ov8856 node
soc: qcom: socinfo: Fix an off by one in qcom_show_pmic_model()
soc: ti: pm33xx: Fix some resource leak in the error handling paths of the probe function
staging: media: atomisp: Fix size_t format specifier in hmm_alloc() debug statemenet
Bluetooth: drop HCI device reference before return
Bluetooth: Put HCI device if inquiry procedure interrupts
memory: ti-aemif: Drop child node when jumping out loop
ARM: dts: Configure missing thermal interrupt for 4430
usb: dwc2: Do not update data length if it is 0 on inbound transfers
usb: dwc2: Abort transaction after errors with unknown reason
usb: dwc2: Make "trimming xfer length" a debug message
staging: rtl8723bs: wifi_regd.c: Fix incorrect number of regulatory rules
x86/MSR: Filter MSR writes through X86_IOC_WRMSR_REGS ioctl too
arm64: dts: renesas: beacon: Fix EEPROM compatible value
can: mcp251xfd: mcp251xfd_probe(): fix errata reference
ARM: dts: armada388-helios4: assign pinctrl to LEDs
ARM: dts: armada388-helios4: assign pinctrl to each fan
arm64: dts: armada-3720-turris-mox: rename u-boot mtd partition to a53-firmware
opp: Correct debug message in _opp_add_static_v2()
Bluetooth: btusb: Fix memory leak in btusb_mtk_wmt_recv
soc: qcom: ocmem: don't return NULL in of_get_ocmem
arm64: dts: msm8916: Fix reserved and rfsa nodes unit address
arm64: dts: meson: fix broken wifi node for Khadas VIM3L
iwlwifi: mvm: set enabled in the PPAG command properly
ARM: s3c: fix fiq for clang IAS
optee: simplify i2c access
staging: wfx: fix possible panic with re-queued frames
ARM: at91: use proper asm syntax in pm_suspend
ath10k: Fix suspicious RCU usage warning in ath10k_wmi_tlv_parse_peer_stats_info()
ath10k: Fix lockdep assertion warning in ath10k_sta_statistics
ath11k: fix a locking bug in ath11k_mac_op_start()
soc: aspeed: snoop: Add clock control logic
iwlwifi: mvm: fix the type we use in the PPAG table validity checks
iwlwifi: mvm: store PPAG enabled/disabled flag properly
iwlwifi: mvm: send stored PPAG command instead of local
iwlwifi: mvm: assign SAR table revision to the command later
iwlwifi: mvm: don't check if CSA event is running before removing
bpf_lru_list: Read double-checked variable once without lock
iwlwifi: pnvm: set the PNVM again if it was already loaded
iwlwifi: pnvm: increment the pointer before checking the TLV
ath9k: fix data bus crash when setting nf_override via debugfs
selftests/bpf: Convert test_xdp_redirect.sh to bash
ibmvnic: Set to CLOSED state even on error
bnxt_en: reverse order of TX disable and carrier off
bnxt_en: Fix devlink info's stored fw.psid version format.
xen/netback: fix spurious event detection for common event case
dpaa2-eth: fix memory leak in XDP_REDIRECT
net: phy: consider that suspend2ram may cut off PHY power
net/mlx5e: Don't change interrupt moderation params when DIM is enabled
net/mlx5e: Change interrupt moderation channel params also when channels are closed
net/mlx5: Fix health error state handling
net/mlx5e: Replace synchronize_rcu with synchronize_net
net/mlx5e: kTLS, Use refcounts to free kTLS RX priv context
net/mlx5: Disable devlink reload for multi port slave device
net/mlx5: Disallow RoCE on multi port slave device
net/mlx5: Disallow RoCE on lag device
net/mlx5: Disable devlink reload for lag devices
net/mlx5e: CT: manage the lifetime of the ct entry object
net/mlx5e: Check tunnel offload is required before setting SWP
mac80211: fix potential overflow when multiplying to u32 integers
libbpf: Ignore non function pointer member in struct_ops
bpf: Fix an unitialized value in bpf_iter
bpf, devmap: Use GFP_KERNEL for xdp bulk queue allocation
bpf: Fix bpf_fib_lookup helper MTU check for SKB ctx
selftests: mptcp: fix ACKRX debug message
tcp: fix SO_RCVLOWAT related hangs under mem pressure
net: axienet: Handle deferred probe on clock properly
cxgb4/chtls/cxgbit: Keeping the max ofld immediate data size same in cxgb4 and ulds
b43: N-PHY: Fix the update of coef for the PHY revision >= 3case
bpf: Clear subreg_def for global function return values
ibmvnic: add memory barrier to protect long term buffer
ibmvnic: skip send_request_unmap for timeout reset
net: dsa: felix: perform teardown in reverse order of setup
net: dsa: felix: don't deinitialize unused ports
net: phy: mscc: adding LCPLL reset to VSC8514
net: amd-xgbe: Reset the PHY rx data path when mailbox command timeout
net: amd-xgbe: Fix NETDEV WATCHDOG transmit queue timeout warning
net: amd-xgbe: Reset link when the link never comes back
net: amd-xgbe: Fix network fluctuations when using 1G BELFUSE SFP
net: mvneta: Remove per-cpu queue mapping for Armada 3700
net: enetc: fix destroyed phylink dereference during unbind
tty: convert tty_ldisc_ops 'read()' function to take a kernel pointer
tty: implement read_iter
fbdev: aty: SPARC64 requires FB_ATY_CT
drm/gma500: Fix error return code in psb_driver_load()
gma500: clean up error handling in init
drm/fb-helper: Add missed unlocks in setcmap_legacy()
drm/panel: mantix: Tweak init sequence
drm/vc4: hdmi: Take into account the clock doubling flag in atomic_check
crypto: sun4i-ss - linearize buffers content must be kept
crypto: sun4i-ss - fix kmap usage
crypto: arm64/aes-ce - really hide slower algos when faster ones are enabled
hwrng: ingenic - Fix a resource leak in an error handling path
media: allegro: Fix use after free on error
kcsan: Rewrite kcsan_prandom_u32_max() without prandom_u32_state()
drm: rcar-du: Fix PM reference leak in rcar_cmm_enable()
drm: rcar-du: Fix crash when using LVDS1 clock for CRTC
drm: rcar-du: Fix the return check of of_parse_phandle and of_find_device_by_node
drm/amdgpu: Fix macro name _AMDGPU_TRACE_H_ in preprocessor if condition
MIPS: c-r4k: Fix section mismatch for loongson2_sc_init
MIPS: lantiq: Explicitly compare LTQ_EBU_PCC_ISTAT against 0
drm/virtio: make sure context is created in gem open
drm/fourcc: fix Amlogic format modifier masks
media: ipu3-cio2: Build only for x86
media: i2c: ov5670: Fix PIXEL_RATE minimum value
media: imx: Unregister csc/scaler only if registered
media: imx: Fix csc/scaler unregister
media: mtk-vcodec: fix error return code in vdec_vp9_decode()
media: camss: missing error code in msm_video_register()
media: vsp1: Fix an error handling path in the probe function
media: em28xx: Fix use-after-free in em28xx_alloc_urbs
media: media/pci: Fix memleak in empress_init
media: tm6000: Fix memleak in tm6000_start_stream
media: aspeed: fix error return code in aspeed_video_setup_video()
ASoC: cs42l56: fix up error handling in probe
ASoC: qcom: qdsp6: Move frontend AIFs to q6asm-dai
evm: Fix memleak in init_desc
crypto: bcm - Rename struct device_private to bcm_device_private
sched/fair: Avoid stale CPU util_est value for schedutil in task dequeue
drm/sun4i: tcon: fix inverted DCLK polarity
media: imx7: csi: Fix regression for parallel cameras on i.MX6UL
media: imx7: csi: Fix pad link validation
media: ti-vpe: cal: fix write to unallocated memory
MIPS: properly stop .eh_frame generation
MIPS: Compare __SYNC_loongson3_war against 0
drm/tegra: Fix reference leak when pm_runtime_get_sync() fails
drm/amdgpu: toggle on DF Cstate after finishing xgmi injection
bsg: free the request before return error code
macintosh/adb-iop: Use big-endian autopoll mask
drm/amd/display: Fix 10/12 bpc setup in DCE output bit depth reduction.
drm/amd/display: Fix HDMI deep color output for DCE 6-11.
media: software_node: Fix refcounts in software_node_get_next_child()
media: lmedm04: Fix misuse of comma
media: vidtv: psi: fix missing crc for PMT
media: atomisp: Fix a buffer overflow in debug code
media: qm1d1c0042: fix error return code in qm1d1c0042_init()
media: cx25821: Fix a bug when reallocating some dma memory
media: mtk-vcodec: fix argument used when DEBUG is defined
media: pxa_camera: declare variable when DEBUG is defined
media: uvcvideo: Accept invalid bFormatIndex and bFrameIndex values
sched/eas: Don't update misfit status if the task is pinned
f2fs: compress: fix potential deadlock
ASoC: qcom: lpass-cpu: Remove bit clock state check
ASoC: SOF: Intel: hda: cancel D0i3 work during runtime suspend
perf/arm-cmn: Fix PMU instance naming
perf/arm-cmn: Move IRQs when migrating context
mtd: parser: imagetag: fix error codes in bcm963xx_parse_imagetag_partitions()
crypto: talitos - Work around SEC6 ERRATA (AES-CTR mode data size error)
crypto: talitos - Fix ctr(aes) on SEC1
drm/nouveau: bail out of nouveau_channel_new if channel init fails
mm: proc: Invalidate TLB after clearing soft-dirty page state
ata: ahci_brcm: Add back regulators management
ASoC: cpcap: fix microphone timeslot mask
ASoC: codecs: add missing max_register in regmap config
mtd: parsers: afs: Fix freeing the part name memory in failure
f2fs: fix to avoid inconsistent quota data
drm/amdgpu: Prevent shift wrapping in amdgpu_read_mask()
f2fs: fix a wrong condition in __submit_bio
ASoC: qcom: Fix typo error in HDMI regmap config callbacks
KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRs
drm/mediatek: Check if fb is null
Drivers: hv: vmbus: Avoid use-after-free in vmbus_onoffer_rescind()
ASoC: Intel: sof_sdw: add missing TGL_HDMI quirk for Dell SKU 0A5E
ASoC: Intel: sof_sdw: add missing TGL_HDMI quirk for Dell SKU 0A3E
locking/lockdep: Avoid unmatched unlock
ASoC: qcom: lpass: Fix i2s ctl register bit map
ASoC: rt5682: Fix panic in rt5682_jack_detect_handler happening during system shutdown
ASoC: SOF: debug: Fix a potential issue on string buffer termination
btrfs: clarify error returns values in __load_free_space_cache
btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge
KVM: x86: Restore all 64 bits of DR6 and DR7 during RSM on x86-64
s390/zcrypt: return EIO when msg retry limit reached
drm/vc4: hdmi: Move hdmi reset to bind
drm/vc4: hdmi: Fix register offset with longer CEC messages
drm/vc4: hdmi: Fix up CEC registers
drm/vc4: hdmi: Restore cec physical address on reconnect
drm/vc4: hdmi: Compute the CEC clock divider from the clock rate
drm/vc4: hdmi: Update the CEC clock divider on HSM rate change
drm/lima: fix reference leak in lima_pm_busy
drm/dp_mst: Don't cache EDIDs for physical ports
hwrng: timeriomem - Fix cooldown period calculation
crypto: ecdh_helper - Ensure 'len >= secret.len' in decode_key()
io_uring: fix possible deadlock in io_uring_poll
nvmet-tcp: fix receive data digest calculation for multiple h2cdata PDUs
nvmet-tcp: fix potential race of tcp socket closing accept_work
nvme-multipath: set nr_zones for zoned namespaces
nvmet: remove extra variable in identify ns
nvmet: set status to 0 in case for invalid nsid
ASoC: SOF: sof-pci-dev: add missing Up-Extreme quirk
ima: Free IMA measurement buffer on error
ima: Free IMA measurement buffer after kexec syscall
ASoC: simple-card-utils: Fix device module clock
fs/jfs: fix potential integer overflow on shift of a int
jffs2: fix use after free in jffs2_sum_write_data()
ubifs: Fix memleak in ubifs_init_authentication
ubifs: replay: Fix high stack usage, again
ubifs: Fix error return code in alloc_wbufs()
irqchip/imx: IMX_INTMUX should not default to y, unconditionally
smp: Process pending softirqs in flush_smp_call_function_from_idle()
drm/amdgpu/display: remove hdcp_srm sysfs on device removal
capabilities: Don't allow writing ambiguous v3 file capabilities
HSI: Fix PM usage counter unbalance in ssi_hw_init
power: supply: cpcap: Add missing IRQF_ONESHOT to fix regression
clk: meson: clk-pll: fix initializing the old rate (fallback) for a PLL
clk: meson: clk-pll: make "ret" a signed integer
clk: meson: clk-pll: propagate the error from meson_clk_pll_set_rate()
selftests/powerpc: Make the test check in eeh-basic.sh posix compliant
regulator: qcom-rpmh-regulator: add pm8009-1 chip revision
arm64: dts: qcom: qrb5165-rb5: fix pm8009 regulators
quota: Fix memory leak when handling corrupted quota file
i2c: iproc: handle only slave interrupts which are enabled
i2c: iproc: update slave isr mask (ISR_MASK_SLAVE)
i2c: iproc: handle master read request
spi: cadence-quadspi: Abort read if dummy cycles required are too many
clk: sunxi-ng: h6: Fix CEC clock
clk: renesas: r8a779a0: Remove non-existent S2 clock
clk: renesas: r8a779a0: Fix parent of CBFUSA clock
HID: core: detect and skip invalid inputs to snto32()
RDMA/siw: Fix handling of zero-sized Read and Receive Queues.
dmaengine: fsldma: Fix a resource leak in the remove function
dmaengine: fsldma: Fix a resource leak in an error handling path of the probe function
dmaengine: owl-dma: Fix a resource leak in the remove function
dmaengine: hsu: disable spurious interrupt
mfd: bd9571mwv: Use devm_mfd_add_devices()
power: supply: cpcap-charger: Fix missing power_supply_put()
power: supply: cpcap-battery: Fix missing power_supply_put()
power: supply: cpcap-charger: Fix power_supply_put on null battery pointer
fdt: Properly handle "no-map" field in the memory region
of/fdt: Make sure no-map does not remove already reserved regions
RDMA/rtrs: Extend ibtrs_cq_qp_create
RDMA/rtrs-srv: Release lock before call into close_sess
RDMA/rtrs-srv: Use sysfs_remove_file_self for disconnect
RDMA/rtrs-clt: Set mininum limit when create QP
RDMA/rtrs: Call kobject_put in the failure path
RDMA/rtrs-srv: Fix missing wr_cqe
RDMA/rtrs-clt: Refactor the failure cases in alloc_clt
RDMA/rtrs-srv: Init wr_cnt as 1
power: reset: at91-sama5d2_shdwc: fix wkupdbc mask
rtc: s5m: select REGMAP_I2C
dmaengine: idxd: set DMA channel to be private
power: supply: fix sbs-charger build, needs REGMAP_I2C
clocksource/drivers/ixp4xx: Select TIMER_OF when needed
clocksource/drivers/mxs_timer: Add missing semicolon when DEBUG is defined
spi: imx: Don't print error on -EPROBEDEFER
RDMA/mlx5: Use the correct obj_id upon DEVX TIR creation
IB/mlx5: Add mutex destroy call to cap_mask_mutex mutex
clk: sunxi-ng: h6: Fix clock divider range on some clocks
platform/chrome: cros_ec_proto: Use EC_HOST_EVENT_MASK not BIT
platform/chrome: cros_ec_proto: Add LID and BATTERY to default mask
regulator: axp20x: Fix reference cout leak
watch_queue: Drop references to /dev/watch_queue
certs: Fix blacklist flag type confusion
regulator: s5m8767: Fix reference count leak
spi: atmel: Put allocated master before return
regulator: s5m8767: Drop regulators OF node reference
power: supply: axp20x_usb_power: Init work before enabling IRQs
power: supply: smb347-charger: Fix interrupt usage if interrupt is unavailable
regulator: core: Avoid debugfs: Directory ... already present! error
isofs: release buffer head before return
watchdog: intel-mid_wdt: Postpone IRQ handler registration till SCU is ready
auxdisplay: ht16k33: Fix refresh rate handling
objtool: Fix error handling for STD/CLD warnings
objtool: Fix retpoline detection in asm code
objtool: Fix ".cold" section suffix check for newer versions of GCC
scsi: lpfc: Fix ancient double free
iommu: Switch gather->end to the inclusive end
IB/umad: Return EIO in case of when device disassociated
IB/umad: Return EPOLLERR in case of when device disassociated
KVM: PPC: Make the VMX instruction emulation routines static
powerpc/47x: Disable 256k page size
powerpc/time: Enable sched clock for irqtime
mmc: owl-mmc: Fix a resource leak in an error handling path and in the remove function
mmc: sdhci-sprd: Fix some resource leaks in the remove function
mmc: usdhi6rol0: Fix a resource leak in the error handling path of the probe
mmc: renesas_sdhi_internal_dmac: Fix DMA buffer alignment from 8 to 128-bytes
ARM: 9046/1: decompressor: Do not clear SCTLR.nTLSMD for ARMv7+ cores
i2c: qcom-geni: Store DMA mapping data in geni_i2c_dev struct
amba: Fix resource leak for drivers without .remove
iommu: Move iotlb_sync_map out from __iommu_map
iommu: Properly pass gfp_t in _iommu_map() to avoid atomic sleeping
IB/mlx5: Return appropriate error code instead of ENOMEM
IB/cm: Avoid a loop when device has 255 ports
tracepoint: Do not fail unregistering a probe due to memory failure
rtc: zynqmp: depend on HAS_IOMEM
perf tools: Fix DSO filtering when not finding a map for a sampled address
perf vendor events arm64: Fix Ampere eMag event typo
RDMA/rxe: Fix coding error in rxe_recv.c
RDMA/rxe: Fix coding error in rxe_rcv_mcast_pkt
RDMA/rxe: Correct skb on loopback path
spi: stm32: properly handle 0 byte transfer
mfd: altera-sysmgr: Fix physical address storing more
mfd: wm831x-auxadc: Prevent use after free in wm831x_auxadc_read_irq()
powerpc/pseries/dlpar: handle ibm, configure-connector delay status
powerpc/8xx: Fix software emulation interrupt
clk: qcom: gcc-msm8998: Fix Alpha PLL type for all GPLLs
kunit: tool: fix unit test cleanup handling
kselftests: dmabuf-heaps: Fix Makefile's inclusion of the kernel's usr/include dir
RDMA/hns: Fixed wrong judgments in the goto branch
RDMA/siw: Fix calculation of tx_valid_cpus size
RDMA/hns: Fix type of sq_signal_bits
RDMA/hns: Disable RQ inline by default
clk: divider: fix initialization with parent_hw
spi: pxa2xx: Fix the controller numbering for Wildcat Point
powerpc/uaccess: Avoid might_fault() when user access is enabled
powerpc/kuap: Restore AMR after replaying soft interrupts
regulator: qcom-rpmh: fix pm8009 ldo7
clk: aspeed: Fix APLL calculate formula from ast2600-A2
selftests/ftrace: Update synthetic event syntax errors
perf symbols: Use (long) for iterator for bfd symbols
regulator: bd718x7, bd71828, Fix dvs voltage levels
spi: dw: Avoid stack content exposure
spi: Skip zero-length transfers in spi_transfer_one_message()
printk: avoid prb_first_valid_seq() where possible
perf symbols: Fix return value when loading PE DSO
nfsd: register pernet ops last, unregister first
svcrdma: Hold private mutex while invoking rdma_accept()
ceph: fix flush_snap logic after putting caps
RDMA/hns: Fixes missing error code of CMDQ
RDMA/ucma: Fix use-after-free bug in ucma_create_uevent
RDMA/rtrs-srv: Fix stack-out-of-bounds
RDMA/rtrs: Only allow addition of path to an already established session
RDMA/rtrs-srv: fix memory leak by missing kobject free
RDMA/rtrs-srv-sysfs: fix missing put_device
RDMA/rtrs-srv: Do not pass a valid pointer to PTR_ERR()
Input: sur40 - fix an error code in sur40_probe()
perf record: Fix continue profiling after draining the buffer
perf intel-pt: Fix missing CYC processing in PSB
perf intel-pt: Fix premature IPC
perf intel-pt: Fix IPC with CYC threshold
perf test: Fix unaligned access in sample parsing test
Input: elo - fix an error code in elo_connect()
sparc64: only select COMPAT_BINFMT_ELF if BINFMT_ELF is set
sparc: fix led.c driver when PROC_FS is not enabled
Input: zinitix - fix return type of zinitix_init_touch()
ARM: 9065/1: OABI compat: fix build when EPOLL is not enabled
misc: eeprom_93xx46: Fix module alias to enable module autoprobe
phy: rockchip-emmc: emmc_phy_init() always return 0
phy: cadence-torrent: Fix error code in cdns_torrent_phy_probe()
misc: eeprom_93xx46: Add module alias to avoid breaking support for non device tree users
PCI: rcar: Always allocate MSI addresses in 32bit space
soundwire: cadence: fix ACK/NAK handling
pwm: rockchip: Enable APB clock during register access while probing
pwm: rockchip: rockchip_pwm_probe(): Remove superfluous clk_unprepare()
pwm: rockchip: Eliminate potential race condition when probing
PCI: xilinx-cpm: Fix reference count leak on error path
VMCI: Use set_page_dirty_lock() when unregistering guest memory
PCI: Align checking of syscall user config accessors
mei: hbm: call mei_set_devstate() on hbm stop response
drm/msm: Fix MSM_INFO_GET_IOVA with carveout
drm/msm/dsi: Correct io_start for MSM8994 (20nm PHY)
drm/msm/mdp5: Fix wait-for-commit for cmd panels
drm/msm: Fix race of GPU init vs timestamp power management.
drm/msm: Fix races managing the OOB state for timestamp vs timestamps.
drm/msm/dp: trigger unplug event in msm_dp_display_disable
vfio/iommu_type1: Populate full dirty when detach non-pinned group
vfio/iommu_type1: Fix some sanity checks in detach group
vfio-pci/zdev: fix possible segmentation fault issue
ext4: fix potential htree index checksum corruption
phy: USB_LGM_PHY should depend on X86
coresight: etm4x: Skip accessing TRCPDCR in save/restore
nvmem: core: Fix a resource leak on error in nvmem_add_cells_from_of()
nvmem: core: skip child nodes not matching binding
soundwire: bus: use sdw_update_no_pm when initializing a device
soundwire: bus: use sdw_write_no_pm when setting the bus scale registers
soundwire: export sdw_write/read_no_pm functions
soundwire: bus: fix confusion on device used by pm_runtime
misc: fastrpc: fix incorrect usage of dma_map_sgtable
remoteproc/mediatek: acknowledge watchdog IRQ after handled
regmap: sdw: use _no_pm functions in regmap_read/write
ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it
mailbox: sprd: correct definition of SPRD_OUTBOX_FIFO_FULL
device-dax: Fix default return code of range_parse()
PCI: pci-bridge-emul: Fix array overruns, improve safety
PCI: cadence: Fix DMA range mapping early return error
i40e: Fix flow for IPv6 next header (extension header)
i40e: Add zero-initialization of AQ command structures
i40e: Fix overwriting flow control settings during driver loading
i40e: Fix addition of RX filters after enabling FW LLDP agent
i40e: Fix VFs not created
Take mmap lock in cacheflush syscall
nios2: fixed broken sys_clone syscall
i40e: Fix add TC filter for IPv6
octeontx2-af: Fix an off by one in rvu_dbg_qsize_write()
pwm: iqs620a: Fix overflow and optimize calculations
vfio/type1: Use follow_pte()
ice: report correct max number of TCs
ice: Account for port VLAN in VF max packet size calculation
ice: Fix state bits on LLDP mode switch
ice: update the number of available RSS queues
net: stmmac: fix CBS idleslope and sendslope calculation
net/mlx4_core: Add missed mlx4_free_cmd_mailbox()
PCI: rockchip: Make 'ep-gpios' DT property optional
vxlan: move debug check after netdev unregister
wireguard: device: do not generate ICMP for non-IP packets
wireguard: kconfig: use arm chacha even with no neon
ocfs2: fix a use after free on error
mm: memcontrol: fix NR_ANON_THPS accounting in charge moving
mm: memcontrol: fix slub memory accounting
mm/memory.c: fix potential pte_unmap_unlock pte error
mm/hugetlb: fix potential double free in hugetlb_register_node() error path
mm/hugetlb: suppress wrong warning info when alloc gigantic page
mm/compaction: fix misbehaviors of fast_find_migrateblock()
r8169: fix jumbo packet handling on RTL8168e
NFSv4: Fixes for nfs4_bitmask_adjust()
KVM: SVM: Intercept INVPCID when it's disabled to inject #UD
KVM: x86/mmu: Expand collapsible SPTE zap for TDP MMU to ZONE_DEVICE and HugeTLB pages
arm64: Add missing ISB after invalidating TLB in __primary_switch
i2c: brcmstb: Fix brcmstd_send_i2c_cmd condition
i2c: exynos5: Preserve high speed master code
mm,thp,shmem: make khugepaged obey tmpfs mount flags
mm: fix memory_failure() handling of dax-namespace metadata
mm/rmap: fix potential pte_unmap on an not mapped pte
proc: use kvzalloc for our kernel buffer
csky: Fix a size determination in gpr_get()
scsi: bnx2fc: Fix Kconfig warning & CNIC build errors
scsi: sd: sd_zbc: Don't pass GFP_NOIO to kvcalloc
block: reopen the device in blkdev_reread_part
ide/falconide: Fix module unload
scsi: sd: Fix Opal support
blk-settings: align max_sectors on "logical_block_size" boundary
soundwire: intel: fix possible crash when no device is detected
ACPI: property: Fix fwnode string properties matching
ACPI: configfs: add missing check after configfs_register_default_group()
cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
HID: logitech-dj: add support for keyboard events in eQUAD step 4 Gaming
HID: wacom: Ignore attempts to overwrite the touch_max value from HID
Input: raydium_ts_i2c - do not send zero length
Input: xpad - add support for PowerA Enhanced Wired Controller for Xbox Series X|S
Input: joydev - prevent potential read overflow in ioctl
Input: i8042 - add ASUS Zenbook Flip to noselftest list
media: mceusb: Fix potential out-of-bounds shift
USB: serial: option: update interface mapping for ZTE P685M
usb: musb: Fix runtime PM race in musb_queue_resume_work
usb: dwc3: gadget: Fix setting of DEPCFG.bInterval_m1
usb: dwc3: gadget: Fix dep->interval for fullspeed interrupt
USB: serial: ftdi_sio: fix FTX sub-integer prescaler
USB: serial: pl2303: fix line-speed handling on newer chips
USB: serial: mos7840: fix error code in mos7840_write()
USB: serial: mos7720: fix error code in mos7720_write()
phy: lantiq: rcu-usb2: wait after clock enable
ALSA: fireface: fix to parse sync status register of latter protocol
ALSA: hda: Add another CometLake-H PCI ID
ALSA: hda/hdmi: Drop bogus check at closing a stream
ALSA: hda/realtek: modify EAPD in the ALC886
ALSA: hda/realtek: Quirk for HP Spectre x360 14 amp setup
MIPS: Ingenic: Disable HPTLB for D0 XBurst CPUs too
MIPS: Support binutils configured with --enable-mips-fix-loongson3-llsc=yes
MIPS: VDSO: Use CLANG_FLAGS instead of filtering out '--target='
Revert "MIPS: Octeon: Remove special handling of CONFIG_MIPS_ELF_APPENDED_DTB=y"
Revert "bcache: Kill btree_io_wq"
bcache: Give btree_io_wq correct semantics again
bcache: Move journal work to new flush wq
Revert "drm/amd/display: Update NV1x SR latency values"
drm/amd/display: Add FPU wrappers to dcn21_validate_bandwidth()
drm/amd/display: Remove Assert from dcn10_get_dig_frontend
drm/amd/display: Add vupdate_no_lock interrupts for DCN2.1
drm/amdkfd: Fix recursive lock warnings
drm/amdgpu: Set reference clock to 100Mhz on Renoir (v2)
drm/nouveau/kms: handle mDP connectors
drm/modes: Switch to 64bit maths to avoid integer overflow
drm/sched: Cancel and flush all outstanding jobs before finish.
drm/panel: kd35t133: allow using non-continuous dsi clock
drm/rockchip: Require the YTR modifier for AFBC
ASoC: siu: Fix build error by a wrong const prefix
selinux: fix inconsistency between inode_getxattr and inode_listsecurity
erofs: initialized fields can only be observed after bit is set
tpm_tis: Fix check_locality for correct locality acquisition
tpm_tis: Clean up locality release
KEYS: trusted: Fix incorrect handling of tpm_get_random()
KEYS: trusted: Fix migratable=1 failing
KEYS: trusted: Reserve TPM for seal and unseal operations
btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node
btrfs: do not warn if we can't find the reloc root when looking up backref
btrfs: add asserts for deleting backref cache nodes
btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root
btrfs: fix reloc root leak with 0 ref reloc roots on recovery
btrfs: splice remaining dirty_bg's onto the transaction dirty bg list
btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself
btrfs: account for new extents being deleted in total_bytes_pinned
btrfs: fix extent buffer leak on failure to copy root
drm/i915/gt: Flush before changing register state
drm/i915/gt: Correct surface base address for renderclear
crypto: arm64/sha - add missing module aliases
crypto: aesni - prevent misaligned buffers on the stack
crypto: michael_mic - fix broken misalignment handling
crypto: sun4i-ss - checking sg length is not sufficient
crypto: sun4i-ss - IV register does not work on A10 and A13
crypto: sun4i-ss - handle BigEndian for cipher
crypto: sun4i-ss - initialize need_fallback
soc: samsung: exynos-asv: don't defer early on not-supported SoCs
soc: samsung: exynos-asv: handle reading revision register error
seccomp: Add missing return in non-void function
arm64: ptrace: Fix seccomp of traced syscall -1 (NO_SYSCALL)
misc: rtsx: init of rts522a add OCP power off when no card is present
drivers/misc/vmw_vmci: restrict too big queue size in qp_host_alloc_queue
pstore: Fix typo in compression option name
dts64: mt7622: fix slow sd card access
arm64: dts: agilex: fix phy interface bit shift for gmac1 and gmac2
staging/mt7621-dma: mtk-hsdma.c->hsdma-mt7621.c
staging: gdm724x: Fix DMA from stack
staging: rtl8188eu: Add Edimax EW-7811UN V2 to device table
floppy: reintroduce O_NDELAY fix
media: i2c: max9286: fix access to unallocated memory
media: ir_toy: add another IR Droid device
media: ipu3-cio2: Fix mbus_code processing in cio2_subdev_set_fmt()
media: marvell-ccic: power up the device on mclk enable
media: smipcie: fix interrupt handling and IR timeout
x86/virt: Eat faults on VMXOFF in reboot flows
x86/reboot: Force all cpus to exit VMX root if VMX is supported
x86/fault: Fix AMD erratum #91 errata fixup for user code
x86/entry: Fix instrumentation annotation
powerpc/prom: Fix "ibm,arch-vec-5-platform-support" scan
rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
rcu/nocb: Perform deferred wake up before last idle's need_resched() check
kprobes: Fix to delay the kprobes jump optimization
arm64: Extend workaround for erratum 1024718 to all versions of Cortex-A55
iommu/arm-smmu-qcom: Fix mask extraction for bootloader programmed SMRs
arm64: kexec_file: fix memory leakage in create_dtb() when fdt_open_into() fails
arm64: uprobe: Return EOPNOTSUPP for AARCH32 instruction probing
arm64 module: set plt* section addresses to 0x0
arm64: spectre: Prevent lockdep splat on v4 mitigation enable path
riscv: Disable KSAN_SANITIZE for vDSO
watchdog: qcom: Remove incorrect usage of QCOM_WDT_ENABLE_IRQ
watchdog: mei_wdt: request stop on unregister
coresight: etm4x: Handle accesses to TRCSTALLCTLR
mtd: spi-nor: sfdp: Fix last erase region marking
mtd: spi-nor: sfdp: Fix wrong erase type bitmask for overlaid region
mtd: spi-nor: core: Fix erase type discovery for overlaid region
mtd: spi-nor: core: Add erase size check for erase command initialization
mtd: spi-nor: hisi-sfc: Put child node np on error path
fs/affs: release old buffer head on error path
seq_file: document how per-entry resources are managed.
x86: fix seq_file iteration for pat/memtype.c
mm: memcontrol: fix swap undercounting in cgroup2
mm: memcontrol: fix get_active_memcg return value
hugetlb: fix update_and_free_page contig page struct assumption
hugetlb: fix copy_huge_page_from_user contig page struct assumption
mm/vmscan: restore zone_reclaim_mode ABI
mm, compaction: make fast_isolate_freepages() stay within zone
KVM: nSVM: fix running nested guests when npt=0
nvmem: qcom-spmi-sdam: Fix uninitialized pdev pointer
module: Ignore _GLOBAL_OFFSET_TABLE_ when warning for undefined symbols
mmc: sdhci-esdhc-imx: fix kernel panic when remove module
mmc: sdhci-pci-o2micro: Bug fix for SDR104 HW tuning failure
powerpc/32: Preserve cr1 in exception prolog stack check to fix build error
powerpc/kexec_file: fix FDT size estimation for kdump kernel
powerpc/32s: Add missing call to kuep_lock on syscall entry
spmi: spmi-pmic-arb: Fix hw_irq overflow
mei: fix transfer over dma with extended header
mei: me: emmitsburg workstation DID
mei: me: add adler lake point S DID
mei: me: add adler lake point LP DID
gpio: pcf857x: Fix missing first interrupt
mfd: gateworks-gsc: Fix interrupt type
printk: fix deadlock when kernel panic
exfat: fix shift-out-of-bounds in exfat_fill_super()
zonefs: Fix file size of zones in full condition
kcmp: Support selection of SYS_kcmp without CHECKPOINT_RESTORE
thermal: cpufreq_cooling: freq_qos_update_request() returns < 0 on error
cpufreq: qcom-hw: drop devm_xxx() calls from init/exit hooks
cpufreq: intel_pstate: Change intel_pstate_get_hwp_max() argument
cpufreq: intel_pstate: Get per-CPU max freq via MSR_HWP_CAPABILITIES if available
proc: don't allow async path resolution of /proc/thread-self components
s390/vtime: fix inline assembly clobber list
virtio/s390: implement virtio-ccw revision 2 correctly
um: mm: check more comprehensively for stub changes
um: defer killing userspace on page table update failures
irqchip/loongson-pch-msi: Use bitmap_zalloc() to allocate bitmap
f2fs: fix out-of-repair __setattr_copy()
f2fs: enforce the immutable flag on open files
f2fs: flush data when enabling checkpoint back
sparc32: fix a user-triggerable oops in clear_user()
spi: fsl: invert spisel_boot signal on MPC8309
spi: spi-synquacer: fix set_cs handling
gfs2: fix glock confusion in function signal_our_withdraw
gfs2: Don't skip dlm unlock if glock has an lvb
gfs2: Lock imbalance on error path in gfs2_recover_one
gfs2: Recursive gfs2_quota_hold in gfs2_iomap_end
dm: fix deadlock when swapping to encrypted device
dm table: fix iterate_devices based device capability checks
dm table: fix DAX iterate_devices based device capability checks
dm table: fix zoned iterate_devices based device capability checks
dm writecache: fix performance degradation in ssd mode
dm writecache: return the exact table values that were set
dm writecache: fix writing beyond end of underlying device when shrinking
dm era: Recover committed writeset after crash
dm era: Update in-core bitset after committing the metadata
dm era: Verify the data block size hasn't changed
dm era: Fix bitset memory leaks
dm era: Use correct value size in equality function of writeset tree
dm era: Reinitialize bitset cache before digesting a new writeset
dm era: only resize metadata in preresume
drm/i915: Reject 446-480MHz HDMI clock on GLK
kgdb: fix to kill breakpoints on initmem after boot
ipv6: silence compilation warning for non-IPV6 builds
net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending
wireguard: selftests: test multiple parallel streams
wireguard: queueing: get rid of per-peer ring buffers
net: sched: fix police ext initialization
net: qrtr: Fix memory leak in qrtr_tun_open
net_sched: fix RTNL deadlock again caused by request_module()
ARM: dts: aspeed: Add LCLK to lpc-snoop
Linux 5.10.20
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3fbcecd9413ce212dac68d5cc800c9457feba56a
commit 97f433c3601a24d3513d06f575a389a2ca4e11e4 upstream.
We get I/O errors when we run md-raid1 on the top of dm-integrity on the
top of ramdisk.
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xff00, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0xffff, 0x1
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8048, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8147, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8246, 0xff
device-mapper: integrity: Bio not aligned on 8 sectors: 0x8345, 0xbb
The ramdisk device has logical_block_size 512 and max_sectors 255. The
dm-integrity device uses logical_block_size 4096 and it doesn't affect the
"max_sectors" value - thus, it inherits 255 from the ramdisk. So, we have
a device with max_sectors not aligned on logical_block_size.
The md-raid device sees that the underlying leg has max_sectors 255 and it
will split the bios on 255-sector boundary, making the bios unaligned on
logical_block_size.
In order to fix the bug, we round down max_sectors to logical_block_size.
Cc: stable@vger.kernel.org
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4601b4b130de2329fe06df80ed5d77265f2058e5 ]
Historically the BLKRRPART ioctls called into the now defunct ->revalidate
method, which caused the sd driver to check if any media is present.
When the ->revalidate method was removed this revalidation was lost,
leading to lots of I/O errors when using the eject command. Fix this by
reopening the device to rescan the partitions, and thus calling the
revalidation logic in the sd driver.
Fixes: 471bd0af54 ("sd: use bdev_check_media_change")
Reported--by: Tom Seewald <tseewald@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Tom Seewald <tseewald@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 41e76c85660c022c6bf5713bfb6c21e64a487cec upstream.
bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:
[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read
[reader]
numjobs=4
directory=/mnt
where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).
Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.
Fixes: 058fdecc6d ("block, bfq: fix in-service-queue check for queue merging")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add a resource-managed variant of blk_ksm_init() so that drivers don't
have to worry about calling blk_ksm_destroy().
Note that the implementation uses a custom devres action to call
blk_ksm_destroy() rather than switching the two allocations to be
directly devres-managed, e.g. with devm_kmalloc(). This is because we
need to keep zeroing the memory containing the keyslots when it is
freed, and also because we want to continue using kvmalloc() (and there
is no devm_kvmalloc()).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20210121082155.111333-2-ebiggers@kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
(cherry picked from commit 5851d3b042b694839d2241fbb3200ce958135cdf)
Bug: 161256007
Change-Id: I3cda7fc5f7509f8d967882a88ca30ddc6f6a0426
Signed-off-by: Eric Biggers <ebiggers@google.com>
Replace the following patches with upstream versions
(well, almost upstream; as of 2021-02-12 they are queued for 5.12 at
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next):
ANDROID-dm-add-support-for-passing-through-inline-crypto-support.patch
ANDROID-dm-enable-may_passthrough_inline_crypto-on-some-targets.patch
ANDROID-block-Introduce-passthrough-keyslot-manager.patch
Also, resolve conflicts with the following non-upstream patches for
hardware-wrapped key support. Notably, we need to handle the field
blk_keyslot_manager::features in a few places:
ANDROID-block-add-hardware-wrapped-key-support.patch
ANDROID-dm-add-support-for-passing-through-derive_raw_secret.patch
Finally, update non-upstream device-mapper targets (dm-bow and
dm-default-key) to use the new way of specifying inline crypto
passthrough support (DM_TARGET_PASSES_CRYPTO) rather than the old way
(may_passthrough_inline_crypto). These changes should be folded into:
ANDROID-dm-bow-Add-dm-bow-feature.patch
ANDROID-dm-add-dm-default-key-target-for-metadata-encryption.patch
Test: tested on db845c; verified that inline crypto support gets passed
through over dm-linear.
Bug: 162257830
Change-Id: I5e3dea1aa09fc1215c90857b5b51d9e3720ef7db
Signed-off-by: Eric Biggers <ebiggers@google.com>
Changes in 5.10.17
objtool: Fix seg fault with Clang non-section symbols
Revert "dts: phy: add GPIO number and active state used for phy reset"
gpio: mxs: GPIO_MXS should not default to y unconditionally
gpio: ep93xx: fix BUG_ON port F usage
gpio: ep93xx: Fix single irqchip with multi gpiochips
tracing: Do not count ftrace events in top level enable output
tracing: Check length before giving out the filter buffer
drm/i915: Fix overlay frontbuffer tracking
arm/xen: Don't probe xenbus as part of an early initcall
cgroup: fix psi monitor for root cgroup
Revert "drm/amd/display: Update NV1x SR latency values"
drm/i915/tgl+: Make sure TypeC FIA is powered up when initializing it
drm/dp_mst: Don't report ports connected if nothing is attached to them
dmaengine: move channel device_node deletion to driver
tmpfs: disallow CONFIG_TMPFS_INODE64 on s390
tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha
soc: ti: omap-prm: Fix boot time errors for rst_map_012 bits 0 and 1
arm64: dts: rockchip: Fix PCIe DT properties on rk3399
arm64: dts: qcom: sdm845: Reserve LPASS clocks in gcc
ARM: OMAP2+: Fix suspcious RCU usage splats for omap_enter_idle_coupled
arm64: dts: rockchip: remove interrupt-names property from rk3399 vdec node
platform/x86: hp-wmi: Disable tablet-mode reporting by default
arm64: dts: rockchip: Disable display for NanoPi R2S
ovl: perform vfs_getxattr() with mounter creds
cap: fix conversions on getxattr
ovl: skip getxattr of security labels
scsi: lpfc: Fix EEH encountering oops with NVMe traffic
x86/split_lock: Enable the split lock feature on another Alder Lake CPU
nvme-pci: ignore the subsysem NQN on Phison E16
drm/amd/display: Fix DPCD translation for LTTPR AUX_RD_INTERVAL
drm/amd/display: Add more Clock Sources to DCN2.1
drm/amd/display: Release DSC before acquiring
drm/amd/display: Fix dc_sink kref count in emulated_link_detect
drm/amd/display: Free atomic state after drm_atomic_commit
drm/amd/display: Decrement refcount of dc_sink before reassignment
riscv: virt_addr_valid must check the address belongs to linear mapping
bfq-iosched: Revert "bfq: Fix computation of shallow depth"
ARM: dts: lpc32xx: Revert set default clock rate of HCLK PLL
kallsyms: fix nonconverging kallsyms table with lld
ARM: ensure the signal page contains defined contents
ARM: kexec: fix oops after TLB are invalidated
ubsan: implement __ubsan_handle_alignment_assumption
Revert "lib: Restrict cpumask_local_spread to houskeeping CPUs"
x86/efi: Remove EFI PGD build time checks
lkdtm: don't move ctors to .rodata
KVM: x86: cleanup CR3 reserved bits checks
cgroup-v1: add disabled controller check in cgroup1_parse_param()
dmaengine: idxd: fix misc interrupt completion
ath9k: fix build error with LEDS_CLASS=m
mt76: dma: fix a possible memory leak in mt76_add_fragment()
drm/vc4: hvs: Fix buffer overflow with the dlist handling
dmaengine: idxd: check device state before issue command
bpf: Unbreak BPF_PROG_TYPE_KPROBE when kprobe is called via do_int3
bpf: Check for integer overflow when using roundup_pow_of_two()
netfilter: xt_recent: Fix attempt to update deleted entry
selftests: netfilter: fix current year
netfilter: nftables: fix possible UAF over chains from packet path in netns
netfilter: flowtable: fix tcp and udp header checksum update
xen/netback: avoid race in xenvif_rx_ring_slots_available()
net: hdlc_x25: Return meaningful error code in x25_open
net: ipa: set error code in gsi_channel_setup()
hv_netvsc: Reset the RSC count if NVSP_STAT_FAIL in netvsc_receive()
net: enetc: initialize the RFS and RSS memories
selftests: txtimestamp: fix compilation issue
net: stmmac: set TxQ mode back to DCB after disabling CBS
ibmvnic: Clear failover_pending if unable to schedule
netfilter: conntrack: skip identical origin tuple in same zone only
scsi: scsi_debug: Fix a memory leak
x86/build: Disable CET instrumentation in the kernel for 32-bit too
net: dsa: felix: implement port flushing on .phylink_mac_link_down
net: hns3: add a check for queue_id in hclge_reset_vf_queue()
net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()
net: hns3: add a check for index in hclge_get_rss_key()
firmware_loader: align .builtin_fw to 8
drm/sun4i: tcon: set sync polarity for tcon1 channel
drm/sun4i: dw-hdmi: always set clock rate
drm/sun4i: Fix H6 HDMI PHY configuration
drm/sun4i: dw-hdmi: Fix max. frequency for H6
clk: sunxi-ng: mp: fix parent rate change flag check
i2c: stm32f7: fix configuration of the digital filter
h8300: fix PREEMPTION build, TI_PRE_COUNT undefined
scripts: set proper OpenSSL include dir also for sign-file
x86/pci: Create PCI/MSI irqdomain after x86_init.pci.arch_init()
arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page
rxrpc: Fix clearance of Tx/Rx ring when releasing a call
udp: fix skb_copy_and_csum_datagram with odd segment sizes
net: dsa: call teardown method on probe failure
cpufreq: ACPI: Extend frequency tables to cover boost frequencies
cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
net: gro: do not keep too many GRO packets in napi->rx_list
net: fix iteration for sctp transport seq_files
net/vmw_vsock: fix NULL pointer dereference
net/vmw_vsock: improve locking in vsock_connect_timeout()
net: watchdog: hold device global xmit lock during tx disable
bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state
switchdev: mrp: Remove SWITCHDEV_ATTR_ID_MRP_PORT_STAT
vsock/virtio: update credit only if socket is not closed
vsock: fix locking in vsock_shutdown()
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
net/qrtr: restrict user-controlled length in qrtr_tun_write_iter()
ovl: expand warning in ovl_d_real()
kcov, usb: only collect coverage from __usb_hcd_giveback_urb in softirq
Linux 5.10.17
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Id0300681f52b51d3f466f1e66ec3a6c25f65f4d3
[ Upstream commit 388c705b95f23f317fa43e6abf9ff07b583b721a ]
This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.
bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
sbitmap_get_shallow, which uses just the number to limit the scan depth of
each bitmap word, formula:
scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%
That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
But after commit patch 'bfq: Fix computation of shallow depth', we use
sbitmap.depth instead, as a example in following case:
sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
nothing.
Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmAnzLcACgkQONu9yGCS
aT5FUQ/+LUBYHpWjyV1wrnjwf3AAtcUnZGtPUOEsv9d9lAcituistag2zHXive9g
K7HGria7BVcARnAtdcOLWB7ur9Vj+Ch1XVOVhSdI8EgGPslxoWKxmM03FQtSjQak
OYZAHc/A/mrTtG+rYROx4gp+jxaiaUx8e/zleFgNeN1GU9/owR2H8+d/a2L3bnzN
mgaYG4/0GTy1JfDXwsmiNa376dIViPAkukjS8AV+dPZKFag+TmcE0d/qTtDlmiQO
gSboV/8FzwKgUIxjOt6Rw6AniCfGTew/Dy/NkRiGB4ge5+aMZe78+IZ6xzRlbVix
d1/+7Iviy40pTWOZdRxwefAj0/MS9zZeVrDSA/Ips24EfD/0qxq9QEa3cEXvQkZF
ih5AX9obPBxHRsFwn7x9siP3ZW1W2jaEYzrXxIWBJxFDVRRh3/DMo5rljSkUWxzS
8dBpxfNiRMggsbgKPBtuV5+4Dzdbx5Dn1sbaMgT9pU1f+U0LH0KjIU1evuCFqUo6
C/Y61pDjc8GotBFuKjcbCYBMWpAJ/UwqRn4HrMBRMN+ZOpBQr/2RLaM8ROla8H3W
GrhADQlDuHForKHRuiuBpaUxZGLeZw2dpZClrV0WwzHLLV0KsQC0+xE9ge0/GPtQ
rnJPxYiKg2WJctVBlH2i5uLw6s25+dq4ufSZBmr2AOg8u0YccU4=
=BFeH
-----END PGP SIGNATURE-----
Merge 5.10.16 into android12-5.10
Changes in 5.10.16
io_uring: simplify io_task_match()
io_uring: add a {task,files} pair matching helper
io_uring: don't iterate io_uring_cancel_files()
io_uring: pass files into kill timeouts/poll
io_uring: always batch cancel in *cancel_files()
io_uring: fix files cancellation
io_uring: account io_uring internal files as REQ_F_INFLIGHT
io_uring: if we see flush on exit, cancel related tasks
io_uring: fix __io_uring_files_cancel() with TASK_UNINTERRUPTIBLE
io_uring: replace inflight_wait with tctx->wait
io_uring: fix cancellation taking mutex while TASK_UNINTERRUPTIBLE
io_uring: fix flush cqring overflow list while TASK_INTERRUPTIBLE
io_uring: fix list corruption for splice file_get
io_uring: fix sqo ownership false positive warning
io_uring: reinforce cancel on flush during exit
io_uring: drop mm/files between task_work_submit
gpiolib: cdev: clear debounce period if line set to output
powerpc/64/signal: Fix regression in __kernel_sigtramp_rt64() semantics
af_key: relax availability checks for skb size calculation
regulator: core: avoid regulator_resolve_supply() race condition
ASoC: wm_adsp: Fix control name parsing for multi-fw
drm/nouveau/nvif: fix method count when pushing an array
mac80211: 160MHz with extended NSS BW in CSA
ASoC: Intel: Skylake: Zero snd_ctl_elem_value
chtls: Fix potential resource leak
pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
pNFS/NFSv4: Improve rejection of out-of-order layouts
ALSA: hda: intel-dsp-config: add PCI id for TGL-H
ASoC: ak4458: correct reset polarity
ASoC: Intel: sof_sdw: set proper flags for Dell TGL-H SKU 0A5E
iwlwifi: mvm: skip power command when unbinding vif during CSA
iwlwifi: mvm: take mutex for calling iwl_mvm_get_sync_time()
iwlwifi: pcie: add a NULL check in iwl_pcie_txq_unmap
iwlwifi: pcie: fix context info memory leak
iwlwifi: mvm: invalidate IDs of internal stations at mvm start
iwlwifi: pcie: add rules to match Qu with Hr2
iwlwifi: mvm: guard against device removal in reprobe
iwlwifi: queue: bail out on invalid freeing
SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
SUNRPC: Handle 0 length opaque XDR object data properly
i2c: mediatek: Move suspend and resume handling to NOIRQ phase
blk-cgroup: Use cond_resched() when destroy blkgs
regulator: Fix lockdep warning resolving supplies
bpf: Fix verifier jmp32 pruning decision logic
bpf: Fix 32 bit src register truncation on div/mod
bpf: Fix verifier jsgt branch analysis on max bound
drm/i915: Fix ICL MG PHY vswing handling
drm/i915: Skip vswing programming for TBT
nilfs2: make splice write available again
Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
squashfs: avoid out of bounds writes in decompressors
squashfs: add more sanity checks in id lookup
squashfs: add more sanity checks in inode lookup
squashfs: add more sanity checks in xattr id lookup
Linux 5.10.16
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3d667eb0c90288b118c756a33c70c8ceb097405
[ Upstream commit 6c635caef410aa757befbd8857c1eadde5cc22ed ]
On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.
[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700] blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701] cgwb_release_workfn+0x104/0x158
[ 4757.010702] process_one_work+0x1bc/0x3f0
[ 4757.010704] worker_thread+0x164/0x468
[ 4757.010705] kthread+0x108/0x138
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.10.13
iwlwifi: provide gso_type to GSO packets
nbd: freeze the queue while we're adding connections
tty: avoid using vfs_iocb_iter_write() for redirected console writes
ACPI: sysfs: Prefer "compatible" modalias
ACPI: thermal: Do not call acpi_thermal_check() directly
kernel: kexec: remove the lock operation of system_transition_mutex
ALSA: hda/realtek: Enable headset of ASUS B1400CEPE with ALC256
ALSA: hda/via: Apply the workaround generically for Clevo machines
parisc: Enable -mlong-calls gcc option by default when !CONFIG_MODULES
media: cec: add stm32 driver
media: cedrus: Fix H264 decoding
media: hantro: Fix reset_raw_fmt initialization
media: rc: fix timeout handling after switch to microsecond durations
media: rc: ite-cir: fix min_timeout calculation
media: rc: ensure that uevent can be read directly after rc device register
ARM: dts: tbs2910: rename MMC node aliases
ARM: dts: ux500: Reserve memory carveouts
ARM: dts: imx6qdl-gw52xx: fix duplicate regulator naming
wext: fix NULL-ptr-dereference with cfg80211's lack of commit()
x86/xen: avoid warning in Xen pv guest with CONFIG_AMD_MEM_ENCRYPT enabled
ASoC: AMD Renoir - refine DMI entries for some Lenovo products
Revert "drm/amdgpu/swsmu: drop set_fan_speed_percent (v2)"
drm/nouveau/kms/gk104-gp1xx: Fix > 64x64 cursors
drm/i915: Always flush the active worker before returning from the wait
drm/i915/gt: Always try to reserve GGTT address 0x0
drivers/nouveau/kms/nv50-: Reject format modifiers for cursor planes
bcache: only check feature sets when sb->version >= BCACHE_SB_VERSION_CDEV_WITH_FEATURES
net: usb: qmi_wwan: added support for Thales Cinterion PLSx3 modem family
s390: uv: Fix sysfs max number of VCPUs reporting
s390/vfio-ap: No need to disable IRQ after queue reset
PM: hibernate: flush swap writer after marking
x86/entry: Emit a symbol for register restoring thunk
efi/apple-properties: Reinstate support for boolean properties
crypto: marvel/cesa - Fix tdma descriptor on 64-bit
drivers: soc: atmel: Avoid calling at91_soc_init on non AT91 SoCs
drivers: soc: atmel: add null entry at the end of at91_soc_allowed_list[]
btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
btrfs: fix possible free space tree corruption with online conversion
KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]
KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()
KVM: arm64: Filter out v8.1+ events on v8.0 HW
KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
KVM: x86: get smi pending status correctly
KVM: Forbid the use of tagged userspace addresses for memslots
io_uring: fix wqe->lock/completion_lock deadlock
xen: Fix XenStore initialisation for XS_LOCAL
leds: trigger: fix potential deadlock with libata
arm64: dts: broadcom: Fix USB DMA address translation for Stingray
mt7601u: fix kernel crash unplugging the device
mt76: mt7663s: fix rx buffer refcounting
mt7601u: fix rx buffer refcounting
iwlwifi: Fix IWL_SUBDEVICE_NO_160 macro to use the correct bit.
drm/i915/gt: Clear CACHE_MODE prior to clearing residuals
drm/i915/pmu: Don't grab wakeref when enabling events
net/mlx5e: Fix IPSEC stats
ARM: dts: imx6qdl-kontron-samx6i: fix pwms for lcd-backlight
drm/nouveau/svm: fail NOUVEAU_SVM_INIT ioctl on unsupported devices
drm/vc4: Correct lbm size and calculation
drm/vc4: Correct POS1_SCL for hvs5
drm/nouveau/dispnv50: Restore pushing of all data.
drm/i915: Check for all subplatform bits
drm/i915/selftest: Fix potential memory leak
uapi: fix big endian definition of ipv6_rpl_sr_hdr
KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM
tee: optee: replace might_sleep with cond_resched
xen-blkfront: allow discard-* nodes to be optional
blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
clk: imx: fix Kconfig warning for i.MX SCU clk
clk: mmp2: fix build without CONFIG_PM
clk: qcom: gcc-sm250: Use floor ops for sdcc clks
ARM: imx: build suspend-imx6.S with arm instruction set
ARM: zImage: atags_to_fdt: Fix node names on added root nodes
netfilter: nft_dynset: add timeout extension to template
Revert "RDMA/mlx5: Fix devlink deadlock on net namespace deletion"
Revert "block: simplify set_init_blocksize" to regain lost performance
xfrm: Fix oops in xfrm_replay_advance_bmp
xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
selftests: xfrm: fix test return value override issue in xfrm_policy.sh
xfrm: Fix wraparound in xfrm_policy_addr_delta()
arm64: dts: ls1028a: fix the offset of the reset register
ARM: imx: fix imx8m dependencies
ARM: dts: imx6qdl-kontron-samx6i: fix i2c_lcd/cam default status
ARM: dts: imx6qdl-sr-som: fix some cubox-i platforms
arm64: dts: imx8mp: Correct the gpio ranges of gpio3
firmware: imx: select SOC_BUS to fix firmware build
RDMA/cxgb4: Fix the reported max_recv_sge value
ASoC: dt-bindings: lpass: Fix and common up lpass dai ids
ASoC: qcom: Fix incorrect volatile registers
ASoC: qcom: Fix broken support to MI2S TERTIARY and QUATERNARY
ASoC: qcom: lpass-ipq806x: fix bitwidth regmap field
spi: altera: Fix memory leak on error path
ASoC: Intel: Skylake: skl-topology: Fix OOPs ib skl_tplg_complete
powerpc/64s: prevent recursive replay_soft_interrupts causing superfluous interrupt
pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
ASoC: SOF: Intel: soundwire: fix select/depend unmet dependencies
ASoC: qcom: lpass: Fix out-of-bounds DAI ID lookup
iwlwifi: pcie: avoid potential PNVM leaks
iwlwifi: pnvm: don't skip everything when not reloading
iwlwifi: pnvm: don't try to load after failures
iwlwifi: pcie: set LTR on more devices
iwlwifi: pcie: use jiffies for memory read spin time limit
iwlwifi: pcie: reschedule in long-running memory reads
mac80211: pause TX while changing interface type
ice: fix FDir IPv6 flexbyte
ice: Implement flow for IPv6 next header (extension header)
ice: update dev_addr in ice_set_mac_address even if HW filter exists
ice: Don't allow more channels than LAN MSI-X available
ice: Fix MSI-X vector fallback logic
i40e: acquire VSI pointer only after VF is initialized
igc: fix link speed advertising
net/mlx5: Fix memory leak on flow table creation error flow
net/mlx5e: E-switch, Fix rate calculation for overflow
net/mlx5e: free page before return
net/mlx5e: Reduce tc unsupported key print level
net/mlx5: Maintain separate page trees for ECPF and PF functions
net/mlx5e: Disable hw-tc-offload when MLX5_CLS_ACT config is disabled
net/mlx5e: Fix CT rule + encap slow path offload and deletion
net/mlx5e: Correctly handle changing the number of queues when the interface is down
net/mlx5e: Revert parameters on errors when changing trust state without reset
net/mlx5e: Revert parameters on errors when changing MTU and LRO state without reset
net/mlx5: CT: Fix incorrect removal of tuple_nat_node from nat rhashtable
can: dev: prevent potential information leak in can_fill_info()
ACPI/IORT: Do not blindly trust DMA masks from firmware
of/device: Update dma_range_map only when dev has valid dma-ranges
iommu/amd: Use IVHD EFR for early initialization of IOMMU features
iommu/vt-d: Correctly check addr alignment in qi_flush_dev_iotlb_pasid()
nvme-multipath: Early exit if no path is available
selftests: forwarding: Specify interface when invoking mausezahn
rxrpc: Fix memory leak in rxrpc_lookup_local
NFC: fix resource leak when target index is invalid
NFC: fix possible resource leak
ASoC: mediatek: mt8183-da7219: ignore TDM DAI link by default
ASoC: mediatek: mt8183-mt6358: ignore TDM DAI link by default
ASoC: topology: Properly unregister DAI on removal
ASoC: topology: Fix memory corruption in soc_tplg_denum_create_values()
scsi: qla2xxx: Fix description for parameter ql2xenforce_iocb_limit
team: protect features update by RCU to avoid deadlock
tcp: make TCP_USER_TIMEOUT accurate for zero window probes
tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
vsock: fix the race conditions in multi-transport support
Linux 5.10.13
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I75f419b25f24da559e446d62f75ce6bb9b0a5396
commit 2569063c7140c65a0d0ad075e95ddfbcda9ba3c0 upstream.
In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against
q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE.
So fix it.
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: f1b49fdc1c ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It may happen that the underlying device's runtime-pm is
not controlled by block-pm. So it's possible that when
commands are sent to the device, it's suspended and may not
be resumed by blk-pm. Hence explicitly resume the parent
which is the platform device.
Bug: 178653131
Link: https://lore.kernel.org/linux-arm-msm/b1db5394aa3f6cf44cd9adb9c8d569caa0c9e4f5.1611803264.git.asutoshd@codeaurora.org/T/#u
Change-Id: Iead4102ea609d339549694164771e47debb58a00
Signed-off-by: Asutosh Das <asutoshd@codeaurora.org>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bao D. Nguyen <nguyenb@codeaurora.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmAHFpcACgkQONu9yGCS
aT4Vhw/+JLscHnfK//hbS6Nx95MY95VMzy+p2ccADXRy3O/5nr0HwGKnXTKB4Bg+
05S3Hv9ZU/XSszLWvgFQ0Z0peU241ASPz1uLTgtpziBT5plXa5eJULBZ+WknWMef
dNKpvKPphpEbQ0yz6o/4sbNAdiI9BzyGCOicQ2dl9nY7R/JA9YHquUD7iHMnvbs+
yxwwawNHVwszUT/fJT3iFzOAehHGAttHdf3z/bGPS1ogy2S7J5IluJgTAibd3P7G
5o7OUUA5ujEtjBLIkA61fqeL2Qaci83Ff/8KEPEfF1JeLBbMHYcLHnz3RAwBaLZh
nlM4smyTeekcnHIzyRGw16OmpoYwY3MQAt+UFLCzKhlnscB0UqCNkA9zQA9k/taw
cy7/fe5hWFU9DRv4uTUT2H1tkP+pNQ5eIaejPHMtld5JlYXoDN4RyQq7sAyMQgBj
CXADStYSR/f5sWWgRbRs1F7E0lrePsVpjOcqHXxbsS+52yN2CZSKazlOIJ9xArfM
cTzzLUuYbMZoHjIDdMMkjA41VMmyJ+BKrqEgzu3LsJQs57o/ckjnQx4VV5YiHhci
v35OL8oa9IZi8WQikB9bx2WZRWUChOGKwMNeeUwEFD4Zmye1OtyyHuzYQf9QSjRv
zbf1owwsg3xnfkvLcfru8mNMgJkgG8RpuNNVPO8boWZ4pgPu2tk=
=5K55
-----END PGP SIGNATURE-----
Merge 5.10.9 into android12-5.10
Changes in 5.10.9
btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
btrfs: prevent NULL pointer dereference in extent_io_tree_panic
ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machines
ALSA: doc: Fix reference to mixart.rst
ASoC: AMD Renoir - add DMI entry for Lenovo ThinkPad X395
ASoC: dapm: remove widget from dirty list on free
x86/hyperv: check cpu mask after interrupt has been disabled
drm/amdgpu: add green_sardine device id (v2)
drm/amdgpu: fix DRM_INFO flood if display core is not supported (bug 210921)
Revert "drm/amd/display: Fixed Intermittent blue screen on OLED panel"
drm/amdgpu: add new device id for Renior
drm/i915: Allow the sysadmin to override security mitigations
drm/i915/gt: Limit VFE threads based on GT
drm/i915/backlight: fix CPU mode backlight takeover on LPT
drm/bridge: sii902x: Refactor init code into separate function
dt-bindings: display: sii902x: Add supply bindings
drm/bridge: sii902x: Enable I/O and core VCC supplies if present
tracing/kprobes: Do the notrace functions check without kprobes on ftrace
tools/bootconfig: Add tracing_on support to helper scripts
ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
ext4: fix wrong list_splice in ext4_fc_cleanup
ext4: fix bug for rename with RENAME_WHITEOUT
cifs: check pointer before freeing
cifs: fix interrupted close commands
riscv: Drop a duplicated PAGE_KERNEL_EXEC
riscv: return -ENOSYS for syscall -1
riscv: Fixup CONFIG_GENERIC_TIME_VSYSCALL
riscv: Fix KASAN memory mapping.
mips: fix Section mismatch in reference
mips: lib: uncached: fix non-standard usage of variable 'sp'
MIPS: boot: Fix unaligned access with CONFIG_MIPS_RAW_APPENDED_DTB
MIPS: Fix malformed NT_FILE and NT_SIGINFO in 32bit coredumps
MIPS: relocatable: fix possible boot hangup with KASLR enabled
RDMA/ocrdma: Fix use after free in ocrdma_dealloc_ucontext_pd()
ACPI: scan: Harden acpi_device_add() against device ID overflows
xen/privcmd: allow fetching resource sizes
compiler.h: Raise minimum version of GCC to 5.1 for arm64
mm/vmalloc.c: fix potential memory leak
mm/hugetlb: fix potential missing huge page size info
mm/process_vm_access.c: include compat.h
dm raid: fix discard limits for raid1
dm snapshot: flush merged data before committing metadata
dm integrity: fix flush with external metadata device
dm integrity: fix the maximum number of arguments
dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
dm crypt: do not wait for backlogged crypto request completion in softirq
dm crypt: do not call bio_endio() from the dm-crypt tasklet
dm crypt: defer decryption to a tasklet if interrupts disabled
stmmac: intel: change all EHL/TGL to auto detect phy addr
r8152: Add Lenovo Powered USB-C Travel Hub
btrfs: tree-checker: check if chunk item end overflows
ext4: don't leak old mountpoint samples
io_uring: don't take files/mm for a dead task
io_uring: drop mm and files after task_work_run
ARC: build: remove non-existing bootpImage from KBUILD_IMAGE
ARC: build: add uImage.lzma to the top-level target
ARC: build: add boot_targets to PHONY
ARC: build: move symlink creation to arch/arc/Makefile to avoid race
ARM: omap2: pmic-cpcap: fix maximum voltage to be consistent with defaults on xt875
ath11k: fix crash caused by NULL rx_channel
netfilter: ipset: fixes possible oops in mtype_resize
ath11k: qmi: try to allocate a big block of DMA memory first
btrfs: fix async discard stall
btrfs: merge critical sections of discard lock in workfn
btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
regulator: bd718x7: Add enable times
ethernet: ucc_geth: fix definition and size of ucc_geth_tx_global_pram
ARM: dts: ux500/golden: Set display max brightness
habanalabs: adjust pci controller init to new firmware
habanalabs/gaudi: retry loading TPC f/w on -EINTR
habanalabs: register to pci shutdown callback
staging: spmi: hisi-spmi-controller: Fix some error handling paths
spi: altera: fix return value for altera_spi_txrx()
habanalabs: Fix memleak in hl_device_reset
hwmon: (pwm-fan) Ensure that calculation doesn't discard big period values
lib/raid6: Let $(UNROLL) rules work with macOS userland
kconfig: remove 'kvmconfig' and 'xenconfig' shorthands
spi: fix the divide by 0 error when calculating xfer waiting time
io_uring: drop file refs after task cancel
bfq: Fix computation of shallow depth
arch/arc: add copy_user_page() to <asm/page.h> to fix build error on ARC
misdn: dsp: select CONFIG_BITREVERSE
net: ethernet: fs_enet: Add missing MODULE_LICENSE
selftests: fix the return value for UDP GRO test
nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
nvme: avoid possible double fetch in handling CQE
nvmet-rdma: Fix list_del corruption on queue establishment failure
drm/amd/display: fix sysfs amdgpu_current_backlight_pwm NULL pointer issue
drm/amdgpu: fix a GPU hang issue when remove device
drm/amd/pm: fix the failure when change power profile for renoir
drm/amdgpu: fix potential memory leak during navi12 deinitialization
usb: typec: Fix copy paste error for NVIDIA alt-mode description
iommu/vt-d: Fix lockdep splat in sva bind()/unbind()
ACPI: scan: add stub acpi_create_platform_device() for !CONFIG_ACPI
drm/msm: Call msm_init_vram before binding the gpu
ARM: picoxcell: fix missing interrupt-parent properties
poll: fix performance regression due to out-of-line __put_user()
rcu-tasks: Move RCU-tasks initialization to before early_initcall()
bpf: Simplify task_file_seq_get_next()
bpf: Save correct stopping point in file seq iteration
x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handling
cfg80211: select CONFIG_CRC32
nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context
iommu/vt-d: Update domain geometry in iommu_ops.at(de)tach_dev
net/mlx5e: CT: Use per flow counter when CT flow accounting is enabled
net/mlx5: Fix passing zero to 'PTR_ERR'
net/mlx5: E-Switch, fix changing vf VLANID
blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
mm: fix clear_refs_write locking
mm: don't play games with pinned pages in clear_page_refs
mm: don't put pinned pages into the swap cache
perf intel-pt: Fix 'CPU too large' error
dump_common_audit_data(): fix racy accesses to ->d_name
ASoC: meson: axg-tdm-interface: fix loopback
ASoC: meson: axg-tdmin: fix axg skew offset
ASoC: Intel: fix error code cnl_set_dsp_D0()
nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY
nvme: don't intialize hwmon for discovery controllers
nvme-tcp: fix possible data corruption with bio merges
nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT
NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
pNFS: We want return-on-close to complete when evicting the inode
pNFS: Mark layout for return if return-on-close was not sent
pNFS: Stricter ordering of layoutget and layoutreturn
NFS: Adjust fs_context error logging
NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
NFS: nfs_delegation_find_inode_server must first reference the superblock
NFS: nfs_igrab_and_active must first reference the superblock
scsi: ufs: Fix possible power drain during system suspend
ext4: fix superblock checksum failure when setting password salt
RDMA/restrack: Don't treat as an error allocation ID wrapping
RDMA/usnic: Fix memleak in find_free_vf_and_create_qp_grp
bnxt_en: Improve stats context resource accounting with RDMA driver loaded.
RDMA/mlx5: Fix wrong free of blue flame register on error
IB/mlx5: Fix error unwinding when set_has_smi_cap fails
umount(2): move the flag validity checks first
dm zoned: select CONFIG_CRC32
drm/i915/dsi: Use unconditional msleep for the panel_on_delay when there is no reset-deassert MIPI-sequence
drm/i915/icl: Fix initing the DSI DSC power refcount during HW readout
drm/i915/gt: Restore clear-residual mitigations for Ivybridge, Baytrail
mm, slub: consider rest of partial list if acquire_slab() fails
riscv: Trace irq on only interrupt is enabled
iommu/vt-d: Fix unaligned addresses for intel_flush_svm_range_dev()
net: sunrpc: interpret the return value of kstrtou32 correctly
selftests: netfilter: Pass family parameter "-f" to conntrack tool
dm: eliminate potential source of excessive kernel log noise
ALSA: fireface: Fix integer overflow in transmit_midi_msg()
ALSA: firewire-tascam: Fix integer overflow in midi_port_work()
netfilter: conntrack: fix reading nf_conntrack_buckets
netfilter: nf_nat: Fix memleak in nf_nat_init
Linux 5.10.9
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I609e501511889081e03d2d18ee7e1be95406f396
[ Upstream commit 02f938e9fed1681791605ca8b96c2d9da9355f6a ]
Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
something like:
root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3
Add the decoding for that flag.
Fixes: 32bc15afed ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a ]
BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.10.8
powerpc/32s: Fix RTAS machine check with VMAP stack
io_uring: synchronise IOPOLL on task_submit fail
io_uring: limit {io|sq}poll submit locking scope
io_uring: patch up IOPOLL overflow_flush sync
RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
iommu/arm-smmu-qcom: Initialize SCTLR of the bypass context
drm/panfrost: Don't corrupt the queue mutex on open/close
io_uring: Fix return value from alloc_fixed_file_ref_node
scsi: ufs: Fix -Wsometimes-uninitialized warning
btrfs: skip unnecessary searches for xattrs when logging an inode
btrfs: fix deadlock when cloning inline extent and low on free metadata space
btrfs: shrink delalloc pages instead of full inodes
net: cdc_ncm: correct overhead in delayed_ndp_size
net: hns3: fix incorrect handling of sctp6 rss tuple
net: hns3: fix the number of queues actually used by ARQ
net: hns3: fix a phy loopback fail issue
net: stmmac: dwmac-sun8i: Fix probe error handling
net: stmmac: dwmac-sun8i: Balance internal PHY resource references
net: stmmac: dwmac-sun8i: Balance internal PHY power
net: stmmac: dwmac-sun8i: Balance syscon (de)initialization
net: vlan: avoid leaks on register_vlan_dev() failures
net/sonic: Fix some resource leaks in error handling paths
net: bareudp: add missing error handling for bareudp_link_config()
ptp: ptp_ines: prevent build when HAS_IOMEM is not set
net: ipv6: fib: flush exceptions when purging route
tools: selftests: add test for changing routes with PTMU exceptions
net: fix pmtu check in nopmtudisc mode
net: ip: always refragment ip defragmented packets
chtls: Fix hardware tid leak
chtls: Remove invalid set_tcb call
chtls: Fix panic when route to peer not configured
chtls: Avoid unnecessary freeing of oreq pointer
chtls: Replace skb_dequeue with skb_peek
chtls: Added a check to avoid NULL pointer dereference
chtls: Fix chtls resources release sequence
octeontx2-af: fix memory leak of lmac and lmac->name
nexthop: Fix off-by-one error in error path
nexthop: Unlink nexthop group entry in error path
nexthop: Bounce NHA_GATEWAY in FDB nexthop groups
s390/qeth: fix deadlock during recovery
s390/qeth: fix locking for discipline setup / removal
s390/qeth: fix L2 header access in qeth_l3_osa_features_check()
net: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE
net/mlx5: Use port_num 1 instead of 0 when delete a RoCE address
net/mlx5e: ethtool, Fix restriction of autoneg with 56G
net/mlx5e: In skb build skip setting mark in switchdev mode
net/mlx5: Check if lag is supported before creating one
scsi: lpfc: Fix variable 'vport' set but not used in lpfc_sli4_abts_err_handler()
ionic: start queues before announcing link up
HID: wacom: Fix memory leakage caused by kfifo_alloc
fanotify: Fix sys_fanotify_mark() on native x86-32
ARM: OMAP2+: omap_device: fix idling of devices during probe
i2c: sprd: use a specific timeout to avoid system hang up issue
dmaengine: dw-edma: Fix use after free in dw_edma_alloc_chunk()
selftests/bpf: Clarify build error if no vmlinux
can: tcan4x5x: fix bittiming const, use common bittiming from m_can driver
can: m_can: m_can_class_unregister(): remove erroneous m_can_clk_stop()
can: kvaser_pciefd: select CONFIG_CRC32
spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get()
spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
spi: stm32: FIFO threshold level - fix align packet size
i2c: i801: Fix the i2c-mux gpiod_lookup_table not being properly terminated
i2c: mediatek: Fix apdma and i2c hand-shake timeout
bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET
interconnect: imx: Add a missing of_node_put after of_device_is_available
interconnect: qcom: fix rpmh link failures
dmaengine: mediatek: mtk-hsdma: Fix a resource leak in the error handling path of the probe function
dmaengine: milbeaut-xdmac: Fix a resource leak in the error handling path of the probe function
dmaengine: xilinx_dma: check dma_async_device_register return value
dmaengine: xilinx_dma: fix incompatible param warning in _child_probe()
dmaengine: xilinx_dma: fix mixed_enum_type coverity warning
arm64: mm: Fix ARCH_LOW_ADDRESS_LIMIT when !CONFIG_ZONE_DMA
qed: select CONFIG_CRC32
phy: dp83640: select CONFIG_CRC32
wil6210: select CONFIG_CRC32
block: rsxx: select CONFIG_CRC32
lightnvm: select CONFIG_CRC32
zonefs: select CONFIG_CRC32
iommu/vt-d: Fix misuse of ALIGN in qi_flush_piotlb()
iommu/intel: Fix memleak in intel_irq_remapping_alloc
bpftool: Fix compilation failure for net.o with older glibc
nvme-tcp: Fix possible race of io_work and direct send
net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups
net/mlx5e: Fix two double free cases
regmap: debugfs: Fix a memory leak when calling regmap_attach_dev
wan: ds26522: select CONFIG_BITREVERSE
arm64: cpufeature: remove non-exist CONFIG_KVM_ARM_HOST
regulator: qcom-rpmh-regulator: correct hfsmps515 definition
net: mvpp2: disable force link UP during port init procedure
drm/i915/dp: Track pm_qos per connector
net: mvneta: fix error message when MTU too large for XDP
selftests: fib_nexthops: Fix wrong mausezahn invocation
KVM: arm64: Don't access PMCR_EL0 when no PMU is available
xsk: Fix race in SKB mode transmit with shared cq
xsk: Rollback reservation at NETDEV_TX_BUSY
block/rnbd-clt: avoid module unload race with close confirmation
can: isotp: isotp_getname(): fix kernel information leak
block: fix use-after-free in disk_part_iter_next
net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
regmap: debugfs: Fix a reversed if statement in regmap_debugfs_init()
drm/panfrost: Remove unused variables in panfrost_job_close()
tools headers UAPI: Sync linux/fscrypt.h with the kernel sources
Linux 5.10.8
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib8272ec9f47a3c3813509bcacece3b16137332e1
commit aebf5db917055b38f4945ed6d621d9f07a44ff30 upstream.
Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.
Cc: <stable@vger.kernel.org>
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.10.7
i40e: Fix Error I40E_AQ_RC_EINVAL when removing VFs
iavf: fix double-release of rtnl_lock
net/sched: sch_taprio: ensure to reset/destroy all child qdiscs
net: mvpp2: Add TCAM entry to drop flow control pause frames
net: mvpp2: prs: fix PPPoE with ipv6 packet parse
net: systemport: set dev->max_mtu to UMAC_MAX_MTU_SIZE
ethernet: ucc_geth: fix use-after-free in ucc_geth_remove()
ethernet: ucc_geth: set dev->max_mtu to 1518
ionic: account for vlan tag len in rx buffer len
atm: idt77252: call pci_disable_device() on error path
net: mvpp2: Fix GoP port 3 Networking Complex Control configurations
net: stmmac: dwmac-meson8b: ignore the second clock input
ibmvnic: fix login buffer memory leak
ibmvnic: continue fatal error reset after passive init
net: ethernet: mvneta: Fix error handling in mvneta_probe
qede: fix offload for IPIP tunnel packets
virtio_net: Fix recursive call to cpus_read_lock()
net/ncsi: Use real net-device for response handler
net: ethernet: Fix memleak in ethoc_probe
net-sysfs: take the rtnl lock when storing xps_cpus
net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
net-sysfs: take the rtnl lock when storing xps_rxqs
net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc
net: ethernet: ti: cpts: fix ethtool output when no ptp_clock registered
tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
e1000e: Only run S0ix flows if shutdown succeeded
e1000e: bump up timeout to wait when ME un-configures ULP mode
Revert "e1000e: disable s0ix entry and exit flows for ME systems"
e1000e: Export S0ix flags to ethtool
bnxt_en: Check TQM rings for maximum supported value.
net: mvpp2: fix pkt coalescing int-threshold configuration
bnxt_en: Fix AER recovery.
ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
net: sched: prevent invalid Scell_log shift count
net: hns: fix return value check in __lb_other_process()
erspan: fix version 1 check in gre_parse_header()
net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
bareudp: set NETIF_F_LLTX flag
bareudp: Fix use of incorrect min_headroom size
vhost_net: fix ubuf refcount incorrectly when sendmsg fails
r8169: work around power-saving bug on some chip versions
net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
CDC-NCM: remove "connected" log message
ibmvnic: fix: NULL pointer dereference.
net: usb: qmi_wwan: add Quectel EM160R-GL
selftests: mlxsw: Set headroom size of correct port
stmmac: intel: Add PCI IDs for TGL-H platform
selftests/vm: fix building protection keys test
block: add debugfs stanza for QUEUE_FLAG_NOWAIT
workqueue: Kick a worker based on the actual activation of delayed works
scsi: ufs: Fix wrong print message in dev_err()
scsi: ufs-pci: Fix restore from S4 for Intel controllers
scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
scsi: block: Introduce BLK_MQ_REQ_PM
scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
local64.h: make <asm/local64.h> mandatory
lib/genalloc: fix the overflow when size is too big
depmod: handle the case of /sbin/depmod without /sbin in PATH
scsi: ufs: Clear UAC for FFU and RPMB LUNs
kbuild: don't hardcode depmod path
Bluetooth: revert: hci_h5: close serdev device and free hu in h5_close
scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
scsi: block: Do not accept any requests while suspended
crypto: ecdh - avoid buffer overflow in ecdh_set_secret()
crypto: asym_tpm: correct zero out potential secrets
powerpc: Handle .text.{hot,unlikely}.* in linker script
Staging: comedi: Return -EFAULT if copy_to_user() fails
staging: mt7621-dma: Fix a resource leak in an error handling path
usb: gadget: enable super speed plus
USB: cdc-acm: blacklist another IR Droid device
USB: cdc-wdm: Fix use after free in service_outstanding_interrupt().
usb: typec: intel_pmc_mux: Configure HPD first for HPD+IRQ request
usb: dwc3: meson-g12a: disable clk on error handling path in probe
usb: dwc3: gadget: Restart DWC3 gadget when enabling pullup
usb: dwc3: gadget: Clear wait flag on dequeue
usb: dwc3: ulpi: Use VStsDone to detect PHY regs access completion
usb: dwc3: ulpi: Replace CPU-based busyloop with Protocol-based one
usb: dwc3: ulpi: Fix USB2.0 HS/FS/LS PHY suspend regression
usb: chipidea: ci_hdrc_imx: add missing put_device() call in usbmisc_get_init_data()
USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set
usb: usbip: vhci_hcd: protect shift size
usb: uas: Add PNY USB Portable SSD to unusual_uas
USB: serial: iuu_phoenix: fix DMA from stack
USB: serial: option: add LongSung M5710 module support
USB: serial: option: add Quectel EM160R-GL
USB: yurex: fix control-URB timeout handling
USB: usblp: fix DMA to stack
ALSA: usb-audio: Fix UBSAN warnings for MIDI jacks
usb: gadget: select CONFIG_CRC32
USB: Gadget: dummy-hcd: Fix shift-out-of-bounds bug
usb: gadget: f_uac2: reset wMaxPacketSize
usb: gadget: function: printer: Fix a memory leak for interface descriptor
usb: gadget: u_ether: Fix MTU size mismatch with RX packet size
USB: gadget: legacy: fix return error code in acm_ms_bind()
usb: gadget: Fix spinlock lockup on usb_function_deactivate
usb: gadget: configfs: Preserve function ordering after bind failure
usb: gadget: configfs: Fix use-after-free issue with udc_name
USB: serial: keyspan_pda: remove unused variable
hwmon: (amd_energy) fix allocation of hwmon_channel_info config
mm: make wait_on_page_writeback() wait for multiple pending writebacks
x86/mm: Fix leak of pmd ptlock
KVM: x86/mmu: Use -1 to flag an undefined spte in get_mmio_spte()
KVM: x86/mmu: Get root level from walkers when retrieving MMIO SPTE
kvm: check tlbs_dirty directly
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield
x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR
x86/resctrl: Don't move a task to the same resource group
blk-iocost: fix NULL iocg deref from racing against initialization
ALSA: hda/via: Fix runtime PM for Clevo W35xSS
ALSA: hda/conexant: add a new hda codec CX11970
ALSA: hda/realtek - Fix speaker volume control on Lenovo C940
ALSA: hda/realtek: Add mute LED quirk for more HP laptops
ALSA: hda/realtek: Enable mute and micmute LED on HP EliteBook 850 G7
ALSA: hda/realtek: Add two "Intel Reference board" SSID in the ALC256.
iommu/vt-d: Move intel_iommu info from struct intel_svm to struct intel_svm_dev
btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
btrfs: send: fix wrong file path when there is an inode with a pending rmdir
Revert "device property: Keep secondary firmware node secondary by type"
dmabuf: fix use-after-free of dmabuf's file->f_inode
arm64: link with -z norelro for LLD or aarch64-elf
drm/i915: clear the shadow batch
drm/i915: clear the gpu reloc batch
bcache: fix typo from SUUP to SUPP in features.h
bcache: check unsupported feature sets for bcache register
bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket
net/mlx5e: Fix SWP offsets when vlan inserted by driver
ARM: dts: OMAP3: disable AES on N950/N9
netfilter: x_tables: Update remaining dereference to RCU
netfilter: ipset: fix shift-out-of-bounds in htable_bits()
netfilter: xt_RATEEST: reject non-null terminated string from userspace
netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
dmaengine: idxd: off by one in cleanup code
x86/mtrr: Correct the range check before performing MTRR type lookups
KVM: x86: fix shift out of bounds reported by UBSAN
xsk: Fix memory leak for failed bind
rtlwifi: rise completion at the last step of firmware callback
scsi: target: Fix XCOPY NAA identifier lookup
Linux 5.10.7
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I1a7c195af35831fe362b027fe013c0c7e4dc20ea
commit d16baa3f1453c14d680c5fee01cd122a22d0e0ce upstream.
When initializing iocost for a queue, its rqos should be registered before
the blkcg policy is activated to allow policy data initiailization to lookup
the associated ioc. This unfortunately means that the rqos methods can be
called on bios before iocgs are attached to all existing blkgs.
While the race is theoretically possible on ioc_rqos_throttle(), it mostly
happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
The former determines it from the passed in @rqos and then bails before
dereferencing iocg if the looked up ioc is disabled, which most likely is
the case if initialization is still in progress. The latter looked up ioc by
dereferencing the possibly NULL iocg making it a lot more prone to actually
triggering the bug.
* Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
up ioc for consistency.
* Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
dereferencing it.
* Explain the danger of NULL iocgs in blk_iocost_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jonathan Lemon <bsd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 52abca64fd9410ea6c9a3a74eab25663b403d7da ]
blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
power management state. Now that SCSI domain validation no longer depends
on this behavior, modify the behavior of blk_queue_enter() as follows:
- Do not accept any requests while suspended.
- Only process power management requests while suspending or resuming.
Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
causes runtime-suspended devices not to resume as they should. The request
which should cause a runtime resume instead gets issued directly, without
resuming the device first. Of course the device can't handle it properly,
the I/O fails, and the device remains suspended.
The problem is fixed by checking that the queue's runtime-PM status isn't
RPM_SUSPENDED before allowing a request to be issued, and queuing a
runtime-resume request if it is. In particular, the inline
blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
code is unified by merging the surrounding checks into the routine. If the
queue isn't set up for runtime PM, or there currently is no restriction on
allowed requests, the request is allowed. Likewise if the BLK_MQ_REQ_PM
flag is set and the status isn't RPM_SUSPENDED. Otherwise a runtime resume
is queued and the request is blocked until conditions are more suitable.
[ bvanassche: modified commit message and removed Cc: stable because
without the previous patches from this series this patch would break
parallel SCSI domain validation + introduced queue_rpm_status() ]
Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-and-tested-by: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a4d34da715e3cb7e0741fe603dcd511bed067e00 ]
Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
used by any kernel code.
Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Martin Kepplinger <martin.kepplinger@puri.sm>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0854bcdcdec26aecdc92c303816f349ee1fba2bc ]
Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
functions set RQF_PM. This is the first step towards removing
BLK_MQ_REQ_PREEMPT.
Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Can Guo <cang@codeaurora.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dc30432605bbbd486dfede3852ea4d42c40a84b4 ]
This was missed in 021a24460d. Leads to the numeric value of
QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
/sys/kernel/debug/block/*/state.
Fixes: 021a24460d
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.10.5
net/sched: sch_taprio: reset child qdiscs before freeing them
mptcp: fix security context on server socket
ethtool: fix error paths in ethnl_set_channels()
ethtool: fix string set id check
md/raid10: initialize r10_bio->read_slot before use.
drm/amd/display: Add get_dig_frontend implementation for DCEx
io_uring: close a small race gap for files cancel
jffs2: Allow setting rp_size to zero during remounting
jffs2: Fix NULL pointer dereference in rp_size fs option parsing
spi: dw-bt1: Fix undefined devm_mux_control_get symbol
opp: fix memory leak in _allocate_opp_table
opp: Call the missing clk_put() on error
scsi: block: Fix a race in the runtime power management code
mm/hugetlb: fix deadlock in hugetlb_cow error path
mm: memmap defer init doesn't work as expected
lib/zlib: fix inflating zlib streams on s390
io_uring: don't assume mm is constant across submits
io_uring: use bottom half safe lock for fixed file data
io_uring: add a helper for setting a ref node
io_uring: fix io_sqe_files_unregister() hangs
uapi: move constants from <linux/kernel.h> to <linux/const.h>
tools headers UAPI: Sync linux/const.h with the kernel headers
cgroup: Fix memory leak when parsing multiple source parameters
zlib: move EXPORT_SYMBOL() and MODULE_LICENSE() out of dfltcc_syms.c
scsi: cxgb4i: Fix TLS dependency
Bluetooth: hci_h5: close serdev device and free hu in h5_close
fbcon: Disable accelerated scrolling
reiserfs: add check for an invalid ih_entry_count
misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
media: gp8psk: initialize stats at power control logic
f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
ALSA: seq: Use bool for snd_seq_queue internal flags
ALSA: rawmidi: Access runtime->avail always in spinlock
bfs: don't use WARNING: string when it's just info.
ext4: check for invalid block size early when mounting a file system
fcntl: Fix potential deadlock in send_sig{io, urg}()
io_uring: check kthread stopped flag when sq thread is unparked
rtc: sun6i: Fix memleak in sun6i_rtc_clk_init
module: set MODULE_STATE_GOING state when a module fails to load
quota: Don't overflow quota file offsets
rtc: pl031: fix resource leak in pl031_probe
powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
i3c master: fix missing destroy_workqueue() on error in i3c_master_register
NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode
f2fs: avoid race condition for shrinker count
f2fs: fix race of pending_pages in decompression
module: delay kobject uevent until after module init call
powerpc/64: irq replay remove decrementer overflow check
fs/namespace.c: WARN if mnt_count has become negative
watchdog: rti-wdt: fix reference leak in rti_wdt_probe
um: random: Register random as hwrng-core device
um: ubd: Submit all data segments atomically
NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails
drm/amd/display: updated wm table for Renoir
tick/sched: Remove bogus boot "safety" check
s390: always clear kernel stack backchain before calling functions
io_uring: remove racy overflow list fast checks
ALSA: pcm: Clear the full allocated memory at hw_params
dm verity: skip verity work if I/O error when system is shutting down
ext4: avoid s_mb_prefetch to be zero in individual scenarios
device-dax: Fix range release
Linux 5.10.5
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2b481bfac06bafdef2cf3cc1ac2c2a4ddf9913dc
commit fa4d0f1992a96f6d7c988ef423e3127e613f6ac9 upstream.
With the current implementation the following race can happen:
* blk_pre_runtime_suspend() calls blk_freeze_queue_start() and
blk_mq_unfreeze_queue().
* blk_queue_enter() calls blk_queue_pm_only() and that function returns
true.
* blk_queue_enter() calls blk_pm_request_resume() and that function does
not call pm_request_resume() because the queue runtime status is
RPM_ACTIVE.
* blk_pre_runtime_suspend() changes the queue status into RPM_SUSPENDING.
Fix this race by changing the queue runtime status into RPM_SUSPENDING
before switching q_usage_counter to atomic mode.
Link: https://lore.kernel.org/r/20201209052951.16136-2-bvanassche@acm.org
Fixes: 986d413b7c ("blk-mq: Enable support for runtime power management")
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable <stable@vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Stanley Chu <stanley.chu@mediatek.com>
Co-developed-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Can Guo <cang@codeaurora.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/L/n0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptxfD/9Dvasr1LIF9jTL5nU3iwwCAKVneI0VXsvn
x9xLv+3AlvyhKJSpDxjyYrrsNwo2r2Hay/3Yix079wMyDNcjuUeaDVscQ2Ed6WXI
slOhefyd4sgZf0kpB7TkwSrU1idoSEhOgcbZSiPLnlb071rYbJm5M6W15j9knUW0
HfomlryBgTX6b0et+YHvboTQ+31ZWTpZdMk9XEvQhCVynPOwU858rRKnFZUQSJ04
LGKXcA5xchxD8HVImwGv6QFTtFSzoyT3g68fEXEM9rPoEM8T0utVFPfF7o8L4Oui
nLxpBThwN18xwcc2iei8Fko2H6RgAmaD4cR4rgZaesa4QavPtT8N+JebMX47Tvhy
BBUbD/gRMPt1nO9ufPuLDyYBda2Ne5Z1DkBSIuYgZrreiOBdYGaJDbGIJ9uGwi+j
u4QvooSMoXb/7XGojxaJCjPDlNyhOb+fbZaOJDU49xQR3k01B6vteDnxGBlvbYHn
xuC82LTpTkWQT65ciaFAsW0L2lE6zlgkWWP3ahzH/qAg0HvODT9LDPNxvUpgZqUT
zoUHua3jAY2JBudUMLlacD+/ZS/T7krMl2XLSvyKaOIAyAlaluxK/uykQ+Z3sWgF
XGYQ/yv/4mQ1rTozqRFKndRPN++FCRzIwcHk8iBQCRlh8yLB/gVBr/Q6SgJrdnEj
WWwc1tRmwA==
=o4GN
-----END PGP SIGNATURE-----
Merge tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block
Pull block fix from Jens Axboe:
"Single fix for an issue with chunk_sectors and stacked devices"
* tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block:
block: use gcd() to fix chunk_sectors limit stacking
Restores splitting in terms of varied per-target ti->max_io_len
rather than use block core's single stacked 'chunk_sectors' limit.
- Like DM crypt, update DM integrity to not use crypto drivers that
have CRYPTO_ALG_ALLOCATES_MEMORY set.
- Fix DM writecache target's argument parsing and status display.
- Remove needless BUG() from dm writecache's persistent_memory_claim()
- Remove old gcc workaround in DM cache target's block_div() for ARM
link errors now that gcc >= 4.9 is required.
- Fix RCU locking in dm_blk_report_zones and dm_dax_zero_page_range.
- Remove old, and now frowned upon, BUG_ON(in_interrupt()) in
dm_table_event().
- Remove invalid sparse annotations from dm_prepare_ioctl() and
dm_unprepare_ioctl().
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl/KonYTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWgSJCACNYEndubrROJZL+FOUQixzZQiOfphw
9Brb/XbdXWXIv7F+JV85E6olOqz7JjTGrO91uD5kwHEtVhDx5zT/GCm+5FoBrLa/
FuTphPRWNimZSU1umJe2AG9hOiDpPJJUe/wwj3QkBH2TeEHwBHblB8BkRFzxnP+p
0dGybrQBMtrH3GO65YG7qaASeBPl1+G3mVHfzViyhk1uoZL1y9pKbzPK60TkHcsa
VCGTPke5Ri3hvd85hmpDcXmyxjxZfCA8Jc/DrQ+DDEwakHoJFwlSzP7fqwHnpKHT
RDL4iOID54SViSGqzNcxlGtr/EHyN9Mom2d4Nnb0cgsRG4woCeJWJZMM
=+o1m
-----END PGP SIGNATURE-----
Merge tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix DM's bio splitting changes that were made during v5.9. This
restores splitting in terms of varied per-target ti->max_io_len
rather than use block core's single stacked 'chunk_sectors' limit.
- Like DM crypt, update DM integrity to not use crypto drivers that
have CRYPTO_ALG_ALLOCATES_MEMORY set.
- Fix DM writecache target's argument parsing and status display.
- Remove needless BUG() from dm writecache's persistent_memory_claim()
- Remove old gcc workaround in DM cache target's block_div() for ARM
link errors now that gcc >= 4.9 is required.
- Fix RCU locking in dm_blk_report_zones and dm_dax_zero_page_range.
- Remove old, and now frowned upon, BUG_ON(in_interrupt()) in
dm_table_event().
- Remove invalid sparse annotations from dm_prepare_ioctl() and
dm_unprepare_ioctl().
* tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: remove invalid sparse __acquires and __releases annotations
dm: fix double RCU unlock in dm_dax_zero_page_range() error path
dm: fix IO splitting
dm writecache: remove BUG() and fail gracefully instead
dm table: Remove BUG_ON(in_interrupt())
dm: fix bug with RCU locking in dm_blk_report_zones
Revert "dm cache: fix arm link errors with inline"
dm writecache: fix the maximum number of arguments
dm writecache: advance the number of arguments when reporting max_age
dm integrity: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
Commit 882ec4e609 ("dm table: stack 'chunk_sectors' limit to account
for target-specific splitting") caused a couple regressions:
1) Using lcm_not_zero() when stacking chunk_sectors was a bug because
chunk_sectors must reflect the most limited of all devices in the
IO stack.
2) DM targets that set max_io_len but that do _not_ provide an
.iterate_devices method no longer had there IO split properly.
And commit 5091cdec56 ("dm: change max_io_len() to use
blk_max_size_offset()") also caused a regression where DM no longer
supported varied (per target) IO splitting. The implication being the
potential for severely reduced performance for IO stacks that use a DM
target like dm-cache to hide performance limitations of a slower
device (e.g. one that requires 4K IO splitting).
Coming full circle: Fix all these issues by discontinuing stacking
chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(),
add optional chunk_sectors override argument to blk_max_size_offset()
and update DM's max_io_len() to pass ti->max_io_len to its
blk_max_size_offset() call.
Passing in an optional chunk_sectors override to blk_max_size_offset()
allows for code reuse of block's centralized calculation for max IO
size based on provided offset and split boundary.
Fixes: 882ec4e609 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting")
Fixes: 5091cdec56 ("dm: change max_io_len() to use blk_max_size_offset()")
Cc: stable@vger.kernel.org
Reported-by: John Dorminy <jdorminy@redhat.com>
Reported-by: Bruce Johnston <bjohnsto@redhat.com>
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
commit 22ada802ed ("block: use lcm_not_zero() when stacking
chunk_sectors") broke chunk_sectors limit stacking. chunk_sectors must
reflect the most limited of all devices in the IO stack.
Otherwise malformed IO may result. E.g.: prior to this fix,
->chunk_sectors = lcm_not_zero(8, 128) would result in
blk_max_size_offset() splitting IO at 128 sectors rather than the
required more restrictive 8 sectors.
And since commit 07d098e6bb ("block: allow 'chunk_sectors' to be
non-power-of-2") care must be taken to properly stack chunk_sectors to
be compatible with the possibility that a non-power-of-2 chunk_sectors
may be stacked. This is why gcd() is used instead of reverting back
to using min_not_zero().
Fixes: 22ada802ed ("block: use lcm_not_zero() when stacking chunk_sectors")
Fixes: 07d098e6bb ("block: allow 'chunk_sectors' to be non-power-of-2")
Reported-by: John Dorminy <jdorminy@redhat.com>
Reported-by: Bruce Johnston <bjohnsto@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: John Dorminy <jdorminy@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If there is only one keyslot, then blk_ksm_init() computes
slot_hashtable_size=1 and log_slot_ht_size=0. This causes
blk_ksm_find_keyslot() to crash later because it uses
hash_ptr(key, log_slot_ht_size) to find the hash bucket containing the
key, and hash_ptr() doesn't support the bits == 0 case.
Fix this by making the hash table always have at least 2 buckets.
Tested by running:
kvm-xfstests -c ext4 -g encrypt -m inlinecrypt \
-o blk-crypto-fallback.num_keyslots=1
Fixes: 1b26283970 ("block: Keyslot Manager for Inline Encryption")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
disk_get_part needs to be paired with a disk_put_part.
Cc: stable@vger.kernel.org
Fixes: ef45fe470e ("blk-cgroup: show global disk stats in root cgroup io.stat")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For avoiding use-after-free on flush request, we call its .end_io() from
both timeout code path and __blk_mq_end_request().
When flush request's ref doesn't drop to zero, it is still used, we
can't mark it as IDLE, so fix it by marking IDLE when its refcount drops
to zero really.
Fixes: 65ff5cd045 ("blk-mq: mark flush request as IDLE in flush_end_io()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Return if the function ended up sending an uevent or not.
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mark flush request as IDLE in its .end_io(), aligning it with how normal
requests behave. The flush request stays in in-flight tags if we're not
using an IO scheduler, so we need to change its state into IDLE.
Otherwise, we will hang in blk_mq_tagset_wait_completed_request() during
error recovery because flush the request state is kept as COMPLETED.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Chao Leng <lengchao@huawei.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the bio's size reaches max_append_sectors, bio_add_hw_page returns
0 then __bio_iov_append_get_pages returns -EINVAL. This is an expected
result of building a small enough bio not to be split in the IO path.
However, iov_iter is not advanced in this case, causing the same pages
are filled for the bio again and again.
Fix the case by properly advancing the iov_iter for already processed
pages.
Fixes: 0512a75b98 ("block: Introduce REQ_OP_ZONE_APPEND")
Cc: stable@vger.kernel.org # 5.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similarly to commit 457e490f2b ("blkcg: allocate struct blkcg_gq
outside request queue spinlock"), blkg_create can also trigger
occasional -ENOMEM failures at the radix insertion because any
allocation inside blkg_create has to be non-blocking, making it more
likely to fail. This causes trouble for userspace tools trying to
configure io weights who need to deal with this condition.
This patch reduces the occurrence of -ENOMEMs on this path by preloading
the radix tree element on a GFP_KERNEL context, such that we guarantee
the later non-blocking insertion won't fail.
A similar solution exists in blkcg_init_queue for the same situation.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Steps on the way to 5.10-rc1
Resolves conflicts in:
fs/userfaultfd.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3fe3c818f1f6565cfd4fa551de72d2b72ef60af
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+UQjkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpvN9D/9iHU7Vgi8J3SiLrHYUiDtMSI5VnEmSBo6K
Ej/wbbrk4tm2UYi550krOk0dMaHxWD3XWSTlpnrE0sUfjs69G676yzxrlnPt50f5
XbcMc0YOHZfffeu9xXykUO8Q2918PTPC08eLaTK1I8lhKAuuTFCT/syGYu+prfd7
AogyuczaDok8nqJEK9QNr0iaEUbe17GQwmvpWyjHl/qfKhWvV2r6jCZZf6pzQj2c
zv3kbiT3u6xw9OEuhY0sgpTEfhAHEXbNIln6Ob4qVgxmOjwgiZdU/QXyw1i2s6pc
ks7e28P43r3VfNYGBfr/hQCeAJT9gOeUG5yBiQr7ooX6uNPL6GOCG7DO/g5y2thQ
NkV4hub/FjYWbSmRzDlJGj1fWn4L+3r/O8g5nMr+F1L3JYeaW0hOyStqBQ4O74Cj
04tvWQ8ndXdPQrm/iDhM6KxfCvR5TC6k4fy9XPpRW8JOxauhIwTZQJyEQUnXTH3v
pwv3IxRmuWGa3mrJZ5kGhsNAEGHdZCL5soLI+BXAD2MUW2IB5v2HpD/z1bvWL/51
uYiVIt/2LxgLkF7BXP40PnY0qqTsOwGxdd6wQhi5Jn9Et+JkmAAR6cVwXx4AhuQg
FT5mq7ZTQBZrErQu4Mr1k3UyqBFm4MB+mbJhWrVWnUnnyA6pcr1NUsUTz5JcyrWz
jWI7T1Si7w==
=dFJi
-----END PGP SIGNATURE-----
Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Christoph
- rdma error handling fixes (Chao Leng)
- fc error handling and reconnect fixes (James Smart)
- fix the qid displace when tracing ioctl command (Keith Busch)
- don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
- fix MTDT for passthru (Logan Gunthorpe)
- blacklist Write Same on more devices (Kai-Heng Feng)
- fix an uninitialized work struct (zhenwei pi)"
- lightnvm out-of-bounds fix (Colin)
- SG allocation leak fix (Doug)
- rnbd fixes (Gioh, Guoqing, Jack)
- zone error translation fixes (Keith)
- kerneldoc markup fix (Mauro)
- zram lockdep fix (Peter)
- Kill unused io_context members (Yufen)
- NUMA memory allocation cleanup (Xianting)
- NBD config wakeup fix (Xiubo)
* tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
block: blk-mq: fix a kernel-doc markup
nvme-fc: shorten reconnect delay if possible for FC
nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
nvme-fc: fix error loop in create_hw_io_queues
nvme-fc: fix io timeout to abort I/O
null_blk: use zone status for max active/open
nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
nvmet: cleanup nvmet_passthru_map_sg()
nvmet: limit passthru MTDS by BIO_MAX_PAGES
nvmet: fix uninitialized work for zero kato
nvme-pci: disable Write Zeroes on Sandisk Skyhawk
nvme: use queuedata for nvme_req_qid
nvme-rdma: fix crash due to incorrect cqe
nvme-rdma: fix crash when connect rejected
block: remove unused members for io_context
blk-mq: remove the calling of local_memory_node()
zram: Fix __zram_bvec_{read,write}() locking order
skd_main: remove unused including <linux/version.h>
sgl_alloc_order: fix memory leak
lightnvm: fix out-of-bounds write to array devices->info[]
...
This reverts commit 9d3a39a5f1 as that
commit causes a bunch of SELinux errors.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I73dc15c4ecddcf1950ca7fca2cc107be27da8f3f
Steps on the way to 5.10-rc1
Resolves conflicts in:
include/linux/blk-crypto.h
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4012850c2e4b804d9e87e90b8e03a3b9ce21b5e7
We don't need to check whether the node is memoryless numa node before
calling allocator interface. SLUB(and SLAB,SLOB) relies on the page
allocator to pick a node. Page allocator should deal with memoryless
nodes just fine. It has zonelists constructed for each possible nodes.
And it will automatically fall back into a node which is closest to the
requested node. As long as __GFP_THISNODE is not enforced of course.
The code comments of kmem_cache_alloc_node() of SLAB also showed this:
* Fallback to other node is possible if __GFP_THISNODE is not set.
blk-mq code doesn't set __GFP_THISNODE, so we can remove the calling
of local_memory_node().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix this warning:
./block/bio.c:1098: WARNING: Inline emphasis start-string without end-string.
The thing is that *iter is not a valid markup.
That seems to be a typo:
*iter -> @iter
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Using "@bio's parent" causes the following waring:
./block/bio.c:10: WARNING: Inline emphasis start-string without end-string.
The main problem here is that this would be converted into:
**bio**'s parent
By kernel-doc, which is not a valid notation. It would be
possible to use, instead, this kernel-doc markup:
``bio's`` parent
Yet, here, is probably simpler to just use an altenative language:
the parent of @bio
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
fix bio splitting for bios that were deferred to the worker thread
due to a DM device being suspended.
- Remove DM core's special handling of NVMe devices now that block
core has internalized efficiencies drivers previously needed to
be concerned about (via now removed direct_make_request).
- Fix request-based DM to not bounce through indirect dm_submit_bio;
instead have block core make direct call to blk_mq_submit_bio().
- Various DM core cleanups to simplify and improve code.
- Update DM cryot to not use drivers that set
CRYPTO_ALG_ALLOCATES_MEMORY.
- Fix DM raid's raid1 and raid10 discard limits for the purposes of
linux-stable. But then remove DM raid's discard limits settings now
that MD raid can efficiently handle large discards.
- A couple small cleanups across various targets.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl+Fx1gTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWk5iB/9pONYmtfQ5oBx4jg/PU8cVYYIfOtwS
ZtItFbw7T9bkHVZ8d4hDr5LTq898cADuRD5edlR82gDOcXkiJlb5PqU39RoOTVvF
Xz87sWzHdGAK7rdnCMAc2hiX3oQOje9o7NxGeGQ/uPaNU+U/vJS0AZtEAwltocBd
j9MGESddBC636Gzbg5C0c0frikXd0am6qp6SCYJNpP5I0G2beHk2YX5Jqt9c7zMk
8kyQend5b5RvkPNWTAjkVfWUsIjwYHh6MF48ZoGvD0X3lWjIBiwyxC0UX5hSXq63
kB+nqxbXcvQLEBtJuDZ2bjyvrwzCVLpmfgLgzxOOU8fI5Q2U0zpsPaa0
=6YDu
-----END PGP SIGNATURE-----
Merge tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Improve DM core's bio splitting to use blk_max_size_offset(). Also
fix bio splitting for bios that were deferred to the worker thread
due to a DM device being suspended.
- Remove DM core's special handling of NVMe devices now that block core
has internalized efficiencies drivers previously needed to be
concerned about (via now removed direct_make_request).
- Fix request-based DM to not bounce through indirect dm_submit_bio;
instead have block core make direct call to blk_mq_submit_bio().
- Various DM core cleanups to simplify and improve code.
- Update DM cryot to not use drivers that set
CRYPTO_ALG_ALLOCATES_MEMORY.
- Fix DM raid's raid1 and raid10 discard limits for the purposes of
linux-stable. But then remove DM raid's discard limits settings now
that MD raid can efficiently handle large discards.
- A couple small cleanups across various targets.
* tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: fix request-based DM to not bounce through indirect dm_submit_bio
dm: remove special-casing of bio-based immutable singleton target on NVMe
dm: export dm_copy_name_and_uuid
dm: fix comment in __dm_suspend()
dm: fold dm_process_bio() into dm_submit_bio()
dm: fix missing imposition of queue_limits from dm_wq_work() thread
dm snap persistent: simplify area_io()
dm thin metadata: Remove unused local variable when create thin and snap
dm raid: remove unnecessary discard limits for raid10
dm raid: fix discard limits for raid1 and raid10
dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
dm: use dm_table_get_device_name() where appropriate in targets
dm table: make 'struct dm_table' definition accessible to all of DM core
dm: eliminate need for start_io_acct() forward declaration
dm: simplify __process_abnormal_io()
dm: push use of on-stack flush_bio down to __send_empty_flush()
dm: optimize max_io_len() by inlining max_io_len_target_boundary()
dm: push md->immutable_target optimization down to __process_bio()
dm: change max_io_len() to use blk_max_size_offset()
dm table: stack 'chunk_sectors' limit to account for target-specific splitting
A zoned device with limited resources to open or activate zones may
return an error when the host exceeds those limits. The same command may
be successful if retried later, but the host needs to wait for specific
zone states before it should expect a retry to succeed. Have the block
layer provide an appropriate status for these conditions so applications
can distinuguish this error for special handling.
Cc: linux-api@vger.kernel.org
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EYWYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsCgD/9Izy/mbiQMmcBPBuQFds2b2SwPAoB4RVcU
NU7pcI3EbAlcj7xDF08Z74Sr6MKyg+JhGid15iw47o+qFq6cxDKiESYLIrFmb70R
lUDkPr9J4OLNDSZ6hpM4sE6Qg9bzDPhRbAceDQRtVlqjuQdaOS2qZAjNG4qjO8by
3PDO7XHCW+X4HhXiu2PDCKuwyDlHxggYzhBIFZNf58US2BU8+tLn2gvTSvmTb27F
w0s5WU1Q5Q0W9RLrp4YTQi4SIIOq03BTSqpRjqhomIzhSQMieH95XNKGRitLjdap
2mFNJ+5I+DTB/TW2BDBrBRXnoV/QNBJsR0DDFnUZsHEejjXKEVt5BRCpSQC9A0WW
XUyVE1K+3GwgIxSI8tjPtyPEGzzhnqJjzHPq4LJLGlQje95v9JZ6bpODB7HHtZQt
rbNp8IoVQ0n01nIvkkt/vnzCE9VFbWFFQiiu5/+x26iKZXW0pAF9Dnw46nFHoYZi
llYvbKDcAUhSdZI8JuqnSnKhi7sLRNPnApBxs52mSX8qaE91sM2iRFDewYXzaaZG
NjijYCcUtopUvojwxYZaLnIpnKWG4OZqGTNw1IdgzUtfdxoazpg6+4wAF9vo7FEP
AePAUTKrfkGBm95uAP4bRvXBzS9UhXJvBrFW3grzRZybMj617F01yAR4N0xlMXeN
jMLrGe7sWA==
=xE9E
-----END PGP SIGNATURE-----
Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
"Here are the driver updates for 5.10.
A few SCSI updates in here too, in coordination with Martin as they
depend on core block changes for the shared tag bitmap.
This contains:
- NVMe pull requests via Christoph:
- fix keep alive timer modification (Amit Engel)
- order the PCI ID list more sensibly (Andy Shevchenko)
- cleanup the open by controller helper (Chaitanya Kulkarni)
- use an xarray for the CSE log lookup (Chaitanya Kulkarni)
- support ZNS in nvmet passthrough mode (Chaitanya Kulkarni)
- fix nvme_ns_report_zones (Christoph Hellwig)
- add a sanity check to nvmet-fc (James Smart)
- fix interrupt allocation when too many polled queues are
specified (Jeffle Xu)
- small nvmet-tcp optimization (Mark Wunderlich)
- fix a controller refcount leak on init failure (Chaitanya
Kulkarni)
- misc cleanups (Chaitanya Kulkarni)
- major refactoring of the scanning code (Christoph Hellwig)
- MD updates via Song:
- Bug fixes in bitmap code, from Zhao Heming
- Fix a work queue check, from Guoqing Jiang
- Fix raid5 oops with reshape, from Song Liu
- Clean up unused code, from Jason Yan
- Discard improvements, from Xiao Ni
- raid5/6 page offset support, from Yufen Yu
- Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap,
Hannes)
- null_blk open/active zone limit support (Niklas)
- Set of bcache updates (Coly, Dongsheng, Qinglang)"
* tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits)
md/raid5: fix oops during stripe resizing
md/bitmap: fix memory leak of temporary bitmap
md: fix the checking of wrong work queue
md/bitmap: md_bitmap_get_counter returns wrong blocks
md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks
md/raid0: remove unused function is_io_in_chunk_boundary()
nvme-core: remove extra condition for vwc
nvme-core: remove extra variable
nvme: remove nvme_identify_ns_list
nvme: refactor nvme_validate_ns
nvme: move nvme_validate_ns
nvme: query namespace identifiers before adding the namespace
nvme: revalidate zone bitmaps in nvme_update_ns_info
nvme: remove nvme_update_formats
nvme: update the known admin effects
nvme: set the queue limits in nvme_update_ns_info
nvme: remove the 0 lba_shift check in nvme_update_ns_info
nvme: clean up the check for too large logic block sizes
nvme: freeze the queue over ->lba_shift updates
nvme: factor out a nvme_configure_metadata helper
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
ZwfpDh2+Tg==
=LzyE
-----END PGP SIGNATURE-----
Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- Series of merge handling cleanups (Baolin, Christoph)
- Series of blk-throttle fixes and cleanups (Baolin)
- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)
- Removal of bdget() as a generic API (Christoph)
- Removal of blkdev_get() as a generic API (Christoph)
- Cleanup of is-partition checks (Christoph)
- Series reworking disk revalidation (Christoph)
- Series cleaning up bio flags (Christoph)
- bio crypt fixes (Eric)
- IO stats inflight tweak (Gabriel)
- blk-mq tags fixes (Hannes)
- Buffer invalidation fixes (Jan)
- Allow soft limits for zone append (Johannes)
- Shared tag set improvements (John, Kashyap)
- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)
- DM no-wait support (Mike, Konstantin)
- Request allocation improvements (Ming)
- Allow md/dm/bcache to use IO stat helpers (Song)
- Series improving blk-iocost (Tejun)
- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)
* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...
Pull compat iovec cleanups from Al Viro:
"Christoph's series around import_iovec() and compat variant thereof"
* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
security/keys: remove compat_keyctl_instantiate_key_iov
mm: remove compat_process_vm_{readv,writev}
fs: remove compat_sys_vmsplice
fs: remove the compat readv/writev syscalls
fs: remove various compat readv/writev helpers
iov_iter: transparently handle compat iovecs in import_iovec
iov_iter: refactor rw_copy_check_uvector and import_iovec
iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
compat.h: fix a spelling error in <linux/compat.h>
blk_exit_queue will free elevator_data, while blk_mq_run_work_fn
will access it. Move cancel of hctx->run_work to the front of
blk_exit_queue to avoid use-after-free.
Fixes: 1b97871b50 ("blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release")
Signed-off-by: Yang Yang <yang.yang@vivo.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After commit 923218f616 ("blk-mq: don't allocate driver tag upfront
for flush rq"), blk_mq_submit_bio() will call blk_insert_flush()
directly to handle flush request rather than blk_mq_sched_insert_request()
in the case of elevator.
Then, all flush request either have set RQF_FLUSH_SEQ flag when call
blk_mq_sched_insert_request(), or have inserted into hctx->dispatch.
So, remove the dead code path.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since whole elevator register is protectd by sysfs_lock, we
don't need extras 'has_elevator'. Just use q->elevator directly.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After commit b89f625e28 ("block: don't release queue's sysfs
lock during switching elevator"), whole elevator register and
unregister function are covered by sysfs_lock. So, remove wrong
comment and add lockdep assert.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have introduced helper function blk_mq_hctx_stopped() to test
BLK_MQ_S_STOPPED.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have defined common interface blk_queue_registered() to
test QUEUE_FLAG_REGISTERED. Just use it.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
elv_support_iosched() will check queue_is_mq() for us. So, remove
the redundant check to clean code.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We will register debugfs for scheduler no matter whether it have
defined callback funciton .exit_sched. So, blk_mq_exit_sched()
is always needed to unregister debugfs. Also, q->elevator should
be set as NULL after exiting scheduler.
For now, since all register scheduler have defined .exit_sched,
it will not cause any actual problem. But It will be more reasonable
to do this change.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9/uU0QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpnQvD/wNEBP6d4ISx2/I6sDon9SKJgiY3CLF7x3f
F//GHMYP9+ZzoLdQRlebGiP6c5PVRL6ExJUVNT+Wc4h5jOuThuxy63j/zvv/RSFw
WH9lFiTG44zjbWjp3sCDOuIlHnCTsqA4zYb6os62q3v4SzenW/TA65C+yLn823AF
1VKeVvcoHDu3bvLwtLmAyqZAm2iJH02yKdclKgyaLSKdaGGPX2MJ4tW3GxqzA71i
7R/qer8KqYXSdJdghGI5eFycLnv/TE/bky02TlE+qUhIFwIhDNyo69IQzlMSQXmw
ECaAxMJYvzh6ruztkdJP0wOjYEryLY1oCusQEseB9M//qMlue/4Mi2D3bX5Ni1g4
blQQbIi1gu1J/fZrFtW7G/qHxDvT8oA5cFSv5e/72QRIghvavV6cvEP3s9Uu9v9l
3pA2LcErEgVellzvAe9q192mPpAUgR42VlUyYi7P74By+m7pWob2jWR0WsSbXqNk
pVhhW3s02hIf9HUAwJkqH46Y3FZmbpTBQvYByFnQh1VSRzmx69zZxs4SrKJTJq9L
Id83gBW+r1cuJ8QuZUX4D3ttIGuaZ7J8IdSY4JUBJPMOavbykb6YiWtZ4W5IW5R/
VYcuVTmJr37hcSBHJLw3FmlEN4IH/2QX+mrtJvCEWgeJACo3TVpv0QGw+gD1V5iS
EQzTCgctTg==
=THH6
-----END PGP SIGNATURE-----
Merge tag 'block5.9-2020-10-08' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes that should go into this release:
- NVMe controller error path reference fix (Chaitanya)
- Fix regression with IBM partitions on non-dasd devices (Christoph)
- Fix a missing clear in the compat CDROM packet structure (Peilin)"
* tag 'block5.9-2020-10-08' of git://git.kernel.dk/linux-block:
partitions/ibm: fix non-DASD devices
nvme-core: put ctrl ref when module ref get fail
block/scsi-ioctl: Fix kernel-infoleak in scsi_put_cdrom_generic_arg()
syzbot is reporting unkillable task [1], for the caller is failing to
handle a corrupted filesystem image which attempts to access beyond
the end of the device. While we need to fix the caller, flooding the
console with handle_bad_sector() message is unlikely useful.
[1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The __throtl_de/enqueue_tg() functions are only be called by
throtl_de/enqueue_tg(), thus we can just open code them to
make code more readable.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The throtl_schedule_next_dispatch() will validate if the service queue
is empty before calling update_min_dispatch_time(), and the
update_min_dispatch_time() will call throtl_rb_first(), which will
validate service queue again.
Thus we can move the service queue validation out of the
throtl_rb_first() to remove the redundant validation in the fast path.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should move the list operation after validation.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It can not scale up in throtl_adjusted_limit() if we set bps or iops is
1, which will cause IO hang when enable low limit. Thus we should treat
1 as a illegal value to avoid this issue.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The IO latency tracking is only for LOW limit, so we should add a
validation to avoid redundant latency tracking if the LOW limit
is not valid.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only update the tg->last_finish_time when the low limitaion is
enabled, so we can move the tg->last_finish_time validation a little
forward to avoid getting the unnecessary current time stamp if the
the low limitation is not enabled.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The throtl_downgrade_state() is always used to change to LIMIT_LOW
limitation, thus remove the latter meaningless parameter which
indicates the limitation index.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is unnecessary to force request-based DM to call into bio-based
dm_submit_bio (via indirect disk->fops->submit_bio) only to have it then
call blk_mq_submit_bio().
Fix this by establishing a request-based DM block_device_operations
(dm_rq_blk_dops, which doesn't have .submit_bio) and update
dm_setup_md_queue() to set md->disk->fops to it for
DM_TYPE_REQUEST_BASED.
Remove DM_TYPE_REQUEST_BASED conditional in dm_submit_bio and unexport
blk_mq_submit_bio.
Fixes: c62b37d96b ("block: move ->make_request_fn to struct block_device_operations")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Don't error out if the dasd_biodasdinfo symbol is not available.
Cc: stable@vger.kernel.org
Fixes: 26d7e28e38 ("s390/dasd: remove ioctl_by_bdev calls")
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
According to Documentation/block/stat.rst, inflight should not include
I/O requests that are in the queue but not yet dispatched to the device,
but blk-mq identifies as inflight any request that has a tag allocated,
which, for queues without elevator, happens at request allocation time
and before it is queued in the ctx (default case in blk_mq_submit_bio).
In addition, current behavior is different for queues with elevator from
queues without it, since for the former the driver tag is allocated at
dispatch time. A more precise approach would be to only consider
requests with state MQ_RQ_IN_FLIGHT.
This effectively reverts commit 6131837b1d ("blk-mq: count allocated
but not started requests in iostats inflight") to consolidate blk-mq
behavior with itself (elevator case) and with original documentation,
but it differs from the behavior used by the legacy path.
This version differs from v1 by using blk_mq_rq_state to access the
state attribute. Avoid using blk_mq_request_started, which was
suggested, since we don't want to include MQ_RQ_COMPLETE.
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move blk_mq_sched_try_merge to blk-merge.c, which allows to mark
a lot of the merge infrastructure static there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Also move the definition from the public blkdev.h to the private
block/blk.h header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Also move the definition from the public blkdev.h to the private
block/blk.h header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_crypt_set_ctx() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.
For now this assumption is still fine, since no callers violate it.
Making bio_crypt_set_ctx() able to fail would add unneeded complexity.
However, if a caller didn't use __GFP_DIRECT_RECLAIM, it would be very
hard to notice the bug. Make it easier by adding a WARN_ON_ONCE().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.
However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c.
This case isn't currently reachable with a bio that actually has an
encryption context. However, it's fragile to rely on this. Just make
blk_crypto_rq_bio_prep() able to fail.
Suggested-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_crypt_clone() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.
However, bio_crypt_clone() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via
kcryptd_io_read() in drivers/md/dm-crypt.c.
Neither case is currently reachable with a bio that actually has an
encryption context. However, it's fragile to rely on this. Just make
bio_crypt_clone() able to fail, analogous to bio_integrity_clone().
Reported-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All remaining callers of bdget() outside of fs/block_dev.c want to get a
reference to the struct block_device for a given struct hd_struct. Add
a helper just for that and then mark bdget static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.
This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
scsi_put_cdrom_generic_arg() is copying uninitialized stack memory to
userspace, since the compiler may leave a 3-byte hole in the middle of
`cgc32`. Fix it by adding a padding field to `struct
compat_cdrom_generic_command`.
Cc: stable@vger.kernel.org
Fixes: f3ee6e63a9 ("compat_ioctl: move CDROM_SEND_PACKET handling into scsi")
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reported-by: syzbot+85433a479a646a064ab3@syzkaller.appspotmail.com
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
DM depends on these block 5.10 commits:
22ada802ed block: use lcm_not_zero() when stacking chunk_sectors
07d098e6bb block: allow 'chunk_sectors' to be non-power-of-2
021a24460d block: add QUEUE_FLAG_NOWAIT
6abc49468e dm: add support for REQ_NOWAIT and enable it for linear target
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
'f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with
blk_mq_queue_tag_busy_iter")' introduce a bug what we may sleep between
rcu lock. Then '530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter
callbacks")' fix it by get request_queue's ref. And 'a9a808084d6a ("block:
Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()")'
remove the synchronize_rcu in __blk_mq_update_nr_hw_queues. We need
update the confused comments in blk_mq_queue_tag_busy_iter.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Blk-mq should call commit_rqs once 'bd.last != true' and no more
request will come(so virtscsi can kick the virtqueue, e.g.). We already
do that in 'blk_mq_dispatch_rq_list/blk_mq_try_issue_list_directly' while
list not empty and 'queued > 0'. However, we can seen the same scene
once the last request in list call queue_rq and return error like
BLK_STS_IOERR which will not requeue the request, and lead that list
empty but need call commit_rqs too(Or the request for virtscsi will stay
timeout until other request kick virtqueue).
We found this problem by do fsstress test with offline/online virtscsi
device repeat quickly.
Fixes: d666ba98f8 ("blk-mq: add mq_ops->commit_rqs()")
Reported-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.
Debug code,
1, blk_mq_alloc_rq_maps():
u64 start, end;
depth = set->queue_depth;
start = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
current->pid, current->comm, current->nvcsw, current->nivcsw,
set->queue_depth, set->nr_hw_queues);
do {
err = __blk_mq_alloc_rq_maps(set);
if (!err)
break;
set->queue_depth >>= 1;
if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
err = -ENOMEM;
break;
}
} while (set->queue_depth);
end = ktime_get_ns();
pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
current->pid, current->comm,
current->nvcsw, current->nivcsw, end - start);
2, __blk_mq_alloc_rq_maps():
u64 start, end;
for (i = 0; i < set->nr_hw_queues; i++) {
start = ktime_get_ns();
if (!__blk_mq_alloc_rq_map(set, i))
goto out_unwind;
end = ktime_get_ns();
pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
}
Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[ 428.428771] nvme nvme0: pci function 10000:01:00.0
[ 428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[ 428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[ 428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[ 432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[ 432.593404] hw queue 0 init cost time 22883 ns
[ 432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[ 432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[ 432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[ 432.596203] hw queue 0 init cost time 242630 ns
[ 432.596441] hw queue 1 init cost time 235913 ns
[ 432.596659] hw queue 2 init cost time 216461 ns
[ 432.596877] hw queue 3 init cost time 215851 ns
[ 432.597107] hw queue 4 init cost time 228406 ns
[ 432.597336] hw queue 5 init cost time 227298 ns
[ 432.597564] hw queue 6 init cost time 224633 ns
[ 432.597785] hw queue 7 init cost time 219954 ns
[ 432.597937] hw queue 8 init cost time 150930 ns
[ 432.598082] hw queue 9 init cost time 143496 ns
[ 432.598231] hw queue 10 init cost time 147261 ns
[ 432.598397] hw queue 11 init cost time 164522 ns
[ 432.598542] hw queue 12 init cost time 143401 ns
[ 432.598692] hw queue 13 init cost time 148934 ns
[ 432.598841] hw queue 14 init cost time 147194 ns
[ 432.598991] hw queue 15 init cost time 148942 ns
[ 432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[ 432.602611] nvme0n1: p1
So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes up a merge issue in:
net/ipv6/route.c
on the way to a 5.9-rc7 release.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4eb508eb3761b95ad8f39dd79f03b3352481ceaf
Three fixes: one in drivers (lpfc) and two for zoned block devices.
The latter also impinges on the block layer but only to introduce a
new block API for setting the zone model rather than fiddling with the
queue directly in the zoned block driver.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCX29mRyYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishabnAP48vMYD
/cjyGAJfq/0k/U/t6pRPc5tUm89LOWcOJz0SjwD/YXcQNz7mx8MxnypAV1jbWXR7
iyWkPMYVc4EJh7oTARE=
=SQhI
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three fixes: one in drivers (lpfc) and two for zoned block devices.
The latter also impinges on the block layer but only to introduce a
new block API for setting the zone model rather than fiddling with the
queue directly in the zoned block driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd: sd_zbc: Fix ZBC disk initialization
scsi: sd: sd_zbc: Fix handling of host-aware ZBC disks
scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported
An iocg may have 0 debt but non-zero delay. The current debt forgiveness
logic doesn't act on such iocgs. This can lead to unexpected behaviors - an
iocg with a little bit of debt will have its delay canceled through debt
forgiveness but one w/o any debt but active delay will have to wait out
until its delay decays out.
This patch updates the debt handling logic so that it treats delays the same
as debts. If either debt or delay is active, debt forgiveness logic kicks in
and acts on both the same way.
Also, avoid turning the debt and delay directly to zero as that can confuse
state transitions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Debt forgiveness logic was counting the number of consecutive !busy periods
as the trigger condition. While this usually works, it can easily be thrown
off by temporary fluctuations especially on configurations w/ short periods.
This patch reimplements debt forgiveness so that:
* Use the average usage over the forgiveness period instead of counting
consecutive periods.
* Debt is reduced at around the target rate (1/2 every 100ms) regardless of
ioc period duration.
* Usage threshold is raised to 50%. Combined with the preceding changes and
the switch to average usage, this makes debt forgivness a lot more
effective at reducing the amount of unnecessary idleness.
* Constants are renamed with DFGV_ prefix.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Debt sets the initial delay duration which is decayed over time. The current
debt reduction halved the debt but didn't change the delay. It prevented
future debts from increasing delay but didn't do anything to lower the
existing delay, limiting the mechanism's ability to reduce unnecessary
idling.
Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq()
recalculates new delay value based on the reduced debt amount.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Debt reduction was blocked if any iocg was short on budget in the past
period to avoid reducing debts while some iocgs are saturated. However, this
ends up unnecessarily blocking debt reduction due to temporary local
imbalances when the device is generally being underutilized, while also
failing to block when the underlying device is overwhelmed and the usage
becomes low from high latency.
Given that debt accumulation mostly happens with swapout bursts which can
significantly deteriorate the underlying device's latency response, the
current logic is not great.
Let's replace it with ioc->busy_level based condition so that we block debt
reduction when the underlying device is being saturated. ioc_forgive_debts()
call is moved after busy_level determination.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Debt reduction logic is going to be improved and expanded. Factor it out
into ioc_forgive_debts() and generalize the comment a bit. No functional
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for
REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where
applicable.
Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also
update submit_bio_checks() to verify it is set for REQ_NOWAIT bios.
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No need to go through the hd_struct to find the partition number.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-5.10/block: (140 commits)
bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
bdi: invert BDI_CAP_NO_ACCT_WB
bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
mm: use SWP_SYNCHRONOUS_IO more intelligently
bdi: remove BDI_CAP_SYNCHRONOUS_IO
bdi: remove BDI_CAP_CGROUP_WRITEBACK
block: lift setting the readahead size into the block layer
md: update the optimal I/O size on reshape
bdi: initialize ->ra_pages and ->io_pages in bdi_init
aoe: set an optimal I/O size
bcache: inherit the optimal I/O size
drbd: remove dead code in device_to_statistics
fs: remove the unused SB_I_MULTIROOT flag
block: mark blkdev_get static
PM: mm: cleanup swsusp_swap_check
mm: split swap_type_of
PM: rewrite is_hibernate_resume_dev to not require an inode
mm: cleanup claim_swapfile
ocfs2: cleanup o2hb_region_dev_store
dasd: cleanup dasd_scan_partitions
...
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.
One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drivers shouldn't really mess with the readahead size, as that is a VM
concept. Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk. Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors. To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set up a readahead size by default, as very few users have a good
reason to change it. This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use blkdev_get_by_dev instead of open coding it using bdget_disk +
blkdev_get, and split the code to read the partition table into a
separate helper to make it a little more obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is possible, albeit more unlikely, for a block device to have a non
power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
which results in a full-stripe size of 1280K. This causes the RAID6's
io_opt to be advertised as 1280K, and a stacked device _could_ then be
made to use a blocksize, aka chunk_sectors, that matches non power-of-2
io_opt of underlying RAID6 -- resulting in stacked device's
chunk_sectors being a non power-of-2).
Update blk_queue_chunk_sectors() and blk_max_size_offset() to
accommodate drivers that need a non power-of-2 chunk_sectors.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
lcm_not_zero() rather than min_not_zero() -- otherwise the final
'chunk_sectors' could result in sub-optimal alignment of IO to
component devices in the IO stack.
Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
then it is a bug in the driver and the device should be flagged as
'misaligned'.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bmd is allocated using kmalloc in bio_alloc_map_data, so make sure
is_null_mapped is properly initialized to false for the !null_mapped
case.
Fixes: f3256075ba ("block: remove the BIO_NULL_MAPPED flag")
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
sg_init_table zeroes its first argument, so the allocation of that argument
doesn't have to.
the semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x;
@@
x =
- kzalloc
+ kmalloc
(...)
...
sg_init_table(x,...)
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When CONFIG_BLK_DEV_ZONED is disabled, allow using host-aware ZBC disks as
regular disks. In this case, ensure that command completion is correctly
executed by changing sd_zbc_complete() to return good_bytes instead of 0
and causing a hang during device probe (endless retries).
When CONFIG_BLK_DEV_ZONED is enabled and a host-aware disk is detected to
have partitions, it will be used as a regular disk. In this case, make sure
to not do anything in sd_zbc_revalidate_zones() as that triggers warnings.
Since all these different cases result in subtle settings of the disk queue
zoned model, introduce the block layer helper function
blk_queue_set_zoned() to generically implement setting up the effective
zoned model according to the disk type, the presence of partitions on the
disk and CONFIG_BLK_DEV_ZONED configuration.
Link: https://lore.kernel.org/r/20200915073347.832424-2-damien.lemoal@wdc.com
Fixes: b72053072c ("block: allow partitions on host aware zone devices")
Cc: <stable@vger.kernel.org>
Reported-by: Borislav Petkov <bp@alien8.de>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Do not need check the bps or iops limitation if bps or iops is unlimited.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The tg_may_dispatch() will call tg_with_in_bps_limit() and
tg_with_in_iops_limit() to check if we can dispatch a bio or
not, which will calculate bps/iops limitation multiple times.
But tg_may_dispatch() is always called under queue lock, which
means the bps/iops limitation will not change in tg_may_dispatch().
So we can calculate the bps/iops limitation only once, and pass
them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to
avoid calculating bps/iops limitation repeatedly.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only
variables, thus better to use readable macros instead of static
variables, which can also save some spaces for .bss area.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
adjust_inuse_and_calc_cost() is responsible for reducing the amount of
donated weights dynamically in period as the budget runs low. Because we
don't want to do full donation calculation in period, we keep latching up
inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the
resulting hweight_inuse is satisfactory.
Unfortunately, the adj_step calculation was reading the active weight before
acquiring ioc->lock. Because the current thread could have lost race to
activate the iocg to another thread before entering this function, it may
read the active weight as zero before acquiring ioc->lock. When this
happens, the adj_step is calculated as zero and the incremental adjustment
loop becomes an infinite one.
Fix it by fetching the active weight after acquiring ioc->lock.
Fixes: b0853ab4a2 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but
all hweight calculations round up and thus it may end up >= WEIGHT_ONE
triggering divide-by-zero and other issues. Bound the value to avoid
surprises.
Fixes: e08d02aa5f ("blk-iocost: implement Andy's method for donation weight updates")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These functions can be used to enable iostat for partitions on devices
like md, bcache.
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- NVMe pull request from Christoph:
- cancel async events before freeing them (David Milburn)
- revert a broken race fix (James Smart)
- fix command processing during resets (Sagi Grimberg)
- Fix a kyber crash with requeued flushes (Omar)
- Fix __bio_try_merge_page() same_page error for no merging (Ritesh)
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9boNoQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpm++D/9oEC1RazLFXwZD7rtXUMQ0bWRmbyM77Qtq
P7wn0poSSvHT6fNyd9ytf9STlTXeJz81Gk4jTRiau1HKAhc9GudYEzYFw0baNN82
AX5dO1Gt2vww+k4XAHCM0l0k2/IOgQg8d2hDJBt68bnDIW/T1T3GORqS5Ki0dw9R
EYVFbBePZTyUIAxDWnSKtNRR3TpMrfZfi9AAUpwGkKVcCZkHD4SlrNPGKd0ckD5Z
GnHdJtWjb5mIgVHMbHgWjcIjKhC7BTrL+sCqdBJ55NvfWXZ20QoKKDSx5BWl6rMI
g/eMAJjoYJ6Ih13sjIbrC7fHZBXzPRTRfqKBq8fM6oytD0cO9ZcUfpBeqiCWOyrT
SU3C1MkkqeskDGNXhjOq8lFWeyQlUgBg0rXIDDeFNusUB3QOZa3T7oirqZlfZsOi
G7WVd4/aftr+qB8GVl1HmLCg7U3rO2q6EuJ+aJDGh07TuiFi5qaPwRzmRcykKs62
UJ15W9JaNEHdGQs5rim7evz9qLCTyQqrwF7nDFBpM8hsraPPCNbwGoUbXLACtXGR
htjr5nxEoOEJs9SKZCWl9jXzvyoMkqLp4j6soVS7cZKUJU1qxMhf68FGylbHitEq
Pe1z7dG/3Pq/zV77aGTt1J40tB43tHr3gOSQ2swwjxqvYIjlvbP4xnl6SIHvLlof
blntc17XWQ==
=J16G
-----END PGP SIGNATURE-----
Merge tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Fix a regression in bdev partition locking (Christoph)
- NVMe pull request from Christoph:
- cancel async events before freeing them (David Milburn)
- revert a broken race fix (James Smart)
- fix command processing during resets (Sagi Grimberg)
- Fix a kyber crash with requeued flushes (Omar)
- Fix __bio_try_merge_page() same_page error for no merging (Ritesh)
* tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block:
block: Set same_page to false in __bio_try_merge_page if ret is false
nvme-fabrics: allow to queue requests for live queues
block: only call sched requeue_request() for scheduled requests
nvme-tcp: cancel async events before freeing event struct
nvme-rdma: cancel async events before freeing event struct
nvme-fc: cancel async events before freeing event struct
nvme: Revert: Fix controller creation races with teardown flow
block: restore a specific error code in bdev_del_partition
NVMe shares tagset between fabric queue and admin queue or between
connect_q and NS queue, so hctx_may_queue() can be called to allocate
request for these queues.
Tags can be reserved in these tagset. Before error recovery, there is
often lots of in-flight requests which can't be completed, and new
reserved request may be needed in error recovery path. However,
hctx_may_queue() can always return false because there is too many
in-flight requests which can't be completed during error handling.
Finally, nothing can proceed.
Fix this issue by always allowing reserved tag allocation in
hctx_may_queue(). This is reasonable because reserved tags are supposed
to always be available.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: David Milburn <dmilburn@redhat.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
scsi/sg.h is included more than once, Remove the one that isn't
necessary.
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The test and the explaination of the patch as bellow.
Before test we added more debug code in blkg_async_bio_workfn():
int count = 0
if (bios.head && bios.head->bi_next) {
need_plug = true;
blk_start_plug(&plug);
}
while ((bio = bio_list_pop(&bios))) {
/*io_punt is a sysctl user interface to control the print*/
if(io_punt) {
printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n",
current->comm, current->pid, bio->bi_iter.bi_sector,
(bio->bi_iter.bi_size)>>9, count++, need_plug);
}
submit_bio(bio);
}
if (need_plug)
blk_finish_plug(&plug);
Steps that need to be set to trigger *PUNT* io before testing:
mount -t btrfs -o compress=lzo /dev/sda6 /btrfs
mount -t cgroup2 nodev /cgroup2
mkdir /cgroup2/cg3
echo "+io" > /cgroup2/cgroup.subtree_control
echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s
echo $$ > /cgroup2/cg3/cgroup.procs
Then use dd command to test btrfs PUNT io in current shell:
dd if=/dev/zero of=/btrfs/file bs=64K count=100000
Test hardware environment as below:
[root@localhost btrfs]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
With above debug code, test command and test environment, I did the
tests under 3 different system loads, which are triggered by stress:
1, Run 64 threads by command "stress -c 64 &"
[53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1
[53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1
[53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1
[53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1
[53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1
[53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1
... ...
[53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1
[53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1
[53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1
[53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1
[53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1
[53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1
[53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1
2, Run 32 threads by command "stress -c 32 &"
[50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1
[50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1
[50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1
[50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1
[50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1
[50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1
... ...
[50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1
[50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1
[50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1
[50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1
[50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1
[50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1
[50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1
[50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1
3, Don't run thread by stress
[50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0
[50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0
[50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0
[50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0
[50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0
[50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0
[50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0
[50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0
[50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0
[50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0
[50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0
[50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0
[50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1
[50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1
[50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0
[50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0
[50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0
[50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0
[50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0
[50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0
[50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0
Analysis of above 3 test results with different system load:
>From above test, we can see more and more continuous bios can be plugged
with system load increasing. When run "stress -c 64 &", 310 continuous
bios are plugged; When run "stress -c 32 &", 260 continuous bios are
plugged; When don't run stress, at most only 2 continuous bios are
plugged, in most cases, bio_list only contains one single bio.
How to explain above phenomenon:
We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will
queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is
scheduled, it depends on the system load. When system load is low, the
workqueue will be quickly scheduled, and the bio in bio_list will be
quickly processed in blkg_async_bio_workfn(), so there is less chance
that the same io submit thread can add multiple continuous bios to
bio_list before workqueue is scheduled to run. The analysis aligned with
above test "3".
When system load is high, there is some delay before the workqueue can
be scheduled to run, the higher the system load the greater the delay.
So there is more chance that the same io submit thread can add multiple
continuous bios to bio_list. Then when the workqueue is scheduled to run,
there are more continuous bios in bio_list, which will be processed in
blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2".
According to test, we can get io performance improved with the patch,
especially when system load is higher. Another optimazition is to use
the plug only when bio_list contains at least 2 bios.
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Like check_disk_changed, except that it does not call ->revalidate_disk
but leaves that to the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we hit the UINT_MAX limit of bio->bi_iter.bi_size and so we are anyway
not merging this page in this bio, then it make sense to make same_page
also as false before returning.
Without this patch, we hit below WARNING in iomap.
This mostly happens with very large memory system and / or after tweaking
vm dirty threshold params to delay writeback of dirty data.
WARNING: CPU: 18 PID: 5130 at fs/iomap/buffered-io.c:74 iomap_page_release+0x120/0x150
CPU: 18 PID: 5130 Comm: fio Kdump: loaded Tainted: G W 5.8.0-rc3 #6
Call Trace:
__remove_mapping+0x154/0x320 (unreliable)
iomap_releasepage+0x80/0x180
try_to_release_page+0x94/0xe0
invalidate_inode_page+0xc8/0x110
invalidate_mapping_pages+0x1dc/0x540
generic_fadvise+0x3c8/0x450
xfs_file_fadvise+0x2c/0xe0 [xfs]
vfs_fadvise+0x3c/0x60
ksys_fadvise64_64+0x68/0xe0
sys_fadvise64+0x28/0x40
system_call_exception+0xf8/0x1c0
system_call_common+0xf0/0x278
Fixes: cc90bc6842 ("block: fix "check bi_size overflow before merge"")
Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yang Yang reported the following crash caused by requeueing a flush
request in Kyber:
[ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
...
[ 2.517468] pc : clear_bit+0x18/0x2c
[ 2.517502] lr : sbitmap_queue_clear+0x40/0x228
[ 2.517503] sp : ffffff800832bc60 pstate : 00c00145
...
[ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
[ 2.517602] Call trace:
[ 2.517606] clear_bit+0x18/0x2c
[ 2.517619] kyber_finish_request+0x74/0x80
[ 2.517627] blk_mq_requeue_request+0x3c/0xc0
[ 2.517637] __scsi_queue_insert+0x11c/0x148
[ 2.517640] scsi_softirq_done+0x114/0x130
[ 2.517643] blk_done_softirq+0x7c/0xb0
[ 2.517651] __do_softirq+0x208/0x3bc
[ 2.517657] run_ksoftirqd+0x34/0x60
[ 2.517663] smpboot_thread_fn+0x1c4/0x2c0
[ 2.517667] kthread+0x110/0x120
[ 2.517669] ret_from_fork+0x10/0x18
This happens because Kyber doesn't track flush requests, so
kyber_finish_request() reads a garbage domain token. Only call the
scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
the finish_request() hook in blk_mq_free_request()). Now that we're
handling it in blk-mq, also remove the check from BFQ.
Reported-by: Yang Yang <yang.yang@vivo.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch to the naming used by the other entries so that we can use the
QUEUE_RW_ENTRY helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add two helpers macros to avoid boilerplate code for the queue sysfs
entries.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
mdadm relies on the fact that deleting an invalid partition returns
-ENXIO or -ENOTTY to detect if a block device is a partition or a
whole device.
Fixes: 08fc1ab6d7 ("block: fix locking in bdev_del_partition")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now we usually free the hctx->sched_data by e->type->ops.exit_hctx(),
and no users will use blk_mq_sched_free_hctx_data() function.
Remove it.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Discarding blocks and buffers under a mounted filesystem is hardly
anything admin wants to do. Usually it will confuse the filesystem and
sometimes the loss of buffer_head state (including b_private field) can
even cause crashes like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G O --------- - - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
...
Call Trace:
__jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
kjournald2+0xbd/0x270 [jbd2]
So if we don't have block device open with O_EXCL already, claim the
block device while we truncate buffer cache. This makes sure any
exclusive block device user (such as filesystem) cannot operate on the
device while we are discarding buffer cache.
Reported-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: fix !CONFIG_BLOCK error in truncate_bdev_range()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9SWMMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphIcD/488Q7rXb2eABp1fGs4gu+VFOCLogeHL8xh
5xHNiOPnZG2SGr8DQJY/7EX2kE65rbZi8/g+2N6anovI2nduRu0tzSra7fRgzbys
ZQC1CUel0MbCd7e8OaEfg108PSHNxBf1PqDcE7zCeyZ0DIs3s4vK/bQtmzzxZHgU
wNw4OIP9gOdqgjowb6GGHo9SLN4GT8rZ0jZVPLa7GwFsvxCTwv/7lHO8rqeSeuCu
5H6i3M/rSbtTXPLHf4Fy97x9WmBmdgu4epTXiwbOxaagpx3lm/7n1P3CpavR+Gcq
O5VGIIzazxPwnZl9y/6rZFLGYqcj38RxUvC8KtK6tDXxEu/BDJa1d6hXI03SyXAO
ZAiEpQTKOkJE3R8ewUDrXLvl3p6FvwZVZ5SIFwUb+0JFrVQYwrgfoRJtzb5SIUan
T9/bSYge7lFRI92FZRIqhvk8rsEBRdu7N/rQCyGf6GuZ0vRXWRAqN7T02iDn3czX
pdGAepU5ymw8CwyUiNNnkY0DUaQLBIO9tCA9epxLwdroQ95vJtMPRBX1STQ65GVk
XvMFAJqDAehQ/nP5xO60cWGZHyL7L/ccpofZlA/ytgAIZRa85GvhrdVy7yc6DKto
wu6h2tkX9+ldoUjVbn/60T+Ft3QUTlfAuDfherkNoFNB/G5i1pzOHbwvL7B3czr3
ZMjoNiOIqA==
=8fvz
-----END PGP SIGNATURE-----
Merge tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A bit larger than usual this week, mostly due to the NVMe fixes
arriving late for -rc3 and hence didn't make last weeks pull request.
- NVMe:
- instance leak and io boundary fixes from Keith
- fc locking fix from Christophe
- various tcp/rdma reset during traffic fixes from Sagi
- pci use-after-free fix from Tong
- tcp target null deref fix from Ziye
- Locking fix for partition removal (Christoph)
- Ensure bdi->io_pages is always set (me)
- Fixup for hd struct reference (Ming)
- Fix for zero length bvecs (Ming)
- Two small blk-iocost fixes (Tejun)"
* tag 'block-5.9-2020-09-04' of git://git.kernel.dk/linux-block:
block: allow for_each_bvec to support zero len bvec
blk-stat: make q->stats->lock irqsafe
blk-iocost: ioc_pd_free() shouldn't assume irq disabled
block: fix locking in bdev_del_partition
block: release disk reference in hd_struct_free_work
block: ensure bdi->io_pages is always initialized
nvme-pci: cancel nvme device request before disabling
nvme: only use power of two io boundaries
nvme: fix controller instance leak
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
nvme: Fix NULL dereference for pci nvme controllers
nvme-rdma: fix reset hang if controller died in the middle of a reset
nvme-rdma: fix timeout handler
nvme-rdma: serialize controller teardown sequences
nvme-tcp: fix reset hang if controller died in the middle of a reset
nvme-tcp: fix timeout handler
nvme-tcp: serialize controller teardown sequences
nvme: have nvme_wait_freeze_timeout return if it timed out
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible for mq-deadline and bfq IO schedulers
when nr_hw_queues is more than one.
It is because kblockd work queue can submit IO from all online CPUs
(through blk_mq_run_hw_queues()) even though only one hctx has pending
commands.
The elevator callback .has_work for mq-deadline and bfq scheduler considers
pending work if there are any IOs on request queue but it does not account
hctx context.
Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.
[jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.
Instead maintain the number of active request queues per tag_set, and make
the judgement based on that.
Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.
For this case, record the nr_active_requests per request_queue, and make
the judgement based on that value.
Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-mq.h and blk-mq-tag.h include on each other, which is less than ideal.
Locate hctx_may_queue() to blk-mq.h, as it is not really tag specific code.
In this way, we can drop the blk-mq-tag.h include of blk-mq.h
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.
In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec060
("blk-mq: drain I/O when all CPUs in a hctx are offline").
However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
queues need to be exposed to the upper layer.
In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.
However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092 ("scsi:
core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.
To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.
New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.
Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.
This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].
[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass hctx/tagset flags argument down to blk_mq_init_tags() and
blk_mq_free_tags() for selective init/free.
For now, make it include the alloc policy flag, which can be evaluated
when needed (in blk_mq_init_tags()).
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the tags are allocated in blk_mq_init_tags(), it's better practice
to free in that same function upon error, rather than a callee which is to
init the bitmap tags (blk_mq_init_tags()).
[jpg: Split from an earlier patch with a new commit message]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.
So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.
[jpg: take out some unrelated changes in blk_mq_init_bitmap_tags()]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.
So rename it to make the point explicitly.
[jpg: rebase a few times, add rnbd-clt.c change]
Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only virtio_blk and xen-blkfront set the revalidate argument to true,
and both do not implement the ->revalidate_disk method. So switch
to the helper that just updates the size instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
variable to better describe the condition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove a duplicative condition to remove below cppcheck warnings:
"warning: Redundant condition: sched_allow_merge. '!A || (A && B)' is
equivalent to '!A || B' [redundantCondition]"
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If WRITE_ZERO/WRITE_SAME operation is not supported by the storage,
blk_cloned_rq_check_limits() will return IO error which will cause
device-mapper to fail the paths.
Instead, if the queue limit is set to 0, return BLK_STS_NOTSUPP.
BLK_STS_NOTSUPP will be ignored by device-mapper and will not fail the
paths.
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These are really cheap to collect and can be useful in debugging iocost
behavior. Add them as debug stats for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When an iocg accumulates too much vtime or gets deactivated, we throw away
some vtime, which lowers the overall device utilization. As the exact amount
which is being thrown away is known, we can compensate by accelerating the
vrate accordingly so that the extra vtime generated in the current period
matches what got lost.
This significantly improves work conservation when involving high weight
cgroups with intermittent and bursty IO patterns.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A low weight iocg can amass a large amount of debt, for example, when
anonymous memory gets reclaimed aggressively. If the system has a lot of
memory paired with a slow IO device, the debt can span multiple seconds or
more. If there are no other subsequent IO issuers, the in-debt iocg may end
up blocked paying its debt while the IO device is idle.
This patch implements a mechanism to protect against such pathological
cases. If the device has been sufficiently idle for a substantial amount of
time, the debts are halved. The criteria are on the conservative side as we
want to resolve the rare extreme cases without impacting regular operation
by forgiving debts too readily.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Debt handling had several issues.
* How much inuse a debtor carries wasn't clearly defined. inuse would be
driven down over time from not issuing IOs but it'd be better to clamp it
to minimum immediately once in debt.
* How much can be paid off was determined by hweight_inuse. As inuse was
driven down, the payment amount would fall together regardless of the
debtor's active weight. This means that the debtors were punished harshly.
* ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after
iocg_kick_delay().
This patch revamps debt handling so that
* Debt handling owns inuse for iocgs in debt and keeps them at zero.
* Payment amount is determined by hweight_active. This is more deterministic
and safer than hweight_inuse but still far from ideal in that it doesn't
factor in possible donations from other iocgs for debt payments. This
likely needs further improvements in the future.
* iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, iocost doesn't collect or expose any statistics punting off all
monitoring duties to drgn based iocost_monitor.py. While it works for some
scenarios, there are some usability and data availability challenges. For
example, accurate per-cgroup usage information can't be tracked by vtime
progression at all and the number available in iocg->usages[] are really
short-term snapshots used for control heuristics with possibly significant
errors.
This patch implements per-cgroup absolute usage stat counter and exposes it
through io.stat along with the current vrate. Usage stat collection and
flushing employ the same method as cgroup rstat on the active iocg's and the
only hot path overhead is preemption toggling and adding to a percpu
counter.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Curently, iocost syncs the delay duration to the outstanding debt amount,
which seemed enough to protect the system from anon memory hogs. However,
that was mostly because the delay calcuation was using hweight_inuse which
quickly converges towards zero under debt for delay duration calculation,
often pusnishing debtors overly harshly for longer than deserved.
The previous patch fixed the delay calcuation and now the protection against
anonymous memory hogs isn't enough because the effect of delay is indirect
and non-linear and a huge amount of future debt can accumulate abruptly
while unthrottled.
This patch implements delay hysteresis so that delay is decayed
exponentially over time instead of getting cleared immediately as debt is
paid off. While the overall behavior is similar to the blk-cgroup
implementation used by blk-iolatency, a lot of the details are different and
due to the empirical nature of the mechanism, it's challenging to adapt the
mechanism for one controller without negatively impacting the other.
As the delay is gradually decayed now, there's no point in running it from
its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
the dedicated hrtimer is removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the margin drops below the minimum on a donating iocg, donation is
immediately canceled in full. There are a couple shortcomings with the
current behavior.
* It's abrupt. A small temporary budget deficit can lead to a wide swing in
weight allocation and a large surplus.
* It's open coded in the issue path but not implemented for the merge path.
A series of merges at a low inuse can make the iocg incur debts and stall
incorrectly.
This patch reimplements in-period donation snapbacks so that
* inuse adjustment and cost calculations are factored into
adjust_inuse_and_calc_cost() which is called from both the issue and merge
paths.
* Snapbacks are more gradual. It occurs in quarter steps.
* A snapback triggers if the margin goes below the low threshold and is
lower than the budget at the time of the last adjustment.
* For the above, __propagate_weights() stores the margin in
iocg->saved_margin. Move iocg->last_inuse storing together into
__propagate_weights() for consistency.
* Full snapback is guaranteed when there are waiters.
* With precise donation and gradual snapbacks, inuse adjustments are now a
lot more effective and the value of scaling inuse on weight changes isn't
clear. Removed inuse scaling from weight_update().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, debt handling requires only iocg->waitq.lock. In the future, we
want to adjust and propagate inuse changes depending on debt status. Let's
grab ioc->lock in debt handling paths in preparation.
* Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
ioc->lock needs to be made before entering the critical sections.
* Add and use iocg_[un]lock() which handles the conditional double locking.
* Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
the caller grabbed both locks.
This patch is prepatory and the comments contain references to future
changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iocost has various safety nets to combat inuse adjustment calculation
inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse
adjustment calculations are now accurate and we can make donation amount
determinations accurate too.
* Stop keeping track of past usage history and using the maximum. Act on the
immediate usage information.
* Remove donation constraints defined by SURPLUS_* constants. Donate
whatever isn't used.
* Determine the donation amount so that the iocg will end up with
MARGIN_TARGET_PCT budget at the end of the coming period assuming the same
usage as the previous period. TARGET is set at 50% of period, which is the
previous maximum. This provides smooth convergence for most repetitive IO
patterns.
* Apply donation logic early at 20% budget. There's no risk in doing so as
the calculation is based on the delta between the current budget and the
target budget at the end of the coming period.
* Remove preemptive iocg activation for zero cost IOs. As donation can reach
near zero now, the mere activation doesn't provide any protection anymore.
In the unlikely case that this becomes a problem, the right solution is
assigning appropriate costs for such IOs.
This significantly improves the donation determination logic while also
simplifying it. Now all donations are immediate, exact and smooth.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The margin handling was pretty inconsistent.
* ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin
thresholds. However, the two are in different units with the former
requiring conversion to vtime on use.
* iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of
period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a
quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use
different values for the two.
This patch cleans up margin and timer slack handling:
* vtime margins are now recorded in ioc->margins.{min, max} on period
duration changes and used consistently.
* Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and
used consistently for iocg_kick_waitq() and iocg_kick_delay().
The only functional change is shortening of timer slack. No meaningful
visible change is expected.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iocost implements work conservation by reducing iocg->inuse and propagating
the adjustment upwards proportionally. However, while I knew the target
absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure
out how to determine the iocg->inuse adjustment to achieve that and
approximated the adjustment by scaling iocg->inuse using the proportion of
the needed hweight_inuse changes.
When nested, these scalings aren't accurate even when adjusting a single
node as the donating node also receives the benefit of the donated portion.
When multiple nodes are donating as they often do, they can be wildly wrong.
iocost employed various safety nets to combat the inaccuracies. There are
ample buffers in determining how much to donate, the adjustments are
conservative and gradual. While it can achieve a reasonable level of work
conservation in simple scenarios, the inaccuracies can easily add up leading
to significant loss of total work. This in turn makes it difficult to
closely cap vrate as vrate adjustment is needed to compensate for the loss
of work. The combination of inaccurate donation calculations and vrate
adjustments can lead to wide fluctuations and clunky overall behaviors.
Andy Newell devised a method to calculate the needed ->inuse updates to
achieve the target hweight_inuse's. The method is compatible with the
proportional inuse adjustment propagation which allows all hot path
operations to be local to each iocg.
To roughly summarize, Andy's method divides the tree into donating and
non-donating parts, calculates global donation rate which is used to
determine the target hweight_inuse for each node, and then derives per-level
proportions. There's non-trivial amount of math involved. Please refer to
the following pdfs for detailed descriptions.
https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bohttps://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsEhttps://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
This patch implements Andy's method in transfer_surpluses(). This makes the
donation calculations accurate per cycle and enables further improvements in
other parts of the donation logic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
They are in microseconds and wrap in around 1.2 hours with u32. While
unlikely, confusions from wraparounds are still possible. We aren't saving
anything meaningful by keeping these u32. Let's make them u64.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The way the surplus donation logic is structured isn't great. There are two
separate paths for starting/increasing donations and decreasing them making
the logic harder to follow and is prone to unnecessary behavior differences.
In preparation for improved donation handling, this patch restructures the
code so that
* All donors - new, increasing and decreasing - are funneled through the
same code path.
* The target donation calculation is factored into hweight_after_donation()
which is called once from the same spot for all possible donors.
* Actual inuse adjustment is factored into trasnfer_surpluses().
This change introduces a few behavior differences - e.g. donation amount
reduction now uses the max usage of the recent three periods just like new
and increasing donations, and inuse now gets adjusted upwards the same way
it gets downwards. These differences are unlikely to have severely negative
implications and the whole logic will be revamped soon.
This patch also removes two tracepoints. The existing TPs don't quite fit
the new implementation. A later patch will update and reinstate them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To improve weight donations, we want to able to scale inuse with a greater
accuracy and down below 1. Let's make non-hierarchical weights to use
WEIGHT_ONE based fixed point numbers too like hierarchical ones.
This doesn't cause any behavior changes yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Budget donations are inaccurate and could take multiple periods to converge.
To prevent triggering vrate adjustments while surplus transfers were
catching up, vrate adjustment was suppressed if donations were increasing,
which was indicated by non-zero nr_surpluses.
This entangling won't be necessary with the scheduled rewrite of donation
mechanism which will make it precise and immediate. Let's decouple the two
in preparation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We're gonna use HWEIGHT_WHOLE for regular weights too. Let's rename it to
WEIGHT_ONE.
Pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of marking iocgs with surplus with a flag and filtering for them
while walking all active iocgs, build a surpluses list. This doesn't make
much difference now but will help implementing improved donation logic which
will iterate iocgs with surplus multiple times.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, iocg->usages[] which are used to guide inuse adjustments are
calculated from vtime deltas. This, however, assumes that the hierarchical
inuse weight at the time of calculation held for the entire period, which
often isn't true and can lead to significant errors.
Now that we have absolute usage information collected, we can derive
iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
decisions are made based on actual absolute usage. The calculated usage is
clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
saturation regardless of the current hierarchical inuse weight.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iocg_kick_waitq() is the function which pays debt and iocg_kick_delay()
updates the actual delay status accordingly. If iocg_kick_delay() is not
called after iocg_kick_delay() updated debt, unnecessarily large delays can
be applied temporarily.
Let's make sure such conditions don't occur by making iocg_kick_waitq()
always call iocg_kick_delay() after paying debt.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We'll make iocg_kick_waitq() call iocg_kick_delay(). Reorder them in
preparation. This is pure code reorganization.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__propagate_weights() currently expects the callers to clamp inuse within
[1, active], which is needlessly fragile. The inuse adjustment logic is
going to be revamped, in preparation, let's make __propagate_weights() clamp
inuse on entry.
Also, make it avoid weight updates altogether if neither active or inuse is
changed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It already propagates two weights - active and inuse - and there will be
another soon. Let's drop the confusing misnomers. Rename
[__]propagate_active_weights() to [__]propagate_weights() and
commit_active_weights() to commit_weights().
This is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-iocost has been reading percpu stat counters from remote cpus which on
some archs can lead to torn reads in really rare occassions. Use local[64]_t
for those counters.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use early returns and goto-based unwinding to simplify the flow a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The alignment offset is only used in slow path callers, so just calculate
it on the fly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The alignment offset is only used in slow path callers, so just calculate
it on the fly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The small blk_mq_attempt_merge() function is only called by
__blk_mq_sched_bio_merge(), just open code it.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are lots of duplicated code when trying to merge a bio from
plug list and sw queue, we can introduce a new helper to attempt
to merge a bio, which can simplify the blk_bio_list_merge()
and blk_attempt_plug_merge().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It's better to move bio merge related functions into blk-merge.c,
which contains all merge related functions.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This comment was added before the multiqueue I/O scheduler framework
was introduced; multiqueue has support for I/O scheduling now, so this
obsolete comment can be removed.
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just check if there is private data, in which case the bio must have
originated from bio_copy_user_iov.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just duplicate a small amount of code in the low-level map into the bio
and copy to the bio routines, leading to much easier to follow and
maintain code, and better shared error handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Open code __blk_rq_unmap_user in the two callers. Both never pass a NULL
bio, and one of them can use an existing local variable instead of the bio
flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We can simply use a boolean flag in the bio_map_data data structure
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Two different callers use two different mutexes for updating the
block device size, which obviously doesn't help to actually protect
against concurrent updates from the different callers. In addition
one of the locks, bd_mutex is rather prone to deadlocks with other
parts of the block stack that use it for high level synchronization.
Switch to using a new spinlock protecting just the size updates, as
that is all we need, and make sure everyone does the update through
the proper helper.
This fixes a bug reported with the nvme revalidating disks during a
hot removal operation, which can currently deadlock on bd_mutex.
Reported-by: Xianting Tian <xianting_tian@126.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block-5.9:
blk-stat: make q->stats->lock irqsafe
blk-iocost: ioc_pd_free() shouldn't assume irq disabled
block: fix locking in bdev_del_partition
block: release disk reference in hd_struct_free_work
block: ensure bdi->io_pages is always initialized
nvme-pci: cancel nvme device request before disabling
nvme: only use power of two io boundaries
nvme: fix controller instance leak
nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
nvme: Fix NULL dereference for pci nvme controllers
nvme-rdma: fix reset hang if controller died in the middle of a reset
nvme-rdma: fix timeout handler
nvme-rdma: serialize controller teardown sequences
nvme-tcp: fix reset hang if controller died in the middle of a reset
nvme-tcp: fix timeout handler
nvme-tcp: serialize controller teardown sequences
nvme: have nvme_wait_freeze_timeout return if it timed out
nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
blk-iocost calls blk_stat_enable_accounting() while holding an irqsafe lock
which triggers a lockdep splat because q->stats->lock isn't irqsafe. Let's
make it irqsafe.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: cd006509b0 ("blk-iocost: account for IO size when testing latencies")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ioc_pd_free() grabs irq-safe ioc->lock without ensuring that irq is disabled
when it can be called with irq disabled or enabled. This has a small chance
of causing A-A deadlocks and triggers lockdep splats. Use irqsave operations
instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to hold the whole device bd_mutex to protect against
other thread concurrently deleting out partition before we get
to it, and thus causing a use after free.
Fixes: cddae808ae ("block: pass a hd_struct to delete_partition")
Reported-by: syzbot+6448f3c229bc52b82f69@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit e8c7d14ac6 ("block: revert back to synchronous request_queue removal")
stops to release request queue from wq context because that commit
supposed all blk_put_queue() is called in context which is allowed
to sleep. However, this assumption isn't true because we release disk's
reference in partition's percpu_ref's ->release() which doesn't allow
to sleep, because the ->release() is run via call_rcu().
Fixes this issue by moving put disk reference into hd_struct_free_work()
Fixes: e8c7d14ac6 ("block: revert back to synchronous request_queue removal")
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If a driver leaves the limit settings as the defaults, then we don't
initialize bdi->io_pages. This means that file systems may need to
work around bdi->io_pages == 0, which is somewhat messy.
Initialize the default value just like we do for ->ra_pages.
Cc: stable@vger.kernel.org
Fixes: 9491ae4aad ("mm: don't cap request size based on read-ahead setting")
Reported-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9CwtMQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsehEAC4ReB53LLbZxqcmoA2RNs9yz1I4DM2PU6z
C+NSGGEnAFHQAhLbfCAzxbtQa6x/m64zoLd+8zHZNAeanJXarszcgSuqhXQFlEfX
7Jz/vdXGdu7Q4zgkLuO3FxleDoPoUC5qOSFHWYtMu6KvHLOkmc9DvdSUsFMDSThX
6RsoaQY2gDOD/pwtm8Cqmy89nLZdFoyxadXyk/lzxLodjeRZOwoVc+YM8YWmrXZ0
mKEEuO4uBWxUUmoyAwUABNqWWAkwTDEhrYCiiG81DkAa1Cu0mRXodN0xycr72cLZ
Ik2OlnTLCE6B0UXsBu2c0+qXGArWsvDyhEEkwF+O+Ump4IBIr72EmgZb+o2nnkXo
Uu4X/r0qeQ6XD+vBTHcE6oPUjJhV6uEXXon5aesE+vh277ILmHgMyjJKaSiJcY/E
efM5SuPRq2kuROKWLKiLJnpuJ/9ZTU/4nk4k1pOlWWOVGLHien0sWBBzQ+iWr6mm
eRl5EkI3JoahqIrNFz0+qF3DwKPVfu+B02/EzA8OXoYHIRV9KMS5eWX5hK12aZ3i
4AT3xuAanfcNs4qBAScOfHQxQu9U5Z7Mu4JQJ58xdsJd+UWBnbznUmSLob9KKk+c
X8AvAcYhb684F87VCmaCzDlIPMb46OYxLBgI6sz7L0xdc7i8TCeeEDbQCN1HixZ3
SNtKzalNXA==
=fAwK
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Sagi:
- nvme completion rework from Christoph and Chao that mostly came
from a bit of divergence of how we classify errors related to
pathing/retry etc.
- nvmet passthru fixes from Chaitanya
- minor nvmet fixes from Amit and I
- mpath round-robin path selection fix from Martin
- ignore noiob for zoned devices from Keith
- minor nvme-fc fix from Tianjia"
- BFQ cgroup leak fix (Dmitry)
- block layer MAINTAINERS addition (Geert)
- fix null_blk FUA checking (Hou)
- get_max_io_size() size fix (Keith)
- fix block page_is_mergeable() for compound pages (Matthew)
- discard granularity fixes (Ming)
- IO scheduler ordering fix (Ming)
- misc fixes
* tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits)
null_blk: fix passing of REQ_FUA flag in null_handle_rq
nvmet: Disable keep-alive timer when kato is cleared to 0h
nvme: redirect commands on dying queue
nvme: just check the status code type in nvme_is_path_error
nvme: refactor command completion
nvme: rename and document nvme_end_request
nvme: skip noiob for zoned devices
nvme-pci: fix PRP pool size
nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
nvme: Use spin_lock_irq() when taking the ctrl->lock
nvmet: call blk_mq_free_request() directly
nvmet: fix oops in pt cmd execution
nvmet: add ns tear down label for pt-cmd handling
nvme: multipath: round-robin: eliminate "fallback" variable
nvme: multipath: round-robin: fix single non-optimized path case
nvme-fc: Fix wrong return value in __nvme_fc_init_request()
nvmet-passthru: Reject commands with non-sgl flags set
nvmet: fix a memory leak
blkcg: fix memleak for iolatency
MAINTAINERS: Add missing header files to BLOCK LAYER section
...
Normally, blkcg_iolatency_exit() will free related memory in iolatency
when cleanup queue. But if blk_throtl_init() return error and queue init
fail, blkcg_iolatency_exit() will not do that for us. Then it cause
memory leak.
Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit aligning splits to physical block sizes inadvertently
modified one return case such that that it now returns 0 length splits
when the number of sectors doesn't exceed the physical offset. This
later hits a BUG in bio_split(). Restore the previous working behavior.
Fixes: 9cc5169cd4 ("block: Improve physical block alignment of split bios")
Reported-by: Eric Deal <eric.deal@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
c616cbee97 ("blk-mq: punt failed direct issue to dispatch list") supposed
to add request which has been through ->queue_rq() to the hw queue dispatch
list, however it adds request running out of budget or driver tag to hw queue
too. This way basically bypasses request merge, and causes too many request
dispatched to LLD, and system% is unnecessary increased.
Fixes this issue by adding request not through ->queue_rq into sw/scheduler
queue, and this way is safe because no ->queue_rq is called on this request
yet.
High %system can be observed on Azure storvsc device, and even soft lock
is observed. This patch reduces %system during heavy sequential IO,
meantime decreases soft lockup risk.
Fixes: c616cbee97 ("blk-mq: punt failed direct issue to dispatch list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Changes from v1:
- update commit description with proper ref-accounting justification
commit db37a34c56 ("block, bfq: get a ref to a group when adding it to a service tree")
introduce leak forbfq_group and blkcg_gq objects because of get/put
imbalance.
In fact whole idea of original commit is wrong because bfq_group entity
can not dissapear under us because it is referenced by child bfq_queue's
entities from here:
-> bfq_init_entity()
->bfqg_and_blkg_get(bfqg);
->entity->parent = bfqg->my_entity
-> bfq_put_queue(bfqq)
FINAL_PUT
->bfqg_and_blkg_put(bfqq_group(bfqq))
->kmem_cache_free(bfq_pool, bfqq);
So parent entity can not disappear while child entity is in tree,
and child entities already has proper protection.
This patch revert commit db37a34c56 ("block, bfq: get a ref to a group when adding it to a service tree")
bfq_group leak trace caused by bad commit:
-> blkg_alloc
-> bfq_pq_alloc
-> bfqg_get (+1)
->bfq_activate_bfqq
->bfq_activate_requeue_entity
-> __bfq_activate_entity
->bfq_get_entity
->bfqg_and_blkg_get (+1) <==== : Note1
->bfq_del_bfqq_busy
->bfq_deactivate_entity+0x53/0xc0 [bfq]
->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
-> bfq_forget_entity(is_in_service = true)
entity->on_st_or_in_serv = false <=== :Note2
if (is_in_service)
return; ==> do not touch reference
-> blkcg_css_offline
-> blkcg_destroy_blkgs
-> blkg_destroy
-> bfq_pd_offline
-> __bfq_deactivate_entity
if (!entity->on_st_or_in_serv) /* true, because (Note2)
return false;
-> bfq_pd_free
-> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
So bfq_group and blkcg_gq will leak forever, see test-case below.
##TESTCASE_BEGIN:
#!/bin/bash
max_iters=${1:-100}
#prep cgroup mounts
mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/blkio
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
# Prepare blkdev
grep blkio /proc/cgroups
truncate -s 1M img
losetup /dev/loop0 img
echo bfq > /sys/block/loop0/queue/scheduler
grep blkio /proc/cgroups
for ((i=0;i<max_iters;i++))
do
mkdir -p /sys/fs/cgroup/blkio/a
echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
rmdir /sys/fs/cgroup/blkio/a
grep blkio /proc/cgroups
done
##TESTCASE_END:
Fixes: db37a34c56 ("block, bfq: get a ref to a group when adding it to a service tree")
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we pass in an offset which is larger than PAGE_SIZE, then
page_is_mergeable() thinks it's not mergeable with the previous bio_vec,
leading to a large number of bio_vecs being used. Use a slightly more
obvious test that the two pages are compatible with each other.
Fixes: 52d52d1c98 ("block: only allow contiguous page structs in a bio_vec")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When queue_max_discard_segments(q) is 1, blk_discard_mergable() will
return false for discard request, then normal request merge is applied.
However, only queue_max_segments() is checked, so max discard segment
limit isn't respected.
Check max discard segment limit in the request merge code for fixing
the issue.
Discard request failure of virtio_blk is fixed.
Fixes: 6984046608 ("block: fix the DISCARD request merge")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
SCHED_RESTART code path is relied to re-run queue for dispatch requests
in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
requests to hctx->dispatch.
memory barriers have to be used for ordering the following two pair of OPs:
1) adding requests to hctx->dispatch and checking SCHED_RESTART in
blk_mq_dispatch_rq_list()
2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
in blk_mq_sched_restart().
Without the added memory barrier, either:
1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
dispatch side
or
2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
in dispatch side, meantime checking if there is request in
hctx->dispatch from blk_mq_sched_restart() is missed.
IO hang in ltp/fs_fill test is reported by kernel test robot:
https://lkml.org/lkml/2020/7/26/77
Turns out it is caused by the above out-of-order OPs. And the IO hang
can't be observed any more after applying this patch.
Fixes: bd166ef183 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Jeffery <djeffery@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a kernel-doc warning in block/blk-mq.c:
../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert'
Fixes: 01e99aeca3 ("blk-mq: insert passthrough request into hctx->dispatch directly")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: André Almeida <andrealmeid@collabora.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl83DGUQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplTYEACs5kPPpSVdzVsOeHj0LkfQXiSlli8dbPBo
NB2+TIyr9NxJgxn8B4x+5/4DZgJaCoHOeyyzOocXQmvWGwWOTkrfX/OSyQOlRB5z
dpzqF0Huhw31MSQEiwA/8lo3omBmat9cMzMa5PJYPghMGfqyQDzVJk1lIX51a1th
oE01eBpNNsDK0OTwKrl6Rx2/OuFZnA0P3lQwgPZSLnDM6Hq+xeHTdx2LNSyE2QFv
GzYl4dFoXg3NReLv9D57b7hE6Dc95NcCDDeU7Y3cE7XPksKMA/TkVYOD20ysJ31l
9uzscvvcm2UugN2r0d/B35lf6NWmOG24SmkLMKTtExPGHOCQIbDAlSP/QQ4zz9pQ
2yA+eImpQnRsCzPbGcnBzwEF3yX5+lQYmFWac+0AHDiWEWkb8e3MzNSWPZrsN+cD
7U7c5Zw6zDEtl/naJccuZZPgQGbZgFJ/P6Wo6l5ywIPtE7wzv4MUe4eUxZhitL9M
0ZP6WIQd8oNQdNoCYVQDwPdYJYMq7uUQFUo40vaSfntZxVKZQao7cvUHwmzVzNlZ
v5UazETAx+4Eg6MNwfjKp+kt3rr6Xul7K9Nzn6R/cVacIU349FovUshm7WieoAUu
niZ40gXltxj7NDwHj3p/dqesW5Nhv/qk6hlVWoi9vdmh8vAVBy/fedQfocvKrFJy
prCI1h1UOQ==
=10Pr
-----END PGP SIGNATURE-----
Merge tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes on the block side of things:
- Discard granularity fix (Coly)
- rnbd cleanups (Guoqing)
- md error handling fix (Dan)
- md sysfs fix (Junxiao)
- Fix flush request accounting, which caused an IO slowdown for some
configurations (Ming)
- Properly propagate loop flag for partition scanning (Lennart)"
* tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
block: fix double account of flush request's driver tag
loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
rnbd: no need to set bi_end_io in rnbd_bio_map_kern
rnbd: remove rnbd_dev_submit_io
md-cluster: Fix potential error pointer dereference in resize_bitmaps()
block: check queue's limits.discard_granularity in __blkdev_issue_discard()
md: get sysfs entry after redundancy attr group create
In case of none scheduler, we share data request's driver tag for
flush request, so have to mark the flush request as INFLIGHT for
avoiding double account of this driver tag.
Fixes: 568f270065 ("blk-mq: centralise related handling into blk_mq_get_driver_tag")
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix a checkpatch warning that I fixed when applying
ANDROID-dm-add-support-for-passing-through-inline-crypto-support.patch
to android12-5.4 (http://aosp/1394279). This gets android-mainline in
sync.
This should be folded into
ANDROID-dm-add-support-for-passing-through-inline-crypto-support.patch.
Bug: 162257830
Change-Id: I083209d174630220e3b6d148d7317600ce040a84
Signed-off-by: Eric Biggers <ebiggers@google.com>
Tiny steps on the way to 5.9-rc1.
Fixes conflicts in:
fs/f2fs/inline.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I16d863ae44a51156499458e8c3486587cbe2babe
- Untangle the header spaghetti which causes build failures in various
situations caused by the lockdep additions to seqcount to validate that
the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict per
CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored and
write_seqcount_begin() has a lockdep assertion to validate that the
lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API is
unchanged and determines the type at compile time with the help of
_Generic which is possible now that the minimal GCC version has been
moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs which
have been addressed already independent of this.
While generaly useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if the
writers are serialized by an associated lock, which leads to the well
known reader preempts writer livelock. RT prevents this by storing the
associated lock pointer independent of lockdep in the seqcount and
changing the reader side to block on the lock when a reader detects
that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and initializers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
81ppn2KlvzUK8g==
=7Gj+
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.
While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and
initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...
Steps on the way to 5.9-rc1
Resolves conflicts in:
drivers/irqchip/qcom-pdc.c
include/linux/device.h
net/xfrm/xfrm_state.c
security/lsm_audit.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4aeb3d04f4717714a421721eb3ce690c099bb30a
This series consists of the usual driver updates (ufs, qla2xxx, tcmu,
lpfc, hpsa, zfcp, scsi_debug) and minor bug fixes. We also have a
huge docbook fix update like most other subsystems and no major update
to the core (the few non trivial updates are either minor fixes or
removing an unused feature [scsi_sdb_cache]).
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXyxq1yYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishSoAAQChZ4i8
ZqYW3pL33JO3fA8vdjvLuyC489Hj4wzIsl3/bQEAxYyM6BSLvMoLWR2Plq/JmTLm
4W/LDptarpTiDI3NuDc=
=4b0W
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This consists of the usual driver updates (ufs, qla2xxx, tcmu, lpfc,
hpsa, zfcp, scsi_debug) and minor bug fixes.
We also have a huge docbook fix update like most other subsystems and
no major update to the core (the few non trivial updates are either
minor fixes or removing an unused feature [scsi_sdb_cache])"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (307 commits)
scsi: scsi_transport_srp: Sanitize scsi_target_block/unblock sequences
scsi: ufs-mediatek: Apply DELAY_AFTER_LPM quirk to Micron devices
scsi: ufs: Introduce device quirk "DELAY_AFTER_LPM"
scsi: virtio-scsi: Correctly handle the case where all LUNs are unplugged
scsi: scsi_debug: Implement tur_ms_to_ready parameter
scsi: scsi_debug: Fix request sense
scsi: lpfc: Fix typo in comment for ULP
scsi: ufs-mediatek: Prevent LPM operation on undeclared VCC
scsi: iscsi: Do not put host in iscsi_set_flashnode_param()
scsi: hpsa: Correct ctrl queue depth
scsi: target: tcmu: Make TMR notification optional
scsi: target: tcmu: Implement tmr_notify callback
scsi: target: tcmu: Fix and simplify timeout handling
scsi: target: tcmu: Factor out new helper ring_insert_padding
scsi: target: tcmu: Do not queue aborted commands
scsi: target: tcmu: Use priv pointer in se_cmd
scsi: target: Add tmr_notify backend function
scsi: target: Modify core_tmr_abort_task()
scsi: target: iscsi: Fix inconsistent debug message
scsi: target: iscsi: Fix login error when receiving
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8pjAEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgphuHEAC5hNAi99HLktAQ7qZy4cBqnGNKCrguFszq
Kxiecp3Nrb9EnAPNWYG+QMO0kD9z8quML85beBaJNxN1PlOk9pawqFAd4ziFncFI
ruZwIMP+oH/0OmPUA2a4ymrqu+rpyFvfsDL2RKJ9dirAt9fuv9W0RZM5g7Oz83Xi
cNdPRn0tOhK0DTPxL4M1/NR2OutSgvKDfA5Et3IrDFl7+bJAEFqmSO8wOSdZtvFp
KcR4O/DXnr5Wl6cPvzlvooQze8vGGJkXAyIKaC9cuBm/nlzMCBGG8kE0v3kRJ8Sc
uSSFkC+P+OlktY4JwXN+mCacDUdVBiiL/uUs1zel6HmociBgh67mgyJ6AfQtGZry
yVl9mj44qWZjAzCODv5KnuxlH+gBacdmjcQqwxsZ2P477gfNkxmBXgHeWdfzO9A/
zTUXaBDXg3VdYxQfD8zTWPkCwXYp+YG3SRb9pfrIWIiYuz2UECZTvl/8Upnacz2B
POTf+6vcNDlILCtboVE0mKEYR0ckxqrbs0NQloQdmVOfXNyhLml9OrXmwJIffVtE
pZ9g428c5bm44lIOiB2eW+QPsXo0s8GxqIrMtxzKsJ3WgFefwLiVDLJBqEt78jRJ
RvpGUxrMLgWFubowH8yDmWV+Fp0NpqcqF+GU45z8nGC3OTS+i0ZvUFYgLM6a2uOf
sv4bzDPDBg==
=uMth
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block
Pull block stacking updates from Jens Axboe:
"The stacking related fixes depended on both the core block and drivers
branches, so here's a topic branch with that change.
Outside of that, a late fix from Johannes for zone revalidation"
* tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block:
block: don't do revalidate zones on invalid devices
block: remove blk_queue_stack_limits
block: remove bdev_stack_limits
block: inherit the zoned characteristics in blk_stack_limits
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8od3oQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppkpD/9D+XqD9qYcYTj+ShVCc5+3RtMG5ZiAAX0y
l4QXomentn/1Y0UYXFGJH7JLZWrKYT0QiktLtfpe5pmTqRUkckTIyJQlsHb+K6Dz
lFjtywRK9pcFYgiWIUg80wlJKrTa8QdnrlS/Esn4YITKGRbgMIdFvq2jymXC+1ho
RgodlgzcBUREgHSLo0H3cqEKA53fQiJhKC6CbFrFdrkpf2yUpcTfEDtpSwuIuPj3
2AUed1qXUtNjdHciCn3N37OuHqXKAA9noXAWfg9Gx/5zfGUNX9QJvlsny1AopgS0
jJvPSDVAhu/qRLHW6q/ZOT0JAlHegguuTAOtgMh2cMpAS5sumCAtltxVcI7Qnx41
HalMpTefXsVoBo0gfjqldnIPt34ZNj5aH5GYaH/wPpSg6VkTVBJK8GuQDBvg27qT
w+U/T6EzuqniWXh/P3COhfrMCR9ueUOY1qWCRwzomlpeIfBhCzidt2wUqIxX1TOA
Q0Ltf0eERDevsZbE+tIm+VAAg98kHehcS2t8lfFYFO6/PKu2iJpJt/HtJbZNBE+W
rm96E4qXRiy1UuL7D9vBkaWsbnosuNHgGQXx57GlokQU+2IGBmOxV52XHiSxxpXd
AS1ZTd56ItmID8VaU09Pbf7ZFbiCgdEAxIbUFzaCuvo+lxryHFphIUARNi/zPnNT
UC2OzunCqA==
=oADH
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
- NVMe:
- ZNS support (Aravind, Keith, Matias, Niklas)
- Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
Dongli, Max, Sagi)
- null_blk zone capacity support (Aravind)
- MD:
- raid5/6 fixes (ChangSyun)
- Warning fixes (Damien)
- raid5 stripe fixes (Guoqing, Song, Yufen)
- sysfs deadlock fix (Junxiao)
- raid10 deadlock fix (Vitaly)
- struct_size conversions (Gustavo)
- Set of bcache updates/fixes (Coly)
* tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
md/raid5: Allow degraded raid6 to do rmw
md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
raid5: don't duplicate code for different paths in handle_stripe
raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
md: print errno in super_written
md/raid5: remove the redundant setting of STRIPE_HANDLE
md: register new md sysfs file 'uuid' read-only
md: fix max sectors calculation for super 1.0
nvme-loop: remove extra variable in create ctrl
nvme-loop: set ctrl state connecting after init
nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
nvme-multipath: fix logic for non-optimized paths
nvme-rdma: fix controller reset hang during traffic
nvme-tcp: fix controller reset hang during traffic
nvmet: introduce the passthru Kconfig option
nvmet: introduce the passthru configfs interface
nvmet: Add passthru enable/disable helpers
nvmet: add passthru code to process commands
nvme: export nvme_find_get_ns() and nvme_put_ns()
nvme: introduce nvme_ctrl_get_by_path()
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7asQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplrCD/0S17kio+k4cOJDGwl88WoJw+QiYmM5019k
decZ1JymQvV1HXRmlcZiEAu0hHDD0FoovSRrw7II3gw3GouETmYQM62f6ZTpDeMD
CED/fidnfULAkPaI6h+bj3jyI0cEuujG/R47rGSQEkIIr3RttqKZUzVkB9KN+KMw
+OBuXZtMIoFFEVJ91qwC2dm2qHLqOn1/5MlT59knso/xbPOYOXsFQpGiACJqF97x
6qSSI8uGE+HZqvL2OLWPDBbLEJhrq+dzCgxln5VlvLele4UcRhOdonUb7nUwEKCe
zwvtXzz16u1D1b8bJL4Kg5bGqyUAQUCSShsfBJJxh6vTTULiHyCX5sQaai1OEB16
4dpBL9E+nOUUix4wo9XBY0/KIYaPWg5L1CoEwkAXqkXPhFvNUucsC0u6KvmzZR3V
1OogVTjl6GhS8uEVQjTKNshkTIC9QHEMXDUOHtINDCb/sLU+ANXU5UpvsuzZ9+kt
KGc4mdyCwaKBq4YW9sVwhhq/RHLD4AUtWZiUVfOE+0cltCLJUNMbQsJ+XrcYaQnm
W4zz22Rep+SJuQNVcCW/w7N2zN3yB6gC1qeroSLvzw4b5el2TdFp+BcgVlLHK+uh
xjsGNCq++fyzNk7vvMZ5hVq4JGXYjza7AiP5HlQ8nqdiPUKUPatWCBqUm9i9Cz/B
n+0dlYbRwQ==
=2vmy
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Lots of cleanups in here, hardening the code and/or making it easier
to read and fixing bugs, but a core feature/change too adding support
for real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for
the fast path. In detail:
- Cleanup how memory accounting is done on ring setup/free (Bijan)
- sq array offset calculation fixup (Dmitry)
- Consistently handle blocking off O_DIRECT submission path (me)
- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries
from task_work, like we handle poll based retry. (me)
- IO completion optimizations (me)
- Fix race with accounting and ring fd install (me)
- Support EPOLLEXCLUSIVE (Jiufei)
- Get rid of the io_kiocb unionizing, made possible by shrinking
other bits (Pavel)
- Completion side cleanups (Pavel)
- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)
- Request environment grabbing cleanups (Pavel)
- File and socket read/write cleanups (Pavel)
- Improve kiocb_set_rw_flags() (Pavel)
- Tons of fixes and cleanups (Pavel)
- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"
* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
+BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
RzAUdZFbWw==
=abJG
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.
- Softirq completion cleanups (Christoph)
- Stop using ->queuedata (Christoph)
- Cleanup bd claiming (Christoph)
- Use check_events, moving away from the legacy media change
(Christoph)
- Use inode i_blkbits consistently (Christoph)
- Remove old unused writeback congestion bits (Christoph)
- Cleanup/unify submission path (Christoph)
- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)
- sbitmap cleared bits handling (John)
- Request merging blktrace event addition (Jan)
- sysfs add/remove race fixes (Luis)
- blk-mq tag fixes/optimizations (Ming)
- Duplicate words in comments (Randy)
- Flush deferral cleanup (Yufen)
- IO context locking/retry fixes (John)
- struct_size() usage (Gustavo)
- blk-iocost fixes (Chengming)
- blk-cgroup IO stats fixes (Boris)
- Various little fixes"
* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...
When we loose a device for whatever reason while (re)scanning zones, we
trip over a NULL pointer in blk_revalidate_zone_cb, like in the following
log:
sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB)
sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed
sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08
sd 0:0:0:0: [sda] Sense Key : 0xb [current]
sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6
sda: failed to revalidate zones
sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B)
sda: detected capacity change from 14000519643136 to 0
==================================================================
BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550
Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58
CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Workqueue: events_unbound async_run_entry_fn
Call Trace:
dump_stack+0x7d/0xb0
? blk_revalidate_zone_cb+0x1b7/0x550
kasan_report.cold+0x5/0x37
? blk_revalidate_zone_cb+0x1b7/0x550
check_memory_region+0x145/0x1a0
blk_revalidate_zone_cb+0x1b7/0x550
sd_zbc_parse_report+0x1f1/0x370
? blk_req_zone_write_trylock+0x200/0x200
? sectors_to_logical+0x60/0x60
? blk_req_zone_write_trylock+0x200/0x200
? blk_req_zone_write_trylock+0x200/0x200
sd_zbc_report_zones+0x3c4/0x5e0
? sd_dif_config_host+0x500/0x500
blk_revalidate_disk_zones+0x231/0x44d
? _raw_write_lock_irqsave+0xb0/0xb0
? blk_queue_free_zone_bitmaps+0xd0/0xd0
sd_zbc_read_zones+0x8cf/0x11a0
sd_revalidate_disk+0x305c/0x64e0
? __device_add_disk+0x776/0xf20
? read_capacity_16.part.0+0x1080/0x1080
? blk_alloc_devt+0x250/0x250
? create_object.isra.0+0x595/0xa20
? kasan_unpoison_shadow+0x33/0x40
sd_probe+0x8dc/0xcd2
really_probe+0x20e/0xaf0
__driver_attach_async_helper+0x249/0x2d0
async_run_entry_fn+0xbe/0x560
process_one_work+0x764/0x1290
? _raw_read_unlock_irqrestore+0x30/0x30
worker_thread+0x598/0x12f0
? __kthread_parkme+0xc6/0x1b0
? schedule+0xed/0x2c0
? process_one_work+0x1290/0x1290
kthread+0x36b/0x440
? kthread_create_worker_on_cpu+0xa0/0xa0
ret_from_fork+0x22/0x30
==================================================================
When the device is already gone we end up with the following scenario:
The device's capacity is 0 and thus the number of zones will be 0 as well. When
allocating the bitmap for the conventional zones, we then trip over a NULL
pointer.
So if we encounter a zoned block device with a 0 capacity, don't dare to
revalidate the zones sizes.
Fixes: 6c6b354914 ("block: set the zone size in blk_revalidate_disk_zones atomically")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop the repeated word "request".
Change to the correct kernel-doc notation for function name separtor.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop the repeated word "to".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop the repeated word "the".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop the repeated word "to" in multiple places.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop the repeated word "the".
Fix typos of "features" and "specified".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Drop the repeated words "a" and "the".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We shouldn't skip iocg when its abs_vdebt is not zero.
Fixes: 0b80f9866e ("iocost: protect iocg->abs_vdebt with iocg->waitq.lock")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.
Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.
If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20200720155530.1173732-21-a.darwish@linutronix.de
tag_set_list is only accessed under the tag_set_lock lock. There is
no need for using the _rcu list functions.
The _rcu list function were introduced to allow read access to the
tag_set_list protected under RCU, see 705cda97ee ("blk-mq: Make it
safe to use RCU to iterate over blk_mq_tag_set.tag_list") and
05b7941394 ("Revert "blk-mq: don't handle TAG_SHARED in restart"").
Those changes got reverted later but the cleanup commit missed a
couple of places to undo the changes.
Fixes: 97889f9ac2 ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 05d18ae1cc ("scsi: pm: Balance pm_only counter of request queue
during system resume") fixed a problem in the block layer's runtime-PM
code: blk_set_runtime_active() failed to call blk_clear_pm_only().
However, the commit's implementation was awkward; it forced the SCSI
system-resume handler to choose whether to call blk_post_runtime_resume()
or blk_set_runtime_active(), depending on whether or not the SCSI device
had previously been runtime suspended.
This patch simplifies the situation considerably by adding the missing
function call directly into blk_set_runtime_active() (under the condition
that the queue is not already in the RPM_ACTIVE state). This allows the
SCSI routine to revert back to its original form. Furthermore, making this
change reveals that blk_post_runtime_resume() (in its success pathway) does
exactly the same thing as blk_set_runtime_active(). The duplicate code is
easily removed by making one routine call the other.
No functional changes are intended.
Link: https://lore.kernel.org/r/20200706151436.GA702867@rowland.harvard.edu
CC: Can Guo <cang@codeaurora.org>
CC: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This function is just a tiny wrapper around blk_stack_limits. Open code
it int the two callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This function is just a tiny wrapper around blk_stack_limit and has
two callers. Simplify the stack a bit by open coding it in the two
callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lift the code from device mapper into blk_stack_limits to inherity
the stacking limitations. This ensures we do the right thing for
all stacked zoned block devices.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* for-5.9/drivers: (38 commits)
block: add max_active_zones to blk-sysfs
block: add max_open_zones to blk-sysfs
s390/dasd: Use struct_size() helper
s390/dasd: fix inability to use DASD with DIAG driver
md-cluster: fix wild pointer of unlock_all_bitmaps()
md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes
md: fix deadlock causing by sysfs_notify
md: improve io stats accounting
md: raid0/linear: fix dereference before null check on pointer mddev
rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()'
nvme: remove ns->disk checks
nvme-pci: use standard block status symbolic names
nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size()
nvme-pci: add a blank line after declarations
nvme-pci: fix some comments issues
nvme-pci: remove redundant segment validation
nvme: document quirked Intel models
nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs
nvme: support for zoned namespaces
nvme: support for multiple Command Sets Supported and Effects log pages
...
* for-5.9/block: (124 commits)
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
block: use bd_prepare_to_claim directly in the loop driver
block: refactor bd_start_claiming
block: simplify the restart case in __blkdev_get
Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."
block: always remove partitions from blk_drop_partitions()
block: relax jiffies rounding for timeouts
blk-mq: remove redundant validation in __blk_mq_end_request()
blk-mq: Remove unnecessary local variable
writeback: remove bdi->congested_fn
writeback: remove struct bdi_writeback_congested
writeback: remove {set,clear}_wb_congested
...
In order to improve consistency and usability in cgroup stat accounting,
we would like to support the root cgroup's io.stat.
Since the root cgroup has processes doing io even if the system has no
explicitly created cgroups, we need to be careful to avoid overhead in
that case. For that reason, the rstat algorithms don't handle the root
cgroup, so just turning the file on wouldn't give correct statistics.
To get around this, we simulate flushing the iostat struct by filling it
out directly from global disk stats. The result is a root cgroup io.stat
file consistent with both /proc/diskstats and io.stat.
Note that in order to collect the disk stats, we needed to iterate over
devices. To facilitate that, we had to change the linkage of a disk_type
to external so that it can be used from blk-cgroup.c to iterate over
disks.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Boris Burkov <boris@bur.io>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Previously, the code which printed io.stat only needed access to the
generic rstat flushing code, but since we plan to write some more
specific code for preparing root cgroup stats, we need to manipulate
iostat structs directly. Since declaring static functions ahead does not
seem like common practice in this file, simply move the iostat functions
up. We only plan to use blkg_iostat_set, but it seems better to keep them
all together.
Signed-off-by: Boris Burkov <boris@bur.io>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch improves discard bio split for address and size alignment in
__blkdev_issue_discard(). The aligned discard bio may help underlying
device controller to perform better discard and internal garbage
collection, and avoid unnecessary internal fragment.
Current discard bio split algorithm in __blkdev_issue_discard() may have
non-discarded fregment on device even the discard bio LBA and size are
both aligned to device's discard granularity size.
Here is the example steps on how to reproduce the above problem.
- On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
with thin mode and give it to a Linux virtual machine.
- Inside the Linux virtual machine, if the 50GB virtual disk shows up as
/dev/sdb, fill data into the first 50GB by,
# dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
- Discard the 50GB range from offset 0 on /dev/sdb,
# blkdiscard /dev/sdb -o 0 -l 53687091200
- Observe the underlying mapping status of the device
# sg_get_lba_status /dev/sdb -m 1048 --lba=0
descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated
Although the discard bio starts at LBA 0 and has 50<<30 bytes size which
are perfect aligned to the discard granularity, from the above list
these are many 1MB (2048 sectors) internal fragments exist unexpectedly.
The problem is in __blkdev_issue_discard(), an improper algorithm causes
an improper bio size which is not aligned.
25 int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
26 sector_t nr_sects, gfp_t gfp_mask, int flags,
27 struct bio **biop)
28 {
29 struct request_queue *q = bdev_get_queue(bdev);
[snipped]
56
57 while (nr_sects) {
58 sector_t req_sects = min_t(sector_t, nr_sects,
59 bio_allowed_max_sectors(q));
60
61 WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
62
63 bio = blk_next_bio(bio, 0, gfp_mask);
64 bio->bi_iter.bi_sector = sector;
65 bio_set_dev(bio, bdev);
66 bio_set_op_attrs(bio, op, 0);
67
68 bio->bi_iter.bi_size = req_sects << 9;
69 sector += req_sects;
70 nr_sects -= req_sects;
[snipped]
79 }
80
81 *biop = bio;
82 return 0;
83 }
84 EXPORT_SYMBOL(__blkdev_issue_discard);
At line 58-59, to discard a 50GB range, req_sects is set as return value
of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
case, the discard granularity is 2048 sectors, although the start LBA
and discard length are aligned to discard granularity, req_sects never
has chance to be aligned to discard granularity. This is why there are
some still-mapped 2048 sectors fragment in every 4 or 8 GB range.
If req_sects at line 58 is set to a value aligned to discard_granularity
and close to UNIT_MAX, then all consequent split bios inside device
driver are (almostly) aligned to discard_granularity of the device
queue. The 2048 sectors still-mapped fragment will disappear.
This patch introduces bio_aligned_discard_max_sectors() to return the
the value which is aligned to q->limits.discard_granularity and closest
to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
this new routine to decide a more proper split bio length.
But we still need to handle the situation when discard start LBA is not
aligned to q->limits.discard_granularity, otherwise even the length is
aligned, current code may still leave 2048 fragment around every 4GB
range. Therefore, to calculate req_sects, firstly the start LBA of
discard range is checked (including partition offset), if it is not
aligned to discard granularity, the first split location should make
sure following bio has bi_sector aligned to discard granularity. Then
there won't be still-mapped fragment in the middle of the discard range.
The above is how this patch improves discard bio alignment in
__blkdev_issue_discard(). Now with this patch, after discard with same
command line mentiond previously, sg_get_lba_status returns,
descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
descriptor LBA: 0x0000000006600000 blocks: 0 deallocated
We an see there is no 2048 sectors segment anymore, everything is clean.
Reported-and-tested-by: Acshai Manoj <acshai.manoj@microfocus.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Enzo Matsumiya <ematsumiya@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 7520872c0c ("block: don't defer flushes on blk-mq + scheduling")
tried to fix deadlock for cycled wait between flush requests and data
request into flush_data_in_flight. The former holded all driver tags
and wait for data request completion, but the latter can not complete
for waiting free driver tags.
After commit 923218f616 ("blk-mq: don't allocate driver tag upfront
for flush rq"), flush requests will not get driver tag before queuing
into flush queue.
* With elevator, flush request just get sched_tags before inserting
flush queue. It will not get driver tag until issue them to driver.
data request on list fq->flush_data_in_flight will complete in
the end.
* Without elevator, each flush request will get a driver tag when
allocate request. Then data request on fq->flush_data_in_flight
don't worry about lacking driver tag.
In both of these cases, cycled wait cannot be true. So we may allow
to defer flush request.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The sparse tool complains as follows:
block/blk-timeout.c:93:12: warning:
symbol 'blk_timeout_init' was not declared. Should it be static?
Function blk_timeout_init() is not used outside of blk-timeout.c, so
mark it static.
Fixes: 9054650fac ("block: relax jiffies rounding for timeouts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
The reverse-order double lock dance in ioc_release_fn() is using a
retry loop. This is a problem on PREEMPT_RT because it could preempt
the task that would release q->queue_lock and thus live lock in the
retry loop.
RCU is already managing the freeing of the request queue and icq. If
the trylock fails, use RCU to guarantee that the request queue and
icq are not freed and re-acquire the locks in the correct order,
allowing forward progress.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The legacy CFQ IO scheduler could call put_io_context() in its exit_icq()
elevator callback. This led to a lockdep warning, which was fixed in
commit d8c66c5d59 ("block: fix lockdep warning on io_context release
put_io_context()") by using a nested subclass for the ioc spinlock.
However, with commit f382fb0bce ("block: remove legacy IO schedulers")
the CFQ IO scheduler no longer exists.
The BFQ IO scheduler also implements the exit_icq() elevator callback but
does not call put_io_context().
The nested subclass for the ioc spinlock is no longer needed. Since it
existed as an exception and no longer applies, remove the nested subclass
usage.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new max_active zones definition in the sysfs documentation.
This definition will be common for all devices utilizing the zoned block
device support in the kernel.
Export max_active_zones according to this new definition for NVMe Zoned
Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
the kernel), and ZBC SCSI devices.
Add the new max_active_zones member to struct request_queue, rather
than as a queue limit, since this property cannot be split across stacking
drivers.
For SCSI devices, even though max active zones is not part of the ZBC/ZAC
spec, export max_active_zones as 0, signifying "no limit".
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new max_open_zones definition in the sysfs documentation.
This definition will be common for all devices utilizing the zoned block
device support in the kernel.
Export max open zones according to this new definition for NVMe Zoned
Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
the kernel), and ZBC SCSI devices.
Add the new max_open_zones member to struct request_queue, rather
than as a queue limit, since this property cannot be split across stacking
drivers.
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Javier González <javier@javigon.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 826f2f48da.
Qian Cai reports that this commit causes stalls with swap. Revert until
the reason can be figured out.
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In theory, when GENHD_FL_NO_PART_SCAN is set, no partitions can be created
on one disk. However, ioctl(BLKPG, BLKPG_ADD_PARTITION) doesn't check
GENHD_FL_NO_PART_SCAN, so partitions still can be added even though
GENHD_FL_NO_PART_SCAN is set.
So far blk_drop_partitions() only removes partitions when disk_part_scan_enabled()
return true. This way can make ghost partition on loop device after changing/clearing
FD in case that PARTSCAN is disabled, such as partitions can be added
via 'parted' on loop disk even though GENHD_FL_NO_PART_SCAN is set.
Fix this issue by always removing partitions in blk_drop_partitions(), and
this way is correct because the current code supposes that no partitions
can be added in case of GENHD_FL_NO_PART_SCAN.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In doing high IOPS testing, blk-mq is generally pretty well optimized.
There are a few things that stuck out as using more CPU than what is
really warranted, and one thing is the round_jiffies_up() that we do
twice for each request. That accounts for about 0.8% of the CPU in
my testing.
We can make this cheaper by avoiding an integer division, by just adding
a rough HZ mask that we can AND with instead. The timeouts are only on a
second granularity already, we don't have to be that accurate here and
this patch barely changes that. All we care about is nice grouping.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We've already validated the 'q->elevator' before calling
->ops.completed_request() in blk_mq_sched_completed_request(), thus no
need to validate rq->internal_tag again. Rmove it.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove unnecessary local variable 'ret' in blk_mq_dispatch_hctx_list().
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
md is the last driver using the legacy media_changed method. Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move .nr_active update and request assignment into blk_mq_get_driver_tag(),
all are good to do during getting driver tag.
Meantime blk-flush related code is simplified and flush request needn't
to update the request table manually any more.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>