Commit Graph

1122007 Commits

Author SHA1 Message Date
Song Liu
74173ff458 Merge branch 'md-next-raid10-optimize' into md-next
This patchset tries to avoid that two locks are held unconditionally
in hot path.

Test environment:

Architecture:
aarch64 Huawei KUNPENG 920
x86 Intel(R) Xeon(R) Platinum 8380

Raid10 initialize:
mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 \
    /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Test cmd:
(task set -c 0-15) fio -name=0 -ioengine=libaio -direct=1 -\
    group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers \
    -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 \
    -rw=randread

Test result:

aarch64:
before this patchset:           3.2 GiB/s
bind node before this patchset: 6.9 Gib/s
after this patchset:            7.9 Gib/s
bind node after this patchset:  8.0 Gib/s

x86:(bind node is not tested yet)
before this patchset: 7.0 GiB/s
after this patchset : 9.3 GiB/s

Please noted that in the test machine, memory access latency is very bad
across nodes compare to local node in aarch64, which is why bandwidth
while bind node is much better.
2022-09-22 00:05:05 -07:00
Yu Kuai
b9b083f904 md/raid10: convert resync_lock to use seqlock
Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier',
and io can't be dispatched until 'barrier' is dropped.

Since holding the 'barrier' is not common, convert 'resync_lock' to use
seqlock so that holding lock can be avoided in fast path.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-and-Tested-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:05 -07:00
Yu Kuai
4f350284a7 md/raid10: fix improper BUG_ON() in raise_barrier()
'conf->barrier' is protected by 'conf->resync_lock', reading
'conf->barrier' without holding the lock is wrong.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:05 -07:00
Yu Kuai
0c0be98bbe md/raid10: prevent unnecessary calls to wake_up() in fast path
Currently, wake_up() is called unconditionally in fast path such as
raid10_make_request(), which will cause lock contention under high
concurrency:

raid10_make_request
 wake_up
  __wake_up_common_lock
   spin_lock_irqsave

Improve performance by only call wake_up() if waitqueue is not empty
in allow_barrier() and raid10_make_request().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:05 -07:00
Yu Kuai
0de57e541b md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowait
For the case nowait in wait_barrier(), there is no point to increase
nr_waiting and then decrease it.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:05 -07:00
Yu Kuai
ed2e063f92 md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()
Currently the nasty condition in wait_barrier() is hard to read. This
patch factors out the condition into a function.

There are no functional changes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:05 -07:00
Logan Gunthorpe
3bfc3bcd78 md: Remove extra mddev_get() in md_seq_start()
A regression is seen where mddev devices stay permanently after they
are stopped due to an elevated reference count.

This was tracked down to an extra mddev_get() in md_seq_start().

It only happened rarely because most of the time the md_seq_start()
is called with a zero offset. The path with an extra mddev_get() only
happens when it starts with a non-zero offset.

The commit noted below changed an mddev_get() to check its success
but inadvertently left the original call in. Remove the extra call.

Fixes: 12a6caf273 ("md: only delete entries from all_mddevs when the disk is freed")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:04 -07:00
David Sloan
c66a6f41e0 md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()
When running chunk-sized reads on disks with badblocks duplicate bio
free/puts are observed:

   =============================================================================
   BUG bio-200 (Not tainted): Object already free
   -----------------------------------------------------------------------------
   Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504
    __slab_alloc.constprop.0+0x5a/0xb0
    kmem_cache_alloc+0x31e/0x330
    mempool_alloc_slab+0x17/0x20
    mempool_alloc+0x100/0x2b0
    bio_alloc_bioset+0x181/0x460
    do_mpage_readpage+0x776/0xd00
    mpage_readahead+0x166/0x320
    blkdev_readahead+0x15/0x20
    read_pages+0x13f/0x5f0
    page_cache_ra_unbounded+0x18d/0x220
    force_page_cache_ra+0x181/0x1c0
    page_cache_sync_ra+0x65/0xb0
    filemap_get_pages+0x1df/0xaf0
    filemap_read+0x1e1/0x700
    blkdev_read_iter+0x1e5/0x330
    vfs_read+0x42a/0x570
   Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504
    kmem_cache_free+0x46d/0x490
    mempool_free_slab+0x17/0x20
    mempool_free+0x66/0x190
    bio_free+0x78/0x90
    bio_put+0x100/0x1a0
    raid5_make_request+0x2259/0x2450
    md_handle_request+0x402/0x600
    md_submit_bio+0xd9/0x120
    __submit_bio+0x11f/0x1b0
    submit_bio_noacct_nocheck+0x204/0x480
    submit_bio_noacct+0x32e/0xc70
    submit_bio+0x98/0x1a0
    mpage_readahead+0x250/0x320
    blkdev_readahead+0x15/0x20
    read_pages+0x13f/0x5f0
    page_cache_ra_unbounded+0x18d/0x220
   Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
   CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
   Workqueue: raid5wq raid5_do_work
   Call Trace:
    <TASK>
    dump_stack_lvl+0x5a/0x78
    dump_stack+0x10/0x16
    print_trailer+0x158/0x165
    object_err+0x35/0x50
    free_debug_processing.cold+0xb7/0xbe
    __slab_free+0x1ae/0x330
    kmem_cache_free+0x46d/0x490
    mempool_free_slab+0x17/0x20
    mempool_free+0x66/0x190
    bio_free+0x78/0x90
    bio_put+0x100/0x1a0
    mpage_end_io+0x36/0x150
    bio_endio+0x2fd/0x360
    md_end_io_acct+0x7e/0x90
    bio_endio+0x2fd/0x360
    handle_failed_stripe+0x960/0xb80
    handle_stripe+0x1348/0x3760
    handle_active_stripes.constprop.0+0x72a/0xaf0
    raid5_do_work+0x177/0x330
    process_one_work+0x616/0xb20
    worker_thread+0x2bd/0x6f0
    kthread+0x179/0x1b0
    ret_from_fork+0x22/0x30
    </TASK>

The double free is caused by an unnecessary bio_put() in the
if(is_badblock(...)) error path in raid5_read_one_chunk().

The error path was moved ahead of bio_alloc_clone() in c82aa1b767
("md/raid5: move checking badblock before clone bio in
raid5_read_one_chunk"). The previous code checked and freed align_bio
which required a bio_put. After the move that is no longer needed as
raid_bio is returned to the control of the common io path which
performs its own endio resulting in a double free on bad device blocks.

Fixes: c82aa1b767 ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk")
Signed-off-by: David Sloan <david.sloan@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <Guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:04 -07:00
Logan Gunthorpe
e2eed85bc7 md/raid5: Ensure stripe_fill happens on non-read IO with journal
When doing degrade/recover tests using the journal a kernel BUG
is hit at drivers/md/raid5.c:4381 in handle_parity_checks5():

  BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));

This was found to occur because handle_stripe_fill() was skipped
for stripes in the journal due to a condition in that function.
Thus blocks were not fetched and R5_UPTODATE was not set when
the code reached handle_parity_checks5().

To fix this, don't skip handle_stripe_fill() unless the stripe is
for read.

Fixes: 07e8336484 ("md/r5cache: shift complex rmw from read path to write path")
Link: https://lore.kernel.org/linux-raid/e05c4239-41a9-d2f7-3cfa-4aa9d2cea8c1@deltatee.com/
Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:04 -07:00
Logan Gunthorpe
f9287c3e93 md/raid5: Don't read ->active_stripes if it's not needed
The atomic_read() is not needed in many cases so only do
the read after the first checks are done.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:04 -07:00
Logan Gunthorpe
2f2d51efd8 md/raid5: Cleanup prototype of raid5_get_active_stripe()
Drop the three bools in the prototype of raid5_get_active_stripe()
and replace them with a flags parameter.

At the same time, drop the distinction with __raid5_get_active_stripe().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:04 -07:00
Logan Gunthorpe
9892fa993f md/raid5: Drop extern on function declarations in raid5.h
externs should not be used in function declarations, so clean those
up.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:04 -07:00
Logan Gunthorpe
b6d56144fe md/raid5: Refactor raid5_get_active_stripe()
Refactor raid5_get_active_stripe() without the gotos with an
explicit infinite loop and some additional nesting.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:03 -07:00
Saurabh Sengar
1727fd5015 md: Replace snprintf with scnprintf
Current code produces a warning as shown below when total characters
in the constituent block device names plus the slashes exceeds 200.
snprintf() returns the number of characters generated from the given
input, which could cause the expression “200 – len” to wrap around
to a large positive number. Fix this by using scnprintf() instead,
which returns the actual number of characters written into the buffer.

[ 1513.267938] ------------[ cut here ]------------
[ 1513.267943] WARNING: CPU: 15 PID: 37247 at <snip>/lib/vsprintf.c:2509 vsnprintf+0x2c8/0x510
[ 1513.267944] Modules linked in:  <snip>
[ 1513.267969] CPU: 15 PID: 37247 Comm: mdadm Not tainted 5.4.0-1085-azure #90~18.04.1-Ubuntu
[ 1513.267969] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022
[ 1513.267971] RIP: 0010:vsnprintf+0x2c8/0x510
<-snip->
[ 1513.267982] Call Trace:
[ 1513.267986]  snprintf+0x45/0x70
[ 1513.267990]  ? disk_name+0x71/0xa0
[ 1513.267993]  dump_zones+0x114/0x240 [raid0]
[ 1513.267996]  ? _cond_resched+0x19/0x40
[ 1513.267998]  raid0_run+0x19e/0x270 [raid0]
[ 1513.268000]  md_run+0x5e0/0xc50
[ 1513.268003]  ? security_capable+0x3f/0x60
[ 1513.268005]  do_md_run+0x19/0x110
[ 1513.268006]  md_ioctl+0x195e/0x1f90
[ 1513.268007]  blkdev_ioctl+0x91f/0x9f0
[ 1513.268010]  block_ioctl+0x3d/0x50
[ 1513.268012]  do_vfs_ioctl+0xa9/0x640
[ 1513.268014]  ? __fput+0x162/0x260
[ 1513.268016]  ksys_ioctl+0x75/0x80
[ 1513.268017]  __x64_sys_ioctl+0x1a/0x20
[ 1513.268019]  do_syscall_64+0x5e/0x200
[ 1513.268021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 766038846e ("md/raid0: replace printk() with pr_*()")
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:03 -07:00
Guoqing Jiang
62bca04bb7 md/raid10: fix compile warning
With W=1, compiler complains.

drivers/md/raid10.c:1983: warning: bad line:

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:03 -07:00
XU pengfei
12ba6676b9 md/raid5: Fix spelling mistakes in comments
Fix spelling of 'waitting' in comments.

Signed-off-by: XU pengfei <xupengfei@nfschina.com>
Signed-off-by: Song Liu <song@kernel.org>
2022-09-22 00:05:03 -07:00
Li Jinlin
9713a67067 block/blk-rq-qos: delete useless enmu RQ_QOS_IOPRIO
Since blk-ioprio handing was converted from a rqos policy to a direct call,
RQ_QOS_IOPRIO is not used anymore, just delete it.

Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220916023241.32926-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:50:53 -06:00
Liu Shixin
8ef859995d block: aoe: use DEFINE_SHOW_ATTRIBUTE to simplify aoe_debugfs
Use DEFINE_SHOW_ATTRIBUTE helper macro to simplify the code.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Link: https://lore.kernel.org/r/20220915023424.3198940-1-liushixin2@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:49:24 -06:00
Wolfram Sang
e55e1b4831 block: move from strlcpy with unused retval to strscpy
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.

Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Acked-by: Geoff Levand <geoff@infradead.org>
Link: https://lore.kernel.org/r/20220818205958.6552-1-wsa+renesas@sang-engineering.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:45:04 -06:00
Gaosheng Cui
7f0c1afeac block/drbd: remove useless comments in receive_DataReply()
All implementations of req->collision, _req_may_be_done and
drbd_fail_pending_reads have been removed, so remove the comments
in receive_DataReply() that provide no useful information.

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20220920015216.782190-3-cuigaosheng1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:44:05 -06:00
Gaosheng Cui
a437df5f8b drbd: remove orphan _req_may_be_done() declaration
The _req_may_be_done() has been removed by
commit 6870ca6d46 ("drbd: factor out master_bio completion
and drbd_request destruction paths"), so remove the orphan
declaration.

Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20220920015216.782190-2-cuigaosheng1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 19:44:05 -06:00
Yu Kuai
8c5035dfbb blk-wbt: call rq_qos_add() after wb_normal is initialized
Our test found a problem that wbt inflight counter is negative, which
will cause io hang(noted that this problem doesn't exist in mainline):

t1: device create	t2: issue io
add_disk
 blk_register_queue
  wbt_enable_default
   wbt_init
    rq_qos_add
    // wb_normal is still 0
			/*
			 * in mainline, disk can't be opened before
			 * bdev_add(), however, in old kernels, disk
			 * can be opened before blk_register_queue().
			 */
			blkdev_issue_flush
                        // disk size is 0, however, it's not checked
                         submit_bio_wait
                          submit_bio
                           blk_mq_submit_bio
                            rq_qos_throttle
                             wbt_wait
			      bio_to_wbt_flags
                               rwb_enabled
			       // wb_normal is 0, inflight is not increased

    wbt_queue_depth_changed(&rwb->rqos);
     wbt_update_limits
     // wb_normal is initialized
                            rq_qos_track
                             wbt_track
                              rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
			      // wb_normal is not 0,wbt_flags will be set
t3: io completion
blk_mq_free_request
 rq_qos_done
  wbt_done
   wbt_is_tracked
   // return true
   __wbt_done
    wbt_rqw_done
     atomic_dec_return(&rqw->inflight);
     // inflight is decreased

commit 8235b5c1e8 ("block: call bdev_add later in device_add_disk") can
avoid this problem, however it's better to fix this problem in wbt:

1) Lower kernel can't backport this patch due to lots of refactor.
2) Root cause is that wbt call rq_qos_add() before wb_normal is
initialized.

Fixes: e34cbd3074 ("blk-wbt: add general throttling mechanism")
Cc: <stable@vger.kernel.org>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:36:13 -06:00
Christoph Hellwig
f7de4886fe rnbd-srv: remove struct rnbd_dev
Given that rnbd_srv_sess_dev already has an open_flags member, there
is no need for the rnbd_dev indirection as a simple block_device pointer
works just as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220909131509.3263924-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:35:23 -06:00
Christoph Hellwig
6856b194b2 rnbd-srv: remove rnbd_dev_{open,close}
These can be trivially open coded in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220909131509.3263924-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:35:23 -06:00
Christoph Hellwig
2ecaa58104 rnbd-srv: remove rnbd_endio
Fold rnbd_endio into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220909131509.3263924-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:35:23 -06:00
Christoph Hellwig
9ad1532060 rnbd-srv: simplify rnbd_srv_fill_msg_open_rsp
Remove all the wrappers and just get the information directly from
the block device, or where no such helpers exist the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220909131509.3263924-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:35:23 -06:00
Bart Van Assche
b2bed51a52 block: Fix the enum blk_eh_timer_return documentation
The documentation of the blk_eh_timer_return enumeration values does not
reflect correctly how e.g. the SCSI core uses these values. Fix the
documentation.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 88b0cfad28 ("block: document the blk_eh_timer_return values")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220920200626.3422296-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:34:37 -06:00
Stefan Haberland
32ff8ce08b s390/dasd: add device ping attribute
Add a function to check if a device is accessible.
This makes mostly sense for copy pair secondary devices but it will work
for all devices.

The sysfs attribute ping is a write only attribute and will issue a NOP
CCW to the device.
In case of success it will return zero. If the device is not accessible
it will return an error code.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220920192616.808070-8-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:32:51 -06:00
Stefan Haberland
1fca631a11 s390/dasd: suppress generic error messages for PPRC secondary devices
Suppress generic command reject messages and dump of sense data for
Peer-To-Peer-Remote-Copy (PPRC) secondary errors.
If IO is issued on a PPRC secondary device, a specific
error message is printed instead.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220920192616.808070-7-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:32:51 -06:00
Stefan Haberland
112ff512fc s390/dasd: add ioctl to perform a swap of the drivers copy pair
The newly defined ioctl BIODASDCOPYPAIRSWAP takes a structure that
specifies a copy pair that should be swapped. It will call the device
discipline function to perform the swap operation.

The structure looks as followed:

struct dasd_copypair_swap_data_t {
       char primary[20];
       char secondary[20];
       __u8 reserved[64];
};

where primary is the old primary device that will be replaced by the
secondary device. The old primary will become a secondary device
afterwards.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220920192616.808070-6-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:32:51 -06:00
Stefan Haberland
413862caad s390/dasd: add copy pair swap capability
In case of errors or misbehaviour of the primary device a controlled
failover to one of the configured secondary devices needs to be
performed.

The swap processing stops I/O on the primary device, all requests are
re-queued to the blocklayer queue, the entries in the copy relation are
swapped and finally the link to the blockdevice is moved from primary to
secondary dasd device.
After this, the secondary becomes the new primary device and I/O is
restarted on that device.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220920192616.808070-5-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:32:51 -06:00
Stefan Haberland
a91ff09d39 s390/dasd: add copy pair setup
A copy relation that is configured on the storage server side needs to be
enabled separately in the device driver. A sysfs interface is created
that allows userspace tooling to control such setup.

The following sysfs entries are added to store and read copy relation
information:

copy_pair
    - Add/Delete a copy pair relation to the DASD device driver
    - Query all previously added copy pair relations
copy_role
    - Query the copy pair role of the device

To add a copy pair to the DASD device driver it has to be specified
through the sysfs attribute copy_pair. Only one secondary device can be
specified at a time together with the primary device. Both, secondary
and primary can be used equally to define the copy pair.
The secondary devices have to be offline when adding the copy relation.
The primary device needs to be specified first followed by the comma
separated secondary device.
Read from the copy_pair attribute to get the current setup and write
"clear" to the attribute to delete any existing setup.

Example:
$ echo 0.0.9700,0.0.9740 > /sys/bus/ccw/devices/0.0.9700/copy_pair
$ cat /sys/bus/ccw/devices/0.0.9700/copy_pair
0.0.9700,0.0.9740

During device online processing the required data will be read from the
storage server and the information will be compared to the setup
requested through the copy_pair attribute. The registration of the
primary and secondary device will be handled accordingly.
A blockdevice is only allocated for copy relation primary devices.

To query the copy role of a device read from the copy_role sysfs
attribute. Possible values are primary, secondary, and none.

Example:
$ cat /sys/bus/ccw/devices/0.0.9700/copy_role
primary

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220920192616.808070-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:32:51 -06:00
Stefan Haberland
3f217cceb6 s390/dasd: add query PPRC function
Add function to query the Peer-to-Peer-Remote-Copy (PPRC) state of a
device by reading the related structure through a read subsystem data call.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220920192616.808070-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:32:50 -06:00
Stefan Haberland
2b43bf061b s390/dasd: put block allocation in separate function
Put block allocation into a separate function to put some copy pair logic
in it in a later patch.

Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220920192616.808070-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21 08:32:50 -06:00
Li zeming
a7609c68f7 blk-iocost: Remove unnecessary (void*) conversions
The key pointer is void and hence does not need an explicit cast.

Signed-off-by: Li zeming <zeming@nfschina.com>
Link: https://lore.kernel.org/r/20220919012825.2936-1-zeming@nfschina.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:28:30 -06:00
Christoph Hellwig
118f3663fb block: remove PSI accounting from the bio layer
PSI accounting is now done by the VM code, where it should have been
since the beginning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:24:38 -06:00
Christoph Hellwig
99486c511f erofs: add manual PSI accounting for the compressed address space
erofs uses an additional address space for compressed data read from disk
in addition to the one directly associated with the inode.  Reading into
the lower address space is open coded using add_to_page_cache_lru instead
of using the filemap.c helper for page allocation micro-optimizations,
which means it is not covered by the MM PSI annotations for ->read_folio
and ->readahead, so add manual ones instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220915094200.139713-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:24:38 -06:00
Christoph Hellwig
4088a47e78 btrfs: add manual PSI accounting for compressed reads
btrfs compressed reads try to always read the entire compressed chunk,
even if only a subset is requested.  Currently this is covered by the
magic PSI accounting underneath submit_bio, but that is about to go
away. Instead add manual psi_memstall_{enter,leave} annotations.

Note that for readahead this really should be using readahead_expand,
but the additionals reads are also done for plain ->read_folio where
readahead_expand can't work, so this overall logic is left as-is for
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:24:38 -06:00
Christoph Hellwig
527eb453bb sched/psi: export psi_memstall_{enter,leave}
To properly account for all refaults from file system logic, file systems
need to call psi_memstall_enter directly, so export it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:24:38 -06:00
Christoph Hellwig
176042404e mm: add PSI accounting around ->read_folio and ->readahead calls
PSI tries to account for the cost of bringing back in pages discarded by
the MM LRU management.  Currently the prime place for that is hooked into
the bio submission path, which is a rather bad place:

 - it does not actually account I/O for non-block file systems, of which
   we have many
 - it adds overhead and a layering violation to the block layer

Add the accounting into the two places in the core MM code that read
pages into an address space by calling into ->read_folio and ->readahead
so that the entire file system operations are covered, to broaden
the coverage and allow removing the accounting in the block layer going
forward.

As psi_memstall_enter can deal with nested calls this will not lead to
double accounting even while the bio annotations are still present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220915094200.139713-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:24:38 -06:00
Ping-Xiang Chen
e88480871b block: fix comment typo in submit_bio of block-core.c.
This patch fix a comment typo in block-core.c.

Signed-off-by: Ping-Xiang Chen <p.x.chen@uci.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220914074237.31621-1-p.x.chen@uci.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-20 08:22:47 -06:00
Jens Axboe
77571ba60e nvme updates for Linux 6.1
- handle number of queue changes in the TCP and RDMA drivers
    (Daniel Wagner)
  - allow changing the number of queues in nvmet (Daniel Wagner)
  - also consider host_iface when checking ip options (Daniel Wagner)
  - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco)
  - avoid unnecessary flush bios in nvmet (Guixin Liu)
  - shrink and better pack the nvme_iod structure (Keith Busch)
  - add comment for unaligned "fake" nqn (Linjun Bao)
  - print actual source IP address through sysfs "address" attr
    (Martin Belanger)
  - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmMpcyELHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOX4w//aVg+caMozh2FsHBcoZlPJNDl/lHq3gfD1dHSRGKQ
 F8dHMfPbVCBGZPLkQrXZq+B+vQ5N8uTLCnjgzJn7MGD2mkpXMjqnZ1OaX0v+J5/B
 fb/R4T6+wjObLQUE2fkV+/W+3wy1GGxbWKqzqn9SjSvmKghtq7923q9qZeDZaqzR
 YkBlYZQ1ViHOSo1P+wat25lJ7ATaDDWaOTAfA0HkDuZCJI/2GbUY5JL+QVhINsIb
 yIj6l9mcGcnia41bft11GIDv3zoMPRxJhT4EVf63tgRqLz58TwuKreL42hKigJWP
 uti3ZfoW83kUVmAAznHywPAQtgX12xj8njRv/dWsBnUDDq9aFGmTR2iDzkAZ5u6m
 uzS7ZNU9Ms50XHkGfj9Wp3j5VC3kONUMjrkG/GL/LPFzdxtiSBKkwFeH94i702qt
 wYFnv97oCxhj1UJt0+X/lkTSqE8SJee29Y4BybBPKZV0DGw+sysweqUsCIwBAuqp
 a2/fe+UAs7m4dKeUdjB7OsIg58fsP/hn26duabTjVgUPhT6tXFKyMvN+DZy+dfB+
 rjAPULxBxSGboGibKYmOvplEUbLVrRfWIM/Ub5qE4YGjgudixYXYLMCcDC0jVkej
 BUfu4kDrGZpZgkH3PqHqtIIKmgHYOJ7bWT+cZNzQBUhpoKBt0ruDvf/NT/DOtjPK
 EWU=
 =WnST
 -----END PGP SIGNATURE-----

Merge tag 'nvme-6.1-2022-09-20' of git://git.infradead.org/nvme into for-6.1/block

Pull NVMe updates from Christoph:

"nvme updates for Linux 6.1

 - handle number of queue changes in the TCP and RDMA drivers
   (Daniel Wagner)
 - allow changing the number of queues in nvmet (Daniel Wagner)
 - also consider host_iface when checking ip options (Daniel Wagner)
 - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco)
 - avoid unnecessary flush bios in nvmet (Guixin Liu)
 - shrink and better pack the nvme_iod structure (Keith Busch)
 - add comment for unaligned "fake" nqn (Linjun Bao)
 - print actual source IP address through sysfs "address" attr
   (Martin Belanger)
 - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang)"

* tag 'nvme-6.1-2022-09-20' of git://git.infradead.org/nvme:
  nvme-tcp: print actual source IP address through sysfs "address" attr
  nvmet-tcp: don't map pages which can't come from HIGHMEM
  nvme-pci: move iod dma_len fill gaps
  nvme-pci: iod npages fits in s8
  nvme-pci: iod's 'aborted' is a bool
  nvme-pci: remove nvme_queue from nvme_iod
  nvme: consider also host_iface when checking ip options
  nvme-rdma: handle number of queue changes
  nvme-tcp: handle number of queue changes
  nvmet: expose max queues to configfs
  nvmet: avoid unnecessary flush bio
  nvmet-auth: remove redundant parameters req
  nvmet-auth: clean up with done_kfree
  nvme-auth: remove the redundant req->cqe->result.u16 assignment operation
  nvme: move from strlcpy with unused retval to strscpy
  nvme: add comment for unaligned "fake" nqn
2022-09-20 07:50:14 -06:00
Coly Li
d2d05b8803 bcache: fix set_at_max_writeback_rate() for multiple attached devices
Inside set_at_max_writeback_rate() the calculation in following if()
check is wrong,
	if (atomic_inc_return(&c->idle_counter) <
	    atomic_read(&c->attached_dev_nr) * 6)

Because each attached backing device has its own writeback thread
running and increasing c->idle_counter, the counter increates much
faster than expected. The correct calculation should be,
	(counter / dev_nr) < dev_nr * 6
which equals to,
	counter < dev_nr * dev_nr * 6

This patch fixes the above mistake with correct calculation, and helper
routine idle_counter_exceeded() is added to make code be more clear.

Reported-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Acked-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Link: https://lore.kernel.org/r/20220919161647.81238-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-19 11:12:35 -06:00
Jilin Yuan
6dd3be6923 bcache:: fix repeated words in comments
Delete the redundant word 'we'.

Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220919161647.81238-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-19 11:12:35 -06:00
Jules Maselbas
11e529ccea bcache: bset: Fix comment typos
Remove the redundant word `by`, correct the typo `creaated`.

CC: Kent Overstreet <kent.overstreet@gmail.com>
CC: linux-bcache@vger.kernel.org
Signed-off-by: Jules Maselbas <jmaselbas@kalray.eu>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220919161647.81238-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-19 11:12:35 -06:00
Lin Feng
d86b4e6dc8 bcache: remove unused bch_mark_cache_readahead function def in stats.h
This is a cleanup for commit 1616a4c2ab ("bcache: remove bcache device
self-defined readahead")', currently no user for
bch_mark_cache_readahead() since that commit.

Signed-off-by: Lin Feng <linf@wangsu.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220919161647.81238-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-19 11:12:35 -06:00
Li Lei
97d26ae764 bcache: remove unnecessary flush_workqueue
All pending works will be drained by destroy_workqueue(), no need to call
flush_workqueue() explicitly.

Signed-off-by: Li Lei <lilei@szsandstone.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220919161647.81238-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-19 11:12:35 -06:00
Martin Belanger
02c57a82c0 nvme-tcp: print actual source IP address through sysfs "address" attr
TCP transport relies on the routing table to determine which source
address and interface to use when making a connection. Currently, there
is no way to tell from userspace where a connection was made. This
patch exposes the actual source address using a new field named
"src_addr=" in the "address" attribute.

This is needed to diagnose and identify connectivity issues. With the
source address we can infer the interface associated with each
connection.

This was tested with nvme-cli 2.0 to verify it does not have any
adverse effect. The new "src_addr=" field will simply be displayed in
the output of the "list-subsys" or "list -v" commands as shown here.

$ nvme list-subsys
nvme-subsys0 - NQN=nqn.2014-08.org.nvmexpress.discovery
\
 +- nvme0 tcp traddr=192.168.56.1,trsvcid=8009,src_addr=192.168.56.101 live

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:28 +02:00
Fabio M. De Francesco
5bfaba275a nvmet-tcp: don't map pages which can't come from HIGHMEM
kmap() is being deprecated in favor of kmap_local_page().[1]

There are two main problems with kmap(): (1) It comes with an overhead as
mapping space is restricted and protected by a global lock for
synchronization and (2) it also requires global TLB invalidation when the
kmap’s pool wraps and it might block when the mapping space is fully
utilized until a slot becomes available.

The pages which will be mapped are allocated in nvmet_tcp_map_data(),
using the GFP_KERNEL flag. This assures that they cannot come from
HIGHMEM. This imply that a straight page_address() can replace the kmap()
of sg_page(sg) in nvmet_tcp_map_pdu_iovec(). As a side effect, we might
also delete the field "nr_mapped" from struct "nvmet_tcp_cmd" because,
after removing the kmap() calls, there would be no longer any need of it.

In addition, there is no reason to use a kvec for the command receive
data buffers iovec, use a bio_vec instead and let iov_iter handle the
buffer mapping and data copy.

Test with blktests on a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel with
HIGHMEM64GB enabled.

[1] "[PATCH] checkpatch: Add kmap and kmap_atomic to the deprecated
list" https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com/

Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Cc: Keith Busch <kbusch@kernel.org>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
[sagi: added bio_vec plus minor naming changes]
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00
Keith Busch
c4c22c5208 nvme-pci: move iod dma_len fill gaps
The 32-bit field, dma_len, packs better in the iod struct above the
dma_addr_t on 64-bit systems.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-09-19 17:55:25 +02:00