commit c8e846ff93d5eaa5384f6f325a1687ac5921aade upstream.
dm-era doesn't support changing the data block size of existing devices,
so check explicitly that the requested block size for a new target
matches the one stored in the metadata.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit de89afc1e40fdfa5f8b666e5d07c43d21a1d3be0 upstream.
Following a system crash, dm-era fails to recover the committed writeset
for the current era, leading to lost writes. That is, we lose the
information about what blocks were written during the affected era.
dm-era assumes that the writeset of the current era is archived when the
device is suspended. So, when resuming the device, it just moves on to
the next era, ignoring the committed writeset.
This assumption holds when the device is properly shut down. But, when
the system crashes, the code that suspends the target never runs, so the
writeset for the current era is not archived.
There are three issues that cause the committed writeset to get lost:
1. dm-era doesn't load the committed writeset when opening the metadata
2. The code that resizes the metadata wipes the information about the
committed writeset (assuming it was loaded at step 1)
3. era_preresume() starts a new era, without taking into account that
the current era might not have been archived, due to a system crash.
To fix this:
1. Load the committed writeset when opening the metadata
2. Fix the code that resizes the metadata to make sure it doesn't wipe
the loaded writeset
3. Fix era_preresume() to check for a loaded writeset and archive it,
before starting a new era.
Fixes: eec40579d84873 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4134455f2aafdfeab50cabb4cccb35e916034b93 upstream.
Do not attempt to write any data beyond the end of the underlying data
device while shrinking it.
The DM writecache device must be suspended when the underlying data
device is shrunk.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a666e5c05e7c4aaabb2c5d58117b0946803d03d2 upstream.
The system would deadlock when swapping to a dm-crypt device. The reason
is that for each incoming write bio, dm-crypt allocates memory that holds
encrypted data. These excessive allocations exhaust all the memory and the
result is either deadlock or OOM trigger.
This patch limits the number of in-flight swap bios, so that the memory
consumed by dm-crypt is limited. The limit is enforced if the target set
the "limit_swap_bios" variable and if the bio has REQ_SWAP set.
Non-swap bios are not affected becuase taking the semaphore would cause
performance degradation.
This is similar to request-based drivers - they will also block when the
number of requests is over the limit.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit afe78ab46f638ecdf80a35b122ffc92c20d9ae5d upstream.
This is potentially long running and not latency sensitive, let's get
it out of the way of other latency sensitive events.
As observed in the previous commit, the `system_wq` comes easily
congested by bcache, and this fixes a few more stalls I was observing
every once in a while.
Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance
of boot and file system operations in my tests. Also, without
`WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the
previous behavior as `system_wq` also does no memory reclaim:
> // workqueue.c:
> system_wq = alloc_workqueue("events", 0, 0);
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d797bd9897e3559eb48d68368550d637d32e468c upstream.
Before killing `btree_io_wq`, the queue was allocated using
`create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After
killing it, it no longer had this property but `system_wq` is not
single threaded.
Let's combine both worlds and make it multi threaded but able to
reclaim memory.
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9f233ffe02e5cef611100cd8c5bcf4de26ca7bef upstream.
This reverts commit 56b30770b27d54d68ad51eccc6d888282b568cee.
With the btree using the `system_wq`, I seem to see a lot more desktop
latency than I should.
After some more investigation, it looks like the original assumption
of 56b3077 no longer is true, and bcache has a very high potential of
congesting the `system_wq`. In turn, this introduces laggy desktop
performance, IO stalls (at least with btrfs), and input events may be
delayed.
So let's revert this. It's important to note that the semantics of
using `system_wq` previously mean that `btree_io_wq` should be created
before and destroyed after other bcache wqs to keep the same
assumptions.
Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dc5d17a3c39b06aef866afca19245a9cfb533a79 upstream.
One customer reports a crash problem which causes by flush request. It
triggers a warning before crash.
/* new request after previous flush is completed */
if (ktime_after(req_start, mddev->prev_flush_start)) {
WARN_ON(mddev->flush_bio);
mddev->flush_bio = bio;
bio = NULL;
}
The WARN_ON is triggered. We use spin lock to protect prev_flush_start and
flush_bio in md_flush_request. But there is no lock protection in
md_submit_flush_data. It can set flush_bio to NULL first because of
compiler reordering write instructions.
For example, flush bio1 sets flush bio to NULL first in
md_submit_flush_data. An interrupt or vmware causing an extended stall
happen between updating flush_bio and prev_flush_start. Because flush_bio
is NULL, flush bio2 can get the lock and submit to underlayer disks. Then
flush bio1 updates prev_flush_start after the interrupt or extended stall.
Then flush bio3 enters in md_flush_request. The start time req_start is
behind prev_flush_start. The flush_bio is not NULL(flush bio2 hasn't
finished). So it can trigger the WARN_ON now. Then it calls INIT_WORK
again. INIT_WORK() will re-initialize the list pointers in the
work_struct, which then can result in a corrupted work list and the
work_struct queued a second time. With the work list corrupted, it can
lead in invalid work items being used and cause a crash in
process_one_work.
We need to make sure only one flush bio can be handled at one same time.
So add spin lock in md_submit_flush_data to protect prev_flush_start and
flush_bio in an atomic way.
Reviewed-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5c02406428d5219c367c5f53457698c58bc5f917 upstream.
Otherwise a malicious user could (ab)use the "recalculate" feature
that makes dm-integrity calculate the checksums in the background
while the device is already usable. When the system restarts before all
checksums have been calculated, the calculation continues where it was
interrupted even if the recalculate feature is not requested the next
time the dm device is set up.
Disable recalculating if we use internal_hash or journal_hash with a
key (e.g. HMAC) and we don't have the "legacy_recalculate" flag.
This may break activation of a volume, created by an older kernel,
that is not yet fully recalculated -- if this happens, the user should
add the "legacy_recalculate" flag to constructor parameters.
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Daniel Glockner <dg@emlix.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f7b347acb5f6c29d9229bb64893d8b6a2c7949fb ]
The integrity target relies on skcipher for encryption/decryption, but
certain kernel configurations may not enable CRYPTO_SKCIPHER, leading to
compilation errors due to unresolved symbols. Explicitly select
CRYPTO_SKCIPHER for DM_INTEGRITY, since it is unconditionally dependent
on it.
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 809b1e4945774c9ec5619a8f4e2189b7b3833c0c upstream.
This reverts commit
644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()")
dm_get_dev_t() is just used to convert an arbitrary 'path' string
into a dev_t. It doesn't presume that the device is present; that
check will be done later, as the only caller is dm_get_device(),
which does a dm_get_table_device() later on, which will properly
open the device.
So if the path string already _is_ in major:minor representation
we can convert it directly, avoiding a recursion into the filesystem
to lookup the block device.
This avoids a hang in multipath_message() when the filesystem is
inaccessible.
Fixes: 644bda6f3460 ("dm table: fall back to getting device using name_to_dev_t()")
Cc: stable@vger.kernel.org
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0378c625afe80eb3f212adae42cc33c9f6f31abf upstream.
There wasn't ever a real need to log an error in the kernel log for
ioctls issued with insufficient permissions. Simply return an error
and if an admin/user is sufficiently motivated they can enable DM's
dynamic debugging to see an explanation for why the ioctls were
disallowed.
Reported-by: Nir Soffer <nsoffer@redhat.com>
Fixes: e980f62353c6 ("dm: don't allow ioctls to targets that don't map to whole devices")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9b5948267adc9e689da609eb61cf7ed49cae5fa8 ]
With external metadata device, flush requests are not passed down to the
data device.
Fix this by submitting the flush request in dm_integrity_flush_buffers. In
order to not degrade performance, we overlap the data device flush with
the metadata device flush.
Reported-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 17ffc193cdc6dc7a613d00d8ad47fc1f801b9bf0 upstream.
Advance the maximum number of arguments from 9 to 15 to account for
all potential feature flags that may be supplied.
Linux 4.19 added "meta_device"
(356d9d52e1221ba0c9f10b8b38652f78a5298329) and "recalculate"
(a3fcf7253139609bf9ff901fbf955fba047e75dd) flags.
Commit 468dfca38b1a6fbdccd195d875599cb7c8875cd9 added
"sectors_per_bit" and "bitmap_flush_interval".
Commit 84597a44a9d86ac949900441cea7da0af0f2f473 added
"allow_discards".
And the commit d537858ac8aaf4311b51240893add2fc62003b97 added
"fix_padding".
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fcc42338375a1e67b8568dbb558f8b784d0f3b01 upstream.
If the origin device has a volatile write-back cache and the following
events occur:
1: After finishing merge operation of one set of exceptions,
merge_callback() is invoked.
2: Update the metadata in COW device tracking the merge completion.
This update to COW device is flushed cleanly.
3: System crashes and the origin device's cache where the recent
merge was completed has not been flushed.
During the next cycle when we read the metadata from the COW device,
we will skip reading those metadata whose merge was completed in
step (1). This will lead to data loss/corruption.
To address this, flush the origin device post merge IO before
updating the metadata.
Cc: stable@vger.kernel.org
Signed-off-by: Akilesh Kailash <akailash@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit cc07d72bf350b77faeffee1c37bc52197171473f upstream.
Block core warned that discard_granularity was 0 for dm-raid with
personality of raid1. Reason is that raid_io_hints() was incorrectly
special-casing raid1 rather than raid0.
Fix raid_io_hints() by removing discard limits settings for
raid1. Check for raid0 instead.
Fixes: 61697a6abd24a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Stephan Bärwolf <stephan@matrixstorm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 252bd1256396cebc6fc3526127fdb0b317601318 ]
If emergency system shutdown is called, like by thermal shutdown,
a dm device could be alive when the block device couldn't process
I/O requests anymore. In this state, the handling of I/O errors
by new dm I/O requests or by those already in-flight can lead to
a verity corruption state, which is a misjudgment.
So, skip verity work in response to I/O error when system is shutting
down.
Signed-off-by: Hyeongseok Kim <hyeongseok@gmail.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 93decc563637c4288380912eac0eb42fb246cc04 upstream.
In __make_request() a new r10bio is allocated and passed to
raid10_read_request(). The read_slot member of the bio is not
initialized, and the raid10_read_request() uses it to index an
array. This leads to occasional panics.
Fix by initializing the field to invalid value and checking for
valid value in raid10_read_request().
Cc: stable@vger.kernel.org
Signed-off-by: Kevin Vigor <kvigor@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bca5b0658020be90b6b504ca514fd80110204f71 upstream.
md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA nodeB
-------------------- --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
b.
md_do_sync
resync_info_update
send RESYNCING
+ set MD_CLUSTER_SEND_LOCK
+ wait for holding token_lockres:EX
c.
mdadm /dev/md0 --remove /dev/sdg
+ held reconfig_mutex
+ send REMOVE
+ wait_event(MD_CLUSTER_SEND_LOCK)
d.
recv_daemon //METADATA_UPDATED from A
process_metadata_update
+ (mddev_trylock(mddev) ||
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
//this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
This will block another node to send msg
b. B does sync jobs, which will send RESYNCING at intervals.
This will be block for holding token_lockres:EX lock.
c. B do "mdadm --remove", which will send REMOVE.
This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
d. B recv METADATA_UPDATED msg, which send from A in step <a>.
This will be blocked by step <c>: holding mddev lock, it makes
wait_event can't hold mddev lock. (btw,
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
There is a similar deadlock in commit 0ba959774e93
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".
For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.
Repro steps (I only triggered 3 times with hundreds tests):
two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
--bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```
test script will hung when executing "mdadm --remove".
```
# dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D 0 5329 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? _cond_resched+0x2d/0x40
? schedule+0x4a/0xb0
? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
? wait_woken+0x80/0x80
? process_recvd_msg+0x113/0x1d0 [md_cluster]
? recv_daemon+0x9e/0x120 [md_cluster]
? md_thread+0x94/0x160 [md_mod]
? wait_woken+0x80/0x80
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
mdadm D 0 5423 1 0x00004004
Call Trace:
__schedule+0x1f6/0x560
? __schedule+0x1fe/0x560
? schedule+0x4a/0xb0
? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
? wait_woken+0x80/0x80
? remove_disk+0x4f/0x90 [md_cluster]
? hot_remove_disk+0xb1/0x1b0 [md_mod]
? md_ioctl+0x50c/0xba0 [md_mod]
? wait_woken+0x80/0x80
? blkdev_ioctl+0xa2/0x2a0
? block_ioctl+0x39/0x40
? ksys_ioctl+0x82/0xc0
? __x64_sys_ioctl+0x16/0x20
? do_syscall_64+0x5f/0x150
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
md0_resync D 0 5425 2 0x80004000
Call Trace:
__schedule+0x1f6/0x560
? schedule+0x4a/0xb0
? dlm_lock_sync+0xa1/0xd0 [md_cluster]
? wait_woken+0x80/0x80
? lock_token+0x2d/0x90 [md_cluster]
? resync_info_update+0x95/0x100 [md_cluster]
? raid1_sync_request+0x7d3/0xa40 [raid1]
? md_do_sync.cold+0x737/0xc8f [md_mod]
? md_thread+0x94/0x160 [md_mod]
? md_congested+0x30/0x30 [md_mod]
? kthread+0x115/0x140
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x1f/0x40
```
At last, thanks for Xiao's solution.
Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a8da01f79c89755fad55ed0ea96e8d2103242a72 upstream.
Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see user will
execute "mdadm --grow" successfully on local, while the remote node deny
to do reshape job when it doing resync job. The inconsistent handling
cause array enter unexpected status. If user doesn't observe this issue
and continue executing mdadm cmd, the array doesn't work at last.
Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.
The following script reproduces the issue with ~100% probability.
(two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
```
# on node1, node2 is the remote node.
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done
mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
sleep 5
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0
mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```
Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4d7659bfbe277a43399a4a2d90fca141e70f29e1 ]
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: 2ca4c92f58f9 ("dm ioctl: prevent empty message")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e7b624183d921b49ef0a96329f21647d38865ee9 ]
The BUG_ON(in_interrupt()) in dm_table_event() is a historic leftover from
a rework of the dm table code which changed the calling context.
Issuing a BUG for a wrong calling context is frowned upon and
in_interrupt() is deprecated and only covering parts of the wrong
contexts. The sanity check for the context is covered by
CONFIG_DEBUG_ATOMIC_SLEEP and other debug facilities already.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 857c4c0a8b2888d806f4308c58f59a6a81a1dee9 upstream.
Building on arch/s390/ results in this build error:
cc1: some warnings being treated as errors
../drivers/md/dm-writecache.c: In function 'persistent_memory_claim':
../drivers/md/dm-writecache.c:323:1: error: no return statement in function returning non-void [-Werror=return-type]
Fix this by replacing the BUG() with an -EOPNOTSUPP return.
Fixes: 48debafe4f2f ("dm: add writecache target")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bde3808bc8c2741ad3d804f84720409aee0c2972 upstream.
Fixes sparse warnings:
drivers/md/dm.c:508:12: warning: context imbalance in 'dm_prepare_ioctl' - wrong count at exit
drivers/md/dm.c:543:13: warning: context imbalance in 'dm_unprepare_ioctl' - wrong count at exit
Fixes: 971888c46993f ("dm: hold DM table for duration of ioctl rather than use blkdev_get")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 89478335718c98557f10470a9bc5c555b9261c4e upstream.
The dm_get_live_table() function makes RCU read lock so
dm_put_live_table() must be called even if dm_table map is not found.
Fixes: e76239a3748c9 ("block: add a report_zones method")
Cc: stable@vger.kernel.org
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 67aa3ec3dbc43d6e34401d9b2a40040ff7bb57af upstream.
Advance the maximum number of arguments to 16.
This fixes issue where certain operations, combined with table
configured args, exceed 10 arguments.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d837f7277f56e70d82b3a4a037d744854e62f387 ]
md_bitmap_get_counter() has code:
```
if (bitmap->bp[page].hijacked ||
bitmap->bp[page].map == NULL)
csize = ((sector_t)1) << (bitmap->chunkshift +
PAGE_COUNTER_SHIFT - 1);
```
The minus 1 is wrong, this branch should report 2048 bits of space.
With "-1" action, this only report 1024 bit of space.
This bug code returns wrong blocks, but it doesn't inflence bitmap logic:
1. Most callers focus this function return value (the counter of offset),
not the parameter blocks.
2. The bug is only triggered when hijacked is true or map is NULL.
the hijacked true condition is very rare.
the "map == null" only true when array is creating or resizing.
3. Even the caller gets wrong blocks, current code makes caller just to
call md_bitmap_get_counter() one more time.
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1383b347a8ae4a69c04ae3746e6cb5c8d38e2585 ]
Callers of get_bitmap_from_slot() are responsible to free the bitmap.
Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit ee1dfad5325ff1cfb2239e564cd411b3bfe8667a upstream.
dm_queue_split() is removed because __split_and_process_bio() _must_
handle splitting bios to ensure proper bio submission and completion
ordering as a bio is split.
Otherwise, multiple recursive calls to ->submit_bio will cause multiple
split bios to be allocated from the same ->bio_split mempool at the same
time. This would result in deadlock in low memory conditions because no
progress could be made (only one bio is available in ->bio_split
mempool).
This fix has been verified to still fix the loss of performance, due
to excess splitting, that commit 120c9257f5f1 provided.
Fixes: 120c9257f5f1 ("Revert "dm: always call blk_queue_split() in dm_process_bio()"")
Cc: stable@vger.kernel.org # 5.0+, requires custom backport due to 5.9 changes
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 34cf78bf34d48dddddfeeadb44f9841d7864997a ]
This patch fix a lost wake-up problem caused by the race between
mca_cannibalize_lock and bch_cannibalize_unlock.
Consider two processes, A and B. Process A is executing
mca_cannibalize_lock, while process B takes c->btree_cache_alloc_lock
and is executing bch_cannibalize_unlock. The problem happens that after
process A executes cmpxchg and will execute prepare_to_wait. In this
timeslice process B executes wake_up, but after that process A executes
prepare_to_wait and set the state to TASK_INTERRUPTIBLE. Then process A
goes to sleep but no one will wake up it. This problem may cause bcache
device to dead.
Signed-off-by: Guoju Fang <fangguoju@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6ba01df72b4b63a26b4977790f58d8f775d2992c ]
Partitioned request-based devices cannot be used as underlying devices
for request-based DM because no partition offsets are added to each
incoming request. As such, until now, stacking on partitioned devices
would _always_ result in data corruption (e.g. wiping the partition
table, writing to other partitions, etc). Fix this by disallowing
request-based stacking on partitions.
While at it, since all .request_fn support has been removed from block
core, remove legacy dm-table code that differentiated between blk-mq and
.request_fn request-based.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit e2ec5128254518cae320d5dc631b71b94160f663 upstream.
DM was calling generic_fsdax_supported() to determine whether a device
referenced in the DM table supports DAX. However this is a helper for "leaf" device drivers so that
they don't have to duplicate common generic checks. High level code
should call dax_supported() helper which that calls into appropriate
helper for the particular device. This problem manifested itself as
kernel messages:
dm-3: error: dax access failed (-95)
when lvm2-testsuite run in cases where a DM device was stacked on top of
another DM device.
Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices")
Cc: <stable@vger.kernel.org>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/160061715195.13131.5503173247632041975.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 02186d8897d49b0afd3c80b6cf23437d91024065 upstream.
A recent fix to the dm_dax_supported() flow uncovered a latent bug. When
dm_get_live_table() fails it is still required to drop the
srcu_read_lock(). Without this change the lvm2 test-suite triggers this
warning:
# lvm2-testsuite --only pvmove-abort-all.sh
WARNING: lock held when returning to user space!
5.9.0-rc5+ #251 Tainted: G OE
------------------------------------------------
lvm/1318 is leaving the kernel with locks still held!
1 lock held by lvm/1318:
#0: ffff9372abb5a340 (&md->io_barrier){....}-{0:0}, at: dm_get_live_table+0x5/0xb0 [dm_mod]
...and later on this hang signature:
INFO: task lvm:1344 blocked for more than 122 seconds.
Tainted: G OE 5.9.0-rc5+ #251
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:lvm state:D stack: 0 pid: 1344 ppid: 1 flags:0x00004000
Call Trace:
__schedule+0x45f/0xa80
? finish_task_switch+0x249/0x2c0
? wait_for_completion+0x86/0x110
schedule+0x5f/0xd0
schedule_timeout+0x212/0x2a0
? __schedule+0x467/0xa80
? wait_for_completion+0x86/0x110
wait_for_completion+0xb0/0x110
__synchronize_srcu+0xd1/0x160
? __bpf_trace_rcu_utilization+0x10/0x10
__dm_suspend+0x6d/0x210 [dm_mod]
dm_suspend+0xf6/0x140 [dm_mod]
Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple devices")
Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reported-by: Adrian Huang <ahuang12@lenovo.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Adrian Huang <ahuang12@lenovo.com>
Link: https://lore.kernel.org/r/160045867590.25663.7548541079217827340.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3a653b205f29b3f9827a01a0c88bfbcb0d169494 upstream.
The following error ocurred when testing disk online/offline:
[ 301.798344] device-mapper: thin: 253:5: aborting current metadata transaction
[ 301.848441] device-mapper: thin: 253:5: failed to abort metadata transaction
[ 301.849206] Aborting journal on device dm-26-8.
[ 301.850489] EXT4-fs error (device dm-26) in __ext4_new_inode:943: Journal has aborted
[ 301.851095] EXT4-fs (dm-26): Delayed block allocation failed for inode 398742 at logical offset 181 with max blocks 19 with error 30
[ 301.854476] BUG: KASAN: use-after-free in dm_bm_set_read_only+0x3a/0x40 [dm_persistent_data]
Reason is:
metadata_operation_failed
abort_transaction
dm_pool_abort_metadata
__create_persistent_data_objects
r = __open_or_format_metadata
if (r) --> If failed will free pmd->bm but pmd->bm not set NULL
dm_block_manager_destroy(pmd->bm);
set_pool_mode
dm_pool_metadata_read_only(pool->pmd);
dm_bm_set_read_only(pmd->bm); --> use-after-free
Add checks to see if pmd->bm is NULL in dm_bm_set_read_only and
dm_bm_set_read_write functions. If bm is NULL it means creating the
bm failed and so dm_bm_is_read_only must return true.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 219403d7e56f9b716ad80ab87db85d29547ee73e upstream.
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d16ff19e69ab57e08bf908faaacbceaf660249de upstream.
Maybe __create_persistent_data_objects() caller will use PTR_ERR as a
pointer, it will lead to some strange things.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7785a9e4c228db6d01086a52d5685cd7336a08b7 upstream.
Use the DECLARE_CRYPTO_WAIT() macro to properly initialize the crypto
wait structures declared on stack before their use with
crypto_wait_req().
Fixes: 39d13a1ac41d ("dm crypt: reuse eboiv skcipher for IV generation")
Fixes: bbb1658461ac ("dm crypt: Implement Elephant diffuser for Bitlocker compatibility")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e27fec66f0a94e35a35548bd0b29ae616e62ec62 upstream.
The dm-integrity target did not report errors in bitmap mode just after
creation. The reason is that the function integrity_recalc didn't clean up
ic->recalc_bitmap as it proceeded with recalculation.
Fix this by updating the bitmap accordingly -- the double shift serves
to rounddown.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c322ee9320eaa4013ca3620b1130992916b19b31 upstream.
Commit 935fcc56abc3 ("dm mpath: only flush workqueue when needed")
changed flush_multipath_work() to avoid needless workqueue
flushing (of a multipath global workqueue). But that change didn't
realize the surrounding flush_multipath_work() code should also only
run if 'pg_init_in_progress' is set.
Fix this by only doing all of flush_multipath_work()'s PG init related
work if 'pg_init_in_progress' is set.
Otherwise multipath_wait_for_pg_init_completion() will run
unconditionally but the preceeding flush_workqueue(kmpath_handlerd)
may not. This could lead to deadlock (though only if kmpath_handlerd
never runs a corresponding work to decrement 'pg_init_in_progress').
It could also be, though highly unlikely, that the kmpath_handlerd
work that does PG init completes before 'pg_init_in_progress' is set,
and then an intervening DM table reload's multipath_postsuspend()
triggers flush_multipath_work().
Fixes: 935fcc56abc3 ("dm mpath: only flush workqueue when needed")
Cc: stable@vger.kernel.org
Reported-by: Ben Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f9e040efcc28309e5c592f7e79085a9a52e31f58 upstream.
The function dax_direct_access doesn't take partitions into account,
it always maps pages from the beginning of the device. Therefore,
persistent_memory_claim() must get the partition offset using
get_start_sect() and add it to the page offsets passed to
dax_direct_access().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 65f0f017e7be8c70330372df23bcb2a407ecf02d ]
For some block devices which large capacity (e.g. 8TB) but small io_opt
size (e.g. 8 sectors), in bcache_device_init() the stripes number calcu-
lated by,
DIV_ROUND_UP_ULL(sectors, d->stripe_size);
might be overflow to the unsigned int bcache_device->nr_stripes.
This patch uses the uint64_t variable to store DIV_ROUND_UP_ULL()
and after the value is checked to be available in unsigned int range,
sets it to bache_device->nr_stripes. Then the overflow is avoided.
Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e8abe1de43dac658dacbd04a4543e0c988a8d386 ]
The error handling calls md_bitmap_free(bitmap) which checks for NULL
but will Oops if we pass an error pointer. Let's set "bitmap" to NULL
on this error path.
Fixes: afd756286083 ("md-cluster/raid10: resize all the bitmaps before start reshape")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e766668c6cd49d741cfb49eaeb38998ba34d27bc ]
dm_stop_queue() only uses blk_mq_quiesce_queue() so it doesn't
formally stop the blk-mq queue; therefore there is no point making the
blk_mq_queue_stopped() check -- it will never be stopped.
In addition, even though dm_stop_queue() actually tries to quiesce hw
queues via blk_mq_quiesce_queue(), checking with blk_queue_quiesced()
to avoid unnecessary queue quiesce isn't reliable because: the
QUEUE_FLAG_QUIESCED flag is set before synchronize_rcu() and
dm_stop_queue() may be called when synchronize_rcu() from another
blk_mq_quiesce_queue() is in-progress.
Fixes: 7b17c2f7292ba ("dm: Fix a race condition related to stopping and starting queues")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7a1481267999c02abf4a624515c1b5c7c1fccbd6 upstream.
offset_to_stripe() returns the stripe number (in type unsigned int) from
an offset (in type uint64_t) by the following calculation,
do_div(offset, d->stripe_size);
For large capacity backing device (e.g. 18TB) with small stripe size
(e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
returned value which caller receives is 536870912, due to the overflow.
Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
in range [0, bcache_dev->nr_stripes - 1].
This patch adds a upper limition check in offset_to_stripe(): the max
valid stripe number should be less than bcache_device->nr_stripes. If
the calculated stripe number from do_div() is equal to or larger than
bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
therefore -EOVERFLOW is not used as error code.)
This patch also changes nr_stripes' type of struct bcache_device from
'unsigned int' to 'int', and return value type of offset_to_stripe()
from 'unsigned int' to 'int', to match their exact data ranges.
All locations where bcache_device->nr_stripes and offset_to_stripe() are
referenced also get updated for the above type change.
Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5fe48867856367142d91a82f2cbf7a57a24cbb70 upstream.
There are some meta data of bcache are allocated by multiple pages,
and they are used as bio bv_page for I/Os to the cache device. for
example cache_set->uuids, cache->disk_buckets, journal_write->data,
bset_tree->data.
For such meta data memory, all the allocated pages should be treated
as a single memory block. Then the memory management and underlying I/O
code can treat them more clearly.
This patch adds __GFP_COMP flag to all the location allocating >0 order
pages for the above mentioned meta data. Then their pages are treated
as compound pages now.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a1c6ae3d9f3dd6aa5981a332a6f700cf1c25edef upstream.
In degraded raid5, we need to read parity to do reconstruct-write when
data disks fail. However, we can not read parity from
handle_stripe_dirtying() in force reconstruct-write mode.
Reproducible Steps:
1. Create degraded raid5
mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
2. Set rmw_level to 0
echo 0 > /sys/block/md2/md/rmw_level
3. IO to raid5
Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
the parity in this situation.
Cc: <stable@vger.kernel.org> # v4.4+
Reviewed-by: Alex Wu <alexwu@synology.com>
Reviewed-by: BingJing Chang <bingjingc@synology.com>
Reviewed-by: Danny Shih <dannyshih@synology.com>
Signed-off-by: ChangSyun Peng <allenpeng@synology.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 117f636ea695270fe492d0c0c9dfadc7a662af47 ]
In register_cache_set(), c is pointer to struct cache_set, and ca is
pointer to struct cache, if ca->sb.seq > c->sb.seq, it means this
registering cache has up to date version and other members, the in-
memory version and other members should be updated to the newer value.
But current implementation makes a cache set only has a single cache
device, so the above assumption works well except for a special case.
The execption is when a cache device new created and both ca->sb.seq and
c->sb.seq are 0, because the super block is never flushed out yet. In
the location for the following if() check,
2156 if (ca->sb.seq > c->sb.seq) {
2157 c->sb.version = ca->sb.version;
2158 memcpy(c->sb.set_uuid, ca->sb.set_uuid, 16);
2159 c->sb.flags = ca->sb.flags;
2160 c->sb.seq = ca->sb.seq;
2161 pr_debug("set version = %llu\n", c->sb.version);
2162 }
c->sb.version is not initialized yet and valued 0. When ca->sb.seq is 0,
the if() check will fail (because both values are 0), and the cache set
version, set_uuid, flags and seq won't be updated.
The above problem is hiden for current code, because the bucket size is
compatible among different super block version. And the next time when
running cache set again, ca->sb.seq will be larger than 0 and cache set
super block version will be updated properly.
But if the large bucket feature is enabled, sb->bucket_size is the low
16bits of the bucket size. For a power of 2 value, when the actual
bucket size exceeds 16bit width, sb->bucket_size will always be 0. Then
read_super_common() will fail because the if() check to
is_power_of_2(sb->bucket_size) is false. This is how the long time
hidden bug is triggered.
This patch modifies the if() check to the following way,
2156 if (ca->sb.seq > c->sb.seq || c->sb.seq == 0) {
Then cache set's version, set_uuid, flags and seq will always be updated
corectly including for a new created cache device.
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>