7411202a0f
8692 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
356056cbe6 |
btrfs: fix extent buffer leak after tree mod log failure at split_node()
[ Upstream commit ede600e497b1461d06d22a7d17703d9096868bc3 ]
At split_node(), if we fail to log the tree mod log copy operation, we
return without unlocking the split extent buffer we just allocated and
without decrementing the reference we own on it. Fix this by unlocking
it and decrementing the ref count before returning.
Fixes:
|
||
|
63008dab58 |
btrfs: fix race between quota disable and relocation
[ Upstream commit 8a4a0b2a3eaf75ca8854f856ef29690c12b2f531 ]
If we disable quotas while we have a relocation of a metadata block group
that has extents belonging to the quota root, we can cause the relocation
to fail with -ENOENT. This is because relocation builds backref nodes for
extents of the quota root and later needs to walk the backrefs and access
the quota root - however if in between a task disables quotas, it results
in deleting the quota root from the root tree (with btrfs_del_root(),
called from btrfs_quota_disable().
This can be sporadically triggered by test case btrfs/255 from fstests:
$ ./check btrfs/255
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg)
- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad)
# --- tests/btrfs/255.out 2023-03-02 21:47:53.876609426 +0000
# +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad 2023-06-16 10:20:39.267563212 +0100
# @@ -1,2 +1,4 @@
# QA output created by 255
# +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory
# +There may be more info in syslog - try dmesg | tail
# Silence is golden
# ...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad' to see the entire diff)
Ran: btrfs/255
Failures: btrfs/255
Failed 1 of 1 tests
To fix this make the quota disable operation take the cleaner mutex, as
relocation of a block group also takes this mutex. This is also what we
do when deleting a subvolume/snapshot, we take the cleaner mutex in the
cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root()
at btrfs_drop_snapshot() while under the protection of the cleaner mutex.
Fixes:
|
||
|
0a55f346e0 |
btrfs: qgroup: catch reserved space leaks at unmount time
[ Upstream commit 5958253cf65de42493f17f36877a901486a90365 ] Before this patch, qgroup completely relies on per-inode extent io tree to detect reserved data space leak. However previous bug has already shown how release page before btrfs_finish_ordered_io() could lead to leak, and since it's QGROUP_RESERVED bit cleared without triggering qgroup rsv, it can't be detected by per-inode extent io tree. So this patch adds another (and hopefully the final) safety net to catch qgroup data reserved space leak. At least the new safety net catches all the leaks during development, so it should be pretty useful in the real world. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 8a4a0b2a3eaf ("btrfs: fix race between quota disable and relocation") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
ae91ab710d |
btrfs: fix warning when putting transaction with qgroups enabled after abort
commit aa84ce8a78a1a5c10cdf9c7a5fb0c999fbc2c8d6 upstream. If we have a transaction abort with qgroups enabled we get a warning triggered when doing the final put on the transaction, like this: [552.6789] ------------[ cut here ]------------ [552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs] [552.6817] Modules linked in: btrfs blake2b_generic xor (...) [552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs] [552.6821] Code: bd a0 01 00 (...) [552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286 [552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000 [552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010 [552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20 [552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70 [552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028 [552.6821] FS: 0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000 [552.6821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0 [552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [552.6822] Call Trace: [552.6822] <TASK> [552.6822] ? __warn+0x80/0x130 [552.6822] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6824] ? report_bug+0x1f4/0x200 [552.6824] ? handle_bug+0x42/0x70 [552.6824] ? exc_invalid_op+0x14/0x70 [552.6824] ? asm_exc_invalid_op+0x16/0x20 [552.6824] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6826] btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs] [552.6828] ? _raw_spin_unlock_irqrestore+0x23/0x40 [552.6828] ? try_to_wake_up+0x94/0x5e0 [552.6828] ? __pfx_process_timeout+0x10/0x10 [552.6828] transaction_kthread+0x103/0x1d0 [btrfs] [552.6830] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] [552.6832] kthread+0xee/0x120 [552.6832] ? __pfx_kthread+0x10/0x10 [552.6832] ret_from_fork+0x29/0x50 [552.6832] </TASK> [552.6832] ---[ end trace 0000000000000000 ]--- This corresponds to this line of code: void btrfs_put_transaction(struct btrfs_transaction *transaction) { (...) WARN_ON(!RB_EMPTY_ROOT( &transaction->delayed_refs.dirty_extent_root)); (...) } The warning happens because btrfs_qgroup_destroy_extent_records(), called in the transaction abort path, we free all entries from the rbtree "dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we don't actually empty the rbtree - it's still pointing to nodes that were freed. So set the rbtree's root node to NULL to avoid this warning (assign RB_ROOT). Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
679c34821a |
btrfs: fix race when deleting quota root from the dirty cow roots list
commit b31cb5a6eb7a48b0a7bfdf06832b1fd5088d8c79 upstream.
When disabling quotas we are deleting the quota root from the list
fs_info->dirty_cowonly_roots without taking the lock that protects it,
which is struct btrfs_fs_info::trans_lock. This unsynchronized list
manipulation may cause chaos if there's another concurrent manipulation
of this list, such as when adding a root to it with
ctree.c:add_root_to_dirty_list().
This can result in all sorts of weird failures caused by a race, such as
the following crash:
[337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
[337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
[337571.279928] Code: 85 38 06 00 (...)
[337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
[337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
[337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
[337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
[337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
[337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
[337571.281723] FS: 00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
[337571.281950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
[337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[337571.282874] Call Trace:
[337571.283101] <TASK>
[337571.283327] ? __die_body+0x1b/0x60
[337571.283570] ? die_addr+0x39/0x60
[337571.283796] ? exc_general_protection+0x22e/0x430
[337571.284022] ? asm_exc_general_protection+0x22/0x30
[337571.284251] ? commit_cowonly_roots+0x11f/0x250 [btrfs]
[337571.284531] btrfs_commit_transaction+0x42e/0xf90 [btrfs]
[337571.284803] ? _raw_spin_unlock+0x15/0x30
[337571.285031] ? release_extent_buffer+0x103/0x130 [btrfs]
[337571.285305] reset_balance_state+0x152/0x1b0 [btrfs]
[337571.285578] btrfs_balance+0xa50/0x11e0 [btrfs]
[337571.285864] ? __kmem_cache_alloc_node+0x14a/0x410
[337571.286086] btrfs_ioctl+0x249a/0x3320 [btrfs]
[337571.286358] ? mod_objcg_state+0xd2/0x360
[337571.286577] ? refill_obj_stock+0xb0/0x160
[337571.286798] ? seq_release+0x25/0x30
[337571.287016] ? __rseq_handle_notify_resume+0x3ba/0x4b0
[337571.287235] ? percpu_counter_add_batch+0x2e/0xa0
[337571.287455] ? __x64_sys_ioctl+0x88/0xc0
[337571.287675] __x64_sys_ioctl+0x88/0xc0
[337571.287901] do_syscall_64+0x38/0x90
[337571.288126] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[337571.288352] RIP: 0033:0x7f478aaffe9b
So fix this by locking struct btrfs_fs_info::trans_lock before deleting
the quota root from that list.
Fixes:
|
||
|
28850d25a6 |
btrfs: handle memory allocation failure in btrfs_csum_one_bio
[ Upstream commit 806570c0bb7b4847828c22c4934fcf2dc8fc572f ] Since f8a53bb58ec7 ("btrfs: handle checksum generation in the storage layer") the failures of btrfs_csum_one_bio() are handled via bio_end_io(). This means, we can return BLK_STS_RESOURCE from btrfs_csum_one_bio() in case the allocation of the ordered sums fails. This also fixes a syzkaller report, where injecting a failure into the kvzalloc() call results in a BUG_ON(). Reported-by: syzbot+d8941552e21eac774778@syzkaller.appspotmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
8e54667403 |
btrfs: unset reloc control if transaction commit fails in prepare_to_relocate()
commit 85f02d6c856b9f3a0acf5219de6e32f58b9778eb upstream. In btrfs_relocate_block_group(), the rc is allocated. Then btrfs_relocate_block_group() calls relocate_block_group() prepare_to_relocate() set_reloc_control() that assigns rc to the variable fs_info->reloc_ctl. When prepare_to_relocate() returns, it calls btrfs_commit_transaction() btrfs_start_dirty_block_groups() btrfs_alloc_path() kmem_cache_zalloc() which may fail for example (or other errors could happen). When the failure occurs, btrfs_relocate_block_group() detects the error and frees rc and doesn't set fs_info->reloc_ctl to NULL. After that, in btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and then used, which may cause a use-after-free bug. This possible bug can be triggered by calling btrfs_ioctl_balance() before calling btrfs_ioctl_defrag(). To fix this possible bug, in prepare_to_relocate(), check if btrfs_commit_transaction() fails. If the failure occurs, unset_reloc_control() is called to set fs_info->reloc_ctl to NULL. The error log in our fault-injection testing is shown as follows: [ 58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs] ... [ 58.753577] Call Trace: ... [ 58.755800] kasan_report+0x45/0x60 [ 58.756066] btrfs_init_reloc_root+0x7ca/0x920 [btrfs] [ 58.757304] record_root_in_trans+0x792/0xa10 [btrfs] [ 58.757748] btrfs_record_root_in_trans+0x463/0x4f0 [btrfs] [ 58.758231] start_transaction+0x896/0x2950 [btrfs] [ 58.758661] btrfs_defrag_root+0x250/0xc00 [btrfs] [ 58.759083] btrfs_ioctl_defrag+0x467/0xa00 [btrfs] [ 58.759513] btrfs_ioctl+0x3c95/0x114e0 [btrfs] ... [ 58.768510] Allocated by task 23683: [ 58.768777] ____kasan_kmalloc+0xb5/0xf0 [ 58.769069] __kmalloc+0x227/0x3d0 [ 58.769325] alloc_reloc_control+0x10a/0x3d0 [btrfs] [ 58.769755] btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs] [ 58.770228] btrfs_relocate_chunk+0xf1/0x760 [btrfs] [ 58.770655] __btrfs_balance+0x1326/0x1f10 [btrfs] [ 58.771071] btrfs_balance+0x3150/0x3d30 [btrfs] [ 58.771472] btrfs_ioctl_balance+0xd84/0x1410 [btrfs] [ 58.771902] btrfs_ioctl+0x4caa/0x114e0 [btrfs] ... [ 58.773337] Freed by task 23683: ... [ 58.774815] kfree+0xda/0x2b0 [ 58.775038] free_reloc_control+0x1d6/0x220 [btrfs] [ 58.775465] btrfs_relocate_block_group+0x115c/0x1e20 [btrfs] [ 58.775944] btrfs_relocate_chunk+0xf1/0x760 [btrfs] [ 58.776369] __btrfs_balance+0x1326/0x1f10 [btrfs] [ 58.776784] btrfs_balance+0x3150/0x3d30 [btrfs] [ 58.777185] btrfs_ioctl_balance+0xd84/0x1410 [btrfs] [ 58.777621] btrfs_ioctl+0x4caa/0x114e0 [btrfs] ... Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Zixuan Fu <r33s3n6@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
4223d91ca1 |
btrfs: check return value of btrfs_commit_transaction in relocation
commit fb686c6824dd6294ca772b92424b8fba666e7d00 upstream. There are a few places where we don't check the return value of btrfs_commit_transaction in relocation.c. Thankfully all these places have straightforward error handling, so simply change all of the sites at once. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
a9f5423460 |
btrfs: use nofs when cleaning up aborted transactions
commit 597441b3436a43011f31ce71dc0a6c0bf5ce958a upstream. Our CI system caught a lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.3.0-rc7+ #1167 Not tainted ------------------------------------------------------ kswapd0/46 is trying to acquire lock: ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120 but task is already holding lock: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0xa5/0xe0 kmem_cache_alloc+0x31/0x2c0 alloc_extent_state+0x1d/0xd0 __clear_extent_bit+0x2e0/0x4f0 try_release_extent_mapping+0x216/0x280 btrfs_release_folio+0x2e/0x90 invalidate_inode_pages2_range+0x397/0x470 btrfs_cleanup_dirty_bgs+0x9e/0x210 btrfs_cleanup_one_transaction+0x22/0x760 btrfs_commit_transaction+0x3b7/0x13a0 create_subvol+0x59b/0x970 btrfs_mksubvol+0x435/0x4f0 __btrfs_ioctl_snap_create+0x11e/0x1b0 btrfs_ioctl_snap_create_v2+0xbf/0x140 btrfs_ioctl+0xa45/0x28f0 __x64_sys_ioctl+0x88/0xc0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #0 (sb_internal#2){++++}-{0:0}: __lock_acquire+0x1435/0x21a0 lock_acquire+0xc2/0x2b0 start_transaction+0x401/0x730 btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_evict_inode+0x292/0x3d0 evict+0xcc/0x1d0 inode_lru_isolate+0x14d/0x1e0 __list_lru_walk_one+0xbe/0x1c0 list_lru_walk_one+0x58/0x80 prune_icache_sb+0x39/0x60 super_cache_scan+0x161/0x1f0 do_shrink_slab+0x163/0x340 shrink_slab+0x1d3/0x290 shrink_node+0x300/0x720 balance_pgdat+0x35c/0x7a0 kswapd+0x205/0x410 kthread+0xf0/0x120 ret_from_fork+0x29/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(sb_internal#2); lock(fs_reclaim); lock(sb_internal#2); *** DEADLOCK *** 3 locks held by kswapd0/46: #0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0 #1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290 #2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0 stack backtrace: CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x58/0x90 check_noncircular+0xd6/0x100 ? save_trace+0x3f/0x310 ? add_lock_to_list+0x97/0x120 __lock_acquire+0x1435/0x21a0 lock_acquire+0xc2/0x2b0 ? btrfs_commit_inode_delayed_inode+0x5f/0x120 start_transaction+0x401/0x730 ? btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_commit_inode_delayed_inode+0x5f/0x120 btrfs_evict_inode+0x292/0x3d0 ? lock_release+0x134/0x270 ? __pfx_wake_bit_function+0x10/0x10 evict+0xcc/0x1d0 inode_lru_isolate+0x14d/0x1e0 __list_lru_walk_one+0xbe/0x1c0 ? __pfx_inode_lru_isolate+0x10/0x10 ? __pfx_inode_lru_isolate+0x10/0x10 list_lru_walk_one+0x58/0x80 prune_icache_sb+0x39/0x60 super_cache_scan+0x161/0x1f0 do_shrink_slab+0x163/0x340 shrink_slab+0x1d3/0x290 shrink_node+0x300/0x720 balance_pgdat+0x35c/0x7a0 kswapd+0x205/0x410 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_kswapd+0x10/0x10 kthread+0xf0/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> This happens because when we abort the transaction in the transaction commit path we call invalidate_inode_pages2_range on our block group cache inodes (if we have space cache v1) and any delalloc inodes we may have. The plain invalidate_inode_pages2_range() call passes through GFP_KERNEL, which makes sense in most cases, but not here. Wrap these two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to make sure we don't end up with the fs reclaim dependency under the transaction dependency. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
a3714a47b4 |
btrfs: fix space cache inconsistency after error loading it from disk
[ Upstream commit 0004ff15ea26015a0a3a6182dca3b9d1df32e2b7 ]
When loading a free space cache from disk, at __load_free_space_cache(),
if we fail to insert a bitmap entry, we still increment the number of
total bitmaps in the btrfs_free_space_ctl structure, which is incorrect
since we failed to add the bitmap entry. On error we then empty the
cache by calling __btrfs_remove_free_space_cache(), which will result
in getting the total bitmaps counter set to 1.
A failure to load a free space cache is not critical, so if a failure
happens we just rebuild the cache by scanning the extent tree, which
happens at block-group.c:caching_thread(). Yet the failure will result
in having the total bitmaps of the btrfs_free_space_ctl always bigger
by 1 then the number of bitmap entries we have. So fix this by having
the total bitmaps counter be incremented only if we successfully added
the bitmap entry.
Fixes:
|
||
|
5968983037 |
btrfs: replace calls to btrfs_find_free_ino with btrfs_find_free_objectid
[ Upstream commit abadc1fcd72e887a8f875dabe4a07aa8c28ac8af ] The former is going away as part of the inode map removal so switch callers to btrfs_find_free_objectid. No functional changes since with INODE_MAP disabled (default) find_free_objectid was called anyway. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 0004ff15ea26 ("btrfs: fix space cache inconsistency after error loading it from disk") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
3e2b4bceaa |
btrfs: print-tree: parent bytenr must be aligned to sector size
commit c87f318e6f47696b4040b58f460d5c17ea0280e6 upstream. Check nodesize to sectorsize in alignment check in print_extent_item. The comment states that and this is correct, similar check is done elsewhere in the functions. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: ea57788eb76d ("btrfs: require only sector size alignment for parent eb bytenr") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anastasia Belova <abelova@astralinux.ru> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
1e05bf5e80 |
btrfs: don't free qgroup space unless specified
commit d246331b78cbef86237f9c22389205bc9b4e1cc1 upstream. Boris noticed in his simple quotas testing that he was getting a leak with Sweet Tea's change to subvol create that stopped doing a transaction commit. This was just a side effect of that change. In the delayed inode code we have an optimization that will free extra reservations if we think we can pack a dir item into an already modified leaf. Previously this wouldn't be triggered in the subvolume create case because we'd commit the transaction, it was still possible but much harder to trigger. It could actually be triggered if we did a mkdir && subvol create with qgroups enabled. This occurs because in btrfs_insert_delayed_dir_index(), which gets called when we're adding the dir item, we do the following: btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL); if we're able to skip reserving space. The problem here is that trans->block_rsv points at the temporary block rsv for the subvolume create, which has qgroup reservations in the block rsv. This is a problem because btrfs_block_rsv_release() will do the following: if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) { qgroup_to_release = block_rsv->qgroup_rsv_reserved - block_rsv->qgroup_rsv_size; block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size; } The temporary block rsv just has ->qgroup_rsv_reserved set, ->qgroup_rsv_size == 0. The optimization in btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0. Then later on when we call btrfs_subvolume_release_metadata() which has btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release); btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release); qgroup_to_release is set to 0, and we do not convert the reserved metadata space. The problem here is that the block rsv code has been unconditionally messing with ->qgroup_rsv_reserved, because the main place this is used is delalloc, and any time we call btrfs_block_rsv_release() we do it with qgroup_to_release set, and thus do the proper accounting. The subvolume code is the only other code that uses the qgroup reservation stuff, but it's intermingled with the above optimization, and thus was getting its reservation freed out from underneath it and thus leaking the reserved space. The solution is to simply not mess with the qgroup reservations if we don't have qgroup_to_release set. This works with the existing code as anything that messes with the delalloc reservations always have qgroup_to_release set. This fixes the leak that Boris was observing. Reviewed-by: Qu Wenruo <wqu@suse.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
b232f5e8cf |
btrfs: fix btrfs_prev_leaf() to not return the same key twice
commit 6f932d4ef007d6a4ae03badcb749fbb8f49196f6 upstream. A call to btrfs_prev_leaf() may end up returning a path that points to the same item (key) again. This happens if while btrfs_prev_leaf(), after we release the path, a concurrent insertion happens, which moves items off from a sibling into the front of the previous leaf, and an item with the computed previous key does not exists. For example, suppose we have the two following leaves: Leaf A ------------------------------------------------------------- | ... key (300 96 10) key (300 96 15) key (300 96 16) | ------------------------------------------------------------- slot 20 slot 21 slot 22 Leaf B ------------------------------------------------------------- | key (300 96 20) key (300 96 21) key (300 96 22) ... | ------------------------------------------------------------- slot 0 slot 1 slot 2 If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with a path pointing to leaf B and slot 0 and the following happens: 1) At btrfs_prev_leaf() we compute the previous key to search as: (300 96 19), which is a key that does not exists in the tree; 2) Then we call btrfs_release_path() at btrfs_prev_leaf(); 3) Some other task inserts a key at leaf A, that sorts before the key at slot 20, for example it has an objectid of 299. In order to make room for the new key, the key at slot 22 is moved to the front of leaf B. This happens at push_leaf_right(), called from split_leaf(). After this leaf B now looks like: -------------------------------------------------------------------------------- | key (300 96 16) key (300 96 20) key (300 96 21) key (300 96 22) ... | -------------------------------------------------------------------------------- slot 0 slot 1 slot 2 slot 3 4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed previous key: (300 96 19). Since the key does not exists, btrfs_search_slot() returns 1 and with a path pointing to leaf B and slot 1, the item with key (300 96 20); 5) This makes btrfs_prev_leaf() return a path that points to slot 1 of leaf B, the same key as before it was called, since the key at slot 0 of leaf B (300 96 16) is less than the computed previous key, which is (300 96 19); 6) As a consequence btrfs_previous_item() returns a path that points again to the item with key (300 96 20). For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not be functional a problem, despite not making sense to return a new path pointing again to the same item/key. However for a caller such as tree-log.c:log_dir_items(), this has a bad consequence, as it can result in not logging some dir index deletions in case the directory is being logged without holding the inode's VFS lock (logging triggered while logging a child inode for example) - for the example scenario above, in case the dir index keys 17, 18 and 19 were deleted in the current transaction. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
5492d40812 |
btrfs: scrub: reject unsupported scrub flags
commit 604e6681e114d05a2e384c4d1e8ef81918037ef5 upstream. Since the introduction of scrub interface, the only flag that we support is BTRFS_SCRUB_READONLY. Thus there is no sanity checks, if there are some undefined flags passed in, we just ignore them. This is problematic if we want to introduce new scrub flags, as we have no way to determine if such flags are supported. Address the problem by introducing a check for the flags, and if unsupported flags are set, return -EOPNOTSUPP to inform the user space. This check should be backported for all supported kernels before any new scrub flags are introduced. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
50dbfd9dac |
btrfs: fix fast csum implementation detection
commit 68d99ab0e9221ef54506f827576c5a914680eeaf upstream. The BTRFS_FS_CSUM_IMPL_FAST flag is currently set whenever a non-generic crc32c is detected, which is the incorrect check if the file system uses a different checksumming algorithm. Refactor the code to only check this if crc32c is actually used. Note that in an ideal world the information if an algorithm is hardware accelerated or not should be provided by the crypto API instead, but that's left for another day. CC: stable@vger.kernel.org # 5.4.x: c8a5f8ca9a9c: btrfs: print checksum type and implementation at mount time CC: stable@vger.kernel.org # 5.4.x Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
c6db5f2a31 |
btrfs: print checksum type and implementation at mount time
commit c8a5f8ca9a9c7d5c5bc31d54f47ea9d86f93ed69 upstream. Per user request, print the checksum type and implementation at mount time among the messages. The checksum is user configurable and the actual crypto implementation is useful to see for performance reasons. The same information is also available after mount in /sys/fs/FSID/checksum file. Example: [25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm Link: https://github.com/kdave/btrfs-progs/issues/483 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
9b189af357 |
btrfs: scan device in non-exclusive mode
commit 50d281fc434cb8e2497f5e70a309ccca6b1a09f0 upstream. This fixes mkfs/mount/check failures due to race with systemd-udevd scan. During the device scan initiated by systemd-udevd, other user space EXCL operations such as mkfs, mount, or check may get blocked and result in a "Device or resource busy" error. This is because the device scan process opens the device with the EXCL flag in the kernel. Two reports were received: - btrfs/179 test case, where the fsck command failed with the -EBUSY error - LTP pwritev03 test case, where mkfs.vfs failed with the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem on the device. In both cases, fsck and mkfs (respectively) were racing with a systemd-udevd device scan, and systemd-udevd won, resulting in the -EBUSY error for fsck and mkfs. Reproducing the problem has been difficult because there is a very small window during which these userspace threads can race to acquire the exclusive device open. Even on the system where the problem was observed, the problem occurrences were anywhere between 10 to 400 iterations and chances of reproducing decreases with debug printk()s. However, an exclusive device open is unnecessary for the scan process, as there are no write operations on the device during scan. Furthermore, during the mount process, the superblock is re-read in the below function call chain: btrfs_mount_root btrfs_open_devices open_fs_devices btrfs_open_one_device btrfs_get_bdev_and_sb So, to fix this issue, removes the FMODE_EXCL flag from the scan operation, and add a comment. The case where mkfs may still write to the device and a scan is running, the btrfs signature is not written at that time so scan will not recognize such device. Reported-by: Sherry Yang <sherry.yang@oracle.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
98e626c115 |
btrfs: send: limit number of clones and allocated memory size
[ Upstream commit 33e17b3f5ab74af12aca58c515bc8424ff69a343 ] The arg->clone_sources_count is u64 and can trigger a warning when a huge value is passed from user space and a huge array is allocated. Limit the allocated memory to 8MiB (can be increased if needed), which in turn limits the number of clone sources to 8M / sizeof(struct clone_root) = 8M / 40 = 209715. Real world number of clones is from tens to hundreds, so this is future proof. Reported-by: syzbot+4376a9a073770c173269@syzkaller.appspotmail.com Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
b6ac5e6be5 |
btrfs: free device in btrfs_close_devices for a single device filesystem
commit 5f58d783fd7823b2c2d5954d1126e702f94bfc4c upstream. We have this check to make sure we don't accidentally add older devices that may have disappeared and re-appeared with an older generation from being added to an fs_devices (such as a replace source device). This makes sense, we don't want stale disks in our file system. However for single disks this doesn't really make sense. I've seen this in testing, but I was provided a reproducer from a project that builds btrfs images on loopback devices. The loopback device gets cached with the new generation, and then if it is re-used to generate a new file system we'll fail to mount it because the new fs is "older" than what we have in cache. Fix this by freeing the cache when closing the device for a single device filesystem. This will ensure that the mount command passed device path is scanned successfully during the next mount. CC: stable@vger.kernel.org # 5.10+ Reported-by: Daan De Meyer <daandemeyer@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
7e43bb69bc |
btrfs: zlib: zero-initialize zlib workspace
commit eadd7deca0ad8a83edb2b894d8326c78e78635d6 upstream. KMSAN reports uses of uninitialized memory in zlib's longest_match() called on memory originating from zlib_alloc_workspace(). This issue is known by zlib maintainers and is claimed to be harmless, but to be on the safe side we'd better initialize the memory. Link: https://zlib.net/zlib_faq.html#faq36 Reported-by: syzbot+14d9e7602ebdf7ec0a60@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
ed29d8b309 |
btrfs: limit device extents to the device size
commit 3c538de0f2a74d50aff7278c092f88ae59cee688 upstream. There was a recent regression in btrfs/177 that started happening with the size class patches ("btrfs: introduce size class to block group allocator"). This however isn't a regression introduced by those patches, but rather the bug was uncovered by a change in behavior in these patches. The patches triggered more chunk allocations in the ^free-space-tree case, which uncovered a race with device shrink. The problem is we will set the device total size to the new size, and use this to find a hole for a device extent. However during shrink we may have device extents allocated past this range, so we could potentially find a hole in a range past our new shrink size. We don't actually limit our found extent to the device size anywhere, we assume that we will not find a hole past our device size. This isn't true with shrink as we're relocating block groups and thus creating holes past the device size. Fix this by making sure we do not search past the new device size, and if we wander into any device extents that start after our device size simply break from the loop and use whatever hole we've already found. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
89ac597e3e |
btrfs: fix race between quota rescan and disable leading to NULL pointer deref
commit b7adbf9ada3513d2092362c8eac5cddc5b651f5c upstream. If we have one task trying to start the quota rescan worker while another one is trying to disable quotas, we can end up hitting a race that results in the quota rescan worker doing a NULL pointer dereference. The steps for this are the following: 1) Quotas are enabled; 2) Task A calls the quota rescan ioctl and enters btrfs_qgroup_rescan(). It calls qgroup_rescan_init() which returns 0 (success) and then joins a transaction and commits it; 3) Task B calls the quota disable ioctl and enters btrfs_quota_disable(). It clears the bit BTRFS_FS_QUOTA_ENABLED from fs_info->flags and calls btrfs_qgroup_wait_for_completion(), which returns immediately since the rescan worker is not yet running. Then it starts a transaction and locks fs_info->qgroup_ioctl_lock; 4) Task A queues the rescan worker, by calling btrfs_queue_work(); 5) The rescan worker starts, and calls rescan_should_stop() at the start of its while loop, which results in 0 iterations of the loop, since the flag BTRFS_FS_QUOTA_ENABLED was cleared from fs_info->flags by task B at step 3); 6) Task B sets fs_info->quota_root to NULL; 7) The rescan worker tries to start a transaction and uses fs_info->quota_root as the root argument for btrfs_start_transaction(). This results in a NULL pointer dereference down the call chain of btrfs_start_transaction(). The stack trace is something like the one reported in Link tag below: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f] CPU: 1 PID: 34 Comm: kworker/u4:2 Not tainted 6.1.0-syzkaller-13872-gb6bb9676f216 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022 Workqueue: btrfs-qgroup-rescan btrfs_work_helper RIP: 0010:start_transaction+0x48/0x10f0 fs/btrfs/transaction.c:564 Code: 48 89 fb 48 (...) RSP: 0018:ffffc90000ab7ab0 EFLAGS: 00010206 RAX: 0000000000000041 RBX: 0000000000000208 RCX: ffff88801779ba80 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000 RBP: dffffc0000000000 R08: 0000000000000001 R09: fffff52000156f5d R10: fffff52000156f5d R11: 1ffff92000156f5c R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f2bea75b718 CR3: 000000001d0cc000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_qgroup_rescan_worker+0x3bb/0x6a0 fs/btrfs/qgroup.c:3402 btrfs_work_helper+0x312/0x850 fs/btrfs/async-thread.c:280 process_one_work+0x877/0xdb0 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 </TASK> Modules linked in: So fix this by having the rescan worker function not attempt to start a transaction if it didn't do any rescan work. Reported-by: syzbot+96977faa68092ad382c4@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000e5454b05f065a803@google.com/ Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
839f3ca8bc |
btrfs: replace strncpy() with strscpy()
[ Upstream commit 63d5429f68a3d4c4aa27e65a05196c17f86c41d6 ] Using strncpy() on NUL-terminated strings are deprecated. To avoid possible forming of non-terminated string strscpy() should be used. Found by Linux Verification Center (linuxtesting.org) with SVACE. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
c59ee1528b |
btrfs: fix resolving backrefs for inline extent followed by prealloc
commit 560840afc3e63bbe5d9c5ef6b2ecf8f3589adff6 upstream. If a file consists of an inline extent followed by a regular or prealloc extent, then a legitimate attempt to resolve a logical address in the non-inline region will result in add_all_parents reading the invalid offset field of the inline extent. If the inline extent item is placed in the leaf eb s.t. it is the first item, attempting to access the offset field will not only be meaningless, it will go past the end of the eb and cause this panic: [17.626048] BTRFS warning (device dm-2): bad eb member end: ptr 0x3fd4 start 30834688 member offset 16377 size 8 [17.631693] general protection fault, probably for non-canonical address 0x5088000000000: 0000 [#1] SMP PTI [17.635041] CPU: 2 PID: 1267 Comm: btrfs Not tainted 5.12.0-07246-g75175d5adc74-dirty #199 [17.637969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [17.641995] RIP: 0010:btrfs_get_64+0xe7/0x110 [17.649890] RSP: 0018:ffffc90001f73a08 EFLAGS: 00010202 [17.651652] RAX: 0000000000000001 RBX: ffff88810c42d000 RCX: 0000000000000000 [17.653921] RDX: 0005088000000000 RSI: ffffc90001f73a0f RDI: 0000000000000001 [17.656174] RBP: 0000000000000ff9 R08: 0000000000000007 R09: c0000000fffeffff [17.658441] R10: ffffc90001f73790 R11: ffffc90001f73788 R12: ffff888106afe918 [17.661070] R13: 0000000000003fd4 R14: 0000000000003f6f R15: cdcdcdcdcdcdcdcd [17.663617] FS: 00007f64e7627d80(0000) GS:ffff888237c80000(0000) knlGS:0000000000000000 [17.666525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [17.668664] CR2: 000055d4a39152e8 CR3: 000000010c596002 CR4: 0000000000770ee0 [17.671253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [17.673634] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [17.676034] PKRU: 55555554 [17.677004] Call Trace: [17.677877] add_all_parents+0x276/0x480 [17.679325] find_parent_nodes+0xfae/0x1590 [17.680771] btrfs_find_all_leafs+0x5e/0xa0 [17.682217] iterate_extent_inodes+0xce/0x260 [17.683809] ? btrfs_inode_flags_to_xflags+0x50/0x50 [17.685597] ? iterate_inodes_from_logical+0xa1/0xd0 [17.687404] iterate_inodes_from_logical+0xa1/0xd0 [17.689121] ? btrfs_inode_flags_to_xflags+0x50/0x50 [17.691010] btrfs_ioctl_logical_to_ino+0x131/0x190 [17.692946] btrfs_ioctl+0x104a/0x2f60 [17.694384] ? selinux_file_ioctl+0x182/0x220 [17.695995] ? __x64_sys_ioctl+0x84/0xc0 [17.697394] __x64_sys_ioctl+0x84/0xc0 [17.698697] do_syscall_64+0x33/0x40 [17.700017] entry_SYSCALL_64_after_hwframe+0x44/0xae [17.701753] RIP: 0033:0x7f64e72761b7 [17.709355] RSP: 002b:00007ffefb067f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [17.712088] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f64e72761b7 [17.714667] RDX: 00007ffefb067fb0 RSI: 00000000c0389424 RDI: 0000000000000003 [17.717386] RBP: 00007ffefb06d188 R08: 000055d4a390d2b0 R09: 00007f64e7340a60 [17.719938] R10: 0000000000000231 R11: 0000000000000246 R12: 0000000000000001 [17.722383] R13: 0000000000000000 R14: 00000000c0389424 R15: 000055d4a38fd2a0 [17.724839] Modules linked in: Fix the bug by detecting the inline extent item in add_all_parents and skipping to the next extent item. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
3b2c064a8e |
btrfs: send: avoid unaligned encoded writes when attempting to clone range
[ Upstream commit a11452a3709e217492798cf3686ac2cc8eb3fb51 ] When trying to see if we can clone a file range, there are cases where we end up sending two write operations in case the inode from the source root has an i_size that is not sector size aligned and the length from the current offset to its i_size is less than the remaining length we are trying to clone. Issuing two write operations when we could instead issue a single write operation is not incorrect. However it is not optimal, specially if the extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed to the send ioctl. In that case we can end up sending an encoded write with an offset that is not sector size aligned, which makes the receiver fallback to decompressing the data and writing it using regular buffered IO (so re-compressing the data in case the fs is mounted with compression enabled), because encoded writes fail with -EINVAL when an offset is not sector size aligned. The following example, which triggered a bug in the receiver code for the fallback logic of decompressing + regular buffer IO and is fixed by the patchset referred in a Link at the bottom of this changelog, is an example where we have the non-optimal behaviour due to an unaligned encoded write: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV > /dev/null mount -o compress $DEV $MNT # File foo has a size of 33K, not aligned to the sector size. xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar # Now clone the first 32K of file bar into foo at offset 0. xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo # Snapshot the default subvolume and create a full send stream (v2). btrfs subvolume snapshot -r $MNT $MNT/snap btrfs send --compressed-data -f /tmp/test.send $MNT/snap echo -e "\nFile bar in the original filesystem:" od -A d -t x1 $MNT/snap/bar umount $MNT mkfs.btrfs -f $DEV > /dev/null mount $DEV $MNT echo -e "\nReceiving stream in a new filesystem..." btrfs receive -f /tmp/test.send $MNT echo -e "\nFile bar in the new filesystem:" od -A d -t x1 $MNT/snap/bar umount $MNT Before this patch, the send stream included one regular write and one encoded write for file 'bar', with the later being not sector size aligned and causing the receiver to fallback to decompression + buffered writes. The output of the btrfs receive command in verbose mode (-vvv): (...) mkfile o258-7-0 rename o258-7-0 -> bar utimes clone bar - source=foo source offset=0 offset=0 length=32768 write bar - offset=32768 length=1024 encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0 encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument") (...) This patch avoids the regular write followed by an unaligned encoded write so that we end up sending a single encoded write that is aligned. So after this patch the stream content is (output of btrfs receive -vvv): (...) mkfile o258-7-0 rename o258-7-0 -> bar utimes clone bar - source=foo source offset=0 offset=0 length=32768 encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0 (...) So we get more optimal behaviour and avoid the silent data loss bug in versions of btrfs-progs affected by the bug referred by the Link tag below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1). Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
8eb912af52 |
btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()
[ Upstream commit f7e942b5bb35d8e3af54053d19a6bf04143a3955 ] Syzkaller reported BUG as follows: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 Call Trace: <TASK> dump_stack_lvl+0xcd/0x134 __might_resched.cold+0x222/0x26b kmem_cache_alloc+0x2e7/0x3c0 update_qgroup_limit_item+0xe1/0x390 btrfs_qgroup_inherit+0x147b/0x1ee0 create_subvol+0x4eb/0x1710 btrfs_mksubvol+0xfe5/0x13f0 __btrfs_ioctl_snap_create+0x2b0/0x430 btrfs_ioctl_snap_create_v2+0x25a/0x520 btrfs_ioctl+0x2a1c/0x5ce0 __x64_sys_ioctl+0x193/0x200 do_syscall_64+0x35/0x80 Fix this by calling qgroup_dirty() on @dstqgroup, and update limit item in btrfs_run_qgroups() later outside of the spinlock context. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
787138e4b9 |
btrfs: move QUOTA_ENABLED check to rescan_should_stop from btrfs_qgroup_rescan_worker
[ Upstream commit db5df254120004471e1c957957ab2f1e612dcbd6 ] Instead of having 2 places that short circuit the qgroup leaf scan have everything in the qgroup_rescan_leaf function. In addition to that, also ensure that the inconsistent qgroup flag is set when rescan_should_stop returns true. This both retains the old behavior when -EINTR was set in the body of the loop and at the same time also extends this behavior when scanning is interrupted due to remount or unmount operations. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: f7e942b5bb35 ("btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
83aae3204e |
btrfs: free btrfs_path before copying inodes to userspace
[ Upstream commit 418ffb9e3cf6c4e2574d3a732b724916684bd133 ] btrfs_ioctl_logical_to_ino() frees the search path after the userspace copy from the temp buffer @inodes. Which potentially can lead to a lock splat. Fix this by freeing the path before we copy @inodes to userspace. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
a541f1f0ce |
btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs()
commit ffdbb44f2f23f963b8f5672e35c3a26088177a62 upstream. Although kset_unregister() can eventually remove all attribute files, explicitly rolling back with the matching function makes the code logic look clearer. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
d037681515 |
btrfs: free btrfs_path before copying subvol info to userspace
commit 013c1c5585ebcfb19c88efe79063d0463b1b6159 upstream. btrfs_ioctl_get_subvol_info() frees the search path after the userspace copy from the temp buffer @subvol_info. This can lead to a lock splat warning. Fix this by freeing the path before we copy it to userspace. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
69e2f1dd93 |
btrfs: free btrfs_path before copying fspath to userspace
commit 8cf96b409d9b3946ece58ced13f92d0f775b0442 upstream. btrfs_ioctl_ino_to_path() frees the search path after the userspace copy from the temp buffer @ipath->fspath. Which potentially can lead to a lock splat warning. Fix this by freeing the path before we copy it to userspace. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
3cde2bc708 |
btrfs: free btrfs_path before copying root refs to userspace
commit b740d806166979488e798e41743aaec051f2443f upstream. Syzbot reported the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Not tainted ------------------------------------------------------ syz-executor307/3029 is trying to acquire lock: ffff0000c02525d8 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x54/0xb4 mm/memory.c:5576 but task is already holding lock: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x64/0x84 kernel/locking/rwsem.c:1624 __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 btrfs_search_slot_get_root+0x74/0x338 fs/btrfs/ctree.c:1637 btrfs_search_slot+0x1b0/0xfd8 fs/btrfs/ctree.c:1944 btrfs_update_root+0x6c/0x5a0 fs/btrfs/root-tree.c:132 commit_fs_roots+0x1f0/0x33c fs/btrfs/transaction.c:1459 btrfs_commit_transaction+0x89c/0x12d8 fs/btrfs/transaction.c:2343 flush_space+0x66c/0x738 fs/btrfs/space-info.c:786 btrfs_async_reclaim_metadata_space+0x43c/0x4e0 fs/btrfs/space-info.c:1059 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 -> #2 (&fs_info->reloc_mutex){+.+.}-{3:3}: __mutex_lock_common+0xd4/0xca8 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x38/0x44 kernel/locking/mutex.c:799 btrfs_record_root_in_trans fs/btrfs/transaction.c:516 [inline] start_transaction+0x248/0x944 fs/btrfs/transaction.c:752 btrfs_start_transaction+0x34/0x44 fs/btrfs/transaction.c:781 btrfs_create_common+0xf0/0x1b4 fs/btrfs/inode.c:6651 btrfs_create+0x8c/0xb0 fs/btrfs/inode.c:6697 lookup_open fs/namei.c:3413 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x804/0x11c4 fs/namei.c:3688 do_filp_open+0xdc/0x1b8 fs/namei.c:3718 do_sys_openat2+0xb8/0x22c fs/open.c:1313 do_sys_open fs/open.c:1329 [inline] __do_sys_openat fs/open.c:1345 [inline] __se_sys_openat fs/open.c:1340 [inline] __arm64_sys_openat+0xb0/0xe0 fs/open.c:1340 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #1 (sb_internal#2){.+.+}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1826 [inline] sb_start_intwrite include/linux/fs.h:1948 [inline] start_transaction+0x360/0x944 fs/btrfs/transaction.c:683 btrfs_join_transaction+0x30/0x40 fs/btrfs/transaction.c:795 btrfs_dirty_inode+0x50/0x140 fs/btrfs/inode.c:6103 btrfs_update_time+0x1c0/0x1e8 fs/btrfs/inode.c:6145 inode_update_time fs/inode.c:1872 [inline] touch_atime+0x1f0/0x4a8 fs/inode.c:1945 file_accessed include/linux/fs.h:2516 [inline] btrfs_file_mmap+0x50/0x88 fs/btrfs/file.c:2407 call_mmap include/linux/fs.h:2192 [inline] mmap_region+0x7fc/0xc14 mm/mmap.c:1752 do_mmap+0x644/0x97c mm/mmap.c:1540 vm_mmap_pgoff+0xe8/0x1d0 mm/util.c:552 ksys_mmap_pgoff+0x1cc/0x278 mm/mmap.c:1586 __do_sys_mmap arch/arm64/kernel/sys.c:28 [inline] __se_sys_mmap arch/arm64/kernel/sys.c:21 [inline] __arm64_sys_mmap+0x58/0x6c arch/arm64/kernel/sys.c:21 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 -> #0 (&mm->mmap_lock){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 other info that might help us debug this: Chain exists of: &mm->mmap_lock --> &fs_info->reloc_mutex --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(&fs_info->reloc_mutex); lock(btrfs-root-00); lock(&mm->mmap_lock); *** DEADLOCK *** 1 lock held by syz-executor307/3029: #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline] #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279 stack backtrace: CPU: 0 PID: 3029 Comm: syz-executor307 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 Call trace: dump_backtrace+0x1c4/0x1f0 arch/arm64/kernel/stacktrace.c:156 show_stack+0x2c/0x54 arch/arm64/kernel/stacktrace.c:163 __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x104/0x16c lib/dump_stack.c:106 dump_stack+0x1c/0x58 lib/dump_stack.c:113 print_circular_bug+0x2c4/0x2c8 kernel/locking/lockdep.c:2053 check_noncircular+0x14c/0x154 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3095 [inline] check_prevs_add kernel/locking/lockdep.c:3214 [inline] validate_chain kernel/locking/lockdep.c:3829 [inline] __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053 lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666 __might_fault+0x7c/0xb4 mm/memory.c:5577 _copy_to_user include/linux/uaccess.h:134 [inline] copy_to_user include/linux/uaccess.h:160 [inline] btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203 btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856 __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline] invoke_syscall arch/arm64/kernel/syscall.c:52 [inline] el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142 do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206 el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636 el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654 el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581 We do generally the right thing here, copying the references into a temporary buffer, however we are still holding the path when we do copy_to_user from the temporary buffer. Fix this by freeing the path before we copy to user space. Reported-by: syzbot+4ef9e52e464c6ff47d9d@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
6fa082ad96 |
btrfs: remove pointless and double ulist frees in error paths of qgroup tests
[ Upstream commit d0ea17aec12ea0f7b9d2ed727d8ef8169d1e7699 ] Several places in the qgroup self tests follow the pattern of freeing the ulist pointer they passed to btrfs_find_all_roots() if the call to that function returned an error. That is pointless because that function always frees the ulist in case it returns an error. Also In some places like at test_multiple_refs(), after a call to btrfs_qgroup_account_extent() we also leave "old_roots" and "new_roots" pointing to ulists that were freed, because btrfs_qgroup_account_extent() has freed those ulists, and if after that the next call to btrfs_find_all_roots() fails, we call ulist_free() on the "old_roots" ulist again, resulting in a double free. So remove those calls to reduce the code size and avoid double ulist free in case of an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
c914c56ac0 |
btrfs: selftests: fix wrong error check in btrfs_free_dummy_root()
commit 9b2f20344d450137d015b380ff0c2e2a6a170135 upstream.
The btrfs_alloc_dummy_root() uses ERR_PTR as the error return value
rather than NULL, if error happened, there will be a NULL pointer
dereference:
BUG: KASAN: null-ptr-deref in btrfs_free_dummy_root+0x21/0x50 [btrfs]
Read of size 8 at addr 000000000000002c by task insmod/258926
CPU: 2 PID: 258926 Comm: insmod Tainted: G W 6.1.0-rc2+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x34/0x44
kasan_report+0xb7/0x140
kasan_check_range+0x145/0x1a0
btrfs_free_dummy_root+0x21/0x50 [btrfs]
btrfs_test_free_space_cache+0x1a8c/0x1add [btrfs]
btrfs_run_sanity_tests+0x65/0x80 [btrfs]
init_btrfs_fs+0xec/0x154 [btrfs]
do_one_initcall+0x87/0x2a0
do_init_module+0xdf/0x320
load_module+0x3006/0x3390
__do_sys_finit_module+0x113/0x1b0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Fixes:
|
||
|
4ae03c869c |
btrfs: fix type of parameter generation in btrfs_get_dentry
commit 2398091f9c2c8e0040f4f9928666787a3e8108a7 upstream. The type of parameter generation has been u32 since the beginning, however all callers pass a u64 generation, so unify the types to prevent potential loss. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
5d1a47ebf8 |
btrfs: fix ulist leaks in error paths of qgroup self tests
[ Upstream commit d37de92b38932d40e4a251e876cc388f9aee5f42 ]
In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests,
if we fail to add the tree ref, remove the extent item or remove the
extent ref, we are returning from the test function without freeing the
"old_roots" ulist that was allocated by the previous calls to
btrfs_find_all_roots(). Fix that by calling ulist_free() before returning.
Fixes:
|
||
|
6a6731a0df |
btrfs: fix inode list leak during backref walking at find_parent_nodes()
[ Upstream commit 92876eec382a0f19f33d09d2c939e9ca49038ae5 ]
During backref walking, at find_parent_nodes(), if we are dealing with a
data extent and we get an error while resolving the indirect backrefs, at
resolve_indirect_refs(), or in the while loop that iterates over the refs
in the direct refs rbtree, we end up leaking the inode lists attached to
the direct refs we have in the direct refs rbtree that were not yet added
to the refs ulist passed as argument to find_parent_nodes(). Since they
were not yet added to the refs ulist and prelim_release() does not free
the lists, on error the caller can only free the lists attached to the
refs that were added to the refs ulist, all the remaining refs get their
inode lists never freed, therefore leaking their memory.
Fix this by having prelim_release() always free any attached inode list
to each ref found in the rbtree, and have find_parent_nodes() set the
ref's inode list to NULL once it transfers ownership of the inode list
to a ref added to the refs ulist passed to find_parent_nodes().
Fixes:
|
||
|
2c0329406b |
btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
[ Upstream commit 5614dc3a47e3310fbc77ea3b67eaadd1c6417bf1 ]
During backref walking, at resolve_indirect_refs(), if we get an error
we jump to the 'out' label and call ulist_free() on the 'parents' ulist,
which frees all the elements in the ulist - however that does not free
any inode lists that may be attached to elements, through the 'aux' field
of a ulist node, so we end up leaking lists if we have any attached to
the unodes.
Fix this by calling free_leaf_list() instead of ulist_free() when we exit
from resolve_indirect_refs(). The static function free_leaf_list() is
moved up for this to be possible and it's slightly simplified by removing
unnecessary code.
Fixes:
|
||
|
13a2719ec8 |
btrfs: fix processing of delayed tree block refs during backref walking
[ Upstream commit 943553ef9b51db303ab2b955c1025261abfdf6fb ] During backref walking, when processing a delayed reference with a type of BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there: 1) We are accessing the delayed references extent_op, and its key, without the protection of the delayed ref head's lock; 2) If there's no extent op for the delayed ref head, we end up with an uninitialized key in the stack, variable 'tmp_op_key', and then pass it to add_indirect_ref(), which adds the reference to the indirect refs rb tree. This is wrong, because indirect references should have a NULL key when we don't have access to the key, and in that case they should be added to the indirect_missing_keys rb tree and not to the indirect rb tree. This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting from freeing an extent buffer, therefore with a count of -1, it will not cancel out the corresponding reference we have in the extent tree (with a count of 1), since both references end up in different rb trees. When using fiemap, where we often need to check if extents are shared through shared subtrees resulting from snapshots, it means we can incorrectly report an extent as shared when it's no longer shared. However this is temporary because after the transaction is committed the extent is no longer reported as shared, as running the delayed reference results in deleting the tree block reference from the extent tree. Outside the fiemap context, the result is unpredictable, as the key was not initialized but it's used when navigating the rb trees to insert and search for references (prelim_ref_compare()), and we expect all references in the indirect rb tree to have valid keys. The following reproducer triggers the second bug: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount -o compress $DEV $MNT # With a compressed 128M file we get a tree height of 2 (level 1 root). xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo btrfs subvolume snapshot $MNT $MNT/snap # Fiemap should output 0x2008 in the flags column. # 0x2000 means shared extent # 0x8 means encoded extent (because it's compressed) echo echo "fiemap after snapshot, range [120M, 120M + 128K):" xfs_io -c "fiemap -v 120M 128K" $MNT/foo echo # Overwrite one extent and fsync to flush delalloc and COW a new path # in the snapshot's tree. # # After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type # BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent # buffer in the path. # # In the extent tree we have inline references of type # BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent # buffers, so they should cancel each other, and the extent buffers in # the fs tree should no longer be considered as shared. # echo "Overwriting file range [120M, 120M + 128K)..." xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo xfs_io -c "fsync" $MNT/snap/foo # Fiemap should output 0x8 in the flags column. The extent in the range # [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs # tree. echo echo "fiemap after overwrite range [120M, 120M + 128K):" xfs_io -c "fiemap -v 120M 128K" $MNT/foo echo umount $MNT Running it before this patch: $ ./test.sh (...) wrote 134217728/134217728 bytes at offset 0 128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec) Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap' fiemap after snapshot, range [120M, 120M + 128K): /mnt/sdj/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [245760..246015]: 34304..34559 256 0x2008 Overwriting file range [120M, 120M + 128K)... wrote 131072/131072 bytes at offset 125829120 128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec) fiemap after overwrite range [120M, 120M + 128K): /mnt/sdj/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [245760..246015]: 34304..34559 256 0x2008 The extent in the range [120M, 120M + 128K) is still reported as shared (0x2000 bit set) after overwriting that range and flushing delalloc, which is not correct - an entire path was COWed in the snapshot's tree and the extent is now only referenced by the original fs tree. Running it after this patch: $ ./test.sh (...) wrote 134217728/134217728 bytes at offset 0 128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec) Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap' fiemap after snapshot, range [120M, 120M + 128K): /mnt/sdj/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [245760..246015]: 34304..34559 256 0x2008 Overwriting file range [120M, 120M + 128K)... wrote 131072/131072 bytes at offset 125829120 128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec) fiemap after overwrite range [120M, 120M + 128K): /mnt/sdj/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [245760..246015]: 34304..34559 256 0x8 Now the extent is not reported as shared anymore. So fix this by passing a NULL key pointer to add_indirect_ref() when processing a delayed reference for a tree block if there's no extent op for our delayed ref head with a defined key. Also access the extent op only after locking the delayed ref head's lock. The reproducer will be converted later to a test case for fstests. Fixes: |
||
|
b397ce3477 |
btrfs: fix processing of delayed data refs during backref walking
[ Upstream commit 4fc7b57228243d09c0d878873bf24fa64a90fa01 ]
When processing delayed data references during backref walking and we are
using a share context (we are being called through fiemap), whenever we
find a delayed data reference for an inode different from the one we are
interested in, then we immediately exit and consider the data extent as
shared. This is wrong, because:
1) This might be a DROP reference that will cancel out a reference in the
extent tree;
2) Even if it's an ADD reference, it may be followed by a DROP reference
that cancels it out.
In either case we should not exit immediately.
Fix this by never exiting when we find a delayed data reference for
another inode - instead add the reference and if it does not cancel out
other delayed reference, we will exit early when we call
extent_is_shared() after processing all delayed references. If we find
a drop reference, then signal the code that processes references from
the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
immediately if it finds there a reference for another inode, since we
have delayed drop references that may cancel it out. In this later case
we exit once we don't have references in the rb trees that cancel out
each other and have two references for different inodes.
Example reproducer for case 1):
$ cat test-1.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
xfs_io -f -c "pwrite 0 64K" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
echo
echo "fiemap after cloning:"
xfs_io -c "fiemap -v" $MNT/foo
rm -f $MNT/bar
echo
echo "fiemap after removing file bar:"
xfs_io -c "fiemap -v" $MNT/foo
umount $MNT
Running it before this patch, the extent is still listed as shared, it has
the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
$ ./test-1.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
Example reproducer for case 2):
$ cat test-2.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
xfs_io -f -c "pwrite 0 64K" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
# Flush delayed references to the extent tree and commit current
# transaction.
sync
echo
echo "fiemap after cloning:"
xfs_io -c "fiemap -v" $MNT/foo
rm -f $MNT/bar
echo
echo "fiemap after removing file bar:"
xfs_io -c "fiemap -v" $MNT/foo
umount $MNT
Running it before this patch, the extent is still listed as shared, it has
the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
$ ./test-2.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
After this patch, after deleting bar in both tests, the extent is not
reported with the 0x2000 flag anymore, it gets only the flag 0x1
(which is FIEMAP_EXTENT_LAST):
$ ./test-1.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x1
$ ./test-2.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x1
These tests will later be converted to a test case for fstests.
Fixes:
|
||
|
715fe15785 |
btrfs: scrub: try to fix super block errors
[ Upstream commit f9eab5f0bba76742af654f33d517bf62a0db8f12 ] [BUG] The following script shows that, although scrub can detect super block errors, it never tries to fix it: mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2 xfs_io -c "pwrite 67108864 4k" $dev2 mount $dev1 $mnt btrfs scrub start -B $dev2 btrfs scrub start -Br $dev2 umount $mnt The first scrub reports the super error correctly: scrub done for f3289218-abd3-41ac-a630-202f766c0859 Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 But the second read-only scrub still reports the same super error: Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 [CAUSE] The comments already shows that super block can be easily fixed by committing a transaction: /* * If we find an error in a super block, we just report it. * They will get written with the next transaction commit * anyway */ But the truth is, such assumption is not always true, and since scrub should try to repair every error it found (except for read-only scrub), we should really actively commit a transaction to fix this. [FIX] Just commit a transaction if we found any super block errors, after everything else is done. We cannot do this just after scrub_supers(), as btrfs_commit_transaction() will try to pause and wait for the running scrub, thus we can not call it with scrub_lock hold. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
26b7c0ac49 |
btrfs: fix race between quota enable and quota rescan ioctl
commit 331cd9461412e103d07595a10289de90004ac890 upstream. When enabling quotas, at btrfs_quota_enable(), after committing the transaction, we change fs_info->quota_root to point to the quota root we created and set BTRFS_FS_QUOTA_ENABLED at fs_info->flags. Then we try to start the qgroup rescan worker, first by initializing it with a call to qgroup_rescan_init() - however if that fails we end up freeing the quota root but we leave fs_info->quota_root still pointing to it, this can later result in a use-after-free somewhere else. We have previously set the flags BTRFS_FS_QUOTA_ENABLED and BTRFS_QGROUP_STATUS_FLAG_ON, so we can only fail with -EINPROGRESS at btrfs_quota_enable(), which is possible if someone already called the quota rescan ioctl, and therefore started the rescan worker. So fix this by ignoring an -EINPROGRESS and asserting we can't get any other error. Reported-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/linux-btrfs/20220823015931.421355-1-yebin10@huawei.com/ CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
9ab0c653ef |
btrfs: harden identification of a stale device
commit 770c79fb65506fc7c16459855c3839429f46cb32 upstream. Identifying and removing the stale device from the fs_uuids list is done by btrfs_free_stale_devices(). btrfs_free_stale_devices() in turn depends on device_path_matched() to check if the device appears in more than one btrfs_device structure. The matching of the device happens by its path, the device path. However, when device mapper is in use, the dm device paths are nothing but a link to the actual block device, which leads to the device_path_matched() failing to match. Fix this by matching the dev_t as provided by lookup_bdev() instead of plain string compare of the device paths. Reported-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
06ebb40b87 |
btrfs: tree-checker: check for overlapping extent items
[ Upstream commit 899b7f69f244e539ea5df1b4d756046337de44a5 ] We're seeing a weird problem in production where we have overlapping extent items in the extent tree. It's unclear where these are coming from, and in debugging we realized there's no check in the tree checker for this sort of problem. Add a check to the tree-checker to make sure that the extents do not overlap each other. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
2608885a4f |
btrfs: unify lookup return value when dir entry is missing
[ Upstream commit 8dcbc26194eb872cc3430550fb70bb461424d267 ] btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir entries and both are used during log replay or when updating a log tree during an unlink. However when the dir item does not exists, btrfs_lookup_dir_item() returns NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if the dir item exists but there is no matching entry for a given name or index, both return NULL. This makes the call sites during log replay to be more verbose than necessary and it makes it easy to miss this slight difference. Since we don't need to distinguish between those two cases, make btrfs_lookup_dir_index_item() always return NULL when there is no matching directory entry - either because there isn't any dir entry or because there is one but it does not match the given name and index. Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to 'index' since it is supposed to match an index number, and the name 'objectid' is not very good because it can easily be confused with an inode number (like the inode number a dir entry points to). CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
1fe3375cf2 |
btrfs: do not pin logs too early during renames
[ Upstream commit bd54f381a12ac695593271a663d36d14220215b2 ]
During renames we pin the logs of the roots a bit too early, before the
calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
since those will not change anything in a log tree.
In a scenario where we have multiple and diverse filesystem operations
running in parallel, those calls can take a significant amount of time,
due to lock contention on extent buffers, and delay log commits from other
tasks for longer than necessary.
So just pin logs after calls to btrfs_insert_inode_ref() and right before
the first operation that can update a log tree.
The following script that uses dbench was used for testing:
$ cat dbench-test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 120 16
umount $MNT
The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
and using a non-debug kernel config (Debian's default config).
The results compare a branch without this patch and without the previous
patch in the series, that has the subject:
"btrfs: eliminate some false positives when checking if inode was logged"
Versus the same branch with these two patches applied.
dbench with 8 clients, results before:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4391359 0.009 249.745
Close 3225882 0.001 3.243
Rename 185953 0.065 240.643
Unlink 886669 0.049 249.906
Deltree 112 2.455 217.433
Mkdir 56 0.002 0.004
Qpathinfo 3980281 0.004 3.109
Qfileinfo 697579 0.001 0.187
Qfsinfo 729780 0.002 2.424
Sfileinfo 357764 0.004 1.415
Find 1538861 0.016 4.863
WriteX
|
||
|
e9b4baabf8 |
btrfs: introduce btrfs_lookup_match_dir
[ Upstream commit a7d1c5dc8632e9b370ad26478c468d4e4e29f263 ] btrfs_search_slot is called in multiple places in dir-item.c to search for a dir entry, and then calling btrfs_match_dir_name to return a btrfs_dir_item. In order to reduce the number of callers of btrfs_search_slot, create a common function that looks for the dir key, and if found call btrfs_match_dir_item_name. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
dd3365d3b4 |
btrfs: check if root is readonly while setting security xattr
commit b51111271b0352aa596c5ae8faf06939e91b3b68 upstream. For a filesystem which has btrfs read-only property set to true, all write operations including xattr should be denied. However, security xattr can still be changed even if btrfs ro property is true. This happens because xattr_permission() does not have any restrictions on security.*, system.* and in some cases trusted.* from VFS and the decision is left to the underlying filesystem. See comments in xattr_permission() for more details. This patch checks if the root is read-only before performing the set xattr operation. Testcase: DEV=/dev/vdb MNT=/mnt mkfs.btrfs -f $DEV mount $DEV $MNT echo "file one" > $MNT/f1 setfattr -n "security.one" -v 2 $MNT/f1 btrfs property set /mnt ro true setfattr -n "security.one" -v 1 $MNT/f1 umount $MNT CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
5b44dcf8b7 |
btrfs: add info when mount fails due to stale replace target
commit f2c3bec215694fb8bc0ef5010f2a758d1906fc2d upstream. If the replace target device reappears after the suspended replace is cancelled, it blocks the mount operation as it can't find the matching replace-item in the metadata. As shown below, BTRFS error (device sda5): replace devid present without an active replace item To overcome this situation, the user can run the command btrfs device scan --forget <replace target device> and try the mount command again. And also, to avoid repeating the issue, superblock on the devid=0 must be wiped. wipefs -a device-path-to-devid=0. This patch adds some info when this situation occurs. Reported-by: Samuel Greiner <samuel@balkonien.org> Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/ CC: stable@vger.kernel.org # 5.0+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |