Commit Graph

781 Commits

Author SHA1 Message Date
Filipe Manana
656f30dba7 Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Josef Bacik
47ab2a6c68 Btrfs: remove empty block groups automatically
One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space.  This happens because all the chunks were allocated for
data since the metadata requirements were so low.  But now there's a bunch of
empty data block groups and not enough metadata space to do anything.  This
patch solves this problem by automatically deleting empty block groups.  If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it.  This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:21 -07:00
Miao Xie
8b110e393c Btrfs: implement repair function when direct read fails
This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:01 -07:00
Miao Xie
ce7213c70c Btrfs: fix wrong device bytes_used in the super block
device->bytes_used will be changed when allocating a new chunk, and
disk_total_size will be changed if resizing is successful.
Meanwhile, the on-disk super blocks of the previous transaction
might not be updated. Considering the consistency of the metadata
in the previous transaction, We should use the size in the previous
transaction to check if the super block is beyond the boundary
of the device.

Though it is not big problem because we don't use it now, but anyway
it is better that we make it be consistent with the common metadata,
maybe we will use it in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:34 -07:00
Miao Xie
935e5cc935 Btrfs: fix wrong disk size when writing super blocks
total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:33 -07:00
Li RongQing
82f70d62f7 btrfs: remove the wrong comments
This comments became wrong after c3c532[bdi: add helper function for
doing init and register of a bdi for a file system], so remove them.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:28 -07:00
Andrey Utkin
56094eecd3 btrfs: Drop stray check of fixup_workers creation
The issue was introduced in a79b7d4b3e,
adding allocation of extent_workers, so this stray check is surely not
meant to be a check of something else.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=82021
Reported-by: Maks Naumov <maksqwe1@ukr.net>
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:03 -07:00
Wang Shilong
29549aec76 Btrfs: print btrfs specific info for some fatal error cases
Marc argued that if there are several btrfs filesystems mounted,
while users even don't know which filesystem hit the corrupted
errors something like generation verification failure.

Since @extent_buffer structure has a member @fs_info, let's output
btrfs device info.

Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:28 -07:00
David Sterba
707e8a0715 btrfs: use nodesize everywhere, kill leafsize
The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.

Shaves a few bytes from .text:

  text    data     bss     dec     hex filename
852418   24560   23112  900090   dbbfa btrfs.ko.before
851074   24584   23112  898770   db6d2 btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:14 -07:00
David Sterba
3abdbd780e btrfs: make close_ctree return void
There's no user of the return value and we can get rid of the comment in
put_super.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:11 -07:00
David Sterba
57cdc8db21 btrfs: cleanup ino cache members of btrfs_root
The naming is confusing, generic yet used for a specific cache. Add a
prefix 'ino_' or rename appropriately.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:09 -07:00
Liu Bo
9e0af23764 Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work->func() <---- (we name it work X)
    for ordered_work in wq->ordered_list
            ordered_work->ordered_func()
            ordered_work->ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
    btrfs_readpages()
         for page that is not in page cache
                __do_readpage()
                     submit_extent_page()
                           btrfs_submit_bio_hook()
                                 btrfs_bio_wq_end_io()
                                 submit_bio()
                                 end_workqueue_bio() <--(ret by the 1st endio)
                                      queue a work(named work Y) for the 2nd
                                      also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg   -- struct work_struct

kthread:
worker_thread()
    pick up a work_struct from @worklist
    process_one_work(arg)
	worker->current_work = arg;  <-- arg is A->normal_work
	worker->current_func(arg)
		normal_work_helper(arg)
		     A = container_of(arg, struct btrfs_work, normal_work);

		     A->func()
		     A->ordered_func()
		     A->ordered_free()  <-- A gets freed

		     B->ordered_func()
			  submit_compressed_extents()
			      find_free_extent()
				  load_free_space_inode()
				      ...   <-- (the above readhead stack)
				      end_workqueue_bio()
					   btrfs_queue_work(work C)
		     B->ordered_free()

As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-24 07:17:02 -07:00
Miao Xie
7df69d3e94 Btrfs: Fix wrong device size when we are resizing the device
total_bytes of device is just a in-memory variant which is used to record
the size of the device, and it might be changed before we resize a device,
if the resize operation fails, it will be fallbacked. But some code used it
to update on-disk metadata of the device, it would cause the problem that
on-disk metadata of the devices was not consistent. We should use the other
variant named disk_total_bytes to update the on-disk metadata of device,
because that variant is updated only when the resize operation is successful.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:18 -07:00
Chris Mason
8d875f95da btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file
with new versions.  Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held.  It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.

Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:42 -07:00
Wang Shilong
5f3164813b Btrfs: fix race between balance recovery and root deletion
Balance recovery is called when RW mounting or remounting from
RO to RW, it is called to finish roots merging.

When doing balance recovery, relocation root's corresponding
fs root(whose root refs is 0) might be destroyed by cleaner
thread, this will make btrfs fail to mount.

Fix this problem by holding @cleaner_mutex when doing balance
recovery.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:04 -07:00
Josef Bacik
472b909ff6 btrfs: only unlock block in verify_parent_transid if we locked it
This is a regression from my patch a26e8c9f75, we
need to only unlock the block if we were the one who locked it.  Otherwise this
will trip BUG_ON()'s in locking.c  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:47 -07:00
Chris Mason
a79b7d4b3e Btrfs: async delayed refs
Delayed extent operations are triggered during transaction commits.
The goal is to queue up a healthly batch of changes to the extent
allocation tree and run through them in bulk.

This farms them off to async helper threads.  The goal is to have the
bulk of the delayed operations being done in the background, but this is
also important to limit our stack footprint.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:58 -07:00
Josef Bacik
faa2dbf004 Btrfs: add sanity tests for new qgroup accounting code
This exercises the various parts of the new qgroup accounting code.  We do some
basic stuff and do some things with the shared refs to make sure all that code
works.  I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:49 -07:00
Josef Bacik
fcebe4562d Btrfs: rework qgroup accounting
Currently qgroups account for space by intercepting delayed ref updates to fs
trees.  It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly.  The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together.  Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc.  This patch
accomplishes this by only adding qgroup operations for real ref changes.  We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time.  This patch encompasses a bunch of
architectural changes

1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.

2) tree mod seq:  we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.

3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence.  This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point.  This allows us to merge delayed refs during runtime.

With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:48 -07:00
Filipe Manana
1f21ef0a34 Btrfs: check if items are ordered when a leaf is marked dirty
To ease finding bugs during development related to modifying btree leaves
in such a way that it makes its items not sorted by key anymore. Since this
is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY
is set, which isn't meant to be enabled for regular users.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:45 -07:00
Wang Shilong
de348ee022 Btrfs: make sure there are not any read requests before stopping workers
In close_ctree(), after we have stopped all workers,there maybe still
some read requests(for example readahead) to submit and this *maybe* trigger
an oops that user reported before:

kernel BUG at fs/btrfs/async-thread.c:619!

By hacking codes, i can reproduce this problem with one cpu available.
We fix this potential problem by invalidating all btree inode pages before
stopping all workers.

Thanks to Miao for pointing out this problem.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:43 -07:00
Tsutomu Itoh
59885b3930 Btrfs: fix possible memory leak in btrfs_create_tree()
In btrfs_create_tree(), if btrfs_insert_root() fails, we should
free root->commit_root.

Reported-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:42 -07:00
Miao Xie
27cdeb7096 Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:40 -07:00
Qu Wenruo
65d33fd7a6 btrfs: Add check to avoid cleanup roots already in fs_info->dead_roots.
Current btrfs_orphan_cleanup will also cleanup roots which is already in
fs_info->dead_roots without protection.
This will have conditional race with fs_info->cleaner_kthread.

This patch will use refs in root->root_item to detect roots in
dead_roots and avoid conflicts.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:35 -07:00
Miao Xie
21c7e75654 Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.

So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.

Here is my test result(Tested by compilebench):
 Memory:	2GB
 CPU:		2Cores * 1CPU
 Partition:	40GB(SSD)

Test command:
 # compilebench -D <mnt> -m

Without this patch:
 intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
 compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
 read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
 delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)

With this patch:
 intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
 compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
 read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
 delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:34 -07:00
Linus Torvalds
33c0022f0e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: limit the path size in send to PATH_MAX
  Btrfs: correctly set profile flags on seqlock retry
  Btrfs: use correct key when repeating search for extent item
  Btrfs: fix inode caching vs tree log
  Btrfs: fix possible memory leaks in open_ctree()
  Btrfs: avoid triggering bug_on() when we fail to start inode caching task
  Btrfs: move btrfs_{set,clear}_and_info() to ctree.h
  btrfs: replace error code from btrfs_drop_extents
  btrfs: Change the hole range to a more accurate value.
  btrfs: fix use-after-free in mount_subvol()
2014-04-27 13:26:28 -07:00
Wang Shilong
28c16cbbc3 Btrfs: fix possible memory leaks in open_ctree()
Fix possible memory leaks in the following error handling paths:

read_tree_block()
btrfs_recover_log_trees
btrfs_commit_super()
btrfs_find_orphan_roots()
btrfs_cleanup_fs_roots()

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24 16:43:32 -07:00
Linus Torvalds
3123bca719 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull second set of btrfs updates from Chris Mason:
 "The most important changes here are from Josef, fixing a btrfs
  regression in 3.14 that can cause corruptions in the extent allocation
  tree when snapshots are in use.

  Josef also fixed some deadlocks in send/recv and other assorted races
  when balance is running"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
  Btrfs: fix compile warnings on on avr32 platform
  btrfs: allow mounting btrfs subvolumes with different ro/rw options
  btrfs: export global block reserve size as space_info
  btrfs: fix crash in remount(thread_pool=) case
  Btrfs: abort the transaction when we don't find our extent ref
  Btrfs: fix EINVAL checks in btrfs_clone
  Btrfs: fix unlock in __start_delalloc_inodes()
  Btrfs: scrub raid56 stripes in the right way
  Btrfs: don't compress for a small write
  Btrfs: more efficient io tree navigation on wait_extent_bit
  Btrfs: send, build path string only once in send_hole
  btrfs: filter invalid arg for btrfs resize
  Btrfs: send, fix data corruption due to incorrect hole detection
  Btrfs: kmalloc() doesn't return an ERR_PTR
  Btrfs: fix snapshot vs nocow writting
  btrfs: Change the expanding write sequence to fix snapshot related bug.
  btrfs: make device scan less noisy
  btrfs: fix lockdep warning with reclaim lock inversion
  Btrfs: hold the commit_root_sem when getting the commit root during send
  Btrfs: remove transaction from send
  ...
2014-04-11 14:16:53 -07:00
Josef Bacik
9e351cc862 Btrfs: remove transaction from send
Lets try this again.  We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send.  So fix this by making sure looking at the
commit roots is always going to be consistent.  We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once.  Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes.  With this patch we no longer
deadlock and pass all the weird send/receive corner cases.  Thanks,

Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:39:30 -07:00
Josef Bacik
a26e8c9f75 Btrfs: don't clear uptodate if the eb is under IO
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel.  This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed.  Turns out this is because of a
bad interaction of balance and send/receive.  Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening.  So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid.  But because we think it is invalid we clear uptodate and
re-read in the block.  If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.

Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data.  So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06 17:34:37 -07:00
Linus Torvalds
53c566625f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs changes from Chris Mason:
 "This is a pretty long stream of bug fixes and performance fixes.

  Qu Wenruo has replaced the btrfs async threads with regular kernel
  workqueues.  We'll keep an eye out for performance differences, but
  it's nice to be using more generic code for this.

  We still have some corruption fixes and other patches coming in for
  the merge window, but this batch is tested and ready to go"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
  Btrfs: fix a crash of clone with inline extents's split
  btrfs: fix uninit variable warning
  Btrfs: take into account total references when doing backref lookup
  Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
  Btrfs: fix incremental send's decision to delay a dir move/rename
  Btrfs: remove unnecessary inode generation lookup in send
  Btrfs: fix race when updating existing ref head
  btrfs: Add trace for btrfs_workqueue alloc/destroy
  Btrfs: less fs tree lock contention when using autodefrag
  Btrfs: return EPERM when deleting a default subvolume
  Btrfs: add missing kfree in btrfs_destroy_workqueue
  Btrfs: cache extent states in defrag code path
  Btrfs: fix deadlock with nested trans handles
  Btrfs: fix possible empty list access when flushing the delalloc inodes
  Btrfs: split the global ordered extents mutex
  Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
  Btrfs: reclaim delalloc metadata more aggressively
  Btrfs: remove unnecessary lock in may_commit_transaction()
  Btrfs: remove the unnecessary flush when preparing the pages
  Btrfs: just do dirty page flush for the inode with compression before direct IO
  ...
2014-04-04 15:31:36 -07:00
Miao Xie
573bfb72f7 Btrfs: fix possible empty list access when flushing the delalloc inodes
We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:29 -04:00
Miao Xie
31f3d255c6 Btrfs: split the global ordered extents mutex
When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.

This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:28 -04:00
Miao Xie
8257b2dc3c Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume
If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.

So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().

These two functions are only used for nocow file write operations.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:22 -04:00
Qu Wenruo
d458b0540e btrfs: Cleanup the "_struct" suffix in btrfs_workequeue
Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:16 -04:00
Qu Wenruo
a046e9c88b btrfs: Cleanup the old btrfs_worker.
Since all the btrfs_worker is replaced with the newly created
btrfs_workqueue, the old codes can be easily remove.

Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:15 -04:00
Qu Wenruo
fc97fab0ea btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue.
Replace the fs_info->qgroup_rescan_worker with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:13 -04:00
Qu Wenruo
5b3bc44e2e btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue.
Replace the fs_info->delayed_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:12 -04:00
Qu Wenruo
dc6e320998 btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue.
Replace the fs_info->fixup_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:12 -04:00
Qu Wenruo
736cfa15e8 btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue.
Replace the fs_info->readahead_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:11 -04:00
Qu Wenruo
e66f0bb144 btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue.
Replace the fs_info->cache_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:10 -04:00
Qu Wenruo
d05a33ac26 btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue.
Replace the fs_info->rmw_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:09 -04:00
Qu Wenruo
fccb5d86d8 btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue.
Replace the fs_info->endio_* workqueues with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:08 -04:00
Qu Wenruo
a44903abe9 btrfs: Replace fs_info->flush_workers with btrfs_workqueue.
Replace the fs_info->submit_workers with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:07 -04:00
Qu Wenruo
a8c93d4ef6 btrfs: Replace fs_info->submit_workers with btrfs_workqueue.
Much like the fs_info->workers, replace the fs_info->submit_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:07 -04:00
Qu Wenruo
afe3d24267 btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue
Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:06 -04:00
Qu Wenruo
5cdc7ad337 btrfs: Replace fs_info->workers with btrfs_workqueue.
Use the newly created btrfs_workqueue_struct to replace the original
fs_info->workers

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:05 -04:00
Miao Xie
d1433debe7 Btrfs: just wait or commit our own log sub-transaction
We might commit the log sub-transaction which didn't contain the metadata we
logged. It was because we didn't record the log transid and just select
the current log sub-transaction to commit, but the right one might be
committed by the other task already. Actually, we needn't do anything
and it is safe that we go back directly in this case.

This patch improves the log sync by the above idea. We record the transid
of the log sub-transaction in which we log the metadata, and the transid
of the log sub-transaction we have committed. If the committed transid
is >= the transid we record when logging the metadata, we just go back.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:43 -04:00
Miao Xie
8b050d350c Btrfs: fix skipped error handle when log sync failed
It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.

This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:43 -04:00
Liu Bo
2a85d9cac1 Btrfs: fix possible deadlock in btrfs_cleanup_transaction
[13654.480669] ======================================================
[13654.480905] [ INFO: possible circular locking dependency detected ]
[13654.481003] 3.12.0+ #4 Tainted: G        W  O
[13654.481060] -------------------------------------------------------
[13654.481060] btrfs-transacti/9347 is trying to acquire lock:
[13654.481060]  (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060] but task is already holding lock:
[13654.481060]  (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
[13654.481060] which lock already depends on the new lock.

[13654.481060] the existing dependency chain (in reverse order) is:
[13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}:
[13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
[13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
[13654.481060]        [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs]
[13654.481060]        [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs]
[13654.481060]        [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs]
[13654.481060]        [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs]
[13654.481060]        [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs]
[13654.481060]        [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs]
[13654.481060]        [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs]
[13654.481060]        [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs]
[13654.481060]        [<ffffffff8114be91>] do_writepages+0x21/0x50
[13654.481060]        [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60
[13654.481060]        [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20
[13654.481060]        [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs]
[13654.481060]        [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs]
[13654.481060]        [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
[13654.481060]        [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs]
[13654.481060]        [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs]
[13654.481060]        [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs]
[13654.481060]        [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs]
[13654.481060]        [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs]
[13654.481060]        [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs]
[13654.481060]        [<ffffffff811b2409>] mount_fs+0x39/0x1b0
[13654.481060]        [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0
[13654.481060]        [<ffffffff811cfcce>] do_mount+0x23e/0xa90
[13654.481060]        [<ffffffff811d05a3>] SyS_mount+0x83/0xc0
[13654.481060]        [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b
[13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}:
[13654.481060]        [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70
[13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
[13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
[13654.481060]        [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060]        [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs]
[13654.481060]        [<ffffffff81079efa>] kthread+0xea/0xf0
[13654.481060]        [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
[13654.481060] other info that might help us debug this:

[13654.481060]  Possible unsafe locking scenario:

[13654.481060]        CPU0                    CPU1
[13654.481060]        ----                    ----
[13654.481060]   lock(&(&fs_info->ordered_root_lock)->rlock);
[13654.481060]				 lock(&(&root->ordered_extent_lock)->rlock);
[13654.481060]				 lock(&(&fs_info->ordered_root_lock)->rlock);
[13654.481060]   lock(&(&root->ordered_extent_lock)->rlock);
[13654.481060]
 *** DEADLOCK ***
[...]

======================================================

btrfs_destroy_all_ordered_extents()
gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock,
while btrfs_[add,remove]_ordered_extent()
acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock.

This patch fixes the above problem.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:16:37 -04:00