49283 Commits

Author SHA1 Message Date
3d4b4a3e30 xfs: remove bli from AIL before release on transaction abort
When a buffer is modified, logged and committed, it ultimately ends
up sitting on the AIL with a dirty bli waiting for metadata
writeback. If another transaction locks and invalidates the buffer
(freeing an inode chunk, for example) in the meantime, the bli is
flagged as stale, the dirty state is cleared and the bli remains in
the AIL.

If a shutdown occurs before the transaction that has invalidated the
buffer is committed, the transaction is ultimately aborted. The log
items are flagged as such and ->iop_unlock() handles the aborted
items. Because the bli is clean (due to the invalidation),
->iop_unlock() unconditionally releases it. The log item may still
reside in the AIL, however, which means the I/O completion handler
may still run and attempt to access it. This results in assert
failure due to the release of the bli while still present in the AIL
and a subsequent NULL dereference and panic in the buffer I/O
completion handling. This can be reproduced by running generic/388
in repetition.

To avoid this problem, update xfs_buf_item_unlock() to first check
whether the bli is aborted and if so, remove it from the AIL before
it is released. This ensures that the bli is no longer accessed
during the shutdown sequence after it has been freed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
79e641ce29 xfs: release bli from transaction properly on fs shutdown
If a filesystem shutdown occurs with a buffer log item in the CIL
and a log force occurs, the ->iop_unpin() handler is generally
expected to tear down the bli properly. This entails freeing the bli
memory and releasing the associated hold on the buffer so it can be
released and the filesystem unmounted.

If this sequence occurs while ->bli_refcount is elevated (i.e.,
another transaction is open and attempting to modify the buffer),
however, ->iop_unpin() may not be responsible for releasing the bli.
Instead, the transaction may release the final ->bli_refcount
reference and thus xfs_trans_brelse() is responsible for tearing
down the bli.

While xfs_trans_brelse() does drop the reference count, it only
attempts to release the bli if it is clean (i.e., not in the
CIL/AIL). If the filesystem is shutdown and the bli is sitting dirty
in the CIL as noted above, this ends up skipping the last
opportunity to release the bli. In turn, this leaves the hold on the
buffer and causes an unmount hang. This can be reproduced by running
generic/388 in repetition.

Update xfs_trans_brelse() to handle this shutdown corner case
correctly. If the final bli reference is dropped and the filesystem
is shutdown, remove the bli from the AIL (if necessary) and release
the bli to drop the buffer hold and ensure an unmount does not hang.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
0cbe48cc58 xfs: avoid harmless gcc-7 warnings
gcc-7 flags the use of integer math inside of a condition
as a potential bug:

fs/xfs/xfs_bmap_util.c: In function 'xfs_swap_extents_check_format':
fs/xfs/xfs_bmap_util.c:1619:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]
fs/xfs/xfs_bmap_util.c:1629:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context]

There is already a helper function for testing the di_forkoff
field for zero, so let's use that instead to shut up the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
f990fc5ad1 xfs: remove lsn relevant fields from xfs_trans structure and its users
The t_lsn is not used anymore and the t_commit_lsn is used as a tmp
storage for the checkpoint sequence number only in the current code.

And the start/commit lsn are tracked as a transaction group tag in
the xfs_cil_ctx instead of a single transaction, so remove them from
the xfs_trans structure and their users to match with the design.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
3398a4005f xfs: remove XFS_HSIZE
XFS_HSIZE is an extremly confusing way to calculate the size of handle_t.
Given that handle_t always only had two sizes, and one of them isn't
even covered by XFS_HSIZE to start with just remove the macro and use
a constant sizeof expression.

Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
d4ca1d550d xfs: dump transaction usage details on log reservation overrun
If a transaction log reservation overrun occurs, the ticket data
associated with the reservation is dumped in xfs_log_commit_cil().
This occurs long after the transaction items and details have been
removed from the transaction and effectively lost. This limited set
of ticket data provides very little information to support debugging
transaction overruns based on the typical report.

To improve transaction log reservation overrun reporting, create a
helper to dump transaction details such as log items, log vector
data, etc., as well as the underlying ticket data for the
transaction. Move the overrun detection from xfs_log_commit_cil() to
xlog_cil_insert_items() so it occurs prior to migration of the
logged items to the CIL. Call the new helper such that it is able to
dump this transaction data before it is lost.

Also, warn on overrun to provide callstack context for the offending
transaction and include a few additional messages from
xlog_cil_insert_items() to display the reservation consumed locally
for overhead such as log vector headers, split region headers and
the context ticket. This provides a complete general breakdown of
the reservation consumption of a transaction when/if it happens to
overrun the reservation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
e2f2342639 xfs: refactor xlog_cil_insert_items() to facilitate transaction dump
Transaction reservation overrun detection currently occurs too late
to print useful information about the offending transaction.
Ideally, the transaction data is printed before the associated log
items are moved from the transaction to the CIL, which occurs in
xlog_cil_insert_items(), such that details of the items logged by
the transaction are available for analysis.

Refactor xlog_cil_insert_items() to facilitate moving tx overrun
detection to this function. Update the function to track each bit of
extra log reservation stolen from the transaction (i.e., such as for
the CIL context ticket) and perform the log item migration as the
last operation before the CIL lock is released. This creates a
context where the transaction reservation consumption has been fully
calculated when the log items are moved to the CIL. This patch makes
no functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
7d2d565346 xfs: separate shutdown from ticket reservation print helper
xlog_print_tic_res() pre-dates delayed logging and the committed
items list (CIL) and thus retains some factoring warts, such as hard
coded function names in the output and the fact that it induces a
shutdown.

In preparation for more detailed logging of regular transaction
overrun situations, refactor xlog_print_tic_res() to be slightly
more generic. Reword some of the warning messages and pull the
shutdown into the callers.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
1040960efa xfs: define fatal assert build time tunable
While configurable at runtime, the DEBUG mode assert failure
behavior is usually either desired or not for a particular
situation. For example, developers using kernel modules may prefer
for fatal asserts to remain disabled across module reloads while QE
engineers doing broad regression testing may prefer to have fatal
asserts enabled on boot to facilitate data collection for bug
reports.

To provide a compromise/convenience for developers, create a Kconfig
option that sets the default value of the DEBUG mode 'bug_on_assert'
sysfs tunable. The default behavior remains to trigger kernel BUGs
on assert failures to preserve existing behavior across kernel
configuration updates with DEBUG mode enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
ccdab3d6e8 xfs: define bug_on_assert debug mode sysfs tunable
In DEBUG mode, assert failures unconditionally trigger a kernel BUG.
This is useful in diagnostic situations to panic a system and
collect detailed state information at the time of a failure.

This can also cause problems in cases where DEBUG mode code is
desired but it is preferable not trigger kernel BUGs on assert
failure. For example, during development of new code or during
certain xfstests tests that intentionally cause corruption and test
the kernel for survival (but otherwise may expect to trigger assert
failures).

To provide additional flexibility, create the
<sysfs>/fs/xfs/debug/bug_on_assert tunable to configure assert
failure behavior at runtime. This tunable is only available in DEBUG
mode and is enabled by default to preserve existing default
behavior. When disabled, assert failures in DEBUG mode result in
kernel warnings.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
e1a4e37cc7 xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent
In a pathological scenario where we are trying to bunmapi a single
extent in which every other block is shared, it's possible that trying
to unmap the entire large extent in a single transaction can generate so
many EFIs that we overflow the transaction reservation.

Therefore, use a heuristic to guess at the number of blocks we can
safely unmap from a reflink file's data fork in an single transaction.
This should prevent problems such as the log head slamming into the tail
and ASSERTs that trigger because we've exceeded the transaction
reservation.

Note that since bunmapi can fail to unmap the entire range, we must also
teach the deferred unmap code to roll into a new transaction whenever we
get low on reservation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[hch: random edits, all bugs are my fault]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-19 08:59:10 -07:00
d205a7d0ec xfs: refactor dir2 leaf readahead shadow buffer cleverness
Currently, the dir2 leaf block getdents function uses a complex state
tracking mechanism to create a shadow copy of the block mappings and
then uses the shadow copy to schedule readahead.  Since the read and
readahead functions are perfectly capable of reading the mappings
themselves, we can tear all that out in favor of a simpler function that
simply keeps pushing the readahead window further out.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19 08:59:10 -07:00
7912e7fef2 xfs: push buffer of flush locked dquot to avoid quotacheck deadlock
Reclaim during quotacheck can lead to deadlocks on the dquot flush
lock:

 - Quotacheck populates a local delwri queue with the physical dquot
   buffers.
 - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and
   dirties all of the dquots.
 - Reclaim kicks in and attempts to flush a dquot whose buffer is
   already queud on the quotacheck queue. The flush succeeds but
   queueing to the reclaim delwri queue fails as the backing buffer is
   already queued. The flush unlock is now deferred to I/O completion
   of the buffer from the quotacheck queue.
 - The dqadjust bulkstat continues and dirties the recently flushed
   dquot once again.
 - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires
   the flush lock to update the backing buffers with the in-core
   recalculated values. It deadlocks on the redirtied dquot as the
   flush lock was already acquired by reclaim, but the buffer resides
   on the local delwri queue which isn't submitted until the end of
   quotacheck.

This is reproduced by running quotacheck on a filesystem with a
couple million inodes in low memory (512MB-1GB) situations. This is
a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write
buffer lists"), which removed a trylock and buffer I/O submission
from the quotacheck dquot flush sequence.

Quotacheck first resets and collects the physical dquot buffers in a
delwri queue. Then, it traverses the filesystem inodes via bulkstat,
updates the in-core dquots, flushes the corrected dquots to the
backing buffers and finally submits the delwri queue for I/O. Since
the backing buffers are queued across the entire quotacheck
operation, dquot reclaim cannot possibly complete a dquot flush
before quotacheck completes.

Therefore, quotacheck must submit the buffer for I/O in order to
cycle the flush lock and flush the dirty in-core dquot to the
buffer. Add a delwri queue buffer push mechanism to submit an
individual buffer for I/O without losing the delwri queue status and
use it from quotacheck to avoid the deadlock. This restores
quotacheck behavior to as before the regression was introduced.

Reported-by: Martin Svec <martin.svec@zoner.cz>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19 08:59:10 -07:00
1be7107fbe mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 21:50:20 +08:00
6e20350659 A fix for an old ceph ->fh_to_* bug from Luis and two timestamp
fixups from Zheng, prompted by the ongoing y2038 work.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZRTAEAAoJEEp/3jgCEfOLNwcH/jfzGqhOS262fU5FCCowfNVJ
 9ANzXiRpWykaHR7iPXOTRmRUqCJOCzhogmwzjnl7bQUX7cJPsFN+R7l9KR+dCAUX
 300dplDWZ5oQCX2c7A7vzRCIgv4wjQjtS0mo+dY/EBCNYcynoAUVmbr/87Ezrroi
 qfmA6pnI6hI527RLBkwIObljoAiy11MjQ1xFj0zS2bckWxfCSauO1v1qSpMhawkn
 v4fAWEKz3y8oUG3MtT7/Ukx4/GJAOcksxKZf93AW0sNwQozCxvB40D/Dda3NcT4s
 xYVVymUTYGTg1I/CmZHZxSqJwtKUOZJLwMFTXEFyo6bQH0Vj2pw/HaRf8Q5ksOU=
 =ClP/
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix for an old ceph ->fh_to_* bug from Luis and two timestamp fixups
  from Zheng, prompted by the ongoing y2038 work"

* tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client:
  ceph: unify inode i_ctime update
  ceph: use current_kernel_time() to get request time stamp
  ceph: check i_nlink while converting a file handle to dentry
2017-06-18 08:23:02 +09:00
adc311034c Changes since last update:
- Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZQhB9AAoJEPh/dxk0SrTrMWsP/A1gSHmaS2UI3WvXdqwAxldy
 GiDZ3oSAhn4SF7BY+AhIu+Ot9kwCyOVzd/RZ9zVYmRjTTFxYX7P099uua2e/mf70
 Z1o9oSk9H76srqo7L76h7lI2ONsgd28aqh082hmDrRUhhwy7LZThVV15ZPZFfgU4
 8Q17h0uRUVgSn8M8INsuiMpw1sCLJsXw/9Rb9iFMgi3tSJaGZ1Mm//nA1aeS8HFH
 xCKHYw7YUtVKtIVyVV1NGdtXhXZbNznJXelkUZLMnMgOOmAqWiUa8FOGPNbQEezX
 1VidjzGhxRF7uoUJNnnX5mM+26c8/Ip4cLtqvnFQo30bx+HXO3OR8jywQO+6DD9Q
 NxFHY5D/Peud96xsnq5mRzNMN5fqumyVAyUCXSebT3Oa42HMdva+66s9JoFfDehi
 Z7VmRdoryxexDRbhiQVev4xe20beLtOHAP2I5JrnJddrXb0aFWowLk776wa0uxD8
 7cbKgovikqSHk4/mqpyZ5iQeaufg/kOi2cNaTcAFeCbvbXYieeRXVlisMBh1DKec
 lSX5e4kNNS20VVbCYakAttK69lpZzuraYPDDsnb4HRlmt0VX12lYyqQwklY/YZ9S
 jDagtKWKDm/L/jq2j5Nd3uSycM+lMaq97mIMjzPRrnPjOriME1ZGLwEQbKfBLXnW
 3Flzt5C2Hk1Fb/VlNx4S
 =U99d
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "One more bugfix for you for 4.12-rc6 to fix something that came up in
  an earlier rc:

   - Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y"

* tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
2017-06-17 17:34:41 +09:00
c8636b90a0 Merge branch 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ufs fixes from Al Viro:
 "Fix assorted ufs bugs: a couple of deadlocks, fs corruption in
  truncate(), oopsen on tail unpacking and truncate when racing with
  vmscan, mild fs corruption (free blocks stats summary buggered, *BSD
  fsck would complain and fix), several instances of broken logics
  around reserved blocks (starting with "check almost never triggers
  when it should" and then there are issues with sufficiently large
  UFS2)"

[ Note: ufs hasn't gotten any loving in a long time, because nobody
  really seems to use it. These ufs fixes are triggered by people
  actually caring now, not some sudden influx of new bugs.  - Linus ]

* 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs_truncate_blocks(): fix the case when size is in the last direct block
  ufs: more deadlock prevention on tail unpacking
  ufs: avoid grabbing ->truncate_mutex if possible
  ufs_get_locked_page(): make sure we have buffer_heads
  ufs: fix s_size/s_dsize users
  ufs: fix reserved blocks check
  ufs: make ufs_freespace() return signed
  ufs: fix logics in "ufs: make fsck -f happy"
2017-06-17 17:30:07 +09:00
ccd3d905f7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of fixes; a leak in mntns_install() caught by Andrei (this
  cycle regression) + d_invalidate() softlockup fix - that had been
  reported by a bunch of people lately, but the problem is pretty old"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: don't forget to put old mntns in mntns_install
  Hang/soft lockup in d_invalidate with simultaneous calls
2017-06-17 17:26:53 +09:00
64c2b20301 userfaultfd: shmem: handle coredumping in handle_userfault()
Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to
__get_user_pages().

shmem as opposed has no special FOLL_DUMP handling there so
handle_mm_fault() is invoked without mmap_sem and ends up calling
handle_userfault() that isn't expecting to be invoked without mmap_sem
held.

This makes handle_userfault() fail immediately if invoked through
shmem_vm_ops->fault during coredumping and solves the problem.

The side effect is a BUG_ON with no lock held triggered by the
coredumping process which exits.  Only 4.11 is affected, pre-4.11 anon
memory holes are skipped in __get_user_pages by checking FOLL_DUMP
explicitly against empty pagetables (mm/gup.c:no_page_table()).

It's zero cost as we already had a check for current->flags to prevent
futex to trigger userfaults during exit (PF_EXITING).

Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: <stable@vger.kernel.org>	[4.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-17 06:37:05 +09:00
ab2789b72d A fix from Nic for a race seen in production (including a stable tag).
And while I'm sending you this I'm also sneaking in a trivial new helper
 from Bart so that we don't need inter-tree dependencies for the next merge
 window.
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllDnicLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOkPA//TMmDanqxLjjz12m9TiQoCjo/iFCtv9KpuJH/rdCz
 EnWK1GdGtWhR3Z1uk/Ss3zbBA/CwfUR/urVdc1P/aefLoVmsYOWQi1jsPHCHtFG6
 zkDYHr7qYqu91otaO0HgFrcOpuJe+LdbhwZndvUiTYJN8vNMRnQAnKdiEUEKmArq
 dBUj/H0JTbQwSXHZat2ZS9PwHsm7RGO+0qeixxc/HE730LF0TEwnteoy9jlu5d7U
 v1RZs9/zszmvQpWU34vPHCVH/sNfTMdVGPzc9+WNrOoxjM9vmhEOE0jTiclOcsCK
 sMAYHCG7woxkCPVZmxqgLx6P/9zZav6L2NZFPcT3z4jFq5Um+ugJ691f1oHaTq+L
 Bnn1DJdTl50wtMnb7yS1Uux+Y0OswKAXvDdC6NFPGJWwEnG41K3oL78Pq/vN7bKV
 ynKxRZciIsy/9S/Oyzp0oYV+l/cyScPVe/KfUN4zvIALi/mltMkAXYaZMEZDp7Vo
 w2TeJO7Nr3O75ghw/yCFHTWMAVbrTJg/ma1rkdUeekKYXix+4Bpr2XYqA3HHZCQY
 06pvIH+fZs1XshFlCs3RoWXvjdfjDgIO8zjrvSkTs8WUK4AxVNXIDtPDA6fpzcGz
 yZEehpdbPWPDvdd1C7TzEAi6lgOV/W5AsPUfk5KbLOaFzKWRe+FYtzDykGwamYeP
 Ov8=
 =NGL4
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs

Pull configfs updates from Christoph Hellwig:
 "A fix from Nic for a race seen in production (including a stable tag).

  And while I'm sending you this I'm also sneaking in a trivial new
  helper from Bart so that we don't need inter-tree dependencies for the
  next merge window"

* tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs:
  configfs: Introduce config_item_get_unless_zero()
  configfs: Fix race between create_link and configfs_rmdir
2017-06-16 18:45:47 +09:00
20223f0f39 fs: pass on flags in compat_writev
Fixes: 793b80ef14af ("vfs: pass a flags argument to vfs_readv/vfs_writev")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-16 18:40:51 +09:00
4068367c9c fs: don't forget to put old mntns in mntns_install
Fixes: 4f757f3cbf54 ("make sure that mntns_install() doesn't end up with referral for root")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:53:05 -04:00
81be24d263 Hang/soft lockup in d_invalidate with simultaneous calls
It's not hard to trigger a bunch of d_invalidate() on the same
dentry in parallel.  They end up fighting each other - any
dentry picked for removal by one will be skipped by the rest
and we'll go for the next iteration through the entire
subtree, even if everything is being skipped.  Morevoer, we
immediately go back to scanning the subtree.  The only thing
we really need is to dissolve all mounts in the subtree and
as soon as we've nothing left to do, we can just unhash the
dentry and bugger off.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:52:09 -04:00
54ed0f71f0 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes a bug on sparc where we may dereference freed stack memory"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: Work around deallocated stack frame reference gcc bug on sparc.
2017-06-15 17:54:51 +09:00
a8fad98483 ufs_truncate_blocks(): fix the case when size is in the last direct block
The logics when deciding whether we need to do anything with direct blocks
is broken when new size is within the last direct block.  It's better to
find the path to the last byte _not_ to be removed and use that instead
of the path to the beginning of the first block to be freed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 03:57:46 -04:00
289dec5b89 ufs: more deadlock prevention on tail unpacking
->s_lock is not needed for ufs_change_blocknr()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 00:42:56 -04:00
09bf4f5b6e ufs: avoid grabbing ->truncate_mutex if possible
tail unpacking is done in a wrong place; the deadlocks galore
is best dealt with by doing that in ->write_iter() (and switching
to iomap, while we are at it), but that's rather painful to
backport.  The trouble comes from grabbing pages that cover
the beginning of tail from inside of ufs_new_fragments(); ongoing
pageout of any of those is going to deadlock on ->truncate_mutex
with process that got around to extending the tail holding that
and waiting for page to get unlocked, while ->writepage() on
that page is waiting on ->truncate_mutex.

The thing is, we don't need ->truncate_mutex when the fragment
we are trying to map is within the tail - the damn thing is
allocated (tail can't contain holes).

Let's do a plain lookup and if the fragment is present, we can
just pretend that we'd won the race in almost all cases.  The
only exception is a fragment between the end of tail and the
end of block containing tail.

Protect ->i_lastfrag with ->meta_lock - read_seqlock_excl() is
sufficient.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 00:41:18 -04:00
267309f394 ufs_get_locked_page(): make sure we have buffer_heads
callers rely upon that, but find_lock_page() racing with attempt of
page eviction by memory pressure might have left us with
	* try_to_free_buffers() successfully done
	* __remove_mapping() failed, leaving the page in our mapping
	* find_lock_page() returning an uptodate page with no
buffer_heads attached.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 23:32:19 -04:00
c596961d1b ufs: fix s_size/s_dsize users
For UFS2 we need 64bit variants; we even store them in uspi, but
use 32bit ones instead.  One wrinkle is in handling of reserved
space - recalculating it every time had been stupid all along, but
now it would become really ugly.  Just calculate it once...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 16:43:03 -04:00
b451cec4bb ufs: fix reserved blocks check
a) honour ->s_minfree; don't just go with default (5)
b) don't bother with capability checks until we know we'll need them

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:46:05 -04:00
fffd70f588 ufs: make ufs_freespace() return signed
as it is, checking that its return value is <= 0 is useless and
that's how it's being used.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:36:31 -04:00
96ecff1422 ufs: fix logics in "ufs: make fsck -f happy"
Storing stats _only_ at new locations is wrong for UFS1; old
locations should always be kept updated.  The check for "has
been converted to use of new locations" is also wrong - it
should be "->fs_maxbsize is equal to ->fs_bsize".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:17:32 -04:00
4ca2fea6f8 ceph: unify inode i_ctime update
Current __ceph_setattr() can set inode's i_ctime to current_time(),
req->r_stamp or attr->ia_ctime. These time stamps may have minor
differences. It may cause potential problem.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:37:23 +02:00
56199016e8 ceph: use current_kernel_time() to get request time stamp
ceph uses ktime_get_real_ts() to get request time stamp. In most
other cases, current_kernel_time() is used to get time stamp for
filesystem operations (called by current_time()).

There is granularity difference between ktime_get_real_ts() and
current_kernel_time(). The later one can be up to one jiffy behind
the former one. This can causes inode's ctime to go back.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:33:23 +02:00
03f219041f ceph: check i_nlink while converting a file handle to dentry
Converting a file handle to a dentry can be done call after the inode
unlink.  This means that __fh_to_dentry() requires an extra check to
verify the number of links is not 0.

The issue can be easily reproduced using xfstest generic/426, which does
something like:

    name_to_handle_at(&fh)
    echo 3 > /proc/sys/vm/drop_caches
    unlink()
    open_by_handle_at(&fh)

The call to open_by_handle_at() should fail, as the file doesn't exist
anymore.

Link: http://tracker.ceph.com/issues/19958
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:32:43 +02:00
19e72d3abb configfs: Introduce config_item_get_unless_zero()
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
[hch: minor style tweak]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-12 13:20:20 +02:00
ba80aa909c configfs: Fix race between create_link and configfs_rmdir
This patch closes a long standing race in configfs between
the creation of a new symlink in create_link(), while the
symlink target's config_item is being concurrently removed
via configfs_rmdir().

This can happen because the symlink target's reference
is obtained by config_item_get() in create_link() before
the CONFIGFS_USET_DROPPING bit set by configfs_detach_prep()
during configfs_rmdir() shutdown is actually checked..

This originally manifested itself on ppc64 on v4.8.y under
heavy load using ibmvscsi target ports with Novalink API:

[ 7877.289863] rpadlpar_io: slot U8247.22L.212A91A-V1-C8 added
[ 7879.893760] ------------[ cut here ]------------
[ 7879.893768] WARNING: CPU: 15 PID: 17585 at ./include/linux/kref.h:46 config_item_get+0x7c/0x90 [configfs]
[ 7879.893811] CPU: 15 PID: 17585 Comm: targetcli Tainted: G           O 4.8.17-customv2.22 #12
[ 7879.893812] task: c00000018a0d3400 task.stack: c0000001f3b40000
[ 7879.893813] NIP: d000000002c664ec LR: d000000002c60980 CTR: c000000000b70870
[ 7879.893814] REGS: c0000001f3b43810 TRAP: 0700   Tainted: G O     (4.8.17-customv2.22)
[ 7879.893815] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28222242  XER: 00000000
[ 7879.893820] CFAR: d000000002c664bc SOFTE: 1
                GPR00: d000000002c60980 c0000001f3b43a90 d000000002c70908 c0000000fbc06820
                GPR04: c0000001ef1bd900 0000000000000004 0000000000000001 0000000000000000
                GPR08: 0000000000000000 0000000000000001 d000000002c69560 d000000002c66d80
                GPR12: c000000000b70870 c00000000e798700 c0000001f3b43ca0 c0000001d4949d40
                GPR16: c00000014637e1c0 0000000000000000 0000000000000000 c0000000f2392940
                GPR20: c0000001f3b43b98 0000000000000041 0000000000600000 0000000000000000
                GPR24: fffffffffffff000 0000000000000000 d000000002c60be0 c0000001f1dac490
                GPR28: 0000000000000004 0000000000000000 c0000001ef1bd900 c0000000f2392940
[ 7879.893839] NIP [d000000002c664ec] config_item_get+0x7c/0x90 [configfs]
[ 7879.893841] LR [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
[ 7879.893842] Call Trace:
[ 7879.893844] [c0000001f3b43ac0] [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
[ 7879.893847] [c0000001f3b43b10] [c000000000329770] do_dentry_open+0x2c0/0x460
[ 7879.893849] [c0000001f3b43b70] [c000000000344480] path_openat+0x210/0x1490
[ 7879.893851] [c0000001f3b43c80] [c00000000034708c] do_filp_open+0xfc/0x170
[ 7879.893853] [c0000001f3b43db0] [c00000000032b5bc] do_sys_open+0x1cc/0x390
[ 7879.893856] [c0000001f3b43e30] [c000000000009584] system_call+0x38/0xec
[ 7879.893856] Instruction dump:
[ 7879.893858] 409d0014 38210030 e8010010 7c0803a6 4e800020 3d220000 e94981e0 892a0000
[ 7879.893861] 2f890000 409effe0 39200001 992a0000 <0fe00000> 4bffffd0 60000000 60000000
[ 7879.893866] ---[ end trace 14078f0b3b5ad0aa ]---

To close this race, go ahead and obtain the symlink's target
config_item reference only after the existing CONFIGFS_USET_DROPPING
check succeeds.

This way, if configfs_rmdir() wins create_link() will return -ENONET,
and if create_link() wins configfs_rmdir() will return -EBUSY.

Reported-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Tested-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
2017-06-12 13:20:10 +02:00
5e38b72ac1 Fix various bug fixes in ext4 caused by races and memory allocation
failures.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlk9it4ACgkQ8vlZVpUN
 gaNE2wgAiuo9pHO5cUzRhP5HG8Jz4vE2l6mpuPq2JbhT+xGFbvjX+UhA+DN6Cw35
 EK+MnBC6h2wFoQrEOLbavNWc94nb6HZWw2riheK4sst80hBpeclwInCVCw1DLmu6
 +Nx8fzXVqjSf57Qi8CR09AGovSFBgfAJFi8aJTe8KXaSRPx48bWFxK6glgIG5gRw
 VEOyEg5kPRYoUNA8ewsunC67V32ljYF2IZSzTTlrcqaLsXi+SWeiJ2hceYfgI0qz
 fZB1EFBvmGBsdsFM2SHiJpME6fzMEQqx7oTyNC8DhCxMRVd29WjxeqCwcU4nV0Q5
 jJaCKJftJ/q/eLKn7ksezhoL0A96BA==
 =Z9eH
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix various bug fixes in ext4 caused by races and memory allocation
  failures"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix fdatasync(2) after extent manipulation operations
  ext4: fix data corruption for mmap writes
  ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
  ext4: fix quota charging for shared xattr blocks
  ext4: remove redundant check for encrypted file on dio write path
  ext4: remove unused d_name argument from ext4_search_dir() et al.
  ext4: fix off-by-one error when writing back pages before dio read
  ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
  ext4: keep existing extra fields when inode expands
  ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
  ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()
  ext4: fix SEEK_HOLE
  jbd2: preserve original nofs flag during journal restart
  ext4: clear lockdep subtype for quota files on quota off
2017-06-11 11:57:47 -07:00
5faab9e0f0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull UFS fixes from Al Viro:
 "This is just the obvious backport fodder; I'm pretty sure that there
  will be more - definitely so wrt performance and quite possibly
  correctness as well"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs: we need to sync inode before freeing it
  excessive checks in ufs_write_failed() and ufs_evict_inode()
  ufs_getfrag_block(): we only grab ->truncate_mutex on block creation path
  ufs_extend_tail(): fix the braino in calling conventions of ufs_new_fragments()
  ufs: set correct ->s_maxsize
  ufs: restore maintaining ->i_blocks
  fix ufs_isblockset()
  ufs: restore proper tail allocation
2017-06-10 11:09:23 -07:00
66cea28a94 Merge branch 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Some fixes that Dave Sterba collected.

  We've been hitting an early enospc problem on production machines that
  Omar tracked down to an old int->u64 mistake. I waited a bit on this
  pull to make sure it was really the problem from production, but it's
  on ~2100 hosts now and I think we're good.

  Omar also noticed a commit in the queue would make new early ENOSPC
  problems. I pulled that out for now, which is why the top three
  commits are younger than the rest.

  Otherwise these are all fixes, some explaining very old bugs that
  we've been poking at for a while"

* 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix delalloc accounting leak caused by u32 overflow
  Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io
  btrfs: tree-log.c: Wrong printk information about namelen
  btrfs: fix race with relocation recovery and fs_root setup
  btrfs: fix memory leak in update_space_info failure path
  btrfs: use correct types for page indices in btrfs_page_exists_in_range
  btrfs: fix incorrect error return ret being passed to mapping_set_error
  btrfs: Make flush bios explicitely sync
  btrfs: fiemap: Cache and merge fiemap extent before submit it to user
2017-06-10 11:06:05 -07:00
67a70017fa ufs: we need to sync inode before freeing it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-10 12:02:28 -04:00
babef37dcc excessive checks in ufs_write_failed() and ufs_evict_inode()
As it is, short copy in write() to append-only file will fail
to truncate the excessive allocated blocks.  As the matter of
fact, all checks in ufs_truncate_blocks() are either redundant
or wrong for that caller.  As for the only other caller
(ufs_evict_inode()), we only need the file type checks there.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
006351ac8e ufs_getfrag_block(): we only grab ->truncate_mutex on block creation path
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
940ef1a0ed ufs_extend_tail(): fix the braino in calling conventions of ufs_new_fragments()
... and it really needs splitting into "new" and "extend" cases, but that's for
later

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
6b0d144fa7 ufs: set correct ->s_maxsize
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
eb315d2ae6 ufs: restore maintaining ->i_blocks
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
414cf7186d fix ufs_isblockset()
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
8785d84d00 ufs: restore proper tail allocation
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-09 16:28:01 -04:00
70e7af244f Btrfs: fix delalloc accounting leak caused by u32 overflow
btrfs_calc_trans_metadata_size() does an unsigned 32-bit multiplication,
which can overflow if num_items >= 4 GB / (nodesize * BTRFS_MAX_LEVEL * 2).
For a nodesize of 16kB, this overflow happens at 16k items. Usually,
num_items is a small constant passed to btrfs_start_transaction(), but
we also use btrfs_calc_trans_metadata_size() for metadata reservations
for extent items in btrfs_delalloc_{reserve,release}_metadata().

In drop_outstanding_extents(), num_items is calculated as
inode->reserved_extents - inode->outstanding_extents. The difference
between these two counters is usually small, but if many delalloc
extents are reserved and then the outstanding extents are merged in
btrfs_merge_extent_hook(), the difference can become large enough to
overflow in btrfs_calc_trans_metadata_size().

The overflow manifests itself as a leak of a multiple of 4 GB in
delalloc_block_rsv and the metadata bytes_may_use counter. This in turn
can cause early ENOSPC errors. Additionally, these WARN_ONs in
extent-tree.c will be hit when unmounting:

    WARN_ON(fs_info->delalloc_block_rsv.size > 0);
    WARN_ON(fs_info->delalloc_block_rsv.reserved > 0);
    WARN_ON(space_info->bytes_pinned > 0 ||
            space_info->bytes_reserved > 0 ||
            space_info->bytes_may_use > 0);

Fix it by casting nodesize to a u64 so that
btrfs_calc_trans_metadata_size() does a full 64-bit multiplication.
While we're here, do the same in btrfs_calc_trunc_metadata_size(); this
can't overflow with any existing uses, but it's better to be safe here
than have another hard-to-debug problem later on.

Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:36 -07:00
452e62b71f Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io
Before this, we use 'filled' mode here, ie. if all range has been
filled with EXTENT_DEFRAG bits, get to clear it, but if the defrag
range joins the adjacent delalloc range, then we'll have EXTENT_DEFRAG
bits in extent_state until releasing this inode's pages, and that
prevents extent_data from being freed.

This clears the bit if any was found within the ordered extent.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-06-09 12:48:29 -07:00