41955 Commits

Author SHA1 Message Date
1081230b74 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
 "This first core part of the block IO changes contains:

   - Cleanup of the bio IO error signaling from Christoph.  We used to
     rely on the uptodate bit and passing around of an error, now we
     store the error in the bio itself.

   - Improvement of the above from myself, by shrinking the bio size
     down again to fit in two cachelines on x86-64.

   - Revert of the max_hw_sectors cap removal from a revision again,
     from Jeff Moyer.  This caused performance regressions in various
     tests.  Reinstate the limit, bump it to a more reasonable size
     instead.

   - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me.
     Most devices have huge trim limits, which can cause nasty latencies
     when deleting files.  Enable the admin to configure the size down.
     We will look into having a more sane default instead of UINT_MAX
     sectors.

   - Improvement of the SGP gaps logic from Keith Busch.

   - Enable the block core to handle arbitrarily sized bios, which
     enables a nice simplification of bio_add_page() (which is an IO hot
     path).  From Kent.

   - Improvements to the partition io stats accounting, making it
     faster.  From Ming Lei.

   - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
     file in blk-mq, as well as a fix for a blk-mq timeout race
     condition.

   - Ming Lin has been carrying Kents above mentioned patches forward
     for a while, and testing them.  Ming also did a few fixes around
     that.

   - Sasha Levin found and fixed a use-after-free problem introduced by
     the bio->bi_error changes from Christoph.

   - Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
  blk: Fix bio_io_vec index when checking bvec gaps
  block: Replace SG_GAPS with new queue limits mask
  block: bump BLK_DEF_MAX_SECTORS to 2560
  Revert "block: remove artifical max_hw_sectors cap"
  blk-mq: fix race between timeout and freeing request
  blk-mq: fix buffer overflow when reading sysfs file of 'pending'
  Documentation: update notes in biovecs about arbitrarily sized bios
  block: remove bio_get_nr_vecs()
  fs: use helper bio_add_page() instead of open coding on bi_io_vec
  block: kill merge_bvec_fn() completely
  md/raid5: get rid of bio_fits_rdev()
  md/raid5: split bio for chunk_aligned_read
  block: remove split code in blkdev_issue_{discard,write_same}
  btrfs: remove bio splitting and merge_bvec_fn() calls
  bcache: remove driver private bio splitting code
  block: simplify bio_add_page()
  block: make generic_make_request handle arbitrarily sized bios
  blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
  block: don't access bio->bi_error after bio_put()
  block: shrink struct bio down to 2 cache lines again
  ...
2015-09-02 13:10:25 -07:00
a457974f1b nfsd: deal with DELEGRETURN racing with CB_RECALL
We have observed the server sending recalls for delegation stateids
that have already been successfully returned. Change
nfsd4_cb_recall_done() to return success if the client has returned
the delegation. While this does not completely eliminate the sending
of recalls for delegations that have already been returned, this
does prevent unnecessarily declaring the callback path to be down.

Reported-by: Eric Meddaugh <etmsys@rit.edu>
Signed-off-by: Andrew Elble <aweits@rit.edu>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-09-02 10:05:28 -04:00
089b669506 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree for 4.3 (kerneldoc updates, printk()
  fixes, Documentation and MAINTAINERS updates)"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
  MAINTAINERS: update my e-mail address
  mod_devicetable: add space before */
  scsi: a100u2w: trivial typo in printk
  i2c: Fix typo in i2c-bfin-twi.c
  treewide: fix typos in comment blocks
  Doc: fix trivial typo in SubmittingPatches
  proportions: Spelling s/consitent/consistent/
  dm: Spelling s/consitent/consistent/
  aic7xxx: Fix typo in error message
  pcmcia: Fix typo in locking documentation
  scsi/arcmsr: Fix typos in error log
  drm/nouveau/gr: Fix typo in nv10.c
  [SCSI] Fix printk typos in drivers/scsi
  staging: comedi: Grammar s/Enable support a/Enable support for a/
  Btrfs: Spelling s/consitent/consistent/
  README: GTK+ is a acronym
  ASoC: omap: Fix typo in config option description
  mm: tlb.c: Fix error message
  ntfs: super.c: Fix error log
  fix typo in Documentation/SubmittingPatches
  ...
2015-09-01 18:46:42 -07:00
73b6fa8e49 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace updates from Eric Biederman:
 "This finishes up the changes to ensure proc and sysfs do not start
  implementing executable files, as the there are application today that
  are only secure because such files do not exist.

  It akso fixes a long standing misfeature of /proc/<pid>/mountinfo that
  did not show the proper source for files bind mounted from
  /proc/<pid>/ns/*.

  It also straightens out the handling of clone flags related to user
  namespaces, fixing an unnecessary failure of unshare(CLONE_NEWUSER)
  when files such as /proc/<pid>/environ are read while <pid> is calling
  unshare.  This winds up fixing a minor bug in unshare flag handling
  that dates back to the first version of unshare in the kernel.

  Finally, this fixes a minor regression caused by the introduction of
  sysfs_create_mount_point, which broke someone's in house application,
  by restoring the size of /sys/fs/cgroup to 0 bytes.  Apparently that
  application uses the directory size to determine if a tmpfs is mounted
  on /sys/fs/cgroup.

  The bind mount escape fixes are present in Al Viros for-next branch.
  and I expect them to come from there.  The bind mount escape is the
  last of the user namespace related security bugs that I am aware of"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  fs: Set the size of empty dirs to 0.
  userns,pidns: Force thread group sharing, not signal handler sharing.
  unshare: Unsharing a thread does not require unsharing a vm
  nsfs: Add a show_path method to fix mountinfo
  mnt: fs_fully_visible enforce noexec and nosuid  if !SB_I_NOEXEC
  vfs: Commit to never having exectuables on proc and sysfs.
2015-09-01 16:13:25 -07:00
4a3e5779cf nfs: Remove unneeded checking of the return value from scnprintf
The return value from scnprintf always less than the buffer length.
So, result >= len always false. This patch removes those checking.

int vscnprintf(char *buf, size_t size, const char *fmt, va_list args)
{
        int i;

	i = vsnprintf(buf, size, fmt, args);

	if (likely(i < size))
		return i;
	if (size != 0)
		return size - 1;
	return 0;
}

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-01 15:19:40 -07:00
4a70316cae nfs: Fix truncated client owner id without proto type
The length of "Linux NFSv4.0 " is 14, not 10.

Without this patch, I get a truncated client owner id as,
"Linux NFSv4.0 ::1/::1"

With this patch,
"Linux NFSv4.0 ::1/::1 tcp"

Fixes: a319268891 ("nfs: make nfs4_init_nonuniform_client_string use a dynamically allocated buffer")
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-01 15:19:40 -07:00
889d94d49a NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
If a read-write layout has an invalid mirror, then we should
mark it as invalid, and return it.
If a read-only layout has an invalid mirror, then mark it as invalid
and check if there is still at least one valid mirror before we return
it.

Note: Also fix incorrect use of pnfs_generic_mark_devid_invalid().
We really want nfs4_mark_deviceid_unavailable().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-01 15:12:11 -07:00
81d6dc8b34 NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
Unlike read layouts, the writeable layout cannot fall back to using only
one of the mirrors. It need to write to all of them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-01 15:12:11 -07:00
388ef16640 NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
Unlike the files layout, flexfiles does not test for the NFS_DEVICEID_INVALID
flag. Instead it relies on NFS_DEVICEID_UNAVAILABLE.
Fix is to replace with nfs4_mark_deviceid_unavailable().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-01 15:12:11 -07:00
01a5ad827a f2fs: upset segment_info repair
upset segment_info like this:

276000|161 0|0   4|70  3|0   3|0   0|0   0|91  4|0   4|232 4|39
276104|0   4|0   4|1   4|0   4|0   4|280 4|0   4|42  4|262 4|38
276204|179 4|89  4|39  4|24  4|0   4|96  4|3   4|428 4|0   4|118
276304|112 4|97  4|0   4|0   4|0   4|68  4|0   4|0   4|86  4|138
276404|0   4|0   0|166 5|39  4|101 0|111

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-09-01 14:45:27 -07:00
972398fa0a NFSv4.1/flexfiles: Fix freeing of mirrors
Mirrors are now shared objects, so we should not be freeing them directly
inside ff_layout_free_lseg(). We should already be doing the right thing
in _ff_layout_free_lseg(), so just let it handle things.

Also ensure that ff_layout_free_mirror() frees the RPC credential if it
is set.

Fixes: 28a0d72c6867 ("Add refcounting to struct nfs4_ff_layout_mirror")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-01 12:18:57 -07:00
f984a7ce58 nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case
Somebody with a Solaris client was hitting this case.  We haven't
figured out why yet, and don't have a reproducer.  Meanwhile Frank
noticed that RFC 7530 actually recommends CLID_INUSE for this case.
Unlikely to help the original reporter, but may as well fix it.

Reported-by: Frank Filz <ffilzlnx@mindspring.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-09-01 13:53:40 -04:00
5d54b8cdea Merge branch 'xfs-misc-fixes-for-4.3-4' into for-next 2015-09-01 10:30:11 +10:00
3fcbbd244e nfsd: ensure that delegation stateid hash references are only put once
It's possible that a DELEGRETURN could race with (e.g.) client expiry,
in which case we could end up putting the delegation hash reference more
than once.

Have unhash_delegation_locked return a bool that indicates whether it
was already unhashed. In the case of destroy_delegation we only
conditionally put the hash reference if that returns true.

The other callers of unhash_delegation_locked call it while walking
list_heads that shouldn't yet be detached. If we find that it doesn't
return true in those cases, then throw a WARN_ON as that indicates that
we have a partially hashed delegation, and that something is likely very
wrong.

Tested-by: Andrew W Elble <aweits@rit.edu>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:32:16 -04:00
e85687393f nfsd: ensure that the ol stateid hash reference is only put once
When an open or lock stateid is hashed, we take an extra reference to
it. When we unhash it, we drop that reference. The code however does
not properly account for the case where we have two callers concurrently
trying to unhash the stateid. This can lead to list corruption and the
hash reference being put more than once.

Fix this by having unhash_ol_stateid use list_del_init on the st_perfile
list_head, and then testing to see if that list_head is empty before
releasing the hash reference. This means that some of the unhashing
wrappers now become bool return functions so we can test to see whether
the stateid was unhashed before we put the reference.

Reported-by: Andrew W Elble <aweits@rit.edu>
Tested-by: Andrew W Elble <aweits@rit.edu>
Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Tested-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:32:15 -04:00
51a5456859 nfsd: allow more than one laundry job to run at a time
We can potentially have several nfs4_laundromat jobs running if there
are multiple namespaces running nfsd on the box. Those are effectively
separated from one another though, so I don't see any reason to
serialize them.

Also, create_singlethread_workqueue automatically adds the
WQ_MEM_RECLAIM flag. Since we run this job on a timer, it's not really
involved in any reclaim paths. I see no need for a rescuer thread.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:32:14 -04:00
46cc8ba304 nfsd: don't WARN/backtrace for invalid container deployment.
These messages, combined with the backtrace they trigger, makes it seem
like a serious problem, though a quick search shows distros marking
it as a "won't fix" non-issue when the problem is reported by users.

The backtrace is overkill, and only really manages to show that if
you follow the code path, you can't really avoid it with bootargs
or configuration settings in the container.

Given that, lets tone it down a bit and get rid of the WARN severity,
and the associated backtrace, so people aren't needlessly alarmed.

Also, lets drop the split printk line, since they are grep unfriendly.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:32:08 -04:00
7fadc59cc8 fs: fix fs/locks.c kernel-doc warning
Fix kernel-doc warnings in fs/locks.c:

Warning(..//fs/locks.c:1577): No description found for parameter 'flags'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-08-31 16:27:25 -04:00
75976de655 NFSD: Return word2 bitmask if setting security label in OPEN/CREATE
Security label can be set in OPEN/CREATE request, nfsd should set
the bitmask in word2 if setting success.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:16:40 -04:00
ead8fb8c24 NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1
According to rfc5661 18.16.4,
"If EXCLUSIVE4_1 was used, the client determines the attributes
 used for the verifier by comparing attrset with cva_attrs.attrmask;"

So, EXCLUSIVE4_1 also needs those bitmask used to store the verifier.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:16:39 -04:00
7d580722c9 nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL.
The encode order should be as the bitmask defined order.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:16:39 -04:00
6896f15aab nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug
Currently we'll respond correctly to a request for either
FS_LAYOUT_TYPES or LAYOUT_TYPES, but not to a request for both
attributes simultaneously.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 16:12:39 -04:00
0a2050d744 NFSD: Store parent's stat in a separate value
After commit ae7095a7c4 (nfsd4: helper function for getting mounted_on
ino) we ignore the return value from get_parent_attributes().

Also, the following FATTR4_WORD2_LAYOUT_BLKSIZE uses stat.blksize, so to
avoid overwriting that, use an independent value for the parent's
attributes.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-31 15:11:05 -04:00
527afb4493 Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
We need not check path before btrfs_free_path() is called because
path is checked in btrfs_free_path().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:46:41 -07:00
c6dd6ea557 btrfs: async_thread: Fix workqueue 'max_active' value when initializing
At initializing time, for threshold-able workqueue, it's max_active
of kernel workqueue should be 1 and grow if it hits threshold.

But due to the bad naming, there is both 'max_active' for kernel
workqueue and btrfs workqueue.
So wrong value is given at workqueue initialization.

This patch fixes it, and to avoid further misunderstanding, change the
member name of btrfs_workqueue to 'current_active' and 'limit_active'.

Also corresponding comment is added for readability.

Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:46:40 -07:00
943c6e9925 btrfs: Add raid56 support for updating
num_tolerated_disk_barrier_failures in btrfs_balance

Code for updating fs_info->num_tolerated_disk_barrier_failures in
btrfs_balance() lacks raid56 support.

Reason:
 Above code was wroten in 2012-08-01, together with
 btrfs_calc_num_tolerated_disk_barrier_failures()'s first version.

 Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated
 later to support raid56, but code in btrfs_balance() was not
 updated together.

Fix:
 Merge above similar code to a common function:
 btrfs_get_num_tolerated_disk_barrier_failures()
 and make it support both case.

 It can fix this bug with a bonus of cleanup, and make these code
 never in above no-sync state from now on.

Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:48 -07:00
2c4580454f btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
1: Use ARRAY_SIZE(types) to replace a static-value variant:
   int num_types = 4;

2: Use 'continue' on condition to reduce one level tab
   if (!XXX) {
       code;
       ...
   }
   ->
   if (XXX)
       continue;
   code;
   ...

3: Put setting 'num_tolerated_disk_barrier_failures = 2' to
   (num_tolerated_disk_barrier_failures > 2) condition to make
   make logic neat.
   if (num_tolerated_disk_barrier_failures > 0 && XXX)
       num_tolerated_disk_barrier_failures = 0;
   else if (num_tolerated_disk_barrier_failures > 1) {
       if (XXX)
           num_tolerated_disk_barrier_failures = 1;
       else if (XXX)
           num_tolerated_disk_barrier_failures = 2;
   ->
   if (num_tolerated_disk_barrier_failures > 0 && XXX)
       num_tolerated_disk_barrier_failures = 0;
   if (num_tolerated_disk_barrier_failures > 1 && XXX)
       num_tolerated_disk_barrier_failures = ;
   if (num_tolerated_disk_barrier_failures > 2 && XXX)
       num_tolerated_disk_barrier_failures = 2;

4: Remove comment of:
   num_mirrors - 1: if RAID1 or RAID10 is configured and more
   than 2 mirrors are used.
   which is not fit with code.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:47 -07:00
8c204c9657 btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
These variables are not used from introduced version, remove them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:46 -07:00
7955323bdc btrfs: Update out-of-date "skip parity stripe" comment
Because btrfs support scrub raid56 parity stripe now.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-08-31 11:45:45 -07:00
1c00038c76 Char/Misc driver patches for 4.3-rc1
Here's the "big" char/misc driver update for 4.3-rc1.
 
 Not much really interesting here, just a number of little changes all
 over the place, and some nice consolidation of the nvmem drivers to a
 common framework.  As usual, the mei drivers stand out as the largest
 "churn" to handle new devices and features in their hardware.
 
 All have been in linux-next for a while with no issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlXV844ACgkQMUfUDdst+ymYfQCgmDKjq3fsVHCxNZPxnukFYzvb
 xZkAnRb8fuub5gVQFP29A+rhyiuWD13v
 =Bq9K
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver patches from Greg KH:
 "Here's the "big" char/misc driver update for 4.3-rc1.

  Not much really interesting here, just a number of little changes all
  over the place, and some nice consolidation of the nvmem drivers to a
  common framework.  As usual, the mei drivers stand out as the largest
  "churn" to handle new devices and features in their hardware.

  All have been in linux-next for a while with no issues"

* tag 'char-misc-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
  auxdisplay: ks0108: initialize local parport variable
  extcon: palmas: Fix build break due to devm_gpiod_get_optional API change
  extcon: palmas: Support GPIO based USB ID detection
  extcon: Fix signedness bugs about break error handling
  extcon: Drop owner assignment from i2c_driver
  extcon: arizona: Simplify pdata symantics for micd_dbtime
  extcon: arizona: Declare 3-pole jack if we detect open circuit on mic
  extcon: Add exception handling to prevent the NULL pointer access
  extcon: arizona: Ensure variables are set for headphone detection
  extcon: arizona: Use gpiod inteface to handle micd_pol_gpio gpio
  extcon: arizona: Add basic microphone detection DT/ACPI bindings
  extcon: arizona: Update to use the new device properties API
  extcon: palmas: Remove the mutually_exclusive array
  extcon: Remove optional print_state() function pointer of struct extcon_dev
  extcon: Remove duplicate header file in extcon.h
  extcon: max77843: Clear IRQ bits state before request IRQ
  toshiba laptop: replace ioremap_cache with ioremap
  misc: eeprom: max6875: clean up max6875_read()
  misc: eeprom: clean up eeprom_read()
  misc: eeprom: 93xx46: clean up eeprom_93xx46_bin_read/write
  ...
2015-08-31 08:34:13 -07:00
2d89a1d3c9 NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
If we have a read layout, then sanity check the minimal layout length
so that it does not extend beyond the end of file.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-31 02:05:47 -07:00
21b874c873 NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
According to RFC5661 section 18.43.3, if the server cannot satisfy
the loga_minlength argument to LAYOUTGET, there are 2 cases:
1) If loga_minlength == 0, it returns NFS4ERR_LAYOUTTRYLATER
2) If loga_minlength != 0, it returns NFS4ERR_BADLAYOUT

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-31 01:33:12 -07:00
4ae93560b1 NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-31 01:33:12 -07:00
4a1e2feb9d NFSv4.1: Fix a protocol issue with CLOSE stateids
According to RFC5661 Section 18.2.4, CLOSE is supposed to return
the zero stateid. This means that nfs_clear_open_stateid_locked()
cannot assume that the result stateid will always match the 'other'
field of the existing open stateid when trying to determine a race
with a parallel OPEN.

Instead, we look at the argument, and check for matches.

Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-30 18:45:04 -07:00
90816d1dda NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
If the file was fenced and/or has been deleted on the DS, then we want
to retry pNFS after a layoutreturn with error report. If the server
cannot fix the problem, then we rely on it to tell us so in the
response to the LAYOUTGET.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-30 13:10:27 -07:00
54d7185642 f2fs: avoid accessing NULL pointer in f2fs_drop_largest_extent
If extent cache is disable, we will encounter oops when triggering direct
IO as below:

BUG: unable to handle kernel NULL pointer dereference at 0000000c
IP: [<f0b9c61e>] f2fs_drop_largest_extent+0xe/0x30 [f2fs]
*pdpt = 000000002bb9a001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in: f2fs(O) fuse bnep rfcomm bluetooth nfsd dm_crypt nfs_acl auth_rpcgss oid_registry nfs binfmt_misc fscache lockd
sunrpc grace snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer
snd_seq_device snd soundcore joydev psmouse hid_generic i2c_piix4 serio_raw ppdev mac_hid parport_pc lp parport ext4 jbd2 mbcache
usbhid hid e1000
CPU: 3 PID: 3608 Comm: dd Tainted: G           O    4.2.0-rc4 #12
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: ef161600 ti: ebd5e000 task.ti: ebd5e000
EIP: 0060:[<f0b9c61e>] EFLAGS: 00010202 CPU: 3
EIP is at f2fs_drop_largest_extent+0xe/0x30 [f2fs]
EAX: 00000000 EBX: ddebc000 ECX: 00000000 EDX: 00000000
ESI: ebd5fdf8 EDI: 00000000 EBP: ebd5fd58 ESP: ebd5fd58
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 0000000c CR3: 2c24ee40 CR4: 000006f0
Stack:
 ebd5fda4 f0b8c005 00000000 00000001 00000000 f0b8c430 c816cd68 ddebc000
 ddebc088 00001000 00000555 00000555 ffffffff c160bb00 00055501 00000000
 00000000 00000100 00000000 ebd5fe20 f0b8c430 00000046 ef161600 00001000
Call Trace:
 [<f0b8c005>] __allocate_data_block+0x1a5/0x260 [f2fs]
 [<f0b8c430>] ? f2fs_direct_IO+0x370/0x440 [f2fs]
 [<c160bb00>] ? down_read+0x30/0x50
 [<f0b8c430>] f2fs_direct_IO+0x370/0x440 [f2fs]
 [<c113e115>] generic_file_direct_write+0xa5/0x260
 [<c10b53f8>] ? current_fs_time+0x18/0x50
 [<c113e38b>] __generic_file_write_iter+0xbb/0x210
 [<c113e50f>] ? generic_file_write_iter+0x2f/0x320
 [<c113e63c>] generic_file_write_iter+0x15c/0x320
 [<f0b77f29>] f2fs_file_write_iter+0x39/0x80 [f2fs]
 [<c11984d9>] __vfs_write+0xa9/0xe0
 [<c1199227>] vfs_write+0x97/0x180
 [<c119955b>] SyS_write+0x5b/0xd0
 [<c160dcd0>] sysenter_do_call+0x12/0x12
Code: 10 8b 50 1c 89 53 14 eb ca 8d 74 26 00 85 f6 74 86 eb a6 0f 0b 90 8d b4 26 00 00 00 00 55 89 e5 3e 8d 74 26 00 8b 80 d4 02 00
00 <8b> 48 0c 39 d1 77 0e 03 48 14 39 ca 73 07 c7 40 14 00 00 00 00
EIP: [<f0b9c61e>] f2fs_drop_largest_extent+0xe/0x30 [f2fs] SS:ESP 0068:ebd5fd58
CR2: 000000000000000c
---[ end trace a38c07026a1afffd ]---

This is because when extent cache is disable, extent_tree pointer in struct
f2fs_inode_info should be NULL, but in f2fs_drop_largest_extent we access
this NULL pointer directly without checking state of extent cache, then,
the oops occurs. Let's fix it by checking state of extent cache before
accessing.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-28 10:14:26 -07:00
1a7ccad88d xfs: fix error gotos in xfs_setattr_nonsize
As the code stands today, if xfs_trans_reserve() fails, we
goto out_dqrele, which does not free the allocated transaction.

Fix up the goto targets to undo everything properly.

Addresses-Coverity-Id: 145571
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-28 14:51:10 +10:00
8774cf8bac xfs: add mssing inode cache attempts counter increment
Increasing the inode cache attempt counter was apparently dropped while
refactoring the cache code and so stayed at the initial 0 value. Add the
increment back to make the runtime stats more useful.

Signed-off-by: Lucas Stach <dev@lynxeye.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-28 14:50:56 +10:00
c9eb256eda xfs: return errors from partial I/O failures to files
There is an issue with xfs's error reporting in some cases of I/O partially
failing and partially succeeding. Calls like fsync() can report success even
though not all I/O was successful in partial-failure cases such as one disk of
a RAID0 array being offline.

The issue can occur when there are more than one bio per xfs_ioend struct.
Each call to xfs_end_bio() for a bio completing will write a value to
ioend->io_error.  If a successful bio completes after any failed bio, no
error is reported do to it writing 0 over the error code set by any failed bio.
The I/O error information is now lost and when the ioend is completed
only success is reported back up the filesystem stack.

xfs_end_bio() should only set ioend->io_error in the case of BIO_UPTODATE
being clear.  ioend->io_error is initialized to 0 at allocation so only needs
to be updated by a failed bio. Also check that ioend->io_error is 0 so that
the first error reported will be the error code returned.

Cc: stable@vger.kernel.org
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-28 14:50:45 +10:00
dfdd4ac66c libxfs: bad magic number should set da block buffer error
If xfs_da3_node_read_verify() doesn't recognize the magic number of a
buffer it's just read, set the buffer error to -EFSCORRUPTED so that
the error can be sent up to userspace.  Without this patch we'll
notice the bad magic eventually while trying to traverse or change
the block, but we really ought to fail early in the verifier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-28 14:50:03 +10:00
6669cb8bed NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
The "FIXME" is outdated. Flexfiles does add a payload.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-27 20:43:20 -04:00
d13549074c NFSv4.1/flexfiles: Fix a protocol error in layoutreturn
According to the flexfiles protocol, the layoutreturn should specify an
array of errors in the following format:

struct ff_ioerr4 {
	offset4        ffie_offset;
	length4        ffie_length;
	stateid4       ffie_stateid;
	device_error4  ffie_errors<>;
};

This patch fixes up the code to ensure that our ffie_errors is indeed
encoded as an array (albeit with only a single entry).

Reported-by: Tom Haynes <thomas.haynes@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-27 20:42:20 -04:00
5334c5bdac NFS: Send attributes in OPEN request for NFS4_CREATE_EXCLUSIVE4_1
Client sends a SETATTR request after OPEN for updating attributes.
For create file with S_ISGID is set, the S_ISGID in SETATTR will be
ignored at nfs server as chmod of no PERMISSION.

v3, same as v2.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-27 19:47:07 -04:00
8c61282ff6 NFS: Get suppattr_exclcreat when getting server capabilities
Create file with attributs as NFS4_CREATE_EXCLUSIVE4_1 mode
depends on suppattr_exclcreat attribut.

v3, same as v2.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-27 19:45:27 -04:00
c5c3fb5f97 NFS: Make opened as optional argument in _nfs4_do_open
Check opened, only update it when non-NULL.
It's not needs define an unused value for the opened
when calling _nfs4_do_open.

v3, same as v2.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-27 19:44:38 -04:00
ae57ca0f4f NFS: Check size by inode_newsize_ok in nfs_setattr
Set rlimit for NFS's files is useless right now.
For local process's rlimit, it should be checked by nfs client.

The same, CIFS also call inode_change_ok checking rlimit at its client
in cifs_setattr_nounix() and cifs_setattr_unix().

v3, fix bad using of error

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-27 19:44:21 -04:00
cb389b9c0e dax: drop size parameter to ->direct_access()
None of the implementations currently use it.  The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-08-27 19:40:58 -04:00
0bdb8fa6ec NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return must notify of layout return
It's not sufficient to just mark the layout segment for layout return. We
also need to set the NFS_LAYOUT_RETURN_BEFORE_CLOSE flag in the layout header.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-27 19:17:33 -04:00
b3a5bbfd78 dlm: print error from kernel_sendpage
Print a dlm-specific error when a socket error occurs
when sending a dlm message.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2015-08-27 09:34:47 -05:00
19b2c30d3c f2fs: update extent tree in batches
This patch introduce a new helper f2fs_update_extent_tree_range which can
do extent mapping update at a specified range.

The main idea is:
1) punch all mapping info in extent node(s) which are at a specified range;
2) try to merge new extent mapping with adjacent node, or failing that,
   insert the mapping into extent tree as a new node.

In order to see the benefit, I add a function for stating time stamping
count as below:

uint64_t rdtsc(void)
{
	uint32_t lo, hi;
	__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
	return (uint64_t)hi << 32 | lo;
}

My test environment is: ubuntu, intel i7-3770, 16G memory, 256g micron ssd.

truncation path:	update extent cache from truncate_data_blocks_range
non-truncataion path:	update extent cache from other paths
total:			all update paths

a) Removing 128MB file which has one extent node mapping whole range of
file:
1. dd if=/dev/zero of=/mnt/f2fs/128M bs=1M count=128
2. sync
3. rm /mnt/f2fs/128M

Before:
		total		count		average
truncation:	7651022		32768		233.49

Patched:
		total		count		average
truncation:	3321		33		100.64

b) fsstress:
fsstress -d /mnt/f2fs -l 5 -n 100 -p 20
Test times:		5 times.

Before:
		total		count		average
truncation:	5812480.6	20911.6		277.95
non-truncation:	7783845.6	13440.8		579.12
total:		13596326.2	34352.4		395.79

Patched:
		total		count		average
truncation:	1281283.0	3041.6		421.25
non-truncation:	7355844.4	13662.8		538.38
total:		8637127.4	16704.4		517.06

1) For the updates in truncation path:
 - we can see updating in batches leads total tsc and update count reducing
   explicitly;
 - besides, for a single batched updating, punching multiple extent nodes
   in a loop, result in executing more operations, so our average tsc
   increase intensively.
2) For the updates in non-truncation path:
 - there is a little improvement, that is because for the scenario that we
   just need to update in the head or tail of extent node, new interface
   optimize to update info in extent node directly, rather than removing
   original extent node for updating and then inserting that updated one
   into cache as new node.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-08-26 11:50:35 -07:00