5459 Commits

Author SHA1 Message Date
c128e57551 NFS: Optimise the default readahead size
In the years since the max readahead size was fixed in NFS, a number of
things have happened:
- Users can now set the value directly using /sys/class/bdi
- NFS max supported block sizes have increased by several orders of
  magnitude from 64K to 1MB.
- Disk access latencies are orders of magnitude faster due to SSD + NVME.

In particular note that if the server is advertising 1MB as the optimal
read size, as that will set the readahead size to 15MB.
Let's therefore adjust down, and try to default to VM_READAHEAD_PAGES.
However let's inform the VM about our preferred block size so that it
can choose to round up in cases where that makes sense.

Reported-by: Alkis Georgopoulos <alkisg@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-24 15:58:20 -04:00
32c6e7eee3 NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
If a LOCKU request receives a NFS4ERR_OLD_STATEID, then bump the
seqid before resending. Ensure we only bump the seqid by 1.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 16:00:14 -04:00
0e0cb35b41 NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
If a CLOSE or OPEN_DOWNGRADE operation receives a NFS4ERR_OLD_STATEID
then bump the seqid before resending. Ensure we only bump the seqid
by 1.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:56:19 -04:00
e217e825dc NFSv4: Fix OPEN_DOWNGRADE error handling
If OPEN_DOWNGRADE returns a state error, then we want to initiate
state recovery in addition to marking the stateid as closed.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:52:44 -04:00
30cb3ee299 pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid
If a LAYOUTRETURN receives a reply of NFS4ERR_OLD_STATEID then assume we've
missed an update, and just bump the stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:48:35 -04:00
9228395709 NFSv4: Add a helper to increment stateid seqids
Add a helper function to increment stateid seqids according to the
rules specified in RFC5661 Section 8.2.2.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:44:45 -04:00
6109bcf713 NFSv4: Handle RPC level errors in LAYOUTRETURN
Handle RPC level errors by assuming that the RPC call was successful.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:39:56 -04:00
078a432d1c NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
If the server sends a NFS4ERR_DELAY, then allow the caller to retry.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:36:58 -04:00
287a9c558b NFSv4: Clean up pNFS return-on-close error handling
Both close and delegreturn have identical code to handle pNFS
return-on-close. This patch refactors that code and places it
in pnfs.c

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:27:51 -04:00
9c47b18cf7 pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors
IF the server rejected our layout return with a state error such as
NFS4ERR_BAD_STATEID, or even a stale inode error, then we do want
to clear out all the remaining layout segments and mark that stateid
as invalid.

Fixes: 1c5bd76d17cca ("pNFS: Enable layoutreturn operation for...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:17:42 -04:00
581057c834 NFS: remove unused check for negative dentry
This check has been hanging out since we used to have parallel paths to add
dentry in nfs_create(), but that hasn't been the case for some years.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:15:24 -04:00
17fd6e457b NFSv3: use nfs_add_or_obtain() to create and reference inodes
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:15:24 -04:00
406cd91533 NFS: Refactor nfs_instantiate() for dentry referencing callers
Since commit b0c6108ecf64 ("nfs_instantiate(): prevent multiple aliases for
directory inode"), nfs_instantiate() may succeed without actually
instantiating the dentry that was passed in.  That can be problematic for
some callers in NFSv3, so this patch breaks things up so we can get the
actual dentry obtained.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-20 15:15:24 -04:00
f836b27eca NFS: Have nfs4_proc_get_lease_time() call nfs4_call_sync_custom()
This removes some code duplication, since both functions were doing the
same thing.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22 10:05:35 -04:00
cc15e24a3a NFS: Have nfs41_proc_secinfo_no_name() call nfs4_call_sync_custom()
We need to use the custom rpc_task_setup here to set the
RPC_TASK_NO_ROUND_ROBIN flag on the RPC call.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22 10:05:30 -04:00
4c952e3d1b NFS: Have nfs41_proc_reclaim_complete() call nfs4_call_sync_custom()
An async call followed by an rpc_wait_for_completion() is basically the
same as a synchronous call, so we can use nfs4_call_sync_custom() to
keep our custom callback ops and the RPC_TASK_NO_ROUND_ROBIN flag.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22 10:05:21 -04:00
50493364e7 NFS: Have _nfs4_proc_secinfo() call nfs4_call_sync_custom()
We do this to set the RPC_TASK_NO_ROUND_ROBIN flag in the task_setup
structure

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22 10:05:16 -04:00
dae40965d5 NFS: Have nfs4_proc_setclientid() call nfs4_call_sync_custom()
Rather than running the task manually

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22 10:05:11 -04:00
48c058543c NFS: Add an nfs4_call_sync_custom() function
There are a few cases where we need to manually configure the
rpc_task_setup structure to get the behavior we want.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-22 10:05:06 -04:00
1e672e3644 NFSv4: Fix a memory leak bug
In nfs4_try_migration(), if nfs4_begin_drain_session() fails, the
previously allocated 'page' and 'locations' are not deallocated, leading to
memory leaks. To fix this issue, go to the 'out' label to free 'page' and
'locations' before returning the error.

Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-21 16:39:29 -04:00
e2751463ea fs: nfs: Fix possible null-pointer dereferences in encode_attrs()
In encode_attrs(), there is an if statement on line 1145 to check
whether label is NULL:
    if (label && (attrmask[2] & FATTR4_WORD2_SECURITY_LABEL))

When label is NULL, it is used on lines 1178-1181:
    *p++ = cpu_to_be32(label->lfs);
    *p++ = cpu_to_be32(label->pi);
    *p++ = cpu_to_be32(label->len);
    p = xdr_encode_opaque_fixed(p, label->label, label->len);

To fix these bugs, label is checked before being used.

These bugs are found by a static analysis tool STCheck written by us.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-08-20 09:30:50 -04:00
67e7b52d44 NFSv4: Ensure state recovery handles ETIMEDOUT correctly
Ensure that the state recovery code handles ETIMEDOUT correctly,
and also that we set RPC_TASK_TIMEOUT when recovering open state.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-07 12:55:11 -04:00
dea1bb35c5 NFS: Fix regression whereby fscache errors are appearing on 'nofsc' mounts
People are reporing seeing fscache errors being reported concerning
duplicate cookies even in cases where they are not setting up fscache
at all. The rule needs to be that if fscache is not enabled, then it
should have no side effects at all.

To ensure this is the case, we disable fscache completely on all superblocks
for which the 'fsc' mount option was not set. In order to avoid issues
with '-oremount', we also disable the ability to turn fscache on via
remount.

Fixes: f1fe29b4a02d ("NFS: Use i_writecount to control whether...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200145
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Steve Dickson <steved@redhat.com>
Cc: David Howells <dhowells@redhat.com>
2019-08-04 22:35:41 -04:00
09a54f0ebf NFSv4: Fix an Oops in nfs4_do_setattr
If the user specifies an open mode of 3, then we don't have a NFSv4 state
attached to the context, and so we Oops when we try to dereference it.

Reported-by: Olga Kornievskaia <aglo@umich.edu>
Fixes: 29b59f9416937 ("NFSv4: change nfs4_do_setattr to take...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.10: 991eedb1371dc: NFSv4: Only pass the...
Cc: stable@vger.kernel.org # v4.10+
2019-08-04 22:35:41 -04:00
c77e22834a NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()
John Hubbard reports seeing the following stack trace:

nfs4_do_reclaim
   rcu_read_lock /* we are now in_atomic() and must not sleep */
       nfs4_purge_state_owners
           nfs4_free_state_owner
               nfs4_destroy_seqid_counter
                   rpc_destroy_wait_queue
                       cancel_delayed_work_sync
                           __cancel_work_timer
                               __flush_work
                                   start_flush_work
                                       might_sleep:
                                        (kernel/workqueue.c:2975: BUG)

The solution is to separate out the freeing of the state owners
from nfs4_purge_state_owners(), and perform that outside the atomic
context.

Reported-by: John Hubbard <jhubbard@nvidia.com>
Fixes: 0aaaf5c424c7f ("NFS: Cache state owners after files are closed")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-04 22:35:40 -04:00
e3c8dc761e NFSv4: Check the return value of update_open_stateid()
Ensure that we always check the return value of update_open_stateid()
so that we can retry if the update of local state failed. This fixes
infinite looping on state recovery.

Fixes: e23008ec81ef3 ("NFSv4 reduce attribute requests for open reclaim")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v3.7+
2019-08-04 22:35:40 -04:00
ad11408970 NFSv4.1: Only reap expired delegations
Fix nfs_reap_expired_delegations() to ensure that we only reap delegations
that are actually expired, rather than triggering on random errors.

Fixes: 45870d6909d5a ("NFSv4.1: Test delegation stateids when server...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-04 22:35:40 -04:00
27a30cf64a NFSv4.1: Fix open stateid recovery
The logic for checking in nfs41_check_open_stateid() whether the state
is supported by a delegation is inverted. In addition, it makes more
sense to perform that check before we check for expired locks.

Fixes: 8a64c4ef106d1 ("NFSv4.1: Even if the stateid is OK,...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-04 22:35:40 -04:00
731c74dd98 NFSv4: Report the error from nfs4_select_rw_stateid()
In pnfs_update_layout() ensure that we do report any fatal errors from
nfs4_select_rw_stateid().

Fixes: d9aba2b40de6 ("NFSv4: Don't use the zero stateid with layoutget")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-04 22:35:40 -04:00
c34fae003c NFSv4: When recovering state fails with EAGAIN, retry the same recovery
If the server returns with EAGAIN when we're trying to recover from
a server reboot, we currently delay for 1 second, but then mark the
stateid as needing recovery after the grace period has expired.

Instead, we should just retry the same recovery process immediately
after the 1 second delay. Break out of the loop after 10 retries.

Fixes: 35a61606a612 ("NFS: Reduce indentation of the switch statement...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-04 22:35:40 -04:00
86dbd08b32 NFSv4: Print an error in the syslog when state is marked as irrecoverable
When error recovery fails due to a fatal error on the server, ensure
we log it in the syslog.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-04 22:35:40 -04:00
5eb8d18ca0 NFSv4: Fix delegation state recovery
Once we clear the NFS_DELEGATED_STATE flag, we're telling
nfs_delegation_claim_opens() that we're done recovering all open state
for that stateid, so we really need to ensure that we test for all
open modes that are currently cached and recover them before exiting
nfs4_open_delegation_recall().

Fixes: 24311f884189d ("NFSv4: Recovery of recalled read delegations...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.3+
2019-08-04 22:35:40 -04:00
8c39a39e28 NFSv4: Fix a credential refcount leak in nfs41_check_delegation_stateid
It is unsafe to dereference delegation outside the rcu lock, and in
any case, the refcount is guaranteed held if cred is non-zero.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-08-04 22:35:40 -04:00
18253e034d Merge branch 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache and mountpoint updates from Al Viro:
 "Saner handling of refcounts to mountpoints.

  Transfer the counting reference from struct mount ->mnt_mountpoint
  over to struct mountpoint ->m_dentry. That allows us to get rid of the
  convoluted games with ordering of mount shutdowns.

  The cost is in teaching shrink_dcache_{parent,for_umount} to cope with
  mixed-filesystem shrink lists, which we'll also need for the Slab
  Movable Objects patchset"

* 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch the remnants of releasing the mountpoint away from fs_pin
  get rid of detach_mnt()
  make struct mountpoint bear the dentry reference to mountpoint, not struct mount
  Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists
  fs/namespace.c: shift put_mountpoint() to callers of unhash_mnt()
  __detach_mounts(): lookup_mountpoint() can't return ERR_PTR() anymore
  nfs: dget_parent() never returns NULL
  ceph: don't open-code the check for dead lockref
2019-07-20 09:15:51 -07:00
6860c981b9 NFS client updates for Linux 5.3
Highlights include:
 
 Stable fixes:
 - SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC request
 - Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR() dereference
 - Revert buggy NFS readdirplus optimisation
 - NFSv4: Handle the special Linux file open access mode
 - pnfs: Fix a problem where we gratuitously start doing I/O through the MDS
 
 Features:
 - Allow NFS client to set up multiple TCP connections to the server using
    a new 'nconnect=X' mount option. Queue length is used to balance load.
 - Enhance statistics reporting to report on all transports when using
    multiple connections.
 - Speed up SUNRPC by removing bh-safe spinlocks
 - Add a mechanism to allow NFSv4 to request that containers set a
    unique per-host identifier for when the hostname is not set.
 - Ensure NFSv4 updates the lease_time after a clientid update
 
 Bugfixes and cleanup:
 - Fix use-after-free in rpcrdma_post_recvs
 - Fix a memory leak when nfs_match_client() is interrupted
 - Fix buggy file access checking in NFSv4 open for execute
 - disable unsupported client side deduplication
 - Fix spurious client disconnections
 - Fix occasional RDMA transport deadlock
 - Various RDMA cleanups
 - Various tracepoint fixes
 - Fix the TCP callback channel to guarantee the server can actually send
    the number of callback requests that was negotiated at mount time.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl0wz/EACgkQZwvnipYK
 APJUrQ/9FpbWTaI/H0BkBf+/iIDAwuGozldDDRkHpHvykvcbNSFPNk5CA4fV/lCq
 rI0tD9ijj0KJhAjS1RWLb9oXMUm/MDIIcZNvsbOiIfGftzGnxAz022KS2JyV+Xk6
 ul7sCQ8QmbQJzfZk+aFZxl5slS9i5nM+Sa//k5SK3JghOlLuYkW50OlmE6ME8rKP
 sLWD2quLh1J5mjG+B9ml4YWozNi0elFN6ZqdzO/SjIhIG7rn2Hkr0glWcEprdiS7
 qHoF6dIn80oR37YNrcbBtEGqBnMnWkpGAGq2Hgf5I2YtO+xfEgMqqs9FNIcahtQG
 SGa1bnpcys9+fG4fnwzgOgYkvGzKdi/bIYc8lhBowl+76z+6VgjwlwT3wRtnWfO9
 288LZDMMfOSYoDkbA9iGtoQRviIsoGMoeH/r5tZe7U1W35h60NGPVhH9sP9/Gc2s
 Va1xBYK6PaIg+UxeghJmCnNK3zANyEI7AtDVvn0wsJuDcgXO5/PAANenc6/Fl6nf
 U9XB4wmgaM2jN5uqLYT3bJAaAPAXTOayVnnd9oDYb87/31sYM6iwzSPkmm5+qIsi
 0Ftks4PSjpjTTrmUvs3CL+ll3Y9bWUd58qlooGwi6u1DnXDQDzbIoRQBezNFJ+Lc
 MahUVY7w9ATyJFJRvbscO7mRS8qTmr4qP91bAOsYQ0P2Yo02L7k=
 =eR7d
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:

   - SUNRPC: Ensure bvecs are re-synced when we re-encode the RPC
     request

   - Fix an Oops in ff_layout_track_ds_error due to a PTR_ERR()
     dereference

   - Revert buggy NFS readdirplus optimisation

   - NFSv4: Handle the special Linux file open access mode

   - pnfs: Fix a problem where we gratuitously start doing I/O through
     the MDS

  Features:

   - Allow NFS client to set up multiple TCP connections to the server
     using a new 'nconnect=X' mount option. Queue length is used to
     balance load.

   - Enhance statistics reporting to report on all transports when using
     multiple connections.

   - Speed up SUNRPC by removing bh-safe spinlocks

   - Add a mechanism to allow NFSv4 to request that containers set a
     unique per-host identifier for when the hostname is not set.

   - Ensure NFSv4 updates the lease_time after a clientid update

  Bugfixes and cleanup:

   - Fix use-after-free in rpcrdma_post_recvs

   - Fix a memory leak when nfs_match_client() is interrupted

   - Fix buggy file access checking in NFSv4 open for execute

   - disable unsupported client side deduplication

   - Fix spurious client disconnections

   - Fix occasional RDMA transport deadlock

   - Various RDMA cleanups

   - Various tracepoint fixes

   - Fix the TCP callback channel to guarantee the server can actually
     send the number of callback requests that was negotiated at mount
     time"

* tag 'nfs-for-5.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (68 commits)
  pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS
  pnfs: Fix a problem where we gratuitously start doing I/O through the MDS
  SUNRPC: Optimise transport balancing code
  SUNRPC: Ensure the bvecs are reset when we re-encode the RPC request
  pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error
  NFSv4: Don't use the zero stateid with layoutget
  SUNRPC: Fix up backchannel slot table accounting
  SUNRPC: Fix initialisation of struct rpc_xprt_switch
  SUNRPC: Skip zero-refcount transports
  SUNRPC: Replace division by multiplication in calculation of queue length
  NFSv4: Validate the stateid before applying it to state recovery
  nfs4.0: Refetch lease_time after clientid update
  nfs4: Rename nfs41_setup_state_renewal
  nfs4: Make nfs4_proc_get_lease_time available for nfs4.0
  nfs: Fix copy-and-paste error in debug message
  NFS: Replace 16 seq_printf() calls by seq_puts()
  NFS: Use seq_putc() in nfs_show_stats()
  Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
  SUNRPC: Fix transport accounting when caller specifies an rpc_xprt
  NFS: Record task, client ID, and XID in xdr_status trace points
  ...
2019-07-18 14:32:33 -07:00
d5b9216fd5 pnfs/flexfiles: Add tracepoints for detecting pnfs fallback to MDS
Add tracepoints to allow debugging of the event chain leading to
a pnfs fallback to doing I/O through the MDS.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-18 15:50:28 -04:00
58bbeab425 pnfs: Fix a problem where we gratuitously start doing I/O through the MDS
If the client has to stop in pnfs_update_layout() to wait for another
layoutget to complete, it currently exits and defaults to I/O through
the MDS if the layoutget was successful.

Fixes: d03360aaf5cc ("pNFS: Ensure we return the error if someone kills...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.20+
2019-07-18 15:33:42 -04:00
8e04fdfadd pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error
mirror->mirror_ds can be NULL if uninitialised, but can contain
a PTR_ERR() if call to GETDEVICEINFO failed.

Fixes: 65990d1afbd2 ("pNFS/flexfiles: Fix a deadlock on LAYOUTGET")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # 4.10+
2019-07-18 14:43:52 -04:00
d9aba2b40d NFSv4: Don't use the zero stateid with layoutget
The NFSv4.1 protocol explicitly forbids us from using the zero stateid
together with layoutget, so when we see that nfs4_select_rw_stateid()
is unable to return a valid delegation, lock or open stateid, then
we should initiate recovery and retry.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-18 14:43:52 -04:00
7402a4fedc SUNRPC: Fix up backchannel slot table accounting
Add a per-transport maximum limit in the socket case, and add
helpers to allow the NFSv4 code to discover that limit.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-18 01:12:59 -04:00
50c8000744 NFSv4: Validate the stateid before applying it to state recovery
If the stateid is the zero or invalid stateid, then it is pointless
to attempt to use it for recovery. In that case, try to fall back
to using the open state stateid, or just doing a general recovery
of all state on a given inode.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-15 10:11:20 -04:00
5b596830d9 nfs4.0: Refetch lease_time after clientid update
RFC 7530 requires us to refetch the lease time attribute once a new
clientID is established. This is already implemented for the
nfs4.1(+) clients by nfs41_init_clientid, which calls
nfs41_finish_session_reset, which calls nfs4_setup_state_renewal.

To make nfs4_setup_state_renewal available for nfs4.0, move it
further to the top of the source file to include it regardles of
CONFIG_NFS_V4_1 and to save a forward declaration.

Call nfs4_setup_state_renewal from nfs4_init_clientid.

Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-13 11:48:41 -04:00
ea51efaa96 nfs4: Rename nfs41_setup_state_renewal
The function nfs41_setup_state_renewal is useful to the nfs 4.0 client
as well, so rename the function to nfs4_setup_state_renewal.

Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-13 11:48:41 -04:00
0efb01b2ac nfs4: Make nfs4_proc_get_lease_time available for nfs4.0
Compile nfs4_proc_get_lease_time, enc_get_lease_time and
dec_get_lease_time for nfs4.0. Use nfs4_sequence_done instead of
nfs41_sequence_done in nfs4_proc_get_lease_time,

Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-13 11:48:41 -04:00
2eaf426deb nfs: Fix copy-and-paste error in debug message
The debug message of decode_attr_lease_time incorrectly
says "file size". Fix it to "lease time".

Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-13 11:48:41 -04:00
1c316e39a0 NFS: Replace 16 seq_printf() calls by seq_puts()
Some strings should be put into a sequence.
Thus use the corresponding function “seq_puts”.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-13 11:47:49 -04:00
9bcaa35c68 NFS: Use seq_putc() in nfs_show_stats()
A single character (line break) should be put into a sequence.
Thus use the corresponding function “seq_putc”.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-13 11:47:49 -04:00
db531db951 Revert "NFS: readdirplus optimization by cache mechanism" (memleak)
This reverts commit be4c2d4723a4a637f0d1b4f7c66447141a4b3564.

That commit caused a severe memory leak in nfs_readdir_make_qstr().

When listing a directory with more than 100 files (this is how many
struct nfs_cache_array_entry elements fit in one 4kB page), all
allocated file name strings past those 100 leak.

The root of the leakage is that those string pointers are managed in
pages which are never linked into the page cache.

fs/nfs/dir.c puts pages into the page cache by calling
read_cache_page(); the callback function nfs_readdir_filler() will
then fill the given page struct which was passed to it, which is
already linked in the page cache (by do_read_cache_page() calling
add_to_page_cache_lru()).

Commit be4c2d4723a4 added another (local) array of allocated pages, to
be filled with more data, instead of discarding excess items received
from the NFS server.  Those additional pages can be used by the next
nfs_readdir_filler() call (from within the same nfs_readdir() call).

The leak happens when some of those additional pages are never used
(copied to the page cache using copy_highpage()).  The pages will be
freed by nfs_readdir_free_pages(), but their contents will not.  The
commit did not invoke nfs_readdir_clear_array() (and doing so would
have been dangerous, because it did not track which of those pages
were already copied to the page cache, risking double free bugs).

How to reproduce the leak:

- Use a kernel with CONFIG_SLUB_DEBUG_ON.

- Create a directory on a NFS mount with more than 100 files with
  names long enough to use the "kmalloc-32" slab (so we can easily
  look up the allocation counts):

  for i in `seq 110`; do touch ${i}_0123456789abcdef; done

- Drop all caches:

  echo 3 >/proc/sys/vm/drop_caches

- Check the allocation counter:

  grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
  30564391 nfs_readdir_add_to_array+0x73/0xd0 age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1

- Request a directory listing and check the allocation counters again:

  ls
  [...]
  grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
  30564511 nfs_readdir_add_to_array+0x73/0xd0 age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1

There are now 120 new allocations.

- Drop all caches and check the counters again:

  echo 3 >/proc/sys/vm/drop_caches
  grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
  30564401 nfs_readdir_add_to_array+0x73/0xd0 age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1

110 allocations are gone, but 10 have leaked and will never be freed.

Unhelpfully, those allocations are explicitly excluded from KMEMLEAK,
that's why my initial attempts with KMEMLEAK were not successful:

	/*
	 * Avoid a kmemleak false positive. The pointer to the name is stored
	 * in a page cache page which kmemleak does not scan.
	 */
	kmemleak_not_leak(string->name);

It would be possible to solve this bug without reverting the whole
commit:

- keep track of which pages were not used, and call
  nfs_readdir_clear_array() on them, or
- manually link those pages into the page cache

But for now I have decided to just revert the commit, because the real
fix would require complex considerations, risking more dangerous
(crash) bugs, which may seem unsuitable for the stable branches.

Signed-off-by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-12 16:01:37 -04:00
347543e640 Merge tag 'nfs-rdma-for-5.3-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
NFSoRDMA client updates for 5.3

New features:
- Add a way to place MRs back on the free list
- Reduce context switching
- Add new trace events

Bugfixes and cleanups:
- Fix a BUG when tracing is enabled with NFSv4.1
- Fix a use-after-free in rpcrdma_post_recvs
- Replace use of xdr_stream_pos in rpcrdma_marshal_req
- Fix occasional transport deadlock
- Fix show_nfs_errors macros, other tracing improvements
- Remove RPCRDMA_REQ_F_PENDING and fr_state
- Various simplifications and refactors
2019-07-12 12:11:01 -04:00
40f06c7995 Changes to copy_file_range for 5.3 from Dave and Amir:
- Create a generic copy_file_range handler and make individual
   filesystems responsible for calling it (i.e. no more assuming that
   do_splice_direct will work or is appropriate)
 - Refactor copy_file_range and remap_range parameter checking where they
   are the same
 - Install missing copy_file_range parameter checking(!)
 - Remove suid/sgid and update mtime like any other file write
 - Change the behavior so that a copy range crossing the source file's
   eof will result in a short copy to the source file's eof instead of
   EINVAL
 - Permit filesystems to decide if they want to handle cross-superblock
   copy_file_range in their local handlers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0BGvAACgkQ+H93GTRK
 tOu2aw/+KGG7PiXm9ED3ZXUppKVddrZMOgqM7mSfHo6TBgW3pJUJcRIhawK0Wz/P
 stgTsOkurHSl3iT3vQyX4GTZvLoGN/rfsRLPxogJptBUqVv3BOrXsrI53f7V/kbm
 rtjlYsgExji7VBUiMTe5kOWWqxyR7B4nXyvY/8rier57rW/8C1I58B0OrxAmTK0k
 rz1e5BtE1dg91xA7cSdEc38FInz8MW8cvsrEzW9vyYY4IVE0PBuhhA1EvryxTrAZ
 hfthHFfzwxhJkI0mdha8uqNufNWrHLSqiwyjYC7pwAwSQzQPiQz9U17flu+URnfF
 kXaR5LdXbBP3pl46RdthrfuonWsv612cC1Qwfjs8PBG9lG7b9PGJ40MGVTiw7LlQ
 924/03ho0zAnV0E8Qn5O9nPshQNDJhwhzMS39EmMyFKb1D5XGzdMV0gDdIfx6hdO
 HDbw6VQ33S59gvk7v/gxsFB5Bs4PKfamHx/QmwQwpqWM5XExcr0yJ90OTBtAuY4r
 S+9gwG6uED3aPh8HbQ5UgnA8bZmMmi8AkcBvqJ9GgNw5SbZl0oyv9Sj6JNpoOejV
 8y9JkhoZUxqiihnKTw/vtMrj5RCOfifNBjMSwrShfLdLKtK0AZl1mXC0/1Q3VnEQ
 TUcyRHEzrtHgJ9/AK9xIyDNvNYzvHSLZj7maoZZumgQa2FOFrmw=
 =qM44
 -----END PGP SIGNATURE-----

Merge tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull copy_file_range updates from Darrick Wong:
 "This fixes numerous parameter checking problems and inconsistent
  behaviors in the new(ish) copy_file_range system call.

  Now the system call will actually check its range parameters
  correctly; refuse to copy into files for which the caller does not
  have sufficient privileges; update mtime and strip setuid like file
  writes are supposed to do; and allows copying up to the EOF of the
  source file instead of failing the call like we used to.

  Summary:

   - Create a generic copy_file_range handler and make individual
     filesystems responsible for calling it (i.e. no more assuming that
     do_splice_direct will work or is appropriate)

   - Refactor copy_file_range and remap_range parameter checking where
     they are the same

   - Install missing copy_file_range parameter checking(!)

   - Remove suid/sgid and update mtime like any other file write

   - Change the behavior so that a copy range crossing the source file's
     eof will result in a short copy to the source file's eof instead of
     EINVAL

   - Permit filesystems to decide if they want to handle
     cross-superblock copy_file_range in their local handlers"

* tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fuse: copy_file_range needs to strip setuid bits and update timestamps
  vfs: allow copy_file_range to copy across devices
  xfs: use file_modified() helper
  vfs: introduce file_modified() helper
  vfs: add missing checks to copy_file_range
  vfs: remove redundant checks from generic_remap_checks()
  vfs: introduce generic_file_rw_checks()
  vfs: no fallback for ->copy_file_range
  vfs: introduce generic_copy_file_range()
2019-07-10 20:32:37 -07:00