Commit Graph

67449 Commits

Author SHA1 Message Date
Trond Myklebust
5c4afe2ab6 NFSv4.2: decode_read_plus_data() must skip padding after data segment
All XDR opaque object sizes are 32-bit aligned, and a data segment is no
exception.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
1ee6310119 NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Geliang Tang
1f70ea7009 NFSv4.1: use BITS_PER_LONG macro in nfs4session.h
Use the existing BITS_PER_LONG macro instead of calculating the value.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Frank van der Linden
a1f26739cc NFSv4.2: improve page handling for GETXATTR
XDRBUF_SPARSE_PAGES can cause problems for the RDMA transport,
and it's easy enough to allocate enough pages for the request
up front, so do that.

Also, since we've allocated the pages anyway, use the full
page aligned length for the receive buffer. This will allow
caching of valid replies that are too large for the caller,
but that still fit in the allocated pages.

Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
fa94a951bf NFSv4.2: Fix up the get/listxattr calls to rpc_prepare_reply_pages()
Ensure that both getxattr and listxattr page array are correctly
aligned, and that getxattr correctly accounts for the page padding word.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-10 09:01:53 -05:00
NeilBrown
bf701b765e NFS: switch nfsiod to be an UNBOUND workqueue.
nfsiod is currently a concurrency-managed workqueue (CMWQ).
This means that workitems scheduled to nfsiod on a given CPU are queued
behind all other work items queued on any CMWQ on the same CPU.  This
can introduce unexpected latency.

Occaionally nfsiod can even cause excessive latency.  If the work item
to complete a CLOSE request calls the final iput() on an inode, the
address_space of that inode will be dismantled.  This takes time
proportional to the number of in-memory pages, which on a large host
working on large files (e.g..  5TB), can be a large number of pages
resulting in a noticable number of seconds.

We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
This causes each concurrent work item to gets a dedicated thread which
can be scheduled to an idle CPU.

There is precedent for this as several other filesystems use WQ_UNBOUND
workqueue for handling various async events.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ada609ee2a ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Calum Mackay
9b82d88d59 lockd: don't use interval-based rebinding over TCP
NLM uses an interval-based rebinding, i.e. it clears the transport's
binding under certain conditions if more than 60 seconds have elapsed
since the connection was last bound.

This rebinding is not necessary for an autobind RPC client over a
connection-oriented protocol like TCP.

It can also cause problems: it is possible for nlm_bind_host() to clear
XPRT_BOUND whilst a connection worker is in the middle of trying to
reconnect, after it had already been checked in xprt_connect().

When the connection worker notices that XPRT_BOUND has been cleared
under it, in xs_tcp_finish_connecting(), that results in:

	xs_tcp_setup_socket: connect returned unhandled error -107

Worse, it's possible that the two can get into lockstep, resulting in
the same behaviour repeated indefinitely, with the above error every
300 seconds, without ever recovering, and the connection never being
established. This has been seen in practice, with a large number of NLM
client tasks, following a server restart.

The existing callers of nlm_bind_host & nlm_rebind_host should not need
to force the rebind, for TCP, so restrict the interval-based rebinding
to UDP only.

For TCP, we will still rebind when needed, e.g. on timeout, and connection
error (including closure), since connection-related errors on an existing
connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
already unconditionally clear XPRT_BOUND.

To avoid having to add the fix, and explanation, to both nlm_bind_host()
and nlm_rebind_host(), remove the duplicate code from the former, and
have it call the latter.

Drop the dprintk, which adds no value over a trace.

Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Fixes: 35f5a422ce ("SUNRPC: new interface to force an RPC rebind")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Sargun Dhillon
d3ff46fe69 NFSv4: Refactor to use user namespaces for nfs4idmap
In several patches work has been done to enable NFSv4 to use user
namespaces:
58002399da: NFSv4: Convert the NFS client idmapper to use the container user namespace
3b7eb5e35d: NFS: When mounting, don't share filesystems between different user namespaces

Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.

Users can still use rpc.idmapd if they choose to, but there are complexities
with user namespaces and request-key that have yet to be addresssed.

Eventually, we will need to at least:
  * Come up with an upcall mechanism that can be triggered inside of the container,
    or safely triggered outside, with the requisite context to do the right
    mapping. * Handle whatever refactoring needs to be done in net/sunrpc.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d088c ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Sargun Dhillon
d18a9d3fa0 NFS: NFSv2/NFSv3: Use cred from fs_context during mount
There was refactoring done to use the fs_context for mounting done in:
62a55d088c: NFS: Additional refactoring for fs_context conversion

This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.

NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0e5: NFS: Store the credential of the mount process in the nfs_server
264d948ce7: NFS: Convert NFSv3 to use the container user namespace
c207db2f5d: NFS: Convert NFSv2 to use the container user namespace

Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d088c ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
b6d49ecd10 NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode
When returning the layout in nfs4_evict_inode(), we need to ensure that
the layout is actually done being freed before we can proceed to free the
inode itself.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
17068466ad NFSv4: Fix open coded xdr_stream_remaining()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
9ed5af268e SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages()
rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to
contain the length of the data that we expect to want placed in the head
kvec plus a count of 1 word of padding that is placed after the page data.
This is very confusing when trying to read the code, and sometimes leads
to callers adding an arbitrary value of '1' just in order to satisfy the
requirement (whether or not the page data actually needs such padding).

This patch aims to clarify the code by changing the 'hdrsize' argument
to remove that 1 word of padding. This means we need to subtract the
padding from all the existing callers.

Fixes: 02ef04e432 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
046e5ccb41 NFSv4: Fix the alignment of page data in the getdeviceinfo reply
We can fit the device_addr4 opaque data padding in the pages.

Fixes: cf500bac8f ("SUNRPC: Introduce rpc_prepare_reply_pages()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
9889981349 pNFS: Clean up open coded xdr string decoding
Use the existing xdr_stream_decode_string_dup() to safely decode into
kmalloced strings.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
9a7016319e pNFS/flexfiles: Fix up layoutstats reporting for non-TCP transports
Ensure that we report the correct netid when using UDP or RDMA
transports to the DSes.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
4be78d2681 NFSv4/pNFS: Store the transport type in struct nfs4_pnfs_ds_addr
We want to enable RDMA and UDP as valid transport methods if a
GETDEVICEINFO call specifies it. Do so by adding a parser for the
netid that translates it to an appropriate argument for the RPC
transport layer.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
190c75a31f pNFS: Add helpers for allocation/free of struct nfs4_pnfs_ds_addr
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
a12f996d34 NFSv4/pNFS: Use connections to a DS that are all of the same protocol family
If the pNFS metadata server advertises multiple addresses for the same
data server, we should try to connect to just one protocol family and
transport type on the assumption that homogeneity will improve performance.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
1c3695d0bb NFS: Switch mount code to use xprt_find_transport_ident()
Switch the mount code to use xprt_find_transport_ident() and to check
the results before allowing the mount to proceed.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
794092c57f NFS: Do uncached readdir when we're seeking a cookie in an empty page cache
If the directory is changing, causing the page cache to get invalidated
while we are listing the contents, then the NFS client is currently forced
to read in the entire directory contents from scratch, because it needs
to perform a linear search for the readdir cookie. While this is not
an issue for small directories, it does not scale to directories with
millions of entries.
In order to be able to deal with large directories that are changing,
add a heuristic to ensure that if the page cache is empty, and we are
searching for a cookie that is not the zero cookie, we just default to
performing uncached readdir.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
35df59d3ef NFS: Reduce number of RPC calls when doing uncached readdir
If we're doing uncached readdir, allocate multiple pages in order to
try to avoid duplicate RPC calls for the same getdents() call.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
762567b7c7 NFS: Optimisations for monotonically increasing readdir cookies
If the server is handing out monotonically increasing readdir cookie values,
then we can optimise away searches through pages that contain cookies that
lie outside our search range.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
b593c09f83 NFS: Improve handling of directory verifiers
If the server insists on using the readdir verifiers in order to allow
cookies to expire, then we should ensure that we cache the verifier
with the cookie, so that we can return an error if the application
tries to use the expired cookie.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
9fff59ed4c NFS: Handle NFS4ERR_NOT_SAME and NFSERR_BADCOOKIE from readdir calls
If the server returns NFS4ERR_NOT_SAME or tells us that the cookie is
bad in response to a READDIR call, then we should empty the page cache
so that we can fill it from scratch again.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
82e22a5e62 NFS: Allow the NFS generic code to pass in a verifier to readdir
If we're ever going to allow support for servers that use the readdir
verifier, then that use needs to be managed by the middle layers as
those need to be able to reject cookies from other verifiers.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
6c981eff23 NFS: Cleanup to remove nfs_readdir_descriptor_t typedef
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
6b75cf9e30 NFS: Reduce readdir stack usage
The descriptor and the struct nfs_entry are both large structures,
so don't allocate them from the stack.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
dbeaf8c984 NFS: nfs_do_filldir() does not return a value
Clean up nfs_do_filldir().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
93b8959a0a NFS: More readdir cleanups
Remove the redundant caching of the credential in struct
nfs_open_dir_context.
Pass the buffer size as an argument to nfs_readdir_xdr_filler().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
1a34c8c9a4 NFS: Support larger readdir buffers
Support readdir buffers of up to 1MB in size so that we can read
large directories using few RPC calls.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
a52a8a6ada NFS: Simplify struct nfs_cache_array_entry
We don't need to store a hash, so replace struct qstr with a simple
const char pointer and length.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
ed09222d65 NFS: Replace kmap() with kmap_atomic() in nfs_readdir_search_array()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
e762a63981 NFS: Remove unnecessary kmap in nfs_readdir_xdr_to_array()
The kmapped pointer is only used once per loop to check if we need to
exit.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
3b2a09f127 NFS: Don't discard readdir results
If a readdir call returns more data than we can fit into one page
cache page, then allocate a new one for that data rather than
discarding the data.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
1f1d4aa4e4 NFS: Clean up directory array handling
Refactor to use pagecache_get_page() so that we can fill the page
in multiple stages.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
972bcdf233 NFS: Clean up nfs_readdir_page_filler()
Clean up handling of the case where there are no entries in the readdir
reply.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
b1e21c9743 NFS: Clean up readdir struct nfs_cache_array
Since the 'eof_index' is only ever used as a flag, make it so.
Also add a flag to detect if the page has been completely filled.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
2e7a464179 NFS: Ensure contents of struct nfs_open_dir_context are consistent
Ensure that the contents of struct nfs_open_dir_context are consistent
by setting them under the file->f_lock from a private copy (that is
known to be consistent).

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
2020-12-02 14:05:51 -05:00
Olga Kornievskaia
05ad917561 NFSv4.2: condition READDIR's mask for security label based on LSM state
Currently, the client will always ask for security_labels if the server
returns that it supports that feature regardless of any LSM modules
(such as Selinux) enforcing security policy. This adds performance
penalty to the READDIR operation.

Client adjusts superblock's support of the security_label based on
the server's support but also current client's configuration of the
LSM modules. Thus, prior to using the default bitmask in READDIR,
this patch checks the server's capabilities and then instructs
READDIR to remove FATTR4_WORD2_SECURITY_LABEL from the bitmask.

v5: fixing silly mistakes of the rushed v4
v4: simplifying logic
v3: changing label's initialization per Ondrej's comment
v2: dropping selinux hook and using the sb cap.

Suggested-by: Ondrej Mosnacek <omosnace@redhat.com>
Suggested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 2b0143b5c9 ("VFS: normal filesystems (and lustre): d_inode() annotations")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
76998ebb91 NFSv4: Observe the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp
We need to respect the NFS_MOUNT_SOFTREVAL flag in _nfs4_proc_lookupp,
by timing out if the server is unavailable.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
3c5e9a59fa NFSv3: Add emulation of the lookupp() operation
In order to use the open_by_filehandle() operations on NFSv3, we need
to be able to emulate lookupp() so that nfs_get_parent() can be used
to convert disconnected dentries into connected ones.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Trond Myklebust
5f447cb881 NFSv3: Refactor nfs3_proc_lookup() to split out the dentry
We want to reuse the lookup code in NFSv3 in order to emulate the
NFSv4 lookupp operation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Dai Ngo
bd75475c2f NFSv4.2: Fix 5 seconds delay when doing inter server copy
Since commit b4868b44c5 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.

Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
until after the call to update_open_stateid, to indicate this is the 1st
open. This fix is part of a 2 patches, the other patch is the fix in the
source server to return the stateid for COPY_NOTIFY request with seqid 1
instead of 0.

Fixes: ce0887ac96 ("NFSD add nfs4 inter ssc to nfsd4_copy")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-02 13:45:33 -05:00
Chuck Lever
5482e09a88 NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
By switching to an XFS-backed export, I am able to reproduce the
ibcomp worker crash on my client with xfstests generic/013.

For the failing LISTXATTRS operation, xdr_inline_pages() is called
with page_len=12 and buflen=128.

- When ->send_request() is called, rpcrdma_marshal_req() does not
  set up a Reply chunk because buflen is smaller than the inline
  threshold. Thus rpcrdma_convert_iovs() does not get invoked at
  all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
  on the receive buffer.

- During reply processing, rpcrdma_inline_fixup() tries to copy
  received data into rq_rcv_buf->pages because page_len is positive.
  But there are no receive pages because rpcrdma_marshal_req() never
  allocated them.

The result is that the ibcomp worker faults and dies. Sometimes that
causes a visible crash, and sometimes it results in a transport hang
without other symptoms.

RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
should eventually be fixed or replaced. However, my preference is
that upper-layer operations should explicitly allocate their receive
buffers (using GFP_KERNEL) when possible, rather than relying on
XDRBUF_SPARSE_PAGES.

Reported-by: Olga kornievskaia <kolga@netapp.com>
Suggested-by: Olga kornievskaia <kolga@netapp.com>
Fixes: c10a75145f ("NFSv4.2: add the extended attribute proc functions.")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga kornievskaia <kolga@netapp.com>
Reviewed-by: Frank van der Linden <fllinden@amazon.com>
Tested-by: Olga kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-02 13:19:20 -05:00
Trond Myklebust
63e2fffa59 pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled
If the flexfiles mirroring is enabled, then the read code expects to be
able to set pgio->pg_mirror_idx to point to the data server that is
being used for this particular read. However it does not change the
pg_mirror_count because we only need to send a single read.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-11-30 10:52:22 -05:00
Linus Torvalds
1214917e00 More EFI fixes for v5.10-rc:
- revert efivarfs kmemleak fix again - it was a false positive;
 - make CONFIG_EFI_EARLYCON depend on CONFIG_EFI explicitly so it does not
   pull in other dependencies unnecessarily if CONFIG_EFI is not set
 - defer attempts to load SSDT overrides from EFI vars until after the
   efivar layer is up.
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAl+/i/UACgkQw08iOZLZ
 jyRb+Qv/RVQSvZW+u6MrYPZthmVxnNQ4pFHKAjibD3h9isbrWdq6AETtNghUGWQr
 nr1WeOj4Qa2aDe4z63Sra3QNsdLnSn0FsHuJPD8rozMd4N4jiChtpLukdShibXXz
 25yfs7KpNyqwj3QnFd2LpJBXGqzdoKFrzWbnWSnFEMrSfkqptBhogslhVxzal8Uz
 4hUyGhe/iBfgU720uoVCmofPpYxqV/cndEmsnA8rZnW5yTPXIn6f9c4KUOcFxgLS
 LchxZYRd9GZoQ4Yt40ih9JX1ILZNhrhXh96cfNuUiVwnf9Lg7xSOGIX3+e3PR9Lz
 1VI4UuA7HM8qDZx+N5iiBRrBIgtdtMgKLEip3n+x84/7P/p26HXe3OJCPBWh0Q0q
 aWCcV8qiwju6VYK6dv4Gz+2/OYiSKXrXZCx57dnCr7tv5srYenwsvCCYa9wLTpMW
 16/DRmkHcfbAdfS38Bhs3qY/zK+He7XE/sPJao/NuVwoJ3Nu0dA4LR7JFG7TVjKm
 bepO3A8W
 =gvBz
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent-for-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI fixes from Borislav Petkov:
 "More EFI fixes forwarded from Ard Biesheuvel:

   - revert efivarfs kmemleak fix again - it was a false positive

   - make CONFIG_EFI_EARLYCON depend on CONFIG_EFI explicitly so it does
     not pull in other dependencies unnecessarily if CONFIG_EFI is not
     set

   - defer attempts to load SSDT overrides from EFI vars until after the
     efivar layer is up"

* tag 'efi-urgent-for-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi: EFI_EARLYCON should depend on EFI
  efivarfs: revert "fix memory leak in efivarfs_create()"
  efi/efivars: Set generic ops before loading SSDT
2020-11-29 10:18:53 -08:00
Linus Torvalds
9223e74f99 io_uring-5.10-2020-11-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/BZBwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpku+EACqrIZ6wuA/mIWbKs9wWf280hRbIQL1q7zy
 Ejj2A6DfclZz8XuUUfrRQ2diDDPBNnfPVRfiIzAi/AO7f7jDn91+2ZYfoJJWLZbq
 kCDHScPI6+OjDho5r+3vocxdoXZfuYxBlKsIzdiLl32mGEWbYjd+rmlAVJ74FPYz
 mFddmQCZoTBuEArpW6h8bKoPLgoC4zzZ2MgcjjpHT7Me/ai8t7OoA+FBpUjRcu+q
 Bt/ZfjIiWOzss9+psk5BSrHo5yXY51TWiiuV8hVM3RwVsL2dabG4WPgaLzum3H9p
 UZ4NAvXdRX9gWfF85mo7PEo1/0REmRy4BOJQ8qtBnvq6eepttAXHBx0SFDfpcLzG
 oAB4HaxuixAdsutnIynBXad3MskhNtFzbz/4UqOcnKbpT5Zxi65YiBKAsNwXjIdU
 bc3rZ3hyP4FifzNFa86wFLQ9w9gKV8mEogAz122lxL44AyguZzwN1I3H/YSEt2iX
 tKqHNWxnrb9MP3ycAuMvwjF3aNruns3QVv5fZQ9CrAUnAVF6Q8o/E2aTt4rmiJUW
 kQAvQJQj6MhBi/GXJk6YmTTDYgEMZ5JxFV3IVW17YZiMlpvIFML4X2nKo12qhwnq
 8eZ0qJHFRyycW9eBY3qznw8S5Z7Nh8r91MjJs5sullHg99xKJbvR+IoswImfJMme
 Gk857imwGw==
 =HJnn
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-11-27' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Out of bounds fix for the cq size cap from earlier this release (Joseph)

 - iov_iter type check fix (Pavel)

 - Files grab + cancelation fix (Pavel)

* tag 'io_uring-5.10-2020-11-27' of git://git.kernel.dk/linux-block:
  io_uring: fix files grab/cancel race
  io_uring: fix ITER_BVEC check
  io_uring: fix shift-out-of-bounds when round up cq size
2020-11-27 12:56:04 -08:00
Linus Torvalds
a17a3ca55e for-5.10-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/BGBQACgkQxWXV+ddt
 WDtRYQ/+IUGjJ4l6MyL3PwLgTVabKsSpm2R3y3M/0tVJ0FIhDXYbkpjB2CkpIdcH
 jUvHnRL4H59hwG6rwlXU2oC298FNbbrLGIU+c9DR50RuyQCDGnT02XvxwfDIEFzp
 WLNZ/CAhRkm6boj//70lV26BpeXT59KMYwNixfCNTXq9Ir0qYHHCGg4cEBQLS++2
 JUU8XVTLURIYiFOLbwmABI7V43OgDhdORIr+qnR8xjCUyhusZsjVVbvIdW3BDi/S
 wK7NJsfuqgsF0zD9URJjpwTFiJL9SvBLWR8JnM9NiLW3ZbkGBL+efL6mdWuH7534
 gruJRS2zYPMO2/Kpjy/31CWLap3PUSD3i8cKF+uo3liojWuSUhN8kfvcNxJVd5se
 NkEK+4zOjsDIVbv7gcjThSv4KTnOUO/XfN9TWUMuduaMBmGQNaQut1FpGV98utiK
 yW6x8xqcR4SI+lqY6ILqCK+qUHf19BLSsuyzZdTIontKKRA9F9hY8a4XTZuzTWml
 BGYmFGP640vOo8C9GjrQfpAwa7CB/DnF/cg1AAmuZ8vrEm9zYjmauFKK8ZPcveA3
 KGrnmIlYjhAIX16oRbfwOgj9D2xa1loBzJyHQHByvCMXGVFBnqRTRANRHFrdQWJB
 qh9+J4EJcUXPE9WGHxAW/g9vpFkV7IRABHs7aUB8zApxI9nGA0Q=
 =kcxn
 -----END PGP SIGNATURE-----

Merge tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few fixes for various warnings that accumulated over past two weeks:

   - tree-checker: add missing return values for some errors

   - lockdep fixes
      - when reading qgroup config and starting quota rescan
      - reverse order of quota ioctl lock and VFS freeze lock

   - avoid accessing potentially stale fs info during device scan,
     reported by syzbot

   - add scope NOFS protection around qgroup relation changes

   - check for running transaction before flushing qgroups

   - fix tracking of new delalloc ranges for some cases"

* tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix lockdep splat when enabling and disabling qgroups
  btrfs: do nofs allocations when adding and removing qgroup relations
  btrfs: fix lockdep splat when reading qgroup config on mount
  btrfs: tree-checker: add missing returns after data_ref alignment checks
  btrfs: don't access possibly stale fs_info data for printing duplicate device
  btrfs: tree-checker: add missing return after error in root_item
  btrfs: qgroup: don't commit transaction when we already hold the handle
  btrfs: fix missing delalloc new bit for new delalloc ranges
2020-11-27 12:42:13 -08:00
Pavel Begunkov
af60470347 io_uring: fix files grab/cancel race
When one task is in io_uring_cancel_files() and another is doing
io_prep_async_work() a race may happen. That's because after accounting
a request inflight in first call to io_grab_identity() it still may fail
and go to io_identity_cow(), which migh briefly keep dangling
work.identity and not only.

Grab files last, so io_prep_async_work() won't fail if it did get into
->inflight_list.

note: the bug shouldn't exist after making io_uring_cancel_files() not
poking into other tasks' requests.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-26 08:50:21 -07:00
Ard Biesheuvel
ff04f3b6f2 efivarfs: revert "fix memory leak in efivarfs_create()"
The memory leak addressed by commit fe5186cf12 is a false positive:
all allocations are recorded in a linked list, and freed when the
filesystem is unmounted. This leads to double frees, and as reported
by David, leads to crashes if SLUB is configured to self destruct when
double frees occur.

So drop the redundant kfree() again, and instead, mark the offending
pointer variable so the allocation is ignored by kmemleak.

Cc: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
Fixes: fe5186cf12 ("efivarfs: fix memory leak in efivarfs_create()")
Reported-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-11-25 16:55:02 +01:00