47351 Commits

Author SHA1 Message Date
3ccc6c6faa skbuff: optimize the pull_pages code in __pskb_pull_tail()
In the pull_pages code block, if the first frag size > eat,
we can end the loop in advance to avoid extra copy.

Signed-off-by: Lin Zhang <xiaolou4617@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-17 08:56:50 -07:00
36ac344e16 netfilter: expect: fix crash when putting uninited expectation
We crash in __nf_ct_expect_check, it calls nf_ct_remove_expect on the
uninitialised expectation instead of existing one, so del_timer chokes
on random memory address.

Fixes: ec0e3f01114ad32711243 ("netfilter: nf_ct_expect: Add nf_ct_remove_expect()")
Reported-by: Sergey Kvachonok <ravenexp@gmail.com>
Tested-by: Sergey Kvachonok <ravenexp@gmail.com>
Cc: Gao Feng <fgao@ikuai8.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 17:03:12 +02:00
974292defe netfilter: nf_tables: only allow in/output for arp packets
arp packets cannot be forwarded.

They can be bridged, but then they can be filtered using
either ebtables or nftables bridge family.

The bridge netfilter exposes a "call-arptables" switch which
pushes packets into arptables, but lets not expose this for nftables, so better
close this asap.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 17:02:44 +02:00
97772bcd56 netfilter: nat: fix src map lookup
When doing initial conversion to rhashtable I replaced the bucket
walk with a single rhashtable_lookup_fast().

When moving to rhlist I failed to properly walk the list of identical
tuples, but that is what is needed for this to work correctly.
The table contains the original tuples, so the reply tuples are all
distinct.

We currently decide that mapping is (not) in range only based on the
first entry, but in case its not we need to try the reply tuple of the
next entry until we either find an in-range mapping or we checked
all the entries.

This bug makes nat core attempt collision resolution while it might be
able to use the mapping as-is.

Fixes: 870190a9ec90 ("netfilter: nat: convert nat bysrc hash to rhashtable")
Reported-by: Jaco Kroon <jaco@uls.co.za>
Tested-by: Jaco Kroon <jaco@uls.co.za>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 17:02:19 +02:00
cf56c2f892 netfilter: remove old pre-netns era hook api
no more users in the tree, remove this.

The old api is racy wrt. module removal, all users have been converted
to the netns-aware api.

The old api pretended we still have global hooks but that has not been
true for a long time.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 17:01:10 +02:00
7c40b22f6f libceph: potential NULL dereference in ceph_msg_data_create()
If kmem_cache_zalloc() returns NULL then the INIT_LIST_HEAD(&data->links);
will Oops.  The callers aren't really prepared for NULL returns so it
doesn't make a lot of difference in real life.

Fixes: 5240d9f95dfe ("libceph: replace message data pointer with list")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-17 14:54:59 +02:00
914902af4f libceph: don't call encode_request_finish() on MOSDBackoff messages
encode_request_finish() is for MOSDOp messages.  Calling it on
MOSDBackoff ack-block messages corrupts them.

Fixes: a02a946dfe96 ("libceph: respect RADOS_BACKOFF backoffs")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-17 14:54:59 +02:00
f5cc689865 libceph: use alloc_pg_mapping() in __decode_pg_upmap_items()
... otherwise we die in insert_pg_mapping(), which wants pg->node to be
empty, i.e. initialized with RB_CLEAR_NODE.

Fixes: 6f428df47dae ("libceph: pg_upmap[_items] infrastructure")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-17 14:54:58 +02:00
c2acfd95d0 libceph: set -EINVAL in one place in crush_decode()
No sooner than Dan had fixed this issue in commit 293dffaad8d5
("libceph: NULL deref on crush_decode() error path"), I brought it
back.  Add a new label and set -EINVAL once, right before failing.

Fixes: 278b1d709c6a ("libceph: ceph_decode_skip_* helpers")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-17 14:54:58 +02:00
00c8ebb360 libceph: NULL deref on osdmap_apply_incremental() error path
There are hidden gotos in the ceph_decode_* macros.  We need to set the
"err" variable on these error paths otherwise we end up returning
ERR_PTR(0) which is NULL.  It causes NULL dereferences in the callers.

Fixes: 6f428df47dae ("libceph: pg_upmap[_items] infrastructure")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[idryomov@gmail.com: similar bug in osdmap_decode(), changelog tweak]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-07-17 14:54:58 +02:00
f55ce7b024 netfilter: nfnetlink: Improve input length sanitization in nfnetlink_rcv
Verify that the length of the socket buffer is sufficient to cover the
nlmsghdr structure before accessing the nlh->nlmsg_len field for further
input sanitization. If the client only supplies 1-3 bytes of data in
sk_buff, then nlh->nlmsg_len remains partially uninitialized and
contains leftover memory from the corresponding kernel allocation.
Operating on such data may result in indeterminate evaluation of the
nlmsg_len < NLMSG_HDRLEN expression.

The bug was discovered by a runtime instrumentation designed to detect
use of uninitialized memory in the kernel. The patch prevents this and
other similar tools (e.g. KMSAN) from flagging this behavior in the future.

Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-17 13:27:46 +02:00
1474774a7f sctp: remove the typedef sctp_hmac_algo_param_t
This patch is to remove the typedef sctp_hmac_algo_param_t, and
replace with struct sctp_hmac_algo_param in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:14 -07:00
a762a9d94d sctp: remove the typedef sctp_chunks_param_t
This patch is to remove the typedef sctp_chunks_param_t, and
replace with struct sctp_chunks_param in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:14 -07:00
b02db702fa sctp: remove the typedef sctp_random_param_t
This patch is to remove the typedef sctp_random_param_t, and
replace with struct sctp_random_param in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:14 -07:00
15328d9fee sctp: remove the typedef sctp_supported_ext_param_t
This patch is to remove the typedef sctp_supported_ext_param_t, and
replace with struct sctp_supported_ext_param in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:14 -07:00
85f6bd24ac sctp: remove the typedef sctp_adaptation_ind_param_t
This patch is to remove the typedef sctp_adaptation_ind_param_t, and
replace with struct sctp_adaptation_ind_param in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:14 -07:00
e925d506f1 sctp: remove the typedef sctp_supported_addrs_param_t
This patch is to remove the typedef sctp_supported_addrs_param_t, and
replace with struct sctp_supported_addrs_param in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:14 -07:00
365ddb65e7 sctp: remove the typedef sctp_cookie_preserve_param_t
This patch is to remove the typedef sctp_cookie_preserve_param_t, and
replace with struct sctp_cookie_preserve_param in the places where it's
using this typedef.

It is also to fix some indents in sctp_sf_do_5_2_6_stale().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:13 -07:00
00987cc07e sctp: remove the typedef sctp_ipv6addr_param_t
This patch is to remove the typedef sctp_ipv6addr_param_t, and replace
with struct sctp_ipv6addr_param in the places where it's using this
typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:13 -07:00
a38905e6aa sctp: remove the typedef sctp_ipv4addr_param_t
This patch is to remove the typedef sctp_ipv4addr_param_t, and replace
with struct sctp_ipv4addr_param in the places where it's using this
typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 20:52:13 -07:00
aed20a53a7 rds: cancel send/recv work before queuing connection shutdown
We could end up executing rds_conn_shutdown before the rds_recv_worker
thread, then rds_conn_shutdown -> rds_tcp_conn_shutdown can do a
sock_release and set sock->sk to null, which may interleave in bad
ways with rds_recv_worker, e.g., it could result in:

"BUG: unable to handle kernel NULL pointer dereference at 0000000000000078"
    [ffff881769f6fd70] release_sock at ffffffff815f337b
    [ffff881769f6fd90] rds_tcp_recv at ffffffffa043c888 [rds_tcp]
    [ffff881769f6fdb0] rds_recv_worker at ffffffffa04a4810 [rds]
    [ffff881769f6fde0] process_one_work at ffffffff810a14c1
    [ffff881769f6fe40] worker_thread at ffffffff810a1940
    [ffff881769f6fec0] kthread at ffffffff810a6b1e

Also, do not enqueue any new shutdown workq items when the connection is
shutting down (this may happen for rds-tcp in softirq mode, if a FIN
or CLOSE is received while the modules is in the middle of an unload)

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-16 19:07:35 -07:00
3298456557 tcp_bbr: init pacing rate on first RTT sample
Fixes the following behavior: for connections that had no RTT sample
at the time of initializing congestion control, BBR was initializing
the pacing rate to a high nominal rate (based an a guess of RTT=1ms,
in case this is LAN traffic). Then BBR never adjusted the pacing rate
downward upon obtaining an actual RTT sample, if the connection never
filled the pipe (e.g. all sends were small app-limited writes()).

This fix adjusts the pacing rate upon obtaining the first RTT sample.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15 14:43:29 -07:00
1d3648eb5d tcp_bbr: remove sk_pacing_rate=0 transient during init
Fix a corner case noticed by Eric Dumazet, where BBR's setting
sk->sk_pacing_rate to 0 during initialization could theoretically
cause packets in the sending host to hang if there were packets "in
flight" in the pacing infrastructure at the time the BBR congestion
control state is initialized. This could occur if the pacing
infrastructure happened to race with bbr_init() in a way such that the
pacer read the 0 rather than the immediately following non-zero pacing
rate.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15 14:43:29 -07:00
79135b89b8 tcp_bbr: introduce bbr_init_pacing_rate_from_rtt() helper
Introduce a helper to initialize the BBR pacing rate unconditionally,
based on the current cwnd and RTT estimate. This is a pure refactor,
but is needed for two following fixes.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15 14:43:29 -07:00
f19fd62daf tcp_bbr: introduce bbr_bw_to_pacing_rate() helper
Introduce a helper to convert a BBR bandwidth and gain factor to a
pacing rate in bytes per second. This is a pure refactor, but is
needed for two following fixes.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15 14:43:29 -07:00
4aea287e90 tcp_bbr: cut pacing rate only if filled pipe
In bbr_set_pacing_rate(), which decides whether to cut the pacing
rate, there was some code that considered exiting STARTUP to be
equivalent to the notion of filling the pipe (i.e.,
bbr_full_bw_reached()). Specifically, as the code was structured,
exiting STARTUP and going into PROBE_RTT could cause us to cut the
pacing rate down to something silly and low, based on whatever
bandwidth samples we've had so far, when it's possible that all of
them have been small app-limited bandwidth samples that are not
representative of the bandwidth available in the path. (The code was
correct at the time it was written, but the state machine changed
without this spot being adjusted correspondingly.)

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15 14:43:29 -07:00
8b97ac5bda openvswitch: Fix for force/commit action failures
When there is an established connection in direction A->B, it is
possible to receive a packet on port B which then executes
ct(commit,force) without first performing ct() - ie, a lookup.
In this case, we would expect that this packet can delete the existing
entry so that we can commit a connection with direction B->A. However,
currently we only perform a check in skb_nfct_cached() for whether
OVS_CS_F_TRACKED is set and OVS_CS_F_INVALID is not set, ie that a
lookup previously occurred. In the above scenario, a lookup has not
occurred but we should still be able to statelessly look up the
existing entry and potentially delete the entry if it is in the
opposite direction.

This patch extends the check to also hint that if the action has the
force flag set, then we will lookup the existing entry so that the
force check at the end of skb_nfct_cached has the ability to delete
the connection.

Fixes: dd41d330b03 ("openvswitch: Add force commit.")
CC: Pravin Shelar <pshelar@nicira.com>
CC: dev@openvswitch.org
Signed-off-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15 14:40:21 -07:00
254d900b80 ipv4: ip_do_fragment: fix headroom tests
Some time ago David Woodhouse reported skb_under_panic
when we try to push ethernet header to fragmented ipv6 skbs.
It was fixed for ipv6 by Florian Westphal in
commit 1d325d217c7f ("ipv6: ip6_fragment: fix headroom tests and skb leak")

However similar problem still exist in ipv4.

It does not trigger skb_under_panic due paranoid check
in ip_finish_output2, however according to Alexey Kuznetsov
current state is abnormal and ip_fragment should be fixed too.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-15 14:38:31 -07:00
52f6c588c7 Add wait_for_random_bytes() and get_random_*_wait() functions so that
callers can more safely get random bytes if they can block until the
 CRNG is initialized.
 
 Also print a warning if get_random_*() is called before the CRNG is
 initialized.  By default, only one single-line warning will be printed
 per boot.  If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
 warning will be printed for each function which tries to get random
 bytes before the CRNG is initialized.  This can get spammy for certain
 architecture types, so it is not enabled by default.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllqXNUACgkQ8vlZVpUN
 gaPtAgf/aUbXZuWYsDQzslHsbzEWi+qz4QgL885/w4L00pEImTTp91Q06SDxWhtB
 KPvGnZHS3IofxBh2DC+6AwN6dPMoWDCfYhhO6po3FSz0DiPRIQCTuvOb8fhKY1X7
 rTdDq2xtDxPGxJ25bMJtlrgzH2XlXPpVyPUeoc9uh87zUK5aesXpUn9kBniRexoz
 ume+M/cDzPKkwNQpbLq8vzhNjoWMVv0FeW2akVvrjkkWko8nZLZ0R/kIyKQlRPdG
 LZDXcz0oTHpDS6+ufEo292ZuWm2IGer2YtwHsKyCAsyEWsUqBz2yurtkSj3mAVyC
 hHafyS+5WNaGdgBmg0zJxxwn5qxxLg==
 =ua7p
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random updates from Ted Ts'o:
 "Add wait_for_random_bytes() and get_random_*_wait() functions so that
  callers can more safely get random bytes if they can block until the
  CRNG is initialized.

  Also print a warning if get_random_*() is called before the CRNG is
  initialized. By default, only one single-line warning will be printed
  per boot. If CONFIG_WARN_ALL_UNSEEDED_RANDOM is defined, then a
  warning will be printed for each function which tries to get random
  bytes before the CRNG is initialized. This can get spammy for certain
  architecture types, so it is not enabled by default"

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
  random: reorder READ_ONCE() in get_random_uXX
  random: suppress spammy warnings about unseeded randomness
  random: warn when kernel uses unseeded randomness
  net/route: use get_random_int for random counter
  net/neighbor: use get_random_u32 for 32-bit hash random
  rhashtable: use get_random_u32 for hash_rnd
  ceph: ensure RNG is seeded before using
  iscsi: ensure RNG is seeded before use
  cifs: use get_random_u32 for 32-bit lock random
  random: add get_random_{bytes,u32,u64,int,long,once}_wait family
  random: add wait_for_random_bytes() API
2017-07-15 12:44:02 -07:00
78dcf73421 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data
2017-07-15 12:00:42 -07:00
2173bd0631 Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull network field-by-field copy-in updates from Al Viro:
 "This part of the misc compat queue was held back for review from
  networking folks and since davem has jus ACKed those..."

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  get_compat_bpf_fprog(): don't copyin field-by-field
  get_compat_msghdr(): get rid of field-by-field copyin
  copy_msghdr_from_user(): get rid of field-by-field copyin
2017-07-15 11:06:17 -07:00
10b3bf5440 sctp: fix an array overflow when all ext chunks are set
Marcelo noticed an array overflow caused by commit c28445c3cb07
("sctp: add reconf_enable in asoc ep and netns"), in which sctp
would add SCTP_CID_RECONF into extensions when reconf_enable is
set in sctp_make_init and sctp_make_init_ack.

Then now when all ext chunks are set, 4 ext chunk ids can be put
into extensions array while extensions array size is 3. It would
cause a kernel panic because of this overflow.

This patch is to fix it by defining extensions array size is 4 in
both sctp_make_init and sctp_make_init_ack.

Fixes: c28445c3cb07 ("sctp: add reconf_enable in asoc ep and netns")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14 09:05:10 -07:00
c4c4290c17 net sched actions: rename act_get_notify() to tcf_get_notify()
Make name consistent with other TC event notification routines, such as
tcf_add_notify() and tcf_del_notify()

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14 08:52:33 -07:00
ccd4eb49f3 net/packet: Fix Tx queue selection for AF_PACKET
When PACKET_QDISC_BYPASS is not used, Tx queue selection will be done
before the packet is enqueued, taking into account any mappings set by
a queuing discipline such as mqprio without hardware offloading. This
selection may be affected by a previously saved queue_mapping, either on
the Rx path, or done before the packet reaches the device, as it's
currently the case for AF_PACKET.

In order for queue selection to work as expected when using traffic
control, there can't be another selection done before that point is
reached, so move the call to packet_pick_tx_queue to
packet_direct_xmit, leaving the default xmit path as it was before
PACKET_QDISC_BYPASS was introduced.

A forward declaration of packet_pick_tx_queue() is introduced to avoid
the need to reorder the functions within the file.

Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Iván Briano <ivan.briano@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14 08:20:28 -07:00
31a4562d74 net: bridge: fix dest lookup when vlan proto doesn't match
With 802.1ad support the vlan_ingress code started checking for vlan
protocol mismatch which causes the current tag to be inserted and the
bridge vlan protocol & pvid to be set. The vlan tag insertion changes
the skb mac_header and thus the lookup mac dest pointer which was loaded
prior to calling br_allowed_ingress in br_handle_frame_finish is VLAN_HLEN
bytes off now, pointing to the last two bytes of the destination mac and
the first four of the source mac causing lookups to always fail and
broadcasting all such packets to all ports. Same thing happens for locally
originated packets when passing via br_dev_xmit. So load the dest pointer
after the vlan checks and possible skb change.

Fixes: 8580e2117c06 ("bridge: Prepare for 802.1ad vlan filtering support")
Reported-by: Anitha Narasimha Murthy <anitha@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14 08:19:23 -07:00
230cd1279d netpoll: shut up a kernel warning on refcount
When we convert atomic_t to refcount_t, a new kernel warning
on "increment on 0" is introduced in the netpoll code,
zap_completion_queue(). In fact for this special case, we know
the refcount is 0 and we just have to set it to 1 to satisfy
the following dev_kfree_skb_any(), so we can just use
refcount_set(..., 1) instead.

Fixes: 633547973ffc ("net: convert sk_buff.users from atomic_t to refcount_t")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Cc: Reshetova, Elena <elena.reshetova@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14 08:16:59 -07:00
b86faee6d1 NFS client updates for Linux 4.13
Stable bugfixes:
 - Fix -EACCESS on commit to DS handling
 - Fix initialization of nfs_page_array->npages
 - Only invalidate dentries that are actually invalid
 
 Features:
 - Enable NFSoRDMA transparent state migration
 - Add support for lookup-by-filehandle
 - Add support for nfs re-exporting
 
 Other bugfixes and cleanups:
 - Christoph cleaned up the way we declare NFS operations
 - Clean up various internal structures
 - Various cleanups to commits
 - Various improvements to error handling
 - Set the dt_type of . and .. entries in NFS v4
 - Make slot allocation more reliable
 - Fix fscache stat printing
 - Fix uninitialized variable warnings
 - Fix potential list overrun in nfs_atomic_open()
 - Fix a race in NFSoRDMA RPC reply handler
 - Fix return size for nfs42_proc_copy()
 - Fix against MAC forgery timing attacks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlln4jEACgkQ18tUv7Cl
 QOv2ZxAAwbQN9Dtx4rOZmPe0Xszua23sNN0ja891PodkCjIiZrRelZhLIBAf1rfP
 uSR+jTD8EsBHGt3bzTXg2DHz+o8cGDZuH+uuZX+wRWJPQcKA2pC7zElqnse8nmn5
 4Z1UUdzf42vE4NZ/G1ucqpEiAmOqGJ3s7pCRLLXPvOSSQXqOhiomNDAcGxX05FIv
 Ly4Kr6RIfg/O4oNOZBuuL/tZHodeyOj1vbyjt/4bDQ5MEXlUQfcjJZEsz/2EcNh6
 rAgbquxr1pGCD072pPBwYNH2vLGbgNN41KDDMGI0clp+8p6EhV6BOlgcEoGtZM86
 c0yro2oBOB2vPCv9nGr6JgTOHPKG6ksJ7vWVXrtQEjBGP82AbFfAawLgqZ6Ae8dP
 Sqpx55j4xdm4nyNglCuhq5PlPAogARq/eibR+RbY973Lhzr5bZb3XqlairCkNNEv
 4RbTlxbWjhgrKJ56jVf+KpUDJAVG5viKMD7YDx/bOfLtvPwALbozD7ONrunz5v43
 PgQEvWvVtnQAKp27pqHemTsLFhU6M6eGUEctRnAfB/0ogWZh1X8QXgulpDlqG3kb
 g12kr5hfA0pSfcB0aGXVzJNnHKfW3IY3WBWtxq4xaMY22YkHtuB+78+9/yk3jCAi
 dvimjT2Ko9fE9MnltJ/hC5BU+T+xUxg+1vfwWnKMvMH8SIqjyu4=
 =OpLj
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Fix -EACCESS on commit to DS handling
   - Fix initialization of nfs_page_array->npages
   - Only invalidate dentries that are actually invalid

  Features:
   - Enable NFSoRDMA transparent state migration
   - Add support for lookup-by-filehandle
   - Add support for nfs re-exporting

  Other bugfixes and cleanups:
   - Christoph cleaned up the way we declare NFS operations
   - Clean up various internal structures
   - Various cleanups to commits
   - Various improvements to error handling
   - Set the dt_type of . and .. entries in NFS v4
   - Make slot allocation more reliable
   - Fix fscache stat printing
   - Fix uninitialized variable warnings
   - Fix potential list overrun in nfs_atomic_open()
   - Fix a race in NFSoRDMA RPC reply handler
   - Fix return size for nfs42_proc_copy()
   - Fix against MAC forgery timing attacks"

* tag 'nfs-for-4.13-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (68 commits)
  NFS: Don't run wake_up_bit() when nobody is waiting...
  nfs: add export operations
  nfs4: add NFSv4 LOOKUPP handlers
  nfs: add a nfs_ilookup helper
  nfs: replace d_add with d_splice_alias in atomic_open
  sunrpc: use constant time memory comparison for mac
  NFSv4.2 fix size storage for nfs42_proc_copy
  xprtrdma: Fix documenting comments in frwr_ops.c
  xprtrdma: Replace PAGE_MASK with offset_in_page()
  xprtrdma: FMR does not need list_del_init()
  xprtrdma: Demote "connect" log messages
  NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration
  NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration
  xprtrdma: Don't defer MR recovery if ro_map fails
  xprtrdma: Fix FRWR invalidation error recovery
  xprtrdma: Fix client lock-up after application signal fires
  xprtrdma: Rename rpcrdma_req::rl_free
  xprtrdma: Pass only the list of registered MRs to ro_unmap_sync
  xprtrdma: Pre-mark remotely invalidated MRs
  xprtrdma: On invalidation failure, remove MWs from rl_registered
  ...
2017-07-13 14:35:37 -07:00
6240300597 Chuck's RDMA update overhauls the "call receive" side of the
RPC-over-RDMA transport to use the new rdma_rw API.
 
 Christoph cleaned the way nfs operations are declared, removing a bunch
 of function-pointer casts and declaring the operation vectors as const.
 
 Christoph's changes touch both client and server, and both client and
 server pulls this time around should be based on the same commits from
 Christoph.
 
 (Note: Anna and I initially didn't coordinate this well and we realized
 our pull requests were going to leave you with Christoph's 33 patches
 duplicated between our two trees.  We decided a last-minute rebase was
 the lesser of two evils, so her pull request will show that last-minute
 rebase.  Yell if that was the wrong choice, and we'll know better for
 next time....)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZZ80JAAoJECebzXlCjuG+PiMP/jmw4IbzY4qt/X8aldVTMPZ8
 TkEXuZSrc7FbmroqAR0XN/qJjzENKUcrnlYm7HKVe6iItTZUvJuVThtHQVGzZUZD
 wP2VRzgkky59aDs9cphfTPGKPKL1MtoC3qQdFmKd/8ZhBDHIq89A2pQJwl7PI4rA
 IHzvLmZtTKL+xWoypqZQxepONhEY2ZPrffGWL+5OVF/dPmWfJ6m/M6jRTb7zV/YD
 PZyRqWQ8UY/HwZTwRrxZDCCxUsmRUPZz195iFjM8wvBl7auWNetC22gyyITlvfzf
 1m0zJqw3qn09+v2xnAWs/ZVxypg6rsEiIcL2mf0JC/tQh+iIzabc4e/TwDEWqSq+
 ocQrvXJuZCjsrMqg4oaIuDFogaZCsGR5wxDAEyfYDS/8fMdiKq8xJzT7v31/2U37
 Bsr1hvgAmD4eZWaTrJg11V5RnTzDgns+EtNfISR8t4/k+wehDfyzav8A+j72sqvR
 JT+7iUEd0QcBwo+MCC7AOnLLsIX45QUjZKKrvZNAC1fmr8RyAF1zo5HHO+NNjLuP
 J2PUG2GbNxsQkm/JAFKDvyklLpEXZc6uyYAcEefirxYbh1x0GfuetzqtH58DtrQL
 /1e80MRG9Qgq5S8PvYyvp1bIQPDRaQ188chEvzZy+3QeNXydq2LzDh0bjlM+4A9I
 DZhP2pNGLh0ImaPtX0q+
 =mR/a
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck's RDMA update overhauls the "call receive" side of the
  RPC-over-RDMA transport to use the new rdma_rw API.

  Christoph cleaned the way nfs operations are declared, removing a
  bunch of function-pointer casts and declaring the operation vectors as
  const.

  Christoph's changes touch both client and server, and both client and
  server pulls this time around should be based on the same commits from
  Christoph"

* tag 'nfsd-4.13' of git://linux-nfs.org/~bfields/linux: (53 commits)
  svcrdma: fix an incorrect check on -E2BIG and -EINVAL
  nfsd4: factor ctime into change attribute
  svcrdma: Remove svc_rdma_chunk_ctxt::cc_dir field
  svcrdma: use offset_in_page() macro
  svcrdma: Clean up after converting svc_rdma_recvfrom to rdma_rw API
  svcrdma: Clean-up svc_rdma_unmap_dma
  svcrdma: Remove frmr cache
  svcrdma: Remove unused Read completion handlers
  svcrdma: Properly compute .len and .buflen for received RPC Calls
  svcrdma: Use generic RDMA R/W API in RPC Call path
  svcrdma: Add recvfrom helpers to svc_rdma_rw.c
  sunrpc: Allocate up to RPCSVC_MAXPAGES per svc_rqst
  svcrdma: Don't account for Receive queue "starvation"
  svcrdma: Improve Reply chunk sanity checking
  svcrdma: Improve Write chunk sanity checking
  svcrdma: Improve Read chunk sanity checking
  svcrdma: Remove svc_rdma_marshal.c
  svcrdma: Avoid Send Queue overflow
  svcrdma: Squelch disconnection messages
  sunrpc: Disable splice for krb5i
  ...
2017-07-13 13:56:24 -07:00
5d89fb3322 net: set fib rule refcount after malloc
The configure callback of fib_rules_ops can change the refcnt of a
fib rule. For instance, mlxsw takes a refcnt when adding the processing
of the rule to a work queue. Thus the rule refcnt can not be reset to
to 1 afterwards. Move the refcnt setting to after the allocation.

Fixes: 5361e209dd30 ("net: avoid one splat in fib_nl_delrule()")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-13 13:43:54 -07:00
15a8b93fd5 sunrpc: use constant time memory comparison for mac
Otherwise, we enable a MAC forgery via timing attack.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: linux-nfs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:14 -04:00
6afafa7799 xprtrdma: Fix documenting comments in frwr_ops.c
Clean up.

FASTREG and LOCAL_INV WRs are typically not signaled. localinv_wake
is used for the last LOCAL_INV WR in a chain, which is always
signaled. The documenting comments should reflect that.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:13 -04:00
d933cc3201 xprtrdma: Replace PAGE_MASK with offset_in_page()
Clean up.

Reported by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:13 -04:00
e2f6ef0915 xprtrdma: FMR does not need list_del_init()
Clean up.

Commit 38f1932e60ba ("xprtrdma: Remove FMRs from the unmap list
after unmapping") utilized list_del_init() to try to prevent some
list corruption. The corruption was actually caused by the reply
handler racing with a signal. Now that MR invalidation is properly
serialized, list_del_init() can safely be replaced.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:13 -04:00
173b8f49b3 xprtrdma: Demote "connect" log messages
Some have complained about the log messages generated when xprtrdma
opens or closes a connection to a server. When an NFS mount is
mostly idle these can appear every few minutes as the client idles
out the connection and reconnects.

Connection and disconnection is a normal part of operation, and not
exceptional, so change these to dprintk's for now. At some point
all of these will be converted to tracepoints, but that's for
another day.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:12 -04:00
1f541895da xprtrdma: Don't defer MR recovery if ro_map fails
Deferred MR recovery does a DMA-unmapping of the MW. However, ro_map
invokes rpcrdma_defer_mr_recovery in some error cases where the MW
has not even been DMA-mapped yet.

Avoid a DMA-unmapping error replacing rpcrdma_defer_mr_recovery.

Also note that if ib_dma_map_sg is asked to map 0 nents, it will
return 0. So the extra "if (i == 0)" check is no longer needed.

Fixes: 42fe28f60763 ("xprtrdma: Do not leak an MW during a DMA ...")
Fixes: 505bbe64dd04 ("xprtrdma: Refactor MR recovery work queues")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:11 -04:00
8d75483a23 xprtrdma: Fix FRWR invalidation error recovery
When ib_post_send() fails, all LOCAL_INV WRs past @bad_wr have to be
examined, and the MRs reset by hand.

I'm not sure how the existing code can work by comparing R_keys.
Restructure the logic so that instead it walks the chain of WRs,
starting from the first bad one.

Make sure to wait for completion if at least one WR was actually
posted. Otherwise, if the ib_post_send fails, we can end up
DMA-unmapping the MR while LOCAL_INV operations are in flight.

Commit 7a89f9c626e3 ("xprtrdma: Honor ->send_request API contract")
added the rdma_disconnect() call site. The disconnect actually
causes more problems than it solves, and SQ overruns happen only as
a result of software bugs. So remove it.

Fixes: d7a21c1bed54 ("xprtrdma: Reset MRs in frwr_op_unmap_sync()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:11 -04:00
431af645cf xprtrdma: Fix client lock-up after application signal fires
After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.

The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.

With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.

This sets up a race between two processes:

1.  After the signal, xprt_rdma_free calls ro_unmap_safe.
2.  While ro_unmap_safe is still running, the server replies and
    rpcrdma_reply_handler runs, calling ro_unmap_sync.

Both processes invoke ib_unmap_fmr on the same FMR.

The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.

If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.

But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.

Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:11 -04:00
a80d66c9e0 xprtrdma: Rename rpcrdma_req::rl_free
Clean up: I'm about to use the rl_free field for purposes other than
a free list. So use a more generic name.

This is a refactoring change only.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:10 -04:00
451d26e151 xprtrdma: Pass only the list of registered MRs to ro_unmap_sync
There are rare cases where an rpcrdma_req can be re-used (via
rpcrdma_buffer_put) while the RPC reply handler is still running.
This is due to a signal firing at just the wrong instant.

Since commit 9d6b04097882 ("xprtrdma: Place registered MWs on a
per-req list"), rpcrdma_mws are self-contained; ie., they fully
describe an MR and scatterlist, and no part of that information is
stored in struct rpcrdma_req.

As part of closing the above race window, pass only the req's list
of registered MRs to ro_unmap_sync, rather than the rpcrdma_req
itself.

Some extra transport header sanity checking is removed. Since the
client depends on its own recollection of what memory had been
registered, there doesn't seem to be a way to abuse this change.

And, the check was not terribly effective. If the client had sent
Read chunks, the "list_empty" test is negative in both of the
removed cases, which are actually looking for Write or Reply
chunks.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:10 -04:00
4b196dc6fe xprtrdma: Pre-mark remotely invalidated MRs
There are rare cases where an rpcrdma_req and its matched
rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC
reply handler is still using that req. This is typically due to a
signal firing at just the wrong instant.

As part of closing this race window, avoid using the wrong
rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as
invalidated while we are sure the rep is still OK to use.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:10 -04:00