Commit Graph

63260 Commits

Author SHA1 Message Date
Alexander Lobakin
b65578cec1 skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too
commit 66c556025d687dbdd0f748c5e1df89c977b6c02a upstream.

Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
tiny skbs") ensured that skbs with data size lower than 1025 bytes
will be kmalloc'ed to avoid excessive page cache fragmentation and
memory consumption.
However, the fix adressed only __napi_alloc_skb() (primarily for
virtio_net and napi_get_frags()), but the issue can still be achieved
through __netdev_alloc_skb(), which is still used by several drivers.
Drivers often allocate a tiny skb for headers and place the rest of
the frame to frags (so-called copybreak).
Mirror the condition to __netdev_alloc_skb() to handle this case too.

Since v1 [0]:
 - fix "Fixes:" tag;
 - refine commit message (mention copybreak usecase).

[0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me

Fixes: a1c7fff7e1 ("net: netdev_alloc_skb() use build_skb()")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:23 +01:00
Guillaume Nault
83d7403b2e netfilter: rpfilter: mask ecn bits before fib lookup
commit 2e5a6266fbb11ae93c468dfecab169aca9c27b43 upstream.

RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt()
treats Not-ECT or ECT(1) packets in a different way than those with
ECT(0) or CE.

Reproducer:

  Create two netns, connected with a veth:
  $ ip netns add ns0
  $ ip netns add ns1
  $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
  $ ip -netns ns0 link set dev veth01 up
  $ ip -netns ns1 link set dev veth10 up
  $ ip -netns ns0 address add 192.0.2.10/32 dev veth01
  $ ip -netns ns1 address add 192.0.2.11/32 dev veth10

  Add a route to ns1 in ns0:
  $ ip -netns ns0 route add 192.0.2.11/32 dev veth01

  In ns1, only packets with TOS 4 can be routed to ns0:
  $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10

  Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS
  is 4:
  $ ip netns exec ns0 ping -Q 4 192.0.2.11   # TOS 4, Not-ECT
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 5 192.0.2.11   # TOS 4, ECT(1)
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 6 192.0.2.11   # TOS 4, ECT(0)
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 7 192.0.2.11   # TOS 4, CE
    ... 0% packet loss ...

  Now use iptable's rpfilter module in ns1:
  $ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP

  Not-ECT and ECT(1) packets still pass:
  $ ip netns exec ns0 ping -Q 4 192.0.2.11   # TOS 4, Not-ECT
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 5 192.0.2.11   # TOS 4, ECT(1)
    ... 0% packet loss ...

  But ECT(0) and ECN packets are dropped:
  $ ip netns exec ns0 ping -Q 6 192.0.2.11   # TOS 4, ECT(0)
    ... 100% packet loss ...
  $ ip netns exec ns0 ping -Q 7 192.0.2.11   # TOS 4, CE
    ... 100% packet loss ...

After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore.

Fixes: 8f97339d3f ("netfilter: add ipv4 reverse path filter match")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:19 +01:00
Cong Wang
238b5ebdb6 cls_flower: call nla_ok() before nla_next()
commit c96adff95619178e2118925578343ad54857c80c upstream.

fl_set_enc_opt() simply checks if there are still bytes left to parse,
but this is not sufficent as syzbot seems to be able to generate
malformatted netlink messages. nla_ok() is more strict so should be
used to validate the next nlattr here.

And nla_validate_nested_deprecated() has less strict check too, it is
probably too late to switch to the strict version, but we can just
call nla_ok() too after it.

Reported-and-tested-by: syzbot+2624e3778b18fc497c92@syzkaller.appspotmail.com
Fixes: 0a6e77784f ("net/sched: allow flower to match tunnel options")
Fixes: 79b1011cb3 ("net: sched: allow flower to match erspan options")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20210115185024.72298-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-27 11:55:19 +01:00
Maxim Mikityanskiy
636868a52d xsk: Clear pool even for inactive queues
[ Upstream commit b425e24a934e21a502d25089c6c7443d799c5594 ]

The number of queues can change by other means, rather than ethtool. For
example, attaching an mqprio qdisc with num_tc > 1 leads to creating
multiple sets of TX queues, which may be then destroyed when mqprio is
deleted. If an AF_XDP socket is created while mqprio is active,
dev->_tx[queue_id].pool will be filled, but then real_num_tx_queues may
decrease with deletion of mqprio, which will mean that the pool won't be
NULLed, and a further increase of the number of TX queues may expose a
dangling pointer.

To avoid any potential misbehavior, this commit clears pool for RX and
TX queues, regardless of real_num_*_queues, still taking into
consideration num_*_queues to avoid overflows.

Fixes: 1c1efc2af1 ("xsk: Create and free buffer pool independently from umem")
Fixes: a41b4f3c58 ("xsk: simplify xdp_clear_umem_at_qid implementation")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20210118160333.333439-1-maximmi@mellanox.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27 11:55:10 +01:00
Song Liu
ef9eb913c0 bpf: Reject too big ctx_size_in for raw_tp test run
[ Upstream commit 7ac6ad051150592557520b45773201b987ecfce3 ]

syzbot reported a WARNING for allocating too big memory:

WARNING: CPU: 1 PID: 8484 at mm/page_alloc.c:4976 __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5011
Modules linked in:
CPU: 1 PID: 8484 Comm: syz-executor862 Not tainted 5.11.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:4976
Code: 00 00 0c 00 0f 85 a7 00 00 00 8b 3c 24 4c 89 f2 44 89 e6 c6 44 24 70 00 48 89 6c 24 58 e8 d0 d7 ff ff 49 89 c5 e9 ea fc ff ff <0f> 0b e9 b5 fd ff ff 89 74 24 14 4c 89 4c 24 08 4c 89 74 24 18 e8
RSP: 0018:ffffc900012efb10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 1ffff9200025df66 RCX: 0000000000000000
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000140dc0
RBP: 0000000000140dc0 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff81b1f7e1 R11: 0000000000000000 R12: 0000000000000014
R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
FS:  000000000190c880(0000) GS:ffff8880b9e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f08b7f316c0 CR3: 0000000012073000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267
alloc_pages include/linux/gfp.h:547 [inline]
kmalloc_order+0x2e/0xb0 mm/slab_common.c:837
kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853
kmalloc include/linux/slab.h:557 [inline]
kzalloc include/linux/slab.h:682 [inline]
bpf_prog_test_run_raw_tp+0x4b5/0x670 net/bpf/test_run.c:282
bpf_prog_test_run kernel/bpf/syscall.c:3120 [inline]
__do_sys_bpf+0x1ea9/0x4f10 kernel/bpf/syscall.c:4398
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x440499
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffe1f3bfb18 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440499
RDX: 0000000000000048 RSI: 0000000020000600 RDI: 000000000000000a
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401ca0
R13: 0000000000401d30 R14: 0000000000000000 R15: 0000000000000000

This is because we didn't filter out too big ctx_size_in. Fix it by
rejecting ctx_size_in that are bigger than MAX_BPF_FUNC_ARGS (12) u64
numbers.

Fixes: 1b4d60ec16 ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
Reported-by: syzbot+4f98876664c7337a4ae6@syzkaller.appspotmail.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210112234254.1906829-1-songliubraving@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27 11:55:07 +01:00
Chuck Lever
00ee972739 SUNRPC: Handle TCP socket sends with kernel_sendpage() again
[ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ]

Daire Byrne reports a ~50% aggregrate throughput regression on his
Linux NFS server after commit da1661b93b ("SUNRPC: Teach server to
use xprt_sock_sendmsg for socket sends"), which replaced
kernel_send_page() calls in NFSD's socket send path with calls to
sock_sendmsg() using iov_iter.

Investigation showed that tcp_sendmsg() was not using zero-copy to
send the xdr_buf's bvec pages, but instead was relying on memcpy.
This means copying every byte of a large NFS READ payload.

It looks like TLS sockets do indeed support a ->sendpage method,
so it's really not necessary to use xprt_sock_sendmsg() to support
TLS fully on the server. A mechanical reversion of da1661b93b is
not possible at this point, but we can re-implement the server's
TCP socket sendmsg path using kernel_sendpage().

Reported-by: Daire Byrne <daire@dneg.com>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27 11:54:57 +01:00
Vladimir Oltean
d0b97c8cd6 net: dsa: unbind all switches from tree when DSA master unbinds
commit 07b90056cb15ff9877dca0d8f1b6583d1051f724 upstream.

Currently the following happens when a DSA master driver unbinds while
there are DSA switches attached to it:

$ echo 0000:00:00.5 > /sys/bus/pci/drivers/mscc_felix/unbind
------------[ cut here ]------------
WARNING: CPU: 0 PID: 392 at net/core/dev.c:9507
Call trace:
 rollback_registered_many+0x5fc/0x688
 unregister_netdevice_queue+0x98/0x120
 dsa_slave_destroy+0x4c/0x88
 dsa_port_teardown.part.16+0x78/0xb0
 dsa_tree_teardown_switches+0x58/0xc0
 dsa_unregister_switch+0x104/0x1b8
 felix_pci_remove+0x24/0x48
 pci_device_remove+0x48/0xf0
 device_release_driver_internal+0x118/0x1e8
 device_driver_detach+0x28/0x38
 unbind_store+0xd0/0x100

Located at the above location is this WARN_ON:

	/* Notifier chain MUST detach us all upper devices. */
	WARN_ON(netdev_has_any_upper_dev(dev));

Other stacked interfaces, like VLAN, do indeed listen for
NETDEV_UNREGISTER on the real_dev and also unregister themselves at that
time, which is clearly the behavior that rollback_registered_many
expects. But DSA interfaces are not VLAN. They have backing hardware
(platform devices, PCI devices, MDIO, SPI etc) which have a life cycle
of their own and we can't just trigger an unregister from the DSA
framework when we receive a netdev notifier that the master unregisters.

Luckily, there is something we can do, and that is to inform the driver
core that we have a runtime dependency to the DSA master interface's
device, and create a device link where that is the supplier and we are
the consumer. Having this device link will make the DSA switch unbind
before the DSA master unbinds, which is enough to avoid the WARN_ON from
rollback_registered_many.

Note that even before the blamed commit, DSA did nothing intelligent
when the master interface got unregistered either. See the discussion
here:
https://lore.kernel.org/netdev/20200505210253.20311-1-f.fainelli@gmail.com/
But this time, at least the WARN_ON is loud enough that the
upper_dev_link commit can be blamed.

The advantage with this approach vs dev_hold(master) in the attached
link is that the latter is not meant for long term reference counting.
With dev_hold, the only thing that will happen is that when the user
attempts an unbind of the DSA master, netdev_wait_allrefs will keep
waiting and waiting, due to DSA keeping the refcount forever. DSA would
not access freed memory corresponding to the master interface, but the
unbind would still result in a freeze. Whereas with device links,
graceful teardown is ensured. It even works with cascaded DSA trees.

$ echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind
[ 1818.797546] device swp0 left promiscuous mode
[ 1819.301112] sja1105 spi2.0: Link is Down
[ 1819.307981] DSA: tree 1 torn down
[ 1819.312408] device eno2 left promiscuous mode
[ 1819.656803] mscc_felix 0000:00:00.5: Link is Down
[ 1819.667194] DSA: tree 0 torn down
[ 1819.711557] fsl_enetc 0000:00:00.2 eno2: Link is Down

This approach allows us to keep the DSA framework absolutely unchanged,
and the driver core will just know to unbind us first when the master
goes away - as opposed to the large (and probably impossible) rework
required if attempting to listen for NETDEV_UNREGISTER.

As per the documentation at Documentation/driver-api/device_link.rst,
specifying the DL_FLAG_AUTOREMOVE_CONSUMER flag causes the device link
to be automatically purged when the consumer fails to probe or later
unbinds. So we don't need to keep the consumer_link variable in struct
dsa_switch.

Fixes: 2f1e8ea726 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210111230943.3701806-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:05 +01:00
Lorenzo Bianconi
6423b21937 mac80211: check if atf has been disabled in __ieee80211_schedule_txq
commit c13cf5c159660451c8fbdc37efb998b198e1d305 upstream.

Check if atf has been disabled in __ieee80211_schedule_txq() in order to
avoid a given sta is always put to the beginning of the active_txqs list
and never moved to the end since deficit is not decremented in
ieee80211_sta_register_airtime()

Fixes: b4809e9484 ("mac80211: Add airtime accounting and scheduling to TXQs")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/93889406c50f1416214c079ca0b8c9faecc5143e.1608975195.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:04 +01:00
Felix Fietkau
a00432fa4c mac80211: do not drop tx nulldata packets on encrypted links
commit 2463ec86cd0338a2c2edbfb0b9d50c52ff76ff43 upstream.

ieee80211_tx_h_select_key drops any non-mgmt packets without a key when
encryption is used. This is wrong for nulldata packets that can't be
encrypted and are sent out for probing clients and indicating 4-address
mode.

Reported-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Fixes: a0761a3017 ("mac80211: drop data frames without key on encrypted links")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218191525.1168-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:04 +01:00
David Howells
6f3fe96a69 rxrpc: Fix handling of an unsupported token type in rxrpc_read()
[ Upstream commit d52e419ac8b50c8bef41b398ed13528e75d7ad48 ]

Clang static analysis reports the following:

net/rxrpc/key.c:657:11: warning: Assigned value is garbage or undefined
                toksize = toksizes[tok++];
                        ^ ~~~~~~~~~~~~~~~

rxrpc_read() contains two consecutive loops.  The first loop calculates the
token sizes and stores the results in toksizes[] and the second one uses
the array.  When there is an error in identifying the token in the first
loop, the token is skipped, no change is made to the toksizes[] array.
When the same error happens in the second loop, the token is not skipped.
This will cause the toksizes[] array to be out of step and will overrun
past the calculated sizes.

Fix this by making both loops log a message and return an error in this
case.  This should only happen if a new token type is incompletely
implemented, so it should normally be impossible to trigger this.

Fixes: 9a059cd5ca ("rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read()")
Reported-by: Tom Rix <trix@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/161046503122.2445787.16714129930607546635.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:03 +01:00
Vladimir Oltean
2bfb953aee net: dsa: clear devlink port type before unregistering slave netdevs
[ Upstream commit 91158e1680b164c8d101144ca916a3dca10c3e17 ]

Florian reported a use-after-free bug in devlink_nl_port_fill found with
KASAN:

(devlink_nl_port_fill)
(devlink_port_notify)
(devlink_port_unregister)
(dsa_switch_teardown.part.3)
(dsa_tree_teardown_switches)
(dsa_unregister_switch)
(bcm_sf2_sw_remove)
(platform_remove)
(device_release_driver_internal)
(device_links_unbind_consumers)
(device_release_driver_internal)
(device_driver_detach)
(unbind_store)

Allocated by task 31:
 alloc_netdev_mqs+0x5c/0x50c
 dsa_slave_create+0x110/0x9c8
 dsa_register_switch+0xdb0/0x13a4
 b53_switch_register+0x47c/0x6dc
 bcm_sf2_sw_probe+0xaa4/0xc98
 platform_probe+0x90/0xf4
 really_probe+0x184/0x728
 driver_probe_device+0xa4/0x278
 __device_attach_driver+0xe8/0x148
 bus_for_each_drv+0x108/0x158

Freed by task 249:
 free_netdev+0x170/0x194
 dsa_slave_destroy+0xac/0xb0
 dsa_port_teardown.part.2+0xa0/0xb4
 dsa_tree_teardown_switches+0x50/0xc4
 dsa_unregister_switch+0x124/0x250
 bcm_sf2_sw_remove+0x98/0x13c
 platform_remove+0x44/0x5c
 device_release_driver_internal+0x150/0x254
 device_links_unbind_consumers+0xf8/0x12c
 device_release_driver_internal+0x84/0x254
 device_driver_detach+0x30/0x34
 unbind_store+0x90/0x134

What happens is that devlink_port_unregister emits a netlink
DEVLINK_CMD_PORT_DEL message which associates the devlink port that is
getting unregistered with the ifindex of its corresponding net_device.
Only trouble is, the net_device has already been unregistered.

It looks like we can stub out the search for a corresponding net_device
if we clear the devlink_port's type. This looks like a bit of a hack,
but also seems to be the reason why the devlink_port_type_clear function
exists in the first place.

Fixes: 3122433eb5 ("net: dsa: Register devlink ports before calling DSA driver setup()")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian fainelli <f.fainelli@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210112004831.3778323-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:03 +01:00
Eric Dumazet
024158d3b5 net: avoid 32 x truesize under-estimation for tiny skbs
[ Upstream commit 3226b158e67cfaa677fd180152bfb28989cb2fac ]

Both virtio net and napi_get_frags() allocate skbs
with a very small skb->head

While using page fragments instead of a kmalloc backed skb->head might give
a small performance improvement in some cases, there is a huge risk of
under estimating memory usage.

For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
per page (order-3 page in x86), or even 64 on PowerPC

We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

Even if we force napi_alloc_skb() to only use order-0 pages, the issue
would still be there on arches with PAGE_SIZE >= 32768

This patch makes sure that small skb head are kmalloc backed, so that
other objects in the slab page can be reused instead of being held as long
as skbs are sitting in socket queues.

Note that we might in the future use the sk_buff napi cache,
instead of going through a more expensive __alloc_skb()

Another idea would be to use separate page sizes depending
on the allocated length (to never have more than 4 frags per page)

I would like to thank Greg Thelen for his precious help on this matter,
analysing crash dumps is always a time consuming task.

Fixes: fd11a83dd3 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:03 +01:00
Jakub Kicinski
00442a9621 net: sit: unregister_netdevice on newlink's error path
[ Upstream commit 47e4bb147a96f1c9b4e7691e7e994e53838bfff8 ]

We need to unregister the netdevice if config failed.
.ndo_uninit takes care of most of the heavy lifting.

This was uncovered by recent commit c269a24ce057 ("net: make
free_netdev() more lenient with unregistering devices").
Previously the partially-initialized device would be left
in the system.

Reported-and-tested-by: syzbot+2393580080a2da190f04@syzkaller.appspotmail.com
Fixes: e2f1f072db ("sit: allow to configure 6rd tunnels via netlink")
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20210114012947.2515313-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:02 +01:00
Baptiste Lepers
7c85d8e7eb rxrpc: Call state should be read with READ_ONCE() under some circumstances
[ Upstream commit a95d25dd7b94a5ba18246da09b4218f132fed60e ]

The call state may be changed at any time by the data-ready routine in
response to received packets, so if the call state is to be read and acted
upon several times in a function, READ_ONCE() must be used unless the call
state lock is held.

As it happens, we used READ_ONCE() to read the state a few lines above the
unmarked read in rxrpc_input_data(), so use that value rather than
re-reading it.

Fixes: a158bdd324 ("rxrpc: Fix call timeouts")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/161046715522.2450566.488819910256264150.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:01 +01:00
Petr Machata
1e5a4c74b5 net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands
[ Upstream commit df85bc140a4d6cbaa78d8e9c35154e1a2f0622c7 ]

In commit 826f328e2b7e ("net: dcb: Validate netlink message in DCB
handler"), Linux started rejecting RTM_GETDCB netlink messages if they
contained a set-like DCB_CMD_ command.

The reason was that privileges were only verified for RTM_SETDCB messages,
but the value that determined the action to be taken is the command, not
the message type. And validation of message type against the DCB command
was the obvious missing piece.

Unfortunately it turns out that mlnx_qos, a somewhat widely deployed tool
for configuration of DCB, accesses the DCB set-like APIs through
RTM_GETDCB.

Therefore do not bounce the discrepancy between message type and command.
Instead, in addition to validating privileges based on the actual message
type, validate them also based on the expected message type. This closes
the loophole of allowing DCB configuration on non-admin accounts, while
maintaining backward compatibility.

Fixes: 2f90b8657e ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
Fixes: 826f328e2b7e ("net: dcb: Validate netlink message in DCB handler")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/a3edcfda0825f2aa2591801c5232f2bbf2d8a554.1610384801.git.me@pmachata.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:01 +01:00
Petr Machata
dbdca9d9b7 net: dcb: Validate netlink message in DCB handler
[ Upstream commit 826f328e2b7e8854dd42ea44e6519cd75018e7b1 ]

DCB uses the same handler function for both RTM_GETDCB and RTM_SETDCB
messages. dcb_doit() bounces RTM_SETDCB mesasges if the user does not have
the CAP_NET_ADMIN capability.

However, the operation to be performed is not decided from the DCB message
type, but from the DCB command. Thus DCB_CMD_*_GET commands are used for
reading DCB objects, the corresponding SET and DEL commands are used for
manipulation.

The assumption is that set-like commands will be sent via an RTM_SETDCB
message, and get-like ones via RTM_GETDCB. However, this assumption is not
enforced.

It is therefore possible to manipulate DCB objects without CAP_NET_ADMIN
capability by sending the corresponding command in an RTM_GETDCB message.
That is a bug. Fix it by validating the type of the request message against
the type used for the response.

Fixes: 2f90b8657e ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
Signed-off-by: Petr Machata <me@pmachata.org>
Link: https://lore.kernel.org/r/a2a9b88418f3a58ef211b718f2970128ef9e3793.1608673640.git.me@pmachata.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:01 +01:00
Willem de Bruijn
26413630f4 esp: avoid unneeded kmap_atomic call
[ Upstream commit 9bd6b629c39e3fa9e14243a6d8820492be1a5b2e ]

esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
the esp trailer.

It accesses the page with kmap_atomic to handle highmem. But
skb_page_frag_refill can return compound pages, of which
kmap_atomic only maps the first underlying page.

skb_page_frag_refill does not return highmem, because flag
__GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
That also does not call kmap_atomic, but directly uses page_address,
in skb_copy_to_page_nocache. Do the same for ESP.

This issue has become easier to trigger with recent kmap local
debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.

Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:01 +01:00
Hoang Le
60b8b4e631 tipc: fix NULL deref in tipc_link_xmit()
[ Upstream commit b77413446408fdd256599daf00d5be72b5f3e7c6 ]

The buffer list can have zero skb as following path:
tipc_named_node_up()->tipc_node_xmit()->tipc_link_xmit(), so
we need to check the list before casting an &sk_buff.

Fault report:
 [] tipc: Bulk publication failure
 [] general protection fault, probably for non-canonical [#1] PREEMPT [...]
 [] KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
 [] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.10.0-rc4+ #2
 [] Hardware name: Bochs ..., BIOS Bochs 01/01/2011
 [] RIP: 0010:tipc_link_xmit+0xc1/0x2180
 [] Code: 24 b8 00 00 00 00 4d 39 ec 4c 0f 44 e8 e8 d7 0a 10 f9 48 [...]
 [] RSP: 0018:ffffc90000006ea0 EFLAGS: 00010202
 [] RAX: dffffc0000000000 RBX: ffff8880224da000 RCX: 1ffff11003d3cc0d
 [] RDX: 0000000000000019 RSI: ffffffff886007b9 RDI: 00000000000000c8
 [] RBP: ffffc90000007018 R08: 0000000000000001 R09: fffff52000000ded
 [] R10: 0000000000000003 R11: fffff52000000dec R12: ffffc90000007148
 [] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc90000007018
 [] FS:  0000000000000000(0000) GS:ffff888037400000(0000) knlGS:000[...]
 [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [] CR2: 00007fffd2db5000 CR3: 000000002b08f000 CR4: 00000000000006f0

Fixes: af9b028e27 ("tipc: make media xmit call outside node spinlock context")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20210108071337.3598-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:00 +01:00
Aya Levin
bbcb4746a6 net: ipv6: Validate GSO SKB before finish IPv6 processing
[ Upstream commit b210de4f8c97d57de051e805686248ec4c6cfc52 ]

There are cases where GSO segment's length exceeds the egress MTU:
 - Forwarding of a TCP GRO skb, when DF flag is not set.
 - Forwarding of an skb that arrived on a virtualisation interface
   (virtio-net/vhost/tap) with TSO/GSO size set by other network
   stack.
 - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
   interface with a smaller MTU.
 - Arriving GRO skb (or GSO skb in a virtualised environment) that is
   bridged to a NETIF_F_TSO tunnel stacked over an interface with an
   insufficient MTU.

If so:
 - Consume the SKB and its segments.
 - Issue an ICMP packet with 'Packet Too Big' message containing the
   MTU, allowing the source host to reduce its Path MTU appropriately.

Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.

Fixes: 9e50849054 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:04:00 +01:00
Baptiste Lepers
4b97ce051f udp: Prevent reuseport_select_sock from reading uninitialized socks
[ Upstream commit fd2ddef043592e7de80af53f47fa46fd3573086e ]

reuse->socks[] is modified concurrently by reuseport_add_sock. To
prevent reading values that have not been fully initialized, only read
the array up until the last known safe index instead of incorrectly
re-reading the last index of the array.

Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:03:59 +01:00
Dongseok Yi
24cd331741 net: fix use-after-free when UDP GRO with shared fraglist
[ Upstream commit 53475c5dd856212e91538a9501162e821cc1f791 ]

skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.

Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.

[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-23 16:03:59 +01:00
Dinghao Liu
88a5c90f39 netfilter: nf_nat: Fix memleak in nf_nat_init
commit 869f4fdaf4ca7bb6e0d05caf6fa1108dddc346a7 upstream.

When register_pernet_subsys() fails, nf_nat_bysource
should be freed just like when nf_ct_extend_register()
fails.

Fixes: 1cd472bf03 ("netfilter: nf_nat: add nat hook register functions to nf_nat")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19 18:27:33 +01:00
Jesper Dangaard Brouer
f14e31c169 netfilter: conntrack: fix reading nf_conntrack_buckets
commit f6351c3f1c27c80535d76cac2299aec44c36291e upstream.

The old way of changing the conntrack hashsize runtime was through changing
the module param via file /sys/module/nf_conntrack/parameters/hashsize. This
was extended to sysctl change in commit 3183ab8997 ("netfilter: conntrack:
allow increasing bucket size via sysctl too").

The commit introduced second "user" variable nf_conntrack_htable_size_user
which shadow actual variable nf_conntrack_htable_size. When hashsize is
changed via module param this "user" variable isn't updated. This results in
sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users
update via the old way.

This patch fix the issue by always updating "user" variable when reading the
proc file. This will take care of changes to the actual variable without
sysctl need to be aware.

Fixes: 3183ab8997 ("netfilter: conntrack: allow increasing bucket size via sysctl too")
Reported-by: Yoel Caspersen <yoel@kviknet.dk>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19 18:27:33 +01:00
j.nixdorf@avm.de
f0cd3fba4e net: sunrpc: interpret the return value of kstrtou32 correctly
commit 86b53fbf08f48d353a86a06aef537e78e82ba721 upstream.

A return value of 0 means success. This is documented in lib/kstrtox.c.

This was found by trying to mount an NFS share from a link-local IPv6
address with the interface specified by its index:

  mount("[fe80::1%1]:/srv/nfs", "/mnt", "nfs", 0, "nolock,addr=fe80::1%1")

Before this commit this failed with EINVAL and also caused the following
message in dmesg:

  [...] NFS: bad IP address specified: addr=fe80::1%1

The syscall using the same address based on the interface name instead
of its index succeeds.

Credits for this patch go to my colleague Christian Speich, who traced
the origin of this bug to this line of code.

Signed-off-by: Johannes Nixdorf <j.nixdorf@avm.de>
Fixes: 00cfaa943e ("replace strict_strto calls")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-19 18:27:33 +01:00
Arnd Bergmann
4ac5d20182 cfg80211: select CONFIG_CRC32
[ Upstream commit 152a8a6c017bfdeda7f6d052fbc6e151891bd9b6 ]

Without crc32 support, this fails to link:

arm-linux-gnueabi-ld: net/wireless/scan.o: in function `cfg80211_scan_6ghz':
scan.c:(.text+0x928): undefined reference to `crc32_le'

Fixes: c8cb5b854b ("nl80211/cfg80211: support 6 GHz scanning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19 18:27:28 +01:00
Vasily Averin
cc77e4a020 netfilter: ipset: fixes possible oops in mtype_resize
[ Upstream commit 2b33d6ffa9e38f344418976b06057e2fc2aa9e2a ]

currently mtype_resize() can cause oops

        t = ip_set_alloc(htable_size(htable_bits));
        if (!t) {
                ret = -ENOMEM;
                goto out;
        }
        t->hregion = ip_set_alloc(ahash_sizeof_regions(htable_bits));

Increased htable_bits can force htable_size() to return 0.
In own turn ip_set_alloc(0) returns not 0 but ZERO_SIZE_PTR,
so follwoing access to t->hregion should trigger an OOPS.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-19 18:27:23 +01:00
Vasily Averin
43f6ea4140 net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
commit 54970a2fbb673f090b7f02d7f57b10b2e0707155 upstream.

syzbot reproduces BUG_ON in skb_checksum_help():
tun creates (bogus) skb with huge partial-checksummed area and
small ip packet inside. Then ip_rcv trims the skb based on size
of internal ip packet, after that csum offset points beyond of
trimmed skb. Then checksum_tg() called via netfilter hook
triggers BUG_ON:

        offset = skb_checksum_start_offset(skb);
        BUG_ON(offset >= skb_headlen(skb));

To work around the problem this patch forces pskb_trim_rcsum_slow()
to return -EINVAL in described scenario. It allows its callers to
drop such kind of packets.

Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0
Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:17:06 +01:00
Oliver Hartkopp
61e8c02ae6 can: isotp: isotp_getname(): fix kernel information leak
commit b42b3a2744b3e8f427de79896720c72823af91ad upstream.

Initialize the sockaddr_can structure to prevent a data leak to user space.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: syzbot+057884e2f453e8afebc8@syzkaller.appspotmail.com
Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20210112091643.11789-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:17:05 +01:00
Magnus Karlsson
be66572736 xsk: Rollback reservation at NETDEV_TX_BUSY
commit b1b95cb5c0a9694d47d5f845ba97e226cfda957d upstream.

Rollback the reservation in the completion ring when we get a
NETDEV_TX_BUSY. When this error is received from the driver, we are
supposed to let the user application retry the transmit again. And in
order to do this, we need to roll back the failed send so it can be
retried. Unfortunately, we did not cancel the reservation we had made
in the completion ring. By not doing this, we actually make the
completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
after enough of these errors the completion ring will be of size zero
and transmit will stop working.

Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
error.

Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:17:05 +01:00
Magnus Karlsson
9ad0375ed2 xsk: Fix race in SKB mode transmit with shared cq
commit f09ced4053bc0a2094a12b60b646114c966ef4c6 upstream.

Fix a race when multiple sockets are simultaneously calling sendto()
when the completion ring is shared in the SKB case. This is the case
when you share the same netdev and queue id through the
XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
be in xsk_generic_xmit() and call the backpressure mechanism in
xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
specific scenario, a race might occur since the rings are
single-producer single-consumer.

Fix this by moving the tx_completion_lock from the socket to the pool
as the pool is shared between the sockets that share the completion
ring. (The pool is not shared when this is not the case.) And then
protect the accesses to xskq_prod_reserve() with this lock. The
tx_completion_lock is renamed cq_lock to better reflect that it
protects accesses to the potentially shared completion ring.

Fixes: 35fcde7f8d ("xsk: support for Tx")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:17:05 +01:00
Petr Machata
5fb8a3116c nexthop: Bounce NHA_GATEWAY in FDB nexthop groups
[ Upstream commit b19218b27f3477316d296e8bcf4446aaf017aa69 ]

The function nh_check_attr_group() is called to validate nexthop groups.
The intention of that code seems to have been to bounce all attributes
above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
these attributes except when NHA_FDB attribute is present--then it accepts
them.

NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.

But that still leaves NHA_GATEWAY as an attribute that would be accepted in
FDB nexthop groups (with no meaning), so long as it keeps the address
family as unspecified:

 # ip nexthop add id 1 fdb via 127.0.0.1
 # ip nexthop add id 10 fdb via default group 1

The nexthop code is still relatively new and likely not used very broadly,
and the FDB bits are newer still. Even though there is a reproducer out
there, it relies on an improbable gateway arguments "via default", "via
all" or "via any". Given all this, I believe it is OK to reformulate the
condition to do the right thing and bounce NHA_GATEWAY.

Fixes: 38428d6871 ("nexthop: support for fdb ecmp nexthops")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:16:57 +01:00
Ido Schimmel
eaa7a6c39d nexthop: Unlink nexthop group entry in error path
[ Upstream commit 7b01e53eee6dce7a8a6736e06b99b68cd0cc7a27 ]

In case of error, remove the nexthop group entry from the list to which
it was previously added.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:16:57 +01:00
Ido Schimmel
6486bc0a34 nexthop: Fix off-by-one error in error path
[ Upstream commit 07e61a979ca4dddb3661f59328b3cd109f6b0070 ]

A reference was not taken for the current nexthop entry, so do not try
to put it in the error path.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:16:57 +01:00
Florian Westphal
b0ff6d00ed net: ip: always refragment ip defragmented packets
[ Upstream commit bb4cc1a18856a73f0ff5137df0c2a31f4c50f6cf ]

Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.

In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.

This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.

The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.

Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.

packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set.  At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.

If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.

Also, tunnel might be configured to force DF bit on outer header.

In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.

But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.

Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.

The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.

Fixes: d6b915e29f ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:16:56 +01:00
Florian Westphal
d5fc41ebe2 net: fix pmtu check in nopmtudisc mode
[ Upstream commit 50c661670f6a3908c273503dfa206dfc7aa54c07 ]

For some reason ip_tunnel insist on setting the DF bit anyway when the
inner header has the DF bit set, EVEN if the tunnel was configured with
'nopmtudisc'.

This means that the script added in the previous commit
cannot be made to work by adding the 'nopmtudisc' flag to the
ip tunnel configuration. Doing so breaks connectivity even for the
without-conntrack/netfilter scenario.

When nopmtudisc is set, the tunnel will skip the mtu check, so no
icmp error is sent to client. Then, because inner header has DF set,
the outer header gets added with DF bit set as well.

IP stack then sends an error to itself because the packet exceeds
the device MTU.

Fixes: 23a3647bc4 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:16:56 +01:00
Sean Tranchetti
69363e37d4 net: ipv6: fib: flush exceptions when purging route
[ Upstream commit d8f5c29653c3f6995e8979be5623d263e92f6b86 ]

Route removal is handled by two code paths. The main removal path is via
fib6_del_route() which will handle purging any PMTU exceptions from the
cache, removing all per-cpu copies of the DST entry used by the route, and
releasing the fib6_info struct.

The second removal location is during fib6_add_rt2node() during a route
replacement operation. This path also calls fib6_purge_rt() to handle
cleaning up the per-cpu copies of the DST entries and releasing the
fib6_info associated with the older route, but it does not flush any PMTU
exceptions that the older route had. Since the older route is removed from
the tree during the replacement, we lose any way of accessing it again.

As these lingering DSTs and the fib6_info struct are holding references to
the underlying netdevice struct as well, unregistering that device from the
kernel can never complete.

Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:16:56 +01:00
Jakub Kicinski
9591f32a63 net: vlan: avoid leaks on register_vlan_dev() failures
[ Upstream commit 55b7ab1178cbf41f979ff83236d3321ad35ed2ad ]

VLAN checks for NETREG_UNINITIALIZED to distinguish between
registration failure and unregistration in progress.

Since commit cb626bf566 ("net-sysfs: Fix reference count leak")
registration failure may, however, result in NETREG_UNREGISTERED
as well as NETREG_UNINITIALIZED.

This fix is similer to cebb69754f ("rtnetlink: Fix
memory(net_device) leak when ->newlink fails")

Fixes: cb626bf566 ("net-sysfs: Fix reference count leak")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-17 14:16:55 +01:00
Magnus Karlsson
0fae7d269e xsk: Fix memory leak for failed bind
commit 8bee683384087a6275c9183a483435225f7bb209 upstream.

Fix a possible memory leak when a bind of an AF_XDP socket fails. When
the fill and completion rings are created, they are tied to the
socket. But when the buffer pool is later created at bind time, the
ownership of these two rings are transferred to the buffer pool as
they might be shared between sockets (and the buffer pool cannot be
created until we know what we are binding to). So, before the buffer
pool is created, these two rings are cleaned up with the socket, and
after they have been transferred they are cleaned up together with
the buffer pool.

The problem is that ownership was transferred before it was absolutely
certain that the buffer pool could be created and initialized
correctly and when one of these errors occurred, the fill and
completion rings did neither belong to the socket nor the pool and
where therefore leaked. Solve this by moving the ownership transfer
to the point where the buffer pool has been completely set up and
there is no way it can fail.

Fixes: 7361f9c3d7 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:26 +01:00
Pablo Neira Ayuso
8b109f4cd1 netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
commit 95cd4bca7b1f4a25810f3ddfc5e767fb46931789 upstream.

If userspace requests a feature which is not available the original set
definition, then bail out with EOPNOTSUPP. If userspace sends
unsupported dynset flags (new feature not supported by this kernel),
then report EOPNOTSUPP to userspace. EINVAL should be only used to
report malformed netlink messages from userspace.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:26 +01:00
Florian Westphal
810bc977f8 netfilter: xt_RATEEST: reject non-null terminated string from userspace
commit 6cb56218ad9e580e519dcd23bfb3db08d8692e5a upstream.

syzbot reports:
detected buffer overflow in strlen
[..]
Call Trace:
 strlen include/linux/string.h:325 [inline]
 strlcpy include/linux/string.h:348 [inline]
 xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143

strlcpy assumes src is a c-string. Check info->name before its used.

Reported-by: syzbot+e86f7c428c8c50db65b4@syzkaller.appspotmail.com
Fixes: 5859034d7e ("[NETFILTER]: x_tables: add RATEEST target")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:26 +01:00
Vasily Averin
d17f2ccf6f netfilter: ipset: fix shift-out-of-bounds in htable_bits()
commit 5c8193f568ae16f3242abad6518dc2ca6c8eef86 upstream.

htable_bits() can call jhash_size(32) and trigger shift-out-of-bounds

UBSAN: shift-out-of-bounds in net/netfilter/ipset/ip_set_hash_gen.h:151:6
shift exponent 32 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 8498 Comm: syz-executor519
 Not tainted 5.10.0-rc7-next-20201208-syzkaller #0
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
 htable_bits net/netfilter/ipset/ip_set_hash_gen.h:151 [inline]
 hash_mac_create.cold+0x58/0x9b net/netfilter/ipset/ip_set_hash_gen.h:1524
 ip_set_create+0x610/0x1380 net/netfilter/ipset/ip_set_core.c:1115
 nfnetlink_rcv_msg+0xecc/0x1180 net/netfilter/nfnetlink.c:252
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 nfnetlink_rcv+0x1ac/0x420 net/netfilter/nfnetlink.c:600
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2345
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2399
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch replaces htable_bits() by simple fls(hashsize - 1) call:
it alone returns valid nbits both for round and non-round hashsizes.
It is normal to set any nbits here because it is validated inside
following htable_size() call which returns 0 for nbits>31.

Fixes: 1feab10d7e6d("netfilter: ipset: Unified hash type generation")
Reported-by: syzbot+d66bfadebca46cf61a2b@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:26 +01:00
Subash Abhinov Kasiviswanathan
27bc60d967 netfilter: x_tables: Update remaining dereference to RCU
commit 443d6e86f821a165fae3fc3fc13086d27ac140b1 upstream.

This fixes the dereference to fetch the RCU pointer when holding
the appropriate xtables lock.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: cc00bcaa58 ("netfilter: x_tables: Switch synchronization to RCU")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:25 +01:00
Cong Wang
0d6eeee3b9 erspan: fix version 1 check in gre_parse_header()
[ Upstream commit 085c7c4e1c0e50d90b7d90f61a12e12b317a91e2 ]

Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
have an erspan header. So the check in gre_parse_header() is wrong,
we have to distinguish version 1 from version 0.

We can just check the gre header length like is_erspan_type1().

Fixes: cb73ee40b1 ("net: ip_gre: use erspan key field for tunnel lookup")
Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:12 +01:00
Randy Dunlap
7a20969b87 net: sched: prevent invalid Scell_log shift count
[ Upstream commit bd1248f1ddbc48b0c30565fce897a3b6423313b8 ]

Check Scell_log shift size in red_check_params() and modify all callers
of red_check_params() to pass Scell_log.

This prevents a shift out-of-bounds as detected by UBSAN:
  UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22
  shift exponent 72 is too large for 32-bit type 'int'

Fixes: 8afa10cbe2 ("net_sched: red: Avoid illegal values")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com
Cc: Nogah Frankel <nogahf@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:12 +01:00
Guillaume Nault
5e87eabce7 ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
[ Upstream commit 21fdca22eb7df2a1e194b8adb812ce370748b733 ]

RT_TOS() only clears one of the ECN bits. Therefore, when
fib_compute_spec_dst() resorts to a fib lookup, it can return
different results depending on the value of the second ECN bit.

For example, ECT(0) and ECT(1) packets could be treated differently.

  $ ip netns add ns0
  $ ip netns add ns1
  $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
  $ ip -netns ns0 link set dev lo up
  $ ip -netns ns1 link set dev lo up
  $ ip -netns ns0 link set dev veth01 up
  $ ip -netns ns1 link set dev veth10 up

  $ ip -netns ns0 address add 192.0.2.10/24 dev veth01
  $ ip -netns ns1 address add 192.0.2.11/24 dev veth10

  $ ip -netns ns1 address add 192.0.2.21/32 dev lo
  $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
  $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0

With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
(ping uses -Q to set all TOS and ECN bits):

  $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
  [...]
  64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms

But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
because the "tos 4" route isn't matched:

  $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
  [...]
  64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms

After this patch the ECN bits don't affect the result anymore:

  $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
  [...]
  64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms

Fixes: 35ebf65e85 ("ipv4: Create and use fib_compute_spec_dst() helper.")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:12 +01:00
Antoine Tenart
e4535dbb72 net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc
[ Upstream commit 4ae2bb81649dc03dfc95875f02126b14b773f7ab ]

Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:11 +01:00
Antoine Tenart
90297553d8 net-sysfs: take the rtnl lock when storing xps_rxqs
[ Upstream commit 2d57b4f142e0b03e854612b8e28978935414bced ]

Two race conditions can be triggered when storing xps rxqs, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
     compute the size of new_dev_maps when allocating it. dev->tc_num is
     also used to access the map, and the compiler may generate code to
     retrieve this field multiple times in the function.

   - netdev_set_num_tc sets dev->tc_num.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
        dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the map with the
        *old* dev->num_tc.

   2.3. netdev_set_num_tc updates dev->tc_num.

   2.4. Later accesses to the map lead to out of bound accesses and
        oops.

   A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_rxqs_store.

Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:11 +01:00
Antoine Tenart
0ca897c1ec net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
[ Upstream commit fb25038586d0064123e393cadf1fadd70a9df97a ]

Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:11 +01:00
Antoine Tenart
4da25d83b7 net-sysfs: take the rtnl lock when storing xps_cpus
[ Upstream commit 1ad58225dba3f2f598d2c6daed4323f24547168f ]

Two race conditions can be triggered when storing xps cpus, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
     compute the size of new_dev_maps when allocating it. dev->tc_num is
     also used to access the map, and the compiler may generate code to
     retrieve this field multiple times in the function.

   - netdev_set_num_tc sets dev->tc_num.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
        dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the map with the
        *old* dev->num_tc.

   2.3. netdev_set_num_tc updates dev->tc_num.

   2.4. Later accesses to the map lead to out of bound accesses and
        oops.

   A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_cpus_store.

Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:10 +01:00
John Wang
2cdf8c2747 net/ncsi: Use real net-device for response handler
[ Upstream commit 427c940558560bff2583d07fc119a21094675982 ]

When aggregating ncsi interfaces and dedicated interfaces to bond
interfaces, the ncsi response handler will use the wrong net device to
find ncsi_dev, so that the ncsi interface will not work properly.
Here, we use the original net device to fix it.

Fixes: 138635cc27 ("net/ncsi: NCSI response packet handler")
Signed-off-by: John Wang <wangzhiqiang.bj@bytedance.com>
Link: https://lore.kernel.org/r/20201223055523.2069-1-wangzhiqiang.bj@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:10 +01:00
Davide Caratti
db8895aa58 net/sched: sch_taprio: ensure to reset/destroy all child qdiscs
[ Upstream commit 698285da79f5b0b099db15a37ac661ac408c80eb ]

taprio_graft() can insert a NULL element in the array of child qdiscs. As
a consquence, taprio_reset() might not reset child qdiscs completely, and
taprio_destroy() might leak resources. Fix it by ensuring that loops that
iterate over q->qdiscs[] don't end when they find the first NULL item.

Fixes: 44d4775ca518 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/13edef6778fef03adc751582562fba4a13e06d6a.1608240532.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12 20:18:08 +01:00
Edward Vear
ce9163cf7a Bluetooth: Fix attempting to set RPA timeout when unsupported
commit a31489d2a368d2f9225ed6a6f595c63bc7d10de8 upstream.

During controller initialization, an LE Set RPA Timeout command is sent
to the controller if supported. However, the value checked to determine
if the command is supported is incorrect. Page 1921 of the Bluetooth
Core Spec v5.2 shows that bit 2 of octet 35 of the Supported_Commands
field corresponds to the LE Set RPA Timeout command, but currently
bit 6 of octet 35 is checked. This patch checks the correct value
instead.

This issue led to the error seen in the following btmon output during
initialization of an adapter (rtl8761b) and prevented initialization
from completing.

< HCI Command: LE Set Resolvable Private Address Timeout (0x08|0x002e) plen 2
        Timeout: 900 seconds
> HCI Event: Command Complete (0x0e) plen 4
      LE Set Resolvable Private Address Timeout (0x08|0x002e) ncmd 2
        Status: Unsupported Remote Feature / Unsupported LMP Feature (0x1a)
= Close Index: 00:E0:4C:6B:E5:03

The error did not appear when running with this patch.

Signed-off-by: Edward Vear <edwardvear@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-09 13:46:23 +01:00
Michal Kubecek
62162b3223 ethtool: fix string set id check
[ Upstream commit efb796f5571f030743e1d9c662cdebdad724f8c5 ]

Syzbot reported a shift of a u32 by more than 31 in strset_parse_request()
which is undefined behavior. This is caused by range check of string set id
using variable ret (which is always 0 at this point) instead of id (string
set id from request).

Fixes: 71921690f9 ("ethtool: provide string sets with STRSET_GET request")
Reported-by: syzbot+96523fb438937cd01220@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/b54ed5c5fd972a59afea3e1badfb36d86df68799.1607952208.git.mkubecek@suse.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06 14:56:48 +01:00
Ivan Vecera
95fcb69c49 ethtool: fix error paths in ethnl_set_channels()
[ Upstream commit ef72cd3c5ce168829c6684ecb2cae047d3493690 ]

Fix two error paths in ethnl_set_channels() to avoid lock-up caused
but unreleased RTNL.

Fixes: e19c591eaf ("ethtool: set device channel counts with CHANNELS_SET request")
Reported-by: LiLiang <liali@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/20201215090810.801777-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06 14:56:48 +01:00
Paolo Abeni
aeab3d7a04 mptcp: fix security context on server socket
[ Upstream commit 0c14846032f2c0a3b63234e1fc2759f4155b6067 ]

Currently MPTCP is not propagating the security context
from the ingress request socket to newly created msk
at clone time.

Address the issue invoking the missing security helper.

Fixes: cf7da0d66c ("mptcp: Create SUBFLOW socket for incoming connections")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06 14:56:48 +01:00
Davide Caratti
a969a632cb net/sched: sch_taprio: reset child qdiscs before freeing them
[ Upstream commit 44d4775ca51805b376a8db5b34f650434a08e556 ]

syzkaller shows that packets can still be dequeued while taprio_destroy()
is running. Let sch_taprio use the reset() function to cancel the advance
timer and drop all skbs from the child qdiscs.

Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Link: https://syzkaller.appspot.com/bug?id=f362872379bf8f0017fb667c1ab158f2d1e764ae
Reported-by: syzbot+8971da381fb5a31f542d@syzkaller.appspotmail.com
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/63b6d79b0e830ebb0283e020db4df3cdfdfb2b94.1608142843.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-06 14:56:48 +01:00
Chuck Lever
35f71f3cbd xprtrdma: Fix XDRBUF_SPARSE_PAGES support
commit 15261b9126cd5bb2ad8521da49d8f5c042d904c7 upstream.

Olga K. observed that rpcrdma_marsh_req() allocates sparse pages
only when it has determined that a Reply chunk is necessary. There
are plenty of cases where no Reply chunk is needed, but the
XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in
rpcrdma_inline_fixup() when it tries to copy parts of the received
Reply into a missing page.

To avoid crashing, handle sparse page allocation up front.

Until XATTR support was added, this issue did not appear often
because the only SPARSE_PAGES consumer always expected a reply large
enough to always require a Reply chunk.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30 11:54:15 +01:00
Dan Aloni
c02c1df4fb sunrpc: fix xs_read_xdr_buf for partial pages receive
[ Upstream commit ac9645c87380e39a8fa87a1b51721efcdea89dbf ]

When receiving pages data, return value 'ret' when positive includes
`buf->page_base`, so we should subtract that before it is used for
changing `offset` and comparing against `want`.

This was discovered on the very rare cases where the server returned a
chunk of bytes that when added to the already received amount of bytes
for the pages happened to match the current `recv.len`, for example
on this case:

     buf->page_base : 258356
     actually received from socket: 1740
     ret : 260096
     want : 260096

In this case neither of the two 'if ... goto out' trigger, and we
continue to tail parsing.

Worth to mention that the ensuing EMSGSIZE from the continued execution of
`xs_read_xdr_buf` may be observed by an application due to 4 superfluous
bytes being added to the pages data.

Fixes: 277e4ab7d5 ("SUNRPC: Simplify TCP receive code by switching to using iterators")
Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:53 +01:00
Wen Gong
7fcf1c1380 mac80211: fix a mistake check for rx_stats update
[ Upstream commit f879ac8ed6c83ce05fcb53815a8ea83c5b6099a1 ]

It should be !is_multicast_ether_addr() in ieee80211_rx_h_sta_process()
for the rx_stats update, below commit remove the !, this patch is to
change it back.

It lead the rx rate "iw wlan0 station dump" become invalid for some
scenario when IEEE80211_HW_USES_RSS is set.

Fixes: 09a740ce35 ("mac80211: receive and process S1G beacons")
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1607483189-3891-1-git-send-email-wgong@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:50 +01:00
Johannes Berg
1a889bda7d mac80211: don't set set TDLS STA bandwidth wider than possible
[ Upstream commit f65607cdbc6b0da356ef5a22552ddd9313cf87a0 ]

When we set up a TDLS station, we set sta->sta.bandwidth solely based
on the capabilities, because the "what's the current bandwidth" check
is bypassed and only applied for other types of stations.

This leads to the unfortunate scenario that the sta->sta.bandwidth is
160 MHz if both stations support it, but we never actually configure
this bandwidth unless the AP is already using 160 MHz; even for wider
bandwidth support we only go up to 80 MHz (at least right now.)

For iwlwifi, this can also lead to firmware asserts, telling us that
we've configured the TX rates for a higher bandwidth than is actually
available due to the PHY configuration.

For non-TDLS, we check against the interface's requested bandwidth,
but we explicitly skip this check for TDLS to cope with the wider BW
case. Change this to
 (a) still limit to the TDLS peer's own chandef, which gets factored
     into the overall PHY configuration we request from the driver,
     and
 (b) limit it to when the TDLS peer is authorized, because it's only
     factored into the channel context in this case.

Fixes: 504871e602 ("mac80211: fix bandwidth computation for TDLS peers")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.fcc7d29c4590.I11f77e9e25ddf871a3c8d5604650c763e2c5887a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:50 +01:00
Wei Yongjun
6013c99742 Bluetooth: sco: Fix crash when using BT_SNDMTU/BT_RCVMTU option
[ Upstream commit f6b8c6b5543983e9de29dc14716bfa4eb3f157c4 ]

This commit add the invalid check for connected socket, without it will
causes the following crash due to sco_pi(sk)->conn being NULL:

KASAN: null-ptr-deref in range [0x0000000000000050-0x0000000000000057]
CPU: 3 PID: 4284 Comm: test_sco Not tainted 5.10.0-rc3+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
RIP: 0010:sco_sock_getsockopt+0x45d/0x8e0
Code: 48 c1 ea 03 80 3c 02 00 0f 85 ca 03 00 00 49 8b 9d f8 04 00 00 48 b8 00
      00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e b5 03 00 00 8b 43 50 48 8b 0c
RSP: 0018:ffff88801bb17d88 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff83a4ecdf
RDX: 000000000000000a RSI: ffffc90002fce000 RDI: 0000000000000050
RBP: 1ffff11003762fb4 R08: 0000000000000001 R09: ffff88810e1008c0
R10: ffffffffbd695dcf R11: fffffbfff7ad2bb9 R12: 0000000000000000
R13: ffff888018ff1000 R14: dffffc0000000000 R15: 000000000000000d
FS:  00007fb4f76c1700(0000) GS:ffff88811af80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005555e3b7a938 CR3: 00000001117be001 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 ? sco_skb_put_cmsg+0x80/0x80
 ? sco_skb_put_cmsg+0x80/0x80
 __sys_getsockopt+0x12a/0x220
 ? __ia32_sys_setsockopt+0x150/0x150
 ? syscall_enter_from_user_mode+0x18/0x50
 ? rcu_read_lock_bh_held+0xb0/0xb0
 __x64_sys_getsockopt+0xba/0x150
 ? syscall_enter_from_user_mode+0x1d/0x50
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 0fc1a726f8 ("Bluetooth: sco: new getsockopt options BT_SNDMTU/BT_RCVMTU")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Luiz Augusto Von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:40 +01:00
Fedor Tokarev
7941ee42dc net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs'
[ Upstream commit 35a6d396721e28ba161595b0fc9e8896c00399bb ]

'snprintf' returns the number of characters which would have been written
if enough space had been available, excluding the terminating null byte.
Thus, the return value of 'sizeof(buf)' means that the last character
has been dropped.

Signed-off-by: Fedor Tokarev <ftokarev@gmail.com>
Fixes: 2f34b8bfae ("SUNRPC: add links for all client xprts to debugfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:30 +01:00
Trond Myklebust
c1e628f910 SUNRPC: xprt_load_transport() needs to support the netid "rdma6"
[ Upstream commit d5aa6b22e2258f05317313ecc02efbb988ed6d38 ]

According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.

Fixes: 181342c5eb ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:30 +01:00
Trond Myklebust
d1296acac0 SUNRPC: rpc_wake_up() should wake up tasks in the correct order
[ Upstream commit e4c72201b6ec3173dfe13fa2e2335a3ad78d4921 ]

Currently, we wake up the tasks by priority queue ordering, which means
that we ignore the batching that is supposed to help with QoS issues.

Fixes: c049f8ea9a ("SUNRPC: Remove the bh-safe lock requirement on the rpc_wait_queue->lock")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:30 +01:00
Sathish Narasimman
c98d335792 Bluetooth: Fix: LL PRivacy BLE device fails to connect
[ Upstream commit 1fb17dfc258ff6208f7873cc7b8e40e27515d2d5 ]

When adding device to white list the device is added to resolving list
also. It has to be added only when HCI_ENABLE_LL_PRIVACY flag is set.
HCI_ENABLE_LL_PRIVACY flag has to be tested before adding/deleting devices
to resolving list. use_ll_privacy macro is used only to check if controller
supports LL_Privacy.

https://bugzilla.kernel.org/show_bug.cgi?id=209745

Fixes: 0eee35bdfa ("Bluetooth: Update resolving list when updating whitelist")
Signed-off-by: Sathish Narasimman <sathish.narasimman@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:05 +01:00
Anmol Karn
147cdf5f34 Bluetooth: Fix null pointer dereference in hci_event_packet()
[ Upstream commit 6dfccd13db2ff2b709ef60a50163925d477549aa ]

AMP_MGR is getting derefernced in hci_phy_link_complete_evt(), when called
from hci_event_packet() and there is a possibility, that hcon->amp_mgr may
not be found when accessing after initialization of hcon.

- net/bluetooth/hci_event.c:4945
The bug seems to get triggered in this line:

bredr_hcon = hcon->amp_mgr->l2cap_conn->hcon;

Fix it by adding a NULL check for the hcon->amp_mgr before checking the ev-status.

Fixes: d5e911928b ("Bluetooth: AMP: Process Physical Link Complete evt")
Reported-and-tested-by: syzbot+0bef568258653cff272f@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=0bef568258653cff272f
Signed-off-by: Anmol Karn <anmol.karan123@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:05 +01:00
Colin Ian King
615bc1ba5b nl80211/cfg80211: fix potential infinite loop
[ Upstream commit ba5c25236bc3d399df82ebe923490ea8d2d35cf2 ]

The for-loop iterates with a u8 loop counter and compares this
with the loop upper limit of request->n_ssids which is an int type.
There is a potential infinite loop if n_ssids is larger than the
u8 loop counter, so fix this by making the loop counter an int.

Addresses-Coverity: ("Infinite loop")
Fixes: c8cb5b854b ("nl80211/cfg80211: support 6 GHz scanning")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20201029222407.390218-1-colin.king@canonical.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-30 11:53:03 +01:00
Anant Thazhemadam
05725b40b9 nl80211: validate key indexes for cfg80211_registered_device
commit 2d9463083ce92636a1bdd3e30d1236e3e95d859e upstream.

syzbot discovered a bug in which an OOB access was being made because
an unsuitable key_idx value was wrongly considered to be acceptable
while deleting a key in nl80211_del_key().

Since we don't know the cipher at the time of deletion, if
cfg80211_validate_key_settings() were to be called directly in
nl80211_del_key(), even valid keys would be wrongly determined invalid,
and deletion wouldn't occur correctly.
For this reason, a new function - cfg80211_valid_key_idx(), has been
created, to determine if the key_idx value provided is valid or not.
cfg80211_valid_key_idx() is directly called in 2 places -
nl80211_del_key(), and cfg80211_validate_key_settings().

Reported-by: syzbot+49d4cab497c2142ee170@syzkaller.appspotmail.com
Tested-by: syzbot+49d4cab497c2142ee170@syzkaller.appspotmail.com
Suggested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Link: https://lore.kernel.org/r/20201204215825.129879-1-anant.thazhemadam@gmail.com
Cc: stable@vger.kernel.org
[also disallow IGTK key IDs if no IGTK cipher is supported]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-26 16:02:45 +01:00
Peilin Ye
b260e4a688 Bluetooth: Fix slab-out-of-bounds read in hci_le_direct_adv_report_evt()
commit f7e0e8b2f1b0a09b527885babda3e912ba820798 upstream.

`num_reports` is not being properly checked. A malformed event packet with
a large `num_reports` number makes hci_le_direct_adv_report_evt() read out
of bounds. Fix it.

Cc: stable@vger.kernel.org
Fixes: 2f010b5588 ("Bluetooth: Add support for handling LE Direct Advertising Report events")
Reported-and-tested-by: syzbot+24ebd650e20bd263ca01@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=24ebd650e20bd263ca01
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-26 16:02:44 +01:00
Thierry Reding
eadec7f537 net: ipconfig: Avoid spurious blank lines in boot log
commit c9f64d1fc101c64ea2be1b2e562b4395127befc9 upstream.

When dumping the name and NTP servers advertised by DHCP, a blank line
is emitted if either of the lists is empty. This can lead to confusing
issues such as the blank line getting flagged as warning. This happens
because the blank line is the result of pr_cont("\n") and that may see
its level corrupted by some other driver concurrently writing to the
console.

Fix this by making sure that the terminating newline is only emitted
if at least one entry in the lists was printed before.

Reported-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20201110073757.1284594-1-thierry.reding@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-26 16:02:38 +01:00
David S. Miller
d9838b1d39 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-12-10

The following pull-request contains BPF updates for your *net* tree.

We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).

The main changes are:

1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.

2) Fix ring_buffer__poll() return value, from Andrii.

3) Fix race in lwt_bpf, from Cong.

4) Fix test_offload, from Toke.

5) Various xsk fixes.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10 14:29:30 -08:00
Guillaume Nault
7fdd375e38 net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
TCA_FLOWER_KEY_MPLS_OPT_LSE_LABEL is a u32 attribute (MPLS label is
20 bits long).

Fixes the following bug:

 $ tc filter add dev ethX ingress protocol mpls_uc \
     flower mpls lse depth 2 label 256             \
     action drop

 $ tc filter show dev ethX ingress
   filter protocol mpls_uc pref 49152 flower chain 0
   filter protocol mpls_uc pref 49152 flower chain 0 handle 0x1
     eth_type 8847
     mpls
       lse depth 2 label 0  <-- invalid label 0, should be 256
   ...

Fixes: 61aec25a6d ("cls_flower: Support filtering on multiple MPLS Label Stack Entries")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 20:39:38 -08:00
David S. Miller
b7e4ba9a91 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
   from Subash Abhinov Kasiviswanathan.

2) Fix netlink dump of dynset timeouts later than 23 days.

3) Add comment for the indirect serialization of the nft commit mutex
   with rtnl_mutex.

4) Remove bogus check for confirmed conntrack when matching on the
   conntrack ID, from Brett Mastbergen.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 18:55:46 -08:00
Neal Cardwell
299bcb55ec tcp: fix cwnd-limited bug for TSO deferral where we send nothing
When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
into persistent scenarios where we have the following sequence:

(1) ACK for full-sized skb of N*MSS arrives
  -> tcp_write_xmit() transmit full-sized skb with N*MSS
  -> move pacing release time forward
  -> exit tcp_write_xmit() because pacing time is in the future

(2) TSQ callback or TCP internal pacing timer fires
  -> try to transmit next skb, but TSO deferral finds remainder of
     available cwnd is not big enough to trigger an immediate send
     now, so we defer sending until the next ACK.

(3) repeat...

So we can get into a case where we never mark ourselves as
cwnd-limited for many seconds at a time, even with
bulk/infinite-backlog senders, because:

o In case (1) above, every time in tcp_write_xmit() we have enough
cwnd to send a full-sized skb, we are not fully using the cwnd
(because cwnd is not a multiple of the TSO skb size). So every time we
send data, we are not cwnd limited, and so in the cwnd-limited
tracking code in tcp_cwnd_validate() we mark ourselves as not
cwnd-limited.

o In case (2) above, every time in tcp_write_xmit() that we try to
transmit the "remainder" of the cwnd but defer, we set the local
variable is_cwnd_limited to true, but we do not send any packets, so
sent_pkts is zero, so we don't call the cwnd-limited logic to update
tp->is_cwnd_limited.

Fixes: ca8a226343 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Reported-by: Ingemar Johansson <ingemar.s.johansson@ericsson.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09 16:15:54 -08:00
Chris Mi
5137d30365 net: flow_offload: Fix memory leak for indirect flow block
The offending commit introduces a cleanup callback that is invoked
when the driver module is removed to clean up the tunnel device
flow block. But it returns on the first iteration of the for loop.
The remaining indirect flow blocks will never be freed.

Fixes: 1fac52da59 ("net: flow_offload: consolidate indirect flow_block infrastructure")
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
2020-12-09 16:08:33 -08:00
Wei Wang
8ef44b6fe4 tcp: Retain ECT bits for tos reflection
For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.

Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Reported-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 16:08:23 -08:00
Michal Kubecek
a770bf5156 ethtool: fix stack overflow in ethnl_parse_bitset()
Syzbot reported a stack overflow in bitmap_from_arr32() called from
ethnl_parse_bitset() when bitset from netlink message is longer than
target bitmap length. While ethnl_compact_sanity_checks() makes sure that
trailing part is all zeros (i.e. the request does not try to touch bits
kernel does not recognize), we also need to cap change_bits to nbits so
that we don't try to write past the prepared bitmaps.

Fixes: 88db6d1e4f ("ethtool: add ethnl_parse_bitset() helper")
Reported-by: syzbot+9d39fa49d4df294aab93@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/3487ee3a98e14cd526f55b6caaa959d2dcbcad9f.1607465316.git.mkubecek@suse.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09 15:50:38 -08:00
Oliver Hartkopp
323a391a22 can: isotp: isotp_setsockopt(): block setsockopt on bound sockets
The isotp socket can be widely configured in its behaviour regarding addressing
types, fill-ups, receive pattern tests and link layer length. Usually all
these settings need to be fixed before bind() and can not be changed
afterwards.

This patch adds a check to enforce the common usage pattern.

Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Thomas Wagner <thwa1@web.de>
Link: https://lore.kernel.org/r/20201203140604.25488-2-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20201204133508.742120-3-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09 08:44:15 -08:00
Toke Høiland-Jørgensen
998f172962 xdp: Remove the xdp_attachment_flags_ok() callback
Since commit 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF
programs in net_device"), the XDP program attachment info is now maintained
in the core code. This interacts badly with the xdp_attachment_flags_ok()
check that prevents unloading an XDP program with different load flags than
it was loaded with. In practice, two kinds of failures are seen:

- An XDP program loaded without specifying a mode (and which then ends up
  in driver mode) cannot be unloaded if the program mode is specified on
  unload.

- The dev_xdp_uninstall() hook always calls the driver callback with the
  mode set to the type of the program but an empty flags argument, which
  means the flags_ok() check prevents the program from being removed,
  leading to bpf prog reference leaks.

The original reason this check was added was to avoid ambiguity when
multiple programs were loaded. With the way the checks are done in the core
now, this is quite simple to enforce in the core code, so let's add a check
there and get rid of the xdp_attachment_flags_ok() callback entirely.

Fixes: 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
2020-12-09 16:27:42 +01:00
Brett Mastbergen
2d94b20b95 netfilter: nft_ct: Remove confirmation check for NFT_CT_ID
Since commit 656c8e9cc1 ("netfilter: conntrack: Use consistent ct id
hash calculation") the ct id will not change from initialization to
confirmation.  Removing the confirmation check allows for things like
adding an element to a 'typeof ct id' set in prerouting upon reception
of the first packet of a new connection, and then being able to
reference that set consistently both before and after the connection
is confirmed.

Fixes: 656c8e9cc1 ("netfilter: conntrack: Use consistent ct id hash calculation")
Signed-off-by: Brett Mastbergen <brett.mastbergen@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-09 10:31:58 +01:00
Eric Dumazet
72d05c00d7 tcp: select sane initial rcvq_space.space for big MSS
Before commit a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.

This is no longer the case, and Hazem Mohamed Abuelfotoh reported
that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.

Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
of rcvq_space.space computation, while it can select later a smaller
value for tp->rcv_ssthresh and tp->window_clamp.

ss -temoi on receiver would show :

skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596

This means that TCP can not increase its window in tcp_grow_window(),
and that DRS can never kick.

Fix this by making sure that rcvq_space.space is not bigger than number of bytes
that can be held in TCP receive queue.

People unable/unwilling to change their kernel can work around this issue by
selecting a bigger tcp_rmem[1] value as in :

echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem

Based on an initial report and patch from Hazem Mohamed Abuelfotoh
 https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/

Fixes: a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Fixes: 041a14d267 ("tcp: start receiver buffer autotuning sooner")
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:27:48 -08:00
Cengiz Can
0398ba9e5a net: tipc: prevent possible null deref of link
`tipc_node_apply_property` does a null check on a `tipc_link_entry`
pointer but also accesses the same pointer out of the null check block.

This triggers a warning on Coverity Static Analyzer because we're
implying that `e->link` can BE null.

Move "Update MTU for node link entry" line into if block to make sure
that we're not in a state that `e->link` is null.

Signed-off-by: Cengiz Can <cengiz@kernel.wtf>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 15:53:41 -08:00
Pablo Neira Ayuso
42f1c27120 netfilter: nftables: comment indirect serialization of commit_mutex with rtnl_mutex
Add an explicit comment in the code to describe the indirect
serialization of the holders of the commit_mutex with the rtnl_mutex.
Commit 90d2723c6d ("netfilter: nf_tables: do not hold reference on
netdevice from preparation phase") already describes this, but a comment
in this case is better for reference.

Reported-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-08 21:53:48 +01:00
Pablo Neira Ayuso
917d80d376 netfilter: nft_dynset: fix timeouts later than 23 days
Use nf_msecs_to_jiffies64 and nf_jiffies64_to_msecs as provided by
8e1102d5a1 ("netfilter: nf_tables: support timeouts larger than 23
days"), otherwise ruleset listing breaks.

Fixes: a8b1e36d0d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-08 20:42:11 +01:00
Subash Abhinov Kasiviswanathan
cc00bcaa58 netfilter: x_tables: Switch synchronization to RCU
When running concurrent iptables rules replacement with data, the per CPU
sequence count is checked after the assignment of the new information.
The sequence count is used to synchronize with the packet path without the
use of any explicit locking. If there are any packets in the packet path using
the table information, the sequence count is incremented to an odd value and
is incremented to an even after the packet process completion.

The new table value assignment is followed by a write memory barrier so every
CPU should see the latest value. If the packet path has started with the old
table information, the sequence counter will be odd and the iptables
replacement will wait till the sequence count is even prior to freeing the
old table info.

However, this assumes that the new table information assignment and the memory
barrier is actually executed prior to the counter check in the replacement
thread. If CPU decides to execute the assignment later as there is no user of
the table information prior to the sequence check, the packet path in another
CPU may use the old table information. The replacement thread would then free
the table information under it leading to a use after free in the packet
processing context-

Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c

This could be fixed by forcing instruction order after the new table
information assignment or by switching to RCU for the synchronization.

Fixes: 80055dab5d ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
Reported-by: Sean Tranchetti <stranche@codeaurora.org>
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-08 12:57:39 +01:00
Jakub Kicinski
819f56bad1 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2020-12-07

1) Sysbot reported fixes for the new 64/32 bit compat layer.
   From Dmitry Safonov.

2) Fix a memory leak in xfrm_user_policy that was introduced
   by adding the 64/32 bit compat layer. From Yu Kuai.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  net: xfrm: fix memory leak in xfrm_user_policy()
  xfrm/compat: Don't allocate memory with __GFP_ZERO
  xfrm/compat: memset(0) 64-bit padding at right place
  xfrm/compat: Translate by copying XFRMA_UNSPEC attribute
====================

Link: https://lore.kernel.org/r/20201207093937.2874932-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-07 18:29:54 -08:00
Jianguo Wu
f55628b3e7 mptcp: print new line in mptcp_seq_show() if mptcp isn't in use
When do cat /proc/net/netstat, the output isn't append with a new line, it looks like this:
[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0[root@localhost ~]#

This is because in mptcp_seq_show(), if mptcp isn't in use, net->mib.mptcp_statistics is NULL,
so it just puts all 0 after "MPTcpExt:", and return, forgot the '\n'.

After this patch:

[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0
[root@localhost ~]#

Fixes: fc518953bc ("mptcp: add and use MIB counter infrastructure")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/142e2fd9-58d9-bb13-fb75-951cccc2331e@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-07 17:45:29 -08:00
Joseph Huang
851d0a73c9 bridge: Fix a deadlock when enabling multicast snooping
When enabling multicast snooping, bridge module deadlocks on multicast_lock
if 1) IPv6 is enabled, and 2) there is an existing querier on the same L2
network.

The deadlock was caused by the following sequence: While holding the lock,
br_multicast_open calls br_multicast_join_snoopers, which eventually causes
IP stack to (attempt to) send out a Listener Report (in igmp6_join_group).
Since the destination Ethernet address is a multicast address, br_dev_xmit
feeds the packet back to the bridge via br_multicast_rcv, which in turn
calls br_multicast_add_group, which then deadlocks on multicast_lock.

The fix is to move the call br_multicast_join_snoopers outside of the
critical section. This works since br_multicast_join_snoopers only deals
with IP and does not modify any multicast data structures of the bridge,
so there's no need to hold the lock.

Steps to reproduce:
1. sysctl net.ipv6.conf.all.force_mld_version=1
2. have another querier
3. ip link set dev bridge type bridge mcast_snooping 0 && \
   ip link set dev bridge type bridge mcast_snooping 1 < deadlock >

A typical call trace looks like the following:

[  936.251495]  _raw_spin_lock+0x5c/0x68
[  936.255221]  br_multicast_add_group+0x40/0x170 [bridge]
[  936.260491]  br_multicast_rcv+0x7ac/0xe30 [bridge]
[  936.265322]  br_dev_xmit+0x140/0x368 [bridge]
[  936.269689]  dev_hard_start_xmit+0x94/0x158
[  936.273876]  __dev_queue_xmit+0x5ac/0x7f8
[  936.277890]  dev_queue_xmit+0x10/0x18
[  936.281563]  neigh_resolve_output+0xec/0x198
[  936.285845]  ip6_finish_output2+0x240/0x710
[  936.290039]  __ip6_finish_output+0x130/0x170
[  936.294318]  ip6_output+0x6c/0x1c8
[  936.297731]  NF_HOOK.constprop.0+0xd8/0xe8
[  936.301834]  igmp6_send+0x358/0x558
[  936.305326]  igmp6_join_group.part.0+0x30/0xf0
[  936.309774]  igmp6_group_added+0xfc/0x110
[  936.313787]  __ipv6_dev_mc_inc+0x1a4/0x290
[  936.317885]  ipv6_dev_mc_inc+0x10/0x18
[  936.321677]  br_multicast_open+0xbc/0x110 [bridge]
[  936.326506]  br_multicast_toggle+0xec/0x140 [bridge]

Fixes: 4effd28c12 ("bridge: join all-snoopers multicast address")
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201204235628.50653-1-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-07 17:14:43 -08:00
Cong Wang
e3366884b3 lwt_bpf: Replace preempt_disable() with migrate_disable()
migrate_disable() is just a wrapper for preempt_disable() in
non-RT kernel. It is safe to replace it, and RT kernel will
benefit.

Note that it is introduced since Feb 2020.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201205075946.497763-2-xiyou.wangcong@gmail.com
2020-12-07 11:53:40 -08:00
Dongdong Wang
d9054a1ff5 lwt: Disable BH too in run_lwt_bpf()
The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
and BPF redirect helpers. Callers on RX path are all in BH context,
disabling preemption is not sufficient to prevent BH interruption.

In production, we observed strange packet drops because of the race
condition between LWT xmit and TC ingress, and we verified this issue
is fixed after we disable BH.

Although this bug was technically introduced from the beginning, that
is commit 3a0af8fd61 ("bpf: BPF for lightweight tunnel infrastructure"),
at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
So this patch may not work well before RCU flavor consolidation has been
completed around v5.0.

Update the comments above the code too, as call_rcu() is now BH friendly.

Signed-off-by: Dongdong Wang <wangdongdong.6@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/bpf/20201205075946.497763-1-xiyou.wangcong@gmail.com
2020-12-07 11:53:39 -08:00
Xin Long
10c678bd0a udp: fix the proto value passed to ip_protocol_deliver_rcu for the segments
Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
Otherwize, with a negtive value passed, it will underflow inet_protos.

This can be reproduced with IPIP FOU:

  # ip fou add port 5555 ipproto 4
  # ethtool -K eth1 rx-gro-list on

Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Reported-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-07 00:32:11 -08:00
Eric Dumazet
905b2032fa mac80211: mesh: fix mesh_pathtbl_init() error path
If tbl_mpp can not be allocated, we call mesh_table_free(tbl_path)
while tbl_path rhashtable has not yet been initialized, which causes
panics.

Simply factorize the rhashtable_init() call into mesh_table_alloc()

WARNING: CPU: 1 PID: 8474 at kernel/workqueue.c:3040 __flush_work kernel/workqueue.c:3040 [inline]
WARNING: CPU: 1 PID: 8474 at kernel/workqueue.c:3040 __cancel_work_timer+0x514/0x540 kernel/workqueue.c:3136
Modules linked in:
CPU: 1 PID: 8474 Comm: syz-executor663 Not tainted 5.10.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__flush_work kernel/workqueue.c:3040 [inline]
RIP: 0010:__cancel_work_timer+0x514/0x540 kernel/workqueue.c:3136
Code: 5d c3 e8 bf ae 29 00 0f 0b e9 f0 fd ff ff e8 b3 ae 29 00 0f 0b 43 80 3c 3e 00 0f 85 31 ff ff ff e9 34 ff ff ff e8 9c ae 29 00 <0f> 0b e9 dc fe ff ff 89 e9 80 e1 07 80 c1 03 38 c1 0f 8c 7d fd ff
RSP: 0018:ffffc9000165f5a0 EFLAGS: 00010293
RAX: ffffffff814b7064 RBX: 0000000000000001 RCX: ffff888021c80000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888024039ca0 R08: dffffc0000000000 R09: fffffbfff1dd3e64
R10: fffffbfff1dd3e64 R11: 0000000000000000 R12: 1ffff920002cbebd
R13: ffff888024039c88 R14: 1ffff11004807391 R15: dffffc0000000000
FS:  0000000001347880(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000002cc0a000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 rhashtable_free_and_destroy+0x25/0x9c0 lib/rhashtable.c:1137
 mesh_table_free net/mac80211/mesh_pathtbl.c:69 [inline]
 mesh_pathtbl_init+0x287/0x2e0 net/mac80211/mesh_pathtbl.c:785
 ieee80211_mesh_init_sdata+0x2ee/0x530 net/mac80211/mesh.c:1591
 ieee80211_setup_sdata+0x733/0xc40 net/mac80211/iface.c:1569
 ieee80211_if_add+0xd5c/0x1cd0 net/mac80211/iface.c:1987
 ieee80211_add_iface+0x59/0x130 net/mac80211/cfg.c:125
 rdev_add_virtual_intf net/wireless/rdev-ops.h:45 [inline]
 nl80211_new_interface+0x563/0xb40 net/wireless/nl80211.c:3855
 genl_family_rcv_msg_doit net/netlink/genetlink.c:739 [inline]
 genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
 genl_rcv_msg+0xe4e/0x1280 net/netlink/genetlink.c:800
 netlink_rcv_skb+0x190/0x3a0 net/netlink/af_netlink.c:2494
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x780/0x930 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x9a8/0xd40 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0x519/0x800 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x2b1/0x360 net/socket.c:2440
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 60854fd945 ("mac80211: mesh: convert path table to rhashtable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Link: https://lore.kernel.org/r/20201204162428.2583119-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 17:34:25 -08:00
Wang Hai
bb2da7651a openvswitch: fix error return code in validate_and_copy_dec_ttl()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Changing 'return start' to 'return action_start' can fix this bug.

Fixes: 69929d4c49 ("net: openvswitch: fix TTL decrement action netlink message format")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/20201204114314.1596-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 15:43:14 -08:00
Zhang Changzhong
ee4f52a8de net: bridge: vlan: fix error return code in __vlan_add()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: f8ed289fab ("bridge: vlan: use br_vlan_(get|put)_master to deal with refcounts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/1607071737-33875-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 15:41:06 -08:00
Zhang Changzhong
b410f04eb5 ipv4: fix error return code in rtm_to_fib_config()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: d15662682d ("ipv4: Allow ipv6 gateway with ipv4 routes")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1607071695-33740-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 15:38:16 -08:00
Davide Caratti
4eef8b1f36 net/sched: fq_pie: initialize timer earlier in fq_pie_init()
with the following tdc testcase:

 83be: (qdisc, fq_pie) Create FQ-PIE with invalid number of flows

as fq_pie_init() fails, fq_pie_destroy() is called to clean up. Since the
timer is not yet initialized, it's possible to observe a splat like this:

  INFO: trying to register non-static key.
  the code is fine but needs lockdep annotation.
  turning off the locking correctness validator.
  CPU: 0 PID: 975 Comm: tc Not tainted 5.10.0-rc4+ #298
  Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
  Call Trace:
   dump_stack+0x99/0xcb
   register_lock_class+0x12dd/0x1750
   __lock_acquire+0xfe/0x3970
   lock_acquire+0x1c8/0x7f0
   del_timer_sync+0x49/0xd0
   fq_pie_destroy+0x3f/0x80 [sch_fq_pie]
   qdisc_create+0x916/0x1160
   tc_modify_qdisc+0x3c4/0x1630
   rtnetlink_rcv_msg+0x346/0x8e0
   netlink_unicast+0x439/0x630
   netlink_sendmsg+0x719/0xbf0
   sock_sendmsg+0xe2/0x110
   ____sys_sendmsg+0x5ba/0x890
   ___sys_sendmsg+0xe9/0x160
   __sys_sendmsg+0xd3/0x170
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [...]
  ODEBUG: assert_init not available (active state 0) object type: timer_list hint: 0x0
  WARNING: CPU: 0 PID: 975 at lib/debugobjects.c:508 debug_print_object+0x162/0x210
  [...]
  Call Trace:
   debug_object_assert_init+0x268/0x380
   try_to_del_timer_sync+0x6a/0x100
   del_timer_sync+0x9e/0xd0
   fq_pie_destroy+0x3f/0x80 [sch_fq_pie]
   qdisc_create+0x916/0x1160
   tc_modify_qdisc+0x3c4/0x1630
   rtnetlink_rcv_msg+0x346/0x8e0
   netlink_rcv_skb+0x120/0x380
   netlink_unicast+0x439/0x630
   netlink_sendmsg+0x719/0xbf0
   sock_sendmsg+0xe2/0x110
   ____sys_sendmsg+0x5ba/0x890
   ___sys_sendmsg+0xe9/0x160
   __sys_sendmsg+0xd3/0x170
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

fix it moving timer_setup() before any failure, like it was done on 'red'
with former commit 608b4adab1 ("net_sched: initialize timer earlier in
red_init()").

Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 14:15:01 -08:00
Zhang Changzhong
12c8a8ca11 xsk: Return error code if force_zc is set
If force_zc is set, we should exit out with an error, not fall back to
copy mode.

Fixes: 921b68692a ("xsk: Enable sharing of dma mappings")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/1607077277-41995-1-git-send-email-zhangchangzhong@huawei.com
2020-12-04 16:48:31 +01:00
Borwankar, Antara
bdeca45a0c mac80211: set SDATA_STATE_RUNNING for monitor interfaces
During restarrt, mac80211 is supposed to reconfigure the driver.
When there's a monitor interface, the interface is added and the
channel context for it was created, but not assigned to it as it
was not considered running during the restart.

Fix this by setting SDATA_STATE_RUNNING while adding monitor
interfaces.

Signed-off-by: Borwankar, Antara <antara.borwankar@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.e1df99693a4c.I494579f28018c2d0b9d4083a664cf872c28405ae@changeid
[reword commit log]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04 12:45:25 +01:00
Sara Sharon
f495acd885 cfg80211: initialize rekey_data
In case we have old supplicant, the akm field is uninitialized.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.930f0ab7ebee.Ic546e384efab3f4a89f318eafddc3eb7d556aecb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04 12:35:58 +01:00
Wen Gong
8fca2b8706 mac80211: fix return value of ieee80211_chandef_he_6ghz_oper
ieee80211_chandef_he_6ghz_oper() needs to return true if it
determined a value 6 GHz chandef, fix that.

Fixes: 1d00ce807e ("mac80211: support S1G association")
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1606121152-3452-1-git-send-email-wgong@codeaurora.org
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04 12:35:58 +01:00
Davide Caratti
9608fa6530 net/sched: act_mpls: ensure LSE is pullable before reading it
when 'act_mpls' is used to mangle the LSE, the current value is read from
the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is
contained in the skb "linear" area.

Found by code inspection.

v2:
 - use MPLS_HLEN instead of sizeof(new_lse), thanks to Jakub Kicinski

Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/3243506cba43d14858f3bd21ee0994160e44d64a.1606987058.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 11:13:37 -08:00
Davide Caratti
43c13605ba net: openvswitch: ensure LSE is pullable before reading it
when openvswitch is configured to mangle the LSE, the current value is
read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that
the label is contained in the skb "linear" area.

Found by code inspection.

Fixes: d27cf5c59a ("net: core: add MPLS update core helper and use in OvS")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/aa099f245d93218b84b5c056b67b6058ccf81a66.1606987185.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 11:13:29 -08:00
Davide Caratti
13de4ed9e3 net: skbuff: ensure LSE is pullable before decrementing the MPLS ttl
skb_mpls_dec_ttl() reads the LSE without ensuring that it is contained in
the skb "linear" area. Fix this calling pskb_may_pull() before reading the
current ttl.

Found by code inspection.

Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/53659f28be8bc336c113b5254dc637cc76bbae91.1606987074.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 11:13:21 -08:00
Dan Carpenter
6ee50c8e26 net/x25: prevent a couple of overflows
The .x25_addr[] address comes from the user and is not necessarily
NUL terminated.  This leads to a couple problems.  The first problem is
that the strlen() in x25_bind() can read beyond the end of the buffer.

The second problem is more subtle and could result in memory corruption.
The call tree is:
  x25_connect()
  --> x25_write_internal()
      --> x25_addr_aton()

The .x25_addr[] buffers are copied to the "addresses" buffer from
x25_write_internal() so it will lead to stack corruption.

Verify that the strings are NUL terminated and return -EINVAL if they
are not.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Fixes: a9288525d2 ("X25: Dont let x25_bind use addresses containing characters")
Reported-by: "kiyin(尹亮)" <kiyin@tencent.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/X8ZeAKm8FnFpN//B@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02 17:26:36 -08:00
Xuan Zhuo
3413f04141 xsk: Change the tx writeable condition
Modify the tx writeable condition from the queue is not full to the
number of present tx queues is less than the half of the total number
of queues. Because the tx queue not full is a very short time, this will
cause a large number of EPOLLOUT events, and cause a large number of
process wake up.

Fixes: 35fcde7f8d ("xsk: support for Tx")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/508fef55188d4e1160747ead64c6dcda36735880.1606555939.git.xuanzhuo@linux.alibaba.com
2020-12-03 01:14:38 +01:00
Xuan Zhuo
f5da54187e xsk: Replace datagram_poll by sock_poll_wait
datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
based on the traditional socket information (eg: sk_wmem_alloc), but
this does not apply to xsk. So this patch uses sock_poll_wait instead of
datagram_poll, and the mask is calculated by xsk_poll.

Fixes: c497176cb2 ("xsk: add Rx receive functions and poll support")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/e82f4697438cd63edbf271ebe1918db8261b7c09.1606555939.git.xuanzhuo@linux.alibaba.com
2020-12-03 01:14:14 +01:00
Antoine Tenart
832ba59649 net: ip6_gre: set dev->hard_header_len when using header_ops
syzkaller managed to crash the kernel using an NBMA ip6gre interface. I
could reproduce it creating an NBMA ip6gre interface and forwarding
traffic to it:

  skbuff: skb_under_panic: text:ffffffff8250e927 len:148 put:44 head:ffff8c03c7a33
  ------------[ cut here ]------------
  kernel BUG at net/core/skbuff.c:109!
  Call Trace:
  skb_push+0x10/0x10
  ip6gre_header+0x47/0x1b0
  neigh_connected_output+0xae/0xf0

ip6gre tunnel provides its own header_ops->create, and sets it
conditionally when initializing the tunnel in NBMA mode. When
header_ops->create is used, dev->hard_header_len should reflect the
length of the header created. Otherwise, when not used,
dev->needed_headroom should be used.

Fixes: eb95f52fc7 ("net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap")
Cc: Maria Pasechnik <mariap@mellanox.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://lore.kernel.org/r/20201130161911.464106-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02 11:16:12 -08:00
Hoang Le
0643334902 tipc: fix incompatible mtu of transmission
In commit 682cd3cf94
("tipc: confgiure and apply UDP bearer MTU on running links"), we
introduced a function to change UDP bearer MTU and applied this new value
across existing per-link. However, we did not apply this new MTU value at
node level. This lead to packet dropped at link level if its size is
greater than new MTU value.

To fix this issue, we also apply this new MTU value for node level.

Fixes: 682cd3cf94 ("tipc: confgiure and apply UDP bearer MTU on running links")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20201130025544.3602-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:26:57 -08:00
Björn Töpel
ed1182dc00 xdp: Handle MEM_TYPE_XSK_BUFF_POOL correctly in xdp_return_buff()
It turns out that it does exist a path where xdp_return_buff() is
being passed an XDP buffer of type MEM_TYPE_XSK_BUFF_POOL. This path
is when AF_XDP zero-copy mode is enabled, and a buffer is redirected
to a DEVMAP with an attached XDP program that drops the buffer.

This change simply puts the handling of MEM_TYPE_XSK_BUFF_POOL back
into xdp_return_buff().

Fixes: 82c41671ca ("xdp: Simplify xdp_return_{frame, frame_rx_napi, buff}")
Reported-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/bpf/20201127171726.123627-1-bjorn.topel@gmail.com
2020-11-30 23:00:26 +01:00
Jakub Kicinski
bd2d5c54dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Fix insufficient validation of IPSET_ATTR_IPADDR_IPV6 reported
   by syzbot.

2) Remove spurious reports on nf_tables when lockdep gets disabled,
   from Florian Westphal.

3) Fix memleak in the error path of error path of
   ip_vs_control_net_init(), from Wang Hai.

4) Fix missing control data in flow dissector, otherwise IP address
   matching in hardware offload infra does not work.

5) Fix hardware offload match on prefix IP address when userspace
   does not send a bitwise expression to represent the prefix.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: nftables_offload: build mask based from the matching bytes
  netfilter: nftables_offload: set address type in control dissector
  ipvs: fix possible memory leak in ip_vs_control_net_init
  netfilter: nf_tables: avoid false-postive lockdep splat
  netfilter: ipset: prevent uninit-value in hash_ip6_add
====================

Link: https://lore.kernel.org/r/20201127190313.24947-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 13:20:23 -08:00
Guillaume Nault
1ebf179037 ipv4: Fix tos mask in inet_rtm_getroute()
When inet_rtm_getroute() was converted to use the RCU variants of
ip_route_input() and ip_route_output_key(), the TOS parameters
stopped being masked with IPTOS_RT_MASK before doing the route lookup.

As a result, "ip route get" can return a different route than what
would be used when sending real packets.

For example:

    $ ip route add 192.0.2.11/32 dev eth0
    $ ip route add unreachable 192.0.2.11/32 tos 2
    $ ip route get 192.0.2.11 tos 2
    RTNETLINK answers: No route to host

But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
actually be routed using the first route:

    $ ping -c 1 -Q 2 192.0.2.11
    PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
    64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms

    --- 192.0.2.11 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms

This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
return results consistent with real route lookups.

Fixes: 3765d35ed8 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 13:14:23 -08:00
Jakub Kicinski
3771b82242 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-11-28

1) Do not reference the skb for xsk's generic TX side since when looped
   back into RX it might crash in generic XDP, from Björn Töpel.

2) Fix umem cleanup on a partially set up xsk socket when being destroyed,
   from Magnus Karlsson.

3) Fix an incorrect netdev reference count when failing xsk_bind() operation,
   from Marek Majtyka.

4) Fix bpftool to set an error code on failed calloc() in build_btf_type_table(),
   from Zhen Lei.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add MAINTAINERS entry for BPF LSM
  bpftool: Fix error return value in build_btf_type_table
  net, xsk: Avoid taking multiple skbuff references
  xsk: Fix incorrect netdev reference count
  xsk: Fix umem cleanup bug at socket destruct
  MAINTAINERS: Update XDP and AF_XDP entries
====================

Link: https://lore.kernel.org/r/20201128005104.1205-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 12:04:57 -08:00
Jakub Kicinski
28d35ad083 Here are some batman-adv bugfixes:
- Fix head/tailroom issues for fragments, by Sven Eckelmann (3 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl/BORsWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoX+AD/wJYmiGL8beD3eCBQdTjmvJNRbZ
 MIlzse2LsiSk/4G/UbrEr11M4pc6gFDghH/VXs2MxD2hd6qJsJH7ulAQKMPzTEqD
 WhwlyJ08O9HIivaN2Ri1ibwmLIDAYlvR2bsWu95weI/hDdE5ups/kIvBudM8b+fC
 HKQ/e9wGaiB+zUUobvl6bP0Av6u3xPNTHFkXgGTo1shgLISAYFsmEoYQ3yZYrYKy
 P9ckwLoJwMD6AODMRvWVmW4dzi5QSBWN/84XgQSoqdlZeNklIcTY2P5XWdePo7cv
 wTnSxm4gJwdU7i8dBXLbtOYzmavzq0JkQHnRcFcKm6RWMbaeNNwfYUwMLp4h4ZtG
 Mjh0qt4IyuRb5Tf7jkoZ9UR4fIftpuMNMMvE37Cfr7wUZKWtVbG9jHyydU2MZ678
 vAF5NigYHa88i+MSdbXq8fPflmhddiHsN05nzI3olzX7tBFWNDf+71P82Gm4zKuM
 YC4ZlZH7Rrb79C13CJm5xKN0MI5TYGDYJeFLV4Km7hZ9H9EyDJcsuSFPgV0R0hg5
 Y5rWIWRrK/lU1aSXPSEuyL1P+tbH5z5xeYxdPUz2WroqNTyre6ycXy+wuoyCw5Ij
 t91wqD8pNC1/3nynBbI2kcLeGFPjDTEfWYcYN7OE3Ts6AXr3fs8x1XUcUsseAZ1P
 oxaUxzzfCcDno3abtg==
 =mORx
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-pullrequest-20201127' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - Fix head/tailroom issues for fragments, by Sven Eckelmann (3 patches)

* tag 'batadv-net-pullrequest-20201127' of git://git.open-mesh.org/linux-merge:
  batman-adv: Don't always reallocate the fragmentation skb head
  batman-adv: Reserve needed_*room for fragments
  batman-adv: Consider fragmentation for needed_headroom
====================

Link: https://lore.kernel.org/r/20201127173849.19208-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 11:56:06 -08:00
Antoine Tenart
44f64f23ba netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal
Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the
hooks as, while it's an expected value for a bridge, routing expects
PACKET_HOST. The change is undone later on after hook traversal. This
can be seen with pairs of functions updating skb>pkt_type and then
reverting it to its original value:

For hook NF_INET_PRE_ROUTING:
  setup_pre_routing / br_nf_pre_routing_finish

For hook NF_INET_FORWARD:
  br_nf_forward_ip / br_nf_forward_finish

But the third case where netfilter does this, for hook
NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing
but never reverted. A comment says:

  /* We assume any code from br_dev_queue_push_xmit onwards doesn't care
   * about the value of skb->pkt_type. */

But when having a tunnel (say vxlan) attached to a bridge we have the
following call trace:

  br_nf_pre_routing
  br_nf_pre_routing_ipv6
     br_nf_pre_routing_finish
  br_nf_forward_ip
     br_nf_forward_finish
  br_nf_post_routing           <- pkt_type is updated to PACKET_HOST
     br_nf_dev_queue_xmit      <- but not reverted to its original value
  vxlan_xmit
     vxlan_xmit_one
        skb_tunnel_check_pmtu  <- a check on pkt_type is performed

In this specific case, this creates issues such as when an ICMPv6 PTB
should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB
isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST
and returns early).

If the comment is right and no one cares about the value of
skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting
it to its original value should be safe.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20201123174902.622102-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 11:46:51 -08:00
Jakub Kicinski
d0742c49ca linux-can-fixes-for-5.10-20201127
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl/Ay8sTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqZvTB/9TfdD9pr5E4zIfStIlir7bvamDRK3y
 QVQd4+gaB6gLKNAQo0Kh+osbMZAm43ixzEtoLEHDOd+XaaNDyJdUpNUXyf+/ZZgs
 twMoCuY1vnO/1LPtVhLuggiyUWhc8YFPs7ZuFYmAWS+kYDMG4qxF3rMHcTjKw+bD
 TcD383ZldO8I2cTmdhT8ldx3tk0UUSCvG284Db43asUkTOhPs8SrkFLVlAZZfcNc
 d3gxzQYuMyR/s/ektyFP39ed9PO/OOQcg401XbT7aXvnCgaecTJHLIDfSvz6f69M
 HHybn8CWs68+Ne4k6kC4c5tGJzz3QGt+q1dO1J1IENG99DsDpoEnfNiS
 =xlpC
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.10-20201127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2020-11-27

The first patch is by me and target the gs_usb driver and fixes the endianess
problem with candleLight firmware.

Another patch by me for the mcp251xfd driver add sanity checking to bail out if
no IRQ is configured.

The next three patches target the m_can driver. A patch by me removes the
hardcoded IRQF_TRIGGER_FALLING from the request_threaded_irq() as this clashes
with the trigger level specified in the DT. Further a patch by me fixes the
nominal bitiming tseg2 min value for modern m_can cores. Pankaj Sharma's patch
add support for cores version 3.3.x.

The last patch by Oliver Hartkopp is for af_can and converts a WARN() into a
pr_warn(), which is triggered by the syzkaller. It was able to create a
situation where the closing of a socket runs simultaneously to the notifier
call chain for removing the CAN network device in use.

* tag 'linux-can-fixes-for-5.10-20201127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check
  can: m_can: m_can_dev_setup(): add support for bosch mcan version 3.3.0
  can: m_can: fix nominal bitiming tseg2 min for version >= 3.1
  can: m_can: m_can_open(): remove IRQF_TRIGGER_FALLING from request_threaded_irq()'s flags
  can: mcp251xfd: mcp251xfd_probe(): bail out if no IRQ was given
  can: gs_usb: fix endianess problem with candleLight firmware
====================

Link: https://lore.kernel.org/r/20201127100301.512603-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:13:39 -08:00
Willem de Bruijn
985f733742 sock: set sk_err to ee_errno on dequeue from errq
When setting sk_err, set it to ee_errno, not ee_origin.

Commit f5f99309fa ("sock: do not set sk_err in
sock_dequeue_err_skb") disabled updating sk_err on errq dequeue,
which is correct for most error types (origins):

  -       sk->sk_err = err;

Commit 38b257938a ("sock: reset sk_err when the error queue is
empty") reenabled the behavior for IMCP origins, which do require it:

  +       if (icmp_next)
  +               sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin;

But read from ee_errno.

Fixes: 38b257938a ("sock: reset sk_err when the error queue is empty")
Reported-by: Ayush Ranjan <ayushranjan@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:07:06 -08:00
Paolo Abeni
d3ab78858f mptcp: fix NULL ptr dereference on bad MPJ
If an msk listener receives an MPJ carrying an invalid token, it
will zero the request socket msk entry. That should later
cause fallback and subflow reset - as per RFC - at
subflow_syn_recv_sock() time due to failing hmac validation.

Since commit 4cf8b7e48a ("subflow: introduce and use
mptcp_can_accept_new_subflow()"), we unconditionally dereference
- in mptcp_can_accept_new_subflow - the subflow request msk
before performing hmac validation. In the above scenario we
hit a NULL ptr dereference.

Address the issue doing the hmac validation earlier.

Fixes: 4cf8b7e48a ("subflow: introduce and use mptcp_can_accept_new_subflow()")
Tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/03b2cfa3ac80d8fc18272edc6442a9ddf0b1e34e.1606400227.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:05:31 -08:00
Eelco Chaudron
69929d4c49 net: openvswitch: fix TTL decrement action netlink message format
Currently, the openvswitch module is not accepting the correctly formated
netlink message for the TTL decrement action. For both setting and getting
the dec_ttl action, the actions should be nested in the
OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi.

When the original patch was sent, it was tested with a private OVS userspace
implementation. This implementation was unfortunately not upstreamed and
reviewed, hence an erroneous version of this patch was sent out.

Leaving the patch as-is would cause problems as the kernel module could
interpret additional attributes as actions and vice-versa, due to the
actions not being encapsulated/nested within the actual attribute, but
being concatinated after it.

Fixes: 744676e777 ("openvswitch: add TTL decrement action")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160622121495.27296.888010441924340582.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:03:06 -08:00
Pablo Neira Ayuso
a5d45bc0dc netfilter: nftables_offload: build mask based from the matching bytes
Userspace might match on prefix bytes of header fields if they are on
the byte boundary, this requires that the mask is adjusted accordingly.
Use NFT_OFFLOAD_MATCH_EXACT() for meta since prefix byte matching is not
allowed for this type of selector.

The bitwise expression might be optimized out by userspace, hence the
kernel needs to infer the prefix from the number of payload bytes to
match on. This patch adds nft_payload_offload_mask() to calculate the
bitmask to match on the prefix.

Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-27 12:10:47 +01:00
Pablo Neira Ayuso
3c78e9e0d3 netfilter: nftables_offload: set address type in control dissector
This patch adds nft_flow_rule_set_addr_type() to set the address type
from the nft_payload expression accordingly.

If the address type is not set in the control dissector then a rule that
matches either on source or destination IP address does not work.

After this patch, nft hardware offload generates the flow dissector
configuration as tc-flower does to match on an IP address.

This patch has been also tested functionally to make sure packets are
filtered out by the NIC.

This is also getting the code aligned with the existing netfilter flow
offload infrastructure which is also setting the control dissector.

Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-27 12:10:46 +01:00
Wang Hai
4bc3c8dc9f ipvs: fix possible memory leak in ip_vs_control_net_init
kmemleak report a memory leak as follows:

BUG: memory leak
unreferenced object 0xffff8880759ea000 (size 256):
backtrace:
[<00000000c0bf2deb>] kmem_cache_zalloc include/linux/slab.h:656 [inline]
[<00000000c0bf2deb>] __proc_create+0x23d/0x7d0 fs/proc/generic.c:421
[<000000009d718d02>] proc_create_reg+0x8e/0x140 fs/proc/generic.c:535
[<0000000097bbfc4f>] proc_create_net_data+0x8c/0x1b0 fs/proc/proc_net.c:126
[<00000000652480fc>] ip_vs_control_net_init+0x308/0x13a0 net/netfilter/ipvs/ip_vs_ctl.c:4169
[<000000004c927ebe>] __ip_vs_init+0x211/0x400 net/netfilter/ipvs/ip_vs_core.c:2429
[<00000000aa6b72d9>] ops_init+0xa8/0x3c0 net/core/net_namespace.c:151
[<00000000153fd114>] setup_net+0x2de/0x7e0 net/core/net_namespace.c:341
[<00000000be4e4f07>] copy_net_ns+0x27d/0x530 net/core/net_namespace.c:482
[<00000000f1c23ec9>] create_new_namespaces+0x382/0xa30 kernel/nsproxy.c:110
[<00000000098a5757>] copy_namespaces+0x2e6/0x3b0 kernel/nsproxy.c:179
[<0000000026ce39e9>] copy_process+0x220a/0x5f00 kernel/fork.c:2072
[<00000000b71f4efe>] _do_fork+0xc7/0xda0 kernel/fork.c:2428
[<000000002974ee96>] __do_sys_clone3+0x18a/0x280 kernel/fork.c:2703
[<0000000062ac0a4d>] do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
[<0000000093f1ce2c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

In the error path of ip_vs_control_net_init(), remove_proc_entry() needs
to be called to remove the added proc entry, otherwise a memory leak
will occur.

Also, add some '#ifdef CONFIG_PROC_FS' because proc_create_net* return NULL
when PROC is not used.

Fixes: b17fc9963f ("IPVS: netns, ip_vs_stats and its procfs")
Fixes: 61b1ab4583 ("IPVS: netns, add basic init per netns.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-27 12:10:46 +01:00
Oliver Hartkopp
d73ff9b7c4 can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check
To detect potential bugs in CAN protocol implementations (double removal of
receiver entries) a WARN() statement has been used if no matching list item was
found for removal.

The fault injection issued by syzkaller was able to create a situation where
the closing of a socket runs simultaneously to the notifier call chain for
removing the CAN network device in use.

This case is very unlikely in real life but it doesn't break anything.
Therefore we just replace the WARN() statement with pr_warn() to preserve the
notification for the CAN protocol development.

Reported-by: syzbot+381d06e0c8eaacb8706f@syzkaller.appspotmail.com
Reported-by: syzbot+d0ddd88c9a7432f041e6@syzkaller.appspotmail.com
Reported-by: syzbot+76d62d3b8162883c7d11@syzkaller.appspotmail.com
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201126192140.14350-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-27 10:49:28 +01:00
Sven Eckelmann
992b03b88e batman-adv: Don't always reallocate the fragmentation skb head
When a packet is fragmented by batman-adv, the original batman-adv header
is not modified. Only a new fragmentation is inserted between the original
one and the ethernet header. The code must therefore make sure that it has
a writable region of this size in the skbuff head.

But it is not useful to always reallocate the skbuff by this size even when
there would be more than enough headroom still in the skb. The reallocation
is just to costly during in this codepath.

Fixes: ee75ed8887 ("batman-adv: Fragment and send skbs larger than mtu")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-11-27 08:02:55 +01:00
Sven Eckelmann
c5cbfc8755 batman-adv: Reserve needed_*room for fragments
The batadv net_device is trying to propagate the needed_headroom and
needed_tailroom from the lower devices. This is needed to avoid cost
intensive reallocations using pskb_expand_head during the transmission.

But the fragmentation code split the skb's without adding extra room at the
end/beginning of the various fragments. This reduced the performance of
transmissions over complex scenarios (batadv on vxlan on wireguard) because
the lower devices had to perform the reallocations at least once.

Fixes: ee75ed8887 ("batman-adv: Fragment and send skbs larger than mtu")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-11-27 08:02:55 +01:00
Sven Eckelmann
4ca23e2c20 batman-adv: Consider fragmentation for needed_headroom
If a batman-adv packets has to be fragmented, then the original batman-adv
packet header is not stripped away. Instead, only a new header is added in
front of the packet after it was split.

This size must be considered to avoid cost intensive reallocations during
the transmission through the various device layers.

Fixes: 7bca68c784 ("batman-adv: Add lower layer needed_(head|tail)room to own ones")
Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-11-27 08:02:55 +01:00
Maxim Mikityanskiy
025cc2fb6a net/tls: Protect from calling tls_dev_del for TLS RX twice
tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after
calling tls_dev_del if TLX TX offload is also enabled. Clearing
tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a
time frame when tls_device_down may get called and call tls_dev_del for
RX one extra time, confusing the driver, which may lead to a crash.

This patch corrects this racy behavior by adding a flag to prevent
tls_device_down from calling tls_dev_del the second time.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 17:31:06 -08:00
Parav Pandit
a7b4364950 devlink: Make sure devlink instance and port are in same net namespace
When devlink reload operation is not used, netdev of an Ethernet port may
be present in different net namespace than the net namespace of the
devlink instance.

Ensure that both the devlink instance and devlink port netdev are located
in same net namespace.

Fixes: 070c63f20f ("net: devlink: allow to change namespaces during reload")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 17:26:34 -08:00
Parav Pandit
b187c9b417 devlink: Hold rtnl lock while reading netdev attributes
A netdevice of a devlink port can be moved to different net namespace
than its parent devlink instance.
This scenario occurs when devlink reload is not used.

When netdevice is undergoing migration to net namespace, its ifindex
and name may change.

In such use case, devlink port query may read stale netdev attributes.

Fix it by reading them under rtnl lock.

Fixes: bfcd3a4661 ("Introduce devlink infrastructure")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 17:26:34 -08:00
Florian Westphal
c0700dfa2c netfilter: nf_tables: avoid false-postive lockdep splat
There are reports wrt lockdep splat in nftables, e.g.:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 31416 at net/netfilter/nf_tables_api.c:622
lockdep_nfnl_nft_mutex_not_held+0x28/0x38 [nf_tables]
...

These are caused by an earlier, unrelated bug such as a n ABBA deadlock
in a different subsystem.
In such an event, lockdep is disabled and lockdep_is_held returns true
unconditionally.  This then causes the WARN() in nf_tables.

Make the WARN conditional on lockdep still active to avoid this.

Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/linux-kselftest/CA+G9fYvFUpODs+NkSYcnwKnXm62tmP=ksLeBPmB+KFrB2rvCtQ@mail.gmail.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-26 00:09:42 +01:00
Eric Dumazet
68ad89de91 netfilter: ipset: prevent uninit-value in hash_ip6_add
syzbot found that we are not validating user input properly
before copying 16 bytes [1].

Using NLA_BINARY in ipaddr_policy[] for IPv6 address is not correct,
since it ensures at most 16 bytes were provided.

We should instead make sure user provided exactly 16 bytes.

In old kernels (before v4.20), fix would be to remove the NLA_BINARY,
since NLA_POLICY_EXACT_LEN() was not yet available.

[1]
BUG: KMSAN: uninit-value in hash_ip6_add+0x1cba/0x3a50 net/netfilter/ipset/ip_set_hash_gen.h:892
CPU: 1 PID: 11611 Comm: syz-executor.0 Not tainted 5.10.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x21c/0x280 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x5f/0xa0 mm/kmsan/kmsan_instr.c:197
 hash_ip6_add+0x1cba/0x3a50 net/netfilter/ipset/ip_set_hash_gen.h:892
 hash_ip6_uadt+0x976/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:267
 call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
 ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
 nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
 netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
 nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x6d5/0x830 net/socket.c:2440
 __do_sys_sendmsg net/socket.c:2449 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fe2e503fc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000029ec0 RCX: 000000000045deb9
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000003
RBP: 000000000118bf60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
R13: 000000000169fb7f R14: 00007fe2e50409c0 R15: 000000000118bf2c

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
 kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289
 __msan_chain_origin+0x57/0xa0 mm/kmsan/kmsan_instr.c:147
 ip6_netmask include/linux/netfilter/ipset/pfxlen.h:49 [inline]
 hash_ip6_netmask net/netfilter/ipset/ip_set_hash_ip.c:185 [inline]
 hash_ip6_uadt+0xb1c/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:263
 call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
 ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
 nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
 netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
 nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x6d5/0x830 net/socket.c:2440
 __do_sys_sendmsg net/socket.c:2449 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
 kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289
 kmsan_memcpy_memmove_metadata+0x25e/0x2d0 mm/kmsan/kmsan.c:226
 kmsan_memcpy_metadata+0xb/0x10 mm/kmsan/kmsan.c:246
 __msan_memcpy+0x46/0x60 mm/kmsan/kmsan_instr.c:110
 ip_set_get_ipaddr6+0x2cb/0x370 net/netfilter/ipset/ip_set_core.c:310
 hash_ip6_uadt+0x439/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:255
 call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
 ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
 nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
 netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
 nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x6d5/0x830 net/socket.c:2440
 __do_sys_sendmsg net/socket.c:2449 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
 kmsan_internal_poison_shadow+0x5c/0xf0 mm/kmsan/kmsan.c:104
 kmsan_slab_alloc+0x8d/0xe0 mm/kmsan/kmsan_hooks.c:76
 slab_alloc_node mm/slub.c:2906 [inline]
 __kmalloc_node_track_caller+0xc61/0x15f0 mm/slub.c:4512
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0x309/0xae0 net/core/skbuff.c:210
 alloc_skb include/linux/skbuff.h:1094 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1176 [inline]
 netlink_sendmsg+0xdb8/0x1840 net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x6d5/0x830 net/socket.c:2440
 __do_sys_sendmsg net/socket.c:2449 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: a7b4f989a6 ("netfilter: ipset: IP set core support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-26 00:09:42 +01:00
Eric Dumazet
2543a6000e gro_cells: reduce number of synchronize_net() calls
After cited commit, gro_cells_destroy() became damn slow
on hosts with a lot of cores.

This is because we have one additional synchronize_net() per cpu as
stated in the changelog.

gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough
to not have one synchronize_net() call per netif_napi_del()

We can factorize all the synchronize_net() to a single one,
right before freeing per-cpu memory.

Fixes: 5198d545db ("net: remove napi_hash_del() from driver-facing API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201124203822.1360107-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 11:28:12 -08:00
Wang Hai
e255e11e66 ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init
kmemleak report a memory leak as follows:

unreferenced object 0xffff8880059c6a00 (size 64):
  comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s)
  hex dump (first 32 bytes):
    20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00   ...............
    1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00  ................
  backtrace:
    [<00000000aa4e7a87>] ip6addrlbl_add+0x90/0xbb0
    [<0000000070b8d7f1>] ip6addrlbl_net_init+0x109/0x170
    [<000000006a9ca9d4>] ops_init+0xa8/0x3c0
    [<000000002da57bf2>] setup_net+0x2de/0x7e0
    [<000000004e52d573>] copy_net_ns+0x27d/0x530
    [<00000000b07ae2b4>] create_new_namespaces+0x382/0xa30
    [<000000003b76d36f>] unshare_nsproxy_namespaces+0xa1/0x1d0
    [<0000000030653721>] ksys_unshare+0x3a4/0x780
    [<0000000007e82e40>] __x64_sys_unshare+0x2d/0x40
    [<0000000031a10c08>] do_syscall_64+0x33/0x40
    [<0000000099df30e7>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

We should free all rules when we catch an error in ip6addrlbl_net_init().
otherwise a memory leak will occur.

Fixes: 2a8cc6c890 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 11:20:16 -08:00
Jakub Kicinski
26c8996526 Here is a batman-adv bugfix:
- set module owner to THIS_MODULE, by Taehee Yoo
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl+9DbQWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoWlxD/0TOgtU9fVykvjMG3U0rnB6pvp7
 rFH1Srqg5hsJ2PNLAF8fr0o4PDTyiAztirdJ4C4+ietug0ozxe4GZARB+/rPEk/3
 u558cDMZaqLOo195g70s+8cr0XXPy8Bt4ip/X4uW8DNTluKMiQcKNDrVkc6Cysa/
 5cnT4wf6RftG9eMbtG1wdEzqG/goQpmrP+bo+J1QdBQEycqMwegTWYZyU9behS+m
 u47FIa20XQ8FmoZx3oooGNb7z9iDHzwVc7PMHw19Xv14bahtz1BwydERdJMde3hP
 UD4HeAQla6Gl0ZO2pS5MuCQxEu8aphel75a3g+D4XCdyyIl1B3FzK1F+DWdoHlnx
 g0g7xv0nnfpF2wVhGkgIKfssQXino9YPyIjl/zLJ7YXXrTZCzkOjGiumSvrvXgox
 mTtvDvbSIOylQksPF5UJtoOo6mWihg7HERyrOwfIV6UxKuqbKhv3e17gUlkZPIcp
 ec4iFjGheE0H00zSzBkeb+76M0XQmDTbvmjUfOxaxP4o+u5W27QE+HznOTrQ9uBa
 FrMt++XjxSfmVsiFrpqE4LQyYfcFjO3Ei5yYfR3tYfDx4qegDvqksLbzb3Fkp2G9
 Pz9HEc+PYJk/w3N2Q+zSpbDJfUvHjyIPDeqsTdtbbUPUyfkGmghT7F9V+vof2jqm
 6pijD3ji8JN1wmh9yw==
 =Qr0B
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-pullrequest-20201124' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is a batman-adv bugfix:

 - set module owner to THIS_MODULE, by Taehee Yoo

* tag 'batadv-net-pullrequest-20201124' of git://git.open-mesh.org/linux-merge:
  batman-adv: set .owner to THIS_MODULE
====================

Link: https://lore.kernel.org/r/20201124134417.17269-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 16:49:14 -08:00
Alexander Duyck
407c85c7dd tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN
When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.

To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.

To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 14:12:55 -08:00
Björn Töpel
36ccdf8582 net, xsk: Avoid taking multiple skbuff references
Commit 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
addressed the problem that packets were discarded from the Tx AF_XDP
ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
bumping the skbuff reference count, so that the buffer would not be
freed by dev_direct_xmit(). A reference count larger than one means
that the skbuff is "shared", which is not the case.

If the "shared" skbuff is sent to the generic XDP receive path,
netif_receive_generic_xdp(), and pskb_expand_head() is entered the
BUG_ON(skb_shared(skb)) will trigger.

This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
where a user can select the skbuff free policy. This allows AF_XDP to
avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
behavior.

Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com
2020-11-24 22:39:56 +01:00
Moshe Shemesh
5204bb683c devlink: Fix reload stats structure
Fix reload stats structure exposed to the user. Change stats structure
hierarchy to have the reload action as a parent of the stat entry and
then stat entry includes value per limit. This will also help to avoid
string concatenation on iproute2 output.

Reload stats structure before this fix:
"stats": {
    "reload": {
        "driver_reinit": 2,
        "fw_activate": 1,
        "fw_activate_no_reset": 0
     }
}

After this fix:
"stats": {
    "reload": {
        "driver_reinit": {
            "unspecified": 2
        },
        "fw_activate": {
            "unspecified": 1,
            "no_reset": 0
        }
}

Fixes: a254c26426 ("devlink: Add reload stats")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 13:04:04 -08:00
Eyal Birger
d549699048 net/packet: fix packet receive on L3 devices without visible hard header
In the patchset merged by commit b9fcf0a0d8
("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
did not have header_ops were given one for the purpose of protocol parsing
on af_packet transmit path.

That change made af_packet receive path regard these devices as having a
visible L3 header and therefore aligned incoming skb->data to point to the
skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
reset their mac_header prior to ingress and therefore their incoming
packets became malformed.

Ideally these devices would reset their mac headers, or af_packet would be
able to rely on dev->hard_header_len being 0 for such cases, but it seems
this is not the case.

Fix by changing af_packet RX ll visibility criteria to include the
existence of a '.create()' header operation, which is used when creating
a device hard header - via dev_hard_header() - by upper layers, and does
not exist in these L3 devices.

As this predicate may be useful in other situations, add it as a common
dev_has_header() helper in netdevice.h.

Fixes: b9fcf0a0d8 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 17:29:36 -08:00
Stefano Garzarella
3fe356d58e vsock/virtio: discard packets only when socket is really closed
Starting from commit 8692cefc43 ("virtio_vsock: Fix race condition
in virtio_transport_recv_pkt"), we discard packets in
virtio_transport_recv_pkt() if the socket has been released.

When the socket is connected, we schedule a delayed work to wait the
RST packet from the other peer, also if SHUTDOWN_MASK is set in
sk->sk_shutdown.
This is done to complete the virtio-vsock shutdown algorithm, releasing
the port assigned to the socket definitively only when the other peer
has consumed all the packets.

If we discard the RST packet received, the socket will be closed only
when the VSOCK_CLOSE_TIMEOUT is reached.

Sergio discovered the issue while running ab(1) HTTP benchmark using
libkrun [1] and observing a latency increase with that commit.

To avoid this issue, we discard packet only if the socket is really
closed (SOCK_DONE flag is set).
We also set SOCK_DONE in virtio_transport_release() when we don't need
to wait any packets from the other peer (we didn't schedule the delayed
work). In this case we remove the socket from the vsock lists, releasing
the port assigned.

[1] https://github.com/containers/libkrun

Fixes: 8692cefc43 ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt")
Cc: justin.he@arm.com
Reported-by: Sergio Lopez <slp@redhat.com>
Tested-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Jia He <justin.he@arm.com>
Link: https://lore.kernel.org/r/20201120104736.73749-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 16:36:29 -08:00
Ricardo Dias
01770a1661 tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.

Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 16:32:33 -08:00
Marek Majtyka
178648916e xsk: Fix incorrect netdev reference count
Fix incorrect netdev reference count in xsk_bind operation. Incorrect
reference count of the device appears when a user calls bind with the
XDP_ZEROCOPY flag on an interface which does not support zero-copy.
In such a case, an error is returned but the reference count is not
decreased. This change fixes the fault, by decreasing the reference count
in case of such an error.

The problem being corrected appeared in '162c820ed896' for the first time,
and the code was moved to new file location over the time with commit
'c2d3d6a47462'. This specific patch applies to all version starting
from 'c2d3d6a47462'. The same solution should be applied but on different
file (net/xdp/xdp_umem.c) and function (xdp_umem_assign_dev) for versions
from '162c820ed896' to 'c2d3d6a47462' excluded.

Fixes: 162c820ed8 ("xdp: hold device for umem regardless of zero-copy mode")
Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201120151443.105903-1-marekx.majtyka@intel.com
2020-11-23 13:19:33 +01:00
Julian Wiedmann
c5dab0941f net/af_iucv: set correct sk_protocol for child sockets
Child sockets erroneously inherit their parent's sk_type (ie. SOCK_*),
instead of the PF_IUCV protocol that the parent was created with in
iucv_sock_create().

We're currently not using sk->sk_protocol ourselves, so this shouldn't
have much impact (except eg. getting the output in skb_dump() right).

Fixes: eac3731bd0 ("[S390]: Add AF_IUCV socket support")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20201120100657.34407-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-21 14:43:45 -08:00
Alexander Duyck
55472017a4 tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control
When setting congestion control via a BPF program it is seen that the
SYN/ACK for packets within a given flow will not include the ECT0 flag. A
bit of simple printk debugging shows that when this is configured without
BPF we will see the value INET_ECN_xmit value initialized in
tcp_assign_congestion_control however when we configure this via BPF the
socket is in the closed state and as such it isn't configured, and I do not
see it being initialized when we transition the socket into the listen
state. The result of this is that the ECT0 bit is configured based on
whatever the default state is for the socket.

Any easy way to reproduce this is to monitor the following with tcpdump:
tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca

Without this patch the SYN/ACK will follow whatever the default is. If dctcp
all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
bit matches the value seen on the other packets in the given stream.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 18:09:47 -08:00
Alexander Duyck
861602b577 tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header
An issue was recently found where DCTCP SYN/ACK packets did not have the
ECT bit set in the L3 header. A bit of code review found that the recent
change referenced below had gone though and added a mask that prevented the
ECN bits from being populated in the L3 header.

This patch addresses that by rolling back the mask so that it is only
applied to the flags coming from the incoming TCP request instead of
applying it to the socket tos/tclass field. Doing this the ECT bits were
restored in the SYN/ACK packets in my testing.

One thing that is not addressed by this patch set is the fact that
tcp_reflect_tos appears to be incompatible with ECN based congestion
avoidance algorithms. At a minimum the feature should likely be documented
which it currently isn't.

Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 18:09:47 -08:00
Vadim Fedorenko
20ffc7adf5 net/tls: missing received data after fast remote close
In case when tcp socket received FIN after some data and the
parser haven't started before reading data caller will receive
an empty buffer. This behavior differs from plain TCP socket and
leads to special treating in user-space.
The flow that triggers the race is simple. Server sends small
amount of data right after the connection is configured to use TLS
and closes the connection. In this case receiver sees TLS Handshake
data, configures TLS socket right after Change Cipher Spec record.
While the configuration is in process, TCP socket receives small
Application Data record, Encrypted Alert record and FIN packet. So
the TCP socket changes sk_shutdown to RCV_SHUTDOWN and sk_flag with
SK_DONE bit set. The received data is not parsed upon arrival and is
never sent to user-space.

Patch unpauses parser directly if we have unparsed data in tcp
receive queue.

Fixes: fcf4793e27 ("tls: check RCV_SHUTDOWN in tls_wait_data")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1605801588-12236-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 10:25:26 -08:00
Anmol Karn
3b3fd068c5 rose: Fix Null pointer dereference in rose_send_frame()
rose_send_frame() dereferences `neigh->dev` when called from
rose_transmit_clear_request(), and the first occurrence of the
`neigh` is in rose_loopback_timer() as `rose_loopback_neigh`,
and it is initialized in rose_add_loopback_neigh() as NULL.
i.e when `rose_loopback_neigh` used in rose_loopback_timer()
its `->dev` was still NULL and rose_loopback_timer() was calling
rose_rx_call_request() without checking for NULL.

- net/rose/rose_link.c
This bug seems to get triggered in this line:

rose_call = (ax25_address *)neigh->dev->dev_addr;

Fix it by adding NULL checking for `rose_loopback_neigh->dev`
in rose_loopback_timer().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com
Tested-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=9d2a7ca8c7f2e4b682c97578dfa3f236258300b3
Signed-off-by: Anmol Karn <anmol.karan123@gmail.com>
Link: https://lore.kernel.org/r/20201119191043.28813-1-anmol.karan123@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 10:04:58 -08:00
Magnus Karlsson
537cf4e3cc xsk: Fix umem cleanup bug at socket destruct
Fix a bug that is triggered when a partially setup socket is
destroyed. For a fully setup socket, a socket that has been bound to a
device, the cleanup of the umem is performed at the end of the buffer
pool's cleanup work queue item. This has to be performed in a work
queue, and not in RCU cleanup, as it is doing a vunmap that cannot
execute in interrupt context. However, when a socket has only been
partially set up so that a umem has been created but the buffer pool
has not, the code erroneously directly calls the umem cleanup function
instead of using a work queue, and this leads to a BUG_ON() in
vunmap().

As there in this case is no buffer pool, we cannot use its work queue,
so we need to introduce a work queue for the umem and schedule this for
the cleanup. So in the case there is no pool, we are going to use the
umem's own work queue to schedule the cleanup. But if there is a
pool, the cleanup of the umem is still being performed by the pool's
work queue, as it is important that the umem is cleaned up after the
pool.

Fixes: e5e1a4bc91 ("xsk: Fix possible memory leak at socket close")
Reported-by: Marek Majtyka <marekx.majtyka@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Marek Majtyka <marekx.majtyka@intel.com>
Link: https://lore.kernel.org/bpf/1605873219-21629-1-git-send-email-magnus.karlsson@gmail.com
2020-11-20 15:52:39 +01:00
Jakub Kicinski
e6ea60bac1 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
1) libbpf should not attempt to load unused subprogs, from Andrii.

2) Make strncpy_from_user() mask out bytes after NUL terminator, from Daniel.

3) Relax return code check for subprograms in the BPF verifier, from Dmitrii.

4) Fix several sockmap issues, from John.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  fail_function: Remove a redundant mutex unlock
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
  lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
  libbpf: Fix VERSIONED_SYM_COUNT number parsing
  bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list
  bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
  bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
  bpf, sockmap: Use truesize with sk_rmem_schedule()
  bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
  bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
  selftests/bpf: Fix error return code in run_getsockopt_test()
  bpf: Relax return code check for subprograms
  tools, bpftool: Add missing close before bpftool net attach exit
  MAINTAINERS/bpf: Update Andrii's entry.
  selftests/bpf: Fix unused attribute usage in subprogs_unused test
  bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id
  bpf: Fix passing zero to PTR_ERR() in bpf_btf_printf_prepare
  libbpf: Don't attempt to load unused subprog as an entry-point BPF program
====================

Link: https://lore.kernel.org/r/20201119200721.288-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 12:26:10 -08:00
Karsten Graul
41a0be3f8f net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid()
Sparse complaints 3 times about:
net/smc/smc_ib.c:203:52: warning: incorrect type in argument 1 (different address spaces)
net/smc/smc_ib.c:203:52:    expected struct net_device const *dev
net/smc/smc_ib.c:203:52:    got struct net_device [noderef] __rcu *const ndev

Fix that by using the existing and validated ndev variable instead of
accessing attr->ndev directly.

Fixes: 5102eca903 ("net/smc: Use rdma_read_gid_l2_fields to L2 fields")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 10:59:19 -08:00
Karsten Graul
0530bd6e6a net/smc: fix matching of existing link groups
With the multi-subnet support of SMC-Dv2 the match for existing link
groups should not include the vlanid of the network device.
Set ini->smcd_version accordingly before the call to smc_conn_create()
and use this value in smc_conn_create() to skip the vlanid check.

Fixes: 5c21c4ccaf ("net/smc: determine accepted ISM devices")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 10:59:19 -08:00
Georg Kohmann
2d8f6481c1 ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module
IPV6=m
NF_DEFRAG_IPV6=y

ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function
`nf_ct_frag6_gather':
net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to
`ipv6_frag_thdr_truncated'

Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This
dependency is forcing IPV6=y.

Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This
is the same solution as used with a similar issues: Referring to
commit 70b095c843 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6
module")

Fixes: 9d9e937b1c ("ipv6/netfilter: Discard first fragment not including all headers")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Georg Kohmann <geokohma@cisco.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 10:49:50 -08:00
Florian Fainelli
1532b97784 net: Have netpoll bring-up DSA management interface
DSA network devices rely on having their DSA management interface up and
running otherwise their ndo_open() will return -ENETDOWN. Without doing
this it would not be possible to use DSA devices as netconsole when
configured on the command line. These devices also do not utilize the
upper/lower linking so the check about the netpoll device having upper
is not going to be a problem.

The solution adopted here is identical to the one done for
net/ipv4/ipconfig.c with 728c02089a ("net: ipv4: handle DSA enabled
master network devices"), with the network namespace scope being
restricted to that of the process configuring netpoll.

Fixes: 04ff53f96a ("net: dsa: Add netconsole support")
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20201117035236.22658-1-f.fainelli@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-18 11:04:11 -08:00
Zhang Changzhong
a5ebcbdf34 ah6: fix error return code in ah6_input()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1605581105-35295-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-18 10:53:16 -08:00
Florian Klink
c09c8a27b9 ipv4: use IS_ENABLED instead of ifdef
Checking for ifdef CONFIG_x fails if CONFIG_x=m.

Use IS_ENABLED instead, which is true for both built-ins and modules.

Otherwise, a
> ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1
fails with the message "Error: IPv6 support not enabled in kernel." if
CONFIG_IPV6 is `m`.

In the spirit of b8127113d0.

Fixes: d15662682d ("ipv4: Allow ipv6 gateway with ipv4 routes")
Cc: Kim Phillips <kim.phillips@arm.com>
Signed-off-by: Florian Klink <flokli@flokli.de>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201115224509.2020651-1-flokli@flokli.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17 17:02:03 -08:00
Wang Hai
e33de7c531 inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill()
nlmsg_cancel() needs to be called in the error path of
inet_req_diag_fill to cancel the message.

Fixes: d545caca82 ("net: inet: diag: expose the socket mark to privileged processes.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17 16:08:36 -08:00
John Fastabend
4363023d26 bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list
When skb has a frag_list its possible for skb_to_sgvec() to fail. This
happens when the scatterlist has fewer elements to store pages than would
be needed for the initial skb plus any of its frags.

This case appears rare, but is possible when running an RX parser/verdict
programs exposed to the internet. Currently, when this happens we throw
an error, break the pipe, and kfree the msg. This effectively breaks the
application or forces it to do a retry.

Lets catch this case and handle it by doing an skb_linearize() on any
skb we receive with frags. At this point skb_to_sgvec should not fail
because the failing conditions would require frags to be in place.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556576837.73229.14800682790808797635.stgit@john-XPS-13-9370
2020-11-18 00:14:04 +01:00
John Fastabend
2443ca6667 bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
If the skb_verdict_prog redirects an skb knowingly to itself, fix your
BPF program this is not optimal and an abuse of the API please use
SK_PASS. That said there may be cases, such as socket load balancing,
where picking the socket is hashed based or otherwise picks the same
socket it was received on in some rare cases. If this happens we don't
want to confuse userspace giving them an EAGAIN error if we can avoid
it.

To avoid double accounting in these cases. At the moment even if the
skb has already been charged against the sockets rcvbuf and forward
alloc we check it again and do set_owner_r() causing it to be orphaned
and recharged. For one this is useless work, but more importantly we
can have a case where the skb could be put on the ingress queue, but
because we are under memory pressure we return EAGAIN. The trouble
here is the skb has already been accounted for so any rcvbuf checks
include the memory associated with the packet already. This rolls
up and can result in unnecessary EAGAIN errors in userspace read()
calls.

Fix by doing an unlikely check and skipping checks if skb->sk == sk.

Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556574804.73229.11328201020039674147.stgit@john-XPS-13-9370
2020-11-18 00:13:27 +01:00
John Fastabend
6fa9201a89 bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
If a socket redirects to itself and it is under memory pressure it is
possible to get a socket stuck so that recv() returns EAGAIN and the
socket can not advance for some time. This happens because when
redirecting a skb to the same socket we received the skb on we first
check if it is OK to enqueue the skb on the receiving socket by checking
memory limits. But, if the skb is itself the object holding the memory
needed to enqueue the skb we will keep retrying from kernel side
and always fail with EAGAIN. Then userspace will get a recv() EAGAIN
error if there are no skbs in the psock ingress queue. This will continue
until either some skbs get kfree'd causing the memory pressure to
reduce far enough that we can enqueue the pending packet or the
socket is destroyed. In some cases its possible to get a socket
stuck for a noticeable amount of time if the socket is only receiving
skbs from sk_skb verdict programs. To reproduce I make the socket
memory limits ridiculously low so sockets are always under memory
pressure. More often though if under memory pressure it looks like
a spurious EAGAIN error on user space side causing userspace to retry
and typically enough has moved on the memory side that it works.

To fix skip memory checks and skb_orphan if receiving on the same
sock as already assigned.

For SK_PASS cases this is easy, its always the same socket so we
can just omit the orphan/set_owner pair.

For backlog cases we need to check skb->sk and decide if the orphan
and set_owner pair are needed.

Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556572660.73229.12566203819812939627.stgit@john-XPS-13-9370
2020-11-18 00:12:41 +01:00
John Fastabend
70796fb751 bpf, sockmap: Use truesize with sk_rmem_schedule()
We use skb->size with sk_rmem_scheduled() which is not correct. Instead
use truesize to align with socket and tcp stack usage of sk_rmem_schedule.

Suggested-by: Daniel Borkman <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556570616.73229.17003722112077507863.stgit@john-XPS-13-9370
2020-11-18 00:12:37 +01:00
John Fastabend
36cd0e696a bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
allows users to tune SO_RCVBUF and sockmap will honor them.

We can refactor the if(charge) case out in later patches. But, keep this
fix to the point.

Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370
2020-11-18 00:12:34 +01:00
John Fastabend
c9c89dcd87 bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.

Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.

The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.

When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.

The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/160556566659.73229.15694973114605301063.stgit@john-XPS-13-9370
2020-11-18 00:12:10 +01:00
Tariq Toukan
138559b9f9 net/tls: Fix wrong record sn in async mode of device resync
In async_resync mode, we log the TCP seq of records until the async request
is completed.  Later, in case one of the logged seqs matches the resync
request, we return it, together with its record serial number.  Before this
fix, we mistakenly returned the serial number of the current record
instead.

Fixes: ed9b7646b0 ("net/tls: Add asynchronous resync")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20201115131448.2702-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17 14:41:20 -08:00
Ryan Sharpelletti
1b9e2a8c99 tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate
During loss recovery, retransmitted packets are forced to use TCP
timestamps to calculate the RTT samples, which have a millisecond
granularity. BBR is designed using a microsecond granularity. As a
result, multiple RTT samples could be truncated to the same RTT value
during loss recovery. This is problematic, as BBR will not enter
PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning
that if there are persistent losses, PROBE_RTT will constantly be
pushed off and potentially never re-entered. This patch makes sure
that BBR enters PROBE_RTT by checking if RTT sample is < the current
min_rtt sample, rather than <=.

The Netflix transport/TCP team discovered this bug in the Linux TCP
BBR code during lab tests.

Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Ryan Sharpelletti <sharpelletti@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17 11:03:22 -08:00
Vadim Fedorenko
3fe16edf67 net/tls: fix corrupted data in recvmsg
If tcp socket has more data than Encrypted Handshake Message then
tls_sw_recvmsg will try to decrypt next record instead of returning
full control message to userspace as mentioned in comment. The next
message - usually Application Data - gets corrupted because it uses
zero copy for decryption that's why the data is not stored in skb
for next iteration. Revert check to not decrypt next record if
current is not Application Data.

Fixes: 692d7b5d1f ("tls: Fix recvmsg() to be able to peek across multiple records")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1605413760-21153-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 17:11:02 -08:00
Heiner Kallweit
7a30ecc923 net: bridge: add missing counters to ndo_get_stats64 callback
In br_forward.c and br_input.c fields dev->stats.tx_dropped and
dev->stats.multicast are populated, but they are ignored in
ndo_get_stats64.

Fixes: 28172739f0 ("net: fix 64 bit counters on 32 bit arches")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/58ea9963-77ad-a7cf-8dfd-fc95ab95f606@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 15:47:50 -08:00
Georg Kohmann
9d9e937b1c ipv6/netfilter: Discard first fragment not including all headers
Packets are processed even though the first fragment don't include all
headers through the upper layer header. This breaks TAHI IPv6 Core
Conformance Test v6LC.1.3.6.

Referring to RFC8200 SECTION 4.5: "If the first fragment does not include
all headers through an Upper-Layer header, then that fragment should be
discarded and an ICMP Parameter Problem, Code 3, message should be sent to
the source of the fragment, with the Pointer field set to zero."

The fragment needs to be validated the same way it is done in
commit 2efdaaaf88 ("IPv6: reply ICMP error if the first fragment don't
include all headers") for ipv6. Wrap the validation into a common function,
ipv6_frag_thdr_truncated() to check for truncation in the upper layer
header. This validation does not fullfill all aspects of RFC 8200,
section 4.5, but is at the moment sufficient to pass mentioned TAHI test.

In netfilter, utilize the fragment offset returned by find_prev_fhdr() to
let ipv6_frag_thdr_truncated() start it's traverse from the fragment
header.

Return 0 to drop the fragment in the netfilter. This is the same behaviour
as used on other protocol errors in this function, e.g. when
nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up
by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an
appropriate ICMP Parameter Problem message back to the source.

References commit 2efdaaaf88 ("IPv6: reply ICMP error if the first
fragment don't include all headers")

Signed-off-by: Georg Kohmann <geokohma@cisco.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 10:15:11 -08:00
Jakub Kicinski
e2142ef266 linux-can-fixes-for-5.10-20201115
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl+xZoMTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqX8lB/4gQq9MTczYjSxVGgqaEyToftX7rokl
 EiP6RcpXKvAY8p0IJHEWHeCGOqsD5am4xo6M/Z5nMAinBdywweSvQRdU8ZDdCb84
 AxJEbrkU7JTNqHbQnLp3wzl5sHPAwvRavf9plFwnPosMM2nS0HcPnjaj9kgO3cLI
 E/fMa+WmPwDjMEIBM2TJCgFh9pH5Fs8I3SRHC/TUYwigXKVtgIuIm1o6t4evBk9w
 zBW+5HX++r2FSt0164vVZyBCOCK/z2tEMCEO8l3mrUKdjBy3jAFqXfWxcQbaFy8f
 Oy8Nx1vgNZYVH7RqYJ6JuQO6/wQIPAWb6KOgDrDWaM6jl3ZdA7/gBdiR
 =Ed6S
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.10-20201115' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2020-11-15

Anant Thazhemadam contributed two patches for the AF_CAN that prevent potential
access of uninitialized member in can_rcv() and canfd_rcv().

The next patch is by Alejandro Concepcion Rodriguez and changes can_restart()
to use the correct function to push a skb into the networking stack from
process context.

Zhang Qilong's patch fixes a memory leak in the error path of the ti_hecc's
probe function.

A patch by me fixes mcba_usb_start_xmit() function in the mcba_usb driver, to
first fill the skb and then pass it to can_put_echo_skb().

Colin Ian King's patch fixes a potential integer overflow on shift in the
peak_usb driver.

The next two patches target the flexcan driver, a patch by me adds the missing
"req_bit" to the stop mode property comment (which was broken during net-next
for v5.10). Zhang Qilong's patch fixes the failure handling of
pm_runtime_get_sync().

The next seven patches target the m_can driver including the tcan4x5x spi
driver glue code. Enric Balletbo i Serra's patch for the tcan4x5x Kconfig fix
the REGMAP_SPI dependency handling. A patch by me for the tcan4x5x driver's
probe() function adds missing error handling to for devm_regmap_init(), and in
tcan4x5x_can_remove() the order of deregistration is fixed. Wu Bo's patch for
the m_can driver fixes the state change handling in
m_can_handle_state_change(). Two patches by Dan Murphy first introduce
m_can_class_free_dev() and then make use of it to fix the freeing of the can
device. A patch by Faiz Abbas add a missing shutdown of the CAN controller in
the m_can_stop() function.

* tag 'linux-can-fixes-for-5.10-20201115' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: m_can: m_can_stop(): set device to software init mode before closing
  can: m_can: Fix freeing of can device from peripherials
  can: m_can: m_can_class_free_dev(): introduce new function
  can: m_can: m_can_handle_state_change(): fix state change
  can: tcan4x5x: tcan4x5x_can_remove(): fix order of deregistration
  can: tcan4x5x: tcan4x5x_can_probe(): add missing error checking for devm_regmap_init()
  can: tcan4x5x: replace depends on REGMAP_SPI with depends on SPI
  can: flexcan: fix failure handling of pm_runtime_get_sync()
  can: flexcan: flexcan_setup_stop_mode(): add missing "req_bit" to stop mode property comment
  can: peak_usb: fix potential integer overflow on shift of a int
  can: mcba_usb: mcba_usb_start_xmit(): first fill skb, then pass to can_put_echo_skb()
  can: ti_hecc: Fix memleak in ti_hecc_probe
  can: dev: can_restart(): post buffer from the right context
  can: af_can: prevent potential access of uninitialized member in canfd_rcv()
  can: af_can: prevent potential access of uninitialized member in can_rcv()
====================

Link: https://lore.kernel.org/r/20201115174131.2089251-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 07:34:30 -08:00
Anant Thazhemadam
9aa9379d8f can: af_can: prevent potential access of uninitialized member in canfd_rcv()
In canfd_rcv(), cfd->len is uninitialized when skb->len = 0, and this
uninitialized cfd->len is accessed nonetheless by pr_warn_once().

Fix this uninitialized variable access by checking cfd->len's validity
condition (cfd->len > CANFD_MAX_DLEN) separately after the skb->len's
condition is checked, and appropriately modify the log messages that
are generated as well.
In case either of the required conditions fail, the skb is freed and
NET_RX_DROP is returned, same as before.

Fixes: d468984688 ("can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once")
Reported-by: syzbot+9bcb0c9409066696d3aa@syzkaller.appspotmail.com
Tested-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Link: https://lore.kernel.org/r/20201103213906.24219-3-anant.thazhemadam@gmail.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-15 18:24:33 +01:00
Anant Thazhemadam
c8c958a58f can: af_can: prevent potential access of uninitialized member in can_rcv()
In can_rcv(), cfd->len is uninitialized when skb->len = 0, and this
uninitialized cfd->len is accessed nonetheless by pr_warn_once().

Fix this uninitialized variable access by checking cfd->len's validity
condition (cfd->len > CAN_MAX_DLEN) separately after the skb->len's
condition is checked, and appropriately modify the log messages that
are generated as well.
In case either of the required conditions fail, the skb is freed and
NET_RX_DROP is returned, same as before.

Fixes: 8cb68751c1 ("can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once")
Reported-by: syzbot+9bcb0c9409066696d3aa@syzkaller.appspotmail.com
Tested-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Link: https://lore.kernel.org/r/20201103213906.24219-2-anant.thazhemadam@gmail.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-15 18:17:04 +01:00
Taehee Yoo
14a2e551fa batman-adv: set .owner to THIS_MODULE
If THIS_MODULE is not set, the module would be removed while debugfs is
being used.
It eventually makes kernel panic.

Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-11-15 11:43:56 +01:00
Wang Hai
849920c703 devlink: Add missing genlmsg_cancel() in devlink_nl_sb_port_pool_fill()
If sb_occ_port_pool_get() failed in devlink_nl_sb_port_pool_fill(),
msg should be canceled by genlmsg_cancel().

Fixes: df38dafd25 ("devlink: implement shared buffer occupancy monitoring interface")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201113111622.11040-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-14 16:23:19 -08:00
Paul Moore
1ba86d4366 netlabel: fix an uninitialized warning in netlbl_unlabel_staticlist()
Static checking revealed that a previous fix to
netlbl_unlabel_staticlist() leaves a stack variable uninitialized,
this patches fixes that.

Fixes: 866358ec33 ("netlabel: fix our progress tracking in netlbl_unlabel_staticlist()")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/r/160530304068.15651.18355773009751195447.stgit@sifl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-14 12:07:57 -08:00
Xin Long
057a10fa1f sctp: change to hold/put transport for proto_unreach_timer
A call trace was found in Hangbin's Codenomicon testing with debug kernel:

  [ 2615.981988] ODEBUG: free active (active state 0) object type: timer_list hint: sctp_generate_proto_unreach_event+0x0/0x3a0 [sctp]
  [ 2615.995050] WARNING: CPU: 17 PID: 0 at lib/debugobjects.c:328 debug_print_object+0x199/0x2b0
  [ 2616.095934] RIP: 0010:debug_print_object+0x199/0x2b0
  [ 2616.191533] Call Trace:
  [ 2616.194265]  <IRQ>
  [ 2616.202068]  debug_check_no_obj_freed+0x25e/0x3f0
  [ 2616.207336]  slab_free_freelist_hook+0xeb/0x140
  [ 2616.220971]  kfree+0xd6/0x2c0
  [ 2616.224293]  rcu_do_batch+0x3bd/0xc70
  [ 2616.243096]  rcu_core+0x8b9/0xd00
  [ 2616.256065]  __do_softirq+0x23d/0xacd
  [ 2616.260166]  irq_exit+0x236/0x2a0
  [ 2616.263879]  smp_apic_timer_interrupt+0x18d/0x620
  [ 2616.269138]  apic_timer_interrupt+0xf/0x20
  [ 2616.273711]  </IRQ>

This is because it holds asoc when transport->proto_unreach_timer starts
and puts asoc when the timer stops, and without holding transport the
transport could be freed when the timer is still running.

So fix it by holding/putting transport instead for proto_unreach_timer
in transport, just like other timers in transport.

v1->v2:
  - Also use sctp_transport_put() for the "out_unlock:" path in
    sctp_generate_proto_unreach_event(), as Marcelo noticed.

Fixes: 50b5d6ad63 ("sctp: Fix a race between ICMP protocol unreachable and connect()")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/102788809b554958b13b95d33440f5448113b8d6.1605331373.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-14 11:57:12 -08:00
Stefano Garzarella
65b422d9b6 vsock: forward all packets to the host when no H2G is registered
Before commit c0cfa2d8a7 ("vsock: add multi-transports support"),
if a G2H transport was loaded (e.g. virtio transport), every packets
was forwarded to the host, regardless of the destination CID.
The H2G transports implemented until then (vhost-vsock, VMCI) always
responded with an error, if the destination CID was not
VMADDR_CID_HOST.

From that commit, we are using the remote CID to decide which
transport to use, so packets with remote CID > VMADDR_CID_HOST(2)
are sent only through H2G transport. If no H2G is available, packets
are discarded directly in the guest.

Some use cases (e.g. Nitro Enclaves [1]) rely on the old behaviour
to implement sibling VMs communication, so we restore the old
behavior when no H2G is registered.
It will be up to the host to discard packets if the destination is
not the right one. As it was already implemented before adding
multi-transport support.

Tested with nested QEMU/KVM by me and Nitro Enclaves by Andra.

[1] Documentation/virt/ne_overview.rst

Cc: Jorgen Hansen <jhansen@vmware.com>
Cc: Dexuan Cui <decui@microsoft.com>
Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Reported-by: Andra Paraschiv <andraprs@amazon.com>
Tested-by: Andra Paraschiv <andraprs@amazon.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/r/20201112133837.34183-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-14 11:33:39 -08:00
Zhang Qilong
ceb736e1d4 ipv6: Fix error path to cancel the meseage
genlmsg_cancel() needs to be called in the error path of
inet6_fill_ifmcaddr and inet6_fill_ifacaddr to cancel
the message.

Fixes: 6ecf4c37eb ("ipv6: enable IFA_TARGET_NETNSID for RTM_GETADDR")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Link: https://lore.kernel.org/r/20201112080950.1476302-1-zhangqilong3@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-13 18:20:00 -08:00
Jeff Dike
8cf8821e15 net: Exempt multicast addresses from five-second neighbor lifetime
Commit 58956317c8 ("neighbor: Improve garbage collection")
guarantees neighbour table entries a five-second lifetime.  Processes
which make heavy use of multicast can fill the neighour table with
multicast addresses in five seconds.  At that point, neighbour entries
can't be GC-ed because they aren't five seconds old yet, the kernel
log starts to fill up with "neighbor table overflow!" messages, and
sends start to fail.

This patch allows multicast addresses to be thrown out before they've
lived out their five seconds.  This makes room for non-multicast
addresses and makes messages to all addresses more reliable in these
circumstances.

Fixes: 58956317c8 ("neighbor: Improve garbage collection")
Signed-off-by: Jeff Dike <jdike@akamai.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201113015815.31397-1-jdike@akamai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-13 14:24:39 -08:00
Jakub Kicinski
1395f8df87 A handful of fixes:
* a use-after-free fix in rfkill
  * a memory leak fix in the mac80211 TX status path
  * some rate scaling fixes
  * a fix for the often-reported (by syzbot) sleeping
    in atomic issue with mac80211's station removal
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl+uSaYACgkQB8qZga/f
 l8Q83A/8DaRC1M1nuhnPeMXJd73qxPk58+WqmBJU8sQEGtyCHyQZ6EdHUgW/V4ww
 Yh4vYBOoiWGG6JC0N0G02O4SxGmk3hZyTdxJ8WxnsM7pJ4yzWufduJrYvnx59mbh
 muw+WgEYLdyWYyzn81/rmfvQKfeFiic4pwSBuru2o0Vhb2bNYeUpYIMmdHkzlEyV
 BEwdVGCSvKWuxXdGCbplAJ79MbN6jP03y1+u7ZLYz7P+OvZZVQqUFouNuHUfrLUs
 nS8sgRToFTDt1Hu9GagRvbvqpwXGFgH+4LpMwLLa4mFjPcCC2UPewbiSIk7EU4tU
 wCr9pRlQXmcD+yVuWvwEQJvHQgkzm/Krx517QteXYB/GLmNC0A43kyQK+CEDlk56
 Fj0hk3UmXmek2EBEI/jLURlsJr7QMWqRRXXVK2VIvEButP+iWmPPyZ01Tkfjz5IN
 jHl8ahiSqX3kcCZLldlvkmtwqKKtgE4coQnPVOJJRBKYX5YpwLm0rClTsvx6XMe2
 zsegduwv6il9v3ezNudfdkcPtPfptOrkkvHgBfw6VgmvxQA1iuC8XDt/5lTpw/lt
 hshBRFvKpBTp0x52nkvBs7b7qO8qM55hEZF7NISJOyLbO1VAJPt4zZEFukip+zow
 QgA556OfDrA5bHDmuYIChBkJFOKeTI7J8DYQy65vloowg6OpkYg=
 =r2Wn
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A handful of fixes:
 * a use-after-free fix in rfkill
 * a memory leak fix in the mac80211 TX status path
 * some rate scaling fixes
 * a fix for the often-reported (by syzbot) sleeping
   in atomic issue with mac80211's station removal

* tag 'mac80211-for-net-2020-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
  mac80211: free sta in sta_info_insert_finish() on errors
  mac80211: minstrel: fix tx status processing corner case
  mac80211: minstrel: remove deferred sampling code
  mac80211: fix memory leak on filtered powersave frames
  rfkill: Fix use-after-free in rfkill_resume()
====================

Link: https://lore.kernel.org/r/20201113093421.24025-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-13 11:58:55 -08:00
Johannes Berg
7bc40aedf2 mac80211: free sta in sta_info_insert_finish() on errors
If sta_info_insert_finish() fails, we currently keep the station
around and free it only in the caller, but there's only one such
caller and it always frees it immediately.

As syzbot found, another consequence of this split is that we can
put things that sleep only into __cleanup_single_sta() and not in
sta_info_free(), but this is the only place that requires such of
sta_info_free() now.

Change this to free the station in sta_info_insert_finish(), in
which case we can still sleep. This will also let us unify the
cleanup code later.

Cc: stable@vger.kernel.org
Fixes: dcd479e10a ("mac80211: always wind down STA state")
Reported-by: syzbot+32c6c38c4812d22f2f0b@syzkaller.appspotmail.com
Reported-by: syzbot+4c81fe92e372d26c4246@syzkaller.appspotmail.com
Reported-by: syzbot+6a7fe9faf0d1d61bc24a@syzkaller.appspotmail.com
Reported-by: syzbot+abed06851c5ffe010921@syzkaller.appspotmail.com
Reported-by: syzbot+b7aeb9318541a1c709f1@syzkaller.appspotmail.com
Reported-by: syzbot+d5a9416c6cafe53b5dd0@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20201112112201.ee6b397b9453.I9c31d667a0ea2151441cc64ed6613d36c18a48e0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-11-13 09:48:32 +01:00
Xie He
4ee18c179e net: x25: Increase refcnt of "struct x25_neigh" in x25_rx_call_request
The x25_disconnect function in x25_subr.c would decrease the refcount of
"x25->neighbour" (struct x25_neigh) and reset this pointer to NULL.

However, the x25_rx_call_request function in af_x25.c, which is called
when we receive a connection request, does not increase the refcount when
it assigns the pointer.

Fix this issue by increasing the refcount of "struct x25_neigh" in
x25_rx_call_request.

This patch fixes frequent kernel crashes when using AF_X25 sockets.

Fixes: 4becb7ee5b ("net/x25: Fix x25_neigh refcnt leak when x25 disconnect")
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20201112103506.5875-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12 18:01:40 -08:00
Joel Stanley
1922a46b8c net/ncsi: Fix netlink registration
If a user unbinds and re-binds a NC-SI aware driver the kernel will
attempt to register the netlink interface at runtime. The structure is
marked __ro_after_init so registration fails spectacularly at this point.

 # echo 1e660000.ethernet > /sys/bus/platform/drivers/ftgmac100/unbind
 # echo 1e660000.ethernet > /sys/bus/platform/drivers/ftgmac100/bind
  ftgmac100 1e660000.ethernet: Read MAC address 52:54:00:12:34:56 from chip
  ftgmac100 1e660000.ethernet: Using NCSI interface
  8<--- cut here ---
  Unable to handle kernel paging request at virtual address 80a8f858
  pgd = 8c768dd6
  [80a8f858] *pgd=80a0841e(bad)
  Internal error: Oops: 80d [#1] SMP ARM
  CPU: 0 PID: 116 Comm: sh Not tainted 5.10.0-rc3-next-20201111-00003-gdd25b227ec1e #51
  Hardware name: Generic DT based system
  PC is at genl_register_family+0x1f8/0x6d4
  LR is at 0xff26ffff
  pc : [<8073f930>]    lr : [<ff26ffff>]    psr: 20000153
  sp : 8553bc80  ip : 81406244  fp : 8553bd04
  r10: 8085d12c  r9 : 80a8f73c  r8 : 85739000
  r7 : 00000017  r6 : 80a8f860  r5 : 80c8ab98  r4 : 80a8f858
  r3 : 00000000  r2 : 00000000  r1 : 81406130  r0 : 00000017
  Flags: nzCv  IRQs on  FIQs off  Mode SVC_32  ISA ARM  Segment none
  Control: 00c5387d  Table: 85524008  DAC: 00000051
  Process sh (pid: 116, stack limit = 0x1f1988d6)
 ...
  Backtrace:
  [<8073f738>] (genl_register_family) from [<80860ac0>] (ncsi_init_netlink+0x20/0x48)
   r10:8085d12c r9:80c8fb0c r8:85739000 r7:00000000 r6:81218000 r5:85739000
   r4:8121c000
  [<80860aa0>] (ncsi_init_netlink) from [<8085d740>] (ncsi_register_dev+0x1b0/0x210)
   r5:8121c400 r4:8121c000
  [<8085d590>] (ncsi_register_dev) from [<805a8060>] (ftgmac100_probe+0x6e0/0x778)
   r10:00000004 r9:80950228 r8:8115bc10 r7:8115ab00 r6:9eae2c24 r5:813b6f88
   r4:85739000
  [<805a7980>] (ftgmac100_probe) from [<805355ec>] (platform_drv_probe+0x58/0xa8)
   r9:80c76bb0 r8:00000000 r7:80cd4974 r6:80c76bb0 r5:8115bc10 r4:00000000
  [<80535594>] (platform_drv_probe) from [<80532d58>] (really_probe+0x204/0x514)
   r7:80cd4974 r6:00000000 r5:80cd4868 r4:8115bc10

Jakub pointed out that ncsi_register_dev is obviously broken, because
there is only one family so it would never work if there was more than
one ncsi netdev.

Fix the crash by registering the netlink family once on boot, and drop
the code to unregister it.

Fixes: 955dc68cb9 ("net/ncsi: Add generic netlink family")
Signed-off-by: Joel Stanley <joel@jms.id.au>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Link: https://lore.kernel.org/r/20201112061210.914621-1-joel@jms.id.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12 17:00:13 -08:00
Linus Torvalds
db7c953555 Networking fixes for 5.10-rc4, including fixes from the bpf subtree.
Current release - regressions:
 
  - arm64: dts: fsl-ls1028a-kontron-sl28: specify in-band mode for ENETC
 
 Current release - bugs in new features:
 
  - mptcp: provide rmem[0] limit offset to fix oops
 
 Previous release - regressions:
 
  - IPv6: Set SIT tunnel hard_header_len to zero to fix path MTU
    calculations
 
  - lan743x: correctly handle chips with internal PHY
 
  - bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
 
  - mlx5e: Fix VXLAN port table synchronization after function reload
 
 Previous release - always broken:
 
  - bpf: Zero-fill re-used per-cpu map element
 
  - net: udp: fix out-of-order packets when forwarding with UDP GSO
              fraglists turned on
    - fix UDP header access on Fast/frag0 UDP GRO
    - fix IP header access and skb lookup on Fast/frag0 UDP GRO
 
  - ethtool: netlink: add missing netdev_features_change() call
 
  - net: Update window_clamp if SOCK_RCVBUF is set
 
  - igc: Fix returning wrong statistics
 
  - ch_ktls: fix multiple leaks and corner cases in Chelsio TLS offload
 
  - tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies
 
  - r8169: disable hw csum for short packets on all chip versions
 
  - vrf: Fix fast path output packet handling with async Netfilter rules
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+thbIACgkQMUZtbf5S
 Irsy0RAAhYIYDNMSkQhcVcQPMxbtStwgTtKrWxg/D2zh3Kg+B4oRgoNZnt9kmlHX
 Su/aRWbTWBkDIMxIWBfRsO3z5zSQm4yLG1FTlfsOcWzOJcsntCO8SzikyxtnEZK8
 Bpi7dOoKB6KF0V2YjM9AHh5fbXvS7KJfp/PjZ7Kpn5BEbFV8rKtIyiJxwXXZUr6O
 ddM9Om4i0zf+dmsY1HVEyowPQMVB3vbn8F3dPk3ZrD8NVa53NtvMRxHKSsourRbZ
 yp4LKZV+POKHPFglO4jhLymhyeiwb1qgA8wssk7EKu0bwPeOcER4Tpewh1ib4C/C
 sRRzj0Wlw6dyPCkyNKx23D7dF/DrnLmXLUBhGS2mu2htSlWOH6w6rFQoVSNGGy9T
 DKUlUVUPG80mgYdME6NLJ27GOGQzxoAvzWgpcL6dJs9jz8nQqABJeXvdjw/vc/XH
 AOaKy4VwE3qf0W106JpUb+a/q0RJf7w3o4c1vLc/AZwpshNBOsrJBqrTk2E5Nrhd
 mcQykaF++DbLPIyTqhHl0GpKapohThESyMvfc4WRBFBaCwgFdOY/t0Gz3GA2N8Jc
 fuq9NOB1bfouaFGfzdkZ7RZJi3lFqZfv/XiJCh/knp1/lHAQPo4TuADcFDsjeEc9
 yr48SRDnCqahAQ7bUP0b5i31SZzwAYb/HnwYuvf4LWFvHl9XG5A=
 =AKM7
 -----END PGP SIGNATURE-----

Merge tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Current release - regressions:

   - arm64: dts: fsl-ls1028a-kontron-sl28: specify in-band mode for
     ENETC

  Current release - bugs in new features:

   - mptcp: provide rmem[0] limit offset to fix oops

  Previous release - regressions:

   - IPv6: Set SIT tunnel hard_header_len to zero to fix path MTU
     calculations

   - lan743x: correctly handle chips with internal PHY

   - bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE

   - mlx5e: Fix VXLAN port table synchronization after function reload

  Previous release - always broken:

   - bpf: Zero-fill re-used per-cpu map element

   - fix out-of-order UDP packets when forwarding with UDP GSO fraglists
     turned on:
       - fix UDP header access on Fast/frag0 UDP GRO
       - fix IP header access and skb lookup on Fast/frag0 UDP GRO

   - ethtool: netlink: add missing netdev_features_change() call

   - net: Update window_clamp if SOCK_RCVBUF is set

   - igc: Fix returning wrong statistics

   - ch_ktls: fix multiple leaks and corner cases in Chelsio TLS offload

   - tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies

   - r8169: disable hw csum for short packets on all chip versions

   - vrf: Fix fast path output packet handling with async Netfilter
     rules"

* tag 'net-5.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
  lan743x: fix use of uninitialized variable
  net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO
  net: udp: fix UDP header access on Fast/frag0 UDP GRO
  devlink: Avoid overwriting port attributes of registered port
  vrf: Fix fast path output packet handling with async Netfilter rules
  cosa: Add missing kfree in error path of cosa_write
  net: switch to the kernel.org patchwork instance
  ch_ktls: stop the txq if reaches threshold
  ch_ktls: tcb update fails sometimes
  ch_ktls/cxgb4: handle partial tag alone SKBs
  ch_ktls: don't free skb before sending FIN
  ch_ktls: packet handling prior to start marker
  ch_ktls: Correction in middle record handling
  ch_ktls: missing handling of header alone
  ch_ktls: Correction in trimmed_len calculation
  cxgb4/ch_ktls: creating skbs causes panic
  ch_ktls: Update cheksum information
  ch_ktls: Correction in finding correct length
  cxgb4/ch_ktls: decrypted bit is not enough
  net/x25: Fix null-ptr-deref in x25_connect
  ...
2020-11-12 14:02:04 -08:00
Alexander Lobakin
55e729889b net: udp: fix IP header access and skb lookup on Fast/frag0 UDP GRO
udp{4,6}_lib_lookup_skb() use ip{,v6}_hdr() to get IP header of the
packet. While it's probably OK for non-frag0 paths, this helpers
will also point to junk on Fast/frag0 GRO when all headers are
located in frags. As a result, sk/skb lookup may fail or give wrong
results. To support both GRO modes, skb_gro_network_header() might
be used. To not modify original functions, add private versions of
udp{4,6}_lib_lookup_skb() only to perform correct sk lookups on GRO.

Present since the introduction of "application-level" UDP GRO
in 4.7-rc1.

Misc: replace totally unneeded ternaries with plain ifs.

Fixes: a6024562ff ("udp: Add GRO functions to UDP socket")
Suggested-by: Willem de Bruijn <willemb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12 09:55:51 -08:00
Alexander Lobakin
4b1a86281c net: udp: fix UDP header access on Fast/frag0 UDP GRO
UDP GRO uses udp_hdr(skb) in its .gro_receive() callback. While it's
probably OK for non-frag0 paths (when all headers or even the entire
frame are already in skb head), this inline points to junk when
using Fast GRO (napi_gro_frags() or napi_gro_receive() with only
Ethernet header in skb head and all the rest in the frags) and breaks
GRO packet compilation and the packet flow itself.
To support both modes, skb_gro_header_fast() + skb_gro_header_slow()
are typically used. UDP even has an inline helper that makes use of
them, udp_gro_udphdr(). Use that instead of troublemaking udp_hdr()
to get rid of the out-of-order delivers.

Present since the introduction of plain UDP GRO in 5.0-rc1.

Fixes: e20cf8d3f1 ("udp: implement GRO for plain UDP sockets.")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12 09:55:43 -08:00
Parav Pandit
9f73bd1c2c devlink: Avoid overwriting port attributes of registered port
Cited commit in fixes tag overwrites the port attributes for the
registered port.

Avoid such error by checking registered flag before setting attributes.

Fixes: 71ad8d55f8 ("devlink: Replace devlink_port_attrs_set parameters with a struct")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20201111034744.35554-1-parav@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-12 08:06:57 -08:00
Felix Fietkau
b2911a8439 mac80211: minstrel: fix tx status processing corner case
Some drivers fill the status rate list without setting the rate index after
the final rate to -1. minstrel_ht already deals with this, but minstrel
doesn't, which causes it to get stuck at the lowest rate on these drivers.

Fix this by checking the count as well.

Cc: stable@vger.kernel.org
Fixes: cccf129f82 ("mac80211: add the 'minstrel' rate control algorithm")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201111183359.43528-3-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-11-12 11:25:09 +01:00
Felix Fietkau
4fe40b8e15 mac80211: minstrel: remove deferred sampling code
Deferring sampling attempts to the second stage has some bad interactions
with drivers that process the rate table in hardware and use the probe flag
to indicate probing packets (e.g. most mt76 drivers). On affected drivers
it can lead to probing not working at all.

If the link conditions turn worse, it might not be such a good idea to
do a lot of sampling for lower rates in this case.

Fix this by simply skipping the sample attempt instead of deferring it,
but keep the checks that would allow it to be sampled if it was skipped
too often, but only if it has less than 95% success probability.

Also ensure that IEEE80211_TX_CTL_RATE_CTRL_PROBE is set for all probing
packets.

Cc: stable@vger.kernel.org
Fixes: cccf129f82 ("mac80211: add the 'minstrel' rate control algorithm")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201111183359.43528-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-11-12 11:24:43 +01:00
Felix Fietkau
1d18288555 mac80211: fix memory leak on filtered powersave frames
After the status rework, ieee80211_tx_status_ext is leaking un-acknowledged
packets for stations in powersave mode.
To fix this, move the code handling those packets from __ieee80211_tx_status
into ieee80211_tx_status_ext

Reported-by: Tobias Waldvogel <tobias.waldvogel@gmail.com>
Fixes: 3318111cf6 ("mac80211: reduce duplication in tx status functions")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201111183359.43528-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-11-12 11:23:58 +01:00
Claire Chang
94e2bd0b25 rfkill: Fix use-after-free in rfkill_resume()
If a device is getting removed or reprobed during resume, use-after-free
might happen. For example, h5_btrtl_resume() schedules a work queue for
device reprobing, which of course requires removal first.

If the removal happens in parallel with the device_resume() and wins the
race to acquire device_lock(), removal may remove the device from the PM
lists and all, but device_resume() is already running and will continue
when the lock can be acquired, thus calling rfkill_resume().

During this, if rfkill_set_block() is then called after the corresponding
*_unregister() and kfree() are called, there will be an use-after-free
in hci_rfkill_set_block():

BUG: KASAN: use-after-free in hci_rfkill_set_block+0x58/0xc0 [bluetooth]
...
Call trace:
  dump_backtrace+0x0/0x154
  show_stack+0x20/0x2c
  dump_stack+0xbc/0x12c
  print_address_description+0x88/0x4b0
  __kasan_report+0x144/0x168
  kasan_report+0x10/0x18
  check_memory_region+0x19c/0x1ac
  __kasan_check_write+0x18/0x24
  hci_rfkill_set_block+0x58/0xc0 [bluetooth]
  rfkill_set_block+0x9c/0x120
  rfkill_resume+0x34/0x70
  dpm_run_callback+0xf0/0x1f4
  device_resume+0x210/0x22c

Fix this by checking rfkill->registered in rfkill_resume(). device_del()
in rfkill_unregister() requires device_lock() and the whole rfkill_resume()
is also protected by the same lock via device_resume(), we can make sure
either the rfkill->registered is false before rfkill_resume() starts or the
rfkill device won't be unregistered before rfkill_resume() returns.

As async_resume() holds a reference to the device, at this level there can
be no use-after-free; only in the user that doesn't expect this scenario.

Fixes: 8589086f4e ("Bluetooth: hci_h5: Turn off RTL8723BS on suspend, reprobe on resume")
Signed-off-by: Claire Chang <tientzu@chromium.org>
Link: https://lore.kernel.org/r/20201110084908.219088-1-tientzu@chromium.org
[edit commit message for clarity and add more info provided later]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-11-12 09:18:06 +01:00
Martin Schiller
3611823087 net/x25: Fix null-ptr-deref in x25_connect
This fixes a regression for blocking connects introduced by commit
4becb7ee5b ("net/x25: Fix x25_neigh refcnt leak when x25 disconnect").

The x25->neighbour is already set to "NULL" by x25_disconnect() now,
while a blocking connect is waiting in
x25_wait_for_connection_establishment(). Therefore x25->neighbour must
not be accessed here again and x25->state is also already set to
X25_STATE_0 by x25_disconnect().

Fixes: 4becb7ee5b ("net/x25: Fix x25_neigh refcnt leak when x25 disconnect")
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Reviewed-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20201109065449.9014-1-ms@dev.tdt.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-11 14:53:56 -08:00
Wang Hai
fa6882c636 tipc: fix memory leak in tipc_topsrv_start()
kmemleak report a memory leak as follows:

unreferenced object 0xffff88810a596800 (size 512):
  comm "ip", pid 21558, jiffies 4297568990 (age 112.120s)
  hex dump (first 32 bytes):
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
    ff ff ff ff ff ff ff ff 00 83 60 b0 ff ff ff ff  ..........`.....
  backtrace:
    [<0000000022bbe21f>] tipc_topsrv_init_net+0x1f3/0xa70
    [<00000000fe15ddf7>] ops_init+0xa8/0x3c0
    [<00000000138af6f2>] setup_net+0x2de/0x7e0
    [<000000008c6807a3>] copy_net_ns+0x27d/0x530
    [<000000006b21adbd>] create_new_namespaces+0x382/0xa30
    [<00000000bb169746>] unshare_nsproxy_namespaces+0xa1/0x1d0
    [<00000000fe2e42bc>] ksys_unshare+0x39c/0x780
    [<0000000009ba3b19>] __x64_sys_unshare+0x2d/0x40
    [<00000000614ad866>] do_syscall_64+0x56/0xa0
    [<00000000a1b5ca3c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

'srv' is malloced in tipc_topsrv_start() but not free before
leaving from the error handling cases. We need to free it.

Fixes: 5c45ab24ac ("tipc: make struct tipc_server private for server.c")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201109140913.47370-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-11 14:39:23 -08:00
Ursula Braun
4031eeafa7 net/af_iucv: fix null pointer dereference on shutdown
syzbot reported the following KASAN finding:

BUG: KASAN: nullptr-dereference in iucv_send_ctrl+0x390/0x3f0 net/iucv/af_iucv.c:385
Read of size 2 at addr 000000000000021e by task syz-executor907/519

CPU: 0 PID: 519 Comm: syz-executor907 Not tainted 5.9.0-syzkaller-07043-gbcf9877ad213 #0
Hardware name: IBM 3906 M04 701 (KVM/Linux)
Call Trace:
 [<00000000c576af60>] unwind_start arch/s390/include/asm/unwind.h:65 [inline]
 [<00000000c576af60>] show_stack+0x180/0x228 arch/s390/kernel/dumpstack.c:135
 [<00000000c9dcd1f8>] __dump_stack lib/dump_stack.c:77 [inline]
 [<00000000c9dcd1f8>] dump_stack+0x268/0x2f0 lib/dump_stack.c:118
 [<00000000c5fed016>] print_address_description.constprop.0+0x5e/0x218 mm/kasan/report.c:383
 [<00000000c5fec82a>] __kasan_report mm/kasan/report.c:517 [inline]
 [<00000000c5fec82a>] kasan_report+0x11a/0x168 mm/kasan/report.c:534
 [<00000000c98b5b60>] iucv_send_ctrl+0x390/0x3f0 net/iucv/af_iucv.c:385
 [<00000000c98b6262>] iucv_sock_shutdown+0x44a/0x4c0 net/iucv/af_iucv.c:1457
 [<00000000c89d3a54>] __sys_shutdown+0x12c/0x1c8 net/socket.c:2204
 [<00000000c89d3b70>] __do_sys_shutdown net/socket.c:2212 [inline]
 [<00000000c89d3b70>] __s390x_sys_shutdown+0x38/0x48 net/socket.c:2210
 [<00000000c9e36eac>] system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

There is nothing to shutdown if a connection has never been established.
Besides that iucv->hs_dev is not yet initialized if a socket is in
IUCV_OPEN state and iucv->path is not yet initialized if socket is in
IUCV_BOUND state.
So, just skip the shutdown calls for a socket in these states.

Fixes: eac3731bd0 ("[S390]: Add AF_IUCV socket support")
Fixes: 82492a355f ("af_iucv: add shutdown for HS transport")
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
[jwi: correct one Fixes tag]
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-10 18:08:17 -08:00
Mao Wenan
909172a149 net: Update window_clamp if SOCK_RCVBUF is set
When net.ipv4.tcp_syncookies=1 and syn flood is happened,
cookie_v4_check or cookie_v6_check tries to redo what
tcp_v4_send_synack or tcp_v6_send_synack did,
rsk_window_clamp will be changed if SOCK_RCVBUF is set,
which will make rcv_wscale is different, the client
still operates with initial window scale and can overshot
granted window, the client use the initial scale but local
server use new scale to advertise window value, and session
work abnormally.

Fixes: e88c64f0a4 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
Signed-off-by: Mao Wenan <wenan.mao@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1604967391-123737-1-git-send-email-wenan.mao@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-10 17:42:35 -08:00
Paul Moore
866358ec33 netlabel: fix our progress tracking in netlbl_unlabel_staticlist()
The current NetLabel code doesn't correctly keep track of the netlink
dump state in some cases, in particular when multiple interfaces with
large configurations are loaded.  The problem manifests itself by not
reporting the full configuration to userspace, even though it is
loaded and active in the kernel.  This patch fixes this by ensuring
that the dump state is properly reset when necessary inside the
netlbl_unlabel_staticlist() function.

Fixes: 8cc44579d1 ("NetLabel: Introduce static network labels for unlabeled connections")
Signed-off-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/160484450633.3752.16512718263560813473.stgit@sifl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-10 15:31:46 -08:00
Yu Kuai
48f486e13f net: xfrm: fix memory leak in xfrm_user_policy()
if xfrm_get_translator() failed, xfrm_user_policy() return without
freeing 'data', which is allocated in memdup_sockptr().

Fixes: 96392ee5a1 ("xfrm/compat: Translate 32-bit user_policy from sockptr")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-11-10 09:14:25 +01:00
Paolo Abeni
989ef49bdf mptcp: provide rmem[0] limit
The mptcp proto struct currently does not provide the
required limit for forward memory scheduling. Under
pressure sk_rmem_schedule() will unconditionally try
to use such field and will oops.

Address the issue inheriting the tcp limit, as we already
do for the wmem one.

Fixes: 9c3f94e168 ("mptcp: add missing memory scheduling in the rx path")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/37af798bd46f402fb7c79f57ebbdd00614f5d7fa.1604861097.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09 17:34:34 -08:00
Alexander Lobakin
413691384a ethtool: netlink: add missing netdev_features_change() call
After updating userspace Ethtool from 5.7 to 5.9, I noticed that
NETDEV_FEAT_CHANGE is no more raised when changing netdev features
through Ethtool.
That's because the old Ethtool ioctl interface always calls
netdev_features_change() at the end of user request processing to
inform the kernel that our netdevice has some features changed, but
the new Netlink interface does not. Instead, it just notifies itself
with ETHTOOL_MSG_FEATURES_NTF.
Replace this ethtool_notify() call with netdev_features_change(), so
the kernel will be aware of any features changes, just like in case
with the ioctl interface. This does not omit Ethtool notifications,
as Ethtool itself listens to NETDEV_FEAT_CHANGE and drops
ETHTOOL_MSG_FEATURES_NTF on it
(net/ethtool/netlink.c:ethnl_netdev_event()).

From v1 [1]:
- dropped extra new line as advised by Jakub;
- no functional changes.

[1] https://lore.kernel.org/netdev/AlZXQ2o5uuTVHCfNGOiGgJ8vJ3KgO5YIWAnQjH0cDE@cp3-web-009.plabs.ch

Fixes: 0980bfcd69 ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/ahA2YWXYICz5rbUSQqNG4roJ8OlJzzYQX7PTiG80@cp4-web-028.plabs.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09 17:15:34 -08:00
Stefano Brivio
77a2d673d5 tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies
Jianlin reports that a bridged IPv6 VXLAN endpoint, carrying IPv6
packets over a link with a PMTU estimation of exactly 1350 bytes,
won't trigger ICMPv6 Packet Too Big replies when the encapsulated
datagrams exceed said PMTU value. VXLAN over IPv6 adds 70 bytes of
overhead, so an ICMPv6 reply indicating 1280 bytes as inner MTU
would be legitimate and expected.

This comes from an off-by-one error I introduced in checks added
as part of commit 4cb47a8644 ("tunnels: PMTU discovery support
for directly bridged IP packets"), whose purpose was to prevent
sending ICMPv6 Packet Too Big messages with an MTU lower than the
smallest permissible IPv6 link MTU, i.e. 1280 bytes.

In iptunnel_pmtud_check_icmpv6(), avoid triggering a reply only if
the advertised MTU would be less than, and not equal to, 1280 bytes.

Also fix the analogous comparison for IPv4, that is, skip the ICMP
reply only if the resulting MTU is strictly less than 576 bytes.

This becomes apparent while running the net/pmtu.sh bridged VXLAN
or GENEVE selftests with adjusted lower-link MTU values. Using
e.g. GENEVE, setting ll_mtu to the values reported below, in the
test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() test
function, we can see failures on the following tests:

             test                | ll_mtu
  -------------------------------|--------
  pmtu_ipv4_br_geneve4_exception |   626
  pmtu_ipv6_br_geneve4_exception |  1330
  pmtu_ipv6_br_geneve6_exception |  1350

owing to the different tunneling overheads implied by the
corresponding configurations.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 4cb47a8644 ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Link: https://lore.kernel.org/r/4f5fc2f33bfdf8409549fafd4f952b008bf04d63.1604681709.git.sbrivio@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09 15:39:39 -08:00
Oliver Herms
8ef9ba4d66 IPv6: Set SIT tunnel hard_header_len to zero
Due to the legacy usage of hard_header_len for SIT tunnels while
already using infrastructure from net/ipv4/ip_tunnel.c the
calculation of the path MTU in tnl_update_pmtu is incorrect.
This leads to unnecessary creation of MTU exceptions for any
flow going over a SIT tunnel.

As SIT tunnels do not have a header themsevles other than their
transport (L3, L2) headers we're leaving hard_header_len set to zero
as tnl_update_pmtu is already taking care of the transport headers
sizes.

This will also help avoiding unnecessary IPv6 GC runs and spinlock
contention seen when using SIT tunnels and for more than
net.ipv6.route.gc_thresh flows.

Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201103104133.GA1573211@tws
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09 15:07:40 -08:00
Linus Torvalds
3552c3709c This is mainly server-to-server copy and fallout from Chuck's 5.10 rpc
refactoring.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl+pk7QVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+lwgQAL0WE92H1QJwYtrC5bXko1CjXjL7
 I1lv/rMf1ZHhdbZLZQNSqXFYTGrO3w6n02H7bJcYlryg5YSt8i8evdJXICYyeZIX
 5QAT0K5hzHTNWKnumqBSwoVOPl1e6ImZtmyxqQvA/2sQP18OPvroK/9H0YkdnM3/
 d8lcpKTBCJj0UAWmktaXGYG8PdNSjaNXMfPRwpCOGHiXk+QBAb+QjshB54PKjjhR
 aiJTJzceroLer0YlQSXfVQMt6EwkTkjCbMbxPywfFYGGvl/Y7H4YgVA8rYqO/XZr
 BmP9V+xX87GyB0IEGxoheVcmTMUSw37JUfAC2oBQB9g2emG5avRn4vdhL25nKd1T
 sgaVC+0tnoMQ7KNaYp1SK6orgS+OIYeQLhxbu6jmU+viccJ621JmpRF+95OwEZ9Z
 4+vBwI3Oft20jndgNwrTvCLgkzEVFpJuayBeZCk7pvchM2YjaWwl291ix+cwM2wQ
 fwMVs6dpLIgfB8jNOM6qAfI1jB1HMePrPraqxddxh5tZ0Tt4C4uwpEIDDwaPesmJ
 FK3JB+7GpU/tMHmmaeVFUMGx9V+8fJFEC0MFUrrqAMZ3XbzQ+DM5ysk1TQsO0OEO
 F1ojiYNW8s4U+dLCY0S16vFVoQIuM9Ui1zXGaJHQgS04l+cFCmD495s4HtYA1k7l
 H/T/o416bZlbOhcK
 =bpPt
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.10-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "This is mainly server-to-server copy and fallout from Chuck's 5.10 rpc
  refactoring"

* tag 'nfsd-5.10-1' of git://linux-nfs.org/~bfields/linux:
  net/sunrpc: fix useless comparison in proc_do_xprt()
  net/sunrpc: return 0 on attempt to write to "transports"
  NFSD: fix missing refcount in nfsd4_copy by nfsd4_do_async_copy
  NFSD: Fix use-after-free warning when doing inter-server copy
  NFSD: MKNOD should return NFSERR_BADTYPE instead of NFSERR_INVAL
  SUNRPC: Fix general protection fault in trace_rpc_xdr_overflow()
  NFSD: NFSv3 PATHCONF Reply is improperly formed
2020-11-09 12:43:12 -08:00
Dmitry Safonov
ad37f77fd3 xfrm/compat: Don't allocate memory with __GFP_ZERO
32-bit to 64-bit messages translator zerofies needed paddings in the
translation, the rest is the actual payload.
Don't allocate zero pages as they are not needed.

Fixes: 5106f4a8ac ("xfrm/compat: Add 32=>64-bit messages translator")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-11-09 07:34:56 +01:00
Dmitry Safonov
d1949d045f xfrm/compat: memset(0) 64-bit padding at right place
32-bit messages translated by xfrm_compat can have attributes attached.
For all, but XFRMA_SA, XFRMA_POLICY the size of payload is the same
in 32-bit UABI and 64-bit UABI. For XFRMA_SA (struct xfrm_usersa_info)
and XFRMA_POLICY (struct xfrm_userpolicy_info) it's only tail-padding
that is present in 64-bit payload, but not in 32-bit.
The proper size for destination nlattr is already calculated by
xfrm_user_rcv_calculate_len64() and allocated with kvmalloc().

xfrm_attr_cpy32() copies 32-bit copy_len into 64-bit attribute
translated payload, zero-filling possible padding for SA/POLICY.
Due to a typo, *pos already has 64-bit payload size, in a result next
memset(0) is called on the memory after the translated attribute, not on
the tail-padding of it.

Fixes: 5106f4a8ac ("xfrm/compat: Add 32=>64-bit messages translator")
Reported-by: syzbot+c43831072e7df506a646@syzkaller.appspotmail.com
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-11-09 07:34:56 +01:00
Dmitry Safonov
dbd7ae5154 xfrm/compat: Translate by copying XFRMA_UNSPEC attribute
xfrm_xlate32() translates 64-bit message provided by kernel to be sent
for 32-bit listener (acknowledge or monitor). Translator code doesn't
expect XFRMA_UNSPEC attribute as it doesn't know its payload.
Kernel never attaches such attribute, but a user can.

I've searched if any opensource does it and the answer is no.
Nothing on github and google finds only tfcproject that has such code
commented-out.

What will happen if a user sends a netlink message with XFRMA_UNSPEC
attribute? Ipsec code ignores this attribute. But if there is a
monitor-process or 32-bit user requested ack - kernel will try to
translate such message and will hit WARN_ONCE() in xfrm_xlate64_attr().

Deal with XFRMA_UNSPEC by copying the attribute payload with
xfrm_nla_cpy(). In result, the default switch-case in xfrm_xlate64_attr()
becomes an unused code. Leave those 3 lines in case a new xfrm attribute
will be added.

Fixes: 5461fc0c8d ("xfrm/compat: Add 64=>32-bit messages translator")
Reported-by: syzbot+a7e701c8385bd8543074@syzkaller.appspotmail.com
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-11-09 07:34:56 +01:00
Dan Carpenter
ae2975046d net/sunrpc: fix useless comparison in proc_do_xprt()
In the original code, the "if (*lenp < 0)" check didn't work because
"*lenp" is unsigned.  Fortunately, the memory_read_from_buffer() call
will never fail in this context so it doesn't affect runtime.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-11-08 16:28:25 -05:00
Jakub Kicinski
86bbf01977 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-11-06

1) Pre-allocated per-cpu hashmap needs to zero-fill reused element, from David.

2) Tighten bpf_lsm function check, from KP.

3) Fix bpftool attaching to flow dissector, from Lorenz.

4) Use -fno-gcse for the whole kernel/bpf/core.c instead of function attribute, from Ard.

* git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Update verification logic for LSM programs
  bpf: Zero-fill re-used per-cpu map element
  bpf: BPF_PRELOAD depends on BPF_SYSCALL
  tools/bpftool: Fix attaching flow dissector
  libbpf: Fix possible use after free in xsk_socket__delete
  libbpf: Fix null dereference in xsk_socket__delete
  libbpf, hashmap: Fix undefined behavior in hash_bits
  bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE
  tools, bpftool: Remove two unused variables.
  tools, bpftool: Avoid array index warnings.
  xsk: Fix possible memory leak at socket close
  bpf: Add struct bpf_redir_neigh forward declaration to BPF helper defs
  samples/bpf: Set rlimit for memlock to infinity in all samples
  bpf: Fix -Wshadow warnings
  selftest/bpf: Fix profiler test using CO-RE relocation for enums
====================

Link: https://lore.kernel.org/r/20201106221759.24143-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-06 17:49:34 -08:00
Dan Carpenter
d435c05ab0 net/sunrpc: return 0 on attempt to write to "transports"
You can't write to this file because the permissions are 0444.  But
it sort of looked like you could do a write and it would result in
a read.  Then it looked like proc_sys_call_handler() just ignored
it.  Which is confusing.  It's more clear if the "write" just
returns zero.

Also, the "lenp" pointer is never NULL so that check can be removed.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-11-06 15:56:18 -05:00
Linus Torvalds
41f1653024 Networking fixes for 5.10-rc3, including fixes from wireless, can,
and netfilter subtrees.
 
 Current release - bugs in new features:
 
  - can: isotp: isotp_rcv_cf(): enable RX timeout handling in
    listen-only mode
 
 Previous release - regressions:
 
  - mac80211:
    - don't require VHT elements for HE on 2.4 GHz
    - fix regression where EAPOL frames were sent in plaintext
 
  - netfilter:
    - ipset: Update byte and packet counters regardless of whether
      they match
 
  - ip_tunnel: fix over-mtu packet send by allowing fragmenting even
    if inner packet has IP_DF (don't fragment) set in its header
    (when TUNNEL_DONT_FRAGMENT flag is not set on the tunnel dev)
 
  - net: fec: fix MDIO probing for some FEC hardware blocks
 
  - ip6_tunnel: set inner ipproto before ip6_tnl_encap to un-break
    gso support
 
  - sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian
    platforms, sparse-related fix used the wrong integer size
 
 Previous release - always broken:
 
  - netfilter: use actual socket sk rather than skb sk when routing
    harder
 
  - r8169: work around short packet hw bug on RTL8125 by padding frames
 
  - net: ethernet: ti: cpsw: disable PTPv1 hw timestamping
    advertisement, the hardware does not support it
 
  - chelsio/chtls: fix always leaking ctrl_skb and another leak caused
    by a race condition
 
  - fix drivers incorrectly writing into skbs on TX:
    - cadence: force nonlinear buffers to be cloned
    - gianfar: Account for Tx PTP timestamp in the skb headroom
    - gianfar: Replace skb_realloc_headroom with skb_cow_head for PTP
 
  - can: flexcan:
    - remove FLEXCAN_QUIRK_DISABLE_MECR quirk for LS1021A
    - add ECC initialization for VF610 and LX2160A
    - flexcan_remove(): disable wakeup completely
 
  - can: fix packet echo functionality:
    - peak_canfd: fix echo management when loopback is on
    - make sure skbs are not freed in IRQ context in case they need
      to be dropped
    - always clone the skbs to make sure they have a reference on
      the socket, and prevent it from disappearing
    - fix real payload length return value for RTR frames
 
  - can: j1939: return failure on bind if netdev is down, rather than
    waiting indefinitely
 
 Misc:
 
  - IPv6: reply ICMP error if the first fragment don't include all
    headers to improve compliance with RFC 8200
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+kTDcACgkQMUZtbf5S
 IrtC9A//f9rwNFI7sRaz9FYi6ljtWY7paPxdOxy3pWRoNzbfffjTGSPheNvy1pQb
 IPaLsNwRrckQNSEPTbQqlUYcjzk1W74ffvq0sQOan4kNKxjX3uf78E6RuWARJsRC
 dLqfcJctO6bFi6sEMwIFZ2tLOO5lUIA+Pd0GbjhSdObWzl3uqJ26v7wC6vVk29vS
 116Mmhe8/TDVtCOzwlZnBPHqBJkTAirB+MAEX4Sp6FB9YirlcNZbWyHX5L6ejGqC
 WQVjU2tPBBugeo0j72tc+y0mD3iK0aLcPL+dk0EQQYHRDMVTebl+gxNPUXCo9Out
 HGe5z4e4qrR4Rx1W6MQ3pKwTYuCdwKjMRGd72JAi428/l4NN3y9W/HkI2Zuppd2l
 7ifURkNQllYjGCSoHBviJbajyFBeA1nkFJgMSJiRs4T167K3zTbsyjNnfa4LnsvS
 B3SrYMGqIH+oR20R9EoV8prVX+Alj1hh/jX02J8zsCcHmBqF2yZi17NarVAWoarm
 v/AAqehlP+D1vjAmbCG9DeborrjaNi+v6zFTKK6ZadvLXRJX/N+wEPIpG4KjiK8W
 DWKIVlee0R+kgCXE1n9AuZaZLWb7VwrAjkG1Pmfi3vkZhWeAhOW4X98ehhi/hVR/
 Gq+e48ZECW5yuOA1q4hbsCYkGr2qAn/LPbsXxhEmW8qwkJHZYkI=
 =5R2w
 -----END PGP SIGNATURE-----

Merge tag 'net-5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.10-rc3, including fixes from wireless, can, and
  netfilter subtrees.

  Current merge window - bugs in new features:

   - can: isotp: isotp_rcv_cf(): enable RX timeout handling in
     listen-only mode

  Previous releases - regressions:

   - mac80211:
      - don't require VHT elements for HE on 2.4 GHz
      - fix regression where EAPOL frames were sent in plaintext

   - netfilter:
      - ipset: Update byte and packet counters regardless of whether
        they match

   - ip_tunnel: fix over-mtu packet send by allowing fragmenting even if
     inner packet has IP_DF (don't fragment) set in its header (when
     TUNNEL_DONT_FRAGMENT flag is not set on the tunnel dev)

   - net: fec: fix MDIO probing for some FEC hardware blocks

   - ip6_tunnel: set inner ipproto before ip6_tnl_encap to un-break gso
     support

   - sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian
     platforms, sparse-related fix used the wrong integer size

  Previous releases - always broken:

   - netfilter: use actual socket sk rather than skb sk when routing
     harder

   - r8169: work around short packet hw bug on RTL8125 by padding frames

   - net: ethernet: ti: cpsw: disable PTPv1 hw timestamping
     advertisement, the hardware does not support it

   - chelsio/chtls: fix always leaking ctrl_skb and another leak caused
     by a race condition

   - fix drivers incorrectly writing into skbs on TX:
      - cadence: force nonlinear buffers to be cloned
      - gianfar: Account for Tx PTP timestamp in the skb headroom
      - gianfar: Replace skb_realloc_headroom with skb_cow_head for PTP

   - can: flexcan:
      - remove FLEXCAN_QUIRK_DISABLE_MECR quirk for LS1021A
      - add ECC initialization for VF610 and LX2160A
      - flexcan_remove(): disable wakeup completely

   - can: fix packet echo functionality:
      - peak_canfd: fix echo management when loopback is on
      - make sure skbs are not freed in IRQ context in case they need to
        be dropped
      - always clone the skbs to make sure they have a reference on the
        socket, and prevent it from disappearing
      - fix real payload length return value for RTR frames

   - can: j1939: return failure on bind if netdev is down, rather than
     waiting indefinitely

  Misc:

   - IPv6: reply ICMP error if the first fragment don't include all
     headers to improve compliance with RFC 8200"

* tag 'net-5.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits)
  ionic: check port ptr before use
  r8169: work around short packet hw bug on RTL8125
  net: openvswitch: silence suspicious RCU usage warning
  chelsio/chtls: fix always leaking ctrl_skb
  chelsio/chtls: fix memory leaks caused by a race
  can: flexcan: flexcan_remove(): disable wakeup completely
  can: flexcan: add ECC initialization for VF610
  can: flexcan: add ECC initialization for LX2160A
  can: flexcan: remove FLEXCAN_QUIRK_DISABLE_MECR quirk for LS1021A
  can: mcp251xfd: remove unneeded break
  can: mcp251xfd: mcp251xfd_regmap_nocrc_read(): fix semicolon.cocci warnings
  can: mcp251xfd: mcp251xfd_regmap_crc_read(): increase severity of CRC read error messages
  can: peak_canfd: pucan_handle_can_rx(): fix echo management when loopback is on
  can: peak_usb: peak_usb_get_ts_time(): fix timestamp wrapping
  can: peak_usb: add range checking in decode operations
  can: xilinx_can: handle failure cases of pm_runtime_get_sync
  can: ti_hecc: ti_hecc_probe(): add missed clk_disable_unprepare() in error path
  can: isotp: padlen(): make const array static, makes object smaller
  can: isotp: isotp_rcv_cf(): enable RX timeout handling in listen-only mode
  can: isotp: Explain PDU in CAN_ISOTP help text
  ...
2020-11-06 11:50:28 -08:00
Jakub Kicinski
ac6f929d74 linux-can-fixes-for-5.10-20201103
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl+hzPwTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqU8YB/9PBALnpZFDSyOE/8rKoBoqA2uPfj2i
 Yplu98jkFqhnb5I1KFPCNJiTQd+/aAzM2LzeGHVOBMIF6scPUclC12k1q4fdLtX0
 6YMZ38w2I2hq8z1QIgOYo7jQ34NeonNt7T5CHEeBA7xXGnlo/WYDNDE0cruPnPRZ
 eFqM5f1/PVKKh4gFVTAqICC2ZMefL4rgAkFgFXj2rfiYr115OEGAwCav5Ys31p/y
 MI5SfQmNkfkE8HswMNBDQZ+8V5qkKvarHXwUcRfgUqkpqHQjzcOIJnCDh/ngIh50
 imwxHaCerXvEj8MBUcF2fZV7w6QPTFIV3TQ0AiUjuVUE3HPuR+JPvSl4
 =B3CC
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.10-20201103' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2020-11-03

The first two patches are by Oleksij Rempel and they add a generic
can-controller Device Tree yaml binding and convert the text based binding
of the flexcan driver to a yaml based binding.

Zhang Changzhong's patch fixes a remove_proc_entry warning in the AF_CAN
core.

A patch by me fixes a kfree_skb() call from IRQ context in the rx-offload
helper.

Vincent Mailhol contributes a patch to prevent a call to kfree_skb() in
hard IRQ context in can_get_echo_skb().

Oliver Hartkopp's patch fixes the length calculation for RTR CAN frames
in the __can_get_echo_skb() helper.

Oleksij Rempel's patch fixes a use-after-free that shows up with j1939 in
can_create_echo_skb().

Yegor Yefremov contributes 4 patches to enhance the j1939 documentation.

Zhang Changzhong's patch fixes a hanging task problem in j1939_sk_bind()
if the netdev is down.

Then there are three patches for the newly added CAN_ISOTP protocol. Geert
Uytterhoeven enhances the kconfig help text. Oliver Hartkopp's patch adds
missing RX timeout handling in listen-only mode and Colin Ian King's patch
decreases the generated object code by 926 bytes.

Zhang Changzhong contributes a patch for the ti_hecc driver that fixes the
error path in the probe function.

Navid Emamdoost's patch for the xilinx_can driver fixes the error handling
in case of failing pm_runtime_get_sync().

There are two patches for the peak_usb driver. Dan Carpenter adds range
checking in decode operations and Stephane Grosjean's patch fixes
a timestamp wrapping problem.

Stephane Grosjean's patch for th peak_canfd driver fixes echo management if
loopback is on.

The next three patches all target the mcp251xfd driver. The first one is
by me and it increased the severity of CRC read error messages. The kernel
test robot removes an unneeded semicolon and Tom Rix removes unneeded
break in several switch-cases.

The last 4 patches are by Joakim Zhang and target the flexcan driver,
the first three fix ECC related device specific quirks for the LS1021A,
LX2160A and the VF610 SoC. The last patch disable wakeup completely upon
driver remove.

* tag 'linux-can-fixes-for-5.10-20201103' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: (27 commits)
  can: flexcan: flexcan_remove(): disable wakeup completely
  can: flexcan: add ECC initialization for VF610
  can: flexcan: add ECC initialization for LX2160A
  can: flexcan: remove FLEXCAN_QUIRK_DISABLE_MECR quirk for LS1021A
  can: mcp251xfd: remove unneeded break
  can: mcp251xfd: mcp251xfd_regmap_nocrc_read(): fix semicolon.cocci warnings
  can: mcp251xfd: mcp251xfd_regmap_crc_read(): increase severity of CRC read error messages
  can: peak_canfd: pucan_handle_can_rx(): fix echo management when loopback is on
  can: peak_usb: peak_usb_get_ts_time(): fix timestamp wrapping
  can: peak_usb: add range checking in decode operations
  can: xilinx_can: handle failure cases of pm_runtime_get_sync
  can: ti_hecc: ti_hecc_probe(): add missed clk_disable_unprepare() in error path
  can: isotp: padlen(): make const array static, makes object smaller
  can: isotp: isotp_rcv_cf(): enable RX timeout handling in listen-only mode
  can: isotp: Explain PDU in CAN_ISOTP help text
  can: j1939: j1939_sk_bind(): return failure if netdev is down
  can: j1939: use backquotes for code samples
  can: j1939: swap addr and pgn in the send example
  can: j1939: fix syntax and spelling
  can: j1939: rename jacd tool
  ...
====================

Link: https://lore.kernel.org/r/<20201103220636.972106-1-mkl@pengutronix.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-04 10:38:07 -08:00
Jakub Kicinski
2da4c187ae Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
1) Fix packet receiving of standard IP tunnels when the xfrm_interface
   module is installed. From Xin Long.

2) Fix a race condition between spi allocating and hash list
   resizing. From zhuoliang zhang.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-04 08:12:52 -08:00
Eelco Chaudron
fea07a487c net: openvswitch: silence suspicious RCU usage warning
Silence suspicious RCU usage warning in ovs_flow_tbl_masks_cache_resize()
by replacing rcu_dereference() with rcu_dereference_ovsl().

In addition, when creating a new datapath, make sure it's configured under
the ovs_lock.

Fixes: 9bf24f594c ("net: openvswitch: make masks cache size configurable")
Reported-by: syzbot+9a8f8bfcc56e8578016c@syzkaller.appspotmail.com
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160439190002.56943.1418882726496275961.stgit@ebuild
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-03 16:57:42 -08:00
Colin Ian King
c3ddac4b0c can: isotp: padlen(): make const array static, makes object smaller
Don't populate the const array plen on the stack but instead it static. Makes
the object code smaller by 926 bytes.

Before:
   text	   data	    bss	    dec	    hex	filename
  26531	   1943	     64	  28538	   6f7a	net/can/isotp.o

After:
   text	   data	    bss	    dec	    hex	filename
  25509	   2039	     64	  27612	   6bdc	net/can/isotp.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20201020154203.54711-1-colin.king@canonical.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-03 22:30:32 +01:00
Oliver Hartkopp
78656ea235 can: isotp: isotp_rcv_cf(): enable RX timeout handling in listen-only mode
As reported by Thomas Wagner:

    https://github.com/hartkopp/can-isotp/issues/34

the timeout handling for data frames is not enabled when the isotp socket is
used in listen-only mode (sockopt CAN_ISOTP_LISTEN_MODE). This mode is enabled
by the isotpsniffer application which therefore became inconsistend with the
strict rx timeout rules when running the isotp protocol in the operational
mode.

This patch fixes this inconsistency by moving the return condition for the
listen-only mode behind the timeout handling code.

Reported-by: Thomas Wagner <thwa1@web.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Link: https://github.com/hartkopp/can-isotp/issues/34
Link: https://lore.kernel.org/r/20201019120229.89326-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-03 22:30:32 +01:00
Geert Uytterhoeven
5a7de2408f can: isotp: Explain PDU in CAN_ISOTP help text
The help text for the CAN_ISOTP config symbol uses the acronym "PDU".  However,
this acronym is not explained here, nor in Documentation/networking/can.rst.

Expand the acronym to make it easier for users to decide if they need to enable
the CAN_ISOTP option or not.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20201013141341.28487-1-geert+renesas@glider.be
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-03 22:30:31 +01:00
Zhang Changzhong
08c487d8d8 can: j1939: j1939_sk_bind(): return failure if netdev is down
When a netdev down event occurs after a successful call to
j1939_sk_bind(), j1939_netdev_notify() can handle it correctly.

But if the netdev already in down state before calling j1939_sk_bind(),
j1939_sk_release() will stay in wait_event_interruptible() blocked
forever. Because in this case, j1939_netdev_notify() won't be called and
j1939_tp_txtimer() won't call j1939_session_cancel() or other function
to clear session for ENETDOWN error, this lead to mismatch of
j1939_session_get/put() and jsk->skb_pending will never decrease to
zero.

To reproduce it use following commands:
1. ip link add dev vcan0 type vcan
2. j1939acd -r 100,80-120 1122334455667788 vcan0
3. presses ctrl-c and thread will be blocked forever

This patch adds check for ndev->flags in j1939_sk_bind() to avoid this
kind of situation and return with -ENETDOWN.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1599460308-18770-1-git-send-email-zhangchangzhong@huawei.com
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-03 22:30:31 +01:00
Zhang Changzhong
3accbfdc36 can: proc: can_remove_proc(): silence remove_proc_entry warning
If can_init_proc() fail to create /proc/net/can directory, can_remove_proc()
will trigger a warning:

WARNING: CPU: 6 PID: 7133 at fs/proc/generic.c:672 remove_proc_entry+0x17b0
Kernel panic - not syncing: panic_on_warn set ...

Fix to return early from can_remove_proc() if can proc_dir does not exists.

Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1594709090-3203-1-git-send-email-zhangchangzhong@huawei.com
Fixes: 8e8cda6d73 ("can: initial support for network namespaces")
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-03 22:24:19 +01:00
Davide Caratti
e16b874ee8 mptcp: token: fix unititialized variable
gcc complains about use of uninitialized 'num'. Fix it by doing the first
assignment of 'num' when the variable is declared.

Fixes: 96d890daad ("mptcp: add msk interations helper")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/49e20da5d467a73414d4294a8bd35e2cb1befd49.1604308087.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-03 13:08:30 -08:00
Petr Malat
b6df8c8141 sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian platforms
Commit 978aa04741 ("sctp: fix some type cast warnings introduced since
very beginning")' broke err reading from sctp_arg, because it reads the
value as 32-bit integer, although the value is stored as 16-bit integer.
Later this value is passed to the userspace in 16-bit variable, thus the
user always gets 0 on big-endian platforms. Fix it by reading the __u16
field of sctp_arg union, as reading err field would produce a sparse
warning.

Fixes: 978aa04741 ("sctp: fix some type cast warnings introduced since very beginning")
Signed-off-by: Petr Malat <oss@malat.biz>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/20201030132633.7045-1-oss@malat.biz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-02 15:03:25 -08:00
Jakub Kicinski
04a55c944f A couple of fixes, for
* HE on 2.4 GHz
  * a few issues syzbot found, but we have many more reports :-(
  * a regression in nl80211-transported EAPOL frames which had
    affected a number of users, from Mathy
  * kernel-doc markings in mac80211, from Mauro
  * a format argument in reg.c, from Ye Bin
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl+b4B0ACgkQB8qZga/f
 l8SdOg/+PcKqoNXh+VgP2ZvCcN3D/3ow5mf9OgNijPoM35a4sJZdGpRWp1oUXK8O
 1NbyE/mL3TiNS5HIb+jF03+pHOl+ysh0AE3QsQeZwn+bHkU61T9J477NQr4Y9hQ0
 eJZxDjgSUsJhx1xnGPH0QFUi7zaFQAfy1Q5AVCPP5ywEOovTuOY9Qw/7D2EZoh5L
 k2H1kjLIle2VCckqrL5pno3dz1lAGRZ5RGGiP8/ATfBH6pYON9yFSflc9x6azTS2
 vWyfZxzbrFWvT2YFMwNUnNl4oNjLIvGYmYzULp9MweyFA6lfZpIOAcGtVoM98nD7
 wu6KWeozo4c+3D25kqxRlpE6fILJ+uiCKcHV+7GyLsDkp9s2onE5f/UHmcP+pGkh
 QE/ubTbe2brWSpPwHyAXEg1FQ3WPmJj90Tr3OA2j+rIfh/+eUH/inoxvddqyOamR
 mBr8M1VRY8+PRAru9UKU+EG4CueX5GALxbOH8rJFtlDsJz33CxGm1sxpAc9pR4CX
 XYuaPYsunok7/tXRxXulaCE0B6DOqyhX8L7drVJ0nEpAv7J3WfkZxlFs8+3VPBG7
 BMiwNqNMKHM613a16vvl8ZOsguzRPIxhPLWVbcUb5c6NQq+4FOn5pZYtZU8jZ19e
 w3iDPbfaCCARnf5U5JKqTt45q+71rbU9pSncMjy7r5/r1hPGwiQ=
 =2AGz
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-10-30' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A couple of fixes, for
 * HE on 2.4 GHz
 * a few issues syzbot found, but we have many more reports :-(
 * a regression in nl80211-transported EAPOL frames which had
   affected a number of users, from Mathy
 * kernel-doc markings in mac80211, from Mauro
 * a format argument in reg.c, from Ye Bin
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-02 09:43:54 -08:00
Jakub Kicinski
859191b234 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Incorrect netlink report logic in flowtable and genID.

2) Add a selftest to check that wireguard passes the right sk
   to ip_route_me_harder, from Jason A. Donenfeld.

3) Pass the actual sk to ip_route_me_harder(), also from Jason.

4) Missing expression validation of updates via nft --check.

5) Update byte and packet counters regardless of whether they
   match, from Stefano Brivio.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31 17:34:19 -07:00
wenxu
20149e9eb6 ip_tunnel: fix over-mtu packet send fail without TUNNEL_DONT_FRAGMENT flags
The tunnel device such as vxlan, bareudp and geneve in the lwt mode set
the outer df only based TUNNEL_DONT_FRAGMENT.
And this was also the behavior for gre device before switching to use
ip_md_tunnel_xmit in commit 962924fa2b ("ip_gre: Refactor collect
metatdata mode tunnel xmit to ip_md_tunnel_xmit")

When the ip_gre in lwt mode xmit with ip_md_tunnel_xmi changed the rule and
make the discrepancy between handling of DF by different tunnels. So in the
ip_md_tunnel_xmit should follow the same rule like other tunnels.

Fixes: cfc7381b30 ("ip_tunnel: add collect_md mode to IPIP tunnel")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Link: https://lore.kernel.org/r/1604028728-31100-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31 17:19:02 -07:00
Linus Torvalds
53760f9b74 flexible-array member conversion patches for 5.10-rc2
Hi Linus,
 
 Please, pull the following patches that replace zero-length arrays with
 flexible-array members.
 
 Thanks
 --
 Gustavo
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAl+cjRUACgkQRwW0y0cG
 2zGWAhAAjUfTsAmXWhKNaWFSCYR0Q822puTUWOKfiBd+jjGaO04luTtr2gjv2Dkb
 Vgad8H4N8oZU79xfh5JZ5PUyScaso8wE6ZJTh2PLKXpKmNd213f5x/pIt78CCDTa
 Y1L/eR41mmveTL3VNS3sf6WaZpT9owxJKGIY8JgdiOmSjxJQpX5zdaC1KYso4eXr
 lIXIRo9VLEmVLhhHhZi+QmX6+aQ05E1D9K0ENe4/uEnRsV525W78iwZ4fYeLzr+A
 krEOdgx6sPgzajPYnHoayrrcKNKxD5YY1SWuVSm2tqYYIhlRoK3f5xgLOd10RiHE
 YMgx8aWzGmGJwoUhgp1bo/l9EZ7O8OWRqM/GOP4x6Wgjdhqw2x5jgskmhsKNGEXu
 /BlbS+qL5aUrMCxhvNbApuZW6xBiBbva76MH3vU9vFhZbVz1CHLQdGI0tfxggYWS
 jc2UPgoxL9OQlf3jSc+gK7RMFhBGNWn2Aiy8GQas3BxPYXuYPvwOj+irDOG/qZ9D
 VZ5swUw4+th+DsF5K53mEFeLv0fONMgL9Ka5bNR6+k6HG0WNLYYVOiet3xYUDo1f
 eZbMZthfc+QW7R8cwG0WuFk6rC6mLqE+A9nQuLZoJD+VMuJd4pwW9+6EW8nDX08w
 FS4/o92xUFJfOCgaLRS61FSAuSmFENieN+yoKMK/Uf6PJVdNMb4=
 =vyu3
 -----END PGP SIGNATURE-----

Merge tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull more flexible-array member conversions from Gustavo A. R. Silva:
 "Replace zero-length arrays with flexible-array members"

* tag 'flexible-array-conversions-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  printk: ringbuffer: Replace zero-length array with flexible-array member
  net/smc: Replace zero-length array with flexible-array member
  net/mlx5: Replace zero-length array with flexible-array member
  mei: hw: Replace zero-length array with flexible-array member
  gve: Replace zero-length array with flexible-array member
  Bluetooth: btintel: Replace zero-length array with flexible-array member
  scsi: target: tcmu: Replace zero-length array with flexible-array member
  ima: Replace zero-length array with flexible-array member
  enetc: Replace zero-length array with flexible-array member
  fs: Replace zero-length array with flexible-array member
  Bluetooth: Replace zero-length array with flexible-array member
  params: Replace zero-length array with flexible-array member
  tracepoint: Replace zero-length array with flexible-array member
  platform/chrome: cros_ec_proto: Replace zero-length array with flexible-array member
  platform/chrome: cros_ec_commands: Replace zero-length array with flexible-array member
  mailbox: zynqmp-ipi-message: Replace zero-length array with flexible-array member
  dmaengine: ti-cppi5: Replace zero-length array with flexible-array member
2020-10-31 14:31:28 -07:00
Hangbin Liu
2efdaaaf88 IPv6: reply ICMP error if the first fragment don't include all headers
Based on RFC 8200, Section 4.5 Fragment Header:

  -  If the first fragment does not include all headers through an
     Upper-Layer header, then that fragment should be discarded and
     an ICMP Parameter Problem, Code 3, message should be sent to
     the source of the fragment, with the Pointer field set to zero.

Checking each packet header in IPv6 fast path will have performance impact,
so I put the checking in ipv6_frag_rcv().

As the packet may be any kind of L4 protocol, I only checked some common
protocols' header length and handle others by (offset + 1) > skb->len.
Also use !(frag_off & htons(IP6_OFFSET)) to catch atomic fragments
(fragmented packet with only one fragment).

When send ICMP error message, if the 1st truncated fragment is ICMP message,
icmp6_send() will break as is_ineligible() return true. So I added a check
in is_ineligible() to let fragment packet with nexthdr ICMP but no ICMP header
return false.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31 13:16:02 -07:00
Colin Ian King
2f71e00619 net: atm: fix update of position index in lec_seq_next
The position index in leq_seq_next is not updated when the next
entry is fetched an no more entries are available. This causes
seq_file to report the following error:

"seq_file: buggy .next function lec_seq_next [lec] did not update
 position index"

Fix this by always updating the position index.

[ Note: this is an ancient 2002 bug, the sha is from the
  tglx/history repo ]

Fixes 4aea2cbff417 ("[ATM]: Move lan seq_file ops to lec.c [1/3]")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20201027114925.21843-1-colin.king@canonical.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-31 12:26:30 -07:00
Stefano Brivio
7d10e62c2f netfilter: ipset: Update byte and packet counters regardless of whether they match
In ip_set_match_extensions(), for sets with counters, we take care of
updating counters themselves by calling ip_set_update_counter(), and of
checking if the given comparison and values match, by calling
ip_set_match_counter() if needed.

However, if a given comparison on counters doesn't match the configured
values, that doesn't mean the set entry itself isn't matching.

This fix restores the behaviour we had before commit 4750005a85
("netfilter: ipset: Fix "don't update counters" mode when counters used
at the matching"), without reintroducing the issue fixed there: back
then, mtype_data_match() first updated counters in any case, and then
took care of matching on counters.

Now, if the IPSET_FLAG_SKIP_COUNTER_UPDATE flag is set,
ip_set_update_counter() will anyway skip counter updates if desired.

The issue observed is illustrated by this reproducer:

  ipset create c hash:ip counters
  ipset add c 192.0.2.1
  iptables -I INPUT -m set --match-set c src --bytes-gt 800 -j DROP

if we now send packets from 192.0.2.1, bytes and packets counters
for the entry as shown by 'ipset list' are always zero, and, no
matter how many bytes we send, the rule will never match, because
counters themselves are not updated.

Reported-by: Mithil Mhatre <mmhatre@redhat.com>
Fixes: 4750005a85 ("netfilter: ipset: Fix "don't update counters" mode when counters used at the matching")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-31 11:11:11 +01:00
Gustavo A. R. Silva
7206d58a3a net/smc: Replace zero-length array with flexible-array member
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-30 16:57:42 -05:00
Alexander Ovechkin
9e7c5b396e ip6_tunnel: set inner ipproto before ip6_tnl_encap
ip6_tnl_encap assigns to proto transport protocol which
encapsulates inner packet, but we must pass to set_inner_ipproto
protocol of that inner packet.

Calling set_inner_ipproto after ip6_tnl_encap might break gso.
For example, in case of encapsulating ipv6 packet in fou6 packet, inner_ipproto
would be set to IPPROTO_UDP instead of IPPROTO_IPV6. This would lead to
incorrect calling sequence of gso functions:
ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> udp6_ufo_fragment
instead of:
ipv6_gso_segment -> udp6_ufo_fragment -> skb_udp_tunnel_segment -> ip6ip6_gso_segment

Fixes: 6c11fbf97e ("ip6_tunnel: add MPLS transmit support")
Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru>
Link: https://lore.kernel.org/r/20201029171012.20904-1-ovov@yandex-team.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-30 08:07:30 -07:00
Pablo Neira Ayuso
c0391b6ab8 netfilter: nf_tables: missing validation from the abort path
If userspace does not include the trailing end of batch message, then
nfnetlink aborts the transaction. This allows to check that ruleset
updates trigger no errors.

After this patch, invoking this command from the prerouting chain:

 # nft -c add rule x y fib saddr . oif type local

fails since oif is not supported there.

This patch fixes the lack of rule validation from the abort/check path
to catch configuration errors such as the one above.

Fixes: a654de8fdc ("netfilter: nf_tables: fix chain dependency validation")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30 12:57:39 +01:00
Jason A. Donenfeld
46d6c5ae95 netfilter: use actual socket sk rather than skb sk when routing harder
If netfilter changes the packet mark when mangling, the packet is
rerouted using the route_me_harder set of functions. Prior to this
commit, there's one big difference between route_me_harder and the
ordinary initial routing functions, described in the comment above
__ip_queue_xmit():

   /* Note: skb->sk can be different from sk, in case of tunnels */
   int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,

That function goes on to correctly make use of sk->sk_bound_dev_if,
rather than skb->sk->sk_bound_dev_if. And indeed the comment is true: a
tunnel will receive a packet in ndo_start_xmit with an initial skb->sk.
It will make some transformations to that packet, and then it will send
the encapsulated packet out of a *new* socket. That new socket will
basically always have a different sk_bound_dev_if (otherwise there'd be
a routing loop). So for the purposes of routing the encapsulated packet,
the routing information as it pertains to the socket should come from
that socket's sk, rather than the packet's original skb->sk. For that
reason __ip_queue_xmit() and related functions all do the right thing.

One might argue that all tunnels should just call skb_orphan(skb) before
transmitting the encapsulated packet into the new socket. But tunnels do
*not* do this -- and this is wisely avoided in skb_scrub_packet() too --
because features like TSQ rely on skb->destructor() being called when
that buffer space is truely available again. Calling skb_orphan(skb) too
early would result in buffers filling up unnecessarily and accounting
info being all wrong. Instead, additional routing must take into account
the new sk, just as __ip_queue_xmit() notes.

So, this commit addresses the problem by fishing the correct sk out of
state->sk -- it's already set properly in the call to nf_hook() in
__ip_local_out(), which receives the sk as part of its normal
functionality. So we make sure to plumb state->sk through the various
route_me_harder functions, and then make correct use of it following the
example of __ip_queue_xmit().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30 12:57:39 +01:00
Pablo Neira Ayuso
dceababac2 netfilter: nftables: fix netlink report logic in flowtable and genid
The netlink report should be sent regardless the available listeners.

Fixes: 84d7fce693 ("netfilter: nf_tables: export rule-set generation ID")
Fixes: 3b49e2e94e ("netfilter: nf_tables: add flow table netlink frontend")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-30 12:57:38 +01:00
Johannes Berg
c2f4681452 mac80211: don't require VHT elements for HE on 2.4 GHz
After the previous similar bugfix there was another bug here,
if no VHT elements were found we also disabled HE. Fix this to
disable HE only on the 5 GHz band; on 6 GHz it was already not
disabled, and on 2.4 GHz there need (should) not be any VHT.

Fixes: 57fa5e85d5 ("mac80211: determine chandef from HE 6 GHz operation")
Link: https://lore.kernel.org/r/20201013140156.535a2fc6192f.Id6e5e525a60ac18d245d86f4015f1b271fce6ee6@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30 10:22:42 +01:00
Ye Bin
db18d20d1c cfg80211: regulatory: Fix inconsistent format argument
Fix follow warning:
[net/wireless/reg.c:3619]: (warning) %d in format string (no. 2)
requires 'int' but the argument type is 'unsigned int'.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20201009070215.63695-1-yebin10@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30 10:06:56 +01:00
Mauro Carvalho Chehab
b1e8eb11fb mac80211: fix kernel-doc markups
Some identifiers have different names between their prototypes
and the kernel-doc markup.

Others need to be fixed, as kernel-doc markups should use this format:
        identifier - description

In the specific case of __sta_info_flush(), add a documentation
for sta_info_flush(), as this one is the one used outside
sta_info.c.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Link: https://lore.kernel.org/r/978d35eef2dc76e21c81931804e4eaefbd6d635e.1603469755.git.mchehab+huawei@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30 10:06:09 +01:00
Johannes Berg
dcd479e10a mac80211: always wind down STA state
When (for example) an IBSS station is pre-moved to AUTHORIZED
before it's inserted, and then the insertion fails, we don't
clean up the fast RX/TX states that might already have been
created, since we don't go through all the state transitions
again on the way down.

Do that, if it hasn't been done already, when the station is
freed. I considered only freeing the fast TX/RX state there,
but we might add more state so it's more robust to wind down
the state properly.

Note that we warn if the station was ever inserted, it should
have been properly cleaned up in that case, and the driver
will probably not like things happening out of order.

Reported-by: syzbot+2e293dbd67de2836ba42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20201009141710.7223b322a955.I95bd08b9ad0e039c034927cce0b75beea38e059b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30 10:05:12 +01:00
Johannes Berg
9bdaf3b91e cfg80211: initialize wdev data earlier
There's a race condition in the netdev registration in that
NETDEV_REGISTER actually happens after the netdev is available,
and so if we initialize things only there, we might get called
with an uninitialized wdev through nl80211 - not using a wdev
but using a netdev interface index.

I found this while looking into a syzbot report, but it doesn't
really seem to be related, and unfortunately there's no repro
for it (yet). I can't (yet) explain how it managed to get into
cfg80211_release_pmsr() from nl80211_netlink_notify() without
the wdev having been initialized, as the latter only iterates
the wdevs that are linked into the rdev, which even without the
change here happened after init.

However, looking at this, it seems fairly clear that the init
needs to be done earlier, otherwise we might even re-init on a
netns move, when data might still be pending.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20201009135821.fdcbba3aad65.Ie9201d91dbcb7da32318812effdc1561aeaf4cdc@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30 10:03:59 +01:00
Johannes Berg
14f46c1e51 mac80211: fix use of skb payload instead of header
When ieee80211_skb_resize() is called from ieee80211_build_hdr()
the skb has no 802.11 header yet, in fact it consist only of the
payload as the ethernet frame is removed. As such, we're using
the payload data for ieee80211_is_mgmt(), which is of course
completely wrong. This didn't really hurt us because these are
always data frames, so we could only have added more tailroom
than we needed if we determined it was a management frame and
sdata->crypto_tx_tailroom_needed_cnt was false.

However, syzbot found that of course there need not be any payload,
so we're using at best uninitialized memory for the check.

Fix this to pass explicitly the kind of frame that we have instead
of checking there, by replacing the "bool may_encrypt" argument
with an argument that can carry the three possible states - it's
not going to be encrypted, it's a management frame, or it's a data
frame (and then we check sdata->crypto_tx_tailroom_needed_cnt).

Reported-by: syzbot+32fd1a1bfe355e93f1e2@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20201009132538.e1fd7f802947.I799b288466ea2815f9d4c84349fae697dca2f189@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30 10:03:48 +01:00
Mathy Vanhoef
804fc6a293 mac80211: fix regression where EAPOL frames were sent in plaintext
When sending EAPOL frames via NL80211 they are treated as injected
frames in mac80211. Due to commit 1df2bdba52 ("mac80211: never drop
injected frames even if normally not allowed") these injected frames
were not assigned a sta context in the function ieee80211_tx_dequeue,
causing certain wireless network cards to always send EAPOL frames in
plaintext. This may cause compatibility issues with some clients or
APs, which for instance can cause the group key handshake to fail and
in turn would cause the station to get disconnected.

This commit fixes this regression by assigning a sta context in
ieee80211_tx_dequeue to injected frames as well.

Note that sending EAPOL frames in plaintext is not a security issue
since they contain their own encryption and authentication protection.

Cc: stable@vger.kernel.org
Fixes: 1df2bdba52 ("mac80211: never drop injected frames even if normally not allowed")
Reported-by: Thomas Deutschmann <whissi@gentoo.org>
Tested-by: Christian Hesse <list@eworm.de>
Tested-by: Thomas Deutschmann <whissi@gentoo.org>
Signed-off-by: Mathy Vanhoef <Mathy.Vanhoef@kuleuven.be>
Link: https://lore.kernel.org/r/20201019160113.350912-1-Mathy.Vanhoef@kuleuven.be
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-10-30 10:03:24 +01:00
Gustavo A. R. Silva
b08eadd272 Bluetooth: Replace zero-length array with flexible-array member
There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-10-29 17:22:59 -05:00
Linus Torvalds
934291ffb6 Networking fixes for 5.10-rc2.
Current release regressions:
 
  - r8169: fix forced threading conflicting with other shared
    interrupts; we tried to fix the use of raise_softirq_irqoff
    from an IRQ handler on RT by forcing hard irqs, but this
    driver shares legacy PCI IRQs so drop the _irqoff() instead
 
  - tipc: fix memory leak caused by a recent syzbot report fix
    to tipc_buf_append()
 
 Current release - bugs in new features:
 
  - devlink: Unlock on error in dumpit() and fix some error codes
 
  - net/smc: fix null pointer dereference in smc_listen_decline()
 
 Previous release - regressions:
 
  - tcp: Prevent low rmem stalls with SO_RCVLOWAT.
 
  - net: protect tcf_block_unbind with block lock
 
  - ibmveth: Fix use of ibmveth in a bridge; the self-imposed filtering
    to only send legal frames to the hypervisor was too strict
 
  - net: hns3: Clear the CMDQ registers before unmapping BAR region;
    incorrect cleanup order was leading to a crash
 
  - bnxt_en - handful of fixes to fixes:
     - Send HWRM_FUNC_RESET fw command unconditionally, even
       if there are PCIe errors being reported
     - Check abort error state in bnxt_open_nic().
     - Invoke cancel_delayed_work_sync() for PFs also.
     - Fix regression in workqueue cleanup logic in bnxt_remove_one().
 
  - mlxsw: Only advertise link modes supported by both driver
    and device, after removal of 56G support from the driver
    56G was not cleared from advertised modes
 
  - net/smc: fix suppressed return code
 
 Previous release - always broken:
 
  - netem: fix zero division in tabledist, caused by integer overflow
 
  - bnxt_en: Re-write PCI BARs after PCI fatal error.
 
  - cxgb4: set up filter action after rewrites
 
  - net: ipa: command payloads already mapped
 
 Misc:
 
  - s390/ism: fix incorrect system EID, it's okay to change since
    it was added in current release
 
  - vsock: use ns_capable_noaudit() on socket create to suppress
    false positive audit messages
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+bGTcACgkQMUZtbf5S
 IrtMvxAAldlA7x22atOHJ2HMTqUGK3rlIQYgxlWJbfDnA7Ui4rZTDa/K0VkuS4ey
 rfaBf37XLDmzZkHgYvXG1qV2kB0MrXQqF7jJn+BNlAuM1kIsURt85Y2FxVu/+x6X
 wWtBgg/D77VXpeMimGcp8wBg5xFlUDdTezo+tInSuY9ahi1dUQx3ZSBTgqz3a5Vn
 wUwD7U0wkBEHkZFeLE6u0tdN9wY8IHH6cbMfzfnPxxIv6VVUOcQcvbomc+reEPhH
 vxeCHg7tK3yxbe9cPEbuwVDpoapB8Y627rv08Njhfuxx6Yysp/OOvUNRIBeD/7Gi
 TiZc6RMQ9XZ9QoGueaxFVSFIGRpRIQiO/gh+O5lWVX8dGsIjlKnw2E8gWmSS48YP
 cMAez0Fe+CJ2S2QNFbGVyJJX6xOl5h6kQaf88OiEhudpEUgyz156MNVwbJnE4fYk
 8GONCIea1hNjLQ1VUfcQEYdxChWVeAoUEZIFcK2YKA+1w9Ris6hV21j/aUxYXQRt
 RGOALFUtCRIEX28ZW8eEyXgp1EdUvp7qcIK5YZEF6YHWlRxQ8LkU6qhD7Mm2oqkE
 fydoMDz9TEBaWqFtpgQmZH76JYqd7btCsR2YPwnlKmcKQ3tEKtW0NKt1QH/DKcvm
 nmDA6A+52XSbar1sRlVPnr3IGfodqGQ3A35sVFS8jkcmMvDRlbk=
 =reLi
 -----END PGP SIGNATURE-----

Merge tag 'net-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Current release regressions:

   - r8169: fix forced threading conflicting with other shared
     interrupts; we tried to fix the use of raise_softirq_irqoff from an
     IRQ handler on RT by forcing hard irqs, but this driver shares
     legacy PCI IRQs so drop the _irqoff() instead

   - tipc: fix memory leak caused by a recent syzbot report fix to
     tipc_buf_append()

  Current release - bugs in new features:

   - devlink: Unlock on error in dumpit() and fix some error codes

   - net/smc: fix null pointer dereference in smc_listen_decline()

  Previous release - regressions:

   - tcp: Prevent low rmem stalls with SO_RCVLOWAT.

   - net: protect tcf_block_unbind with block lock

   - ibmveth: Fix use of ibmveth in a bridge; the self-imposed filtering
     to only send legal frames to the hypervisor was too strict

   - net: hns3: Clear the CMDQ registers before unmapping BAR region;
     incorrect cleanup order was leading to a crash

   - bnxt_en - handful of fixes to fixes:
      - Send HWRM_FUNC_RESET fw command unconditionally, even if there
        are PCIe errors being reported
      - Check abort error state in bnxt_open_nic().
      - Invoke cancel_delayed_work_sync() for PFs also.
      - Fix regression in workqueue cleanup logic in bnxt_remove_one().

   - mlxsw: Only advertise link modes supported by both driver and
     device, after removal of 56G support from the driver 56G was not
     cleared from advertised modes

   - net/smc: fix suppressed return code

  Previous release - always broken:

   - netem: fix zero division in tabledist, caused by integer overflow

   - bnxt_en: Re-write PCI BARs after PCI fatal error.

   - cxgb4: set up filter action after rewrites

   - net: ipa: command payloads already mapped

  Misc:

   - s390/ism: fix incorrect system EID, it's okay to change since it
     was added in current release

   - vsock: use ns_capable_noaudit() on socket create to suppress false
     positive audit messages"

* tag 'net-5.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
  r8169: fix issue with forced threading in combination with shared interrupts
  netem: fix zero division in tabledist
  ibmvnic: fix ibmvnic_set_mac
  mptcp: add missing memory scheduling in the rx path
  tipc: fix memory leak caused by tipc_buf_append()
  gtp: fix an use-before-init in gtp_newlink()
  net: protect tcf_block_unbind with block lock
  ibmveth: Fix use of ibmveth in a bridge.
  net/sched: act_mpls: Add softdep on mpls_gso.ko
  ravb: Fix bit fields checking in ravb_hwtstamp_get()
  devlink: Unlock on error in dumpit()
  devlink: Fix some error codes
  chelsio/chtls: fix memory leaks in CPL handlers
  chelsio/chtls: fix deadlock issue
  net: hns3: Clear the CMDQ registers before unmapping BAR region
  bnxt_en: Send HWRM_FUNC_RESET fw command unconditionally.
  bnxt_en: Check abort error state in bnxt_open_nic().
  bnxt_en: Re-write PCI BARs after PCI fatal error.
  bnxt_en: Invoke cancel_delayed_work_sync() for PFs also.
  bnxt_en: Fix regression in workqueue cleanup logic in bnxt_remove_one().
  ...
2020-10-29 12:55:02 -07:00
Aleksandr Nogikh
eadd1befdd netem: fix zero division in tabledist
Currently it is possible to craft a special netlink RTM_NEWQDISC
command that can result in jitter being equal to 0x80000000. It is
enough to set the 32 bit jitter to 0x02000000 (it will later be
multiplied by 2^6) or just set the 64 bit jitter via
TCA_NETEM_JITTER64. This causes an overflow during the generation of
uniformly distributed numbers in tabledist(), which in turn leads to
division by zero (sigma != 0, but sigma * 2 is 0).

The related fragment of code needs 32-bit division - see commit
9b0ed89 ("netem: remove unnecessary 64 bit modulus"), so switching to
64 bit is not an option.

Fix the issue by keeping the value of jitter within the range that can
be adequately handled by tabledist() - [0;INT_MAX]. As negative std
deviation makes no sense, take the absolute value of the passed value
and cap it at INT_MAX. Inside tabledist(), switch to unsigned 32 bit
arithmetic in order to prevent overflows.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Aleksandr Nogikh <nogikh@google.com>
Reported-by: syzbot+ec762a6342ad0d3c0d8f@syzkaller.appspotmail.com
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/20201028170731.1383332-1-aleksandrnogikh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29 11:45:47 -07:00
Paolo Abeni
9c3f94e168 mptcp: add missing memory scheduling in the rx path
When moving the skbs from the subflow into the msk receive
queue, we must schedule there the required amount of memory.

Try to borrow the required memory from the subflow, if needed,
so that we leverage the existing TCP heuristic.

Fixes: 6771bfd9ee ("mptcp: update mptcp ack sequence from work queue")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/f6143a6193a083574f11b00dbf7b5ad151bc4ff4.1603810630.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29 11:27:14 -07:00
Tung Nguyen
ceb1eb2fb6 tipc: fix memory leak caused by tipc_buf_append()
Commit ed42989eab ("tipc: fix the skb_unshare() in tipc_buf_append()")
replaced skb_unshare() with skb_copy() to not reduce the data reference
counter of the original skb intentionally. This is not the correct
way to handle the cloned skb because it causes memory leak in 2
following cases:
 1/ Sending multicast messages via broadcast link
  The original skb list is cloned to the local skb list for local
  destination. After that, the data reference counter of each skb
  in the original list has the value of 2. This causes each skb not
  to be freed after receiving ACK:
  tipc_link_advance_transmq()
  {
   ...
   /* release skb */
   __skb_unlink(skb, &l->transmq);
   kfree_skb(skb); <-- memory exists after being freed
  }

 2/ Sending multicast messages via replicast link
  Similar to the above case, each skb cannot be freed after purging
  the skb list:
  tipc_mcast_xmit()
  {
   ...
   __skb_queue_purge(pkts); <-- memory exists after being freed
  }

This commit fixes this issue by using skb_unshare() instead. Besides,
to avoid use-after-free error reported by KASAN, the pointer to the
fragment is set to NULL before calling skb_unshare() to make sure that
the original skb is not freed after freeing the fragment 2 times in
case skb_unshare() returns NULL.

Fixes: ed42989eab ("tipc: fix the skb_unshare() in tipc_buf_append()")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Reported-by: Thang Hoang Ngo <thang.h.ngo@dektech.com.au>
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/20201027032403.1823-1-tung.q.nguyen@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29 09:51:52 -07:00
Magnus Karlsson
e5e1a4bc91 xsk: Fix possible memory leak at socket close
Fix a possible memory leak at xsk socket close that is caused by the
refcounting of the umem object being wrong. The reference count of the
umem was decremented only after the pool had been freed. Note that if
the buffer pool is destroyed, it is important that the umem is
destroyed after the pool, otherwise the umem would disappear while the
driver is still running. And as the buffer pool needs to be destroyed
in a work queue, the umem is also (if its refcount reaches zero)
destroyed after the buffer pool in that same work queue.

What was missing is that the refcount also needs to be decremented
when the pool is not freed and when the pool has not even been
created. The first case happens when the refcount of the pool is
higher than 1, i.e. it is still being used by some other socket using
the same device and queue id. In this case, it is safe to decrement
the refcount of the umem outside of the work queue as the umem will
never be freed because the refcount of the umem is always greater than
or equal to the refcount of the buffer pool. The second case is if the
buffer pool has not been created yet, i.e. the socket was closed
before it was bound but after the umem was created. In this case, it
is safe to destroy the umem outside of the work queue, since there is
no pool that can use it by definition.

Fixes: 1c1efc2af1 ("xsk: Create and free buffer pool independently from umem")
Reported-by: syzbot+eb71df123dc2be2c1456@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1603801921-2712-1-git-send-email-magnus.karlsson@gmail.com
2020-10-29 15:19:56 +01:00
Jason Gunthorpe
071ba4cc55 RDMA: Add rdma_connect_locked()
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
handler triggers a completion and another thread does rdma_connect() or
the handler directly calls rdma_connect().

In all cases rdma_connect() needs to hold the handler_mutex, but when
handler's are invoked this is already held by the core code. This causes
ULPs using the 2nd method to deadlock.

Provide a rdma_connect_locked() and have all ULPs call it from their
handlers.

Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com
Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Fixes: 2a7cec5381 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state")
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-28 09:14:49 -03:00
Leon Romanovsky
d6535dca28 net: protect tcf_block_unbind with block lock
The tcf_block_unbind() expects that the caller will take block->cb_lock
before calling it, however the code took RTNL lock and dropped cb_lock
instead. This causes to the following kernel panic.

 WARNING: CPU: 1 PID: 13524 at net/sched/cls_api.c:1488 tcf_block_unbind+0x2db/0x420
 Modules linked in: mlx5_ib mlx5_core mlxfw ptp pps_core act_mirred act_tunnel_key cls_flower vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad ib_ipoib rdma_cm iw_cm ib_cm ib_uverbs ib_core overlay [last unloaded: mlxfw]
 CPU: 1 PID: 13524 Comm: test-ecmp-add-v Tainted: G        W         5.9.0+ #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:tcf_block_unbind+0x2db/0x420
 Code: ff 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e 41 5f c3 49 8d bc 24 30 01 00 00 be ff ff ff ff e8 7d 7f 70 00 85 c0 0f 85 7b fd ff ff <0f> 0b e9 74 fd ff ff 48 c7 c7 dc 6a 24 84 e8 02 ec fe fe e9 55 fd
 RSP: 0018:ffff888117d17968 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff88812f713c00 RCX: 1ffffffff0848d5b
 RDX: 0000000000000001 RSI: ffff88814fbc8130 RDI: ffff888107f2b878
 RBP: 1ffff11022fa2f3f R08: 0000000000000000 R09: ffffffff84115a87
 R10: fffffbfff0822b50 R11: ffff888107f2b898 R12: ffff88814fbc8000
 R13: ffff88812f713c10 R14: ffff888117d17a38 R15: ffff88814fbc80c0
 FS:  00007f6593d36740(0000) GS:ffff8882a4f00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00005607a00758f8 CR3: 0000000131aea006 CR4: 0000000000170ea0
 Call Trace:
  tc_block_indr_cleanup+0x3e0/0x5a0
  ? tcf_block_unbind+0x420/0x420
  ? __mutex_unlock_slowpath+0xe7/0x610
  flow_indr_dev_unregister+0x5e2/0x930
  ? mlx5e_restore_tunnel+0xdf0/0xdf0 [mlx5_core]
  ? mlx5e_restore_tunnel+0xdf0/0xdf0 [mlx5_core]
  ? flow_indr_block_cb_alloc+0x3c0/0x3c0
  ? mlx5_db_free+0x37c/0x4b0 [mlx5_core]
  mlx5e_cleanup_rep_tx+0x8b/0xc0 [mlx5_core]
  mlx5e_detach_netdev+0xe5/0x120 [mlx5_core]
  mlx5e_vport_rep_unload+0x155/0x260 [mlx5_core]
  esw_offloads_disable+0x227/0x2b0 [mlx5_core]
  mlx5_eswitch_disable_locked.cold+0x38e/0x699 [mlx5_core]
  mlx5_eswitch_disable+0x94/0xf0 [mlx5_core]
  mlx5_device_disable_sriov+0x183/0x1f0 [mlx5_core]
  mlx5_core_sriov_configure+0xfd/0x230 [mlx5_core]
  sriov_numvfs_store+0x261/0x2f0
  ? sriov_drivers_autoprobe_store+0x110/0x110
  ? sysfs_file_ops+0x170/0x170
  ? sysfs_file_ops+0x117/0x170
  ? sysfs_file_ops+0x170/0x170
  kernfs_fop_write+0x1ff/0x3f0
  ? rcu_read_lock_any_held+0x6e/0x90
  vfs_write+0x1f3/0x620
  ksys_write+0xf9/0x1d0
  ? __x64_sys_read+0xb0/0xb0
  ? lockdep_hardirqs_on_prepare+0x273/0x3f0
  ? syscall_enter_from_user_mode+0x1d/0x50
  do_syscall_64+0x2d/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

<...>

 ---[ end trace bfdd028ada702879 ]---

Fixes: 0fdcf78d59 ("net: use flow_indr_dev_setup_offload()")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20201026123327.1141066-1-leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-27 17:58:36 -07:00
Guillaume Nault
501b72ae24 net/sched: act_mpls: Add softdep on mpls_gso.ko
TCA_MPLS_ACT_PUSH and TCA_MPLS_ACT_MAC_PUSH might be used on gso
packets. Such packets will thus require mpls_gso.ko for segmentation.

v2: Drop dependency on CONFIG_NET_MPLS_GSO in Kconfig (from Jakub and
    David).

Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/1f6cab15bbd15666795061c55563aaf6a386e90e.1603708007.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-27 17:17:06 -07:00
Dan Carpenter
0d8cb9464a devlink: Unlock on error in dumpit()
This needs to unlock before returning.

Fixes: 544e7c33ec ("net: devlink: Add support for port regions")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20201026080127.GB1628785@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-27 17:05:57 -07:00
Dan Carpenter
6c211809c8 devlink: Fix some error codes
These paths don't set the error codes.  It's especially important in
devlink_nl_region_notify_build() where it leads to a NULL dereference in
the caller.

Fixes: 544e7c33ec ("net: devlink: Add support for port regions")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20201026080059.GA1628785@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-27 17:05:57 -07:00
Karsten Graul
96d6fded95 net/smc: fix suppressed return code
The patch that repaired the invalid return code in smcd_new_buf_create()
missed to take care of errno ENOSPC which has a special meaning that no
more DMBEs can be registered on the device. Fix that by keeping this
errno value during the translation of the return code.

Fixes: 6b1bbf94ab ("net/smc: fix invalid return code in smcd_new_buf_create()")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-26 16:29:14 -07:00
Karsten Graul
4a9baf45fd net/smc: fix null pointer dereference in smc_listen_decline()
smc_listen_work() calls smc_listen_decline() on label out_decl,
providing the ini pointer variable. But this pointer can still be null
when the label out_decl is reached.
Fix this by checking the ini variable in smc_listen_work() and call
smc_listen_decline() with the result directly.

Fixes: a7c9c5f4af ("net/smc: CLC accept / confirm V2")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-26 16:29:14 -07:00
Jeff Vander Stoep
af545bb5ee vsock: use ns_capable_noaudit() on socket create
During __vsock_create() CAP_NET_ADMIN is used to determine if the
vsock_sock->trusted should be set to true. This value is used later
for determing if a remote connection should be allowed to connect
to a restricted VM. Unfortunately, if the caller doesn't have
CAP_NET_ADMIN, an audit message such as an selinux denial is
generated even if the caller does not want a trusted socket.

Logging errors on success is confusing. To avoid this, switch the
capable(CAP_NET_ADMIN) check to the noaudit version.

Reported-by: Roman Kiryanov <rkir@google.com>
https://android-review.googlesource.com/c/device/generic/goldfish/+/1468545/
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/r/20201023143757.377574-1-jeffv@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-26 16:22:42 -07:00
Eric Biggers
23224e4500 mm: remove kzfree() compatibility definition
Commit 453431a549 ("mm, treewide: rename kzfree() to
kfree_sensitive()") renamed kzfree() to kfree_sensitive(),
but it left a compatibility definition of kzfree() to avoid
being too disruptive.

Since then a few more instances of kzfree() have slipped in.

Just get rid of them and remove the compatibility definition
once and for all.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25 11:39:02 -07:00
Willy Tarreau
3744741ada random32: add noise from network and scheduling activity
With the removal of the interrupt perturbations in previous random32
change (random32: make prandom_u32() output unpredictable), the PRNG
has become 100% deterministic again. While SipHash is expected to be
way more robust against brute force than the previous Tausworthe LFSR,
there's still the risk that whoever has even one temporary access to
the PRNG's internal state is able to predict all subsequent draws till
the next reseed (roughly every minute). This may happen through a side
channel attack or any data leak.

This patch restores the spirit of commit f227e3ec3b ("random32: update
the net random state on interrupt and activity") in that it will perturb
the internal PRNG's statee using externally collected noise, except that
it will not pick that noise from the random pool's bits nor upon
interrupt, but will rather combine a few elements along the Tx path
that are collectively hard to predict, such as dev, skb and txq
pointers, packet length and jiffies values. These ones are combined
using a single round of SipHash into a single long variable that is
mixed with the net_rand_state upon each invocation.

The operation was inlined because it produces very small and efficient
code, typically 3 xor, 2 add and 2 rol. The performance was measured
to be the same (even very slightly better) than before the switch to
SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
(i40e), the connection rate dropped from 556k/s to 555k/s while the
SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.

Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
Cc: George Spelvin <lkml@sdf.org>
Cc: Amit Klein <aksecurity@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24 20:21:57 +02:00
Arjun Roy
435ccfa894 tcp: Prevent low rmem stalls with SO_RCVLOWAT.
With SO_RCVLOWAT, under memory pressure,
it is possible to enter a state where:

1. We have not received enough bytes to satisfy SO_RCVLOWAT.
2. We have not entered buffer pressure (see tcp_rmem_pressure()).
3. But, we do not have enough buffer space to accept more packets.

In this case, we advertise 0 rwnd (due to #3) but the application does
not drain the receive queue (no wakeup because of #1 and #2) so the
flow stalls.

Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
rwnd<=rcv_mss, force a wakeup to prevent a stall.

Without this patch, setting tcp_rmem to 6143 and disabling TCP
autotune causes a stalled flow. With this patch, no stall occurs. This
is with RPC-style traffic with large messages.

Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-23 19:11:20 -07:00
Linus Torvalds
3cb12d27ff Fixes for 5.10-rc1 from the networking tree:
Cross-tree/merge window issues:
 
  - rtl8150: don't incorrectly assign random MAC addresses; fix late
    in the 5.9 cycle started depending on a return code from
    a function which changed with the 5.10 PR from the usb subsystem
 
 Current release - regressions:
 
  - Revert "virtio-net: ethtool configurable RXCSUM", it was causing
    crashes at probe when control vq was not negotiated/available
 
 Previous releases - regressions:
 
  - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO
    bus, only first device would be probed correctly
 
  - nexthop: Fix performance regression in nexthop deletion by
    effectively switching from recently added synchronize_rcu()
    to synchronize_rcu_expedited()
 
  - netsec: ignore 'phy-mode' device property on ACPI systems;
    the property is not populated correctly by the firmware,
    but firmware configures the PHY so just keep boot settings
 
 Previous releases - always broken:
 
  - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing
    bulk transfers getting "stuck"
 
  - icmp: randomize the global rate limiter to prevent attackers from
    getting useful signal
 
  - r8169: fix operation under forced interrupt threading, make the
    driver always use hard irqs, even on RT, given the handler is
    light and only wants to schedule napi (and do so through
    a _irqoff() variant, preferably)
 
  - bpf: Enforce pointer id generation for all may-be-null register
    type to avoid pointers erroneously getting marked as null-checked
 
  - tipc: re-configure queue limit for broadcast link
 
  - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN
    tunnels
 
  - fix various issues in chelsio inline tls driver
 
 Misc:
 
  - bpf: improve just-added bpf_redirect_neigh() helper api to support
    supplying nexthop by the caller - in case BPF program has already
    done a lookup we can avoid doing another one
 
  - remove unnecessary break statements
 
  - make MCTCP not select IPV6, but rather depend on it
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+R+5UACgkQMUZtbf5S
 Irt9KxAAiYme2aSvMOni0NQsOgQ5mVsy7tk0/4dyRqkAx0ggrfGcFuhgZYNm8ZKY
 KoQsQyn30Wb/2wAp1vX2I4Fod67rFyBfQg/8iWiEAu47X7Bj1lpPPJexSPKhF9/X
 e0TuGxZtoaDuV9C3Su/FOjRmnShGSFQu1SCyJThshwaGsFL3YQ0Ut07VRgRF8x05
 A5fy2SVVIw0JOQgV1oH0GP5oEK3c50oGnaXt8emm56PxVIfAYY0oq69hQUzrfMFP
 zV9R0XbnbCIibT8R3lEghjtXavtQTzK5rYDKazTeOyDU87M+yuykNYj7MhgDwl9Q
 UdJkH2OpMlJylEH3asUjz/+ObMhXfOuj/ZS3INtO5omBJx7x76egDZPMQe4wlpcC
 NT5EZMS7kBdQL8xXDob7hXsvFpuEErSUGruYTHp4H52A9ke1dRTH2kQszcKk87V3
 s+aVVPtJ5bHzF3oGEvfwP0DFLTF6WvjD0Ts0LmTY2DhpE//tFWV37j60Ni5XU21X
 fCPooihQbLOsq9D8zc0ydEvCg2LLWMXM5ovCkqfIAJzbGVYhnxJSryZwpOlKDS0y
 LiUmLcTZDoNR/szx0aJhVHdUUVgXDX/GsllHoc1w7ZvDRMJn40K+xnaF3dSMwtIl
 imhfc5pPi6fdBgjB0cFYRPfhwiwlPMQ4YFsOq9JvynJzmt6P5FQ=
 =ceke
 -----END PGP SIGNATURE-----

Merge tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Cross-tree/merge window issues:

   - rtl8150: don't incorrectly assign random MAC addresses; fix late in
     the 5.9 cycle started depending on a return code from a function
     which changed with the 5.10 PR from the usb subsystem

  Current release regressions:

   - Revert "virtio-net: ethtool configurable RXCSUM", it was causing
     crashes at probe when control vq was not negotiated/available

  Previous release regressions:

   - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO
     bus, only first device would be probed correctly

   - nexthop: Fix performance regression in nexthop deletion by
     effectively switching from recently added synchronize_rcu() to
     synchronize_rcu_expedited()

   - netsec: ignore 'phy-mode' device property on ACPI systems; the
     property is not populated correctly by the firmware, but firmware
     configures the PHY so just keep boot settings

  Previous releases - always broken:

   - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing
     bulk transfers getting "stuck"

   - icmp: randomize the global rate limiter to prevent attackers from
     getting useful signal

   - r8169: fix operation under forced interrupt threading, make the
     driver always use hard irqs, even on RT, given the handler is light
     and only wants to schedule napi (and do so through a _irqoff()
     variant, preferably)

   - bpf: Enforce pointer id generation for all may-be-null register
     type to avoid pointers erroneously getting marked as null-checked

   - tipc: re-configure queue limit for broadcast link

   - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN
     tunnels

   - fix various issues in chelsio inline tls driver

  Misc:

   - bpf: improve just-added bpf_redirect_neigh() helper api to support
     supplying nexthop by the caller - in case BPF program has already
     done a lookup we can avoid doing another one

   - remove unnecessary break statements

   - make MCTCP not select IPV6, but rather depend on it"

* tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  tcp: fix to update snd_wl1 in bulk receiver fast path
  net: Properly typecast int values to set sk_max_pacing_rate
  netfilter: nf_fwd_netdev: clear timestamp in forwarding path
  ibmvnic: save changed mac address to adapter->mac_addr
  selftests: mptcp: depends on built-in IPv6
  Revert "virtio-net: ethtool configurable RXCSUM"
  rtnetlink: fix data overflow in rtnl_calcit()
  net: ethernet: mtk-star-emac: select REGMAP_MMIO
  net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup
  net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device
  bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static
  bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh()
  bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop
  mptcp: depends on IPV6 but not as a module
  sfc: move initialisation of efx->filter_sem to efx_init_struct()
  mpls: load mpls_gso after mpls_iptunnel
  net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels
  net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action()
  net: dsa: bcm_sf2: make const array static, makes object smaller
  mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it
  ...
2020-10-23 12:05:49 -07:00
zhuoliang zhang
a779d91314 net: xfrm: fix a race condition during allocing spi
we found that the following race condition exists in
xfrm_alloc_userspi flow:

user thread                                    state_hash_work thread
----                                           ----
xfrm_alloc_userspi()
 __find_acq_core()
   /*alloc new xfrm_state:x*/
   xfrm_state_alloc()
   /*schedule state_hash_work thread*/
   xfrm_hash_grow_check()   	               xfrm_hash_resize()
 xfrm_alloc_spi                                  /*hold lock*/
      x->id.spi = htonl(spi)                     spin_lock_bh(&net->xfrm.xfrm_state_lock)
      /*waiting lock release*/                     xfrm_hash_transfer()
      spin_lock_bh(&net->xfrm.xfrm_state_lock)      /*add x into hlist:net->xfrm.state_byspi*/
	                                                hlist_add_head_rcu(&x->byspi)
                                                 spin_unlock_bh(&net->xfrm.xfrm_state_lock)

    /*add x into hlist:net->xfrm.state_byspi 2 times*/
    hlist_add_head_rcu(&x->byspi)

1. a new state x is alloced in xfrm_state_alloc() and added into the bydst hlist
in  __find_acq_core() on the LHS;
2. on the RHS, state_hash_work thread travels the old bydst and tranfers every xfrm_state
(include x) into the new bydst hlist and new byspi hlist;
3. user thread on the LHS gets the lock and adds x into the new byspi hlist again.

So the same xfrm_state (x) is added into the same list_hash
(net->xfrm.state_byspi) 2 times that makes the list_hash become
an inifite loop.

To fix the race, x->id.spi = htonl(spi) in the xfrm_alloc_spi() is moved
to the back of spin_lock_bh, sothat state_hash_work thread no longer add x
which id.spi is zero into the hash_list.

Fixes: f034b5d4ef ("[XFRM]: Dynamic xfrm_state hash table sizing.")
Signed-off-by: zhuoliang zhang <zhuoliang.zhang@mediatek.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-10-23 09:08:55 +02:00
Neal Cardwell
18ded910b5 tcp: fix to update snd_wl1 in bulk receiver fast path
In the header prediction fast path for a bulk data receiver, if no
data is newly acknowledged then we do not call tcp_ack() and do not
call tcp_ack_update_window(). This means that a bulk receiver that
receives large amounts of data can have the incoming sequence numbers
wrap, so that the check in tcp_may_update_window fails:
   after(ack_seq, tp->snd_wl1)

If the incoming receive windows are zero in this state, and then the
connection that was a bulk data receiver later wants to send data,
that connection can find itself persistently rejecting the window
updates in incoming ACKs. This means the connection can persistently
fail to discover that the receive window has opened, which in turn
means that the connection is unable to send anything, and the
connection's sending process can get permanently "stuck".

The fix is to update snd_wl1 in the header prediction fast path for a
bulk data receiver, so that it keeps up and does not see wrapping
problems.

This fix is based on a very nice and thorough analysis and diagnosis
by Apollon Oikonomopoulos (see link below).

This is a stable candidate but there is no Fixes tag here since the
bug predates current git history. Just for fun: looks like the bug
dates back to when header prediction was added in Linux v2.1.8 in Nov
1996. In that version tcp_rcv_established() was added, and the code
only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer:
receiver" code path it does not call tcp_ack(). This fix seems to
apply cleanly at least as far back as v3.2.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Apollon Oikonomopoulos <apoikos@dmesg.gr>
Tested-by: Apollon Oikonomopoulos <apoikos@dmesg.gr>
Link: https://www.spinics.net/lists/netdev/msg692430.html
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201022143331.1887495-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22 12:26:57 -07:00
Ke Li
700465fd33 net: Properly typecast int values to set sk_max_pacing_rate
In setsockopt(SO_MAX_PACING_RATE) on 64bit systems, sk_max_pacing_rate,
after extended from 'u32' to 'unsigned long', takes unintentionally
hiked value whenever assigned from an 'int' value with MSB=1, due to
binary sign extension in promoting s32 to u64, e.g. 0x80000000 becomes
0xFFFFFFFF80000000.

Thus inflated sk_max_pacing_rate causes subsequent getsockopt to return
~0U unexpectedly. It may also result in increased pacing rate.

Fix by explicitly casting the 'int' value to 'unsigned int' before
assigning it to sk_max_pacing_rate, for zero extension to happen.

Fixes: 76a9ebe811 ("net: extend sk_pacing_rate to unsigned long")
Signed-off-by: Ji Li <jli@akamai.com>
Signed-off-by: Ke Li <keli@akamai.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201022064146.79873-1-keli@akamai.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22 12:18:25 -07:00
Jakub Kicinski
594850ca43 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Update debugging in IPVS tcp protocol handler to make it easier
   to understand, from longguang.yue

2) Update TCP tracker to deal with keepalive packet after
   re-registration, from Franceso Ruggeri.

3) Missing IP6SKB_FRAGMENTED from netfilter fragment reassembly,
   from Georg Kohmann.

4) Fix bogus packet drop in ebtables nat extensions, from
   Thimothee Cocault.

5) Fix typo in flowtable documentation.

6) Reset skb timestamp in nft_fwd_netdev.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22 12:00:37 -07:00
Jakub Kicinski
d2775984d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-10-22

1) Fix enforcing NULL check in verifier for new helper return types of
   RET_PTR_TO_{BTF_ID,MEM_OR_BTF_ID}_OR_NULL, from Martin KaFai Lau.

2) Fix bpf_redirect_neigh() helper API before it becomes frozen by adding
   nexthop information as argument, from Toke Høiland-Jørgensen.

3) Guard & fix compilation of bpf_tail_call_static() when __bpf__ arch is
   not defined by compiler or clang too old, from Daniel Borkmann.

4) Remove misplaced break after return in attach_type_to_prog_type(), from
   Tom Rix.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-22 09:51:41 -07:00
Linus Torvalds
24717cfbbb The one new feature this time, from Anna Schumaker, is READ_PLUS, which
has the same arguments as READ but allows the server to return an array
 of data and hole extents.
 
 Otherwise it's a lot of cleanup and bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl+Q5vsVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+DUAP/RlALnXbaoWi8YCcEcc9U1LoQKbD
 CJpDR+FqCOyGwRuzWung/5pvkOO50fGEeAroos+2rF/NgRkQq8EFr9AuBhNOYUFE
 IZhWEOfu/r2ukXyBmcu21HGcWLwPnyJehvjuzTQW2wOHlBi/sdoL5Ap1sVlwVLj5
 EZ5kqJLD+ioG2sufW99Spi55l1Cy+3Y0IhLSWl4ZAE6s8hmFSYAJZFsOeI0Afx57
 USPTDRaeqjyEULkb+f8IhD0eRApOUo4evDn9dwQx+of7HPa1CiygctTKYwA3hnlc
 gXp2KpVA1REaiYVgOPwYlnqBmJ2K9X0wCRzcWy2razqEcVAX/2j7QCe9M2mn4DC8
 xZ2q4SxgXu9yf0qfUSVnDxWmP6ipqq7OmsG0JXTFseGKBdpjJY1qHhyqanVAGvEg
 I+xHnnWfGwNCftwyA3mt3RfSFPsbLlSBIMZxvN4kn8aVlqszGITOQvTdQcLYA6kT
 xWllBf4XKVXMqF0PzerxPDmfzBfhx6b1VPWOIVcu7VLBg3IXoEB2G5xG8MUJiSch
 OUTCt41LUQkerQlnzaZYqwmFdSBfXJefmcE/x/vps4VtQ/fPHX1jQyD7iTu3HfSP
 bRlkKHvNVeTodlBDe/HTPiTA99MShhBJyvtV5wfzIqwjc1cNreed+ePppxn8mxJi
 SmQ2uZk/MpUl7/V0
 =rcOj
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "The one new feature this time, from Anna Schumaker, is READ_PLUS,
  which has the same arguments as READ but allows the server to return
  an array of data and hole extents.

  Otherwise it's a lot of cleanup and bugfixes"

* tag 'nfsd-5.10' of git://linux-nfs.org/~bfields/linux: (43 commits)
  NFSv4.2: Fix NFS4ERR_STALE error when doing inter server copy
  SUNRPC: fix copying of multiple pages in gss_read_proxy_verf()
  sunrpc: raise kernel RPC channel buffer size
  svcrdma: fix bounce buffers for unaligned offsets and multiple pages
  nfsd: remove unneeded break
  net/sunrpc: Fix return value for sysctl sunrpc.transports
  NFSD: Encode a full READ_PLUS reply
  NFSD: Return both a hole and a data segment
  NFSD: Add READ_PLUS hole segment encoding
  NFSD: Add READ_PLUS data support
  NFSD: Hoist status code encoding into XDR encoder functions
  NFSD: Map nfserr_wrongsec outside of nfsd_dispatch
  NFSD: Remove the RETURN_STATUS() macro
  NFSD: Call NFSv2 encoders on error returns
  NFSD: Fix .pc_release method for NFSv2
  NFSD: Remove vestigial typedefs
  NFSD: Refactor nfsd_dispatch() error paths
  NFSD: Clean up nfsd_dispatch() variables
  NFSD: Clean up stale comments in nfsd_dispatch()
  NFSD: Clean up switch statement in nfsd_dispatch()
  ...
2020-10-22 09:44:27 -07:00
Linus Torvalds
334d431f65 9p pull request for inclusion in 5.10
A couple of small fixes (loff_t overflow on 32bit, syzbot uninitialized
 variable warning) and code cleanup (xen)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl+Rc/EACgkQq06b7GqY
 5nACqxAAnrPj6uA4IQGCQCbRKetkidZrIGre/8RyEAyfxj+8gV0POYp/1SFokAu+
 RzNkeDgKXkV8L/6BEc/XTi/Iy2HGlPOQLjpBcDLwOIe+ZSlXYK544zUqs20W8/ZH
 NHtUY9rvgEWaTyYCi0wjS51xqBs2/Km4OIUKj7OZHXx9rWmpPX/6xwgcR4tNAWEJ
 7AYkq9G0OiDbY9TaHRK0dOmVq1Xs+H3s5Ci4FMh/uFcQBI3UgfpYAegqTiMh8/AF
 aWgtMdidHaBIWdzb1W4UbBXVd14dFfcIqMTzstFO+mElO4VtQCAQtC16cJX9Dqgh
 2YQxZgF5uVAi3DogHLDT0iT6RiqJrU8WeDLfOq8AuOy29l5jlGB6y1GppmS4z0DK
 sw+fy0lMH9q2ahbJs3RMk05m9vK8L9HCjqnMbTD0TohtkEEWyhHJ2RadtkDnRAL+
 Vy+tlfENloVW4zJIFT8RI5pLXqLG4NGV5Tlp5D//5abV1V8uEeVFHe97JTBMsafi
 FAGWqghPPxRB7UhtbnNHiIGDClOctT9y+SY0mxIkq0Bw+26K5G/eXWwF+Wgelfrs
 n0nCrbi5rVdJJhcrZid4r/tUDpXydxgQmrXV8jsgpHH22B02WXKtAzcgWJsB9vec
 syicGh9Shr4cF1S0Rq/Brgl47WGwHjufsvWSwnukvZ1+X4Rs9Qw=
 =ONeq
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "A couple of small fixes (loff_t overflow on 32bit, syzbot
  uninitialized variable warning) and code cleanup (xen)"

* tag '9p-for-5.10-rc1' of git://github.com/martinetd/linux:
  net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid
  9p/xen: Fix format argument warning
  9P: Cast to loff_t before multiplying
2020-10-22 09:33:20 -07:00
Pablo Neira Ayuso
c77761c8a5 netfilter: nf_fwd_netdev: clear timestamp in forwarding path
Similar to 7980d2eabd ("ipvs: clear skb->tstamp in forwarding path").
fq qdisc requires tstamp to be cleared in forwarding path.

Fixes: 8203e2d844 ("net: clear skb->tstamp in forwarding paths")
Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Fixes: 80b14dee2b ("net: Add a new socket option for a future transmit time.")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-22 14:49:36 +02:00
Di Zhu
ebfe3c5183 rtnetlink: fix data overflow in rtnl_calcit()
"ip addr show" command execute error when we have a physical
network card with a large number of VFs

The return value of if_nlmsg_size() in rtnl_calcit() will exceed
range of u16 data type when any network cards has a larger number of
VFs. rtnl_vfinfo_size() will significant increase needed dump size when
the value of num_vfs is larger.

Eventually we get a wrong value of min_ifinfo_dump_size because of overflow
which decides the memory size needed by netlink dump and netlink_dump()
will return -EMSGSIZE because of not enough memory was allocated.

So fix it by promoting  min_dump_alloc data type to u32 to
avoid whole netlink message size overflow and it's also align
with the data type of struct netlink_callback{}.min_dump_alloc
which is assigned by return value of rtnl_calcit()

Signed-off-by: Di Zhu <zhudi21@huawei.com>
Link: https://lore.kernel.org/r/20201021020053.1401-1-zhudi21@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-21 18:24:08 -07:00
Toke Høiland-Jørgensen
ba452c9e99 bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop
Based on the discussion in [0], update the bpf_redirect_neigh() helper to
accept an optional parameter specifying the nexthop information. This makes
it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without
incurring a duplicate FIB lookup - since the FIB lookup helper will return
the nexthop information even if no neighbour is present, this can simply
be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns
BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen.

  [0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk
2020-10-22 01:28:54 +02:00
Linus Torvalds
ed7cfefe44 We have:
- a patch that removes crush_workspace_mutex (myself).  CRUSH
   computations are no longer serialized and can run in parallel.
 
 - a couple new filesystem client metrics for "ceph fs top" command
   (Xiubo Li)
 
 - a fix for a very old messenger bug that affected the filesystem,
   marked for stable (myself)
 
 - assorted fixups and cleanups throughout the codebase from Jeff
   and others.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl+QN8cTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi4iwCACLwUvO0ONTzUUb2D8ftfyXjo/jIod5
 eO7RHfCHOP2A83GQpdYdktVDSVeOUIOiPhxAxo1dL4GieI2/saXrnoevam7ogZkA
 OmR4drdtRVUqF/aATrtjiDMg2ge0dnx5gfjMxSP/FiPPpXOdrtxng/7tv8yo+03q
 AlMqg/YcxO06t1M1qh9SEyfzjcHyPnJU2i0ienngxnGxQ7QiMOR6anF1LNhtN803
 4fTBX2tLqDNAa+x5yF1kKSn9OmFNhc0oUsqef+Ck0Vw1LC0/SuxzE2J904iehAwy
 /HzJCer0zcp+eZhn3HhoHmp0UZpOC1dzMxa3CHyRJFrxCSTEhY8iqPiJ
 =wGr7
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:

 - a patch that removes crush_workspace_mutex (myself). CRUSH
   computations are no longer serialized and can run in parallel.

 - a couple new filesystem client metrics for "ceph fs top" command
   (Xiubo Li)

 - a fix for a very old messenger bug that affected the filesystem,
   marked for stable (myself)

 - assorted fixups and cleanups throughout the codebase from Jeff and
   others.

* tag 'ceph-for-5.10-rc1' of git://github.com/ceph/ceph-client: (27 commits)
  libceph: clear con->out_msg on Policy::stateful_server faults
  libceph: format ceph_entity_addr nonces as unsigned
  libceph: fix ENTITY_NAME format suggestion
  libceph: move a dout in queue_con_delay()
  ceph: comment cleanups and clarifications
  ceph: break up send_cap_msg
  ceph: drop separate mdsc argument from __send_cap
  ceph: promote to unsigned long long before shifting
  ceph: don't SetPageError on readpage errors
  ceph: mark ceph_fmt_xattr() as printf-like for better type checking
  ceph: fold ceph_update_writeable_page into ceph_write_begin
  ceph: fold ceph_sync_writepages into writepage_nounlock
  ceph: fold ceph_sync_readpages into ceph_readpage
  ceph: don't call ceph_update_writeable_page from page_mkwrite
  ceph: break out writeback of incompatible snap context to separate function
  ceph: add a note explaining session reject error string
  libceph: switch to the new "osd blocklist add" command
  libceph, rbd, ceph: "blacklist" -> "blocklist"
  ceph: have ceph_writepages_start call pagevec_lookup_range_tag
  ceph: use kill_anon_super helper
  ...
2020-10-21 10:34:10 -07:00
Matthieu Baerts
0ed37ac586 mptcp: depends on IPV6 but not as a module
Like TCP, MPTCP cannot be compiled as a module. Obviously, MPTCP IPv6'
support also depends on CONFIG_IPV6. But not all functions from IPv6
code are exported.

To simplify the code and reduce modifications outside MPTCP, it was
decided from the beginning to support MPTCP with IPv6 only if
CONFIG_IPV6 was built inlined. That's also why CONFIG_MPTCP_IPV6 was
created. More modifications are needed to support CONFIG_IPV6=m.

Even if it was not explicit, until recently, we were forcing CONFIG_IPV6
to be built-in because we had "select IPV6" in Kconfig. Now that we have
"depends on IPV6", we have to explicitly set "IPV6=y" to force
CONFIG_IPV6 not to be built as a module.

In other words, we can now only have CONFIG_MPTCP_IPV6=y if
CONFIG_IPV6=y.

Note that the new dependency might hide the fact IPv6 is not supported
in MPTCP even if we have CONFIG_IPV6=m. But selecting IPV6 like we did
before was forcing it to be built-in while it was maybe not what the
user wants.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 010b430d5d ("mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it")
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201021105154.628257-1-matthieu.baerts@tessares.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-21 08:05:40 -07:00
Alexander Ovechkin
b7c24497ba mpls: load mpls_gso after mpls_iptunnel
mpls_iptunnel is used only for mpls encapsuation, and if encaplusated
packet is larger than MTU we need mpls_gso for segmentation.

Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru>
Acked-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: David Ahern <dsahern@gmail.com>
Link: https://lore.kernel.org/r/20201020114333.26866-1-ovov@yandex-team.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 21:16:45 -07:00
Davide Caratti
a7a12b5a0f net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels
the following command

 # tc action add action tunnel_key \
 > set src_ip 2001:db8::1 dst_ip 2001:db8::2 id 10 erspan_opts 1:6789:0:0

generates the following splat:

 BUG: KASAN: slab-out-of-bounds in tunnel_key_copy_opts+0xcc9/0x1010 [act_tunnel_key]
 Write of size 4 at addr ffff88813f5f1cc8 by task tc/873

 CPU: 2 PID: 873 Comm: tc Not tainted 5.9.0+ #282
 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
 Call Trace:
  dump_stack+0x99/0xcb
  print_address_description.constprop.7+0x1e/0x230
  kasan_report.cold.13+0x37/0x7c
  tunnel_key_copy_opts+0xcc9/0x1010 [act_tunnel_key]
  tunnel_key_init+0x160c/0x1f40 [act_tunnel_key]
  tcf_action_init_1+0x5b5/0x850
  tcf_action_init+0x15d/0x370
  tcf_action_add+0xd9/0x2f0
  tc_ctl_action+0x29b/0x3a0
  rtnetlink_rcv_msg+0x341/0x8d0
  netlink_rcv_skb+0x120/0x380
  netlink_unicast+0x439/0x630
  netlink_sendmsg+0x719/0xbf0
  sock_sendmsg+0xe2/0x110
  ____sys_sendmsg+0x5ba/0x890
  ___sys_sendmsg+0xe9/0x160
  __sys_sendmsg+0xd3/0x170
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f872a96b338
 Code: 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 25 43 2c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 41 89 d4 55
 RSP: 002b:00007ffffe367518 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 000000005f8f5aed RCX: 00007f872a96b338
 RDX: 0000000000000000 RSI: 00007ffffe367580 RDI: 0000000000000003
 RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000000001c
 R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000001
 R13: 0000000000686760 R14: 0000000000000601 R15: 0000000000000000

 Allocated by task 873:
  kasan_save_stack+0x19/0x40
  __kasan_kmalloc.constprop.7+0xc1/0xd0
  __kmalloc+0x151/0x310
  metadata_dst_alloc+0x20/0x40
  tunnel_key_init+0xfff/0x1f40 [act_tunnel_key]
  tcf_action_init_1+0x5b5/0x850
  tcf_action_init+0x15d/0x370
  tcf_action_add+0xd9/0x2f0
  tc_ctl_action+0x29b/0x3a0
  rtnetlink_rcv_msg+0x341/0x8d0
  netlink_rcv_skb+0x120/0x380
  netlink_unicast+0x439/0x630
  netlink_sendmsg+0x719/0xbf0
  sock_sendmsg+0xe2/0x110
  ____sys_sendmsg+0x5ba/0x890
  ___sys_sendmsg+0xe9/0x160
  __sys_sendmsg+0xd3/0x170
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

 The buggy address belongs to the object at ffff88813f5f1c00
  which belongs to the cache kmalloc-256 of size 256
 The buggy address is located 200 bytes inside of
  256-byte region [ffff88813f5f1c00, ffff88813f5f1d00)
 The buggy address belongs to the page:
 page:0000000011b48a19 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13f5f0
 head:0000000011b48a19 order:1 compound_mapcount:0
 flags: 0x17ffffc0010200(slab|head)
 raw: 0017ffffc0010200 0000000000000000 0000000d00000001 ffff888107c43400
 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff88813f5f1b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff88813f5f1c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff88813f5f1c80: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc
                                               ^
  ffff88813f5f1d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff88813f5f1d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

using IPv6 tunnels, act_tunnel_key allocates a fixed amount of memory for
the tunnel metadata, but then it expects additional bytes to store tunnel
specific metadata with tunnel_key_copy_opts().

Fix the arguments of __ipv6_tun_set_dst(), so that 'md_size' contains the
size previously computed by tunnel_key_get_opts_len(), like it's done for
IPv4 tunnels.

Fixes: 0ed5269f9e ("net/sched: add tunnel option support to act_tunnel_key")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/36ebe969f6d13ff59912d6464a4356fe6f103766.1603231100.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 21:10:41 -07:00
Guillaume Nault
b130762161 net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action()
We need to jump to the "err_out_locked" label when
tcf_gate_get_entries() fails. Otherwise, tc_setup_flow_action() exits
with ->tcfa_lock still held.

Fixes: d29bdd69ec ("net: schedule: add action gate offloading")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/12f60e385584c52c22863701c0185e40ab08a7a7.1603207948.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 21:00:52 -07:00
Geert Uytterhoeven
010b430d5d mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it
MPTCP_IPV6 selects IPV6, thus enabling an optional feature the user may
not want to enable.  Fix this by making MPTCP_IPV6 depend on IPV6, like
is done for all other IPv6 features.

Fixes: f870fa0b57 ("mptcp: Add MPTCP socket stubs")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201020073839.29226-1-geert@linux-m68k.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 20:54:08 -07:00
Defang Bo
280e3ebdaf nfc: Ensure presence of NFC_ATTR_FIRMWARE_NAME attribute in nfc_genl_fw_download()
Check that the NFC_ATTR_FIRMWARE_NAME attributes are provided by
the netlink client prior to accessing them.This prevents potential
unhandled NULL pointer dereference exceptions which can be triggered
by malicious user-mode programs, if they omit one or both of these
attributes.

Similar to commit a0323b979f ("nfc: Ensure presence of required attributes in the activate_target handler").

Fixes: 9674da8759 ("NFC: Add firmware upload netlink command")
Signed-off-by: Defang Bo <bodefang@126.com>
Link: https://lore.kernel.org/r/1603107538-4744-1-git-send-email-bodefang@126.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 17:06:22 -07:00
Geert Uytterhoeven
b142083b58 mptcp: MPTCP_KUNIT_TESTS should depend on MPTCP instead of selecting it
MPTCP_KUNIT_TESTS selects MPTCP, thus enabling an optional feature the
user may not want to enable.  Fix this by making the test depend on
MPTCP instead.

Fixes: a00a582203 ("mptcp: move crypto test to KUNIT")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/20201019113240.11516-1-geert@linux-m68k.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 16:50:06 -07:00
Geliang Tang
65b8c8a620 mptcp: move mptcp_options_received's port initialization
Move mptcp_options_received's port initialization from
mptcp_parse_option to mptcp_get_options, put it together with
the other fields initializations of mptcp_options_received.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 16:32:36 -07:00
Geliang Tang
fe2d9b1a0e mptcp: initialize mptcp_options_received's ahmac
Initialize mptcp_options_received's ahmac to zero, otherwise it
will be a random number when receiving ADD_ADDR suboption with echo-flag=1.

Fixes: 3df523ab58 ("mptcp: Add ADD_ADDR handling")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 16:32:14 -07:00
Roi Dayan
47b5d2a107 net/sched: act_ct: Fix adding udp port mangle operation
Need to use the udp header type and not tcp.

Fixes: 9c26ba9b1f ("net/sched: act_ct: Instantiate flow table entry actions")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Link: https://lore.kernel.org/r/20201019090244.3015186-1-roid@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-20 16:15:51 -07:00
Linus Torvalds
59f0e7eb2f NFS Client Updates for Linux 5.10
- Stable Fixes:
   - Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+
   - Fix nfs_path in case of a rename retry
   - Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag
 
 - New features and improvements:
   - Replace dprintk() calls with tracepoints
   - Make cache consistency bitmap dynamic
   - Added support for the NFS v4.2 READ_PLUS operation
   - Improvements to net namespace uniquifier
 
 - Other bugfixes and cleanups
   - Remove redundant clnt pointer
   - Don't update timeout values on connection resets
   - Remove redundant tracepoints
   - Various cleanups to comments
   - Fix oops when trying to use copy_file_range with v4.0 source server
   - Improvements to flexfiles mirrors
   - Add missing "local_lock=posix" mount option
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl+PJv0ACgkQ18tUv7Cl
 QOtB+hAAza4BwQSAZs0QbszPcoNV0f0qND99hWcJ2ad5KCr8S5eNYaMqB8G8lbS+
 Ahuv6xRy69l2vjHvINXoSaSiXClYNAlAr5myiW0DqMLDpVV8VEMAhqgFJK97VpqQ
 AsbsdFwC4xAcQ+6nJ+IfK9f8AOYhvARkOP3171qkq0jX3Eq4j1IiQARn+JrGFZFL
 waC4UtVhCnx5z5pg7jyu7Ar4N0Xky72+Fm1EvRog4gndX2JbNNSkwavD33mWZ8iN
 3q6DRq/AtEk9F4ttPwVeALvpNCBqjoGcNw5FCBil7x4V+xIbDI2FBhusKas3GzXm
 W2tyUuJMTGpA6ZjkyYDcg0jC8vKUiUpMWgT3oZoQ/6g7P4vPzB/XB/RQcAInMRjS
 /MhWZ3YNQmApnVCIZSwOa9+oNUc69dM0+vqmesLvvb/aNjzRnGwiJNiuWGyrwoGX
 3SzpPUcJixa3+eKLi9uEHojKB+SuPIrB/HW4UYh/ZpOMyilVqUnX8RYbaK3qQ/gK
 Un++LjTmao9cXyjIDCiHJB630W01YSWXq1nwuSYkLuXCuALmmtgA4z8pm+4K+w+G
 SctETKVcD0qi3Yx5mFdU5A1dHXMB7nzYNaiRWYLvLO5qAR33nUPI4lV8rog2TUPe
 2iW4n0j2YxnTlwR3QMPYJRndP7KeR5XvOoNuh3XpKRVbmbc7/A4=
 =6RM5
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable Fixes:
   - Wait for stateid updates after CLOSE/OPEN_DOWNGRADE # v5.4+
   - Fix nfs_path in case of a rename retry
   - Support EXCHID4_FLAG_SUPP_FENCE_OPS v4.2 EXCHANGE_ID flag

  New features and improvements:
   - Replace dprintk() calls with tracepoints
   - Make cache consistency bitmap dynamic
   - Added support for the NFS v4.2 READ_PLUS operation
   - Improvements to net namespace uniquifier

  Other bugfixes and cleanups:
   - Remove redundant clnt pointer
   - Don't update timeout values on connection resets
   - Remove redundant tracepoints
   - Various cleanups to comments
   - Fix oops when trying to use copy_file_range with v4.0 source server
   - Improvements to flexfiles mirrors
   - Add missing 'local_lock=posix' mount option"

* tag 'nfs-for-5.10-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (55 commits)
  NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag
  NFSv4: Fix up RCU annotations for struct nfs_netns_client
  NFS: Only reference user namespace from nfs4idmap struct instead of cred
  nfs: add missing "posix" local_lock constant table definition
  NFSv4: Use the net namespace uniquifier if it is set
  NFSv4: Clean up initialisation of uniquified client id strings
  NFS: Decode a full READ_PLUS reply
  SUNRPC: Add an xdr_align_data() function
  NFS: Add READ_PLUS hole segment decoding
  SUNRPC: Add the ability to expand holes in data pages
  SUNRPC: Split out _shift_data_right_tail()
  SUNRPC: Split out xdr_realign_pages() from xdr_align_pages()
  NFS: Add READ_PLUS data segment support
  NFS: Use xdr_page_pos() in NFSv4 decode_getacl()
  SUNRPC: Implement a xdr_page_pos() function
  SUNRPC: Split out a function for setting current page
  NFS: fix nfs_path in case of a rename retry
  fs: nfs: return per memcg count for xattr shrinkers
  NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE
  nfs: remove incorrect fallthrough label
  ...
2020-10-20 13:26:30 -07:00
Martijn de Gouw
d48c812474 SUNRPC: fix copying of multiple pages in gss_read_proxy_verf()
When the passed token is longer than 4032 bytes, the remaining part
of the token must be copied from the rqstp->rq_arg.pages. But the
copy must make sure it happens in a consecutive way.

With the existing code, the first memcpy copies 'length' bytes from
argv->iobase, but since the header is in front, this never fills the
whole first page of in_token->pages.

The mecpy in the loop copies the following bytes, but starts writing at
the next page of in_token->pages.  This leaves the last bytes of page 0
unwritten.

Symptoms were that users with many groups were not able to access NFS
exports, when using Active Directory as the KDC.

Signed-off-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com>
Fixes: 5866efa8cb "SUNRPC: Fix svcauth_gss_proxy_init()"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-20 13:21:30 -04:00
Roberto Bergantinos Corpas
27a1e8a0f7 sunrpc: raise kernel RPC channel buffer size
Its possible that using AUTH_SYS and mountd manage-gids option a
user may hit the 8k RPC channel buffer limit. This have been observed
on field, causing unanswered RPCs on clients after mountd fails to
write on channel :

rpc.mountd[11231]: auth_unix_gid: error writing reply

Userland nfs-utils uses a buffer size of 32k (RPC_CHAN_BUF_SIZE), so
lets match those two.

Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-20 13:21:30 -04:00
Saeed Mirzamohammadi
31cc578ae2 netfilter: nftables_offload: KASAN slab-out-of-bounds Read in nft_flow_rule_create
This patch fixes the issue due to:

BUG: KASAN: slab-out-of-bounds in nft_flow_rule_create+0x622/0x6a2
net/netfilter/nf_tables_offload.c:40
Read of size 8 at addr ffff888103910b58 by task syz-executor227/16244

The error happens when expr->ops is accessed early on before performing the boundary check and after nft_expr_next() moves the expr to go out-of-bounds.

This patch checks the boundary condition before expr->ops that fixes the slab-out-of-bounds Read issue.

Add nft_expr_more() and use it to fix this problem.

Signed-off-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-20 13:54:54 +02:00
Timothée COCAULT
63137bc588 netfilter: ebtables: Fixes dropping of small packets in bridge nat
Fixes an error causing small packets to get dropped. skb_ensure_writable
expects the second parameter to be a length in the ethernet payload.=20
If we want to write the ethernet header (src, dst), we should pass 0.
Otherwise, packets with small payloads (< ETH_ALEN) will get dropped.

Fixes: c1a8311679 ("netfilter: bridge: convert skb_make_writable to skb_ensure_writable")
Signed-off-by: Timothée COCAULT <timothee.cocault@orange.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-20 13:54:53 +02:00
Georg Kohmann
68f9f9c2c3 netfilter: Drop fragmented ndisc packets assembled in netfilter
Fragmented ndisc packets assembled in netfilter not dropped as specified
in RFC 6980, section 5. This behaviour breaks TAHI IPv6 Core Conformance
Tests v6LC.2.1.22/23, V6LC.2.2.26/27 and V6LC.2.3.18.

Setting IP6SKB_FRAGMENTED flag during reassembly.

References: commit b800c3b966 ("ipv6: drop fragmented ndisc packets by default (RFC 6980)")
Signed-off-by: Georg Kohmann <geokohma@cisco.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-20 13:54:53 +02:00
Francesco Ruggeri
4f25434bcc netfilter: conntrack: connection timeout after re-register
If the first packet conntrack sees after a re-register is an outgoing
keepalive packet with no data (SEG.SEQ = SND.NXT-1), td_end is set to
SND.NXT-1.
When the peer correctly acknowledges SND.NXT, tcp_in_window fails
check III (Upper bound for valid (s)ack: sack <= receiver.td_end) and
returns false, which cascades into nf_conntrack_in setting
skb->_nfct = 0 and in later conntrack iptables rules not matching.
In cases where iptables are dropping packets that do not match
conntrack rules this can result in idle tcp connections to time out.

v2: adjust td_end when getting the reply rather than when sending out
    the keepalive packet.

Fixes: f94e63801a ("netfilter: conntrack: reset tcp maxwin on re-register")
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-20 13:54:53 +02:00
longguang.yue
79dce09ab0 ipvs: adjust the debug info in function set_tcp_state
Outputting client,virtual,dst addresses info when tcp state changes,
which makes the connection debug more clear

Signed-off-by: longguang.yue <bigclouds@163.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-20 13:54:46 +02:00
Ido Schimmel
df6afe2f7c nexthop: Fix performance regression in nexthop deletion
While insertion of 16k nexthops all using the same netdev ('dummy10')
takes less than a second, deletion takes about 130 seconds:

# time -p ip -b nexthop.batch
real 0.29
user 0.01
sys 0.15

# time -p ip link set dev dummy10 down
real 131.03
user 0.06
sys 0.52

This is because of repeated calls to synchronize_rcu() whenever a
nexthop is removed from a nexthop group:

# /usr/share/bcc/tools/offcputime -p `pgrep -nx ip` -K
...
    b'finish_task_switch'
    b'schedule'
    b'schedule_timeout'
    b'wait_for_completion'
    b'__wait_rcu_gp'
    b'synchronize_rcu.part.0'
    b'synchronize_rcu'
    b'__remove_nexthop'
    b'remove_nexthop'
    b'nexthop_flush_dev'
    b'nh_netdev_event'
    b'raw_notifier_call_chain'
    b'call_netdevice_notifiers_info'
    b'__dev_notify_flags'
    b'dev_change_flags'
    b'do_setlink'
    b'__rtnl_newlink'
    b'rtnl_newlink'
    b'rtnetlink_rcv_msg'
    b'netlink_rcv_skb'
    b'rtnetlink_rcv'
    b'netlink_unicast'
    b'netlink_sendmsg'
    b'____sys_sendmsg'
    b'___sys_sendmsg'
    b'__sys_sendmsg'
    b'__x64_sys_sendmsg'
    b'do_syscall_64'
    b'entry_SYSCALL_64_after_hwframe'
    -                ip (277)
        126554955

Since nexthops are always deleted under RTNL, synchronize_net() can be
used instead. It will call synchronize_rcu_expedited() which only blocks
for several microseconds as opposed to multiple milliseconds like
synchronize_rcu().

With this patch deletion of 16k nexthops takes less than a second:

# time -p ip link set dev dummy10 down
real 0.12
user 0.00
sys 0.04

Tested with fib_nexthops.sh which includes torture tests that prompted
the initial change:

# ./fib_nexthops.sh
...
Tests passed: 134
Tests failed:   0

Fixes: 90f33bffa3 ("nexthops: don't modify published nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201016172914.643282-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-19 20:07:15 -07:00
Christian Eggers
bc7e343dbd net: dsa: tag_ksz: KSZ8795 and KSZ9477 also use tail tags
The Marvell 88E6060 uses tag_trailer.c and the KSZ8795, KSZ9477 and
KSZ9893 switches also use tail tags.

Fixes: 7a6ffe764b ("net: dsa: point out the tail taggers")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20201016171603.10587-1-ceggers@arri.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-19 17:32:50 -07:00
Taehee Yoo
0e8b8d6a2d net: core: use list_del_init() instead of list_del() in netdev_run_todo()
dev->unlink_list is reused unless dev is deleted.
So, list_del() should not be used.
Due to using list_del(), dev->unlink_list can't be reused so that
dev->nested_level update logic doesn't work.
In order to fix this bug, list_del_init() should be used instead
of list_del().

Test commands:
    ip link add bond0 type bond
    ip link add bond1 type bond
    ip link set bond0 master bond1
    ip link set bond0 nomaster
    ip link set bond1 master bond0
    ip link set bond1 nomaster

Splat looks like:
[  255.750458][ T1030] ============================================
[  255.751967][ T1030] WARNING: possible recursive locking detected
[  255.753435][ T1030] 5.9.0-rc8+ #772 Not tainted
[  255.754553][ T1030] --------------------------------------------
[  255.756047][ T1030] ip/1030 is trying to acquire lock:
[  255.757304][ T1030] ffff88811782a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync_multiple+0xc2/0x150
[  255.760056][ T1030]
[  255.760056][ T1030] but task is already holding lock:
[  255.761862][ T1030] ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.764581][ T1030]
[  255.764581][ T1030] other info that might help us debug this:
[  255.766645][ T1030]  Possible unsafe locking scenario:
[  255.766645][ T1030]
[  255.768566][ T1030]        CPU0
[  255.769415][ T1030]        ----
[  255.770259][ T1030]   lock(&dev_addr_list_lock_key/1);
[  255.771629][ T1030]   lock(&dev_addr_list_lock_key/1);
[  255.772994][ T1030]
[  255.772994][ T1030]  *** DEADLOCK ***
[  255.772994][ T1030]
[  255.775091][ T1030]  May be due to missing lock nesting notation
[  255.775091][ T1030]
[  255.777182][ T1030] 2 locks held by ip/1030:
[  255.778299][ T1030]  #0: ffffffffb1f63250 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2e4/0x8b0
[  255.780600][ T1030]  #1: ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.783411][ T1030]
[  255.783411][ T1030] stack backtrace:
[  255.784874][ T1030] CPU: 7 PID: 1030 Comm: ip Not tainted 5.9.0-rc8+ #772
[  255.786595][ T1030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  255.789030][ T1030] Call Trace:
[  255.789850][ T1030]  dump_stack+0x99/0xd0
[  255.790882][ T1030]  __lock_acquire.cold.71+0x166/0x3cc
[  255.792285][ T1030]  ? register_lock_class+0x1a30/0x1a30
[  255.793619][ T1030]  ? rcu_read_lock_sched_held+0x91/0xc0
[  255.794963][ T1030]  ? rcu_read_lock_bh_held+0xa0/0xa0
[  255.796246][ T1030]  lock_acquire+0x1b8/0x850
[  255.797332][ T1030]  ? dev_mc_sync_multiple+0xc2/0x150
[  255.798624][ T1030]  ? bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.800039][ T1030]  ? check_flags+0x50/0x50
[  255.801143][ T1030]  ? lock_contended+0xd80/0xd80
[  255.802341][ T1030]  _raw_spin_lock_nested+0x2e/0x70
[  255.803592][ T1030]  ? dev_mc_sync_multiple+0xc2/0x150
[  255.804897][ T1030]  dev_mc_sync_multiple+0xc2/0x150
[  255.806168][ T1030]  bond_enslave+0x3d58/0x43e0 [bonding]
[  255.807542][ T1030]  ? __lock_acquire+0xe53/0x51b0
[  255.808824][ T1030]  ? bond_update_slave_arr+0xdc0/0xdc0 [bonding]
[  255.810451][ T1030]  ? check_chain_key+0x236/0x5e0
[  255.811742][ T1030]  ? mutex_is_locked+0x13/0x50
[  255.812910][ T1030]  ? rtnl_is_locked+0x11/0x20
[  255.814061][ T1030]  ? netdev_master_upper_dev_get+0xf/0x120
[  255.815553][ T1030]  do_setlink+0x94c/0x3040
[ ... ]

Reported-by: syzbot+4a0f7bc34e3997a6c7df@syzkaller.appspotmail.com
Fixes: 1fc70edb7d ("net: core: add nested_level variable in net_device")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20201015162606.9377-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-18 14:50:25 -07:00
Eelco Chaudron
f981fc3d51 net: openvswitch: fix to make sure flow_lookup() is not preempted
The flow_lookup() function uses per CPU variables, which must be called
with BH disabled. However, this is fine in the general NAPI use case
where the local BH is disabled. But, it's also called from the netlink
context. The below patch makes sure that even in the netlink path, the
BH is disabled.

In addition, u64_stats_update_begin() requires a lock to ensure one writer
which is not ensured here. Making it per-CPU and disabling NAPI (softirq)
ensures that there is always only one writer.

Fixes: eac87c413b ("net: openvswitch: reorder masks array based on usage")
Reported-by: Juri Lelli <jlelli@redhat.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160295903253.7789.826736662555102345.stgit@ebuild
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-18 12:29:36 -07:00
Eric Dumazet
b38e7819ca icmp: randomize the global rate limiter
Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.

Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.

Fixes: 4cdf507d54 ("icmp: add a global rate limitation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Keyu Man <kman001@ucr.edu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-16 16:47:09 -07:00
Hoang Huu Le
ec78e31852 tipc: fix incorrect setting window for bcast link
In commit 16ad3f4022
("tipc: introduce variable window congestion control"), we applied
the algorithm to select window size from minimum window to the
configured maximum window for unicast link, and, besides we chose
to keep the window size for broadcast link unchanged and equal (i.e
fix window 50)

However, when setting maximum window variable via command, the window
variable was re-initialized to unexpect value (i.e 32).

We fix this by updating the fix window for broadcast as we stated.

Fixes: 16ad3f4022 ("tipc: introduce variable window congestion control")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-16 14:09:12 -07:00
Hoang Huu Le
75cee397ae tipc: re-configure queue limit for broadcast link
The queue limit of the broadcast link is being calculated base on initial
MTU. However, when MTU value changed (e.g manual changing MTU on NIC
device, MTU negotiation etc.,) we do not re-calculate queue limit.
This gives throughput does not reflect with the change.

So fix it by calling the function to re-calculate queue limit of the
broadcast link.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-16 14:09:12 -07:00
Dan Aloni
c327a310ec svcrdma: fix bounce buffers for unaligned offsets and multiple pages
This was discovered using O_DIRECT at the client side, with small
unaligned file offsets or IOs that span multiple file pages.

Fixes: e248aa7be8 ("svcrdma: Remove max_sge check at connect time")
Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-16 15:15:04 -04:00
Artur Molchanov
c09f56b8f6 net/sunrpc: Fix return value for sysctl sunrpc.transports
Fix returning value for sysctl sunrpc.transports.
Return error code from sysctl proc_handler function proc_do_xprt instead of number of the written bytes.
Otherwise sysctl returns random garbage for this key.

Since v1:
- Handle negative returned value from memory_read_from_buffer as an error

Signed-off-by: Artur Molchanov <arturmolchanov@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-16 15:15:04 -04:00
Linus Torvalds
9ff9b0d392 networking changes for the 5.10 merge window
Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
 traversal in common container configs and improving TCP back-pressure.
 Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
 
 Expand netlink policy support and improve policy export to user space.
 (Ge)netlink core performs request validation according to declared
 policies. Expand the expressiveness of those policies (min/max length
 and bitmasks). Allow dumping policies for particular commands.
 This is used for feature discovery by user space (instead of kernel
 version parsing or trial and error).
 
 Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
 
 Allow more than 255 IPv4 multicast interfaces.
 
 Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
 packets of TCPv6.
 
 In Multi-patch TCP (MPTCP) support concurrent transmission of data
 on multiple subflows in a load balancing scenario. Enhance advertising
 addresses via the RM_ADDR/ADD_ADDR options.
 
 Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
 
 Allow more calls to same peer in RxRPC.
 
 Support two new Controller Area Network (CAN) protocols -
 CAN-FD and ISO 15765-2:2016.
 
 Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
 kernel problem.
 
 Add TC actions for implementing MPLS L2 VPNs.
 
 Improve nexthop code - e.g. handle various corner cases when nexthop
 objects are removed from groups better, skip unnecessary notifications
 and make it easier to offload nexthops into HW by converting
 to a blocking notifier.
 
 Support adding and consuming TCP header options by BPF programs,
 opening the doors for easy experimental and deployment-specific
 TCP option use.
 
 Reorganize TCP congestion control (CC) initialization to simplify life
 of TCP CC implemented in BPF.
 
 Add support for shipping BPF programs with the kernel and loading them
 early on boot via the User Mode Driver mechanism, hence reusing all the
 user space infra we have.
 
 Support sleepable BPF programs, initially targeting LSM and tracing.
 
 Add bpf_d_path() helper for returning full path for given 'struct path'.
 
 Make bpf_tail_call compatible with bpf-to-bpf calls.
 
 Allow BPF programs to call map_update_elem on sockmaps.
 
 Add BPF Type Format (BTF) support for type and enum discovery, as
 well as support for using BTF within the kernel itself (current use
 is for pretty printing structures).
 
 Support listing and getting information about bpf_links via the bpf
 syscall.
 
 Enhance kernel interfaces around NIC firmware update. Allow specifying
 overwrite mask to control if settings etc. are reset during update;
 report expected max time operation may take to users; support firmware
 activation without machine reboot incl. limits of how much impact
 reset may have (e.g. dropping link or not).
 
 Extend ethtool configuration interface to report IEEE-standard
 counters, to limit the need for per-vendor logic in user space.
 
 Adopt or extend devlink use for debug, monitoring, fw update
 in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
 mv88e6xxx, dpaa2-eth).
 
 In mlxsw expose critical and emergency SFP module temperature alarms.
 Refactor port buffer handling to make the defaults more suitable and
 support setting these values explicitly via the DCBNL interface.
 
 Add XDP support for Intel's igb driver.
 
 Support offloading TC flower classification and filtering rules to
 mscc_ocelot switches.
 
 Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
 fixed interval period pulse generator and one-step timestamping in
 dpaa-eth.
 
 Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
 offload.
 
 Add Lynx PHY/PCS MDIO module, and convert various drivers which have
 this HW to use it. Convert mvpp2 to split PCS.
 
 Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
 7-port Mediatek MT7531 IP.
 
 Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
 and wcn3680 support in wcn36xx.
 
 Improve performance for packets which don't require much offloads
 on recent Mellanox NICs by 20% by making multiple packets share
 a descriptor entry.
 
 Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
 subtree to drivers/net. Move MDIO drivers out of the phy directory.
 
 Clean up a lot of W=1 warnings, reportedly the actively developed
 subsections of networking drivers should now build W=1 warning free.
 
 Make sure drivers don't use in_interrupt() to dynamically adapt their
 code. Convert tasklets to use new tasklet_setup API (sadly this
 conversion is not yet complete).
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
 IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
 Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
 r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
 iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
 mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
 wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
 owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
 HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
 =bc1U
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:

 - Add redirect_neigh() BPF packet redirect helper, allowing to limit
   stack traversal in common container configs and improving TCP
   back-pressure.

   Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

 - Expand netlink policy support and improve policy export to user
   space. (Ge)netlink core performs request validation according to
   declared policies. Expand the expressiveness of those policies
   (min/max length and bitmasks). Allow dumping policies for particular
   commands. This is used for feature discovery by user space (instead
   of kernel version parsing or trial and error).

 - Support IGMPv3/MLDv2 multicast listener discovery protocols in
   bridge.

 - Allow more than 255 IPv4 multicast interfaces.

 - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
   packets of TCPv6.

 - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
   multiple subflows in a load balancing scenario. Enhance advertising
   addresses via the RM_ADDR/ADD_ADDR options.

 - Support SMC-Dv2 version of SMC, which enables multi-subnet
   deployments.

 - Allow more calls to same peer in RxRPC.

 - Support two new Controller Area Network (CAN) protocols - CAN-FD and
   ISO 15765-2:2016.

 - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
   kernel problem.

 - Add TC actions for implementing MPLS L2 VPNs.

 - Improve nexthop code - e.g. handle various corner cases when nexthop
   objects are removed from groups better, skip unnecessary
   notifications and make it easier to offload nexthops into HW by
   converting to a blocking notifier.

 - Support adding and consuming TCP header options by BPF programs,
   opening the doors for easy experimental and deployment-specific TCP
   option use.

 - Reorganize TCP congestion control (CC) initialization to simplify
   life of TCP CC implemented in BPF.

 - Add support for shipping BPF programs with the kernel and loading
   them early on boot via the User Mode Driver mechanism, hence reusing
   all the user space infra we have.

 - Support sleepable BPF programs, initially targeting LSM and tracing.

 - Add bpf_d_path() helper for returning full path for given 'struct
   path'.

 - Make bpf_tail_call compatible with bpf-to-bpf calls.

 - Allow BPF programs to call map_update_elem on sockmaps.

 - Add BPF Type Format (BTF) support for type and enum discovery, as
   well as support for using BTF within the kernel itself (current use
   is for pretty printing structures).

 - Support listing and getting information about bpf_links via the bpf
   syscall.

 - Enhance kernel interfaces around NIC firmware update. Allow
   specifying overwrite mask to control if settings etc. are reset
   during update; report expected max time operation may take to users;
   support firmware activation without machine reboot incl. limits of
   how much impact reset may have (e.g. dropping link or not).

 - Extend ethtool configuration interface to report IEEE-standard
   counters, to limit the need for per-vendor logic in user space.

 - Adopt or extend devlink use for debug, monitoring, fw update in many
   drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
   dpaa2-eth).

 - In mlxsw expose critical and emergency SFP module temperature alarms.
   Refactor port buffer handling to make the defaults more suitable and
   support setting these values explicitly via the DCBNL interface.

 - Add XDP support for Intel's igb driver.

 - Support offloading TC flower classification and filtering rules to
   mscc_ocelot switches.

 - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
   fixed interval period pulse generator and one-step timestamping in
   dpaa-eth.

 - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
   offload.

 - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
   this HW to use it. Convert mvpp2 to split PCS.

 - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
   7-port Mediatek MT7531 IP.

 - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
   and wcn3680 support in wcn36xx.

 - Improve performance for packets which don't require much offloads on
   recent Mellanox NICs by 20% by making multiple packets share a
   descriptor entry.

 - Move chelsio inline crypto drivers (for TLS and IPsec) from the
   crypto subtree to drivers/net. Move MDIO drivers out of the phy
   directory.

 - Clean up a lot of W=1 warnings, reportedly the actively developed
   subsections of networking drivers should now build W=1 warning free.

 - Make sure drivers don't use in_interrupt() to dynamically adapt their
   code. Convert tasklets to use new tasklet_setup API (sadly this
   conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
  Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
  net, sockmap: Don't call bpf_prog_put() on NULL pointer
  bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
  bpf, sockmap: Add locking annotations to iterator
  netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
  net: fix pos incrementment in ipv6_route_seq_next
  net/smc: fix invalid return code in smcd_new_buf_create()
  net/smc: fix valid DMBE buffer sizes
  net/smc: fix use-after-free of delayed events
  bpfilter: Fix build error with CONFIG_BPFILTER_UMH
  cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
  net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
  bpf: Fix register equivalence tracking.
  rxrpc: Fix loss of final ack on shutdown
  rxrpc: Fix bundle counting for exclusive connections
  netfilter: restore NF_INET_NUMHOOKS
  ibmveth: Identify ingress large send packets.
  ibmveth: Switch order of ibmveth_helper calls.
  cxgb4: handle 4-tuple PEDIT to NAT mode translation
  selftests: Add VRF route leaking tests
  ...
2020-10-15 18:42:13 -07:00
Jakub Kicinski
105faa8742 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-10-15

The main changes are:

1) Fix register equivalence tracking in verifier, from Alexei Starovoitov.

2) Fix sockmap error path to not call bpf_prog_put() with NULL, from Alex Dewar.

3) Fix sockmap to add locking annotations to iterator, from Lorenz Bauer.

4) Fix tcp_hdr_options test to use loopback address, from Martin KaFai Lau.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 12:45:00 -07:00
Jakub Kicinski
2295cddf99 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflicts in net/mptcp/protocol.h and
tools/testing/selftests/net/Makefile.

In both cases code was added on both sides in the same place
so just keep both.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 12:43:21 -07:00
Jakub Kicinski
2ecbc1f684 Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
This reverts commit 1d273fcc2c.

Alexei points out there's nothing implying headers will be built
and therefore exist under usr/include, so this fix doesn't make
much sense.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 12:33:24 -07:00
Alex Dewar
83c11c1755 net, sockmap: Don't call bpf_prog_put() on NULL pointer
If bpf_prog_inc_not_zero() fails for skb_parser, then bpf_prog_put() is
called unconditionally on skb_verdict, even though it may be NULL. Fix
and tidy up error path.

Fixes: 743df8b774 ("bpf, sockmap: Check skb_verdict and skb_parser programs explicitly")
Addresses-Coverity-ID: 1497799: Null pointer dereferences (FORWARD_NULL)
Signed-off-by: Alex Dewar <alex.dewar90@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20201012170952.60750-1-alex.dewar90@gmail.com
2020-10-15 21:05:23 +02:00
Lorenz Bauer
f58423aeab bpf, sockmap: Add locking annotations to iterator
The sparse checker currently outputs the following warnings:

    include/linux/rcupdate.h:632:9: sparse: sparse: context imbalance in 'sock_hash_seq_start' - wrong count at exit
    include/linux/rcupdate.h:632:9: sparse: sparse: context imbalance in 'sock_map_seq_start' - wrong count at exit

Add the necessary __acquires and __release annotations to make the
iterator locking schema palatable to sparse. Also add __must_hold
for good measure.

The kernel codebase uses both __acquires(rcu) and __acquires(RCU).
I couldn't find any guidance which one is preferred, so I used
what is easier to type out.

Fixes: 0365351524 ("net: Allow iterating sockmap and sockhash")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20201012091850.67452-1-lmb@cloudflare.com
2020-10-15 20:49:56 +02:00
Davide Caratti
346e320cb2 netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
nftables payload statements are used to mangle SCTP headers, but they can
only replace the Internet Checksum. As a consequence, nftables rules that
mangle sport/dport/vtag in SCTP headers potentially generate packets that
are discarded by the receiver, unless the CRC-32C is "offloaded" (e.g the
rule mangles a skb having 'ip_summed' equal to 'CHECKSUM_PARTIAL'.

Fix this extending uAPI definitions and L4 checksum update function, in a
way that userspace programs (e.g. nft) can instruct the kernel to compute
CRC-32C in SCTP headers. Also ensure that LIBCRC32C is built if NF_TABLES
is 'y' or 'm' in the kernel build configuration.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 11:45:19 -07:00
Yonghong Song
6617dfd440 net: fix pos incrementment in ipv6_route_seq_next
Commit 4fc427e051 ("ipv6_route_seq_next should increase position index")
tried to fix the issue where seq_file pos is not increased
if a NULL element is returned with seq_ops->next(). See bug
  https://bugzilla.kernel.org/show_bug.cgi?id=206283
The commit effectively does:
  - increase pos for all seq_ops->start()
  - increase pos for all seq_ops->next()

For ipv6_route, increasing pos for all seq_ops->next() is correct.
But increasing pos for seq_ops->start() is not correct
since pos is used to determine how many items to skip during
seq_ops->start():
  iter->skip = *pos;
seq_ops->start() just fetches the *current* pos item.
The item can be skipped only after seq_ops->show() which essentially
is the beginning of seq_ops->next().

For example, I have 7 ipv6 route entries,
  root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=4096
  00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001     eth0
  fe800000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000001 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001       lo
  fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
  ff000000000000000000000000000000 08 00000000000000000000000000000000 00 00000000000000000000000000000000 00000100 00000004 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  0+1 records in
  0+1 records out
  1050 bytes (1.0 kB, 1.0 KiB) copied, 0.00707908 s, 148 kB/s
  root@arch-fb-vm1:~/net-next

In the above, I specify buffer size 4096, so all records can be returned
to user space with a single trip to the kernel.

If I use buffer size 128, since each record size is 149, internally
kernel seq_read() will read 149 into its internal buffer and return the data
to user space in two read() syscalls. Then user read() syscall will trigger
next seq_ops->start(). Since the current implementation increased pos even
for seq_ops->start(), it will skip record #2, #4 and #6, assuming the first
record is #1.

  root@arch-fb-vm1:~/net-next dd if=/proc/net/ipv6_route bs=128
  00000000000000000000000000000000 40 00000000000000000000000000000000 00 00000000000000000000000000000000 00000400 00000001 00000000 00000001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
  fe800000000000002050e3fffebd3be8 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80200001     eth0
  00000000000000000000000000000000 00 00000000000000000000000000000000 00 00000000000000000000000000000000 ffffffff 00000001 00000000 00200200       lo
4+1 records in
4+1 records out
600 bytes copied, 0.00127758 s, 470 kB/s

To fix the problem, create a fake pos pointer so seq_ops->start()
won't actually increase seq_file pos. With this fix, the
above `dd` command with `bs=128` will show correct result.

Fixes: 4fc427e051 ("ipv6_route_seq_next should increase position index")
Cc: Alexei Starovoitov <ast@kernel.org>
Suggested-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 10:21:28 -07:00
Karsten Graul
6b1bbf94ab net/smc: fix invalid return code in smcd_new_buf_create()
smc_ism_register_dmb() returns error codes set by the ISM driver which
are not guaranteed to be negative or in the errno range. Such values
would not be handled by ERR_PTR() and finally the return code will be
used as a memory address.
Fix that by using a valid negative errno value with ERR_PTR().

Fixes: 72b7f6c487 ("net/smc: unique reason code for exceeded max dmb count")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 09:54:43 -07:00
Karsten Graul
ef12ad4588 net/smc: fix valid DMBE buffer sizes
The SMCD_DMBE_SIZES should include all valid DMBE buffer sizes, so the
correct value is 6 which means 1MB. With 7 the registration of an ISM
buffer would always fail because of the invalid size requested.
Fix that and set the value to 6.

Fixes: c6ba7c9ba4 ("net/smc: add base infrastructure for SMC-D and ISM")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 09:54:43 -07:00
Karsten Graul
d535ca1367 net/smc: fix use-after-free of delayed events
When a delayed event is enqueued then the event worker will send this
event the next time it is running and no other flow is currently
active. The event handler is called for the delayed event, and the
pointer to the event keeps set in lgr->delayed_event. This pointer is
cleared later in the processing by smc_llc_flow_start().
This can lead to a use-after-free condition when the processing does not
reach smc_llc_flow_start(), but frees the event because of an error
situation. Then the delayed_event pointer is still set but the event is
freed.
Fix this by always clearing the delayed event pointer when the event is
provided to the event handler for processing, and remove the code to
clear it in smc_llc_flow_start().

Fixes: 555da9af82 ("net/smc: add event-based llc_flow framework")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 09:54:43 -07:00
YueHaibing
1d273fcc2c bpfilter: Fix build error with CONFIG_BPFILTER_UMH
IF CONFIG_BPFILTER_UMH is set, building fails:

In file included from /usr/include/sys/socket.h:33:0,
                 from net/bpfilter/main.c:6:
/usr/include/bits/socket.h:390:10: fatal error: asm/socket.h: No such file or directory
 #include <asm/socket.h>
          ^~~~~~~~~~~~~~
compilation terminated.
scripts/Makefile.userprogs:43: recipe for target 'net/bpfilter/main.o' failed
make[2]: *** [net/bpfilter/main.o] Error 1

Add missing include path to fix this.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-15 09:36:14 -07:00
David Howells
ddc7834af8 rxrpc: Fix loss of final ack on shutdown
Fix the loss of transmission of a call's final ack when a socket gets shut
down.  This means that the server will retransmit the last data packet or
send a ping ack and then get an ICMP indicating the port got closed.  The
server will then view this as a failure.

Fixes: 3136ef49a1 ("rxrpc: Delay terminal ACK transmission on a client call")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-15 13:28:00 +01:00
David Howells
f3af4ad1e0 rxrpc: Fix bundle counting for exclusive connections
Fix rxrpc_unbundle_conn() to not drop the bundle usage count when cleaning
up an exclusive connection.

Based on the suggested fix from Hillf Danton.

Fixes: 245500d853 ("rxrpc: Rewrite the client connection manager")
Reported-by: syzbot+d57aaf84dd8a550e6d91@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hillf Danton <hdanton@sina.com>
2020-10-15 13:28:00 +01:00
Pablo Neira Ayuso
d25e2e9388 netfilter: restore NF_INET_NUMHOOKS
This definition is used by the iptables legacy UAPI, restore it.

Fixes: d3519cb89f ("netfilter: nf_tables: add inet ingress support")
Reported-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-14 20:28:05 -07:00
Mathieu Desnoyers
272928d1cd ipv6/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2)
As per RFC4443, the destination address field for ICMPv6 error messages
is copied from the source address field of the invoking packet.

In configurations with Virtual Routing and Forwarding tables, looking up
which routing table to use for sending ICMPv6 error messages is
currently done by using the destination net_device.

If the source and destination interfaces are within separate VRFs, or
one in the global routing table and the other in a VRF, looking up the
source address of the invoking packet in the destination interface's
routing table will fail if the destination interface's routing table
contains no route to the invoking packet's source address.

One observable effect of this issue is that traceroute6 does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Use the source device routing table when sending ICMPv6 error
messages.

[ In the context of ipv4, it has been pointed out that a similar issue
  may exist with ICMP errors triggered when forwarding between network
  namespaces. It would be worthwhile to investigate whether ipv6 has
  similar issues, but is outside of the scope of this investigation. ]

[ Testing shows that similar issues exist with ipv6 unreachable /
  fragmentation needed messages.  However, investigation of this
  additional failure mode is beyond this investigation's scope. ]

Link: https://tools.ietf.org/html/rfc4443
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-14 17:14:23 -07:00
Mathieu Desnoyers
e1e84eb58e ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2)
As per RFC792, ICMP errors should be sent to the source host.

However, in configurations with Virtual Routing and Forwarding tables,
looking up which routing table to use is currently done by using the
destination net_device.

commit 9d1a6c4ea4 ("net: icmp_route_lookup should use rt dev to
determine L3 domain") changes the interface passed to
l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
to skb_dst(skb_in)->dev. This effectively uses the destination device
rather than the source device for choosing which routing table should be
used to lookup where to send the ICMP error.

Therefore, if the source and destination interfaces are within separate
VRFs, or one in the global routing table and the other in a VRF, looking
up the source host in the destination interface's routing table will
fail if the destination interface's routing table contains no route to
the source host.

One observable effect of this issue is that traceroute does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Preferably use the source device routing table when sending ICMP error
messages. If no source device is set, fall-back on the destination
device routing table. Else, use the main routing table (index 0).

[ It has been pointed out that a similar issue may exist with ICMP
  errors triggered when forwarding between network namespaces. It would
  be worthwhile to investigate, but is outside of the scope of this
  investigation. ]

[ It has also been pointed out that a similar issue exists with
  unreachable / fragmentation needed messages, which can be triggered by
  changing the MTU of eth1 in r1 to 1400 and running:

  ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

  Some investigation points to raw_icmp_error() and raw_err() as being
  involved in this last scenario. The focus of this patch is TTL expired
  ICMP messages, which go through icmp_route_lookup.
  Investigation of failure modes related to raw_icmp_error() is beyond
  this investigation's scope. ]

Fixes: 9d1a6c4ea4 ("net: icmp_route_lookup should use rt dev to determine L3 domain")
Link: https://tools.ietf.org/html/rfc792
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-14 17:14:11 -07:00
Jakub Kicinski
1e40d75ef9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Extend nf_queue selftest to cover re-queueing, non-gso mode and
   delayed queueing, from Florian Westphal.

2) Clear skb->tstamp in IPVS forwarding path, from Julian Anastasov.

3) Provide netlink extended error reporting for EEXIST case.

4) Missing VLAN offload tag and proto in log target.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 20:02:54 -07:00
Cong Wang
fdafed4599 ip_gre: set dev->hard_header_len and dev->needed_headroom properly
GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
conditionally. When it is set, it assumes the outer IP header is
already created before ipgre_xmit().

This is not true when we send packets through a raw packet socket,
where L2 headers are supposed to be constructed by user. Packet
socket calls dev_validate_header() to validate the header. But
GRE tunnel does not set dev->hard_header_len, so that check can
be simply bypassed, therefore uninit memory could be passed down
to ipgre_xmit(). Similar for dev->needed_headroom.

dev->hard_header_len is supposed to be the length of the header
created by dev->header_ops->create(), so it should be used whenever
header_ops is set, and dev->needed_headroom should be used when it
is not set.

Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 18:35:29 -07:00
Heiner Kallweit
5fc3594d36 xfrm: use new function dev_fetch_sw_netstats
Simplify the code by using new function dev_fetch_sw_netstats().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/a6b816f4-bbf2-9db0-d59a-7e4e9cc808fe@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:33:49 -07:00
Heiner Kallweit
3569939a81 net: openvswitch: use new function dev_fetch_sw_netstats
Simplify the code by using new function dev_fetch_sw_netstats().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/5e52dc91-97b1-82b0-214b-65d404e4a2ec@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:33:49 -07:00
Heiner Kallweit
6401297e76 mac80211: use new function dev_fetch_sw_netstats
Simplify the code by using new function dev_fetch_sw_netstats().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/93dda477-70ae-0ccf-71b4-bfebb66c9beb@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:33:49 -07:00
Heiner Kallweit
cf89f18fa4 iptunnel: use new function dev_fetch_sw_netstats
Simplify the code by using new function dev_fetch_sw_netstats().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/050f9a83-b195-a3d6-edbd-91a59040be21@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:33:49 -07:00
Heiner Kallweit
a0d2698101 net: dsa: use new function dev_fetch_sw_netstats
Simplify the code by using new function dev_fetch_sw_netstats().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/b6047017-8226-6b7e-a3cd-064e69fdfa27@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:33:49 -07:00
Heiner Kallweit
f3f04f0f3a net: bridge: use new function dev_fetch_sw_netstats
Simplify the code by using new function dev_fetch_sw_netstats().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/d1c3ff29-5691-9d54-d164-16421905fa59@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:33:49 -07:00
Heiner Kallweit
44fa32f008 net: add function dev_fetch_sw_netstats for fetching pcpu_sw_netstats
In several places the same code is used to populate rtnl_link_stats64
fields with data from pcpu_sw_netstats. Therefore factor out this code
to a new function dev_fetch_sw_netstats().

v2:
- constify argument netstats
- don't ignore netstats being NULL or an ERRPTR
- switch to EXPORT_SYMBOL_GPL

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/6d16a338-52f5-df69-0020-6bc771a7d498@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:33:48 -07:00
Or Cohen
c9bf52a173 net/af_unix: Remove unused old_pid variable
Commit 109f6e39fa ("af_unix: Allow SO_PEERCRED
to work across namespaces.") introduced the old_pid variable
in unix_listen, but it's never used.
Remove the declaration and the call to put_pid.

Signed-off-by: Or Cohen <orcohen@paloaltonetworks.com>
Link: https://lore.kernel.org/r/20201011153527.18628-1-orcohen@paloaltonetworks.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:21:46 -07:00
Christian Eggers
4e3bbb33e6 socket: don't clear SOCK_TSTAMP_NEW when SO_TIMESTAMPNS is disabled
SOCK_TSTAMP_NEW (timespec64 instead of timespec) is also used for
hardware time stamps (configured via SO_TIMESTAMPING_NEW).

User space (ptp4l) first configures hardware time stamping via
SO_TIMESTAMPING_NEW which sets SOCK_TSTAMP_NEW. In the next step, ptp4l
disables SO_TIMESTAMPNS(_NEW) (software time stamps), but this must not
switch hardware time stamps back to "32 bit mode".

This problem happens on 32 bit platforms were the libc has already
switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW
socket options). ptp4l complains with "missing timestamp on transmitted
peer delay request" because the wrong format is received (and
discarded).

Fixes: 887feae36a ("socket: Add SO_TIMESTAMP[NS]_NEW")
Fixes: 783da70e83 ("net: add sock_enable_timestamps")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:18:18 -07:00
Christian Eggers
59e611a566 socket: fix option SO_TIMESTAMPING_NEW
The comparison of optname with SO_TIMESTAMPING_NEW is wrong way around,
so SOCK_TSTAMP_NEW will first be set and than reset again. Additionally
move it out of the test for SOF_TIMESTAMPING_RX_SOFTWARE as this seems
unrelated.

This problem happens on 32 bit platforms were the libc has already
switched to struct timespec64 (from SO_TIMExxx_OLD to SO_TIMExxx_NEW
socket options). ptp4l complains with "missing timestamp on transmitted
peer delay request" because the wrong format is received (and
discarded).

Fixes: 9718475e69 ("socket: Add SO_TIMESTAMPING_NEW")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reviewed-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:18:18 -07:00
Julia Lawall
0403a2b53c net/tls: use semicolons rather than commas to separate statements
Replace commas with semicolons.  Commas introduce unnecessary
variability in the code structure and are hard to see.  What is done
is essentially described by the following Coccinelle semantic patch
(http://coccinelle.lip6.fr/):

// <smpl>
@@ expression e1,e2; @@
e1
-,
+;
e2
... when any
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/1602412498-32025-6-git-send-email-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:11:52 -07:00
Julia Lawall
6159e9633f net/ipv6: use semicolons rather than commas to separate statements
Replace commas with semicolons.  Commas introduce unnecessary
variability in the code structure and are hard to see.  What is done
is essentially described by the following Coccinelle semantic patch
(http://coccinelle.lip6.fr/):

// <smpl>
@@ expression e1,e2; @@
e1
-,
+;
e2
... when any
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://lore.kernel.org/r/1602412498-32025-5-git-send-email-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:11:52 -07:00
Julia Lawall
44797589c2 tcp: use semicolons rather than commas to separate statements
Replace commas with semicolons.  Commas introduce unnecessary
variability in the code structure and are hard to see.  What is done
is essentially described by the following Coccinelle semantic patch
(http://coccinelle.lip6.fr/):

// <smpl>
@@ expression e1,e2; @@
e1
-,
+;
e2
... when any
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/1602412498-32025-4-git-send-email-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:11:52 -07:00
Pablo Neira Ayuso
0d9826bc18 netfilter: nf_log: missing vlan offload tag and proto
Dump vlan tag and proto for the usual vlan offload case if the
NF_LOG_MACDECODE flag is set on. Without this information the logging is
misleading as there is no reference to the VLAN header.

[12716.993704] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0800 SRC=192.168.10.2 DST=172.217.168.163 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=2548 DF PROTO=TCP SPT=55848 DPT=80 WINDOW=501 RES=0x00 ACK FIN URGP=0
[12721.157643] test: IN=veth0 OUT= MACSRC=86:6c:92:ea:d6:73 MACDST=0e:3b:eb:86:73:76 VPROTO=8100 VID=10 MACPROTO=0806 ARP HTYPE=1 PTYPE=0x0800 OPCODE=2 MACSRC=86:6c:92:ea:d6:73 IPSRC=192.168.10.2 MACDST=0e:3b:eb:86:73:76 IPDST=192.168.10.1

Fixes: 83e96d443b ("netfilter: log: split family specific code to nf_log_{ip,ip6,common}.c files")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-14 01:25:14 +02:00
Linus Torvalds
6ad4bf6ea1 io_uring-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EXPEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiR4EAC3trm1ojXVF7y9/XRhcPpb4Pror+ZA1coO
 gyoy+zUuCEl9WCzzHWqXULMYMP0YzNJnJs0oLQPA1s0sx1H4uDMl/UXg0OXZisYG
 Y59Kca3c1DHFwj9KPQXfGmCEjc/rbDWK5TqRc2iZMz+6E5Mt71UFZHtenwgV1zD8
 hTmZZkzLCu2ePfOvrPONgL5tDqPWGVyn61phoC7cSzMF66juXGYuvQGktzi/m6q+
 jAxUnhKvKTlLB9wsq3s5X/20/QD56Yuba9U+YxeeNDBE8MDWQOsjz0mZCV1fn4p3
 h/6762aRaWaXH7EwMtsHFUWy7arJZg/YoFYNYLv4Ksyy3y4sMABZCy3A+JyzrgQ0
 hMu7vjsP+k22X1WH8nyejBfWNEmxu6dpgckKrgF0dhJcXk/acWA3XaDWZ80UwfQy
 isKRAP1rA0MJKHDMIwCzSQJDPvtUAkPptbNZJcUSU78o+pPoCaQ93V++LbdgGtKn
 iGJJqX05dVbcsDx5X7fluphjkUTC4yFr7ZgLgbOIedXajWRD8iOkO2xxCHk6SKFl
 iv9entvRcX9k3SHK9uffIUkRBUujMU0+HCIQFCO1qGmkCaS5nSrovZl4HoL7L/Dj
 +T8+v7kyJeklLXgJBaE7jk01O4HwZKjwPWMbCjvL9NKk8j7c1soYnRu5uNvi85Mu
 /9wn671s+w==
 =udgj
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - Add blkcg accounting for io-wq offload (Dennis)

 - A use-after-free fix for io-wq (Hillf)

 - Cancelation fixes and improvements

 - Use proper files_struct references for offload

 - Cleanup of io_uring_get_socket() since that can now go into our own
   header

 - SQPOLL fixes and cleanups, and support for sharing the thread

 - Improvement to how page accounting is done for registered buffers and
   huge pages, accounting the real pinned state

 - Series cleaning up the xarray code (Willy)

 - Various cleanups, refactoring, and improvements (Pavel)

 - Use raw spinlock for io-wq (Sebastian)

 - Add support for ring restrictions (Stefano)

* tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (62 commits)
  io_uring: keep a pointer ref_node in file_data
  io_uring: refactor *files_register()'s error paths
  io_uring: clean file_data access in files_register
  io_uring: don't delay io_init_req() error check
  io_uring: clean leftovers after splitting issue
  io_uring: remove timeout.list after hrtimer cancel
  io_uring: use a separate struct for timeout_remove
  io_uring: improve submit_state.ios_left accounting
  io_uring: simplify io_file_get()
  io_uring: kill extra check in fixed io_file_get()
  io_uring: clean up ->files grabbing
  io_uring: don't io_prep_async_work() linked reqs
  io_uring: Convert advanced XArray uses to the normal API
  io_uring: Fix XArray usage in io_uring_add_task_file
  io_uring: Fix use of XArray in __io_uring_files_cancel
  io_uring: fix break condition for __io_uring_register() waiting
  io_uring: no need to call xa_destroy() on empty xarray
  io_uring: batch account ->req_issue and task struct references
  io_uring: kill callback_head argument for io_req_task_work_add()
  io_uring: move req preps out of io_issue_sqe()
  ...
2020-10-13 12:36:21 -07:00
Linus Torvalds
39a5101f98 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Allow DRBG testing through user-space af_alg
   - Add tcrypt speed testing support for keyed hashes
   - Add type-safe init/exit hooks for ahash

  Algorithms:
   - Mark arc4 as obsolete and pending for future removal
   - Mark anubis, khazad, sead and tea as obsolete
   - Improve boot-time xor benchmark
   - Add OSCCA SM2 asymmetric cipher algorithm and use it for integrity

  Drivers:
   - Fixes and enhancement for XTS in caam
   - Add support for XIP8001B hwrng in xiphera-trng
   - Add RNG and hash support in sun8i-ce/sun8i-ss
   - Allow imx-rngc to be used by kernel entropy pool
   - Use crypto engine in omap-sham
   - Add support for Ingenic X1830 with ingenic"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (205 commits)
  X.509: Fix modular build of public_key_sm2
  crypto: xor - Remove unused variable count in do_xor_speed
  X.509: fix error return value on the failed path
  crypto: bcm - Verify GCM/CCM key length in setkey
  crypto: qat - drop input parameter from adf_enable_aer()
  crypto: qat - fix function parameters descriptions
  crypto: atmel-tdes - use semicolons rather than commas to separate statements
  crypto: drivers - use semicolons rather than commas to separate statements
  hwrng: mxc-rnga - use semicolons rather than commas to separate statements
  hwrng: iproc-rng200 - use semicolons rather than commas to separate statements
  hwrng: stm32 - use semicolons rather than commas to separate statements
  crypto: xor - use ktime for template benchmarking
  crypto: xor - defer load time benchmark to a later time
  crypto: hisilicon/zip - fix the uninitalized 'curr_qm_qp_num'
  crypto: hisilicon/zip - fix the return value when device is busy
  crypto: hisilicon/zip - fix zero length input in GZIP decompress
  crypto: hisilicon/zip - fix the uncleared debug registers
  lib/mpi: Fix unused variable warnings
  crypto: x86/poly1305 - Remove assignments with no effect
  hwrng: npcm - modify readl to readb
  ...
2020-10-13 08:50:16 -07:00
Linus Torvalds
85ed13e78d Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat iovec cleanups from Al Viro:
 "Christoph's series around import_iovec() and compat variant thereof"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  security/keys: remove compat_keyctl_instantiate_key_iov
  mm: remove compat_process_vm_{readv,writev}
  fs: remove compat_sys_vmsplice
  fs: remove the compat readv/writev syscalls
  fs: remove various compat readv/writev helpers
  iov_iter: transparently handle compat iovecs in import_iovec
  iov_iter: refactor rw_copy_check_uvector and import_iovec
  iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
  compat.h: fix a spelling error in <linux/compat.h>
2020-10-12 16:35:51 -07:00
Linus Torvalds
c90578360c Merge branch 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull copy_and_csum cleanups from Al Viro:
 "Saner calling conventions for csum_and_copy_..._user() and friends"

[ Removing 800+ lines of code and cleaning stuff up is good  - Linus ]

* 'work.csum_and_copy' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ppc: propagate the calling conventions change down to csum_partial_copy_generic()
  amd64: switch csum_partial_copy_generic() to new calling conventions
  sparc64: propagate the calling convention changes down to __csum_partial_copy_...()
  xtensa: propagate the calling conventions change down into csum_partial_copy_generic()
  mips: propagate the calling convention change down into __csum_partial_copy_..._user()
  mips: __csum_partial_copy_kernel() has no users left
  mips: csum_and_copy_{to,from}_user() are never called under KERNEL_DS
  sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic()
  i386: propagate the calling conventions change down to csum_partial_copy_generic()
  sh: propage the calling conventions change down to csum_partial_copy_generic()
  m68k: get rid of zeroing destination on error in csum_and_copy_from_user()
  arm: propagate the calling convention changes down to csum_partial_copy_from_user()
  alpha: propagate the calling convention changes down to csum_partial_copy.c helpers
  saner calling conventions for csum_and_copy_..._user()
  csum_and_copy_..._user(): pass 0xffffffff instead of 0 as initial sum
  csum_partial_copy_nocheck(): drop the last argument
  unify generic instances of csum_partial_copy_nocheck()
  icmp_push_reply(): reorder adding the checksum up
  skb_copy_and_csum_bits(): don't bother with the last argument
2020-10-12 16:24:13 -07:00
Jakub Kicinski
ccdf7fae3a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-10-12

The main changes are:

1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.

2) libbpf relocation support for different size load/store, from Andrii.

3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.

4) BPF support for per-cpu variables, form Hao.

5) sockmap improvements, from John.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-12 16:16:50 -07:00
Jakub Kicinski
a308283fdb Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Inspect the reply packets coming from DR/TUN and refresh connection
   state and timeout, from longguang yue and Julian Anastasov.

2) Series to add support for the inet ingress chain type in nf_tables.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-12 15:00:36 -07:00
Pablo Neira Ayuso
98a381a7d4 netfilter: nftables: extend error reporting for chain updates
The initial support for netlink extended ACK is missing the chain update
path, which results in misleading error reporting in case of EEXIST.

Fixes 36dd1bcc07 ("netfilter: nf_tables: initial support for extended ACK reporting")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-12 16:54:30 +02:00
Ilya Dryomov
28e1581c3b libceph: clear con->out_msg on Policy::stateful_server faults
con->out_msg must be cleared on Policy::stateful_server
(!CEPH_MSG_CONNECT_LOSSY) faults.  Not doing so botches the
reconnection attempt, because after writing the banner the
messenger moves on to writing the data section of that message
(either from where it got interrupted by the connection reset or
from the beginning) instead of writing struct ceph_msg_connect.
This results in a bizarre error message because the server
sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
ceph_msg_connect:

  libceph: mds0 (1)172.21.15.45:6828 socket error on write
  ceph: mds0 reconnect start
  libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
  libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
  libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch

AFAICT this bug goes back to the dawn of the kernel client.
The reason it survived for so long is that only MDS sessions
are stateful and only two MDS messages have a data section:
CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
The connection has to get reset precisely when such message
is being sent -- in this case it was the former.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/47723
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-10-12 15:29:27 +02:00
Ilya Dryomov
a9dfe31e5c libceph: format ceph_entity_addr nonces as unsigned
Match the server side logs.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Ilya Dryomov
5a5036c89f libceph: move a dout in queue_con_delay()
The queued con->work can start executing (and therefore logging)
before we get to this "con->work has been queued" message, making
the logs confusing.  Move it up, with the meaning of "con->work
is about to be queued".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Ilya Dryomov
1b05fae7f2 libceph: switch to the new "osd blocklist add" command
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Ilya Dryomov
0b98acd618 libceph, rbd, ceph: "blacklist" -> "blocklist"
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Ilya Dryomov
3986f9a42e libceph: multiple workspaces for CRUSH computations
Replace a global map->crush_workspace (protected by a global mutex)
with a list of workspaces, up to the number of CPUs + 1.

This is based on a patch from Robin Geuze <robing@nl.team.blue>.
Robin and his team have observed a 10-20% increase in IOPS on all
queue depths and lower CPU usage as well on a high-end all-NVMe
100GbE cluster.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Oliver Hartkopp
f726f3d371 can: remove obsolete version strings
As pointed out by Jakub Kicinski here:
http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com
this patch removes the obsolete version information of the different
CAN protocols and the AF_CAN core module.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201012074354.25839-2-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-10-12 10:06:39 +02:00
Oliver Hartkopp
ac911bfeb3 can: isotp: implement cleanups / improvements from review
As pointed out by Jakub Kicinski here:
http://lore.kernel.org/r/20201009175751.5c54097f@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com
this patch addresses the remarked issues:

- remove empty line in comment
- remove default=y for CAN_ISOTP in Kconfig
- make use of pr_notice_once()
- use GFP_ATOMIC instead of gfp_any() in soft hrtimer context

The version strings in the CAN subsystem are removed by a separate patch.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201012074354.25839-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-10-12 10:06:08 +02:00
Anant Thazhemadam
7ca1db21ef net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid
In p9_fd_create_unix, checking is performed to see if the addr (passed
as an argument) is NULL or not.
However, no check is performed to see if addr is a valid address, i.e.,
it doesn't entirely consist of only 0's.
The initialization of sun_server.sun_path to be equal to this faulty
addr value leads to an uninitialized variable, as detected by KMSAN.
Checking for this (faulty addr) and returning a negative error number
appropriately, resolves this issue.

Link: http://lkml.kernel.org/r/20201012042404.2508-1-anant.thazhemadam@gmail.com
Reported-by: syzbot+75d51fe5bf4ebe988518@syzkaller.appspotmail.com
Tested-by: syzbot+75d51fe5bf4ebe988518@syzkaller.appspotmail.com
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2020-10-12 10:05:47 +02:00
John Fastabend
ef5659280e bpf, sockmap: Allow skipping sk_skb parser program
Currently, we often run with a nop parser namely one that just does
this, 'return skb->len'. This happens when either our verdict program
can handle streaming data or it is only looking at socket data such
as IP addresses and other metadata associated with the flow. The second
case is common for a L3/L4 proxy for instance.

So lets allow loading programs without the parser then we can skip
the stream parser logic and avoid having to add a BPF program that
is effectively a nop.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239297866.8495.13345662302749219672.stgit@john-Precision-5820-Tower
2020-10-11 18:09:44 -07:00
John Fastabend
743df8b774 bpf, sockmap: Check skb_verdict and skb_parser programs explicitly
We are about to allow skb_verdict to run without skb_parser programs
as a first step change code to check each program type specifically.
This should be a mechanical change without any impact to actual result.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160239294756.8495.5796595770890272219.stgit@john-Precision-5820-Tower
2020-10-11 18:09:44 -07:00
John Fastabend
0b17ad25d8 bpf, sockmap: Add memory accounting so skbs on ingress lists are visible
Move skb->sk assignment out of sk_psock_bpf_run() and into individual
callers. Then we can use proper skb_set_owner_r() call to assign a
sk to a skb. This improves things by also charging the truesize against
the sockets sk_rmem_alloc counter. With this done we get some accounting
in place to ensure the memory associated with skbs on the workqueue are
still being accounted for somewhere. Finally, by using skb_set_owner_r
the destructor is setup so we can just let the normal skb_kfree logic
recover the memory. Combined with previous patch dropping skb_orphan()
we now can recover from memory pressure and maintain accounting.

Note, we will charge the skbs against their originating socket even
if being redirected into another socket. Once the skb completes the
redirect op the kfree_skb will give the memory back. This is important
because if we charged the socket we are redirecting to (like it was
done before this series) the sock_writeable() test could fail because
of the skb trying to be sent is already charged against the socket.

Also TLS case is special. Here we wait until we have decided not to
simply PASS the packet up the stack. In the case where we PASS the
packet up the stack we already have an skb which is accounted for on
the TLS socket context.

For the parser case we continue to just set/clear skb->sk this is
because the skb being used here may be combined with other skbs or
turned into multiple skbs depending on the parser logic. For example
the parser could request a payload length greater than skb->len so
that the strparser needs to collect multiple skbs. At any rate
the final result will be handled in the strparser recv callback.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226867513.5692.10579573214635925960.stgit@john-Precision-5820-Tower
2020-10-11 18:00:57 -07:00
John Fastabend
10d58d0063 bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup
Calling skb_orphan() is unnecessary in the strp rcv handler because the skb
is from a skb_clone() in __strp_recv. So it never has a destructor or a
sk assigned. Plus its confusing to read because it might hint to the reader
that the skb could have an sk assigned which is not true. Even if we did
have an sk assigned it would be cleaner to simply wait for the upcoming
kfree_skb().

Additionally, move the comment about strparser clone up so its closer to
the logic it is describing and add to it so that it is more complete.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226865548.5692.9098315689984599579.stgit@john-Precision-5820-Tower
2020-10-11 18:00:57 -07:00
John Fastabend
9047f19e7c bpf, sockmap: Remove dropped data on errors in redirect case
In the sk_skb redirect case we didn't handle the case where we overrun
the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
Because we didn't have anything implemented we simply dropped the skb.
This meant data could be dropped if socket memory accounting was in
place.

This fixes the above dropped data case by moving the memory checks
later in the code where we actually do the send or recv. This pushes
those checks into the workqueue and allows us to return an EAGAIN error
which in turn allows us to try again later from the workqueue.

Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226863689.5692.13861422742592309285.stgit@john-Precision-5820-Tower
2020-10-11 18:00:57 -07:00
John Fastabend
29545f4977 bpf, sockmap: Remove skb_set_owner_w wmem will be taken later from sendpage
The skb_set_owner_w is unnecessary here. The sendpage call will create a
fresh skb and set the owner correctly from workqueue. Its also not entirely
harmless because it consumes cycles, but also impacts resource accounting
by increasing sk_wmem_alloc. This is charging the socket we are going to
send to for the skb, but we will put it on the workqueue for some time
before this happens so we are artifically inflating sk_wmem_alloc for
this period. Further, we don't know how many skbs will be used to send the
packet or how it will be broken up when sent over the new socket so
charging it with one big sum is also not correct when the workqueue may
break it up if facing memory pressure. Seeing we don't know how/when
this is going to be sent drop the early accounting.

A later patch will do proper accounting charged on receive socket for
the case where skbs get enqueued on the workqueue.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226861708.5692.17964237936462425136.stgit@john-Precision-5820-Tower
2020-10-11 18:00:57 -07:00
John Fastabend
9ecbfb06a0 bpf, sockmap: On receive programs try to fast track SK_PASS ingress
When we receive an skb and the ingress skb verdict program returns
SK_PASS we currently set the ingress flag and put it on the workqueue
so it can be turned into a sk_msg and put on the sk_msg ingress queue.
Then finally telling userspace with data_ready hook.

Here we observe that if the workqueue is empty then we can try to
convert into a sk_msg type and call data_ready directly without
bouncing through a workqueue. Its a common pattern to have a recv
verdict program for visibility that always returns SK_PASS. In this
case unless there is an ENOMEM error or we overrun the socket we
can avoid the workqueue completely only using it when we fall back
to error cases caused by memory pressure.

By doing this we eliminate another case where data may be dropped
if errors occur on memory limits in workqueue.

Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226859704.5692.12929678876744977669.stgit@john-Precision-5820-Tower
2020-10-11 18:00:57 -07:00
John Fastabend
cfea28f890 bpf, sockmap: Skb verdict SK_PASS to self already checked rmem limits
For sk_skb case where skb_verdict program returns SK_PASS to continue to
pass packet up the stack, the memory limits were already checked before
enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks
here. The theory is if the TCP stack believes we have memory to receive
the packet then lets trust the stack and not double check the limits.

In fact the accounting here can cause a drop if sk_rmem_alloc has increased
after the stack accepted this packet, but before the duplicate check here.
And worse if this happens because TCP stack already believes the data has
been received there is no retransmit.

Fixes: 51199405f9 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/160226857664.5692.668205469388498375.stgit@john-Precision-5820-Tower
2020-10-11 18:00:57 -07:00
Julian Anastasov
7980d2eabd ipvs: clear skb->tstamp in forwarding path
fq qdisc requires tstamp to be cleared in forwarding path

Reported-by: Evgeny B <abt-admin@mail.ru>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209427
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 8203e2d844 ("net: clear skb->tstamp in forwarding paths")
Fixes: fb420d5d91 ("tcp/fq: move back to CLOCK_MONOTONIC")
Fixes: 80b14dee2b ("net: Add a new socket option for a future transmit time.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-12 01:59:41 +02:00
Pablo Neira Ayuso
793d5d6124 netfilter: flowtable: reduce calls to pskb_may_pull()
Make two unfront calls to pskb_may_pull() to linearize the network and
transport header.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-12 01:58:10 +02:00
Pablo Neira Ayuso
d3519cb89f netfilter: nf_tables: add inet ingress support
This patch adds a new ingress hook for the inet family. The inet ingress
hook emulates the IP receive path code, therefore, unclean packets are
drop before walking over the ruleset in this basechain.

This patch also introduces the nft_base_chain_netdev() helper function
to check if this hook is bound to one or more devices (through the hook
list infrastructure). This check allows to perform the same handling for
the inet ingress as it would be a netdev ingress chain from the control
plane.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-10-12 01:57:34 +02:00