57691 Commits

Author SHA1 Message Date
c5f095baa8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Add NFT_CHAIN_POLICY_UNSET to replace hardcoded -1 to
   specify that the chain policy is unset. The chain policy
   field is actually defined as an 8-bit unsigned integer.

2) Remove always true condition reported by smatch in
   chain policy check.

3) Fix element lookup on dynamic sets, from Florian Westphal.

4) Use __u8 in ebtables uapi header, from Masahiro Yamada.

5) Bogus EBUSY when removing flowtable after chain flush,
   from Laura Garcia Liebana.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 20:15:00 +02:00
dfe5999dc0 net/sched: Set default of CONFIG_NET_TC_SKB_EXT to N
This a new feature, it is preferred that it defaults to N.
We will probe the feature support from userspace before actually using it.

Fixes: 95a7233c452a ('net: openvswitch: Set OvS recirc_id from tc chain index')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 20:08:28 +02:00
3c30819dc6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-09-27

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix libbpf's BTF dumper to not skip anonymous enum definitions, from Andrii.

2) Fix BTF verifier issues when handling the BTF of vmlinux, from Alexei.

3) Fix nested calls into bpf_event_output() from TCP sockops BPF
   programs, from Allan.

4) Fix NULL pointer dereference in AF_XDP's xsk map creation when
   allocation fails, from Jonathan.

5) Remove unneeded 64 byte alignment requirement of the AF_XDP UMEM
   headroom, from Bjorn.

6) Remove unused XDP_OPTIONS getsockopt() call which results in an error
   on older kernels, from Toke.

7) Fix a client/server race in tcp_rtt BPF kselftest case, from Stanislav.

8) Fix indentation issue in BTF's btf_enum_check_kflag_member(), from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 16:23:32 +02:00
e3ae1f96ac net: sched: sch_sfb: don't call qdisc_put() while holding tree lock
Recent changes that removed rtnl dependency from rules update path of tc
also made tcf_block_put() function sleeping. This function is called from
ops->destroy() of several Qdisc implementations, which in turn is called by
qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
which results sleeping-while-atomic BUG.

Steps to reproduce for sfb:

tc qdisc add dev ens1f0 handle 1: root sfb
tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10
tc qdisc change dev ens1f0 root handle 1: sfb

Resulting dmesg:

[ 7265.938717] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909
[ 7265.940152] in_atomic(): 1, irqs_disabled(): 0, pid: 28579, name: tc
[ 7265.941455] INFO: lockdep is turned off.
[ 7265.942744] CPU: 11 PID: 28579 Comm: tc Tainted: G        W         5.3.0-rc8+ #721
[ 7265.944065] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 7265.945396] Call Trace:
[ 7265.946709]  dump_stack+0x85/0xc0
[ 7265.947994]  ___might_sleep.cold+0xac/0xbc
[ 7265.949282]  __mutex_lock+0x5b/0x960
[ 7265.950543]  ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 7265.951803]  ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 7265.953022]  tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 7265.954248]  tcf_block_put_ext.part.0+0x21/0x50
[ 7265.955478]  tcf_block_put+0x50/0x70
[ 7265.956694]  sfq_destroy+0x15/0x50 [sch_sfq]
[ 7265.957898]  qdisc_destroy+0x5f/0x160
[ 7265.959099]  sfb_change+0x175/0x330 [sch_sfb]
[ 7265.960304]  tc_modify_qdisc+0x324/0x840
[ 7265.961503]  rtnetlink_rcv_msg+0x170/0x4b0
[ 7265.962692]  ? netlink_deliver_tap+0x95/0x400
[ 7265.963876]  ? rtnl_dellink+0x2d0/0x2d0
[ 7265.965064]  netlink_rcv_skb+0x49/0x110
[ 7265.966251]  netlink_unicast+0x171/0x200
[ 7265.967427]  netlink_sendmsg+0x224/0x3f0
[ 7265.968595]  sock_sendmsg+0x5e/0x60
[ 7265.969753]  ___sys_sendmsg+0x2ae/0x330
[ 7265.970916]  ? ___sys_recvmsg+0x159/0x1f0
[ 7265.972074]  ? do_wp_page+0x9c/0x790
[ 7265.973233]  ? __handle_mm_fault+0xcd3/0x19e0
[ 7265.974407]  __sys_sendmsg+0x59/0xa0
[ 7265.975591]  do_syscall_64+0x5c/0xb0
[ 7265.976753]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 7265.977938] RIP: 0033:0x7f229069f7b8
[ 7265.979117] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5
4
[ 7265.981681] RSP: 002b:00007ffd7ed2d158 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 7265.983001] RAX: ffffffffffffffda RBX: 000000005d813ca1 RCX: 00007f229069f7b8
[ 7265.984336] RDX: 0000000000000000 RSI: 00007ffd7ed2d1c0 RDI: 0000000000000003
[ 7265.985682] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000165c9a0
[ 7265.987021] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001
[ 7265.988309] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000

In sfb_change() function use qdisc_purge_queue() instead of
qdisc_tree_flush_backlog() to properly reset old child Qdisc and save
pointer to it into local temporary variable. Put reference to Qdisc after
sch tree lock is released in order not to call potentially sleeping cls API
in atomic section. This is safe to do because Qdisc has already been reset
by qdisc_purge_queue() inside sch tree lock critical section.

Reported-by: syzbot+ac54455281db908c581e@syzkaller.appspotmail.com
Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:13:55 +02:00
c2999f7fb0 net: sched: multiq: don't call qdisc_put() while holding tree lock
Recent changes that removed rtnl dependency from rules update path of tc
also made tcf_block_put() function sleeping. This function is called from
ops->destroy() of several Qdisc implementations, which in turn is called by
qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
which results sleeping-while-atomic BUG.

Steps to reproduce for multiq:

tc qdisc add dev ens1f0 root handle 1: multiq
tc qdisc add dev ens1f0 parent 1:10 handle 50: sfq perturb 10
ethtool -L ens1f0 combined 2
tc qdisc change dev ens1f0 root handle 1: multiq

Resulting dmesg:

[ 5539.419344] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909
[ 5539.420945] in_atomic(): 1, irqs_disabled(): 0, pid: 27658, name: tc
[ 5539.422435] INFO: lockdep is turned off.
[ 5539.423904] CPU: 21 PID: 27658 Comm: tc Tainted: G        W         5.3.0-rc8+ #721
[ 5539.425400] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 5539.426911] Call Trace:
[ 5539.428380]  dump_stack+0x85/0xc0
[ 5539.429823]  ___might_sleep.cold+0xac/0xbc
[ 5539.431262]  __mutex_lock+0x5b/0x960
[ 5539.432682]  ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 5539.434103]  ? __nla_validate_parse+0x51/0x840
[ 5539.435493]  ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 5539.436903]  tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 5539.438327]  tcf_block_put_ext.part.0+0x21/0x50
[ 5539.439752]  tcf_block_put+0x50/0x70
[ 5539.441165]  sfq_destroy+0x15/0x50 [sch_sfq]
[ 5539.442570]  qdisc_destroy+0x5f/0x160
[ 5539.444000]  multiq_tune+0x14a/0x420 [sch_multiq]
[ 5539.445421]  tc_modify_qdisc+0x324/0x840
[ 5539.446841]  rtnetlink_rcv_msg+0x170/0x4b0
[ 5539.448269]  ? netlink_deliver_tap+0x95/0x400
[ 5539.449691]  ? rtnl_dellink+0x2d0/0x2d0
[ 5539.451116]  netlink_rcv_skb+0x49/0x110
[ 5539.452522]  netlink_unicast+0x171/0x200
[ 5539.453914]  netlink_sendmsg+0x224/0x3f0
[ 5539.455304]  sock_sendmsg+0x5e/0x60
[ 5539.456686]  ___sys_sendmsg+0x2ae/0x330
[ 5539.458071]  ? ___sys_recvmsg+0x159/0x1f0
[ 5539.459461]  ? do_wp_page+0x9c/0x790
[ 5539.460846]  ? __handle_mm_fault+0xcd3/0x19e0
[ 5539.462263]  __sys_sendmsg+0x59/0xa0
[ 5539.463661]  do_syscall_64+0x5c/0xb0
[ 5539.465044]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 5539.466454] RIP: 0033:0x7f1fe08177b8
[ 5539.467863] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5
4
[ 5539.470906] RSP: 002b:00007ffe812de5d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 5539.472483] RAX: ffffffffffffffda RBX: 000000005d8135e3 RCX: 00007f1fe08177b8
[ 5539.474069] RDX: 0000000000000000 RSI: 00007ffe812de640 RDI: 0000000000000003
[ 5539.475655] RBP: 0000000000000000 R08: 0000000000000001 R09: 000000000182e9b0
[ 5539.477203] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001
[ 5539.478699] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000

Rearrange locking in multiq_tune() in following ways:

- In loop that removes Qdiscs from disabled queues, call
  qdisc_purge_queue() instead of qdisc_tree_flush_backlog() on Qdisc that
  is being destroyed. Save the Qdisc in temporary allocated array and call
  qdisc_put() on each element of the array after sch tree lock is released.
  This is safe to do because Qdiscs have already been reset by
  qdisc_purge_queue() inside sch tree lock critical section.

- Do the same change for second loop that initializes Qdiscs for newly
  enabled queues in multiq_tune() function. Since sch tree lock is obtained
  and released on each iteration of this loop, just call qdisc_put()
  directly outside of critical section. Don't verify that old Qdisc is not
  noop_qdisc before releasing reference to it because such check is already
  performed by qdisc_put*() functions.

Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:13:55 +02:00
4ce70b4aed net: sched: sch_htb: don't call qdisc_put() while holding tree lock
Recent changes that removed rtnl dependency from rules update path of tc
also made tcf_block_put() function sleeping. This function is called from
ops->destroy() of several Qdisc implementations, which in turn is called by
qdisc_put(). Some Qdiscs call qdisc_put() while holding sch tree spinlock,
which results sleeping-while-atomic BUG.

Steps to reproduce for htb:

tc qdisc add dev ens1f0 root handle 1: htb default 12
tc class add dev ens1f0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps
tc qdisc add dev ens1f0 parent 1:1 handle 40: sfq perturb 10
tc class add dev ens1f0 parent 1:1 classid 1:2 htb rate 100kbps ceil 100kbps

Resulting dmesg:

[ 4791.148551] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:909
[ 4791.151354] in_atomic(): 1, irqs_disabled(): 0, pid: 27273, name: tc
[ 4791.152805] INFO: lockdep is turned off.
[ 4791.153605] CPU: 19 PID: 27273 Comm: tc Tainted: G        W         5.3.0-rc8+ #721
[ 4791.154336] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 4791.155075] Call Trace:
[ 4791.155803]  dump_stack+0x85/0xc0
[ 4791.156529]  ___might_sleep.cold+0xac/0xbc
[ 4791.157251]  __mutex_lock+0x5b/0x960
[ 4791.157966]  ? console_unlock+0x363/0x5d0
[ 4791.158676]  ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 4791.159395]  ? tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 4791.160103]  tcf_chain0_head_change_cb_del.isra.0+0x1b/0xf0
[ 4791.160815]  tcf_block_put_ext.part.0+0x21/0x50
[ 4791.161530]  tcf_block_put+0x50/0x70
[ 4791.162233]  sfq_destroy+0x15/0x50 [sch_sfq]
[ 4791.162936]  qdisc_destroy+0x5f/0x160
[ 4791.163642]  htb_change_class.cold+0x5df/0x69d [sch_htb]
[ 4791.164505]  tc_ctl_tclass+0x19d/0x480
[ 4791.165360]  rtnetlink_rcv_msg+0x170/0x4b0
[ 4791.166191]  ? netlink_deliver_tap+0x95/0x400
[ 4791.166907]  ? rtnl_dellink+0x2d0/0x2d0
[ 4791.167625]  netlink_rcv_skb+0x49/0x110
[ 4791.168345]  netlink_unicast+0x171/0x200
[ 4791.169058]  netlink_sendmsg+0x224/0x3f0
[ 4791.169771]  sock_sendmsg+0x5e/0x60
[ 4791.170475]  ___sys_sendmsg+0x2ae/0x330
[ 4791.171183]  ? ___sys_recvmsg+0x159/0x1f0
[ 4791.171894]  ? do_wp_page+0x9c/0x790
[ 4791.172595]  ? __handle_mm_fault+0xcd3/0x19e0
[ 4791.173309]  __sys_sendmsg+0x59/0xa0
[ 4791.174024]  do_syscall_64+0x5c/0xb0
[ 4791.174725]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 4791.175435] RIP: 0033:0x7f0aa41497b8
[ 4791.176129] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 5
4
[ 4791.177532] RSP: 002b:00007fff4e37d588 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 4791.178243] RAX: ffffffffffffffda RBX: 000000005d8132f7 RCX: 00007f0aa41497b8
[ 4791.178947] RDX: 0000000000000000 RSI: 00007fff4e37d5f0 RDI: 0000000000000003
[ 4791.179662] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000020149a0
[ 4791.180382] R10: 0000000000404eda R11: 0000000000000246 R12: 0000000000000001
[ 4791.181100] R13: 000000000047f640 R14: 0000000000000000 R15: 0000000000000000

In htb_change_class() function save parent->leaf.q to local temporary
variable and put reference to it after sch tree lock is released in order
not to call potentially sleeping cls API in atomic section. This is safe to
do because Qdisc has already been reset by qdisc_purge_queue() inside sch
tree lock critical section.

Fixes: c266f64dbfa2 ("net: sched: protect block state with mutex")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:13:55 +02:00
05733434ee net/rds: Check laddr_check before calling it
In rds_bind(), laddr_check is called without checking if it is NULL or
not.  And rs_transport should be reset if rds_add_bound() fails.

Fixes: c5c1a030a7db ("net/rds: An rds_sock is added too early to the hash table")
Reported-by: syzbot+fae39afd2101a17ec624@syzkaller.appspotmail.com
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:10:55 +02:00
f6c0f5d209 tcp: honor SO_PRIORITY in TIME_WAIT state
ctl packets sent on behalf of TIME_WAIT sockets currently
have a zero skb->priority, which can cause various problems.

In this patch we :

- add a tw_priority field in struct inet_timewait_sock.

- populate it from sk->sk_priority when a TIME_WAIT is created.

- For IPv4, change ip_send_unicast_reply() and its two
  callers to propagate tw_priority correctly.
  ip_send_unicast_reply() no longer changes sk->sk_priority.

- For IPv6, make sure TIME_WAIT sockets pass their tw_priority
  field to tcp_v6_send_response() and tcp_v6_send_ack().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
e9a5dceee5 ipv6: tcp: provide sk->sk_priority to ctl packets
We can populate skb->priority for some ctl packets
instead of always using zero.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
4f6570d720 ipv6: add priority parameter to ip6_xmit()
Currently, ip6_xmit() sets skb->priority based on sk->sk_priority

This is not desirable for TCP since TCP shares the same ctl socket
for a given netns. We want to be able to send RST or ACK packets
with a non zero skb->priority.

This patch has no functional change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
159d2c7d81 sch_netem: fix rcu splat in netem_enqueue()
qdisc_root() use from netem_enqueue() triggers a lockdep warning.

__dev_queue_xmit() uses rcu_read_lock_bh() which is
not equivalent to rcu_read_lock() + local_bh_disable_bh as far
as lockdep is concerned.

WARNING: suspicious RCU usage
5.3.0-rc7+ #0 Not tainted
-----------------------------
include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by syz-executor427/8855:
 #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline]
 #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214
 #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline]
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838

stack backtrace:
CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357
 qdisc_root include/net/sch_generic.h:492 [inline]
 netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479
 __dev_xmit_skb net/core/dev.c:3527 [inline]
 __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
 neigh_hh_output include/net/neighbour.h:500 [inline]
 neigh_output include/net/neighbour.h:509 [inline]
 ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228
 __ip_finish_output net/ipv4/ip_output.c:308 [inline]
 __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290
 ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417
 dst_output include/net/dst.h:436 [inline]
 ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125
 ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555
 udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887
 udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174
 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:657
 ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
 __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
 do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 10:29:11 +02:00
0355d6c1d5 kcm: disable preemption in kcm_parse_func_strparser()
After commit a2c11b034142 ("kcm: use BPF_PROG_RUN")
syzbot easily triggers the warning in cant_sleep().

As explained in commit 6cab5e90ab2b ("bpf: run bpf programs
with preemption disabled") we need to disable preemption before
running bpf programs.

BUG: assuming atomic context at net/kcm/kcmsock.c:382
in_atomic(): 0, irqs_disabled(): 0, pid: 7, name: kworker/u4:0
3 locks held by kworker/u4:0/7:
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: __write_once_size include/linux/compiler.h:226 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: atomic64_set include/asm-generic/atomic-instrumented.h:855 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: atomic_long_set include/asm-generic/atomic-long.h:40 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: set_work_data kernel/workqueue.c:620 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: set_work_pool_and_clear_pending kernel/workqueue.c:647 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: process_one_work+0x88b/0x1740 kernel/workqueue.c:2240
 #1: ffff8880a989fdc0 ((work_completion)(&strp->work)){+.+.}, at: process_one_work+0x8c1/0x1740 kernel/workqueue.c:2244
 #2: ffff888098998d10 (sk_lock-AF_INET){+.+.}, at: lock_sock include/net/sock.h:1522 [inline]
 #2: ffff888098998d10 (sk_lock-AF_INET){+.+.}, at: strp_sock_lock+0x2e/0x40 net/strparser/strparser.c:440
CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: kstrp strp_work
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 __cant_sleep kernel/sched/core.c:6826 [inline]
 __cant_sleep.cold+0xa4/0xbc kernel/sched/core.c:6803
 kcm_parse_func_strparser+0x54/0x200 net/kcm/kcmsock.c:382
 __strp_recv+0x5dc/0x1b20 net/strparser/strparser.c:221
 strp_recv+0xcf/0x10b net/strparser/strparser.c:343
 tcp_read_sock+0x285/0xa00 net/ipv4/tcp.c:1639
 strp_read_sock+0x14d/0x200 net/strparser/strparser.c:366
 do_strp_work net/strparser/strparser.c:414 [inline]
 strp_work+0xe3/0x130 net/strparser/strparser.c:423
 process_one_work+0x9af/0x1740 kernel/workqueue.c:2269

Fixes: a2c11b034142 ("kcm: use BPF_PROG_RUN")
Fixes: 6cab5e90ab2b ("bpf: run bpf programs with preemption disabled")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 10:27:14 +02:00
ca7a03c417 ipv6: do not free rt if FIB_LOOKUP_NOREF is set on suppress rule
Commit 7d9e5f422150 removed references from certain dsts, but accounting
for this never translated down into the fib6 suppression code. This bug
was triggered by WireGuard users who use wg-quick(8), which uses the
"suppress-prefix" directive to ip-rule(8) for routing all of their
internet traffic without routing loops. The test case added here
causes the reference underflow by causing packets to evaluate a suppress
rule.

Fixes: 7d9e5f422150 ("ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-26 09:34:25 +02:00
ea8564c865 openvswitch: change type of UPCALL_PID attribute to NLA_UNSPEC
userspace openvswitch patch "(dpif-linux: Implement the API
functions to allow multiple handler threads read upcall)"
changes its type from U32 to UNSPEC, but leave the kernel
unchanged

and after kernel 6e237d099fac "(netlink: Relax attr validation
for fixed length types)", this bug is exposed by the below
warning

	[   57.215841] netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.

Fixes: 5cd667b0a456 ("openvswitch: Allow each vport to have an array of 'port_id's")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-26 09:32:33 +02:00
adecda5bee net: print proper warning on dst underflow
Proper warnings with stack traces make it much easier to figure out
what's doing the double free and create more meaningful bug reports from
users.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-26 09:05:56 +02:00
3e8b9bfa11 net/sched: cbs: Fix not adding cbs instance to list
When removing a cbs instance when offloading is enabled, the crash
below can be observed.

The problem happens because that when offloading is enabled, the cbs
instance is not added to the list.

Also, the current code doesn't handle correctly the case when offload
is disabled without removing the qdisc: if the link speed changes the
credit calculations will be wrong. When we create the cbs instance
with offloading enabled, it's not added to the notification list, when
later we disable offloading, it's not in the list, so link speed
changes will not affect it.

The solution for both issues is the same, add the cbs instance being
created unconditionally to the global list, even if the link state
notification isn't useful "right now".

Crash log:

[518758.189866] BUG: kernel NULL pointer dereference, address: 0000000000000000
[518758.189870] #PF: supervisor read access in kernel mode
[518758.189871] #PF: error_code(0x0000) - not-present page
[518758.189872] PGD 0 P4D 0
[518758.189874] Oops: 0000 [#1] SMP PTI
[518758.189876] CPU: 3 PID: 4825 Comm: tc Not tainted 5.2.9 #1
[518758.189877] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ULTRA/Z390 AORUS ULTRA-CF, BIOS F7 03/14/2019
[518758.189881] RIP: 0010:__list_del_entry_valid+0x29/0xa0
[518758.189883] Code: 90 48 b8 00 01 00 00 00 00 ad de 55 48 8b 17 4c 8b 47 08 48 89 e5 48 39 c2 74 27 48 b8 00 02 00 00 00 00 ad de 49 39 c0 74 2d <49> 8b 30 48 39 fe 75 3d 48 8b 52 08 48 39 f2 75 4c b8 01 00 00 00
[518758.189885] RSP: 0018:ffffa27e43903990 EFLAGS: 00010207
[518758.189887] RAX: dead000000000200 RBX: ffff8bce69f0f000 RCX: 0000000000000000
[518758.189888] RDX: 0000000000000000 RSI: ffff8bce69f0f064 RDI: ffff8bce69f0f1e0
[518758.189890] RBP: ffffa27e43903990 R08: 0000000000000000 R09: ffff8bce69e788c0
[518758.189891] R10: ffff8bce62acd400 R11: 00000000000003cb R12: ffff8bce69e78000
[518758.189892] R13: ffff8bce69f0f140 R14: 0000000000000000 R15: 0000000000000000
[518758.189894] FS:  00007fa1572c8f80(0000) GS:ffff8bce6e0c0000(0000) knlGS:0000000000000000
[518758.189895] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[518758.189896] CR2: 0000000000000000 CR3: 000000040a398006 CR4: 00000000003606e0
[518758.189898] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[518758.189899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[518758.189900] Call Trace:
[518758.189904]  cbs_destroy+0x32/0xa0 [sch_cbs]
[518758.189906]  qdisc_destroy+0x45/0x120
[518758.189907]  qdisc_put+0x25/0x30
[518758.189908]  qdisc_graft+0x2c1/0x450
[518758.189910]  tc_get_qdisc+0x1c8/0x310
[518758.189912]  ? get_page_from_freelist+0x91a/0xcb0
[518758.189914]  rtnetlink_rcv_msg+0x293/0x360
[518758.189916]  ? kmem_cache_alloc_node_trace+0x178/0x260
[518758.189918]  ? __kmalloc_node_track_caller+0x38/0x50
[518758.189920]  ? rtnl_calcit.isra.0+0xf0/0xf0
[518758.189922]  netlink_rcv_skb+0x48/0x110
[518758.189923]  rtnetlink_rcv+0x10/0x20
[518758.189925]  netlink_unicast+0x15b/0x1d0
[518758.189926]  netlink_sendmsg+0x1ea/0x380
[518758.189929]  sock_sendmsg+0x2f/0x40
[518758.189930]  ___sys_sendmsg+0x295/0x2f0
[518758.189932]  ? ___sys_recvmsg+0x151/0x1e0
[518758.189933]  ? do_wp_page+0x7e/0x450
[518758.189935]  __sys_sendmsg+0x48/0x80
[518758.189937]  __x64_sys_sendmsg+0x1a/0x20
[518758.189939]  do_syscall_64+0x53/0x1f0
[518758.189941]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[518758.189942] RIP: 0033:0x7fa15755169a
[518758.189944] Code: 48 c7 c0 ff ff ff ff eb be 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 18 b8 2e 00 00 00 c5 fc 77 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 89 54 24 1c
[518758.189946] RSP: 002b:00007ffda58b60b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[518758.189948] RAX: ffffffffffffffda RBX: 000055e4b836d9a0 RCX: 00007fa15755169a
[518758.189949] RDX: 0000000000000000 RSI: 00007ffda58b6128 RDI: 0000000000000003
[518758.189951] RBP: 00007ffda58b6190 R08: 0000000000000001 R09: 000055e4b9d848a0
[518758.189952] R10: 0000000000000000 R11: 0000000000000246 R12: 000000005d654b49
[518758.189953] R13: 0000000000000000 R14: 00007ffda58b6230 R15: 00007ffda58b6210
[518758.189955] Modules linked in: sch_cbs sch_etf sch_mqprio netlink_diag unix_diag e1000e igb intel_pch_thermal thermal video backlight pcc_cpufreq
[518758.189960] CR2: 0000000000000000
[518758.189961] ---[ end trace 6a13f7aaf5376019 ]---
[518758.189963] RIP: 0010:__list_del_entry_valid+0x29/0xa0
[518758.189964] Code: 90 48 b8 00 01 00 00 00 00 ad de 55 48 8b 17 4c 8b 47 08 48 89 e5 48 39 c2 74 27 48 b8 00 02 00 00 00 00 ad de 49 39 c0 74 2d <49> 8b 30 48 39 fe 75 3d 48 8b 52 08 48 39 f2 75 4c b8 01 00 00 00
[518758.189967] RSP: 0018:ffffa27e43903990 EFLAGS: 00010207
[518758.189968] RAX: dead000000000200 RBX: ffff8bce69f0f000 RCX: 0000000000000000
[518758.189969] RDX: 0000000000000000 RSI: ffff8bce69f0f064 RDI: ffff8bce69f0f1e0
[518758.189971] RBP: ffffa27e43903990 R08: 0000000000000000 R09: ffff8bce69e788c0
[518758.189972] R10: ffff8bce62acd400 R11: 00000000000003cb R12: ffff8bce69e78000
[518758.189973] R13: ffff8bce69f0f140 R14: 0000000000000000 R15: 0000000000000000
[518758.189975] FS:  00007fa1572c8f80(0000) GS:ffff8bce6e0c0000(0000) knlGS:0000000000000000
[518758.189976] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[518758.189977] CR2: 0000000000000000 CR3: 000000040a398006 CR4: 00000000003606e0
[518758.189979] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[518758.189980] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: e0a7683d30e9 ("net/sched: cbs: fix port_rate miscalculation")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-26 09:03:03 +02:00
bf69abad27 net: Fix Kconfig indentation
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
    $ sed -e 's/^        /\t/' -i */Kconfig

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-26 08:56:17 +02:00
9b05b6e11d netfilter: nf_tables: bogus EBUSY when deleting flowtable after flush
The deletion of a flowtable after a flush in the same transaction
results in EBUSY. This patch adds an activation and deactivation of
flowtables in order to update the _use_ counter.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-25 11:01:19 +02:00
3a359798b1 nfc: enforce CAP_NET_RAW for raw sockets
When creating a raw AF_NFC socket, CAP_NET_RAW needs to be checked
first.

Signed-off-by: Ori Nimron <orinimron123@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-24 16:37:18 +02:00
e69dbd4619 ieee802154: enforce CAP_NET_RAW for raw sockets
When creating a raw AF_IEEE802154 socket, CAP_NET_RAW needs to be
checked first.

Signed-off-by: Ori Nimron <orinimron123@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-24 16:37:18 +02:00
0614e2b737 ax25: enforce CAP_NET_RAW for raw sockets
When creating a raw AF_AX25 socket, CAP_NET_RAW needs to be checked
first.

Signed-off-by: Ori Nimron <orinimron123@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-24 16:37:18 +02:00
6cc03e8aa3 appletalk: enforce CAP_NET_RAW for raw sockets
When creating a raw AF_APPLETALK socket, CAP_NET_RAW needs to be checked
first.

Signed-off-by: Ori Nimron <orinimron123@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-24 16:37:18 +02:00
3d66b89c30 net: sched: fix possible crash in tcf_action_destroy()
If the allocation done in tcf_exts_init() failed,
we end up with a NULL pointer in exts->actions.

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8198 Comm: syz-executor.3 Not tainted 5.3.0-rc8+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:tcf_action_destroy+0x71/0x160 net/sched/act_api.c:705
Code: c3 08 44 89 ee e8 4f cb bb fb 41 83 fd 20 0f 84 c9 00 00 00 e8 c0 c9 bb fb 48 89 d8 48 b9 00 00 00 00 00 fc ff df 48 c1 e8 03 <80> 3c 08 00 0f 85 c0 00 00 00 4c 8b 33 4d 85 f6 0f 84 9d 00 00 00
RSP: 0018:ffff888096e16ff0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: dffffc0000000000
RDX: 0000000000040000 RSI: ffffffff85b6ab30 RDI: 0000000000000000
RBP: ffff888096e17020 R08: ffff8880993f6140 R09: fffffbfff11cae67
R10: fffffbfff11cae66 R11: ffffffff88e57333 R12: 0000000000000000
R13: 0000000000000000 R14: ffff888096e177a0 R15: 0000000000000001
FS:  00007f62bc84a700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000758040 CR3: 0000000088b64000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 tcf_exts_destroy+0x38/0xb0 net/sched/cls_api.c:3030
 tcindex_set_parms+0xf7f/0x1e50 net/sched/cls_tcindex.c:488
 tcindex_change+0x230/0x318 net/sched/cls_tcindex.c:519
 tc_new_tfilter+0xa4b/0x1c70 net/sched/cls_api.c:2152
 rtnetlink_rcv_msg+0x838/0xb00 net/core/rtnetlink.c:5214
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5241
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x8a5/0xd60 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:657
 ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
 __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]

Fixes: 90b73b77d08e ("net: sched: change action API to use array of pointers to actions")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-24 16:33:57 +02:00
199ce850ce net_sched: add policy validation for action attributes
Similar to commit 8b4c3cdd9dd8
("net: sched: Add policy validation for tc attributes"), we need
to add proper policy validation for TC action attributes too.

Cc: David Ahern <dsahern@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-21 19:33:13 -07:00
62794fc4fb net_sched: add max len check for TCA_KIND
The TCA_KIND attribute is of NLA_STRING which does not check
the NUL char. KMSAN reported an uninit-value of TCA_KIND which
is likely caused by the lack of NUL.

Change it to NLA_NUL_STRING and add a max len too.

Fixes: 8b4c3cdd9dd8 ("net: sched: Add policy validation for tc attributes")
Reported-and-tested-by: syzbot+618aacd49e8c8b8486bd@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-21 19:18:51 -07:00
73f0c11d11 net: qrtr: Stop rx_worker before freeing node
As the endpoint is unregistered there might still be work pending to
handle incoming messages, which will result in a use after free
scenario. The plan is to remove the rx_worker, but until then (and for
stable@) ensure that the work is stopped before the node is freed.

Fixes: bdabad3e363d ("net: Add Qualcomm IPC router")
Cc: stable@vger.kernel.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-21 18:45:46 -07:00
7b09c2d052 ipv6: fix a typo in fib6_rule_lookup()
Yi Ren reported an issue discovered by syzkaller, and bisected
to the cited commit.

Many thanks to Yi, this trivial patch does not reflect the patient
work that has been done.

Fixes: d64a1f574a29 ("ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logic")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Bisected-and-reported-by: Yi Ren <c4tren@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-20 19:17:57 -07:00
b41d936b5e sch_netem: fix a divide by zero in tabledist()
syzbot managed to crash the kernel in tabledist() loading
an empty distribution table.

	t = dist->table[rnd % dist->size];

Simply return an error when such load is attempted.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-20 19:12:22 -07:00
77d5bc7e6a ipv4: Revert removal of rt_uses_gateway
Julian noted that rt_uses_gateway has a more subtle use than 'is gateway
set':
    https://lore.kernel.org/netdev/alpine.LFD.2.21.1909151104060.2546@ja.home.ssi.bg/

Revert that part of the commit referenced in the Fixes tag.

Currently, there are no u8 holes in 'struct rtable'. There is a 4-byte hole
in the second cacheline which contains the gateway declaration. So move
rt_gw_family down to the gateway declarations since they are always used
together, and then re-use that u8 for rt_uses_gateway. End result is that
rtable size is unchanged.

Fixes: 1550c171935d ("ipv4: Prepare rtable for IPv6 gateway")
Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-20 18:23:33 -07:00
92974a1d00 net/sched: act_sample: don't push mac header on ip6gre ingress
current 'sample' action doesn't push the mac header of ingress packets if
they are received by a layer 3 tunnel (like gre or sit); but it forgot to
check for gre over ipv6, so the following script:

 # tc q a dev $d clsact
 # tc f a dev $d ingress protocol ip flower ip_proto icmp action sample \
 > group 100 rate 1
 # psample -v -g 100

dumps everything, including outer header and mac, when $d is a gre tunnel
over ipv6. Fix this adding a missing label for ARPHRD_IP6GRE devices.

Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-20 17:01:59 -07:00
acab713177 netfilter: nf_tables: allow lookups in dynamic sets
This un-breaks lookups in sets that have the 'dynamic' flag set.
Given this active example configuration:

table filter {
  set set1 {
    type ipv4_addr
    size 64
    flags dynamic,timeout
    timeout 1m
  }

  chain input {
     type filter hook input priority 0; policy accept;
  }
}

... this works:
nft add rule ip filter input add @set1 { ip saddr }

-> whenever rule is triggered, the source ip address is inserted
into the set (if it did not exist).

This won't work:
nft add rule ip filter input ip saddr @set1 counter
Error: Could not process rule: Operation not supported

In other words, we can add entries to the set, but then can't make
matching decision based on that set.

That is just wrong -- all set backends support lookups (else they would
not be very useful).
The failure comes from an explicit rejection in nft_lookup.c.

Looking at the history, it seems like NFT_SET_EVAL used to mean
'set contains expressions' (aka. "is a meter"), for instance something like

 nft add rule ip filter input meter example { ip saddr limit rate 10/second }
 or
 nft add rule ip filter input meter example { ip saddr counter }

The actual meaning of NFT_SET_EVAL however, is
'set can be updated from the packet path'.

'meters' and packet-path insertions into sets, such as
'add @set { ip saddr }' use exactly the same kernel code (nft_dynset.c)
and thus require a set backend that provides the ->update() function.

The only set that provides this also is the only one that has the
NFT_SET_EVAL feature flag.

Removing the wrong check makes the above example work.
While at it, also fix the flag check during set instantiation to
allow supported combinations only.

Fixes: 8aeff920dcc9b3f ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-20 10:20:02 +02:00
ff175d0b0e netfilter: nf_tables_offload: fix always true policy is unset check
New smatch warnings:
net/netfilter/nf_tables_offload.c:316 nft_flow_offload_chain() warn: always true condition '(policy != -1) => (0-255 != (-1))'

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-20 10:20:02 +02:00
ad652f3811 netfilter: nf_tables: add NFT_CHAIN_POLICY_UNSET and use it
Default policy is defined as a unsigned 8-bit field, do not use a
negative value to leave it unset, use this new NFT_CHAIN_POLICY_UNSET
instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-20 10:20:02 +02:00
cf0eba3342 net/ncsi: Disable global multicast filter
Disabling multicast filtering from NCSI if it is supported. As it
should not filter any multicast packets. In current code, multicast
filter is enabled and with an exception of optional field supported
by device are disabled filtering.

Mainly I see if goal is to disable filtering for IPV6 packets then let
it disabled for every other types as well. As we are seeing issues with
LLDP not working with this enabled filtering. And there are other issues
with IPV6.

By Disabling this multicast completely, it is working for both IPV6 as
well as LLDP.

Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-19 18:04:40 -07:00
733ef7f056 xsk: relax UMEM headroom alignment
This patch removes the 64B alignment of the UMEM headroom. There is
really no reason for it, and having a headroom less than 64B should be
valid.

Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-09-19 14:23:41 +02:00
e170eb2771 Merge branch 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs mount API infrastructure updates from Al Viro:
 "Infrastructure bits of mount API conversions.

  The rest is more of per-filesystem updates and that will happen
  in separate pull requests"

* 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mtd: Provide fs_context-aware mount_mtd() replacement
  vfs: Create fs_context-aware mount_bdev() replacement
  new helper: get_tree_keyed()
  vfs: set fs_context::user_ns for reconfigure
2019-09-18 13:15:58 -07:00
81160dda9a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support IPV6 RA Captive Portal Identifier, from Maciej Żenczykowski.

 2) Use bio_vec in the networking instead of custom skb_frag_t, from
    Matthew Wilcox.

 3) Make use of xmit_more in r8169 driver, from Heiner Kallweit.

 4) Add devmap_hash to xdp, from Toke Høiland-Jørgensen.

 5) Support all variants of 5750X bnxt_en chips, from Michael Chan.

 6) More RTNL avoidance work in the core and mlx5 driver, from Vlad
    Buslov.

 7) Add TCP syn cookies bpf helper, from Petar Penkov.

 8) Add 'nettest' to selftests and use it, from David Ahern.

 9) Add extack support to drop_monitor, add packet alert mode and
    support for HW drops, from Ido Schimmel.

10) Add VLAN offload to stmmac, from Jose Abreu.

11) Lots of devm_platform_ioremap_resource() conversions, from
    YueHaibing.

12) Add IONIC driver, from Shannon Nelson.

13) Several kTLS cleanups, from Jakub Kicinski.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1930 commits)
  mlxsw: spectrum_buffers: Add the ability to query the CPU port's shared buffer
  mlxsw: spectrum: Register CPU port with devlink
  mlxsw: spectrum_buffers: Prevent changing CPU port's configuration
  net: ena: fix incorrect update of intr_delay_resolution
  net: ena: fix retrieval of nonadaptive interrupt moderation intervals
  net: ena: fix update of interrupt moderation register
  net: ena: remove all old adaptive rx interrupt moderation code from ena_com
  net: ena: remove ena_restore_ethtool_params() and relevant fields
  net: ena: remove old adaptive interrupt moderation code from ena_netdev
  net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()
  net: ena: enable the interrupt_moderation in driver_supported_features
  net: ena: reimplement set/get_coalesce()
  net: ena: switch to dim algorithm for rx adaptive interrupt moderation
  net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
  net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
  ethtool: implement Energy Detect Powerdown support via phy-tunable
  xen-netfront: do not assume sk_buff_head list is empty in error handling
  s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
  net: ena: don't wake up tx queue when down
  drop_monitor: Better sanitize notified packets
  ...
2019-09-18 12:34:53 -07:00
8b53c76533 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add the ability to abort a skcipher walk.

  Algorithms:
   - Fix XTS to actually do the stealing.
   - Add library helpers for AES and DES for single-block users.
   - Add library helpers for SHA256.
   - Add new DES key verification helper.
   - Add surrounding bits for ESSIV generator.
   - Add accelerations for aegis128.
   - Add test vectors for lzo-rle.

  Drivers:
   - Add i.MX8MQ support to caam.
   - Add gcm/ccm/cfb/ofb aes support in inside-secure.
   - Add ofb/cfb aes support in media-tek.
   - Add HiSilicon ZIP accelerator support.

  Others:
   - Fix potential race condition in padata.
   - Use unbound workqueues in padata"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (311 commits)
  crypto: caam - Cast to long first before pointer conversion
  crypto: ccree - enable CTS support in AES-XTS
  crypto: inside-secure - Probe transform record cache RAM sizes
  crypto: inside-secure - Base RD fetchcount on actual RD FIFO size
  crypto: inside-secure - Base CD fetchcount on actual CD FIFO size
  crypto: inside-secure - Enable extended algorithms on newer HW
  crypto: inside-secure: Corrected configuration of EIP96_TOKEN_CTRL
  crypto: inside-secure - Add EIP97/EIP197 and endianness detection
  padata: remove cpu_index from the parallel_queue
  padata: unbind parallel jobs from specific CPUs
  padata: use separate workqueues for parallel and serial work
  padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
  crypto: pcrypt - remove padata cpumask notifier
  padata: make padata_do_parallel find alternate callback CPU
  workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
  workqueue: unconfine alloc/apply/free_workqueue_attrs()
  padata: allocate workqueue internally
  arm64: dts: imx8mq: Add CAAM node
  random: Use wait_event_freezable() in add_hwgenerator_randomness()
  crypto: ux500 - Fix COMPILE_TEST warnings
  ...
2019-09-18 12:11:14 -07:00
4feaab05dc LED updates for 5.4-rc1
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQUwxxKyE5l/npt8ARiEGxRG/Sl2wUCXYAIeQAKCRBiEGxRG/Sl
 2/SzAQDEnoNxzV/R5kWFd+2kmFeY3cll0d99KMrWJ8om+kje6QD/cXxZHzFm+T1L
 UPF66k76oOODV7cyndjXnTnRXbeCRAM=
 =Szby
 -----END PGP SIGNATURE-----

Merge tag 'leds-for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds

Pull LED updates from Jacek Anaszewski:
 "In this cycle we've finally managed to contribute the patch set
  sorting out LED naming issues. Besides that there are many changes
  scattered among various LED class drivers and triggers.

  LED naming related improvements:

   - add new 'function' and 'color' fwnode properties and deprecate
     'label' property which has been frequently abused for conveying
     vendor specific names that have been available in sysfs anyway

   - introduce a set of standard LED_FUNCTION* definitions

   - introduce a set of standard LED_COLOR_ID* definitions

   - add a new {devm_}led_classdev_register_ext() API with the
     capability of automatic LED name composition basing on the
     properties available in the passed fwnode; the function is
     backwards compatible in a sense that it uses 'label' data, if
     present in the fwnode, for creating LED name

   - add tools/leds/get_led_device_info.sh script for retrieving LED
     vendor, product and bus names, if applicable; it also performs
     basic validation of an LED name

   - update following drivers and their DT bindings to use the new LED
     registration API:

        - leds-an30259a, leds-gpio, leds-as3645a, leds-aat1290, leds-cr0014114,
          leds-lm3601x, leds-lm3692x, leds-lp8860, leds-lt3593, leds-sc27xx-blt

  Other LED class improvements:

   - replace {devm_}led_classdev_register() macros with inlines

   - allow to call led_classdev_unregister() unconditionally

   - switch to use fwnode instead of be stuck with OF one

  LED triggers improvements:

   - led-triggers:
        - fix dereferencing of null pointer
        - fix a memory leak bug

   - ledtrig-gpio:
        - GPIO 0 is valid

  Drop superseeded apu2/3 support from leds-apu since for apu2+ a newer,
  more complete driver exists, based on a generic driver for the AMD
  SOCs gpio-controller, supporting LEDs as well other devices:

   - drop profile field from priv data

   - drop iosize field from priv data

   - drop enum_apu_led_platform_types

   - drop superseeded apu2/3 led support

   - add pr_fmt prefix for better log output

   - fix error message on probing failure

  Other misc fixes and improvements to existing LED class drivers:

   - leds-ns2, leds-max77650:
        - add of_node_put() before return

   - leds-pwm, leds-is31fl32xx:
        - use struct_size() helper

   - leds-lm3697, leds-lm36274, leds-lm3532:
        - switch to use fwnode_property_count_uXX()

   - leds-lm3532:
        - fix brightness control for i2c mode
        - change the define for the fs current register
        - fixes for the driver for stability
        - add full scale current configuration
        - dt: Add property for full scale current.
        - avoid potentially unpaired regulator calls
        - move static keyword to the front of declarations
        - fix optional led-max-microamp prop error handling

   - leds-max77650:
        - add of_node_put() before return
        - add MODULE_ALIAS()
        - Switch to fwnode property API

   - leds-as3645a:
        - fix misuse of strlcpy

   - leds-netxbig:
        - add of_node_put() in netxbig_leds_get_of_pdata()
        - remove legacy board-file support

   - leds-is31fl319x:
        - simplify getting the adapter of a client

   - leds-ti-lmu-common:
        - fix coccinelle issue
        - move static keyword to the front of declaration

   - leds-syscon:
        - use resource managed variant of device register

   - leds-ktd2692:
        - fix a typo in the name of a constant

   - leds-lp5562:
        - allow firmware files up to the maximum length

   - leds-an30259a:
        - fix typo

   - leds-pca953x:
        - include the right header"

* tag 'leds-for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: (72 commits)
  leds: lm3532: Fix optional led-max-microamp prop error handling
  led: triggers: Fix dereferencing of null pointer
  leds: ti-lmu-common: Move static keyword to the front of declaration
  leds: lm3532: Move static keyword to the front of declarations
  leds: trigger: gpio: GPIO 0 is valid
  leds: pwm: Use struct_size() helper
  leds: is31fl32xx: Use struct_size() helper
  leds: ti-lmu-common: Fix coccinelle issue in TI LMU
  leds: lm3532: Avoid potentially unpaired regulator calls
  leds: syscon: Use resource managed variant of device register
  leds: Replace {devm_}led_classdev_register() macros with inlines
  leds: Allow to call led_classdev_unregister() unconditionally
  leds: lm3532: Add full scale current configuration
  dt: lm3532: Add property for full scale current.
  leds: lm3532: Fixes for the driver for stability
  leds: lm3532: Change the define for the fs current register
  leds: lm3532: Fix brightness control for i2c mode
  leds: Switch to use fwnode instead of be stuck with OF one
  leds: max77650: Switch to fwnode property API
  led: triggers: Fix a memory leak bug
  ...
2019-09-17 18:40:42 -07:00
1bab8d4c48 Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Pull in bug fixes from 'net' tree for the merge window.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-17 23:51:10 +02:00
7f2444d38f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...
2019-09-17 12:35:15 -07:00
94d18ee934 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "This cycle's RCU changes were:

   - A few more RCU flavor consolidation cleanups.

   - Updates to RCU's list-traversal macros improving lockdep usability.

   - Forward-progress improvements for no-CBs CPUs: Avoid ignoring
     incoming callbacks during grace-period waits.

   - Forward-progress improvements for no-CBs CPUs: Use ->cblist
     structure to take advantage of others' grace periods.

   - Also added a small commit that avoids needlessly inflicting
     scheduler-clock ticks on callback-offloaded CPUs.

   - Forward-progress improvements for no-CBs CPUs: Reduce contention on
     ->nocb_lock guarding ->cblist.

   - Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
     list to further reduce contention on ->nocb_lock guarding ->cblist.

   - Miscellaneous fixes.

   - Torture-test updates.

   - minor LKMM updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
  MAINTAINERS: Update from paulmck@linux.ibm.com to paulmck@kernel.org
  rcu: Don't include <linux/ktime.h> in rcutiny.h
  rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
  rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
  rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
  rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
  rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
  rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
  rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
  rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
  rcu/nocb: Add bypass callback queueing
  rcu/nocb: Atomic ->len field in rcu_segcblist structure
  rcu/nocb: Unconditionally advance and wake for excessive CBs
  rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
  rcu/nocb: Reduce contention at no-CBs invocation-done time
  rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
  rcu/nocb: Round down for number of no-CBs grace-period kthreads
  rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
  rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
  rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
  ...
2019-09-16 16:28:19 -07:00
9f2f13f4ff ethtool: implement Energy Detect Powerdown support via phy-tunable
The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
this feature is common across other PHYs (like EEE), and defining
`ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.

The way EDPD works, is that the RX block is put to a lower power mode,
except for link-pulse detection circuits. The TX block is also put to low
power mode, but the PHY wakes-up periodically to send link pulses, to avoid
lock-ups in case the other side is also in EDPD mode.

Currently, there are 2 PHY drivers that look like they could use this new
PHY tunable feature: the `adin` && `micrel` PHYs.

The ADIN's datasheet mentions that TX pulses are at intervals of 1 second
default each, and they can be disabled. For the Micrel KSZ9031 PHY, the
datasheet does not mention whether they can be disabled, but mentions that
they can modified.

The way this change is structured, is similar to the PHY tunable downshift
control:
* a `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` value is exposed to cover a default
  TX interval; some PHYs could specify a certain value that makes sense
* `ETHTOOL_PHY_EDPD_NO_TX` would disable TX when EDPD is enabled
* `ETHTOOL_PHY_EDPD_DISABLE` will disable EDPD

As noted by the `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` the interval unit is 1
millisecond, which should cover a reasonable range of intervals:
 - from 1 millisecond, which does not sound like much of a power-saver
 - to ~65 seconds which is quite a lot to wait for a link to come up when
   plugging a cable

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 22:02:45 +02:00
bef1746681 drop_monitor: Better sanitize notified packets
When working in 'packet' mode, drop monitor generates a notification
with a potentially truncated payload of the dropped packet. The payload
is copied from the MAC header, but I forgot to check that the MAC header
was set, so do it now.

Fixes: ca30707dee2b ("drop_monitor: Add packet alert mode")
Fixes: 5e58109b1ea4 ("drop_monitor: Add support for packet alert mode for hardware drops")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 21:39:27 +02:00
5f06c63bd3 net: dsa: sja1105: Advertise the 8 TX queues
This is a preparation patch for the tc-taprio offload (and potentially
for other future offloads such as tc-mqprio).

Instead of looking directly at skb->priority during xmit, let's get the
netdev queue and the queue-to-traffic-class mapping, and put the
resulting traffic class into the dsa_8021q PCP field. The switch is
configured with a 1-to-1 PCP-to-ingress-queue-to-egress-queue mapping
(see vlan_pmap in sja1105_main.c), so the effect is that we can inject
into a front-panel's egress traffic class through VLAN tagging from
Linux, completely transparently.

Unfortunately the switch doesn't look at the VLAN PCP in the case of
management traffic to/from the CPU (link-local frames at
01-80-C2-xx-xx-xx or 01-1B-19-xx-xx-xx) so we can't alter the
transmission queue of this type of traffic on a frame-by-frame basis. It
is only selected through the "hostprio" setting which ATM is harcoded in
the driver to 7.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 21:32:57 +02:00
47d23af292 net: dsa: Pass ndo_setup_tc slave callback to drivers
DSA currently handles shared block filters (for the classifier-action
qdisc) in the core due to what I believe are simply pragmatic reasons -
hiding the complexity from drivers and offerring a simple API for port
mirroring.

Extend the dsa_slave_setup_tc function by passing all other qdisc
offloads to the driver layer, where the driver may choose what it
implements and how. DSA is simply a pass-through in this case.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 21:32:57 +02:00
9c66d15646 taprio: Add support for hardware offloading
This allows taprio to offload the schedule enforcement to capable
network cards, resulting in more precise windows and less CPU usage.

The gate mask acts on traffic classes (groups of queues of same
priority), as specified in IEEE 802.1Q-2018, and following the existing
taprio and mqprio semantics.
It is up to the driver to perform conversion between tc and individual
netdev queues if for some reason it needs to make that distinction.

Full offload is requested from the network interface by specifying
"flags 2" in the tc qdisc creation command, which in turn corresponds to
the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.

The important detail here is the clockid which is implicitly /dev/ptpN
for full offload, and hence not configurable.

A reference counting API is added to support the use case where Ethernet
drivers need to keep the taprio offload structure locally (i.e. they are
a multi-port switch driver, and configuring a port depends on the
settings of other ports as well). The refcount_t variable is kept in a
private structure (__tc_taprio_qopt_offload) and not exposed to drivers.

In the future, the private structure might also be expanded with a
backpointer to taprio_sched *q, to implement the notification system
described in the patch (of when admin became oper, or an error occurred,
etc, so the offload can be monitored with 'tc qdisc show').

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 21:32:57 +02:00
8f7baad7f0 tcp: Add snd_wnd to TCP_INFO
Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
performance problems --
> (1) Usually when we're diagnosing TCP performance problems, we do so
> from the sender, since the sender makes most of the
> performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> From the sender-side the thing that would be most useful is to see
> tp->snd_wnd, the receive window that the receiver has advertised to
> the sender.

This serves the purpose of adding an additional __u32 to avoid the
would-be hole caused by the addition of the tcpi_rcvi_ooopack field.

Signed-off-by: Thomas Higdon <tph@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 16:26:11 +02:00
f9af2dbbfe tcp: Add TCP_INFO counter for packets received out-of-order
For receive-heavy cases on the server-side, we want to track the
connection quality for individual client IPs. This counter, similar to
the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
tracks out-of-order packet reception. By providing this counter in
TCP_INFO, it will allow understanding to what degree receive-heavy
sockets are experiencing out-of-order delivery and packet drops
indicating congestion.

Please note that this is similar to the counter in NetBSD TCP_INFO, and
has the same name.

Also note that we avoid increasing the size of the tcp_sock struct by
taking advantage of a hole.

Signed-off-by: Thomas Higdon <tph@fb.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 16:26:11 +02:00
28f2c362db Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-09-16

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Now that initial BPF backend for gcc has been merged upstream, enable
   BPF kselftest suite for bpf-gcc. Also fix a BE issue with access to
   bpf_sysctl.file_pos, from Ilya.

2) Follow-up fix for link-vmlinux.sh to remove bash-specific extensions
   related to recent work on exposing BTF info through sysfs, from Andrii.

3) AF_XDP zero copy fixes for i40e and ixgbe driver which caused umem
   headroom to be added twice, from Ciara.

4) Refactoring work to convert sock opt tests into test_progs framework
   in BPF kselftests, from Stanislav.

5) Fix a general protection fault in dev_map_hash_update_elem(), from Toke.

6) Cleanup to use BPF_PROG_RUN() macro in KCM, from Sami.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 16:02:03 +02:00