Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changed key initialization of tcp_fastopen cookies to net_get_random_once.
If the user sets a custom key net_get_random_once must be called at
least once to ensure we don't overwrite the user provided key when the
first cookie is generated later on.
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the ehash and ipv6_hash_secrets with net_get_random_once.
Each compilation unit gets its own secret now:
ipv4/inet_hashtables.o
ipv4/udp.o
ipv6/inet6_hashtables.o
ipv6/udp.o
rds/connection.o
The functions still get inlined into the hashing functions. In the fast
path we have at most two (needed in ipv6) if (unlikely(...)).
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits the secret key for syncookies for ipv4 and ipv6 and
initializes them with net_get_random_once. This change was the reason I
did this series. I think the initialization of the syncookie_secret is
way to early.
Cc: Florian Westphal <fw@strlen.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_get_random_once is a new macro which handles the initialization
of secret keys. It is possible to call it in the fast path. Only the
initialization depends on the spinlock and is rather slow. Otherwise
it should get used just before the key is used to delay the entropy
extration as late as possible to get better randomness. It returns true
if the key got initialized.
The usage of static_keys for net_get_random_once is a bit uncommon so
it needs some further explanation why this actually works:
=== In the simple non-HAVE_JUMP_LABEL case we actually have ===
no constrains to use static_key_(true|false) on keys initialized with
STATIC_KEY_INIT_(FALSE|TRUE). So this path just expands in favor of
the likely case that the initialization is already done. The key is
initialized like this:
___done_key = { .enabled = ATOMIC_INIT(0) }
The check
if (!static_key_true(&___done_key)) \
expands into (pseudo code)
if (!likely(___done_key > 0))
, so we take the fast path as soon as ___done_key is increased from the
helper function.
=== If HAVE_JUMP_LABELs are available this depends ===
on patching of jumps into the prepared NOPs, which is done in
jump_label_init at boot-up time (from start_kernel). It is forbidden
and dangerous to use net_get_random_once in functions which are called
before that!
At compilation time NOPs are generated at the call sites of
net_get_random_once. E.g. net/ipv6/inet6_hashtable.c:inet6_ehashfn (we
need to call net_get_random_once two times in inet6_ehashfn, so two NOPs):
71: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
76: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Both will be patched to the actual jumps to the end of the function to
call __net_get_random_once at boot time as explained above.
arch_static_branch is optimized and inlined for false as return value and
actually also returns false in case the NOP is placed in the instruction
stream. So in the fast case we get a "return false". But because we
initialize ___done_key with (enabled != (entries & 1)) this call-site
will get patched up at boot thus returning true. The final check looks
like this:
if (!static_key_true(&___done_key)) \
___ret = __net_get_random_once(buf, \
expands to
if (!!static_key_false(&___done_key)) \
___ret = __net_get_random_once(buf, \
So we get true at boot time and as soon as static_key_slow_inc is called
on the key it will invert the logic and return false for the fast path.
static_key_slow_inc will change the branch because it got initialized
with .enabled == 0. After static_key_slow_inc is called on the key the
branch is replaced with a nop again.
=== Misc: ===
The helper defers the increment into a workqueue so we don't
have problems calling this code from atomic sections. A seperate boolean
(___done) guards the case where we enter net_get_random_once again before
the increment happend.
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits the inet6_ehashfn into separate ones in
ipv6/inet6_hashtables.o and ipv6/udp.o to ease the introduction of
seperate secrets keys later.
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This duplicates a bit of code but let's us easily introduce
separate secret keys later. The separate compilation units are
ipv4/inet_hashtabbles.o, ipv4/udp.o and rds/connection.o.
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now inet_gso_segment() is stackable, its relatively easy to
implement GSO/TSO support for IPIP
Performance results, when segmentation is done after tunnel
device (as no NIC is yet enabled for TSO IPIP support) :
Before patch :
lpq83:~# ./netperf -H 7.7.9.84 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.9.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 3357.88 5.09 3.70 2.983 2.167
After patch :
lpq83:~# ./netperf -H 7.7.9.84 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.9.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
87380 16384 16384 10.00 7710.19 4.52 6.62 1.152 1.687
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support GSO on IPIP, we need to make
inet_gso_segment() stackable.
It should not assume network header starts right after mac
header.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes gre_handle_offloads() more generic
and rename it to iptunnel_handle_offloads()
This will be used to add GSO/TSO support to IPIP tunnels.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While implementing GSO/TSO support for IPIP, I found skb_segment()
was assuming network header was immediately following mac header.
Its not really true in the case inet_gso_segment() is stacked :
By the time tcp_gso_segment() is called, network header points
to the inner IP header.
Let's instead assume nothing and pick the current offsets found in
original skb, we have skb_headers_offset_update() helper for that.
Also move the csum_start update inside skb_headers_offset_update()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, if user application does:
sendto len<mtu flag MSG_MORE
sendto len>mtu flag 0
The skb is not treated as fragmented one because it is not initialized
that way. So move the initialization to fix this.
introduced by:
commit e89e9cf539 "[IPv4/IPv6]: UFO Scatter-gather approach"
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now, if user application does:
sendto len<mtu flag MSG_MORE
sendto len>mtu flag 0
The skb is not treated as fragmented one because it is not initialized
that way. So move the initialization to fix this.
introduced by:
commit e89e9cf539 "[IPv4/IPv6]: UFO Scatter-gather approach"
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
if up->pending != 0 dontfrag is left with default value -1. That
causes that application that do:
sendto len>mtu flag MSG_MORE
sendto len>mtu flag 0
will receive EMSGSIZE errno as the result of the second sendto.
This patch fixes it by respecting IPV6_DONTFRAG socket option.
introduced by:
commit 4b340ae20d "IPv6: Complete IPV6_DONTFRAG support"
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_gso_send_check() and ipv6_gso_segment() are called by
skb_mac_gso_segment() under rcu lock, no need to use
rcu_read_lock() / rcu_read_unlock()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_gso_segment() and inet_gso_send_check() are called by
skb_mac_gso_segment() under rcu lock, no need to use
rcu_read_lock() / rcu_read_unlock()
Avoid calling ip_hdr() twice per function.
We can use ip_send_check() helper.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case of credentials passing in unix stream sockets (dgram
sockets seem not affected), we get a rather sparse race after
commit 16e5726 ("af_unix: dont send SCM_CREDENTIALS by default").
We have a stream server on receiver side that requests credential
passing from senders (e.g. nc -U). Since we need to set SO_PASSCRED
on each spawned/accepted socket on server side to 1 first (as it's
not inherited), it can happen that in the time between accept() and
setsockopt() we get interrupted, the sender is being scheduled and
continues with passing data to our receiver. At that time SO_PASSCRED
is neither set on sender nor receiver side, hence in cmsg's
SCM_CREDENTIALS we get eventually pid:0, uid:65534, gid:65534
(== overflow{u,g}id) instead of what we actually would like to see.
On the sender side, here nc -U, the tests in maybe_add_creds()
invoked through unix_stream_sendmsg() would fail, as at that exact
time, as mentioned, the sender has neither SO_PASSCRED on his side
nor sees it on the server side, and we have a valid 'other' socket
in place. Thus, sender believes it would just look like a normal
connection, not needing/requesting SO_PASSCRED at that time.
As reverting 16e5726 would not be an option due to the significant
performance regression reported when having creds always passed,
one way/trade-off to prevent that would be to set SO_PASSCRED on
the listener socket and allow inheriting these flags to the spawned
socket on server side in accept(). It seems also logical to do so
if we'd tell the listener socket to pass those flags onwards, and
would fix the race.
Before, strace:
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}],
msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS{pid=0, uid=65534, gid=65534}},
msg_flags=0}, 0) = 5
After, strace:
recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}],
msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS{pid=11580, uid=1000, gid=1000}},
msg_flags=0}, 0) = 5
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The backbone gw check has to be VLAN specific so that code
using it can specify VID where the check has to be done.
In the TT code, the check has been moved into the
tt_global_add() function so that it can be performed on a
per-entry basis instead of ignoring all the TT data received
from another backbone node. Only TT global entries belonging
to the VLAN where the backbone node is connected to are
skipped.
All the other spots where the TT code was checking whether a
node is a backbone have been removed.
Moreover, batadv_bla_is_backbone_gw_orig() now returns bool
since it used to return only 1 or 0.
Cc: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Instead of unconditionally removing all the TT entries
served by a given originator, make tt_global_orig_del()
remove only entries matching a given VLAN identifier
provided as argument.
If such argument is negative all the global entries
served by the originator are removed.
This change is used into the BLA code to purge entries
served by a newly discovered Backbone node, but limiting
the operation only to those connected to the VLAN where the
backbone has been discovered.
Cc: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
This change allows nodes to handle the TT table on a
per-VLAN basis. This is needed because nodes may have to
store only some of the global entries advertised by another
node.
In this scenario such nodes would re-create only a partial
global table and would not be able to compute a correct CRC
anymore.
This patch splits the logic and introduces one CRC per VLAN.
In this way a node fetching only some entries belonging to
some VLANs is still able to compute the needed CRCs and
still check the table correctness.
With this patch the shape of the TVLV-TT is changed too
because now a node needs to advertise all the CRCs of all
the VLANs that it is wired to.
The debug output of the local Translation Table now shows
the CRC along with each entry since there is not a common
value for the entire table anymore.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
A TT response may be prepared and sent while the local or
global translation table is getting updated.
The worst case is when one of the tables is accessed after
its content has been recently updated but the metadata
(TTVN/CRC) has not yet. In this case the reader will get a
table content which does not match the TTVN/CRC.
This will lead to an inconsistent state and so to a TT
recovery.
To avoid entering this situation, put a lock around those TT
operations recomputing the metadata and around the TT
Response creation (the latter is the only reader that
accesses the metadata together with the table).
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
this comment refers to the old batmand codebase and does
not make sense anymore. Remove it
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
With this patch the functions batadv_send_skb_unicast() and
batadv_send_skb_unicast_4addr() are further refined into
batadv_send_skb_via_tt(), batadv_send_skb_via_tt_4addr() and
batadv_send_skb_via_gw(). This way we avoid any "guessing" about where to send
a packet in the unicast forwarding methods and let the callers decide.
This is going to be useful for the upcoming multicast related patches in
particular.
Further, the return values were polished a little to use the more
appropriate NET_XMIT_* defines.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Acked-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
AP isolation has to be enabled on one VLAN interface only.
This patch moves the AP isolation attribute to the per-vlan
interface attribute set, enabling it to have a different
value depending on the selected vlan.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Each VLAN can now have its own set of attributes which are
exported through a new subfolder in the sysfs tree.
Each VLAN created on top of a soft_iface will have its own
subfolder.
The subfolder is named "vlan%VID" and it is created inside
the "mesh" sysfs folder belonging to batman-adv.
Attributes corresponding to the untagged LAN are stored in
the root sysfs folder as before.
This patch also creates all the needed macros and data
structures to easily handle new VLAN spacific attributes.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Since batman-adv is now fully VLAN-aware, a proper framework
able to handle per-vlan-interface attributes is needed.
Those attributes will affect the associated VLAN interface
only, rather than the real soft_iface (which would result
in every vlan interface having the same attribute
configuration).
To make the code simpler and easier to extend, attributes
associated to the standalone soft_iface are now treated
like belonging to yet another vlan having a special vid.
This vid is different from the others because it is made up
by all zeros and the VLAN_HAS_TAG bit is not set.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
The same IP subnet can be used on different VLANs, therefore
DAT has to differentiate whether the IP to resolve belongs
to one or the other virtual LAN.
To accomplish this task DAT has to deal with the VLAN tag
and store it together with each ARP entry.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
The gateway code is now adapted in order to correctly
interact with the Translation Table component by using the
vlan ID
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
now that each TT entry is characterised by a VLAN ID, the
latter has to be taken into consideration when computing the
local/global table CRC as it would be theoretically possible
to have the same client in two different VLANs
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
To make the translation table code VLAN-aware, each entry
must carry the VLAN ID which it belongs to. This patch adds
such attribute to the related TT structures.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
My university will stop email service for alumni in january 2014, please
use my new e-mail address instead.
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Fix bogus merge conflict resolution by checking the return
values of the skb preparation routines.
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Randy found that if network namespace not enabled then
nd_net does not exist and would cause compilation failure.
This is handled correctly by using the dev_net() macro.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the specialized code in __tcp_retransmit_skb() that tries to trim
any ACKed payload preceding a FIN before we retransmit (this was added
in 1999 in v2.2.3pre3). This trimming code was made unreachable by the
more general code added above it that uses tcp_trim_head() to trim any
ACKed payload, with or without a FIN (this was added in "[NET]: Add
segmentation offload support to TCP." in 2002 circa v2.5.33).
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently set the value that variable vid is pointing, which will be
used in FDB later, to 0 at br_allowed_ingress() when we receive untagged
or priority-tagged frames, even though the PVID is valid.
This leads to FDB updates in such a wrong way that they are learned with
VID 0.
Update the value to that of PVID if the PVID is applied.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are using the VLAN_TAG_PRESENT bit to detect whether the PVID is
set or not at br_get_pvid(), while we don't care about the bit in
adding/deleting the PVID, which makes it impossible to forward any
incomming untagged frame with vlan_filtering enabled.
Since vid 0 cannot be used for the PVID, we can use vid 0 to indicate
that the PVID is not set, which is slightly more efficient than using
the VLAN_TAG_PRESENT.
Fix the problem by getting rid of using the VLAN_TAG_PRESENT.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q says that when we receive priority-tagged (VID 0) frames
use the PVID for the port as its VID.
(See IEEE 802.1Q-2011 6.9.1 and Table 9-2)
Apply the PVID to not only untagged frames but also priority-tagged frames.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1Q says that:
- VID 0 shall not be configured as a PVID, or configured in any Filtering
Database entry.
- VID 4095 shall not be configured as a PVID, or transmitted in a tag
header. This VID value may be used to indicate a wildcard match for the VID
in management operations or Filtering Database entries.
(See IEEE 802.1Q-2011 6.9.1 and Table 9-2)
Don't accept adding these VIDs in the vlan_filtering implementation.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rtmsg_fib function doesn't modify this argument so mark
it const.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current test works fine in practice. The "amount" variable is
actually used as a boolean so negative values or any non-zero values
count as "true". However since we don't allow numbers greater than one,
let's not allow negative numbers either.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_table_lookup has included the rcu lock protection.
Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) Don't use a wildcard SA if a more precise one is in acquire state,
from Fan Du.
2) Simplify the SA lookup when using wildcard source. We need to check
only the destination in this case, from Fan Du.
3) Add a receive path hook for IPsec virtual tunnel interfaces
to xfrm6_mode_tunnel.
4) Add support for IPsec virtual tunnel interfaces to ipv6.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename tcp_tso_segment() to tcp_gso_segment(), to better reflect
what is going on, and ease grep games.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When checking statistics or changing parameters on a link, the
link_find_link function is used to locate the link with a given
name. The complex method of deconstructing the name into local
and remote address/interface is error prone and may fail if the
interface names contains special characters. We change the lookup
method to iterate over the list of nodes and compare the link
names.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
link_cmd_set_value() takes commands for link, bearer and media related
configuration. Genereally the function returns 0 when a command is
recognized, and -EINVAL when it is not. However, in the switch for link
related commands it returns 0 even when the command is unrecognized. This
will sometimes make it look as if a failed configuration command has been
successful, but has otherwise no negative effects.
We remove this anomaly by returning -EINVAL even for link commands. We also
rework all three switches to make them conforming to common kernel coding
style.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, rcv_msg() always returns zero on a packet delivery upcall
from net_device.
To make its behavior more compliant with the way this API should be
used, we change this to let it return NET_RX_SUCCESS (which is zero
anyway) when it is able to handle the packet, and NET_RX_DROP otherwise.
The latter does not imply any functional change, it only enables the
driver to keep more accurate statistics about the fate of delivered
packets.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tipc_block_bearer() currently takes a bearer name (const char*)
as argument. This requires the function to make a lookup to find
the pointer to the corresponding bearer struct. In the current
code base this is not necessary, since the only two callers
(tipc_continue(),recv_notification()) already have validated
copies of this pointer, and hence can pass it directly in the
function call.
We change tipc_block_bearer() to directly take struct tipc_bearer*
as argument instead.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TIPC 'bearer' exists as an abstract concept, while 'media'
is deemed a specific implementation of a bearer, such as Ethernet
or Infiniband media. When a component inside TIPC wants to control
a specific media, it only needs to access the generic bearer API
to achieve this. However, in the current media implementations,
the 'bearer' name is also extensively used in media specific
function and variable names.
This may create confusion, so we choose to replace the term 'bearer'
with 'media' in all function names, variable names, and prefixes
where this is what really is meant.
Note that this change is cosmetic only, and no runtime behaviour
changes are made here.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tipc_msg_build() now copies message data from iovec to skb_buff
using memcpy_fromiovecend(), which doesn't need to be passed the
iovec length to perform the copying.
So we remove the parameter indicating iovec length in all
functions where TIPC messages are built and sent.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tipc_msg_build() calls skb_copy_to_linear_data_offset() to copy data
from user space to kernel space. However, the latter function does
in its turn call memcpy() to perform the actual copying. This poses
an obvious security and robustness risk, since memcpy() never makes
any validity check on the pointer it is copying from.
To correct this, we the replace the offending function call with
a call to memcpy_fromiovecend(), which uses copy_from_user() to
perform the copying.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While working on virtio_net new allocation strategy to increase
payload/truesize ratio, we found that refactoring sk_page_frag_refill()
was needed.
This patch splits sk_page_frag_refill() into two parts, adding
skb_page_frag_refill() which can be used without a socket.
While we are at it, add a minimum frag size of 32 for
sk_page_frag_refill()
Michael will either use netdev_alloc_frag() from softirq context,
or skb_page_frag_refill() from process context in refill_work()
(GFP_KERNEL allocations)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Dalton <mwdalton@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
This is a batch of updates intended for the 3.13 stream...
The biggest item of interest in here is wcn36xx, the new mac80211
driver for Qualcomm WCN3660/WCN3680 hardware.
Regarding the mac80211 bits, Johannes says:
"We have an assortment of cleanups and new features, of which the
biggest one is probably the channel-switch support in IBSS. Nothing
else really stands out much."
On top of that, the ath9k and rt2x00 get a lot of update action from
Felix Fietkau and Gabor Juhos, respectively. There are a handful of
updates to other drivers here and there as well.
Please let me know if there are problems!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit be4f154d5e
bridge: Clamp forward_delay when enabling STP
had a typo when attempting to clamp maximum forward delay.
It is possible to set bridge_forward_delay to be higher then
permitted maximum when STP is off. When turning STP on, the
higher then allowed delay has to be clamed down to max value.
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Half of the rt_cache_stat fields are no longer used after IP
route cache removal, lets shrink this per cpu area.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_can_gso() should only be used as a hint in tcp_sendmsg() to build GSO
packets in the first place. (As a performance hint)
Once we have GSO packets in write queue, we can not decide they are no
longer GSO only because flow now uses a route which doesn't handle
TSO/GSO.
Core networking stack handles the case very well for us, all we need
is keeping track of packet counts in MSS terms, regardless of
segmentation done later (in GSO or hardware)
Right now, if tcp_fragment() splits a GSO packet in two parts,
@left and @right, and route changed through a non GSO device,
both @left and @right have pcount set to 1, which is wrong,
and leads to incorrect packet_count tracking.
This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with MTU
discovery")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP stack should make sure it owns skbs before mangling them.
We had various crashes using bnx2x, and it turned out gso_size
was cleared right before bnx2x driver was populating TC descriptor
of the _previous_ packet send. TCP stack can sometime retransmit
packets that are still in Qdisc.
Of course we could make bnx2x driver more robust (using
ACCESS_ONCE(shinfo->gso_size) for example), but the bug is TCP stack.
We have identified two points where skb_unclone() was needed.
This patch adds a WARN_ON_ONCE() to warn us if we missed another
fix of this kind.
Kudos to Neal for finding the root cause of this bug. Its visible
using small MSS.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
Please pull this batch of fixes intended for the 3.12 stream!
For the mac80211 bits, Johannes says:
"Jouni fixes a remain-on-channel vs. scan bug, and Felix fixes client TX
probing on VLANs."
And also:
"This time I have two fixes from Emmanuel for RF-kill issues, and fixed
two issues reported by Evan Huus and Thomas Lindroth respectively."
On top of those...
Avinash Patil adds a couple of mwifiex fixes to properly inform cfg80211
about some different types of disconnects, avoiding WARNINGs.
Mark Cave-Ayland corrects a pointer arithmetic problem in rtlwifi,
avoiding incorrect automatic gain calculations.
Solomon Peachy sends a cw1200 fix for locking around calls to
cw1200_irq_handler, addressing "lost interrupt" problems.
Please let me know if there are problems!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
On receiving an ACK that covers the loss probe sequence, TLP
immediately sets the congestion state to Open, even though some packets
are not recovered and retransmisssion are on the way. The later ACks
may trigger a WARN_ON check in step D of tcp_fastretrans_alert(), e.g.,
https://bugzilla.redhat.com/show_bug.cgi?id=989251
The fix is to follow the similar procedure in recovery by calling
tcp_try_keep_open(). The sender switches to Open state if no packets
are retransmissted. Otherwise it goes to Disorder and let subsequent
ACKs move the state to Recovery or Open.
Reported-By: Michael Sterrett <michael@sterretts.net>
Tested-By: Dormando <dormando@rydia.net>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP/IPv6 fragmentation knows how to compute only TCP/UDP checksum.
This causes problems if SCTP packets has to be fragmented and
ipsummed has been set to PARTIAL due to checksum offload support.
This condition can happen when retransmitting after MTU discover,
or when INIT or other control chunks are larger then MTU.
Check for the rare fragmentation condition in SCTP and use software
checksum calculation in this case.
CC: Fan Du <fan.du@windriver.com>
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
igb/ixgbe have hardware sctp checksum support, when this feature is enabled
and also IPsec is armed to protect sctp traffic, ugly things happened as
xfrm_output checks CHECKSUM_PARTIAL to do checksum operation(sum every thing
up and pack the 16bits result in the checksum field). The result is fail
establishment of sctp communication.
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
netfilter updates: nf_tables pull request
The following patchset contains the current original nf_tables tree
condensed in 17 patches. I have organized them by chronogical order
since the original nf_tables code was released in 2009 and by
dependencies between the different patches.
The patches are:
1) Adapt all existing hooks in the tree to pass hook ops to the
hook callback function, required by nf_tables, from Patrick McHardy.
2) Move alloc_null_binding to nf_nat_core, as it is now also needed by
nf_tables and ip_tables, original patch from Patrick McHardy but
required major changes to adapt it to the current tree that I made.
3) Add nf_tables core, including the netlink API, the packet filtering
engine, expressions and built-in tables, from Patrick McHardy. This
patch includes accumulated fixes since 2009 and minor enhancements.
The patch description contains a list of references to the original
patches for the record. For those that are not familiar to the
original work, see [1], [2] and [3].
4) Add netlink set API, this replaces the original set infrastructure
to introduce a netlink API to add/delete sets and to add/delete
set elements. This includes two set types: the hash and the rb-tree
sets (used for interval based matching). The main difference with
ipset is that this infrastructure is data type agnostic. Patch from
Patrick McHardy.
5) Allow expression operation overload, this API change allows us to
provide define expression subtypes depending on the configuration
that is received from user-space via Netlink. It is used by follow
up patches to provide optimized versions of the payload and cmp
expressions and the x_tables compatibility layer, from Patrick
McHardy.
6) Add optimized data comparison operation, it requires the previous
patch, from Patrick McHardy.
7) Add optimized payload implementation, it requires patch 5, from
Patrick McHardy.
8) Convert built-in tables to chain types. Each chain type have special
semantics (filter, route and nat) that are used by userspace to
configure the chain behaviour. The main chain regarding iptables
is that tables become containers of chain, with no specific semantics.
However, you may still configure your tables and chains to retain
iptables like semantics, patch from me.
9) Add compatibility layer for x_tables. This patch adds support to
use all existing x_tables extensions from nf_tables, this is used
to provide a userspace utility that accepts iptables syntax but
used internally the nf_tables kernel core. This patch includes
missing features in the nf_tables core such as the per-chain
stats, default chain policy and number of chain references, which
are required by the iptables compatibility userspace tool. Patch
from me.
10) Fix transport protocol matching, this fix is a side effect of the
x_tables compatibility layer, which now provides a pointer to the
transport header, from me.
11) Add support for dormant tables, this feature allows you to disable
all chains and rules that are contained in one table, from me.
12) Add IPv6 NAT support. At the time nf_tables was made, there was no
NAT IPv6 support yet, from Tomasz Bursztyka.
13) Complete net namespace support. This patch register the protocol
family per net namespace, so tables (thus, other objects contained
in tables such as sets, chains and rules) are only visible from the
corresponding net namespace, from me.
14) Add the insert operation to the nf_tables netlink API, this requires
adding a new position attribute that allow us to locate where in the
ruleset a rule needs to be inserted, from Eric Leblond.
15) Add rule batching support, including atomic rule-set updates by
using rule-set generations. This patch includes a change to nfnetlink
to include two new control messages to indicate the beginning and
the end of a batch. The end message is interpreted as the commit
message, if it's missing, then the rule-set updates contained in the
batch are aborted, from me.
16) Add trace support to the nf_tables packet filtering core, from me.
17) Add ARP filtering support, original patch from Patrick McHardy, but
adapted to fit into the chain type infrastructure. This was recovered
to be used by nft userspace tool and our compatibility arptables
userspace tool.
There is still work to do to fully replace x_tables [4] [5] but that can
be done incrementally by extending our netlink API. Moreover, looking at
netfilter-devel and the amount of contributions to nf_tables we've been
getting, I think it would be good to have it mainstream to avoid accumulating
large patchsets skip continuous rebases.
I tried to provide a reasonable patchset, we have more than 100 accumulated
patches in the original nf_tables tree, so I collapsed many of the small
fixes to the main patch we had since 2009 and provide a small batch for
review to netdev, while trying to retain part of the history.
For those who didn't give a try to nf_tables yet, there's a quick howto
available from Eric Leblond that describes how to get things working [6].
Comments/reviews welcome.
Thanks!
[1] http://lwn.net/Articles/324251/
[2] http://workshop.netfilter.org/2013/wiki/images/e/ee/Nftables-osd-2013-developer.pdf
[3] http://lwn.net/Articles/564095/
[4] http://people.netfilter.org/pablo/map-pending-work.txt
[4] http://people.netfilter.org/pablo/nftables-todo.txt
[5] https://home.regit.org/netfilter-en/nftables-quick-howto/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP listener refactoring, part 6 :
Use sock_gen_put() from inet_diag_dump_one_icsk() for future
SYN_RECV support.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for tracing the packet travel through
the ruleset, in a similar fashion to x_tables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a batch support to nfnetlink. Basically, it adds
two new control messages:
* NFNL_MSG_BATCH_BEGIN, that indicates the beginning of a batch,
the nfgenmsg->res_id indicates the nfnetlink subsystem ID.
* NFNL_MSG_BATCH_END, that results in the invocation of the
ss->commit callback function. If not specified or an error
ocurred in the batch, the ss->abort function is invoked
instead.
The end message represents the commit operation in nftables, the
lack of end message results in an abort. This patch also adds the
.call_batch function that is only called from the batch receival
path.
This patch adds atomic rule updates and dumps based on
bitmask generations. This allows to atomically commit a set of
rule-set updates incrementally without altering the internal
state of existing nf_tables expressions/matches/targets.
The idea consists of using a generation cursor of 1 bit and
a bitmask of 2 bits per rule. Assuming the gencursor is 0,
then the genmask (expressed as a bitmask) can be interpreted
as:
00 active in the present, will be active in the next generation.
01 inactive in the present, will be active in the next generation.
10 active in the present, will be deleted in the next generation.
^
gencursor
Once you invoke the transition to the next generation, the global
gencursor is updated:
00 active in the present, will be active in the next generation.
01 active in the present, needs to zero its future, it becomes 00.
10 inactive in the present, delete now.
^
gencursor
If a dump is in progress and nf_tables enters a new generation,
the dump will stop and return -EBUSY to let userspace know that
it has to retry again. In order to invalidate dumps, a global
genctr counter is increased everytime nf_tables enters a new
generation.
This new operation can be used from the user-space utility
that controls the firewall, eg.
nft -f restore
The rule updates contained in `file' will be applied atomically.
cat file
-----
add filter INPUT ip saddr 1.1.1.1 counter accept #1
del filter INPUT ip daddr 2.2.2.2 counter drop #2
-EOF-
Note that the rule 1 will be inactive until the transition to the
next generation, the rule 2 will be evicted in the next generation.
There is a penalty during the rule update due to the branch
misprediction in the packet matching framework. But that should be
quickly resolved once the iteration over the commit list that
contain rules that require updates is finished.
Event notification happens once the rule-set update has been
committed. So we skip notifications is case the rule-set update
is aborted, which can happen in case that the rule-set is tested
to apply correctly.
This patch squashed the following patches from Pablo:
* nf_tables: atomic rule updates and dumps
* nf_tables: get rid of per rule list_head for commits
* nf_tables: use per netns commit list
* nfnetlink: add batch support and use it from nf_tables
* nf_tables: all rule updates are transactional
* nf_tables: attach replacement rule after stale one
* nf_tables: do not allow deletion/replacement of stale rules
* nf_tables: remove unused NFTA_RULE_FLAGS
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new rule attribute NFTA_RULE_POSITION which is
used to store the position of a rule relatively to the others.
By providing the create command and specifying the position, the
rule is inserted after the rule with the handle equal to the
provided position.
Regarding notification, the position attribute specifies the
handle of the previous rule to make sure we don't point to any
stale rule in notifications coming from the commit path.
This patch includes the following fix from Pablo:
* nf_tables: fix rule deletion event reporting
Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Register family per netnamespace to ensure that sets are
only visible in its approapriate namespace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch generalizes the NAT expression to support both IPv4 and IPv6
using the existing IPv4/IPv6 NAT infrastructure. This also adds the
NAT chain type for IPv6.
This patch collapses the following patches that were posted to the
netfilter-devel mailing list, from Tomasz:
* nf_tables: Change NFTA_NAT_ attributes to better semantic significance
* nf_tables: Split IPv4 NAT into NAT expression and IPv4 NAT chain
* nf_tables: Add support for IPv6 NAT expression
* nf_tables: Add support for IPv6 NAT chain
* nf_tables: Fix up build issue on IPv6 NAT support
And, from Pablo Neira Ayuso:
* fix missing dependencies in nft_chain_nat
Signed-off-by: Tomasz Bursztyka <tomasz.bursztyka@linux.intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows you to temporarily disable an entire table.
You can change the state of a dormant table via NFT_MSG_NEWTABLE
messages. Using this operation you can wake up a table, so their
chains are registered.
This provides atomicity at chain level. Thus, the rule-set of one
chain is applied at once, avoiding any possible intermediate state
in every chain. Still, the chains that belongs to a table are
registered consecutively. This also allows you to have inactive
tables in the kernel.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We cannot use skb->transport_header since it's unset, use
pkt->xt.thoff instead.
Now possible using information made available through the x_tables
compatibility layer.
Reported-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the x_tables compatibility layer. This allows you
to use existing x_tables matches and targets from nf_tables.
This compatibility later allows us to use existing matches/targets
for features that are still missing in nf_tables. We can progressively
replace them with native nf_tables extensions. It also provides the
userspace compatibility software that allows you to express the
rule-set using the iptables syntax but using the nf_tables kernel
components.
In order to get this compatibility layer working, I've done the
following things:
* add NFNL_SUBSYS_NFT_COMPAT: this new nfnetlink subsystem is used
to query the x_tables match/target revision, so we don't need to
use the native x_table getsockopt interface.
* emulate xt structures: this required extending the struct nft_pktinfo
to include the fragment offset, which is already obtained from
ip[6]_tables and that is used by some matches/targets.
* add support for default policy to base chains, required to emulate
x_tables.
* add NFTA_CHAIN_USE attribute to obtain the number of references to
chains, required by x_tables emulation.
* add chain packet/byte counters using per-cpu.
* support 32-64 bits compat.
For historical reasons, this patch includes the following patches
that were posted in the netfilter-devel mailing list.
From Pablo Neira Ayuso:
* nf_tables: add default policy to base chains
* netfilter: nf_tables: add NFTA_CHAIN_USE attribute
* nf_tables: nft_compat: private data of target and matches in contiguous area
* nf_tables: validate hooks for compat match/target
* nf_tables: nft_compat: release cached matches/targets
* nf_tables: x_tables support as a compile time option
* nf_tables: fix alias for xtables over nftables module
* nf_tables: add packet and byte counters per chain
* nf_tables: fix per-chain counter stats if no counters are passed
* nf_tables: don't bump chain stats
* nf_tables: add protocol and flags for xtables over nf_tables
* nf_tables: add ip[6]t_entry emulation
* nf_tables: move specific layer 3 compat code to nf_tables_ipv[4|6]
* nf_tables: support 32bits-64bits x_tables compat
* nf_tables: fix compilation if CONFIG_COMPAT is disabled
From Patrick McHardy:
* nf_tables: move policy to struct nft_base_chain
* nf_tables: send notifications for base chain policy changes
From Alexander Primak:
* nf_tables: remove the duplicate NF_INET_LOCAL_OUT
From Nicolas Dichtel:
* nf_tables: fix compilation when nf-netlink is a module
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch converts built-in tables/chains to chain types that
allows you to deploy customized table and chain configurations from
userspace.
After this patch, you have to specify the chain type when
creating a new chain:
add chain ip filter output { type filter hook input priority 0; }
^^^^ ------
The existing chain types after this patch are: filter, route and
nat. Note that tables are just containers of chains with no specific
semantics, which is a significant change with regards to iptables.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add an optimized payload expression implementation for small (up to 4 bytes)
aligned data loads from the linear packet area.
This patch also includes original Patrick McHardy's entitled (nf_tables:
inline nft_payload_fast_eval() into main evaluation loop).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add an optimized version of nft_data_cmp() that only handles values of to
4 bytes length.
This patch includes original Patrick McHardy's patch entitled (nf_tables:
inline nft_cmp_fast_eval() into main evaluation loop).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Split the expression ops into two parts and support overloading of
the runtime expression ops based on the requested function through
a ->select_ops() callback.
This can be used to provide optimized implementations, for instance
for loading small aligned amounts of data from the packet or inlining
frequently used operations into the main evaluation loop.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the new netlink API for maintaining nf_tables sets
independently of the ruleset. The API supports the following operations:
- creation of sets
- deletion of sets
- querying of specific sets
- dumping of all sets
- addition of set elements
- removal of set elements
- dumping of all set elements
Sets are identified by name, each table defines an individual namespace.
The name of a set may be allocated automatically, this is mostly useful
in combination with the NFT_SET_ANONYMOUS flag, which destroys a set
automatically once the last reference has been released.
Sets can be marked constant, meaning they're not allowed to change while
linked to a rule. This allows to perform lockless operation for set
types that would otherwise require locking.
Additionally, if the implementation supports it, sets can (as before) be
used as maps, associating a data value with each key (or range), by
specifying the NFT_SET_MAP flag and can be used for interval queries by
specifying the NFT_SET_INTERVAL flag.
Set elements are added and removed incrementally. All element operations
support batching, reducing netlink message and set lookup overhead.
The old "set" and "hash" expressions are replaced by a generic "lookup"
expression, which binds to the specified set. Userspace is not aware
of the actual set implementation used by the kernel anymore, all
configuration options are generic.
Currently the implementation selection logic is largely missing and the
kernel will simply use the first registered implementation supporting the
requested operation. Eventually, the plan is to have userspace supply a
description of the data characteristics and select the implementation
based on expected performance and memory use.
This patch includes the new 'lookup' expression to look up for element
matching in the set.
This patch includes kernel-doc descriptions for this set API and it
also includes the following fixes.
From Patrick McHardy:
* netfilter: nf_tables: fix set element data type in dumps
* netfilter: nf_tables: fix indentation of struct nft_set_elem comments
* netfilter: nf_tables: fix oops in nft_validate_data_load()
* netfilter: nf_tables: fix oops while listing sets of built-in tables
* netfilter: nf_tables: destroy anonymous sets immediately if binding fails
* netfilter: nf_tables: propagate context to set iter callback
* netfilter: nf_tables: add loop detection
From Pablo Neira Ayuso:
* netfilter: nf_tables: allow to dump all existing sets
* netfilter: nf_tables: fix wrong type for flags variable in newelem
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds nftables which is the intended successor of iptables.
This packet filtering framework reuses the existing netfilter hooks,
the connection tracking system, the NAT subsystem, the transparent
proxying engine, the logging infrastructure and the userspace packet
queueing facilities.
In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:
* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.
Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.
nftables also inherits the table, chain and rule objects from
iptables, but in a more configurable way, and it also includes the
original datatype-agnostic set infrastructure with mapping support.
This set infrastructure is enhanced in the follow up patch (netfilter:
nf_tables: add netlink set API).
This patch includes the following components:
* the netlink API: net/netfilter/nf_tables_api.c and
include/uapi/netfilter/nf_tables.h
* the packet filter core: net/netfilter/nf_tables_core.c
* the expressions (described above): net/netfilter/nft_*.c
* the filter tables: arp, IPv4, IPv6 and bridge:
net/ipv4/netfilter/nf_tables_ipv4.c
net/ipv6/netfilter/nf_tables_ipv6.c
net/ipv4/netfilter/nf_tables_arp.c
net/bridge/netfilter/nf_tables_bridge.c
* the NAT table (IPv4 only):
net/ipv4/netfilter/nf_table_nat_ipv4.c
* the route table (similar to mangle):
net/ipv4/netfilter/nf_table_route_ipv4.c
net/ipv6/netfilter/nf_table_route_ipv6.c
* internal definitions under:
include/net/netfilter/nf_tables.h
include/net/netfilter/nf_tables_core.h
* It also includes an skeleton expression:
net/netfilter/nft_expr_template.c
and the preliminary implementation of the meta target
net/netfilter/nft_meta_target.c
It also includes a change in struct nf_hook_ops to add a new
pointer to store private data to the hook, that is used to store
the rule list per chain.
This patch is based on the patch from Patrick McHardy, plus merged
accumulated cleanups, fixes and small enhancements to the nftables
code that has been done since 2009, which are:
From Patrick McHardy:
* nf_tables: adjust netlink handler function signatures
* nf_tables: only retry table lookup after successful table module load
* nf_tables: fix event notification echo and avoid unnecessary messages
* nft_ct: add l3proto support
* nf_tables: pass expression context to nft_validate_data_load()
* nf_tables: remove redundant definition
* nft_ct: fix maxattr initialization
* nf_tables: fix invalid event type in nf_tables_getrule()
* nf_tables: simplify nft_data_init() usage
* nf_tables: build in more core modules
* nf_tables: fix double lookup expression unregistation
* nf_tables: move expression initialization to nf_tables_core.c
* nf_tables: build in payload module
* nf_tables: use NFPROTO constants
* nf_tables: rename pid variables to portid
* nf_tables: save 48 bits per rule
* nf_tables: introduce chain rename
* nf_tables: check for duplicate names on chain rename
* nf_tables: remove ability to specify handles for new rules
* nf_tables: return error for rule change request
* nf_tables: return error for NLM_F_REPLACE without rule handle
* nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
* nf_tables: fix NLM_F_MULTI usage in netlink notifications
* nf_tables: include NLM_F_APPEND in rule dumps
From Pablo Neira Ayuso:
* nf_tables: fix stack overflow in nf_tables_newrule
* nf_tables: nft_ct: fix compilation warning
* nf_tables: nft_ct: fix crash with invalid packets
* nft_log: group and qthreshold are 2^16
* nf_tables: nft_meta: fix socket uid,gid handling
* nft_counter: allow to restore counters
* nf_tables: fix module autoload
* nf_tables: allow to remove all rules placed in one chain
* nf_tables: use 64-bits rule handle instead of 16-bits
* nf_tables: fix chain after rule deletion
* nf_tables: improve deletion performance
* nf_tables: add missing code in route chain type
* nf_tables: rise maximum number of expressions from 12 to 128
* nf_tables: don't delete table if in use
* nf_tables: fix basechain release
From Tomasz Bursztyka:
* nf_tables: Add support for changing users chain's name
* nf_tables: Change chain's name to be fixed sized
* nf_tables: Add support for replacing a rule by another one
* nf_tables: Update uapi nftables netlink header documentation
From Florian Westphal:
* nft_log: group is u16, snaplen u32
From Phil Oester:
* nf_tables: operational limit match
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Similar to nat_decode_session, alloc_null_binding is needed for both
ip_tables and nf_tables, so move it to nf_nat_core.c. This change
is required by nf_tables.
This is an adapted version of the original patch from Patrick McHardy.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass the hook ops to the hookfn to allow for generic hook
functions. This change is required by nf_tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If a frame's timestamp is calculated, and the bitrate
calculation goes wrong and returns zero, the system
will attempt to divide by zero and crash. Catch this
case and print the rate information that the driver
reported when this happens.
Cc: stable@vger.kernel.org
Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When parsing an invalid radiotap header, the parser can overrun
the buffer that is passed in because it doesn't correctly check
1) the minimum radiotap header size
2) the space for extended bitmaps
The first issue doesn't affect any in-kernel user as they all
check the minimum size before calling the radiotap function.
The second issue could potentially affect the kernel if an skb
is passed in that consists only of the radiotap header with a
lot of extended bitmaps that extend past the SKB. In that case
a read-only buffer overrun by at most 4 bytes is possible.
Fix this by adding the appropriate checks to the parser.
Cc: stable@vger.kernel.org
Reported-by: Evan Huus <eapache@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We do not actually need to set any rx filters for the virtual batman
soft interface. However a dummy handler enables a user to set static
multicast listeners for instance.
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
This is a simple batadv_tt_save_orig_buffer() refactoring
aiming to make it more generic and avoid useless casts.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Implement batadv_tt_entries() to get the number of entries
fitting in a given amount of bytes. This computation is done
several times in the code and therefore it is useful to have
an helper function.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
This is not used anymore with the new fragmentation, and it might
actually mess up the bonding code because find_router() assumes it
is only called once per packet.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
The module prints a warning when the MTU on the hard interface is too
small to transfer payload traffic without fragmentation. The required
MTU is calculated based on the encapsulation header size. If network
coding is compild into the module its header size is taken into
account as well.
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
the icmp and the icmp_rr packets share the same initial
fields since they use the same code to be processed and
forwarded.
Extract the common fields and put them into a separate
struct so that future ICMP packets can be easily added
without bloating the packet definition.
However, keep the seqno field outside of the newly created
common header because future ICMP types may require a
bigger sequence number space.
This change breaks compatibility due to fields reordering
in the ICMP headers.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
When comparing a network ordered value with a constant, it
is better to convert the constant at compile time by means
of htons() instead of converting the value at runtime using
ntohs().
This refactoring may slightly improve the code performance.
Moreover substitute __constant_htons() with htons() since
the latter increase readability and it is smart enough to be
as efficient as the former
Signed-off-by: Antonio Quartulli <ordex@autistici.org>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Acked-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Non-broadcast packets larger than MTU are fragmented and sent with
an encapsulating header. Up to 16 fragments are supported, which are
sent in reverse order on the wire to allow minimal memory copying when
creating fragments.
Signed-off-by: Martin Hundebøll <martin@hundeboll.net>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Fragments arriving at their destination are buffered for later merge.
Merged packets are passed to the main receive function as had they never
been fragmented.
Fragments are forwarded without merging if the MTU of the outgoing
interface is smaller than the size of the merged packet.
Signed-off-by: Martin Hundebøll <martin@hundeboll.net>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>