Steffen Klassert says:
====================
1) Allow to avoid copying DSCP during encapsulation
by setting a SA flag. From Nicolas Dichtel.
2) Constify the netlink dispatch table, no need to modify it
at runtime. From Mathias Krause.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We use _get() and _put() for device ref-counting in the kernel. However,
hci_conn_put() is _not_ used for ref-counting, hence, rename it to
hci_conn_drop() so we can later fix ref-counting and introduce
hci_conn_put().
hci_conn_hold() and hci_conn_put() are currently used to manage how long a
connection should be held alive. When the last user drops the connection,
we spawn a delayed work that performs the disconnect. Obviously, this has
nothing to do with ref-counting for the _object_ but rather for the
keep-alive of the connection.
But we really _need_ proper ref-counting for the _object_ to allow
connection-users like rfcomm-tty, HIDP or others.
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Some drivers need SSID in AP and IBSS mode. AP SSID is provided
through BSS_CHANGED_SSID notification. There was no easy way to
do the same for IBSS. In IBSS mode SSID is known but was not
stored in BSS configuration. Extend the AP-mode functionality
to also work in IBSS mode.
Signed-off-by: Marek Puzyniak <marek.puzyniak@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch introduces an UAPI header for the SCTP protocol,
so that we can facilitate the maintenance and development of
user land applications or libraries, in particular in terms
of header synchronization.
To not break compatibility, some fragments from lksctp-tools'
netinet/sctp.h have been carefully included, while taking care
that neither kernel nor user land breaks, so both compile fine
with this change (for lksctp-tools I tested with the old
netinet/sctp.h header and with a newly adapted one that includes
the uapi sctp header). lksctp-tools smoke test run through
successfully as well in both cases.
Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The callers always pass current to sock_update_netprio().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The callers always pass current to sock_update_classid().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of invalidating all IPv6 addresses with global scope
when one decides to use IPv6 tokens, we should only invalidate
previous tokens and leave the rest intact until they expire
eventually (or are intact forever). For doing this less greedy
approach, we're adding a bool at the end of inet6_ifaddr structure
instead, for two reasons: i) per-inet6_ifaddr flag space is
already used up, making it wider might not be a good idea,
since ii) also we do not necessarily need to export this
information into user space.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving data messages, the "BUG_ON(skb->len < skb->data_len)" in
the skb_pull() function triggers a kernel panic.
Replace the skb_pull logic by a per skb offset as advised by
Eric Dumazet.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for IPv6 tokenized IIDs, that allow
for administrators to assign well-known host-part addresses
to nodes whilst still obtaining global network prefix from
Router Advertisements. It is currently in draft status.
The primary target for such support is server platforms
where addresses are usually manually configured, rather
than using DHCPv6 or SLAAC. By using tokenised identifiers,
hosts can still determine their network prefix by use of
SLAAC, but more readily be automatically renumbered should
their network prefix change. [...]
The disadvantage with static addresses is that they are
likely to require manual editing should the network prefix
in use change. If instead there were a method to only
manually configure the static identifier part of the IPv6
address, then the address could be automatically updated
when a new prefix was introduced, as described in [RFC4192]
for example. In such cases a DNS server might be
configured with such a tokenised interface identifier of
::53, and SLAAC would use the token in constructing the
interface address, using the advertised prefix. [...]
http://tools.ietf.org/html/draft-chown-6man-tokenised-ipv6-identifiers-02
The implementation is partially based on top of Mark K.
Thompson's proof of concept. However, it uses the Netlink
interface for configuration resp. data retrival, so that
it can be easily extended in future. Successfully tested
by myself.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check for NULL before calling the following operations from "struct
ieee802154_mlme_ops": assoc_req, assoc_resp, disassoc_req, start_req,
and scan_req.
This fixes a current oops where those functions are called but not
implemented. It also updates the documentation to clarify that they
are now optional by design. If a call to an unimplemented function
is attempted, the kernel returns EOPNOTSUPP via netlink.
The following operations are still required: get_phy, get_pan_id,
get_short_addr, and get_dsn.
Note that the places where this patch changes the initialization
of "ret" should not affect the rest of the code since "ret" was
always set (again) before returning its value.
Signed-off-by: Werner Almesberger <werner@almesberger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It served no purpose: we never call it from anywhere in the stack
and the only driver that did implement it (fakehard) merely provided
a dummy value.
There is also considerable doubt whether it would make sense to
even attempt beacon processing at this level in the Linux kernel.
Signed-off-by: Werner Almesberger <werner@almesberger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that uids and gids are completely encapsulated in kuid_t
and kgid_t we no longer need to pass struct cred which allowed
us to test both the uid and the user namespace for equality.
Passing struct cred potentially allows us to pass the entire group
list as BSD does but I don't believe the cost of cache line misses
justifies retaining code for a future potential application.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter and IPVS updates for
your net-next tree, most relevantly they are:
* Add net namespace support to NFLOG, ULOG and ebt_ulog and NFQUEUE.
The LOG and ebt_log target has been also adapted, but they still
depend on the syslog netnamespace that seems to be missing, from
Gao Feng.
* Don't lose indications of congestion in IPv6 fragmentation handling,
from Hannes Frederic Sowa.i
* IPVS conversion to use RCU, including some code consolidation patches
and optimizations, also some from Julian Anastasov.
* cpu fanout support for NFQUEUE, from Holger Eitzenberger.
* Better error reporting to userspace when dropping packets from
all our _*_[xfrm|route]_me_harder functions, from Patrick McHardy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to verify that the given sockets actually are l2cap sockets. If
they aren't, we are not supposed to access bt_sk(sock) and we shouldn't
start the session if the offsets turn out to be valid local BT addresses.
That is, if someone passes a TCP socket to HIDCONNADD, then we access some
random offset in the TCP socket (which isn't even guaranteed to be valid).
Fix this by checking that the socket is an l2cap socket.
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
This patch adds netns support to nf_log and it prepares netns
support for existing loggers. It is composed of four major
changes.
1) nf_log_register has been split to two functions: nf_log_register
and nf_log_set. The new nf_log_register is used to globally
register the nf_logger and nf_log_set is used for enabling
pernet support from nf_loggers.
Per netns is not yet complete after this patch, it comes in
separate follow up patches.
2) Add net as a parameter of nf_log_bind_pf. Per netns is not
yet complete after this patch, it only allows to bind the
nf_logger to the protocol family from init_net and it skips
other cases.
3) Adapt all nf_log_packet callers to pass netns as parameter.
After this patch, this function only works for init_net.
4) Make the sysctl net/netfilter/nf_log pernet.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch makes this proc dentry pernet. So far only init_net
had a /proc/net/netfilter directory.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch implements per hash bucket locking for the frag queue
hash. This removes two write locks, and the only remaining write
lock is for protecting hash rebuild. This essentially reduce the
readers-writer lock to a rebuild lock.
This patch is part of "net: frag performance followup"
http://thread.gmane.org/gmane.linux.network/263644
of which two patches have already been accepted:
Same test setup as previous:
(http://thread.gmane.org/gmane.linux.network/257155)
Two 10G interfaces, on seperate NUMA nodes, are under-test, and uses
Ethernet flow-control. A third interface is used for generating the
DoS attack (with trafgen).
Notice, I have changed the frag DoS generator script to be more
efficient/deadly. Before it would only hit one RX queue, now its
sending packets causing multi-queue RX, due to "better" RX hashing.
Test types summary (netperf UDP_STREAM):
Test-20G64K == 2x10G with 65K fragments
Test-20G3F == 2x10G with 3x fragments (3*1472 bytes)
Test-20G64K+DoS == Same as 20G64K with frag DoS
Test-20G3F+DoS == Same as 20G3F with frag DoS
Test-20G64K+MQ == Same as 20G64K with Multi-Queue frag DoS
Test-20G3F+MQ == Same as 20G3F with Multi-Queue frag DoS
When I rebased this-patch(03) (on top of net-next commit a210576c) and
removed the _bh spinlock, I saw a performance regression. BUT this
was caused by some unrelated change in-between. See tests below.
Test (A) is what I reported before for patch-02, accepted in commit 1b5ab0de.
Test (B) verifying-retest of commit 1b5ab0de corrospond to patch-02.
Test (C) is what I reported before for this-patch
Test (D) is net-next master HEAD (commit a210576c), which reveals some
(unknown) performance regression (compared against test (B)).
Test (D) function as a new base-test.
Performance table summary (in Mbit/s):
(#) Test-type: 20G64K 20G3F 20G64K+DoS 20G3F+DoS 20G64K+MQ 20G3F+MQ
---------- ------- ------- ---------- --------- -------- -------
(A) Patch-02 : 18848.7 13230.1 4103.04 5310.36 130.0 440.2
(B) 1b5ab0de : 18841.5 13156.8 4101.08 5314.57 129.0 424.2
(C) Patch-03v1: 18838.0 13490.5 4405.11 6814.72 196.6 461.6
(D) a210576c : 18321.5 11250.4 3635.34 5160.13 119.1 405.2
(E) with _bh : 17247.3 11492.6 3994.74 6405.29 166.7 413.6
(F) without bh: 17471.3 11298.7 3818.05 6102.11 165.7 406.3
Test (E) and (F) is this-patch(03), with(V1) and without(V2) the _bh spinlocks.
I cannot explain the slow down for 20G64K (but its an artificial
"lab-test" so I'm not worried). But the other results does show
improvements. And test (E) "with _bh" version is slightly better.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
----
V2:
- By analysis from Hannes Frederic Sowa and Eric Dumazet, we don't
need the spinlock _bh versions, as Netfilter currently does a
local_bh_disable() before entering inet_fragment.
- Fold-in desc from cover-mail
V3:
- Drop the chain_len counter per hash bucket.
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver init queue is no longer needed. This can be all handled
inside the drivers now. So remove it.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Some drivers require a special stage for their early init. This is
always specific to the driver or transport. So call back into driver to
allow bringing up the device.
The advantage with this stage is that the Bluetooth core is actually
handling the HCI layer now. This means that command and event processing
is available.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This patch adds a __hci_cmd_sync_ev function, analogous to
__hci_cmd_sync except that it also takes an event parameter to indicate
that the command completes with a special event instead of command
complete. Internally this new function takes advantage of the
hci_req_add_ev function introduced in the previous patch.
The primary expected user of this new function are the setup routines of
HCI drivers which may want to send custom commands and return only when
they have completed.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds support for having commands within HCI requests that do
not result in a command complete but some other event. This is at least
needed for some vendor specific commands to be issued in the
hdev->setup() procecure, but might also be useful for other commands.
The way that the support is implemented is by extending the skb control
buffer to have a field to indicate that the command is expected to
terminate with a special event. After sending the command each received
event can then be compared against this field through hdev->sent_cmd.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds a helper function for sending a single HCI command
waiting for its completion and then returning back the parameters in the
resulting command complete event (if there was one).
The implementation is very similar to that of hci_req_sync() except that
instead of invocing a callback for sending HCI commands the function
constructs and sends one itself and after being woken up picks the last
received event from hdev->recv_evt (if it matches the right criteria)
and returns it.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
This patch adds tracking of received HCI events to the hci_dev struct.
This is necessary so that a subsequent patch can implement a function
for sending a single command synchronously and returning the resulting
command complete parameters in the function return value.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
This patch removes the hci_req_cmd_status function since it is not
used anymore. The HCI request framework now considers the HCI command
has complete once the Command Status or Command Complete Event is
received.
Signed-off-by: Andre Guedes <andre.guedes@openbossa.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Remove a declaration left over from the TCPCT-ectomy. This sysctl is
no longer referenced anywhere since 1a2c6181c4 ("tcp: Remove TCPCT").
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the final step in RCU conversion.
Things that are removed:
- svc->usecnt: now svc is accessed under RCU read lock
- svc->inc: and some unused code
- ip_vs_bind_pe and ip_vs_unbind_pe: no ability to replace PE
- __ip_vs_svc_lock: replaced with RCU
- IP_VS_WAIT_WHILE: now readers lookup svcs and dests under
RCU and work in parallel with configuration
Other changes:
- before now, a RCU read-side critical section included the
calling of the schedule method, now it is extended to include
service lookup
- ip_vs_svc_table and ip_vs_svc_fwm_table are now using hlist
- svc->pe and svc->scheduler remain to the end (of grace period),
the schedulers are prepared for such RCU readers
even after done_service is called but they need
to use synchronize_rcu because last ip_vs_scheduler_put
can happen while RCU read-side critical sections
use an outdated svc->scheduler pointer
- as planned, update_service is removed
- empty services can be freed immediately after grace period.
If dests were present, the services are freed from
the dest trash code
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
In previous commits the schedulers started to access
svc->destinations with _rcu list traversal primitives
because the IP_VS_WAIT_WHILE macro still plays the role of
grace period. Now it is time to finish the updating part,
i.e. adding and deleting of dests with _rcu suffix before
removing the IP_VS_WAIT_WHILE in next commit.
We use the same rule for conns as for the
schedulers: dests can be searched in RCU read-side critical
section where ip_vs_dest_hold can be called by ip_vs_bind_dest.
Some things are not perfect, for example, calling
functions like ip_vs_lookup_dest from updating code under
RCU, just because we use some function both from reader
and from updater.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
This method releases the scheduler state,
it can not fail. Such change will help to properly
replace the scheduler in following patch.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
All dests will go to trash, no exceptions.
But we have to use new list node t_list for this, due
to RCU changes in following patches. Dests will wait there
initial grace period and later all conns and schedulers to
put their reference. The dests don't get reference for
staying in dest trash as before.
As result, we do not load ip_vs_dest_put with
extra checks for last refcnt and the schedulers do not
need to play games with atomic_inc_not_zero while
selecting best destination.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
ip_vs_dest_hold will be used under RCU lock
while ip_vs_dest_put can be called even after dest
is removed from service, as it happens for conns and
some schedulers.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Allow schedulers to use rcu_dereference when
returning destination on lookup. The RCU read-side critical
section will allow ip_vs_bind_dest to get dest refcnt as
preparation for the step where destinations will be
deleted without an IP_VS_WAIT_WHILE guard that holds the
packet processing during update.
Add new optional scheduler methods add_dest,
del_dest and upd_dest. For now the methods are called
together with update_service but update_service will be
removed in a following change.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
We have many fields to set and few to reset,
use kmem_cache_alloc instead to save some cycles.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
__ip_vs_conn_in_get and ip_vs_conn_out_get are
hot places. Optimize them, so that ports are matched first.
By moving net and fwmark below, on 32-bit arch we can fit
caddr in 32-byte cache line and all addresses in 64-byte
cache line.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Convert __ip_vs_conntbl_lock_array as follows:
- readers that do not modify conn lists will use RCU lock
- updaters that modify lists will use spinlock_t
Now for conn lookups we will use RCU read-side
critical section. Without using __ip_vs_conn_get such
places have access to connection fields and can
dereference some pointers like pe and pe_data plus
the ability to update timer expiration. If full access
is required we contend for reference.
We add barrier in __ip_vs_conn_put, so that
other CPUs see the refcnt operation after other writes.
With the introduction of ip_vs_conn_unlink()
we try to reorganize ip_vs_conn_expire(), so that
unhashing of connections that should stay more time is
avoided, even if it is for very short time.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
rs_lock was used to protect rs_table (hash table)
from updaters (under global mutex) and readers (packet handlers).
We can remove rs_lock by using RCU lock for readers. Reclaiming
dest only with kfree_rcu is enough because the readers access
only fields from the ip_vs_dest structure.
Use hlist for rs_table.
As we are now using hlist_del_rcu, introduce in_rs_table
flag as replacement for the list_empty checks which do not
work with RCU. It is needed because only NAT dests are in
the rs_table.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
We use locks like tcp_app_lock, udp_app_lock,
sctp_app_lock to protect access to the protocol hash tables
from readers in packet context while the application
instances (inc) are [un]registered under global mutex.
As the hash tables are mostly read when conns are
created and bound to app, use RCU for readers and reclaim
app instance after grace period.
Simplify ip_vs_app_inc_get because we use usecnt
only for statistics and rely on module refcounting.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Currently when forwarding requests to real servers
we use dst_lock and atomic operations when cloning the
dst_cache value. As the dst_cache value does not change
most of the time it is better to use RCU and to lock
dst_lock only when we need to replace the obsoleted dst.
For this to work we keep dst_cache in new structure protected
by RCU. For packets to remote real servers we will use noref
version of dst_cache, it will be valid while we are in RCU
read-side critical section because now dst_release for replaced
dsts will be invoked after the grace period. Packets to
local real servers that are passed to local stack with
NF_ACCEPT need a dst clone.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Move and give better names to two functions:
- ip_vs_dst_reset to __ip_vs_dst_cache_reset
- __ip_vs_dev_reset to ip_vs_forget_dev
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Avoid replacing the cached route for real server
on every packet with different TOS. I doubt that routing
by TOS for real server is used at all, so we should be
better with such optimization.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Currently, when a socket receives something on the error queue it only wakes up
the socket on select if it is in the "read" list, that is the socket has
something to read. It is useful also to wake the socket if it is in the error
list, which would enable software to wait on error queue packets without waking
up for regular data on the socket. The main use case is for receiving
timestamped transmit packets which return the timestamp to the socket via the
error queue. This enables an application to select on the socket for the error
queue only instead of for the regular traffic.
-v2-
* Added the SO_SELECT_ERR_QUEUE socket option to every architechture specific file
* Modified every socket poll function that checks error queue
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jeffrey Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Matthew Vick <matthew.vick@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the protection of netns_frags.nqueues updates under the LRU_lock,
instead of the write lock. As they are located on the same cacheline,
and this is also needed when transitioning to use per hash bucket locking.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip-header id needs to be incremented even if IP_DF flag is set.
This behaviour was changed in commit 490ab08127cebc25e3a26
(IP_GRE: Fix IP-Identification).
Following patch fixes it so that identification is always
incremented.
Reported-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Inspection of upper layer protocol is considered harmful, especially
if it is about ARP or other stateful upper layer protocol; driver
cannot (and should not) have full state of them.
IPv4 over Firewire module used to inspect ARP (both in sending path
and in receiving path), and record peer's GUID, max packet size, max
speed and fifo address. This patch removes such inspection by extending
our "hardware address" definition to include other information as well:
max packet size, max speed and fifo. By doing this, The neighbour
module in networking subsystem can cache them.
Note: As we have started ignoring sspd and max_rec in ARP/NDP, those
information will not be used in the driver when sending.
When a packet is being sent, the IP layer fills our pseudo header with
the extended "hardware address", including GUID and fifo. The driver
can look-up node-id (the real but rather volatile low-level address)
by GUID, and then the module can send the packet to the wire using
parameters provided in the extendedn hardware address.
This approach is realistic because IP over IEEE1394 (RFC2734) and IPv6
over IEEE1394 (RFC3146) share same "hardware address" format
in their address resolution protocols.
Here, extended "hardware address" is defined as follows:
union fwnet_hwaddr {
u8 u[16];
struct {
__be64 uniq_id; /* EUI-64 */
u8 max_rec; /* max packet size */
u8 sspd; /* max speed */
__be16 fifo_hi; /* hi 16bits of FIFO addr */
__be32 fifo_lo; /* lo 32bits of FIFO addr */
} __packed uc;
};
Note that Hardware address is declared as union, so that we can map full
IP address into this, when implementing MCAP (Multicast Cannel Allocation
Protocol) for IPv6, but IP and ARP subsystem do not need to know this
format in detail.
One difference between original ARP (RFC826) and 1394 ARP (RFC2734)
is that 1394 ARP Request/Reply do not contain the target hardware address
field (aka ar$tha). This difference is handled in the ARP subsystem.
CC: Stephan Gatzka <stephan.gatzka@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following patch refactors GRE code into ip tunneling code and GRE
specific code. Common tunneling code is moved to ip_tunnel module.
ip_tunnel module is written as generic library which can be used
by different tunneling implementations.
ip_tunnel module contains following components:
- packet xmit and rcv generic code. xmit flow looks like
(gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
- hash table of all devices.
- lookup for tunnel devices.
- control plane operations like device create, destroy, ioctl, netlink
operations code.
- registration for tunneling modules, like gre, ipip etc.
- define single pcpu_tstats dev->tstats.
- struct tnl_ptk_info added to pass parsed tunnel packet parameters.
ipip.h header is renamed to ip_tunnel.h
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>