Fix the following build warning:
Documentation/networking/devlink/mlx5.rst:13: WARNING: Error parsing content block for the "list-table" directive:
+uniform two-level bullet list expected, but row 2 does not contain the same number of items as row 1 (2 vs 3).
...
Add the missing item in the first row.
Fixes: 0844fa5f7b ("net/mlx5: Let user configure io_eq_size param")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The rep legacy RQ completion handling was missing the appropriate
handling of error CQEs (dump the CQE and queue a recover work), fix it
by calling trigger_report() when needed.
Since all CQE handling flows do the exact same error CQE handling,
extract it to a common helper function.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove redundant and trivial error logging when trying to
offload mirred device with unsupported devices.
Using OVS could hit those a lot and the errors are still
logged in extack.
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Feature dependencies should be resolved in fix features rather than in
set features flow. Move the check that disables HW-GRO in case CQE
compression is enabled from set_feature_hw_gro() to
mlx5e_fix_features().
Signed-off-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Remove redundant space when constructing the feature's enum. Validate
against the indented enum value.
Fixes: 6c72cb05d4 ("net/mlx5e: Use bitmap field for profile features")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
When using libvirt to passthrough VF to VM it will always set the VF vlan
to 0 even if user didn’t request it, this will cause libvirt to fail to
boot in case the PF isn't eswitch owner.
Example of such case is the DPU host PF which isn't eswitch manager, so
any attempt to passthrough VF of it using libvirt will fail.
Fix it by not returning error in case set VF vlan is called with vid 0.
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add FEC counters' statistics of corrected_blocks and
uncorrectable_blocks, along with their lanes via ethtool.
HW supports corrected_blocks and uncorrectable_blocks counters both for
RS-FEC mode and FC-FEC mode. In FC mode these counters are accumulated
per lane, while in RS mode the correction method crosses lanes, thus
only total corrected_blocks and uncorrectable_blocks are reported in
this mode.
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
log_max_qp in driver's default profile #2 was set to 18, but FW actually
supports 17 at the most - a situation that led to the concerning print
when the driver is loaded:
"log_max_qp value in current profile is 18, changing to HCA capabaility
limit (17)"
The expected behavior from mlx5_profile #2 is to match the maximum FW
capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is
initialized to a defined static value (0xff) - which basically means that
when loading this profile, log_max_qp value will be what the currently
installed FW supports at most.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently all SFs are using the same CPUs. Spreading SF over CPUs, in
round-robin manner, in order to achieve better distribution of the SFs
over available CPUs.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently IRQs are requested one by one. To balance spreading IRQs
among cpus using such scheme requires remembering cpu mask for the
cpus used for a given device. This complicates the IRQ allocation
scheme in subsequent patch.
Hence, prepare the code for bulk IRQs allocation. This enables
spreading IRQs among cpus in subsequent patch.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The downstream patches add more functionality to irq_pool_affinity.
Move the irq_pool_affinity logic to a new file in order to ease the
coding and maintenance of it.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Move affinity binding of the IRQ to irq_request function in order to
bind the IRQ before inserting it to the xarray.
After this change, the IRQ is ready for use when inserted to the xarray.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, IRQ layer have a separate flow for ctrl and comp IRQs, and
the distinction between ctrl and comp IRQs is done in the IRQ layer.
In order to ease the coding and maintenance of the IRQ layer,
introduce a new API for requesting control IRQs -
mlx5_ctrl_irq_request(struct mlx5_core_dev *dev).
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Callers of this functions ignore its return value, as reported by
Wang Qing, in one of the return paths, it returns positive values.
Since return value is ignored anyways, void out the return type of the
function.
Reported-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The second parameter of bpf_d_path() can only accept writable
memories. Rdonly_mem obtained from bpf_per_cpu_ptr() can not
be passed into bpf_d_path for modification. This patch adds
a selftest to verify this behavior.
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220106205525.2116218-1-haoluo@google.com
This adds documention for:
- bpf_map_delete_batch()
- bpf_map_lookup_batch()
- bpf_map_lookup_and_delete_batch()
- bpf_map_update_batch()
This also updates the public API for the `keys` parameter
of `bpf_map_delete_batch()`, and both the
`keys` and `values` parameters of `bpf_map_update_batch()`
to be constants.
Signed-off-by: Grant Seltzer <grantseltzer@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220106201304.112675-1-grantseltzer@gmail.com
PT_REGS*() macro on some architectures force-cast struct pt_regs to
other types (user_pt_regs, etc) and might drop volatile modifiers, if any.
Volatile isn't really required as pt_regs value isn't supposed to change
during the BPF program run, so this is correct behavior.
But progs/loop3.c relies on that volatile modifier to ensure that loop
is preserved. Fix loop3.c by declaring i and sum variables as volatile
instead. It preserves the loop and makes the test pass on all
architectures (including s390x which is currently broken).
Fixes: 3cc31d7940 ("libbpf: Normalize PT_REGS_xxx() macro definitions")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220106205156.955373-1-andrii@kernel.org
kfree() and bitmap_free() are the same. But using the latter is more
consistent when freeing memory allocated with bitmap_zalloc().
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When a bitmap is local to a function, it is safe to use the non-atomic
__[set|clear]_bit(). No concurrent accesses can occur.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The 'possible_idx' bitmap is set just after it is zeroed, so we can save
the first step.
The 'free_idx' bitmap is used only at the end of the function as the
result of a bitmap xor operation. So there is no need to explicitly
zero it before.
So, slightly simply the code and remove 2 useless 'bitmap_zero()' call
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
In current switchdev implementation, every VF PR is assigned to
individual ring on switchdev ctrl VSI. For slow-path traffic, there
is a mapping VF->ring done in software based on src_vsi value (by
calling ice_eswitch_get_target_netdev function).
With this change, HW solution is introduced which is more
efficient. For each VF, src MAC (VF's MAC) filter will be created,
which forwards packets to the corresponding switchdev ctrl VSI queue
based on src MAC address.
This filter has to be removed and then replayed in case of
resetting one VF. Keep information about this rule in repr->mac_rule,
thanks to that we know which rule has to be removed and replayed
for a given VF.
In case of CORE/GLOBAL all rules are removed
automatically. We have to take care of readding them. This is done
by ice_replay_vsi_adv_rule.
When driver leaves switchdev mode, remove all advanced rules
from switchdev ctrl VSI. This is done by ice_rem_adv_rule_for_vsi.
Flag repr->rule_added is needed because in some cases reset
might be triggered before VF sends request to add MAC.
Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice_replay_vsi_adv_rule will replay advanced rules for a given VSI.
Exit this function when list of rules for given recipe is empty.
Do not add rule when given vsi_handle does not match vsi_handle
from the rule info.
Use ICE_MAX_NUM_RECIPES instead of ICE_SW_LKUP_LAST in order to find
advanced rules as well.
Signed-off-by: Victor Raj <victor.raj@intel.com>
Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Laurent reported that they have seen a significant amount of TCP retransmissions
at high throughput from applications residing in network namespaces talking to
the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
as per systemd default) of the phys device such as ena or virtio_net due to all
traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
setup was _not_ using XDP on veths as the issue is generic.)
More specifically, after edbea92202 ("veth: Store queue_mapping independently
of XDP prog presence") which made it all the way back to v4.19.184+,
skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX
queue by default for veths) instead of leaving at 0.
This is eventually retained and callbacks like ena_select_queue() will also pick
single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic
is forwarded to that device via upper stack or other means. Similarly, for others
not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might
call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.
In general, it is a _bad_ idea for virtual devices like veth to mess around with
queue selection [by default]. Given dev->real_num_tx_queues is by default 1,
the skb->queue_mapping was left untouched, and so prior to edbea92202 the
netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.
Unbreak this and restore prior behavior by removing the skb_record_rx_queue()
from veth_xmit() altogether.
If the veth peer has an XDP program attached, then it would return the first RX
queue index in xdp_md->rx_queue_index (unless configured in non-default manner).
However, this is still better than breaking the generic case.
Fixes: edbea92202 ("veth: Store queue_mapping independently of XDP prog presence")
Fixes: 638264dc90 ("veth: Support per queue XDP ring")
Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the ibmveth sysfs code to use default_groups
field which has been the preferred way since aa30f47cf6 ("kobject: Add
support for default attribute groups to kobj_type") so that we can soon
get rid of the obsolete default_attrs field.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Cristobal Forno <cforno12@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clean the following coccicheck warning:
./drivers/net/ethernet/sfc/efx_channels.c:870:36-37: WARNING opportunity
for swap().
./drivers/net/ethernet/sfc/efx_channels.c:824:36-37: WARNING opportunity
for swap().
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ethtool_get_phy_stats(), the phydev varaible is set to
dev->phydev but dev->phydev is still used. Replace
dev->phydev uses with phydev.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the PCS selection to use mac_select_pcs, which allows the PCS
to perform any validation it needs.
We must use separate phylink_pcs instances for the USX and SGMII PCS,
rather than just changing the "ops" pointer before re-setting it to
phylink as this interface queries the PCS, rather than requesting it
to be changed.
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet suggested to allow users to modify max GRO packet size.
We have seen GRO being disabled by users of appliances (such as
wifi access points) because of claimed bufferbloat issues,
or some work arounds in sch_cake, to split GRO/GSO packets.
Instead of disabling GRO completely, one can chose to limit
the maximum packet size of GRO packets, depending on their
latency constraints.
This patch adds a per device gro_max_size attribute
that can be changed with ip link command.
ip link set dev eth0 gro_max_size 16000
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When multiple sockets using the SOF_TIMESTAMPING_BIND_PHC flag received
a packet with a hardware timestamp (e.g. multiple PTP instances in
different PTP domains using the UDPv4/v6 multicast or L2 transport),
the timestamps received on some sockets were corrupted due to repeated
conversion of the same timestamp (by the same or different vclocks).
Fix ptp_convert_timestamp() to not modify the shared skb timestamp
and return the converted timestamp as a ktime_t instead. If the
conversion fails, return 0 to not confuse the application with
timestamps corresponding to an unexpected PHC.
Fixes: d7c0882655 ("net: socket: support hardware timestamp conversion to PHC bound")
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Yangbo Lu <yangbo.lu@nxp.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As discussed during review here:
https://patchwork.kernel.org/project/netdevbpf/patch/20220105132141.2648876-3-vladimir.oltean@nxp.com/
we should inform developers about pitfalls of concurrent access to the
boolean properties of dsa_switch and dsa_port, now that they've been
converted to bit fields. No other measure than a comment needs to be
taken, since the code paths that update these bit fields are not
concurrent with each other.
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a cosmetic incremental fixup to commits
7787ff7763 ("net: dsa: merge all bools of struct dsa_switch into a single u32")
bde82f389a ("net: dsa: merge all bools of struct dsa_port into a single u8")
The desire to make this change was enunciated after posting these
patches here:
https://patchwork.kernel.org/project/netdevbpf/cover/20220105132141.2648876-1-vladimir.oltean@nxp.com/
but due to a slight timing overlap (message posted at 2:28 p.m. UTC,
merge commit is at 2:46 p.m. UTC), that comment was missed and the
changes were applied as-is.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean says:
====================
DSA initialization cleanups
These patches contain miscellaneous work that makes the DSA init code
path symmetric with the teardown path, and some additional patches
carried by Ansuel Smith for his register access over Ethernet work, but
those patches can be applied as-is too.
https://patchwork.kernel.org/project/netdevbpf/patch/20211214224409.5770-3-ansuelsmth@gmail.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit a57d8c217a ("net: dsa: flush switchdev workqueue before
tearing down CPU/DSA ports"), the port setup and teardown procedure
became asymmetric.
The fact of the matter is that user ports need the shared ports to be up
before they can be used for CPU-initiated termination. And since we
register net devices for the user ports, those won't be functional until
we also call the setup for the shared (CPU, DSA) ports. But we may do
that later, depending on the port numbering scheme of the hardware we
are dealing with.
It just makes sense that all shared ports are brought up before any user
port is. I can't pinpoint any issue due to the current behavior, but
let's change it nonetheless, for consistency's sake.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA needs to simulate master tracking events when a binding is first
with a DSA master established and torn down, in order to give drivers
the simplifying guarantee that ->master_state_change calls are made
only when the master's readiness state to pass traffic changes.
master_state_change() provide a operational bool that DSA driver can use
to understand if DSA master is operational or not.
To avoid races, we need to block the reception of
NETDEV_UP/NETDEV_CHANGE/NETDEV_GOING_DOWN events in the netdev notifier
chain while we are changing the master's dev->dsa_ptr (this changes what
netdev_uses_dsa(dev) reports).
The dsa_master_setup() and dsa_master_teardown() functions optionally
require the rtnl_mutex to be held, if the tagger needs the master to be
promiscuous, these functions call dev_set_promiscuity(). Move the
rtnl_lock() from that function and make it top-level.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At present there are two paths for changing the MTU of the DSA master.
The first is:
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> dsa_slave_change_mtu
-> dev_set_mtu(master)
The second is:
dsa_tree_setup
-> dsa_tree_setup_master
-> dsa_master_setup
-> dev_set_mtu(dev)
So the dev_set_mtu() call from dsa_master_setup() has been effectively
superseded by the dsa_slave_change_mtu(slave_dev, ETH_DATA_LEN) that is
done from dsa_slave_create() for each user port. The later function also
updates the master MTU according to the largest user port MTU from the
tree. Therefore, updating the master MTU through a separate code path
isn't needed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently dsa_slave_create() has two sequences of rtnl_lock/rtnl_unlock
in a row. Remove the rtnl_unlock() and rtnl_lock() in between, such that
the operation can execute slighly faster.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In dsa_slave_create() there are 2 sections that take rtnl_lock():
MTU change and netdev registration. They are separated by PHY
initialization.
There isn't any strict ordering requirement except for the fact that
netdev registration should be last. Therefore, we can perform the MTU
change a bit later, after the PHY setup. A future change will then be
able to merge the two rtnl_lock sections into one.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2022-01-06
1) Fix some clang_analyzer warnings about never read variables.
From luo penghao.
2) Check for pols[0] only once in xfrm_expand_policies().
From Jean Sacren.
3) The SA curlft.use_time was updated only on SA cration time.
Update whenever the SA is used. From Antony Antony
4) Add support for SM3 secure hash.
From Xu Jia.
5) Add support for SM4 symmetric cipher algorithm.
From Xu Jia.
6) Add a rate limit for SA mapping change messages.
From Antony Antony.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an xdp_do_redirect_frame() variant which supports pre-computed
xdp_frame structures. This will be used in bpf_prog_run() to avoid having
to write to the xdp_frame structure when the XDP program doesn't modify the
frame boundaries.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103150812.87914-6-toke@redhat.com
All map redirect functions except XSK maps convert xdp_buff to xdp_frame
before enqueueing it. So move this conversion of out the map functions
and into xdp_do_redirect(). This removes a bit of duplicated code, but more
importantly it makes it possible to support caller-allocated xdp_frame
structures, which will be added in a subsequent commit.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103150812.87914-5-toke@redhat.com
Store the XDP mem ID inside the page_pool struct so it can be retrieved
later for use in bpf_prog_run().
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20220103150812.87914-4-toke@redhat.com
Add a new callback function to page_pool that, if set, will be called every
time a new page is allocated. This will be used from bpf_test_run() to
initialise the page data with the data provided by userspace when running
XDP programs with redirect turned on.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20220103150812.87914-3-toke@redhat.com
The functions that register an XDP memory model take a struct xdp_rxq as
parameter, but the RXQ is not actually used for anything other than pulling
out the struct xdp_mem_info that it embeds. So refactor the register
functions and export variants that just take a pointer to the xdp_mem_info.
This is in preparation for enabling XDP_REDIRECT in bpf_prog_run(), using a
page_pool instance that is not connected to any network device.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220103150812.87914-2-toke@redhat.com
Ong Boon says:
====================
First of all, sorry for taking more time to get back to this series and
thanks to all valuble feedback in series-1 at [1] from Jesper and Song
Liu.
Since then I have looked into what Jesper suggested in [2] and worked on
revising the patch series into several patches for ease of review:
v1->v2:
1/7: [No change]. Add VLAN tag (ID & Priority) to the generated Tx-Only
frames.
2/7: [No change]. Add DMAC and SMAC setting to the generated Tx-Only
frames. If parameters are not set, previous DMAC and SMAC are used.
3/7: [New]. Add support for selecting different CLOCK for clock_gettime()
used in get_nsecs.
4/7: [New]. This is a total rework from series-1 3/4-patch [3]. It uses
clock_nanosleep() suggested by Jesper. In addition, added statistic
for Tx schedule variance under application stat (-a|--app-stats).
Make the cyclic Tx operation and --poll mode to be mutually-
exclusive. Still, the ability to specify TX cycle time and used
together with batch size and packet count remain the same.
5/7: [New]. Add the support for TX process schedule policy and priority
setting. By default, SCHED_OTHER policy is used. This too is matching
the schedule policy setting in [2].
6/7: [Change]. This is update from series-1 4/4-patch [4]. Added TX clean
process time-out in 1s granularity with configurable retries count
(-O|--retries).
7/7: [New]. Added timestamp for TX packet following pktgen_hdr format
matching the implementation in [2]. However, the sequence ID remains
the same as it is instead of process schedule diff in [2].
To summarize on what program options have been added with v2 series
using an example below:-
DMAC (-G) = fa:8d:f1:e2:0b:e8
SMAC (-H) = ce:17:07:17:3e:3a
VLAN tagged (-V)
VLAN ID (-J) = 12
VLAN Pri (-K) = 3
Tx Queue (-q) = 3
Cycle Time in us (-T) = 1000
Batch (-b) = 2
Packet Count = 6
Tx schedule policy (-W) = FIFO
Tx schedule priority (-U) = 50
Clock selection (-w) = REALTIME
Tx timeout retries(-O) = 5
Tx timestamp (-y)
Cyclic Tx schedule stat (-a)
Note: xdpsock sets UDP dest-port and src-port to 0x1000 as default.
Sending Board
=============
$ xdpsock -i eth0 -t -N -z -H ce:17:07:17:3e:3a -G fa:8d:f1:e2:0b:e8 \
-V -J 12 -K 3 -q 3 \
-T 1000 -b 2 -C 6 -W FIFO -U 50 -w REALTIME \
-O 5 -y -a
sock0@eth0:3 txonly xdp-drv
pps pkts 0.00
rx 0 0
tx 0 6
calls/s count
rx empty polls 0 0
fill fail polls 0 0
copy tx sendtos 0 0
tx wakeup sendtos 0 5
opt polls 0 0
period min ave max cycle
Cyclic TX 1000000 31033 32009 33397 3
Receiving Board
===============
$ tcpdump -nei eth0 udp port 0x1000 -vv -Q in -X \
--time-stamp-precision nano
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:46:40.520111580 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~....
0x0010: 0a0a 0a20 1000 1000 0018 e997 be9b e955 ...............U
0x0020: 0000 0000 61cd 2ba1 0006 987c ....a.+....|
03:46:40.520112163 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~....
0x0010: 0a0a 0a20 1000 1000 0018 e996 be9b e955 ...............U
0x0020: 0000 0001 61cd 2ba1 0006 987c ....a.+....|
03:46:40.521066860 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~....
0x0010: 0a0a 0a20 1000 1000 0018 e5af be9b e955 ...............U
0x0020: 0000 0002 61cd 2ba1 0006 9c62 ....a.+....b
03:46:40.521067012 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~....
0x0010: 0a0a 0a20 1000 1000 0018 e5ae be9b e955 ...............U
0x0020: 0000 0003 61cd 2ba1 0006 9c62 ....a.+....b
03:46:40.522061935 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~....
0x0010: 0a0a 0a20 1000 1000 0018 e1c5 be9b e955 ...............U
0x0020: 0000 0004 61cd 2ba1 0006 a04a ....a.+....J
03:46:40.522062173 ce:17:07:17:3e:3a > fa:8d:f1:e2:0b:e8, ethertype 802.1Q (0x8100), length 62: vlan 12, p 3, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [none], proto UDP (17), length 44)
10.10.10.16.4096 > 10.10.10.32.4096: [udp sum ok] UDP, length 16
0x0000: 4500 002c 0000 0000 4011 527e 0a0a 0a10 E..,....@.R~....
0x0010: 0a0a 0a20 1000 1000 0018 e1c4 be9b e955 ...............U
0x0020: 0000 0005 61cd 2ba1 0006 a04a ....a.+....J
I have tested the above with both tagged and untagged packet format and
based on the timestamp in tcpdump found that the timing of the batch
cyclic transmission is correct.
Appreciate if community can give the patch series v2 a try and point out
any gap.
Thanks
Boon Leong
[1] https://patchwork.kernel.org/project/netdevbpf/cover/20211124091821.3916046-1-boon.leong.ong@intel.com/
[2] https://github.com/netoptimizer/network-testing/blob/master/src/udp_pacer.c
[3] https://patchwork.kernel.org/project/netdevbpf/patch/20211124091821.3916046-4-boon.leong.ong@intel.com/
[4] https://patchwork.kernel.org/project/netdevbpf/patch/20211124091821.3916046-5-boon.leong.ong@intel.com/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It may be useful to add timestamp for Tx packets for continuous or cyclic
transmit operation. The timestamp and sequence ID of a Tx packet are
stored according to pktgen header format. To enable per-packet timestamp,
use -y|--tstamp option. If timestamp is off, pktgen header is not
included in the UDP payload. This means receiving side can use the magic
number for pktgen for differentiation.
The implementation supports both VLAN tagged and untagged option. By
default, the minimum packet size is set at 64B. However, if VLAN tagged
is on (-V), the minimum packet size is increased to 66B just so to fit
the pktgen_hdr size.
Added hex_dump() into the code path just for future cross-checking.
As before, simply change to "#define DEBUG_HEXDUMP 1" to inspect the
accuracy of TX packet.
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-8-boon.leong.ong@intel.com
When user sets tx-pkt-count and in case where there are invalid Tx frame,
the complete_tx_only_all() process polls indefinitely. So, this patch
adds a time-out mechanism into the process so that the application
can terminate automatically after it retries 3*polling interval duration.
v1->v2:
Thanks to Jesper's and Song Liu's suggestion.
- clean-up git message to remove polling log
- make the Tx time-out retries configurable with 1s granularity
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-7-boon.leong.ong@intel.com
By default, TX schedule policy is SCHED_OTHER (round-robin time-sharing).
To improve TX cyclic scheduling, we add SCHED_FIFO policy and its priority
by using -W FIFO or --policy=FIFO and -U <PRIO> or --schpri=<PRIO>.
A) From xdpsock --app-stats, for SCHED_OTHER policy:
$ xdpsock -i eth0 -t -N -z -T 1000 -b 16 -C 100000 -a
period min ave max cycle
Cyclic TX 1000000 53507 75334 712642 6250
B) For SCHED_FIFO policy and schpri=50:
$ xdpsock -i eth0 -t -N -z -T 1000 -b 16 -C 100000 -a -W FIFO -U 50
period min ave max cycle
Cyclic TX 1000000 3699 24859 54397 6250
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-6-boon.leong.ong@intel.com
Tx cycle time is in micro-seconds unit. By combining the batch size (-b M)
and Tx cycle time (-T|--tx-cycle N), xdpsock now can transmit batch-size of
packets every N-us periodically. Cyclic TX operation is not applicable if
--poll mode is used.
To transmit 16 packets every 1ms cycle time for total of 100000 packets
silently:
$ xdpsock -i eth0 -T -N -z -T 1000 -b 16 -C 100000
To print cyclic TX schedule variance stats, use --app-stats|-a:
$ xdpsock -i eth0 -T -N -z -T 1000 -b 16 -C 100000 -a
sock0@eth0:0 txonly xdp-drv
pps pkts 0.00
rx 0 0
tx 0 100000
calls/s count
rx empty polls 0 0
fill fail polls 0 0
copy tx sendtos 0 0
tx wakeup sendtos 0 6254
opt polls 0 0
period min ave max cycle
Cyclic TX 1000000 53507 75334 712642 6250
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211230035447.523177-5-boon.leong.ong@intel.com