android_kernel_xiaomi_sm8450

Author	SHA1	Message	Date
Greg Kroah-Hartman	cf3a19d56e	Merge 5.10.201 into android12-5.10-lts Changes in 5.10.201 iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() sched/uclamp: Ignore (util == 0) optimization in feec() when p_util_max = 0 vfs: fix readahead(2) on block devices x86/srso: Fix SBPB enablement for (possible) future fixed HW futex: Don't include process MM in futex key on no-MMU x86/boot: Fix incorrect startup_gdt_descr.size pstore/platform: Add check for kstrdup genirq/matrix: Exclude managed interrupts in irq_matrix_allocated() i40e: fix potential memory leaks in i40e_remove() udp: add missing WRITE_ONCE() around up->encap_rcv tcp: call tcp_try_undo_recovery when an RTOd TFO SYNACK is ACKed overflow: Implement size_t saturating arithmetic helpers gve: Use size_add() in call to struct_size() mlxsw: Use size_mul() in call to struct_size() tipc: Use size_add() in calls to struct_size() net: spider_net: Use size_add() in call to struct_size() wifi: rtw88: debug: Fix the NULL vs IS_ERR() bug for debugfs_create_file() wifi: mt76: mt7603: rework/fix rx pse hang check tcp_metrics: add missing barriers on delete tcp_metrics: properly set tp->snd_ssthresh in tcp_init_metrics() tcp_metrics: do not create an entry from tcp_init_metrics() wifi: rtlwifi: fix EDCA limit set by BT coexistence can: dev: can_restart(): don't crash kernel if carrier is OK can: dev: can_restart(): fix race condition between controller restart and netif_carrier_on() PM / devfreq: rockchip-dfi: Make pmu regmap mandatory thermal: core: prevent potential string overflow r8169: use tp_to_dev instead of open code r8169: fix rare issue with broken rx after link-down on RTL8125 chtls: fix tp->rcv_tstamp initialization tcp: fix cookie_init_timestamp() overflows ACPI: sysfs: Fix create_pnp_modalias() and create_of_modalias() ipv6: avoid atomic fragment on GSO packets net: add DEV_STATS_READ() helper ipvlan: properly track tx_errors regmap: debugfs: Fix a erroneous check after snprintf() clk: qcom: clk-rcg2: Fix clock rate overflow for high parent frequencies clk: qcom: mmcc-msm8998: Add hardware clockgating registers to some clks clk: qcom: mmcc-msm8998: Don't check halt bit on some branch clks clk: qcom: mmcc-msm8998: Set bimc_smmu_gdsc always on clk: qcom: mmcc-msm8998: Fix the SMMU GDSC clk: qcom: gcc-sm8150: use ARRAY_SIZE instead of specifying num_parents clk: qcom: gcc-sm8150: Fix gcc_sdcc2_apps_clk_src clk: imx: Select MXC_CLK for CLK_IMX8QXP clk: imx: imx8mq: correct error handling path clk: asm9260: use parent index to link the reference clock clk: linux/clk-provider.h: fix kernel-doc warnings and typos spi: nxp-fspi: use the correct ioremap function clk: keystone: pll: fix a couple NULL vs IS_ERR() checks clk: ti: Add ti_dt_clk_name() helper to use clock-output-names clk: ti: Update pll and clockdomain clocks to use ti_dt_clk_name() clk: ti: Update component clocks to use ti_dt_clk_name() clk: ti: change ti_clk_register[_omap_hw]() API clk: ti: fix double free in of_ti_divider_clk_setup() clk: npcm7xx: Fix incorrect kfree clk: mediatek: clk-mt6765: Add check for mtk_alloc_clk_data clk: mediatek: clk-mt6779: Add check for mtk_alloc_clk_data clk: mediatek: clk-mt6797: Add check for mtk_alloc_clk_data clk: mediatek: clk-mt7629-eth: Add check for mtk_alloc_clk_data clk: mediatek: clk-mt7629: Add check for mtk_alloc_clk_data clk: mediatek: clk-mt2701: Add check for mtk_alloc_clk_data clk: qcom: config IPQ_APSS_6018 should depend on QCOM_SMEM platform/x86: wmi: Fix probe failure when failing to register WMI devices platform/x86: wmi: remove unnecessary initializations platform/x86: wmi: Fix opening of char device hwmon: (axi-fan-control) Support temperature vs pwm points hwmon: (axi-fan-control) Fix possible NULL pointer dereference hwmon: (coretemp) Fix potentially truncated sysfs attribute name drm/rockchip: vop: Fix reset of state in duplicate state crtc funcs drm/rockchip: vop: Fix call to crtc reset helper drm/radeon: possible buffer overflow drm/bridge: tc358768: Fix use of uninitialized variable drm/bridge: tc358768: Disable non-continuous clock mode drm/bridge: tc358768: Fix bit updates drm/mediatek: Fix iommu fault during crtc enabling drm/rockchip: cdn-dp: Fix some error handling paths in cdn_dp_probe() arm64/arm: xen: enlighten: Fix KPTI checks drm/rockchip: Fix type promotion bug in rockchip_gem_iommu_map() xen-pciback: Consider INTx disabled when MSI/MSI-X is enabled arm64: dts: qcom: msm8916: Fix iommu local address range arm64: dts: qcom: sdm845-mtp: fix WiFi configuration ARM: dts: qcom: mdm9615: populate vsdcc fixed regulator soc: qcom: llcc: Handle a second device without data corruption firmware: ti_sci: Mark driver as non removable clk: scmi: Free scmi_clk allocated when the clocks with invalid info are skipped selftests/pidfd: Fix ksft print formats selftests/resctrl: Ensure the benchmark commands fits to its array crypto: hisilicon/hpre - Fix a erroneous check after snprintf() hwrng: geode - fix accessing registers libnvdimm/of_pmem: Use devm_kstrdup instead of kstrdup and check its return value nd_btt: Make BTT lanes preemptible crypto: caam/qi2 - fix Chacha20 + Poly1305 self test failure crypto: caam/jr - fix Chacha20 + Poly1305 self test failure crypto: qat - mask device capabilities with soft straps crypto: qat - increase size of buffers hid: cp2112: Fix duplicate workqueue initialization ARM: 9321/1: memset: cast the constant byte to unsigned char ext4: move 'ix' sanity check to corrent position ASoC: fsl: mpc5200_dma.c: Fix warning of Function parameter or member not described IB/mlx5: Fix rdma counter binding for RAW QP RDMA/hns: Fix uninitialized ucmd in hns_roce_create_qp_common() RDMA/hns: Fix signed-unsigned mixed comparisons ASoC: fsl: Fix PM disable depth imbalance in fsl_easrc_probe scsi: ufs: core: Leave space for '\0' in utf8 desc string RDMA/hfi1: Workaround truncation compilation error hid: cp2112: Fix IRQ shutdown stopping polling for all IRQs on chip sh: bios: Revive earlyprintk support Revert "HID: logitech-hidpp: add a module parameter to keep firmware gestures" HID: logitech-hidpp: Remove HIDPP_QUIRK_NO_HIDINPUT quirk HID: logitech-hidpp: Don't restart IO, instead defer hid_connect() only HID: logitech-hidpp: Revert "Don't restart communication if not necessary" HID: logitech-hidpp: Move get_wireless_feature_index() check to hidpp_connect_event() ASoC: Intel: Skylake: Fix mem leak when parsing UUIDs fails padata: Convert from atomic_t to refcount_t on parallel_data->refcnt padata: Fix refcnt handling in padata_free_shell() ASoC: ams-delta.c: use component after check mfd: core: Un-constify mfd_cell.of_reg mfd: core: Ensure disabled devices are skipped without aborting mfd: dln2: Fix double put in dln2_probe leds: pwm: Don't disable the PWM when the LED should be off leds: trigger: ledtrig-cpu:: Fix 'output may be truncated' issue for 'cpu' tty: tty_jobctrl: fix pid memleak in disassociate_ctty() livepatch: Fix missing newline character in klp_resolve_symbols() usb: dwc2: fix possible NULL pointer dereference caused by driver concurrency dmaengine: ti: edma: handle irq_of_parse_and_map() errors misc: st_core: Do not call kfree_skb() under spin_lock_irqsave() tools: iio: privatize globals and functions in iio_generic_buffer.c file tools: iio: iio_generic_buffer: Fix some integer type and calculation tools: iio: iio_generic_buffer ensure alignment USB: usbip: fix stub_dev hub disconnect dmaengine: pxa_dma: Remove an erroneous BUG_ON() in pxad_free_desc() f2fs: fix to initialize map.m_pblk in f2fs_precache_extents() interconnect: qcom: sc7180: Retire DEFINE_QBCM interconnect: qcom: sc7180: Set ACV enable_mask modpost: fix tee MODULE_DEVICE_TABLE built on big-endian host powerpc/40x: Remove stale PTE_ATOMIC_UPDATES macro powerpc/xive: Fix endian conversion size powerpc/imc-pmu: Use the correct spinlock initializer. powerpc/pseries: fix potential memory leak in init_cpu_associativity() xhci: Loosen RPM as default policy to cover for AMD xHC 1.1 usb: host: xhci-plat: fix possible kernel oops while resuming perf machine: Avoid out of bounds LBR memory read perf hist: Add missing puts to hist__account_cycles i3c: Fix potential refcount leak in i3c_master_register_new_i3c_devs rtc: pcf85363: fix wrong mask/val parameters in regmap_update_bits call pcmcia: cs: fix possible hung task and memory leak pccardd() pcmcia: ds: fix refcount leak in pcmcia_device_add() pcmcia: ds: fix possible name leak in error path in pcmcia_device_add() media: i2c: max9286: Fix some redundant of_node_put() calls media: bttv: fix use after free error due to btv->timeout timer media: s3c-camif: Avoid inappropriate kfree() media: vidtv: psi: Add check for kstrdup media: vidtv: mux: Add check and kfree for kstrdup media: cedrus: Fix clock/reset sequence media: dvb-usb-v2: af9035: fix missing unlock regmap: prevent noinc writes from clobbering cache pwm: sti: Avoid conditional gotos pwm: sti: Reduce number of allocations and drop usage of chip_data pwm: brcmstb: Utilize appropriate clock APIs in suspend/resume Input: synaptics-rmi4 - fix use after free in rmi_unregister_function() llc: verify mac len before reading mac header hsr: Prevent use after free in prp_create_tagged_frame() tipc: Change nla_policy for bearer-related names to NLA_NUL_STRING inet: shrink struct flowi_common dccp: Call security_inet_conn_request() after setting IPv4 addresses. dccp/tcp: Call security_inet_conn_request() after setting IPv6 addresses. net: r8169: Disable multicast filter for RTL8168H and RTL8107E Fix termination state for idr_for_each_entry_ul() net: stmmac: xgmac: Enable support for multiple Flexible PPS outputs net/smc: fix dangling sock under state SMC_APPFINCLOSEWAIT net/smc: allow cdc msg send rather than drop it with NULL sndbuf_desc net/smc: put sk reference if close work was canceled tg3: power down device only on SYSTEM_POWER_OFF r8169: respect userspace disabling IFF_MULTICAST netfilter: xt_recent: fix (increase) ipv6 literal buffer length netfilter: nft_redir: use `struct nf_nat_range2` throughout and deduplicate eval call-backs netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses x86: Share definition of __is_canonical_address() x86/sev-es: Allow copy_from_kernel_nofault() in earlier boot drm/syncobj: fix DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE spi: spi-zynq-qspi: add spi-mem to driver kconfig dependencies fbdev: imsttfb: Fix error path of imsttfb_probe() fbdev: imsttfb: fix a resource leak in probe fbdev: fsl-diu-fb: mark wr_reg_wa() static tracing/kprobes: Fix the order of argument descriptions Revert "mmc: core: Capture correct oemid-bits for eMMC cards" btrfs: use u64 for buffer sizes in the tree search ioctls Linux 5.10.201 Change-Id: I0ce874e25eb6aeebf5826d6ef843fdbbf55d7c7d Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2023-11-29 14:46:51 +00:00
Gustavo A. R. Silva	9b8486fdad	tipc: Use size_add() in calls to struct_size() [ Upstream commit 2506a91734754de690869824fb0d1ac592ec1266 ] If, for any reason, the open-coded arithmetic causes a wraparound, the protection that `struct_size()` adds against potential integer overflows is defeated. Fix this by hardening call to `struct_size()` with `size_add()`. Fixes: `e034c6d23b` ("tipc: Use struct_size() helper") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-11-20 11:06:44 +01:00
Greg Kroah-Hartman	7ae5626406	Revert "tipc: do not update mtu if msg_max is too small in mtu negotiation" This reverts commit `2bd4ff4ffb` which is commit 56077b56cd3fb78e1c8619e29581ba25a5c55e86 upstream. It breaks the Android kernel ABI and is not needed for Android devices, so it is safe to revert for now. If it is determined that it is needed in the future, it can be brought back in an abi-preserving way. Bug: 161946584 Change-Id: I5fc048bf94f78a38691e3b27cf225536200bcd49 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2023-06-27 12:12:09 +00:00
Xin Long	2bd4ff4ffb	tipc: do not update mtu if msg_max is too small in mtu negotiation [ Upstream commit 56077b56cd3fb78e1c8619e29581ba25a5c55e86 ] When doing link mtu negotiation, a malicious peer may send Activate msg with a very small mtu, e.g. 4 in Shuang's testing, without checking for the minimum mtu, l->mtu will be set to 4 in tipc_link_proto_rcv(), then n->links[bearer_id].mtu is set to 4294967228, which is a overflow of '4 - INT_H_SIZE - EMSG_OVERHEAD' in tipc_link_mss(). With tipc_link.mtu = 4, tipc_link_xmit() kept printing the warning: tipc: Too large msg, purging xmit list 1 5 0 40 4! tipc: Too large msg, purging xmit list 1 15 0 60 4! And with tipc_link_entry.mtu 4294967228, a huge skb was allocated in named_distribute(), and when purging it in tipc_link_xmit(), a crash was even caused: general protection fault, probably for non-canonical address 0x2100001011000dd: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 6.3.0.neta #19 RIP: 0010:kfree_skb_list_reason+0x7e/0x1f0 Call Trace: <IRQ> skb_release_data+0xf9/0x1d0 kfree_skb_reason+0x40/0x100 tipc_link_xmit+0x57a/0x740 [tipc] tipc_node_xmit+0x16c/0x5c0 [tipc] tipc_named_node_up+0x27f/0x2c0 [tipc] tipc_node_write_unlock+0x149/0x170 [tipc] tipc_rcv+0x608/0x740 [tipc] tipc_udp_recv+0xdc/0x1f0 [tipc] udp_queue_rcv_one_skb+0x33e/0x620 udp_unicast_rcv_skb.isra.72+0x75/0x90 __udp4_lib_rcv+0x56d/0xc20 ip_protocol_deliver_rcu+0x100/0x2d0 This patch fixes it by checking the new mtu against tipc_bearer_min_mtu(), and not updating mtu if it is too small. Fixes: `ed193ece26` ("tipc: simplify link mtu negotiation") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2023-05-30 12:57:52 +01:00
YueHaibing	36e248269a	tipc: Fix potential OOB in tipc_link_proto_rcv() [ Upstream commit 743117a997bbd4840e827295c07e59bcd7f7caa3 ] Fix the potential risk of OOB if skb_linearize() fails in tipc_link_proto_rcv(). Fixes: `5cbb28a4bf` ("tipc: linearize arriving NAME_DISTR and LINK_PROTO buffers") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20221203094635.29024-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-12-14 11:32:03 +01:00
Xin Long	f299d3fbe4	tipc: simplify the finalize work queue [ Upstream commit be07f056396d6bb40963c45a02951c566ddeef8e ] This patch is to use "struct work_struct" for the finalize work queue instead of "struct tipc_net_work", as it can get the "net" and "addr" from tipc_net's other members and there is no need to add extra net and addr in tipc_net by defining "struct tipc_net_work". Note that it's safe to get net from tn->bcl as bcl is always released after the finalize work queue is done. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-06-29 08:59:47 +02:00
Tung Nguyen	5e42f90d72	tipc: fix incorrect order of state message data sanity check [ Upstream commit c79fcc27be90b308b3fa90811aefafdd4078668c ] When receiving a state message, function tipc_link_validate_msg() is called to validate its header portion. Then, its data portion is validated before it can be accessed correctly. However, current data sanity check is done after the message header is accessed to update some link variables. This commit fixes this issue by moving the data sanity check to the beginning of state message handling and right after the header sanity check. Fixes: 9aa422ad3266 ("tipc: improve size validations for received domain records") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Link: https://lore.kernel.org/r/20220308021200.9245-1-tung.q.nguyen@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-03-16 14:15:58 +01:00
Jon Maloy	3c7e594355	tipc: improve size validations for received domain records commit 9aa422ad326634b76309e8ff342c246800621216 upstream. The function tipc_mon_rcv() allows a node to receive and process domain_record structs from peer nodes to track their views of the network topology. This patch verifies that the number of members in a received domain record does not exceed the limit defined by MAX_MON_DOMAIN, something that may otherwise lead to a stack overflow. tipc_mon_rcv() is called from the function tipc_link_proto_rcv(), where we are reading a 32 bit message data length field into a uint16. To avert any risk of bit overflow, we add an extra sanity check for this in that function. We cannot see that happen with the current code, but future designers being unaware of this risk, may introduce it by allowing delivery of very large (> 64k) sk buffers from the bearer layer. This potential problem was identified by Eric Dumazet. This fixes CVE-2022-0435 Reported-by: Samuel Page <samuel.page@appgate.com> Reported-by: Eric Dumazet <edumazet@google.com> Fixes: `35c55c9877` ("tipc: add neighbor monitoring framework") Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Samuel Page <samuel.page@appgate.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-02-11 09:09:03 +01:00
Xin Long	9c3c2ef6ca	tipc: only accept encrypted MSG_CRYPTO msgs [ Upstream commit 271351d255b09e39c7f6437738cba595f9b235be ] The MSG_CRYPTO msgs are always encrypted and sent to other nodes for keys' deployment. But when receiving in peers, if those nodes do not validate it and make sure it's encrypted, one could craft a malicious MSG_CRYPTO msg to deploy its key with no need to know other nodes' keys. This patch is to do that by checking TIPC_SKB_CB(skb)->decrypted and discard it if this packet never got decrypted. Note that this is also a supplementary fix to CVE-2021-43267 that can be triggered by an unencrypted malicious MSG_CRYPTO msg. Fixes: `1ef6f7c939` ("tipc: add automatic session key exchange") Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-11-26 10:39:14 +01:00
Hoang Le	60b8b4e631	tipc: fix NULL deref in tipc_link_xmit() [ Upstream commit b77413446408fdd256599daf00d5be72b5f3e7c6 ] The buffer list can have zero skb as following path: tipc_named_node_up()->tipc_node_xmit()->tipc_link_xmit(), so we need to check the list before casting an &sk_buff. Fault report: [] tipc: Bulk publication failure [] general protection fault, probably for non-canonical [#1] PREEMPT [...] [] KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf] [] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.10.0-rc4+ #2 [] Hardware name: Bochs ..., BIOS Bochs 01/01/2011 [] RIP: 0010:tipc_link_xmit+0xc1/0x2180 [] Code: 24 b8 00 00 00 00 4d 39 ec 4c 0f 44 e8 e8 d7 0a 10 f9 48 [...] [] RSP: 0018:ffffc90000006ea0 EFLAGS: 00010202 [] RAX: dffffc0000000000 RBX: ffff8880224da000 RCX: 1ffff11003d3cc0d [] RDX: 0000000000000019 RSI: ffffffff886007b9 RDI: 00000000000000c8 [] RBP: ffffc90000007018 R08: 0000000000000001 R09: fffff52000000ded [] R10: 0000000000000003 R11: fffff52000000dec R12: ffffc90000007148 [] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc90000007018 [] FS: 0000000000000000(0000) GS:ffff888037400000(0000) knlGS:000[...] [] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [] CR2: 00007fffd2db5000 CR3: 000000002b08f000 CR4: 00000000000006f0 Fixes: `af9b028e27` ("tipc: make media xmit call outside node spinlock context") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Link: https://lore.kernel.org/r/20210108071337.3598-1-hoang.h.le@dektech.com.au Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-01-23 16:04:00 +01:00
David S. Miller	3ab0a7a0c3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-22 16:45:34 -07:00
Tuong Lien	1ef6f7c939	tipc: add automatic session key exchange With support from the master key option in the previous commit, it becomes easy to make frequent updates/exchanges of session keys between authenticated cluster nodes. Basically, there are two situations where the key exchange will take in place: - When a new node joins the cluster (with the master key), it will need to get its peer's TX key, so that be able to decrypt further messages from that peer. - When a new session key is generated (by either user manual setting or later automatic rekeying feature), the key will be distributed to all peer nodes in the cluster. A key to be exchanged is encapsulated in the data part of a 'MSG_CRYPTO /KEY_DISTR_MSG' TIPC v2 message, then xmit-ed as usual and encrypted by using the master key before sending out. Upon receipt of the message it will be decrypted in the same way as regular messages, then attached as the sender's RX key in the receiver node. In this way, the key exchange is reliable by the link layer, as well as security, integrity and authenticity by the crypto layer. Also, the forward security will be easily achieved by user changing the master key actively but this should not be required very frequently. The key exchange feature is independent on the presence of a master key Note however that the master key still is needed for new nodes to be able to join the cluster. It is also optional, and can be turned off/on via the sysfs: 'net/tipc/key_exchange_enabled' [default 1: enabled]. Backward compatibility is guaranteed because for nodes that do not have master key support, key exchange using master key ie. tx_key = 0 if any will be shortly discarded at the message validation step. In other words, the key exchange feature will be automatically disabled to those nodes. v2: fix the "implicit declaration of function 'tipc_crypto_key_flush'" error in node.c. The function only exists when built with the TIPC "CONFIG_TIPC_CRYPTO" option. v3: use 'info->extack' for a message emitted due to netlink operations instead (- David's comment). Reported-by: kernel test robot <lkp@intel.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-18 13:58:37 -07:00
Lu Wei	2e5117ba9f	net: tipc: kerneldoc fixes Fix parameter description of tipc_link_bc_create() Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: `16ad3f4022` ("tipc: introduce variable window congestion control") Signed-off-by: Lu Wei <luwei32@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-09-15 13:33:04 -07:00
YueHaibing	622a63f6f3	tipc: Remove unused macro TIPC_NACK_INTV There is no caller in tree any more. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-08-31 12:39:07 -07:00
Gustavo A. R. Silva	df561f6688	treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>	2020-08-23 17:36:59 -05:00
Miaohe Lin	7f8901b74b	net: tipc: Convert to use the preferred fallthrough macro Convert the uses of fallthrough comments to fallthrough macro. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-08-18 12:46:48 -07:00
David S. Miller	a57066b1a0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net The UDP reuseport conflict was a little bit tricky. The net-next code, via bpf-next, extracted the reuseport handling into a helper so that the BPF sk lookup code could invoke it. At the same time, the logic for reuseport handling of unconnected sockets changed via commit `efc6b6f6c3` which changed the logic to carry on the reuseport result into the rest of the lookup loop if we do not return immediately. This requires moving the reuseport_has_conns() logic into the callers. While we are here, get rid of inline directives as they do not belong in foo.c files. The other changes were cases of more straightforward overlapping modifications. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-25 17:49:04 -07:00
Tung Nguyen	6ef9dcb780	tipc: allow to build NACK message in link timeout function Commit `02288248b0` ("tipc: eliminate gap indicator from ACK messages") eliminated sending of the 'gap' indicator in regular ACK messages and only allowed to build NACK message with enabled probe/probe_reply. However, necessary correction for building NACK message was missed in tipc_link_timeout() function. This leads to significant delay and link reset (due to retransmission failure) in lossy environment. This commit fixes it by setting the 'probe' flag to 'true' when the receive deferred queue is not empty. As a result, NACK message will be built to send back to another peer. Fixes: `02288248b0` ("tipc: eliminate gap indicator from ACK messages") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-20 20:11:22 -07:00
Andrew Lunn	d8141208b0	net: tipc: kerneldoc fixes Simple fixes which require no deep knowledge of the code. Cc: Jon Maloy <jmaloy@redhat.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-13 17:20:40 -07:00
David S. Miller	71930d6102	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-11 00:46:00 -07:00
Hamish Martin	a34f829164	tipc: fix retransmission on unicast links A scenario has been observed where a 'bc_init' message for a link is not retransmitted if it fails to be received by the peer. This leads to the peer never establishing the link fully and it discarding all other data received on the link. In this scenario the message is lost in transit to the peer. The issue is traced to the 'nxt_retr' field of the skb not being initialised for links that aren't a bc_sndlink. This leads to the comparison in tipc_link_advance_transmq() that gates whether to attempt retransmission of a message performing in an undesirable way. Depending on the relative value of 'jiffies', this comparison: time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr) may return true or false given that 'nxt_retr' remains at the uninitialised value of 0 for non bc_sndlinks. This is most noticeable shortly after boot when jiffies is initialised to a high value (to flush out rollover bugs) and we compare a jiffies of, say, 4294940189 to zero. In that case time_before returns 'true' leading to the skb not being retransmitted. The fix is to ensure that all skbs have a valid 'nxt_retr' time set for them and this is achieved by refactoring the setting of this value into a central function. With this fix, transmission losses of 'bc_init' messages do not stall the link establishment forever because the 'bc_init' message is retransmitted and the link eventually establishes correctly. Fixes: `382f598fb6` ("tipc: reduce duplicate packets for unicast traffic") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-07-08 15:39:50 -07:00
Gustavo A. R. Silva	e034c6d23b	tipc: Use struct_size() helper Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-06-19 20:15:25 -07:00
Hoang Huu Le	cad2929dc4	tipc: update a binding service via broadcast Currently, updating binding table (add service binding to name table/withdraw a service binding) is being sent over replicast. However, if we are scaling up clusters to > 100 nodes/containers this method is less affection because of looping through nodes in a cluster one by one. It is worth to use broadcast to update a binding service. This way, the binding table can be updated on all peer nodes in one shot. Broadcast is used when all peer nodes, as indicated by a new capability flag TIPC_NAMED_BCAST, support reception of this message type. Four problems need to be considered when introducing this feature. 1) When establishing a link to a new peer node we still update this by a unicast 'bulk' update. This may lead to race conditions, where a later broadcast publication/withdrawal bypass the 'bulk', resulting in disordered publications, or even that a withdrawal may arrive before the corresponding publication. We solve this by adding an 'is_last_bulk' bit in the last bulk messages so that it can be distinguished from all other messages. Only when this message has arrived do we open up for reception of broadcast publications/withdrawals. 2) When a first legacy node is added to the cluster all distribution will switch over to use the legacy 'replicast' method, while the opposite happens when the last legacy node leaves the cluster. This entails another risk of message disordering that has to be handled. We solve this by adding a sequence number to the broadcast/replicast messages, so that disordering can be discovered and corrected. Note however that we don't need to consider potential message loss or duplication at this protocol level. 3) Bulk messages don't contain any sequence numbers, and will always arrive in order. Hence we must exempt those from the sequence number control and deliver them unconditionally. We solve this by adding a new 'is_bulk' bit in those messages so that they can be recognized. 4) Legacy messages, which don't contain any new bits or sequence numbers, but neither can arrive out of order, also need to be exempt from the initial synchronization and sequence number check, and delivered unconditionally. Therefore, we add another 'is_not_legacy' bit to all new messages so that those can be distinguished from legacy messages and the latter delivered directly. v1->v2: - fix warning issue reported by kbuild test robot <lkp@intel.com> - add santiy check to drop the publication message with a sequence number that is lower than the agreed synch point Signed-off-by: kernel test robot <lkp@intel.com> Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-06-17 08:53:34 -07:00
Tuong Lien	03b6fefd9b	tipc: add support for broadcast rcv stats dumping This commit enables dumping the statistics of a broadcast-receiver link like the traditional 'broadcast-link' one (which is for broadcast- sender). The link dumping can be triggered via netlink (e.g. the iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the indicator. The name of a broadcast-receiver link of a specific peer will be in the format: 'broadcast-link:<peer-id>'. For example: Link <broadcast-link:1001002> Window:50 packets RX packets:7841 fragments:2408/440 bundles:0/0 TX packets:0 fragments:0/0 bundles:0/0 RX naks:0 defs:124 dups:0 TX naks:21 acks:0 retrans:0 Congestion link:0 Send queue max:0 avg:0 In addition, the broadcast-receiver link statistics can be reset in the usual way via netlink by specifying that link name in command. Note: the 'tipc_link_name_ext()' is removed because the link name can now be retrieved simply via the 'l->name'. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-26 15:16:52 -07:00
Tuong Lien	a91d55d162	tipc: enable broadcast retrans via unicast In some environment, broadcast traffic is suppressed at high rate (i.e. a kind of bandwidth limit setting). When it is applied, TIPC broadcast can still run successfully. However, when it comes to a high load, some packets will be dropped first and TIPC tries to retransmit them but the packet retransmission is intentionally broadcast too, so making things worse and not helpful at all. This commit enables the broadcast retransmission via unicast which only retransmits packets to the specific peer that has really reported a gap i.e. not broadcasting to all nodes in the cluster, so will prevent from being suppressed, and also reduce some overheads on the other peers due to duplicates, finally improve the overall TIPC broadcast performance. Note: the functionality can be turned on/off via the sysctl file: echo 1 > /proc/sys/net/tipc/bc_retruni echo 0 > /proc/sys/net/tipc/bc_retruni Default is '0', i.e. the broadcast retransmission still works as usual. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-26 15:16:52 -07:00
Tuong Lien	c6ed7a5cc2	tipc: add back link trace events In the previous commit ("tipc: add Gap ACK blocks support for broadcast link"), we have removed the following link trace events due to the code changes: - tipc_link_bc_ack - tipc_link_retrans This commit adds them back along with some minor changes to adapt to the new code. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-26 15:16:52 -07:00
Tuong Lien	d7626b5acf	tipc: introduce Gap ACK blocks for broadcast link As achieved through commit `9195948fbf` ("tipc: improve TIPC throughput by Gap ACK blocks"), we apply the same mechanism for the broadcast link as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will consist of two parts built for both the broadcast and unicast types: 31 16 15 0 +-------------+-------------+-------------+-------------+ \| bgack_cnt \| ugack_cnt \| len \| +-------------+-------------+-------------+-------------+ - \| gap \| ack \| \| +-------------+-------------+-------------+-------------+ > bc gacks : : : \| +-------------+-------------+-------------+-------------+ - \| gap \| ack \| \| +-------------+-------------+-------------+-------------+ > uc gacks : : : \| +-------------+-------------+-------------+-------------+ - which is "automatically" backward-compatible. We also increase the max number of Gap ACK blocks to 128, allowing upto 64 blocks per type (total buffer size = 516 bytes). Besides, the 'tipc_link_advance_transmq()' function is refactored which is applicable for both the unicast and broadcast cases now, so some old functions can be removed and the code is optimized. With the patch, TIPC broadcast is more robust regardless of packet loss or disorder, latency, ... in the underlying network. Its performance is boost up significantly. For example, experiment with a 5% packet loss rate results: $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000 real 0m 42.46s user 0m 1.16s sys 0m 17.67s Without the patch: $ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000 real 8m 27.94s user 0m 0.55s sys 0m 2.38s Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-05-26 15:16:52 -07:00
Tuong Lien	edadedf1c5	tipc: fix incorrect increasing of link window In commit `16ad3f4022` ("tipc: introduce variable window congestion control"), we allow link window to change with the congestion avoidance algorithm. However, there is a bug that during the slow-start if packet retransmission occurs, the link will enter the fast-recovery phase, set its window to the 'ssthresh' which is never less than 300, so the link window suddenly increases to that limit instead of decreasing. Consequently, two issues have been observed: - For broadcast-link: it can leave a gap between the link queues that a new packet will be inserted and sent before the previous ones, i.e. not in-order. - For unicast: the algorithm does not work as expected, the link window jumps to the slow-start threshold whereas packet retransmission occurs. This commit fixes the issues by avoiding such the link window increase, but still decreasing if the 'ssthresh' is lowered. Fixes: `16ad3f4022` ("tipc: introduce variable window congestion control") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-04-15 16:23:33 -07:00
Jon Maloy	b7ffa045e7	tipc: don't send gap blocks in ACK messages In the commit referred to below we eliminated sending of the 'gap' indicator in regular ACK messages, reserving this to explicit NACK ditto. Unfortunately we missed to also eliminate building of the 'gap block' area in ACK messages. This area is meant to report gaps in the received packet sequence following the initial gap, so that lost packets can be retransmitted earlier and received out-of-sequence packets can be released earlier. However, the interpretation of those blocks is dependent on a complete and correct sequence of gaps and acks. Hence, when the initial gap indicator is missing a single gap block will be interpreted as an acknowledgment of all preceding packets. This may lead to packets being released prematurely from the sender's transmit queue, with easily predicatble consequences. We now fix this by not building any gap block area if there is no initial gap to report. Fixes: commit `02288248b0` ("tipc: eliminate gap indicator from ACK messages") Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-17 14:16:56 -08:00
Jon Maloy	16ad3f4022	tipc: introduce variable window congestion control We introduce a simple variable window congestion control for links. The algorithm is inspired by the Reno algorithm, covering both 'slow start', 'congestion avoidance', and 'fast recovery' modes. - We introduce hard lower and upper window limits per link, still different and configurable per bearer type. - We introduce a 'slow start theshold' variable, initially set to the maximum window size. - We let a link start at the minimum congestion window, i.e. in slow start mode, and then let is grow rapidly (+1 per rceived ACK) until it reaches the slow start threshold and enters congestion avoidance mode. - In congestion avoidance mode we increment the congestion window for each window-size number of acked packets, up to a possible maximum equal to the configured maximum window. - For each non-duplicate NACK received, we drop back to fast recovery mode, by setting the both the slow start threshold to and the congestion window to (current_congestion_window / 2). - If the timeout handler finds that the transmit queue has not moved since the previous timeout, it drops the link back to slow start and forces a probe containing the last sent sequence number to the sent to the peer, so that this can discover the stale situation. This change does in reality have effect only on unicast ethernet transport, as we have seen that there is no room whatsoever for increasing the window max size for the UDP bearer. For now, we also choose to keep the limits for the broadcast link unchanged and equal. This algorithm seems to give a 50-100% throughput improvement for messages larger than MTU. Suggested-by: Xin Long <lucien.xin@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:31:15 -08:00
Jon Maloy	d3b09995ab	tipc: eliminate more unnecessary nacks and retransmissions When we increase the link tranmsit window we often observe the following scenario: 1) A STATE message bypasses a sequence of traffic packets and arrives far ahead of those to the receiver. STATE messages contain a 'peers_nxt_snt' field to indicate which was the last packet sent from the peer. This mechanism is intended as a last resort for the receiver to detect missing packets, e.g., during very low traffic when there is no packet flow to help early loss detection. 3) The receiving link compares the 'peer_nxt_snt' field to its own 'rcv_nxt', finds that there is a gap, and immediately sends a NACK message back to the peer. 4) When this NACKs arrives at the sender, all the requested retransmissions are performed, since it is a first-time request. Just like in the scenario described in the previous commit this leads to many redundant retransmissions, with decreased throughput as a consequence. We fix this by adding two more conditions before we send a NACK in this sitution. First, the deferred queue must be empty, so we cannot assume that the potential packet loss has already been detected by other means. Second, we check the 'peers_snd_nxt' field only in probe/ probe_reply messages, thus turning this into a true mechanism of last resort as it was really meant to be. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:31:15 -08:00
Jon Maloy	02288248b0	tipc: eliminate gap indicator from ACK messages When we increase the link send window we sometimes observe the following scenario: 1) A packet #N arrives out of order far ahead of a sequence of older packets which are still under way. The packet is added to the deferred queue. 2) The missing packets arrive in sequence, and for each 16th of them an ACK is sent back to the receiver, as it should be. 3) When building those ACK messages, it is checked if there is a gap between the link's 'rcv_nxt' and the first packet in the deferred queue. This is always the case until packet number #N-1 arrives, and a 'gap' indicator is added, effectively turning them into NACK messages. 4) When those NACKs arrive at the sender, all the requested retransmissions are done, since it is a first-time request. This sometimes leads to a huge amount of redundant retransmissions, causing a drop in max throughput. This problem gets worse when we in a later commit introduce variable window congestion control, since it drops the link back to 'fast recovery' much more often than necessary. We now fix this by not sending any 'gap' indicator in regular ACK messages. We already have a mechanism for sending explicit NACKs in place, and this is sufficient to keep up the packet flow. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-12-10 17:31:15 -08:00
Hoang Le	ba5f6a8617	tipc: update replicast capability for broadcast send link When setting up a cluster with non-replicast/replicast capability supported. This capability will be disabled for broadcast send link in order to be backwards compatible. However, when these non-support nodes left and be removed out the cluster. We don't update this capability on broadcast send link. Then, some of features that based on this capability will also disabling as unexpected. In this commit, we make sure the broadcast send link capabilities will be re-calculated as soon as a node removed/rejoined a cluster. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-22 09:29:50 -08:00
Tuong Lien	fc1b6d6de2	tipc: introduce TIPC encryption & authentication This commit offers an option to encrypt and authenticate all messaging, including the neighbor discovery messages. The currently most advanced algorithm supported is the AEAD AES-GCM (like IPSec or TLS). All encryption/decryption is done at the bearer layer, just before leaving or after entering TIPC. Supported features: - Encryption & authentication of all TIPC messages (header + data); - Two symmetric-key modes: Cluster and Per-node; - Automatic key switching; - Key-expired revoking (sequence number wrapped); - Lock-free encryption/decryption (RCU); - Asynchronous crypto, Intel AES-NI supported; - Multiple cipher transforms; - Logs & statistics; Two key modes: - Cluster key mode: One single key is used for both TX & RX in all nodes in the cluster. - Per-node key mode: Each nodes in the cluster has one specific TX key. For RX, a node requires its peers' TX key to be able to decrypt the messages from those peers. Key setting from user-space is performed via netlink by a user program (e.g. the iproute2 'tipc' tool). Internal key state machine: Attach Align(RX) +-+ +-+ \| V \| V +---------+ Attach +---------+ \| IDLE \|---------------->\| PENDING \|(user = 0) +---------+ +---------+ A A Switch\| A \| \| \| \| \| \| Free(switch/revoked) \| \| (Free)\| +----------------------+ \| \|Timeout \| (TX) \| \| \|(RX) \| \| \| \| \| \| v \| +---------+ Switch +---------+ \| PASSIVE \|<----------------\| ACTIVE \| +---------+ (RX) +---------+ (user = 1) (user >= 1) The number of TFMs is 10 by default and can be changed via the procfs 'net/tipc/max_tfms'. At this moment, as for simplicity, this file is also used to print the crypto statistics at runtime: echo 0xfff1 > /proc/sys/net/tipc/max_tfms The patch defines a new TIPC version (v7) for the encryption message (- backward compatibility as well). The message is basically encapsulated as follows: +----------------------------------------------------------+ \| TIPCv7 encryption \| Original TIPCv2 \| Authentication \| \| header \| packet (encrypted) \| Tag \| +----------------------------------------------------------+ The throughput is about ~40% for small messages (compared with non- encryption) and ~9% for large messages. With the support from hardware crypto i.e. the Intel AES-NI CPU instructions, the throughput increases upto ~85% for small messages and ~55% for large messages. By default, the new feature is inactive (i.e. no encryption) until user sets a key for TIPC. There is however also a new option - "TIPC_CRYPTO" in the kernel configuration to enable/disable the new code when needed. MAINTAINERS \| add two new files 'crypto.h' & 'crypto.c' in tipc Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-08 14:01:59 -08:00
Tuong Lien	d0d605c5e1	tipc: eliminate the dummy packet in link synching When preparing tunnel packets for the link failover or synchronization, as for the safe algorithm, we added a dummy packet on the pair link but never sent it out. In the case of failover, the pair link will be reset anyway. But for link synching, it will always result in retransmission of the dummy packet after that. We have also observed that such the retransmission at the early stage when a new node comes in a large cluster will take some time and hard to be done, leading to the repeated retransmit failures and the link is reset. Since in commit `4929a932be` ("tipc: optimize link synching mechanism") we have already built a dummy 'TUNNEL_PROTOCOL' message on the new link for the synchronization, there's no need for the dummy on the pair one, this commit will skip it when the new mechanism takes in place. In case nothing exists in the pair link's transmq, the link synching will just start and stop shortly on the peer side. The patch is backward compatible. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Tested-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-06 21:16:02 -08:00
Hoang Le	426071f1f3	tipc: reduce sensitive to retransmit failures With huge cluster (e.g >200nodes), the amount of that flow: gap -> retransmit packet -> acked will take time in case of STATE_MSG dropped/delayed because a lot of traffic. This lead to 1.5 sec tolerance value criteria made link easy failure around 2nd, 3rd of failed retransmission attempts. Instead of re-introduced criteria of 99 faled retransmissions to fix the issue, we increase failure detection timer to ten times tolerance value. Fixes: `77cf8edbc0` ("tipc: simplify stale link failure criteria") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Jon Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-06 17:37:43 -08:00
Tuong Lien	06e7c70c6e	tipc: improve message bundling algorithm As mentioned in commit `e95584a889` ("tipc: fix unlimited bundling of small messages"), the current message bundling algorithm is inefficient that can generate bundles of only one payload message, that causes unnecessary overheads for both the sender and receiver. This commit re-designs the 'tipc_msg_make_bundle()' function (now named as 'tipc_msg_try_bundle()'), so that when a message comes at the first place, we will just check & keep a reference to it if the message is suitable for bundling. The message buffer will be put into the link backlog queue and processed as normal. Later on, when another one comes we will make a bundle with the first message if possible and so on... This way, a bundle if really needed will always consist of at least two payload messages. Otherwise, we let the first buffer go its way without any need of bundling, so reduce the overheads to zero. Moreover, since now we have both the messages in hand, we can even optimize the 'tipc_msg_bundle()' function, make bundle of a very large (size ~ MSS) and small messages which is not with the current algorithm e.g. [1400-byte message] + [10-byte message] (MTU = 1500). Acked-by: Ying Xue <ying.xue@windreiver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-03 17:26:15 -08:00
Geert Uytterhoeven	8ebed8ae49	tipc: Spelling s/enpoint/endpoint/ Fix misspelling of "endpoint". Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-28 13:42:17 -07:00
Tuong Lien	e95584a889	tipc: fix unlimited bundling of small messages We have identified a problem with the "oversubscription" policy in the link transmission code. When small messages are transmitted, and the sending link has reached the transmit window limit, those messages will be bundled and put into the link backlog queue. However, bundles of data messages are counted at the 'CRITICAL' level, so that the counter for that level, instead of the counter for the real, bundled message's level is the one being increased. Subsequent, to-be-bundled data messages at non-CRITICAL levels continue to be tested against the unchanged counter for their own level, while contributing to an unrestrained increase at the CRITICAL backlog level. This leaves a gap in congestion control algorithm for small messages that can result in starvation for other users or a "real" CRITICAL user. Even that eventually can lead to buffer exhaustion & link reset. We fix this by keeping a 'target_bskb' buffer pointer at each levels, then when bundling, we only bundle messages at the same importance level only. This way, we know exactly how many slots a certain level have occupied in the queue, so can manage level congestion accurately. By bundling messages at the same level, we even have more benefits. Let consider this: - One socket sends 64-byte messages at the 'CRITICAL' level; - Another sends 4096-byte messages at the 'LOW' level; When a 64-byte message comes and is bundled the first time, we put the overhead of message bundle to it (+ 40-byte header, data copy, etc.) for later use, but the next message can be a 4096-byte one that cannot be bundled to the previous one. This means the last bundle carries only one payload message which is totally inefficient, as for the receiver also! Later on, another 64-byte message comes, now we make a new bundle and the same story repeats... With the new bundling algorithm, this will not happen, the 64-byte messages will be bundled together even when the 4096-byte message(s) comes in between. However, if the 4096-byte messages are sent at the same level i.e. 'CRITICAL', the bundling algorithm will again cause the same overhead. Also, the same will happen even with only one socket sending small messages at a rate close to the link transmit's one, so that, when one message is bundled, it's transmitted shortly. Then, another message comes, a new bundle is created and so on... We will solve this issue radically by another patch. Fixes: `365ad353c2` ("tipc: reduce risk of user starvation during link congestion") Reported-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-10-02 11:02:05 -04:00
David S. Miller	446bf64b61	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Merge conflict of mlx5 resolved using instructions in merge commit `9566e650bf`. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-19 11:54:03 -07:00
Jon Maloy	e654f9f53b	tipc: clean up skb list lock handling on send path The policy for handling the skb list locks on the send and receive paths is simple. - On the send path we never need to grab the lock on the 'xmitq' list when the destination is an exernal node. - On the receive path we always need to grab the lock on the 'inputq' list, irrespective of source node. However, when transmitting node local messages those will eventually end up on the receive path of a local socket, meaning that the argument 'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the function tipc_sk_rcv(). This has been handled by always initializing the spinlock of the 'xmitq' list at message creation, just in case it may end up on the receive path later, and despite knowing that the lock in most cases never will be used. This approach is inaccurate and confusing, and has also concealed the fact that the stated 'no lock grabbing' policy for the send path is violated in some cases. We now clean up this by never initializing the lock at message creation, instead doing this at the moment we find that the message actually will enter the receive path. At the same time we fix the four locations where we incorrectly access the spinlock on the send/error path. This patch also reverts commit `d12cffe932` ("tipc: ensure head->lock is initialised") which has now become redundant. CC: Eric Dumazet <edumazet@google.com> Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-18 14:01:07 -07:00
Tuong Lien	712042313b	tipc: fix false detection of retransmit failures This commit eliminates the use of the link 'stale_limit' & 'prev_from' (besides the already removed - 'stale_cnt') variables in the detection of repeated retransmit failures as there is no proper way to initialize them to avoid a false detection, i.e. it is not really a retransmission failure but due to a garbage values in the variables. Instead, a jiffies variable will be added to individual skbs (like the way we restrict the skb retransmissions) in order to mark the first skb retransmit time. Later on, at the next retransmissions, the timestamp will be checked to see if the skb in the link transmq is "too stale", that is, the link tolerance time has passed, so that a link reset will be ordered. Note, just checking on the first skb in the queue is fine enough since it must be the oldest one. A counter is also added to keep track the actual skb retransmissions' number for later checking when the failure happens. The downside of this approach is that the skb->cb[] buffer is about to be exhausted, however it is always able to allocate another memory area and keep a reference to it when needed. Fixes: `77cf8edbc0` ("tipc: simplify stale link failure criteria") Reported-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-16 16:27:13 -07:00
Jon Maloy	7c5b420559	tipc: reduce risk of wakeup queue starvation In commit `365ad353c2` ("tipc: reduce risk of user starvation during link congestion") we allowed senders to add exactly one list of extra buffers to the link backlog queues during link congestion (aka "oversubscription"). However, the criteria for when to stop adding wakeup messages to the input queue when the overload abates is inaccurate, and may cause starvation problems during very high load. Currently, we stop adding wakeup messages after 10 total failed attempts where we find that there is no space left in the backlog queue for a certain importance level. The counter for this is accumulated across all levels, which may lead the algorithm to leave the loop prematurely, although there may still be plenty of space available at some levels. The result is sometimes that messages near the wakeup queue tail are not added to the input queue as they should be. We now introduce a more exact algorithm, where we keep adding wakeup messages to a level as long as the backlog queue has free slots for the corresponding level, and stop at the moment there are no more such slots or when there are no more wakeup messages to dequeue. Fixes: `365ad35` ("tipc: reduce risk of user starvation during link congestion") Reported-by: Tung Nguyen <tung.q.nguyen@dektech.com.au> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-08-01 18:19:28 -04:00
Tuong Lien	2320bcdae6	tipc: fix changeover issues due to large packet In conjunction with changing the interfaces' MTU (e.g. especially in the case of a bonding) where the TIPC links are brought up and down in a short time, a couple of issues were detected with the current link changeover mechanism: 1) When one link is up but immediately forced down again, the failover procedure will be carried out in order to failover all the messages in the link's transmq queue onto the other working link. The link and node state is also set to FAILINGOVER as part of the process. The message will be transmited in form of a FAILOVER_MSG, so its size is plus of 40 bytes (= the message header size). There is no problem if the original message size is not larger than the link's MTU - 40, and indeed this is the max size of a normal payload messages. However, in the situation above, because the link has just been up, the messages in the link's transmq are almost SYNCH_MSGs which had been generated by the link synching procedure, then their size might reach the max value already! When the FAILOVER_MSG is built on the top of such a SYNCH_MSG, its size will exceed the link's MTU. As a result, the messages are dropped silently and the failover procedure will never end up, the link will not be able to exit the FAILINGOVER state, so cannot be re-established. 2) The same scenario above can happen more easily in case the MTU of the links is set differently or when changing. In that case, as long as a large message in the failure link's transmq queue was built and fragmented with its link's MTU > the other link's one, the issue will happen (there is no need of a link synching in advance). 3) The link synching procedure also faces with the same issue but since the link synching is only started upon receipt of a SYNCH_MSG, dropping the message will not result in a state deadlock, but it is not expected as design. The 1) & 3) issues are resolved by the last commit that only a dummy SYNCH_MSG (i.e. without data) is generated at the link synching, so the size of a FAILOVER_MSG if any then will never exceed the link's MTU. For the 2) issue, the only solution is trying to fragment the messages in the failure link's transmq queue according to the working link's MTU so they can be failovered then. A new function is made to accomplish this, it will still be a TUNNEL PROTOCOL/FAILOVER MSG but if the original message size is too large, it will be fragmented & reassembled at the receiving side. Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-07-25 15:55:47 -07:00
Tuong Lien	4929a932be	tipc: optimize link synching mechanism This commit along with the next one are to resolve the issues with the link changeover mechanism. See that commit for details. Basically, for the link synching, from now on, we will send only one single ("dummy") SYNCH message to peer. The SYNCH message does not contain any data, just a header conveying the synch point to the peer. A new node capability flag ("TIPC_TUNNEL_ENHANCED") is introduced for backward compatible! Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Suggested-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-07-25 15:55:47 -07:00
Jon Maloy	53962bcea9	tipc: embed jiffies in macro TIPC_BC_RETR_LIM The macro TIPC_BC_RETR_LIM is always used in combination with 'jiffies', so we can just as well perform the addition in the macro itself. This way, we get a few shorter code lines and one less line break. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-07-01 19:10:57 -07:00
Jon Maloy	a7dc51adca	tipc: rename function msg_get_wrapped() to msg_inner_hdr() We rename the inline function msg_get_wrapped() to the more comprehensible msg_inner_hdr(). Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-25 13:42:54 -07:00
Jon Maloy	20c6731294	tipc: eliminate unnecessary skb expansion during retransmission We increase the allocated headroom for the buffer copies to be retransmitted. This eliminates the need for the lower stack levels (UDP/IP/L2) to expand the headroom in order to add their own headers. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-25 13:40:23 -07:00
Jon Maloy	77cf8edbc0	tipc: simplify stale link failure criteria In commit `a4dc70d46c` ("tipc: extend link reset criteria for stale packet retransmission") we made link retransmission failure events dependent on the link tolerance, and not only of the number of failed retransmission attempts, as we did earlier. This works well. However, keeping the original, additional criteria of 99 failed retransmissions is now redundant, and may in some cases lead to failure detection times in the order of minutes instead of the expected 1.5 sec link tolerance value. We now remove this criteria altogether. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-25 13:28:57 -07:00
David S. Miller	92ad6325cb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Minor SPDX change conflict. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-06-22 08:59:24 -04:00

1 2 3 4 5 ...

436 Commits