-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmMPexEACgkQONu9yGCS
aT7SIg//QPmoJq2ho7oqDXzdxW67Eay3QZEPDoBol34RxEXoAUpxFB1nQlC3u1aI
OyPNXqQSPkObkXRMAVYStTZWgN3iUngorbsDOM+svGpAxt9zC/6d7JGNdhstaQLG
p/OoWaV7qwnNUsvndhohdmwU9TqjwpbvQwSa570uWQ47nIoxMyIz0iR80GjBSNGf
a2QiJg4OsaVxqxoySB6I6qAceRMbLOZVxW6p963IYC9Fj4j1NmhsPDIy95aidEN5
RG+Ng9GnuYRo0ktlhSje9YKyE5bYhUNCi6GWsCyArAFo0db/2GzRFweZRy5w7MC/
IaFQf93pDZinIBfDJliXfFMBx4YLdI3IHdtILPJvF7d1U5n6pG44knrPkPHzNouf
Ife8SckAPLzZeffobIcOXgoZqM3Xj/5mpHWffPQ2wIpL0ylf4bshPiC8mIRoyblh
ufrzUV6r7uBesp18c6nhjwAKgNVaw4w9+CpDk0qLlDELKNfENJ9wMRAJpcifYJKL
jJVWJh2wXG4kBWbp/2SetMkNNEeqn/PQUVY843uRE2iE76J2lzly5/+gI4DsSN6+
z2ZQL5tzguZvLw0s+si+doU+orbpzXluJncNdJyw8+1A7J2kxSn/Xfks9X3BKDyi
69pxUx627rMJZi4Pwsc1tyoeTVj32EAmUqronHD9tsQKsujIX0M=
=DO69
-----END PGP SIGNATURE-----
Merge 5.10.140 into android12-5.10-lts
Changes in 5.10.140
audit: fix potential double free on error path from fsnotify_add_inode_mark
parisc: Fix exception handler for fldw and fstw instructions
kernel/sys_ni: add compat entry for fadvise64_64
pinctrl: amd: Don't save/restore interrupt status and wake status bits
xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()
xfs: reject crazy array sizes being fed to XFS_IOC_GETBMAP*
fs: remove __sync_filesystem
vfs: make sync_filesystem return errors from ->sync_fs
xfs: return errors in xfs_fs_sync_fs
xfs: only bother with sync_filesystem during readonly remount
kernel/sched: Remove dl_boosted flag comment
xfrm: fix refcount leak in __xfrm_policy_check()
xfrm: clone missing x->lastused in xfrm_do_migrate
af_key: Do not call xfrm_probe_algs in parallel
xfrm: policy: fix metadata dst->dev xmit null pointer dereference
NFS: Don't allocate nfs_fattr on the stack in __nfs42_ssc_open()
NFSv4.2 fix problems with __nfs42_ssc_open
SUNRPC: RPC level errors should set task->tk_rpc_status
mm/huge_memory.c: use helper function migration_entry_to_page()
mm/smaps: don't access young/dirty bit if pte unpresent
rose: check NULL rose_loopback_neigh->loopback
nfc: pn533: Fix use-after-free bugs caused by pn532_cmd_timeout
ice: xsk: Force rings to be sized to power of 2
ice: xsk: prohibit usage of non-balanced queue id
net/mlx5e: Properly disable vlan strip on non-UL reps
net: ipa: don't assume SMEM is page-aligned
net: moxa: get rid of asymmetry in DMA mapping/unmapping
bonding: 802.3ad: fix no transmission of LACPDUs
net: ipvtap - add __init/__exit annotations to module init/exit funcs
netfilter: ebtables: reject blobs that don't provide all entry points
bnxt_en: fix NQ resource accounting during vf creation on 57500 chips
netfilter: nft_payload: report ERANGE for too long offset and length
netfilter: nft_payload: do not truncate csum_offset and csum_type
netfilter: nf_tables: do not leave chain stats enabled on error
netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet families
netfilter: nft_tunnel: restrict it to netdev family
netfilter: nftables: remove redundant assignment of variable err
netfilter: nf_tables: consolidate rule verdict trace call
netfilter: nft_cmp: optimize comparison for 16-bytes
netfilter: bitwise: improve error goto labels
netfilter: nf_tables: upfront validation of data via nft_data_init()
netfilter: nf_tables: disallow jump to implicit chain from set element
netfilter: nf_tables: disallow binding to already bound chain
tcp: tweak len/truesize ratio for coalesce candidates
net: Fix data-races around sysctl_[rw]mem(_offset)?.
net: Fix data-races around sysctl_[rw]mem_(max|default).
net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
net: Fix data-races around netdev_max_backlog.
net: Fix data-races around netdev_tstamp_prequeue.
ratelimit: Fix data-races in ___ratelimit().
bpf: Folding omem_charge() into sk_storage_charge()
net: Fix data-races around sysctl_optmem_max.
net: Fix a data-race around sysctl_tstamp_allow_data.
net: Fix a data-race around sysctl_net_busy_poll.
net: Fix a data-race around sysctl_net_busy_read.
net: Fix a data-race around netdev_budget.
net: Fix a data-race around netdev_budget_usecs.
net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.
net: Fix data-races around sysctl_devconf_inherit_init_net.
net: Fix a data-race around sysctl_somaxconn.
ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
rxrpc: Fix locking in rxrpc's sendmsg
ionic: fix up issues with handling EAGAIN on FW cmds
btrfs: fix silent failure when deleting root reference
btrfs: replace: drop assert for suspended replace
btrfs: add info when mount fails due to stale replace target
btrfs: check if root is readonly while setting security xattr
perf/x86/lbr: Enable the branch type for the Arch LBR by default
x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry
x86/bugs: Add "unknown" reporting for MMIO Stale Data
loop: Check for overflow while configuring loop
asm-generic: sections: refactor memory_intersects
s390: fix double free of GS and RI CBs on fork() failure
ACPI: processor: Remove freq Qos request for all CPUs
xen/privcmd: fix error exit of privcmd_ioctl_dm_op()
mm/hugetlb: fix hugetlb not supporting softdirty tracking
Revert "md-raid: destroy the bitmap after destroying the thread"
md: call __md_stop_writes in md_stop
arm64: Fix match_list for erratum 1286807 on Arm Cortex-A76
Documentation/ABI: Mention retbleed vulnerability info file for sysfs
blk-mq: fix io hung due to missing commit_rqs
perf python: Fix build when PYTHON_CONFIG is user supplied
perf/x86/intel/uncore: Fix broken read_counter() for SNB IMC PMU
scsi: ufs: core: Enable link lost interrupt
scsi: storvsc: Remove WQ_MEM_RECLAIM from storvsc_error_wq
bpf: Don't use tnum_range on array range checking for poke descriptors
Linux 5.10.140
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I29f4b4af2a584dc2f2789aac613583603002464a
commit a657182a5c5150cdfacb6640aad1d2712571a409 upstream.
Hsin-Wei reported a KASAN splat triggered by their BPF runtime fuzzer which
is based on a customized syzkaller:
BUG: KASAN: slab-out-of-bounds in bpf_int_jit_compile+0x1257/0x13f0
Read of size 8 at addr ffff888004e90b58 by task syz-executor.0/1489
CPU: 1 PID: 1489 Comm: syz-executor.0 Not tainted 5.19.0 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x9c/0xc9
print_address_description.constprop.0+0x1f/0x1f0
? bpf_int_jit_compile+0x1257/0x13f0
kasan_report.cold+0xeb/0x197
? kvmalloc_node+0x170/0x200
? bpf_int_jit_compile+0x1257/0x13f0
bpf_int_jit_compile+0x1257/0x13f0
? arch_prepare_bpf_dispatcher+0xd0/0xd0
? rcu_read_lock_sched_held+0x43/0x70
bpf_prog_select_runtime+0x3e8/0x640
? bpf_obj_name_cpy+0x149/0x1b0
bpf_prog_load+0x102f/0x2220
? __bpf_prog_put.constprop.0+0x220/0x220
? find_held_lock+0x2c/0x110
? __might_fault+0xd6/0x180
? lock_downgrade+0x6e0/0x6e0
? lock_is_held_type+0xa6/0x120
? __might_fault+0x147/0x180
__sys_bpf+0x137b/0x6070
? bpf_perf_link_attach+0x530/0x530
? new_sync_read+0x600/0x600
? __fget_files+0x255/0x450
? lock_downgrade+0x6e0/0x6e0
? fput+0x30/0x1a0
? ksys_write+0x1a8/0x260
__x64_sys_bpf+0x7a/0xc0
? syscall_enter_from_user_mode+0x21/0x70
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f917c4e2c2d
The problem here is that a range of tnum_range(0, map->max_entries - 1) has
limited ability to represent the concrete tight range with the tnum as the
set of resulting states from value + mask can result in a superset of the
actual intended range, and as such a tnum_in(range, reg->var_off) check may
yield true when it shouldn't, for example tnum_range(0, 2) would result in
00XX -> v = 0000, m = 0011 such that the intended set of {0, 1, 2} is here
represented by a less precise superset of {0, 1, 2, 3}. As the register is
known const scalar, really just use the concrete reg->var_off.value for the
upper index check.
Fixes: d2e4c1e6c2 ("bpf: Constant map key tracking for prog array pokes")
Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/984b37f9fdf7ac36831d2137415a4a915744c1b6.1661462653.git.daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6d17a112e9a63ff6a5edffd1676b99e0ffbcd269 upstream.
Link lost is treated as fatal error with commit c99b9b230149 ("scsi: ufs:
Treat link loss as fatal error"), but the event isn't registered as
interrupt source. Enable it.
Link: https://lore.kernel.org/r/1659404551-160958-1-git-send-email-kwmad.kim@samsung.com
Fixes: c99b9b230149 ("scsi: ufs: Treat link loss as fatal error")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Kiwoong Kim <kwmad.kim@samsung.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 11745ecfe8fea4b4a4c322967a7605d2ecbd5080 upstream.
Existing code was generating bogus counts for the SNB IMC bandwidth counters:
$ perf stat -a -I 1000 -e uncore_imc/data_reads/,uncore_imc/data_writes/
1.000327813 1,024.03 MiB uncore_imc/data_reads/
1.000327813 20.73 MiB uncore_imc/data_writes/
2.000580153 261,120.00 MiB uncore_imc/data_reads/
2.000580153 23.28 MiB uncore_imc/data_writes/
The problem was introduced by commit:
07ce734dd8 ("perf/x86/intel/uncore: Clean up client IMC")
Where the read_counter callback was replace to point to the generic
uncore_mmio_read_counter() function.
The SNB IMC counters are freerunnig 32-bit counters laid out contiguously in
MMIO. But uncore_mmio_read_counter() is using a readq() call to read from
MMIO therefore reading 64-bit from MMIO. Although this is okay for the
uncore_perf_event_update() function because it is shifting the value based
on the actual counter width to compute a delta, it is not okay for the
uncore_pmu_event_start() which is simply reading the counter and therefore
priming the event->prev_count with a bogus value which is responsible for
causing bogus deltas in the perf stat command above.
The fix is to reintroduce the custom callback for read_counter for the SNB
IMC PMU and use readl() instead of readq(). With the change the output of
perf stat is back to normal:
$ perf stat -a -I 1000 -e uncore_imc/data_reads/,uncore_imc/data_writes/
1.000120987 296.94 MiB uncore_imc/data_reads/
1.000120987 138.42 MiB uncore_imc/data_writes/
2.000403144 175.91 MiB uncore_imc/data_reads/
2.000403144 68.50 MiB uncore_imc/data_writes/
Fixes: 07ce734dd8 ("perf/x86/intel/uncore: Clean up client IMC")
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20220803160031.1379788-1-eranian@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bc9e7fe313d5e56d4d5f34bcc04d1165f94f86fb upstream.
The previous change to Python autodetection had a small mistake where
the auto value was used to determine the Python binary, rather than the
user supplied value. The Python binary is only used for one part of the
build process, rather than the final linking, so it was producing
correct builds in most scenarios, especially when the auto detected
value matched what the user wanted, or the system only had a valid set
of Pythons.
Change it so that the Python binary path is derived from either the
PYTHON_CONFIG value or PYTHON value, depending on what is specified by
the user. This was the original intention.
This error was spotted in a build failure an odd cross compilation
environment after commit 4c41cb46a732fe82 ("perf python: Prefer
python3") was merged.
Fixes: 630af16eee495f58 ("perf tools: Use Python devtools for version autodetection rather than runtime")
Signed-off-by: James Clark <james.clark@arm.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@arm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220728093946.1337642-1-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 65fac0d54f374625b43a9d6ad1f2c212bd41f518 upstream.
Currently, in virtio_scsi, if 'bd->last' is not set to true while
dispatching request, such io will stay in driver's queue, and driver
will wait for block layer to dispatch more rqs. However, if block
layer failed to dispatch more rq, it should trigger commit_rqs to
inform driver.
There is a problem in blk_mq_try_issue_list_directly() that commit_rqs
won't be called:
// assume that queue_depth is set to 1, list contains two rq
blk_mq_try_issue_list_directly
blk_mq_request_issue_directly
// dispatch first rq
// last is false
__blk_mq_try_issue_directly
blk_mq_get_dispatch_budget
// succeed to get first budget
__blk_mq_issue_directly
scsi_queue_rq
cmd->flags |= SCMD_LAST
virtscsi_queuecommand
kick = (sc->flags & SCMD_LAST) != 0
// kick is false, first rq won't issue to disk
queued++
blk_mq_request_issue_directly
// dispatch second rq
__blk_mq_try_issue_directly
blk_mq_get_dispatch_budget
// failed to get second budget
ret == BLK_STS_RESOURCE
blk_mq_request_bypass_insert
// errors is still 0
if (!list_empty(list) || errors && ...)
// won't pass, commit_rqs won't be called
In this situation, first rq relied on second rq to dispatch, while
second rq relied on first rq to complete, thus they will both hung.
Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since
it means that request is not queued successfully.
Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE'
can't be treated as 'errors' here, fix the problem by calling
commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.
Fixes: d666ba98f8 ("blk-mq: add mq_ops->commit_rqs()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 00da0cb385d05a89226e150a102eb49d8abb0359 upstream.
While reporting for the AMD retbleed vulnerability was added in
6b80b59b3555 ("x86/bugs: Report AMD retbleed vulnerability")
the new sysfs file was not mentioned so far in the ABI documentation for
sysfs-devices-system-cpu. Fix that.
Fixes: 6b80b59b3555 ("x86/bugs: Report AMD retbleed vulnerability")
Signed-off-by: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220801091529.325327-1-carnil@debian.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5e1e087457c94ad7fafbe1cf6f774c6999ee29d4 upstream.
Since commit 51f559d66527 ("arm64: Enable repeat tlbi workaround on KRYO4XX
gold CPUs"), we failed to detect erratum 1286807 on Cortex-A76 because its
entry in arm64_repeat_tlbi_list[] was accidently corrupted by this commit.
Fix this issue by creating a separate entry for Kryo4xx Gold.
Fixes: 51f559d66527 ("arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs")
Cc: Shreyas K K <quic_shrekk@quicinc.com>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220809043848.969-1-yuzenghui@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0dd84b319352bb8ba64752d4e45396d8b13e6018 upstream.
From the link [1], we can see raid1d was running even after the path
raid_dtr -> md_stop -> __md_stop.
Let's stop write first in destructor to align with normal md-raid to
fix the KASAN issue.
[1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZcec6hQFm8nhg@mail.gmail.com/T/#m7f12bf90481c02c6d2da68c64aeed4779b7df74a
Fixes: 48df498daf ("md: move bitmap_destroy to the beginning of __md_stop")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1d258758cf06a0734482989911d184dd5837ed4e upstream.
This reverts commit e151db8ecfb019b7da31d076130a794574c89f6f. Because it
obviously breaks clustered raid as noticed by Neil though it fixed KASAN
issue for dm-raid, let's revert it and fix KASAN issue in next commit.
[1]. https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@linux.dev/T/#m95ac225cab7409f66c295772483d091084a6d470
Fixes: e151db8ecfb0 ("md-raid: destroy the bitmap after destroying the thread")
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f96f7a40874d7c746680c0b9f57cef2262ae551f upstream.
Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.
I observed that hugetlb does not support/expect write-faults in shared
mappings that would have to map the R/O-mapped page writable -- and I
found two case where we could currently get such faults and would
erroneously map an anon page into a shared mapping.
Reproducers part of the patches.
I propose to backport both fixes to stable trees. The first fix needs a
small adjustment.
This patch (of 2):
Staring at hugetlb_wp(), one might wonder where all the logic for shared
mappings is when stumbling over a write-protected page in a shared
mapping. In fact, there is none, and so far we thought we could get away
with that because e.g., mprotect() should always do the right thing and
map all pages directly writable.
Looks like we were wrong:
--------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/mman.h>
#define HUGETLB_SIZE (2 * 1024 * 1024u)
static void clear_softdirty(void)
{
int fd = open("/proc/self/clear_refs", O_WRONLY);
const char *ctrl = "4";
int ret;
if (fd < 0) {
fprintf(stderr, "open(clear_refs) failed\n");
exit(1);
}
ret = write(fd, ctrl, strlen(ctrl));
if (ret != strlen(ctrl)) {
fprintf(stderr, "write(clear_refs) failed\n");
exit(1);
}
close(fd);
}
int main(int argc, char **argv)
{
char *map;
int fd;
fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
if (!fd) {
fprintf(stderr, "open() failed\n");
return -errno;
}
if (ftruncate(fd, HUGETLB_SIZE)) {
fprintf(stderr, "ftruncate() failed\n");
return -errno;
}
map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "mmap() failed\n");
return -errno;
}
*map = 0;
if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}
clear_softdirty();
if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
fprintf(stderr, "mmprotect() failed\n");
return -errno;
}
*map = 0;
return 0;
}
--------------------------------------------------------------------------
Above test fails with SIGBUS when there is only a single free hugetlb page.
# echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
Bus error (core dumped)
And worse, with sufficient free hugetlb pages it will map an anonymous page
into a shared mapping, for example, messing up accounting during unmap
and breaking MAP_SHARED semantics:
# echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# ./test
# cat /proc/meminfo | grep HugePages_
HugePages_Total: 2
HugePages_Free: 1
HugePages_Rsvd: 18446744073709551615
HugePages_Surp: 0
Reason in this particular case is that vma_wants_writenotify() will
return "true", removing VM_SHARED in vma_set_page_prot() to map pages
write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
support softdirty tracking.
Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com
Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com
Fixes: 64e455079e ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org> [3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c5deb27895e017a0267de0a20d140ad5fcc55a54 upstream.
The error exit of privcmd_ioctl_dm_op() is calling unlock_pages()
potentially with pages being NULL, leading to a NULL dereference.
Additionally lock_pages() doesn't check for pin_user_pages_fast()
having been completely successful, resulting in potentially not
locking all pages into memory. This could result in sporadic failures
when using the related memory in user mode.
Fix all of that by calling unlock_pages() always with the real number
of pinned pages, which will be zero in case pages being NULL, and by
checking the number of pages pinned by pin_user_pages_fast() matching
the expected number of pages.
Cc: <stable@vger.kernel.org>
Fixes: ab520be8cd ("xen/privcmd: Add IOCTL_PRIVCMD_DM_OP")
Reported-by: Rustam Subkhankulov <subkhankulov@ispras.ru>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Link: https://lore.kernel.org/r/20220825141918.3581-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 36527b9d882362567ceb4eea8666813280f30e6f upstream.
The freq Qos request would be removed repeatedly if the cpufreq policy
relates to more than one CPU. Then, it would cause the "called for unknown
object" warning.
Remove the freq Qos request for each CPU relates to the cpufreq policy,
instead of removing repeatedly for the last CPU of it.
Fixes: a1bb46c36c ("ACPI: processor: Add QoS requests for all CPUs")
Reported-by: Jeremy Linton <Jeremy.Linton@arm.com>
Tested-by: Jeremy Linton <jeremy.linton@arm.com>
Signed-off-by: Riwen Lu <luriwen@kylinos.cn>
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 13cccafe0edcd03bf1c841de8ab8a1c8e34f77d9 upstream.
The pointers for guarded storage and runtime instrumentation control
blocks are stored in the thread_struct of the associated task. These
pointers are initially copied on fork() via arch_dup_task_struct()
and then cleared via copy_thread() before fork() returns. If fork()
happens to fail after the initial task dup and before copy_thread(),
the newly allocated task and associated thread_struct memory are
freed via free_task() -> arch_release_task_struct(). This results in
a double free of the guarded storage and runtime info structs
because the fields in the failed task still refer to memory
associated with the source task.
This problem can manifest as a BUG_ON() in set_freepointer() (with
CONFIG_SLAB_FREELIST_HARDENED enabled) or KASAN splat (if enabled)
when running trinity syscall fuzz tests on s390x. To avoid this
problem, clear the associated pointer fields in
arch_dup_task_struct() immediately after the new task is copied.
Note that the RI flag is still cleared in copy_thread() because it
resides in thread stack memory and that is where stack info is
copied.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Fixes: 8d9047f8b9 ("s390/runtime instrumentation: simplify task exit handling")
Fixes: 7b83c6297d ("s390/guarded storage: simplify task exit handling")
Cc: <stable@vger.kernel.org> # 4.15
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20220816155407.537372-1-bfoster@redhat.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0c7d7cc2b4fe2e74ef8728f030f0f1674f9f6aee upstream.
There are two problems with the current code of memory_intersects:
First, it doesn't check whether the region (begin, end) falls inside the
region (virt, vend), that is (virt < begin && vend > end).
The second problem is if vend is equal to begin, it will return true but
this is wrong since vend (virt + size) is not the last address of the
memory region but (virt + size -1) is. The wrong determination will
trigger the misreporting when the function check_for_illegal_area calls
memory_intersects to check if the dma region intersects with stext region.
The misreporting is as below (stext is at 0x80100000):
WARNING: CPU: 0 PID: 77 at kernel/dma/debug.c:1073 check_for_illegal_area+0x130/0x168
DMA-API: chipidea-usb2 e0002000.usb: device driver maps memory from kernel text or rodata [addr=800f0000] [len=65536]
Modules linked in:
CPU: 1 PID: 77 Comm: usb-storage Not tainted 5.19.0-yocto-standard #5
Hardware name: Xilinx Zynq Platform
unwind_backtrace from show_stack+0x18/0x1c
show_stack from dump_stack_lvl+0x58/0x70
dump_stack_lvl from __warn+0xb0/0x198
__warn from warn_slowpath_fmt+0x80/0xb4
warn_slowpath_fmt from check_for_illegal_area+0x130/0x168
check_for_illegal_area from debug_dma_map_sg+0x94/0x368
debug_dma_map_sg from __dma_map_sg_attrs+0x114/0x128
__dma_map_sg_attrs from dma_map_sg_attrs+0x18/0x24
dma_map_sg_attrs from usb_hcd_map_urb_for_dma+0x250/0x3b4
usb_hcd_map_urb_for_dma from usb_hcd_submit_urb+0x194/0x214
usb_hcd_submit_urb from usb_sg_wait+0xa4/0x118
usb_sg_wait from usb_stor_bulk_transfer_sglist+0xa0/0xec
usb_stor_bulk_transfer_sglist from usb_stor_bulk_srb+0x38/0x70
usb_stor_bulk_srb from usb_stor_Bulk_transport+0x150/0x360
usb_stor_Bulk_transport from usb_stor_invoke_transport+0x38/0x440
usb_stor_invoke_transport from usb_stor_control_thread+0x1e0/0x238
usb_stor_control_thread from kthread+0xf8/0x104
kthread from ret_from_fork+0x14/0x2c
Refactor memory_intersects to fix the two problems above.
Before the 1d7db834a027e ("dma-debug: use memory_intersects()
directly"), memory_intersects is called only by printk_late_init:
printk_late_init -> init_section_intersects ->memory_intersects.
There were few places where memory_intersects was called.
When commit 1d7db834a027e ("dma-debug: use memory_intersects()
directly") was merged and CONFIG_DMA_API_DEBUG is enabled, the DMA
subsystem uses it to check for an illegal area and the calltrace above
is triggered.
[akpm@linux-foundation.org: fix nearby comment typo]
Link: https://lkml.kernel.org/r/20220819081145.948016-1-quanyang.wang@windriver.com
Fixes: 9795593625 ("asm/sections: add helpers to check for section data")
Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thierry Reding <treding@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c490a0b5a4f36da3918181a8acdc6991d967c5f3 upstream.
The userspace can configure a loop using an ioctl call, wherein
a configuration of type loop_config is passed (see lo_ioctl()'s
case on line 1550 of drivers/block/loop.c). This proceeds to call
loop_configure() which in turn calls loop_set_status_from_info()
(see line 1050 of loop.c), passing &config->info which is of type
loop_info64*. This function then sets the appropriate values, like
the offset.
loop_device has lo_offset of type loff_t (see line 52 of loop.c),
which is typdef-chained to long long, whereas loop_info64 has
lo_offset of type __u64 (see line 56 of include/uapi/linux/loop.h).
The function directly copies offset from info to the device as
follows (See line 980 of loop.c):
lo->lo_offset = info->lo_offset;
This results in an overflow, which triggers a warning in iomap_iter()
due to a call to iomap_iter_done() which has:
WARN_ON_ONCE(iter->iomap.offset > iter->pos);
Thus, check for negative value during loop_set_status_from_info().
Bug report: https://syzkaller.appspot.com/bug?id=c620fe14aac810396d3c3edc9ad73848bf69a29e
Reported-and-tested-by: syzbot+a8e049cd3abd342936b6@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220823160810.181275-1-code@siddh.me
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7df548840c496b0141fb2404b889c346380c2b22 upstream.
Older Intel CPUs that are not in the affected processor list for MMIO
Stale Data vulnerabilities currently report "Not affected" in sysfs,
which may not be correct. Vulnerability status for these older CPUs is
unknown.
Add known-not-affected CPUs to the whitelist. Report "unknown"
mitigation status for CPUs that are not in blacklist, whitelist and also
don't enumerate MSR ARCH_CAPABILITIES bits that reflect hardware
immunity to MMIO Stale Data vulnerabilities.
Mitigation is not deployed when the status is unknown.
[ bp: Massage, fixup. ]
Fixes: 8d50cdf8b834 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data")
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/a932c154772f2121794a5f2eded1a11013114711.1657846269.git.pawan.kumar.gupta@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fc2e426b1161761561624ebd43ce8c8d2fa058da upstream.
When meeting ftrace trampolines in ORC unwinding, unwinder uses address
of ftrace_{regs_}call address to find the ORC entry, which gets next frame at
sp+176.
If there is an IRQ hitting at sub $0xa8,%rsp, the next frame should be
sp+8 instead of 176. It makes unwinder skip correct frame and throw
warnings such as "wrong direction" or "can't access registers", etc,
depending on the content of the incorrect frame address.
By adding the base address ftrace_{regs_}caller with the offset
*ip - ops->trampoline*, we can get the correct address to find the ORC entry.
Also change "caller" to "tramp_addr" to make variable name conform to
its content.
[ mingo: Clarified the changelog a bit. ]
Fixes: 6be7fa3c74 ("ftrace, orc, x86: Handle ftrace dynamically allocated trampolines")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220819084334.244016-1-chenzhongjin@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 32ba156df1b1c8804a4e5be5339616945eafea22 upstream.
On the platform with Arch LBR, the HW raw branch type encoding may leak
to the perf tool when the SAVE_TYPE option is not set.
In the intel_pmu_store_lbr(), the HW raw branch type is stored in
lbr_entries[].type. If the SAVE_TYPE option is set, the
lbr_entries[].type will be converted into the generic PERF_BR_* type
in the intel_pmu_lbr_filter() and exposed to the user tools.
But if the SAVE_TYPE option is NOT set by the user, the current perf
kernel doesn't clear the field. The HW raw branch type leaks.
There are two solutions to fix the issue for the Arch LBR.
One is to clear the field if the SAVE_TYPE option is NOT set.
The other solution is to unconditionally convert the branch type and
expose the generic type to the user tools.
The latter is implemented here, because
- The branch type is valuable information. I don't see a case where
you would not benefit from the branch type. (Stephane Eranian)
- Not having the branch type DOES NOT save any space in the
branch record (Stephane Eranian)
- The Arch LBR HW can retrieve the common branch types from the
LBR_INFO. It doesn't require the high overhead SW disassemble.
Fixes: 47125db27e ("perf/x86/intel/lbr: Support Architectural LBR")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220816125612.2042397-1-kan.liang@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b51111271b0352aa596c5ae8faf06939e91b3b68 upstream.
For a filesystem which has btrfs read-only property set to true, all
write operations including xattr should be denied. However, security
xattr can still be changed even if btrfs ro property is true.
This happens because xattr_permission() does not have any restrictions
on security.*, system.* and in some cases trusted.* from VFS and
the decision is left to the underlying filesystem. See comments in
xattr_permission() for more details.
This patch checks if the root is read-only before performing the set
xattr operation.
Testcase:
DEV=/dev/vdb
MNT=/mnt
mkfs.btrfs -f $DEV
mount $DEV $MNT
echo "file one" > $MNT/f1
setfattr -n "security.one" -v 2 $MNT/f1
btrfs property set /mnt ro true
setfattr -n "security.one" -v 1 $MNT/f1
umount $MNT
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f2c3bec215694fb8bc0ef5010f2a758d1906fc2d upstream.
If the replace target device reappears after the suspended replace is
cancelled, it blocks the mount operation as it can't find the matching
replace-item in the metadata. As shown below,
BTRFS error (device sda5): replace devid present without an active replace item
To overcome this situation, the user can run the command
btrfs device scan --forget <replace target device>
and try the mount command again. And also, to avoid repeating the issue,
superblock on the devid=0 must be wiped.
wipefs -a device-path-to-devid=0.
This patch adds some info when this situation occurs.
Reported-by: Samuel Greiner <samuel@balkonien.org>
Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 59a3991984dbc1fc47e5651a265c5200bd85464e upstream.
If the filesystem mounts with the replace-operation in a suspended state
and try to cancel the suspended replace-operation, we hit the assert. The
assert came from the commit fe97e2e173 ("btrfs: dev-replace: replace's
scrub must not be running in suspended state") that was actually not
required. So just remove it.
$ mount /dev/sda5 /btrfs
BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
$ mount -o degraded /dev/sda5 /btrfs <-- success.
$ btrfs replace cancel /btrfs
kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
kernel: ------------[ cut here ]------------
kernel: kernel BUG at fs/btrfs/ctree.h:3750!
After the patch:
$ btrfs replace cancel /btrfs
BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
Fixes: fe97e2e173 ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
CC: stable@vger.kernel.org # 5.0+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 47bf225a8d2cccb15f7e8d4a1ed9b757dd86afd7 upstream.
At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
up returning from the function with a value of 0 (success). This happens
because the function returns the value stored in the variable 'err',
which is 0, while the error value we got from btrfs_search_slot() is
stored in the 'ret' variable.
So fix it by setting 'err' with the error value.
Fixes: 8289ed9f93bef2 ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0fc4dd452d6c14828eed6369155c75c0ac15bab3 ]
In looping on FW update tests we occasionally see the
FW_ACTIVATE_STATUS command fail while it is in its EAGAIN loop
waiting for the FW activate step to finsh inside the FW. The
firmware is complaining that the done bit is set when a new
dev_cmd is going to be processed.
Doing a clean on the cmd registers and doorbell before exiting
the wait-for-done and cleaning the done bit before the sleep
prevents this from occurring.
Fixes: fbfb803153 ("ionic: Add hardware init and device commands")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b0f571ecd7943423c25947439045f0d352ca3dbf ]
Fix three bugs in the rxrpc's sendmsg implementation:
(1) rxrpc_new_client_call() should release the socket lock when returning
an error from rxrpc_get_call_slot().
(2) rxrpc_wait_for_tx_window_intr() will return without the call mutex
held in the event that we're interrupted by a signal whilst waiting
for tx space on the socket or relocking the call mutex afterwards.
Fix this by: (a) moving the unlock/lock of the call mutex up to
rxrpc_send_data() such that the lock is not held around all of
rxrpc_wait_for_tx_window*() and (b) indicating to higher callers
whether we're return with the lock dropped. Note that this means
recvmsg() will not block on this call whilst we're waiting.
(3) After dropping and regaining the call mutex, rxrpc_send_data() needs
to go and recheck the state of the tx_pending buffer and the
tx_total_len check in case we raced with another sendmsg() on the same
call.
Thinking on this some more, it might make sense to have different locks for
sendmsg() and recvmsg(). There's probably no need to make recvmsg() wait
for sendmsg(). It does mean that recvmsg() can return MSG_EOR indicating
that a call is dead before a sendmsg() to that call returns - but that can
currently happen anyway.
Without fix (2), something like the following can be induced:
WARNING: bad unlock balance detected!
5.16.0-rc6-syzkaller #0 Not tainted
-------------------------------------
syz-executor011/3597 is trying to release lock (&call->user_mutex) at:
[<ffffffff885163a3>] rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748
but there are no more locks to release!
other info that might help us debug this:
no locks held by syz-executor011/3597.
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_unlock_imbalance_bug include/trace/events/lock.h:58 [inline]
__lock_release kernel/locking/lockdep.c:5306 [inline]
lock_release.cold+0x49/0x4e kernel/locking/lockdep.c:5657
__mutex_unlock_slowpath+0x99/0x5e0 kernel/locking/mutex.c:900
rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748
rxrpc_sendmsg+0x420/0x630 net/rxrpc/af_rxrpc.c:561
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
____sys_sendmsg+0x6e8/0x810 net/socket.c:2409
___sys_sendmsg+0xf3/0x170 net/socket.c:2463
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2492
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
[Thanks to Hawkins Jiawei and Khalid Masum for their attempts to fix this]
Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Reported-by: syzbot+7f0483225d0c94cb3441@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: syzbot+7f0483225d0c94cb3441@syzkaller.appspotmail.com
cc: Hawkins Jiawei <yin31149@gmail.com>
cc: Khalid Masum <khalid.masum.92@gmail.com>
cc: Dan Carpenter <dan.carpenter@oracle.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/166135894583.600315.7170979436768124075.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 25d7a5f5a6bb15a2dae0a3f39ea5dda215024726 ]
The ixgbe_ptp_start_cyclecounter is intended to be called whenever the
cyclecounter parameters need to be changed.
Since commit a9763f3cb5 ("ixgbe: Update PTP to support X550EM_x
devices"), this function has cleared the SYSTIME registers and reset the
TSAUXC DISABLE_SYSTIME bit.
While these need to be cleared during ixgbe_ptp_reset, it is wrong to clear
them during ixgbe_ptp_start_cyclecounter. This function may be called
during both reset and link status change. When link changes, the SYSTIME
counter is still operating normally, but the cyclecounter should be updated
to account for the possibly changed parameters.
Clearing SYSTIME when link changes causes the timecounter to jump because
the cycle counter now reads zero.
Extract the SYSTIME initialization out to a new function and call this
during ixgbe_ptp_reset. This prevents the timecounter adjustment and avoids
an unnecessary reset of the current time.
This also restores the original SYSTIME clearing that occurred during
ixgbe_ptp_reset before the commit above.
Reported-by: Steve Payne <spayne@aurora.tech>
Reported-by: Ilya Evenbach <ievenbach@aurora.tech>
Fixes: a9763f3cb5 ("ixgbe: Update PTP to support X550EM_x devices")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3c9ba81d72047f2e81bb535d42856517b613aba7 ]
While reading sysctl_somaxconn, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a5612ca10d1aa05624ebe72633e0c8c792970833 ]
While reading sysctl_devconf_inherit_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 856c395cfa ("net: introduce a knob to control whether to inherit devconf config")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit af67508ea6cbf0e4ea27f8120056fa2efce127dd ]
While reading sysctl_fb_tunnels_only_for_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 79134e6ce2 ("net: do not create fallback tunnels for non-default namespaces")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit fa45d484c52c73f79db2c23b0cdfc6c6455093ad ]
While reading netdev_budget_usecs, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 7acf8a1e8a ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2e0c42374ee32e72948559d2ae2f7ba3dc6b977c ]
While reading netdev_budget, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 51b0bdedb8 ("[NET]: Separate two usages of netdev_max_backlog.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e59ef36f0795696ab229569c153936bfd068d21c ]
While reading sysctl_net_busy_read, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 2d48d67fa8 ("net: poll/select low latency socket support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c42b7cddea47503411bfb5f2f93a4154aaffa2d9 ]
While reading sysctl_net_busy_poll, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 0602129286 ("net: add low latency socket poll")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d2154b0afa73c0159b2856f875c6b4fe7cf6a95e ]
While reading sysctl_tstamp_allow_data, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: b245be1f4d ("net-timestamp: no-payload only sysctl")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7de6d09f51917c829af2b835aba8bb5040f8e86a ]
While reading sysctl_optmem_max, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9e838b02b0bb795793f12049307a354e28b5749c ]
sk_storage_charge() is the only user of omem_charge().
This patch simplifies it by folding omem_charge() into
sk_storage_charge().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201112211301.2586255-1-kafai@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6bae8ceb90ba76cdba39496db936164fa672b9be ]
While reading rs->interval and rs->burst, they can be changed
concurrently via sysctl (e.g. net_ratelimit_state). Thus, we
need to add READ_ONCE() to their readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 61adf447e38664447526698872e21c04623afb8e ]
While reading netdev_tstamp_prequeue, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 3b098e2d7c ("net: Consistent skb timestamping")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5dcd08cd19912892586c6082d56718333e2d19db ]
While reading netdev_max_backlog, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
While at it, we remove the unnecessary spaces in the doc.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bf955b5ab8f6f7b0632cdef8e36b14e4f6e77829 ]
While reading weight_p, it can be changed concurrently. Thus, we need
to add READ_ONCE() to its reader.
Also, dev_[rt]x_weight can be read/written at the same time. So, we
need to use READ_ONCE() and WRITE_ONCE() for its access. Moreover, to
use the same weight_p while changing dev_[rt]x_weight, we add a mutex
in proc_do_dev_weight().
Fixes: 3d48b53fb2 ("net: dev_weight: TX/RX orthogonality")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1227c1771dd2ad44318aa3ab9e3a293b3f34ff2a ]
While reading sysctl_[rw]mem_(max|default), they can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 02739545951ad4c1215160db7fbf9b7a918d3c0b ]
While reading these sysctl variables, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.
- .sysctl_rmem
- .sysctl_rwmem
- .sysctl_rmem_offset
- .sysctl_wmem_offset
- sysctl_tcp_rmem[1, 2]
- sysctl_tcp_wmem[1, 2]
- sysctl_decnet_rmem[1]
- sysctl_decnet_wmem[1]
- sysctl_tipc_rmem[1]
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 240bfd134c592791fdceba1ce7fc3f973c33df2d ]
tcp_grow_window() is using skb->len/skb->truesize to increase tp->rcv_ssthresh
which has a direct impact on advertized window sizes.
We added TCP coalescing in linux-3.4 & linux-3.5:
Instead of storing skbs with one or two MSS in receive queue (or OFO queue),
we try to append segments together to reduce memory overhead.
High performance network drivers tend to cook skb with 3 parts :
1) sk_buff structure (256 bytes)
2) skb->head contains room to copy headers as needed, and skb_shared_info
3) page fragment(s) containing the ~1514 bytes frame (or more depending on MTU)
Once coalesced into a previous skb, 1) and 2) are freed.
We can therefore tweak the way we compute len/truesize ratio knowing
that skb->truesize is inflated by 1) and 2) soon to be freed.
This is done only for in-order skb, or skb coalesced into OFO queue.
The result is that low rate flows no longer pay the memory price of having
low GRO aggregation factor. Same result for drivers not using GRO.
This is critical to allow a big enough receiver window,
typically tcp_rmem[2] / 2.
We have been using this at Google for about 5 years, it is due time
to make it upstream.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f323ef3a0d49e147365284bc1f02212e617b7f09 ]
Extend struct nft_data_desc to add a flag field that specifies
nft_data_init() is being called for set element data.
Use it to disallow jump to implicit chain from set element, only jump
to chain via immediate expression is allowed.
Fixes: d0e2c7de92 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 341b6941608762d8235f3fd1e45e4d7114ed8c2c ]
Instead of parsing the data and then validate that type and length are
correct, pass a description of the expected data so it can be validated
upfront before parsing it to bail out earlier.
This patch adds a new .size field to specify the maximum size of the
data area. The .len field is optional and it is used as an input/output
field, it provides the specific length of the expected data in the input
path. If then .len field is not specified, then obtained length from the
netlink attribute is stored. This is required by cmp, bitwise, range and
immediate, which provide no netlink attribute that describes the data
length. The immediate expression uses the destination register type to
infer the expected data type.
Relying on opencoded validation of the expected data might lead to
subtle bugs as described in 7e6bc1f6cabc ("netfilter: nf_tables:
stricter validation of element data").
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 23f68d462984bfda47c7bf663dca347e8e3df549 ]
Allow up to 16-byte comparisons with a new cmp fast version. Use two
64-bit words and calculate the mask representing the bits to be
compared. Make sure the comparison is 64-bit aligned and avoid
out-of-bound memory access on registers.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>