898b031dc2
40544 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Zqiang
|
d0a8c0e31a |
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
[ Upstream commit 3c1566bca3f8349f12b75d0a2d5e4a20ad6262ec ] For kernels built with CONFIG_PREEMPT_RCU=y, the following scenario can result in a NULL-pointer dereference: CPU1 CPU2 rcu_preempt_deferred_qs_irqrestore rcu_print_task_exp_stall if (special.b.blocked) READ_ONCE(rnp->exp_tasks) != NULL raw_spin_lock_rcu_node np = rcu_next_node_entry(t, rnp) if (&t->rcu_node_entry == rnp->exp_tasks) WRITE_ONCE(rnp->exp_tasks, np) .... raw_spin_unlock_irqrestore_rcu_node raw_spin_lock_irqsave_rcu_node t = list_entry(rnp->exp_tasks->prev, struct task_struct, rcu_node_entry) (if rnp->exp_tasks is NULL, this will dereference a NULL pointer) The problem is that CPU2 accesses the rcu_node structure's->exp_tasks field without holding the rcu_node structure's ->lock and CPU2 did not observe CPU1's change to rcu_node structure's ->exp_tasks in time. Therefore, if CPU1 sets rcu_node structure's->exp_tasks pointer to NULL, then CPU2 might dereference that NULL pointer. This commit therefore holds the rcu_node structure's ->lock while accessing that structure's->exp_tasks field. [ paulmck: Apply Frederic Weisbecker feedback. ] Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Paul E. McKenney
|
522c441faf |
refscale: Move shutdown from wait_event() to wait_event_idle()
[ Upstream commit 6bc6e6b27524304aadb9c04611ddb1c84dd7617a ] The ref_scale_shutdown() kthread/function uses wait_event() to wait for the refscale test to complete. However, although the read-side tests are normally extremely fast, there is no law against specifying a very large value for the refscale.loops module parameter or against having a slow read-side primitive. Either way, this might well trigger the hung-task timeout. This commit therefore replaces those wait_event() calls with calls to wait_event_idle(), which do not trigger the hung-task timeout. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Thomas Gleixner
|
a84b08314f |
tick/broadcast: Make broadcast device replacement work correctly
[ Upstream commit f9d36cf445ffff0b913ba187a3eff78028f9b1fb ]
When a tick broadcast clockevent device is initialized for one shot mode
then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode
cpumask into the oneshot broadcast cpumask.
This is required when switching from periodic broadcast mode to oneshot
broadcast mode to ensure that CPUs which are waiting for periodic
broadcast are woken up on the next tick.
But it is subtly broken, when an active broadcast device is replaced and
the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed
this and debugged the issue.
Then the OR of the periodic broadcast CPU mask is wrong as the periodic
cpumask bits are sticky after tick_broadcast_enable() set it for a CPU
unless explicitly cleared via tick_broadcast_disable().
That means that this sets all other CPUs which have tick broadcasting
enabled at that point unconditionally in the oneshot broadcast mask.
If the affected CPUs were already idle and had their bits set in the
oneshot broadcast mask then this does no harm. But for non idle CPUs
which were not set this corrupts their state.
On their next invocation of tick_broadcast_enable() they observe the bit
set, which indicates that the broadcast for the CPU is already set up.
As a consequence they fail to update the broadcast event even if their
earliest expiring timer is before the actually programmed broadcast
event.
If the programmed broadcast event is far in the future, then this can
cause stalls or trigger the hung task detector.
Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether
this is the initial switch over from periodic to oneshot broadcast which
must take the periodic broadcast mask into account. In the case of
initialization of a replacement device this prevents that the broadcast
oneshot mask is modified.
There is a second problem with broadcast device replacement in this
function. The broadcast device is only armed when the previous state of
the device was periodic.
That is correct for the switch from periodic broadcast mode to oneshot
broadcast mode as the underlying broadcast device could operate in
oneshot state already due to lack of periodic state in hardware. In that
case it is already armed to expire at the next tick.
For the replacement case this is wrong as the device is in shutdown
state. That means that any already pending broadcast event will not be
armed.
This went unnoticed because any CPU which goes idle will observe that
the broadcast device has an expiry time of KTIME_MAX and therefore any
CPUs next timer event will be earlier and cause a reprogramming of the
broadcast device. But that does not guarantee that the events of the
CPUs which were already in idle are delivered on time.
Fix this by arming the newly installed device for an immediate event
which will reevaluate the per CPU expiry times and reprogram the
broadcast device accordingly. This is simpler than caching the last
expiry time in yet another place or saving it before the device exchange
and handing it down to the setup function. Replacement of broadcast
devices is not a frequent operation and usually happens once somewhere
late in the boot process.
Fixes:
|
||
John Stultz
|
1b9c92432f |
locking/rwsem: Add __always_inline annotation to __down_read_common() and inlined callers
commit 92cc5d00a431e96e5a49c0b97e5ad4fa7536bd4b upstream.
Apparently despite it being marked inline, the compiler
may not inline __down_read_common() which makes it difficult
to identify the cause of lock contention, as the blocked
function in traceevents will always be listed as
__down_read_common().
So this patch adds __always_inline annotation to the common
function (as well as the inlined helper callers) to force it to
be inlined so the blocking function will be listed (via Wchan)
in traceevents.
Fixes:
|
||
Marco Elver
|
706ae66574 |
kcsan: Avoid READ_ONCE() in read_instrumented_memory()
commit 8dec88070d964bfeb4198f34cb5956d89dd1f557 upstream.
Haibo Li reported:
| Unable to handle kernel paging request at virtual address
| ffffff802a0d8d7171
| Mem abort info⭕
| ESR = 0x9600002121
| EC = 0x25: DABT (current EL), IL = 32 bitsts
| SET = 0, FnV = 0 0
| EA = 0, S1PTW = 0 0
| FSC = 0x21: alignment fault
| Data abort info⭕
| ISV = 0, ISS = 0x0000002121
| CM = 0, WnR = 0 0
| swapper pgtable: 4k pages, 39-bit VAs, pgdp=000000002835200000
| [ffffff802a0d8d71] pgd=180000005fbf9003, p4d=180000005fbf9003,
| pud=180000005fbf9003, pmd=180000005fbe8003, pte=006800002a0d8707
| Internal error: Oops: 96000021 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 2 PID: 45 Comm: kworker/u8:2 Not tainted
| 5.15.78-android13-8-g63561175bbda-dirty #1
| ...
| pc : kcsan_setup_watchpoint+0x26c/0x6bc
| lr : kcsan_setup_watchpoint+0x88/0x6bc
| sp : ffffffc00ab4b7f0
| x29: ffffffc00ab4b800 x28: ffffff80294fe588 x27: 0000000000000001
| x26: 0000000000000019 x25: 0000000000000001 x24: ffffff80294fdb80
| x23: 0000000000000000 x22: ffffffc00a70fb68 x21: ffffff802a0d8d71
| x20: 0000000000000002 x19: 0000000000000000 x18: ffffffc00a9bd060
| x17: 0000000000000001 x16: 0000000000000000 x15: ffffffc00a59f000
| x14: 0000000000000001 x13: 0000000000000000 x12: ffffffc00a70faa0
| x11: 00000000aaaaaaab x10: 0000000000000054 x9 : ffffffc00839adf8
| x8 : ffffffc009b4cf00 x7 : 0000000000000000 x6 : 0000000000000007
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc00a70fb70
| x2 : 0005ff802a0d8d71 x1 : 0000000000000000 x0 : 0000000000000000
| Call trace:
| kcsan_setup_watchpoint+0x26c/0x6bc
| __tsan_read2+0x1f0/0x234
| inflate_fast+0x498/0x750
| zlib_inflate+0x1304/0x2384
| __gunzip+0x3a0/0x45c
| gunzip+0x20/0x30
| unpack_to_rootfs+0x2a8/0x3fc
| do_populate_rootfs+0xe8/0x11c
| async_run_entry_fn+0x58/0x1bc
| process_one_work+0x3ec/0x738
| worker_thread+0x4c4/0x838
| kthread+0x20c/0x258
| ret_from_fork+0x10/0x20
| Code: b8bfc2a8 2a0803f7 14000007 d503249f (78bfc2a8) )
| ---[ end trace 613a943cb0a572b6 ]-----
The reason for this is that on certain arm64 configuration since
|
||
Chen Yu
|
72f3217aa1 |
PM: hibernate: Do not get block device exclusively in test_resume mode
[ Upstream commit 5904de0d735bbb3b4afe9375c5b4f9748f882945 ] The system refused to do a test_resume because it found that the swap device has already been taken by someone else. Specifically, the swsusp_check()->blkdev_get_by_dev(FMODE_EXCL) is supposed to do this check. Steps to reproduce: dd if=/dev/zero of=/swapfile bs=$(cat /proc/meminfo | awk '/MemTotal/ {print $2}') count=1024 conv=notrunc mkswap /swapfile swapon /swapfile swap-offset /swapfile echo 34816 > /sys/power/resume_offset echo test_resume > /sys/power/disk echo disk > /sys/power/state PM: Using 3 thread(s) for compression PM: Compressing and saving image data (293150 pages)... PM: Image saving progress: 0% PM: Image saving progress: 10% ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/100 ata2: SATA link down (SStatus 0 SControl 300) ata5: SATA link down (SStatus 0 SControl 300) ata6: SATA link down (SStatus 0 SControl 300) ata3: SATA link down (SStatus 0 SControl 300) ata4: SATA link down (SStatus 0 SControl 300) PM: Image saving progress: 20% PM: Image saving progress: 30% PM: Image saving progress: 40% PM: Image saving progress: 50% pcieport 0000:00:02.5: pciehp: Slot(0-5): No device found PM: Image saving progress: 60% PM: Image saving progress: 70% PM: Image saving progress: 80% PM: Image saving progress: 90% PM: Image saving done PM: hibernation: Wrote 1172600 kbytes in 2.70 seconds (434.29 MB/s) PM: S| PM: hibernation: Basic memory bitmaps freed PM: Image not found (code -16) This is because when using the swapfile as the hibernation storage, the block device where the swapfile is located has already been mounted by the OS distribution(usually mounted as the rootfs). This is not an issue for normal hibernation, because software_resume()->swsusp_check() happens before the block device(rootfs) mount. But it is a problem for the test_resume mode. Because when test_resume happens, the block device has been mounted already. Thus remove the FMODE_EXCL for test_resume mode. This would not be a problem because in test_resume stage, the processes have already been frozen, and the race condition described in Commit |
||
Chen Yu
|
208ba216cc |
PM: hibernate: Turn snapshot_test into global variable
[ Upstream commit 08169a162f97819d3e5b4a342bb9cf5137787154 ] There is need to check snapshot_test and open block device in different mode, so as to avoid the race condition. No functional changes intended. Suggested-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 5904de0d735b ("PM: hibernate: Do not get block device exclusively in test_resume mode") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Geert Uytterhoeven
|
c2b990d7aa |
timekeeping: Fix references to nonexistent ktime_get_fast_ns()
[ Upstream commit 158009f1b4a33bc0f354b994eea361362bd83226 ]
There was never a function named ktime_get_fast_ns().
Presumably these should refer to ktime_get_mono_fast_ns() instead.
Fixes:
|
||
Michael Kelley
|
4aa9243ebe |
swiotlb: fix debugfs reporting of reserved memory pools
[ Upstream commit 5499d01c029069044a3b3e50501c77b474c96178 ]
For io_tlb_nslabs, the debugfs code reports the correct value for a
specific reserved memory pool. But for io_tlb_used, the value reported
is always for the default pool, not the specific reserved pool. Fix this.
Fixes:
|
||
Doug Berger
|
e6c69b06e7 |
swiotlb: relocate PageHighMem test away from rmem_swiotlb_setup
[ Upstream commit a90922fa25370902322e9de6640e58737d459a50 ]
The reservedmem_of_init_fn's are invoked very early at boot before the
memory zones have even been defined. This makes it inappropriate to test
whether the page corresponding to a PFN is in ZONE_HIGHMEM from within
one.
Removing the check allows an ARM 32-bit kernel with SPARSEMEM enabled to
boot properly since otherwise we would be de-referencing an
uninitialized sparsemem map to perform pfn_to_page() check.
The arm64 architecture happens to work (and also has no high memory) but
other 32-bit architectures could also be having similar issues.
While it would be nice to provide early feedback about a reserved DMA
pool residing in highmem, it is not possible to do that until the first
time we try to use it, which is where the check is moved to.
Fixes:
|
||
Petr Mladek
|
c3c2aee6f9 |
workqueue: Fix hung time report of worker pools
[ Upstream commit 335a42ebb0ca8ee9997a1731aaaae6dcd704c113 ]
The workqueue watchdog prints a warning when there is no progress in
a worker pool. Where the progress means that the pool started processing
a pending work item.
Note that it is perfectly fine to process work items much longer.
The progress should be guaranteed by waking up or creating idle
workers.
show_one_worker_pool() prints state of non-idle worker pool. It shows
a delay since the last pool->watchdog_ts.
The timestamp is updated when a first pending work is queued in
__queue_work(). Also it is updated when a work is dequeued for
processing in worker_thread() and rescuer_thread().
The delay is misleading when there is no pending work item. In this
case it shows how long the last work item is being proceed. Show
zero instead. There is no stall if there is no pending work.
Fixes:
|
||
Beau Belgrave
|
0489c2b2c3 |
tracing/user_events: Ensure write index cannot be negative
[ Upstream commit cd98c93286a30cc4588dfd02453bec63c2f4acf4 ]
The write index indicates which event the data is for and accesses a
per-file array. The index is passed by user processes during write()
calls as the first 4 bytes. Ensure that it cannot be negative by
returning -EINVAL to prevent out of bounds accesses.
Update ftrace self-test to ensure this occurs properly.
Link: https://lkml.kernel.org/r/20230425225107.8525-2-beaub@linux.microsoft.com
Fixes:
|
||
Schspa Shi
|
6472a6d0c7 |
sched/rt: Fix bad task migration for rt tasks
[ Upstream commit feffe5bb274dd3442080ef0e4053746091878799 ] Commit |
||
Yang Jihong
|
2d44928903 |
perf/core: Fix hardlockup failure caused by perf throttle
[ Upstream commit 15def34e2635ab7e0e96f1bc32e1b69609f14942 ] commit |
||
Libo Chen
|
944465c772 |
sched/fair: Fix inaccurate tally of ttwu_move_affine
[ Upstream commit 39afe5d6fc59237ff7738bf3ede5a8856822d59d ]
There are scenarios where non-affine wakeups are incorrectly counted as
affine wakeups by schedstats.
When wake_affine_idle() returns prev_cpu which doesn't equal to
nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
in wake_affine() and be counted as if target == this_cpu in schedstats.
Replace target == nr_cpumask_bits with target != this_cpu to make sure
affine wakeups are accurately tallied.
Fixes:
|
||
Stanislav Fomichev
|
551a26668c |
bpf: Don't EFAULT for getsockopt with optval=NULL
[ Upstream commit 00e74ae0863827d944e36e56a4ce1e77e50edb91 ]
Some socket options do getsockopt with optval=NULL to estimate the size
of the final buffer (which is returned via optlen). This breaks BPF
getsockopt assumptions about permitted optval buffer size. Let's enforce
these assumptions only when non-NULL optval is provided.
Fixes:
|
||
Alexei Starovoitov
|
c3fb321447 |
bpf: Fix race between btf_put and btf_idr walk.
[ Upstream commit acf1c3d68e9a31f10d92bc67ad4673cdae5e8d92 ] Florian and Eduard reported hard dead lock: [ 58.433327] _raw_spin_lock_irqsave+0x40/0x50 [ 58.433334] btf_put+0x43/0x90 [ 58.433338] bpf_find_btf_id+0x157/0x240 [ 58.433353] btf_parse_fields+0x921/0x11c0 This happens since btf->refcount can be 1 at the time of btf_put() and btf_put() will call btf_free_id() which will try to grab btf_idr_lock and will dead lock. Avoid the issue by doing btf_put() without locking. Fixes: |
||
Feng Zhou
|
52c3d68d99 |
bpf/btf: Fix is_int_ptr()
[ Upstream commit 91f2dc6838c19342f7f2993627c622835cc24890 ] When tracing a kernel function with arg type is u32*, btf_ctx_access() would report error: arg2 type INT is not a struct. The commit |
||
Daniel Borkmann
|
f9361cf40b |
bpf: Fix __reg_bound_offset 64->32 var_off subreg propagation
[ Upstream commit 7be14c1c9030f73cc18b4ff23b78a0a081f16188 ] Xu reports that after commit |
||
Luis Gerhorst
|
157c84b793 |
bpf: Remove misleading spec_v1 check on var-offset stack read
[ Upstream commit 082cdc69a4651dd2a77539d69416a359ed1214f5 ] For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals() ensures that the resulting pointer has a constant offset if bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds() which in turn calls check_stack_access_for_ptr_arithmetic(). There, -EACCESS is returned if the register's offset is not constant, thereby rejecting the program. In summary, an unprivileged user must never be able to create stack pointers with a variable offset. That is also the case, because a respective check in check_stack_write() is missing. If they were able to create a variable-offset pointer, users could still use it in a stack-write operation to trigger unsafe speculative behavior [1]. Because unprivileged users must already be prevented from creating variable-offset stack pointers, viable options are to either remove this check (replacing it with a clarifying comment), or to turn it into a "verifier BUG"-message, also adding a similar check in check_stack_write() (for consistency, as a second-level defense). This patch implements the first option to reduce verifier bloat. This check was introduced by commit |
||
Andrii Nakryiko
|
a62ba7e0d2 |
bpf: fix precision propagation verbose logging
[ Upstream commit 34f0677e7afd3a292bc1aadda7ce8e35faedb204 ] Fix wrong order of frame index vs register/slot index in precision propagation verbose (level 2) output. It's wrong and very confusing as is. Fixes: 529409ea92d5 ("bpf: propagate precision across all frames, not just the last one") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Andrii Nakryiko
|
0049d2edda |
bpf: take into account liveness when propagating precision
[ Upstream commit 52c2b005a3c18c565fc70cfd0ca49375f301e952 ]
When doing state comparison, if old state has register that is not
marked as REG_LIVE_READ, then we just skip comparison, regardless what's
the state of corresponing register in current state. This is because not
REG_LIVE_READ register is irrelevant for further program execution and
correctness. All good here.
But when we get to precision propagation, after two states were declared
equivalent, we don't take into account old register's liveness, and thus
attempt to propagate precision for register in current state even if
that register in old state was not REG_LIVE_READ anymore. This is bad,
because register in current state could be anything at all and this
could cause -EFAULT due to internal logic bugs.
Fix by taking into account REG_LIVE_READ liveness mark to keep the logic
in state comparison in sync with precision propagation.
Fixes:
|
||
Sebastian Andrzej Siewior
|
290e26ec0d |
tick/common: Align tick period with the HZ tick.
[ Upstream commit e9523a0d81899361214d118ad60ef76f0e92f71d ]
With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
expire the timer_list timers. This timer is programmed accurate in
respect to CLOCK_MONOTONIC so that 0 seconds and nanoseconds is the
first tick and the next one is 1000/CONFIG_HZ ms later. For HZ=250 it is
every 4 ms and so based on the current time the next tick can be
computed.
This accuracy broke since the commit mentioned below because the jiffy
based clocksource is initialized with higher accuracy in
read_persistent_wall_and_boot_offset(). This higher accuracy is
inherited during the setup in tick_setup_device(). The timer still fires
every 4ms with HZ=250 but timer is no longer aligned with
CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
of the timestamp. The offset differs with every boot and makes it
impossible for user land to align with the tick.
Align the tick period with CLOCK_MONOTONIC ensuring that it is always a
multiple of 1000/CONFIG_HZ ms.
Fixes:
|
||
Zqiang
|
ae6803b663 |
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
[ Upstream commit db7b464df9d820186e98a65aa6a10f0d51fbf8ce ]
This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling
RCU expedited grace periods to actually force-enable scheduling-clock
interrupts on holdout CPUs.
Fixes:
|
||
Ondrej Mosnacek
|
d79d3430e1 |
tracing: Fix permissions for the buffer_percent file
commit 4f94559f40ad06d627c0fdfc3319cec778a2845b upstream.
This file defines both read and write operations, yet it is being
created as read-only. This means that it can't be written to without the
CAP_DAC_OVERRIDE capability. Fix the permissions to allow root to write
to it without the need to override DAC perms.
Link: https://lore.kernel.org/linux-trace-kernel/20230503140114.3280002-1-omosnace@redhat.com
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|
||
Zhang Zhengming
|
f6ee841ff2 |
relayfs: fix out-of-bounds access in relay_file_read
commit 43ec16f1450f4936025a9bdf1a273affdb9732c1 upstream.
There is a crash in relay_file_read, as the var from
point to the end of last subbuf.
The oops looks something like:
pc : __arch_copy_to_user+0x180/0x310
lr : relay_file_read+0x20c/0x2c8
Call trace:
__arch_copy_to_user+0x180/0x310
full_proxy_read+0x68/0x98
vfs_read+0xb0/0x1d0
ksys_read+0x6c/0xf0
__arm64_sys_read+0x20/0x28
el0_svc_common.constprop.3+0x84/0x108
do_el0_svc+0x74/0x90
el0_svc+0x1c/0x28
el0_sync_handler+0x88/0xb0
el0_sync+0x148/0x180
We get the condition by analyzing the vmcore:
1). The last produced byte and last consumed byte
both at the end of the last subbuf
2). A softirq calls function(e.g __blk_add_trace)
to write relay buffer occurs when an program is calling
relay_file_read_avail().
relay_file_read
relay_file_read_avail
relay_file_read_consume(buf, 0, 0);
//interrupted by softirq who will write subbuf
....
return 1;
//read_start point to the end of the last subbuf
read_start = relay_file_read_start_pos
//avail is equal to subsize
avail = relay_file_read_subbuf_avail
//from points to an invalid memory address
from = buf->start + read_start
//system is crashed
copy_to_user(buffer, from, avail)
Link: https://lkml.kernel.org/r/20230419040203.37676-1-zhang.zhengming@h3c.com
Fixes:
|
||
Zheng Yejian
|
c8a3341b33 |
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
commit 7a29fb4a4771124bc61de397dbfc1554dbbcc19c upstream.
Registering a kprobe on __rcu_irq_enter_check_tick() can cause kernel
stack overflow as shown below. This issue can be reproduced by enabling
CONFIG_NO_HZ_FULL and booting the kernel with argument "nohz_full=",
and then giving the following commands at the shell prompt:
# cd /sys/kernel/tracing/
# echo 'p:mp1 __rcu_irq_enter_check_tick' >> kprobe_events
# echo 1 > events/kprobes/enable
This commit therefore adds __rcu_irq_enter_check_tick() to the kprobes
blacklist using NOKPROBE_SYMBOL().
Insufficient stack space to handle exception!
ESR: 0x00000000f2000004 -- BRK (AArch64)
FAR: 0x0000ffffccf3e510
Task stack: [0xffff80000ad30000..0xffff80000ad38000]
IRQ stack: [0xffff800008050000..0xffff800008058000]
Overflow stack: [0xffff089c36f9f310..0xffff089c36fa0310]
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
pstate: 400003c5 (nZcv DAIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : __rcu_irq_enter_check_tick+0x0/0x1b8
lr : ct_nmi_enter+0x11c/0x138
sp : ffff80000ad30080
x29: ffff80000ad30080 x28: ffff089c82e20000 x27: 0000000000000000
x26: 0000000000000000 x25: ffff089c02a8d100 x24: 0000000000000000
x23: 00000000400003c5 x22: 0000ffffccf3e510 x21: ffff089c36fae148
x20: ffff80000ad30120 x19: ffffa8da8fcce148 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: ffffa8da8e44ea6c
x14: ffffa8da8e44e968 x13: ffffa8da8e03136c x12: 1fffe113804d6809
x11: ffff6113804d6809 x10: 0000000000000a60 x9 : dfff800000000000
x8 : ffff089c026b404f x7 : 00009eec7fb297f7 x6 : 0000000000000001
x5 : ffff80000ad30120 x4 : dfff800000000000 x3 : ffffa8da8e3016f4
x2 : 0000000000000003 x1 : 0000000000000000 x0 : 0000000000000000
Kernel panic - not syncing: kernel stack overflow
CPU: 5 PID: 190 Comm: bash Not tainted 6.2.0-rc2-00320-g1f5abbd77e2c #19
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0xf8/0x108
show_stack+0x20/0x30
dump_stack_lvl+0x68/0x84
dump_stack+0x1c/0x38
panic+0x214/0x404
add_taint+0x0/0xf8
panic_bad_stack+0x144/0x160
handle_bad_stack+0x38/0x58
__bad_stack+0x78/0x7c
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
[...]
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
arm64_enter_el1_dbg.isra.0+0x14/0x20
el1_dbg+0x2c/0x90
el1h_64_sync_handler+0xcc/0xe8
el1h_64_sync+0x64/0x68
__rcu_irq_enter_check_tick+0x0/0x1b8
el1_interrupt+0x28/0x60
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x64/0x68
__ftrace_set_clr_event_nolock+0x98/0x198
__ftrace_set_clr_event+0x58/0x80
system_enable_write+0x144/0x178
vfs_write+0x174/0x738
ksys_write+0xd0/0x188
__arm64_sys_write+0x4c/0x60
invoke_syscall+0x64/0x180
el0_svc_common.constprop.0+0x84/0x160
do_el0_svc+0x48/0xe8
el0_svc+0x34/0xd0
el0t_64_sync_handler+0xb8/0xc0
el0t_64_sync+0x190/0x194
SMP: stopping secondary CPUs
Kernel Offset: 0x28da86000000 from 0xffff800008000000
PHYS_OFFSET: 0xfffff76600000000
CPU features: 0x00000,01a00100,0000421b
Memory Limit: none
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/all/20221119040049.795065-1-zhengyejian1@huawei.com/
Fixes:
|
||
Johannes Berg
|
d9834abd8b |
ring-buffer: Sync IRQ works before buffer destruction
commit 675751bb20634f981498c7d66161584080cc061e upstream.
If something was written to the buffer just before destruction,
it may be possible (maybe not in a real system, but it did
happen in ARCH=um with time-travel) to destroy the ringbuffer
before the IRQ work ran, leading this KASAN report (or a crash
without KASAN):
BUG: KASAN: slab-use-after-free in irq_work_run_list+0x11a/0x13a
Read of size 8 at addr 000000006d640a48 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Tainted: G W O 6.3.0-rc1 #7
Stack:
60c4f20f 0c203d48 41b58ab3 60f224fc
600477fa 60f35687 60c4f20f 601273dd
00000008 6101eb00 6101eab0 615be548
Call Trace:
[<60047a58>] show_stack+0x25e/0x282
[<60c609e0>] dump_stack_lvl+0x96/0xfd
[<60c50d4c>] print_report+0x1a7/0x5a8
[<603078d3>] kasan_report+0xc1/0xe9
[<60308950>] __asan_report_load8_noabort+0x1b/0x1d
[<60232844>] irq_work_run_list+0x11a/0x13a
[<602328b4>] irq_work_tick+0x24/0x34
[<6017f9dc>] update_process_times+0x162/0x196
[<6019f335>] tick_sched_handle+0x1a4/0x1c3
[<6019fd9e>] tick_sched_timer+0x79/0x10c
[<601812b9>] __hrtimer_run_queues.constprop.0+0x425/0x695
[<60182913>] hrtimer_interrupt+0x16c/0x2c4
[<600486a3>] um_timer+0x164/0x183
[...]
Allocated by task 411:
save_stack_trace+0x99/0xb5
stack_trace_save+0x81/0x9b
kasan_save_stack+0x2d/0x54
kasan_set_track+0x34/0x3e
kasan_save_alloc_info+0x25/0x28
____kasan_kmalloc+0x8b/0x97
__kasan_kmalloc+0x10/0x12
__kmalloc+0xb2/0xe8
load_elf_phdrs+0xee/0x182
[...]
The buggy address belongs to the object at 000000006d640800
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 584 bytes inside of
freed 1024-byte region [000000006d640800, 000000006d640c00)
Add the appropriate irq_work_sync() so the work finishes before
the buffers are destroyed.
Prior to the commit in the Fixes tag below, there was only a
single global IRQ work, so this issue didn't exist.
Link: https://lore.kernel.org/linux-trace-kernel/20230427175920.a76159263122.I8295e405c44362a86c995e9c2c37e3e03810aa56@changeid
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|
||
Tze-nan Wu
|
ad7cc2a29e |
ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
commit 7c339fb4d8577792378136c15fde773cfb863cb8 upstream.
In ring_buffer_reset_online_cpus, the buffer_size_kb write operation
may permanently fail if the cpu_online_mask changes between two
for_each_online_buffer_cpu loops. The number of increases and decreases
on both cpu_buffer->resize_disabled and cpu_buffer->record_disabled may be
inconsistent, causing some CPUs to have non-zero values for these atomic
variables after the function returns.
This issue can be reproduced by "echo 0 > trace" while hotplugging cpu.
After reproducing success, we can find out buffer_size_kb will not be
functional anymore.
To prevent leaving 'resize_disabled' and 'record_disabled' non-zero after
ring_buffer_reset_online_cpus returns, we ensure that each atomic variable
has been set up before atomic_sub() to it.
Link: https://lore.kernel.org/linux-trace-kernel/20230426062027.17451-1-Tze-nan.Wu@mediatek.com
Cc: stable@vger.kernel.org
Cc: <mhiramat@kernel.org>
Cc: npiggin@gmail.com
Fixes:
|
||
Kees Cook
|
b9f6845a49 |
kheaders: Use array declaration instead of char
commit b69edab47f1da8edd8e7bfdf8c70f51a2a5d89fb upstream.
Under CONFIG_FORTIFY_SOURCE, memcpy() will check the size of destination
and source buffers. Defining kernel_headers_data as "char" would trip
this check. Since these addresses are treated as byte arrays, define
them as arrays (as done everywhere else).
This was seen with:
$ cat /sys/kernel/kheaders.tar.xz >> /dev/null
detected buffer overflow in memcpy
kernel BUG at lib/string_helpers.c:1027!
...
RIP: 0010:fortify_panic+0xf/0x20
[...]
Call Trace:
<TASK>
ikheaders_read+0x45/0x50 [kheaders]
kernfs_fop_read_iter+0x1a4/0x2f0
...
Reported-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20230302112130.6e402a98@kernel.org/
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Fixes:
|
||
Joel Fernandes (Google)
|
3e7b8a723b |
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
commit 58d7668242647e661a20efe065519abd6454287e upstream.
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined.
However, cpu_is_hotpluggable() still returns true for those CPUs. This causes
torture tests that do offlining to end up trying to offline this CPU causing
test failures. Such failure happens on all architectures.
Fix the repeated error messages thrown by this (even if the hotplug errors are
harmless) by asking the opinion of the nohz subsystem on whether the CPU can be
hotplugged.
[ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ]
For drivers/base/ portion:
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: rcu <rcu@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes:
|
||
Thomas Gleixner
|
bccf9fe296 |
posix-cpu-timers: Implement the missing timer_wait_running callback
commit f7abf14f0001a5a47539d9f60bbdca649e43536b upstream.
For some unknown reason the introduction of the timer_wait_running callback
missed to fixup posix CPU timers, which went unnoticed for almost four years.
Marco reported recently that the WARN_ON() in timer_wait_running()
triggers with a posix CPU timer test case.
Posix CPU timers have two execution models for expiring timers depending on
CONFIG_POSIX_CPU_TIMERS_TASK_WORK:
1) If not enabled, the expiry happens in hard interrupt context so
spin waiting on the remote CPU is reasonably time bound.
Implement an empty stub function for that case.
2) If enabled, the expiry happens in task work before returning to user
space or guest mode. The expired timers are marked as firing and moved
from the timer queue to a local list head with sighand lock held. Once
the timers are moved, sighand lock is dropped and the expiry happens in
fully preemptible context. That means the expiring task can be scheduled
out, migrated, interrupted etc. So spin waiting on it is more than
suboptimal.
The timer wheel has a timer_wait_running() mechanism for RT, which uses
a per CPU timer-base expiry lock which is held by the expiry code and the
task waiting for the timer function to complete blocks on that lock.
This does not work in the same way for posix CPU timers as there is no
timer base and expiry for process wide timers can run on any task
belonging to that process, but the concept of waiting on an expiry lock
can be used too in a slightly different way:
- Add a mutex to struct posix_cputimers_work. This struct is per task
and used to schedule the expiry task work from the timer interrupt.
- Add a task_struct pointer to struct cpu_timer which is used to store
a the task which runs the expiry. That's filled in when the task
moves the expired timers to the local expiry list. That's not
affecting the size of the k_itimer union as there are bigger union
members already
- Let the task take the expiry mutex around the expiry function
- Let the waiter acquire a task reference with rcu_read_lock() held and
block on the expiry mutex
This avoids spin-waiting on a task which might not even be on a CPU and
works nicely for RT too.
Fixes:
|
||
Qais Yousef
|
d362a03d92 |
sched/fair: Fixes for capacity inversion detection
commit: da07d2f9c153e457e845d4dcfdd13568d71d18a4 upstream. Traversing the Perf Domains requires rcu_read_lock() to be held and is conditional on sched_energy_enabled(). Ensure right protections applied. Also skip capacity inversion detection for our own pd; which was an error. Fixes: 44c7b80bffc3 ("sched/fair: Detect capacity inversion") Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230112122708.330667-3-qyousef@layalina.io (cherry picked from commit da07d2f9c153e457e845d4dcfdd13568d71d18a4) Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Qais Yousef
|
799c7301de |
sched/fair: Consider capacity inversion in util_fits_cpu()
commit: aa69c36f31aadc1669bfa8a3de6a47b5e6c98ee8 upstream. We do consider thermal pressure in util_fits_cpu() for uclamp_min only. With the exception of the biggest cores which by definition are the max performance point of the system and all tasks by definition should fit. Even under thermal pressure, the capacity of the biggest CPU is the highest in the system and should still fit every task. Except when it reaches capacity inversion point, then this is no longer true. We can handle this by using the inverted capacity as capacity_orig in util_fits_cpu(). Which not only addresses the problem above, but also ensure uclamp_max now considers the inverted capacity. Force fitting a task when a CPU is in this adverse state will contribute to making the thermal throttling last longer. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-10-qais.yousef@arm.com (cherry picked from commit aa69c36f31aadc1669bfa8a3de6a47b5e6c98ee8) Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Qais Yousef
|
fe1c982958 |
sched/fair: Detect capacity inversion
commit: 44c7b80bffc3a657a36857098d5d9c49d94e652b upstream. Check each performance domain to see if thermal pressure is causing its capacity to be lower than another performance domain. We assume that each performance domain has CPUs with the same capacities, which is similar to an assumption made in energy_model.c We also assume that thermal pressure impacts all CPUs in a performance domain equally. If there're multiple performance domains with the same capacity_orig, we will trigger a capacity inversion if the domain is under thermal pressure. The new cpu_in_capacity_inversion() should help users to know when information about capacity_orig are not reliable and can opt in to use the inverted capacity as the 'actual' capacity_orig. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220804143609.515789-9-qais.yousef@arm.com (cherry picked from commit 44c7b80bffc3a657a36857098d5d9c49d94e652b) Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Ondrej Mosnacek
|
ec90129b91 |
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
commit 659c0ce1cb9efc7f58d380ca4bb2a51ae9e30553 upstream.
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.
The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not. It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.
Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com
Fixes:
|
||
Daniel Borkmann
|
89603f4c91 |
bpf: Fix incorrect verifier pruning due to missing register precision taints
[ Upstream commit 71b547f561247897a0a14f3082730156c0533fed ]
Juan Jose et al reported an issue found via fuzzing where the verifier's
pruning logic prematurely marks a program path as safe.
Consider the following program:
0: (b7) r6 = 1024
1: (b7) r7 = 0
2: (b7) r8 = 0
3: (b7) r9 = -2147483648
4: (97) r6 %= 1025
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2
7: (97) r6 %= 1
8: (b7) r9 = 0
9: (bd) if r6 <= r9 goto pc+1
10: (b7) r6 = 0
11: (b7) r0 = 0
12: (63) *(u32 *)(r10 -4) = r0
13: (18) r4 = 0xffff888103693400 // map_ptr(ks=4,vs=48)
15: (bf) r1 = r4
16: (bf) r2 = r10
17: (07) r2 += -4
18: (85) call bpf_map_lookup_elem#1
19: (55) if r0 != 0x0 goto pc+1
20: (95) exit
21: (77) r6 >>= 10
22: (27) r6 *= 8192
23: (bf) r1 = r0
24: (0f) r0 += r6
25: (79) r3 = *(u64 *)(r0 +0)
26: (7b) *(u64 *)(r1 +0) = r3
27: (95) exit
The verifier treats this as safe, leading to oob read/write access due
to an incorrect verifier conclusion:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (b7) r6 = 1024 ; R6_w=1024
1: (b7) r7 = 0 ; R7_w=0
2: (b7) r8 = 0 ; R8_w=0
3: (b7) r9 = -2147483648 ; R9_w=-2147483648
4: (97) r6 %= 1025 ; R6_w=scalar()
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff00000000; 0xffffffff)) R9_w=-2147483648
7: (97) r6 %= 1 ; R6_w=scalar()
8: (b7) r9 = 0 ; R9=0
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
10: (b7) r6 = 0 ; R6_w=0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 9
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=0
22: (27) r6 *= 8192 ; R6_w=0
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 19
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
last_idx 18 first_idx 9
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
regs=40 stack=0 before 10: (b7) r6 = 0
25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
27: (95) exit
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1
frame 0: propagating r6
last_idx 19 first_idx 11
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
last_idx 9 first_idx 9
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=0 R10=fp0
last_idx 8 first_idx 0
regs=40 stack=0 before 8: (b7) r9 = 0
regs=40 stack=0 before 7: (97) r6 %= 1
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=40 stack=0 before 5: (05) goto pc+0
regs=40 stack=0 before 4: (97) r6 %= 1025
regs=40 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
19: safe
frame 0: propagating r6
last_idx 9 first_idx 0
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=40 stack=0 before 5: (05) goto pc+0
regs=40 stack=0 before 4: (97) r6 %= 1025
regs=40 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
from 6 to 9: safe
verification time 110 usec
stack depth 4
processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2
The verifier considers this program as safe by mistakenly pruning unsafe
code paths. In the above func#0, code lines 0-10 are of interest. In line
0-3 registers r6 to r9 are initialized with known scalar values. In line 4
the register r6 is reset to an unknown scalar given the verifier does not
track modulo operations. Due to this, the verifier can also not determine
precisely which branches in line 6 and 9 are taken, therefore it needs to
explore them both.
As can be seen, the verifier starts with exploring the false/fall-through
paths first. The 'from 19 to 21' path has both r6=0 and r9=0 and the pointer
arithmetic on r0 += r6 is therefore considered safe. Given the arithmetic,
r6 is correctly marked for precision tracking where backtracking kicks in
where it walks back the current path all the way where r6 was set to 0 in
the fall-through branch.
Next, the pruning logics pops the path 'from 9 to 11' from the stack. Also
here, the state of the registers is the same, that is, r6=0 and r9=0, so
that at line 19 the path can be pruned as it is considered safe. It is
interesting to note that the conditional in line 9 turned r6 into a more
precise state, that is, in the fall-through path at the beginning of line
10, it is R6=scalar(umin=1), and in the branch-taken path (which is analyzed
here) at the beginning of line 11, r6 turned into a known const r6=0 as
r9=0 prior to that and therefore (unsigned) r6 <= 0 concludes that r6 must
be 0 (**):
[...] ; R6_w=scalar()
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
[...]
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
[...]
The next path is 'from 6 to 9'. The verifier considers the old and current
state equivalent, and therefore prunes the search incorrectly. Looking into
the two states which are being compared by the pruning logic at line 9, the
old state consists of R6_rwD=Pscalar() R9_rwD=0 R10=fp0 and the new state
consists of R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968)
R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0. While r6 had the reg->precise flag
correctly set in the old state, r9 did not. Both r6'es are considered as
equivalent given the old one is a superset of the current, more precise one,
however, r9's actual values (0 vs 0x80000000) mismatch. Given the old r9
did not have reg->precise flag set, the verifier does not consider the
register as contributing to the precision state of r6, and therefore it
considered both r9 states as equivalent. However, for this specific pruned
path (which is also the actual path taken at runtime), register r6 will be
0x400 and r9 0x80000000 when reaching line 21, thus oob-accessing the map.
The purpose of precision tracking is to initially mark registers (including
spilled ones) as imprecise to help verifier's pruning logic finding equivalent
states it can then prune if they don't contribute to the program's safety
aspects. For example, if registers are used for pointer arithmetic or to pass
constant length to a helper, then the verifier sets reg->precise flag and
backtracks the BPF program instruction sequence and chain of verifier states
to ensure that the given register or stack slot including their dependencies
are marked as precisely tracked scalar. This also includes any other registers
and slots that contribute to a tracked state of given registers/stack slot.
This backtracking relies on recorded jmp_history and is able to traverse
entire chain of parent states. This process ends only when all the necessary
registers/slots and their transitive dependencies are marked as precise.
The backtrack_insn() is called from the current instruction up to the first
instruction, and its purpose is to compute a bitmask of registers and stack
slots that need precision tracking in the parent's verifier state. For example,
if a current instruction is r6 = r7, then r6 needs precision after this
instruction and r7 needs precision before this instruction, that is, in the
parent state. Hence for the latter r7 is marked and r6 unmarked.
For the class of jmp/jmp32 instructions, backtrack_insn() today only looks
at call and exit instructions and for all other conditionals the masks
remain as-is. However, in the given situation register r6 has a dependency
on r9 (as described above in **), so also that one needs to be marked for
precision tracking. In other words, if an imprecise register influences a
precise one, then the imprecise register should also be marked precise.
Meaning, in the parent state both dest and src register need to be tracked
for precision and therefore the marking must be more conservative by setting
reg->precise flag for both. The precision propagation needs to cover both
for the conditional: if the src reg was marked but not the dst reg and vice
versa.
After the fix the program is correctly rejected:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (b7) r6 = 1024 ; R6_w=1024
1: (b7) r7 = 0 ; R7_w=0
2: (b7) r8 = 0 ; R8_w=0
3: (b7) r9 = -2147483648 ; R9_w=-2147483648
4: (97) r6 %= 1025 ; R6_w=scalar()
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff80000000; 0x7fffffff),u32_min=-2147483648) R9_w=-2147483648
7: (97) r6 %= 1 ; R6_w=scalar()
8: (b7) r9 = 0 ; R9=0
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
10: (b7) r6 = 0 ; R6_w=0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 9
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=0
22: (27) r6 *= 8192 ; R6_w=0
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 19
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
last_idx 18 first_idx 9
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
regs=40 stack=0 before 10: (b7) r6 = 0
25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
27: (95) exit
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1
frame 0: propagating r6
last_idx 19 first_idx 11
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
last_idx 9 first_idx 9
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
parent didn't have regs=240 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=P0 R10=fp0
last_idx 8 first_idx 0
regs=240 stack=0 before 8: (b7) r9 = 0
regs=40 stack=0 before 7: (97) r6 %= 1
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
19: safe
from 6 to 9: R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
9: (bd) if r6 <= r9 goto pc+1
last_idx 9 first_idx 0
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
last_idx 9 first_idx 0
regs=200 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
11: R6=scalar(umax=18446744071562067968) R9=-2147483648
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=3,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0_w=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=scalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=scalar(umax=18014398507384832,var_off=(0x0; 0x3fffffffffffff))
22: (27) r6 *= 8192 ; R6_w=scalar(smax=9223372036854767616,umax=18446744073709543424,var_off=(0x0; 0xffffffffffffe000),s32_max=2147475456,u32_max=-8192)
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 21
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
parent didn't have regs=40 stack=0 marks: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_r=Pscalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
last_idx 19 first_idx 11
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
last_idx 9 first_idx 0
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
regs=240 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
math between map_value pointer and register with unbounded min value is not allowed
verification time 886 usec
stack depth 4
processed 49 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 2
Fixes:
|
||
Waiman Long
|
243d9f3a11 |
cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods
[ Upstream commit eee87853794187f6adbe19533ed79c8b44b36a91 ]
In the case of CLONE_INTO_CGROUP, not all cpusets are ready to accept
new tasks. It is too late to check that in cpuset_fork(). So we need
to add the cpuset_can_fork() and cpuset_cancel_fork() methods to
pre-check it before we can allow attachment to a different cpuset.
We also need to set the attach_in_progress flag to alert other code
that a new task is going to be added to the cpuset.
Fixes:
|
||
Waiman Long
|
e33ab28395 |
cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
[ Upstream commit 42a11bf5c5436e91b040aeb04063be1710bb9f9c ] By default, the clone(2) syscall spawn a child process into the same cgroup as its parent. With the use of the CLONE_INTO_CGROUP flag introduced by commit |
||
Waiman Long
|
ff5a4fe259 |
cgroup/cpuset: Skip spread flags update on v2
[ Upstream commit 18f9a4d47527772515ad6cbdac796422566e6440 ] Cpuset v2 has no spread flags to set. So we can skip spread flags update if cpuset v2 is being used. Also change the name to cpuset_update_task_spread_flags() to indicate that there are multiple spread flags. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Stable-dep-of: 42a11bf5c543 ("cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Vincent Guittot
|
31c5ad43bd |
sched/fair: Fix imbalance overflow
[ Upstream commit 91dcf1e8068e9a8823e419a7a34ff4341275fb70 ]
When local group is fully busy but its average load is above system load,
computing the imbalance will overflow and local group is not the best
target for pulling this load.
Fixes:
|
||
Waiman Long
|
a746ad276b |
cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
commit ba9182a89626d5f83c2ee4594f55cb9c1e60f0e2 upstream.
After a successful cpuset_can_attach() call which increments the
attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach()
will be called later. In cpuset_attach(), tasks in cpuset_attach_wq,
if present, will be woken up at the end. That is not the case in
cpuset_cancel_attach(). So missed wakeup is possible if the attach
operation is somehow cancelled. Fix that by doing the wakeup in
cpuset_cancel_attach() as well.
Fixes:
|
||
Waiman Long
|
f06c9b0154 |
cgroup/cpuset: Fix partition root's cpuset.cpus update bug
commit 292fd843de26c551856e66faf134512c52dd78b4 upstream. It was found that commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") introduced a bug that corrupted "cpuset.cpus" of a partition root when it was updated. It is because the tmp->new_cpus field of the passed tmp parameter of update_parent_subparts_cpumask() should not be used at all as it contains important cpumask data that should not be overwritten. Fix it by using tmp->addmask instead. Also update update_cpumask() to make sure that trialcs->cpu_allowed will not be corrupted until it is no longer needed. Fixes: 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") Signed-off-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v6.2+ Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Josh Don
|
53244494bf |
cgroup: fix display of forceidle time at root
commit fcdb1eda5302599045bb366e679cccb4216f3873 upstream.
We need to reset forceidle_sum to 0 when reading from root, since the
bstat we accumulate into is stack allocated.
To make this more robust, just replace the existing cputime reset with a
memset of the overall bstat.
Signed-off-by: Josh Don <joshdon@google.com>
Fixes:
|
||
Steven Rostedt (Google)
|
f58574f238 |
tracing: Have tracing_snapshot_instance_cond() write errors to the appropriate instance
[ Upstream commit 9d52727f8043cfda241ae96896628d92fa9c50bb ]
If a trace instance has a failure with its snapshot code, the error
message is to be written to that instance's buffer. But currently, the
message is written to the top level buffer. Worse yet, it may also disable
the top level buffer and not the instance that had the issue.
Link: https://lkml.kernel.org/r/20230405022341.688730321@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Fixes:
|
||
Steven Rostedt (Google)
|
5620eeb379 |
tracing: Add trace_array_puts() to write into instance
[ Upstream commit d503b8f7474fe7ac616518f7fc49773cbab49f36 ] Add a generic trace_array_puts() that can be used to "trace_puts()" into an allocated trace_array instance. This is just another variant of trace_array_printk(). Link: https://lkml.kernel.org/r/20230207173026.584717290@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: 9d52727f8043 ("tracing: Have tracing_snapshot_instance_cond() write errors to the appropriate instance") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
Tetsuo Handa
|
3756171b97 |
cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex
[ Upstream commit 57dcd64c7e036299ef526b400a8d12b8a2352f26 ] syzbot is reporting circular locking dependency between cpu_hotplug_lock and freezer_mutex, for commit |
||
Liam R. Howlett
|
1c87a6f82a |
mm: enable maple tree RCU mode by default.
commit 3dd4432549415f3c65dd52d5c687629efbf4ece1 upstream.
Use the maple tree in RCU mode for VMA tracking.
The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock. This is safe as the
writes to the stack have a guard VMA which ensures there will always be
a NULL in the direction of the growth and thus will only update a pivot.
It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs. syzbot has constructed a testcase which sets up a
VMA to grow and consume the empty space. Overwriting the entire NULL
entry causes the tree to be altered in a way that is not safe for
concurrent readers; the readers may see a node being rewritten or one
that does not match the maple state they are using.
Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.
Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
Cc: stable@vger.kernel.org
Fixes:
|
||
Zheng Yejian
|
3663f5d5bb |
ring-buffer: Fix race while reader and writer are on the same page
commit 6455b6163d8c680366663cdb8c679514d55fc30c upstream. When user reads file 'trace_pipe', kernel keeps printing following logs that warn at "cpu_buffer->reader_page->read > rb_page_size(reader)" in rb_get_reader_page(). It just looks like there's an infinite loop in tracing_read_pipe(). This problem occurs several times on arm64 platform when testing v5.10 and below. Call trace: rb_get_reader_page+0x248/0x1300 rb_buffer_peek+0x34/0x160 ring_buffer_peek+0xbc/0x224 peek_next_entry+0x98/0xbc __find_next_entry+0xc4/0x1c0 trace_find_next_entry_inc+0x30/0x94 tracing_read_pipe+0x198/0x304 vfs_read+0xb4/0x1e0 ksys_read+0x74/0x100 __arm64_sys_read+0x24/0x30 el0_svc_common.constprop.0+0x7c/0x1bc do_el0_svc+0x2c/0x94 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb4 el0_sync+0x160/0x180 Then I dump the vmcore and look into the problematic per_cpu ring_buffer, I found that tail_page/commit_page/reader_page are on the same page while reader_page->read is obviously abnormal: tail_page == commit_page == reader_page == { .write = 0x100d20, .read = 0x8f9f4805, // Far greater than 0xd20, obviously abnormal!!! .entries = 0x10004c, .real_end = 0x0, .page = { .time_stamp = 0x857257416af0, .commit = 0xd20, // This page hasn't been full filled. // .data[0...0xd20] seems normal. } } The root cause is most likely the race that reader and writer are on the same page while reader saw an event that not fully committed by writer. To fix this, add memory barriers to make sure the reader can see the content of what is committed. Since commit |
||
Steven Rostedt (Google)
|
dc48648699 |
tracing/synthetic: Make lastcmd_mutex static
commit 31c683967174b487939efaf65e41f5ff1404e141 upstream. The lastcmd_mutex is only used in trace_events_synth.c and should be static. Link: https://lore.kernel.org/linux-trace-kernel/202304062033.cRStgOuP-lkp@intel.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230406111033.6e26de93@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Fixes: 4ccf11c4e8a8e ("tracing/synthetic: Fix races on freeing last_cmd") Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |