c2a0eb3b20
40569 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
1b7b048c22 |
bpf: Force kprobe multi expected_attach_type for kprobe_multi link
[ Upstream commit db8eae6bc5c702d8e3ab2d0c6bb5976c131576eb ]
We currently allow to create perf link for program with
expected_attach_type == BPF_TRACE_KPROBE_MULTI.
This will cause crash when we call helpers like get_attach_cookie or
get_func_ip in such program, because it will call the kprobe_multi's
version (current->bpf_ctx context setup) of those helpers while it
expects perf_link's current->bpf_ctx context setup.
Making sure that we use BPF_TRACE_KPROBE_MULTI expected_attach_type
only for programs attaching through kprobe_multi link.
Fixes:
|
||
|
fc3afb3378 |
bpf/btf: Accept function names that contain dots
[ Upstream commit 9724160b3942b0a967b91a59f81da5593f28b8ba ]
When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM
leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn,
pahole creates BTF_KIND_FUNC entries for these and this makes the BTF
metadata validation fail because they contain a dot.
In a dramatic turn of event, this BTF verification failure can cause
the netfilter_bpf initialization to fail, causing netfilter_core to
free the netfilter_helper hashmap and netfilter_ftp to trigger a
use-after-free. The risk of u-a-f in netfilter will be addressed
separately but the existence of "asan.module_ctor" debug info under some
build conditions sounds like a good enough reason to accept functions
that contain dots in BTF.
Although using only LLVM=1 is the recommended way to compile clang-based
kernels, users can certainly do LLVM=1, LLVM_IAS=0 as well and we still
try to support that combination according to Nick. To clarify:
- > v5.10 kernel, LLVM=1 (LLVM_IAS=0 is not the default) is recommended,
but user can still have LLVM=1, LLVM_IAS=0 to trigger the issue
- <= 5.10 kernel, LLVM=1 (LLVM_IAS=0 is the default) is recommended in
which case GNU as will be used
Fixes:
|
||
|
d9a0b1a53c |
bpf: Fix verifier id tracking of scalars on spill
[ Upstream commit 713274f1f2c896d37017efee333fd44149710119 ]
The following scenario describes a bug in the verifier where it
incorrectly concludes about equivalent scalar IDs which could lead to
verifier bypass in privileged mode:
1. Prepare a 32-bit rogue number.
2. Put the rogue number into the upper half of a 64-bit register, and
roll a random (unknown to the verifier) bit in the lower half. The
rest of the bits should be zero (although variations are possible).
3. Assign an ID to the register by MOVing it to another arbitrary
register.
4. Perform a 32-bit spill of the register, then perform a 32-bit fill to
another register. Due to a bug in the verifier, the ID will be
preserved, although the new register will contain only the lower 32
bits, i.e. all zeros except one random bit.
At this point there are two registers with different values but the same
ID, which means the integrity of the verifier state has been corrupted.
5. Compare the new 32-bit register with 0. In the branch where it's
equal to 0, the verifier will believe that the original 64-bit
register is also 0, because it has the same ID, but its actual value
still contains the rogue number in the upper half.
Some optimizations of the verifier prevent the actual bypass, so
extra care is needed: the comparison must be between two registers,
and both branches must be reachable (this is why one random bit is
needed). Both branches are still suitable for the bypass.
6. Right shift the original register by 32 bits to pop the rogue number.
7. Use the rogue number as an offset with any pointer. The verifier will
believe that the offset is 0, while in reality it's the given number.
The fix is similar to the 32-bit BPF_MOV handling in check_alu_op for
SCALAR_VALUE. If the spill is narrowing the actual register value, don't
keep the ID, make sure it's reset to 0.
Fixes:
|
||
|
461fc3391c |
bpf: track immediate values written to stack by BPF_ST instruction
[ Upstream commit ecdf985d7615356b78241fdb159c091830ed0380 ] For aligned stack writes using BPF_ST instruction track stored values in a same way BPF_STX is handled, e.g. make sure that the following commands produce similar verifier knowledge: fp[-8] = 42; r1 = 42; fp[-8] = r1; This covers two cases: - non-null values written to stack are stored as spill of fake registers; - null values written to stack are stored as STACK_ZERO marks. Previously both cases above used STACK_MISC marks instead. Some verifier test cases relied on the old logic to obtain STACK_MISC marks for some stack values. These test cases are updated in the same commit to avoid failures during bisect. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230214232030.1502829-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Stable-dep-of: 713274f1f2c8 ("bpf: Fix verifier id tracking of scalars on spill") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
1dfca388fc |
bpf: ensure main program has an extable
commit 0108a4e9f3584a7a2c026d1601b0682ff7335d95 upstream.
When subprograms are in use, the main program is not jit'd after the
subprograms because jit_subprogs sets a value for prog->bpf_func upon
success. Subsequent calls to the JIT are bypassed when this value is
non-NULL. This leads to a situation where the main program and its
func[0] counterpart are both in the bpf kallsyms tree, but only func[0]
has an extable. Extables are only created during JIT. Now there are
two nearly identical program ksym entries in the tree, but only one has
an extable. Depending upon how the entries are placed, there's a chance
that a fault will call search_extable on the aux with the NULL entry.
Since jit_subprogs already copies state from func[0] to the main
program, include the extable pointer in this state duplication.
Additionally, ensure that the copy of the main program in func[0] is not
added to the bpf_prog_kallsyms table. Instead, let the main program get
added later in bpf_prog_load(). This ensures there is only a single
copy of the main program in the kallsyms table, and that its tag matches
the tag observed by tooling like bpftool.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
0d7a4e6589 |
cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex in freezer_css_{online,offline}()
commit f0cc749254d12c78e93dae3b27b21dc9546843d0 upstream.
syzbot is again reporting circular locking dependency between
cpu_hotplug_lock and freezer_mutex. Do like what we did with
commit 57dcd64c7e036299 ("cgroup,freezer: hold cpu_hotplug_lock
before freezer_mutex").
Reported-by: syzbot <syzbot+2ab700fe1829880a2ec6@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=2ab700fe1829880a2ec6
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+2ab700fe1829880a2ec6@syzkaller.appspotmail.com>
Fixes:
|
||
|
7b162a18d3 |
cgroup: Do not corrupt task iteration when rebinding subsystem
commit 6f363f5aa845561f7ea496d8b1175e3204470486 upstream.
We found a refcount UAF bug as follows:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148
Workqueue: events cpuset_hotplug_workfn
Call trace:
refcount_warn_saturate+0xa0/0x148
__refcount_add.constprop.0+0x5c/0x80
css_task_iter_advance_css_set+0xd8/0x210
css_task_iter_advance+0xa8/0x120
css_task_iter_next+0x94/0x158
update_tasks_root_domain+0x58/0x98
rebuild_root_domains+0xa0/0x1b0
rebuild_sched_domains_locked+0x144/0x188
cpuset_hotplug_workfn+0x138/0x5a0
process_one_work+0x1e8/0x448
worker_thread+0x228/0x3e0
kthread+0xe0/0xf0
ret_from_fork+0x10/0x20
then a kernel panic will be triggered as below:
Unable to handle kernel paging request at virtual address 00000000c0000010
Call trace:
cgroup_apply_control_disable+0xa4/0x16c
rebind_subsystems+0x224/0x590
cgroup_destroy_root+0x64/0x2e0
css_free_rwork_fn+0x198/0x2a0
process_one_work+0x1d4/0x4bc
worker_thread+0x158/0x410
kthread+0x108/0x13c
ret_from_fork+0x10/0x18
The race that cause this bug can be shown as below:
(hotplug cpu) | (umount cpuset)
mutex_lock(&cpuset_mutex) | mutex_lock(&cgroup_mutex)
cpuset_hotplug_workfn |
rebuild_root_domains | rebind_subsystems
update_tasks_root_domain | spin_lock_irq(&css_set_lock)
css_task_iter_start | list_move_tail(&cset->e_cset_node[ss->id]
while(css_task_iter_next) | &dcgrp->e_csets[ss->id]);
css_task_iter_end | spin_unlock_irq(&css_set_lock)
mutex_unlock(&cpuset_mutex) | mutex_unlock(&cgroup_mutex)
Inside css_task_iter_start/next/end, css_set_lock is hold and then
released, so when iterating task(left side), the css_set may be moved to
another list(right side), then it->cset_head points to the old list head
and it->cset_pos->next points to the head node of new list, which can't
be used as struct css_set.
To fix this issue, switch from all css_sets to only scgrp's css_sets to
patch in-flight iterators to preserve correct iteration, and then
update it->cset_head as well.
Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Link: https://www.spinics.net/lists/cgroups/msg37935.html
Suggested-by: Michal Koutný <mkoutny@suse.com>
Link: https://lore.kernel.org/all/20230526114139.70274-1-xiujianfeng@huaweicloud.com/
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Fixes:
|
||
|
0c6552f837 |
tick/common: Align tick period during sched_timer setup
commit 13bb06f8dd42071cb9a49f6e21099eea05d4b856 upstream. The tick period is aligned very early while the first clock_event_device is registered. At that point the system runs in periodic mode and switches later to one-shot mode if possible. The next wake-up event is programmed based on the aligned value (tick_next_period) but the delta value, that is used to program the clock_event_device, is computed based on ktime_get(). With the subtracted offset, the device fires earlier than the exact time frame. With a large enough offset the system programs the timer for the next wake-up and the remaining time left is too small to make any boot progress. The system hangs. Move the alignment later to the setup of tick_sched timer. At this point the system switches to oneshot mode and a high resolution clocksource is available. At this point it is safe to align tick_next_period because ktime_get() will now return accurate (not jiffies based) time. [bigeasy: Patch description + testing]. Fixes: e9523a0d81899 ("tick/common: Align tick period with the HZ tick.") Reported-by: Mathias Krause <minipli@grsecurity.net> Reported-by: "Bhatnagar, Rishabh" <risbhat@amazon.com> Suggested-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Richard W.M. Jones <rjones@redhat.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Acked-by: SeongJae Park <sj@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/5a56290d-806e-b9a5-f37c-f21958b5a8c0@grsecurity.net Link: https://lore.kernel.org/12c6f9a3-d087-b824-0d05-0d18c9bc1bf3@amazon.com Link: https://lore.kernel.org/r/20230615091830.RxMV2xf_@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
c6cbb4e1c1 |
rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
commit 5da7cb193db32da783a3f3e77d8b639989321d48 upstream. Memory passed to kvfree_rcu() that is to be freed is tracked by a per-CPU kfree_rcu_cpu structure, which in turn contains pointers to kvfree_rcu_bulk_data structures that contain pointers to memory that has not yet been handed to RCU, along with an kfree_rcu_cpu_work structure that tracks the memory that has already been handed to RCU. These structures track three categories of memory: (1) Memory for kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived during an OOM episode. The first two categories are tracked in a cache-friendly manner involving a dynamically allocated page of pointers (the aforementioned kvfree_rcu_bulk_data structures), while the third uses a simple (but decidedly cache-unfriendly) linked list through the rcu_head structures in each block of memory. On a given CPU, these three categories are handled as a unit, with that CPU's kfree_rcu_cpu_work structure having one pointer for each of the three categories. Clearly, new memory for a given category cannot be placed in the corresponding kfree_rcu_cpu_work structure until any old memory has had its grace period elapse and thus has been removed. And the kfree_rcu_monitor() function does in fact check for this. Except that the kfree_rcu_monitor() function checks these pointers one at a time. This means that if the previous kfree_rcu() memory passed to RCU had only category 1 and the current one has only category 2, the kfree_rcu_monitor() function will send that current category-2 memory along immediately. This can result in memory being freed too soon, that is, out from under unsuspecting RCU readers. To see this, consider the following sequence of events, in which: o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", then is preempted. o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset" after a later grace period. Except that "from_cset" is freed right after the previous grace period ended, so that "from_cset" is immediately freed. Task A resumes and references "from_cset"'s member, after which nothing good happens. In full detail: CPU 0 CPU 1 ---------------------- ---------------------- count_memcg_event_mm() |rcu_read_lock() <--- |mem_cgroup_from_task() |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq comes, current task is scheduled out. cgroup_attach_task() |cgroup_migrate() |cgroup_migrate_execute() |css_set_move_task(task, from_cset, to_cset, true) |cgroup_move_task(task, to_cset) |rcu_assign_pointer(.., to_cset) |... |cgroup_migrate_finish() |put_css_set_locked(from_cset) |from_cset->refcount return 0 |kfree_rcu(cset, rcu_head) // free from_cset after new gp |add_ptr_to_bulk_krc_lock() |schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] |queue_rcu_work(system_wq, &krwp->rcu_work) |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() |__queue_work(.., rwork->wq, &rwork->work); |kfree_rcu_work() |krwp->bulk_head_free[0] bulk is freed before new gp end!!! |The "from_cset" is freed before new gp end. // the task resumes some time later. |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. This commit therefore causes kfree_rcu_monitor() to refrain from moving kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU grace period has completed for all three categories. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: |
||
|
013027918a |
kexec: support purgatories with .text.hot sections
commit 8652d44f466ad5772e7d1756e9457046189b0dfc upstream.
Patch series "kexec: Fix kexec_file_load for llvm16 with PGO", v7.
When upreving llvm I realised that kexec stopped working on my test
platform.
The reason seems to be that due to PGO there are multiple .text sections
on the purgatory, and kexec does not supports that.
This patch (of 4):
Clang16 links the purgatory text in two sections when PGO is in use:
[ 1] .text PROGBITS 0000000000000000 00000040
00000000000011a1 0000000000000000 AX 0 0 16
[ 2] .rela.text RELA 0000000000000000 00003498
0000000000000648 0000000000000018 I 24 1 8
...
[17] .text.hot. PROGBITS 0000000000000000 00003220
000000000000020b 0000000000000000 AX 0 0 1
[18] .rela.text.hot. RELA 0000000000000000 00004428
0000000000000078 0000000000000018 I 24 17 8
And both of them have their range [sh_addr ... sh_addr+sh_size] on the
area pointed by `e_entry`.
This causes that image->start is calculated twice, once for .text and
another time for .text.hot. The second calculation leaves image->start
in a random location.
Because of this, the system crashes immediately after:
kexec_core: Starting new kernel
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-0-b05c520b7296@chromium.org
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-1-b05c520b7296@chromium.org
Fixes:
|
||
|
314e973f36 |
cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks()
[ Upstream commit ab1de7ead871ebe6d12a774c3c25de0388cde082 ] The commit |
||
|
c68b4db581 |
cgroup: always put cset in cgroup_css_set_put_fork
[ Upstream commit 2bd110339288c18823dcace602b63b0d8627e520 ]
A successful call to cgroup_css_set_fork() will always have taken
a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always
do a corresponding put in cgroup_css_set_put_fork().
Without this, a cset and its contained css structures will be
leaked for some fork failures. The following script reproduces
the leak for a fork failure due to exceeding pids.max in the
pids controller. A similar thing can happen if we jump to the
bad_fork_cancel_cgroup label in copy_process().
[ -z "$1" ] && echo "Usage $0 pids-root" && exit 1
PID_ROOT=$1
CGROUP=$PID_ROOT/foo
[ -e $CGROUP ] && rmdir -f $CGROUP
mkdir $CGROUP
echo 5 > $CGROUP/pids.max
echo $$ > $CGROUP/cgroup.procs
fork_bomb()
{
set -e
for i in $(seq 10); do
/bin/sleep 3600 &
done
}
(fork_bomb) &
wait
echo $$ > $PID_ROOT/cgroup.procs
kill $(cat $CGROUP/cgroup.procs)
rmdir $CGROUP
Fixes:
|
||
|
7a2e2ca9ad |
cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
[ Upstream commit 4cdb91b0dea7d7f59fa84a13c7753cd434fdedcf ] Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay consistent across cgroup core and other subsystem code, while operating on the cgroup_mutex. Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Stable-dep-of: 2bd110339288 ("cgroup: always put cset in cgroup_css_set_put_fork") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
dbc880567a |
bpf: Add extra path pointer check to d_path helper
[ Upstream commit f46fab0e36e611a2389d3843f34658c849b6bd60 ]
Anastasios reported crash on stable 5.15 kernel with following
BPF attached to lsm hook:
SEC("lsm.s/bprm_creds_for_exec")
int BPF_PROG(bprm_creds_for_exec, struct linux_binprm *bprm)
{
struct path *path = &bprm->executable->f_path;
char p[128] = { 0 };
bpf_d_path(path, p, 128);
return 0;
}
But bprm->executable can be NULL, so bpf_d_path call will crash:
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
...
RIP: 0010:d_path+0x22/0x280
...
Call Trace:
<TASK>
bpf_d_path+0x21/0x60
bpf_prog_db9cf176e84498d9_bprm_creds_for_exec+0x94/0x99
bpf_trampoline_6442506293_0+0x55/0x1000
bpf_lsm_bprm_creds_for_exec+0x5/0x10
security_bprm_creds_for_exec+0x29/0x40
bprm_execve+0x1c1/0x900
do_execveat_common.isra.0+0x1af/0x260
__x64_sys_execve+0x32/0x40
It's problem for all stable trees with bpf_d_path helper, which was
added in 5.9.
This issue is fixed in current bpf code, where we identify and mark
trusted pointers, so the above code would fail even to load.
For the sake of the stable trees and to workaround potentially broken
verifier in the future, adding the code that reads the path object from
the passed pointer and verifies it's valid in kernel space.
Fixes:
|
||
|
3849e7fcea |
bpf: Fix elem_size not being set for inner maps
[ Upstream commit cba41bb78d70aad98d8e61e019fd48c561f7f396 ] Commit |
||
|
d7612a922b |
bpf: Fix UAF in task local storage
[ Upstream commit b0fd1852bcc21accca6260ef245356d5c141ff66 ]
When task local storage was generalized for tracing programs, the
bpf_task_local_storage callback was moved from a BPF LSM hook
callback for security_task_free LSM hook to it's own callback. But a
failure case in bad_fork_cleanup_security was missed which, when
triggered, led to a dangling task owner pointer and a subsequent
use-after-free. Move the bpf_task_storage_free to the very end of
free_task to handle all failure cases.
This issue was noticed when a BPF LSM program was attached to the
task_alloc hook on a kernel with KASAN enabled. The program used
bpf_task_storage_get to copy the task local storage from the current
task to the new task being created.
Fixes:
|
||
|
3f731926a1 |
tracing/probe: trace_probe_primary_from_call(): checked list_first_entry
commit 81d0fa4cb4fc0e1a49c2b22f92c43d9fe972ebcf upstream.
All callers of trace_probe_primary_from_call() check the return
value to be non NULL. However, the function returns
list_first_entry(&tpe->probes, ...) which can never be NULL.
Additionally, it does not check for the list being possibly empty,
possibly causing a type confusion on empty lists.
Use list_first_entry_or_null() which solves both problems.
Link: https://lore.kernel.org/linux-trace-kernel/20230128-list-entry-null-check-v1-1-8bde6a3da2ef@diag.uniroma1.it/
Fixes:
|
||
|
7403630eb9 |
tracing/histograms: Allow variables to have some modifiers
commit e30fbc618e97b38dbb49f1d44dcd0778d3f23b8c upstream. Modifiers are used to change the behavior of keys. For instance, they can grouped into buckets, converted to syscall names (from the syscall identifier), show task->comm of the current pid, be an array of longs that represent a stacktrace, and more. It was found that nothing stopped a value from taking a modifier. As values are simple counters. If this happened, it would call code that was not expecting a modifier and crash the kernel. This was fixed by having the ___create_val_field() function test if a modifier was present and fail if one was. This fixed the crash. Now there's a problem with variables. Variables are used to pass fields from one event to another. Variables are allowed to have some modifiers, as the processing may need to happen at the time of the event (like stacktraces and comm names of the current pid). The issue is that it too uses __create_val_field(). Now that fails on modifiers, variables can no longer use them (this is a regression). As not all modifiers are for variables, have them use a separate check. Link: https://lore.kernel.org/linux-trace-kernel/20230523221108.064a5d82@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: e0213434fe3e4 ("tracing: Do not let histogram values have some modifiers") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
2a1195f0e0 |
tracing/timerlat: Always wakeup the timerlat thread
commit 632478a05821bc1c9b55c3a1dd0fb1be7bfa1acc upstream.
While testing rtla timerlat auto analysis, I reach a condition where
the interface was not receiving tracing data. I was able to manually
reproduce the problem with these steps:
# echo 0 > tracing_on # disable trace
# echo 1 > osnoise/stop_tracing_us # stop trace if timerlat irq > 1 us
# echo timerlat > current_tracer # enable timerlat tracer
# sleep 1 # wait... that is the time when rtla
# apply configs like prio or cgroup
# echo 1 > tracing_on # start tracing
# cat trace
# tracer: timerlat
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# ||||| ACTIVATION
# TASK-PID CPU# ||||| TIMESTAMP ID CONTEXT LATENCY
# | | | ||||| | | | |
NOTHING!
Then, trying to enable tracing again with echo 1 > tracing_on resulted
in no change: the trace was still not tracing.
This problem happens because the timerlat IRQ hits the stop tracing
condition while tracing is off, and do not wake up the timerlat thread,
so the timerlat threads are kept sleeping forever, resulting in no
trace, even after re-enabling the tracer.
Avoid this condition by always waking up the threads, even after stopping
tracing, allowing the tracer to return to its normal operating after
a new tracing on.
Link: https://lore.kernel.org/linux-trace-kernel/1ed8f830638b20a39d535d27d908e319a9a3c4e2.1683822622.git.bristot@kernel.org
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
53384076f7 |
x86/pci/xen: populate MSI sysfs entries
commit 335b4223466dd75f9f3ea4918187afbadd22e5c8 upstream. Commit |
||
|
1a9e80f757 |
bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
commit b34ffb0c6d23583830f9327864b9c1f486003305 upstream.
The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.
Fixes:
|
||
|
177ee41f61 |
bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
commit 0613d8ca9ab382caabe9ed2dceb429e9781e443f upstream.
A narrow load from a 64-bit context field results in a 64-bit load
followed potentially by a 64-bit right-shift and then a bitwise AND
operation to extract the relevant data.
In the case of a 32-bit access, an immediate mask of 0xffffffff is used
to construct a 64-bit BPP_AND operation which then sign-extends the mask
value and effectively acts as a glorified no-op. For example:
0: 61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0)
results in the following code generation for a 64-bit field:
ldr x7, [x7] // 64-bit load
mov x10, #0xffffffffffffffff
and x7, x7, x10
Fix the mask generation so that narrow loads always perform a 32-bit AND
operation:
ldr x7, [x7] // 64-bit load
mov w10, #0xffffffff
and w7, w7, w10
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
Cc: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Fixes:
|
||
|
c46d3efb4d |
rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handler
commit be243bacfb25f5219f2396d787408e8cf1301dd1 upstream.
This patch replaces preempt_{disable, enable} with its corresponding
notrace version in rethook_trampoline_handler so no worries about stack
recursion or overflow introduced by preempt_count_{add, sub} under
fprobe + rethook context.
Link: https://lore.kernel.org/all/20230517034510.15639-2-zegao@tencent.com/
Fixes:
|
||
|
60039bf72f |
bpf: Add preempt_count_{sub,add} into btf id deny list
[ Upstream commit c11bd046485d7bf1ca200db0e7d0bdc4bafdd395 ] The recursion check in __bpf_prog_enter* and __bpf_prog_exit* leave preempt_count_{sub,add} unprotected. When attaching trampoline to them we get panic as follows, [ 867.843050] BUG: TASK stack guard page was hit at 0000000009d325cf (stack is 0000000046a46a15..00000000537e7b28) [ 867.843064] stack guard page: 0000 [#1] PREEMPT SMP NOPTI [ 867.843067] CPU: 8 PID: 11009 Comm: trace Kdump: loaded Not tainted 6.2.0+ #4 [ 867.843100] Call Trace: [ 867.843101] <TASK> [ 867.843104] asm_exc_int3+0x3a/0x40 [ 867.843108] RIP: 0010:preempt_count_sub+0x1/0xa0 [ 867.843135] __bpf_prog_enter_recur+0x17/0x90 [ 867.843148] bpf_trampoline_6442468108_0+0x2e/0x1000 [ 867.843154] ? preempt_count_sub+0x1/0xa0 [ 867.843157] preempt_count_sub+0x5/0xa0 [ 867.843159] ? migrate_enable+0xac/0xf0 [ 867.843164] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843168] bpf_trampoline_6442468108_0+0x55/0x1000 ... [ 867.843788] preempt_count_sub+0x5/0xa0 [ 867.843793] ? migrate_enable+0xac/0xf0 [ 867.843829] __bpf_prog_exit_recur+0x2d/0x40 [ 867.843837] BUG: IRQ stack guard page was hit at 0000000099bd8228 (stack is 00000000b23e2bc4..000000006d95af35) [ 867.843841] BUG: IRQ stack guard page was hit at 000000005ae07924 (stack is 00000000ffd69623..0000000014eb594c) [ 867.843843] BUG: IRQ stack guard page was hit at 00000000028320f0 (stack is 00000000034b6438..0000000078d1bcec) [ 867.843842] bpf_trampoline_6442468108_0+0x55/0x1000 ... That is because in __bpf_prog_exit_recur, the preempt_count_{sub,add} are called after prog->active is decreased. Fixing this by adding these two functions into btf ids deny list. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yafang <laoar.shao@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jiri Olsa <olsajiri@gmail.com> Acked-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20230413025248.79764-1-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
4e7a81b5e7 |
bpf: Annotate data races in bpf_local_storage
[ Upstream commit 0a09a2f933c73dc76ab0b72da6855f44342a8903 ] There are a few cases where hlist_node is checked to be unhashed without holding the lock protecting its modification. In this case, one must use hlist_unhashed_lockless to avoid load tearing and KCSAN reports. Fix this by using lockless variant in places not protected by the lock. Since this is not prompted by any actual KCSAN reports but only from code review, I have not included a fixes tag. Cc: Martin KaFai Lau <martin.lau@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230221200646.2500777-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
d0a8c0e31a |
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
[ Upstream commit 3c1566bca3f8349f12b75d0a2d5e4a20ad6262ec ] For kernels built with CONFIG_PREEMPT_RCU=y, the following scenario can result in a NULL-pointer dereference: CPU1 CPU2 rcu_preempt_deferred_qs_irqrestore rcu_print_task_exp_stall if (special.b.blocked) READ_ONCE(rnp->exp_tasks) != NULL raw_spin_lock_rcu_node np = rcu_next_node_entry(t, rnp) if (&t->rcu_node_entry == rnp->exp_tasks) WRITE_ONCE(rnp->exp_tasks, np) .... raw_spin_unlock_irqrestore_rcu_node raw_spin_lock_irqsave_rcu_node t = list_entry(rnp->exp_tasks->prev, struct task_struct, rcu_node_entry) (if rnp->exp_tasks is NULL, this will dereference a NULL pointer) The problem is that CPU2 accesses the rcu_node structure's->exp_tasks field without holding the rcu_node structure's ->lock and CPU2 did not observe CPU1's change to rcu_node structure's ->exp_tasks in time. Therefore, if CPU1 sets rcu_node structure's->exp_tasks pointer to NULL, then CPU2 might dereference that NULL pointer. This commit therefore holds the rcu_node structure's ->lock while accessing that structure's->exp_tasks field. [ paulmck: Apply Frederic Weisbecker feedback. ] Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
522c441faf |
refscale: Move shutdown from wait_event() to wait_event_idle()
[ Upstream commit 6bc6e6b27524304aadb9c04611ddb1c84dd7617a ] The ref_scale_shutdown() kthread/function uses wait_event() to wait for the refscale test to complete. However, although the read-side tests are normally extremely fast, there is no law against specifying a very large value for the refscale.loops module parameter or against having a slow read-side primitive. Either way, this might well trigger the hung-task timeout. This commit therefore replaces those wait_event() calls with calls to wait_event_idle(), which do not trigger the hung-task timeout. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
a84b08314f |
tick/broadcast: Make broadcast device replacement work correctly
[ Upstream commit f9d36cf445ffff0b913ba187a3eff78028f9b1fb ]
When a tick broadcast clockevent device is initialized for one shot mode
then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode
cpumask into the oneshot broadcast cpumask.
This is required when switching from periodic broadcast mode to oneshot
broadcast mode to ensure that CPUs which are waiting for periodic
broadcast are woken up on the next tick.
But it is subtly broken, when an active broadcast device is replaced and
the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed
this and debugged the issue.
Then the OR of the periodic broadcast CPU mask is wrong as the periodic
cpumask bits are sticky after tick_broadcast_enable() set it for a CPU
unless explicitly cleared via tick_broadcast_disable().
That means that this sets all other CPUs which have tick broadcasting
enabled at that point unconditionally in the oneshot broadcast mask.
If the affected CPUs were already idle and had their bits set in the
oneshot broadcast mask then this does no harm. But for non idle CPUs
which were not set this corrupts their state.
On their next invocation of tick_broadcast_enable() they observe the bit
set, which indicates that the broadcast for the CPU is already set up.
As a consequence they fail to update the broadcast event even if their
earliest expiring timer is before the actually programmed broadcast
event.
If the programmed broadcast event is far in the future, then this can
cause stalls or trigger the hung task detector.
Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether
this is the initial switch over from periodic to oneshot broadcast which
must take the periodic broadcast mask into account. In the case of
initialization of a replacement device this prevents that the broadcast
oneshot mask is modified.
There is a second problem with broadcast device replacement in this
function. The broadcast device is only armed when the previous state of
the device was periodic.
That is correct for the switch from periodic broadcast mode to oneshot
broadcast mode as the underlying broadcast device could operate in
oneshot state already due to lack of periodic state in hardware. In that
case it is already armed to expire at the next tick.
For the replacement case this is wrong as the device is in shutdown
state. That means that any already pending broadcast event will not be
armed.
This went unnoticed because any CPU which goes idle will observe that
the broadcast device has an expiry time of KTIME_MAX and therefore any
CPUs next timer event will be earlier and cause a reprogramming of the
broadcast device. But that does not guarantee that the events of the
CPUs which were already in idle are delivered on time.
Fix this by arming the newly installed device for an immediate event
which will reevaluate the per CPU expiry times and reprogram the
broadcast device accordingly. This is simpler than caching the last
expiry time in yet another place or saving it before the device exchange
and handing it down to the setup function. Replacement of broadcast
devices is not a frequent operation and usually happens once somewhere
late in the boot process.
Fixes:
|
||
|
1b9c92432f |
locking/rwsem: Add __always_inline annotation to __down_read_common() and inlined callers
commit 92cc5d00a431e96e5a49c0b97e5ad4fa7536bd4b upstream.
Apparently despite it being marked inline, the compiler
may not inline __down_read_common() which makes it difficult
to identify the cause of lock contention, as the blocked
function in traceevents will always be listed as
__down_read_common().
So this patch adds __always_inline annotation to the common
function (as well as the inlined helper callers) to force it to
be inlined so the blocking function will be listed (via Wchan)
in traceevents.
Fixes:
|
||
|
706ae66574 |
kcsan: Avoid READ_ONCE() in read_instrumented_memory()
commit 8dec88070d964bfeb4198f34cb5956d89dd1f557 upstream.
Haibo Li reported:
| Unable to handle kernel paging request at virtual address
| ffffff802a0d8d7171
| Mem abort info⭕
| ESR = 0x9600002121
| EC = 0x25: DABT (current EL), IL = 32 bitsts
| SET = 0, FnV = 0 0
| EA = 0, S1PTW = 0 0
| FSC = 0x21: alignment fault
| Data abort info⭕
| ISV = 0, ISS = 0x0000002121
| CM = 0, WnR = 0 0
| swapper pgtable: 4k pages, 39-bit VAs, pgdp=000000002835200000
| [ffffff802a0d8d71] pgd=180000005fbf9003, p4d=180000005fbf9003,
| pud=180000005fbf9003, pmd=180000005fbe8003, pte=006800002a0d8707
| Internal error: Oops: 96000021 [#1] PREEMPT SMP
| Modules linked in:
| CPU: 2 PID: 45 Comm: kworker/u8:2 Not tainted
| 5.15.78-android13-8-g63561175bbda-dirty #1
| ...
| pc : kcsan_setup_watchpoint+0x26c/0x6bc
| lr : kcsan_setup_watchpoint+0x88/0x6bc
| sp : ffffffc00ab4b7f0
| x29: ffffffc00ab4b800 x28: ffffff80294fe588 x27: 0000000000000001
| x26: 0000000000000019 x25: 0000000000000001 x24: ffffff80294fdb80
| x23: 0000000000000000 x22: ffffffc00a70fb68 x21: ffffff802a0d8d71
| x20: 0000000000000002 x19: 0000000000000000 x18: ffffffc00a9bd060
| x17: 0000000000000001 x16: 0000000000000000 x15: ffffffc00a59f000
| x14: 0000000000000001 x13: 0000000000000000 x12: ffffffc00a70faa0
| x11: 00000000aaaaaaab x10: 0000000000000054 x9 : ffffffc00839adf8
| x8 : ffffffc009b4cf00 x7 : 0000000000000000 x6 : 0000000000000007
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc00a70fb70
| x2 : 0005ff802a0d8d71 x1 : 0000000000000000 x0 : 0000000000000000
| Call trace:
| kcsan_setup_watchpoint+0x26c/0x6bc
| __tsan_read2+0x1f0/0x234
| inflate_fast+0x498/0x750
| zlib_inflate+0x1304/0x2384
| __gunzip+0x3a0/0x45c
| gunzip+0x20/0x30
| unpack_to_rootfs+0x2a8/0x3fc
| do_populate_rootfs+0xe8/0x11c
| async_run_entry_fn+0x58/0x1bc
| process_one_work+0x3ec/0x738
| worker_thread+0x4c4/0x838
| kthread+0x20c/0x258
| ret_from_fork+0x10/0x20
| Code: b8bfc2a8 2a0803f7 14000007 d503249f (78bfc2a8) )
| ---[ end trace 613a943cb0a572b6 ]-----
The reason for this is that on certain arm64 configuration since
|
||
|
72f3217aa1 |
PM: hibernate: Do not get block device exclusively in test_resume mode
[ Upstream commit 5904de0d735bbb3b4afe9375c5b4f9748f882945 ] The system refused to do a test_resume because it found that the swap device has already been taken by someone else. Specifically, the swsusp_check()->blkdev_get_by_dev(FMODE_EXCL) is supposed to do this check. Steps to reproduce: dd if=/dev/zero of=/swapfile bs=$(cat /proc/meminfo | awk '/MemTotal/ {print $2}') count=1024 conv=notrunc mkswap /swapfile swapon /swapfile swap-offset /swapfile echo 34816 > /sys/power/resume_offset echo test_resume > /sys/power/disk echo disk > /sys/power/state PM: Using 3 thread(s) for compression PM: Compressing and saving image data (293150 pages)... PM: Image saving progress: 0% PM: Image saving progress: 10% ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/100 ata2: SATA link down (SStatus 0 SControl 300) ata5: SATA link down (SStatus 0 SControl 300) ata6: SATA link down (SStatus 0 SControl 300) ata3: SATA link down (SStatus 0 SControl 300) ata4: SATA link down (SStatus 0 SControl 300) PM: Image saving progress: 20% PM: Image saving progress: 30% PM: Image saving progress: 40% PM: Image saving progress: 50% pcieport 0000:00:02.5: pciehp: Slot(0-5): No device found PM: Image saving progress: 60% PM: Image saving progress: 70% PM: Image saving progress: 80% PM: Image saving progress: 90% PM: Image saving done PM: hibernation: Wrote 1172600 kbytes in 2.70 seconds (434.29 MB/s) PM: S| PM: hibernation: Basic memory bitmaps freed PM: Image not found (code -16) This is because when using the swapfile as the hibernation storage, the block device where the swapfile is located has already been mounted by the OS distribution(usually mounted as the rootfs). This is not an issue for normal hibernation, because software_resume()->swsusp_check() happens before the block device(rootfs) mount. But it is a problem for the test_resume mode. Because when test_resume happens, the block device has been mounted already. Thus remove the FMODE_EXCL for test_resume mode. This would not be a problem because in test_resume stage, the processes have already been frozen, and the race condition described in Commit |
||
|
208ba216cc |
PM: hibernate: Turn snapshot_test into global variable
[ Upstream commit 08169a162f97819d3e5b4a342bb9cf5137787154 ] There is need to check snapshot_test and open block device in different mode, so as to avoid the race condition. No functional changes intended. Suggested-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Stable-dep-of: 5904de0d735b ("PM: hibernate: Do not get block device exclusively in test_resume mode") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
c2b990d7aa |
timekeeping: Fix references to nonexistent ktime_get_fast_ns()
[ Upstream commit 158009f1b4a33bc0f354b994eea361362bd83226 ]
There was never a function named ktime_get_fast_ns().
Presumably these should refer to ktime_get_mono_fast_ns() instead.
Fixes:
|
||
|
4aa9243ebe |
swiotlb: fix debugfs reporting of reserved memory pools
[ Upstream commit 5499d01c029069044a3b3e50501c77b474c96178 ]
For io_tlb_nslabs, the debugfs code reports the correct value for a
specific reserved memory pool. But for io_tlb_used, the value reported
is always for the default pool, not the specific reserved pool. Fix this.
Fixes:
|
||
|
e6c69b06e7 |
swiotlb: relocate PageHighMem test away from rmem_swiotlb_setup
[ Upstream commit a90922fa25370902322e9de6640e58737d459a50 ]
The reservedmem_of_init_fn's are invoked very early at boot before the
memory zones have even been defined. This makes it inappropriate to test
whether the page corresponding to a PFN is in ZONE_HIGHMEM from within
one.
Removing the check allows an ARM 32-bit kernel with SPARSEMEM enabled to
boot properly since otherwise we would be de-referencing an
uninitialized sparsemem map to perform pfn_to_page() check.
The arm64 architecture happens to work (and also has no high memory) but
other 32-bit architectures could also be having similar issues.
While it would be nice to provide early feedback about a reserved DMA
pool residing in highmem, it is not possible to do that until the first
time we try to use it, which is where the check is moved to.
Fixes:
|
||
|
c3c2aee6f9 |
workqueue: Fix hung time report of worker pools
[ Upstream commit 335a42ebb0ca8ee9997a1731aaaae6dcd704c113 ]
The workqueue watchdog prints a warning when there is no progress in
a worker pool. Where the progress means that the pool started processing
a pending work item.
Note that it is perfectly fine to process work items much longer.
The progress should be guaranteed by waking up or creating idle
workers.
show_one_worker_pool() prints state of non-idle worker pool. It shows
a delay since the last pool->watchdog_ts.
The timestamp is updated when a first pending work is queued in
__queue_work(). Also it is updated when a work is dequeued for
processing in worker_thread() and rescuer_thread().
The delay is misleading when there is no pending work item. In this
case it shows how long the last work item is being proceed. Show
zero instead. There is no stall if there is no pending work.
Fixes:
|
||
|
0489c2b2c3 |
tracing/user_events: Ensure write index cannot be negative
[ Upstream commit cd98c93286a30cc4588dfd02453bec63c2f4acf4 ]
The write index indicates which event the data is for and accesses a
per-file array. The index is passed by user processes during write()
calls as the first 4 bytes. Ensure that it cannot be negative by
returning -EINVAL to prevent out of bounds accesses.
Update ftrace self-test to ensure this occurs properly.
Link: https://lkml.kernel.org/r/20230425225107.8525-2-beaub@linux.microsoft.com
Fixes:
|
||
|
6472a6d0c7 |
sched/rt: Fix bad task migration for rt tasks
[ Upstream commit feffe5bb274dd3442080ef0e4053746091878799 ] Commit |
||
|
2d44928903 |
perf/core: Fix hardlockup failure caused by perf throttle
[ Upstream commit 15def34e2635ab7e0e96f1bc32e1b69609f14942 ] commit |
||
|
944465c772 |
sched/fair: Fix inaccurate tally of ttwu_move_affine
[ Upstream commit 39afe5d6fc59237ff7738bf3ede5a8856822d59d ]
There are scenarios where non-affine wakeups are incorrectly counted as
affine wakeups by schedstats.
When wake_affine_idle() returns prev_cpu which doesn't equal to
nr_cpumask_bits, it will slip through the check: target == nr_cpumask_bits
in wake_affine() and be counted as if target == this_cpu in schedstats.
Replace target == nr_cpumask_bits with target != this_cpu to make sure
affine wakeups are accurately tallied.
Fixes:
|
||
|
551a26668c |
bpf: Don't EFAULT for getsockopt with optval=NULL
[ Upstream commit 00e74ae0863827d944e36e56a4ce1e77e50edb91 ]
Some socket options do getsockopt with optval=NULL to estimate the size
of the final buffer (which is returned via optlen). This breaks BPF
getsockopt assumptions about permitted optval buffer size. Let's enforce
these assumptions only when non-NULL optval is provided.
Fixes:
|
||
|
c3fb321447 |
bpf: Fix race between btf_put and btf_idr walk.
[ Upstream commit acf1c3d68e9a31f10d92bc67ad4673cdae5e8d92 ] Florian and Eduard reported hard dead lock: [ 58.433327] _raw_spin_lock_irqsave+0x40/0x50 [ 58.433334] btf_put+0x43/0x90 [ 58.433338] bpf_find_btf_id+0x157/0x240 [ 58.433353] btf_parse_fields+0x921/0x11c0 This happens since btf->refcount can be 1 at the time of btf_put() and btf_put() will call btf_free_id() which will try to grab btf_idr_lock and will dead lock. Avoid the issue by doing btf_put() without locking. Fixes: |
||
|
52c3d68d99 |
bpf/btf: Fix is_int_ptr()
[ Upstream commit 91f2dc6838c19342f7f2993627c622835cc24890 ] When tracing a kernel function with arg type is u32*, btf_ctx_access() would report error: arg2 type INT is not a struct. The commit |
||
|
f9361cf40b |
bpf: Fix __reg_bound_offset 64->32 var_off subreg propagation
[ Upstream commit 7be14c1c9030f73cc18b4ff23b78a0a081f16188 ] Xu reports that after commit |
||
|
157c84b793 |
bpf: Remove misleading spec_v1 check on var-offset stack read
[ Upstream commit 082cdc69a4651dd2a77539d69416a359ed1214f5 ] For every BPF_ADD/SUB involving a pointer, adjust_ptr_min_max_vals() ensures that the resulting pointer has a constant offset if bypass_spec_v1 is false. This is ensured by calling sanitize_check_bounds() which in turn calls check_stack_access_for_ptr_arithmetic(). There, -EACCESS is returned if the register's offset is not constant, thereby rejecting the program. In summary, an unprivileged user must never be able to create stack pointers with a variable offset. That is also the case, because a respective check in check_stack_write() is missing. If they were able to create a variable-offset pointer, users could still use it in a stack-write operation to trigger unsafe speculative behavior [1]. Because unprivileged users must already be prevented from creating variable-offset stack pointers, viable options are to either remove this check (replacing it with a clarifying comment), or to turn it into a "verifier BUG"-message, also adding a similar check in check_stack_write() (for consistency, as a second-level defense). This patch implements the first option to reduce verifier bloat. This check was introduced by commit |
||
|
a62ba7e0d2 |
bpf: fix precision propagation verbose logging
[ Upstream commit 34f0677e7afd3a292bc1aadda7ce8e35faedb204 ] Fix wrong order of frame index vs register/slot index in precision propagation verbose (level 2) output. It's wrong and very confusing as is. Fixes: 529409ea92d5 ("bpf: propagate precision across all frames, not just the last one") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230313184017.4083374-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
0049d2edda |
bpf: take into account liveness when propagating precision
[ Upstream commit 52c2b005a3c18c565fc70cfd0ca49375f301e952 ]
When doing state comparison, if old state has register that is not
marked as REG_LIVE_READ, then we just skip comparison, regardless what's
the state of corresponing register in current state. This is because not
REG_LIVE_READ register is irrelevant for further program execution and
correctness. All good here.
But when we get to precision propagation, after two states were declared
equivalent, we don't take into account old register's liveness, and thus
attempt to propagate precision for register in current state even if
that register in old state was not REG_LIVE_READ anymore. This is bad,
because register in current state could be anything at all and this
could cause -EFAULT due to internal logic bugs.
Fix by taking into account REG_LIVE_READ liveness mark to keep the logic
in state comparison in sync with precision propagation.
Fixes:
|
||
|
290e26ec0d |
tick/common: Align tick period with the HZ tick.
[ Upstream commit e9523a0d81899361214d118ad60ef76f0e92f71d ]
With HIGHRES enabled tick_sched_timer() is programmed every jiffy to
expire the timer_list timers. This timer is programmed accurate in
respect to CLOCK_MONOTONIC so that 0 seconds and nanoseconds is the
first tick and the next one is 1000/CONFIG_HZ ms later. For HZ=250 it is
every 4 ms and so based on the current time the next tick can be
computed.
This accuracy broke since the commit mentioned below because the jiffy
based clocksource is initialized with higher accuracy in
read_persistent_wall_and_boot_offset(). This higher accuracy is
inherited during the setup in tick_setup_device(). The timer still fires
every 4ms with HZ=250 but timer is no longer aligned with
CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
of the timestamp. The offset differs with every boot and makes it
impossible for user land to align with the tick.
Align the tick period with CLOCK_MONOTONIC ensuring that it is always a
multiple of 1000/CONFIG_HZ ms.
Fixes:
|
||
|
ae6803b663 |
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
[ Upstream commit db7b464df9d820186e98a65aa6a10f0d51fbf8ce ]
This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling
RCU expedited grace periods to actually force-enable scheduling-clock
interrupts on holdout CPUs.
Fixes:
|
||
|
d79d3430e1 |
tracing: Fix permissions for the buffer_percent file
commit 4f94559f40ad06d627c0fdfc3319cec778a2845b upstream.
This file defines both read and write operations, yet it is being
created as read-only. This means that it can't be written to without the
CAP_DAC_OVERRIDE capability. Fix the permissions to allow root to write
to it without the need to override DAC perms.
Link: https://lore.kernel.org/linux-trace-kernel/20230503140114.3280002-1-omosnace@redhat.com
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|