Commit Graph

40341 Commits

Author SHA1 Message Date
Phil Auld
4070e9cf72 cpu/hotplug: Make target_store() a nop when target == state
[ Upstream commit 64ea6e44f85b9b75925ebe1ba0e6e8430cc4e06f ]

Writing the current state back in hotplug/target calls cpu_down()
which will set cpu dying even when it isn't and then nothing will
ever clear it. A stress test that reads values and writes them back
for all cpu device files in sysfs will trigger the BUG() in
select_fallback_rq once all cpus are marked as dying.

kernel/cpu.c::target_store()
	...
        if (st->state < target)
                ret = cpu_up(dev->id, target);
        else
                ret = cpu_down(dev->id, target);

cpu_down() -> cpu_set_state()
	 bool bringup = st->state < target;
	 ...
	 if (cpu_dying(cpu) != !bringup)
		set_cpu_dying(cpu, !bringup);

Fix this by letting state==target fall through in the target_store()
conditional. Also make sure st->target == target in that case.

Fixes: 757c989b99 ("cpu/hotplug: Make target state writeable")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20221117162329.3164999-2-pauld@redhat.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:58 +01:00
Alexey Izbyshev
a1e49256c7 futex: Resend potentially swallowed owner death notification
[ Upstream commit 90d758896787048fa3d4209309d4800f3920e66f ]

Commit ca16d5bee5 ("futex: Prevent robust futex exit race") addressed
two cases when tasks waiting on a robust non-PI futex remained blocked
despite the futex not being owned anymore:

* if the owner died after writing zero to the futex word, but before
  waking up a waiter

* if a task waiting on the futex was woken up, but died before updating
  the futex word (effectively swallowing the notification without acting
  on it)

In the second case, the task could be woken up either by the previous
owner (after the futex word was reset to zero) or by the kernel (after
the OWNER_DIED bit was set and the TID part of the futex word was reset
to zero) if the previous owner died without the resetting the futex.

Because the referenced commit wakes up a potential waiter only if the
whole futex word is zero, the latter subcase remains unaddressed.

Fix this by looking only at the TID part of the futex when deciding
whether a wake up is needed.

Fixes: ca16d5bee5 ("futex: Prevent robust futex exit race")
Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221111215439.248185-1-izbyshev@ispras.ru
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:58 +01:00
Yang Yingliang
6eda351971 genirq/irqdesc: Don't try to remove non-existing sysfs files
[ Upstream commit 9049e1ca41983ab773d7ea244bee86d7835ec9f5 ]

Fault injection tests trigger warnings like this:

  kernfs: can not remove 'chip_name', no directory
  WARNING: CPU: 0 PID: 253 at fs/kernfs/dir.c:1616 kernfs_remove_by_name_ns+0xce/0xe0
  RIP: 0010:kernfs_remove_by_name_ns+0xce/0xe0
  Call Trace:
   <TASK>
   remove_files.isra.1+0x3f/0xb0
   sysfs_remove_group+0x68/0xe0
   sysfs_remove_groups+0x41/0x70
   __kobject_del+0x45/0xc0
   kobject_del+0x29/0x40
   free_desc+0x42/0x70
   irq_free_descs+0x5e/0x90

The reason is that the interrupt descriptor sysfs handling does not roll
back on a failing kobject_add() during allocation. If the descriptor is
freed later on, kobject_del() is invoked with a not added kobject resulting
in the above warnings.

A proper rollback in case of a kobject_add() failure would be the straight
forward solution. But this is not possible due to the way how interrupt
descriptor sysfs handling works.

Interrupt descriptors are allocated before sysfs becomes available. So the
sysfs files for the early allocated descriptors are added later in the boot
process. At this point there can be nothing useful done about a failing
kobject_add(). For consistency the interrupt descriptor allocation always
treats kobject_add() failures as non-critical and just emits a warning.

To solve this problem, keep track in the interrupt descriptor whether
kobject_add() was successful or not and make the invocation of
kobject_del() conditional on that.

[ tglx: Massage changelog, comments and use a state bit. ]

Fixes: ecb3f394c5 ("genirq: Expose interrupt information through sysfs")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20221128151612.1786122-1-yangyingliang@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:58 +01:00
Chen Zhongjin
e9b4dc13d3 perf: Fix possible memleak in pmu_dev_alloc()
[ Upstream commit e8d7a90c08ce963c592fb49845f2ccc606a2ac21 ]

In pmu_dev_alloc(), when dev_set_name() failed, it will goto free_dev
and call put_device(pmu->dev) to release it.
However pmu->dev->release is assigned after this, which makes warning
and memleak.
Call dev_set_name() after pmu->dev->release = pmu_dev_release to fix it.

  Device '(null)' does not have a release() function...
  WARNING: CPU: 2 PID: 441 at drivers/base/core.c:2332 device_release+0x1b9/0x240
  ...
  Call Trace:
    <TASK>
    kobject_put+0x17f/0x460
    put_device+0x20/0x30
    pmu_dev_alloc+0x152/0x400
    perf_pmu_register+0x96b/0xee0
    ...
  kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
  unreferenced object 0xffff888014759000 (size 2048):
    comm "modprobe", pid 441, jiffies 4294931444 (age 38.332s)
    backtrace:
      [<0000000005aed3b4>] kmalloc_trace+0x27/0x110
      [<000000006b38f9b8>] pmu_dev_alloc+0x50/0x400
      [<00000000735f17be>] perf_pmu_register+0x96b/0xee0
      [<00000000e38477f1>] 0xffffffffc0ad8603
      [<000000004e162216>] do_one_initcall+0xd0/0x4e0
      ...

Fixes: abe4340057 ("perf: Sysfs enumeration")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221111103653.91058-1-chenzhongjin@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:56 +01:00
xiongxin
574b10475d PM: hibernate: Fix mistake in kerneldoc comment
[ Upstream commit 6e5d7300cbe7c3541bc31f16db3e9266e6027b4b ]

The actual maximum image size formula in hibernate_preallocate_memory()
is as follows:

max_size = (count - (size + PAGES_FOR_IO)) / 2
	    - 2 * DIV_ROUND_UP(reserved_size, PAGE_SIZE);

but the one in the kerneldoc comment of the function is different and
incorrect.

Fixes: ddeb648708 ("PM / Hibernate: Add sysfs knob to control size of memory for drivers")
Signed-off-by: xiongxin <xiongxin@kylinos.cn>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:55 +01:00
Hao Lee
44de1adb24 sched/psi: Fix possible missing or delayed pending event
[ Upstream commit e38f89af6a13e895805febd3a329a13ab7e66fa4 ]

When a pending event exists and growth is less than the threshold, the
current logic is to skip this trigger without generating event. However,
from e6df4ead85 ("psi: fix possible trigger missing in the window"),
our purpose is to generate event as long as pending event exists and the
rate meets the limit, no matter what growth is.
This patch handles this case properly.

Fixes: e6df4ead85 ("psi: fix possible trigger missing in the window")
Signed-off-by: Hao Lee <haolee.swjtu@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/20220919072356.GA29069@haolee.io
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:55 +01:00
Qais Yousef
e7c542d2cf sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
[ Upstream commit d81304bc6193554014d4372a01debdf65e1e9a4d ]

If the utilization of the woken up task is 0, we skip the energy
calculation because it has no impact.

But if the task is boosted (uclamp_min != 0) will have an impact on task
placement and frequency selection. Only skip if the util is truly
0 after applying uclamp values.

Change uclamp_task_cpu() signature to avoid unnecessary additional calls
to uclamp_eff_get(). feec() is the only user now.

Fixes: 732cd75b8c ("sched/fair: Select an energy-efficient CPU on task wake-up")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-8-qais.yousef@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:55 +01:00
Qais Yousef
cbd8c04040 sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
[ Upstream commit c56ab1b3506ba0e7a872509964b100912bde165d ]

So that it is now uclamp aware.

This fixes a major problem of busy tasks capped with UCLAMP_MAX keeping
the system in overutilized state which disables EAS and leads to wasting
energy in the long run.

Without this patch running a busy background activity like JIT
compilation on Pixel 6 causes the system to be in overutilized state
74.5% of the time.

With this patch this goes down to  9.79%.

It also fixes another problem when long running tasks that have their
UCLAMP_MIN changed while running such that they need to upmigrate to
honour the new UCLAMP_MIN value. The upmigration doesn't get triggered
because overutilized state never gets set in this state, hence misfit
migration never happens at tick in this case until the task wakes up
again.

Fixes: af24bde8df ("sched/uclamp: Add uclamp support to energy_compute()")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-7-qais.yousef@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:55 +01:00
Qais Yousef
324ce6e2c6 sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
[ Upstream commit a2e7f03ed28fce26c78b985f87913b6ce3accf9d ]

Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.

s/asym_fits_capacity/asym_fits_cpu/ to better reflect what it does now.

Fixes: b4c9c9f156 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-6-qais.yousef@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:55 +01:00
Qais Yousef
3b9c1559de sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
[ Upstream commit b759caa1d9f667b94727b2ad12589cbc4ce13a82 ]

Use the new util_fits_cpu() to ensure migration margin and capacity
pressure are taken into account correctly when uclamp is being used
otherwise we will fail to consider CPUs as fitting in scenarios where
they should.

Fixes: b4c9c9f156 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-5-qais.yousef@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:54 +01:00
Qais Yousef
75ba48621b sched/uclamp: Fix fits_capacity() check in feec()
[ Upstream commit 244226035a1f9b2b6c326e55ae5188fab4f428cb ]

As reported by Yun Hsiang [1], if a task has its uclamp_min >= 0.8 * 1024,
it'll always pick the previous CPU because fits_capacity() will always
return false in this case.

The new util_fits_cpu() logic should handle this correctly for us beside
more corner cases where similar failures could occur, like when using
UCLAMP_MAX.

We open code uclamp_rq_util_with() except for the clamp() part,
util_fits_cpu() needs the 'raw' values to be passed to it.

Also introduce uclamp_rq_{set, get}() shorthand accessors to get uclamp
value for the rq. Makes the code more readable and ensures the right
rules (use READ_ONCE/WRITE_ONCE) are respected transparently.

[1] https://lists.linaro.org/pipermail/eas-dev/2020-July/001488.html

Fixes: 1d42509e47 ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Reported-by: Yun Hsiang <hsiang023167@gmail.com>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-4-qais.yousef@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:54 +01:00
Qais Yousef
d355960543 sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
[ Upstream commit b48e16a69792b5dc4a09d6807369d11b2970cc36 ]

So that the new uclamp rules in regard to migration margin and capacity
pressure are taken into account correctly.

Fixes: a7008c07a5 ("sched/fair: Make task_fits_capacity() consider uclamp restrictions")
Co-developed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-3-qais.yousef@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:54 +01:00
Qais Yousef
0ef8dac246 sched/uclamp: Fix relationship between uclamp and migration margin
[ Upstream commit 48d5e9daa8b767e75ed9421665b037a49ce4bc04 ]

fits_capacity() verifies that a util is within 20% margin of the
capacity of a CPU, which is an attempt to speed up upmigration.

But when uclamp is used, this 20% margin is problematic because for
example if a task is boosted to 1024, then it will not fit on any CPU
according to fits_capacity() logic.

Or if a task is boosted to capacity_orig_of(medium_cpu). The task will
end up on big instead on the desired medium CPU.

Similar corner cases exist for uclamp and usage of capacity_of().
Slightest irq pressure on biggest CPU for example will make a 1024
boosted task look like it can't fit.

What we really want is for uclamp comparisons to ignore the migration
margin and capacity pressure, yet retain them for when checking the
_actual_ util signal.

For example, task p:

	p->util_avg = 300
	p->uclamp[UCLAMP_MIN] = 1024

Will fit a big CPU. But

	p->util_avg = 900
	p->uclamp[UCLAMP_MIN] = 1024

will not, this should trigger overutilized state because the big CPU is
now *actually* being saturated.

Similar reasoning applies to capping tasks with UCLAMP_MAX. For example:

	p->util_avg = 1024
	p->uclamp[UCLAMP_MAX] = capacity_orig_of(medium_cpu)

Should fit the task on medium cpus without triggering overutilized
state.

Inlined comments expand more on desired behavior in more scenarios.

Introduce new util_fits_cpu() function which encapsulates the new logic.
The new function is not used anywhere yet, but will be used to update
various users of fits_capacity() in later patches.

Fixes: af24bde8df ("sched/uclamp: Add uclamp support to energy_compute()")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220804143609.515789-2-qais.yousef@arm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:54 +01:00
Kuniyuki Iwashima
5b81f0c6c6 seccomp: Move copy_seccomp() to no failure path.
[ Upstream commit a1140cb215fa13dcec06d12ba0c3ee105633b7c4 ]

Our syzbot instance reported memory leaks in do_seccomp() [0], similar
to the report [1].  It shows that we miss freeing struct seccomp_filter
and some objects included in it.

We can reproduce the issue with the program below [2] which calls one
seccomp() and two clone() syscalls.

The first clone()d child exits earlier than its parent and sends a
signal to kill it during the second clone(), more precisely before the
fatal_signal_pending() test in copy_process().  When the parent receives
the signal, it has to destroy the embryonic process and return -EINTR to
user space.  In the failure path, we have to call seccomp_filter_release()
to decrement the filter's refcount.

Initially, we called it in free_task() called from the failure path, but
the commit 3a15fb6ed9 ("seccomp: release filter after task is fully
dead") moved it to release_task() to notify user space as early as possible
that the filter is no longer used.

To keep the change and current seccomp refcount semantics, let's move
copy_seccomp() just after the signal check and add a WARN_ON_ONCE() in
free_task() for future debugging.

[0]:
unreferenced object 0xffff8880063add00 (size 256):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.914s)
  hex dump (first 32 bytes):
    01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff  ................
  backtrace:
    do_seccomp (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/seccomp.c:666 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
unreferenced object 0xffffc90000035000 (size 4096):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 05 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    __vmalloc_node_range (mm/vmalloc.c:3226)
    __vmalloc_node (mm/vmalloc.c:3261 (discriminator 4))
    bpf_prog_alloc_no_stats (kernel/bpf/core.c:91)
    bpf_prog_alloc (kernel/bpf/core.c:129)
    bpf_prog_create_from_user (net/core/filter.c:1414)
    do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
unreferenced object 0xffff888003fa1000 (size 1024):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    bpf_prog_alloc_no_stats (./include/linux/slab.h:600 ./include/linux/slab.h:733 kernel/bpf/core.c:95)
    bpf_prog_alloc (kernel/bpf/core.c:129)
    bpf_prog_create_from_user (net/core/filter.c:1414)
    do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
unreferenced object 0xffff888006360240 (size 16):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
  hex dump (first 16 bytes):
    01 00 37 00 76 65 72 6c e0 83 01 06 80 88 ff ff  ..7.verl........
  backtrace:
    bpf_prog_store_orig_filter (net/core/filter.c:1137)
    bpf_prog_create_from_user (net/core/filter.c:1428)
    do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
unreferenced object 0xffff8880060183e0 (size 8):
  comm "repro_seccomp", pid 230, jiffies 4294687090 (age 9.915s)
  hex dump (first 8 bytes):
    06 00 00 00 00 00 ff 7f                          ........
  backtrace:
    kmemdup (mm/util.c:129)
    bpf_prog_store_orig_filter (net/core/filter.c:1144)
    bpf_prog_create_from_user (net/core/filter.c:1428)
    do_seccomp (kernel/seccomp.c:671 kernel/seccomp.c:708 kernel/seccomp.c:1871 kernel/seccomp.c:1991)
    do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

[1]: https://syzkaller.appspot.com/bug?id=2809bb0ac77ad9aa3f4afe42d6a610aba594a987

[2]:
#define _GNU_SOURCE
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/filter.h>
#include <linux/seccomp.h>

void main(void)
{
	struct sock_filter filter[] = {
		BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
	};
	struct sock_fprog fprog = {
		.len = sizeof(filter) / sizeof(filter[0]),
		.filter = filter,
	};
	long i, pid;

	syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER, 0, &fprog);

	for (i = 0; i < 2; i++) {
		pid = syscall(__NR_clone, CLONE_NEWNET | SIGKILL, NULL, NULL, 0);
		if (pid == 0)
			return;
	}
}

Fixes: 3a15fb6ed9 ("seccomp: release filter after task is fully dead")
Reported-by: syzbot+ab17848fe269b573eb71@syzkaller.appspotmail.com
Reported-by: Ayushman Dutta <ayudutta@amazon.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220823154532.82913-1-kuniyu@amazon.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-12-31 13:31:53 +01:00
Tejun Heo
fbf8321238 memcg: Fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified
control fd to route the write call.  As a cgroup interface file can't be
renamed, it's safe to access d_name as long as the specified file is a
regular cgroup file.  Also, as these cgroup interface files can't be
removed before the directory, it's safe to access the parent too.

Prior to 347c4a8747 ("memcg: remove cgroup_event->cft"), there was a
call to __file_cft() which verified that the specified file is a regular
cgroupfs file before further accesses.  The cftype pointer returned from
__file_cft() was no longer necessary and the commit inadvertently
dropped the file type check with it allowing any file to slip through.
With the invarients broken, the d_name and parent accesses can now race
against renames and removals of arbitrary files and cause
use-after-free's.

Fix the bug by resurrecting the file type check in __file_cft().  Now
that cgroupfs is implemented through kernfs, checking the file
operations needs to go through a layer of indirection.  Instead, let's
check the superblock and dentry type.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 347c4a8747 ("memcg: remove cgroup_event->cft")
Cc: stable@kernel.org # v3.14+
Reported-by: Jann Horn <jannh@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-08 10:40:58 -08:00
Linus Torvalds
bce9332220 proc: proc_skip_spaces() shouldn't think it is working on C strings
proc_skip_spaces() seems to think it is working on C strings, and ends
up being just a wrapper around skip_spaces() with a really odd calling
convention.

Instead of basing it on skip_spaces(), it should have looked more like
proc_skip_char(), which really is the exact same function (except it
skips a particular character, rather than whitespace).  So use that as
inspiration, odd coding and all.

Now the calling convention actually makes sense and works for the
intended purpose.

Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05 12:09:06 -08:00
Linus Torvalds
e6cfaf34be proc: avoid integer type confusion in get_proc_long
proc_get_long() is passed a size_t, but then assigns it to an 'int'
variable for the length.  Let's not do that, even if our IO paths are
limited to MAX_RW_COUNT (exactly because of these kinds of type errors).

So do the proper test in the rigth type.

Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-05 11:33:40 -08:00
Linus Torvalds
0c3b5bcb48 - Fix a use-after-free case where the perf pending task callback would
see an already freed event
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOMqHUACgkQEsHwGGHe
 VUpRzw/9Gow+0wbm2XhMuweUA6t3LgNweOmzDl9w8k1f55OD6niCvuDiF9jSaiKZ
 UwGyErasp2dlEVjuNGnp42qSHos3vRiR7sdZZQG+7opWV2FFyxyFpx5x8UEgVnFy
 gOuEij5vLXBApUdNRAcVqCbvivs4Lv6SggDyQ075zGzuOmUv57vw2jDt8YfKaFcp
 jZTiL+j5GKwihndDB6ayx+7Gwo9a9ASKrTgz8JK2tPOIHZR4X9y9ot1IanZnxzwF
 d0kFpLgF/ZqjPRpJoaFn/jgk1AfahQyYHXh7lQ1aP7rLSLRRGcfTBX4n9nC3BYT+
 EHaA94l151L1mzbR69ij9tryAERU4NlguD/FIuCeW+6IEPiuwBNGklXF+rRegNj4
 IYC0ZSld/NyWKtOrwNSrFRMsxFm583Pg6TaBkvU1rGd5YVQ7GImrj7UjecXO/W71
 iXpfarF7ur2zmd+5+F5FB34VYw8GumRo+D+XIb34+8UMBURTX36hgXvSC3sVyyCw
 b0c758F3+1zTwm8z52T1RhOOp47t5iWAznwTq6k1cT7788PDXJ9sGYXIpdLpwKcI
 Fuj61alwamGeUciCr0iKGtCLRHayZII7OeQh1VjXuqgCwI3hI2j3EaI9C74WSApn
 ttVInS0Ka2xcu//A1VFltkMOWNMQK9JeTlqdqctwypTL3WVb2XA=
 =jo4r
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Fix a use-after-free case where the perf pending task callback would
   see an already freed event

* tag 'perf_urgent_for_v6.1_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix perf_pending_task() UaF
2022-12-04 12:36:23 -08:00
Linus Torvalds
01f856ae6d Including fixes from bpf, can and wifi.
Current release - new code bugs:
 
  - eth: mlx5e:
    - use kvfree() in mlx5e_accel_fs_tcp_create()
    - MACsec, fix RX data path 16 RX security channel limit
    - MACsec, fix memory leak when MACsec device is deleted
    - MACsec, fix update Rx secure channel active field
    - MACsec, fix add Rx security association (SA) rule memory leak
 
 Previous releases - regressions:
 
  - wifi: cfg80211: don't allow multi-BSSID in S1G
 
  - stmmac: set MAC's flow control register to reflect current settings
 
  - eth: mlx5:
    - E-switch, fix duplicate lag creation
    - fix use-after-free when reverting termination table
 
 Previous releases - always broken:
 
  - ipv4: fix route deletion when nexthop info is not specified
 
  - bpf: fix a local storage BPF map bug where the value's spin lock
    field can get initialized incorrectly
 
  - tipc: re-fetch skb cb after tipc_msg_validate
 
  - wifi: wilc1000: fix Information Element parsing
 
  - packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
 
  - sctp: fix memory leak in sctp_stream_outq_migrate()
 
  - can: can327: fix potential skb leak when netdev is down
 
  - can: add number of missing netdev freeing on error paths
 
  - aquantia: do not purge addresses when setting the number of rings
 
  - wwan: iosm:
    - fix incorrect skb length leading to truncated packet
    - fix crash in peek throughput test due to skb UAF
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmOGOdYACgkQMUZtbf5S
 IrsknQ//SAoOyDOEu15YzOt8hAupLKoF6MM+D0dwwTEQZLf7IVXCjPpkKtVh7Si7
 YCBoyrqrDs7vwaUrVoKY19Amwov+EYrHCpdC+c7wdZ7uxTaYfUbJJUGmxYOR179o
 lV1+1Aiqg9F9C6CUsmZ5lDN2Yb7/uPDBICIV8LM+VzJAtXjurBVauyMwAxLxPOAr
 cgvM+h5xzE7DXMF2z8R/mUq5MSIWoJo9hy2UwbV+f2liMTQuw9rwTbyw3d7+H/6p
 xmJcBcVaABjoUEsEhld3NTlYbSEnlFgCQBfDWzf2e4y6jBxO0JepuIc7SZwJFRJY
 XBqdsKcGw5RkgKbksKUgxe126XFX0SUUQEp0UkOIqe15k7eC2yO9uj1gRm6OuV4s
 J94HKzHX9WNV5OQ790Ed2JyIJScztMZlNFVJ/cz2/+iKR42xJg6kaO6Rt2fobtmL
 VC2cH+RfHzLl+2+7xnfzXEDgFePSBlA02Aq1wihU3zB3r7WCFHchEf9T7sGt1QF0
 03R+8E3+N2tYqphPAXyDoy6kXQJTPxJHAe1FNHJlwgfieUDEWZi/Pm+uQrKIkDeo
 oq9MAV2QBNSD1w4wl7cXfvicO5kBr/OP6YBqwkpsGao2jCSIgkWEX2DRrUaLczXl
 5/Z+m/gCO5tAEcVRYfMivxUIon//9EIhbErVpHTlNWpRHk24eS4=
 =0Lnw
 -----END PGP SIGNATURE-----

Merge tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, can and wifi.

  Current release - new code bugs:

   - eth: mlx5e:
      - use kvfree() in mlx5e_accel_fs_tcp_create()
      - MACsec, fix RX data path 16 RX security channel limit
      - MACsec, fix memory leak when MACsec device is deleted
      - MACsec, fix update Rx secure channel active field
      - MACsec, fix add Rx security association (SA) rule memory leak

  Previous releases - regressions:

   - wifi: cfg80211: don't allow multi-BSSID in S1G

   - stmmac: set MAC's flow control register to reflect current settings

   - eth: mlx5:
      - E-switch, fix duplicate lag creation
      - fix use-after-free when reverting termination table

  Previous releases - always broken:

   - ipv4: fix route deletion when nexthop info is not specified

   - bpf: fix a local storage BPF map bug where the value's spin lock
     field can get initialized incorrectly

   - tipc: re-fetch skb cb after tipc_msg_validate

   - wifi: wilc1000: fix Information Element parsing

   - packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE

   - sctp: fix memory leak in sctp_stream_outq_migrate()

   - can: can327: fix potential skb leak when netdev is down

   - can: add number of missing netdev freeing on error paths

   - aquantia: do not purge addresses when setting the number of rings

   - wwan: iosm:
      - fix incorrect skb length leading to truncated packet
      - fix crash in peek throughput test due to skb UAF"

* tag 'net-6.1-rc8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits)
  net: ethernet: renesas: ravb: Fix promiscuous mode after system resumed
  MAINTAINERS: Update maintainer list for chelsio drivers
  ionic: update MAINTAINERS entry
  sctp: fix memory leak in sctp_stream_outq_migrate()
  packet: do not set TP_STATUS_CSUM_VALID on CHECKSUM_COMPLETE
  net/mlx5: Lag, Fix for loop when checking lag
  Revert "net/mlx5e: MACsec, remove replay window size limitation in offload path"
  net: marvell: prestera: Fix a NULL vs IS_ERR() check in some functions
  net: tun: Fix use-after-free in tun_detach()
  net: mdiobus: fix unbalanced node reference count
  net: hsr: Fix potential use-after-free
  tipc: re-fetch skb cb after tipc_msg_validate
  mptcp: fix sleep in atomic at close time
  mptcp: don't orphan ssk in mptcp_close()
  dsa: lan9303: Correct stat name
  ipv4: Fix route deletion when nexthop info is not specified
  net: wwan: iosm: fix incorrect skb length
  net: wwan: iosm: fix crash in peek throughput test
  net: wwan: iosm: fix dma_alloc_coherent incompatible pointer type
  net: wwan: iosm: fix kernel test robot reported error
  ...
2022-11-29 09:52:10 -08:00
Peter Zijlstra
517e6a301f perf: Fix perf_pending_task() UaF
Per syzbot it is possible for perf_pending_task() to run after the
event is free()'d. There are two related but distinct cases:

 - the task_work was already queued before destroying the event;
 - destroying the event itself queues the task_work.

The first cannot be solved using task_work_cancel() since
perf_release() itself might be called from a task_work (____fput),
which means the current->task_works list is already empty and
task_work_cancel() won't be able to find the perf_pending_task()
entry.

The simplest alternative is extending the perf_event lifetime to cover
the task_work.

The second is just silly, queueing a task_work while you know the
event is going away makes no sense and is easily avoided by
re-arranging how the event is marked STATE_DEAD and ensuring it goes
through STATE_OFF on the way down.

Reported-by: syzbot+9228d6098455bb209ec8@syzkaller.appspotmail.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
2022-11-29 17:42:49 +01:00
Linus Torvalds
cb525a6513 tracing fixes for 6.1:
- Fix osnoise duration type to 64bit not 32bit.
 
 - Have histogram triggers be able to handle an unexpected NULL pointer
   for the record event, that can happen when the histogram first starts up.
 
 - Clear out ring buffers when dynamic events are removed, as the type
   that is saved in the ring buffer is used to read the event, and a
   stale type that is reused by another event could cause use after free
   issues.
 
 - Trivial comment fix.
 
 - Fix memory leak in user_event_create().
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY4UPShQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qo3qAQCz2R814gCo54WEDcvRUmAmF9BSE2b5
 wtaDfeyXQYbaMwEA4+cuNoMMlQvbVbmcss/Uh7RpRCB8+24iypeiEZNtIgg=
 =GPrG
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix osnoise duration type to 64bit not 32bit

 - Have histogram triggers be able to handle an unexpected NULL pointer
   for the record event, which can happen when the histogram first
   starts up

 - Clear out ring buffers when dynamic events are removed, as the type
   that is saved in the ring buffer is used to read the event, and a
   stale type that is reused by another event could cause use after free
   issues

 - Trivial comment fix

 - Fix memory leak in user_event_create()

* tag 'trace-v6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Free buffers when a used dynamic event is removed
  tracing: Add tracing_reset_all_online_cpus_unlocked() function
  tracing: Fix race where histograms can be called before the event
  tracing/osnoise: Fix duration type
  tracing/user_events: Fix memory leak in user_event_create()
  tracing/hist: add in missing * in comment blocks
2022-11-28 14:42:29 -08:00
Linus Torvalds
5afcab2217 - Two more fixes to the perf sigtrap handling:
- output the address in the sample only when it has been requested
  - handle the case where user-only events can hit in kernel and thus
    upset the sigtrap sanity checking
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmODP5wACgkQEsHwGGHe
 VUqmBhAAnxQTCg4agqtKQyHAnFGZb09dE6pYghcpJzF2sOh9sffCYl9XYd53AGVS
 z7nTGagPmT0AxdtjbFvDnxXuvpPVfcS9rIKG6oBzPUcXSVKr9PsqzZJJiAibg+uP
 OSkC75tqNgBTnyDSGc+C5yd7LmWV98Y66t0fhwGUTHpAPY+vkbvYvi8yJYfVqZsA
 Fc/6j+MJMFO1/52GoP42HVs2F7oyuXceJAAVhyYkYitC23mvrK2BRTZAkKsfRKBk
 Un5gFMo+iFUTrOMo48f8BMNEtgfC5iAPPK2ZlFGwTEbDQ98+inf461s8jZq/Yeg7
 KcdHw0Tk1rynBrFlbbUQArDEPyqzmj10FMVZ5VgUNdMneGEy+EUdpcoilX0ZccaA
 NMnNDMPpU4S0cbuGn43Y2Jlv4QA9oECHqZ8NndAoJ5domR9g12WyVibD5W/bYBYQ
 4Z2lv6luOCfWd2xQgT4s0tfMjwQb7NKmTrT4No9je7/IBDS5YjzBjPriAd4kbhFj
 MUt6KKLK0fN4ASmxajXp6iYRALL2nvgJ2w1gzO0pcjMGCJ1WZBKDdfbNLbgulCwq
 GJHNc55F7lWmkMNFLbWb28YLHjXGskQj4J/RARFszQq9khQnPEFj4X69StQfrghT
 77rJF9CrS3KnxG4gCNWe5qvfQb9+Fi7IhwyWDI9VR0DITxGogvY=
 =9CQa
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.1_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:
 "Two more fixes to the perf sigtrap handling:

   - output the address in the sample only when it has been requested

   - handle the case where user-only events can hit in kernel and thus
     upset the sigtrap sanity checking"

* tag 'perf_urgent_for_v6.1_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Consider OS filter fail
  perf: Fixup SIGTRAP and sample_flags interaction
2022-11-27 11:53:41 -08:00
Linus Torvalds
88817acb8b Power management fixes for 6.1-rc7
- Revert a recent schedutil cpufreq governor change that introduced
    a performace regression on Pixel 6 (Sam Wu).
 
  - Fix amd-pstate driver initialization after running the kernel via
    kexec (Wyes Karny).
 
  - Turn amd-pstate into a built-in driver which allows it to take
    precedence over acpi-cpufreq by default on supported systems and
    amend it with a mechanism to disable this behavior (Perry Yuan).
 
  - Update amd-pstate documentation in accordance with the other changes
    made to it (Perry Yuan).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmOA9bgSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxfpsP/j1JTOy3tPCsvPgI+b/RCWq33sm5PpaU
 iFYpfxE9uexy2FyGgsZSqp4fMpMw9fCjiCpv/yfxtLmKccmOyW2kG7IYnNuV5QuN
 FnN6B9WfsLohbuk87H07GpuGZTy4OISjmEGE/gTHXhY6n0eQ9utiX9b8zQuRdhsO
 Gz7N0GaYrDh1lo9HeHNXiXbgIuVSj8iRyzWjJwdeha8NhZxwzQ3tZz5zNcBlCeJ+
 sFtopCWqDZ5DbmKQy4rYXRuW8zC6UUQ4opqcpG5+0e5e51aa0lexJtUqxcyuKqHd
 8plSZEEofCd7XecsvdXGzxhN29FFFIPkHZHic6/GQQou0gC8uGdp/LQMCoiZ3sJ1
 3YblYJkYXMCsi9COjCAesMwy6PENL98XLfQYVmfT/EEHPgyxG1hD0a/xf0UvXzsf
 V2h/gX+gfnDspCtStkrprapQb1WTCgA8d4tOB6x47EpCfY+H3HrNPFNIpJ93ErQJ
 rAe8jAARh1qAbOEWYipc0XmnJLV4aDBAkR+6uE7GHXu6ESOUfvOzpVrQ5d3x4Ohk
 a+vyupwiRGxRRoBjuGFyf/v15g6/UeLi5A/4Qi5SpCM7Uet7RnE/ltxjbgZJZgfG
 gtBJvNaFExWJI5hFZIpz4hubsWvFMaF6SwlXnVMdYGuKZl0w119wAHFLkyO4EmHk
 5RacLxDHEJ4q
 =+qZV
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These revert a recent change in the schedutil cpufreq governor that
  had not been expected to make any functional difference, but turned
  out to introduce a performance regression, fix an initialization issue
  in the amd-pstate driver and make it actually replace the venerable
  ACPI cpufreq driver on the supported systems by default.

  Specifics:

   - Revert a recent schedutil cpufreq governor change that introduced a
     performace regression on Pixel 6 (Sam Wu)

   - Fix amd-pstate driver initialization after running the kernel via
     kexec (Wyes Karny)

   - Turn amd-pstate into a built-in driver which allows it to take
     precedence over acpi-cpufreq by default on supported systems and
     amend it with a mechanism to disable this behavior (Perry Yuan)

   - Update amd-pstate documentation in accordance with the other
     changes made to it (Perry Yuan)"

* tag 'pm-6.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  Documentation: add amd-pstate kernel command line options
  Documentation: amd-pstate: add driver working mode introduction
  cpufreq: amd-pstate: add amd-pstate driver parameter for mode selection
  cpufreq: amd-pstate: change amd-pstate driver to be built-in type
  cpufreq: amd-pstate: cpufreq: amd-pstate: reset MSR_AMD_PERF_CTL register at init
  Revert "cpufreq: schedutil: Move max CPU capacity to sugov_policy"
2022-11-25 12:43:33 -08:00
Linus Torvalds
0b1dcc2cf5 24 hotfixes. 8 marked cc:stable and 16 for post-6.0 issues.
There have been a lot of hotfixes this cycle, and this is quite a large
 batch given how far we are into the -rc cycle.  Presumably a reflection of
 the unusually large amount of MM material which went into 6.1-rc1.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4Bd6gAKCRDdBJ7gKXxA
 jvX6AQCsG1ld24kMpdD+70XXUyC29g/6/jribgtZApHyDYjxSwD/WmLNpPlUPRax
 WB071Y5w65vjSTUKvwU0OLGbHwyxgAw=
 =swD5
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0
  issues.

  There have been a lot of hotfixes this cycle, and this is quite a
  large batch given how far we are into the -rc cycle. Presumably a
  reflection of the unusually large amount of MM material which went
  into 6.1-rc1"

* tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
  test_kprobes: fix implicit declaration error of test_kprobes
  nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
  mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
  mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
  swapfile: fix soft lockup in scan_swap_map_slots
  hugetlb: fix __prep_compound_gigantic_page page flag setting
  kfence: fix stack trace pruning
  proc/meminfo: fix spacing in SecPageTables
  mm: multi-gen LRU: retry folios written back while isolated
  mailmap: update email address for Satya Priya
  mm/migrate_device: return number of migrating pages in args->cpages
  kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
  MAINTAINERS: update Alex Hung's email address
  mailmap: update Alex Hung's email address
  mm: mmap: fix documentation for vma_mas_szero
  mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
  mm/memory: return vm_fault_t result from migrate_to_ram() callback
  mm: correctly charge compressed memory to its memcg
  ipc/shm: call underlying open/close vm_ops
  gcov: clang: fix the buffer overflow issue
  ...
2022-11-25 10:18:25 -08:00
Peter Zijlstra
030a976efa perf: Consider OS filter fail
Some PMUs (notably the traditional hardware kind) have boundary issues
with the OS filter. Specifically, it is possible for
perf_event_attr::exclude_kernel=1 events to trigger in-kernel due to
SKID or errata.

This can upset the sigtrap logic some and trigger the WARN.

However, if this invalid sample is the first we must not loose the
SIGTRAP, OTOH if it is the second, it must not override the
pending_addr with a (possibly) invalid one.

Fixes: ca6c21327c ("perf: Fix missing SIGTRAPs")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Link: https://lkml.kernel.org/r/Y3hDYiXwRnJr8RYG@xpf.sh.intel.com
2022-11-24 10:12:23 +01:00
Peter Zijlstra
af169b7759 perf: Fixup SIGTRAP and sample_flags interaction
The perf_event_attr::sigtrap functionality relies on data->addr being
set. However commit 7b08463015 ("perf: Use sample_flags for addr")
changed this to only initialize data->addr when not 0.

Fixes: 7b08463015 ("perf: Use sample_flags for addr")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/Y3426b4OimE%2FI5po%40hirez.programming.kicks-ass.net
2022-11-24 10:12:23 +01:00
Steven Rostedt (Google)
4313e5a613 tracing: Free buffers when a used dynamic event is removed
After 65536 dynamic events have been added and removed, the "type" field
of the event then uses the first type number that is available (not
currently used by other events). A type number is the identifier of the
binary blobs in the tracing ring buffer (known as events) to map them to
logic that can parse the binary blob.

The issue is that if a dynamic event (like a kprobe event) is traced and
is in the ring buffer, and then that event is removed (because it is
dynamic, which means it can be created and destroyed), if another dynamic
event is created that has the same number that new event's logic on
parsing the binary blob will be used.

To show how this can be an issue, the following can crash the kernel:

 # cd /sys/kernel/tracing
 # for i in `seq 65536`; do
     echo 'p:kprobes/foo do_sys_openat2 $arg1:u32' > kprobe_events
 # done

For every iteration of the above, the writing to the kprobe_events will
remove the old event and create a new one (with the same format) and
increase the type number to the next available on until the type number
reaches over 65535 which is the max number for the 16 bit type. After it
reaches that number, the logic to allocate a new number simply looks for
the next available number. When an dynamic event is removed, that number
is then available to be reused by the next dynamic event created. That is,
once the above reaches the max number, the number assigned to the event in
that loop will remain the same.

Now that means deleting one dynamic event and created another will reuse
the previous events type number. This is where bad things can happen.
After the above loop finishes, the kprobes/foo event which reads the
do_sys_openat2 function call's first parameter as an integer.

 # echo 1 > kprobes/foo/enable
 # cat /etc/passwd > /dev/null
 # cat trace
             cat-2211    [005] ....  2007.849603: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
             cat-2211    [005] ....  2007.849620: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
             cat-2211    [005] ....  2007.849838: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
             cat-2211    [005] ....  2007.849880: foo: (do_sys_openat2+0x0/0x130) arg1=4294967196
 # echo 0 > kprobes/foo/enable

Now if we delete the kprobe and create a new one that reads a string:

 # echo 'p:kprobes/foo do_sys_openat2 +0($arg2):string' > kprobe_events

And now we can the trace:

 # cat trace
        sendmail-1942    [002] .....   530.136320: foo: (do_sys_openat2+0x0/0x240) arg1=             cat-2046    [004] .....   530.930817: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
             cat-2046    [004] .....   530.930961: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
             cat-2046    [004] .....   530.934278: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
             cat-2046    [004] .....   530.934563: foo: (do_sys_openat2+0x0/0x240) arg1="������������������������������������������������������������������������������������������������"
            bash-1515    [007] .....   534.299093: foo: (do_sys_openat2+0x0/0x240) arg1="kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk���������@��4Z����;Y�����U

And dmesg has:

==================================================================
BUG: KASAN: use-after-free in string+0xd4/0x1c0
Read of size 1 at addr ffff88805fdbbfa0 by task cat/2049

 CPU: 0 PID: 2049 Comm: cat Not tainted 6.1.0-rc6-test+ #641
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x77
  print_report+0x17f/0x47b
  kasan_report+0xad/0x130
  string+0xd4/0x1c0
  vsnprintf+0x500/0x840
  seq_buf_vprintf+0x62/0xc0
  trace_seq_printf+0x10e/0x1e0
  print_type_string+0x90/0xa0
  print_kprobe_event+0x16b/0x290
  print_trace_line+0x451/0x8e0
  s_show+0x72/0x1f0
  seq_read_iter+0x58e/0x750
  seq_read+0x115/0x160
  vfs_read+0x11d/0x460
  ksys_read+0xa9/0x130
  do_syscall_64+0x3a/0x90
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7fc2e972ade2
 Code: c0 e9 b2 fe ff ff 50 48 8d 3d b2 3f 0a 00 e8 05 f0 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
 RSP: 002b:00007ffc64e687c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fc2e972ade2
 RDX: 0000000000020000 RSI: 00007fc2e980d000 RDI: 0000000000000003
 RBP: 00007fc2e980d000 R08: 00007fc2e980c010 R09: 0000000000000000
 R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000020f00
 R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
  </TASK>

 The buggy address belongs to the physical page:
 page:ffffea00017f6ec0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5fdbb
 flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
 raw: 000fffffc0000000 0000000000000000 ffffea00017f6ec8 0000000000000000
 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff88805fdbbe80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff88805fdbbf00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 >ffff88805fdbbf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                ^
  ffff88805fdbc000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff88805fdbc080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ==================================================================

This was found when Zheng Yejian sent a patch to convert the event type
number assignment to use IDA, which gives the next available number, and
this bug showed up in the fuzz testing by Yujie Liu and the kernel test
robot. But after further analysis, I found that this behavior is the same
as when the event type numbers go past the 16bit max (and the above shows
that).

As modules have a similar issue, but is dealt with by setting a
"WAS_ENABLED" flag when a module event is enabled, and when the module is
freed, if any of its events were enabled, the ring buffer that holds that
event is also cleared, to prevent reading stale events. The same can be
done for dynamic events.

If any dynamic event that is being removed was enabled, then make sure the
buffers they were enabled in are now cleared.

Link: https://lkml.kernel.org/r/20221123171434.545706e3@gandalf.local.home
Link: https://lore.kernel.org/all/20221110020319.1259291-1-zhengyejian1@huawei.com/

Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Depends-on: e18eb8783e ("tracing: Add tracing_reset_all_online_cpus_unlocked() function")
Depends-on: 5448d44c38 ("tracing: Add unified dynamic event framework")
Depends-on: 6212dd2968 ("tracing/kprobes: Use dyn_event framework for kprobe events")
Depends-on: 065e63f951 ("tracing: Only have rmmod clear buffers that its events were active in")
Depends-on: 575380da8b ("tracing: Only clear trace buffer on module unload if event was traced")
Fixes: 77b44d1b7c ("tracing/kprobes: Rename Kprobe-tracer to kprobe-event")
Reported-by: Zheng Yejian <zhengyejian1@huawei.com>
Reported-by: Yujie Liu <yujie.liu@intel.com>
Reported-by: kernel test robot <yujie.liu@intel.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-23 19:07:12 -05:00
Steven Rostedt (Google)
e18eb8783e tracing: Add tracing_reset_all_online_cpus_unlocked() function
Currently the tracing_reset_all_online_cpus() requires the
trace_types_lock held. But only one caller of this function actually has
that lock held before calling it, and the other just takes the lock so
that it can call it. More users of this function is needed where the lock
is not held.

Add a tracing_reset_all_online_cpus_unlocked() function for the one use
case that calls it without being held, and also add a lockdep_assert to
make sure it is held when called.

Then have tracing_reset_all_online_cpus() take the lock internally, such
that callers do not need to worry about taking it.

Link: https://lkml.kernel.org/r/20221123192741.658273220@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-23 19:06:11 -05:00
Steven Rostedt (Google)
ef38c79a52 tracing: Fix race where histograms can be called before the event
commit 94eedf3dde ("tracing: Fix race where eprobes can be called before
the event") fixed an issue where if an event is soft disabled, and the
trigger is being added, there's a small window where the event sees that
there's a trigger but does not see that it requires reading the event yet,
and then calls the trigger with the record == NULL.

This could be solved with adding memory barriers in the hot path, or to
make sure that all the triggers requiring a record check for NULL. The
latter was chosen.

Commit 94eedf3dde set the eprobe trigger handle to check for NULL, but
the same needs to be done with histograms.

Link: https://lore.kernel.org/linux-trace-kernel/20221118211809.701d40c0f8a757b0df3c025a@kernel.org/
Link: https://lore.kernel.org/linux-trace-kernel/20221123164323.03450c3a@gandalf.local.home

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Reported-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-23 19:05:50 -05:00
Mukesh Ojha
a6f810efab gcov: clang: fix the buffer overflow issue
Currently, in clang version of gcov code when module is getting removed
gcov_info_add() incorrectly adds the sfn_ptr->counter to all the
dst->functions and it result in the kernel panic in below crash report. 
Fix this by properly handling it.

[    8.899094][  T599] Unable to handle kernel write to read-only memory at virtual address ffffff80461cc000
[    8.899100][  T599] Mem abort info:
[    8.899102][  T599]   ESR = 0x9600004f
[    8.899103][  T599]   EC = 0x25: DABT (current EL), IL = 32 bits
[    8.899105][  T599]   SET = 0, FnV = 0
[    8.899107][  T599]   EA = 0, S1PTW = 0
[    8.899108][  T599]   FSC = 0x0f: level 3 permission fault
[    8.899110][  T599] Data abort info:
[    8.899111][  T599]   ISV = 0, ISS = 0x0000004f
[    8.899113][  T599]   CM = 0, WnR = 1
[    8.899114][  T599] swapper pgtable: 4k pages, 39-bit VAs, pgdp=00000000ab8de000
[    8.899116][  T599] [ffffff80461cc000] pgd=18000009ffcde003, p4d=18000009ffcde003, pud=18000009ffcde003, pmd=18000009ffcad003, pte=00600000c61cc787
[    8.899124][  T599] Internal error: Oops: 9600004f [#1] PREEMPT SMP
[    8.899265][  T599] Skip md ftrace buffer dump for: 0x1609e0
....
..,
[    8.899544][  T599] CPU: 7 PID: 599 Comm: modprobe Tainted: G S         OE     5.15.41-android13-8-g38e9b1af6bce #1
[    8.899547][  T599] Hardware name: XXX (DT)
[    8.899549][  T599] pstate: 82400005 (Nzcv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
[    8.899551][  T599] pc : gcov_info_add+0x9c/0xb8
[    8.899557][  T599] lr : gcov_event+0x28c/0x6b8
[    8.899559][  T599] sp : ffffffc00e733b00
[    8.899560][  T599] x29: ffffffc00e733b00 x28: ffffffc00e733d30 x27: ffffffe8dc297470
[    8.899563][  T599] x26: ffffffe8dc297000 x25: ffffffe8dc297000 x24: ffffffe8dc297000
[    8.899566][  T599] x23: ffffffe8dc0a6200 x22: ffffff880f68bf20 x21: 0000000000000000
[    8.899569][  T599] x20: ffffff880f68bf00 x19: ffffff8801babc00 x18: ffffffc00d7f9058
[    8.899572][  T599] x17: 0000000000088793 x16: ffffff80461cbe00 x15: 9100052952800785
[    8.899575][  T599] x14: 0000000000000200 x13: 0000000000000041 x12: 9100052952800785
[    8.899577][  T599] x11: ffffffe8dc297000 x10: ffffffe8dc297000 x9 : ffffff80461cbc80
[    8.899580][  T599] x8 : ffffff8801babe80 x7 : ffffffe8dc2ec000 x6 : ffffffe8dc2ed000
[    8.899583][  T599] x5 : 000000008020001f x4 : fffffffe2006eae0 x3 : 000000008020001f
[    8.899586][  T599] x2 : ffffff8027c49200 x1 : ffffff8801babc20 x0 : ffffff80461cb3a0
[    8.899589][  T599] Call trace:
[    8.899590][  T599]  gcov_info_add+0x9c/0xb8
[    8.899592][  T599]  gcov_module_notifier+0xbc/0x120
[    8.899595][  T599]  blocking_notifier_call_chain+0xa0/0x11c
[    8.899598][  T599]  do_init_module+0x2a8/0x33c
[    8.899600][  T599]  load_module+0x23cc/0x261c
[    8.899602][  T599]  __arm64_sys_finit_module+0x158/0x194
[    8.899604][  T599]  invoke_syscall+0x94/0x2bc
[    8.899607][  T599]  el0_svc_common+0x1d8/0x34c
[    8.899609][  T599]  do_el0_svc+0x40/0x54
[    8.899611][  T599]  el0_svc+0x94/0x2f0
[    8.899613][  T599]  el0t_64_sync_handler+0x88/0xec
[    8.899615][  T599]  el0t_64_sync+0x1b4/0x1b8
[    8.899618][  T599] Code: f905f56c f86e69ec f86e6a0f 8b0c01ec (f82e6a0c)
[    8.899620][  T599] ---[ end trace ed5218e9e5b6e2e6 ]---

Link: https://lkml.kernel.org/r/1668020497-13142-1-git-send-email-quic_mojha@quicinc.com
Fixes: e178a5beb3 ("gcov: clang support")
Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Tom Rix <trix@redhat.com>
Cc: <stable@vger.kernel.org>	[5.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22 18:50:41 -08:00
Daniel Bristot de Oliveira
022632f6c4 tracing/osnoise: Fix duration type
The duration type is a 64 long value, not an int. This was
causing some long noise to report wrong values.

Change the duration to a 64 bits value.

Link: https://lkml.kernel.org/r/a93d8a8378c7973e9c609de05826533c9e977939.1668692096.git.bristot@kernel.org

Cc: stable@vger.kernel.org
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-22 18:12:01 -05:00
Xiu Jianfeng
ccc6e59007 tracing/user_events: Fix memory leak in user_event_create()
Before current_user_event_group(), it has allocated memory and save it
in @name, this should freed before return error.

Link: https://lkml.kernel.org/r/20221115014445.158419-1-xiujianfeng@huawei.com

Fixes: e5d271812e ("tracing/user_events: Move pages/locks into groups to prepare for namespaces")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-22 18:09:50 -05:00
Colin Ian King
0a068f4a71 tracing/hist: add in missing * in comment blocks
There are a couple of missing * in comment blocks. Fix these.
Cleans up two clang warnings:

kernel/trace/trace_events_hist.c:986: warning: bad line:
kernel/trace/trace_events_hist.c:3229: warning: bad line:

Link: https://lkml.kernel.org/r/20221020133019.1547587-1-colin.i.king@gmail.com

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-22 16:17:33 -05:00
Sam Wu
cdcc5ef26b Revert "cpufreq: schedutil: Move max CPU capacity to sugov_policy"
This reverts commit 6d5afdc97e.

On a Pixel 6 device, it is observed that this commit increases
latency by approximately 50ms, or 20%, in migrating a task
that requires full CPU utilization from a LITTLE CPU to Fmax
on a big CPU. Reverting this change restores the latency back
to its original baseline value.

Fixes: 6d5afdc97e ("cpufreq: schedutil: Move max CPU capacity to sugov_policy")
Signed-off-by: Sam Wu <wusamuel@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-11-22 19:56:52 +01:00
Xu Kuohai
836e49e103 bpf: Do not copy spin lock field from user in bpf_selem_alloc
bpf_selem_alloc function is used by inode_storage, sk_storage and
task_storage maps to set map value, for these map types, there may
be a spin lock in the map value, so if we use memcpy to copy the whole
map value from user, the spin lock field may be initialized incorrectly.

Since the spin lock field is zeroed by kzalloc, call copy_map_value
instead of memcpy to skip copying the spin lock field to fix it.

Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20221114134720.1057939-2-xukuohai@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-21 11:45:37 -08:00
Linus Torvalds
c6c67bf9bc tracing/probes: Fixes for v6.1
- Fix possible NULL pointer dereference  on trace_event_file in kprobe_event_gen_test_exit()
 
 - Fix NULL pointer dereference for trace_array in kprobe_event_gen_test_exit()
 
 - Fix memory leak of filter string for eprobes
 
 - Fix a possible memory leak in rethook_alloc()
 
 - Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
   which can cause a possible use-after-free
 
 - Fix warning in eprobe filter creation
 
 - Fix eprobe filter creation as it picked the wrong event for the fields
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY3qRDhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qgCdAP0cB1ZjmMM7O8OFdnrTV0jfavZTnNNC
 ut9sczpYQ0upcwEArmVvB+H3sTM6Y7PrsQEUn8gsc7WmieUoDAOr0hIe4AI=
 =DGVC
 -----END PGP SIGNATURE-----

Merge tag 'trace-probes-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing/probes fixes from Steven Rostedt:

 - Fix possible NULL pointer dereference on trace_event_file in
   kprobe_event_gen_test_exit()

 - Fix NULL pointer dereference for trace_array in
   kprobe_event_gen_test_exit()

 - Fix memory leak of filter string for eprobes

 - Fix a possible memory leak in rethook_alloc()

 - Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case which
   can cause a possible use-after-free

 - Fix warning in eprobe filter creation

 - Fix eprobe filter creation as it picked the wrong event for the
   fields

* tag 'trace-probes-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/eprobe: Fix eprobe filter to make a filter correctly
  tracing/eprobe: Fix warning in filter creation
  kprobes: Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
  rethook: fix a potential memleak in rethook_alloc()
  tracing/eprobe: Fix memory leak of filter string
  tracing: kprobe: Fix potential null-ptr-deref on trace_array in kprobe_event_gen_test_exit()
  tracing: kprobe: Fix potential null-ptr-deref on trace_event_file in kprobe_event_gen_test_exit()
2022-11-20 15:31:20 -08:00
Linus Torvalds
5239ddeb48 Tracing fixes:
- Fix polling to block on watermark like the reads do, as user space
   applications get confused when the select says read is available, and then
   the read blocks.
 
 - Fix accounting of ring buffer dropped pages as it is what is used to
   determine if the buffer is empty or not.
 
 - Fix memory leak in tracing_read_pipe()
 
 - Fix struct trace_array warning about being declared in parameters
 
 - Fix accounting of ftrace pages used in output at start up.
 
 - Fix allocation of dyn_ftrace pages by subtracting one from order instead of
   diving it by 2
 
 - Static analyzer found a case were a pointer being used outside of a NULL
   check. (rb_head_page_deactivate())
 
 - Fix possible NULL pointer dereference if kstrdup() fails in ftrace_add_mod()
 
 - Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
 
 - Fix bad pointer dereference in register_synth_event() on error path.
 
 - Remove unused __bad_type_size() method
 
 - Fix possible NULL pointer dereference of entry in list 'tr->err_log'
 
 - Fix NULL pointer deference race if eprobe is called before the event setup
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY3qNNBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiVzAP9vdtLkseOueVqPJ/Wc6v3z0xlkxO4L
 Aj9jOac822SPOQEAvUJ1DM1bxm/D2BY5AQsfgSGjdaVYP+I3kvETNgWspQI=
 =3ta3
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix polling to block on watermark like the reads do, as user space
   applications get confused when the select says read is available, and
   then the read blocks

 - Fix accounting of ring buffer dropped pages as it is what is used to
   determine if the buffer is empty or not

 - Fix memory leak in tracing_read_pipe()

 - Fix struct trace_array warning about being declared in parameters

 - Fix accounting of ftrace pages used in output at start up.

 - Fix allocation of dyn_ftrace pages by subtracting one from order
   instead of diving it by 2

 - Static analyzer found a case were a pointer being used outside of a
   NULL check (rb_head_page_deactivate())

 - Fix possible NULL pointer dereference if kstrdup() fails in
   ftrace_add_mod()

 - Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()

 - Fix bad pointer dereference in register_synth_event() on error path

 - Remove unused __bad_type_size() method

 - Fix possible NULL pointer dereference of entry in list 'tr->err_log'

 - Fix NULL pointer deference race if eprobe is called before the event
   setup

* tag 'trace-v6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix race where eprobes can be called before the event
  tracing: Fix potential null-pointer-access of entry in list 'tr->err_log'
  tracing: Remove unused __bad_type_size() method
  tracing: Fix wild-memory-access in register_synth_event()
  tracing: Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
  ftrace: Fix null pointer dereference in ftrace_add_mod()
  ring_buffer: Do not deactivate non-existant pages
  ftrace: Optimize the allocation for mcount entries
  ftrace: Fix the possible incorrect kernel message
  tracing: Fix warning on variable 'struct trace_array'
  tracing: Fix memory leak in tracing_read_pipe()
  ring-buffer: Include dropped pages in counting dirty patches
  tracing/ring-buffer: Have polling block on watermark
2022-11-20 15:25:32 -08:00
Steven Rostedt (Google)
94eedf3dde tracing: Fix race where eprobes can be called before the event
The flag that tells the event to call its triggers after reading the event
is set for eprobes after the eprobe is enabled. This leads to a race where
the eprobe may be triggered at the beginning of the event where the record
information is NULL. The eprobe then dereferences the NULL record causing
a NULL kernel pointer bug.

Test for a NULL record to keep this from happening.

Link: https://lore.kernel.org/linux-trace-kernel/20221116192552.1066630-1-rafaelmendsr@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20221117214249.2addbe10@gandalf.local.home

Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-20 14:05:50 -05:00
Linus Torvalds
d4f754c361 - Fix a small race on the task's exit path where there's a
misunderstanding whether the task holds rq->lock or not
 
 - Prevent processes from getting killed when using deprecated or unknown
 rseq ABI flags in order to be able to fuzz the rseq() syscall with
 syzkaller
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmN6FH0ACgkQEsHwGGHe
 VUrsFw//QdGg/ZpGdeGH5njOFX9KCLrcMojz+Ip7vn4TeWiT/0J70jym4Pxkvwls
 Cz1s47kmhEE56xYi0gMJJi6XtjYWh1/sptuTkS+wHNp+NliaHMnmPPY08wL6/ZEq
 Zm8KP1e3OHjl+k82yhuIeYAFFdUpbLlQkjCKHqQkBlemh0MF+GeTavHDWe3kLolb
 arEld+kIo7OUkBDmVGk6BaE3R0DuCYGokDcj9xGNx0alcQCtu+oT+T9jYMgwBIn2
 zGpfM2jeB4nwYKYv+Mg6tY38vaz6bHZWSoDhaB/+GzewhZMdGGCo3SdN207WJHUK
 TSjjz6a5sMJSZTpVW0K6MxJhv245KNt3Tc/JpDtqVq4KfKNcv0Rs4bDIWWyIsvow
 KlSUvRBeDT3rx9152vB72GiPptZogWsTc5dM0/ZlBaCfM4UBreYgO37ZYYK9hPdx
 M2rCOafUSwkDFG+A5JCWM7d5sPAndvAAX29qqmrgbTVSV+XKAx3UdWpGJTEL8bSF
 OWIFhVd2nqOglL41c1Rm4zk9TJ3k9mMsgnsuskgclmAI7GmRkPADHlv19bRRe8UL
 TEDBOCQaNYFkjkp+4IM/I3GXGa+SiOHEiK7G2CjJEeHbzWoGKQPFpJtOtH4zHNV+
 m32Ek2XBw1zKjfAyr1WGyA+GE4tTIIphqgonu4cA2cKCYjm/SG4=
 =I+nD
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Fix a small race on the task's exit path where there's a
   misunderstanding whether the task holds rq->lock or not

 - Prevent processes from getting killed when using deprecated or
   unknown rseq ABI flags in order to be able to fuzz the rseq() syscall
   with syzkaller

* tag 'sched_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix race in task_call_func()
  rseq: Use pr_warn_once() when deprecated/unknown ABI flags are encountered
2022-11-20 10:43:52 -08:00
Linus Torvalds
eb0ef8add5 - Fix an intel PT erratum where CPUs do not support single range output
for more than 4K
 
 - Fix a NULL ptr dereference which can happen after an NMI interferes
 with the event enabling dance in amd_pmu_enable_all()
 
 - Free the events array too when freeing uncore contexts on CPU online,
   thereby fixing a memory leak
 
 - Improve the pending SIGTRAP check
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmN6DfgACgkQEsHwGGHe
 VUoVUxAAuAaPNOil4T95BGBYL7O7w5VvUQtaq9TtSm1ZwDlz63BNFwrLdlyi3pnc
 4J84lWB1m0hPJHggqqWwBcIRL4xgrNZN4F0Q641Ns8IcAUtYbuQ4Gxm4io5IysFN
 DYph6EZ597S7izdZD/S0wETqjgE5/zs16x2/E9E9aJ9CH6Kewqa0y5bXhieF4+UF
 o0z55KB8YtS19BvopZLunj3tvwOSgAqgkgW7UtbR6clhcXBiq6/6bAjEBxoPrYRE
 7B8LEEMol7ERyIZHLE/xA38D6o1BhlD9OnVNpwYtDeYe63dtPrQAfjDwkk7gOdWB
 H9mNHqYB4kPMaUt4sDfDRTVYvioUXH0y7gjvl54efUtnj3vdxhUFzjD+JFh4nzrr
 efCU+Cy8TzYV1V7c9yn9JjIp7w0rLX2jMlPvL8GGwlYQ8nXQKR6OHi3Dw9kYLKrT
 /jtAxkv8+enbUiMukw0AyhJfgzSM8OkUIjtHGTgkY7Q2BEv+yC7ogvD9wIozocLa
 w0EIJo5wtLUHu78u5+5tIOZYcZq1A89lPV8TXNzCGU6+xwm3wRzpvcAICp285NqA
 gXBEtWGULUB+paz4db5tnChP9xK8MOt4Ro3/Y00Hxr3ZgO6zJ2nc7j8fENX9gON0
 YAeZB9yPPcvy6oABYcVDH3sG/wyN5wCnlPGxrVom4EYRRWzhAL8=
 =9ZHJ
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Borislav Petkov:

 - Fix an intel PT erratum where CPUs do not support single range output
   for more than 4K

 - Fix a NULL ptr dereference which can happen after an NMI interferes
   with the event enabling dance in amd_pmu_enable_all()

 - Free the events array too when freeing uncore contexts on CPU online,
   thereby fixing a memory leak

 - Improve the pending SIGTRAP check

* tag 'perf_urgent_for_v6.1_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/pt: Fix sampling using single range output
  perf/x86/amd: Fix crash due to race between amd_pmu_enable_all, perf NMI and throttling
  perf/x86/amd/uncore: Fix memory leak for events array
  perf: Improve missing SIGTRAP checking
2022-11-20 10:41:14 -08:00
Zheng Yejian
067df9e0ad tracing: Fix potential null-pointer-access of entry in list 'tr->err_log'
Entries in list 'tr->err_log' will be reused after entry number
exceed TRACING_LOG_ERRS_MAX.

The cmd string of the to be reused entry will be freed first then
allocated a new one. If the allocation failed, then the entry will
still be in list 'tr->err_log' but its 'cmd' field is set to be NULL,
later access of 'cmd' is risky.

Currently above problem can cause the loss of 'cmd' information of first
entry in 'tr->err_log'. When execute `cat /sys/kernel/tracing/error_log`,
reproduce logs like:
  [   37.495100] trace_kprobe: error: Maxactive is not for kprobe(null) ^
  [   38.412517] trace_kprobe: error: Maxactive is not for kprobe
    Command: p4:myprobe2 do_sys_openat2
            ^

Link: https://lore.kernel.org/linux-trace-kernel/20221114104632.3547266-1-zhengyejian1@huawei.com

Fixes: 1581a884b7 ("tracing: Remove size restriction on tracing_log_err cmd strings")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-17 20:41:01 -05:00
Qiujun Huang
b8752064e3 tracing: Remove unused __bad_type_size() method
__bad_type_size() is unused after
commit 04ae87a52074("ftrace: Rework event_create_dir()").
So, remove it.

Link: https://lkml.kernel.org/r/D062EC2E-7DB7-4402-A67E-33C3577F551E@gmail.com

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-17 20:21:06 -05:00
Masami Hiramatsu (Google)
40adaf51cb tracing/eprobe: Fix eprobe filter to make a filter correctly
Since the eprobe filter was defined based on the eprobe's trace event
itself, it doesn't work correctly. Use the original trace event of
the eprobe when making the filter so that the filter works correctly.

Without this fix:

 # echo 'e syscalls/sys_enter_openat \
	flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
 # echo 1 > events/eprobes/sys_enter_openat/enable
[  114.551550] event trace: Could not enable event sys_enter_openat
-bash: echo: write error: Invalid argument

With this fix:
 # echo 'e syscalls/sys_enter_openat \
	flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
 # echo 1 > events/eprobes/sys_enter_openat/enable
 # tail trace
cat-241     [000] ...1.   266.498449: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
cat-242     [000] ...1.   266.977640: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0

Link: https://lore.kernel.org/all/166823166395.1385292.8931770640212414483.stgit@devnote3/

Fixes: 752be5c5c9 ("tracing/eprobe: Add eprobe filter support")
Reported-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Tested-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-18 10:15:34 +09:00
Rafael Mendonca
342a4a2f99 tracing/eprobe: Fix warning in filter creation
The filter pointer (filterp) passed to create_filter() function must be a
pointer that references a NULL pointer, otherwise, we get a warning when
adding a filter option to the event probe:

root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core sched/sched_stat_runtime \
        runtime=$runtime:u32 if cpu < 4' >> dynamic_events
[ 5034.340439] ------------[ cut here ]------------
[ 5034.341258] WARNING: CPU: 0 PID: 223 at kernel/trace/trace_events_filter.c:1939 create_filter+0x1db/0x250
[...] stripped
[ 5034.345518] RIP: 0010:create_filter+0x1db/0x250
[...] stripped
[ 5034.351604] Call Trace:
[ 5034.351803]  <TASK>
[ 5034.351959]  ? process_preds+0x1b40/0x1b40
[ 5034.352241]  ? rcu_read_lock_bh_held+0xd0/0xd0
[ 5034.352604]  ? kasan_set_track+0x29/0x40
[ 5034.352904]  ? kasan_save_alloc_info+0x1f/0x30
[ 5034.353264]  create_event_filter+0x38/0x50
[ 5034.353573]  __trace_eprobe_create+0x16f4/0x1d20
[ 5034.353964]  ? eprobe_dyn_event_release+0x360/0x360
[ 5034.354363]  ? mark_held_locks+0xa6/0xf0
[ 5034.354684]  ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 5034.355105]  ? trace_hardirqs_on+0x41/0x120
[ 5034.355417]  ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 5034.355751]  ? __create_object+0x5b7/0xcf0
[ 5034.356027]  ? lock_is_held_type+0xaf/0x120
[ 5034.356362]  ? rcu_read_lock_bh_held+0xb0/0xd0
[ 5034.356716]  ? rcu_read_lock_bh_held+0xd0/0xd0
[ 5034.357084]  ? kasan_set_track+0x29/0x40
[ 5034.357411]  ? kasan_save_alloc_info+0x1f/0x30
[ 5034.357715]  ? __kasan_kmalloc+0xb8/0xc0
[ 5034.357985]  ? write_comp_data+0x2f/0x90
[ 5034.358302]  ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.358691]  ? argv_split+0x381/0x460
[ 5034.358949]  ? write_comp_data+0x2f/0x90
[ 5034.359240]  ? eprobe_dyn_event_release+0x360/0x360
[ 5034.359620]  trace_probe_create+0xf6/0x110
[ 5034.359940]  ? trace_probe_match_command_args+0x240/0x240
[ 5034.360376]  eprobe_dyn_event_create+0x21/0x30
[ 5034.360709]  create_dyn_event+0xf3/0x1a0
[ 5034.360983]  trace_parse_run_command+0x1a9/0x2e0
[ 5034.361297]  ? dyn_event_release+0x500/0x500
[ 5034.361591]  dyn_event_write+0x39/0x50
[ 5034.361851]  vfs_write+0x311/0xe50
[ 5034.362091]  ? dyn_event_seq_next+0x40/0x40
[ 5034.362376]  ? kernel_write+0x5b0/0x5b0
[ 5034.362637]  ? write_comp_data+0x2f/0x90
[ 5034.362937]  ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.363258]  ? ftrace_syscall_enter+0x544/0x840
[ 5034.363563]  ? write_comp_data+0x2f/0x90
[ 5034.363837]  ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.364156]  ? write_comp_data+0x2f/0x90
[ 5034.364468]  ? write_comp_data+0x2f/0x90
[ 5034.364770]  ksys_write+0x158/0x2a0
[ 5034.365022]  ? __ia32_sys_read+0xc0/0xc0
[ 5034.365344]  __x64_sys_write+0x7c/0xc0
[ 5034.365669]  ? syscall_enter_from_user_mode+0x53/0x70
[ 5034.366084]  do_syscall_64+0x60/0x90
[ 5034.366356]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 5034.366767] RIP: 0033:0x7ff0b43938f3
[...] stripped
[ 5034.371892]  </TASK>
[ 5034.374720] ---[ end trace 0000000000000000 ]---

Link: https://lore.kernel.org/all/20221108202148.1020111-1-rafaelmendsr@gmail.com/

Fixes: 752be5c5c9 ("tracing/eprobe: Add eprobe filter support")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-18 10:15:34 +09:00
Li Huafei
5dd7caf0bd kprobes: Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
In __unregister_kprobe_top(), if the currently unregistered probe has
post_handler but other child probes of the aggrprobe do not have
post_handler, the post_handler of the aggrprobe is cleared. If this is
a ftrace-based probe, there is a problem. In later calls to
disarm_kprobe(), we will use kprobe_ftrace_ops because post_handler is
NULL. But we're armed with kprobe_ipmodify_ops. This triggers a WARN in
__disarm_kprobe_ftrace() and may even cause use-after-free:

  Failed to disarm kprobe-ftrace at kernel_clone+0x0/0x3c0 (error -2)
  WARNING: CPU: 5 PID: 137 at kernel/kprobes.c:1135 __disarm_kprobe_ftrace.isra.21+0xcf/0xe0
  Modules linked in: testKprobe_007(-)
  CPU: 5 PID: 137 Comm: rmmod Not tainted 6.1.0-rc4-dirty #18
  [...]
  Call Trace:
   <TASK>
   __disable_kprobe+0xcd/0xe0
   __unregister_kprobe_top+0x12/0x150
   ? mutex_lock+0xe/0x30
   unregister_kprobes.part.23+0x31/0xa0
   unregister_kprobe+0x32/0x40
   __x64_sys_delete_module+0x15e/0x260
   ? do_user_addr_fault+0x2cd/0x6b0
   do_syscall_64+0x3a/0x90
   entry_SYSCALL_64_after_hwframe+0x63/0xcd
   [...]

For the kprobe-on-ftrace case, we keep the post_handler setting to
identify this aggrprobe armed with kprobe_ipmodify_ops. This way we
can disarm it correctly.

Link: https://lore.kernel.org/all/20221112070000.35299-1-lihuafei1@huawei.com/

Fixes: 0bc11ed5ab ("kprobes: Allow kprobes coexist with livepatch")
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-18 10:15:34 +09:00
Yi Yang
0a1ebe35cb rethook: fix a potential memleak in rethook_alloc()
In rethook_alloc(), the variable rh is not freed or passed out
if handler is NULL, which could lead to a memleak, fix it.

Link: https://lore.kernel.org/all/20221110104438.88099-1-yiyang13@huawei.com/
[Masami: Add "rethook:" tag to the title.]

Fixes: 54ecbe6f1e ("rethook: Add a generic return hook")
Cc: stable@vger.kernel.org
Signed-off-by: Yi Yang <yiyang13@huawei.com>
Acke-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-18 10:15:34 +09:00
Rafael Mendonca
d1776c0202 tracing/eprobe: Fix memory leak of filter string
The filter string doesn't get freed when a dynamic event is deleted. If a
filter is set, then memory is leaked:

root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core \
        sched/sched_stat_runtime runtime=$runtime:u32 if cpu < 4' >> dynamic_events
root@localhost:/sys/kernel/tracing# echo "-:egroup/stat_runtime_4core"  >> dynamic_events
root@localhost:/sys/kernel/tracing# echo scan > /sys/kernel/debug/kmemleak
[  224.416373] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
root@localhost:/sys/kernel/tracing# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88810156f1b8 (size 8):
  comm "bash", pid 224, jiffies 4294935612 (age 55.800s)
  hex dump (first 8 bytes):
    63 70 75 20 3c 20 34 00                          cpu < 4.
  backtrace:
    [<000000009f880725>] __kmem_cache_alloc_node+0x18e/0x720
    [<0000000042492946>] __kmalloc+0x57/0x240
    [<0000000034ea7995>] __trace_eprobe_create+0x1214/0x1d30
    [<00000000d70ef730>] trace_probe_create+0xf6/0x110
    [<00000000915c7b16>] eprobe_dyn_event_create+0x21/0x30
    [<000000000d894386>] create_dyn_event+0xf3/0x1a0
    [<00000000e9af57d5>] trace_parse_run_command+0x1a9/0x2e0
    [<0000000080777f18>] dyn_event_write+0x39/0x50
    [<0000000089f0ec73>] vfs_write+0x311/0xe50
    [<000000003da1bdda>] ksys_write+0x158/0x2a0
    [<00000000bb1e616e>] __x64_sys_write+0x7c/0xc0
    [<00000000e8aef1f7>] do_syscall_64+0x60/0x90
    [<00000000fe7fe8ba>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Additionally, in __trace_eprobe_create() function, if an error occurs after
the call to trace_eprobe_parse_filter(), which allocates the filter string,
then memory is also leaked. That can be reproduced by creating the same
event probe twice:

root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core \
        sched/sched_stat_runtime runtime=$runtime:u32 if cpu < 4' >> dynamic_events
root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core \
        sched/sched_stat_runtime runtime=$runtime:u32 if cpu < 4' >> dynamic_events
-bash: echo: write error: File exists
root@localhost:/sys/kernel/tracing# echo scan > /sys/kernel/debug/kmemleak
[  207.871584] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
root@localhost:/sys/kernel/tracing# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8881020d17a8 (size 8):
  comm "bash", pid 223, jiffies 4294938308 (age 31.000s)
  hex dump (first 8 bytes):
    63 70 75 20 3c 20 34 00                          cpu < 4.
  backtrace:
    [<000000000e4f5f31>] __kmem_cache_alloc_node+0x18e/0x720
    [<0000000024f0534b>] __kmalloc+0x57/0x240
    [<000000002930a28e>] __trace_eprobe_create+0x1214/0x1d30
    [<0000000028387903>] trace_probe_create+0xf6/0x110
    [<00000000a80d6a9f>] eprobe_dyn_event_create+0x21/0x30
    [<000000007168698c>] create_dyn_event+0xf3/0x1a0
    [<00000000f036bf6a>] trace_parse_run_command+0x1a9/0x2e0
    [<00000000014bde8b>] dyn_event_write+0x39/0x50
    [<0000000078a097f7>] vfs_write+0x311/0xe50
    [<00000000996cb208>] ksys_write+0x158/0x2a0
    [<00000000a3c2acb0>] __x64_sys_write+0x7c/0xc0
    [<0000000006b5d698>] do_syscall_64+0x60/0x90
    [<00000000780e8ecf>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fix both issues by releasing the filter string in
trace_event_probe_cleanup().

Link: https://lore.kernel.org/all/20221108235738.1021467-1-rafaelmendsr@gmail.com/

Fixes: 752be5c5c9 ("tracing/eprobe: Add eprobe filter support")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-18 10:15:34 +09:00
Shang XiaoJing
22ea4ca963 tracing: kprobe: Fix potential null-ptr-deref on trace_array in kprobe_event_gen_test_exit()
When test_gen_kprobe_cmd() failed after kprobe_event_gen_cmd_end(), it
will goto delete, which will call kprobe_event_delete() and release the
corresponding resource. However, the trace_array in gen_kretprobe_test
will point to the invalid resource. Set gen_kretprobe_test to NULL
after called kprobe_event_delete() to prevent null-ptr-deref.

BUG: kernel NULL pointer dereference, address: 0000000000000070
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 246 Comm: modprobe Tainted: G        W
6.1.0-rc1-00174-g9522dc5c87da-dirty #248
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:__ftrace_set_clr_event_nolock+0x53/0x1b0
Code: e8 82 26 fc ff 49 8b 1e c7 44 24 0c ea ff ff ff 49 39 de 0f 84 3c
01 00 00 c7 44 24 18 00 00 00 00 e8 61 26 fc ff 48 8b 6b 10 <44> 8b 65
70 4c 8b 6d 18 41 f7 c4 00 02 00 00 75 2f
RSP: 0018:ffffc9000159fe00 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88810971d268 RCX: 0000000000000000
RDX: ffff8881080be600 RSI: ffffffff811b48ff RDI: ffff88810971d058
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: ffffc9000159fe58 R11: 0000000000000001 R12: ffffffffa0001064
R13: ffffffffa000106c R14: ffff88810971d238 R15: 0000000000000000
FS:  00007f89eeff6540(0000) GS:ffff88813b600000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000070 CR3: 000000010599e004 CR4: 0000000000330ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __ftrace_set_clr_event+0x3e/0x60
 trace_array_set_clr_event+0x35/0x50
 ? 0xffffffffa0000000
 kprobe_event_gen_test_exit+0xcd/0x10b [kprobe_event_gen_test]
 __x64_sys_delete_module+0x206/0x380
 ? lockdep_hardirqs_on_prepare+0xd8/0x190
 ? syscall_enter_from_user_mode+0x1c/0x50
 do_syscall_64+0x3f/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89eeb061b7

Link: https://lore.kernel.org/all/20221108015130.28326-3-shangxiaojing@huawei.com/

Fixes: 64836248dd ("tracing: Add kprobe event command generation test module")
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-18 10:15:34 +09:00
Shang XiaoJing
e0d75267f5 tracing: kprobe: Fix potential null-ptr-deref on trace_event_file in kprobe_event_gen_test_exit()
When trace_get_event_file() failed, gen_kretprobe_test will be assigned
as the error code. If module kprobe_event_gen_test is removed now, the
null pointer dereference will happen in kprobe_event_gen_test_exit().
Check if gen_kprobe_test or gen_kretprobe_test is error code or NULL
before dereference them.

BUG: kernel NULL pointer dereference, address: 0000000000000012
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 3 PID: 2210 Comm: modprobe Not tainted
6.1.0-rc1-00171-g2159299a3b74-dirty #217
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:kprobe_event_gen_test_exit+0x1c/0xb5 [kprobe_event_gen_test]
Code: Unable to access opcode bytes at 0xffffffff9ffffff2.
RSP: 0018:ffffc900015bfeb8 EFLAGS: 00010246
RAX: ffffffffffffffea RBX: ffffffffa0002080 RCX: 0000000000000000
RDX: ffffffffa0001054 RSI: ffffffffa0001064 RDI: ffffffffdfc6349c
RBP: ffffffffa0000000 R08: 0000000000000004 R09: 00000000001e95c0
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000800
R13: ffffffffa0002420 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f56b75be540(0000) GS:ffff88813bc00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff9ffffff2 CR3: 000000010874a006 CR4: 0000000000330ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __x64_sys_delete_module+0x206/0x380
 ? lockdep_hardirqs_on_prepare+0xd8/0x190
 ? syscall_enter_from_user_mode+0x1c/0x50
 do_syscall_64+0x3f/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Link: https://lore.kernel.org/all/20221108015130.28326-2-shangxiaojing@huawei.com/

Fixes: 64836248dd ("tracing: Add kprobe event command generation test module")
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2022-11-18 10:15:33 +09:00
Shang XiaoJing
1b5f1c34d3 tracing: Fix wild-memory-access in register_synth_event()
In register_synth_event(), if set_synth_event_print_fmt() failed, then
both trace_remove_event_call() and unregister_trace_event() will be
called, which means the trace_event_call will call
__unregister_trace_event() twice. As the result, the second unregister
will causes the wild-memory-access.

register_synth_event
    set_synth_event_print_fmt failed
    trace_remove_event_call
        event_remove
            if call->event.funcs then
            __unregister_trace_event (first call)
    unregister_trace_event
        __unregister_trace_event (second call)

Fix the bug by avoiding to call the second __unregister_trace_event() by
checking if the first one is called.

general protection fault, probably for non-canonical address
	0xfbd59c0000000024: 0000 [#1] SMP KASAN PTI
KASAN: maybe wild-memory-access in range
[0xdead000000000120-0xdead000000000127]
CPU: 0 PID: 3807 Comm: modprobe Not tainted
6.1.0-rc1-00186-g76f33a7eedb4 #299
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:unregister_trace_event+0x6e/0x280
Code: 00 fc ff df 4c 89 ea 48 c1 ea 03 80 3c 02 00 0f 85 0e 02 00 00 48
b8 00 00 00 00 00 fc ff df 4c 8b 63 08 4c 89 e2 48 c1 ea 03 <80> 3c 02
00 0f 85 e2 01 00 00 49 89 2c 24 48 85 ed 74 28 e8 7a 9b
RSP: 0018:ffff88810413f370 EFLAGS: 00010a06
RAX: dffffc0000000000 RBX: ffff888105d050b0 RCX: 0000000000000000
RDX: 1bd5a00000000024 RSI: ffff888119e276e0 RDI: ffffffff835a8b20
RBP: dead000000000100 R08: 0000000000000000 R09: fffffbfff0913481
R10: ffffffff8489a407 R11: fffffbfff0913480 R12: dead000000000122
R13: ffff888105d050b8 R14: 0000000000000000 R15: ffff888105d05028
FS:  00007f7823e8d540(0000) GS:ffff888119e00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7823e7ebec CR3: 000000010a058002 CR4: 0000000000330ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __create_synth_event+0x1e37/0x1eb0
 create_or_delete_synth_event+0x110/0x250
 synth_event_run_command+0x2f/0x110
 test_gen_synth_cmd+0x170/0x2eb [synth_event_gen_test]
 synth_event_gen_test_init+0x76/0x9bc [synth_event_gen_test]
 do_one_initcall+0xdb/0x480
 do_init_module+0x1cf/0x680
 load_module+0x6a50/0x70a0
 __do_sys_finit_module+0x12f/0x1c0
 do_syscall_64+0x3f/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Link: https://lkml.kernel.org/r/20221117012346.22647-3-shangxiaojing@huawei.com

Fixes: 4b147936fa ("tracing: Add support for 'synthetic' events")
Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com>
Cc: stable@vger.kernel.org
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Cc: <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-11-17 18:24:58 -05:00