In commit c5320926e3 ("mem-hotplug: introduce movable_node boot
option"), the memblock allocation direction is changed to bottom-up and
then back to top-down like this:
1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
2. memblock_set_bottom_up(false), called by x86's numa_init().
Even though (1) occurs in generic mm code, it is wrapped by #ifdef
CONFIG_MOVABLE_NODE, which depends on X86_64.
This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
things will be unbalanced. (1) will happen for them, but (2) will not.
This toggle was added in the first place because x86 has a delay between
adding memblocks and marking them as hotpluggable. Since other arches
do this marking either immediately or not at all, they do not require
the bottom-up toggle.
So, resolve things by moving (1) from cmdline_parse_movable_node() to
x86's setup_arch(), immediately after the movable_node parameter has
been parsed.
Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alistair Popple <apopple@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 RAS updates from Ingo Molnar:
"The main changes in this development cycle were:
- more AMD northbridge support work, mostly in preparation for Fam17h
CPUs (Yazen Ghannam, Borislav Petkov)
- cleanups/refactorings and fixes (Borislav Petkov, Tony Luck,
Yinghai Lu)"
* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Include the PPIN in MCE records when available
x86/mce/AMD: Add system physical address translation for AMD Fam17h
x86/amd_nb: Add SMN and Indirect Data Fabric access for AMD Fam17h
x86/amd_nb: Add Fam17h Data Fabric as "Northbridge"
x86/amd_nb: Make all exports EXPORT_SYMBOL_GPL
x86/amd_nb: Make amd_northbridges internal to amd_nb.c
x86/mce/AMD: Reset Threshold Limit after logging error
x86/mce/AMD: Fix HWID_MCATYPE calculation by grouping arguments
x86/MCE: Correct TSC timestamping of error records
x86/RAS: Hide SMCA bank names
x86/RAS: Rename smca_bank_names to smca_names
x86/RAS: Simplify SMCA HWID descriptor struct
x86/RAS: Simplify SMCA bank descriptor struct
x86/MCE: Dump MCE to dmesg if no consumers
x86/RAS: Add TSC timestamp to the injected MCE
x86/MCE: Do not look at panic_on_oops in the severity grading
Pull scheduler updates from Ingo Molnar:
"The main scheduler changes in this cycle were:
- support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a
notion of 'better cores', which the scheduler will prefer to
schedule single threaded workloads on. (Tim Chen, Srinivas
Pandruvada)
- enhance the handling of asymmetric capacity CPUs further (Morten
Rasmussen)
- improve/fix load handling when moving tasks between task groups
(Vincent Guittot)
- simplify and clean up the cputime code (Stanislaw Gruszka)
- improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent
Guittot)
- make struct kthread kmalloc()ed and related fixes (Oleg Nesterov)
- add uaccess atomicity debugging (when using access_ok() in the
wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra)
- implement various fixes, cleanups and other enhancements (Daniel
Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
sched/core: Use load_avg for selecting idlest group
sched/core: Fix find_idlest_group() for fork
kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
kthread: Don't use to_live_kthread() in kthread_[un]park()
kthread: Don't use to_live_kthread() in kthread_stop()
Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
kthread: Make struct kthread kmalloc'ed
x86/uaccess, sched/preempt: Verify access_ok() context
sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable
sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO
x86/sched: Use #include <linux/mutex.h> instead of #include <asm/mutex.h>
cpufreq/intel_pstate: Use CPPC to get max performance
acpi/bus: Set _OSC for diverse core support
acpi/bus: Enable HWP CPPC objects
x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU
x86/sysctl: Add sysctl for ITMT scheduling feature
x86: Enable Intel Turbo Boost Max Technology 3.0
x86/topology: Define x86's arch_update_cpu_topology
sched: Extend scheduler's asym packing
sched/fair: Clean up the tunable parameter definitions
...
Pull perf updates from Ingo Molnar:
"This update is pretty big and almost exclusively includes tooling
changes, because v4.9's LTS status forced to completion most of the
pending kernel side hardware enablement work and because we tried to
freeze core perf work a bit to give a time window for the fuzzing
efforts.
The diff is large mostly due to the JSON hardware event tables added
for Intel and Power8 CPUs. This was a popular feature request from
people working close to hardware and from the HPC community.
Tree size is big because this added the CPU event tables for over a
decade of Intel CPUs. Future changes for a CPU vendor alrady support
should be much smaller, as events for new models are added. The new
events are listed in 'perf list', for the CPU model the tool is
running on. If you find an interesting event it can be used as-is:
$ perf stat -a -e l2_lines_out.pf_clean sleep 1
Performance counter stats for 'system wide':
7,860,403 l2_lines_out.pf_clean
1.000624918 seconds time elapsed
The event lists can be searched the usual 'perf list' fashion for
(case insensitive) substrings as well:
$ perf list l2_lines_out
List of pre-defined events (to be used in -e):
cache:
l2_lines_out.demand_clean
[Clean L2 cache lines evicted by demand]
l2_lines_out.demand_dirty
[Dirty L2 cache lines evicted by demand]
l2_lines_out.dirty_all
[Dirty L2 cache lines filling the L2]
l2_lines_out.pf_clean
[Clean L2 cache lines evicted by L2 prefetch]
l2_lines_out.pf_dirty
[Dirty L2 cache lines evicted by L2 prefetch]
etc.
There's a few high level categories as well that can be listed:
'cache', 'floating point', 'frontend', 'memory', 'pipeline', 'virtual
memory'.
Existing generic events and workflows should work as-is.
The only kernel side change is a late breaking fix for an older
regression, related to Intel BTS, LBR and PT feature interaction.
On the tooling side there are three new tools / major features:
- The new 'perf c2c' tool provides means for Shared Data C2C/HITM
analysis.
This allows you to track down cacheline contention. The tool is
based on x86's load latency and precise store facility events
provided by Intel CPUs.
It was tested by Joe Mario and has proven to be useful, finding
some cacheline contentions. Joe also wrote a blog about c2c tool
with examples:
https://joemario.github.io/blog/2016/09/01/c2c-blog/
excerpt of the content on this site:
At a high level, “perf c2c” will show you:
* The cachelines where false sharing was detected.
* The readers and writers to those cachelines, and the offsets where those accesses occurred.
* The pid, tid, instruction addr, function name, binary object name for those readers and writers.
* The source file and line number for each reader and writer.
* The average load latency for the loads to those cachelines.
* Which numa nodes the samples a cacheline came from and which CPUs were involved.
Using perf c2c is similar to using the Linux perf tool today.
First collect data with “perf c2c record”, then generate a
report output with “perf c2c report”
There one finds extensive details on using the tool, with tips on
reducing the volume of samples while still capturing enough to do
its job. (Dick Fowles, Joe Mario, Don Zickus, Jiri Olsa)
- The new 'perf sched timehist' tool provides tailored analysis of
scheduling events.
Example usage:
perf sched record -- sleep 1
perf sched timehist
By default it shows the individual schedule events, including the
wait time (time between sched-out and next sched-in events for the
task), the task scheduling delay (time between wakeup and actually
running) and run time for the task:
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
-------- ------ ---------------- --------- --------- --------
1.874569 [0011] gcc[31949] 0.014 0.000 1.148
1.874591 [0010] gcc[31951] 0.000 0.000 0.024
1.874603 [0010] migration/10[59] 3.350 0.004 0.011
1.874604 [0011] <idle> 1.148 0.000 0.035
1.874723 [0005] <idle> 0.016 0.000 1.383
1.874746 [0005] gcc[31949] 0.153 0.078 0.022
...
Times are in msec.usec. (David Ahern, Namhyung Kim)
- Add CPU vendor hardware event tables:
Add JSON files with vendor event naming for Intel and Power8
processors, allowing users of tools like oprofile to keep using the
event names they are used to, as well as people reading vendor
documentation, where such naming is used. (Andi Kleen, Sukadev
Bhattiprolu)
You should see all the new events with 'perf list' and you should
be able to search them, for example 'perf list miss' will list all
the myriads of miss events.
Other tooling features added were:
- Cross-arch annotation support:
o Improve ARM support in the annotation code, affecting 'perf
annotate', 'perf report' and live annotation in 'perf top' (Kim
Phillips)
o Initial support for PowerPC in the annotation code (Ravi
Bangoria)
o Support AArch64 in the 'annotate' code, native/local and
cross-arch/remote (Kim Phillips)
- Allow considering just events in a given time interval, via the
'--time start.s.ms,end.s.ms' command line, added to 'perf kmem',
'perf report', 'perf sched timehist' and 'perf script' (David
Ahern)
- Add option to stop printing a callchain at one of a given group of
symbol names (David Ahern)
- Track memory freed in 'perf kmem stat' (David Ahern)
- Allow querying and setting .perfconfig variables (Taeung Song)
- Show branch information in callchains (predicted, TSX aborts, loop
iteractions, etc) (Jin Yao)
- Dynamicly change verbosity level by pressing 'V' in the 'perf
top/report' hists TUI browser (Alexis Berlemont)
- Implement 'perf trace --delay' in the same fashion as in 'perf
record --delay', to skip sampling workload initialization events
(Alexis Berlemont)
- Make vendor named events case insensitive in 'perf list', i.e.
'perf list LONGEST_LAT' works just the same as 'perf list
longest_lat' (Andi Kleen)
- Add unwinding support for jitdump (Stefano Sanfilippo)
Tooling infrastructure changes:
- Support linking perf with clang and LLVM libraries, initially
statically, but this limitation will be lifted and shared
libraries, when available, will be preferred to the static build,
that should, as with other features, be enabled explicitly (Wang
Nan)
- Add initial support (and perf test entry) for tooling hooks,
starting with 'record_start' and 'record_end', that will have as
its initial user the eBPF infrastructure, where perf_ prefixed
functions will be JITed and run when such hooks are called (Wang
Nan)
- Implement assorted libbpf improvements (Wang Nan)"
... and lots of other changes, features, cleanups and refactorings I
did not list, see the shortlog and the git log for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (220 commits)
perf/x86: Fix exclusion of BTS and LBR for Goldmont
perf tools: Explicitly document that --children is enabled by default
perf sched timehist: Cleanup idle_max_cpu handling
perf sched timehist: Handle zero sample->tid properly
perf callchain: Introduce callchain_cursor__copy()
perf sched: Cleanup option processing
perf sched timehist: Improve error message when analyzing wrong file
perf tools: Move perf build related variables under non fixdep leg
perf tools: Force fixdep compilation at the start of the build
perf tools: Move PERF-VERSION-FILE target into rules area
perf build: Check LLVM version in feature check
perf annotate: Show raw form for jump instruction with indirect target
perf tools: Add non config targets
perf tools: Cleanup build directory before each test
perf tools: Move python/perf.so target into rules area
perf tools: Move install-gtk target into rules area
tools build: Move tabs to spaces where suitable
tools build: Make the .cmd file more readable
perf clang: Compile BPF script using builtin clang support
perf clang: Support compile IR to BPF object and add testcase
...
Pull mm/PAT cleanup from Ingo Molnar:
"A single cleanup for a generic interface that was originally
introduced for PAT"
* 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pat, mm: Make track_pfn_insert() return void
Pull locking updates from Ingo Molnar:
"The tree got pretty big in this development cycle, but the net effect
is pretty good:
115 files changed, 673 insertions(+), 1522 deletions(-)
The main changes were:
- Rework and generalize the mutex code to remove per arch mutex
primitives. (Peter Zijlstra)
- Add vCPU preemption support: add an interface to query the
preemption status of vCPUs and use it in locking primitives - this
optimizes paravirt performance. (Pan Xinhui, Juergen Gross,
Christian Borntraeger)
- Introduce cpu_relax_yield() and remov cpu_relax_lowlatency() to
clean up and improve the s390 lock yielding machinery and its core
kernel impact. (Christian Borntraeger)
- Micro-optimize mutexes some more. (Waiman Long)
- Reluctantly add the to-be-deprecated mutex_trylock_recursive()
interface on a temporary basis, to give the DRM code more time to
get rid of its locking hacks. Any other users will be NAK-ed on
sight. (We turned off the deprecation warning for the time being to
not pollute the build log.) (Peter Zijlstra)
- Improve the rtmutex code a bit, in light of recent long lived
bugs/races. (Thomas Gleixner)
- Misc fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/paravirt: Fix bool return type for PVOP_CALL()
x86/paravirt: Fix native_patch()
locking/ww_mutex: Use relaxed atomics
locking/rtmutex: Explain locking rules for rt_mutex_proxy_unlock()/init_proxy_locked()
locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALL
x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()
locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted
locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()
Documentation/virtual/kvm: Support the vCPU preemption check
x86/xen: Support the vCPU preemption check
x86/kvm: Support the vCPU preemption check
x86/kvm: Support the vCPU preemption check
kvm: Introduce kvm_write_guest_offset_cached()
locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests
locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)
locking/core, powerpc: Implement vcpu_is_preempted(cpu)
sched/core: Introduce the vcpu_is_preempted(cpu) interface
sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q
locking/core: Provide common cpu_relax_yield() definition
locking/mutex: Don't mark mutex_trylock_recursive() as deprecated, temporarily
...
Pull EFI updates from Ingo Molnar:
"The main changes in this development cycle were:
- Implement EFI dev path parser and other changes to fully support
thunderbolt devices on Apple Macbooks (Lukas Wunner)
- Add RNG seeding via the EFI stub, on ARM/arm64 (Ard Biesheuvel)
- Expose EFI framebuffer configuration to user-space, to improve
tooling (Peter Jones)
- Misc fixes and cleanups (Ivan Hu, Wei Yongjun, Yisheng Xie, Dan
Carpenter, Roy Franz)"
* 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub: Make efi_random_alloc() allocate below 4 GB on 32-bit
thunderbolt: Compile on x86 only
thunderbolt, efi: Fix Kconfig dependencies harder
thunderbolt, efi: Fix Kconfig dependencies
thunderbolt: Use Device ROM retrieved from EFI
x86/efi: Retrieve and assign Apple device properties
efi: Allow bitness-agnostic protocol calls
efi: Add device path parser
efi/arm*/libstub: Invoke EFI_RNG_PROTOCOL to seed the UEFI RNG table
efi/libstub: Add random.c to ARM build
efi: Add support for seeding the RNG from a UEFI config table
MAINTAINERS: Add ARM and arm64 EFI specific files to EFI subsystem
efi/libstub: Fix allocation size calculations
efi/efivar_ssdt_load: Don't return success on allocation failure
efifb: Show framebuffer layout as device attributes
efi/efi_test: Use memdup_user() as a cleanup
efi/efi_test: Fix uninitialized variable 'rv'
efi/efi_test: Fix uninitialized variable 'datasize'
efi/arm*: Fix efi_init() error handling
efi: Remove unused include of <linux/version.h>
Pull SMP bootup updates from Ingo Molnar:
"Three changes to unify/standardize some of the bootup message printing
in kernel/smp.c between architectures"
* 'core-smp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/smp: Tell the user we're bringing up secondary CPUs
kernel/smp: Make the SMP boot message common on all arches
kernel/smp: Define pr_fmt() for smp.c
Commit:
3cded41794 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
introduced a paravirt op with bool return type [*]
It turns out that the PVOP_CALL*() macros miscompile when rettype is
bool. Code that looked like:
83 ef 01 sub $0x1,%edi
ff 15 32 a0 d8 00 callq *0xd8a032(%rip) # ffffffff81e28120 <pv_lock_ops+0x20>
84 c0 test %al,%al
ended up looking like so after PVOP_CALL1() was applied:
83 ef 01 sub $0x1,%edi
48 63 ff movslq %edi,%rdi
ff 14 25 20 81 e2 81 callq *0xffffffff81e28120
48 85 c0 test %rax,%rax
Note how it tests the whole of %rax, even though a typical bool return
function only sets %al, like:
0f 95 c0 setne %al
c3 retq
This is because ____PVOP_CALL() does:
__ret = (rettype)__eax;
and while regular integer type casts truncate the result, a cast to
bool tests for any !0 value. Fix this by explicitly truncating to
sizeof(rettype) before casting.
[*] The actual bug should've been exposed in commit:
446f3dc8cc ("locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests")
but that didn't properly implement the paravirt call.
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 3cded41794 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
Link: http://lkml.kernel.org/r/20161208154349.346057680@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While chasing a regression I noticed we potentially patch the wrong
code in native_patch().
If we do not select the native code sequence, we must use the default
patcher, not fall-through the switch case.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel test robot <xiaolong.ye@intel.com>
Fixes: 3cded41794 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
Link: http://lkml.kernel.org/r/20161208154349.270616999@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
An earlier patch allowed enabling PT and LBR at the same
time on Goldmont. However it also allowed enabling BTS and LBR
at the same time, which is still not supported. Fix this by
bypassing the check only for PT.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alexander.shishkin@intel.com
Cc: kan.liang@intel.com
Cc: <stable@vger.kernel.org>
Fixes: ccbebba4c6 ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it")
Link: http://lkml.kernel.org/r/20161209001417.4713-1-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header). It is
done by adding a new XDP helper bpf_xdp_adjust_head().
It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.
This patch adds one "xdp_adjust_head" bit to bpf_prog for the
XDP-capable driver to check if the XDP prog requires
bpf_xdp_adjust_head() support. The driver can then decide
to error out during XDP_SETUP_PROG.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull x86 fixes from Ingo Molnar:
"Misc fixes: a core dumping crash fix, a guess-unwinder regression fix,
plus three build warning fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind: Fix guess-unwinder regression
x86/build: Annotate die() with noreturn to fix build warning on clang
x86/platform/olpc: Fix resume handler build warning
x86/apic/uv: Silence a shift wrapping warning
x86/coredump: Always use user_regs_struct for compat_elf_gregset_t
I recently encountered wreckage because access_ok() was used where it
should not be, add an explicit WARN when access_ok() is used wrongly.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lukasz reported that perf stat counters overflow handling is broken on KNL/SLM.
Both these parts have full_width_write set, and that does indeed have
a problem. In order to deal with counter wrap, we must sample the
counter at at least half the counter period (see also the sampling
theorem) such that we can unambiguously reconstruct the count.
However commit:
069e0c3c40 ("perf/x86/intel: Support full width counting")
sets the sampling interval to the full period, not half.
Fixing that exposes another issue, in that we must not sign extend the
delta value when we shift it right; the counter cannot have
decremented after all.
With both these issues fixed, counter overflow functions correctly
again.
Reported-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Tested-by: Liang, Kan <kan.liang@intel.com>
Tested-by: Odzioba, Lukasz <lukasz.odzioba@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: stable@vger.kernel.org
Fixes: 069e0c3c40 ("perf/x86/intel: Support full width counting")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The Knights Mill is enough close to Knights Landing so the path reuses
C-state residency support of the latter.
Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20161201000853.18260-1-piotr.luc@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Right now CONFIG_SCHED_MC_PRIO has X86_INTEL_PSTATE as a dependency,
which is not enabled by default and which hides the CONFIG_SCHED_MC_PRIO
hardware-enabling feature.
Select X86_INTEL_PSTATE instead, plus its dependency (CPU_FREQ), if the
user enables CONFIG_SCHED_MC_PRIO=y.
(Also align the CONFIG_SCHED_MC_PRIO Kconfig help text in standard style.)
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: bp@suse.de
Cc: jolsa@redhat.com
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: rjw@rjwysocki.net
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename CONFIG_SCHED_ITMT for Intel Turbo Boost Max Technology 3.0
to CONFIG_SCHED_MC_PRIO. This makes the configuration extensible
in future to other architectures that wish to similarly establish
CPU core priorities support in the scheduler.
The description in Kconfig is updated to reflect this change with
added details for better clarity. The configuration is explicitly
default-y, to enable the feature on CPUs that have this feature.
It has no effect on non-TBM3 CPUs.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bp@suse.de
Cc: jolsa@redhat.com
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/2b2ee29d93e3f162922d72d0165a1405864fbb23.1480444902.git.tim.c.chen@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
asm/mutex.h is gone from the locking tree, which makes sched/core break the build.
Use linux/mutex.h instead, which is the canonical method.
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: bp@suse.de
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
My attempt at fixing some KASAN false positive warnings was rather brain
dead, and it broke the guess unwinder. With frame pointers disabled,
/proc/<pid>/stack is broken:
# cat /proc/1/stack
[<ffffffffffffffff>] 0xffffffffffffffff
Restore the code flow to more closely resemble its previous state, while
still using READ_ONCE_NOCHECK() macros to silence KASAN false positives.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: c2d75e03d6 ("x86/unwind: Prevent KASAN false positive warnings in guess unwinder")
Link: http://lkml.kernel.org/r/b824f92c2c22eca5ec95ac56bd2a7c84cf0b9df9.1480309971.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixes below warning with clang:
In file included from ../arch/x86/tools/relocs_64.c:17:
../arch/x86/tools/relocs.c:977:6: warning: variable 'do_reloc' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161126222229.673-1-pefoley2@pefoley.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix:
arch/x86/platform/olpc/olpc-xo15-sci.c:199:12: warning: ‘xo15_sci_resume’
defined but not used [-Wunused-function]
static int xo15_sci_resume(struct device *dev)
^
which I see in randconfig builds here.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161126142706.13602-1-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Four fixes for bugs found by syzkaller on x86, all for stable.
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJYObr8AAoJEED/6hsPKofocbIH/j3p7QB73rDM2OCBhzTgGoOb
hcMLXnYEBD5C48ym2QW+wTEWJNNBikKOknYDX8wD1fIsaf8QoMqjEOSyxLPlexWI
mfTZnRAqSqYY9sPdlexpGAQV1uusCoIf2q9A+kW9Yy5q9ngzimiimRtFXgb/u6o5
mXZc7WcM8ZYSYdS+0Bz1lL6k1MGt1Yn207tQ3QNdWi4Pn6aWZp3+8C7rLjWu5zq8
LkMRsgedyxjULnyXedF+/IaXlC7qVO2LVwdxuHWsmeAPp/GmrNbAD+/4JKNk/Sgz
DPcPOWB/cCcCbWVY/8k+gRm0mnknX4bqYnwHwju++gwiUmJXIg3vWKfCDUw2SN0=
=MnV8
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"Four fixes for bugs found by syzkaller on x86, all for stable"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: check for pic and ioapic presence before use
KVM: x86: fix out-of-bounds accesses of rtc_eoi map
KVM: x86: drop error recovery in em_jmp_far and em_ret_far
KVM: x86: fix out-of-bounds access in lapic
Intel Turbo Boost Max Technology 3.0 (ITMT) feature
allows some cores to be boosted to higher turbo
frequency than others.
Add /proc/sys/kernel/sched_itmt_enabled so operator
can enable/disable scheduling of tasks that favor cores
with higher turbo boost frequency potential.
By default, system that is ITMT capable and single
socket has this feature turned on. It is more likely
to be lightly loaded and operates in Turbo range.
When there is a change in the ITMT scheduling operation
desired, a rebuild of the sched domain is initiated
so the scheduler can set up sched domains with appropriate
flag to enable/disable ITMT scheduling operations.
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/07cc62426a28bad57b01ab16bb903a9c84fa5421.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On platforms supporting Intel Turbo Boost Max Technology 3.0, the maximum
turbo frequencies of some cores in a CPU package may be higher than for
the other cores in the same package. In that case, better performance
(and possibly lower energy consumption as well) can be achieved by
making the scheduler prefer to run tasks on the CPUs with higher max
turbo frequencies.
To that end, set up a core priority metric to abstract the core
preferences based on the maximum turbo frequency. In that metric,
the cores with higher maximum turbo frequencies are higher-priority
than the other cores in the same package and that causes the scheduler
to favor them when making load-balancing decisions using the asymmertic
packing approach. At the same time, the priority of SMT threads with a
higher CPU number is reduced so as to avoid scheduling tasks on all of
the threads that belong to a favored core before all of the other cores
have been given a task to run.
The priority metric will be initialized by the P-state driver with the
help of the sched_set_itmt_core_prio() function. The P-state driver
will also determine whether or not ITMT is supported by the platform
and will call sched_set_itmt_support() to indicate that.
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: linux-pm@vger.kernel.org
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: rjw@rjwysocki.net
Cc: linux-acpi@vger.kernel.org
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: bp@suse.de
Link: http://lkml.kernel.org/r/cd401ccdff88f88c8349314febdc25d51f7c48f7.1479844244.git.tim.c.chen@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
KVM was using arrays of size KVM_MAX_VCPUS with vcpu_id, but ID can be
bigger that the maximal number of VCPUs, resulting in out-of-bounds
access.
Found by syzkaller:
BUG: KASAN: slab-out-of-bounds in __apic_accept_irq+0xb33/0xb50 at addr [...]
Write of size 1 by task a.out/27101
CPU: 1 PID: 27101 Comm: a.out Not tainted 4.9.0-rc5+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[...]
Call Trace:
[...] __apic_accept_irq+0xb33/0xb50 arch/x86/kvm/lapic.c:905
[...] kvm_apic_set_irq+0x10e/0x180 arch/x86/kvm/lapic.c:495
[...] kvm_irq_delivery_to_apic+0x732/0xc10 arch/x86/kvm/irq_comm.c:86
[...] ioapic_service+0x41d/0x760 arch/x86/kvm/ioapic.c:360
[...] ioapic_set_irq+0x275/0x6c0 arch/x86/kvm/ioapic.c:222
[...] kvm_ioapic_inject_all arch/x86/kvm/ioapic.c:235
[...] kvm_set_ioapic+0x223/0x310 arch/x86/kvm/ioapic.c:670
[...] kvm_vm_ioctl_set_irqchip arch/x86/kvm/x86.c:3668
[...] kvm_arch_vm_ioctl+0x1a08/0x23c0 arch/x86/kvm/x86.c:3999
[...] kvm_vm_ioctl+0x1fa/0x1a70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3099
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: af1bae5497 ("KVM: x86: bump KVM_MAX_VCPU_ID to 1023")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
em_jmp_far and em_ret_far assumed that setting IP can only fail in 64
bit mode, but syzkaller proved otherwise (and SDM agrees).
Code segment was restored upon failure, but it was left uninitialized
outside of long mode, which could lead to a leak of host kernel stack.
We could have fixed that by always saving and restoring the CS, but we
take a simpler approach and just break any guest that manages to fail
as the error recovery is error-prone and modern CPUs don't need emulator
for this.
Found by syzkaller:
WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480
Kernel panic - not syncing: panic_on_warn set ...
CPU: 2 PID: 3668 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[...]
Call Trace:
[...] __dump_stack lib/dump_stack.c:15
[...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
[...] panic+0x1b7/0x3a3 kernel/panic.c:179
[...] __warn+0x1c4/0x1e0 kernel/panic.c:542
[...] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
[...] em_ret_far+0x428/0x480 arch/x86/kvm/emulate.c:2217
[...] em_ret_far_imm+0x17/0x70 arch/x86/kvm/emulate.c:2227
[...] x86_emulate_insn+0x87a/0x3730 arch/x86/kvm/emulate.c:5294
[...] x86_emulate_instruction+0x520/0x1ba0 arch/x86/kvm/x86.c:5545
[...] emulate_instruction arch/x86/include/asm/kvm_host.h:1116
[...] complete_emulated_io arch/x86/kvm/x86.c:6870
[...] complete_emulated_mmio+0x4e9/0x710 arch/x86/kvm/x86.c:6934
[...] kvm_arch_vcpu_ioctl_run+0x3b7a/0x5a90 arch/x86/kvm/x86.c:6978
[...] kvm_vcpu_ioctl+0x61e/0xdd0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2557
[...] vfs_ioctl fs/ioctl.c:43
[...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
[...] SYSC_ioctl fs/ioctl.c:694
[...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
[...] entry_SYSCALL_64_fastpath+0x1f/0xc2
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: d1442d85cc ("KVM: x86: Handle errors when RIP is set during far jumps")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
'm_io' is stored in 6 bits so it's a number in the 0-63 range. Static
analysis tools complain that 1 << 63 will wrap so I have changed it to
1ULL << m_io.
This code is over three years old so presumably the bug doesn't happen
very frequently in real life or someone would have complained by now.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Fixes: b15cc4a12b ("x86, uv, uv3: Update x2apic Support for SGI UV3")
Link: http://lkml.kernel.org/r/20161123221908.GA23997@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")
changed the coredumping code to construct the elf coredump file according
to register set size - and that's good: if binary crashes with 32-bit code
selector, generate 32-bit ELF core, otherwise - 64-bit core.
That was made for restoring 32-bit applications on x86_64: we want
32-bit application after restore to generate 32-bit ELF dump on crash.
All was quite good and recently I started reworking 32-bit applications
dumping part of CRIU: now it has two parasites (32 and 64) for seizing
compat/native tasks, after rework it'll have one parasite, working in
64-bit mode, to which 32-bit prologue long-jumps during infection.
And while it has worked for my work machine, in VM with
!CONFIG_X86_X32_ABI during reworking I faced that segfault in 32-bit
binary, that has long-jumped to 64-bit mode results in dereference
of garbage:
32-victim[19266]: segfault at f775ef65 ip 00000000f775ef65 sp 00000000f776aa50 error 14
BUG: unable to handle kernel paging request at ffffffffffffffff
IP: [<ffffffff81332ce0>] strlen+0x0/0x20
[...]
Call Trace:
[] elf_core_dump+0x11a9/0x1480
[] do_coredump+0xa6b/0xe60
[] get_signal+0x1a8/0x5c0
[] do_signal+0x23/0x660
[] exit_to_usermode_loop+0x34/0x65
[] prepare_exit_to_usermode+0x2f/0x40
[] retint_user+0x8/0x10
That's because we have 64-bit registers set (with according total size)
and we're writing it to elf_thread_core_info which has smaller size
on !CONFIG_X86_X32_ABI. That lead to overwriting ELF notes part.
Tested on 32-, 64-bit ELF crashes and on 32-bit binaries that have
jumped with 64-bit code selector - all is readable with gdb.
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: 90954e7b94 ("x86/coredump: Use pr_reg size, rather that TIF_IA32 flag")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf fixes from Ingo Molnar:
"Six fixes for bugs that were found via fuzzing, and a trivial
hw-enablement patch for AMD Family-17h CPU PMUs"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Allow only a single PMU/box within an events group
perf/x86/intel: Cure bogus unwind from PEBS entries
perf/x86: Restore TASK_SIZE check on frame pointer
perf/core: Fix address filter parser
perf/x86: Add perf support for AMD family-17h processors
perf/x86/uncore: Fix crash by removing bogus event_list[] handling for SNB client uncore IMC
perf/core: Do not set cpuctx->cgrp for unscheduled cgroups
Intel Xeons from Ivy Bridge onwards support a processor identification
number set in the factory. To the user this is a handy unique number to
identify a particular CPU. Intel can decode this to the fab/production
run to track errors. On systems that have it, include it in the machine
check record. I'm told that this would be helpful for users that run
large data centers with multi-socket servers to keep track of which CPUs
are seeing errors.
Boris:
* Add some clarifying comments and spacing.
* Mask out [63:2] in the disabled-but-not-locked case
* Call the MSR variable "val" for more readability.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/20161123114855.njguoaygp3qnbkia@pd.tnic
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- two fixes to make (very) old Intel CPUs boot reliably
- fix the intel-mid driver and rename it
- two KASAN false positive fixes
- an FPU fix
- two sysfb fixes
- two build fixes related to new toolchain versions"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/intel-mid: Rename platform_wdt to platform_mrfld_wdt
x86/build: Build compressed x86 kernels as PIE when !CONFIG_RELOCATABLE as well
x86/platform/intel-mid: Register watchdog device after SCU
x86/fpu: Fix invalid FPU ptrace state after execve()
x86/boot: Fail the boot if !M486 and CPUID is missing
x86/traps: Ignore high word of regs->cs in early_fixup_exception()
x86/dumpstack: Prevent KASAN false positive warnings
x86/unwind: Prevent KASAN false positive warnings in guess unwinder
x86/boot: Avoid warning for zero-filling .bss
x86/sysfb: Fix lfb_size calculation
x86/sysfb: Add support for 64bit EFI lfb_base
Group validation expects all events to be of the same PMU; however
is_uncore_pmu() is too wide, it matches _all_ uncore events, even
across PMUs.
This triggers failure when we group different events from different
uncore PMUs, like:
perf stat -vv -e '{uncore_cbox_0/config=0x0334/,uncore_qpi_0/event=1/}' -a sleep 1
Fix is_uncore_pmu() by only matching events to the box at hand.
Note that generic code; ran after this step; will disallow this
mixture of PMU events.
Reported-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20161118125354.GQ3117@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vince Weaver reported that perf_fuzzer + KASAN detects that PEBS event
unwinds sometimes do 'weird' things. In particular, we seemed to be
ending up unwinding from random places on the NMI stack.
While it was somewhat expected that the event record BP,SP would not
match the interrupt BP,SP in that the interrupt is strictly later than
the record event, it was overlooked that it could be on an already
overwritten stack.
Therefore, don't copy the recorded BP,SP over the interrupted BP,SP
when we need stack unwinds.
Note that its still possible the unwind doesn't full match the actual
event, as its entirely possible to have done an (I)RET between record
and interrupt, but on average it should still point in the general
direction of where the event came from. Also, it's the best we can do,
considering.
The particular scenario that triggered the bogus NMI stack unwind was
a PEBS event with very short period, upon enabling the event at the
tail of the PMI handler (FREEZE_ON_PMI is not used), it instantly
triggers a record (while still on the NMI stack) which in turn
triggers the next PMI. This then causes back-to-back NMIs and we'll
try and unwind the stack-frame from the last NMI, which obviously is
now overwritten by our own.
Analyzed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davej@codemonkey.org.uk <davej@codemonkey.org.uk>
Cc: dvyukov@google.com <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: ca037701a0 ("perf, x86: Add PEBS infrastructure")
Link: http://lkml.kernel.org/r/20161117171731.GV3157@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
75925e1ad7 ("perf/x86: Optimize stack walk user accesses")
... switched from copy_from_user_nmi() to __copy_from_user_nmi() with a manual
access_ok() check.
Unfortunately, copy_from_user_nmi() does an explicit check against TASK_SIZE,
whereas the access_ok() uses whatever the current address limit of the task is.
We are getting NMIs when __probe_kernel_read() has switched to KERNEL_DS, and
then see vmalloc faults when we access what looks like pointers into vmalloc
space:
[] WARNING: CPU: 3 PID: 3685731 at arch/x86/mm/fault.c:435 vmalloc_fault+0x289/0x290
[] CPU: 3 PID: 3685731 Comm: sh Tainted: G W 4.6.0-5_fbk1_223_gdbf0f40 #1
[] Call Trace:
[] <NMI> [<ffffffff814717d1>] dump_stack+0x4d/0x6c
[] [<ffffffff81076e43>] __warn+0xd3/0xf0
[] [<ffffffff81076f2d>] warn_slowpath_null+0x1d/0x20
[] [<ffffffff8104a899>] vmalloc_fault+0x289/0x290
[] [<ffffffff8104b5a0>] __do_page_fault+0x330/0x490
[] [<ffffffff8104b70c>] do_page_fault+0xc/0x10
[] [<ffffffff81794e82>] page_fault+0x22/0x30
[] [<ffffffff81006280>] ? perf_callchain_user+0x100/0x2a0
[] [<ffffffff8115124f>] get_perf_callchain+0x17f/0x190
[] [<ffffffff811512c7>] perf_callchain+0x67/0x80
[] [<ffffffff8114e750>] perf_prepare_sample+0x2a0/0x370
[] [<ffffffff8114e840>] perf_event_output+0x20/0x60
[] [<ffffffff8114aee7>] ? perf_event_update_userpage+0xc7/0x130
[] [<ffffffff8114ea01>] __perf_event_overflow+0x181/0x1d0
[] [<ffffffff8114f484>] perf_event_overflow+0x14/0x20
[] [<ffffffff8100a6e3>] intel_pmu_handle_irq+0x1d3/0x490
[] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
[] [<ffffffff81197191>] ? vunmap_page_range+0x1a1/0x2f0
[] [<ffffffff811972f1>] ? unmap_kernel_range_noflush+0x11/0x20
[] [<ffffffff814f2056>] ? ghes_copy_tofrom_phys+0x116/0x1f0
[] [<ffffffff81040d1d>] ? x2apic_send_IPI_self+0x1d/0x20
[] [<ffffffff8100411d>] perf_event_nmi_handler+0x2d/0x50
[] [<ffffffff8101ea31>] nmi_handle+0x61/0x110
[] [<ffffffff8101ef94>] default_do_nmi+0x44/0x110
[] [<ffffffff8101f13b>] do_nmi+0xdb/0x150
[] [<ffffffff81795187>] end_repeat_nmi+0x1a/0x1e
[] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
[] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
[] [<ffffffff8147daf7>] ? copy_user_enhanced_fast_string+0x7/0x10
[] <<EOE>> <IRQ> [<ffffffff8115d05e>] ? __probe_kernel_read+0x3e/0xa0
Fix this by moving the valid_user_frame() check to before the uaccess
that loads the return address and the pointer to the next frame.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Fixes: 75925e1ad7 ("perf/x86: Optimize stack walk user accesses")
Signed-off-by: Ingo Molnar <mingo@kernel.org>