293 Commits

Author SHA1 Message Date
aab03e05e8 sched/deadline: Add SCHED_DEADLINE structures & implementation
Introduces the data structures, constants and symbols needed for
SCHED_DEADLINE implementation.

Core data structure of SCHED_DEADLINE are defined, along with their
initializers. Hooks for checking if a task belong to the new policy
are also added where they are needed.

Adds a scheduling class, in sched/dl.c and a new policy called
SCHED_DEADLINE. It is an implementation of the Earliest Deadline
First (EDF) scheduling algorithm, augmented with a mechanism (called
Constant Bandwidth Server, CBS) that makes it possible to isolate
the behaviour of tasks between each other.

The typical -deadline task will be made up of a computation phase
(instance) which is activated on a periodic or sporadic fashion. The
expected (maximum) duration of such computation is called the task's
runtime; the time interval by which each instance need to be completed
is called the task's relative deadline. The task's absolute deadline
is dynamically calculated as the time instant a task (better, an
instance) activates plus the relative deadline.

The EDF algorithms selects the task with the smallest absolute
deadline as the one to be executed first, while the CBS ensures each
task to run for at most its runtime every (relative) deadline
length time interval, avoiding any interference between different
tasks (bandwidth isolation).
Thanks to this feature, also tasks that do not strictly comply with
the computational model sketched above can effectively use the new
policy.

To summarize, this patch:
 - introduces the data structures, constants and symbols needed;
 - implements the core logic of the scheduling algorithm in the new
   scheduling class file;
 - provides all the glue code between the new scheduling class and
   the core scheduler and refines the interactions between sched/dl
   and the other existing scheduling classes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Michael Trimarchi <michael@amarulasolutions.com>
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13 13:41:06 +01:00
d50dde5a10 sched: Add new scheduler syscalls to support an extended scheduling parameters ABI
Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).

In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:

 - a (maximum/typical) instance execution time,
 - a minimum interval between consecutive instances,
 - a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:

 - defines the new struct sched_attr, containing all the fields
   that are necessary for specifying a task in the computational
   model described above;

 - defines and implements the new scheduling related syscalls that
   manipulate it, i.e., sched_setattr() and sched_getattr().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-13 13:41:04 +01:00
ffe732c243 Merge branch 'sched/urgent' into sched/core
Merge the latest batch of fixes before applying development patches.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-17 15:22:35 +01:00
5d4cf996cf sched: Assign correct scheduling domain to 'sd_llc'
Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
dereference on sd_busy but the fix also altered what scheduling domain it
used for the 'sd_llc' percpu variable.

One impact of this is that a task selecting a runqueue may consider
idle CPUs that are not cache siblings as candidates for running.
Tasks are then running on CPUs that are not cache hot.

This was found through bisection where ebizzy threads were not seeing equal
performance and it looked like a scheduling fairness issue. This patch
mitigates but does not completely fix the problem on all machines tested
implying there may be an additional bug or a common root cause. Here are
the average range of performance seen by individual ebizzy threads. It
was tested on top of candidate patches related to x86 TLB range flushing.

	4-core machine
			    3.13.0-rc3            3.13.0-rc3
			       vanilla            fixsd-v3r3
	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)

	8-core machine
	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)

It's eliminated for one machine and reduced for another.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: H Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-17 15:08:43 +01:00
8e8339a3a1 sched: Initialize power_orig for overlapping groups
Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
Make sure to always initialize power_orig now that we actually use it.

Ideally build_sched_domains() -> init_sched_groups_power() would also
initialize this; but for some yet unexplained reason some setups seem
to miss updates there.

Reported-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-12-11 15:47:59 +01:00
e6c390f2df sched: Add sched_class->task_dead() method
Add a new function to the scheduling class interface. It is called
at the end of a context switch, if the prev task is in TASK_DEAD state.

It will be useful for the scheduling classes that want to be notified
when one of their tasks dies, e.g. to perform some cleanup actions,
such as SCHED_DEADLINE.

Signed-off-by: Dario Faggioli <raistlin@linux.it>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-2-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-27 14:08:50 +01:00
192301e70a sched: Check TASK_DEAD rather than EXIT_DEAD in schedule_debug()
schedule_debug() ignores in_atomic() if prev->exit_state != 0.
This is not what we want, ->exit_state is set by exit_notify()
but we should complain until the task does the last schedule()
in TASK_DEAD.

See also 7407251a0e2e "PF_DEAD cleanup", I think this ancient
commit explains why schedule() had to rely on ->exit_state,
until that commit exit_notify() disabled preemption and set
PF_DEAD which was used to detect the exiting task.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131113154538.GB15810@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-27 13:50:53 +01:00
32e475d76a sched: Expose preempt_schedule_irq()
Tony reported that aa0d53260596 ("ia64: Use preempt_schedule_irq")
broke PREEMPT=n builds on ia64.

Ok, wrapped my brain around it. I tripped over the magic asm foo which
has a single need_resched check and schedule point for both sys call
return and interrupt return.

So you need the schedule_preempt_irq() for kernel preemption from
interrupt return while on a normal syscall preemption a schedule would
be sufficient. But using schedule_preempt_irq() is not harmful here in
any way. It just sets the preempt_active bit also in cases where it
would not be required.

Even on preempt=n kernels adding the preempt_active bit is completely
harmless. So instead of having an extra function, moving the existing
one out of the ifdef PREEMPT looks like the sanest thing to do.

It would also allow getting rid of various other sti/schedule/cli asm
magic in other archs.

Reported-and-Tested-by: Tony Luck <tony.luck@gmail.com>
Fixes: aa0d53260596 ("ia64: Use preempt_schedule_irq")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[slightly edited Changelog]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311211230030.30673@ionos.tec.linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-27 11:04:53 +01:00
39e24d8ffb sched: Fix a trivial syntax misuse
Use if statement instead of while loop.

Signed-off-by: Shigeru Yoshida <shigeru.yoshida@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <trivial@kernel.org>
Link: http://lkml.kernel.org/r/20131123.183801.769652906919404319.shigeru.yoshida@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-26 18:37:55 +01:00
0515973ffb sched: Fix a trivial typo in comments
Fix a trivial typo in rq_attach_root().

Signed-off-by: Shigeru Yoshida <shigeru.yoshida@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131117.121236.1990617639803941055.shigeru.yoshida@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19 17:01:18 +01:00
42eb088ed2 sched: Avoid NULL dereference on sd_busy
Commit 37dc6b50cee9 ("sched: Remove unnecessary iteration over sched
domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some
conditions leading to a possible NULL deref in set_cpu_sd_state_idle().

Reported-by: Anton Blanchard <anton@samba.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-19 17:01:16 +01:00
911b2898b3 sched: Optimize task_sched_runtime()
Large multi-threaded apps like to hit this using do_sys_times() and
then queue up on the rq->lock.

Avoid when possible.

Larry reported ~20% performance increase his test case.

Reported-by: Larry Woodman <lwoodman@redhat.com>
Suggested-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131111172925.GG26898@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-13 13:33:54 +01:00
37dc6b50ce sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus
nr_busy_cpus parameter is used by nohz_kick_needed() to find out the
number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES
flag set.  Therefore instead of updating nr_busy_cpus at every level
of sched domain, since it is irrelevant, we can update this parameter
only at the parent domain of the sd which has this flag set. Introduce
a per-cpu parameter sd_busy which represents this parent domain.

In nohz_kick_needed() we directly query the nr_busy_cpus parameter
associated with the groups of sd_busy.

By associating sd_busy with the highest domain which has
SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains
which could have this flag set and trigger nohz_idle_balancing if any
of the levels have more than one busy cpu.

sd_busy is irrelevant for asymmetric load balancing. However sd_asym
has been introduced to represent the highest sched domain which has
SD_ASYM_PACKING flag set so that it can be queried directly when
required.

While we are at it, we might as well change the nohz_idle parameter to
be updated at the sd_busy domain level alone and not the base domain
level of a CPU.  This will unify the concept of busy cpus at just one
level of sched domain where it is currently used.

Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: svaidy@linux.vnet.ibm.com
Cc: vincent.guittot@linaro.org
Cc: bitbucket@online.de
Cc: benh@kernel.crashing.org
Cc: anton@samba.org
Cc: Morten.Rasmussen@arm.com
Cc: pjt@google.com
Cc: peterz@infradead.org
Cc: mikey@neuling.org
Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 12:37:55 +01:00
b8a216269e sched: Move completion code from core.c to completion.c
Completions already have their own header file: linux/completion.h
Move the implementation out of kernel/sched/core.c and into its own
file: kernel/sched/completion.c.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-x2y49rmxu5dljt66ai2lcfuw@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:49:19 +01:00
b4145872f7 sched: Move wait code from core.c to wait.c
For some reason only the wait part of the wait api lives in
kernel/sched/wait.c and the wake part still lives in kernel/sched/core.c;
ammend this.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-ftycee88naznulqk7ei5mbci@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 07:49:18 +01:00
1ee14e6c8c sched: Fix race on toggling cfs_bandwidth_used
When we transition cfs_bandwidth_used to false, any currently
throttled groups will incorrectly return false from cfs_rq_throttled.
While tg_set_cfs_bandwidth will unthrottle them eventually, currently
running code (including at least dequeue_task_fair and
distribute_cfs_runtime) will cause errors.

Fix this by turning off cfs_bandwidth_used only after unthrottling all
cfs_rqs.

Tested: toggle bandwidth back and forth on a loaded cgroup. Caused
crashes in minutes without the patch, hasn't crashed with it.

Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-29 12:02:19 +01:00
ac9ff7997b sched: Remove extra put_online_cpus() inside sched_setaffinity()
Commit 6acce3ef8:

	sched: Remove get_online_cpus() usage

has left one extra put_online_cpus() inside sched_setaffinity(),
remove it to fix the WARN:

   ------------[ cut here ]------------
   WARNING: CPU: 0 PID: 3166 at kernel/cpu.c:84 put_online_cpus+0x43/0x70()
   ...
   [<ffffffff810c3fef>] put_online_cpus+0x43/0x70 [
   [<ffffffff810efd59>] sched_setaffinity+0x7d/0x1f9 [
   ...

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/526DD0EE.1090309@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-28 11:36:50 +01:00
6acce3ef84 sched: Remove get_online_cpus() usage
Remove get_online_cpus() usage from the scheduler; there's 4 sites that
use it:

 - sched_init_smp(); where its completely superfluous since we're in
   'early' boot and there simply cannot be any hotplugging.

 - sched_getaffinity(); we already take a raw spinlock to protect the
   task cpus_allowed mask, this disables preemption and therefore
   also stabilizes cpu_online_mask as that's modified using
   stop_machine. However switch to active mask for symmetry with
   sched_setaffinity()/set_cpus_allowed_ptr(). We guarantee active
   mask stability by inserting sync_rcu/sched() into _cpu_down.

 - sched_setaffinity(); we don't appear to need get_online_cpus()
   either, there's two sites where hotplug appears relevant:
    * cpuset_cpus_allowed(); for the !cpuset case we use possible_mask,
      for the cpuset case we hold task_lock, which is a spinlock and
      thus for mainline disables preemption (might cause pain on RT).
    * set_cpus_allowed_ptr(); Holds all scheduler locks and thus has
      preemption properly disabled; also it already deals with hotplug
      races explicitly where it releases them.

 - migrate_swap(); we can make stop_two_cpus() do the heavy lifting for
   us with a little trickery. By adding a sync_sched/rcu() after the
   CPU_DOWN_PREPARE notifier we can provide preempt/rcu guarantees for
   cpu_active_mask. Use these to validate that both our cpus are active
   when queueing the stop work before we queue the stop_machine works
   for take_cpu_down().

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20131011123820.GV3081@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-16 14:22:16 +02:00
746023159c sched: Fix race in migrate_swap_stop()
There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
places with task T, on CPU B.

Task P:
  - call migrate_swap
Task T:
  - go to sleep, removing itself from the runqueue
Task P:
  - double lock the runqueues on CPU A & B
Task T:
  - get woken up, place itself on the runqueue of CPU C
Task P:
  - see that task T is on a runqueue, and pretend to remove it
    from the runqueue on CPU B

Now CPUs B & C both have corrupted scheduler data structures.

This patch fixes it, by holding the pi_lock for both of the tasks
involved in the migrate swap. This prevents task T from waking up,
and placing itself onto another runqueue, until after migrate_swap
has released all locks.

This means that, when migrate_swap checks, task T will be either
on the runqueue where it was originally seen, or not on any
runqueue at all. Migrate_swap deals correctly with of those cases.

Tested-by: Joe Mario <jmario@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: hannes@cmpxchg.org
Cc: aarcange@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: tglx@linutronix.de
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-16 14:22:14 +02:00
1e3646ffc6 mm: numa: Revert temporarily disabling of NUMA migration
With the scan rate code working (at least for multi-instance specjbb),
the large hammer that is "sched: Do not migrate memory immediately after
switching node" can be replaced with something smarter. Revert temporarily
migration disabling and all traces of numa_migrate_seq.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:20 +02:00
930aa174fc sched/numa: Remove the numa_balancing_scan_period_reset sysctl
With scan rate adaptions based on whether the workload has properly
converged or not there should be no need for the scan period reset
hammer. Get rid of it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-60-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:18 +02:00
0ec8aa00f2 sched/numa: Avoid migrating tasks that are placed on their preferred node
This patch classifies scheduler domains and runqueues into types depending
the number of tasks that are about their NUMA placement and the number
that are currently running on their preferred node. The types are

regular: There are tasks running that do not care about their NUMA
	placement.

remote: There are tasks running that care about their placement but are
	currently running on a node remote to their ideal placement

all: No distinction

To implement this the patch tracks the number of tasks that are optimally
NUMA placed (rq->nr_preferred_running) and the number of tasks running
that care about their placement (nr_numa_running). The load balancer
uses this information to avoid migrating idea placed NUMA tasks as long
as better options for load balancing exists. For example, it will not
consider balancing between a group whose tasks are all perfectly placed
and a group with remote tasks.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-56-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:48:10 +02:00
5e1576ed0e sched/numa: Stay on the same node if CLONE_VM
A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-49-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:57 +02:00
8c8a743c50 sched/numa: Use {cpu, pid} to create task groups for shared faults
While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.

This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1381141781-10992-44-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:47 +02:00
fb13c7ee0e sched/numa: Use a system-wide search to find swap/migration candidates
This patch implements a system-wide search for swap/migration candidates
based on total NUMA hinting faults. It has a balance limit, however it
doesn't properly consider total node balance.

In the old scheme a task selected a preferred node based on the highest
number of private faults recorded on the node. In this scheme, the preferred
node is based on the total number of faults. If the preferred node for a
task changes then task_numa_migrate will search the whole system looking
for tasks to swap with that would improve both the overall compute
balance and minimise the expected number of remote NUMA hinting faults.

Not there is no guarantee that the node the source task is placed
on by task_numa_migrate() has any relationship to the newly selected
task->numa_preferred_nid due to compute overloading.

Signed-off-by: Mel Gorman <mgorman@suse.de>
[ Do not swap with tasks that cannot run on source cpu]
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Fixed compiler warning on UP. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-40-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 14:47:25 +02:00
ac66f54772 sched/numa: Introduce migrate_swap()
Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.

I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.

The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1381141781-10992-39-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:46 +02:00
6fe6b2d6da sched/numa: Do not migrate memory immediately after switching node
The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.

The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:36 +02:00
e6628d5b0a sched/numa: Reschedule task on preferred NUMA node once selected
A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-25-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:28 +02:00
3a7053b322 sched/numa: Favour moving tasks towards the preferred node
This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing.  Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[srikar@linux.vnet.ibm.com: Fixed statistics]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Tunable and use higher faults instead of preferred. ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:26 +02:00
745d61476d sched/numa: Update NUMA hinting faults once per scan
NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:25 +02:00
688b7585d1 sched/numa: Select a preferred node with the most numa hinting faults
This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:23 +02:00
f809ca9a55 sched/numa: Track NUMA hinting faults on per-node basis
This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:22 +02:00
7e8d16b6cb sched/numa: Initialise numa_next_scan properly
Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-09 12:40:19 +02:00
a233f1120c sched: Prepare for per-cpu preempt_count
When using per-cpu preempt_count variables we need to save/restore the
preempt_count on context switch (into per task storage; for instance
the old thread_info::preempt_count variable) because of
PREEMPT_ACTIVE.

However, this means that on fork() the preempt_count value of the last
context switch gets copied and if we had a PREEMPT_ACTIVE switch right
before cloning a child task the child task will now too have
PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE
count.

Therefore we need to make init_task_preempt_count() unconditional;
this resets whatever preempt_count we inherited from our parent
process.

Doing so for !per-cpu implementations is harmless.

For !PREEMPT_COUNT kernels we need to be careful not to start life
with an increased preempt_count.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:55 +02:00
bdb4380658 sched: Extract the basic add/sub preempt_count modifiers
Rewrite the preempt_count macros in order to extract the 3 basic
preempt_count value modifiers:

  __preempt_count_add()
  __preempt_count_sub()

and the new:

  __preempt_count_dec_and_test()

And since we're at it anyway, replace the unconventional
$op_preempt_count names with the more conventional preempt_count_$op.

Since these basic operators are equivalent to the previous _notrace()
variants, do away with the _notrace() versions.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:54 +02:00
0102874755 sched: Create more preempt_count accessors
We need a few special preempt_count accessors:
 - task_preempt_count() for when we're interested in the preemption
   count of another (non-running) task.
 - init_task_preempt_count() for properly initializing the preemption
   count.
 - init_idle_preempt_count() a special case of the above for the idle
   threads.

With these no generic code ever touches thread_info::preempt_count
anymore and architectures could choose to remove it.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-jf5swrio8l78j37d06fzmo4r@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:52 +02:00
f27dde8dee sched: Add NEED_RESCHED to the preempt_count
In order to combine the preemption and need_resched test we need to
fold the need_resched information into the preempt_count value.

Since the NEED_RESCHED flag is set across CPUs this needs to be an
atomic operation, however we very much want to avoid making
preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
infrastructure in place but at 3 sites test it and fold its value into
preempt_count; namely:

 - resched_task() when setting TIF_NEED_RESCHED on the current task
 - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
                   remote task it follows it up with a reschedule IPI
                   and we can modify the cpu local preempt_count from
                   there.
 - cpu_idle_loop() for when resched_task() found tsk_is_polling().

We use an inverted bitmask to indicate need_resched so that a 0 means
both need_resched and !atomic.

Also remove the barrier() in preempt_enable() between
preempt_enable_no_resched() and preempt_check_resched() to avoid
having to reload the preemption value and allow the compiler to use
the flags of the previuos decrement. I couldn't come up with any sane
reason for this barrier() to be there as preempt_enable_no_resched()
already has a barrier() before doing the decrement.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:49 +02:00
4a2b4b2227 sched: Introduce preempt_count accessor functions
Replace the single preempt_count() 'function' that's an lvalue with
two proper functions:

 preempt_count() - returns the preempt_count value as rvalue
 preempt_count_set() - Allows setting the preempt-count value

Also provide preempt_count_ptr() as a convenience wrapper to implement
all modifying operations.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org
[ Fixed build failure. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 14:07:32 +02:00
b021fe3e25 sched, rcu: Make RCU use resched_cpu()
We're going to deprecate and remove set_need_resched() for it will do
the wrong thing. Make an exception for RCU and allow it to use
resched_cpu() which will do the right thing.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-2eywnacjl1nllctl1nszqa5w@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 13:53:08 +02:00
4314895165 sched: Micro-optimize by dropping unnecessary task_rq() calls
We always know the rq used, let's just pass it around.
This seems to cut the size of scheduler core down a tiny bit:

Before:

  [linux]$ size kernel/sched/core.o.orig
     text    data     bss     dec     hex filename
    62760   16130    3876   82766   1434e kernel/sched/core.o.orig

After:

  [linux]$ size kernel/sched/core.o.patched
     text    data     bss     dec     hex filename
    62566   16130    3876   82572   1428c kernel/sched/core.o.patched

Probably speeds it up as well.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130922142054.GA11499@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-25 13:51:06 +02:00
9bd721c55c sched/balancing: Consider max cost of idle balance per sched domain
In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains so that we can
determine the avg_idle's max value.

By using the max, we avoid overrunning the average. This further reduces the
chance we attempt balancing when the CPU is not idle for longer than the cost
to balance.

Signed-off-by: Jason Low <jason.low2@hp.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379096813-3032-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 12:03:44 +02:00
abfafa54db sched: Reduce overestimating rq->avg_idle
When updating avg_idle, if the delta exceeds some max value, then avg_idle
gets set to the max, regardless of what the previous avg was. This can cause
avg_idle to often be overestimated.

This patch modifies the way we update avg_idle by always updating it with the
function call to update_avg() first. Then, if avg_idle exceeds the max, we set
it to the max.

Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1379096813-3032-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-20 12:03:41 +02:00
ae7a835cc5 Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Gleb Natapov:
 "The highlights of the release are nested EPT and pv-ticketlocks
  support (hypervisor part, guest part, which is most of the code, goes
  through tip tree).  Apart of that there are many fixes for all arches"

Fix up semantic conflicts as discussed in the pull request thread..

* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (88 commits)
  ARM: KVM: Add newlines to panic strings
  ARM: KVM: Work around older compiler bug
  ARM: KVM: Simplify tracepoint text
  ARM: KVM: Fix kvm_set_pte assignment
  ARM: KVM: vgic: Bump VGIC_NR_IRQS to 256
  ARM: KVM: Bugfix: vgic_bytemap_get_reg per cpu regs
  ARM: KVM: vgic: fix GICD_ICFGRn access
  ARM: KVM: vgic: simplify vgic_get_target_reg
  KVM: MMU: remove unused parameter
  KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
  KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
  KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
  KVM: x86: update masterclock when kvmclock_offset is calculated (v2)
  KVM: PPC: Book3S: Fix compile error in XICS emulation
  KVM: PPC: Book3S PR: return appropriate error when allocation fails
  arch: powerpc: kvm: add signed type cast for comparation
  KVM: x86: add comments where MMIO does not return to the emulator
  KVM: vmx: count exits to userspace during invalid guest emulation
  KVM: rename __kvm_io_bus_sort_cmp to kvm_io_bus_cmp
  kvm: optimize away THP checks in kvm_is_mmio_pfn()
  ...
2013-09-04 18:15:06 -07:00
6832d9652f Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timers/nohz changes from Ingo Molnar:
 "It mostly contains fixes and full dynticks off-case optimizations, by
  Frederic Weisbecker"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  nohz: Include local CPU in full dynticks global kick
  nohz: Optimize full dynticks's sched hooks with static keys
  nohz: Optimize full dynticks state checks with static keys
  nohz: Rename a few state variables
  vtime: Always debug check snapshot source _before_ updating it
  vtime: Always scale generic vtime accounting results
  vtime: Optimize full dynticks accounting off case with static keys
  vtime: Describe overriden functions in dedicated arch headers
  m68k: hardirq_count() only need preempt_mask.h
  hardirq: Split preempt count mask definitions
  context_tracking: Split low level state headers
  vtime: Fix racy cputime delta update
  vtime: Remove a few unneeded generic vtime state checks
  context_tracking: User/kernel broundary cross trace events
  context_tracking: Optimize context switch off case with static keys
  context_tracking: Optimize guest APIs off case with static key
  context_tracking: Optimize main APIs off case with static key
  context_tracking: Ground setup for static key use
  context_tracking: Remove full dynticks' hacky dependency on wide context tracking
  nohz: Only enable context tracking on full dynticks CPUs
  ...
2013-09-04 09:36:54 -07:00
5e0b3a4e88 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "Various optimizations, cleanups and smaller fixes - no major changes
  in scheduler behavior"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix the sd_parent_degenerate() code
  sched/fair: Rework and comment the group_imb code
  sched/fair: Optimize find_busiest_queue()
  sched/fair: Make group power more consistent
  sched/fair: Remove duplicate load_per_task computations
  sched/fair: Shrink sg_lb_stats and play memset games
  sched: Clean-up struct sd_lb_stat
  sched: Factor out code to should_we_balance()
  sched: Remove one division operation in find_busiest_queue()
  sched/cputime: Use this_cpu_add() in task_group_account_field()
  cpumask: Fix cpumask leak in partition_sched_domains()
  sched/x86: Optimize switch_mm() for multi-threaded workloads
  generic-ipi: Kill unnecessary variable - csd_flags
  numa: Mark __node_set() as __always_inline
  sched/fair: Cleanup: remove duplicate variable declaration
  sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing
2013-09-04 08:36:35 -07:00
0d99b70873 Merge branches 'perf-urgent-for-linus' and 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf changes from Ingo Molnar:
 "As a first remark I'd like to point out that the obsolete '-f'
  (--force) option, which has not done anything for several releases,
  has been removed from 'perf record' and related utilities.  Everyone
  please update muscle memory accordingly! :-)

  Main changes on the perf kernel side:

   - Performance optimizations:
        . for trace events, by Steve Rostedt.
        . for time values, by Peter Zijlstra

   - New hardware support:
        . for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
        . for Intel SNB-EP uncore PMUs, by Zheng Yan

   - Enhanced hardware support:
        . for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan

   - Core perf events code enhancements and fixes:
        . for full-nohz feature handling, by Frederic Weisbecker
        . for group events, by Jiri Olsa
        . for call chains, by Frederic Weisbecker
        . for event stream parsing, by Adrian Hunter

   - New ABI details:
        . Add attr->mmap2 attribute, by Stephane Eranian
        . Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
        . Export u64 time_zero on the mmap header page to allow TSC
          calculation, by Adrian Hunter
        . Add dummy software event, by Adrian Hunter.
        . Add a new PERF_SAMPLE_IDENTIFIER to make samples always
          parseable, by Adrian Hunter.
        . Make Power7 events available via sysfs, by Runzhen Wang.

   - Code cleanups and refactorings:
        . for nohz-full, by Frederic Weisbecker
        . for group events, by Jiri Olsa

   - Documentation updates:
        . for perf_event_type, by Peter Zijlstra

  Main changes on the perf tooling side (some of these tooling changes
  utilize the above kernel side changes):

   - Lots of 'perf trace' enhancements:

        . Make 'perf trace' command line arguments consistent with
          'perf record', by David Ahern.

        . Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.

        . Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.

        . Support ! in -e expressions, to filter a list of syscalls,
          by Arnaldo Carvalho de Melo.

        . Arg formatting improvements to allow masking arguments in
          syscalls such as futex and open, where the some arguments are
          ignored and thus should not be printed depending on other args,
          by Arnaldo Carvalho de Melo.

        . Beautify futex open, openat, open_by_handle_at, lseek and futex
          syscalls, by Arnaldo Carvalho de Melo.

        . Add option to analyze events in a file versus live, so that
          one can do:

           [root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
           [ perf record: Woken up 0 times to write data ]
           [ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
           [root@zoo ~]# perf trace -i perf.data -e futex --duration 1
              17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
             113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
             133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
           [root@zoo ~]#

          By David Ahern.

        . Honor target pid / tid options when analyzing a file, by David Ahern.

        . Introduce better formatting of syscall arguments, including so
          far beautifiers for mmap, madvise, syscall return values,
          by Arnaldo Carvalho de Melo.

        . Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.

   - 'perf report/top' enhancements:

        . Do annotation using /proc/kcore and /proc/kallsyms when
          available, removing the forced need for a vmlinux file kernel
          assembly annotation. This also improves this use case because
          vmlinux has just the initial kernel image, not what is actually
          in use after various code patchings by things like alternatives.
          By Adrian Hunter.

        . Add --ignore-callees=<regex> option to collapse undesired parts
          of call graphs, by Greg Price.

        . Simplify symbol filtering by doing it at machine class level,
          by Adrian Hunter.

        . Add support for callchains in the gtk UI, by Namhyung Kim.

        . Add --objdump option to 'perf top', by Sukadev Bhattiprolu.

   - 'perf kvm' enhancements:

        . Add option to print only events that exceed a specified time
          duration, by David Ahern.

        . Improve stack trace printing, by David Ahern.

        . Update documentation of the live command, by David Ahern

        . Add perf kvm stat live mode that combines aspects of 'perf kvm
          stat' record and report, by David Ahern.

        . Add option to analyze specific VM in perf kvm stat report, by
          David Ahern.

        . Do not require /lib/modules/* on a guest, by Jason Wessel.

   - 'perf script' enhancements:

        . Fix symbol offset computation for some dsos, by David Ahern.

        . Fix named threads support, by David Ahern.

        . Don't install scripting files files when perl/python support
          is disabled, by Arnaldo Carvalho de Melo.

   - 'perf test' enhancements:

        . Add various improvements and fixes to the "vmlinux matches
          kallsyms" 'perf test' entry, related to the /proc/kcore
          annotation feature. By Adrian Hunter.

        . Add sample parsing test, by Adrian Hunter.

        . Add test for reading object code, by Adrian Hunter.

        . Add attr record group sampling test, by Jiri Olsa.

        . Misc testing infrastructure improvements and other details,
          by Jiri Olsa.

   - 'perf list' enhancements:

        . Skip unsupported hardware events, by Namhyung Kim.

        . List pmu events, by Andi Kleen.

   - 'perf diff' enhancements:

        . Add support for more than two files comparison, by Jiri Olsa.

   - 'perf sched' enhancements:

        . Various improvements, including removing reliance on some
          scheduler tracepoints that provide the same information as the
          PERF_RECORD_{FORK,EXIT} events. By David Ahern.

        . Remove odd build stall by moving a large struct initialization
          from a local variable to a global one, by Namhyung Kim.

   - 'perf stat' enhancements:

        . Add --initial-delay option to skip measuring for a defined
          startup phase, by Andi Kleen.

   - Generic perf tooling infrastructure/plumbing changes:

        . Tidy up sample parsing validation, by Adrian Hunter.

        . Fix up jobserver setup in libtraceevent Makefile.
          by Arnaldo Carvalho de Melo.

        . Debug improvements, by Adrian Hunter.

        . Fix correlation of samples coming after PERF_RECORD_EXIT event,
          by David Ahern.

        . Improve robustness of the topology parsing code,
          by Stephane Eranian.

        . Add group leader sampling, that allows just one event in a group
          to sample while the other events have just its values read,
          by Jiri Olsa.

        . Add support for a new modifier "D", which requests that the
          event, or group of events, be pinned to the PMU.
          By Michael Ellerman.

        . Support callchain sorting based on addresses, by Andi Kleen

        . Prep work for multi perf data file storage, by Jiri Olsa.

        . libtraceevent cleanups, by Namhyung Kim.

  And lots and lots of other fixes and code reorganizations that did not
  make it into the list, see the shortlog, diffstat and the Git log for
  details!"

[ Also merge a leftover from the 3.11 cycle ]

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Prevent race in unthrottling code

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits)
  perf trace: Tell arg formatters the arg index
  perf trace: Add beautifier for open's flags arg
  perf trace: Add beautifier for lseek's whence arg
  perf tools: Fix symbol offset computation for some dsos
  perf list: Skip unsupported events
  perf tests: Add 'keep tracking' test
  perf tools: Add support for PERF_COUNT_SW_DUMMY
  perf: Add a dummy software event to keep tracking
  perf trace: Add beautifier for futex 'operation' parm
  perf trace: Allow syscall arg formatters to mask args
  perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()
  perf: Export struct perf_branch_entry to userspace
  perf: Add attr->mmap2 attribute to an event
  perf/x86: Add Silvermont (22nm Atom) support
  perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
  perf trace: Handle missing HUGEPAGE defines
  perf trace: Honor target pid / tid options when analyzing a file
  perf trace: Add option to analyze events in a file versus live
  perf evlist: Add tracepoint lookup by name
  perf tests: Add a sample parsing test
  ...
2013-09-04 08:25:35 -07:00
32dad03d16 Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "A lot of activities on the cgroup front.  Most changes aren't visible
  to userland at all at this point and are laying foundation for the
  planned unified hierarchy.

   - The biggest change is decoupling the lifetime management of css
     (cgroup_subsys_state) from that of cgroup's.  Because controllers
     (cpu, memory, block and so on) will need to be dynamically enabled
     and disabled, css which is the association point between a cgroup
     and a controller may come and go dynamically across the lifetime of
     a cgroup.  Till now, css's were created when the associated cgroup
     was created and stayed till the cgroup got destroyed.

     Assumptions around this tight coupling permeated through cgroup
     core and controllers.  These assumptions are gradually removed,
     which consists bulk of patches, and css destruction path is
     completely decoupled from cgroup destruction path.  Note that
     decoupling of creation path is relatively easy on top of these
     changes and the patchset is pending for the next window.

   - cgroup has its own event mechanism cgroup.event_control, which is
     only used by memcg.  It is overly complex trying to achieve high
     flexibility whose benefits seem dubious at best.  Going forward,
     new events will simply generate file modified event and the
     existing mechanism is being made specific to memcg.  This pull
     request contains prepatory patches for such change.

   - Various fixes and cleanups"

Fixed up conflict in kernel/cgroup.c as per Tejun.

* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
  cgroup: fix cgroup_css() invocation in css_from_id()
  cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
  cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
  cgroup: implement CFTYPE_NO_PREFIX
  cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
  cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
  cgroup: fix cgroup_write_event_control()
  cgroup: fix subsystem file accesses on the root cgroup
  cgroup: change cgroup_from_id() to css_from_id()
  cgroup: use css_get() in cgroup_create() to check CSS_ROOT
  cpuset: remove an unncessary forward declaration
  cgroup: RCU protect each cgroup_subsys_state release
  cgroup: move subsys file removal to kill_css()
  cgroup: factor out kill_css()
  cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
  cgroup: replace cgroup->css_kill_cnt with ->nr_css
  cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
  cgroup: move cgroup->subsys[] assignment to online_css()
  cgroup: reorganize css init / exit paths
  cgroup: add __rcu modifier to cgroup->subsys[]
  ...
2013-09-03 18:25:03 -07:00
10866e62e8 sched/fair: Fix the sd_parent_degenerate() code
I found that on my WSM box I had a redundant domain:

[    0.949769] CPU0 attaching sched-domain:
[    0.953765]  domain 0: span 0,12 level SIBLING
[    0.958335]   groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[    0.964548]   domain 1: span 0-5,12-17 level MC
[    0.969206]    groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[    0.984993]    domain 2: span 0-5,12-17 level CPU
[    0.989822]     groups: 0-5,12-17 (cpu_power = 7055)
[    0.995049]     domain 3: span 0-23 level NUMA
[    0.999620]      groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)

Note how domain 2 has only a single group and spans the same CPUs as
domain 1. We should not keep such domains and do in fact have code to
prune these.

It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it
makes sd_parent_degenerate() fail on the CPU domain. We can easily
fix this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it
to whatever domain ends up covering the span.

With this patch the domains now look like this:

[    0.950419] CPU0 attaching sched-domain:
[    0.954454]  domain 0: span 0,12 level SIBLING
[    0.959039]   groups: 0 (cpu_power = 587) 12 (cpu_power = 588)
[    0.965271]   domain 1: span 0-5,12-17 level MC
[    0.969936]    groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176)
[    0.985737]    domain 2: span 0-23 level NUMA
[    0.990231]     groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056)

Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-ys201g4jwukj0h8xcamakxq1@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-02 08:27:40 +02:00
aee2bce3cf Merge branch 'linus' into perf/core
Pick up the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-29 12:02:08 +02:00
c8d2d47a9c cpumask: Fix cpumask leak in partition_sched_domains()
If doms_new is NULL, partition_sched_domains() will reset ndoms_cur
to 0, and free old sched domains with free_sched_domains(doms_cur, ndoms_cur).
As ndoms_cur is 0, the cpumask will not be freed.

Signed-off-by: Xiaotian Feng <xtfeng@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1375790802-11857-1-git-send-email-xtfeng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-08-16 17:44:27 +02:00