dl_task_timer() may fire on a different rq from where a task was removed
after throttling. Since the call path is:
dl_task_timer() ->
enqueue_task_dl() ->
enqueue_dl_entity() ->
replenish_dl_entity()
and replenish_dl_entity() uses dl_se's rq, we can't use current's rq
in dl_task_timer(), but we need to lock the task's previous one.
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kirill Tkhai <ktkhai@parallels.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Fixes: 3960c8c0c789 ("sched: Make dl_task_time() use task_rq_lock()")
Link: http://lkml.kernel.org/r/1427792017-7356-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Obviously, 'rq' is not used in these two functions, therefore,
there is no reason for it to be passed as an argument.
Signed-off-by: Abel Vesa <abelvesa@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It was found when doing a hotplug stress test on POWER, that the
machine either hit softlockups or rcu_sched stall warnings. The
issue was traced to commit:
7cba160ad789 ("powernv/cpuidle: Redesign idle states management")
which exposed the cpu_down() race with hrtimer based broadcast mode:
5d1638acb9f6 ("tick: Introduce hrtimer based broadcast")
The race is the following:
Assume CPU1 is the CPU which holds the hrtimer broadcasting duty
before it is taken down.
CPU0 CPU1
cpu_down() take_cpu_down()
disable_interrupts()
cpu_die()
while (CPU1 != CPU_DEAD) {
msleep(100);
switch_to_idle();
stop_cpu_timer();
schedule_broadcast();
}
tick_cleanup_cpu_dead()
take_over_broadcast()
So after CPU1 disabled interrupts it cannot handle the broadcast
hrtimer anymore, so CPU0 will be stuck forever.
Fix this by explicitly taking over broadcast duty before cpu_die().
This is a temporary workaround. What we really want is a callback
in the clockevent device which allows us to do that from the dying
CPU by pushing the hrtimer onto a different cpu. That might involve
an IPI and is definitely more complex than this immediate fix.
Changelog was picked up from:
https://lkml.org/lkml/2015/2/16/213
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: mpe@ellerman.id.au
Cc: nicolas.pitre@linaro.org
Cc: peterz@infradead.org
Cc: rjw@rjwysocki.net
Fixes: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
Link: http://lkml.kernel.org/r/20150330092410.24979.59887.stgit@preeti.in.ibm.com
[ Merged it to the latest timer tree, renamed the callback, tidied up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the broadcasting related section to the GENERIC_CLOCKEVENTS=y
section - this also solves build failures on architectures that
don't use generic clockevents yet.
Also standardize include file style to make it easier to read, and
use nesting depth aware preprocessor directives to make future merges
easier.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
frequently updated inods to never have their timestamps updated.
These changes guarantee that no timestamp on disk will be stale by
more than 24 hours.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJVGx6pAAoJEPL5WVaVDYGjh5cIAKQAyGST92IbTkxRZsxMgqnH
7LQI+fbNn6oHGEjSSnsWLxl6CpwT4WrCmj8WhVmpAoTLU958nBbF7iZAaaeQCGeS
3EqaNOlKvuOK9M5PKK7a5AWO04uJuj+t6s536OqHyB1zRb1yYMsywllPzu63eigA
jxu2yZxkFIKjo2ohSaTDRONVCsQGlqgZ2Aq/Ho5vy5QffVJKTN1G/3Kf33xukUyr
SAnndaax23jMqcFJE3gePYXc3W8EuGoloehKyo04qFeNNVMmSoytXAwMzcTmHn+H
biOTN5ezSKbYzv1aevRg7UuSPv17/yIo3aEberfLBgsn5O4wJGDdS+LajaI5/x8=
=0k0d
-----END PGP SIGNATURE-----
Merge tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull lazytime fixes from Ted Ts'o:
"This fixes a problem in the lazy time patches, which can cause
frequently updated inods to never have their timestamps updated.
These changes guarantee that no timestamp on disk will be stale by
more than 24 hours"
* tag 'lazytime_fix' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fs: add dirtytime_expire_seconds sysctl
fs: make sure the timestamps for lazytime inodes eventually get written
Use the new tick_suspend/resume_local() and get rid of the
homebrewn implementation of these in the ARM bL switcher. The
check for the cpumask is completely pointless. There is no harm
to suspend a per cpu tick device unconditionally. If that's a
real issue then we fix it proper at the core level and not with
some completely undocumented hacks in some random core code.
Move the tick internals to the core code, now that this nuisance
is gone.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ rjw: Rebase, changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Link: http://lkml.kernel.org/r/1655112.Ws17YsMfN7@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Xen calls on every cpu into tick_resume() which is just wrong.
tick_resume() is for the syscore global suspend/resume
invocation. What XEN really wants is a per cpu local resume
function.
Provide a tick_resume_local() function and use it in XEN.
Also provide a complementary tick_suspend_local() and modify
tick_unfreeze() and tick_freeze(), respectively, to use the
new local tick resume/suspend functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Combined two patches, rebased, modified subject/changelog. ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1698741.eezk9tnXtG@vostro.rjw.lan
[ Merged to latest timers/core. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Solely used in tick-broadcast.c and the return value is
hardcoded 0. Make it static and void.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1689058.QkHYDJSRKu@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
clockevents_notify() is a leftover from the early design of the
clockevents facility. It's really not a notification mechanism,
it's a multiplex call.
We are way better off to have explicit calls instead of this
monstrosity. Split out the suspend/resume() calls and invoke
them directly from the call sites.
No locking required at this point because these calls happen
with interrupts disabled and a single cpu online.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Rebased on top of 4.0-rc5. ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/713674030.jVm1qaHuPf@vostro.rjw.lan
[ Rebased on top of latest timers/core. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Called with 'clockevents_lock' held and interrupts disabled
already.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51005827.yXt5tjZMBs@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No point to expose everything to the world. People just believe
such functions can be abused for whatever purposes. Sigh.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ Rebased on top of 4.0-rc5 ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/28017337.VbCUc39Gme@vostro.rjw.lan
[ Merged to latest timers/core ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
tick-internal.h is pretty confusing as a lot of the stub inlines
are there several times.
Distangle the maze and make clear functional sections.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/16068264.vcNp79HLaT@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move clocksource related stuff to timekeeping.h and remove the
pointless include from ntp.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ rjw: Subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/2714218.nM5AEfAHj0@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This option was for simpler migration to the clock events code.
Most architectures have been converted and the option has been
disfunctional as a standalone option for quite some time. Remove
it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5021859.jl9OC1medj@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fix from Ingo Molnar:
"A single sched/rt corner case fix for RLIMIT_RTIME correctness"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix RLIMIT_RTTIME when PI-boosting to RT
Pull perf fix from Ingo Molnar:
"A perf kernel side fix for a fuzzer triggered lockup"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix irq_work 'tail' recursion
Pull locking fix from Ingo Molnar:
"A module unload lockdep race fix"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Fix the module unload key range freeing logic
It was a requirement in the legacy interface that drivers must
initialize ->mode field to 'CLOCK_EVT_MODE_UNUSED'. This field
isn't used anymore by the new interface and so should be only
checked for the legacy interface.
Probably it can be dropped as well as core doesn't rely on it
anymore, but lets keep it to support legacy interface.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: linaro-kernel@lists.linaro.org
Cc: linaro-networking@linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/c6604fa1a77fe1fc8dcab87769857228fb1dadd5.1425037853.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'enum clock_event_mode' is used for two purposes today:
- to pass mode to the driver of clockevent device::set_mode().
- for managing state of the device for clockevents core.
For supporting new modes/states we have moved away from the
legacy set_mode() callback to new per-mode/state callbacks. New
modes/states shouldn't be exposed to the legacy (now OBSOLOTE)
callbacks and so we shouldn't add new states to 'enum
clock_event_mode'.
Lets have separate enums for the two use cases mentioned above.
Keep using the earlier enum for legacy set_mode() callback and
mark it OBSOLETE. And add another enum to clearly specify the
possible states of a clockevent device.
This also renames the newly added per-mode callbacks to reflect
state changes.
We haven't got rid of 'mode' member of 'struct
clock_event_device' as it is used by some of the clockevent
drivers and it would automatically die down once we migrate
those drivers to the new interface. It ('mode') is only updated
now for the drivers using the legacy interface.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: linaro-kernel@lists.linaro.org
Cc: linaro-networking@linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/b6b0143a8a57bd58352ad35e08c25424c879c0cb.1425037853.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Upcoming patch will redefine possible states of a clockevent
device. The RESUME mode is a special case only for tick's
clockevent devices. In future it can be replaced by ->resume()
callback already available for clockevent devices.
Lets handle it separately so that clockevents_set_mode() only
handles states valid across all devices. This also renames
set_mode_resume() to tick_resume() to make it more explicit.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: linaro-kernel@lists.linaro.org
Cc: linaro-networking@linaro.org
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/c1b0112410870f49e7bf06958e1483eac6c15e20.1425037853.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for more tk_fast instances, remove all hard-coded
tk_fast_mono references.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.484279927@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce tkr_raw and make use of it.
base_raw -> tkr_raw.base
clock->{mult,shift} -> tkr_raw.{mult.shift}
Kill timekeeping_get_ns_raw() in favour of
timekeeping_get_ns(&tkr_raw), this removes all mono_raw special
casing.
Duplicate the updates to tkr_mono.cycle_last into tkr_raw.cycle_last,
both need the same value.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.422589590@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation of adding another tkr field, rename this one to
tkr_mono. Also rename tk_read_base::base_mono to tk_read_base::base,
since the structure is not specific to CLOCK_MONOTONIC and the mono
name got added to the tk_read_base instance.
Lots of trivial churn.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.344679419@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
One version of sched_rt_global_constaints() (the !rt-cgroup one)
changes state, therefore if we fail the later sched_dl_global_constraints()
call the state is left in an inconsistent state.
Fix this by changing the order of the calls.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit 40767b0dc768 ("sched/deadline: Fix deadline parameter
modification handling") we clear the thottled state when switching
from a dl task, therefore we should never find it set in switching to
a dl task.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[ Improved the changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a CPU is kicked to do nohz idle balancing, it wakes up to do load
balancing on itself, followed by load balancing on behalf of idle CPUs.
But it may end up with load after the load balancing attempt on itself.
This aborts nohz idle balancing. As a result several idle CPUs are left
without tasks till such a time that an ILB CPU finds it unfavorable to
pull tasks upon itself. This delays spreading of load across idle CPUs
and worse, clutters only a few CPUs with tasks.
The effect of the above problem was observed on an SMT8 POWER server
with 2 levels of numa domains. Busy loops equal to number of cores were
spawned. Since load balancing on fork/exec is discouraged across numa
domains, all busy loops would start on one of the numa domains. However
it was expected that eventually one busy loop would run per core across
all domains due to nohz idle load balancing. But it was observed that it
took as long as 10 seconds to spread the load across numa domains.
Further investigation showed that this was a consequence of the
following:
1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
load balancing [Given the experiment, upto 6 CPUs per core could be
potentially idle in this domain.]
2. However the ILB CPU would call load_balance() on itself before
initiating nohz idle load balancing.
3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
tasks from its sibling cores to even out load.
4. Now that the ILB CPU was no longer idle, it would abort nohz idle
load balancing
As a result the opportunities to spread load across numa domains were
lost until such a time that the cores within the first numa domain had
equal number of tasks among themselves. This is a pretty bad scenario,
since the cores within the first numa domain would have as many as 4
tasks each, while cores in the neighbouring numa domains would all
remain idle.
Fix this, by checking if a CPU was woken up to do nohz idle load
balancing, before it does load balancing upon itself. This way we allow
idle CPUs across the system to do load balancing which results in
quicker spread of load, instead of performing load balancing within the
local sched domain hierarchy of the ILB CPU alone under circumstances
such as above.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jason Low <jason.low2@hp.com>
Cc: benh@kernel.crashing.org
Cc: daniel.lezcano@linaro.org
Cc: efault@gmx.de
Cc: iamjoonsoo.kim@lge.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: riel@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: svaidy@linux.vnet.ibm.com
Cc: tim.c.chen@linux.intel.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the freq invariant accounting (in
__update_entity_runnable_avg() and sched_rt_avg_update()) get the
scale factor from a weak function call, this means that even for archs
that default on their implementation the compiler cannot see into this
function and optimize the extra scaling math away.
This is sad, esp. since its a 64-bit multiplication which can be quite
costly on some platforms.
So replace the weak function with #ifdef and __always_inline goo. This
is not quite as nice from an arch support PoV but should at least
result in compile time errors if done wrong.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.
It's worth trying to move the task before the CPU is fully utilized to
minimize the preemption by irq or RT tasks.
Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases:
- There is only 1 task on the busiest CPU.
- We haven't been able to move a task of the busiest rq.
A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got more available capacity for
CFS tasks. This test has been added in need_active_balance.
As a sidenote, this will not generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.
The nohz_kick_needed function has been cleaned up a bit while adding the new
test
env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE.
'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not.
But the 'group_capacity_factor' concept is hardly working for SMT systems, it
sometimes works for big cores but fails to do the right thing for little cores.
Below are two examples to illustrate the problem that this patch solves:
1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
(div_round_closest(3x640/1024) = 2) which means that it will be seen as
overloaded even if we have only one task per CPU.
2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
(1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
(at max and thanks to the fix [0] for SMT system that prevent the apparition
of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
reduced to nearly nothing), the capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).
So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics:
- The available CPU's capacity for CFS tasks which is already used by
load_balance().
- The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
has been re-introduced to compute the usage of a CPU by CFS tasks.
'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
by 'group_no_capacity'. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.
For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
because local is not overloaded.
[1] 9a5d9ba6a363 ("sched/fair: Allow calculate_imbalance() to move idle cpus")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
[ Tidied up the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.
The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be
temporally greater than SCHED_LOAD_SCALE after the migration of a task on this
CPU and until the metrics are stabilized.
The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.
The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the utilization_load_avg.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425455327-13508-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ
The cpu_capacity_orig will be used:
- to detect when the capacity of a CPU has been noticeably reduced so we can
trig load balance to look for a CPU with better capacity. As an example, we
can detect when a CPU handles a significant amount of irq
(with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
scheduler whereas CPUs, which are really idle, are available.
- evaluate the available capacity for CFS tasks
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.
A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.
In sched_rt_avg_update(), we now scale the RT execution time like below:
rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
Then, scale_rt_capacity can be summarized by:
scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total
with available = total - rq->rt_avg
This has been been optimized in current code by:
scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)
But we can also developed the equation like below:
scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)
and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
the computation of rq->rt_avg and scale_rt_capacity().
so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
and
scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)
arch_scale_frequency_capacity() will be called in the hot path of the scheduler
which implies to have a short and efficient function.
As an example, arch_scale_frequency_capacity() should return a cached value that
is updated periodically outside of the hot path.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Apply frequency scale-invariance correction factor to usage tracking.
Each segment of the running_avg_sum geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling.
As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
becomes invariant too. So the usage level that is returned by get_cpu_usage(),
stays relative to the max frequency as the cpu_capacity which is is compared against.
Then, we want the keep the load tracking values in a 32-bit type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity() must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue.
It is _not_ influenced in any way by the task group h_load. Hence it is
representing the actual cpu usage of the group, not its intended load
contribution which may differ significantly from the utilization on
lightly utilized systems.
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.
This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.
The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.
Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum.
Add some descriptions of the variables to explain their differences.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Morten.Rasmussen@arm.com
Cc: Paul Turner <pjt@google.com>
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Trivial cleanups, to improve the readability of the generic sched_clock() code:
- Improve and standardize comments
- Standardize the coding style
- Use vertical spacing where appropriate
- etc.
No code changed:
md5:
19a053b31e0c54feaeff1492012b019a sched_clock.o.before.asm
19a053b31e0c54feaeff1492012b019a sched_clock.o.after.asm
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently it is possible for an NMI (or FIQ on ARM) to come in
and read sched_clock() whilst update_sched_clock() has locked
the seqcount for writing. This results in the NMI handler
locking up when it calls raw_read_seqcount_begin().
This patch fixes the NMI safety issues by providing banked clock
data. This is a similar approach to the one used in Thomas
Gleixner's 4396e058c52e("timekeeping: Provide fast and NMI safe
access to CLOCK_MONOTONIC").
Suggested-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently update_sched_clock() is marked as notrace but this
function is not called by ftrace. This is trivially fixed by
removing the mark up.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently cd.read_data.suspended is read by the hotpath function
sched_clock(). This variable need not be accessed on the
hotpath. In fact, once it is removed, we can remove the
conditional branches from sched_clock() and install a dummy
read_sched_clock function to suspend the clock.
The new master copy of the function pointer
(actual_read_sched_clock) is introduced and is used for all
reads of the clock hardware except those within sched_clock
itself.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently sched_clock(), a very hot code path, is not optimized
to minimise its cache profile. In particular:
1. cd is not ____cacheline_aligned,
2. struct clock_data does not distinguish between hotpath and
coldpath data, reducing locality of reference in the hotpath,
3. Some hotpath data is missing from struct clock_data and is marked
__read_mostly (which more or less guarantees it will not share a
cache line with cd).
This patch corrects these problems by extracting all hotpath
data into a separate structure and using ____cacheline_aligned
to ensure the hotpath uses a single (64 byte) cache line.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the scope of the raw_write_seqcount_begin/end() in
sched_clock_register() far exceeds the scope of the read section
in sched_clock(). This gives the impression of safety during
cursory review but achieves little.
Note that this is likely to be a latent issue at present because
sched_clock_register() is typically called before we enable
interrupts, however the issue does risk bugs being needlessly
introduced as the code evolves.
This patch fixes the problem by increasing the scope of the read
locking performed by sched_clock() to cover all data modified by
sched_clock_register.
We also improve clarity by moving writes to struct clock_data
that do not impact sched_clock() outside of the critical
section.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
[ Reworked it slightly to apply to tip/timers/core]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
Across the board the 4.0-rc1 numbers are much slower, and the degradation
is far worse when using the large memory footprint configs. Perf points
straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
- 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys
- default_send_IPI_mask_sequence_phys
- 99.99% physflat_send_IPI_mask
- 99.37% native_send_call_func_ipi
smp_call_function_many
- native_flush_tlb_others
- 99.85% flush_tlb_page
ptep_clear_flush
try_to_unmap_one
rmap_walk
try_to_unmap
migrate_pages
migrate_misplaced_page
- handle_mm_fault
- 99.73% __do_page_fault
trace_do_page_fault
do_async_page_fault
+ async_page_fault
0.63% native_send_call_func_single_ipi
generic_exec_single
smp_call_function_single
This is showing excessive migration activity even though excessive
migrations are meant to get throttled. Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults. However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote. This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen. This patch tracks when migration failures occur and
slows the PTE scanner.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following point:
2. per-CPU pvclock time info is updated if the
underlying CPU changes.
Is not true anymore since "KVM: x86: update pvclock area conditionally,
on cpu migration".
Add task migration notification back.
Problem noticed by Andy Lutomirski.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
CC: stable@kernel.org # 3.11+