25362 Commits

Author SHA1 Message Date
875aabf52e Merge branch 'uuid-types'
Merge 'uuid-types' from git://git.infradead.org/users/hch/uuid.git
2017-07-03 14:13:44 +02:00
43188702b3 bpf, verifier: add additional patterns to evaluate_reg_imm_alu
Currently the verifier does not track imm across alu operations when
the source register is of unknown type. This adds additional pattern
matching to catch this and track imm. We've seen LLVM generating this
pattern while working on cilium.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:52 -07:00
7bda4b40c5 bpf: extend bpf_trace_printk to support %i
Currently, bpf_trace_printk does not support common formatting
symbol '%i' however vsprintf does and is what eventually gets
called by bpf helper. If users are used to '%i' and currently
make use of it, then bpf_trace_printk will just return with
error without dumping anything to the trace pipe, so just add
support for '%i' to the helper.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:52 -07:00
9780c0ab1a bpf: export whether tail call has jited owner
We do export through fdinfo already whether a prog is JITed or not,
given a program load can fail in case of either prog or tail call map
has JITed property, but neither both are JITed or not JITed, we can
facilitate error reporting in loaders like iproute2 through exporting
owner_jited of tail call map. We already do export owner_prog_type
through this facility, so parser can pick up both for comparison.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:52 -07:00
f96da09473 bpf: simplify narrower ctx access
This work tries to make the semantics and code around the
narrower ctx access a bit easier to follow. Right now
everything is done inside the .is_valid_access(). Offset
matching is done differently for read/write types, meaning
writes don't support narrower access and thus matching only
on offsetof(struct foo, bar) is enough whereas for read
case that supports narrower access we must check for
offsetof(struct foo, bar) + offsetof(struct foo, bar) +
sizeof(<bar>) - 1 for each of the cases. For read cases of
individual members that don't support narrower access (like
packet pointers or skb->cb[] case which has its own narrow
access logic), we check as usual only offsetof(struct foo,
bar) like in write case. Then, for the case where narrower
access is allowed, we also need to set the aux info for the
access. Meaning, ctx_field_size and converted_op_size have
to be set. First is the original field size e.g. sizeof(<bar>)
as in above example from the user facing ctx, and latter
one is the target size after actual rewrite happened, thus
for the kernel facing ctx. Also here we need the range match
and we need to keep track changing convert_ctx_access() and
converted_op_size from is_valid_access() as both are not at
the same location.

We can simplify the code a bit: check_ctx_access() becomes
simpler in that we only store ctx_field_size as a meta data
and later in convert_ctx_accesses() we fetch the target_size
right from the location where we do convert. Should the verifier
be misconfigured we do reject for BPF_WRITE cases or target_size
that are not provided. For the subsystems, we always work on
ranges in is_valid_access() and add small helpers for ranges
and narrow access, convert_ctx_accesses() sets target_size
for the relevant instruction.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Cc: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-03 02:22:52 -07:00
40304b2a15 bpf: BPF support for sock_ops
Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version
 * Some fields are in network byte order reflecting the sock struct
 * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
 * convert them to host byte order.
 */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;     /* In network byte order */
        __u32 local_ip4;      /* In network byte order */
        __u32 remote_ip6[4];  /* In network byte order */
        __u32 local_ip6[4];   /* In network byte order */
        __u32 remote_port;    /* In network byte order */
        __u32 local_port;     /* In host byte horder */
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 16:15:13 -07:00
c0a0c7a4e1 Two fixes:
One is for a crash when using the :mod: trace probe command into
  stack_trace_filter. This bug was introduced during the last merge
  window.
 
  The other was there forever. It's a small bug that makes it impossible
  to name a module function for kprobes when the module starts with a digit.
 -----BEGIN PGP SIGNATURE-----
 
 iQExBAABCAAbBQJZVsqbFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
 AsoH+wWK8tsDqI6MBTXO+v5RwLUu/zHClcMEcJGKLFsHRZ8HOJ9Afg+c1LbTujRR
 Ck20l+U/DibVO1AnjJ9elJDj7/3ajTUfCrTVCKf5B9XbdzAD2qZle3byynhZvm2Q
 CwVbzMvRYd5jZzEiO95YKOhH6iIDfXOZM7vQzz0F/bDZn9uAxiFumwdiSNA+f2wP
 6Ykeuth/IZh0xdbaTqsH1XvhweUSpIIjhOZH/V/uAg+LuRffh4sOBR/wjYUMuegX
 tgpGfJu8VVsa+GIBcThdCkjy1k48GLBsZ6j/47Lhc5r3GNIVR7D7mb9x3zTn1U30
 WTeDu+betctH7b03hiH88kKoEfg=
 =RXOC
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull last-minute tracing fixes from Steven Rostedt:
 "Two fixes:

  One is for a crash when using the :mod: trace probe command into
  stack_trace_filter. This bug was introduced during the last merge
  window.

  The other was there forever. It's a small bug that makes it impossible
  to name a module function for kprobes when the module starts with a
  digit"

* tag 'trace-v4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/kprobes: Allow to create probe with a module name starting with a digit
  ftrace: Fix regression with module command in stack_trace_filter
2017-06-30 17:18:57 -07:00
3859a271a0 randstruct: Mark various structs for randomization
This marks many critical kernel structures for randomization. These are
structures that have been targeted in the past in security exploits, or
contain functions pointers, pointers to function pointer tables, lists,
workqueues, ref-counters, credentials, permissions, or are otherwise
sensitive. This initial list was extracted from Brad Spengler/PaX Team's
code in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

Left out of this list is task_struct, which requires special handling
and will be covered in a subsequent patch.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-30 12:00:51 -07:00
b079115937 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 12:43:08 -04:00
c207aee480 objtool, x86: Add several functions and files to the objtool whitelist
In preparation for an objtool rewrite which will have broader checks,
whitelist functions and files which cause problems because they do
unusual things with the stack.

These whitelists serve as a TODO list for which functions and files
don't yet have undwarf unwinder coverage.  Eventually most of the
whitelists can be removed in favor of manual CFI hint annotations or
objtool improvements.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/7f934a5d707a574bda33ea282e9478e627fb1829.1498659915.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 10:19:19 +02:00
725816e8aa posix_clocks: Use get_itimerspec64() and put_itimerspec64()
Usage of these apis and their compat versions makes
the syscalls: timer_settime and timer_gettime and their
compat implementations simpler.

This patch also serves as a preparatory patch for changing
syscalls to use new time_t data types to support the
y2038 effort by isolating the processing of user pointers
through these apis.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 04:15:02 -04:00
c0edd7c9ac nanosleep: Use get_timespec64() and put_timespec64()
Usage of these apis and their compat versions makes
the syscalls: clock_nanosleep and nanosleep and
their compat implementations simpler.

This is a preparatory patch to isolate data conversions to
struct timespec64 at userspace boundaries. This helps contain
the changes needed to transition to new y2038 safe types.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 04:14:14 -04:00
5c4994102f posix-timers: Use get_timespec64() and put_timespec64()
Usage of these apis and their compat versions makes
the syscalls: clock_gettime, clock_settime, clock_getres
and their compat implementations simpler.

This is a preparatory patch to isolate data conversions to
struct timespec64 at userspace boundaries. This helps contain
the changes needed to transition to new y2038 safe types.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 04:13:19 -04:00
72298e5c92 sched/cputime: Refactor the cputime_adjust() code
Address a Coverity false positive, which is caused by overly
convoluted code:

Value assigned to variable 'utime' at line 619:utime = rtime;
is overwritten at line 642:utime = rtime - stime; before it
can be used. This makes such variable assignment useless.

Remove this variable assignment and refactor the code related.

Addresses-Coverity-ID: 1371643
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Cc: Frans Klaver <fransklaver@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20170629184128.GA5271@embeddedgus
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 09:37:59 +02:00
993647a293 cpu/hotplug: Constify attribute_group structures
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group.

So mark the non-const structs as const:

File size before:
   text	   data	    bss	    dec	    hex	filename
  12582	  15361	     20	  27963	   6d3b	kernel/cpu.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
  12710	  15265	     20	  27995	   6d5b	kernel/cpu.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: anna-maria@linutronix.de
Cc: bigeasy@linutronix.de
Cc: boris.ostrovsky@oracle.com
Cc: rcochran@linutronix.de
Link: http://lkml.kernel.org/r/f9079e94e12b36d245e7adbf67d312bc5d0250c6.1498737970.git.arvind.yadav.cs@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 09:34:39 +02:00
48365b3884 sched/debug: Expose the number of RT/DL tasks that can migrate
Add the value of the rt_rq.rt_nr_migratory and dl_rq.dl_nr_migratory
to the sched_debug output, for instance:

 rt_rq[0]:
   .rt_nr_running                 : 2
   .rt_nr_migratory               : 1     <--- Like this
   .rt_throttled                  : 0
   .rt_time                       : 828.645877
   .rt_runtime                    : 1000.000000

This is useful to debug problems related to the RT/DL schedulers.

This also fixes the format of some variables, that were unsigned, rather
than signed.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Link: http://lkml.kernel.org/r/7896f71cada54ee7dd8507bb666063a2e051c3d4.1498482127.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-30 09:32:07 +02:00
e4448ed87c bpf: don't open-code memdup_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:11 -04:00
a9bd8dfa53 kimage_file_prepare_segments(): don't open-code memdup_user()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-30 02:04:10 -04:00
9e52b32567 tracing/kprobes: Allow to create probe with a module name starting with a digit
Always try to parse an address, since kstrtoul() will safely fail when
given a symbol as input. If that fails (which will be the case for a
symbol), try to parse a symbol instead.

This allows creating a probe such as:

    p:probe/vlan_gro_receive 8021q:vlan_gro_receive+0

Which is necessary for this command to work:

    perf probe -m 8021q -a vlan_gro_receive

Link: http://lkml.kernel.org/r/fd72d666f45b114e2c5b9cf7e27b91de1ec966f1.1498122881.git.sd@queasysnail.net

Cc: stable@vger.kernel.org
Fixes: 413d37d1e ("tracing: Add kprobe-based event tracer")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-29 23:13:23 -04:00
4d8a991d46 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Need to access netdev->num_rx_queues behind an accessor in netvsc
    driver otherwise the build breaks with some configs, from Arnd
    Bergmann.

 2) Add dummy xfrm_dev_event() so that build doesn't fail when
    CONFIG_XFRM_OFFLOAD is not set. From Hangbin Liu.

 3) Don't OOPS when pfkey_msg2xfrm_state() signals an erros, from Dan
    Carpenter.

 4) Fix MCDI command size for filter operations in sfc driver, from
    Martin Habets.

 5) Fix UFO segmenting so that we don't calculate incorrect checksums,
    from Michal Kubecek.

 6) When ipv6 datagram connects fail, reset destination address and
    port. From Wei Wang.

 7) TCP disconnect must reset the cached receive DST, from WANG Cong.

 8) Fix sign extension bug on 32-bit in dev_get_stats(), from Eric
    Dumazet.

 9) fman driver has to depend on HAS_DMA, from Madalin Bucur.

10) Fix bpf pointer leak with xadd in verifier, from Daniel Borkmann.

11) Fix negative page counts with GFO, from Michal Kubecek.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  sfc: fix attempt to translate invalid filter ID
  net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish()
  bpf: prevent leaking pointer via xadd on unpriviledged
  arcnet: com20020-pci: add missing pdev setup in netdev structure
  arcnet: com20020-pci: fix dev_id calculation
  arcnet: com20020: remove needless base_addr assignment
  Trivial fix to spelling mistake in arc_printk message
  arcnet: change irq handler to lock irqsave
  rocker: move dereference before free
  mlxsw: spectrum_router: Fix NULL pointer dereference
  net: sched: Fix one possible panic when no destroy callback
  virtio-net: serialize tx routine during reset
  net: usb: asix88179_178a: Add support for the Belkin B2B128
  fsl/fman: add dependency on HAS_DMA
  net: prevent sign extension in dev_get_stats()
  tcp: reset sk_rx_dst in tcp_disconnect()
  net: ipv6: reset daddr and dport in sk if connect() fails
  bnx2x: Don't log mc removal needlessly
  bnxt_en: Fix netpoll handling.
  bnxt_en: Add missing logic to handle TPA end error conditions.
  ...
2017-06-29 14:30:07 -07:00
59494fe2c8 PM: hibernate: constify attribute_group structures.
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   6332	    488	    308	   7128	   1bd8	kernel/power/hibernate.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
   6396	    424	    308	   7128	   1bd8	kernel/power/hibernate.o

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-29 23:05:48 +02:00
6bdf6abc56 bpf: prevent leaking pointer via xadd on unpriviledged
Leaking kernel addresses on unpriviledged is generally disallowed,
for example, verifier rejects the following:

  0: (b7) r0 = 0
  1: (18) r2 = 0xffff897e82304400
  3: (7b) *(u64 *)(r1 +48) = r2
  R2 leaks addr into ctx

Doing pointer arithmetic on them is also forbidden, so that they
don't turn into unknown value and then get leaked out. However,
there's xadd as a special case, where we don't check the src reg
for being a pointer register, e.g. the following will pass:

  0: (b7) r0 = 0
  1: (7b) *(u64 *)(r1 +48) = r0
  2: (18) r2 = 0xffff897e82304400 ; map
  4: (db) lock *(u64 *)(r1 +48) += r2
  5: (95) exit

We could store the pointer into skb->cb, loose the type context,
and then read it out from there again to leak it eventually out
of a map value. Or more easily in a different variant, too:

   0: (bf) r6 = r1
   1: (7a) *(u64 *)(r10 -8) = 0
   2: (bf) r2 = r10
   3: (07) r2 += -8
   4: (18) r1 = 0x0
   6: (85) call bpf_map_lookup_elem#1
   7: (15) if r0 == 0x0 goto pc+3
   R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
   8: (b7) r3 = 0
   9: (7b) *(u64 *)(r0 +0) = r3
  10: (db) lock *(u64 *)(r0 +0) += r6
  11: (b7) r0 = 0
  12: (95) exit

  from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
  11: (b7) r0 = 0
  12: (95) exit

Prevent this by checking xadd src reg for pointer types. Also
add a couple of test cases related to this.

Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:44:34 -04:00
8007e40a24 bpf: Fix out-of-bound access on interpreters[]
The index is off-by-one when fp->aux->stack_depth
has already been rounded up to 32.  In particular,
if stack_depth is 512, the index will be 16.

The fix is to round_up and then takes -1 instead of round_down.

[   22.318680] ==================================================================
[   22.319745] BUG: KASAN: global-out-of-bounds in bpf_prog_select_runtime+0x48a/0x670
[   22.320737] Read of size 8 at addr ffffffff82aadae0 by task sockex3/1946
[   22.321646]
[   22.321858] CPU: 1 PID: 1946 Comm: sockex3 Tainted: G        W       4.12.0-rc6-01680-g2ee87db3a287 #22
[   22.323061] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
[   22.324260] Call Trace:
[   22.324612]  dump_stack+0x67/0x99
[   22.325081]  print_address_description+0x1e8/0x290
[   22.325734]  ? bpf_prog_select_runtime+0x48a/0x670
[   22.326360]  kasan_report+0x265/0x350
[   22.326860]  __asan_report_load8_noabort+0x19/0x20
[   22.327484]  bpf_prog_select_runtime+0x48a/0x670
[   22.328109]  bpf_prog_load+0x626/0xd40
[   22.328637]  ? __bpf_prog_charge+0xc0/0xc0
[   22.329222]  ? check_nnp_nosuid.isra.61+0x100/0x100
[   22.329890]  ? __might_fault+0xf6/0x1b0
[   22.330446]  ? lock_acquire+0x360/0x360
[   22.331013]  SyS_bpf+0x67c/0x24d0
[   22.331491]  ? trace_hardirqs_on+0xd/0x10
[   22.332049]  ? __getnstimeofday64+0xaf/0x1c0
[   22.332635]  ? bpf_prog_get+0x20/0x20
[   22.333135]  ? __audit_syscall_entry+0x300/0x600
[   22.333770]  ? syscall_trace_enter+0x540/0xdd0
[   22.334339]  ? exit_to_usermode_loop+0xe0/0xe0
[   22.334950]  ? do_syscall_64+0x48/0x410
[   22.335446]  ? bpf_prog_get+0x20/0x20
[   22.335954]  do_syscall_64+0x181/0x410
[   22.336454]  entry_SYSCALL64_slow_path+0x25/0x25
[   22.337121] RIP: 0033:0x7f263fe81f19
[   22.337618] RSP: 002b:00007ffd9a3440c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
[   22.338619] RAX: ffffffffffffffda RBX: 0000000000aac5fb RCX: 00007f263fe81f19
[   22.339600] RDX: 0000000000000030 RSI: 00007ffd9a3440d0 RDI: 0000000000000005
[   22.340470] RBP: 0000000000a9a1e0 R08: 0000000000a9a1e0 R09: 0000009d00000001
[   22.341430] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000010000
[   22.342411] R13: 0000000000a9a023 R14: 0000000000000001 R15: 0000000000000003
[   22.343369]
[   22.343593] The buggy address belongs to the variable:
[   22.344241]  interpreters+0x80/0x980
[   22.344708]
[   22.344908] Memory state around the buggy address:
[   22.345556]  ffffffff82aad980: 00 00 00 04 fa fa fa fa 04 fa fa fa fa fa fa fa
[   22.346449]  ffffffff82aada00: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 00 00
[   22.347361] >ffffffff82aada80: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
[   22.348301]                                                        ^
[   22.349142]  ffffffff82aadb00: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
[   22.350058]  ffffffff82aadb80: 00 00 07 fa fa fa fa fa 00 00 05 fa fa fa fa fa
[   22.350984] ==================================================================

Fixes: b870aa901f4b ("bpf: use different interpreter depending on required stack size")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 15:37:04 -04:00
14dc6f04f4 bpf: Add syscall lookup support for fd array and htab
This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on
BPF_MAP_TYPE_PROG_ARRAY,
BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS.

The lookup returns a prog-id or map-id to the userspace.
The userspace can then use the BPF_PROG_GET_FD_BY_ID
or BPF_MAP_GET_FD_BY_ID to get a fd.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 13:13:25 -04:00
0f17976568 ftrace: Fix regression with module command in stack_trace_filter
When doing the following command:

 # echo ":mod:kvm_intel" > /sys/kernel/tracing/stack_trace_filter

it triggered a crash.

This happened with the clean up of probes. It required all callers to the
regex function (doing ftrace filtering) to have ops->private be a pointer to
a trace_array. But for the stack tracer, that is not the case.

Allow for the ops->private to be NULL, and change the function command
callbacks to handle the trace_array pointer being NULL as well.

Fixes: d2afd57a4b96 ("tracing/ftrace: Allow instances to have their own function probes")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-29 10:05:45 -04:00
96b5b19459 module: make the modinfo name const
This can be accomplished by making blacklisted() also accept const.

Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
[jeyu: fix typo]
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-06-29 14:19:17 +02:00
ff801b716e sched/numa: Hide numa_wake_affine() from UP build
Stephen reported the following build warning in UP:

kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
parameter list
         ^
/home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
definition or declaration, which is probably not what you want

Hide the numa_wake_affine() inline stub on UP builds to get rid of it.

Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-06-29 08:25:52 +02:00
2287d8664f timers: Make the cpu base lock raw
The timers cpu base lock could not be converted to a raw spinlock becaue
the lock held time was non-deterministic due to cascading and long lasting
timer wheel traversals.

The rework of the timer wheel to the new non-cascading model removed also
the wheel traversals and the lock held times are deterministic now. This
allows to make the lock raw and thereby unbreaks NOHz* on preempt-RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20170627161538.30257-1-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-06-29 00:27:24 +02:00
5136f6365c cgroup: implement "nsdelegate" mount option
Currently, cgroup only supports delegation to !root users and cgroup
namespaces don't get any special treatments.  This limits the
usefulness of cgroup namespaces as they by themselves can't be safe
delegation boundaries.  A process inside a cgroup can change the
resource control knobs of the parent in the namespace root and may
move processes in and out of the namespace if cgroups outside its
namespace are visible somehow.

This patch adds a new mount option "nsdelegate" which makes cgroup
namespaces delegation boundaries.  If set, cgroup behaves as if write
permission based delegation took place at namespace boundaries -
writes to the resource control knobs from the namespace root are
denied and migration crossing the namespace boundary aren't allowed
from inside the namespace.

This allows cgroup namespace to function as a delegation boundary by
itself.

v2: Silently ignore nsdelegate specified on !init mounts.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aravind Anbudurai <aru7@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>
2017-06-28 14:45:21 -04:00
824ecbe01c cgroup: restructure cgroup_procs_write_permission()
Restructure cgroup_procs_write_permission() to make extending
permission logic easier.

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-06-28 14:45:02 -04:00
4ec7846785 ftrace: Decrement count for dyn_ftrace_total_info for init functions
Init boot up functions may be traced, but they are also freed when the
kernel finishes booting. These are removed from the ftrace tables, and the
debug variable for dyn_ftrace_total_info needs to reflect that as well.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-28 11:57:03 -04:00
3b58a3c72f ftrace: Unlock hash mutex on failed allocation in process_mod_list()
If the new_hash fails to allocate, then unlock the hash mutex on error.

Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-28 09:09:38 -04:00
165d1cc007 kmod: reduce atomic operations on kmod_concurrent and simplify
When checking if we want to allow a kmod thread to kick off we increment,
then read to see if we should enable a thread. If we were over the allowed
limit limit we decrement. Splitting the increment far apart from decrement
means there could be a time where two increments happen potentially
giving a false failure on a thread which should have been allowed.

CPU1			CPU2
atomic_inc()
			atomic_inc()
atomic_read()
			atomic_read()
atomic_dec()
			atomic_dec()

In this case a read on CPU1 gets the atomic_inc()'s and we could negate
it from getting a kmod thread. We could try to prevent this with a lock
or preemption but that is overkill. We can fix by reducing the number of
atomic operations. We do this by inverting the logic of of the enabler,
instead of incrementing kmod_concurrent as we get new kmod users, define the
variable kmod_concurrent_max as the max number of currently allowed kmod
users and as we get new kmod users just decrement it if its still positive.
This combines the dec and read in one atomic operation.

In this case we no longer get the same false failure:

CPU1			CPU2
atomic_dec_if_positive()
			atomic_dec_if_positive()
atomic_inc()
			atomic_inc()

The number of threads is computed at init, and since the current computation
of kmod_concurrent includes the thread count we can avoid setting
kmod_concurrent_max later in boot through an init call by simply sticking to
50 as the kmod_concurrent_max. The assumption here is a system with modules
must at least have ~16 MiB of RAM.

Suggested-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-06-27 19:36:31 +02:00
93437353da module: use list_for_each_entry_rcu() on find_module_all()
The module list has been using RCU in a lot of other calls
for a while now, we just overlooked changing this one over to
use RCU.

Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-06-27 19:35:52 +02:00
441dae8f2f tracing: Add support for display of tgid in trace output
Earlier patches introduced ability to record the tgid using the 'record-tgid'
option. Here we read the tgid and output it if the option is enabled.

Link: http://lkml.kernel.org/r/20170626053844.5746-3-joelaf@google.com

Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Tested-by: Michael Sartain <mikesart@gmail.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:28 -04:00
d914ba37d7 tracing: Add support for recording tgid of tasks
Inorder to support recording of tgid, the following changes are made:

* Introduce a new API (tracing_record_taskinfo) to additionally record the tgid
  along with the task's comm at the same time. This has has the benefit of not
  setting trace_cmdline_save before all the information for a task is saved.
* Add a new API tracing_record_taskinfo_sched_switch to record task information
  for 2 tasks at a time (previous and next) and use it from sched_switch probe.
* Preserve the old API (tracing_record_cmdline) and create it as a wrapper
  around the new one so that existing callers aren't affected.
* Reuse the existing sched_switch and sched_wakeup probes to record tgid
  information and add a new option 'record-tgid' to enable recording of tgid

When record-tgid option isn't enabled to being with, we take care to make sure
that there's isn't memory or runtime overhead.

Link: http://lkml.kernel.org/r/20170627020155.5139-1-joelaf@google.com

Cc: kernel-team@android.com
Cc: Ingo Molnar <mingo@redhat.com>
Tested-by: Michael Sartain <mikesart@gmail.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:28 -04:00
83dd14933e ftrace: Decrement count for dyn_ftrace_total_info file
The dyn_ftrace_total_info file is used to show how many functions have been
converted into nops and can be used by ftrace. The problem is that it does
not get decremented when functions are removed (init boot code being freed,
and modules being freed). That means the number is very inaccurate everytime
functions are removed from the ftrace tables. Decrement it when functions
are removed.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:27 -04:00
6a9c981b1e ftrace: Remove unused function ftrace_arch_read_dyn_info()
ftrace_arch_read_dyn_info() was used so that archs could add its own debug
information into the dyn_ftrace_total_info in the tracefs file system. That
file is for debugging usage of dynamic ftrace. No arch uses that function
anymore, so just get rid of it.

This also allows for tracing_read_dyn_info() to be cleaned up a bit.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-27 13:30:22 -04:00
eba74c2944 PM / hibernate: Drop redundant parameter of swsusp_alloc()
The first parameter of swsusp_alloc is not used, so drop it.

Signed-off-by: BaoJun Luo <baojun.luo@samsung.com>
[ rjw: Subject & changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-27 02:10:44 +02:00
49368a47f6 PM / hibernate: Use CONFIG_HAVE_SET_MEMORY for include condition
Kbuild reported a build failure when CONFIG_STRICT_KERNEL_RWX was
enabled on powerpc. We don't yet have ARCH_HAS_SET_MEMORY and ppc32
saw a build failure.

I've only done a basic compile test with a config that has
hibernation enabled.

Fixes: 50327ddfbc92 (kernel/power/snapshot.c: use set_memory.h header)
Reported-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-06-27 02:05:28 +02:00
0b5fa22906 seccomp: Switch from atomic_t to recount_t
This switches the seccomp usage tracking from atomic_t to refcount_t to
gain refcount overflow protections.

Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: David Windsor <dwindsor@gmail.com>
Cc: Hans Liljestrand <hans.liljestrand@aalto.fi>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-26 09:24:00 -07:00
131b635159 seccomp: Clean up core dump logic
This just cleans up the core dumping logic to avoid the braces around
the RET_KILL case.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-06-26 09:22:33 -07:00
8c08f0d5c6 ftrace: Have cached module filters be an active filter
When a module filter is added to set_ftrace_filter, if the module is not
loaded, it is cached. This should be considered an active filter, and
function tracing should be filtered by this. That is, if a cached module
filter is the only filter set, then no function tracing should be happening,
as all the functions available will be filtered out.

This makes sense, as the reason to add a cached module filter, is to trace
the module when you load it. There shouldn't be any other tracing happening
until then.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:04 -04:00
d7fbf8df7c ftrace: Implement cached modules tracing on module load
If a module is cached in the set_ftrace_filter, and that module is loaded,
then enable tracing on that module as if the cached module text was written
into set_ftrace_filter just as the module is loaded.

  # echo ":mod:kvm_intel" >
  # cat /sys/kernel/tracing/set_ftrace_filter
 #### all functions enabled ####
 :mod:kvm_intel
  # modprobe kvm_intel
  # cat /sys/kernel/tracing/set_ftrace_filter
 vmx_get_rflags [kvm_intel]
 vmx_get_pkru [kvm_intel]
 vmx_get_interrupt_shadow [kvm_intel]
 vmx_rdtscp_supported [kvm_intel]
 vmx_invpcid_supported [kvm_intel]
 [..]

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:03 -04:00
5985ea8bd5 ftrace: Have the cached module list show in set_ftrace_filter
When writing in a module filter into set_ftrace_filter for a module that is
not yet loaded, it it cached, and will be executed when the module is loaded
(although that is not implemented yet at this commit). Display the list of
cached modules to be traced.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:02 -04:00
673feb9d76 ftrace: Add :mod: caching infrastructure to trace_array
This is the start of the infrastructure work to allow for tracing module
functions before it is loaded.

Currently the following command:

  # echo :mod:some-mod > set_ftrace_filter

will enable tracing of all functions within the module "some-mod" if it is
loaded. What we want, is if the module is not loaded, that line will be
saved. When the module is loaded, then the "some-mod" will have that line
executed on it, so that the functions within it starts being traced.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-06-26 11:53:02 -04:00
1ba5c08b58 kernel/module.c: suppress warning about unused nowarn variable
This patch fix the following warning:
kernel/module.c: In function 'add_usage_links':
kernel/module.c:1653:6: warning: variable 'nowarn' set but not used [-Wunused-but-set-variable]

[jeyu: folded in first patch since it only swapped the function order
so that del_usage_links can be called from add_usage_links]
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: Jessica Yu <jeyu@kernel.org>
2017-06-26 17:23:19 +02:00
bf22ff45be genirq: Avoid unnecessary low level irq function calls
Check irq state in enable/disable/unmask/mask_irq to avoid unnecessary
low level irq function calls.

This has two advantages:
    - Conditionals are faster than hardware access

    - Solves issues with the underlying refcounting of the pinctrl
      infrastructure

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: tfiga@chromium.org
Cc: briannorris@chromium.org
Cc: dianders@chromium.org
Link: http://lkml.kernel.org/r/1498476814-12563-2-git-send-email-jeffy.chen@rock-chips.com
2017-06-26 15:47:00 +02:00
d829b8fb24 genirq: Set irq masked state when initializing irq_desc
The irq default state is set to disabled when allocating irq desc, but the
masked state flag is not set. This is inconsistent vs. the state tracking
logic which is used to prevent unnecessary calls to hardware level irq chip
functions.

Set the masked state flag as well.

Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: tfiga@chromium.org
Cc: briannorris@chromium.org
Cc: dianders@chromium.org
Link: http://lkml.kernel.org/r/1498476814-12563-1-git-send-email-jeffy.chen@rock-chips.com
2017-06-26 14:05:41 +02:00
63a766a178 posix-stubs: Conditionally include COMPAT_SYS_NI defines
These apis only need to be defined if CONFIG_COMPAT is
enabled.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-25 21:58:46 -04:00