1. 27 Apr, 2017 1 commit
    • Ingo Molnar's avatar
      Merge branch 'rcu/urgent' of... · 96fd20cf
      Ingo Molnar authored
      Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
      
      Pull RCU updates from Paul E. McKenney:
      
       "This series greatly reduces the performance degradation of Tree SRCU
        on a CPU-hotplug stress test.  The effect was not subtle:  Mike Galbraith
        measured Classic SRCU at 55 seconds and Tree SRCU at more than 16 -minutes-
        for this test.  Mike collected ftrace data that showed that Classic SRCU
        was auto-expediting invocations of synchronize_srcu() that found SRCU
        completely idle.  This series therefore adds this auto-expedite capability
        to Tree SRCU, bringing the performance shortfall to less than ten seconds,
        which is a great improvement over the initial shortfall of 15 minutes."
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      96fd20cf
  2. 26 Apr, 2017 5 commits
    • Paul E. McKenney's avatar
      srcu: Specify auto-expedite holdoff time · 22607d66
      Paul E. McKenney authored
      On small systems, in the absence of readers, expedited SRCU grace
      periods can complete in less than a microsecond.  This means that an
      eight-CPU system can have all CPUs doing synchronize_srcu() in a tight
      loop and almost always expedite.  This might actually be desirable in
      some situations, but in general it is a good way to needlessly burn
      CPU cycles.  And in those situations where it is desirable, your friend
      is the function synchronize_srcu_expedited().
      
      For other situations, this commit adds a kernel parameter that specifies
      a holdoff between completing the last SRCU grace period and auto-expediting
      the next.  If the next grace period starts before the holdoff expires,
      auto-expediting is disabled.  The holdoff is 50 microseconds by default,
      and can be tuned to the desired number of nanoseconds.  A value of zero
      disables auto-expediting.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      22607d66
    • Paul E. McKenney's avatar
      srcu: Expedite first synchronize_srcu() when idle · 2da4b2a7
      Paul E. McKenney authored
      Classic SRCU in effect expedites the first synchronize_srcu() when SRCU
      is idle, and Mike Galbraith demonstrated that some use cases do in fact
      rely on this behavior.  In particular, Mike showed that Steven Rostedt's
      hotplug stress script takes 55 seconds with Classic SRCU and more than
      16 -minutes- when running Tree SRCU.  Assuming that each Tree SRCU's call
      to synchronize_srcu() takes four milliseconds, this implies that Steven's
      test invokes synchronize_srcu() in isolation, but more than once per
      200 microseconds.  Mike used ftrace to demonstrate that the time between
      successive calls to synchronize_srcu() ranged from 118 to 342 microseconds,
      with one outlier at 80 milliseconds.  This data clearly indicates that
      Tree SRCU needs to expedite the first invocation of synchronize_srcu()
      during an SRCU idle period.
      
      This commit therefor introduces a srcu_might_be_idle() function that
      probabilistically checks whether or not SRCU is idle.  This function is
      used by synchronize_rcu() as an additional criterion in deciding whether
      or not to expedite.
      
      (Hat trick to Peter Zijlstra for his earlier suggestion that this might
      in fact be a problem.  Which for all I know might have motivated Mike to
      look into it.)
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      2da4b2a7
    • Paul E. McKenney's avatar
      srcu: Expedited grace periods with reduced memory contention · 1e9a038b
      Paul E. McKenney authored
      Commit f60d231a ("srcu: Crude control of expedited grace periods")
      introduced a per-srcu_struct atomic counter to track outstanding
      requests for grace periods.  This works, but represents a memory-contention
      bottleneck.  This commit therefore uses the srcu_node combining tree
      to remove this bottleneck.
      
      This commit adds new ->srcu_gp_seq_needed_exp fields to the
      srcu_data, srcu_node, and srcu_struct structures, which track the
      farthest-in-the-future grace period that must be expedited, which in
      turn requires that all nearer-term grace periods also be expedited.
      Requests for expediting start with the srcu_data structure, run up
      through the srcu_node tree, and end at the srcu_struct structure.
      Note that it may be necessary to expedite a grace period that just
      now started, and this is handled by a new srcu_funnel_exp_start()
      function, which is invoked when the grace period itself is already
      in its way, but when that grace period was not marked as expedited.
      
      A new srcu_get_delay() function returns zero if there is at least one
      expedited SRCU grace period in flight, or SRCU_INTERVAL otherwise.
      This function is used to calculate delays:  Normal grace periods
      are allowed to extend in order to cover more requests with a given
      grace-period computation, which decreases per-request overhead.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      1e9a038b
    • Paul E. McKenney's avatar
      srcu: Make rcutorture writer stalls print SRCU GP state · 7f6733c3
      Paul E. McKenney authored
      In the past, SRCU was simple enough that there was little point in
      making the rcutorture writer stall messages print the SRCU grace-period
      number state.  With the advent of Tree SRCU, this has changed.  This
      commit therefore makes Classic, Tiny, and Tree SRCU report this state
      to rcutorture as needed.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      7f6733c3
    • Paul E. McKenney's avatar
      srcu: Exact tracking of srcu_data structures containing callbacks · c7e88067
      Paul E. McKenney authored
      The current Tree SRCU implementation schedules a workqueue for every
      srcu_data covered by a given leaf srcu_node structure having callbacks,
      even if only one of those srcu_data structures actually contains
      callbacks.  This is clearly inefficient for workloads that don't feature
      callbacks everywhere all the time.  This commit therefore adds an array
      of masks that are used by the leaf srcu_node structures to track exactly
      which srcu_data structures contain callbacks.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarMike Galbraith <efault@gmx.de>
      c7e88067
  3. 24 Apr, 2017 2 commits
    • Paul E. McKenney's avatar
      srcu: Make SRCU be built by default · d160a727
      Paul E. McKenney authored
      SRCU is optional, and included only if there is a "select SRCU" in effect.
      However, we now have Tiny SRCU, so this commit defaults CONFIG_SRCU=y.
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d160a727
    • Paul E. McKenney's avatar
      srcu: Fix Kconfig botch when SRCU not selected · 677df9d4
      Paul E. McKenney authored
      If the CONFIG_SRCU option is not selected, for example, when building
      arch/tile allnoconfig, the following build errors appear:
      
      	kernel/rcu/tree.o: In function `srcu_online_cpu':
      	tree.c:(.text+0x4248): multiple definition of `srcu_online_cpu'
      	kernel/rcu/srcutree.o:srcutree.c:(.text+0x2120): first defined here
      	kernel/rcu/tree.o: In function `srcu_offline_cpu':
      	tree.c:(.text+0x4250): multiple definition of `srcu_offline_cpu'
      	kernel/rcu/srcutree.o:srcutree.c:(.text+0x2160): first defined here
      
      The corresponding .config file shows CONFIG_TREE_SRCU=y, but no sign
      of CONFIG_SRCU, which fatally confuses SRCU's #ifdefs, resulting in
      the above errors.  The reason this occurs is the folowing line in
      init/Kconfig's definition for TREE_SRCU:
      
      	default y if !TINY_RCU && !CLASSIC_SRCU
      
      If CONFIG_CLASSIC_SRCU=n, as it will be in for allnoconfig, and if
      CONFIG_SMP=y, then we will get CONFIG_TREE_SRCU=y but no CONFIG_SRCU,
      as seen in the .config file, and which will result in the above errors.
      This error did not show up during rcutorture testing because rcutorture
      forces CONFIG_SRCU=y, as it must to prevent build errors in rcutorture.c.
      
      This commit therefore conditions TREE_SRCU (and TINY_SRCU, while it is
      at it) with SRCU, like this:
      
      	default y if SRCU && !TINY_RCU && !CLASSIC_SRCU
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Reported-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/20170423162205.GP3956@linux.vnet.ibm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      677df9d4
  4. 23 Apr, 2017 1 commit
  5. 21 Apr, 2017 29 commits
    • Linus Torvalds's avatar
      Merge tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux · 94836ecf
      Linus Torvalds authored
      Pull nfsd bugfix from Bruce Fields:
       "Fix a 4.11 regression that triggers a BUG() on an attempt to use an
        unsupported NFSv4 compound op"
      
      * tag 'nfsd-4.11-2' of git://linux-nfs.org/~bfields/linux:
        nfsd: fix oops on unsupported operation
      94836ecf
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 057a650b
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Don't race in IPSEC dumps, from Yuejie Shi.
      
       2) Verify lengths properly in IPSEC reqeusts, from Herbert Xu.
      
       3) Fix out of bounds access in ipv6 segment routing code, from David
          Lebrun.
      
       4) Don't write into the header of cloned SKBs in smsc95xx driver, from
          James Hughes.
      
       5) Several other drivers have this bug too, fix them. From Eric
          Dumazet.
      
       6) Fix access to uninitialized data in TC action cookie code, from
          Wolfgang Bumiller.
      
       7) Fix double free in IPV6 segment routing, again from David Lebrun.
      
       8) Don't let userspace set the RTF_PCPU flag, oops. From David Ahern.
      
       9) Fix use after free in qrtr code, from Dan Carpenter.
      
      10) Don't double-destroy devices in ip6mr code, from Nikolay
          Aleksandrov.
      
      11) Don't pass out-of-range TX queue indices into drivers, from Tushar
          Dave.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
        netpoll: Check for skb->queue_mapping
        ip6mr: fix notification device destruction
        bpf, doc: update bpf maintainers entry
        net: qrtr: potential use after free in qrtr_sendmsg()
        bpf: Fix values type used in test_maps
        net: ipv6: RTF_PCPU should not be settable from userspace
        gso: Validate assumption of frag_list segementation
        kaweth: use skb_cow_head() to deal with cloned skbs
        ch9200: use skb_cow_head() to deal with cloned skbs
        lan78xx: use skb_cow_head() to deal with cloned skbs
        sr9700: use skb_cow_head() to deal with cloned skbs
        cx82310_eth: use skb_cow_head() to deal with cloned skbs
        smsc75xx: use skb_cow_head() to deal with cloned skbs
        ipv6: sr: fix double free of skb after handling invalid SRH
        MAINTAINERS: Add "B:" field for networking.
        net sched actions: allocate act cookie early
        qed: Fix issue in populating the PFC config paramters.
        qed: Fix possible system hang in the dcbnl-getdcbx() path.
        qed: Fix sending an invalid PFC error mask to MFW.
        qed: Fix possible error in populating max_tc field.
        ...
      057a650b
    • Tushar Dave's avatar
      netpoll: Check for skb->queue_mapping · c70b17b7
      Tushar Dave authored
      Reducing real_num_tx_queues needs to be in sync with skb queue_mapping
      otherwise skbs with queue_mapping greater than real_num_tx_queues
      can be sent to the underlying driver and can result in kernel panic.
      
      One such event is running netconsole and enabling VF on the same
      device. Or running netconsole and changing number of tx queues via
      ethtool on same device.
      
      e.g.
      Unable to handle kernel NULL pointer dereference
      tsk->{mm,active_mm}->context = 0000000000001525
      tsk->{mm,active_mm}->pgd = fff800130ff9a000
                    \|/ ____ \|/
                    "@'/ .. \`@"
                    /_| \__/ |_\
                       \__U_/
      kworker/48:1(475): Oops [#1]
      CPU: 48 PID: 475 Comm: kworker/48:1 Tainted: G           OE
      4.11.0-rc3-davem-net+ #7
      Workqueue: events queue_process
      task: fff80013113299c0 task.stack: fff800131132c000
      TSTATE: 0000004480e01600 TPC: 00000000103f9e3c TNPC: 00000000103f9e40 Y:
      00000000    Tainted: G           OE
      TPC: <ixgbe_xmit_frame_ring+0x7c/0x6c0 [ixgbe]>
      g0: 0000000000000000 g1: 0000000000003fff g2: 0000000000000000 g3:
      0000000000000001
      g4: fff80013113299c0 g5: fff8001fa6808000 g6: fff800131132c000 g7:
      00000000000000c0
      o0: fff8001fa760c460 o1: fff8001311329a50 o2: fff8001fa7607504 o3:
      0000000000000003
      o4: fff8001f96e63a40 o5: fff8001311d77ec0 sp: fff800131132f0e1 ret_pc:
      000000000049ed94
      RPC: <set_next_entity+0x34/0xb80>
      l0: 0000000000000000 l1: 0000000000000800 l2: 0000000000000000 l3:
      0000000000000000
      l4: 000b2aa30e34b10d l5: 0000000000000000 l6: 0000000000000000 l7:
      fff8001fa7605028
      i0: fff80013111a8a00 i1: fff80013155a0780 i2: 0000000000000000 i3:
      0000000000000000
      i4: 0000000000000000 i5: 0000000000100000 i6: fff800131132f1a1 i7:
      00000000103fa4b0
      I7: <ixgbe_xmit_frame+0x30/0xa0 [ixgbe]>
      Call Trace:
       [00000000103fa4b0] ixgbe_xmit_frame+0x30/0xa0 [ixgbe]
       [0000000000998c74] netpoll_start_xmit+0xf4/0x200
       [0000000000998e10] queue_process+0x90/0x160
       [0000000000485fa8] process_one_work+0x188/0x480
       [0000000000486410] worker_thread+0x170/0x4c0
       [000000000048c6b8] kthread+0xd8/0x120
       [0000000000406064] ret_from_fork+0x1c/0x2c
       [0000000000000000]           (null)
      Disabling lock debugging due to kernel taint
      Caller[00000000103fa4b0]: ixgbe_xmit_frame+0x30/0xa0 [ixgbe]
      Caller[0000000000998c74]: netpoll_start_xmit+0xf4/0x200
      Caller[0000000000998e10]: queue_process+0x90/0x160
      Caller[0000000000485fa8]: process_one_work+0x188/0x480
      Caller[0000000000486410]: worker_thread+0x170/0x4c0
      Caller[000000000048c6b8]: kthread+0xd8/0x120
      Caller[0000000000406064]: ret_from_fork+0x1c/0x2c
      Caller[0000000000000000]:           (null)
      Signed-off-by: default avatarTushar Dave <tushar.n.dave@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c70b17b7
    • Nikolay Aleksandrov's avatar
      ip6mr: fix notification device destruction · 723b929c
      Nikolay Aleksandrov authored
      Andrey Konovalov reported a BUG caused by the ip6mr code which is caused
      because we call unregister_netdevice_many for a device that is already
      being destroyed. In IPv4's ipmr that has been resolved by two commits
      long time ago by introducing the "notify" parameter to the delete
      function and avoiding the unregister when called from a notifier, so
      let's do the same for ip6mr.
      
      The trace from Andrey:
      ------------[ cut here ]------------
      kernel BUG at net/core/dev.c:6813!
      invalid opcode: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 1165 Comm: kworker/u4:3 Not tainted 4.11.0-rc7+ #251
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
      01/01/2011
      Workqueue: netns cleanup_net
      task: ffff880069208000 task.stack: ffff8800692d8000
      RIP: 0010:rollback_registered_many+0x348/0xeb0 net/core/dev.c:6813
      RSP: 0018:ffff8800692de7f0 EFLAGS: 00010297
      RAX: ffff880069208000 RBX: 0000000000000002 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88006af90569
      RBP: ffff8800692de9f0 R08: ffff8800692dec60 R09: 0000000000000000
      R10: 0000000000000006 R11: 0000000000000000 R12: ffff88006af90070
      R13: ffff8800692debf0 R14: dffffc0000000000 R15: ffff88006af90000
      FS:  0000000000000000(0000) GS:ffff88006cb00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe7e897d870 CR3: 00000000657e7000 CR4: 00000000000006e0
      Call Trace:
       unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
       unregister_netdevice_many+0xc8/0x120 net/core/dev.c:7880
       ip6mr_device_event+0x362/0x3f0 net/ipv6/ip6mr.c:1346
       notifier_call_chain+0x145/0x2f0 kernel/notifier.c:93
       __raw_notifier_call_chain kernel/notifier.c:394
       raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401
       call_netdevice_notifiers_info+0x51/0x90 net/core/dev.c:1647
       call_netdevice_notifiers net/core/dev.c:1663
       rollback_registered_many+0x919/0xeb0 net/core/dev.c:6841
       unregister_netdevice_many.part.105+0x87/0x440 net/core/dev.c:7881
       unregister_netdevice_many net/core/dev.c:7880
       default_device_exit_batch+0x4fa/0x640 net/core/dev.c:8333
       ops_exit_list.isra.4+0x100/0x150 net/core/net_namespace.c:144
       cleanup_net+0x5a8/0xb40 net/core/net_namespace.c:463
       process_one_work+0xc04/0x1c10 kernel/workqueue.c:2097
       worker_thread+0x223/0x19c0 kernel/workqueue.c:2231
       kthread+0x35e/0x430 kernel/kthread.c:231
       ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
      Code: 3c 32 00 0f 85 70 0b 00 00 48 b8 00 02 00 00 00 00 ad de 49 89
      47 78 e9 93 fe ff ff 49 8d 57 70 49 8d 5f 78 eb 9e e8 88 7a 14 fe <0f>
      0b 48 8b 9d 28 fe ff ff e8 7a 7a 14 fe 48 b8 00 00 00 00 00
      RIP: rollback_registered_many+0x348/0xeb0 RSP: ffff8800692de7f0
      ---[ end trace e0b29c57e9b3292c ]---
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      723b929c
    • Daniel Borkmann's avatar
      bpf, doc: update bpf maintainers entry · cdb90499
      Daniel Borkmann authored
      Add various related files that have been missing under
      BPF entry covering essential parts of its infrastructure
      and also add myself as co-maintainer.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cdb90499
    • Dan Carpenter's avatar
      net: qrtr: potential use after free in qrtr_sendmsg() · 6f60f438
      Dan Carpenter authored
      If skb_pad() fails then it frees the skb so we should check for errors.
      
      Fixes: bdabad3e ("net: Add Qualcomm IPC router")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f60f438
    • David Miller's avatar
      bpf: Fix values type used in test_maps · 89087c45
      David Miller authored
      Maps of per-cpu type have their value element size adjusted to 8 if it
      is specified smaller during various map operations.
      
      This makes test_maps as a 32-bit binary fail, in fact the kernel
      writes past the end of the value's array on the user's stack.
      
      To be quite honest, I think the kernel should reject creation of a
      per-cpu map that doesn't have a value size of at least 8 if that's
      what the kernel is going to silently adjust to later.
      
      If the user passed something smaller, it is a sizeof() calcualtion
      based upon the type they will actually use (just like in this testcase
      code) in later calls to the map operations.
      
      Fixes: df570f57 ("samples/bpf: unit test for BPF_MAP_TYPE_PERCPU_ARRAY")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      89087c45
    • David Ahern's avatar
      net: ipv6: RTF_PCPU should not be settable from userspace · 557c44be
      David Ahern authored
      Andrey reported a fault in the IPv6 route code:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff880069809600 task.stack: ffff880062dc8000
      RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
      RSP: 0018:ffff880062dced30 EFLAGS: 00010206
      RAX: dffffc0000000000 RBX: ffff8800670561c0 RCX: 0000000000000006
      RDX: 0000000000000003 RSI: ffff880062dcfb28 RDI: 0000000000000018
      RBP: ffff880062dced68 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff880062dcfb28 R14: dffffc0000000000 R15: 0000000000000000
      FS:  00007feebe37e7c0(0000) GS:ffff88006cb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000205a0fe4 CR3: 000000006b5c9000 CR4: 00000000000006e0
      Call Trace:
       ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
       ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
      ...
      
      Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
      set. Flags passed to the kernel are blindly copied to the allocated
      rt6_info by ip6_route_info_create making a newly inserted route appear
      as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
      and expects rt->dst.from to be set - which it is not since it is not
      really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
      generates the fault.
      
      Fix by checking for the flag and failing with EINVAL.
      
      Fixes: d52d3997 ("ipv6: Create percpu rt6_info")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      557c44be
    • Ilan Tayari's avatar
      gso: Validate assumption of frag_list segementation · 43170c4e
      Ilan Tayari authored
      Commit 07b26c94 ("gso: Support partial splitting at the frag_list
      pointer") assumes that all SKBs in a frag_list (except maybe the last
      one) contain the same amount of GSO payload.
      
      This assumption is not always correct, resulting in the following
      warning message in the log:
          skb_segment: too many frags
      
      For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
      one frag, and some with 2 frags.
      After GRO, the frag_list SKBs end up having different amounts of payload.
      If this frag_list SKB is then forwarded, the aforementioned assumption
      is violated.
      
      Validate the assumption, and fall back to software GSO if it not true.
      
      Change-Id: Ia03983f4a47b6534dd987d7a2aad96d54d46d212
      Fixes: 07b26c94 ("gso: Support partial splitting at the frag_list pointer")
      Signed-off-by: default avatarIlan Tayari <ilant@mellanox.com>
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43170c4e
    • David S. Miller's avatar
      Merge branch 'skb_cow_head' · 918b7024
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: use skb_cow_head() to deal with cloned skbs
      
      James Hughes found an issue with smsc95xx driver. Same problematic code
      is found in other drivers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      918b7024
    • Eric Dumazet's avatar
      kaweth: use skb_cow_head() to deal with cloned skbs · 39fba783
      Eric Dumazet authored
      We can use skb_cow_head() to properly deal with clones,
      especially the ones coming from TCP stack that allow their head being
      modified. This avoids a copy.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39fba783
    • Eric Dumazet's avatar
      ch9200: use skb_cow_head() to deal with cloned skbs · 6bc6895b
      Eric Dumazet authored
      We need to ensure there is enough headroom to push extra header,
      but we also need to check if we are allowed to change headers.
      
      skb_cow_head() is the proper helper to deal with this.
      
      Fixes: 4a476bd6 ("usbnet: New driver for QinHeng CH9200 devices")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bc6895b
    • Eric Dumazet's avatar
      lan78xx: use skb_cow_head() to deal with cloned skbs · d4ca7359
      Eric Dumazet authored
      We need to ensure there is enough headroom to push extra header,
      but we also need to check if we are allowed to change headers.
      
      skb_cow_head() is the proper helper to deal with this.
      
      Fixes: 55d7de9d ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Cc: Woojung Huh <woojung.huh@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4ca7359
    • Eric Dumazet's avatar
      sr9700: use skb_cow_head() to deal with cloned skbs · d532c108
      Eric Dumazet authored
      We need to ensure there is enough headroom to push extra header,
      but we also need to check if we are allowed to change headers.
      
      skb_cow_head() is the proper helper to deal with this.
      
      Fixes: c9b37458 ("USB2NET : SR9700 : One chip USB 1.1 USB2NET SR9700Device Driver Support")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d532c108
    • Eric Dumazet's avatar
      cx82310_eth: use skb_cow_head() to deal with cloned skbs · a9e840a2
      Eric Dumazet authored
      We need to ensure there is enough headroom to push extra header,
      but we also need to check if we are allowed to change headers.
      
      skb_cow_head() is the proper helper to deal with this.
      
      Fixes: cc28a20e ("introduce cx82310_eth: Conexant CX82310-based ADSL router USB ethernet driver")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9e840a2
    • Eric Dumazet's avatar
      smsc75xx: use skb_cow_head() to deal with cloned skbs · b7c6d267
      Eric Dumazet authored
      We need to ensure there is enough headroom to push extra header,
      but we also need to check if we are allowed to change headers.
      
      skb_cow_head() is the proper helper to deal with this.
      
      Fixes: d0cad871 ("smsc75xx: SMSC LAN75xx USB gigabit ethernet adapter driver")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Hughes <james.hughes@raspberrypi.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7c6d267
    • David Lebrun's avatar
      ipv6: sr: fix double free of skb after handling invalid SRH · 95b9b88d
      David Lebrun authored
      The icmpv6_param_prob() function already does a kfree_skb(),
      this patch removes the duplicate one.
      
      Fixes: 1ababeba ("ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95b9b88d
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.11-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 92b4fc75
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Just two fixes.
      
        The first fixes kprobing a stdu, and is marked for stable as it's been
        broken for ~ever. In hindsight this could have gone in next.
      
        The other is a fix for a change we merged this cycle, where if we take
        a certain exception when the kernel is running relocated (currently
        only used for kdump), we checkstop the box.
      
        Thanks to Ravi Bangoria"
      
      * tag 'powerpc-4.11-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64: Fix HMI exception on LE with CONFIG_RELOCATABLE=y
        powerpc/kprobe: Fix oops when kprobed on 'stdu' instruction
      92b4fc75
    • Linus Torvalds's avatar
      Merge tag 'pci-v4.11-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · fe7ba289
      Linus Torvalds authored
      Pull PCI fix from Bjorn Helgaas:
       "Sorry this is so late. It's been in -next for over a week, but I
        forgot to send it on until now.
      
        A single fix to the DT binding of the HiSilicon PCIe host support"
      
      * tag 'pci-v4.11-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: hisi: Fix DT binding (hisi-pcie-almost-ecam)
      fe7ba289
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · a9aa1908
      Linus Torvalds authored
      Pull block layer fixes from Jens Axboe:
       "A couple of last minute fixes for regressions in this cycle. More
        specifically:
      
         - Two patches from Andy, adjusting the NVMe APST quirks to avoid some
           issues specific to one Toshiba drive, and some variant of Samsung
           on two specific Dell laptops.
      
         - A fix for mtip32xx, turning off mq scheduling on that device. We
           have a real fix for this, but it's too late in the cycle.
           Thankfully we already have a NO_SCHED flag we can apply here. A
           prep patch for this is ensuring that we honor the NO_SCHED flag
           when attempting to online switch schedulers, previsouly we only did
           so for drive load time. From Ming.
      
         - Fixing an oops in blk-mq polling with scheduling attached. This one
           is easily reproducible, it would be a shame to release 4.11 with
           that issue. From me.
      
        I'd prefer not having to send in patches at this point in time, but
        the above are all things that have regressed in this cycle and the
        fixes are relatively straight forward"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        blk-mq: fix potential oops with polling and blk-mq scheduler
        nvme: Quirk APST off on "THNSF5256GPUK TOSHIBA"
        nvme: Adjust the Samsung APST quirk
        mtip32xx: pass BLK_MQ_F_NO_SCHED
        block: respect BLK_MQ_F_NO_SCHED
      a9aa1908
    • Linus Torvalds's avatar
      Merge tag 'acpi-4.11-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 4664e322
      Linus Torvalds authored
      Pull ACPI build fix from Rafael Wysocki:
       "This avoids a false-positive build warning from the compiler.
      
        Specifics:
      
         - Avoid a false-positive warning regarding a variable that may not be
           initialized that started to trigger after a previous general build
           fix (Arnd Bergmann)"
      
      * tag 'acpi-4.11-final' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI / power: Avoid maybe-uninitialized warning
      4664e322
    • Linus Torvalds's avatar
      Merge tag 'mmc-v4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · 11b211ed
      Linus Torvalds authored
      Pull MMC fixes from Ulf Hansson:
       "MMC core:
      
         - kmalloc sdio scratch buffer to make it DMA-friendly
      
        MMC host:
      
         - dw_mmc: Fix behaviour for SDIO IRQs when runtime PM is used
      
         - sdhci-esdhc-imx: Correct pad I/O drive strength for UHS-DDR50
           cards"
      
      * tag 'mmc-v4.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
        mmc: sdhci-esdhc-imx: increase the pad I/O drive strength for DDR50 card
        mmc: dw_mmc: Don't allow Runtime PM for SDIO cards
        mmc: sdio: fix alignment issue in struct sdio_func
      11b211ed
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 4d4dfc1c
      Linus Torvalds authored
      Pull input fixlet from Dmitry Torokhov:
       "An update to Elan PS/2 driver to allow working on yet another
        Lifebook"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: elantech - add Fujitsu Lifebook E547 to force crc_enabled
      4d4dfc1c
    • David S. Miller's avatar
      MAINTAINERS: Add "B:" field for networking. · b0522e13
      David S. Miller authored
      We want people to report bugs to the netdev list.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0522e13
    • Paul E. McKenney's avatar
      Merge branches 'doc.2017.04.12a', 'fixes.2017.04.19a' and 'srcu.2017.04.21a' into HEAD · f2094107
      Paul E. McKenney authored
      doc.2017.04.12a: Documentation updates
      fixes.2017.04.19a: Miscellaneous fixes
      srcu.2017.04.21a: Parallelize SRCU callback handling
      f2094107
    • Paul E. McKenney's avatar
      rcu: Make non-preemptive schedule be Tasks RCU quiescent state · bcbfdd01
      Paul E. McKenney authored
      Currently, a call to schedule() acts as a Tasks RCU quiescent state
      only if a context switch actually takes place.  However, just the
      call to schedule() guarantees that the calling task has moved off of
      whatever tracing trampoline that it might have been one previously.
      This commit therefore plumbs schedule()'s "preempt" parameter into
      rcu_note_context_switch(), which then records the Tasks RCU quiescent
      state, but only if this call to schedule() was -not- due to a preemption.
      
      To avoid adding overhead to the common-case context-switch path,
      this commit hides the rcu_note_context_switch() check under an existing
      non-common-case check.
      Suggested-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bcbfdd01
    • Paul E. McKenney's avatar
      srcu: Expedite srcu_schedule_cbs_snp() callback invocation · 0497b489
      Paul E. McKenney authored
      Although Tree SRCU does reduce delays when there is at least one
      synchronize_srcu_expedited() invocation pending, srcu_schedule_cbs_snp()
      still waits for SRCU_INTERVAL before invoking callbacks.  Since
      synchronize_srcu_expedited() now posts a callback and waits for
      that callback to do a wakeup, this destroys the expedited nature of
      synchronize_srcu_expedited().  This destruction became apparent to
      Marc Zyngier in the guise of a guest-OS bootup slowdown from five
      seconds to no fewer than forty seconds.
      
      This commit therefore invokes callbacks immediately at the end of the
      grace period when there is at least one synchronize_srcu_expedited()
      invocation pending.  This brought Marc's guest-OS bootup times back
      into the realm of reason.
      Reported-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      0497b489
    • Paul E. McKenney's avatar
      srcu: Parallelize callback handling · da915ad5
      Paul E. McKenney authored
      Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2],
      however, there are workloads that could result in a high volume of
      concurrent invocations of call_srcu(), which with current SRCU would
      result in excessive lock contention on the srcu_struct structure's
      ->queue_lock, which protects SRCU's callback lists.  This commit therefore
      moves SRCU to per-CPU callback lists, thus greatly reducing contention.
      
      Because a given SRCU instance no longer has a single centralized callback
      list, starting grace periods and invoking callbacks are both more complex
      than in the single-list Classic SRCU implementation.  Starting grace
      periods and handling callbacks are now handled using an srcu_node tree
      that is in some ways similar to the rcu_node trees used by RCU-bh,
      RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is
      controlled by exactly the same Kconfig options and boot parameters that
      control the shape of the rcu_node tree).
      
      In addition, the old per-CPU srcu_array structure is now named srcu_data
      and contains an rcu_segcblist structure named ->srcu_cblist for its
      callbacks (and a spinlock to protect this).  The srcu_struct gets
      an srcu_gp_seq that is used to associate callback segments with the
      corresponding completion-time grace-period number.  These completion-time
      grace-period numbers are propagated up the srcu_node tree so that the
      grace-period workqueue handler can determine whether additional grace
      periods are needed on the one hand and where to look for callbacks that
      are ready to be invoked.
      
      The srcu_barrier() function must now wait on all instances of the per-CPU
      ->srcu_cblist.  Because each ->srcu_cblist is protected by ->lock,
      srcu_barrier() can remotely add the needed callbacks.  In theory,
      it could also remotely start grace periods, but in practice doing so
      is complex and racy.  And interestingly enough, it is never necessary
      for srcu_barrier() to start a grace period because srcu_barrier() only
      enqueues a callback when a callback is already present--and it turns out
      that a grace period has to have already been started for this pre-existing
      callback.  Furthermore, it is only the callback that srcu_barrier()
      needs to wait on, not any particular grace period.  Therefore, a new
      rcu_segcblist_entrain() function enqueues the srcu_barrier() function's
      callback into the same segment occupied by the last pre-existing callback
      in the list.  The special case where all the pre-existing callbacks are
      on a different list (because they are in the process of being invoked)
      is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL
      segment, relying on the done-callbacks check that takes place after all
      callbacks are inovked.
      
      Note that the readers use the same algorithm as before.  Note that there
      is a separate srcu_idx that tells the readers what counter to increment.
      This unfortunately cannot be combined with srcu_gp_seq because they
      need to be incremented at different times.
      
      This commit introduces some ugly #ifdefs in rcutorture.  These will go
      away when I feel good enough about Tree SRCU to ditch Classic SRCU.
      
      Some crude performance comparisons, courtesy of a quickly hacked rcuperf
      asynchronous-grace-period capability:
      
      			Callback Queuing Overhead
      			-------------------------
      	# CPUS		Classic SRCU	Tree SRCU
      	------          ------------    ---------
      	     2              0.349 us     0.342 us
      	    16             31.66  us     0.4   us
      	    41             ---------     0.417 us
      
      The times are the 90th percentiles, a statistic that was chosen to reject
      the overheads of the occasional srcu_barrier() call needed to avoid OOMing
      the test machine.  The rcuperf test hangs when running Classic SRCU at 41
      CPUs, hence the line of dashes.  Despite the hacks to both the rcuperf code
      and that statistics, this is a convincing demonstration of Tree SRCU's
      performance and scalability advantages.
      
      [1] https://lwn.net/Articles/309030/
      [2] https://patchwork.kernel.org/patch/5108281/Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
      da915ad5
    • Paul E. McKenney's avatar
      kvm: Move srcu_struct fields to end of struct kvm · 6ade8694
      Paul E. McKenney authored
      Parallelizing SRCU callback handling will increase the size of
      srcu_struct, which will move the kvm structure's kvm_arch field out
      of reach of powerpc's current assembly code, which will result in the
      following sort of build error:
      
      arch/powerpc/kvm/book3s_hv_rmhandlers.S:617: Error: operand out of range (0x000000000000b328 is not between 0xffffffffffff8000 and 0x0000000000007fff)
      
      This commit moves the srcu_struct fields in the kvm structure to follow
      the kvm_arch field, which will allow powerpc's assembly code to continue
      to be able to reach the kvm_arch field.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reported-by: default avatarMichael Ellerman <michaele@au1.ibm.com>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [ paulmck: Moved this commit to precede SRCU callback parallelization,
        and reworded the commit log into future tense, all in the name of
        bisectability. ]
      6ade8694
  6. 20 Apr, 2017 2 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · c154165e
      Linus Torvalds authored
      Merge two mm fixes from Andrew Morton.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: prevent NR_ISOLATE_* stats from going negative
        Revert "mm, page_alloc: only use per-cpu allocator for irq-safe requests"
      c154165e
    • Rabin Vincent's avatar
      mm: prevent NR_ISOLATE_* stats from going negative · fc280fe8
      Rabin Vincent authored
      Commit 6afcf8ef ("mm, compaction: fix NR_ISOLATED_* stats for pfn
      based migration") moved the dec_node_page_state() call (along with the
      page_is_file_cache() call) to after putback_lru_page().
      
      But page_is_file_cache() can change after putback_lru_page() is called,
      so it should be called before putback_lru_page(), as it was before that
      patch, to prevent NR_ISOLATE_* stats from going negative.
      
      Without this fix, non-CONFIG_SMP kernels end up hanging in the
      while(too_many_isolated()) { congestion_wait() } loop in
      shrink_active_list() due to the negative stats.
      
       Mem-Info:
        active_anon:32567 inactive_anon:121 isolated_anon:1
        active_file:6066 inactive_file:6639 isolated_file:4294967295
                                                          ^^^^^^^^^^
        unevictable:0 dirty:115 writeback:0 unstable:0
        slab_reclaimable:2086 slab_unreclaimable:3167
        mapped:3398 shmem:18366 pagetables:1145 bounce:0
        free:1798 free_pcp:13 free_cma:0
      
      Fixes: 6afcf8ef ("mm, compaction: fix NR_ISOLATED_* stats for pfn based migration")
      Link: http://lkml.kernel.org/r/1492683865-27549-1-git-send-email-rabin.vincent@axis.comSigned-off-by: default avatarRabin Vincent <rabinv@axis.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Ming Ling <ming.ling@spreadtrum.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc280fe8