1. 20 Nov, 2013 2 commits
    • Linus Torvalds's avatar
      Merge tag 'md/3.13' of git://neil.brown.name/md · 6d6e352c
      Linus Torvalds authored
      Pull md update from Neil Brown:
       "Mostly optimisations and obscure bug fixes.
         - raid5 gets less lock contention
         - raid1 gets less contention between normal-io and resync-io during
           resync"
      
      * tag 'md/3.13' of git://neil.brown.name/md:
        md/raid5: Use conf->device_lock protect changing of multi-thread resources.
        md/raid5: Before freeing old multi-thread worker, it should flush them.
        md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
        UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
        raid1: Rewrite the implementation of iobarrier.
        raid1: Add some macros to make code clearly.
        raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
        raid1: Add a field array_frozen to indicate whether raid in freeze state.
        md: Convert use of typedef ctl_table to struct ctl_table
        md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
        md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
        md: fix some places where mddev_lock return value is not checked.
        raid5: Retry R5_ReadNoMerge flag when hit a read error.
        raid5: relieve lock contention in get_active_stripe()
        raid5: relieve lock contention in get_active_stripe()
        wait: add wait_event_cmd()
        md/raid5.c: add proper locking to error path of raid5_start_reshape.
        md: fix calculation of stacking limits on level change.
        raid5: Use slow_path to release stripe when mddev->thread is null
      6d6e352c
    • Mahesh Rajashekhara's avatar
      aacraid: prevent invalid pointer dereference · b4789b8e
      Mahesh Rajashekhara authored
      It appears that driver runs into a problem here if fibsize is too small
      because we allocate user_srbcmd with fibsize size only but later we
      access it until user_srbcmd->sg.count to copy it over to srbcmd.
      
      It is not correct to test (fibsize < sizeof(*user_srbcmd)) because this
      structure already includes one sg element and this is not needed for
      commands without data.  So, we would recommend to add the following
      (instead of test for fibsize == 0).
      Signed-off-by: default avatarMahesh Rajashekhara <Mahesh.Rajashekhara@pmcs.com>
      Reported-by: default avatarNico Golde <nico@ngolde.de>
      Reported-by: default avatarFabian Yamaguchi <fabs@goesec.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4789b8e
  2. 19 Nov, 2013 38 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1ee2dcc2
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "Mostly these are fixes for fallout due to merge window changes, as
        well as cures for problems that have been with us for a much longer
        period of time"
      
       1) Johannes Berg noticed two major deficiencies in our genetlink
          registration.  Some genetlink protocols we passing in constant
          counts for their ops array rather than something like
          ARRAY_SIZE(ops) or similar.  Also, some genetlink protocols were
          using fixed IDs for their multicast groups.
      
          We have to retain these fixed IDs to keep existing userland tools
          working, but reserve them so that other multicast groups used by
          other protocols can not possibly conflict.
      
          In dealing with these two problems, we actually now use less state
          management for genetlink operations and multicast groups.
      
       2) When configuring interface hardware timestamping, fix several
          drivers that simply do not validate that the hwtstamp_config value
          is one the driver actually supports.  From Ben Hutchings.
      
       3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
      
       4) In dev_forward_skb(), set the skb->protocol in the right order
          relative to skb_scrub_packet().  From Alexei Starovoitov.
      
       5) Bridge erroneously fails to use the proper wrapper functions to make
          calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid.  Fix from Toshiaki
          Makita.
      
       6) When detaching a bridge port, make sure to flush all VLAN IDs to
          prevent them from leaking, also from Toshiaki Makita.
      
       7) Put in a compromise for TCP Small Queues so that deep queued devices
          that delay TX reclaim non-trivially don't have such a performance
          decrease.  One particularly problematic area is 802.11 AMPDU in
          wireless.  From Eric Dumazet.
      
       8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
          here.  Fix from Eric Dumzaet, reported by Dave Jones.
      
       9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
      
      10) When computing mergeable buffer sizes, virtio-net fails to take the
          virtio-net header into account.  From Michael Dalton.
      
      11) Fix seqlock deadlock in ip4_datagram_connect() wrt.  statistic
          bumping, this one has been with us for a while.  From Eric Dumazet.
      
      12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
          Hugne.
      
      13) 6lowpan bit used for traffic classification was wrong, from Jukka
          Rissanen.
      
      14) macvlan has the same issue as normal vlans did wrt.  propagating LRO
          disabling down to the real device, fix it the same way.  From Michal
          Kubecek.
      
      15) CPSW driver needs to soft reset all slaves during suspend, from
          Daniel Mack.
      
      16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
      
      17) The xen-netfront RX buffer refill timer isn't properly scheduled on
          partial RX allocation success, from Ma JieYue.
      
      18) When ipv6 ping protocol support was added, the AF_INET6 protocol
          initialization cleanup path on failure was borked a little.  Fix
          from Vlad Yasevich.
      
      19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
          blocks we can do the wrong thing with the msg_name we write back to
          userspace.  From Hannes Frederic Sowa.  There is another fix in the
          works from Hannes which will prevent future problems of this nature.
      
      20) Fix route leak in VTI tunnel transmit, from Fan Du.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
        genetlink: make multicast groups const, prevent abuse
        genetlink: pass family to functions using groups
        genetlink: add and use genl_set_err()
        genetlink: remove family pointer from genl_multicast_group
        genetlink: remove genl_unregister_mc_group()
        hsr: don't call genl_unregister_mc_group()
        quota/genetlink: use proper genetlink multicast APIs
        drop_monitor/genetlink: use proper genetlink multicast APIs
        genetlink: only pass array to genl_register_family_with_ops()
        tcp: don't update snd_nxt, when a socket is switched from repair mode
        atm: idt77252: fix dev refcnt leak
        xfrm: Release dst if this dst is improper for vti tunnel
        netlink: fix documentation typo in netlink_set_err()
        be2net: Delete secondary unicast MAC addresses during be_close
        be2net: Fix unconditional enabling of Rx interface options
        net, virtio_net: replace the magic value
        ping: prevent NULL pointer dereference on write to msg_name
        bnx2x: Prevent "timeout waiting for state X"
        bnx2x: prevent CFC attention
        bnx2x: Prevent panic during DMAE timeout
        ...
      1ee2dcc2
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · 4457e6f6
      Linus Torvalds authored
      Pull sparc fixes from David Miller:
       "Two merge window fallout build fixes"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc64: merge fix
        sparc64: fix build regession
      4457e6f6
    • Linus Torvalds's avatar
      Merge tag 'please-pull-fixia64' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux · e87e7be9
      Linus Torvalds authored
      Pull ia64 fix from Tony Luck:
       "Unbreak ia64 build by avoiding circular dependency"
      
      * tag 'please-pull-fixia64' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
        kernel/bounds: avoid circular dependencies in generated headers
      e87e7be9
    • Kirill A. Shutemov's avatar
      kernel/bounds: avoid circular dependencies in generated headers · 24b9fdc5
      Kirill A. Shutemov authored
      <linux/spinlock.h> has heavy dependencies on other header files.
      It triggers circular dependencies in generated headers on IA64, at
      least:
      
        CC      kernel/bounds.s
      In file included from /home/space/kas/git/public/linux/arch/ia64/include/asm/thread_info.h:9:0,
                       from include/linux/thread_info.h:54,
                       from include/asm-generic/preempt.h:4,
                       from arch/ia64/include/generated/asm/preempt.h:1,
                       from include/linux/preempt.h:18,
                       from include/linux/spinlock.h:50,
                       from kernel/bounds.c:14:
      /home/space/kas/git/public/linux/arch/ia64/include/asm/asm-offsets.h:1:35: fatal error: generated/asm-offsets.h: No such file or directory
      compilation terminated.
      
      Let's replace <linux/spinlock.h> with <linux/spinlock_types.h>, it's
      enough to find out size of spinlock_t.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-and-Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      24b9fdc5
    • David S. Miller's avatar
      Merge branch 'genetlink_mcast' · 091e0662
      David S. Miller authored
      Johannes Berg says:
      
      ====================
      genetlink: clean up multicast group APIs
      
      The generic netlink multicast group registration doesn't have to
      be dynamic, and can thus be simplified just like I did with the
      ops. This removes some complexity in registration code.
      
      Additionally, two users of generic netlink already use multicast
      groups in a wrong way, add workarounds for those two to keep the
      userspace API working, but at the same time make them not clash
      with other users of multicast groups as might happen now.
      
      While making it all a bit easier, also prevent such abuse by adding
      checks to the APIs so each family can only use the groups it owns.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      091e0662
    • Johannes Berg's avatar
      genetlink: make multicast groups const, prevent abuse · 2a94fe48
      Johannes Berg authored
      Register generic netlink multicast groups as an array with
      the family and give them contiguous group IDs. Then instead
      of passing the global group ID to the various functions that
      send messages, pass the ID relative to the family - for most
      families that's just 0 because the only have one group.
      
      This avoids the list_head and ID in each group, adding a new
      field for the mcast group ID offset to the family.
      
      At the same time, this allows us to prevent abusing groups
      again like the quota and dropmon code did, since we can now
      check that a family only uses a group it owns.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a94fe48
    • Johannes Berg's avatar
      genetlink: pass family to functions using groups · 68eb5503
      Johannes Berg authored
      This doesn't really change anything, but prepares for the
      next patch that will change the APIs to pass the group ID
      within the family, rather than the global group ID.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68eb5503
    • Johannes Berg's avatar
      genetlink: add and use genl_set_err() · 62b68e99
      Johannes Berg authored
      Add a static inline to generic netlink to wrap netlink_set_err()
      to make it easier to use here - use it in openvswitch (the only
      generic netlink user of netlink_set_err()).
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62b68e99
    • Johannes Berg's avatar
      genetlink: remove family pointer from genl_multicast_group · c2ebb908
      Johannes Berg authored
      There's no reason to have the family pointer there since it
      can just be passed internally where needed, so remove it.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2ebb908
    • Johannes Berg's avatar
      genetlink: remove genl_unregister_mc_group() · 06fb555a
      Johannes Berg authored
      There are no users of this API remaining, and we'll soon
      change group registration to be static (like ops are now)
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06fb555a
    • Johannes Berg's avatar
      hsr: don't call genl_unregister_mc_group() · 03ed3827
      Johannes Berg authored
      There's no need to unregister the multicast group if the
      generic netlink family is registered immediately after.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03ed3827
    • Johannes Berg's avatar
      quota/genetlink: use proper genetlink multicast APIs · 2ecf7536
      Johannes Berg authored
      The quota code is abusing the genetlink API and is using
      its family ID as the multicast group ID, which is invalid
      and may belong to somebody else (and likely will.)
      
      Make the quota code use the correct API, but since this
      is already used as-is by userspace, reserve a family ID
      for this code and also reserve that group ID to not break
      userspace assumptions.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ecf7536
    • Johannes Berg's avatar
      drop_monitor/genetlink: use proper genetlink multicast APIs · e5dcecba
      Johannes Berg authored
      The drop monitor code is abusing the genetlink API and is
      statically using the generic netlink multicast group 1, even
      if that group belongs to somebody else (which it invariably
      will, since it's not reserved.)
      
      Make the drop monitor code use the proper APIs to reserve a
      group ID, but also reserve the group id 1 in generic netlink
      code to preserve the userspace API. Since drop monitor can
      be a module, don't clear the bit for it on unregistration.
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5dcecba
    • Johannes Berg's avatar
      genetlink: only pass array to genl_register_family_with_ops() · c53ed742
      Johannes Berg authored
      As suggested by David Miller, make genl_register_family_with_ops()
      a macro and pass only the array, evaluating ARRAY_SIZE() in the
      macro, this is a little safer.
      
      The openvswitch has some indirection, assing ops/n_ops directly in
      that code. This might ultimately just assign the pointers in the
      family initializations, saving the struct genl_family_and_ops and
      code (once mcast groups are handled differently.)
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c53ed742
    • Andrey Vagin's avatar
      tcp: don't update snd_nxt, when a socket is switched from repair mode · dbde4979
      Andrey Vagin authored
      snd_nxt must be updated synchronously with sk_send_head.  Otherwise
      tp->packets_out may be updated incorrectly, what may bring a kernel panic.
      
      Here is a kernel panic from my host.
      [  103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
      [  103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150
      ...
      [  146.301158] Call Trace:
      [  146.301158]  [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0
      
      Before this panic a tcp socket was restored. This socket had sent and
      unsent data in the write queue. Sent data was restored in repair mode,
      then the socket was switched from reapair mode and unsent data was
      restored. After that the socket was switched back into repair mode.
      
      In that moment we had a socket where write queue looks like this:
      snd_una    snd_nxt   write_seq
         |_________|________|
                   |
      	  sk_send_head
      
      After a second switching from repair mode the state of socket was
      changed:
      
      snd_una          snd_nxt, write_seq
         |_________ ________|
                   |
      	  sk_send_head
      
      This state is inconsistent, because snd_nxt and sk_send_head are not
      synchronized.
      
      Bellow you can find a call trace, how packets_out can be incremented
      twice for one skb, if snd_nxt and sk_send_head are not synchronized.
      In this case packets_out will be always positive, even when
      sk_write_queue is empty.
      
      tcp_write_wakeup
      	skb = tcp_send_head(sk);
      	tcp_fragment
      		if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq))
      			tcp_adjust_pcount(sk, skb, diff);
      	tcp_event_new_data_sent
      		tp->packets_out += tcp_skb_pcount(skb);
      
      I think update of snd_nxt isn't required, when a socket is switched from
      repair mode.  Because it's initialized in tcp_connect_init. Then when a
      write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent,
      so it's always is in consistent state.
      
      I have checked, that the bug is not reproduced with this patch and
      all tests about restoring tcp connections work fine.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dbde4979
    • Ying Xue's avatar
      atm: idt77252: fix dev refcnt leak · b5de4a22
      Ying Xue authored
      init_card() calls dev_get_by_name() to get a network deceive. But it
      doesn't decrease network device reference count after the device is
      used.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5de4a22
    • fan.du's avatar
      xfrm: Release dst if this dst is improper for vti tunnel · 236c9f84
      fan.du authored
      After searching rt by the vti tunnel dst/src parameter,
      if this rt has neither attached to any transformation
      nor the transformation is not tunnel oriented, this rt
      should be released back to ip layer.
      
      otherwise causing dst memory leakage.
      Signed-off-by: default avatarFan Du <fan.du@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      236c9f84
    • Johannes Berg's avatar
      netlink: fix documentation typo in netlink_set_err() · 840e93f2
      Johannes Berg authored
      The parameter is just 'group', not 'groups', fix the documentation typo.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      840e93f2
    • Linus Torvalds's avatar
      Merge tag 'arc-v3.13-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · dec8e461
      Linus Torvalds authored
      Pull second set of ARC changes from Vineet Gupta:
       - Support for Perf from Mischa
       - Enabling GPIO/Pinctrl drivers for Abilis TB10x platform
       - New defconfig for buildroot
      
      * tag 'arc-v3.13-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: [plat-arcfpga] Add defconfig without initramfs location
        ARC: perf: ARC 700 PMU doesn't support sampling events
        ARC: Add documentation on DT binding for ARC700 PMU
        ARC: Add perf support for ARC700 cores
        ARC: [TB10x] Updates for GPIO and pinctrl
      dec8e461
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 806dace6
      Linus Torvalds authored
      Pull second set of s390 patches from Martin Schwidefsky:
       "The handling of the PCI hotplug notifications has been improved, the
        zfcp dumper can now detect the HSA size dynamically and the default
        install kernel has been changed to the compressed bzImage.  And two
        bug-fixes for scm and 3720"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390/pci: implement hotplug notifications
        s390/scm_block: do not hide eadm subchannel dependency
        s390/sclp: Consolidate early sclp init calls to sclp_early_detect()
        s390/sclp: Move early code from sclp_cmd.c to sclp_early.c
        s390/sclp: Determine HSA size dynamically for zfcpdump
        s390/sclp: Move declarations for sclp_sdias into separate header file
        s390/pci: implement pcibios_remove_bus
        s390/pci: improve handling of bus resources
        s390/3270: fix missing device_destroy() call
        s390/boot: Install bzImage as default kernel image
      806dace6
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · cdc7ef89
      Linus Torvalds authored
      Pull UML changes from Richard Weinberger:
       "This pile contains a nice defconfig cleanup, a rewritten stack
        unwinder and various cleanups"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
        um: Remove unused declarations from <as-layout.h>
        um: remove used STDIO_CONSOLE Kconfig param
        um/vdso: add .gitignore for a couple of targets
        arch/um: make it work with defconfig and x86_64
        um: Make kstack_depth_to_print conform to arch/x86
        um: Get rid of thread_struct->saved_task
        um: Make stack trace reliable against kernel mode faults
        um: Rewrite show_stack()
      cdc7ef89
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9066d9b2
      Linus Torvalds authored
      Pull x86 fix from Ingo Molnar:
       "A modular build fix for certain .config's"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86: Export 'boot_cpu_physical_apicid' to modules
      9066d9b2
    • Linus Torvalds's avatar
      Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 40071626
      Linus Torvalds authored
      Pull irq cleanups from Ingo Molnar:
       "This is a multi-arch cleanup series from Thomas Gleixner, which we
        kept to near the end of the merge window, to not interfere with
        architecture updates.
      
        This series (motivated by the -rt kernel) unifies more aspects of IRQ
        handling and generalizes PREEMPT_ACTIVE"
      
      * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        preempt: Make PREEMPT_ACTIVE generic
        sparc: Use preempt_schedule_irq
        ia64: Use preempt_schedule_irq
        m32r: Use preempt_schedule_irq
        hardirq: Make hardirq bits generic
        m68k: Simplify low level interrupt handling code
        genirq: Prevent spurious detection for unconditionally polled interrupts
      40071626
    • majianpeng's avatar
      md/raid5: Use conf->device_lock protect changing of multi-thread resources. · 60aaf933
      majianpeng authored
      When we change group_thread_cnt from sysfs entry, it can OOPS.
      
      The kernel messages are:
      [  135.299021] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299107] PGD 0
      [  135.299122] Oops: 0000 [#1] SMP
      [  135.299144] Modules linked in: netconsole e1000e ptp pps_core
      [  135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
      [  135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000
      [  135.299283] RIP: 0010:[<ffffffff815188ab>]  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299323] RSP: 0018:ffff8800b77a5c48  EFLAGS: 00010002
      [  135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008
      [  135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00
      [  135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000
      [  135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00
      [  135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70
      [  135.299479] FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      [  135.299510] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0
      [  135.299559] Stack:
      [  135.299570]  ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
      [  135.299611]  000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
      [  135.299654]  000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
      [  135.299696] Call Trace:
      [  135.299711]  [<ffffffff8107383e>] ? __wake_up+0x4e/0x70
      [  135.299733]  [<ffffffff81518f88>] raid5d+0x4c8/0x680
      [  135.299756]  [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0
      [  135.299781]  [<ffffffff81524c9f>] md_thread+0x11f/0x170
      [  135.299804]  [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40
      [  135.299826]  [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110
      [  135.299850]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  135.299871]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299899]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  135.299923]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
      [  135.300005] RIP  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.300005]  RSP <ffff8800b77a5c48>
      [  135.300005] CR2: 0000000000000000
      [  135.300005] ---[ end trace 504854e5bb7562ed ]---
      [  135.300005] Kernel panic - not syncing: Fatal exception
      
      This is because raid5d() can be running when the multi-thread
      resources are changed via system. We see need to provide locking.
      
      mddev->device_lock is suitable, but we cannot simple call
      alloc_thread_groups under this lock as we cannot allocate memory
      while holding a spinlock.
      So change alloc_thread_groups() to allocate and return the data
      structures, then raid5_store_group_thread_cnt() can take the lock
      while updating the pointers to the data structures.
      
      This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
      stable series.
      
      Fixes: b721420e
      Cc: stable@vger.kernel.org (3.12)
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
      60aaf933
    • majianpeng's avatar
      md/raid5: Before freeing old multi-thread worker, it should flush them. · d206dcfa
      majianpeng authored
      When changing group_thread_cnt from sysfs entry, the kernel can oops.
      
      The kernel messages are:
      [  740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961476] PGD b9013067 PUD b651e067 PMD 0
      [  740.961503] Oops: 0000 [#1] SMP
      [  740.961525] Modules linked in: netconsole e1000e ptp pps_core
      [  740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
      [  740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000
      [  740.961673] RIP: 0010:[<ffffffff81062570>]  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961708] RSP: 0018:ffff88013a247e08  EFLAGS: 00010086
      [  740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400
      [  740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680
      [  740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d
      [  740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00
      [  740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0
      [  740.961861] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [  740.961893] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0
      [  740.962001] Stack:
      [  740.962001]  0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
      [  740.962001]  ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
      [  740.962001]  ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
      [  740.962001] Call Trace:
      [  740.962001]  [<ffffffff810639c6>] worker_thread+0x206/0x3f0
      [  740.962001]  [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0
      [  740.962001]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
      [  740.962001] RIP  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.962001]  RSP <ffff88013a247e08>
      [  740.962001] CR2: 0000000000000008
      [  740.962001] ---[ end trace 39181460000748de ]---
      [  740.962001] Kernel panic - not syncing: Fatal exception
      
      This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
      A worker is queued to handle them.
      But before calling raid5_do_work, raid5d handles those
      stripes making conf->active_stripe = 0.
      So mddev_suspend() can return.
      We might then free old worker resources before the queued
      raid5_do_work() handled them.  When it runs, it crashes.
      
      	raid5d()		raid5_store_group_thread_cnt()
      	queue_work		mddev_suspend()
      				handle_strips
      				active_stripe=0
      				free(old worker resources)
      	process_one_work
      	raid5_do_work
      
      To avoid this, we should only flush the worker resources before freeing them.
      
      This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
      stable series.
      
      Cc: stable@vger.kernel.org (3.12)
      Fixes: b721420eSigned-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Reviewed-by: default avatarShaohua Li <shli@kernel.org>
      d206dcfa
    • majianpeng's avatar
      md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. · e59aa23f
      majianpeng authored
      For R5_ReadNoMerge,it mean this bio can't merge with other bios or
      request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
      same work.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      e59aa23f
    • Aurelien Jarno's avatar
      UAPI: include <asm/byteorder.h> in linux/raid/md_p.h · c0f8bd14
      Aurelien Jarno authored
      linux/raid/md_p.h is using conditionals depending on endianess and fails
      with an error if neither of __BIG_ENDIAN, __LITTLE_ENDIAN or
      __BYTE_ORDER are defined, but it doesn't include any header which can
      define these constants. This make this header unusable alone.
      
      This patch adds a #include <asm/byteorder.h> at the beginning of this
      header to make it usable alone. This is needed to compile klibc on MIPS.
      Signed-off-by: default avatarAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c0f8bd14
    • majianpeng's avatar
      raid1: Rewrite the implementation of iobarrier. · 79ef3a8a
      majianpeng authored
      There is an iobarrier in raid1 because of contention between normal IO and
      resync IO.  It suspends all normal IO when resync/recovery happens.
      
      However if normal IO is out side the resync window, there is no contention.
      So this patch changes the barrier mechanism to only block IO that
      could contend with the resync that is currently happening.
      
      We partition the whole space into five parts.
      |---------|-----------|------------|----------------|-------|
              start   next_resync   start_next_window    end_window
      
      start + RESYNC_WINDOW = next_resync
      next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
      start_next_window + NEXT_NORMALIO_DISTANCE = end_window
      
      Firstly we introduce some concepts:
      
      1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
            same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
            So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
      2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
            and start_next_window.  It also indicates the distance between
            start_next_window and end_window.
            It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
            this turned out not to be optimal.
      3 - next_resync: the next sector at which we will do sync IO.
      4 - start: a position which is at most RESYNC_WINDOW before
            next_resync.
      5 - start_next_window:  a position which is NEXT_NORMALIO_DISTANCE
            beyond next_resync.  Normal-io after this position doesn't need to
            wait for resync-io to complete.
      6 - end_window:  a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
            next_resync.  This also doesn't need to wait, but is counted
            differently.
      7 - current_window_requests:  the count of normalIO between
            start_next_window and end_window.
      8 - next_window_requests: the count of normalIO after end_window.
      
      NormalIO will be partitioned into four types:
      
      NormIO1:  the end sector of bio is smaller or equal the start
      NormIO2:  the start sector of bio larger or equal to end_window
      NormIO3:  the start sector of bio larger or equal to
                start_next_window.
      NormIO4:  the location between start_next_window and end_window
      
      |--------|-----------|--------------------|----------------|-------------|
          | start   |   next_resync   |  start_next_window   |  end_window |
       NormIO1   NormIO4            NormIO4                NormIO3      NormIO2
      
      For NormIO1, we don't need any io barrier.
      For NormIO4, we used a similar approach to the original iobarrier
          mechanism.  The normalIO and resyncIO must be kept separate.
      For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
          and "next_window_requests". They indicate the count of active
          requests in the two window.
          For these, we don't wait for resync io to complete.
      
      For resync action, if there are NormIO4s, we must wait for it.
      If not, we can proceed.
      But if resync action reaches start_next_window and
      current_window_requests > 0 (that is there are NormIO3s), we must
      wait until the current_window_requests becomes zero.
      When current_window_requests becomes zero,  start_next_window also
      moves forward. Then current_window_requests will replaced by
      next_window_requests.
      
      There is a problem which when and how to change from NormIO2 to
      NormIO3.  Only then can sync action progress.
      
      We add a field in struct r1conf "start_next_window".
      
      A: if start_next_window == MaxSector, it means there are no NormIO2/3.
         So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
      B: if current_window_requests == 0 && next_window_requests != 0, it
         means start_next_window move to end_window
      
      There is another problem which how to differentiate between
      old NormIO2(now it is NormIO3) and NormIO2.
      For example, there are many bios which are NormIO2 and a bio which is
      NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.
      
      We add a field in struct r1bio "start_next_window".
      This is used to record the position conf->start_next_window when the call
      to wait_barrier() is made in make_request().
      
      In allow_barrier(), we check the conf->start_next_window.
      If r1bio->stat_next_window == conf->start_next_window, it means
      there is no transition between NormIO2 and NormIO3.
      If r1bio->start_next_window != conf->start_next_window, it mean
      there was a transition between NormIO2 and NormIO3.  There can only
      have been one transition.  So it only means the bio is old NormIO2.
      
      For one bio, there may be many r1bio's. So we make sure
      all the r1bio->start_next_window are the same value.
      If we met blocked_dev in make_request(), it must call allow_barrier
      and wait_barrier. So the former and the later value of
      conf->start_next_window will be change.
      If there are many r1bio's with differnet start_next_window,
      for the relevant bio, it depend on the last value of r1bio.
      It will cause error. To avoid this, we must wait for previous r1bios
      to complete.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      79ef3a8a
    • majianpeng's avatar
      raid1: Add some macros to make code clearly. · 8e005f7c
      majianpeng authored
      In a subsequent patch, we'll use some const parameters.
      Using macros will make the code clearly.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      8e005f7c
    • majianpeng's avatar
      raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array... · 07169fd4
      majianpeng authored
      raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
      
      We used to use raise_barrier to suspend normal IO while we reconfigure
      the array.  However raise_barrier will soon only suspend some normal
      IO, not all.  So we need something else.
      Change it to use freeze_array.
      But freeze_array not only suspends normal io, it also suspends
      resync io.
      For the place where call raise_barrier for reconfigure, it isn't a
      problem.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      07169fd4
    • majianpeng's avatar
      raid1: Add a field array_frozen to indicate whether raid in freeze state. · b364e3d0
      majianpeng authored
      Because the following patch will rewrite the content between normal IO
      and resync IO. So we used a parameter to indicate whether raid is in freeze
      array.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      b364e3d0
    • Joe Perches's avatar
      md: Convert use of typedef ctl_table to struct ctl_table · 82592c38
      Joe Perches authored
      This typedef is unnecessary and should just be removed.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      82592c38
    • NeilBrown's avatar
      md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. · 30b8feb7
      NeilBrown authored
      When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
      badblock until md_update_sb() is called.
      But md_stop will take reconfig lock which means raid5d can't call
      md_update_sb() in md_check_recovery(), the badblock will always
      be unack, so raid5d thread enters an infinite loop and md_stop_write()
      can never stop sync_thread. This causes deadlock.
      
      To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
      running, we need set md->recovery FROZEN and INTR flags and wait for
      sync_thread to stop before we (re)take reconfig lock.
      
      This requires that raid5 reshape_request notices MD_RECOVERY_INTR
      (which it probably should have noticed anyway) and stops waiting for a
      metadata update in that case.
      Reported-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Reported-by: default avatarBian Yu <bianyu@kedacom.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      30b8feb7
    • NeilBrown's avatar
      md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. · c91abf5a
      NeilBrown authored
      We currently use kthread_should_stop() in various places in the
      sync/reshape code to abort early.
      However some places set MD_RECOVERY_INTR but don't immediately call
      md_reap_sync_thread() (and we will shortly get another one).
      When this happens we are relying on md_check_recovery() to reap the
      thread and that only happen when it finishes normally.
      So MD_RECOVERY_INTR must lead to a normal finish without the
      kthread_should_stop() test.
      
      So replace all relevant tests, and be more careful when the thread is
      interrupted not to acknowledge that latest step in a reshape as it may
      not be fully committed yet.
      
      Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
      so we don't wait have to wait for the speed to drop before we can abort.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      c91abf5a
    • NeilBrown's avatar
      md: fix some places where mddev_lock return value is not checked. · 29f097c4
      NeilBrown authored
      Sometimes we need to lock and mddev and cannot cope with
      failure due to interrupt.
      In these cases we should use mutex_lock, not mutex_lock_interruptible.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      29f097c4
    • Bian Yu's avatar
      raid5: Retry R5_ReadNoMerge flag when hit a read error. · edfa1f65
      Bian Yu authored
      Because of block layer merge, one bio fails will cause other bios
      which belongs to the same request fails, so raid5_end_read_request
      will record all these bios as badblocks.
      If retry request with R5_ReadNoMerge flag to avoid bios merge,
      badblocks can only record sector which is bad exactly.
      
      test:
      hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
      mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
      mdadm /dev/md0 -f /dev/sdd
      mdadm /dev/md0 -r /dev/sdd
      mdadm --zero-superblock /dev/sdd
      mdadm /dev/md0 -a /dev/sdd
      
      1. Without this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      299776 256
      299776 256
      
      2. With this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      300000 8
      300000 8
      Signed-off-by: default avatarBian Yu <bianyu@kedacom.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      edfa1f65
    • Shaohua Li's avatar
      raid5: relieve lock contention in get_active_stripe() · 4bda556a
      Shaohua Li authored
      track empty inactive list count, so md_raid5_congested() can use it to make
      decision.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      4bda556a
    • Al Viro's avatar
      seq_file: always clear m->count when we free m->buf · 801a7605
      Al Viro authored
      Once we'd freed m->buf, m->count should become zero - we have no valid
      contents reachable via m->buf.
      Reported-by: default avatarCharley (Hao Chuan) Chu <charley.chu@broadcom.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      801a7605