1. 29 Apr, 2016 3 commits
    • Steve Capper's avatar
      mm: exclude HugeTLB pages from THP page_mapped() logic · 66ee95d1
      Steve Capper authored
      HugeTLB pages cannot be split, so we use the compound_mapcount to track
      rmaps.
      
      Currently page_mapped() will check the compound_mapcount, but will also
      go through the constituent pages of a THP compound page and query the
      individual _mapcount's too.
      
      Unfortunately, page_mapped() does not distinguish between HugeTLB and
      THP compound pages and assumes that a compound page always needs to have
      HPAGE_PMD_NR pages querying.
      
      For most cases when dealing with HugeTLB this is just inefficient, but
      for scenarios where the HugeTLB page size is less than the pmd block
      size (e.g.  when using contiguous bit on ARM) this can lead to crashes.
      
      This patch adjusts the page_mapped function such that we skip the
      unnecessary THP reference checks for HugeTLB pages.
      
      Fixes: e1534ae9 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66ee95d1
    • Atsushi Kumagai's avatar
      kexec: export OFFSET(page.compound_head) to find out compound tail page · d7f53518
      Atsushi Kumagai authored
      PageAnon() always look at head page to check PAGE_MAPPING_ANON and tail
      page's page->mapping has just a poisoned data since commit 1c290f64
      ("mm: sanitize page->mapping for tail pages").
      
      If makedumpfile checks page->mapping of a compound tail page to
      distinguish anonymous page as usual, it must fail in newer kernel.  So
      it's necessary to export OFFSET(page.compound_head) to avoid checking
      compound tail pages.
      
      The problem is that unnecessary hugepages won't be removed from a dump
      file in kernels 4.5.x and later.  This means that extra disk space would
      be consumed.  It's a problem, but not critical.
      Signed-off-by: default avatarAtsushi Kumagai <ats-kumagai@wm.jp.nec.com>
      Acked-by: default avatarDave Young <dyoung@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7f53518
    • Atsushi Kumagai's avatar
      kexec: update VMCOREINFO for compound_order/dtor · 8639a847
      Atsushi Kumagai authored
      makedumpfile refers page.lru.next to get the order of compound pages for
      page filtering.
      
      However, now the order is stored in page.compound_order, hence
      VMCOREINFO should be updated to export the offset of
      page.compound_order.
      
      The fact is, page.compound_order was introduced already in kernel 4.0,
      but the offset of it was the same as page.lru.next until kernel 4.3, so
      this was not actual problem.
      
      The above can be said also for page.lru.prev and page.compound_dtor,
      it's necessary to detect hugetlbfs pages.  Further, the content was
      changed from direct address to the ID which means dtor.
      
      The problem is that unnecessary hugepages won't be removed from a dump
      file in kernels 4.4.x and later.  This means that extra disk space would
      be consumed.  It's a problem, but not critical.
      Signed-off-by: default avatarAtsushi Kumagai <ats-kumagai@wm.jp.nec.com>
      Acked-by: default avatarDave Young <dyoung@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8639a847
  2. 27 Apr, 2016 9 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · b75a2bf8
      Linus Torvalds authored
      Pull workqueue fix from Tejun Heo:
       "So, it turns out we had a silly bug in the most fundamental part of
        workqueue for a very long time.  AFAICS, this dates back to pre-git
        era and has quite likely been there from the time workqueue was first
        introduced.
      
        A work item uses its PENDING bit to synchronize multiple queuers.
        Anyone who wins the PENDING bit owns the pending state of the work
        item.  Whether a queuer wins or loses the race, one thing should be
        guaranteed - there will soon be at least one execution of the work
        item - where "after" means that the execution instance would be able
        to see all the changes that the queuer has made prior to the queueing
        attempt.
      
        Unfortunately, we were missing a smp_mb() after clearing PENDING for
        execution, so nothing guaranteed visibility of the changes that a
        queueing loser has made, which manifested as a reproducible blk-mq
        stall.
      
        Lots of kudos to Roman for debugging the problem.  The patch for
        -stable is the minimal one.  For v3.7, Peter is working on a patch to
        make the code path slightly more efficient and less fragile"
      
      * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: fix ghost PENDING flag while doing MQ IO
      b75a2bf8
    • Linus Torvalds's avatar
      Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 763cfc86
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Two patches to fix a deadlock which can be easily triggered if memcg
        charge moving is used.
      
        This bug was introduced while converting threadgroup locking to a
        global percpu_rwsem and is caused by cgroup controller task migration
        path depending on the ability to create new kthreads.  cpuset had a
        similar issue which was fixed by performing heavy-lifting operations
        asynchronous to task migration.  The two patches fix the same issue in
        memcg in a similar way.  The first patch makes the mechanism generic
        and the second relocates memcg charge moving outside the migration
        path.
      
        Given that we don't want to perform heavy operations while
        writelocking threadgroup lock anyway, moving them out of the way is a
        desirable solution.  One thing to note is that the problem was
        difficult to debug because lockdep couldn't figure out the deadlock
        condition.  Looking into how to improve that"
      
      * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        memcg: relocate charge moving from ->attach to ->post_attach
        cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
      763cfc86
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 3118e5f9
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "I2C has one buildfix, one ABBA deadlock fix, and three simple 'add ID'
        patches"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: exynos5: Fix possible ABBA deadlock by keeping I2C clock prepared
        i2c: cpm: Fix build break due to incompatible pointer types
        i2c: ismt: Add Intel DNV PCI ID
        i2c: xlp9xx: add support for Broadcom Vulcan
        i2c: rk3x: add support for rk3228
      3118e5f9
    • Linus Torvalds's avatar
      Merge tag 'arc-4.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 24131a61
      Linus Torvalds authored
      Pull ARC fixes from Vineet Gupta:
      
       - lockdep now works for ARCv2 builds
      
       - enable DT reserved-memory binding (for forthcoming HDMI driver)
      
      * tag 'arc-4.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: add support for reserved memory defined by device tree
        ARC: support generic per-device coherent dma mem
        Documentation: dt: arc: fix spelling mistakes
        ARCv2: Enable LOCKDEP
      24131a61
    • Linus Torvalds's avatar
      Merge tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 · 508fea71
      Linus Torvalds authored
      Pull arch/nios2 fix from Ley Foon Tan:
       "memset: use the right constraint modifier for the %4 output operand"
      
      * tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
        nios2: memset: use the right constraint modifier for the %4 output operand
      508fea71
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.6-3' of... · 9453203b
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull x86 platform driver fix from Darren Hart:
       "Fix regression caused by hotkey enabling value in toshiba_acpi"
      
      * tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
        toshiba_acpi: Fix regression caused by hotkey enabling value
      9453203b
    • Alexey Brodkin's avatar
      ARC: add support for reserved memory defined by device tree · 1b10cb21
      Alexey Brodkin authored
      Enable reserved memory initialization from device tree.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      1b10cb21
    • Alexey Brodkin's avatar
      ARC: support generic per-device coherent dma mem · 32ed9a0e
      Alexey Brodkin authored
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      32ed9a0e
    • Romain Perier's avatar
      nios2: memset: use the right constraint modifier for the %4 output operand · a8950e49
      Romain Perier authored
      Depending on the size of the area to be memset'ed, the nios2 memset implementation
      either uses a naive loop (for buffers smaller or equal than 8 bytes) or a more optimized
      implementation (for buffers larger than 8 bytes). This implementation does 4-byte stores
      rather than 1-byte stores to speed up memset.
      
      However, we discovered that on our nios2 platform, memset() was not properly setting the
      buffer to the expected value. A memset of 0xff would not set the entire buffer to 0xff, but to:
      
      0xff 0x00 0xff 0x00 0xff 0x00 0xff 0x00 ...
      
      Which is obviously incorrect. Our investigation has revealed that the problem lies in the
      incorrect constraints used in the inline assembly.
      
      The following piece of assembly, from the nios2 memset implementation, is supposed to
      create a 4-byte value that repeats 4 times the 1-byte pattern passed as memset argument:
      
      /* fill8 %3, %5 (c & 0xff) */
      "       slli    %4, %5, 8\n"
      "       or      %4, %4, %5\n"
      "       slli    %3, %4, 16\n"
      "       or      %3, %3, %4\n"
      
      However, depending on the compiler and optimization level, this code might be compiled as:
      
      34:	280a923a 	slli	r5,r5,8
      38:	294ab03a 	or	r5,r5,r5
      3c:	2808943a 	slli	r4,r5,16
      40:	2148b03a 	or	r4,r4,r5
      
      This is wrong because r5 gets used both for %5 and %4, which leads to the final pattern
      stored in r4 to be 0xff00ff00 rather than the expected 0xffffffff.
      
      %4 is defined with the "=r" constraint, i.e as an output operand. However, as explained in
      http://www.ethernut.de/en/documents/arm-inline-asm.html, this does not prevent gcc from
      using the same register for an output operand (%4) and input operand (%5). By using the
      constraint modifier '&', we indicate that the register should be used for output only. With this
      change, we get the following assembly output:
      
      34:	2810923a 	slli	r8,r5,8
      38:	4150b03a 	or	r8,r8,r5
      3c:	400e943a 	slli	r7,r8,16
      40:	3a0eb03a 	or	r7,r7,r8
      
      Which correctly produces the 0xffffffff pattern when 0xff is passed as the memset() pattern.
      
      It is worth mentioning the observed consequence of this bug: we were hitting the kernel
      BUG() in mm/bootmem.c:__free() that verifies when marking a page as free that it was
      previously marked as occupied (i.e that the bit was set to 1). The entire bootmem bitmap is
      set to 0xff bit via a memset() during the bootmem initialization. The bootmem_free() call right
      after the initialization was finding some bits to be set to 0, which didn't make sense since the
      bitmap has just been memset'ed to 0xff. Except that due to the bug explained above, the
      bitmap was in fact initialized to 0xff00ff00.
      
      Thanks to Marek Vasut for his help and feedback.
      Signed-off-by: default avatarRomain Perier <romain.perier@free-electrons.com>
      Acked-by: default avatarMarek Vasut <marex@denx.de>
      Acked-by: default avatarLey Foon Tan <lftan@altera.com>
      a8950e49
  3. 26 Apr, 2016 10 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · f28f20da
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Handle v4/v6 mixed sockets properly in soreuseport, from Craig
          Gallak.
      
       2) Bug fixes for the new macsec facility (missing kmalloc NULL checks,
          missing locking around netdev list traversal, etc.) from Sabrina
          Dubroca.
      
       3) Fix handling of host routes on ifdown in ipv6, from David Ahern.
      
       4) Fix double-fdput in bpf verifier.  From Jann Horn.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
        bpf: fix double-fdput in replace_map_fd_with_map_ptr()
        net: ipv6: Delete host routes on an ifdown
        Revert "ipv6: Revert optional address flusing on ifdown."
        net/mlx4_en: fix spurious timestamping callbacks
        net: dummy: remove note about being Y by default
        cxgbi: fix uninitialized flowi6
        ipv6: Revert optional address flusing on ifdown.
        ipv4/fib: don't warn when primary address is missing if in_dev is dead
        net/mlx5: Add pci shutdown callback
        net/mlx5_core: Remove static from local variable
        net/mlx5e: Use vport MTU rather than physical port MTU
        net/mlx5e: Fix minimum MTU
        net/mlx5e: Device's mtu field is u16 and not int
        net/mlx5_core: Add ConnectX-5 to list of supported devices
        net/mlx5e: Fix MLX5E_100BASE_T define
        net/mlx5_core: Fix soft lockup in steering error flow
        qlcnic: Update version to 5.3.64
        net: stmmac: socfpga: Remove re-registration of reset controller
        macsec: fix netlink attribute validation
        macsec: add missing macsec prefix in uapi
        ...
      f28f20da
    • Linus Torvalds's avatar
      Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 91ea692f
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "Here are the latest bug fixes for ARM SoCs, mostly addressing recent
        regressions.  Changes are across several platforms, so I'm listing
        every change separately here.
      
        Regressions since 4.5:
      
         - A correction of the psci firmware DT binding, to prevent users from
           relying on unintended semantics
      
         - Actually getting the newly merged clock driver for some OMAP
           platforms to work
      
         - A revert of patches for the Qualcomm BAM, these need to be reworked
           for 4.7 to avoid breaking boards other than the one they were
           intended for
      
         - A correction for the I2C device nodes on the Socionext Uniphier
           platform
      
         - i.MX SDHCI was broken for non-DT platforms due to a change with the
           setting of the DMA mask
      
         - A revert of a patch that accidentally added a nonexisting clock on
           the Rensas "Porter" board
      
         - A couple of OMAP fixes that are all related to suspend after the
           power domain changes for dra7
      
         - On Mediatek, revert part of the power domain initialization changes
           that broke mt8173-evb
      
        Fixes for older bugs:
      
         - Workaround for an "external abort" in the omap34xx suspend/resume
           code.
      
         - The USB1/eSATA should not be listed as an excon device on
           am57xx-beagle-x15 (broken since v4.0)
      
         - A v4.5 regression in the TI AM33xx and AM43XX DT specifying
           incorrect DMA request lines for the GPMC
      
         - The jiffies calibration on Renesas platforms was incorrect for some
           modern CPU cores.
      
         - A hardware errata woraround for clockdomains on TI DRA7"
      
      * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        drivers: firmware: psci: unify enable-method binding on ARM {64,32}-bit systems
        arm64: dts: uniphier: fix I2C nodes of PH1-LD20
        ARM: shmobile: timer: Fix preset_lpj leading to too short delays
        Revert "ARM: dts: porter: Enable SCIF_CLK frequency and pins"
        ARM: dts: r8a7791: Don't disable referenced optional clocks
        Revert "ARM: OMAP: Catch callers of revision information prior to it being populated"
        ARM: OMAP3: Fix external abort on 36xx waking from off mode idle
        ARM: dts: am57xx-beagle-x15: remove extcon_usb1
        ARM: dts: am437x: Fix GPMC dma properties
        ARM: dts: am33xx: Fix GPMC dma properties
        Revert "soc: mediatek: SCPSYS: Fix double enabling of regulators"
        ARM: mach-imx: sdhci-esdhc-imx: initialize DMA mask
        ARM: DRA7: clockdomain: Implement timer workaround for errata i874
        ARM: OMAP: Catch callers of revision information prior to it being populated
        ARM: dts: dra7: Correct clock tree for sys_32k_ck
        ARM: OMAP: DRA7: Provide proper class to omap2_set_globals_tap
        ARM: OMAP: DRA7: wakeupgen: Skip SAR save for wakeupgen
        Revert "dts: msm8974: Add dma channels for blsp2_i2c1 node"
        Revert "dts: msm8974: Add blsp2_bam dma node"
        ARM: dts: Add clocks for dm814x ADPLL
      91ea692f
    • Linus Torvalds's avatar
      devpts: more pty driver interface cleanups · 8ead9dd5
      Linus Torvalds authored
      This is more prep-work for the upcoming pty changes.  Still just code
      cleanup with no actual semantic changes.
      
      This removes a bunch pointless complexity by just having the slave pty
      side remember the dentry associated with the devpts slave rather than
      the inode.  That allows us to remove all the "look up the dentry" code
      for when we want to remove it again.
      
      Together with moving the tty pointer from "inode->i_private" to
      "dentry->d_fsdata" and getting rid of pointless inode locking, this
      removes about 30 lines of code.  Not only is the end result smaller,
      it's simpler and easier to understand.
      
      The old code, for example, depended on the d_find_alias() to not just
      find the dentry, but also to check that it is still hashed, which in
      turn validated the tty pointer in the inode.
      
      That is a _very_ roundabout way to say "invalidate the cached tty
      pointer when the dentry is removed".
      
      The new code just does
      
      	dentry->d_fsdata = NULL;
      
      in devpts_pty_kill() instead, invalidating the tty pointer rather more
      directly and obviously.  Don't do something complex and subtle when the
      obvious straightforward approach will do.
      
      The rest of the patch (ie apart from code deletion and the above tty
      pointer clearing) is just switching the calling convention to pass the
      dentry or file pointer around instead of the inode.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ead9dd5
    • Jann Horn's avatar
      bpf: fix double-fdput in replace_map_fd_with_map_ptr() · 8358b02b
      Jann Horn authored
      When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
      references a non-map file descriptor as a map file descriptor, the error
      handling code called fdput() twice instead of once (in __bpf_map_get() and
      in replace_map_fd_with_map_ptr()). If the file descriptor table of the
      current task is shared, this causes f_count to be decremented too much,
      allowing the struct file to be freed while it is still in use
      (use-after-free). This can be exploited to gain root privileges by an
      unprivileged user.
      
      This bug was introduced in
      commit 0246e64d ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
      exploitable since
      commit 1be7f75d ("bpf: enable non-root eBPF programs") because
      previously, CAP_SYS_ADMIN was required to reach the vulnerable code.
      
      (posted publicly according to request by maintainer)
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8358b02b
    • David Ahern's avatar
      net: ipv6: Delete host routes on an ifdown · 38bd10c4
      David Ahern authored
      It was a simple idea -- save IPv6 configured addresses on a link down
      so that IPv6 behaves similar to IPv4. As always the devil is in the
      details and the IPv6 stack as too many behavioral differences from IPv4
      making the simple idea more complicated than it needs to be.
      
      The current implementation for keeping IPv6 addresses can panic or spit
      out a warning in one of many paths:
      
      1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
         rt6_fill_node while handling a route dump request.
      
      2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
         fib6_del
      
      3. Panic in fib6_purge_rt because rt6i_ref count is not 1.
      
      The root cause of all these is references related to the host route for
      an address that is retained.
      
      So, this patch deletes the host route every time the ifdown loop runs.
      Since the host route is deleted and will be re-generated an up there is
      no longer a need for the l3mdev fix up. On the 'admin up' side move
      addrconf_permanent_addr into the NETDEV_UP event handling so that it
      runs only once versus on UP and CHANGE events.
      
      All of the current panics and warnings appear to be related to
      addresses on the loopback device, but given the catastrophic nature when
      a bug is triggered this patch takes the conservative approach and evicts
      all host routes rather than trying to determine when it can be re-used
      and when it can not. That can be a later optimizaton if desired.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38bd10c4
    • David S. Miller's avatar
      Revert "ipv6: Revert optional address flusing on ifdown." · 6a923934
      David S. Miller authored
      This reverts commit 841645b5.
      
      Ok, this puts the feature back.  I've decided to apply David A.'s
      bug fix and run with that rather than make everyone wait another
      whole release for this feature.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a923934
    • Roman Pen's avatar
      workqueue: fix ghost PENDING flag while doing MQ IO · 346c09f8
      Roman Pen authored
      The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
      with the following backtrace:
      
      [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
      [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
      [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
      [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
      [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
      [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
      [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
      [  601.350965] Call Trace:
      [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
      [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
      [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
      [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
      [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
      [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
      [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
      [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
      [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
      [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
      [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
      [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
      [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
      [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
      [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
      [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
      [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
      [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
      [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
      
      The underlying device is a null_blk, with default parameters:
      
        queue_mode    = MQ
        submit_queues = 1
      
      Verification that nullb0 has something inflight:
      
      root@pserver8:~# cat /sys/block/nullb0/inflight
             0        1
      root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
      ...
      /sys/block/nullb0/mq/0/cpu2/rq_list
      CTX pending:
              ffff8838038e2400
      ...
      
      During debug it became clear that stalled request is always inserted in
      the rq_list from the following path:
      
         save_stack_trace_tsk + 34
         blk_mq_insert_requests + 231
         blk_mq_flush_plug_list + 281
         blk_flush_plug_list + 199
         wait_on_page_bit + 192
         __filemap_fdatawait_range + 228
         filemap_fdatawait_range + 20
         filemap_write_and_wait_range + 63
         blkdev_fsync + 27
         vfs_fsync_range + 73
         blkdev_write_iter + 202
         __vfs_write + 170
         vfs_write + 169
         kernel_write + 56
      
      So blk_flush_plug_list() was called with from_schedule == true.
      
      If from_schedule is true, that means that finally blk_mq_insert_requests()
      offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
      i.e. it calls kblockd_schedule_delayed_work_on().
      
      That means, that we race with another CPU, which is about to execute
      __blk_mq_run_hw_queue() work.
      
      Further debugging shows the following traces from different CPUs:
      
        CPU#0                                  CPU#1
        ----------------------------------     -------------------------------
        reqeust A inserted
        STORE hctx->ctx_map[0] bit marked
        kblockd_schedule...() returns 1
        <schedule to kblockd workqueue>
                                               request B inserted
                                               STORE hctx->ctx_map[1] bit marked
                                               kblockd_schedule...() returns 0
        *** WORK PENDING bit is cleared ***
        flush_busy_ctxs() is executed, but
        bit 1, set by CPU#1, is not observed
      
      As a result request B pended forever.
      
      This behaviour can be explained by speculative LOAD of hctx->ctx_map on
      CPU#0, which is reordered with clear of PENDING bit and executed _before_
      actual STORE of bit 1 on CPU#1.
      
      The proper fix is an explicit full barrier <mfence>, which guarantees
      that clear of PENDING bit is to be executed before all possible
      speculative LOADS or STORES inside actual work function.
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Michael Wang <yun.wang@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      346c09f8
    • Sudeep Holla's avatar
      drivers: firmware: psci: unify enable-method binding on ARM {64,32}-bit systems · 978fa436
      Sudeep Holla authored
      Currently ARM CPUs DT bindings allows different enable-method value for
      PSCI based systems. On ARM 64-bit this property is required and must be
      "psci" while on ARM 32-bit systems this property is optional and must
      be "arm,psci" if present.
      
      However, "arm,psci" has always been the compatible string for the PSCI
      node, and was never intended to be the enable-method. So this is a bug
      in the binding and not a deliberate attempt at specifying 32-bit
      differently.
      
      This is problematic if 32-bit OS is run on 64-bit system which has
      "psci" as enable-method rather than the expected "arm,psci".
      
      So let's unify the value into "psci" and remove support for "arm,psci"
      before it finds any users.
      Reported-by: default avatarSoby Mathew <Soby.Mathew@arm.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      978fa436
    • Eric Dumazet's avatar
      net/mlx4_en: fix spurious timestamping callbacks · fc96256c
      Eric Dumazet authored
      When multiple skb are TX-completed in a row, we might incorrectly keep
      a timestamp of a prior skb and cause extra work.
      
      Fixes: ec693d47 ("net/mlx4_en: Add HW timestamping (TS) support")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc96256c
    • Ivan Babrou's avatar
      9f5db535
  4. 25 Apr, 2016 9 commits
  5. 24 Apr, 2016 9 commits