1. 02 Jul, 2014 19 commits
    • Johan Hovold's avatar
      USB: cdc-acm: fix potential urb leak and PM imbalance in write · 5d62cbeb
      Johan Hovold authored
      commit 183a4508 upstream.
      
      Make sure to check return value of autopm get in write() in order to
      avoid urb leak and PM counter imbalance on errors.
      
      Fixes: 11ea859d ("USB: additional power savings for cdc-acm devices
      that support remote wakeup")
      Signed-off-by: default avatarJohan Hovold <jhovold@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5d62cbeb
    • Johan Hovold's avatar
      USB: cdc-acm: fix shutdown and suspend race · 4ed88943
      Johan Hovold authored
      commit ed797074 upstream.
      
      We should stop I/O unconditionally at suspend rather than rely on the
      tty-port initialised flag (which is set prior to stopping I/O during
      shutdown) in order to prevent suspend returning with URBs still active.
      
      Fixes: 11ea859d ("USB: additional power savings for cdc-acm devices
      that support remote wakeup")
      Signed-off-by: default avatarJohan Hovold <jhovold@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4ed88943
    • Johan Hovold's avatar
      USB: cdc-acm: fix runtime PM for control messages · 4d9aa622
      Johan Hovold authored
      commit bae3f4c5 upstream.
      
      Fix runtime PM handling of control messages by adding the required PM
      counter operations.
      
      Fixes: 11ea859d ("USB: additional power savings for cdc-acm devices
      that support remote wakeup")
      Signed-off-by: default avatarJohan Hovold <jhovold@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4d9aa622
    • Johan Hovold's avatar
      USB: cdc-acm: fix broken runtime suspend · 870306cf
      Johan Hovold authored
      commit 140cb81a upstream.
      
      The current ACM runtime-suspend implementation is broken in several
      ways:
      
      Firstly, it buffers only the first write request being made while
      suspended -- any further writes are silently dropped.
      
      Secondly, writes being dropped also leak write urbs, which are never
      reclaimed (until the device is unbound).
      
      Thirdly, even the single buffered write is not cleared at shutdown
      (which may happen before the device is resumed), something which can
      lead to another urb leak as well as a PM usage-counter leak.
      
      Fix this by implementing a delayed-write queue using urb anchors and
      making sure to discard the queue properly at shutdown.
      
      Fixes: 11ea859d ("USB: additional power savings for cdc-acm devices
      that support remote wakeup")
      Reported-by: default avatarXiao Jin <jin.xiao@intel.com>
      Signed-off-by: default avatarJohan Hovold <jhovold@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      870306cf
    • Johan Hovold's avatar
      USB: cdc-acm: fix write and resume race · 529ce0fc
      Johan Hovold authored
      commit e144ed28 upstream.
      
      Fix race between write() and resume() due to improper locking that could
      lead to writes being reordered.
      
      Resume must be done atomically and susp_count be protected by the
      write_lock in order to prevent racing with write(). This could otherwise
      lead to writes being reordered if write() grabs the write_lock after
      susp_count is decremented, but before the delayed urb is submitted.
      
      Fixes: 11ea859d ("USB: additional power savings for cdc-acm devices
      that support remote wakeup")
      Signed-off-by: default avatarJohan Hovold <jhovold@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      529ce0fc
    • Johan Hovold's avatar
      USB: cdc-acm: fix write and suspend race · 06c2f546
      Johan Hovold authored
      commit 5a345c20 upstream.
      
      Fix race between write() and suspend() which could lead to writes being
      dropped (or I/O while suspended) if the device is runtime suspended
      while a write request is being processed.
      
      Specifically, suspend() releases the write_lock after determining the
      device is idle but before incrementing the susp_count, thus leaving a
      window where a concurrent write() can submit an urb.
      
      Fixes: 11ea859d ("USB: additional power savings for cdc-acm devices
      that support remote wakeup")
      Signed-off-by: default avatarJohan Hovold <jhovold@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      06c2f546
    • James Hogan's avatar
      MIPS: KVM: Allocate at least 16KB for exception handlers · 263be4bc
      James Hogan authored
      commit 7006e2df upstream.
      
      Each MIPS KVM guest has its own copy of the KVM exception vector. This
      contains the TLB refill exception handler at offset 0x000, the general
      exception handler at offset 0x180, and interrupt exception handlers at
      offset 0x200 in case Cause_IV=1. A common handler is copied to offset
      0x2000 and offset 0x3000 is used for temporarily storing k1 during entry
      from guest.
      
      However the amount of memory allocated for this purpose is calculated as
      0x200 rounded up to the next page boundary, which is insufficient if 4KB
      pages are in use. This can lead to the common handler at offset 0x2000
      being overwritten and infinitely recursive exceptions on the next exit
      from the guest.
      
      Increase the minimum size from 0x200 to 0x4000 to cover the full use of
      the page.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: kvm@vger.kernel.org
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: Sanjay Lal <sanjayl@kymasys.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      263be4bc
    • Paolo Bonzini's avatar
      KVM: lapic: sync highest ISR to hardware apic on EOI · 331f2b02
      Paolo Bonzini authored
      commit fc57ac2c upstream.
      
      When Hyper-V enlightenments are in effect, Windows prefers to issue an
      Hyper-V MSR write to issue an EOI rather than an x2apic MSR write.
      The Hyper-V MSR write is not handled by the processor, and besides
      being slower, this also causes bugs with APIC virtualization.  The
      reason is that on EOI the processor will modify the highest in-service
      interrupt (SVI) field of the VMCS, as explained in section 29.1.4 of
      the SDM; every other step in EOI virtualization is already done by
      apic_send_eoi or on VM entry, but this one is missing.
      
      We need to do the same, and be careful not to muck with the isr_count
      and highest_isr_cache fields that are unused when virtual interrupt
      delivery is enabled.
      Reviewed-by: default avatarYang Zhang <yang.z.zhang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      331f2b02
    • Guenter Roeck's avatar
      mfd: sm501: dbg_regs attribute must be read-only · 7228e303
      Guenter Roeck authored
      commit 8a8320c2 upstream.
      
      Fix:
      
      sm501 sm501: SM501 At b3e00000: Version 050100a0, 8 Mb, IRQ 100
      Attribute dbg_regs: write permission without 'store'
      ------------[ cut here ]------------
      WARNING: at drivers/base/core.c:620
      
      dbg_regs does not have a write function and must therefore be marked
      as read-only.
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarLee Jones <lee.jones@linaro.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7228e303
    • Benjamin LaHaise's avatar
      aio: fix aio request leak when events are reaped by userspace · 1c55a373
      Benjamin LaHaise authored
      commit f8567a38 upstream.
      
      The aio cleanups and optimizations by kmo that were merged into the 3.10
      tree added a regression for userspace event reaping.  Specifically, the
      reference counts are not decremented if the event is reaped in userspace,
      leading to the application being unable to submit further aio requests.
      This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
      This issue was uncovered as part of CVE-2014-0206.
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      1c55a373
    • Benjamin LaHaise's avatar
      aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 · bee3f7b8
      Benjamin LaHaise authored
      commit edfbbf38 upstream.
      
      A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
      by commit a31ad380.  The changes made to
      aio_read_events_ring() failed to correctly limit the index into
      ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
      an arbitrary page with a copy_to_user() to copy the contents into userspace.
      This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
      Petr for disclosing this issue.
      
      This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bee3f7b8
    • James Hogan's avatar
      serial: 8250_dw: Fix LCR workaround regression · be31bc4b
      James Hogan authored
      commit 6979f8d2 upstream.
      
      Commit c49436b6 (serial: 8250_dw: Improve unwritable LCR workaround)
      caused a regression. It added a check that the LCR was written properly
      to detect and workaround the busy quirk, but the behaviour of bit 5
      (UART_LCR_SPAR) differs between IP versions 3.00a and 3.14c per the
      docs. On older versions this caused the check to fail and it would
      repeatedly force idle and rewrite the LCR register, causing delays and
      preventing any input from serial being received.
      
      This is fixed by masking out UART_LCR_SPAR before making the comparison.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Tim Kryger <tim.kryger@linaro.org>
      Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
      Cc: Matt Porter <matt.porter@linaro.org>
      Cc: Markus Mayer <markus.mayer@linaro.org>
      Tested-by: default avatarTim Kryger <tim.kryger@linaro.org>
      Tested-by: default avatarEzequiel Garcia <ezequiel.garcia@free-electrons.com>
      Tested-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      be31bc4b
    • Tim Kryger's avatar
      serial: 8250_dw: Improve unwritable LCR workaround · 24015775
      Tim Kryger authored
      commit c49436b6 upstream.
      
      When configured with UART_16550_COMPATIBLE=NO or in versions prior to
      the introduction of this option, the Designware UART will ignore writes
      to the LCR if the UART is busy.  The current workaround saves a copy of
      the last written LCR and re-writes it in the ISR for a special interrupt
      that is raised when a write was ignored.
      
      Unfortunately, interrupts are typically disabled prior to performing a
      sequence of register writes that include the LCR so the point at which
      the retry occurs is too late.  An example is serial8250_do_set_termios()
      where an ignored LCR write results in the baud divisor not being set and
      instead a garbage character is sent out the transmitter.
      
      Furthermore, since serial_port_out() offers no way to indicate failure,
      a serious effort must be made to ensure that the LCR is actually updated
      before returning back to the caller.  This is difficult, however, as a
      UART that was busy during the first attempt is likely to still be busy
      when a subsequent attempt is made unless some extra action is taken.
      
      This updated workaround reads back the LCR after each write to confirm
      that the new value was accepted by the hardware.  Should the hardware
      ignore a write, the TX/RX FIFOs are cleared and the receive buffer read
      before attempting to rewrite the LCR out of the hope that doing so will
      force the UART into an idle state.  While this may seem unnecessarily
      aggressive, writes to the LCR are used to change the baud rate, parity,
      stop bit, or data length so the data that may be lost is likely not
      important.  Admittedly, this is far from ideal but it seems to be the
      best that can be done given the hardware limitations.
      
      Lastly, the revised workaround doesn't touch the LCR in the ISR, so it
      avoids the possibility of a "serial8250: too much work for irq" lock up.
      This problem is rare in real situations but can be reproduced easily by
      wiring up two UARTs and running the following commands.
      
        # stty -F /dev/ttyS1 echo
        # stty -F /dev/ttyS2 echo
        # cat /dev/ttyS1 &
        [1] 375
        # echo asdf > /dev/ttyS1
        asdf
      
        [   27.700000] serial8250: too much work for irq96
        [   27.700000] serial8250: too much work for irq96
        [   27.710000] serial8250: too much work for irq96
        [   27.710000] serial8250: too much work for irq96
        [   27.720000] serial8250: too much work for irq96
        [   27.720000] serial8250: too much work for irq96
        [   27.730000] serial8250: too much work for irq96
        [   27.730000] serial8250: too much work for irq96
        [   27.740000] serial8250: too much work for irq96
      Signed-off-by: default avatarTim Kryger <tim.kryger@linaro.org>
      Reviewed-by: default avatarMatt Porter <matt.porter@linaro.org>
      Reviewed-by: default avatarMarkus Mayer <markus.mayer@linaro.org>
      Reviewed-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      
      Conflicts:
      	drivers/tty/serial/8250/8250_dw.c
      24015775
    • Naoya Horiguchi's avatar
      mm: add !pte_present() check on existing hugetlb_entry callbacks · 7032d5fb
      Naoya Horiguchi authored
      commit d4c54919 upstream.
      
      The age table walker doesn't check non-present hugetlb entry in common
      path, so hugetlb_entry() callbacks must check it.  The reason for this
      behavior is that some callers want to handle it in its own way.
      
      [ I think that reason is bogus, btw - it should just do what the regular
        code does, which is to call the "pte_hole()" function for such hugetlb
        entries  - Linus]
      
      However, some callers don't check it now, which causes unpredictable
      result, for example when we have a race between migrating hugepage and
      reading /proc/pid/numa_maps.  This patch fixes it by adding !pte_present
      checks on buggy callbacks.
      
      This bug exists for years and got visible by introducing hugepage
      migration.
      
      ChangeLog v2:
      - fix if condition (check !pte_present() instead of pte_present())
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org> [3.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      [ Backported to 3.15.  Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      7032d5fb
    • Jeff Layton's avatar
      nfsd: don't halt scanning the DRC LRU list when there's an RC_INPROG entry · 5b96c379
      Jeff Layton authored
      commit 1b19453d upstream.
      
      Currently, the DRC cache pruner will stop scanning the list when it
      hits an entry that is RC_INPROG. It's possible however for a call to
      take a *very* long time. In that case, we don't want it to block other
      entries from being pruned if they are expired or we need to trim the
      cache to get back under the limit.
      
      Fix the DRC cache pruner to just ignore RC_INPROG entries.
      Signed-off-by: default avatarJeff Layton <jlayton@primarydata.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      5b96c379
    • Anatol Pomozov's avatar
      aio: block io_destroy() until all context requests are completed · 04242f7d
      Anatol Pomozov authored
      commit e02ba72a upstream.
      
      deletes aio context and all resources related to. It makes sense that
      no IO operations connected to the context should be running after the context
      is destroyed. As we removed io_context we have no chance to
      get requests status or call io_getevents().
      
      man page for io_destroy says that this function may block until
      all context's requests are completed. Before kernel 3.11 io_destroy()
      blocked indeed, but since aio refactoring in 3.11 it is not true anymore.
      
      Here is a pseudo-code that shows a testcase for a race condition discovered
      in 3.11:
      
        initialize io_context
        io_submit(read to buffer)
        io_destroy()
      
        // context is destroyed so we can free the resources
        free(buffers);
      
        // if the buffer is allocated by some other user he'll be surprised
        // to learn that the buffer still filled by an outstanding operation
        // from the destroyed io_context
      
      The fix is straight-forward - add a completion struct and wait on it
      in io_destroy, complete() should be called when number of in-fligh requests
      reaches zero.
      
      If two or more io_destroy() called for the same context simultaneously then
      only the first one waits for IO completion, other calls behaviour is undefined.
      
      Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
        do not see the race condition anymore.
      Signed-off-by: default avatarAnatol Pomozov <anatol.pomozov@gmail.com>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      04242f7d
    • Jeff Layton's avatar
      nfsd: don't try to reuse an expired DRC entry off the list · 0c9a0cfb
      Jeff Layton authored
      commit a0ef5e19 upstream.
      
      Currently when we are processing a request, we try to scrape an expired
      or over-limit entry off the list in preference to allocating a new one
      from the slab.
      
      This is unnecessarily complicated. Just use the slab layer.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      0c9a0cfb
    • Zhichuang SUN's avatar
      drivers/video/fbdev/fb-puv3.c: Add header files for function unifb_mmap · 9e6f2084
      Zhichuang SUN authored
      commit fbc6c4a1 upstream.
      
      Function unifb_mmap calls functions which are defined in linux/mm.h
      and asm/pgtable.h
      
      The related error (for unicore32 with unicore32_defconfig):
      	CC      drivers/video/fbdev/fb-puv3.o
      	drivers/video/fbdev/fb-puv3.c: In function 'unifb_mmap':
      	drivers/video/fbdev/fb-puv3.c:646: error: implicit declaration of
      				      function 'vm_iomap_memory'
      	drivers/video/fbdev/fb-puv3.c:646: error: implicit declaration of
      				      function 'pgprot_noncached'
      Signed-off-by: default avatarZhichuang Sun <sunzc522@gmail.com>
      Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Jingoo Han <jg1.han@samsung.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Joe Perches <joe@perches.com>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: linux-fbdev@vger.kernel.org
      Acked-by: default avatarXuetao Guan <gxt@mprc.pku.edu.cn>
      Signed-off-by: default avatarTomi Valkeinen <tomi.valkeinen@ti.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9e6f2084
    • Chen Gang's avatar
      arch/unicore32/mm/alignment.c: include "asm/pgtable.h" to avoid compiling error · 976c2ec4
      Chen Gang authored
      commit 1ff38c56 upstream.
      
      Need include "asm/pgtable.h" to include "asm-generic/pgtable-nopmd.h",
      so can let 'pmd_t' defined. The related error with allmodconfig:
      
          CC      arch/unicore32/mm/alignment.o
        In file included from arch/unicore32/mm/alignment.c:24:
        arch/unicore32/include/asm/tlbflush.h:135: error: expected .). before .*. token
        arch/unicore32/include/asm/tlbflush.h:154: error: expected .). before .*. token
        In file included from arch/unicore32/mm/alignment.c:27:
        arch/unicore32/mm/mm.h:15: error: expected .=., .,., .;., .sm. or ._attribute__. before .*. token
        arch/unicore32/mm/mm.h:20: error: expected .=., .,., .;., .sm. or ._attribute__. before .*. token
        arch/unicore32/mm/mm.h:25: error: expected .=., .,., .;., .sm. or ._attribute__. before .*. token
        make[1]: *** [arch/unicore32/mm/alignment.o] Error 1
        make: *** [arch/unicore32/mm] Error 2
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Acked-by: default avatarXuetao Guan <gxt@mprc.pku.edu.cn>
      Signed-off-by: default avatarXuetao Guan <gxt@mprc.pku.edu.cn>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      976c2ec4
  2. 27 Jun, 2014 21 commits
    • Goldwyn Rodrigues's avatar
      ocfs2: revert iput deferring code in ocfs2_drop_dentry_lock · fc597a30
      Goldwyn Rodrigues authored
      commit 8ed6b237 upstream.
      
      The following patches are reverted in this patch because these patches
      caused performance regression in the remote unlink() calls.
      
        ea455f8a - ocfs2: Push out dropping of dentry lock to ocfs2_wq
        f7b1aa69 - ocfs2: Fix deadlock on umount
        5fd13189 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount
      
      Previous patches in this series removed the possible deadlocks from
      downconvert thread so the above patches shouldn't be needed anymore.
      
      The regression is caused because these patches delay the iput() in case
      of dentry unlocks.  This also delays the unlocking of the open lockres.
      The open lockresource is required to test if the inode can be wiped from
      disk or not.  When the deleting node does not get the open lock, it
      marks it as orphan (even though it is not in use by another
      node/process) and causes a journal checkpoint.  This delays operations
      following the inode eviction.  This also moves the inode to the orphaned
      inode which further causes more I/O and a lot of unneccessary orphans.
      
      The following script can be used to generate the load causing issues:
      
        declare -a create
        declare -a remove
        declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
        unique="`mktemp -u XXXXX`"
        script="/tmp/idontknow-${unique}.sh"
        cat <<EOF > "${script}"
        for n in {1..8}; do mkdir -p test/dir\${n}
          eval touch test/dir\${n}/foo{1.."\$1"}
        done
        EOF
        chmod 700 "${script}"
      
        function fcreate ()
        {
          exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
        }
      
        function fremove ()
        {
          exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
        }
      
        function fcp ()
        {
          exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
        }
      
        echo -------------------------------------------------
        echo "| # files | create #s | copy #s | remove #s |"
        echo -------------------------------------------------
        for ((x=0; x < ${#iterations[*]} ; x++)) do
          create[$x]="`fcreate ${iterations[$x]}`"
          copy[$x]="`fcp ${iterations[$x]}`"
          remove[$x]="`fremove`"
          printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
        done
        rm "${script}"
        echo "------------------------"
      Signed-off-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      fc597a30
    • Jan Kara's avatar
      ocfs2: avoid blocking in ocfs2_mark_lockres_freeing() in downconvert thread · 8f265718
      Jan Kara authored
      commit 84d86f83 upstream.
      
      If we are dropping last inode reference from downconvert thread, we will
      end up calling ocfs2_mark_lockres_freeing() which can block if the lock
      we are freeing is queued thus creating an A-A deadlock.  Luckily, since
      we are the downconvert thread, we can immediately dequeue the lock and
      thus avoid waiting in this case.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8f265718
    • Jan Kara's avatar
      ocfs2: implement delayed dropping of last dquot reference · cac99bee
      Jan Kara authored
      commit e3a767b6 upstream.
      
      We cannot drop last dquot reference from downconvert thread as that
      creates the following deadlock:
      
      NODE 1                                  NODE2
      holds dentry lock for 'foo'
      holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
                                              dquot_initialize(bar)
                                                ocfs2_dquot_acquire()
                                                  ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                                                  ...
      downconvert thread (triggered from another
      node or a different process from NODE2)
        ocfs2_dentry_post_unlock()
          ...
          iput(foo)
            ocfs2_evict_inode(foo)
              ocfs2_clear_inode(foo)
                dquot_drop(inode)
                  ...
      	    ocfs2_dquot_release()
                    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
                     - blocks
                                                  finds we need more space in
                                                  quota file
                                                  ...
                                                  ocfs2_extend_no_holes()
                                                    ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
                                                      - deadlocks waiting for
                                                        downconvert thread
      
      We solve the problem by postponing dropping of the last dquot reference to
      a workqueue if it happens from the downconvert thread.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      cac99bee
    • Jan Kara's avatar
      quota: provide function to grab quota structure reference · b5258061
      Jan Kara authored
      commit 9f985cb6 upstream.
      
      Provide dqgrab() function to get quota structure reference when we are
      sure it already has at least one active reference.  Make use of this
      function inside quota code.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      b5258061
    • Jan Kara's avatar
      ocfs2: move dquot_initialize() in ocfs2_delete_inode() somewhat later · 2eb0658f
      Jan Kara authored
      commit bd62ad7a upstream.
      
      Move dquot_initalize() call in ocfs2_delete_inode() after the moment we
      verify inode is actually a sane one to delete.  We certainly don't want
      to initialize quota for system inodes etc.  This also avoids calling
      into quota code from downconvert thread.
      
      Add more details into the comment why bailing out from
      ocfs2_delete_inode() when we are in downconvert thread is OK.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      2eb0658f
    • Lidong Zhong's avatar
      dlm: keep listening connection alive with sctp mode · 34b6c049
      Lidong Zhong authored
      commit 883854c5 upstream.
      
      The connection struct with nodeid 0 is the listening socket,
      not a connection to another node.  The sctp resend function
      was not checking that the nodeid was valid (non-zero), so it
      would mistakenly get and resend on the listening connection
      when nodeid was zero.
      Signed-off-by: default avatarLidong Zhong <lzhong@suse.com>
      Signed-off-by: default avatarDavid Teigland <teigland@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      34b6c049
    • Miao Xie's avatar
      Btrfs: fix BUG_ON() casued by the reserved space migration · 9bf37c05
      Miao Xie authored
      commit 20dd2cbf upstream.
      
      When we did space balance and snapshot creation at the same time, we might
      meet the following oops:
       kernel BUG at fs/btrfs/inode.c:3038!
       [SNIP]
       Call Trace:
       [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs]
       [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs]
       [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs]
       [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs]
       [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs]
       [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379
       [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39
       [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2
       [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8
       [<ffffffff81121e88>] SyS_ioctl+0x57/0x83
       [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14
       [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b
       [SNIP]
       RIP  [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs]
      
      The reason of the problem is that the relocation root creation stole
      the reserved space, which was reserved for orphan item deletion.
      
      There are several ways to fix this problem, one is to increasing
      the reserved space size of the space balace, and then we can use
      that space to create the relocation tree for each fs/file trees.
      But it is hard to calculate the suitable size because we doesn't
      know how many fs/file trees we need relocate.
      
      We fixed this problem by reserving the space for relocation root creation
      actively since the space it need is very small (one tree block, used for
      root node copy), then we use that reserved space to create the
      relocation tree. If we don't reserve space for relocation tree creation,
      we will use the reserved space of the balance.
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9bf37c05
    • Josef Bacik's avatar
      Btrfs: fix two use-after-free bugs with transaction cleanup · 8b16b61c
      Josef Bacik authored
      commit 724e2315 upstream.
      
      I was noticing the slab redzone stuff going off every once and a while during
      transaction aborts.  This was caused by two things
      
      1) We would walk the pending snapshots and set their error to -ECANCELED.  We
      don't need to do this, the snapshot stuff waits for a transaction commit and if
      there is a problem we just free our pending snapshot object and exit.  Doing
      this was causing us to touch the pending snapshot object after the thing had
      already been freed.
      
      2) We were freeing the transaction manually with wanton disregard for it's
      use_count reference counter.  To fix this I cleaned up the transaction freeing
      loop to either wait for the transaction commit to finish if it was in the middle
      of that (since it will be cleaned and freed up there) or to do the cleanup
      oursevles.
      
      I also moved the global "kill all things dirty everywhere" stuff outside of the
      transaction cleanup loop since that only needs to be done once.  With this patch
      I'm no longer seeing slab corruption because of use after frees.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8b16b61c
    • Josef Bacik's avatar
      Btrfs: don't delete ordered roots from list during cleanup · 445d1c3a
      Josef Bacik authored
      commit 1de2cfde upstream.
      
      During transaction cleanup after an abort we are just removing roots from the
      ordered roots list which is incorrect.  We have a BUG_ON() to make sure that the
      root is still part of the ordered roots list when we put our ordered extent
      which we were tripping in this case.  So do like we do everywhere else and just
      move it to the tail of the ordered roots list and allow the normal cleanup to
      take care of stuff.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      445d1c3a
    • Josef Bacik's avatar
      Btrfs: cleanup transaction on abort · bd32872f
      Josef Bacik authored
      commit 4e121c06 upstream.
      
      If we abort not during a transaction commit we won't clean up anything until we
      unmount.  Unfortunately if we abort in the middle of writing out an ordered
      extent we won't clean it up and if somebody is waiting on that ordered extent
      they will wait forever.  To fix this just make the transaction kthread call the
      cleanup transaction stuff if it notices theres an error, and make
      btrfs_end_transaction wake up the transaction kthread if there is an error.
      Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      bd32872f
    • Josef Bacik's avatar
      Btrfs: do not release metadata for space cache inodes · 4b6d66d1
      Josef Bacik authored
      commit b6d08f06 upstream.
      
      I've been testing our error paths and I was tripping the BUG_ON() in
      drop_outstanding_extent because our outstanding_extents is 0 for space cache
      inodes.  This is because we don't reserve metadata space for these inodes since
      we depend on the global block reserve for our space.  To fix this we need to
      make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for
      space cache inodes.  With this patch I'm no longer panicing.  Thanks,
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      4b6d66d1
    • Filipe David Borba Manana's avatar
      Btrfs: don't leak block group on error · 596075a2
      Filipe David Borba Manana authored
      commit e84cc142 upstream.
      
      In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to
      write_one_cache_group() failed, we would return without putting
      the block group first.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      596075a2
    • Filipe David Borba Manana's avatar
      Btrfs: fix sync fs to actually wait for all data to be persisted · a2ea3d78
      Filipe David Borba Manana authored
      commit 9b199859 upstream.
      
      Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't
      wait for delayed work to finish before returning success to the
      caller. This change fixes this, ensuring that there's no data loss
      if a power failure happens right after fs sync returns success to
      the caller and before the next commit happens.
      
      Steps to reproduce the data loss issue:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt/btrfs
      $ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs
      
      Right after the btrfs fi sync command (a second or 2 for example), power
      off the machine and reboot it. The file will be empty, as it can be verified
      after mounting the filesystem and through btrfs-debug-tree:
      
      $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8
              item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36
                      location key (257 INODE_ITEM 0) type FILE
                      namelen 6 datalen 0 name: foobar
              item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
                      inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1
              item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16
                      inode ref index 2 namelen 6 name: foobar
      checksum tree key (CSUM_TREE ROOT_ITEM 0)
      leaf 29429760 items 0 free space 3995 generation 7 owner 7
      fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e
      chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae
      uuid tree key (UUID_TREE ROOT_ITEM 0)
      
      After this patch, the data loss no longer happens after a power failure and
      btrfs-debug-tree shows:
      
      $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8
      	item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36
      		location key (257 INODE_ITEM 0) type FILE
      		namelen 6 datalen 0 name: foobar
      	item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
      		inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1
      	item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16
      		inode ref index 2 namelen 6 name: foobar
      	item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53
      		extent data disk byte 12845056 nr 8192
      		extent data offset 0 nr 8192 ram 8192
      		extent compression 0
      checksum tree key (CSUM_TREE ROOT_ITEM 0)
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      a2ea3d78
    • Filipe David Borba Manana's avatar
      Btrfs: fix tracking of orphan inode count · f19eb84e
      Filipe David Borba Manana authored
      commit 703c88e0 upstream.
      
      In inode.c:btrfs_orphan_add() if we failed to insert the orphan
      item, we would return without decrementing the orphan count that
      we just incremented before attempting the insertion, leaving the
      orphan inode count wrong.
      
      In inode.c:btrfs_orphan_del(), we were decrementing the inode
      orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set,
      which is logically wrong because it should be decremented if the
      bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment
      the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere.
      Signed-off-by: default avatarFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      f19eb84e
    • Steve French's avatar
      Do not send ClientGUID on SMB2.02 dialect · c82b3dd9
      Steve French authored
      commit 3c5f9be1 upstream.
      
      ClientGUID must be zero for SMB2.02 dialect.  See section 2.2.3
      of MS-SMB2. For SMB2.1 and later it must be non-zero.
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      CC: Sachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      c82b3dd9
    • Sachin Prabhu's avatar
      cifs: Set client guid on per connection basis · 8f7e86ca
      Sachin Prabhu authored
      commit 39552ea8 upstream.
      
      When mounting from a Windows 2012R2 server, we hit the following
      problem:
      1) Mount with any of the following versions - 2.0, 2.1 or 3.0
      2) unmount
      3) Attempt a mount again using a different SMB version >= 2.0.
      
      You end up with the following failure:
      Status code returned 0xc0000203 STATUS_USER_SESSION_DELETED
      CIFS VFS: Send error in SessSetup = -5
      CIFS VFS: cifs_mount failed w/return code = -5
      
      I cannot reproduce this issue using a Windows 2008 R2 server.
      
      This appears to be caused because we use the same client guid for the
      connection on first mount which we then disconnect and attempt to mount
      again using a different protocol version. By generating a new guid each
      time a new connection is Negotiated, we avoid hitting this problem.
      Signed-off-by: default avatarSachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      8f7e86ca
    • Steve French's avatar
      Check SMB3 dialects against downgrade attacks · 16e57e55
      Steve French authored
      commit ff1c038a upstream.
      
      When we are running SMB3 or SMB3.02 connections which are signed
      we need to validate the protocol negotiation information,
      to ensure that the negotiate protocol response was not tampered with.
      
      Add the missing FSCTL which is sent at mount time (immediately after
      the SMB3 Tree Connect) to validate that the capabilities match
      what we think the server sent.
      
      "Secure dialect negotiation is introduced in SMB3 to protect against
      man-in-the-middle attempt to downgrade dialect negotiation.
      The idea is to prevent an eavesdropper from downgrading the initially
      negotiated dialect and capabilities between the client and the server."
      
      For more explanation see 2.2.31.4 of MS-SMB2 or
      http://blogs.msdn.com/b/openspecification/archive/2012/06/28/smb3-secure-dialect-negotiation.aspxReviewed-by: default avatarPavel Shilovsky <piastry@etersoft.ru>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      [ddiss@suse.de: backported atop kernel without clone_range support]
      Signed-off-by: default avatarDavid Disseldorp <ddiss@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      16e57e55
    • Michal Kubecek's avatar
      xfrm: fix race between netns cleanup and state expire notification · 3f8fd8ad
      Michal Kubecek authored
      commit 21ee543e upstream.
      
      The xfrm_user module registers its pernet init/exit after xfrm
      itself so that its net exit function xfrm_user_net_exit() is
      executed before xfrm_net_exit() which calls xfrm_state_fini() to
      cleanup the SA's (xfrm states). This opens a window between
      zeroing net->xfrm.nlsk pointer and deleting all xfrm_state
      instances which may access it (via the timer). If an xfrm state
      expires in this window, xfrm_exp_state_notify() will pass null
      pointer as socket to nlmsg_multicast().
      
      As the notifications are called inside rcu_read_lock() block, it
      is sufficient to retrieve the nlsk socket with rcu_dereference()
      and check the it for null.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      3f8fd8ad
    • Michal Kubeček's avatar
      vlan: more careful checksum features handling · 306ba5b2
      Michal Kubeček authored
      commit da08143b upstream.
      
      When combining real_dev's features and vlan_features, simple
      bitwise AND is used. This doesn't work well for checksum
      offloading features as if one set has NETIF_F_HW_CSUM and the
      other NETIF_F_IP_CSUM and/or NETIF_F_IPV6_CSUM, we end up with
      no checksum offloading. However, from the logical point of view
      (how can_checksum_protocol() works), NETIF_F_HW_CSUM contains
      the functionality of NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM so
      that the result should be IP/IPV6.
      
      Add helper function netdev_intersect_features() implementing
      this logic and use it in vlan_dev_fix_features().
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      306ba5b2
    • Ben Hutchings's avatar
      net/compat: Fix minor information leak in siocdevprivate_ioctl() · e45145b6
      Ben Hutchings authored
      commit 417c3522 upstream.
      
      We don't need to check that ifr_data itself is a valid user pointer,
      but we should check &ifr_data is.  Thankfully the copy of ifr_name is
      checked, so this can only leak a few bytes from immediately above the
      user address limit.
      Signed-off-by: default avatarBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      e45145b6
    • Benjamin Poirier's avatar
      net: Do not enable tx-nocache-copy by default · 9f6e089c
      Benjamin Poirier authored
      commit cdb3f4a3 upstream.
      
      There are many cases where this feature does not improve performance or even
      reduces it.
      
      For example, here are the results from tests that I've run using 3.12.6 on one
      Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
      from the Xeon, but they're similar on the i7. All numbers report the
      mean±stddev over 10 runs of 10s.
      
      1) latency tests similar to what is described in "c6e1a0d1 net: Allow no-cache
      copy from user on transmit"
      There is no statistically significant difference between tx-nocache-copy
      on/off.
      nic irqs spread out (one queue per cpu)
      
      200x netperf -r 1400,1
      tx-nocache-copy off
              692000±1000 tps
              50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
      tx-nocache-copy on
              693000±1000 tps
              50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7
      
      200x netperf -r 14000,14000
      tx-nocache-copy off
              86450±80 tps
              50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
      tx-nocache-copy on
              86110±60 tps
              50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20
      
      2) single stream throughput tests
      tx-nocache-copy leads to higher service demand
      
                              throughput  cpu0        cpu1        demand
                              (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)
      
      nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)
      
      tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
      tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004
      
      nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)
      
      tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
      tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002
      
      As a second example, here are some results from Eric Dumazet with latest
      net-next.
      tx-nocache-copy also leads to higher service demand
      
      (cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)
      
      lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
      lpq83:~# perf stat ./netperf -H lpq84 -c
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
      
       87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000
      
       Performance counter stats for './netperf -H lpq84 -c':
      
             4282.648396 task-clock                #    0.423 CPUs utilized
                   9,348 context-switches          #    0.002 M/sec
                      88 CPU-migrations            #    0.021 K/sec
                     355 page-faults               #    0.083 K/sec
          11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
           9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
           4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
           6,053,172,792 instructions              #    0.51  insns per cycle
                                                   #    1.49  stalled cycles per insn [83.64%]
             597,275,583 branches                  #  139.464 M/sec                   [83.70%]
               8,960,541 branch-misses             #    1.50% of all branches         [83.65%]
      
            10.128990264 seconds time elapsed
      
      lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
      lpq83:~# perf stat ./netperf -H lpq84 -c
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB
      
       87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000
      
       Performance counter stats for './netperf -H lpq84 -c':
      
             2847.375441 task-clock                #    0.281 CPUs utilized
                  11,632 context-switches          #    0.004 M/sec
                      49 CPU-migrations            #    0.017 K/sec
                     354 page-faults               #    0.124 K/sec
           7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
           6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
           1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
           2,079,702,453 instructions              #    0.27  insns per cycle
                                                   #    2.94  stalled cycles per insn [83.22%]
             363,773,213 branches                  #  127.757 M/sec                   [83.29%]
               4,242,732 branch-misses             #    1.17% of all branches         [83.51%]
      
            10.128449949 seconds time elapsed
      
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarBenjamin Poirier <bpoirier@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      9f6e089c