1. 06 Oct, 2015 16 commits
  2. 30 Sep, 2015 24 commits
    • Helge Deller's avatar
      parisc: Filter out spurious interrupts in PA-RISC irq handler · 634a6a28
      Helge Deller authored
      commit b1b4e435 upstream.
      
      When detecting a serial port on newer PA-RISC machines (with iosapic) we have a
      long way to go to find the right IRQ line, registering it, then registering the
      serial port and the irq handler for the serial port. During this phase spurious
      interrupts for the serial port may happen which then crashes the kernel because
      the action handler might not have been set up yet.
      
      So, basically it's a race condition between the serial port hardware and the
      CPU which sets up the necessary fields in the irq sructs. The main reason for
      this race is, that we unmask the serial port irqs too early without having set
      up everything properly before (which isn't easily possible because we need the
      IRQ number to register the serial ports).
      
      This patch is a work-around for this problem. It adds checks to the CPU irq
      handler to verify if the IRQ action field has been initialized already. If not,
      we just skip this interrupt (which isn't critical for a serial port at bootup).
      The real fix would probably involve rewriting all PA-RISC specific IRQ code
      (for CPU, IOSAPIC, GSC and EISA) to use IRQ domains with proper parenting of
      the irq chips and proper irq enabling along this line.
      
      This bug has been in the PA-RISC port since the beginning, but the crashes
      happened very rarely with currently used hardware.  But on the latest machine
      which I bought (a C8000 workstation), which uses the fastest CPUs (4 x PA8900,
      1GHz) and which has the largest possible L1 cache size (64MB each), the kernel
      crashed at every boot because of this race. So, without this patch the machine
      would currently be unuseable.
      
      For the record, here is the flow logic:
      1. serial_init_chip() in 8250_gsc.c calls iosapic_serial_irq().
      2. iosapic_serial_irq() calls txn_alloc_irq() to find the irq.
      3. iosapic_serial_irq() calls cpu_claim_irq() to register the CPU irq
      4. cpu_claim_irq() unmasks the CPU irq (which it shouldn't!)
      5. serial_init_chip() then registers the 8250 port.
      Problems:
      - In step 4 the CPU irq shouldn't have been registered yet, but after step 5
      - If serial irq happens between 4 and 5 have finished, the kernel will crash
      
      Cc: linux-parisc@vger.kernel.org
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      [ luis: backported to 3.16: used Helge's backport ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      634a6a28
    • Wilson Kok's avatar
      fib_rules: fix fib rule dumps across multiple skbs · de090271
      Wilson Kok authored
      commit 41fc0143 upstream.
      
      dump_rules returns skb length and not error.
      But when family == AF_UNSPEC, the caller of dump_rules
      assumes that it returns an error. Hence, when family == AF_UNSPEC,
      we continue trying to dump on -EMSGSIZE errors resulting in
      incorrect dump idx carried between skbs belonging to the same dump.
      This results in fib rule dump always only dumping rules that fit
      into the first skb.
      
      This patch fixes dump_rules to return error so that we exit correctly
      and idx is correctly maintained between skbs that are part of the
      same dump.
      Signed-off-by: default avatarWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      de090271
    • Jesse Gross's avatar
      openvswitch: Zero flows on allocation. · 6d2b5595
      Jesse Gross authored
      commit ae5f2fb1 upstream.
      
      When support for megaflows was introduced, OVS needed to start
      installing flows with a mask applied to them. Since masking is an
      expensive operation, OVS also had an optimization that would only
      take the parts of the flow keys that were covered by a non-zero
      mask. The values stored in the remaining pieces should not matter
      because they are masked out.
      
      While this works fine for the purposes of matching (which must always
      look at the mask), serialization to netlink can be problematic. Since
      the flow and the mask are serialized separately, the uninitialized
      portions of the flow can be encoded with whatever values happen to be
      present.
      
      In terms of functionality, this has little effect since these fields
      will be masked out by definition. However, it leaks kernel memory to
      userspace, which is a potential security vulnerability. It is also
      possible that other code paths could look at the masked key and get
      uninitialized data, although this does not currently appear to be an
      issue in practice.
      
      This removes the mask optimization for flows that are being installed.
      This was always intended to be the case as the mask optimizations were
      really targetting per-packet flow operations.
      
      Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
      Signed-off-by: default avatarJesse Gross <jesse@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [ luis: backported to 3.16: adjusted context ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      6d2b5595
    • Marcelo Ricardo Leitner's avatar
      sctp: fix race on protocol/netns initialization · eb084bd1
      Marcelo Ricardo Leitner authored
      commit 8e2d61e0 upstream.
      
      Consider sctp module is unloaded and is being requested because an user
      is creating a sctp socket.
      
      During initialization, sctp will add the new protocol type and then
      initialize pernet subsys:
      
              status = sctp_v4_protosw_init();
              if (status)
                      goto err_protosw_init;
      
              status = sctp_v6_protosw_init();
              if (status)
                      goto err_v6_protosw_init;
      
              status = register_pernet_subsys(&sctp_net_ops);
      
      The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
      is possible for userspace to create SCTP sockets like if the module is
      already fully loaded. If that happens, one of the possible effects is
      that we will have readers for net->sctp.local_addr_list list earlier
      than expected and sctp_net_init() does not take precautions while
      dealing with that list, leading to a potential panic but not limited to
      that, as sctp_sock_init() will copy a bunch of blank/partially
      initialized values from net->sctp.
      
      The race happens like this:
      
           CPU 0                           |  CPU 1
        socket()                           |
         __sock_create                     | socket()
          inet_create                      |  __sock_create
           list_for_each_entry_rcu(        |
              answer, &inetsw[sock->type], |
              list) {                      |   inet_create
            /* no hits */                  |
           if (unlikely(err)) {            |
            ...                            |
            request_module()               |
            /* socket creation is blocked  |
             * the module is fully loaded  |
             */                            |
             sctp_init                     |
              sctp_v4_protosw_init         |
               inet_register_protosw       |
                list_add_rcu(&p->list,     |
                             last_perm);   |
                                           |  list_for_each_entry_rcu(
                                           |     answer, &inetsw[sock->type],
              sctp_v6_protosw_init         |     list) {
                                           |     /* hit, so assumes protocol
                                           |      * is already loaded
                                           |      */
                                           |  /* socket creation continues
                                           |   * before netns is initialized
                                           |   */
              register_pernet_subsys       |
      
      Simply inverting the initialization order between
      register_pernet_subsys() and sctp_v4_protosw_init() is not possible
      because register_pernet_subsys() will create a control sctp socket, so
      the protocol must be already visible by then. Deferring the socket
      creation to a work-queue is not good specially because we loose the
      ability to handle its errors.
      
      So, as suggested by Vlad, the fix is to split netns initialization in
      two moments: defaults and control socket, so that the defaults are
      already loaded by when we register the protocol, while control socket
      initialization is kept at the same moment it is today.
      
      Fixes: 4db67e80 ("sctp: Make the address lists per network namespace")
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      eb084bd1
    • Daniel Borkmann's avatar
      netlink, mmap: transform mmap skb into full skb on taps · ce90dd09
      Daniel Borkmann authored
      commit 1853c949 upstream.
      
      Ken-ichirou reported that running netlink in mmap mode for receive in
      combination with nlmon will throw a NULL pointer dereference in
      __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
      to handle kernel paging request". The problem is the skb_clone() in
      __netlink_deliver_tap_skb() for skbs that are mmaped.
      
      I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
      skb has it pointed to netlink_skb_destructor(), set in the handler
      netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
      that in such cases, __kfree_skb() doesn't perform a skb_release_data()
      via skb_release_all(), where skb->head is possibly being freed through
      kfree(head) into slab allocator, although netlink mmap skb->head points
      to the mmap buffer. Similarly, the same has to be done also for large
      netlink skbs where the data area is vmalloced. Therefore, as discussed,
      make a copy for these rather rare cases for now. This fixes the issue
      on my and Ken-ichirou's test-cases.
      
      Reference: http://thread.gmane.org/gmane.linux.network/371129
      Fixes: bcbde0d4 ("net: netlink: virtual tap device management")
      Reported-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [ luis: backported to 3.16: adjusted context ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      ce90dd09
    • Richard Laing's avatar
      net/ipv6: Correct PIM6 mrt_lock handling · 0fdc68d7
      Richard Laing authored
      commit 25b4a44c upstream.
      
      In the IPv6 multicast routing code the mrt_lock was not being released
      correctly in the MFC iterator, as a result adding or deleting a MIF would
      cause a hang because the mrt_lock could not be acquired.
      
      This fix is a copy of the code for the IPv4 case and ensures that the lock
      is released correctly.
      Signed-off-by: default avatarRichard Laing <richard.laing@alliedtelesis.co.nz>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      0fdc68d7
    • Eugene Shatokhin's avatar
      usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared · 781a95d3
      Eugene Shatokhin authored
      commit f50791ac upstream.
      
      It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
      usbnet_stop(), but its value should be read before it is cleared
      when dev->flags is set to 0.
      
      The problem was spotted and the fix was provided by
      Oliver Neukum <oneukum@suse.de>.
      Signed-off-by: default avatarEugene Shatokhin <eugene.shatokhin@rosalab.ru>
      Acked-by: default avatarOliver Neukum <oneukum@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      781a95d3
    • Kevin Cernekee's avatar
      UBI: block: Add missing cache flushes · afdef6be
      Kevin Cernekee authored
      commit 98fb1ffd upstream.
      
      Block drivers are responsible for calling flush_dcache_page() on each
      BIO request. This operation keeps the I$ coherent with the D$ on
      architectures that don't have hardware coherency support. Without this
      flush, random crashes are seen when executing user programs from an ext4
      filesystem backed by a ubiblock device.
      
      This patch is based on the change implemented in commit 2d4dc890
      ("block: add helpers to run flush_dcache_page() against a bio and a
      request's pages").
      
      Fixes: 9d54c8a3 ("UBI: R/O block driver on top of UBI volumes")
      Signed-off-by: default avatarKevin Cernekee <cernekee@chromium.org>
      Signed-off-by: default avatarEzequiel Garcia <ezequiel.garcia@imgtec.com>
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      [ luis: backported to 3.16: used Kevin's backport to 3.14 ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      afdef6be
    • Sasha Levin's avatar
      RDS: verify the underlying transport exists before creating a connection · a93002fa
      Sasha Levin authored
      commit 74e98eb0 upstream.
      
      There was no verification that an underlying transport exists when creating
      a connection, this would cause dereferencing a NULL ptr.
      
      It might happen on sockets that weren't properly bound before attempting to
      send a message, which will cause a NULL ptr deref:
      
      [135546.047719] kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
      [135546.051270] Modules linked in:
      [135546.051781] CPU: 4 PID: 15650 Comm: trinity-c4 Not tainted 4.2.0-next-20150902-sasha-00041-gbaa1222-dirty #2527
      [135546.053217] task: ffff8800835bc000 ti: ffff8800bc708000 task.ti: ffff8800bc708000
      [135546.054291] RIP: __rds_conn_create (net/rds/connection.c:194)
      [135546.055666] RSP: 0018:ffff8800bc70fab0  EFLAGS: 00010202
      [135546.056457] RAX: dffffc0000000000 RBX: 0000000000000f2c RCX: ffff8800835bc000
      [135546.057494] RDX: 0000000000000007 RSI: ffff8800835bccd8 RDI: 0000000000000038
      [135546.058530] RBP: ffff8800bc70fb18 R08: 0000000000000001 R09: 0000000000000000
      [135546.059556] R10: ffffed014d7a3a23 R11: ffffed014d7a3a21 R12: 0000000000000000
      [135546.060614] R13: 0000000000000001 R14: ffff8801ec3d0000 R15: 0000000000000000
      [135546.061668] FS:  00007faad4ffb700(0000) GS:ffff880252000000(0000) knlGS:0000000000000000
      [135546.062836] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [135546.063682] CR2: 000000000000846a CR3: 000000009d137000 CR4: 00000000000006a0
      [135546.064723] Stack:
      [135546.065048]  ffffffffafe2055c ffffffffafe23fc1 ffffed00493097bf ffff8801ec3d0008
      [135546.066247]  0000000000000000 00000000000000d0 0000000000000000 ac194a24c0586342
      [135546.067438]  1ffff100178e1f78 ffff880320581b00 ffff8800bc70fdd0 ffff880320581b00
      [135546.068629] Call Trace:
      [135546.069028] ? __rds_conn_create (include/linux/rcupdate.h:856 net/rds/connection.c:134)
      [135546.069989] ? rds_message_copy_from_user (net/rds/message.c:298)
      [135546.071021] rds_conn_create_outgoing (net/rds/connection.c:278)
      [135546.071981] rds_sendmsg (net/rds/send.c:1058)
      [135546.072858] ? perf_trace_lock (include/trace/events/lock.h:38)
      [135546.073744] ? lockdep_init (kernel/locking/lockdep.c:3298)
      [135546.074577] ? rds_send_drop_to (net/rds/send.c:976)
      [135546.075508] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3795)
      [135546.076349] ? __might_fault (mm/memory.c:3795)
      [135546.077179] ? rds_send_drop_to (net/rds/send.c:976)
      [135546.078114] sock_sendmsg (net/socket.c:611 net/socket.c:620)
      [135546.078856] SYSC_sendto (net/socket.c:1657)
      [135546.079596] ? SYSC_connect (net/socket.c:1628)
      [135546.080510] ? trace_dump_stack (kernel/trace/trace.c:1926)
      [135546.081397] ? ring_buffer_unlock_commit (kernel/trace/ring_buffer.c:2479 kernel/trace/ring_buffer.c:2558 kernel/trace/ring_buffer.c:2674)
      [135546.082390] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
      [135546.083410] ? trace_event_raw_event_sys_enter (include/trace/events/syscalls.h:16)
      [135546.084481] ? do_audit_syscall_entry (include/trace/events/syscalls.h:16)
      [135546.085438] ? trace_buffer_unlock_commit (kernel/trace/trace.c:1749)
      [135546.085515] rds_ib_laddr_check(): addr 36.74.25.172 ret -99 node type -1
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      a93002fa
    • Paul Mackerras's avatar
      powerpc/MSI: Fix race condition in tearing down MSI interrupts · 18a7214f
      Paul Mackerras authored
      commit e297c939 upstream.
      
      This fixes a race which can result in the same virtual IRQ number
      being assigned to two different MSI interrupts.  The most visible
      consequence of that is usually a warning and stack trace from the
      sysfs code about an attempt to create a duplicate entry in sysfs.
      
      The race happens when one CPU (say CPU 0) is disposing of an MSI
      while another CPU (say CPU 1) is setting up an MSI.  CPU 0 calls
      (for example) pnv_teardown_msi_irqs(), which calls
      msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its
      hardware IRQ number) is no longer in use.  Then, before CPU 0 gets
      to calling irq_dispose_mapping() to free up the virtal IRQ number,
      CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an
      MSI, and gets the same hardware IRQ number that CPU 0 just freed.
      CPU 1 then calls irq_create_mapping() to get a virtual IRQ number,
      which sees that there is currently a mapping for that hardware IRQ
      number and returns the corresponding virtual IRQ number (which is
      the same virtual IRQ number that CPU 0 was using).  CPU 0 then
      calls irq_dispose_mapping() and frees that virtual IRQ number.
      Now, if another CPU comes along and calls irq_create_mapping(), it
      is likely to get the virtual IRQ number that was just freed,
      resulting in the same virtual IRQ number apparently being used for
      two different hardware interrupts.
      
      To fix this race, we just move the call to msi_bitmap_free_hwirqs()
      to after the call to irq_dispose_mapping().  Since virq_to_hw()
      doesn't work for the virtual IRQ number after irq_dispose_mapping()
      has been called, we need to call it before irq_dispose_mapping() and
      remember the result for the msi_bitmap_free_hwirqs() call.
      
      The pattern of calling msi_bitmap_free_hwirqs() before
      irq_dispose_mapping() appears in 5 places under arch/powerpc, and
      appears to have originated in commit 05af7bd2 ("[POWERPC] MPIC
      U3/U4 MSI backend") from 2007.
      
      Fixes: 05af7bd2 ("[POWERPC] MPIC U3/U4 MSI backend")
      Reported-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      18a7214f
    • Eric Whitney's avatar
      ext4: fix loss of delalloc extent info in ext4_zero_range() · 125fa8e2
      Eric Whitney authored
      commit 94426f4b upstream.
      
      In ext4_zero_range(), removing a file's entire block range from the
      extent status tree removes all records of that file's delalloc extents.
      The delalloc accounting code uses this information, and its loss can
      then lead to accounting errors and kernel warnings at writeback time and
      subsequent file system damage.  This is most noticeable on bigalloc
      file systems where code in ext4_ext_map_blocks() handles cases where
      delalloc extents share clusters with a newly allocated extent.
      
      Because we're not deleting a block range and are correctly updating the
      status of its associated extent, there is no need to remove anything
      from the extent status tree.
      
      When this patch is combined with an unrelated bug fix for
      ext4_zero_range(), kernel warnings and e2fsck errors reported during
      xfstests runs on bigalloc filesystems are greatly reduced without
      introducing regressions on other xfstests-bld test scenarios.
      Signed-off-by: default avatarEric Whitney <enwlinux@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      125fa8e2
    • Yann Droneaud's avatar
      perf/x86: Fix copy_from_user_nmi() return if range is not ok · 324e82a3
      Yann Droneaud authored
      commit ebf2d268 upstream.
      
      Commit 0a196848 ("perf: Fix arch_perf_out_copy_user default"),
      changes copy_from_user_nmi() to return the number of
      remaining bytes so that it behave like copy_from_user().
      
      Unfortunately, when the range is outside of the process
      memory, the return value  is still the number of byte
      copied, eg. 0, instead of the remaining bytes.
      
      As all users of copy_from_user_nmi() were modified as
      part of commit 0a196848, the function should be
      fixed to return the total number of bytes if range is
      not correct.
      Signed-off-by: default avatarYann Droneaud <ydroneaud@opteya.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1435001923-30986-1-git-send-email-ydroneaud@opteya.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      324e82a3
    • NeilBrown's avatar
      md/raid10: always set reshape_safe when initializing reshape_position. · f912b3e0
      NeilBrown authored
      commit 299b0685 upstream.
      
      'reshape_position' tracks where in the reshape we have reached.
      'reshape_safe' tracks where in the reshape we have safely recorded
      in the metadata.
      
      These are compared to determine when to update the metadata.
      So it is important that reshape_safe is initialised properly.
      Currently it isn't.  When starting a reshape from the beginning
      it usually has the correct value by luck.  But when reducing the
      number of devices in a RAID10, it has the wrong value and this leads
      to the metadata not being updated correctly.
      This can lead to corruption if the reshape is not allowed to complete.
      
      This patch is suitable for any -stable kernel which supports RAID10
      reshape, which is 3.5 and later.
      
      Fixes: 3ea7daa5 ("md/raid10: add reshape support")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      f912b3e0
    • NeilBrown's avatar
      md: flush ->event_work before stopping array. · 66dcd4a1
      NeilBrown authored
      commit ee5d004f upstream.
      
      The 'event_work' worker used by dm-raid may still be running
      when the array is stopped.  This can result in an oops.
      
      So flush the workqueue on which it is run after detaching
      and before destroying the device.
      Reported-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Fixes: 9d09e663 ("dm: raid456 basic support")
      [ luis: backported to 3.16: adjusted context ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      66dcd4a1
    • Christoph Hellwig's avatar
      scsi_dh: fix randconfig build error · dbd9d457
      Christoph Hellwig authored
      commit 294ab783 upstream.
      
      It looks like the Kconfig check that was meant to fix this (commit
      fe9233fb [SCSI] scsi_dh: fix kconfig related
      build errors) was actually reversed, but no-one noticed until the new set of
      patches which separated DM and SCSI_DH).
      
      Fixes: fe9233fbSigned-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Odin.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      dbd9d457
    • Daniel Borkmann's avatar
      netlink, mmap: fix edge-case leakages in nf queue zero-copy · f9810d56
      Daniel Borkmann authored
      commit 6bb0fef4 upstream.
      
      When netlink mmap on receive side is the consumer of nf queue data,
      it can happen that in some edge cases, we write skb shared info into
      the user space mmap buffer:
      
      Assume a possible rx ring frame size of only 4096, and the network skb,
      which is being zero-copied into the netlink skb, contains page frags
      with an overall skb->len larger than the linear part of the netlink
      skb.
      
      skb_zerocopy(), which is generic and thus not aware of the fact that
      shared info cannot be accessed for such skbs then tries to write and
      fill frags, thus leaking kernel data/pointers and in some corner cases
      possibly writing out of bounds of the mmap area (when filling the
      last slot in the ring buffer this way).
      
      I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
      an advertised length larger than 4096, where the linear part is visible
      at the slot beginning, and the leaked sizeof(struct skb_shared_info)
      has been written to the beginning of the next slot (also corrupting
      the struct nl_mmap_hdr slot header incl. status etc), since skb->end
      points to skb->data + ring->frame_size - NL_MMAP_HDRLEN.
      
      The fix adds and lets __netlink_alloc_skb() take the actual needed
      linear room for the network skb + meta data into account. It's completely
      irrelevant for non-mmaped netlink sockets, but in case mmap sockets
      are used, it can be decided whether the available skb_tailroom() is
      really large enough for the buffer, or whether it needs to internally
      fallback to a normal alloc_skb().
      
      >From nf queue side, the information whether the destination port is
      an mmap RX ring is not really available without extra port-to-socket
      lookup, thus it can only be determined in lower layers i.e. when
      __netlink_alloc_skb() is called that checks internally for this. I
      chose to add the extra ldiff parameter as mmap will then still work:
      We have data_len and hlen in nfqnl_build_packet_message(), data_len
      is the full length (capped at queue->copy_range) for skb_zerocopy()
      and hlen some possible part of data_len that needs to be copied; the
      rem_len variable indicates the needed remaining linear mmap space.
      
      The only other workaround in nf queue internally would be after
      allocation time by f.e. cap'ing the data_len to the skb_tailroom()
      iff we deal with an mmap skb, but that would 1) expose the fact that
      we use a mmap skb to upper layers, and 2) trim the skb where we
      otherwise could just have moved the full skb into the normal receive
      queue.
      
      After the patch, in my test case the ring slot doesn't fit and therefore
      shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
      thus needs to be picked up via recv().
      
      Fixes: 3ab1f683 ("nfnetlink: add support for memory mapped netlink")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [ luis: backported to 3.16: adjusted context ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      f9810d56
    • Sergei Shtylyov's avatar
      fixed_phy: pass 'irq' to fixed_phy_add() · 9a8359ae
      Sergei Shtylyov authored
      commit bd1a05ee upstream.
      
      I've noticed  that fixed_phy_register() ignores its 'irq' parameter instead of
      passing it to fixed_phy_add(). Luckily, fixed_phy_register()  seems to  always
      be  called with PHY_POLL  for 'irq'... :-)
      
      Fixes: a7595121 ("net: phy: extend fixed driver with fixed_phy_register()")
      Signed-off-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [ luis: backported to 3.16:
        - fixed_phy_add() only has 3 parameters in 3.16 kernel ]
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      9a8359ae
    • Eric Dumazet's avatar
      task_work: remove fifo ordering guarantee · 34180538
      Eric Dumazet authored
      commit c8219906 upstream.
      
      In commit f341861f ("task_work: add a scheduling point in
      task_work_run()") I fixed a latency problem adding a cond_resched()
      call.
      
      Later, commit ac3d0da8 added yet another loop to reverse a list,
      bringing back the latency spike :
      
      I've seen in some cases this loop taking 275 ms, if for example a
      process with 2,000,000 files is killed.
      
      We could add yet another cond_resched() in the reverse loop, or we
      can simply remove the reversal, as I do not think anything
      would depend on order of task_work_add() submitted works.
      
      Fixes: ac3d0da8 ("task_work: Make task_work_add() lockless")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      34180538
    • Daniel Borkmann's avatar
      ipv6: fix exthdrs offload registration in out_rt path · 723f683d
      Daniel Borkmann authored
      commit e41b0bed upstream.
      
      We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
      but in error path, we try to unregister it with inet_del_offload(). This
      doesn't seem correct, it should actually be inet6_del_offload(), also
      ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
      also uses rthdr_offload twice), but it got removed entirely later on.
      
      Fixes: 3336288a ("ipv6: Switch to using new offload infrastructure.")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      723f683d
    • Jialing Fu's avatar
      mmc: core: fix race condition in mmc_wait_data_done · 4a0b1afc
      Jialing Fu authored
      commit 71f8a4b8 upstream.
      
      The following panic is captured in ker3.14, but the issue still exists
      in latest kernel.
      ---------------------------------------------------------------------
      [   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer dereference
      at virtual address 00000578
      ......
      [   20.738499] c0 3136 (Compiler) PC is at _raw_spin_lock_irqsave+0x24/0x60
      [   20.738527] c0 3136 (Compiler) LR is at _raw_spin_lock_irqsave+0x20/0x60
      [   20.740134] c0 3136 (Compiler) Call trace:
      [   20.740165] c0 3136 (Compiler) [<ffffffc0008ee900>] _raw_spin_lock_irqsave+0x24/0x60
      [   20.740200] c0 3136 (Compiler) [<ffffffc0000dd024>] __wake_up+0x1c/0x54
      [   20.740230] c0 3136 (Compiler) [<ffffffc000639414>] mmc_wait_data_done+0x28/0x34
      [   20.740262] c0 3136 (Compiler) [<ffffffc0006391a0>] mmc_request_done+0xa4/0x220
      [   20.740314] c0 3136 (Compiler) [<ffffffc000656894>] sdhci_tasklet_finish+0xac/0x264
      [   20.740352] c0 3136 (Compiler) [<ffffffc0000a2b58>] tasklet_action+0xa0/0x158
      [   20.740382] c0 3136 (Compiler) [<ffffffc0000a2078>] __do_softirq+0x10c/0x2e4
      [   20.740411] c0 3136 (Compiler) [<ffffffc0000a24bc>] irq_exit+0x8c/0xc0
      [   20.740439] c0 3136 (Compiler) [<ffffffc00008489c>] handle_IRQ+0x48/0xac
      [   20.740469] c0 3136 (Compiler) [<ffffffc000081428>] gic_handle_irq+0x38/0x7c
      ----------------------------------------------------------------------
      Because in SMP, "mrq" has race condition between below two paths:
      path1: CPU0: <tasklet context>
        static void mmc_wait_data_done(struct mmc_request *mrq)
        {
           mrq->host->context_info.is_done_rcv = true;
           //
           // If CPU0 has just finished "is_done_rcv = true" in path1, and at
           // this moment, IRQ or ICache line missing happens in CPU0.
           // What happens in CPU1 (path2)?
           //
           // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
           // path2 would have chance to break from wait_event_interruptible
           // in mmc_wait_for_data_req_done and continue to run for next
           // mmc_request (mmc_blk_rw_rq_prep).
           //
           // Within mmc_blk_rq_prep, mrq is cleared to 0.
           // If below line still gets host from "mrq" as the result of
           // compiler, the panic happens as we traced.
           wake_up_interruptible(&mrq->host->context_info.wait);
        }
      
      path2: CPU1: <The mmcqd thread runs mmc_queue_thread>
        static int mmc_wait_for_data_req_done(...
        {
           ...
           while (1) {
                 wait_event_interruptible(context_info->wait,
                         (context_info->is_done_rcv ||
                          context_info->is_new_req));
           	   static void mmc_blk_rw_rq_prep(...
                 {
                 ...
                 memset(brq, 0, sizeof(struct mmc_blk_request));
      
      This issue happens very coincidentally; however adding mdelay(1) in
      mmc_wait_data_done as below could duplicate it easily.
      
         static void mmc_wait_data_done(struct mmc_request *mrq)
         {
           mrq->host->context_info.is_done_rcv = true;
      +    mdelay(1);
           wake_up_interruptible(&mrq->host->context_info.wait);
          }
      
      At runtime, IRQ or ICache line missing may just happen at the same place
      of the mdelay(1).
      
      This patch gets the mmc_context_info at the beginning of function, it can
      avoid this race condition.
      Signed-off-by: default avatarJialing Fu <jlfu@marvell.com>
      Tested-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Fixes: 2220eedf ("mmc: fix async request mechanism ....")
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      4a0b1afc
    • Yishai Hadas's avatar
      IB/uverbs: Fix race between ib_uverbs_open and remove_one · 99521bca
      Yishai Hadas authored
      commit 35d4a0b6 upstream.
      
      Fixes: 2a72f212 ("IB/uverbs: Remove dev_table")
      
      Before this commit there was a device look-up table that was protected
      by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When
      it was dropped and container_of was used instead, it enabled the race
      with remove_one as dev might be freed just after:
      dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev) but
      before the kref_get.
      
      In addition, this buggy patch added some dead code as
      container_of(x,y,z) can never be NULL and so dev can never be NULL.
      As a result the comment above ib_uverbs_open saying "the open method
      will either immediately run -ENXIO" is wrong as it can never happen.
      
      The solution follows Jason Gunthorpe suggestion from below URL:
      https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html
      
      cdev will hold a kref on the parent (the containing structure,
      ib_uverbs_device) and only when that kref is released it is
      guaranteed that open will never be called again.
      
      In addition, fixes the active count scheme to use an atomic
      not a kref to prevent WARN_ON as pointed by above comment
      from Jason.
      Signed-off-by: default avatarYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: default avatarShachar Raindel <raindel@mellanox.com>
      Reviewed-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      99521bca
    • Noa Osherovich's avatar
      IB/mlx4: Use correct SL on AH query under RoCE · 79243b3f
      Noa Osherovich authored
      commit 5e99b139 upstream.
      
      The mlx4 IB driver implementation for ib_query_ah used a wrong offset
      (28 instead of 29) when link type is Ethernet. Fixed to use the correct one.
      
      Fixes: fa417f7b ('IB/mlx4: Add support for IBoE')
      Signed-off-by: default avatarShani Michaeli <shanim@mellanox.com>
      Signed-off-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      79243b3f
    • Jack Morgenstein's avatar
      IB/mlx4: Forbid using sysfs to change RoCE pkeys · 58e1eb81
      Jack Morgenstein authored
      commit 2b135db3 upstream.
      
      The pkey mapping for RoCE must remain the default mapping:
      VFs:
        virtual index 0 = mapped to real index 0 (0xFFFF)
        All others indices: mapped to a real pkey index containing an
                            invalid pkey.
      PF:
        virtual index i = real index i.
      
      Don't allow users to change these mappings using files found in
      sysfs.
      
      Fixes: c1e7e466 ('IB/mlx4: Add iov directory in sysfs under the ib device')
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      58e1eb81
    • Jack Morgenstein's avatar
      IB/mlx4: Fix potential deadlock when sending mad to wire · 72d64a06
      Jack Morgenstein authored
      commit 90c1d8b6 upstream.
      
      send_mad_to_wire takes the same spinlock that is taken in
      the interrupt context.  Therefore, it needs irqsave/restore.
      
      Fixes: b9c5d6a6 ('IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV')
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      72d64a06