1. 22 Mar, 2017 40 commits
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Pass mm_struct to init/cleanup helpers · 2ba7ef21
      Alexey Kardashevskiy authored
      [ Upstream commit 88f54a35 ]
      
      We are going to get rid of @current references in mmu_context_boos3s64.c
      and cache mm_struct in the VFIO container. Since mm_context_t does not
      have reference counting, we will be using mm_struct which does have
      the reference counter.
      
      This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
      than mm_context_t (which is embedded into mm).
      
      This should not cause any behavioral change.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ba7ef21
    • Alexey Kardashevskiy's avatar
      vfio/spapr: Postpone allocation of userspace version of TCE table · 5d8b3e75
      Alexey Kardashevskiy authored
      [ Upstream commit 39701e56 ]
      
      The iommu_table struct manages a hardware TCE table and a vmalloc'd
      table with corresponding userspace addresses. Both are allocated when
      the default DMA window is created and this happens when the very first
      group is attached to a container.
      
      As we are going to allow the userspace to configure container in one
      memory context and pas container fd to another, we have to postpones
      such allocations till a container fd is passed to the destination
      user process so we would account locked memory limit against the actual
      container user constrainsts.
      
      This postpones the it_userspace array allocation till it is used first
      time for mapping. The unmapping patch already checks if the array is
      allocated.
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5d8b3e75
    • Vitaly Kuznetsov's avatar
      Drivers: hv: ring_buffer: count on wrap around mappings in get_next_pkt_raw() (v2) · 3c0cbb47
      Vitaly Kuznetsov authored
      [ Upstream commit fa32ff65 ]
      
      With wrap around mappings in place we can always provide drivers with
      direct links to packets on the ring buffer, even when they wrap around.
      Do the required updates to get_next_pkt_raw()/put_pkt_raw()
      
      The first version of this commit was reverted (65a532f3) to deal with
      cross-tree merge issues which are (hopefully) resolved now.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Tested-by: default avatarDexuan Cui <decui@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c0cbb47
    • Thomas Falcon's avatar
      ibmveth: calculate gso_segs for large packets · 3e5a7f5b
      Thomas Falcon authored
      [ Upstream commit 94acf164 ]
      
      Include calculations to compute the number of segments
      that comprise an aggregated large packet.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Reviewed-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Reviewed-by: default avatarJonathan Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3e5a7f5b
    • Gavin Shan's avatar
      PCI: Do any VF BAR updates before enabling the BARs · fb7c521a
      Gavin Shan authored
      [ Upstream commit f40ec3c7 ]
      
      Previously we enabled VFs and enable their memory space before calling
      pcibios_sriov_enable().  But pcibios_sriov_enable() may update the VF BARs:
      for example, on PPC PowerNV we may change them to manage the association of
      VFs to PEs.
      
      Because 64-bit BARs cannot be updated atomically, it's unsafe to update
      them while they're enabled.  The half-updated state may conflict with other
      devices in the system.
      
      Call pcibios_sriov_enable() before enabling the VFs so any BAR updates
      happen while the VF BARs are disabled.
      
      [bhelgaas: changelog]
      Tested-by: default avatarCarol Soto <clsoto@us.ibm.com>
      Signed-off-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb7c521a
    • Bjorn Helgaas's avatar
      PCI: Ignore BAR updates on virtual functions · 3d58444d
      Bjorn Helgaas authored
      [ Upstream commit 63880b23 ]
      
      VF BARs are read-only zero, so updating VF BARs will not have any effect.
      See the SR-IOV spec r1.1, sec 3.4.1.11.
      
      We already ignore these updates because of 70675e0b ("PCI: Don't try to
      restore VF BARs"); this merely restructures it slightly to make it easier
      to split updates for standard and SR-IOV BARs.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d58444d
    • Bjorn Helgaas's avatar
      PCI: Update BARs using property bits appropriate for type · 74cce811
      Bjorn Helgaas authored
      [ Upstream commit 45d004f4 ]
      
      The BAR property bits (0-3 for memory BARs, 0-1 for I/O BARs) are supposed
      to be read-only, but we do save them in res->flags and include them when
      updating the BAR.
      
      Mask the I/O property bits with ~PCI_BASE_ADDRESS_IO_MASK (0x3) instead of
      PCI_REGION_FLAG_MASK (0xf) to make it obvious that we can't corrupt bits
      2-3 of I/O addresses.
      
      Use PCI_ROM_ADDRESS_MASK for ROM BARs.  This means we'll only check the top
      21 bits (instead of the 28 bits we used to check) of a ROM BAR to see if
      the update was successful.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74cce811
    • Bjorn Helgaas's avatar
      PCI: Don't update VF BARs while VF memory space is enabled · a38012dc
      Bjorn Helgaas authored
      [ Upstream commit 546ba9f8 ]
      
      If we update a VF BAR while it's enabled, there are two potential problems:
      
        1) Any driver that's using the VF has a cached BAR value that is stale
           after the update, and
      
        2) We can't update 64-bit BARs atomically, so the intermediate state
           (new lower dword with old upper dword) may conflict with another
           device, and an access by a driver unrelated to the VF may cause a bus
           error.
      
      Warn about attempts to update VF BARs while they are enabled.  This is a
      programming error, so use dev_WARN() to get a backtrace.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a38012dc
    • Bjorn Helgaas's avatar
      PCI: Decouple IORESOURCE_ROM_ENABLE and PCI_ROM_ADDRESS_ENABLE · bb479246
      Bjorn Helgaas authored
      [ Upstream commit 7a6d312b ]
      
      Remove the assumption that IORESOURCE_ROM_ENABLE == PCI_ROM_ADDRESS_ENABLE.
      PCI_ROM_ADDRESS_ENABLE is the ROM enable bit defined by the PCI spec, so if
      we're reading or writing a BAR register value, that's what we should use.
      IORESOURCE_ROM_ENABLE is a corresponding bit in struct resource flags.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb479246
    • Bjorn Helgaas's avatar
      PCI: Add comments about ROM BAR updating · ed09d211
      Bjorn Helgaas authored
      [ Upstream commit 0b457dde ]
      
      pci_update_resource() updates a hardware BAR so its address matches the
      kernel's struct resource UNLESS it's a disabled ROM BAR.  We only update
      those when we enable the ROM.
      
      It's not obvious from the code why ROM BARs should be handled specially.
      Apparently there are Matrox devices with defective ROM BARs that read as
      zero when disabled.  That means that if pci_enable_rom() reads the disabled
      BAR, sets PCI_ROM_ADDRESS_ENABLE (without re-inserting the address), and
      writes it back, it would enable the ROM at address zero.
      
      Add comments and references to explain why we can't make the code look more
      rational.
      
      The code changes are from 755528c8 ("Ignore disabled ROM resources at
      setup") and 8085ce08 ("[PATCH] Fix PCI ROM mapping").
      
      Link: https://lkml.org/lkml/2005/8/30/138Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed09d211
    • Bjorn Helgaas's avatar
      PCI: Remove pci_resource_bar() and pci_iov_resource_bar() · 7b65c3a8
      Bjorn Helgaas authored
      [ Upstream commit 286c2378 ]
      
      pci_std_update_resource() only deals with standard BARs, so we don't have
      to worry about the complications of VF BARs in an SR-IOV capability.
      
      Compute the BAR address inline and remove pci_resource_bar().  That makes
      pci_iov_resource_bar() unused, so remove that as well.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b65c3a8
    • Bjorn Helgaas's avatar
      PCI: Separate VF BAR updates from standard BAR updates · 6a5f3e66
      Bjorn Helgaas authored
      [ Upstream commit 6ffa2489 ]
      
      Previously pci_update_resource() used the same code path for updating
      standard BARs and VF BARs in SR-IOV capabilities.
      
      Split the VF BAR update into a new pci_iov_update_resource() internal
      interface, which makes it simpler to compute the BAR address (we can get
      rid of pci_resource_bar() and pci_iov_resource_bar()).
      
      This patch:
      
        - Renames pci_update_resource() to pci_std_update_resource(),
        - Adds pci_iov_update_resource(),
        - Makes pci_update_resource() a wrapper that calls the appropriate one,
      
      No functional change intended.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a5f3e66
    • Vitaly Kuznetsov's avatar
      x86/hyperv: Handle unknown NMIs on one CPU when unknown_nmi_panic · 29d92878
      Vitaly Kuznetsov authored
      [ Upstream commit 59107e2f ]
      
      There is a feature in Hyper-V ('Debug-VM --InjectNonMaskableInterrupt')
      which injects NMI to the guest. We may want to crash the guest and do kdump
      on this NMI by enabling unknown_nmi_panic. To make kdump succeed we need to
      allow the kdump kernel to re-establish VMBus connection so it will see
      VMBus devices (storage, network,..).
      
      To properly unload VMBus making it possible to start over during kdump we
      need to do the following:
      
       - Send an 'unload' message to the hypervisor. This can be done on any CPU
         so we do this the crashing CPU.
      
       - Receive the 'unload finished' reply message. WS2012R2 delivers this
         message to the CPU which was used to establish VMBus connection during
         module load and this CPU may differ from the CPU sending 'unload'.
      
      Receiving a VMBus message means the following:
      
       - There is a per-CPU slot in memory for one message. This slot can in
         theory be accessed by any CPU.
      
       - We get an interrupt on the CPU when a message was placed into the slot.
      
       - When we read the message we need to clear the slot and signal the fact
         to the hypervisor. In case there are more messages to this CPU pending
         the hypervisor will deliver the next message. The signaling is done by
         writing to an MSR so this can only be done on the appropriate CPU.
      
      To avoid doing cross-CPU work on crash we have vmbus_wait_for_unload()
      function which checks message slots for all CPUs in a loop waiting for the
      'unload finished' messages. However, there is an issue which arises when
      these conditions are met:
      
       - We're crashing on a CPU which is different from the one which was used
         to initially contact the hypervisor.
      
       - The CPU which was used for the initial contact is blocked with interrupts
         disabled and there is a message pending in the message slot.
      
      In this case we won't be able to read the 'unload finished' message on the
      crashing CPU. This is reproducible when we receive unknown NMIs on all CPUs
      simultaneously: the first CPU entering panic() will proceed to crash and
      all other CPUs will stop themselves with interrupts disabled.
      
      The suggested solution is to handle unknown NMIs for Hyper-V guests on the
      first CPU which gets them only. This will allow us to rely on VMBus
      interrupt handler being able to receive the 'unload finish' message in
      case it is delivered to a different CPU.
      
      The issue is not reproducible on WS2016 as Debug-VM delivers NMI to the
      boot CPU only, WS2012R2 and earlier Hyper-V versions are affected.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Acked-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Cc: devel@linuxdriverproject.org
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Link: http://lkml.kernel.org/r/20161202100720.28121-1-vkuznets@redhat.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29d92878
    • Michael Cyr's avatar
      scsi: ibmvscsis: Synchronize cmds at remove time · 456be98b
      Michael Cyr authored
      [ Upstream commit 8bf11557 ]
      
      This patch adds code to disconnect from the client, which will make sure
      any outstanding commands have been completed, before continuing on with
      the remove operation.
      Signed-off-by: default avatarMichael Cyr <mikecyr@us.ibm.com>
      Signed-off-by: default avatarBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Tested-by: default avatarSteven Royer <seroyer@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      456be98b
    • Michael Cyr's avatar
      scsi: ibmvscsis: Synchronize cmds at tpg_enable_store time · 94700877
      Michael Cyr authored
      [ Upstream commit c9b3379f ]
      
      This patch changes the way the IBM vSCSI server driver manages its
      Command/Response Queue (CRQ).  We used to register the CRQ with phyp at
      probe time.  Now we wait until tpg_enable_store.  Similarly, when
      tpg_enable_store is called to "disable" (i.e. the stored value is 0),
      we unregister the queue with phyp.
      
      One consquence to this is that we have no need for the PART_UP_WAIT_ENAB
      state, since we can't get an Init Message from the client in our CRQ if
      we're waiting to be enabled, since we haven't registered the queue yet.
      Signed-off-by: default avatarMichael Cyr <mikecyr@us.ibm.com>
      Signed-off-by: default avatarBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Tested-by: default avatarSteven Royer <seroyer@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94700877
    • Michael Cyr's avatar
      scsi: ibmvscsis: Rearrange functions for future patches · 189491f8
      Michael Cyr authored
      [ Upstream commit 79fac9c9 ]
      
      This patch reorders functions in a manner necessary for a follow-on
      patch.  It also makes some minor styling changes (mostly removing extra
      spaces) and fixes some typos.
      
      There are no code changes in this patch, with one exception: due to the
      reordering of the functions, I needed to explicitly declare a function
      at the top of the file.  However, this will be removed in the next patch,
      since the code requiring the predeclaration will be removed.
      Signed-off-by: default avatarMichael Cyr <mikecyr@us.ibm.com>
      Signed-off-by: default avatarBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Tested-by: default avatarSteven Royer <seroyer@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      189491f8
    • Michael Cyr's avatar
    • Michael Cyr's avatar
    • Michael Cyr's avatar
    • Todd Fujinaka's avatar
      igb: add i211 to i210 PHY workaround · 61229e62
      Todd Fujinaka authored
      [ Upstream commit 5bc8c230 ]
      
      i210 and i211 share the same PHY but have different PCI IDs. Don't
      forget i211 for any i210 workarounds.
      Signed-off-by: default avatarTodd Fujinaka <todd.fujinaka@intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61229e62
    • Chris J Arges's avatar
      igb: Workaround for igb i210 firmware issue · 15ffc931
      Chris J Arges authored
      [ Upstream commit 4e684f59 ]
      
      Sometimes firmware may not properly initialize I347AT4_PAGE_SELECT causing
      the probe of an igb i210 NIC to fail. This patch adds an addition zeroing
      of this register during igb_get_phy_id to workaround this issue.
      
      Thanks for Jochen Henneberg for the idea and original patch.
      Signed-off-by: default avatarChris J Arges <christopherarges@gmail.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      15ffc931
    • Dan Streetman's avatar
      xen: do not re-use pirq number cached in pci device msi msg data · 4b40611a
      Dan Streetman authored
      [ Upstream commit c74fd80f ]
      
      Revert the main part of commit:
      af42b8d1 ("xen: fix MSI setup and teardown for PV on HVM guests")
      
      That commit introduced reading the pci device's msi message data to see
      if a pirq was previously configured for the device's msi/msix, and re-use
      that pirq.  At the time, that was the correct behavior.  However, a
      later change to Qemu caused it to call into the Xen hypervisor to unmap
      all pirqs for a pci device, when the pci device disables its MSI/MSIX
      vectors; specifically the Qemu commit:
      c976437c7dba9c7444fb41df45468968aaa326ad
      ("qemu-xen: free all the pirqs for msi/msix when driver unload")
      
      Once Qemu added this pirq unmapping, it was no longer correct for the
      kernel to re-use the pirq number cached in the pci device msi message
      data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
      pirqs when the pci device disables its MSI/MSIX vectors.
      
      This bug is causing failures to initialize multiple NVMe controllers
      under Xen, because the NVMe driver sets up a single MSIX vector for
      each controller (concurrently), and then after using that to talk to
      the controller for some configuration data, it disables the single MSIX
      vector and re-configures all the MSIX vectors it needs.  So the MSIX
      setup code tries to re-use the cached pirq from the first vector
      for each controller, but the hypervisor has already given away that
      pirq to another controller, and its initialization fails.
      
      This is discussed in more detail at:
      https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
      
      Fixes: af42b8d1 ("xen: fix MSI setup and teardown for PV on HVM guests")
      Signed-off-by: default avatarDan Streetman <dan.streetman@canonical.com>
      Reviewed-by: default avatarStefano Stabellini <sstabellini@kernel.org>
      Acked-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4b40611a
    • Krister Johansen's avatar
      dmaengine: iota: ioat_alloc_chan_resources should not perform sleeping allocations. · 2382c148
      Krister Johansen authored
      commit 21d25f6a upstream.
      
      On a kernel with DEBUG_LOCKS, ioat_free_chan_resources triggers an
      in_interrupt() warning.  With PROVE_LOCKING, it reports detecting a
      SOFTIRQ-safe to SOFTIRQ-unsafe lock ordering in the same code path.
      
      This is because dma_generic_alloc_coherent() checks if the GFP flags
      permit blocking.  It allocates from different subsystems if blocking is
      permitted.  The free path knows how to return the memory to the correct
      allocator.  If GFP_KERNEL is specified then the alloc and free end up
      going through cma_alloc(), which uses mutexes.
      
      Given that ioat_free_chan_resources() can be called in interrupt
      context, ioat_alloc_chan_resources() must specify GFP_NOWAIT so that the
      allocations do not block and instead use an allocator that uses
      spinlocks.
      Signed-off-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Acked-by: default avatarDave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarVinod Koul <vinod.koul@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2382c148
    • Daniel Borkmann's avatar
      bpf: fix mark_reg_unknown_value for spilled regs on map value marking · 0e0f1d6f
      Daniel Borkmann authored
      [ Upstream commit 6760bf2d ]
      
      Martin reported a verifier issue that hit the BUG_ON() for his
      test case in the mark_reg_unknown_value() function:
      
        [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
        [...]
        [  203.291109] Call Trace:
        [  203.296501]  [<ffffffff811364d5>] mark_map_reg+0x45/0x50
        [  203.308225]  [<ffffffff81136558>] mark_map_regs+0x78/0x90
        [  203.320140]  [<ffffffff8113938d>] do_check+0x226d/0x2c90
        [  203.331865]  [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
        [  203.343403]  [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
        [  203.355705]  [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
        [  203.369158]  [<ffffffff812d8188>] ? security_capable+0x48/0x60
        [  203.382035]  [<ffffffff811351a4>] SyS_bpf+0x124/0x960
        [  203.393185]  [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
        [  203.406258]  [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94
      
      This issue got uncovered after the fix in a08dd0da ("bpf: fix
      regression on verifier pruning wrt map lookups"). The reason why it
      wasn't noticed before was, because as mentioned in a08dd0da,
      mark_map_regs() was doing the id matching incorrectly based on the
      uncached regs[regno].id. So, in the first loop, we walked all regs
      and as soon as we found regno == i, then this reg's id was cleared
      when calling mark_reg_unknown_value() thus that every subsequent
      register was probed against id of 0 (which, in combination with the
      PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
      register state can hold), and therefore wasn't type transitioned such
      as in the spilled register case for the second loop.
      
      Now since that got fixed, it turned out that 57a09bf0 ("bpf:
      Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
      mark_reg_unknown_value() incorrectly for the spilled regs, and thus
      hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.
      
      Although spilled regs have the same type as the non-spilled regs
      for the verifier state, that is, struct bpf_reg_state, they are
      semantically different from the non-spilled regs. In other words,
      there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
      in the stack, for example, register R<x> could have been spilled by
      the program to stack location X, Y, Z, and in mark_map_regs() we
      need to scan these stack slots of type STACK_SPILL for potential
      registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
      Therefore, depending on the location, the spilled_regs regno can
      be a lot higher than just MAX_BPF_REG's value since we operate on
      stack instead. The reset in mark_reg_unknown_value() itself is
      just fine, only that the BUG_ON() was inappropriate for this. Fix
      it by making a __mark_reg_unknown_value() version that can be
      called from mark_map_reg() generically; we know for the non-spilled
      case that the regno is always < MAX_BPF_REG anyway.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Reported-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e0f1d6f
    • Daniel Borkmann's avatar
      bpf: fix regression on verifier pruning wrt map lookups · 1889d6d9
      Daniel Borkmann authored
      [ Upstream commit a08dd0da ]
      
      Commit 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
      registers") introduced a regression where existing programs stopped
      loading due to reaching the verifier's maximum complexity limit,
      whereas prior to this commit they were loading just fine; the affected
      program has roughly 2k instructions.
      
      What was found is that state pruning couldn't be performed effectively
      anymore due to mismatches of the verifier's register state, in particular
      in the id tracking. It doesn't mean that 57a09bf0 is incorrect per
      se, but rather that verifier needs to perform a lot more work for the
      same program with regards to involved map lookups.
      
      Since commit 57a09bf0 is only about tracking registers with type
      PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
      until they are promoted through pattern matching with a NULL check to
      either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
      id becomes irrelevant for the transitioned types.
      
      For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
      but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
      transferred further into other types that don't make use of it. Among
      others, one example is where UNKNOWN_VALUE is set on function call
      return with RET_INTEGER return type.
      
      states_equal() will then fall through the memcmp() on register state;
      note that the second memcmp() uses offsetofend(), so the id is part of
      that since d2a4dd37 ("bpf: fix state equivalence"). But the bisect
      pointed already to 57a09bf0, where we really reach beyond complexity
      limit. What I found was that states_equal() often failed in this
      case due to id mismatches in spilled regs with registers in type
      PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
      a memcmp() on their reg state and don't have any other optimizations
      in place, therefore also id was relevant in this case for making a
      pruning decision.
      
      We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
      For the affected program, it resulted in a ~17 fold reduction of
      complexity and let the program load fine again. Selftest suite also
      runs fine. The only other place where env->id_gen is used currently is
      through direct packet access, but for these cases id is long living, thus
      a different scenario.
      
      Also, the current logic in mark_map_regs() is not fully correct when
      marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
      reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
      it's id is reset and any subsequent registers that hold the original id
      and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
      anymore, since mark_map_reg() reuses the uncached regs[regno].id that
      was just overridden. Note, we don't need to cache it outside of
      mark_map_regs(), since it's called once on this_branch and the other
      time on other_branch, which are both two independent verifier states.
      A test case for this is added here, too.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1889d6d9
    • Alexei Starovoitov's avatar
      bpf: fix state equivalence · b7f5aa1c
      Alexei Starovoitov authored
      [ Upstream commit d2a4dd37 ]
      
      Commmits 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      and 48461135 ("bpf: allow access into map value arrays") by themselves
      are correct, but in combination they make state equivalence ignore 'id' field
      of the register state which can lead to accepting invalid program.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b7f5aa1c
    • Thomas Graf's avatar
      bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers · 1411707a
      Thomas Graf authored
      [ Upstream commit 57a09bf0 ]
      
      A BPF program is required to check the return register of a
      map_elem_lookup() call before accessing memory. The verifier keeps
      track of this by converting the type of the result register from
      PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional
      jump ensures safety. This check is currently exclusively performed
      for the result register 0.
      
      In the event the compiler reorders instructions, BPF_MOV64_REG
      instructions may be moved before the conditional jump which causes
      them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the
      verifier objects when the register is accessed:
      
      0: (b7) r1 = 10
      1: (7b) *(u64 *)(r10 -8) = r1
      2: (bf) r2 = r10
      3: (07) r2 += -8
      4: (18) r1 = 0x59c00000
      6: (85) call 1
      7: (bf) r4 = r0
      8: (15) if r0 == 0x0 goto pc+1
       R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp
      9: (7a) *(u64 *)(r4 +0) = 0
      R4 invalid mem access 'map_value_or_null'
      
      This commit extends the verifier to keep track of all identical
      PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by
      assigning them an ID and then marking them all when the conditional
      jump is observed.
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1411707a
    • Hannes Frederic Sowa's avatar
      dccp: fix memory leak during tear-down of unsuccessful connection request · 9e38375a
      Hannes Frederic Sowa authored
      [ Upstream commit 72ef9c41 ]
      
      This patch fixes a memory leak, which happens if the connection request
      is not fulfilled between parsing the DCCP options and handling the SYN
      (because e.g. the backlog is full), because we forgot to free the
      list of ack vectors.
      Reported-by: default avatarJianwen Ji <jiji@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9e38375a
    • Hannes Frederic Sowa's avatar
      tun: fix premature POLLOUT notification on tun devices · beaa66cc
      Hannes Frederic Sowa authored
      [ Upstream commit b20e2d54 ]
      
      aszlig observed failing ssh tunnels (-w) during initialization since
      commit cc9da6cc ("ipv6: addrconf: use stable address generator for
      ARPHRD_NONE"). We already had reports that the mentioned commit breaks
      Juniper VPN connections. I can't clearly say that the Juniper VPN client
      has the same problem, but it is worth a try to hint to this patch.
      
      Because of the early generation of link local addresses, the kernel now
      can start asking for routers on the local subnet much earlier than usual.
      Those router solicitation packets arrive inside the ssh channels and
      should be transmitted to the tun fd before the configuration scripts
      might have upped the interface and made it ready for transmission.
      
      ssh polls on the interface and receives back a POLL_OUT. It tries to send
      the earily router solicitation packet to the tun interface.  Unfortunately
      it hasn't been up'ed yet by config scripts, thus failing with -EIO. ssh
      doesn't retry again and considers the tun interface broken forever.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=121131
      Fixes: cc9da6cc ("ipv6: addrconf: use stable address generator for ARPHRD_NONE")
      Cc: Bjørn Mork <bjorn@mork.no>
      Reported-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Reported-by: default avatarJonas Lippuner <jonas@lippuner.ca>
      Cc: Jonas Lippuner <jonas@lippuner.ca>
      Reported-by: default avataraszlig <aszlig@redmoonstudios.org>
      Cc: aszlig <aszlig@redmoonstudios.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      beaa66cc
    • Jon Maxwell's avatar
      dccp/tcp: fix routing redirect race · 98933eb3
      Jon Maxwell authored
      [ Upstream commit 45caeaa5 ]
      
      As Eric Dumazet pointed out this also needs to be fixed in IPv6.
      v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
      
      We have seen a few incidents lately where a dst_enty has been freed
      with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
      dst_entry. If the conditions/timings are right a crash then ensues when the
      freed dst_entry is referenced later on. A Common crashing back trace is:
      
       #8 [] page_fault at ffffffff8163e648
          [exception RIP: __tcp_ack_snd_check+74]
      .
      .
       #9 [] tcp_rcv_established at ffffffff81580b64
      #10 [] tcp_v4_do_rcv at ffffffff8158b54a
      #11 [] tcp_v4_rcv at ffffffff8158cd02
      #12 [] ip_local_deliver_finish at ffffffff815668f4
      #13 [] ip_local_deliver at ffffffff81566bd9
      #14 [] ip_rcv_finish at ffffffff8156656d
      #15 [] ip_rcv at ffffffff81566f06
      #16 [] __netif_receive_skb_core at ffffffff8152b3a2
      #17 [] __netif_receive_skb at ffffffff8152b608
      #18 [] netif_receive_skb at ffffffff8152b690
      #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
      #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
      #21 [] net_rx_action at ffffffff8152bac2
      #22 [] __do_softirq at ffffffff81084b4f
      #23 [] call_softirq at ffffffff8164845c
      #24 [] do_softirq at ffffffff81016fc5
      #25 [] irq_exit at ffffffff81084ee5
      #26 [] do_IRQ at ffffffff81648ff8
      
      Of course it may happen with other NIC drivers as well.
      
      It's found the freed dst_entry here:
      
       224 static bool tcp_in_quickack_mode(struct sock *sk)
       225 {
       226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);
       227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);
       228 
       229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
       230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
       231 }
      
      But there are other backtraces attributed to the same freed dst_entry in
      netfilter code as well.
      
      All the vmcores showed 2 significant clues:
      
      - Remote hosts behind the default gateway had always been redirected to a
      different gateway. A rtable/dst_entry will be added for that host. Making
      more dst_entrys with lower reference counts. Making this more probable.
      
      - All vmcores showed a postitive LockDroppedIcmps value, e.g:
      
      LockDroppedIcmps                  267
      
      A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
      regardless of whether user space has the socket locked. This can result in a
      race condition where the same dst_entry cached in sk->sk_dst_entry can be
      decremented twice for the same socket via:
      
      do_redirect()->__sk_dst_check()-> dst_release().
      
      Which leads to the dst_entry being prematurely freed with another socket
      pointing to it via sk->sk_dst_cache and a subsequent crash.
      
      To fix this skip do_redirect() if usespace has the socket locked. Instead let
      the redirect take place later when user space does not have the socket
      locked.
      
      The dccp/IPv6 code is very similar in this respect, so fixing it there too.
      
      As Eric Garver pointed out the following commit now invalidates routes. Which
      can set the dst->obsolete flag so that ipv4_dst_check() returns null and
      triggers the dst_release().
      
      Fixes: ceb33206 ("ipv4: Kill routes during PMTU/redirect updates.")
      Cc: Eric Garver <egarver@redhat.com>
      Cc: Hannes Sowa <hsowa@redhat.com>
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98933eb3
    • Florian Westphal's avatar
      bridge: drop netfilter fake rtable unconditionally · 9bce26f2
      Florian Westphal authored
      [ Upstream commit a13b2082 ]
      
      Andreas reports kernel oops during rmmod of the br_netfilter module.
      Hannes debugged the oops down to a NULL rt6info->rt6i_indev.
      
      Problem is that br_netfilter has the nasty concept of adding a fake
      rtable to skb->dst; this happens in a br_netfilter prerouting hook.
      
      A second hook (in bridge LOCAL_IN) is supposed to remove these again
      before the skb is handed up the stack.
      
      However, on module unload hooks get unregistered which means an
      skb could traverse the prerouting hook that attaches the fake_rtable,
      while the 'fake rtable remove' hook gets removed from the hooklist
      immediately after.
      
      Fixes: 34666d46 ("netfilter: bridge: move br_netfilter out of the core")
      Reported-by: default avatarAndreas Karis <akaris@redhat.com>
      Debugged-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9bce26f2
    • Florian Westphal's avatar
      ipv6: avoid write to a possibly cloned skb · 683100ed
      Florian Westphal authored
      [ Upstream commit 79e49503 ]
      
      ip6_fragment, in case skb has a fraglist, checks if the
      skb is cloned.  If it is, it will move to the 'slow path' and allocates
      new skbs for each fragment.
      
      However, right before entering the slowpath loop, it updates the
      nexthdr value of the last ipv6 extension header to NEXTHDR_FRAGMENT,
      to account for the fragment header that will be inserted in the new
      ipv6-fragment skbs.
      
      In case original skb is cloned this munges nexthdr value of another
      skb.  Avoid this by doing the nexthdr update for each of the new fragment
      skbs separately.
      
      This was observed with tcpdump on a bridge device where netfilter ipv6
      reassembly is active:  tcpdump shows malformed fragment headers as
      the l4 header (icmpv6, tcp, etc). is decoded as a fragment header.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: default avatarAndreas Karis <akaris@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      683100ed
    • Sabrina Dubroca's avatar
      ipv6: make ECMP route replacement less greedy · 4a8d3bb7
      Sabrina Dubroca authored
      [ Upstream commit 67e19400 ]
      
      Commit 27596472 ("ipv6: fix ECMP route replacement") introduced a
      loop that removes all siblings of an ECMP route that is being
      replaced. However, this loop doesn't stop when it has replaced
      siblings, and keeps removing other routes with a higher metric.
      We also end up triggering the WARN_ON after the loop, because after
      this nsiblings < 0.
      
      Instead, stop the loop when we have taken care of all routes with the
      same metric as the route being replaced.
      
        Reproducer:
        ===========
          #!/bin/sh
      
          ip netns add ns1
          ip netns add ns2
          ip -net ns1 link set lo up
      
          for x in 0 1 2 ; do
              ip link add veth$x netns ns2 type veth peer name eth$x netns ns1
              ip -net ns1 link set eth$x up
              ip -net ns2 link set veth$x up
          done
      
          ip -net ns1 -6 r a 2000::/64 nexthop via fe80::0 dev eth0 \
                  nexthop via fe80::1 dev eth1 nexthop via fe80::2 dev eth2
          ip -net ns1 -6 r a 2000::/64 via fe80::42 dev eth0 metric 256
          ip -net ns1 -6 r a 2000::/64 via fe80::43 dev eth0 metric 2048
      
          echo "before replace, 3 routes"
          ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
          echo
      
          ip -net ns1 -6 r c 2000::/64 nexthop via fe80::4 dev eth0 \
                  nexthop via fe80::5 dev eth1 nexthop via fe80::6 dev eth2
      
          echo "after replace, only 2 routes, metric 2048 is gone"
          ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
      
      Fixes: 27596472 ("ipv6: fix ECMP route replacement")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a8d3bb7
    • David Ahern's avatar
      mpls: Do not decrement alive counter for unregister events · 87c0286a
      David Ahern authored
      [ Upstream commit 79099aab ]
      
      Multipath routes can be rendered usesless when a device in one of the
      paths is deleted. For example:
      
      $ ip -f mpls ro ls
      100
      	nexthop as to 200 via inet 172.16.2.2  dev virt12
      	nexthop as to 300 via inet 172.16.3.2  dev br0
      101
      	nexthop as to 201 via inet6 2000:2::2  dev virt12
      	nexthop as to 301 via inet6 2000:3::2  dev br0
      
      $ ip li del br0
      
      When br0 is deleted the other hop is not considered in
      mpls_select_multipath because of the alive check -- rt_nhn_alive
      is 0.
      
      rt_nhn_alive is decremented once in mpls_ifdown when the device is taken
      down (NETDEV_DOWN) and again when it is deleted (NETDEV_UNREGISTER). For
      a 2 hop route, deleting one device drops the alive count to 0. Since
      devices are taken down before unregistering, the decrement on
      NETDEV_UNREGISTER is redundant.
      
      Fixes: c89359a4 ("mpls: support for dead routes")
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      87c0286a
    • David Ahern's avatar
      mpls: Send route delete notifications when router module is unloaded · b61206e2
      David Ahern authored
      [ Upstream commit e37791ec ]
      
      When the mpls_router module is unloaded, mpls routes are deleted but
      notifications are not sent to userspace leaving userspace caches
      out of sync. Add the call to mpls_notify_route in mpls_net_exit as
      routes are freed.
      
      Fixes: 0189197f ("mpls: Basic routing support")
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b61206e2
    • Etienne Noss's avatar
      act_connmark: avoid crashing on malformed nlattrs with null parms · 47c8dc47
      Etienne Noss authored
      [ Upstream commit 52491c76 ]
      
      tcf_connmark_init does not check in its configuration if TCA_CONNMARK_PARMS
      is set, resulting in a null pointer dereference when trying to access it.
      
      [501099.043007] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
      [501099.043039] IP: [<ffffffffc10c60fb>] tcf_connmark_init+0x8b/0x180 [act_connmark]
      ...
      [501099.044334] Call Trace:
      [501099.044345]  [<ffffffffa47270e8>] ? tcf_action_init_1+0x198/0x1b0
      [501099.044363]  [<ffffffffa47271b0>] ? tcf_action_init+0xb0/0x120
      [501099.044380]  [<ffffffffa47250a4>] ? tcf_exts_validate+0xc4/0x110
      [501099.044398]  [<ffffffffc0f5fa97>] ? u32_set_parms+0xa7/0x270 [cls_u32]
      [501099.044417]  [<ffffffffc0f60bf0>] ? u32_change+0x680/0x87b [cls_u32]
      [501099.044436]  [<ffffffffa4725d1d>] ? tc_ctl_tfilter+0x4dd/0x8a0
      [501099.044454]  [<ffffffffa44a23a1>] ? security_capable+0x41/0x60
      [501099.044471]  [<ffffffffa470ca01>] ? rtnetlink_rcv_msg+0xe1/0x220
      [501099.044490]  [<ffffffffa470c920>] ? rtnl_newlink+0x870/0x870
      [501099.044507]  [<ffffffffa472cc61>] ? netlink_rcv_skb+0xa1/0xc0
      [501099.044524]  [<ffffffffa47073f4>] ? rtnetlink_rcv+0x24/0x30
      [501099.044541]  [<ffffffffa472c634>] ? netlink_unicast+0x184/0x230
      [501099.044558]  [<ffffffffa472c9d8>] ? netlink_sendmsg+0x2f8/0x3b0
      [501099.044576]  [<ffffffffa46d8880>] ? sock_sendmsg+0x30/0x40
      [501099.044592]  [<ffffffffa46d8e03>] ? SYSC_sendto+0xd3/0x150
      [501099.044608]  [<ffffffffa425fda1>] ? __do_page_fault+0x2d1/0x510
      [501099.044626]  [<ffffffffa47fbd7b>] ? system_call_fast_compare_end+0xc/0x9b
      
      Fixes: 22a5dc0e ("net: sched: Introduce connmark action")
      Signed-off-by: default avatarÉtienne Noss <etienne.noss@wifirst.fr>
      Signed-off-by: default avatarVictorien Molle <victorien.molle@wifirst.fr>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47c8dc47
    • Dmitry V. Levin's avatar
      uapi: fix linux/packet_diag.h userspace compilation error · ccb65adc
      Dmitry V. Levin authored
      [ Upstream commit 745cb7f8 ]
      
      Replace MAX_ADDR_LEN with its numeric value to fix the following
      linux/packet_diag.h userspace compilation error:
      
      /usr/include/linux/packet_diag.h:67:17: error: 'MAX_ADDR_LEN' undeclared here (not in a function)
        __u8 pdmc_addr[MAX_ADDR_LEN];
      
      This is not the first case in the UAPI where the numeric value
      of MAX_ADDR_LEN is used instead of symbolic one, uapi/linux/if_link.h
      already does the same:
      
      $ grep MAX_ADDR_LEN include/uapi/linux/if_link.h
      	__u8 mac[32]; /* MAX_ADDR_LEN */
      
      There are no UAPI headers besides these two that use MAX_ADDR_LEN.
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Acked-by: default avatarPavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ccb65adc
    • Paolo Abeni's avatar
      net/tunnel: set inner protocol in network gro hooks · b07eed8f
      Paolo Abeni authored
      [ Upstream commit 294acf1c ]
      
      The gso code of several tunnels type (gre and udp tunnels)
      takes for granted that the skb->inner_protocol is properly
      initialized and drops the packet elsewhere.
      
      On the forwarding path no one is initializing such field,
      so gro encapsulated packets are dropped on forward.
      
      Since commit 38720352 ("gre: Use inner_proto to obtain
      inner header protocol"), this can be reproduced when the
      encapsulated packets use gre as the tunneling protocol.
      
      The issue happens also with vxlan and geneve tunnels since
      commit 8bce6d7d ("udp: Generalize skb_udp_segment"), if the
      forwarding host's ingress nic has h/w offload for such tunnel
      and a vxlan/geneve device is configured on top of it, regardless
      of the configured peer address and vni.
      
      To address the issue, this change initialize the inner_protocol
      field for encapsulated packets in both ipv4 and ipv6 gro complete
      callbacks.
      
      Fixes: 38720352 ("gre: Use inner_proto to obtain inner header protocol")
      Fixes: 8bce6d7d ("udp: Generalize skb_udp_segment")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b07eed8f
    • David Ahern's avatar
      vrf: Fix use-after-free in vrf_xmit · db6e7796
      David Ahern authored
      [ Upstream commit f7887d40 ]
      
      KASAN detected a use-after-free:
      
      [  269.467067] BUG: KASAN: use-after-free in vrf_xmit+0x7f1/0x827 [vrf] at addr ffff8800350a21c0
      [  269.467067] Read of size 4 by task ssh/1879
      [  269.467067] CPU: 1 PID: 1879 Comm: ssh Not tainted 4.10.0+ #249
      [  269.467067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [  269.467067] Call Trace:
      [  269.467067]  dump_stack+0x81/0xb6
      [  269.467067]  kasan_object_err+0x21/0x78
      [  269.467067]  kasan_report+0x2f7/0x450
      [  269.467067]  ? vrf_xmit+0x7f1/0x827 [vrf]
      [  269.467067]  ? ip_output+0xa4/0xdb
      [  269.467067]  __asan_load4+0x6b/0x6d
      [  269.467067]  vrf_xmit+0x7f1/0x827 [vrf]
      ...
      
      Which corresponds to the skb access after xmit handling. Fix by saving
      skb->len and using the saved value to update stats.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db6e7796
    • Eric Dumazet's avatar
      dccp: fix use-after-free in dccp_feat_activate_values · 7c0eaeec
      Eric Dumazet authored
      [ Upstream commit 62f8f4d9 ]
      
      Dmitry reported crashes in DCCP stack [1]
      
      Problem here is that when I got rid of listener spinlock, I missed the
      fact that DCCP stores a complex state in struct dccp_request_sock,
      while TCP does not.
      
      Since multiple cpus could access it at the same time, we need to add
      protection.
      
      [1]
      BUG: KASAN: use-after-free in dccp_feat_activate_values+0x967/0xab0
      net/dccp/feat.c:1541 at addr ffff88003713be68
      Read of size 8 by task syz-executor2/8457
      CPU: 2 PID: 8457 Comm: syz-executor2 Not tainted 4.10.0-rc7+ #127
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:15 [inline]
       dump_stack+0x292/0x398 lib/dump_stack.c:51
       kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
       print_address_description mm/kasan/report.c:200 [inline]
       kasan_report_error mm/kasan/report.c:289 [inline]
       kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
       kasan_report mm/kasan/report.c:332 [inline]
       __asan_report_load8_noabort+0x29/0x30 mm/kasan/report.c:332
       dccp_feat_activate_values+0x967/0xab0 net/dccp/feat.c:1541
       dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
       dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
       dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
       dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
       ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
       NF_HOOK include/linux/netfilter.h:257 [inline]
       ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
       dst_input include/net/dst.h:507 [inline]
       ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
       NF_HOOK include/linux/netfilter.h:257 [inline]
       ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
       __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
       __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
       process_backlog+0xe5/0x6c0 net/core/dev.c:4839
       napi_poll net/core/dev.c:5202 [inline]
       net_rx_action+0xe70/0x1900 net/core/dev.c:5267
       __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
       do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
       </IRQ>
       do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328
       do_softirq kernel/softirq.c:176 [inline]
       __local_bh_enable_ip+0x1f2/0x200 kernel/softirq.c:181
       local_bh_enable include/linux/bottom_half.h:31 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:971 [inline]
       ip6_finish_output2+0xbb0/0x23d0 net/ipv6/ip6_output.c:123
       ip6_finish_output+0x302/0x960 net/ipv6/ip6_output.c:148
       NF_HOOK_COND include/linux/netfilter.h:246 [inline]
       ip6_output+0x1cb/0x8d0 net/ipv6/ip6_output.c:162
       ip6_xmit+0xcdf/0x20d0 include/net/dst.h:501
       inet6_csk_xmit+0x320/0x5f0 net/ipv6/inet6_connection_sock.c:179
       dccp_transmit_skb+0xb09/0x1120 net/dccp/output.c:141
       dccp_xmit_packet+0x215/0x760 net/dccp/output.c:280
       dccp_write_xmit+0x168/0x1d0 net/dccp/output.c:362
       dccp_sendmsg+0x79c/0xb10 net/dccp/proto.c:796
       inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
       sock_sendmsg_nosec net/socket.c:635 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:645
       SYSC_sendto+0x660/0x810 net/socket.c:1687
       SyS_sendto+0x40/0x50 net/socket.c:1655
       entry_SYSCALL_64_fastpath+0x1f/0xc2
      RIP: 0033:0x4458b9
      RSP: 002b:00007f8ceb77bb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 00000000004458b9
      RDX: 0000000000000023 RSI: 0000000020e60000 RDI: 0000000000000017
      RBP: 00000000006e1b90 R08: 00000000200f9fe1 R09: 0000000000000020
      R10: 0000000000008010 R11: 0000000000000282 R12: 00000000007080a8
      R13: 0000000000000000 R14: 00007f8ceb77c9c0 R15: 00007f8ceb77c700
      Object at ffff88003713be50, in cache kmalloc-64 size: 64
      Allocated:
      PID = 8446
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
       save_stack+0x43/0xd0 mm/kasan/kasan.c:502
       set_track mm/kasan/kasan.c:514 [inline]
       kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
       kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2738
       kmalloc include/linux/slab.h:490 [inline]
       dccp_feat_entry_new+0x214/0x410 net/dccp/feat.c:467
       dccp_feat_push_change+0x38/0x220 net/dccp/feat.c:487
       __feat_register_sp+0x223/0x2f0 net/dccp/feat.c:741
       dccp_feat_propagate_ccid+0x22b/0x2b0 net/dccp/feat.c:949
       dccp_feat_server_ccid_dependencies+0x1b3/0x250 net/dccp/feat.c:1012
       dccp_make_response+0x1f1/0xc90 net/dccp/output.c:423
       dccp_v6_send_response+0x4ec/0xc20 net/dccp/ipv6.c:217
       dccp_v6_conn_request+0xaba/0x11b0 net/dccp/ipv6.c:377
       dccp_rcv_state_process+0x51e/0x1650 net/dccp/input.c:606
       dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632
       sk_backlog_rcv include/net/sock.h:893 [inline]
       __sk_receive_skb+0x36f/0xcc0 net/core/sock.c:479
       dccp_v6_rcv+0xba5/0x1d00 net/dccp/ipv6.c:742
       ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
       NF_HOOK include/linux/netfilter.h:257 [inline]
       ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
       dst_input include/net/dst.h:507 [inline]
       ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
       NF_HOOK include/linux/netfilter.h:257 [inline]
       ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
       __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
       __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
       process_backlog+0xe5/0x6c0 net/core/dev.c:4839
       napi_poll net/core/dev.c:5202 [inline]
       net_rx_action+0xe70/0x1900 net/core/dev.c:5267
       __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
      Freed:
      PID = 15
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
       save_stack+0x43/0xd0 mm/kasan/kasan.c:502
       set_track mm/kasan/kasan.c:514 [inline]
       kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:578
       slab_free_hook mm/slub.c:1355 [inline]
       slab_free_freelist_hook mm/slub.c:1377 [inline]
       slab_free mm/slub.c:2954 [inline]
       kfree+0xe8/0x2b0 mm/slub.c:3874
       dccp_feat_entry_destructor.part.4+0x48/0x60 net/dccp/feat.c:418
       dccp_feat_entry_destructor net/dccp/feat.c:416 [inline]
       dccp_feat_list_pop net/dccp/feat.c:541 [inline]
       dccp_feat_activate_values+0x57f/0xab0 net/dccp/feat.c:1543
       dccp_create_openreq_child+0x464/0x610 net/dccp/minisocks.c:121
       dccp_v6_request_recv_sock+0x1f6/0x1960 net/dccp/ipv6.c:457
       dccp_check_req+0x335/0x5a0 net/dccp/minisocks.c:186
       dccp_v6_rcv+0x69e/0x1d00 net/dccp/ipv6.c:711
       ip6_input_finish+0x46d/0x17a0 net/ipv6/ip6_input.c:279
       NF_HOOK include/linux/netfilter.h:257 [inline]
       ip6_input+0xdb/0x590 net/ipv6/ip6_input.c:322
       dst_input include/net/dst.h:507 [inline]
       ip6_rcv_finish+0x289/0x890 net/ipv6/ip6_input.c:69
       NF_HOOK include/linux/netfilter.h:257 [inline]
       ipv6_rcv+0x12ec/0x23d0 net/ipv6/ip6_input.c:203
       __netif_receive_skb_core+0x1ae5/0x3400 net/core/dev.c:4190
       __netif_receive_skb+0x2a/0x170 net/core/dev.c:4228
       process_backlog+0xe5/0x6c0 net/core/dev.c:4839
       napi_poll net/core/dev.c:5202 [inline]
       net_rx_action+0xe70/0x1900 net/core/dev.c:5267
       __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
      Memory state around the buggy address:
       ffff88003713bd00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff88003713bd80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff88003713be00: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
                                                                ^
      
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c0eaeec