1. 12 Jul, 2018 11 commits
    • Stefan Baranoff's avatar
      tcp: fix sequence numbers for repaired sockets re-using TIME-WAIT sockets · 21684dc4
      Stefan Baranoff authored
      This patch fixes a bug where the sequence numbers of a socket created using
      TCP repair functionality are lower than set after connect is called.
      This occurs when the repair socket overlaps with a TIME-WAIT socket and
      triggers the re-use code. The amount lower is equal to the number of times
      that a particular IP/port set is re-used and then put back into TIME-WAIT.
      Re-using the first time the sequence number is 1 lower, closing that socket
      and then re-opening (with repair) a new socket with the same addresses/ports
      puts the sequence number 2 lower than set via setsockopt. The third time is
      3 lower, etc. I have not tested what the limit of this acrewal is, if any.
      
      The fix is, if a socket is in repair mode, to respect the already set
      sequence number and timestamp when it would have already re-used the
      TIME-WAIT socket.
      Signed-off-by: default avatarStefan Baranoff <sbaranoff@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21684dc4
    • Jacob Keller's avatar
      sch_fq_codel: zero q->flows_cnt when fq_codel_init fails · 83fe6b87
      Jacob Keller authored
      When fq_codel_init fails, qdisc_create_dflt will cleanup by using
      qdisc_destroy. This function calls the ->reset() op prior to calling the
      ->destroy() op.
      
      Unfortunately, during the failure flow for sch_fq_codel, the ->flows
      parameter is not initialized, so the fq_codel_reset function will null
      pointer dereference.
      
         kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
         kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: PGD 0 P4D 0
         kernel: Oops: 0000 [#1] SMP PTI
         kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci
         kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf ipmi_msghandler [last unloaded: i40e]
         kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G           OE    4.16.13custom-fq-codel-test+ #3
         kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.02.05.0004.051120151007 05/11/2015
         kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
         kernel: RSP: 0018:ffffbfbf4c1fb620 EFLAGS: 00010246
         kernel: RAX: 0000000000000400 RBX: 0000000000000000 RCX: 00000000000005b9
         kernel: RDX: 0000000000000000 RSI: ffff9d03264a60c0 RDI: ffff9cfd17b31c00
         kernel: RBP: 0000000000000001 R08: 00000000000260c0 R09: ffffffffb679c3e9
         kernel: R10: fffff1dab06a0e80 R11: ffff9cfd163af800 R12: ffff9cfd17b31c00
         kernel: R13: 0000000000000001 R14: ffff9cfd153de600 R15: 0000000000000001
         kernel: FS:  00007fdec2f92800(0000) GS:ffff9d0326480000(0000) knlGS:0000000000000000
         kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         kernel: CR2: 0000000000000008 CR3: 0000000c1956a006 CR4: 00000000000606e0
         kernel: Call Trace:
         kernel:  qdisc_destroy+0x56/0x140
         kernel:  qdisc_create_dflt+0x8b/0xb0
         kernel:  mq_init+0xc1/0xf0
         kernel:  qdisc_create_dflt+0x5a/0xb0
         kernel:  dev_activate+0x205/0x230
         kernel:  __dev_open+0xf5/0x160
         kernel:  __dev_change_flags+0x1a3/0x210
         kernel:  dev_change_flags+0x21/0x60
         kernel:  do_setlink+0x660/0xdf0
         kernel:  ? down_trylock+0x25/0x30
         kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
         kernel:  ? rtnl_newlink+0x816/0x990
         kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
         kernel:  ? _cond_resched+0x15/0x30
         kernel:  ? kmem_cache_alloc+0x20/0x1b0
         kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
         kernel:  ? rtnl_calcit.isra.30+0x100/0x100
         kernel:  ? netlink_rcv_skb+0x4c/0x120
         kernel:  ? netlink_unicast+0x19e/0x260
         kernel:  ? netlink_sendmsg+0x1ff/0x3c0
         kernel:  ? sock_sendmsg+0x36/0x40
         kernel:  ? ___sys_sendmsg+0x295/0x2f0
         kernel:  ? ebitmap_cmp+0x6d/0x90
         kernel:  ? dev_get_by_name_rcu+0x73/0x90
         kernel:  ? skb_dequeue+0x52/0x60
         kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
         kernel:  ? bit_waitqueue+0x30/0x30
         kernel:  ? fsnotify_grab_connector+0x3c/0x60
         kernel:  ? __sys_sendmsg+0x51/0x90
         kernel:  ? do_syscall_64+0x74/0x180
         kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
         kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
         kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: ffffbfbf4c1fb620
         kernel: CR2: 0000000000000008
         kernel: ---[ end trace e81a62bede66274e ]---
      
      This is caused because flows_cnt is non-zero, but flows hasn't been
      initialized. fq_codel_init has left the private data in a partially
      initialized state.
      
      To fix this, reset flows_cnt to 0 when we fail to initialize.
      Additionally, to make the state more consistent, also cleanup the flows
      pointer when the allocation of backlogs fails.
      
      This fixes the NULL pointer dereference, since both the for-loop and
      memset in fq_codel_reset will be no-ops when flow_cnt is zero.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83fe6b87
    • David S. Miller's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue · 35288486
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2018-07-12
      
      This series contains updates to ixgbe and e100/e1000 kernel documentation.
      
      Alex fixes ixgbe to ensure that we are more explicit about the ordering
      of updates to the receive address register (RAR) table.
      
      Dan Carpenter fixes an issue where we were reading one element beyond
      the end of the array.
      
      Mauro Carvalho Chehab fixes formatting issues in the e100.rst and
      e1000.rst that were causing errors during 'make htmldocs'.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35288486
    • Mauro Carvalho Chehab's avatar
      networking: e1000.rst: Get rid of Sphinx warnings · 8dc4b1a7
      Mauro Carvalho Chehab authored
      Documentation/networking/e1000.rst:83: ERROR: Unexpected indentation.
          Documentation/networking/e1000.rst:84: WARNING: Block quote ends without a blank line; unexpected unindent.
          Documentation/networking/e1000.rst:173: WARNING: Definition list ends without a blank line; unexpected unindent.
          Documentation/networking/e1000.rst:236: WARNING: Definition list ends without a blank line; unexpected unindent.
      
      While here, fix highlights and mark a table as such.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      8dc4b1a7
    • Mauro Carvalho Chehab's avatar
      networking: e100.rst: Get rid of Sphinx warnings · b203cc7a
      Mauro Carvalho Chehab authored
      Documentation/networking/e100.rst:57: WARNING: Literal block expected; none found.
          Documentation/networking/e100.rst:68: WARNING: Literal block expected; none found.
          Documentation/networking/e100.rst:75: WARNING: Literal block expected; none found.
          Documentation/networking/e100.rst:84: WARNING: Literal block expected; none found.
          Documentation/networking/e100.rst:93: WARNING: Inline emphasis start-string without end-string.
      
      While here, fix some highlights.
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b203cc7a
    • Dan Carpenter's avatar
      ixgbe: Off by one in ixgbe_ipsec_tx() · c4111041
      Dan Carpenter authored
      The ipsec->tx_tbl[] has IXGBE_IPSEC_MAX_SA_COUNT elements so the > needs
      to be changed to >= so we don't read one element beyond the end of the
      array.
      
      Fixes: 59259470 ("ixgbe: process the Tx ipsec offload")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarShannon Nelson <shannon.nelson@oracle.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      c4111041
    • Alexander Duyck's avatar
      ixgbe: Be more careful when modifying MAC filters · d14c780c
      Alexander Duyck authored
      This change makes it so that we are much more explicit about the ordering
      of updates to the receive address register (RAR) table. Prior to this patch
      I believe we may have been updating the table while entries were still
      active, or possibly allowing for reordering of things since we weren't
      explicitly flushing writes to either the lower or upper portion of the
      register prior to accessing the other half.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Reviewed-by: default avatarShannon Nelson <shannon.nelson@oracle.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      d14c780c
    • David S. Miller's avatar
      Merge branch 'ieee802154-for-davem-2018-07-11' of... · 672f5cce
      David S. Miller authored
      Merge branch 'ieee802154-for-davem-2018-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
      
      Stefan Schmidt says:
      
      ====================
      pull-request: ieee802154 for net 2018-07-11
      
      An update from ieee802154 for your *net* tree.
      
      Build system fix for a missing include from Arnd Bergmann.
      Setting the IFLA_LINK for the lowpan parent from Lubomir Rintel.
      Fixes for some RX corner cases in adf7242 driver by Michael Hennerich.
      And some small patches to cleanup our BUG_ON vs WARN_ON usage.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      672f5cce
    • Ewan D. Milne's avatar
      qed: fix spelling mistake "successffuly" -> "successfully" · 20c4515a
      Ewan D. Milne authored
      Trivial fix to spelling mistake in qed_probe message.
      Signed-off-by: default avatarEwan D. Milne <emilne@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20c4515a
    • Russell King's avatar
      sfp: fix module initialisation with netdev already up · 576cd320
      Russell King authored
      It was been observed that with a particular order of initialisation,
      the netdev can be up, but the SFP module still has its TX_DISABLE
      signal asserted.  This occurs when the network device brought up before
      the SFP kernel module has been inserted by userspace.
      
      This occurs because sfp-bus layer does not hear about the change in
      network device state, and so assumes that it is still down.  Set
      netdev->sfp when the upstream is registered to work around this problem.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      576cd320
    • Russell King's avatar
      sfp: ensure we clean up properly on bus registration failure · f20a4c46
      Russell King authored
      We fail to correctly clean up after a bus registration failure, which
      can lead to an incorrect assumption about the registration state of
      the upstream or sfp cage.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f20a4c46
  2. 09 Jul, 2018 17 commits
    • Taehee Yoo's avatar
      rhashtable: add restart routine in rhashtable_free_and_destroy() · 0026129c
      Taehee Yoo authored
      rhashtable_free_and_destroy() cancels re-hash deferred work
      then walks and destroys elements. at this moment, some elements can be
      still in future_tbl. that elements are not destroyed.
      
      test case:
      nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
      all elements of sets before destroying sets and chains.
      But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
      so that splat occurred.
      
      test script:
         %cat test.nft
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		}
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
         table ip aa {
      	   map map1 {
      		   type ipv4_addr : verdict;
      		   elements = {
      			   0 : jump a0,
      			   1 : jump a0,
      			   2 : jump a0,
      			   3 : jump a0,
      			   4 : jump a0,
      			   5 : jump a0,
      			   6 : jump a0,
      			   7 : jump a0,
      			   8 : jump a0,
      			   9 : jump a0,
      		   }
      	   }
      	   chain a0 {
      	   }
         }
         flush ruleset
      
         %while :; do nft -f test.nft; done
      
      Splat looks like:
      [  200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
      [  200.806944] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
      [  200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
      [  200.820297] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
      [  200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 [nf_tables]
      [  200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 48 89 da 48 b8 00 00 00 00 00 fc ff
      [  200.860366] RSP: 0000:ffff880118dbf4d0 EFLAGS: 00010282
      [  200.866354] RAX: 0000000000000061 RBX: ffff88010cdeaf08 RCX: 0000000000000000
      [  200.874355] RDX: 0000000000000061 RSI: 0000000000000008 RDI: ffffed00231b7e90
      [  200.882361] RBP: ffff880118dbf4e8 R08: ffffed002373bcfb R09: ffffed002373bcfa
      [  200.890354] R10: 0000000000000000 R11: ffffed002373bcfb R12: dead000000000200
      [  200.898356] R13: dead000000000100 R14: ffffffffbb62af38 R15: dffffc0000000000
      [  200.906354] FS:  00007fefc31fd700(0000) GS:ffff88011b800000(0000) knlGS:0000000000000000
      [  200.915533] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.922355] CR2: 0000557f1c8e9128 CR3: 0000000106880000 CR4: 00000000001006e0
      [  200.930353] Call Trace:
      [  200.932351]  ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
      [  200.939525]  ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 [nf_tables]
      [  200.947525]  ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
      [  200.952383]  ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
      [  200.959532]  ? nla_parse+0xab/0x230
      [  200.963529]  ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
      [  200.968384]  ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
      [  200.975525]  ? debug_show_all_locks+0x290/0x290
      [  200.980363]  ? debug_show_all_locks+0x290/0x290
      [  200.986356]  ? sched_clock_cpu+0x132/0x170
      [  200.990352]  ? find_held_lock+0x39/0x1b0
      [  200.994355]  ? sched_clock_local+0x10d/0x130
      [  200.999531]  ? memset+0x1f/0x40
      
      V2:
       - free all tables requested by Herbert Xu
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0026129c
    • David S. Miller's avatar
      Merge branch 'bnxt_en-Bug-fixes' · 252dd176
      David S. Miller authored
      Michael Chan says:
      
      ====================
      bnxt_en: Bug fixes.
      
      These are bug fixes in error code paths, TC Flower VLAN TCI flow
      checking bug fix, proper filtering of Broadcast packets if IFF_BROADCAST
      is not set, and a bug fix in bnxt_get_max_rings() to return 0 ring
      parameters when the return value is -ENOMEM.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      252dd176
    • Vikas Gupta's avatar
      bnxt_en: Fix for system hang if request_irq fails · c58387ab
      Vikas Gupta authored
      Fix bug in the error code path when bnxt_request_irq() returns failure.
      bnxt_disable_napi() should not be called in this error path because
      NAPI has not been enabled yet.
      
      Fixes: c0c050c5 ("bnxt_en: New Broadcom ethernet driver.")
      Signed-off-by: default avatarVikas Gupta <vikas.gupta@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c58387ab
    • Michael Chan's avatar
      bnxt_en: Do not modify max IRQ count after RDMA driver requests/frees IRQs. · 30f52947
      Michael Chan authored
      Calling bnxt_set_max_func_irqs() to modify the max IRQ count requested or
      freed by the RDMA driver is flawed.  The max IRQ count is checked when
      re-initializing the IRQ vectors and this can happen multiple times
      during ifup or ethtool -L.  If the max IRQ is reduced and the RDMA
      driver is operational, we may not initailize IRQs correctly.  This
      problem shows up on VFs with very small number of MSIX.
      
      There is no other logic that relies on the IRQ count excluding the ones
      used by RDMA.  So we fix it by just removing the call to subtract or
      add the IRQs used by RDMA.
      
      Fixes: a588e458 ("bnxt_en: Add interface to support RDMA driver.")
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30f52947
    • Michael Chan's avatar
      bnxt_en: Support clearing of the IFF_BROADCAST flag. · 30e33848
      Michael Chan authored
      Currently, the driver assumes IFF_BROADCAST is always set and always sets
      the broadcast filter.  Modify the code to set or clear the broadcast
      filter according to the IFF_BROADCAST flag.
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30e33848
    • Michael Chan's avatar
      bnxt_en: Always set output parameters in bnxt_get_max_rings(). · 78f058a4
      Michael Chan authored
      The current code returns -ENOMEM and does not bother to set the output
      parameters to 0 when no rings are available.  Some callers, such as
      bnxt_get_channels() will display garbage ring numbers when that happens.
      Fix it by always setting the output parameters.
      
      Fixes: 6e6c5a57 ("bnxt_en: Modify bnxt_get_max_rings() to support shared or non shared rings.")
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78f058a4
    • Michael Chan's avatar
      bnxt_en: Fix inconsistent BNXT_FLAG_AGG_RINGS logic. · 07f4fde5
      Michael Chan authored
      If there aren't enough RX rings available, the driver will attempt to
      use a single RX ring without the aggregation ring.  If that also
      fails, the BNXT_FLAG_AGG_RINGS flag is cleared but the other ring
      parameters are not set consistently to reflect that.  If more RX
      rings become available at the next open, the RX rings will be in
      an inconsistent state and may crash when freeing the RX rings.
      
      Fix it by restoring the BNXT_FLAG_AGG_RINGS if not enough RX rings are
      available to run without aggregation rings.
      
      Fixes: bdbd1eb5 ("bnxt_en: Handle no aggregation ring gracefully.")
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07f4fde5
    • Venkat Duvvuru's avatar
      bnxt_en: Fix the vlan_tci exact match check. · e32d4e60
      Venkat Duvvuru authored
      It is possible that OVS may set don’t care for DEI/CFI bit in
      vlan_tci mask. Hence, checking for vlan_tci exact match will endup
      in a vlan flow rejection.
      
      This patch fixes the problem by checking for vlan_pcp and vid
      separately, instead of checking for the entire vlan_tci.
      
      Fixes: e85a9be9 (bnxt_en: do not allow wildcard matches for L2 flows)
      Signed-off-by: default avatarVenkat Duvvuru <venkatkumar.duvvuru@broadcom.com>
      Signed-off-by: default avatarMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e32d4e60
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 26420d9c
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for your net tree:
      
      1) Missing module autoloadfor icmp and icmpv6 x_tables matches,
         from Florian Westphal.
      
      2) Possible non-linear access to TCP header from tproxy, from
         Mate Eckl.
      
      3) Do not allow rbtree to be used for single elements, this patch
         moves all set backend into one single module since such thing
         can only happen if hashtable module is explicitly blacklisted,
         which should not ever be done.
      
      4) Reject error and standard targets from nft_compat for sanity
         reasons, they are never used from there.
      
      5) Don't crash on double hashsize module parameter, from Andrey
         Ryabinin.
      
      6) Drop dst on skb before placing it in the fragmentation
         reassembly queue, from Florian Westphal.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26420d9c
    • Florian Westphal's avatar
      netfilter: ipv6: nf_defrag: drop skb dst before queueing · 84379c9a
      Florian Westphal authored
      Eric Dumazet reports:
       Here is a reproducer of an annoying bug detected by syzkaller on our production kernel
       [..]
       ./b78305423 enable_conntrack
       Then :
       sleep 60
       dmesg | tail -10
       [  171.599093] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  181.631024] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  191.687076] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  201.703037] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  211.711072] unregister_netdevice: waiting for lo to become free. Usage count = 2
       [  221.959070] unregister_netdevice: waiting for lo to become free. Usage count = 2
      
      Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook.
      skb gets queued until frag timer expiry -- 1 minute.
      
      Normally nf_conntrack_reasm gets called during prerouting, so skb has
      no dst yet which might explain why this wasn't spotted earlier.
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: default avatarJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Tested-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      84379c9a
    • Andrey Ryabinin's avatar
      netfilter: nf_conntrack: Fix possible possible crash on module loading. · 2045cdfa
      Andrey Ryabinin authored
      Loading the nf_conntrack module with doubled hashsize parameter, i.e.
      	  modprobe nf_conntrack hashsize=12345 hashsize=12345
      causes NULL-ptr deref.
      
      If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
      will be called also twice.
      The first nf_conntrack_set_hashsize() call will set the
      'nf_conntrack_htable_size' variable:
      
      	nf_conntrack_set_hashsize()
      		...
      		/* On boot, we can set this without any fancy locking. */
      		if (!nf_conntrack_htable_size)
      			return param_set_uint(val, kp);
      
      But on the second invocation, the nf_conntrack_htable_size is already set,
      so the nf_conntrack_set_hashsize() will take a different path and call
      the nf_conntrack_hash_resize() function. Which will crash on the attempt
      to dereference 'nf_conntrack_hash' pointer:
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      	RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
      	Call Trace:
      	 nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
      	 parse_args+0x1f9/0x5a0
      	 load_module+0x1281/0x1a50
      	 __se_sys_finit_module+0xbe/0xf0
      	 do_syscall_64+0x7c/0x390
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fix this, by checking !nf_conntrack_hash instead of
      !nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
      after the module loaded, so the second invocation of the
      nf_conntrack_set_hashsize() won't crash, it will just reinitialize
      nf_conntrack_htable_size again.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      2045cdfa
    • Florian Westphal's avatar
      netfilter: nft_compat: explicitly reject ERROR and standard target · 21d5e078
      Florian Westphal authored
      iptables-nft never requests these, but make this explicitly illegal.
      If it were quested, kernel could oops as ->eval is NULL, furthermore,
      the builtin targets have no owning module so its possible to rmmod
      eb/ip/ip6_tables module even if they would be loaded.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      21d5e078
    • Michael Hennerich's avatar
      net: ieee802154: adf7242: Fix OCL calibration runs · 58e9683d
      Michael Hennerich authored
      Reissuing RC_RX every 400ms - to adjust for offset drift in
      receiver see datasheet page 61, OCL section.
      Signed-off-by: default avatarMichael Hennerich <michael.hennerich@analog.com>
      Signed-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Signed-off-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      58e9683d
    • Michael Hennerich's avatar
      net: ieee802154: adf7242: Fix erroneous RX enable · 36d26d6b
      Michael Hennerich authored
      Only enable RX mode if the netdev is opened.
      Signed-off-by: default avatarMichael Hennerich <michael.hennerich@analog.com>
      Signed-off-by: default avatarAlexandru Ardelean <alexandru.ardelean@analog.com>
      Signed-off-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      36d26d6b
    • Stefan Schmidt's avatar
      ieee802154: fakelb: switch from BUG_ON() to WARN_ON() on problem · 8f2fbc6c
      Stefan Schmidt authored
      The check is valid but it does not warrant to crash the kernel. A
      WARN_ON() is good enough here.
      Found by checkpatch.
      Signed-off-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      8f2fbc6c
    • Stefan Schmidt's avatar
      ieee802154: at86rf230: use __func__ macro for debug messages · 8a81388e
      Stefan Schmidt authored
      Instead of having the function name hard-coded (it might change and we
      forgot to update them in the debug output) we can use __func__ instead
      and also shorter the line so we do not need to break it. Also fix an
      extra blank line while being here.
      Found by checkpatch.
      Signed-off-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      8a81388e
    • Stefan Schmidt's avatar
      ieee802154: at86rf230: switch from BUG_ON() to WARN_ON() on problem · 20f33045
      Stefan Schmidt authored
      The check is valid but it does not warrant to crash the kernel. A
      WARN_ON() is good enough here.
      Found by checkpatch.
      Signed-off-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      20f33045
  3. 08 Jul, 2018 5 commits
    • Eric Dumazet's avatar
      tcp: cleanup copied_seq and urg_data in tcp_disconnect · 6508b678
      Eric Dumazet authored
      tcp_zerocopy_receive() relies on tcp_inq() to limit number of bytes
      requested by user.
      
      syzbot found that after tcp_disconnect(), tcp_inq() was returning
      a stale value (number of bytes in queue before the disconnect).
      
      Note that after this patch, ioctl(fd, SIOCINQ, &val) is also fixed
      and returns 0, so this might be a candidate for all known linux kernels.
      
      While we are at this, we probably also should clear urg_data to
      avoid other syzkaller reports after it discovers how to deal with
      urgent data.
      
      syzkaller repro :
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
      bind(3, {sa_family=AF_INET, sin_port=htons(20000), sin_addr=inet_addr("224.0.0.1")}, 16) = 0
      connect(3, {sa_family=AF_INET, sin_port=htons(20000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
      send(3, ..., 4096, 0) = 4096
      connect(3, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 128) = 0
      getsockopt(3, SOL_TCP, TCP_ZEROCOPY_RECEIVE, ..., [16]) = 0 // CRASH
      
      Fixes: 05255b82 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6508b678
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 7f93d129
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2018-07-07
      
      The following pull-request contains BPF updates for your *net* tree.
      
      Plenty of fixes for different components:
      
      1) A set of critical fixes for sockmap and sockhash, from John Fastabend.
      
      2) fixes for several race conditions in af_xdp, from Magnus Karlsson.
      
      3) hash map refcnt fix, from Mauricio Vasquez.
      
      4) samples/bpf fixes, from Taeung Song.
      
      5) ifup+mtu check for xdp_redirect, from Toshiaki Makita.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f93d129
    • Paolo Abeni's avatar
      ipfrag: really prevent allocation on netns exit · f6f2a4a2
      Paolo Abeni authored
      Setting the low threshold to 0 has no effect on frags allocation,
      we need to clear high_thresh instead.
      
      The code was pre-existent to commit 648700f7 ("inet: frags:
      use rhashtables for reassembly units"), but before the above,
      such assignment had a different role: prevent concurrent eviction
      from the worker and the netns cleanup helper.
      
      Fixes: 648700f7 ("inet: frags: use rhashtables for reassembly units")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6f2a4a2
    • Lorenzo Colitti's avatar
      net: diag: Don't double-free TCP_NEW_SYN_RECV sockets in tcp_abort · acc2cf4e
      Lorenzo Colitti authored
      When tcp_diag_destroy closes a TCP_NEW_SYN_RECV socket, it first
      frees it by calling inet_csk_reqsk_queue_drop_and_and_put in
      tcp_abort, and then frees it again by calling sock_gen_put.
      
      Since tcp_abort only has one caller, and all the other codepaths
      in tcp_abort don't free the socket, just remove the free in that
      function.
      
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Tested: passes Android sock_diag_test.py, which exercises this codepath
      Fixes: d7226c7a ("net: diag: Fix refcnt leak in error path destroying socket")
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acc2cf4e
    • David Ahern's avatar
      net/ipv4: Set oif in fib_compute_spec_dst · e7372197
      David Ahern authored
      Xin reported that icmp replies may not use the address on the device the
      echo request is received if the destination address is broadcast. Instead
      a route lookup is done without considering VRF context. Fix by setting
      oif in flow struct to the master device if it is enslaved. That directs
      the lookup to the VRF table. If the device is not enslaved, oif is still
      0 so no affect.
      
      Fixes: cd2fbe1b ("net: Use VRF device index for lookups on RX")
      Reported-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7372197
  4. 07 Jul, 2018 7 commits
    • Toshiaki Makita's avatar
      xdp: XDP_REDIRECT should check IFF_UP and MTU · d8d7218a
      Toshiaki Makita authored
      Otherwise we end up with attempting to send packets from down devices
      or to send oversized packets, which may cause unexpected driver/device
      behaviour. Generic XDP has already done this check, so reuse the logic
      in native XDP.
      
      Fixes: 814abfab ("xdp: add bpf_redirect helper function")
      Signed-off-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d8d7218a
    • Alexei Starovoitov's avatar
      Merge branch 'sockhash-fixes' · 4fb126cb
      Alexei Starovoitov authored
      John Fastabend says:
      
      ====================
      First three patches resolve issues found while testing sockhash and
      reviewing code. Syzbot also found them about the same time as I was
      working on fixes. The main issue is in the sockhash path we reduced
      the scope of sk_callback lock but this meant we could get update and
      close running in parallel so fix that here.
      
      Then testing sk_msg and sk_skb programs together found that skb->dev
      is not always assigned and some of the helpers were depending on this
      to lookup max mtu. Fix this by using SKB_MAX_ALLOC when no MTU is
      available.
      
      Finally, Martin spotted that the sockmap code was still using the
      qdisc skb cb structure. But I was sure we had fixed this long ago.
      Looks like we missed it in a merge conflict resolution and then by
      chance data_end offset was the same in both structures so everything
      sort of continued to work even though it could break at any moment
      if the structs ever change. So redo the conversion and this time
      also convert the helpers.
      
      v2: fix '0 files changed' issue in patches
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4fb126cb
    • John Fastabend's avatar
      bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb · 0ea488ff
      John Fastabend authored
      In commit
      
        'bpf: bpf_compute_data uses incorrect cb structure' (8108a775)
      
      we added the routine bpf_compute_data_end_sk_skb() to compute the
      correct data_end values, but this has since been lost. In kernel
      v4.14 this was correct and the above patch was applied in it
      entirety. Then when v4.14 was merged into v4.15-rc1 net-next tree
      we lost the piece that renamed bpf_compute_data_pointers to the
      new function bpf_compute_data_end_sk_skb. This was done here,
      
      e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      
      When it conflicted with the following rename patch,
      
      6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      
      Finally, after a refactor I thought even the function
      bpf_compute_data_end_sk_skb() was no longer needed and it was
      erroneously removed.
      
      However, we never reverted the sk_skb_convert_ctx_access() usage of
      tcp_skb_cb which had been committed and survived the merge conflict.
      Here we fix this by adding back the helper and *_data_end_sk_skb()
      usage. Using the bpf_skc_data_end mapping is not correct because it
      expects a qdisc_skb_cb object but at the sock layer this is not the
      case. Even though it happens to work here because we don't overwrite
      any data in-use at the socket layer and the cb structure is cleared
      later this has potential to create some subtle issues. But, even
      more concretely the filter.c access check uses tcp_skb_cb.
      
      And by some act of chance though,
      
      struct bpf_skb_data_end {
              struct qdisc_skb_cb        qdisc_cb;             /*     0    28 */
      
              /* XXX 4 bytes hole, try to pack */
      
              void *                     data_meta;            /*    32     8 */
              void *                     data_end;             /*    40     8 */
      
              /* size: 48, cachelines: 1, members: 3 */
              /* sum members: 44, holes: 1, sum holes: 4 */
              /* last cacheline: 48 bytes */
      };
      
      and then tcp_skb_cb,
      
      struct tcp_skb_cb {
      	[...]
                      struct {
                              __u32      flags;                /*    24     4 */
                              struct sock * sk_redir;          /*    32     8 */
                              void *     data_end;             /*    40     8 */
                      } bpf;                                   /*          24 */
              };
      
      So when we use offset_of() to track down the byte offset we get 40 in
      either case and everything continues to work. Fix this mess and use
      correct structures its unclear how long this might actually work for
      until someone moves the structs around.
      Reported-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Fixes: e1ea2f98 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
      Fixes: 6aaae2b6 ("bpf: rename bpf_compute_data_end into bpf_compute_data_pointers")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0ea488ff
    • John Fastabend's avatar
      bpf: sockmap, consume_skb in close path · 7ebc14d5
      John Fastabend authored
      Currently, when a sock is closed and the bpf_tcp_close() callback is
      used we remove memory but do not free the skb. Call consume_skb() if
      the skb is attached to the buffer.
      
      Reported-by: syzbot+d464d2c20c717ef5a6a8@syzkaller.appspotmail.com
      Fixes: 1aa12bdf ("bpf: sockmap, add sock close() hook to remove socks")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7ebc14d5
    • John Fastabend's avatar
      bpf: sockhash, disallow bpf_tcp_close and update in parallel · 99ba2b5a
      John Fastabend authored
      After latest lock updates there is no longer anything preventing a
      close and recvmsg call running in parallel. Additionally, we can
      race update with close if we close a socket and simultaneously update
      if via the BPF userspace API (note the cgroup ops are already run
      with sock_lock held).
      
      To resolve this take sock_lock in close and update paths.
      
      Reported-by: syzbot+b680e42077a0d7c9a0c4@syzkaller.appspotmail.com
      Fixes: e9db4ef6 ("bpf: sockhash fix omitted bucket lock in sock_close")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      99ba2b5a
    • John Fastabend's avatar
      bpf: fix sk_skb programs without skb->dev assigned · 0c6bc6e5
      John Fastabend authored
      Multiple BPF helpers in use by sk_skb programs calculate the max
      skb length using the __bpf_skb_max_len function. However, this
      calculates the max length using the skb->dev pointer which can be
      NULL when an sk_skb program is paired with an sk_msg program.
      
      To force this a sk_msg program needs to redirect into the ingress
      path of a sock with an attach sk_skb program. Then the the sk_skb
      program would need to call one of the helpers that adjust the skb
      size.
      
      To fix the null ptr dereference use SKB_MAX_ALLOC size if no dev
      is available.
      
      Fixes: 8934ce2f ("bpf: sockmap redirect ingress support")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0c6bc6e5
    • Alexei Starovoitov's avatar
      Merge branch 'sockmap-fixes' · 631da853
      Alexei Starovoitov authored
      John Fastabend says:
      
      ====================
      I missed fixing the error path in the sockhash code to align with
      supporting socks in multiple maps. Simply checking if the psock is
      present does not mean we can decrement the reference count because
      it could be part of another map. Fix this by cleaning up the error
      path so this situation does not happen.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      631da853