1. 21 Sep, 2022 7 commits
  2. 20 Sep, 2022 33 commits
    • Ruffalo Lavoisier's avatar
      liquidio: CN23XX: delete repeated words, add missing words and fix typo in comment · c29b0682
      Ruffalo Lavoisier authored
      - Delete the repeated word 'to' in the comment.
      
      - Add the missing 'use' word within the sentence.
      
      - Correct spelling on 'malformation', 'needs'.
      Signed-off-by: default avatarRuffalo Lavoisier <RuffaloLavoisier@gmail.com>
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Link: https://lore.kernel.org/r/20220919053447.5702-1-RuffaloLavoisier@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c29b0682
    • Ren Zhijie's avatar
      octeontx2-pf: Fix unused variable build error · 54b9a2bb
      Ren Zhijie authored
      If CONFIG_DCB is not set,
      make ARCH=x86_64 CROSS_COMPILE=x86_64-linux-gnu-,
      will be failed, like this:
      
      drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c: In function ‘otx2_select_queue’:
      drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c:1886:19: error: unused variable ‘pf’ [-Werror=unused-variable]
        struct otx2_nic *pf = netdev_priv(netdev);
                         ^~
      cc1: all warnings being treated as errors
      
      To fix this build error, put the definition of *pf under the CONFIG_DCB.
      
      Fixes: 99c969a8 ("octeontx2-pf: Add egress PFC support")
      Signed-off-by: default avatarRen Zhijie <renzhijie2@huawei.com>
      Link: https://lore.kernel.org/r/20220919025840.256411-1-renzhijie2@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      54b9a2bb
    • Karol Kolacinski's avatar
      ice: Add low latency Tx timestamp read · 1229b339
      Karol Kolacinski authored
      E810 products can support low latency Tx timestamp register read.
      This requires usage of threaded IRQ instead of kthread to reduce the
      kthread start latency (spikes up to 20 ms).
      Add a check for the device capability and use the new method if
      supported.
      Signed-off-by: default avatarKarol Kolacinski <karol.kolacinski@intel.com>
      Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20220916201728.241510-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1229b339
    • Jakub Kicinski's avatar
      Merge branch 'add-a-secondary-at-port-to-the-telit-fn990' · 0572b18d
      Jakub Kicinski authored
      Fabio Porcedda says:
      
      ====================
      Add a secondary AT port to the Telit FN990
      
      In order to add a secondary AT port to the Telit FN990 first add "DUN2"
      to mhi_wwan_ctrl.c, after that add a seconday AT port to the
      Telit FN990 in pci_generic.c
      ====================
      
      Link: https://lore.kernel.org/r/20220916144329.243368-1-fabio.porcedda@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0572b18d
    • Fabio Porcedda's avatar
      bus: mhi: host: pci_generic: Add a secondary AT port to Telit FN990 · 479aa3b0
      Fabio Porcedda authored
      Add a secondary AT port using one of OEM reserved channel.
      Signed-off-by: default avatarFabio Porcedda <fabio.porcedda@gmail.com>
      Reviewed-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      479aa3b0
    • Fabio Porcedda's avatar
      net: wwan: mhi_wwan_ctrl: Add DUN2 to have a secondary AT port · 0c60d165
      Fabio Porcedda authored
      In order to have a secondary AT port add "DUN2".
      Signed-off-by: default avatarFabio Porcedda <fabio.porcedda@gmail.com>
      Reviewed-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0c60d165
    • Jakub Kicinski's avatar
      Merge branch 'refactor-duplicate-codes-in-the-tc-cls-walk-function' · adae216f
      Jakub Kicinski authored
      Zhengchao Shao says:
      
      ====================
      refactor duplicate codes in the tc cls walk function
      
      The walk implementation of most tc cls modules is basically the same.
      That is, the values of count and skip are checked first. If count is
      greater than or equal to skip, the registered fn function is executed.
      Otherwise, increase the value of count. So the code can be refactored.
      Then use helper function to replace the code of each cls module in
      alphabetical order.
      
      The walk function is invoked during dump. Therefore, test cases related
       to the tdc filter need to be added.
      ====================
      
      Link: https://lore.kernel.org/r/20220916020251.190097-1-shaozhengchao@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      adae216f
    • Zhengchao Shao's avatar
      selftests/tc-testings: add list case for basic filter · 972e8861
      Zhengchao Shao authored
      Test 0811: Add multiple basic filter with cmp ematch u8/link layer and
      default action and dump them
      Test 5129: List basic filters
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      972e8861
    • Zhengchao Shao's avatar
      selftests/tc-testings: add selftests for tcindex filter · fa8dfba5
      Zhengchao Shao authored
      Test 8293: Add tcindex filter with default action
      Test 7281: Add tcindex filter with hash size and pass action
      Test b294: Add tcindex filter with mask shift and reclassify action
      Test 0532: Add tcindex filter with pass_on and continue actions
      Test d473: Add tcindex filter with pipe action
      Test 2940: Add tcindex filter with miltiple actions
      Test 1893: List tcindex filters
      Test 2041: Change tcindex filter with pass action
      Test 9203: Replace tcindex filter with pass action
      Test 7957: Delete tcindex filter with drop action
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa8dfba5
    • Zhengchao Shao's avatar
      selftests/tc-testings: add selftests for rsvp filter · 23020350
      Zhengchao Shao authored
      Test 2141: Add rsvp filter with tcp proto and specific IP address
      Test 5267: Add rsvp filter with udp proto and specific IP address
      Test 2819: Add rsvp filter with src ip and src port
      Test c967: Add rsvp filter with tunnelid and continue action
      Test 5463: Add rsvp filter with tunnel and pipe action
      Test 2332: Add rsvp filter with miltiple actions
      Test 8879: Add rsvp filter with tunnel and skp flag
      Test 8261: List rsvp filters
      Test 8989: Delete rsvp filter
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      23020350
    • Zhengchao Shao's avatar
      selftests/tc-testings: add selftests for route filter · 67107e7f
      Zhengchao Shao authored
      Test e122: Add route filter with from and to tag
      Test 6573: Add route filter with fromif and to tag
      Test 1362: Add route filter with to flag and reclassify action
      Test 4720: Add route filter with from flag and continue actions
      Test 2812: Add route filter with form tag and pipe action
      Test 7994: Add route filter with miltiple actions
      Test 4312: List route filters
      Test 2634: Delete route filter with pipe action
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      67107e7f
    • Zhengchao Shao's avatar
      selftests/tc-testings: add selftests for flow filter · 58f82b3a
      Zhengchao Shao authored
      Test 5294: Add flow filter with map key and ops
      Test 3514: Add flow filter with map key or ops
      Test 7534: Add flow filter with map key xor ops
      Test 4524: Add flow filter with map key rshift ops
      Test 0230: Add flow filter with map key addend ops
      Test 2344: Add flow filter with src map key
      Test 9304: Add flow filter with proto map key
      Test 9038: Add flow filter with proto-src map key
      Test 2a03: Add flow filter with proto-dst map key
      Test a073: Add flow filter with iif map key
      Test 3b20: Add flow filter with priority map key
      Test 8945: Add flow filter with mark map key
      Test c034: Add flow filter with nfct map key
      Test 0205: Add flow filter with nfct-src map key
      Test 5315: Add flow filter with nfct-src map key
      Test 7849: Add flow filter with nfct-proto-src map key
      Test 9902: Add flow filter with nfct-proto-dst map key
      Test 6742: Add flow filter with rt-classid map key
      Test 5432: Add flow filter with sk-uid map key
      Test 4134: Add flow filter with sk-gid map key
      Test 4522: Add flow filter with vlan-tag map key
      Test 4253: Add flow filter with rxhash map key
      Test 4452: Add flow filter with hash key list
      Test 4341: Add flow filter with muliple ops
      Test 4392: List flow filters
      Test 4322: Change flow filter with map key num
      Test 2320: Replace flow filter with map key num
      Test 3213: Delete flow filter with map key num
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      58f82b3a
    • Zhengchao Shao's avatar
      selftests/tc-testings: add selftests for cgroup filter · 33c41192
      Zhengchao Shao authored
      Test 6273: Add cgroup filter with cmp ematch u8/link layer and drop action
      Test 4721: Add cgroup filter with cmp ematch u8/link layer with trans
      flag and pass action
      Test d392: Add cgroup filter with cmp ematch u16/link layer and pipe action
      Test 0234: Add cgroup filter with cmp ematch u32/link layer and miltiple
      actions
      Test 8499: Add cgroup filter with cmp ematch u8/network layer and pass
      action
      Test b273: Add cgroup filter with cmp ematch u8/network layer with trans
      flag and drop action
      Test 1934: Add cgroup filter with cmp ematch u16/network layer and pipe
      action
      Test 2733: Add cgroup filter with cmp ematch u32/network layer and
      miltiple actions
      Test 3271: Add cgroup filter with NOT cmp ematch rule and pass action
      Test 2362: Add cgroup filter with two ANDed cmp ematch rules and single
      action
      Test 9993: Add cgroup filter with two ORed cmp ematch rules and single
      action
      Test 2331: Add cgroup filter with two ANDed cmp ematch rules and one ORed
      ematch rule and single action
      Test 3645: Add cgroup filter with two ANDed cmp ematch rules and one NOT
      ORed ematch rule and single action
      Test b124: Add cgroup filter with u32 ematch u8/zero offset and drop
      action
      Test 7381: Add cgroup filter with u32 ematch u8/zero offset and invalid
      value >0xFF
      Test 2231: Add cgroup filter with u32 ematch u8/positive offset and drop
      action
      Test 1882: Add cgroup filter with u32 ematch u8/invalid mask >0xFF
      Test 1237: Add cgroup filter with u32 ematch u8/missing offset
      Test 3812: Add cgroup filter with u32 ematch u8/missing AT keyword
      Test 1112: Add cgroup filter with u32 ematch u8/missing value
      Test 3241: Add cgroup filter with u32 ematch u8/non-numeric value
      Test e231: Add cgroup filter with u32 ematch u8/non-numeric mask
      Test 4652: Add cgroup filter with u32 ematch u8/negative offset and pass
      Test 1331: Add cgroup filter with u32 ematch u16/zero offset and pipe
      action
      Test e354: Add cgroup filter with u32 ematch u16/zero offset and invalid
      value >0xFFFF
      Test 3538: Add cgroup filter with u32 ematch u16/positive offset and drop
      action
      Test 4576: Add cgroup filter with u32 ematch u16/invalid mask >0xFFFF
      Test b842: Add cgroup filter with u32 ematch u16/missing offset
      Test c924: Add cgroup filter with u32 ematch u16/missing AT keyword
      Test cc93: Add cgroup filter with u32 ematch u16/missing value
      Test 123c: Add cgroup filter with u32 ematch u16/non-numeric value
      Test 3675: Add cgroup filter with u32 ematch u16/non-numeric mask
      Test 1123: Add cgroup filter with u32 ematch u16/negative offset and drop
      action
      Test 4234: Add cgroup filter with u32 ematch u16/nexthdr+ offset and pass
      action
      Test e912: Add cgroup filter with u32 ematch u32/zero offset and pipe
      action
      Test 1435: Add cgroup filter with u32 ematch u32/positive offset and drop
      action
      Test 1282: Add cgroup filter with u32 ematch u32/missing offset
      Test 6456: Add cgroup filter with u32 ematch u32/missing AT keyword
      Test 4231: Add cgroup filter with u32 ematch u32/missing value
      Test 2131: Add cgroup filter with u32 ematch u32/non-numeric value
      Test f125: Add cgroup filter with u32 ematch u32/non-numeric mask
      Test 4316: Add cgroup filter with u32 ematch u32/negative offset and drop
      action
      Test 23ae: Add cgroup filter with u32 ematch u32/nexthdr+ offset and pipe
      action
      Test 23a1: Add cgroup filter with canid ematch and single SFF
      Test 324f: Add cgroup filter with canid ematch and single SFF with mask
      Test 2576: Add cgroup filter with canid ematch and multiple SFF
      Test 4839: Add cgroup filter with canid ematch and multiple SFF with masks
      Test 6713: Add cgroup filter with canid ematch and single EFF
      Test ab9d: Add cgroup filter with canid ematch and multiple EFF with masks
      Test 5349: Add cgroup filter with canid ematch and a combination of
      SFF/EFF
      Test c934: Add cgroup filter with canid ematch and a combination of
      SFF/EFF with masks
      Test 4319: Replace cgroup filter with diffferent match
      Test 4636: Delete cgroup filter
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      33c41192
    • Zhengchao Shao's avatar
      selftests/tc-testings: add selftests for bpf filter · 93f3f2ea
      Zhengchao Shao authored
      Test 23c3: Add cBPF filter with valid bytecode
      Test 1563: Add cBPF filter with invalid bytecode
      Test 2334: Add eBPF filter with valid object-file
      Test 2373: Add eBPF filter with invalid object-file
      Test 4423: Replace cBPF bytecode
      Test 5122: Delete cBPF filter
      Test e0a9: List cBPF filters
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      93f3f2ea
    • Zhengchao Shao's avatar
      net/sched: use tc_cls_stats_dump() in filter · 5508ff7c
      Zhengchao Shao authored
      use tc_cls_stats_dump() in filter.
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5508ff7c
    • Zhengchao Shao's avatar
      net/sched: cls_api: add helper for tc cls walker stats dump · fe0df81d
      Zhengchao Shao authored
      The walk implementation of most tc cls modules is basically the same.
      That is, the values of count and skip are checked first. If count is
      greater than or equal to skip, the registered fn function is executed.
      Otherwise, increase the value of count. So we can reconstruct them.
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Tested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fe0df81d
    • Rafał Miłecki's avatar
      net: broadcom: bcm4908_enet: handle -EPROBE_DEFER when getting MAC · e93a766d
      Rafał Miłecki authored
      Reading MAC from OF may return -EPROBE_DEFER if underlaying NVMEM device
      isn't ready yet. In such case pass that error code up and "wait" to be
      probed later.
      Signed-off-by: default avatarRafał Miłecki <rafal@milecki.pl>
      Link: https://lore.kernel.org/r/20220915133013.2243-1-zajec5@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e93a766d
    • Lukas Bulwahn's avatar
      net: make NET_(DEV|NS)_REFCNT_TRACKER depend on NET · caddb4e0
      Lukas Bulwahn authored
      It makes little sense to ask if networking namespace or net device refcount
      tracking shall be enabled for debug kernel builds without network support.
      
      This is similar to the commit eb0b39ef ("net: CONFIG_DEBUG_NET depends
      on CONFIG_NET").
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Link: https://lore.kernel.org/r/20220915124256.32512-1-lukas.bulwahn@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      caddb4e0
    • Jakub Kicinski's avatar
      Merge branch 'small-tc-taprio-improvements' · c3194a67
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Small tc-taprio improvements
      
      This series contains:
      - the proper protected variant of rcu_dereference() of admin and oper
        schedules for accesses from the slow path
      - a removal of an extra function pointer indirection for
        qdisc->dequeue() and qdisc->peek()
      - a removal of WARN_ON_ONCE() checks that can never trigger
      - the addition of netlink extack messages to some qdisc->init() failures
      
      These were split from an earlier patch set, hence the v2.
      ====================
      
      Link: https://lore.kernel.org/r/20220915105046.2404072-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c3194a67
    • Vladimir Oltean's avatar
      net/sched: taprio: replace safety precautions with comments · 2c08a4f8
      Vladimir Oltean authored
      The WARN_ON_ONCE() checks introduced in commit 13511704 ("net:
      taprio offload: enforce qdisc to netdev queue mapping") take a small
      toll on performance, but otherwise, the conditions are never expected to
      happen. Replace them with comments, such that the information is still
      conveyed to developers.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2c08a4f8
    • Vladimir Oltean's avatar
      net/sched: taprio: add extack messages in taprio_init · 026de64d
      Vladimir Oltean authored
      Stop contributing to the proverbial user unfriendliness of tc, and tell
      the user what is wrong wherever possible.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      026de64d
    • Vladimir Oltean's avatar
      net/sched: taprio: stop going through private ops for dequeue and peek · 25becba6
      Vladimir Oltean authored
      Since commit 13511704 ("net: taprio offload: enforce qdisc to netdev
      queue mapping"), taprio_dequeue_soft() and taprio_peek_soft() are de
      facto the only implementations for Qdisc_ops :: dequeue and Qdisc_ops ::
      peek that taprio provides.
      
      This is because in full offload mode, __dev_queue_xmit() will select a
      txq->qdisc which is never root taprio qdisc. So if nothing is enqueued
      in the root qdisc, it will never be run and nothing will get dequeued
      from it.
      
      Therefore, we can remove the private indirection from taprio, and always
      point Qdisc_ops :: dequeue to taprio_dequeue_soft (now simply named
      taprio_dequeue) and Qdisc_ops :: peek to taprio_peek_soft (now simply
      named taprio_peek).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      25becba6
    • Vladimir Oltean's avatar
      net/sched: taprio: remove redundant FULL_OFFLOAD_IS_ENABLED check in taprio_enqueue · fa65edde
      Vladimir Oltean authored
      Since commit 13511704 ("net: taprio offload: enforce qdisc to netdev
      queue mapping"), __dev_queue_xmit() will select a txq->qdisc for the
      full offload case of taprio which isn't the root taprio qdisc, so
      qdisc enqueues will never pass through taprio_enqueue().
      
      That commit already introduced one safety precaution check for
      FULL_OFFLOAD_IS_ENABLED(); a second one is really not needed, so
      simplify the conditional for entering into the GSO segmentation logic.
      Also reword the comment a little, to appear more natural after the code
      change.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa65edde
    • Vladimir Oltean's avatar
      net/sched: taprio: use rtnl_dereference for oper and admin sched in taprio_destroy() · 9af23657
      Vladimir Oltean authored
      Sparse complains that taprio_destroy() dereferences q->oper_sched and
      q->admin_sched without rcu_dereference(), since they are marked as __rcu
      in the taprio private structure.
      
      1671:28: warning: incorrect type in argument 1 (different address spaces)
      1671:28:    expected struct callback_head *head
      1671:28:    got struct callback_head [noderef] __rcu *
      1674:28: warning: incorrect type in argument 1 (different address spaces)
      1674:28:    expected struct callback_head *head
      1674:28:    got struct callback_head [noderef] __rcu *
      
      To silence that build warning, do actually use rtnl_dereference(), since
      we know the rtnl_mutex is held at the time of q->destroy().
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9af23657
    • Vladimir Oltean's avatar
      net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex · 18cdd2f0
      Vladimir Oltean authored
      Since the writer-side lock is taken here, we do not need to open an RCU
      read-side critical section, instead we can use rtnl_dereference() to
      tell lockdep we are serialized with concurrent writes.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      18cdd2f0
    • Vladimir Oltean's avatar
      net/sched: taprio: taprio_offload_config_changed() is protected by rtnl_mutex · c8cbe123
      Vladimir Oltean authored
      The locking in taprio_offload_config_changed() is wrong (but also
      inconsequentially so). The current_entry_lock does not serialize changes
      to the admin and oper schedules, only to the current entry. In fact, the
      rtnl_mutex does that, and that is taken at the time when taprio_change()
      is called.
      
      Replace the rcu_dereference_protected() method with the proper RCU
      annotation, and drop the unnecessary spin lock.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c8cbe123
    • Jakub Kicinski's avatar
      Merge branch 'nfp-flower-police-validation-and-ct-enhancements' · 4d3c8848
      Jakub Kicinski authored
      Simon Horman says:
      
      ====================
      nfp: flower: police validation and ct enhancements
      
      this series enhances the flower hardware offload
      facility provided by the nfp driver.
      
      1. Add validation of police actions created independently of flows
      
      2. Add support offload of ct NAT action
      
      3. Support offload of rule which has both vlan push/pop/mangle
         and ct action
      ====================
      
      Link: https://lore.kernel.org/r/20220914160604.1740282-1-simon.horman@corigine.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4d3c8848
    • Hui Zhou's avatar
      nfp: flower: support vlan action in pre_ct · 742b7072
      Hui Zhou authored
      Support hardware offload of rule which has both vlan push/pop/mangle
      and ct action.
      Signed-off-by: default avatarHui Zhou <hui.zhou@corigine.com>
      Reviewed-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      742b7072
    • Hui Zhou's avatar
      nfp: flower: support hw offload for ct nat action · 5cee92c6
      Hui Zhou authored
      support ct nat action when pre_ct merge with post_ct
      and nft. at the same time, add the extra checksum action
      and hardware stats for nft to meet the action check when
      do nat.
      Signed-off-by: default avatarHui Zhou <hui.zhou@corigine.com>
      Reviewed-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5cee92c6
    • Ziyang Chen's avatar
      nfp: flower: add validation of for police actions which are independent of flows · 9f1a948f
      Ziyang Chen authored
      Validation of police actions was added to offload drivers in
      commit d97b4b10 ("flow_offload: reject offload for all drivers with
      invalid police parameters")
      
      This patch extends that validation in the nfp driver to include
      police actions which are created independently of flows.
      Signed-off-by: default avatarZiyang Chen <ziyang.chen@corigine.com>
      Reviewed-by: default avatarBaowen Zheng <baowen.zheng@corigine.com>
      Reviewed-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9f1a948f
    • Jakub Kicinski's avatar
      Merge branch 'tcp-introduce-optional-per-netns-ehash' · 4fa37e49
      Jakub Kicinski authored
      Kuniyuki Iwashima says:
      
      ====================
      tcp: Introduce optional per-netns ehash.
      
      The more sockets we have in the hash table, the longer we spend looking
      up the socket.  While running a number of small workloads on the same
      host, they penalise each other and cause performance degradation.
      
      The root cause might be a single workload that consumes much more
      resources than the others.  It often happens on a cloud service where
      different workloads share the same computing resource.
      
      On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
      entries), after running iperf3 in different netns, creating 24Mi sockets
      without data transfer in the root netns causes about 10% performance
      regression for the iperf3's connection.
      
       thash_entries		sockets		length		Gbps
      	524288		      1		     1		50.7
      			   24Mi		    48		45.1
      
      It is basically related to the length of the list of each hash bucket.
      For testing purposes to see how performance drops along the length,
      I set 131072 (1Mi / 8) to thash_entries, and here's the result.
      
       thash_entries		sockets		length		Gbps
              131072		      1		     1		50.7
      			    1Mi		     8		49.9
      			    2Mi		    16		48.9
      			    4Mi		    32		47.3
      			    8Mi		    64		44.6
      			   16Mi		   128		40.6
      			   24Mi		   192		36.3
      			   32Mi		   256		32.5
      			   40Mi		   320		27.0
      			   48Mi		   384		25.0
      
      To resolve the socket lookup degradation, we introduce an optional
      per-netns hash table for TCP, but it's just ehash, and we still share
      the global bhash, bhash2 and lhash2.
      
      With a smaller ehash, we can look up non-listener sockets faster and
      isolate such noisy neighbours.  Also, we can reduce lock contention.
      
      For details, please see the last patch.
      
        patch 1 - 4: prep for per-netns ehash
        patch     5: small optimisation for netns dismantle without TIME_WAIT sockets
        patch     6: add per-netns ehash
      
      Many thanks to Eric Dumazet for reviewing and advising.
      ====================
      
      Link: https://lore.kernel.org/r/20220908011022.45342-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4fa37e49
    • Kuniyuki Iwashima's avatar
      tcp: Introduce optional per-netns ehash. · d1e5e640
      Kuniyuki Iwashima authored
      The more sockets we have in the hash table, the longer we spend looking
      up the socket.  While running a number of small workloads on the same
      host, they penalise each other and cause performance degradation.
      
      The root cause might be a single workload that consumes much more
      resources than the others.  It often happens on a cloud service where
      different workloads share the same computing resource.
      
      On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
      entries), after running iperf3 in different netns, creating 24Mi sockets
      without data transfer in the root netns causes about 10% performance
      regression for the iperf3's connection.
      
       thash_entries		sockets		length		Gbps
      	524288		      1		     1		50.7
      			   24Mi		    48		45.1
      
      It is basically related to the length of the list of each hash bucket.
      For testing purposes to see how performance drops along the length,
      I set 131072 (1Mi / 8) to thash_entries, and here's the result.
      
       thash_entries		sockets		length		Gbps
              131072		      1		     1		50.7
      			    1Mi		     8		49.9
      			    2Mi		    16		48.9
      			    4Mi		    32		47.3
      			    8Mi		    64		44.6
      			   16Mi		   128		40.6
      			   24Mi		   192		36.3
      			   32Mi		   256		32.5
      			   40Mi		   320		27.0
      			   48Mi		   384		25.0
      
      To resolve the socket lookup degradation, we introduce an optional
      per-netns hash table for TCP, but it's just ehash, and we still share
      the global bhash, bhash2 and lhash2.
      
      With a smaller ehash, we can look up non-listener sockets faster and
      isolate such noisy neighbours.  In addition, we can reduce lock contention.
      
      We can control the ehash size by a new sysctl knob.  However, depending
      on workloads, it will require very sensitive tuning, so we disable the
      feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
      we can fall back to using the global ehash in case we fail to allocate
      enough memory for a new ehash.  The maximum size is 16Mi, which is large
      enough that even if we have 48Mi sockets, the average list length is 3,
      and regression would be less than 1%.
      
      We can check the current ehash size by another read-only sysctl knob,
      net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
      the global ehash (per-netns ehash is disabled or failed to allocate
      memory).
      
        # dmesg | cut -d ' ' -f 5- | grep "established hash"
        TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
      
        # sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
      
        # sysctl net.ipv4.tcp_child_ehash_entries
        net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
      
        # ip netns add test1
        # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
      
        # sysctl -w net.ipv4.tcp_child_ehash_entries=100
        net.ipv4.tcp_child_ehash_entries = 100
      
        # ip netns add test2
        # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets
      
      When more than two processes in the same netns create per-netns ehash
      concurrently with different sizes, we need to guarantee the size in
      one of the following ways:
      
        1) Share the global ehash and create per-netns ehash
      
        First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
        netns sysctl knobs where we can safely change tcp_child_ehash_entries
        and clone()/unshare() to create a per-netns ehash.
      
        2) Control write on sysctl by BPF
      
        We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
        sysctl knobs.
      
      Note that the global ehash allocated at the boot time is spread over
      available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
      pages for each per-netns ehash depending on the current process's NUMA
      policy.  By default, the allocation is done in the local node only, so
      the per-netns hash table could fully reside on a random node.  Thus,
      depending on the NUMA policy the netns is created with and the CPU the
      current thread is running on, we could see some performance differences
      for highly optimised networking applications.
      
      Note also that the default values of two sysctl knobs depend on the ehash
      size and should be tuned carefully:
      
        tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
        tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
      
      As a bonus, we can dismantle netns faster.  Currently, while destroying
      netns, we call inet_twsk_purge(), which walks through the global ehash.
      It can be potentially big because it can have many sockets other than
      TIME_WAIT in all netns.  Splitting ehash changes that situation, where
      it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
      in each netns.
      
      With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
      to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
      Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
      keep it protocol-family-independent.
      
      In the future, we could optimise ehash lookup/iteration further by removing
      netns comparison for the per-netns ehash.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d1e5e640
    • Kuniyuki Iwashima's avatar
      tcp: Save unnecessary inet_twsk_purge() calls. · edc12f03
      Kuniyuki Iwashima authored
      While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch()
      and tcpv6_net_exit_batch() for AF_INET and AF_INET6.  These commands
      trigger the kernel to walk through the potentially big ehash twice even
      though the netns has no TIME_WAIT sockets.
      
        # ip netns add test
        # ip netns del test
      
        or
      
        # unshare -n /bin/true >/dev/null
      
      When tw_refcount is 1, we need not call inet_twsk_purge() at least
      for the net.  We can save such unneeded iterations if all netns in
      net_exit_list have no TIME_WAIT sockets.  This change eliminates
      the tax by the additional unshare() described in the next patch to
      guarantee the per-netns ehash size.
      
      Tested:
      
        # mount -t debugfs none /sys/kernel/debug/
        # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter
        # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter
        # echo function > /sys/kernel/debug/tracing/current_tracer
        # cat ./add_del_unshare.sh
        for i in `seq 1 40`
        do
            (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
        done
        wait;
        # ./add_del_unshare.sh
      
      Before the patch:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [031] ...1.   174.162765: cleanup_net <-process_one_work
          kworker/u128:0-8       [031] ...1.   174.240796: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [032] ...1.   174.244759: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [034] ...1.   174.290861: cleanup_net <-process_one_work
          kworker/u128:0-8       [039] ...1.   175.245027: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [046] ...1.   175.290541: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [037] ...1.   175.321046: cleanup_net <-process_one_work
          kworker/u128:0-8       [024] ...1.   175.941633: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [025] ...1.   176.242539: inet_twsk_purge <-tcp_sk_exit_batch
      
      After:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [038] ...1.   428.116174: cleanup_net <-process_one_work
          kworker/u128:0-8       [038] ...1.   428.262532: cleanup_net <-process_one_work
          kworker/u128:0-8       [030] ...1.   429.292645: cleanup_net <-process_one_work
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      edc12f03