1. 09 Oct, 2023 1 commit
  2. 07 Oct, 2023 7 commits
  3. 06 Oct, 2023 8 commits
    • Gustavo A. R. Silva's avatar
      net: sched: cls_u32: Fix allocation size in u32_init() · c4d49196
      Gustavo A. R. Silva authored
      commit d61491a5 ("net/sched: cls_u32: Replace one-element array
      with flexible-array member") incorrecly replaced an instance of
      `sizeof(*tp_c)` with `struct_size(tp_c, hlist->ht, 1)`. This results
      in a an over-allocation of 8 bytes.
      
      This change is wrong because `hlist` in `struct tc_u_common` is a
      pointer:
      
      net/sched/cls_u32.c:
      struct tc_u_common {
              struct tc_u_hnode __rcu *hlist;
              void                    *ptr;
              int                     refcnt;
              struct idr              handle_idr;
              struct hlist_node       hnode;
              long                    knodes;
      };
      
      So, the use of `struct_size()` makes no sense: we don't need to allocate
      any extra space for a flexible-array member. `sizeof(*tp_c)` is just fine.
      
      So, `struct_size(tp_c, hlist->ht, 1)` translates to:
      
      sizeof(*tp_c) + sizeof(tp_c->hlist->ht) ==
      sizeof(struct tc_u_common) + sizeof(struct tc_u_knode *) ==
      						144 + 8  == 0x98 (byes)
      						     ^^^
      						      |
      						unnecessary extra
      						allocation size
      
      $ pahole -C tc_u_common net/sched/cls_u32.o
      struct tc_u_common {
      	struct tc_u_hnode *        hlist;                /*     0     8 */
      	void *                     ptr;                  /*     8     8 */
      	int                        refcnt;               /*    16     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct idr                 handle_idr;           /*    24    96 */
      	/* --- cacheline 1 boundary (64 bytes) was 56 bytes ago --- */
      	struct hlist_node          hnode;                /*   120    16 */
      	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
      	long int                   knodes;               /*   136     8 */
      
      	/* size: 144, cachelines: 3, members: 6 */
      	/* sum members: 140, holes: 1, sum holes: 4 */
      	/* last cacheline: 16 bytes */
      };
      
      And with `sizeof(*tp_c)`, we have:
      
      	sizeof(*tp_c) == sizeof(struct tc_u_common) == 144 == 0x90 (bytes)
      
      which is the correct and original allocation size.
      
      Fix this issue by replacing `struct_size(tp_c, hlist->ht, 1)` with
      `sizeof(*tp_c)`, and avoid allocating 8 too many bytes.
      
      The following difference in binary output is expected and reflects the
      desired change:
      
      | net/sched/cls_u32.o
      | @@ -6148,7 +6148,7 @@
      | include/linux/slab.h:599
      |     2cf5:      mov    0x0(%rip),%rdi        # 2cfc <u32_init+0xfc>
      |                        2cf8: R_X86_64_PC32     kmalloc_caches+0xc
      |-    2cfc:      mov    $0x98,%edx
      |+    2cfc:      mov    $0x90,%edx
      Reported-by: default avatarAlejandro Colomar <alx@kernel.org>
      Closes: https://lore.kernel.org/lkml/09b4a2ce-da74-3a19-6961-67883f634d98@kernel.org/Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4d49196
    • David S. Miller's avatar
      Merge branch 'qca8k-fixes' · 3d35d954
      David S. Miller authored
      Marek Behún says:
      
      ====================
      net: dsa: qca8k: fix qca8k driver for Turris 1.x
      
      this is v2 of
        https://lore.kernel.org/netdev/20231002104612.21898-1-kabel@kernel.org/
      
      Changes since v1:
      - fixed a typo in commit message noticed by Simon Horman
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d35d954
    • Marek Behún's avatar
      net: dsa: qca8k: fix potential MDIO bus conflict when accessing internal PHYs via management frames · 526c8ee0
      Marek Behún authored
      Besides the QCA8337 switch the Turris 1.x device has on it's MDIO bus
      also Micron ethernet PHY (dedicated to the WAN port).
      
      We've been experiencing a strange behavior of the WAN ethernet
      interface, wherein the WAN PHY started timing out the MDIO accesses, for
      example when the interface was brought down and then back up.
      
      Bisecting led to commit 2cd54856 ("net: dsa: qca8k: add support for
      phy read/write with mgmt Ethernet"), which added support to access the
      QCA8337 switch's internal PHYs via management ethernet frames.
      
      Connecting the MDIO bus pins onto an oscilloscope, I was able to see
      that the MDIO bus was active whenever a request to read/write an
      internal PHY register was done via an management ethernet frame.
      
      My theory is that when the switch core always communicates with the
      internal PHYs via the MDIO bus, even when externally we request the
      access via ethernet. This MDIO bus is the same one via which the switch
      and internal PHYs are accessible to the board, and the board may have
      other devices connected on this bus. An ASCII illustration may give more
      insight:
      
                 +---------+
            +----|         |
            |    | WAN PHY |
            | +--|         |
            | |  +---------+
            | |
            | |  +----------------------------------+
            | |  | QCA8337                          |
      MDC   | |  |                        +-------+ |
      ------o-+--|--------o------------o--|       | |
      MDIO    |  |        |            |  | PHY 1 |-|--to RJ45
      --------o--|---o----+---------o--+--|       | |
                 |   |    |         |  |  +-------+ |
      	   | +-------------+  |  o--|       | |
      	   | | MDIO MDC    |  |  |  | PHY 2 |-|--to RJ45
      eth1	   | |             |  o--+--|       | |
      -----------|-|port0        |  |  |  +-------+ |
                 | |             |  |  o--|       | |
      	   | | switch core |  |  |  | PHY 3 |-|--to RJ45
                 | +-------------+  o--+--|       | |
      	   |                  |  |  +-------+ |
      	   |                  |  o--|  ...  | |
      	   +----------------------------------+
      
      When we send a request to read an internal PHY register via an ethernet
      management frame via eth1, the switch core receives the ethernet frame
      on port 0 and then communicates with the internal PHY via MDIO. At this
      time, other potential devices, such as the WAN PHY on Turris 1.x, cannot
      use the MDIO bus, since it may cause a bus conflict.
      
      Fix this issue by locking the MDIO bus even when we are accessing the
      PHY registers via ethernet management frames.
      
      Fixes: 2cd54856 ("net: dsa: qca8k: add support for phy read/write with mgmt Ethernet")
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Reviewed-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      526c8ee0
    • Marek Behún's avatar
      net: dsa: qca8k: fix regmap bulk read/write methods on big endian systems · 5652d174
      Marek Behún authored
      Commit c766e077 ("net: dsa: qca8k: convert to regmap read/write
      API") introduced bulk read/write methods to qca8k's regmap.
      
      The regmap bulk read/write methods get the register address in a buffer
      passed as a void pointer parameter (the same buffer contains also the
      read/written values). The register address occupies only as many bytes
      as it requires at the beginning of this buffer. For example if the
      .reg_bits member in regmap_config is 16 (as is the case for this
      driver), the register address occupies only the first 2 bytes in this
      buffer, so it can be cast to u16.
      
      But the original commit implementing these bulk read/write methods cast
      the buffer to u32:
        u32 reg = *(u32 *)reg_buf & U16_MAX;
      taking the first 4 bytes. This works on little endian systems where the
      first 2 bytes of the buffer correspond to the low 16-bits, but it
      obviously cannot work on big endian systems.
      
      Fix this by casting the beginning of the buffer to u16 as
         u32 reg = *(u16 *)reg_buf;
      
      Fixes: c766e077 ("net: dsa: qca8k: convert to regmap read/write API")
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Tested-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5652d174
    • David S. Miller's avatar
      Merge branch 'lynx-28g-fixes' · 109c2de9
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Fixes for lynx-28g PHY driver
      
      This series fixes some issues in the Lynx 28G SerDes driver, namely an
      oops when unloading the module, a race between the periodic workqueue
      and the PHY API, and a race between phy_set_mode_ext() calls on multiple
      lanes on the same SerDes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      109c2de9
    • Vladimir Oltean's avatar
      phy: lynx-28g: serialize concurrent phy_set_mode_ext() calls to shared registers · 139ad114
      Vladimir Oltean authored
      The protocol converter configuration registers PCC8, PCCC, PCCD
      (implemented by the driver), as well as others, control protocol
      converters from multiple lanes (each represented as a different
      struct phy). So, if there are simultaneous calls to phy_set_mode_ext()
      to lanes sharing the same PCC register (either for the "old" or for the
      "new" protocol), corruption of the values programmed to hardware is
      possible, because lynx_28g_rmw() has no locking.
      
      Add a spinlock in the struct lynx_28g_priv shared by all lanes, and take
      the global spinlock from the phy_ops :: set_mode() implementation. There
      are no other callers which modify PCC registers.
      
      Fixes: 8f73b37c ("phy: add support for the Layerscape SerDes 28G")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      139ad114
    • Vladimir Oltean's avatar
      phy: lynx-28g: lock PHY while performing CDR lock workaround · 0ac87fe5
      Vladimir Oltean authored
      lynx_28g_cdr_lock_check() runs once per second in a workqueue to reset
      the lane receiver if the CDR has not locked onto bit transitions in the
      RX stream. But the PHY consumer may do stuff with the PHY simultaneously,
      and that isn't okay. Block concurrent generic PHY calls by holding the
      PHY mutex from this workqueue.
      
      Fixes: 8f73b37c ("phy: add support for the Layerscape SerDes 28G")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ac87fe5
    • Ioana Ciornei's avatar
      phy: lynx-28g: cancel the CDR check work item on the remove path · f200bab3
      Ioana Ciornei authored
      The blamed commit added the CDR check work item but didn't cancel it on
      the remove path. Fix this by adding a remove function which takes care
      of it.
      
      Fixes: 8f73b37c ("phy: add support for the Layerscape SerDes 28G")
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f200bab3
  4. 05 Oct, 2023 20 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f291209e
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from Bluetooth, netfilter, BPF and WiFi.
      
        I didn't collect precise data but feels like we've got a lot of 6.5
        fixes here. WiFi fixes are most user-awaited.
      
        Current release - regressions:
      
         - Bluetooth: fix hci_link_tx_to RCU lock usage
      
        Current release - new code bugs:
      
         - bpf: mprog: fix maximum program check on mprog attachment
      
         - eth: ti: icssg-prueth: fix signedness bug in prueth_init_tx_chns()
      
        Previous releases - regressions:
      
         - ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling
      
         - vringh: don't use vringh_kiov_advance() in vringh_iov_xfer(), it
           doesn't handle zero length like we expected
      
         - wifi:
            - cfg80211: fix cqm_config access race, fix crashes with brcmfmac
            - iwlwifi: mvm: handle PS changes in vif_cfg_changed
            - mac80211: fix mesh id corruption on 32 bit systems
            - mt76: mt76x02: fix MT76x0 external LNA gain handling
      
         - Bluetooth: fix handling of HCI_QUIRK_STRICT_DUPLICATE_FILTER
      
         - l2tp: fix handling of transhdrlen in __ip{,6}_append_data()
      
         - dsa: mv88e6xxx: avoid EEPROM timeout when EEPROM is absent
      
         - eth: stmmac: fix the incorrect parameter after refactoring
      
        Previous releases - always broken:
      
         - net: replace calls to sock->ops->connect() with kernel_connect(),
           prevent address rewrite in kernel_bind(); otherwise BPF hooks may
           modify arguments, unexpectedly to the caller
      
         - tcp: fix delayed ACKs when reads and writes align with MSS
      
         - bpf:
            - verifier: unconditionally reset backtrack_state masks on global
              func exit
            - s390: let arch_prepare_bpf_trampoline return program size, fix
              struct_ops offsets
            - sockmap: fix accounting of available bytes in presence of PEEKs
            - sockmap: reject sk_msg egress redirects to non-TCP sockets
      
         - ipv4/fib: send netlink notify when delete source address routes
      
         - ethtool: plca: fix width of reads when parsing netlink commands
      
         - netfilter: nft_payload: rebuild vlan header on h_proto access
      
         - Bluetooth: hci_codec: fix leaking memory of local_codecs
      
         - eth: intel: ice: always add legacy 32byte RXDID in supported_rxdids
      
         - eth: stmmac:
           - dwmac-stm32: fix resume on STM32 MCU
           - remove buggy and unneeded stmmac_poll_controller, depend on NAPI
      
         - ibmveth: always recompute TCP pseudo-header checksum, fix use of
           the driver with Open vSwitch
      
         - wifi:
            - rtw88: rtw8723d: fix MAC address offset in EEPROM
            - mt76: fix lock dependency problem for wed_lock
            - mwifiex: sanity check data reported by the device
            - iwlwifi: ensure ack flag is properly cleared
            - iwlwifi: mvm: fix a memory corruption due to bad pointer arithm
            - iwlwifi: mvm: fix incorrect usage of scan API
      
        Misc:
      
         - wifi: mac80211: work around Cisco AP 9115 VHT MPDU length"
      
      * tag 'net-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
        MAINTAINERS: update Matthieu's email address
        mptcp: userspace pm allow creating id 0 subflow
        mptcp: fix delegated action races
        net: stmmac: remove unneeded stmmac_poll_controller
        net: lan743x: also select PHYLIB
        net: ethernet: mediatek: disable irq before schedule napi
        net: mana: Fix oversized sge0 for GSO packets
        net: mana: Fix the tso_bytes calculation
        net: mana: Fix TX CQE error handling
        netlink: annotate data-races around sk->sk_err
        sctp: update hb timer immediately after users change hb_interval
        sctp: update transport state when processing a dupcook packet
        tcp: fix delayed ACKs for MSS boundary condition
        tcp: fix quick-ack counting to count actual ACKs of new data
        page_pool: fix documentation typos
        tipc: fix a potential deadlock on &tx->lock
        net: stmmac: dwmac-stm32: fix resume on STM32 MCU
        ipv4: Set offload_failed flag in fibmatch results
        netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
        netfilter: nf_tables: Deduplicate nft_register_obj audit logs
        ...
      f291209e
    • Linus Torvalds's avatar
      Merge tag 'integrity-v6.6-fix' of... · cb84fb87
      Linus Torvalds authored
      Merge tag 'integrity-v6.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
      
      Pull integrity fixes from Mimi Zohar:
       "Two additional patches to fix the removal of the deprecated
        IMA_TRUSTED_KEYRING Kconfig"
      
      * tag 'integrity-v6.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
        ima: rework CONFIG_IMA dependency block
        ima: Finish deprecation of IMA_TRUSTED_KEYRING Kconfig
      cb84fb87
    • Linus Torvalds's avatar
      Merge tag 'leds-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds · e90822d7
      Linus Torvalds authored
      Pull LED fix from Lee Jones:
       "Just the one bug-fix:
      
         - Fix regression affecting LED_COLOR_ID_MULTI users"
      
      * tag 'leds-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds:
        leds: Drop BUG_ON check for LED_COLOR_ID_MULTI
      e90822d7
    • Linus Torvalds's avatar
      Merge tag 'mfd-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd · bc622f16
      Linus Torvalds authored
      Pull MFD fixes from Lee Jones:
       "A couple of small fixes:
      
         - Potential build failure in CS42L43
      
         - Device Tree bindings clean-up for a superseded patch"
      
      * tag 'mfd-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
        dt-bindings: mfd: Revert "dt-bindings: mfd: maxim,max77693: Add USB connector"
        mfd: cs42l43: Fix MFD_CS42L43 dependency on REGMAP_IRQ
      bc622f16
    • Linus Torvalds's avatar
      Merge tag 'ovl-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs · 403688e0
      Linus Torvalds authored
      Pull overlayfs fixes from Amir Goldstein:
      
       - Fix for file reference leak regression
      
       - Fix for NULL pointer deref regression
      
       - Fixes for RCU-walk race regressions:
      
         Two of the fixes were taken from Al's RCU pathwalk race fixes series
         with his consent [1].
      
         Note that unlike most of Al's series, these two patches are not about
         racing with ->kill_sb() and they are also very recent regressions
         from v6.5, so I think it's worth getting them into v6.5.y.
      
         There is also a fix for an RCU pathwalk race with ->kill_sb(), which
         may have been solved in vfs generic code as you suggested, but it
         also rids overlayfs from a nasty hack, so I think it's worth anyway.
      
      Link: https://lore.kernel.org/linux-fsdevel/20231003204749.GA800259@ZenIV/ [1]
      
      * tag 'ovl-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
        ovl: fix NULL pointer defer when encoding non-decodable lower fid
        ovl: make use of ->layers safe in rcu pathwalk
        ovl: fetch inode once in ovl_dentry_revalidate_common()
        ovl: move freeing ovl_entry past rcu delay
        ovl: fix file reference leak when submitting aio
      403688e0
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-fixes-and-maintainer-email-update-for-v6-6' · c29d9845
      Jakub Kicinski authored
      Mat Martineau says:
      
      ====================
      mptcp: Fixes and maintainer email update for v6.6
      
      Patch 1 addresses a race condition in MPTCP "delegated actions"
      infrastructure. Affects v5.19 and later.
      
      Patch 2 removes an unnecessary restriction that did not allow additional
      outgoing subflows using the local address of the initial MPTCP subflow.
      v5.16 and later.
      
      Patch 3 updates Matthieu's email address.
      ====================
      
      Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-0-28de4ac663ae@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c29d9845
    • Matthieu Baerts's avatar
      MAINTAINERS: update Matthieu's email address · 8eed6ee3
      Matthieu Baerts authored
      Use my kernel.org account instead.
      
      The other one will bounce by the end of the year.
      Signed-off-by: default avatarMatthieu Baerts <matttbe@kernel.org>
      Signed-off-by: default avatarMat Martineau <martineau@kernel.org>
      Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-3-28de4ac663ae@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8eed6ee3
    • Geliang Tang's avatar
      mptcp: userspace pm allow creating id 0 subflow · e5ed101a
      Geliang Tang authored
      This patch drops id 0 limitation in mptcp_nl_cmd_sf_create() to allow
      creating additional subflows with the local addr ID 0.
      
      There is no reason not to allow additional subflows from this local
      address: we should be able to create new subflows from the initial
      endpoint. This limitation was breaking fullmesh support from userspace.
      
      Fixes: 702c2f64 ("mptcp: netlink: allow userspace-driven subflow establishment")
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/391
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <martineau@kernel.org>
      Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-2-28de4ac663ae@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e5ed101a
    • Paolo Abeni's avatar
      mptcp: fix delegated action races · a5efdbce
      Paolo Abeni authored
      The delegated action infrastructure is prone to the following
      race: different CPUs can try to schedule different delegated
      actions on the same subflow at the same time.
      
      Each of them will check different bits via mptcp_subflow_delegate(),
      and will try to schedule the action on the related per-cpu napi
      instance.
      
      Depending on the timing, both can observe an empty delegated list
      node, causing the same entry to be added simultaneously on two different
      lists.
      
      The root cause is that the delegated actions infra does not provide
      a single synchronization point. Address the issue reserving an additional
      bit to mark the subflow as scheduled for delegation. Acquiring such bit
      guarantee the caller to own the delegated list node, and being able to
      safely schedule the subflow.
      
      Clear such bit only when the subflow scheduling is completed, ensuring
      proper barrier in place.
      
      Additionally swap the meaning of the delegated_action bitmask, to allow
      the usage of the existing helper to set multiple bit at once.
      
      Fixes: bcd97734 ("mptcp: use delegate action to schedule 3rd ack retrans")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarMat Martineau <martineau@kernel.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <martineau@kernel.org>
      Link: https://lore.kernel.org/r/20231004-send-net-20231004-v1-1-28de4ac663ae@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a5efdbce
    • Remi Pommarel's avatar
      net: stmmac: remove unneeded stmmac_poll_controller · 3eef8555
      Remi Pommarel authored
      Using netconsole netpoll_poll_dev could be called from interrupt
      context, thus using disable_irq() would cause the following kernel
      warning with CONFIG_DEBUG_ATOMIC_SLEEP enabled:
      
        BUG: sleeping function called from invalid context at kernel/irq/manage.c:137
        in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 10, name: ksoftirqd/0
        CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G        W         5.15.42-00075-g816b502b2298-dirty #117
        Hardware name: aml (r1) (DT)
        Call trace:
         dump_backtrace+0x0/0x270
         show_stack+0x14/0x20
         dump_stack_lvl+0x8c/0xac
         dump_stack+0x18/0x30
         ___might_sleep+0x150/0x194
         __might_sleep+0x64/0xbc
         synchronize_irq+0x8c/0x150
         disable_irq+0x2c/0x40
         stmmac_poll_controller+0x140/0x1a0
         netpoll_poll_dev+0x6c/0x220
         netpoll_send_skb+0x308/0x390
         netpoll_send_udp+0x418/0x760
         write_msg+0x118/0x140 [netconsole]
         console_unlock+0x404/0x500
         vprintk_emit+0x118/0x250
         dev_vprintk_emit+0x19c/0x1cc
         dev_printk_emit+0x90/0xa8
         __dev_printk+0x78/0x9c
         _dev_warn+0xa4/0xbc
         ath10k_warn+0xe8/0xf0 [ath10k_core]
         ath10k_htt_txrx_compl_task+0x790/0x7fc [ath10k_core]
         ath10k_pci_napi_poll+0x98/0x1f4 [ath10k_pci]
         __napi_poll+0x58/0x1f4
         net_rx_action+0x504/0x590
         _stext+0x1b8/0x418
         run_ksoftirqd+0x74/0xa4
         smpboot_thread_fn+0x210/0x3c0
         kthread+0x1fc/0x210
         ret_from_fork+0x10/0x20
      
      Since [0] .ndo_poll_controller is only needed if driver doesn't or
      partially use NAPI. Because stmmac does so, stmmac_poll_controller
      can be removed fixing the above warning.
      
      [0] commit ac3d9dd0 ("netpoll: make ndo_poll_controller() optional")
      
      Cc: <stable@vger.kernel.org> # 5.15.x
      Fixes: 47dd7a54 ("net: add support for STMicroelectronics Ethernet controllers")
      Signed-off-by: default avatarRemi Pommarel <repk@triplefau.lt>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/1c156a6d8c9170bd6a17825f2277115525b4d50f.1696429960.git.repk@triplefau.ltSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3eef8555
    • Randy Dunlap's avatar
      net: lan743x: also select PHYLIB · 566aeed6
      Randy Dunlap authored
      Since FIXED_PHY depends on PHYLIB, PHYLIB needs to be set to avoid
      a kconfig warning:
      
      WARNING: unmet direct dependencies detected for FIXED_PHY
        Depends on [n]: NETDEVICES [=y] && PHYLIB [=n]
        Selected by [y]:
        - LAN743X [=y] && NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MICROCHIP [=y] && PCI [=y] && PTP_1588_CLOCK_OPTIONAL [=y]
      
      Fixes: 73c4d1b3 ("net: lan743x: select FIXED_PHY")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: lore.kernel.org/r/202309261802.JPbRHwti-lkp@intel.com
      Cc: Bryan Whitehead <bryan.whitehead@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: Simon Horman <horms@kernel.org> # build-tested
      Link: https://lore.kernel.org/r/20231002193544.14529-1-rdunlap@infradead.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      566aeed6
    • Christian Marangi's avatar
      net: ethernet: mediatek: disable irq before schedule napi · fcdfc462
      Christian Marangi authored
      While searching for possible refactor of napi_schedule_prep and
      __napi_schedule it was notice that the mtk eth driver disable the
      interrupt for rx and tx AFTER napi is scheduled.
      
      While this is a very hard to repro case it might happen to have
      situation where the interrupt is disabled and never enabled again as the
      napi completes and the interrupt is enabled before.
      
      This is caused by the fact that a napi driven by interrupt expect a
      logic with:
      1. interrupt received. napi prepared -> interrupt disabled -> napi
         scheduled
      2. napi triggered. ring cleared -> interrupt enabled -> wait for new
         interrupt
      
      To prevent this case, disable the interrupt BEFORE the napi is
      scheduled.
      
      Fixes: 656e7052 ("net-next: mediatek: add support for MT7623 ethernet")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Link: https://lore.kernel.org/r/20231002140805.568-1-ansuelsmth@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fcdfc462
    • Paolo Abeni's avatar
      Merge branch 'net-mana-fix-some-tx-processing-bugs' · defe4b87
      Paolo Abeni authored
      Haiyang Zhang says:
      
      ====================
      net: mana: Fix some TX processing bugs
      
      Fix TX processing bugs on error handling, tso_bytes calculation,
      and sge0 size.
      ====================
      
      Link: https://lore.kernel.org/r/1696020147-14989-1-git-send-email-haiyangz@microsoft.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      defe4b87
    • Haiyang Zhang's avatar
      net: mana: Fix oversized sge0 for GSO packets · a43e8e9f
      Haiyang Zhang authored
      Handle the case when GSO SKB linear length is too large.
      
      MANA NIC requires GSO packets to put only the header part to SGE0,
      otherwise the TX queue may stop at the HW level.
      
      So, use 2 SGEs for the skb linear part which contains more than the
      packet header.
      
      Fixes: ca9c54d2 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarShradha Gupta <shradhagupta@linux.microsoft.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a43e8e9f
    • Haiyang Zhang's avatar
      net: mana: Fix the tso_bytes calculation · 7a54de92
      Haiyang Zhang authored
      sizeof(struct hop_jumbo_hdr) is not part of tso_bytes, so remove
      the subtraction from header size.
      
      Cc: stable@vger.kernel.org
      Fixes: bd7fc6e1 ("net: mana: Add new MANA VF performance counters for easier troubleshooting")
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarShradha Gupta <shradhagupta@linux.microsoft.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7a54de92
    • Haiyang Zhang's avatar
      net: mana: Fix TX CQE error handling · b2b00006
      Haiyang Zhang authored
      For an unknown TX CQE error type (probably from a newer hardware),
      still free the SKB, update the queue tail, etc., otherwise the
      accounting will be wrong.
      
      Also, TX errors can be triggered by injecting corrupted packets, so
      replace the WARN_ONCE to ratelimited error logging.
      
      Cc: stable@vger.kernel.org
      Fixes: ca9c54d2 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarShradha Gupta <shradhagupta@linux.microsoft.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b2b00006
    • Linus Torvalds's avatar
      Merge tag 'rtla-v6.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux · 3006adf3
      Linus Torvalds authored
      Pull rtla fixes from Daniel Bristot de Oliveira:
       "rtla (Real-Time Linux Analysis) tool fixes.
      
        Timerlat auto-analysis:
      
         - Timerlat is reporting thread interference time without thread noise
           events occurrence. It was caused because the thread interference
           variable was not reset after the analysis of a timerlat activation
           that did not hit the threshold.
      
         - The IRQ handler delay is estimated from the delta of the IRQ
           latency reported by timerlat, and the timestamp from IRQ handler
           start event. If the delta is near-zero, the drift from the external
           clock and the trace event and/or the overhead can cause the value
           to be negative. If the value is negative, print a zero-delay.
      
         - IRQ handlers happening after the timerlat thread event but before
           the stop tracing were being reported as IRQ that happened before
           the *current* IRQ occurrence. Ignore Previous IRQ noise in this
           condition because they are valid only for the *next* timerlat
           activation.
      
        Timerlat user-space:
      
         - Timerlat is stopping all user-space thread if a CPU becomes
           offline. Do not stop the entire tool if a CPU is/become offline,
           but only the thread of the unavailable CPU. Stop the tool only, if
           all threads leave because the CPUs become/are offline.
      
        man-pages:
      
         - Fix command line example in timerlat hist man page"
      
      * tag 'rtla-v6.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux:
        rtla: fix a example in rtla-timerlat-hist.rst
        rtla/timerlat: Do not stop user-space if a cpu is offline
        rtla/timerlat_aa: Fix previous IRQ delay for IRQs that happens after thread sample
        rtla/timerlat_aa: Fix negative IRQ delay
        rtla/timerlat_aa: Zero thread sum after every sample analysis
      3006adf3
    • Eric Dumazet's avatar
      netlink: annotate data-races around sk->sk_err · d0f95894
      Eric Dumazet authored
      syzbot caught another data-race in netlink when
      setting sk->sk_err.
      
      Annotate all of them for good measure.
      
      BUG: KCSAN: data-race in netlink_recvmsg / netlink_recvmsg
      
      write to 0xffff8881613bb220 of 4 bytes by task 28147 on cpu 0:
      netlink_recvmsg+0x448/0x780 net/netlink/af_netlink.c:1994
      sock_recvmsg_nosec net/socket.c:1027 [inline]
      sock_recvmsg net/socket.c:1049 [inline]
      __sys_recvfrom+0x1f4/0x2e0 net/socket.c:2229
      __do_sys_recvfrom net/socket.c:2247 [inline]
      __se_sys_recvfrom net/socket.c:2243 [inline]
      __x64_sys_recvfrom+0x78/0x90 net/socket.c:2243
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      write to 0xffff8881613bb220 of 4 bytes by task 28146 on cpu 1:
      netlink_recvmsg+0x448/0x780 net/netlink/af_netlink.c:1994
      sock_recvmsg_nosec net/socket.c:1027 [inline]
      sock_recvmsg net/socket.c:1049 [inline]
      __sys_recvfrom+0x1f4/0x2e0 net/socket.c:2229
      __do_sys_recvfrom net/socket.c:2247 [inline]
      __se_sys_recvfrom net/socket.c:2243 [inline]
      __x64_sys_recvfrom+0x78/0x90 net/socket.c:2243
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x00000000 -> 0x00000016
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 28146 Comm: syz-executor.0 Not tainted 6.6.0-rc3-syzkaller-00055-g9ed22ae6 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20231003183455.3410550-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d0f95894
    • Xin Long's avatar
      sctp: update hb timer immediately after users change hb_interval · 1f4e803c
      Xin Long authored
      Currently, when hb_interval is changed by users, it won't take effect
      until the next expiry of hb timer. As the default value is 30s, users
      have to wait up to 30s to wait its hb_interval update to work.
      
      This becomes pretty bad in containers where a much smaller value is
      usually set on hb_interval. This patch improves it by resetting the
      hb timer immediately once the value of hb_interval is updated by users.
      
      Note that we don't address the already existing 'problem' when sending
      a heartbeat 'on demand' if one hb has just been sent(from the timer)
      mentioned in:
      
        https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg590224.htmlSigned-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Link: https://lore.kernel.org/r/75465785f8ee5df2fb3acdca9b8fafdc18984098.1696172660.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1f4e803c
    • Xin Long's avatar
      sctp: update transport state when processing a dupcook packet · 2222a780
      Xin Long authored
      During the 4-way handshake, the transport's state is set to ACTIVE in
      sctp_process_init() when processing INIT_ACK chunk on client or
      COOKIE_ECHO chunk on server.
      
      In the collision scenario below:
      
        192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408]
          192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885]
          192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408]
          192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO]
          192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK]
        192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021]
      
      when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state,
      sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it
      creates a new association and sets its transport to ACTIVE then updates
      to the old association in sctp_assoc_update().
      
      However, in sctp_assoc_update(), it will skip the transport update if it
      finds a transport with the same ipaddr already existing in the old asoc,
      and this causes the old asoc's transport state not to move to ACTIVE
      after the handshake.
      
      This means if DATA retransmission happens at this moment, it won't be able
      to enter PF state because of the check 'transport->state == SCTP_ACTIVE'
      in sctp_do_8_2_transport_strike().
      
      This patch fixes it by updating the transport in sctp_assoc_update() with
      sctp_assoc_add_peer() where it updates the transport state if there is
      already a transport with the same ipaddr exists in the old asoc.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Link: https://lore.kernel.org/r/fd17356abe49713ded425250cc1ae51e9f5846c6.1696172325.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2222a780
  5. 04 Oct, 2023 4 commits
    • Neal Cardwell's avatar
      tcp: fix delayed ACKs for MSS boundary condition · 4720852e
      Neal Cardwell authored
      This commit fixes poor delayed ACK behavior that can cause poor TCP
      latency in a particular boundary condition: when an application makes
      a TCP socket write that is an exact multiple of the MSS size.
      
      The problem is that there is painful boundary discontinuity in the
      current delayed ACK behavior. With the current delayed ACK behavior,
      we have:
      
      (1) If an app reads data when > 1*MSS is unacknowledged, then
          tcp_cleanup_rbuf() ACKs immediately because of:
      
           tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
      
      (2) If an app reads all received data, and the packets were < 1*MSS,
          and either (a) the app is not ping-pong or (b) we received two
          packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
          of:
      
           ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
            ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
             !inet_csk_in_pingpong_mode(sk))) &&
      
      (3) *However*: if an app reads exactly 1*MSS of data,
          tcp_cleanup_rbuf() does not send an immediate ACK. This is true
          even if the app is not ping-pong and the 1*MSS of data had the PSH
          bit set, suggesting the sending application completed an
          application write.
      
      Thus if the app is not ping-pong, we have this painful case where
      >1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
      write whose last skb is an exact multiple of 1*MSS can get a 40ms
      delayed ACK. This means that any app that transfers data in one
      direction and takes care to align write size or packet size with MSS
      can suffer this problem. With receive zero copy making 4KB MSS values
      more common, it is becoming more common to have application writes
      naturally align with MSS, and more applications are likely to
      encounter this delayed ACK problem.
      
      The fix in this commit is to refine the delayed ACK heuristics with a
      simple check: immediately ACK a received 1*MSS skb with PSH bit set if
      the app reads all data. Why? If an skb has a len of exactly 1*MSS and
      has the PSH bit set then it is likely the end of an application
      write. So more data may not be arriving soon, and yet the data sender
      may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
      set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
      an ACK immediately if the app reads all of the data and is not
      ping-pong. Note that this logic is also executed for the case where
      len > MSS, but in that case this logic does not matter (and does not
      hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
      app reads data and there is more than an MSS of unACKed data.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarYuchung Cheng <ycheng@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Xin Guo <guoxin0309@gmail.com>
      Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4720852e
    • Neal Cardwell's avatar
      tcp: fix quick-ack counting to count actual ACKs of new data · 059217c1
      Neal Cardwell authored
      This commit fixes quick-ack counting so that it only considers that a
      quick-ack has been provided if we are sending an ACK that newly
      acknowledges data.
      
      The code was erroneously using the number of data segments in outgoing
      skbs when deciding how many quick-ack credits to remove. This logic
      does not make sense, and could cause poor performance in
      request-response workloads, like RPC traffic, where requests or
      responses can be multi-segment skbs.
      
      When a TCP connection decides to send N quick-acks, that is to
      accelerate the cwnd growth of the congestion control module
      controlling the remote endpoint of the TCP connection. That quick-ack
      decision is purely about the incoming data and outgoing ACKs. It has
      nothing to do with the outgoing data or the size of outgoing data.
      
      And in particular, an ACK only serves the intended purpose of allowing
      the remote congestion control to grow the congestion window quickly if
      the ACK is ACKing or SACKing new data.
      
      The fix is simple: only count packets as serving the goal of the
      quickack mechanism if they are ACKing/SACKing new data. We can tell
      whether this is the case by checking inet_csk_ack_scheduled(), since
      we schedule an ACK exactly when we are ACKing/SACKing new data.
      
      Fixes: fc6415bc ("[TCP]: Fix quick-ack decrementing with TSO.")
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarYuchung Cheng <ycheng@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      059217c1
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · c56e67f3
      Jakub Kicinski authored
      Florian Westphal says:
      
      ====================
      netfilter patches for net
      
      First patch resolves a regression with vlan header matching, this was
      broken since 6.5 release.  From myself.
      
      Second patch fixes an ancient problem with sctp connection tracking in
      case INIT_ACK packets are delayed.  This comes with a selftest, both
      patches from Xin Long.
      
      Patch 4 extends the existing nftables audit selftest, from
      Phil Sutter.
      
      Patch 5, also from Phil, avoids a situation where nftables
      would emit an audit record twice. This was broken since 5.13 days.
      
      Patch 6, from myself, avoids spurious insertion failure if we encounter an
      overlapping but expired range during element insertion with the
      'nft_set_rbtree' backend. This problem exists since 6.2.
      
      * tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure
        netfilter: nf_tables: Deduplicate nft_register_obj audit logs
        selftests: netfilter: Extend nft_audit.sh
        selftests: netfilter: test for sctp collision processing in nf_conntrack
        netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp
        netfilter: nft_payload: rebuild vlan header on h_proto access
      ====================
      
      Link: https://lore.kernel.org/r/20231004141405.28749-1-fw@strlen.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c56e67f3
    • Randy Dunlap's avatar
      page_pool: fix documentation typos · 513dbc10
      Randy Dunlap authored
      Correct grammar for better readability.
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Jesper Dangaard Brouer <hawk@kernel.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Link: https://lore.kernel.org/r/20231001003846.29541-1-rdunlap@infradead.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      513dbc10