1. 24 Aug, 2022 23 commits
  2. 23 Aug, 2022 17 commits
    • Jakub Kicinski's avatar
      docs: netlink: basic introduction to Netlink · 510156a7
      Jakub Kicinski authored
      Provide a bit of a brain dump of netlink related information
      as documentation. Hopefully this will be useful to people
      trying to navigate implementing YAML based parsing in languages
      we won't be able to help with.
      
      I started writing this doc while trying to figure out what
      it'd take to widen the applicability of YAML to good old rtnl,
      but the doc grew beyond that as it usually happens.
      
      In all honesty a lot of this information is new to me as I usually
      follow the "copy an existing example, drink to forget" process
      of writing netlink user space, so reviews will be much appreciated.
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Acked-by: default avatarJonathan Corbet <corbet@lwn.net>
      Link: https://lore.kernel.org/r/20220819200221.422801-2-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      510156a7
    • Jakub Kicinski's avatar
      net: improve and fix netlink kdoc · 30b60554
      Jakub Kicinski authored
      Subsequent patch will render the kdoc from
      include/uapi/linux/netlink.h into Documentation.
      We need to fix the warnings. While at it move
      the comments on struct nlmsghdr to a proper
      kdoc comment.
      
      Link: https://lore.kernel.org/r/20220819200221.422801-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      30b60554
    • Sergei Antonov's avatar
      net: ftmac100: set max_mtu to allow DSA overhead setting · 6c2c782f
      Sergei Antonov authored
      In case ftmac100 is used with a DSA switch, Linux wants to set MTU
      to 1504 to accommodate for DSA overhead. With the default max_mtu
      it leads to the error message:
       ftmac100 92000000.mac eth0: error -22 setting MTU to 1504 to include DSA overhead
      
      ftmac100 supports packet length 1518 (MAX_PKT_SIZE constant), so it is
      safe to report it in max_mtu.
      Signed-off-by: default avatarSergei Antonov <saproj@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20220821160844.474277-1-saproj@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c2c782f
    • Paolo Abeni's avatar
      Merge branch 'dsa-changes-for-multiple-cpu-ports-part-3' · 52412f55
      Paolo Abeni authored
      Vladimir Oltean says:
      
      ====================
      DSA changes for multiple CPU ports (part 3)
      
      Those who have been following part 1:
      https://patchwork.kernel.org/project/netdevbpf/cover/20220511095020.562461-1-vladimir.oltean@nxp.com/
      and part 2:
      https://patchwork.kernel.org/project/netdevbpf/cover/20220521213743.2735445-1-vladimir.oltean@nxp.com/
      will know that I am trying to enable the second internal port pair from
      the NXP LS1028A Felix switch for DSA-tagged traffic via "ocelot-8021q".
      This series represents part 3 of that effort.
      
      Covered here are some preparations in DSA for handling multiple DSA
      masters:
      - when changing the tagging protocol via sysfs
      - when the masters go down
      as well as preparation for monitoring the upper devices of a DSA master
      (to support DSA masters under a LAG).
      
      There are also 2 small preparations for the ocelot driver, for the case
      where multiple tag_8021q CPU ports are used in a LAG. Both those changes
      have to do with PGID forwarding domains.
      
      Compared to v1, the patches were trimmed down to just another
      preparation stage, and the UAPI changes were pushed further out to part 4.
      https://patchwork.kernel.org/project/netdevbpf/cover/20220523104256.3556016-1-olteanv@gmail.com/
      
      Compared to v2, I had to export a symbol I forgot to
      (ocelot_port_teardown_dsa_8021q_cpu), to avoid a build breakage when the
      felix and seville drivers are built as modules.
      ====================
      
      Link: https://lore.kernel.org/r/20220819174820.3585002-1-vladimir.oltean@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      52412f55
    • Vladimir Oltean's avatar
      net: mscc: ocelot: adjust forwarding domain for CPU ports in a LAG · 291ac151
      Vladimir Oltean authored
      Currently when we have 2 CPU ports configured for DSA tag_8021q mode and
      we put them in a LAG, a PGID dump looks like this:
      
      PGID_SRC[0] = ports 4,
      PGID_SRC[1] = ports 4,
      PGID_SRC[2] = ports 4,
      PGID_SRC[3] = ports 4,
      PGID_SRC[4] = ports 0, 1, 2, 3, 4, 5,
      PGID_SRC[5] = no ports
      
      (ports 0-3 are user ports, ports 4 and 5 are CPU ports)
      
      There are 2 problems with the configuration above:
      
      - user ports should enable forwarding towards both CPU ports, not just 4,
        and the aggregation PGIDs should prune one CPU port or the other from
        the destination port mask, based on a hash computed from packet headers.
      
      - CPU ports should not be allowed to forward towards themselves and also
        not towards other ports in the same LAG as themselves
      
      The first problem requires fixing up the PGID_SRC of user ports, when
      ocelot_port_assigned_dsa_8021q_cpu_mask() is called. We need to say that
      when a user port is assigned to a tag_8021q CPU port and that port is in
      a LAG, it should forward towards all ports in that LAG.
      
      The second problem requires fixing up the PGID_SRC of port 4, to remove
      ports 4 and 5 (in a LAG) from the allowed destinations.
      
      After this change, the PGID source masks look as follows:
      
      PGID_SRC[0] = ports 4, 5,
      PGID_SRC[1] = ports 4, 5,
      PGID_SRC[2] = ports 4, 5,
      PGID_SRC[3] = ports 4, 5,
      PGID_SRC[4] = ports 0, 1, 2, 3,
      PGID_SRC[5] = no ports
      
      Note that PGID_SRC[5] still looks weird (it should say "0, 1, 2, 3" just
      like PGID_SRC[4] does), but I've tested forwarding through this CPU port
      and it doesn't seem like anything is affected (it appears that PGID_SRC[4]
      is being looked up on forwarding from the CPU, since both ports 4 and 5
      have logical port ID 4). The reason why it looks weird is because
      we've never called ocelot_port_assign_dsa_8021q_cpu() for any user port
      towards port 5 (all user ports are assigned to port 4 which is in a LAG
      with 5).
      
      Since things aren't broken, I'm willing to leave it like that for now
      and just document the oddity.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      291ac151
    • Vladimir Oltean's avatar
      net: mscc: ocelot: set up tag_8021q CPU ports independent of user port affinity · 36a0bf44
      Vladimir Oltean authored
      This is a partial revert of commit c295f983 ("net: mscc: ocelot:
      switch from {,un}set to {,un}assign for tag_8021q CPU ports"), because
      as it turns out, this isn't how tag_8021q CPU ports under a LAG are
      supposed to work.
      
      Under that scenario, all user ports are "assigned" to the single
      tag_8021q CPU port represented by the logical port corresponding to the
      bonding interface. So one CPU port in a LAG would have is_dsa_8021q_cpu
      set to true (the one whose physical port ID is equal to the logical port
      ID), and the other one to false.
      
      In turn, this makes 2 undesirable things happen:
      
      (1) PGID_CPU contains only the first physical CPU port, rather than both
      (2) only the first CPU port will be added to the private VLANs used by
          ocelot for VLAN-unaware bridging
      
      To make the driver behave in the same way for both bonded CPU ports, we
      need to bring back the old concept of setting up a port as a tag_8021q
      CPU port, and this is what deals with VLAN membership and PGID_CPU
      updating. But we also need the CPU port "assignment" (the user to CPU
      port affinity), and this is what updates the PGID_SRC forwarding rules.
      
      All DSA CPU ports are statically configured for tag_8021q mode when the
      tagging protocol is changed to ocelot-8021q. User ports are "assigned"
      to one CPU port or the other dynamically (this will be handled by a
      future change).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      36a0bf44
    • Vladimir Oltean's avatar
      net: dsa: use dsa_tree_for_each_cpu_port in dsa_tree_{setup,teardown}_master · 5dc760d1
      Vladimir Oltean authored
      More logic will be added to dsa_tree_setup_master() and
      dsa_tree_teardown_master() in upcoming changes.
      
      Reduce the indentation by one level in these functions by introducing
      and using a dedicated iterator for CPU ports of a tree.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      5dc760d1
    • Vladimir Oltean's avatar
      net: dsa: all DSA masters must be down when changing the tagging protocol · f41ec1fd
      Vladimir Oltean authored
      The fact that the tagging protocol is set and queried from the
      /sys/class/net/<dsa-master>/dsa/tagging file is a bit of a quirk from
      the single CPU port days which isn't aging very well now that DSA can
      have more than a single CPU port. This is because the tagging protocol
      is a switch property, yet in the presence of multiple CPU ports it can
      be queried and set from multiple sysfs files, all of which are handled
      by the same implementation.
      
      The current logic ensures that the net device whose sysfs file we're
      changing the tagging protocol through must be down. That net device is
      the DSA master, and this is fine for single DSA master / CPU port setups.
      
      But exactly because the tagging protocol is per switch [ tree, in fact ]
      and not per DSA master, this isn't fine any longer with multiple CPU
      ports, and we must iterate through the tree and find all DSA masters,
      and make sure that all of them are down.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      f41ec1fd
    • Vladimir Oltean's avatar
      net: dsa: only bring down user ports assigned to a given DSA master · 7136097e
      Vladimir Oltean authored
      This is an adaptation of commit c0a8a9c2 ("net: dsa: automatically
      bring user ports down when master goes down") for multiple DSA masters.
      When a DSA master goes down, only the user ports under its control
      should go down too, the others can still send/receive traffic.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7136097e
    • Vladimir Oltean's avatar
      net: dsa: existing DSA masters cannot join upper interfaces · 4f03dcc6
      Vladimir Oltean authored
      All the traffic to/from a DSA master is supposed to be distributed among
      its DSA switch upper interfaces, so we should not allow other upper
      device kinds.
      
      An exception to this is DSA_TAG_PROTO_NONE (switches with no DSA tags),
      and in that case it is actually expected to create e.g. VLAN interfaces
      on the master. But for those, netdev_uses_dsa(master) returns false, so
      the restriction doesn't apply.
      
      The motivation for this change is to allow LAG interfaces of DSA masters
      to be DSA masters themselves. We want to restrict the user's degrees of
      freedom by 1: the LAG should already have all DSA masters as lowers, and
      while lower ports of the LAG can be removed, none can be added after the
      fact.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4f03dcc6
    • Vladimir Oltean's avatar
      net: bridge: move DSA master bridging restriction to DSA · 920a33cd
      Vladimir Oltean authored
      When DSA gains support for multiple CPU ports in a LAG, it will become
      mandatory to monitor the changeupper events for the DSA master.
      
      In fact, there are already some restrictions to be imposed in that area,
      namely that a DSA master cannot be a bridge port except in some special
      circumstances.
      
      Centralize the restrictions at the level of the DSA layer as a
      preliminary step.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      920a33cd
    • Vladimir Oltean's avatar
      net: dsa: don't stop at NOTIFY_OK when calling ds->ops->port_prechangeupper · 0498277e
      Vladimir Oltean authored
      dsa_slave_prechangeupper_sanity_check() is supposed to enforce some
      adjacency restrictions, and calls ds->ops->port_prechangeupper if the
      driver implements it.
      
      We convert the error code from the port_prechangeupper() call to a
      notifier code, and 0 is converted to NOTIFY_OK, but the caller of
      dsa_slave_prechangeupper_sanity_check() stops at any notifier code
      different from NOTIFY_DONE.
      
      Avoid this by converting back the notifier code to an error code, so
      that both NOTIFY_OK and NOTIFY_DONE will be seen as 0. This allows more
      parallel sanity check functions to be added.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0498277e
    • Vladimir Oltean's avatar
      net: dsa: walk through all changeupper notifier functions · 4c3f80d2
      Vladimir Oltean authored
      Traditionally, DSA has had a single netdev notifier handling function
      for each device type.
      
      For the sake of code cleanliness, we would like to introduce more
      handling functions which do one thing, but the conditions for entering
      these functions start to overlap. Example: a handling function which
      tracks whether any bridges contain both DSA and non-DSA interfaces.
      Either this is placed before dsa_slave_changeupper(), case in which it
      will prevent that function from executing, or we place it after
      dsa_slave_changeupper(), case in which we will prevent it from
      executing. The other alternative is to ignore errors from the new
      handling function (not ideal).
      
      To support this usage, we need to change the pattern. In the new model,
      we enter all notifier handling sub-functions, and exit with NOTIFY_DONE
      if there is nothing to do. This allows the sub-functions to be
      relatively free-form and independent from each other.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4c3f80d2
    • Paolo Abeni's avatar
      Merge branch 'vsock-updates-for-so_rcvlowat-handling' · 139b5fbd
      Paolo Abeni authored
      Arseniy Krasnov says:
      
      ====================
      vsock: updates for SO_RCVLOWAT handling
      
      This patchset includes some updates for SO_RCVLOWAT:
      
      1) af_vsock:
         During my experiments with zerocopy receive, i found, that in some
         cases, poll() implementation violates POSIX: when socket has non-
         default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and
         POLLRDNORM bits in 'revents' even number of bytes available to read
         on socket is smaller than SO_RCVLOWAT value. In this case,user sees
         POLLIN flag and then tries to read data(for example using  'read()'
         call), but read call will be blocked, because  SO_RCVLOWAT logic is
         supported in dequeue loop in af_vsock.c. But the same time,  POSIX
         requires that:
      
         "POLLIN     Data other than high-priority data may be read without
                     blocking.
          POLLRDNORM Normal data may be read without blocking."
      
         See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293.
      
         So, we have, that poll() syscall returns POLLIN, but read call will
         be blocked.
      
         Also in man page socket(7) i found that:
      
         "Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a
         socket as readable only if at least SO_RCVLOWAT bytes are available."
      
         I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it
         uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with
         this case for TCP socket, it works as POSIX required.
      
         I've added some fixes to af_vsock.c and virtio_transport_common.c,
         test is also implemented.
      
      2) virtio/vsock:
         It adds some optimization to wake ups, when new data arrived. Now,
         SO_RCVLOWAT is considered before wake up sleepers who wait new data.
         There is no sense, to kick waiter, when number of available bytes
         in socket's queue < SO_RCVLOWAT, because if we wake up reader in
         this case, it will wait for SO_RCVLOWAT data anyway during dequeue,
         or in poll() case, POLLIN/POLLRDNORM bits won't be set, so such
         exit from poll() will be "spurious". This logic is also used in TCP
         sockets.
      
      3) vmci/vsock:
         Same as 2), but i'm not sure about this changes. Will be very good,
         to get comments from someone who knows this code.
      
      4) Hyper-V:
         As Dexuan Cui mentioned, for Hyper-V transport it is difficult to
         support SO_RCVLOWAT, so he suggested to disable this feature for
         Hyper-V.
      ====================
      
      Link: https://lore.kernel.org/r/de41de4c-0345-34d7-7c36-4345258b7ba8@sberdevices.ruSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      139b5fbd
    • Arseniy Krasnov's avatar
      vsock_test: POLLIN + SO_RCVLOWAT test · b1346338
      Arseniy Krasnov authored
      This adds test to check, that when poll() returns POLLIN, POLLRDNORM bits,
      next read call won't block.
      Signed-off-by: default avatarArseniy Krasnov <AVKrasnov@sberdevices.ru>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b1346338
    • Arseniy Krasnov's avatar
      vmci/vsock: check SO_RCVLOWAT before wake up reader · e061aed9
      Arseniy Krasnov authored
      This adds extra condition to wake up data reader: do it only when number
      of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick
      user, because it will wait until SO_RCVLOWAT bytes will be dequeued. This
      check is performed in vsock_data_ready().
      Signed-off-by: default avatarArseniy Krasnov <AVKrasnov@sberdevices.ru>
      Reviewed-by: default avatarVishnu Dasa <vdasa@vmware.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e061aed9
    • Arseniy Krasnov's avatar
      virtio/vsock: check SO_RCVLOWAT before wake up reader · 39f1ed33
      Arseniy Krasnov authored
      This adds extra condition to wake up data reader: do it only when number
      of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick
      user,because it will wait until SO_RCVLOWAT bytes will be dequeued. This
      check is performed in vsock_data_ready().
      Signed-off-by: default avatarArseniy Krasnov <AVKrasnov@sberdevices.ru>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      39f1ed33