1. 20 Jul, 2021 40 commits
    • David S. Miller's avatar
      Merge branch 'fdb-fanout' · 083cd5a4
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Fan out FDB entries pointing towards the bridge to all switchdev member ports
      
      The "DSA RX filtering" series has added some important support for
      interpreting addresses towards the bridge device as host addresses and
      installing them as FDB entries towards the CPU port, but it does not
      cover all circumstances and needs further work.
      
      To be precise, the mechanism introduced in that series only works as
      long as the ports are fairly static and no port joins or leaves the
      bridge once the configuration is done. If any port leaves, host FDB
      entries that were installed during runtime (for example the user changes
      the MAC address of the bridge device) will be prematurely deleted,
      resulting in a broken setup.
      
      I see this work as targeted for "net-next" because technically it was
      not supposed to work. Also, there are still corner cases and holes to be
      plugged. For example, today, FDB entries on foreign interfaces are not
      covered by br_fdb_replay(), which means that there are cases where some
      host addresses are either lost, or never deleted by DSA. That will be
      resolved once more work gets accepted, in particular the "Allow
      forwarding for the software bridge data path to be offloaded to capable
      devices" series, which moves the br_fdb_replay() call to the bridge core
      and therefore would be required to solve the problem in a generic way
      for every switchdev driver and not just for DSA.
      
      These patches also pave the way for a cleaner implementation for FDB
      entries pointing towards a LAG upper interface in DSA (that code needs
      only to be added, nothing changed), however this is not done here.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      083cd5a4
    • Vladimir Oltean's avatar
      net: dsa: use switchdev_handle_fdb_{add,del}_to_device · b94dc99c
      Vladimir Oltean authored
      Using the new fan-out helper for FDB entries installed on the software
      bridge, we can install host addresses with the proper refcount on the
      CPU port, such that this case:
      
      ip link set swp0 master br0
      ip link set swp1 master br0
      ip link set swp2 master br0
      ip link set swp3 master br0
      ip link set br0 address 00:01:02:03:04:05
      ip link set swp3 nomaster
      
      works properly and the br0 address remains installed as a host entry
      with refcount 3 instead of getting deleted.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b94dc99c
    • Vladimir Oltean's avatar
      net: switchdev: introduce a fanout helper for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE · 8ca07176
      Vladimir Oltean authored
      Currently DSA has an issue with FDB entries pointing towards the bridge
      in the presence of br_fdb_replay() being called at port join and leave
      time.
      
      In particular, each bridge port will ask for a replay for the FDB
      entries pointing towards the bridge when it joins, and for another
      replay when it leaves.
      
      This means that for example, a bridge with 4 switch ports will notify
      DSA 4 times of the bridge MAC address.
      
      But if the MAC address of the bridge changes during the normal runtime
      of the system, the bridge notifies switchdev [ once ] of the deletion of
      the old MAC address as a local FDB towards the bridge, and of the
      insertion [ again once ] of the new MAC address as a local FDB.
      
      This is a problem, because DSA keeps the old MAC address as a host FDB
      entry with refcount 4 (4 ports asked for it using br_fdb_replay). So the
      old MAC address will not be deleted. Additionally, the new MAC address
      will only be installed with refcount 1, and when the first switch port
      leaves the bridge (leaving 3 others as still members), it will delete
      with it the new MAC address of the bridge from the local FDB entries
      kept by DSA (because the br_fdb_replay call on deletion will bring the
      entry's refcount from 1 to 0).
      
      So the problem, really, is that the number of br_fdb_replay() calls is
      not matched with the refcount that a host FDB is offloaded to DSA during
      normal runtime.
      
      An elegant way to solve the problem would be to make the switchdev
      notification emitted by br_fdb_change_mac_address() result in a host FDB
      kept by DSA which has a refcount exactly equal to the number of ports
      under that bridge. Then, no matter how many DSA ports join or leave that
      bridge, the host FDB entry will always be deleted when there are exactly
      zero remaining DSA switch ports members of the bridge.
      
      To implement the proposed solution, we remember that the switchdev
      objects and port attributes have some helpers provided by switchdev,
      which can be optionally called by drivers:
      switchdev_handle_port_obj_{add,del} and switchdev_handle_port_attr_set.
      These helpers:
      - fan out a switchdev object/attribute emitted for the bridge towards
        all the lower interfaces that pass the check_cb().
      - fan out a switchdev object/attribute emitted for a bridge port that is
        a LAG towards all the lower interfaces that pass the check_cb().
      
      In other words, this is the model we need for the FDB events too:
      something that will keep an FDB entry emitted towards a physical port as
      it is, but translate an FDB entry emitted towards the bridge into N FDB
      entries, one per physical port.
      
      Of course, there are many differences between fanning out a switchdev
      object (VLAN) on 3 lower interfaces of a LAG and fanning out an FDB
      entry on 3 lower interfaces of a LAG. Intuitively, an FDB entry towards
      a LAG should be treated specially, because FDB entries are unicast, we
      can't just install the same address towards 3 destinations. It is
      imaginable that drivers might want to treat this case specifically, so
      create some methods for this case and do not recurse into the LAG lower
      ports, just the bridge ports.
      
      DSA also listens for FDB entries on "foreign" interfaces, aka interfaces
      bridged with us which are not part of our hardware domain: think an
      Ethernet switch bridged with a Wi-Fi AP. For those addresses, DSA
      installs host FDB entries. However, there we have the same problem
      (those host FDB entries are installed with a refcount of only 1) and an
      even bigger one which we did not have with FDB entries towards the
      bridge:
      
      br_fdb_replay() is currently not called for FDB entries on foreign
      interfaces, just for the physical port and for the bridge itself.
      
      So when DSA sniffs an address learned by the software bridge towards a
      foreign interface like an e1000 port, and then that e1000 leaves the
      bridge, DSA remains with the dangling host FDB address. That will be
      fixed separately by replaying all FDB entries and not just the ones
      towards the port and the bridge.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ca07176
    • Vladimir Oltean's avatar
      net: switchdev: introduce helper for checking dynamically learned FDB entries · c6451cda
      Vladimir Oltean authored
      It is a bit difficult to understand what DSA checks when it tries to
      avoid installing dynamically learned addresses on foreign interfaces as
      local host addresses, so create a generic switchdev helper that can be
      reused and is generally more readable.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6451cda
    • Xu Liang's avatar
      net: phy: add Maxlinear GPY115/21x/24x driver · 7d901a1e
      Xu Liang authored
      Add driver to support the Maxlinear GPY115, GPY211, GPY212, GPY215,
      GPY241, GPY245 PHYs. Separate from XWAY PHY driver because this series
      has different register layout and new features not supported in XWAY PHY.
      Signed-off-by: default avatarXu Liang <lxu@maxlinear.com>
      Acked-by: default avatarHauke Mehrtens <hmehrtens@maxlinear.com>
      Tested-by: default avatarWong Vee Khee <vee.khee.wong@linux.intel.com>
      Tested-by: default avatarWong Vee Khee <vee.khee.wong@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d901a1e
    • Xu Liang's avatar
      net: phy: add API to read 802.3-c45 IDs · 8b72b301
      Xu Liang authored
      Add API to read 802.3-c45 IDs so that C22/C45 mixed device can use
      C45 APIs without failing ID checks.
      Signed-off-by: default avatarXu Liang <lxu@maxlinear.com>
      Acked-by: default avatarHauke Mehrtens <hmehrtens@maxlinear.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b72b301
    • David S. Miller's avatar
      Merge branch 'tag_8021q-cross-chip' · 08f329fc
      David S. Miller authored
      Vladimir Olteans says:
      
      ====================
      Proper cross-chip support for tag_8021q
      
      The cross-chip bridging support for tag_8021q/sja1105 introduced here:
      https://patchwork.ozlabs.org/project/netdev/cover/20200510163743.18032-1-olteanv@gmail.com/
      
      took some shortcuts and is not reusable in other topologies except for
      the one it was written for: disjoint DSA trees. A diagram of this
      topology can be seen here:
      https://patchwork.ozlabs.org/project/netdev/patch/20200510163743.18032-3-olteanv@gmail.com/
      
      However there are sja1105 switches on other boards using other
      topologies, most notably:
      
      - Daisy chained:
                                                   |
          sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
       [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
                                         |
                                         +---------+
                                                   |
          sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
                                         |
                                         +---------+
                                                   |
          sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
       [  user ] [  user ] [  user ] [  user ] [  dsa  ]
      
      - "H" topology:
      
               eth0                                                     eth1
                |                                                        |
             CPU port                                                CPU port
                |                        DSA link                        |
       sw0p0  sw0p1  sw0p2  sw0p3  sw0p4 -------- sw1p4  sw1p3  sw1p2  sw1p1  sw1p0
         |             |      |                            |      |             |
       user          user   user                         user   user          user
       port          port   port                         port   port          port
      
      In fact, the current code for tag_8021q cross-chip links works for
      neither of these 2 classes of topologies.
      
      The main reasons are:
      (a) The sja1105 driver does not treat DSA links. In the "disjoint trees"
          topology, the routing port towards any other switch is also the CPU
          port, and that was already configured so it already worked.
          This series does not deal with enabling DSA links in the sja1105
          driver, that is a fairly trivial task that will be dealt with
          separately.
      (b) The tag_8021q code for cross-chip links assumes that any 2 switches
          between cross-chip forwarding needs to be enabled (i.e. which have
          user ports part of the same bridge) are at most 1 hop away from each
          other. This was true for the "disjoint trees" case because
          once a packet reached the CPU port, VLAN-unaware bridging was done
          by the DSA master towards the other switches based on destination
          MAC address, so the tag_8021q header was not interpreted in any way.
          However, in a daisy chain setup with 3 switches, all of them will
          interpret the tag_8021q header, and all tag_8021q VLANs need to be
          installed in all switches.
      
      When looking at the O(n^2) real complexity of the problem, it is clear
      that the current code had absolutely no chance of working in the general
      case. So this patch series brings a redesign of tag_8021q, in light of
      its new requirements. Anything with O(n^2) complexity (where n is the
      number of switches in a DSA tree) is an obvious candidate for the DSA
      cross-chip notifier support.
      
      One by one, the patches are:
      - The sja1105 driver is extremely entangled with tag_8021q, to be exact,
        with that driver's best_effort_vlan_filtering support. We drop this
        operating mode, which means that sja1105 temporarily loses network
        stack termination for VLAN-aware bridges. That operating mode raced
        itself to its own grave anyway due to some hardware limitations in
        combination with PTP reported by NXP customers. I can't say a lot
        more, but network stack termination for VLAN-aware bridges in sja1105
        will be reimplemented soon with a much, much better solution.
      - What remains of tag_8021q in sja1105 is support for standalone ports
        mode and for VLAN-unaware bridging. We refactor the API surface of
        tag_8021q to a single pair of dsa_tag_8021q_{register,unregister}
        functions and we clean up everything else related to tag_8021q from
        sja1105 and felix.
      - Then we move tag_8021q into the DSA core. I thought about this a lot,
        and there is really no other way to add a DSA_NOTIFIER_TAG_8021Q_VLAN_ADD
        cross-chip notifier if DSA has no way to know if the individual
        switches use tag_8021q or not. So it needs to be part of the core to
        use notifiers.
      - Then we modify tag_8021q to update dynamically on bridge_{join,leave}
        events, instead of what we have today which is simply installing the
        VLANs on all ports of a switch and leaving port isolation up to
        somebody else. This change is necessary because port isolation over a
        DSA link cannot be done in any other way except based on VLAN
        membership, as opposed to bridging within the same switch which had 2
        choices (at least on sja1105).
      - Finally we add 2 new cross-chip notifiers for adding and deleting a
        tag_8021q VLAN, which is properly refcounted similar to the bridge FDB
        and MDB code, and complete cleanup is done on teardown (note that this
        is unlike regular bridge VLANs, where we currently cannot do
        refcounting because the user can run "bridge vlan add dev swp0 vid 100"
        a gazillion times, and "bridge vlan del dev swp0 vid 100" just once,
        and for some reason expect that the VLAN will be deleted. But I digress).
        With this opportunity we remove a lot of hard-to-digest code and
        replace it with much more idiomatic DSA-style code.
      
      This series was regression-tested on:
      - Single-switch boards with SJA1105T
      - Disjoint-tree boards with SJA1105S and Felix (using ocelot-8021q)
      - H topology boards using SJA1110A
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08f329fc
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: add proper cross-chip notifier support · c64b9c05
      Vladimir Oltean authored
      The big problem which mandates cross-chip notifiers for tag_8021q is
      this:
      
                                                   |
          sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
       [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
                                         |
                                         +---------+
                                                   |
          sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
                                         |
                                         +---------+
                                                   |
          sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
      
      When the user runs:
      
      ip link add br0 type bridge
      ip link set sw0p0 master br0
      ip link set sw2p0 master br0
      
      It doesn't work.
      
      This is because dsa_8021q_crosschip_bridge_join() assumes that "ds" and
      "other_ds" are at most 1 hop away from each other, so it is sufficient
      to add the RX VLAN of {ds, port} into {other_ds, other_port} and vice
      versa and presto, the cross-chip link works. When there is another
      switch in the middle, such as in this case switch 1 with its DSA links
      sw1p3 and sw1p4, somebody needs to tell it about these VLANs too.
      
      Which is exactly why the problem is quadratic: when a port joins a
      bridge, for each port in the tree that's already in that same bridge we
      notify a tag_8021q VLAN addition of that port's RX VLAN to the entire
      tree. It is a very complicated web of VLANs.
      
      It must be mentioned that currently we install tag_8021q VLANs on too
      many ports (DSA links - to be precise, on all of them). For example,
      when sw2p0 joins br0, and assuming sw1p0 was part of br0 too, we add the
      RX VLAN of sw2p0 on the DSA links of switch 0 too, even though there
      isn't any port of switch 0 that is a member of br0 (at least yet).
      In theory we could notify only the switches which sit in between the
      port joining the bridge and the port reacting to that bridge_join event.
      But in practice that is impossible, because of the way 'link' properties
      are described in the device tree. The DSA bindings require DT writers to
      list out not only the real/physical DSA links, but in fact the entire
      routing table, like for example switch 0 above will have:
      
      	sw0p3: port@3 {
      		link = <&sw1p4 &sw2p4>;
      	};
      
      This was done because:
      
      /* TODO: ideally DSA ports would have a single dp->link_dp member,
       * and no dst->rtable nor this struct dsa_link would be needed,
       * but this would require some more complex tree walking,
       * so keep it stupid at the moment and list them all.
       */
      
      but it is a perfect example of a situation where too much information is
      actively detrimential, because we are now in the position where we
      cannot distinguish a real DSA link from one that is put there to avoid
      the 'complex tree walking'. And because DT is ABI, there is not much we
      can change.
      
      And because we do not know which DSA links are real and which ones
      aren't, we can't really know if DSA switch A is in the data path between
      switches B and C, in the general case.
      
      So this is why tag_8021q RX VLANs are added on all DSA links, and
      probably why it will never change.
      
      On the other hand, at least the number of additions/deletions is well
      balanced, and this means that once we implement reference counting at
      the cross-chip notifier level a la fdb/mdb, there is absolutely zero
      need for a struct dsa_8021q_crosschip_link, it's all self-managing.
      
      In fact, with the tag_8021q notifiers emitted from the bridge join
      notifiers, it becomes so generic that sja1105 does not need to do
      anything anymore, we can just delete its implementation of the
      .crosschip_bridge_{join,leave} methods.
      
      Among other things we can simply delete is the home-grown implementation
      of sja1105_notify_crosschip_switches(). The reason why that is wrong is
      because it is not quadratic - it only covers remote switches to which we
      have a cross-chip bridging link and that does not cover in-between
      switches. This deletion is part of the same patch because sja1105 used
      to poke deep inside the guts of the tag_8021q context in order to do
      that. Because the cross-chip links went away, so needs the sja1105 code.
      
      Last but not least, dsa_8021q_setup_port() is simplified (and also
      renamed). Because our TAG_8021Q_VLAN_ADD notifier is designed to react
      on the CPU port too, the four dsa_8021q_vid_apply() calls:
      - 1 for RX VLAN on user port
      - 1 for the user port's RX VLAN on the CPU port
      - 1 for TX VLAN on user port
      - 1 for the user port's TX VLAN on the CPU port
      
      now get squashed into only 2 notifier calls via
      dsa_port_tag_8021q_vlan_add.
      
      And because the notifiers to add and to delete a tag_8021q VLAN are
      distinct, now we finally break up the port setup and teardown into
      separate functions instead of relying on a "bool enabled" flag which
      tells us what to do. Arguably it should have been this way from the
      get go.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c64b9c05
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: manage RX VLANs dynamically at bridge join/leave time · e19cc13c
      Vladimir Oltean authored
      There has been at least one wasted opportunity for tag_8021q to be used
      by a driver:
      
      https://patchwork.ozlabs.org/project/netdev/patch/20200710113611.3398-3-kurt@linutronix.de/#2484272
      
      because of a design decision: the declared purpose of tag_8021q is to
      offer source port/switch identification for a tagging driver for packets
      coming from a switch with no hardware DSA tagging support. It is not
      intended to provide VLAN-based port isolation, because its first user,
      sja1105, had another mechanism for bridging domain isolation, the L2
      Forwarding Table. So even if 2 ports are in the same VLAN but they are
      separated via the L2 Forwarding Table, they will not communicate with
      one another. The L2 Forwarding Table is managed by the
      sja1105_bridge_join() and sja1105_bridge_leave() methods.
      
      As a consequence, today tag_8021q does not bother too much with hooking
      into .port_bridge_join() and .port_bridge_leave() because that would
      introduce yet another degree of freedom, it just iterates statically
      through all ports of a switch and adds the RX VLAN of one port to all
      the others. In this way, whenever .port_bridge_join() is called,
      bridging will magically work because the RX VLANs are already installed
      everywhere they need to be.
      
      This is not to say that the reason for the change in this patch is to
      satisfy the hellcreek and similar use cases, that is merely a nice side
      effect. Instead it is to make sja1105 cross-chip links work properly
      over a DSA link.
      
      For context, sja1105 today supports a degenerate form of cross-chip
      bridging, where the switches are interconnected through their CPU ports
      ("disjoint trees" topology). There is some code which has been
      generalized into dsa_8021q_crosschip_link_{add,del}, but it is not
      enough, and frankly it is impossible to build upon that.
      Real multi-switch DSA trees, like daisy chains or H trees, which have
      actual DSA links, do not work.
      
      The problem is that sja1105 is unlike mv88e6xxx, and does not have a PVT
      for cross-chip bridging, which is a table by which the local switch can
      select the forwarding domain for packets from a certain ingress switch
      ID and source port. The sja1105 switches cannot parse their own DSA
      tags, because, well, they don't really have support for DSA tags, it's
      all VLANs.
      
      So to make something like cross-chip bridging between sw0p0 and sw1p0 to
      work over the sw0p3/sw1p3 DSA link to work with sja1105 in the topology
      below:
      
                               |                                  |
          sw0p0     sw0p1     sw0p2     sw0p3          sw1p3     sw1p2     sw1p1     sw1p0
       [  user ] [  user ] [  cpu  ] [  dsa  ] ---- [  dsa  ] [  cpu  ] [  user ] [  user ]
      
      we need to ask ourselves 2 questions:
      
      (1) how should the L2 Forwarding Table be managed?
      (2) how should the VLAN Lookup Table be managed?
      
      i.e. what should prevent packets from going to unwanted ports?
      
      Since as mentioned, there is no PVT, the L2 Forwarding Table only
      contains forwarding rules for local ports. So we can say "all user ports
      are allowed to forward to all CPU ports and all DSA links".
      
      If we allow forwarding to DSA links unconditionally, this means we must
      prevent forwarding using the VLAN Lookup Table. This is in fact
      asymmetric with what we do for tag_8021q on ports local to the same
      switch, and it matters because now that we are making tag_8021q a core
      DSA feature, we need to hook into .crosschip_bridge_join() to add/remove
      the tag_8021q VLANs. So for symmetry it makes sense to manage the VLANs
      for local forwarding in the same way as cross-chip forwarding.
      
      Note that there is a very precise reason why tag_8021q hooks into
      dsa_switch_bridge_join() which acts at the cross-chip notifier level,
      and not at a higher level such as dsa_port_bridge_join(). We need to
      install the RX VLAN of the newly joining port into the VLAN table of all
      the existing ports across the tree that are part of the same bridge, and
      the notifier already does the iteration through the switches for us.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e19cc13c
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register · 328621f6
      Vladimir Oltean authored
      Right now, setting up tag_8021q is a 2-step operation for a driver,
      first the context structure needs to be created, then the VLANs need to
      be installed on the ports. A similar thing is true for teardown.
      
      Merge the 2 steps into the register/unregister methods, to be as
      transparent as possible for the driver as to what tag_8021q does behind
      the scenes. This also gets rid of the funny "bool setup == true means
      setup, == false means teardown" API that tag_8021q used to expose.
      
      Note that dsa_tag_8021q_register() must be called at least in the
      .setup() driver method and never earlier (like in the driver probe
      function). This is because the DSA switch tree is not initialized at
      probe time, and the cross-chip notifiers will not work.
      
      For symmetry with .setup(), the unregister method should be put in
      .teardown().
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      328621f6
    • Vladimir Oltean's avatar
      net: dsa: make tag_8021q operations part of the core · 5da11eb4
      Vladimir Oltean authored
      Make tag_8021q a more central element of DSA and move the 2 driver
      specific operations outside of struct dsa_8021q_context (which is
      supposed to hold dynamic data and not really constant function
      pointers).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5da11eb4
    • Vladimir Oltean's avatar
      net: dsa: let the core manage the tag_8021q context · d7b1fd52
      Vladimir Oltean authored
      The basic problem description is as follows:
      
      Be there 3 switches in a daisy chain topology:
      
                                                   |
          sw0p0     sw0p1     sw0p2     sw0p3     sw0p4
       [  user ] [  user ] [  user ] [  dsa  ] [  cpu  ]
                                         |
                                         +---------+
                                                   |
          sw1p0     sw1p1     sw1p2     sw1p3     sw1p4
       [  user ] [  user ] [  user ] [  dsa  ] [  dsa  ]
                                         |
                                         +---------+
                                                   |
          sw2p0     sw2p1     sw2p2     sw2p3     sw2p4
       [  user ] [  user ] [  user ] [  user ] [  dsa  ]
      
      The CPU will not be able to ping through the user ports of the
      bottom-most switch (like for example sw2p0), simply because tag_8021q
      was not coded up for this scenario - it has always assumed DSA switch
      trees with a single switch.
      
      To add support for the topology above, we must admit that the RX VLAN of
      sw2p0 must be added on some ports of switches 0 and 1 as well. This is
      in fact a textbook example of thing that can use the cross-chip notifier
      framework that DSA has set up in switch.c.
      
      There is only one problem: core DSA (switch.c) is not able right now to
      make the connection between a struct dsa_switch *ds and a struct
      dsa_8021q_context *ctx. Right now, it is drivers who call into
      tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer,
      and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del}
      methods.
      
      But with cross-chip notifiers, it is possible for tag_8021q to call
      drivers without drivers having ever asked for anything. A good example
      is right above: when sw2p0 wants to set itself up for tag_8021q,
      the .tag_8021q_vlan_add method needs to be called for switches 1 and 0,
      so that they transport sw2p0's VLANs towards the CPU without dropping
      them.
      
      So instead of letting drivers manage the tag_8021q context, add a
      tag_8021q_ctx pointer inside of struct dsa_switch, which will be
      populated when dsa_tag_8021q_register() returns success.
      
      The patch is fairly long-winded because we are partly reverting commit
      5899ee36 ("net: dsa: tag_8021q: add a context structure") which made
      the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we
      can access "ctx" directly from "ds", this is no longer needed.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7b1fd52
    • Vladimir Oltean's avatar
      net: dsa: build tag_8021q.c as part of DSA core · 8b6e638b
      Vladimir Oltean authored
      Upcoming patches will add tag_8021q related logic to switch.c and
      port.c, in order to allow it to make use of cross-chip notifiers.
      In addition, a struct dsa_8021q_context *ctx pointer will be added to
      struct dsa_switch.
      
      It seems fairly low-reward to #ifdef the *ctx from struct dsa_switch and
      to provide shim implementations of the entire tag_8021q.c calling
      surface (not even clear what to do about the tag_8021q cross-chip
      notifiers to avoid compiling them). The runtime overhead for switches
      which don't use tag_8021q is fairly small because all helpers will check
      for ds->tag_8021q_ctx being a NULL pointer and stop there.
      
      So let's make it part of dsa_core.o.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b6e638b
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: create dsa_tag_8021q_{register,unregister} helpers · cedf4670
      Vladimir Oltean authored
      In preparation of moving tag_8021q to core DSA, move all initialization
      and teardown related to tag_8021q which is currently done by drivers in
      2 functions called "register" and "unregister". These will gather more
      functionality in future patches, which will better justify the chosen
      naming scheme.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cedf4670
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: remove struct packet_type declaration · 8afbea18
      Vladimir Oltean authored
      This is no longer necessary since tag_8021q doesn't register itself as a
      full-blown tagger anymore.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8afbea18
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: use symbolic error names · 69ebb370
      Vladimir Oltean authored
      Use %pe to give the user a string holding the error code instead of just
      a number.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69ebb370
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: use "err" consistently instead of "rc" · a81a4574
      Vladimir Oltean authored
      Some of the tag_8021q code has been taken out of sja1105, which uses
      "rc" for its return code variables, whereas the DSA core uses "err".
      Change tag_8021q for consistency.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a81a4574
    • Vladimir Oltean's avatar
      net: dsa: sja1105: delete the best_effort_vlan_filtering mode · 0fac6aa0
      Vladimir Oltean authored
      Simply put, the best-effort VLAN filtering mode relied on VLAN retagging
      from a bridge VLAN towards a tag_8021q sub-VLAN in order to be able to
      decode the source port in the tagger, but the VLAN retagging
      implementation inside the sja1105 chips is not the best and we were
      relying on marginal operating conditions.
      
      The most notable limitation of the best-effort VLAN filtering mode is
      its incapacity to treat this case properly:
      
      ip link add br0 type bridge vlan_filtering 1
      ip link set swp2 master br0
      ip link set swp4 master br0
      bridge vlan del dev swp4 vid 1
      bridge vlan add dev swp4 vid 1 pvid
      
      When sending an untagged packet through swp2, the expectation is for it
      to be forwarded to swp4 as egress-tagged (so it will contain VLAN ID 1
      on egress). But the switch will send it as egress-untagged.
      
      There was an attempt to fix this here:
      https://patchwork.kernel.org/project/netdevbpf/patch/20210407201452.1703261-2-olteanv@gmail.com/
      
      but it failed miserably because it broke PTP RX timestamping, in a way
      that cannot be corrected due to hardware issues related to VLAN
      retagging.
      
      So with either PTP broken or pushing VLAN headers on egress for untagged
      packets being broken, the sad reality is that the best-effort VLAN
      filtering code is broken. Delete it.
      
      Note that this means there will be a temporary loss of functionality in
      this driver until it is replaced with something better (network stack
      RX/TX capability for "mode 2" as described in
      Documentation/networking/dsa/sja1105.rst, the "port under VLAN-aware
      bridge" case). We simply cannot keep this code until that driver rework
      is done, it is super bloated and tangled with tag_8021q.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0fac6aa0
    • David S. Miller's avatar
      Merge branch 's390-next' · c18e9405
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: updates 2021-07-20
      
      please apply the following patch series for qeth to netdev's net-next tree.
      
      This removes the deprecated support for OSN-mode devices, and does some
      follow-on cleanups.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c18e9405
    • David S. Miller's avatar
      Merge branch 'veth-flexible-channel-numbers' · 542bb396
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      veth: more flexible channels number configuration
      
      XDP setups can benefit from multiple veth RX/TX queues. Currently
      veth allow setting such number only at creation time via the
      'numrxqueues' and 'numtxqueues' parameters.
      
      This series introduces support for the ethtool set_channel operation
      and allows configuring the queue number via a new module parameter.
      
      The veth default configuration is not changed.
      
      Finally self-tests are updated to check the new features, with both
      valid and invalid arguments.
      
      This iteration is a rebase of the most recent RFC, it does not provide
      a module parameter to configure the default number of queues, but I
      think could be worthy
      
      RFC v1 -> RFC v2:
       - report more consistent 'combined' count
       - make set_channel as resilient as possible to errors
       - drop module parameter - but I would still consider it.
       - more self-tests
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      542bb396
    • David S. Miller's avatar
      Merge branch 'bridge-vlan-multicast' · 2c080404
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: multicast: add vlan support
      
      This patchset adds initial per-vlan multicast support, most of the code
      deals with moving to multicast context pointers from bridge/port pointers.
      That allows us to switch them with the per-vlan contexts when a multicast
      packet is being processed and vlan multicast snooping has been enabled.
      That is controlled by a global bridge option added in patch 06 which is
      off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
      that this option can change only under RTNL and doesn't require
      multicast_lock, so we need to be careful when retrieving mcast contexts
      in parallel. For packet processing they are switched only once in
      br_multicast_rcv() and then used until the packet has been processed.
      For the most part we need these contexts only to read config values and
      check if they are disabled. The global mcast state which is maintained
      consists of querier and router timers, the rest are config options.
      The port mcast state which is maintained consists of query timer and
      link to router port list if it's ever marked as a router port. Port
      multicast contexts _must_ be used only with their respective global
      contexts, that is a bridge port's mcast context must be used only with
      bridge's global mcast context and a vlan/port's mcast context must be
      used only with that vlan's global mcast context due to the router port
      lists. This way a bridge port can be marked as a router in multiple
      vlans, but might not be a router in some other vlan. Also this allows us
      to have per-vlan querier elections, per-vlan queries and basically the
      whole multicast state becomes per-vlan when the option is enabled.
      One of the hardest parts is synchronization with vlan's memory
      management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
      which is changed only under multicast_lock. When a vlan is being
      destroyed first that flag is removed under the lock, then the multicast
      context is torn down which includes waiting for any outstanding context
      timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
      it must be checked first if the contexts are vlan and the multicast_lock
      has been acquired. That is done by all IGMP/MLD packet processing
      functions and timers. When processing a packet we have RCU so the vlan
      memory won't be freed, but if the flag is missing we must not process it.
      The timers are synchronized in the same way with the addition of waiting
      for them to finish in case they are running after removing the flag
      under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
      snooping requires vlan filtering to be enabled, if it's disabled then
      snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
      controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
      vlan disabled checks. We need both flags because one is controlled by
      user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
      for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
      Since the latter is also used for synchronization between the multicast
      and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
      rely on it when checking if a vlan context is disabled. The multicast
      fast-path has 3 new bit tests on the cache-hot bridge flags field, I
      didn't observe any measurable difference. I haven't forced either
      context options to be always disabled when the other type is enabled
      because the state consists of timers which either expire (router) or
      don't affect the normal operation. Some options, like the mcast querier
      one, won't be allowed to change for the disabled context type, that will
      come with a future patch-set which adds per-vlan querier control.
      
      Another important addition is the global vlan options, so far we had
      only per bridge/port vlan options but in order to control vlan multicast
      snooping globally we need to add a new type of global vlan options.
      They can be changed only on the bridge device and are dumped only when a
      special flag is set in the dump request. The first global option is vlan
      mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
      private flag. It can be set only on master vlan entries. There will be
      many more global vlan options in the future both for multicast config
      and other per-vlan options (e.g. STP).
      
      There's a lot of room for improvements, I'll do some of the initial
      ones but splitting the state to different contexts opens the door
      for a lot more. Also any new multicast options become vlan-supported with
      very little to no effort by using the same contexts.
      
      Short patch description:
        patches 01-04: initial mcast context add, no functional changes
        patch      05: adds vlan mcast init and control helpers and uses them on
                       vlan create/destroy
        patch      06: adds a global bridge mcast vlan snooping knob (default
                       off)
        patches 07-08: add a helper for users which must derive the contexts
                       based on current bridge and vlan options (e.g. timers)
        patch      09: adds checks for disabled vlan contexts in packet
                       processing and timers
        patch      10: adds support for per-vlan querier and tagged queries
        patch      11: adds router port vlan id in the notifications
        patches 12-14: add global vlan options support (change, dump, notify)
        patch      15: adds per-vlan global mcast snooping control
      
      Future patch-sets which build on this one (in order):
       - vlan state mcast handling
       - user-space mdb contexts (currently only the bridge contexts are used
         there)
       - all bridge multicast config options added per-vlan global and per
         vlan/port
       - iproute2 support for all the new uAPIs
       - selftests
      
      This set has been stress-tested (deleting/adding ports/vlans while changing
      vlan mcast snooping while processing IGMP/MLD packets), and also has
      passed all bridge self-tests. I'm sending this set as early as possible
      since there're a few more related sets that should go in the same
      release to get proper and full mcast vlan snooping support.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c080404
    • Julian Wiedmann's avatar
      s390/qeth: clean up device_type management · ae57ea7a
      Julian Wiedmann authored
      qeth uses three device_type structs - a generic one, and one for each
      sub-driver (which is used for fixed-layer devices only). Instead of
      exporting these device_types back&forth between the driver's modules,
      make all the logic self-contained within the sub-drivers.
      
      On disc->setup() they either install their own device_type, or add the
      sysfs attributes that are missing in the generic device_type. Later on
      disc->remove() these attributes are removed again from any device that
      has the generic device_type.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae57ea7a
    • Julian Wiedmann's avatar
      s390/qeth: clean up QETH_PROT_* naming · a37cfa28
      Julian Wiedmann authored
      The QETH_PROT_* naming is shared among two unrelated areas - one is
      the MPC-level protocol identifiers, the other is the qeth_prot_version
      enum.
      
      Rename the MPC definitions to use QETH_MPC_PROT_*.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a37cfa28
    • Julian Wiedmann's avatar
      s390/qeth: remove OSN support · a8c7629c
      Julian Wiedmann authored
      Commit fb64de1b ("s390/qeth: phase out OSN support") spelled out
      why the OSN support in qeth is in a bad shape, and put any remaining
      interested parties on notice to speak up before it gets ripped out.
      
      It's 2021 now, so make true on that promise and remove all the
      OSN-specific parts from qeth. This also means that we no longer need to
      export various parts of the cmd & data path internals to the L2 driver.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8c7629c
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · bc672d49
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2021-07-19
      
      This series contains updates to iavf and i40e drivers.
      
      Stefan Assmann adds locking to a path that does not acquire a spinlock
      where needed for i40e. He also adjusts locking of critical sections to
      help avoid races and removes overriding of the adapter state during
      pending reset for iavf driver.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc672d49
    • David S. Miller's avatar
      Merge branch 'veth-flexible-channel-numbers' · e4b1dc43
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      veth: more flexible channels number configuration
      
      XDP setups can benefit from multiple veth RX/TX queues. Currently
      veth allow setting such number only at creation time via the
      'numrxqueues' and 'numtxqueues' parameters.
      
      This series introduces support for the ethtool set_channel operation
      and allows configuring the queue number via a new module parameter.
      
      The veth default configuration is not changed.
      
      Finally self-tests are updated to check the new features, with both
      valid and invalid arguments.
      
      This iteration is a rebase of the most recent RFC, it does not provide
      a module parameter to configure the default number of queues, but I
      think could be worthy
      
      RFC v1 -> RFC v2:
       - report more consistent 'combined' count
       - make set_channel as resilient as possible to errors
       - drop module parameter - but I would still consider it.
       - more self-tests
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4b1dc43
    • Paolo Abeni's avatar
      selftests: net: veth: add tests for set_channel · 1ec2230f
      Paolo Abeni authored
      Simple functional test for the newly exposed features.
      
      Also add an optional stress test for the channel number
      update under flood.
      
      RFC v1 -> RFC v2:
       - add the stress test
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ec2230f
    • Paolo Abeni's avatar
      veth: create by default nr_possible_cpus queues · 9d3684c2
      Paolo Abeni authored
      This allows easier XDP usage. The number of default active
      queues is not changed: 1 RX and 1 TX so that this does
      not introduce overhead on the datapath for queue selection.
      
      v1 -> v2:
       - drop the module parameter, force default to nr_possible_cpus - Toke
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d3684c2
    • Paolo Abeni's avatar
      veth: implement support for set_channel ethtool op · 4752eeb3
      Paolo Abeni authored
      This change implements the set_channel() ethtool operation,
      preserving the current defaults values and allowing up set
      the number of queues in the range set ad device creation
      time.
      
      The update operation tries hard to leave the device in a
      consistent status in case of errors.
      
      RFC v1 -> RFC v2:
       - don't flip device status on set_channel()
       - roll-back the changes if possible on error - Jackub
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4752eeb3
    • Paolo Abeni's avatar
      veth: factor out initialization helper · dedd53c5
      Paolo Abeni authored
      Extract in simpler helpers the code to enable and disable a
      range of xdp/napi instance, with the common property that
      "disable" helpers can't fail.
      
      Will be used by the next patch. No functional change intended.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dedd53c5
    • Paolo Abeni's avatar
      veth: always report zero combined channels · f7918b79
      Paolo Abeni authored
      veth get_channel currently reports for channels being both RX/TX and
      combined. As Jakub noted:
      
      """
      ethtool man page is relatively clear, unfortunately the kernel code
      is not and few read the man page. A channel is approximately an IRQ,
      not a queue, and IRQ can't be dedicated and combined simultaneously
      """
      
      This patch changes the information exposed by veth_get_channels,
      setting max_combined to zero, being more consistent with the above
      statement. The ethtool_channels is always cleared by the caller, we just
      need to avoid setting the 'combined' fields.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7918b79
    • Vasily Averin's avatar
      memcg: enable accounting for scm_fp_list objects · 2c6ad20b
      Vasily Averin authored
      unix sockets allows to send file descriptors via SCM_RIGHTS type messages.
      Each such send call forces kernel to allocate up to 2Kb memory for
      struct scm_fp_list.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c6ad20b
    • Vasily Averin's avatar
      memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs allocation · 1b51d827
      Vasily Averin authored
      Author: Andrey Ryabinin <aryabinin@virtuozzo.com>
      
      The size of the ip_tunnel_prl structs allocation is controllable from
      user-space, thus it's better to avoid spam in dmesg if allocation failed.
      Also add __GFP_ACCOUNT as this is a good candidate for per-memcg
      accounting. Allocation is temporary and limited by 4GB.
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b51d827
    • Vasily Averin's avatar
      memcg: enable accounting for VLAN group array · a89893dd
      Vasily Averin authored
      vlan array consume up to 8 pages of memory per net device.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a89893dd
    • Vasily Averin's avatar
      memcg: enable accounting for inet_bin_bucket cache · 990c74e3
      Vasily Averin authored
      net namespace can create up to 64K tcp and dccp ports and force kernel
      to allocate up to several megabytes of memory per netns
      for inet_bind_bucket objects.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      990c74e3
    • Vasily Averin's avatar
      memcg: enable accounting for IP address and routing-related objects · 6126891c
      Vasily Averin authored
      An netadmin inside container can use 'ip a a' and 'ip r a'
      to assign a large number of ipv4/ipv6 addresses and routing entries
      and force kernel to allocate megabytes of unaccounted memory
      for long-lived per-netdevice related kernel objects:
      'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node',
      'struct rt6_info', 'struct fib_rules' and ip_fib caches.
      
      These objects can be manually removed, though usually they lives
      in memory till destroy of its net namespace.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      
      One of such objects is the 'struct fib6_node' mostly allocated in
      net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
      
       write_lock_bh(&table->tb6_lock);
       err = fib6_add(&table->tb6_root, rt, info, mxc);
       write_unlock_bh(&table->tb6_lock);
      
      In this case it is not enough to simply add SLAB_ACCOUNT to corresponding
      kmem cache. The proper memory cgroup still cannot be found due to the
      incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
      
      Obsoleted in_interrupt() does not describe real execution context properly.
      >From include/linux/preempt.h:
      
       The following macros are deprecated and should not be used in new code:
       in_interrupt()	- We're in NMI,IRQ,SoftIRQ context or have BH disabled
      
      To verify the current execution context new macro should be used instead:
       in_task()	- We're in task context
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6126891c
    • Vasily Averin's avatar
      memcg: enable accounting for net_device and Tx/Rx queues · c948f51c
      Vasily Averin authored
      Container netadmin can create a lot of fake net devices,
      then create a new net namespace and repeat it again and again.
      Net device can request the creation of up to 4096 tx and rx queues,
      and force kernel to allocate up to several tens of megabytes memory
      per net device.
      
      It makes sense to account for them to restrict the host's memory
      consumption from inside the memcg-limited container.
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c948f51c
    • David S. Miller's avatar
      Merge branch 'bridge-vlan-multicast' · 2967eed9
      David S. Miller authored
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: multicast: add vlan support
      
      This patchset adds initial per-vlan multicast support, most of the code
      deals with moving to multicast context pointers from bridge/port pointers.
      That allows us to switch them with the per-vlan contexts when a multicast
      packet is being processed and vlan multicast snooping has been enabled.
      That is controlled by a global bridge option added in patch 06 which is
      off by default (BR_BOOLOPT_MCAST_VLAN_SNOOPING). It is important to note
      that this option can change only under RTNL and doesn't require
      multicast_lock, so we need to be careful when retrieving mcast contexts
      in parallel. For packet processing they are switched only once in
      br_multicast_rcv() and then used until the packet has been processed.
      For the most part we need these contexts only to read config values and
      check if they are disabled. The global mcast state which is maintained
      consists of querier and router timers, the rest are config options.
      The port mcast state which is maintained consists of query timer and
      link to router port list if it's ever marked as a router port. Port
      multicast contexts _must_ be used only with their respective global
      contexts, that is a bridge port's mcast context must be used only with
      bridge's global mcast context and a vlan/port's mcast context must be
      used only with that vlan's global mcast context due to the router port
      lists. This way a bridge port can be marked as a router in multiple
      vlans, but might not be a router in some other vlan. Also this allows us
      to have per-vlan querier elections, per-vlan queries and basically the
      whole multicast state becomes per-vlan when the option is enabled.
      One of the hardest parts is synchronization with vlan's memory
      management, that is done through a new vlan flag: BR_VLFLAG_MCAST_ENABLED
      which is changed only under multicast_lock. When a vlan is being
      destroyed first that flag is removed under the lock, then the multicast
      context is torn down which includes waiting for any outstanding context
      timers. Since all of the vlan processing depends on BR_VLFLAG_MCAST_ENABLED
      it must be checked first if the contexts are vlan and the multicast_lock
      has been acquired. That is done by all IGMP/MLD packet processing
      functions and timers. When processing a packet we have RCU so the vlan
      memory won't be freed, but if the flag is missing we must not process it.
      The timers are synchronized in the same way with the addition of waiting
      for them to finish in case they are running after removing the flag
      under multicast_lock (i.e. they were waiting for the lock). Multicast vlan
      snooping requires vlan filtering to be enabled, if it's disabled then
      snooping gets automatically disabled, too. BR_VLFLAG_GLOBAL_MCAST_ENABLED
      controls if a vlan has BR_VLFLAG_MCAST_ENABLED set which is used in all
      vlan disabled checks. We need both flags because one is controlled by
      user-space globally (BR_VLFLAG_GLOBAL_MCAST_ENABLED) and the other is
      for a particular bridge/vlan or port/vlan entry (BR_VLFLAG_MCAST_ENABLED).
      Since the latter is also used for synchronization between the multicast
      and vlan code, and also controlled by BR_VLFLAG_GLOBAL_MCAST_ENABLED we
      rely on it when checking if a vlan context is disabled. The multicast
      fast-path has 3 new bit tests on the cache-hot bridge flags field, I
      didn't observe any measurable difference. I haven't forced either
      context options to be always disabled when the other type is enabled
      because the state consists of timers which either expire (router) or
      don't affect the normal operation. Some options, like the mcast querier
      one, won't be allowed to change for the disabled context type, that will
      come with a future patch-set which adds per-vlan querier control.
      
      Another important addition is the global vlan options, so far we had
      only per bridge/port vlan options but in order to control vlan multicast
      snooping globally we need to add a new type of global vlan options.
      They can be changed only on the bridge device and are dumped only when a
      special flag is set in the dump request. The first global option is vlan
      mcast snooping control, it controls the vlan BR_VLFLAG_GLOBAL_MCAST_ENABLED
      private flag. It can be set only on master vlan entries. There will be
      many more global vlan options in the future both for multicast config
      and other per-vlan options (e.g. STP).
      
      There's a lot of room for improvements, I'll do some of the initial
      ones but splitting the state to different contexts opens the door
      for a lot more. Also any new multicast options become vlan-supported with
      very little to no effort by using the same contexts.
      
      Short patch description:
        patches 01-04: initial mcast context add, no functional changes
        patch      05: adds vlan mcast init and control helpers and uses them on
                       vlan create/destroy
        patch      06: adds a global bridge mcast vlan snooping knob (default
                       off)
        patches 07-08: add a helper for users which must derive the contexts
                       based on current bridge and vlan options (e.g. timers)
        patch      09: adds checks for disabled vlan contexts in packet
                       processing and timers
        patch      10: adds support for per-vlan querier and tagged queries
        patch      11: adds router port vlan id in the notifications
        patches 12-14: add global vlan options support (change, dump, notify)
        patch      15: adds per-vlan global mcast snooping control
      
      Future patch-sets which build on this one (in order):
       - vlan state mcast handling
       - user-space mdb contexts (currently only the bridge contexts are used
         there)
       - all bridge multicast config options added per-vlan global and per
         vlan/port
       - iproute2 support for all the new uAPIs
       - selftests
      
      This set has been stress-tested (deleting/adding ports/vlans while changing
      vlan mcast snooping while processing IGMP/MLD packets), and also has
      passed all bridge self-tests. I'm sending this set as early as possible
      since there're a few more related sets that should go in the same
      release to get proper and full mcast vlan snooping support.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2967eed9
    • Nikolay Aleksandrov's avatar
      net: bridge: vlan: add mcast snooping control · 9dee572c
      Nikolay Aleksandrov authored
      Add a new global vlan option which controls whether multicast snooping
      is enabled or disabled for a single vlan. It controls the vlan private
      flag: BR_VLFLAG_GLOBAL_MCAST_ENABLED.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dee572c
    • Nikolay Aleksandrov's avatar
      net: bridge: vlan: notify when global options change · 9aba624d
      Nikolay Aleksandrov authored
      Add support for global options notifications. They use only RTM_NEWVLAN
      since global options can only be set and are contained in a separate
      vlan global options attribute. Notifications are compressed in ranges
      where possible, i.e. the sequential vlan options are equal.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9aba624d