1. 30 Mar, 2020 18 commits
  2. 29 Mar, 2020 10 commits
    • Linus Torvalds's avatar
      Linux 5.6 · 7111951b
      Linus Torvalds authored
      7111951b
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 570203ec
      Linus Torvalds authored
      Merge vm fixes from Andrew Morton:
       "5 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/sparse: fix kernel crash with pfn_section_valid check
        mm: fork: fix kernel_stack memcg stats for various stack implementations
        hugetlb_cgroup: fix illegal access to memory
        drivers/base/memory.c: indicate all memory blocks as removable
        mm/swapfile.c: move inode_lock out of claim_swapfile
      570203ec
    • Linus Torvalds's avatar
      Merge tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ab93e984
      Linus Torvalds authored
      Pull timer fix from Thomas Gleixner:
       "A single fix for the Hyper-V clocksource driver to make sched clock
        actually return nanoseconds and not the virtual clock value which
        increments at 10e7 HZ (100ns)"
      
      * tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        clocksource/drivers/hyper-v: Make sched clock return nanoseconds correctly
      ab93e984
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 01af08bd
      Linus Torvalds authored
      Pull irq fix from Thomas Gleixner:
       "A single bugfix to prevent reference leaks in irq affinity notifiers"
      
      * tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq: Fix reference leaks on irq affinity notifiers
      01af08bd
    • Aneesh Kumar K.V's avatar
      mm/sparse: fix kernel crash with pfn_section_valid check · b943f045
      Aneesh Kumar K.V authored
      Fix the crash like this:
      
          BUG: Kernel NULL pointer dereference on read at 0x00000000
          Faulting instruction address: 0xc000000000c3447c
          Oops: Kernel access of bad area, sig: 11 [#1]
          LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
          CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
          ...
          NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
          LR [c000000000088354] vmemmap_free+0x144/0x320
          Call Trace:
             section_deactivate+0x220/0x240
             __remove_pages+0x118/0x170
             arch_remove_memory+0x3c/0x150
             memunmap_pages+0x1cc/0x2f0
             devm_action_release+0x30/0x50
             release_nodes+0x2f8/0x3e0
             device_release_driver_internal+0x168/0x270
             unbind_store+0x130/0x170
             drv_attr_store+0x44/0x60
             sysfs_kf_write+0x68/0x80
             kernfs_fop_write+0x100/0x290
             __vfs_write+0x3c/0x70
             vfs_write+0xcc/0x240
             ksys_write+0x7c/0x140
             system_call+0x5c/0x68
      
      The crash is due to NULL dereference at
      
      	test_bit(idx, ms->usage->subsection_map);
      
      due to ms->usage = NULL in pfn_section_valid()
      
      With commit d41e2f3b ("mm/hotplug: fix hot remove failure in
      SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
      depopulate_section_mem().  This was done so that pfn_page() can work
      correctly with kernel config that disables SPARSEMEM_VMEMMAP.  With that
      config pfn_to_page does
      
      	__section_mem_map_addr(__sec) + __pfn;
      
      where
      
        static inline struct page *__section_mem_map_addr(struct mem_section *section)
        {
      	unsigned long map = section->section_mem_map;
      	map &= SECTION_MAP_MASK;
      	return (struct page *)map;
        }
      
      Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
      used to check the pfn validity (pfn_valid()).  Since section_deactivate
      release mem_section->usage if a section is fully deactivated,
      pfn_valid() check after a subsection_deactivate cause a kernel crash.
      
        static inline int pfn_valid(unsigned long pfn)
        {
        ...
      	return early_section(ms) || pfn_section_valid(ms, pfn);
        }
      
      where
      
        static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
        {
      	int idx = subsection_map_index(pfn);
      
      	return test_bit(idx, ms->usage->subsection_map);
        }
      
      Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
      freed.  For architectures like ppc64 where large pages are used for
      vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
      sections.  Hence before a vmemmap mapping page can be freed, the kernel
      needs to make sure there are no valid sections within that mapping.
      Clearing the section valid bit before depopulate_section_memap enables
      this.
      
      [aneesh.kumar@linux.ibm.com: add comment]
        Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
      Fixes: d41e2f3b ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
      Reported-by: default avatarSachin Sant <sachinp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarSachin Sant <sachinp@linux.vnet.ibm.com>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b943f045
    • Roman Gushchin's avatar
      mm: fork: fix kernel_stack memcg stats for various stack implementations · 8380ce47
      Roman Gushchin authored
      Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
      space for task stacks can be allocated using __vmalloc_node_range(),
      alloc_pages_node() and kmem_cache_alloc_node().
      
      In the first and the second cases page->mem_cgroup pointer is set, but
      in the third it's not: memcg membership of a slab page should be
      determined using the memcg_from_slab_page() function, which looks at
      page->slab_cache->memcg_params.memcg .  In this case, using
      mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
      page->mem_cgroup pointer is NULL even for pages charged to a non-root
      memory cgroup.
      
      It can lead to kernel_stack per-memcg counters permanently showing 0 on
      some architectures (depending on the configuration).
      
      In order to fix it, let's introduce a mod_memcg_obj_state() helper,
      which takes a pointer to a kernel object as a first argument, uses
      mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
      mod_memcg_state().  It allows to handle all possible configurations
      (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
      spilling any memcg/kmem specifics into fork.c .
      
      Note: This is a special version of the patch created for stable
      backports.  It contains code from the following two patches:
        - mm: memcg/slab: introduce mem_cgroup_from_obj()
        - mm: fork: fix kernel_stack memcg stats for various stack implementations
      
      [guro@fb.com: introduce mem_cgroup_from_obj()]
        Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8380ce47
    • Mina Almasry's avatar
      hugetlb_cgroup: fix illegal access to memory · 726b7bbe
      Mina Almasry authored
      This appears to be a mistake in commit faced7e0 ("mm: hugetlb
      controller for cgroups v2").
      
      Essentially that commit does a hugetlb_cgroup_from_counter assuming that
      page_counter_try_charge has initialized counter.
      
      But if that has failed then it seems will not initialize counter, so
      hugetlb_cgroup_from_counter(counter) ends up pointing to random memory,
      causing kasan to complain.
      
      The solution is to simply use 'h_cg', instead of
      hugetlb_cgroup_from_counter(counter), since that is a reference to the
      hugetlb_cgroup anyway.  After this change kasan ceases to complain.
      
      Fixes: faced7e0 ("mm: hugetlb controller for cgroups v2")
      Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com
      Signed-off-by: default avatarMina Almasry <almasrymina@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarGiuseppe Scrivano <gscrivan@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      726b7bbe
    • David Hildenbrand's avatar
      drivers/base/memory.c: indicate all memory blocks as removable · 53cdc1cb
      David Hildenbrand authored
      We see multiple issues with the implementation/interface to compute
      whether a memory block can be offlined (exposed via
      /sys/devices/system/memory/memoryX/removable) and would like to simplify
      it (remove the implementation).
      
      1. It runs basically lockless. While this might be good for performance,
         we see possible races with memory offlining that will require at
         least some sort of locking to fix.
      
      2. Nowadays, more false positives are possible. No arch-specific checks
         are performed that validate if memory offlining will not be denied
         right away (and such check will require locking). For example, arm64
         won't allow to offline any memory block that was added during boot -
         which will imply a very high error rate. Other archs have other
         constraints.
      
      3. The interface is inherently racy. E.g., if a memory block is detected
         to be removable (and was not a false positive at that time), there is
         still no guarantee that offlining will actually succeed. So any
         caller already has to deal with false positives.
      
      4. It is unclear which performance benefit this interface actually
         provides. The introducing commit 5c755e9f ("memory-hotplug: add
         sysfs removable attribute for hotplug memory remove") mentioned
      
      	"A user-level agent must be able to identify which sections
      	 of memory are likely to be removable before attempting the
      	 potentially expensive operation."
      
         However, no actual performance comparison was included.
      
      Known users:
      
       - lsmem: Will group memory blocks based on the "removable" property. [1]
      
       - chmem: Indirect user. It has a RANGE mode where one can specify
                removable ranges identified via lsmem to be offlined. However,
                it also has a "SIZE" mode, which allows a sysadmin to skip the
                manual "identify removable blocks" step. [2]
      
       - powerpc-utils: Uses the "removable" attribute to skip some memory
                blocks right away when trying to find some to offline+remove.
                However, with ballooning enabled, it already skips this
                information completely (because it once resulted in many false
                negatives). Therefore, the implementation can deal with false
                positives properly already. [3]
      
      According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer
      driven from userspace via the drmgr command (powerpc-utils).  Nowadays
      it's managed in the kernel - including onlining/offlining of memory
      blocks - triggered by drmgr writing to /sys/kernel/dlpar.  So the
      affected legacy userspace handling is only active on old kernels.  Only
      very old versions of drmgr on a new kernel (unlikely) might execute
      slower - totally acceptable.
      
      With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not
      break any user space tool.  We implement a very bad heuristic now.
      Without CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report
      "not removable" as before.
      
      Original discussion can be found in [4] ("[PATCH RFC v1] mm:
      is_mem_section_removable() overhaul").
      
      Other users of is_mem_section_removable() will be removed next, so that
      we can remove is_mem_section_removable() completely.
      
      [1] http://man7.org/linux/man-pages/man1/lsmem.1.html
      [2] http://man7.org/linux/man-pages/man8/chmem.8.html
      [3] https://github.com/ibm-power-utilities/powerpc-utils
      [4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com
      
      Also, this patch probably fixes a crash reported by Steve.
      http://lkml.kernel.org/r/CAPcyv4jpdaNvJ67SkjyUJLBnBnXXQv686BiVW042g03FUmWLXw@mail.gmail.comReported-by: default avatar"Scargall, Steve" <steve.scargall@intel.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarNathan Fontenot <ndfont@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200128093542.6908-1-david@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      53cdc1cb
    • Naohiro Aota's avatar
      mm/swapfile.c: move inode_lock out of claim_swapfile · d795a90e
      Naohiro Aota authored
      claim_swapfile() currently keeps the inode locked when it is successful,
      or the file is already swapfile (with -EBUSY).  And, on the other error
      cases, it does not lock the inode.
      
      This inconsistency of the lock state and return value is quite confusing
      and actually causing a bad unlock balance as below in the "bad_swap"
      section of __do_sys_swapon().
      
      This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
      check out of claim_swapfile().  The inode is unlocked in
      "bad_swap_unlock_inode" section, so that the inode is ensured to be
      unlocked at "bad_swap".  Thus, error handling codes after the locking now
      jumps to "bad_swap_unlock_inode" instead of "bad_swap".
      
          =====================================
          WARNING: bad unlock balance detected!
          5.5.0-rc7+ #176 Not tainted
          -------------------------------------
          swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550
          but there are no more locks to release!
      
          other info that might help us debug this:
          no locks held by swapon/4294.
      
          stack backtrace:
          CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
          Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
          Call Trace:
           dump_stack+0xa1/0xea
           print_unlock_imbalance_bug.cold+0x114/0x123
           lock_release+0x562/0xed0
           up_write+0x2d/0x490
           __do_sys_swapon+0x94b/0x3550
           __x64_sys_swapon+0x54/0x80
           do_syscall_64+0xa4/0x4b0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
          RIP: 0033:0x7f15da0a0dc7
      
      Fixes: 1638045c ("mm: set S_SWAPFILE on blockdev swap devices")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarQais Youef <qais.yousef@arm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d795a90e
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · e595dd94
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix memory leak in vti6, from Torsten Hilbrich.
      
       2) Fix double free in xfrm_policy_timer, from YueHaibing.
      
       3) NL80211_ATTR_CHANNEL_WIDTH attribute is put with wrong type, from
          Johannes Berg.
      
       4) Wrong allocation failure check in qlcnic driver, from Xu Wang.
      
       5) Get ks8851-ml IO operations right, for real this time, from Marek
          Vasut.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits)
        r8169: fix PHY driver check on platforms w/o module softdeps
        net: ks8851-ml: Fix IO operations, again
        mlxsw: spectrum_mr: Fix list iteration in error path
        qlcnic: Fix bad kzalloc null test
        mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX
        mac80211: mark station unauthorized before key removal
        mac80211: Check port authorization in the ieee80211_tx_dequeue() case
        cfg80211: Do not warn on same channel at the end of CSA
        mac80211: drop data frames without key on encrypted links
        ieee80211: fix HE SPR size calculation
        nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type
        xfrm: policy: Fix doulbe free in xfrm_policy_timer
        bpf: Explicitly memset some bpf info structures declared on the stack
        bpf: Explicitly memset the bpf_attr structure
        bpf: Sanitize the bpf_struct_ops tcp-cc name
        vti6: Fix memory leak of skb if input policy check fails
        esp: remove the skb from the chain when it's enqueued in cryptd_wq
        ipv6: xfrm6_tunnel.c: Use built-in RCU list checking
        xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire
        xfrm: fix uctx len check in verify_sec_ctx_len
        ...
      e595dd94
  3. 28 Mar, 2020 4 commits
  4. 27 Mar, 2020 8 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · a0ba26f3
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2020-03-27
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 3 non-merge commits during the last 4 day(s) which contain
      a total of 4 files changed, 25 insertions(+), 20 deletions(-).
      
      The main changes are:
      
      1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid
         having to rely on compiler to do so. Issues have been noticed on
         some compilers with padding and other oddities where the request was
         then unexpectedly rejected, from Greg Kroah-Hartman.
      
      2) Sanitize the bpf_struct_ops TCP congestion control name in order to
         avoid problematic characters such as whitespaces, from Martin KaFai Lau.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0ba26f3
    • David S. Miller's avatar
      Merge branch 'DSA-mtu' · 1a147b74
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Configure the MTU on DSA switches
      
      This series adds support for configuring the MTU on front-panel switch
      ports, while seamlessly adapting the CPU port and the DSA master to the
      largest value plus the tagger overhead.
      
      It also implements bridge MTU auto-normalization within the DSA core, as
      resulted after the feedback of the implementation of this feature inside
      the bridge driver in v2.
      
      Support was added for quite a number of switches, in the hope that this
      series would gain some traction:
       - sja1105
       - felix
       - vsc73xx
       - b53 and rest of the platform
      
      V3 of this series was submitted here:
      https://patchwork.ozlabs.org/cover/1262394/
      
      V2 of this series was submitted here:
      https://patchwork.ozlabs.org/cover/1261471/
      
      V1 of this series was submitted here:
      https://patchwork.ozlabs.org/cover/1199868/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a147b74
    • Vladimir Oltean's avatar
      net: dsa: felix: support changing the MTU · 0b912fc9
      Vladimir Oltean authored
      Changing the MTU for this switch means altering the
      DEV_GMII:MAC_CFG_STATUS:MAC_MAXLEN_CFG field MAX_LEN, which in turn
      limits the size of frames that can be received.
      
      Special accounting needs to be done for the DSA CPU port (NPI port in
      hardware terms). The NPI port configuration needs to be held inside the
      private ocelot structure, since it is now accessed from multiple places.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b912fc9
    • Vladimir Oltean's avatar
      net: dsa: vsc73xx: make the MTU configurable · fb77ffc6
      Vladimir Oltean authored
      Instead of hardcoding the MTU to the maximum value allowed by the
      hardware, obey the value known by the operating system.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb77ffc6
    • Vladimir Oltean's avatar
      net: dsa: sja1105: implement the port MTU callbacks · c279c726
      Vladimir Oltean authored
      On this switch, the frame length enforcements are performed by the
      ingress policers. There are 2 types of those: regular L2 (also called
      best-effort) and Virtual Link policers (an ARINC664/AFDX concept for
      defining L2 streams with certain QoS abilities). To avoid future
      confusion, I prefer to call the reset reason "Best-effort policers",
      even though the VL policers are not yet supported.
      
      We also need to change the setup of the initial static config, such that
      DSA calls to .change_mtu (which are expensive) become no-ops and don't
      reset the switch 5 times.
      
      A driver-level decision is to unconditionally allow single VLAN-tagged
      traffic on all ports. The CPU port must accept an additional VLAN header
      for the DSA tag, which is again a driver-level decision.
      
      The policers actually count bytes not only from the SDU, but also from
      the Ethernet header and FCS, so those need to be accounted for as well.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c279c726
    • Murali Krishna Policharla's avatar
      net: dsa: b53: add MTU configuration support · 6ae5834b
      Murali Krishna Policharla authored
      It looks like the Broadcom switches supported by the b53 driver don't
      support precise configuration of the MTU, but just a mumbo-jumbo boolean
      flag. Set that.
      
      Also configure BCM583XX devices to send and receive jumbo frames when
      ports are configured with 10/100 Mbps speed.
      Signed-off-by: default avatarMurali Krishna Policharla <murali.policharla@broadcom.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ae5834b
    • Vladimir Oltean's avatar
      net: dsa: implement auto-normalization of MTU for bridge hardware datapath · bff33f7e
      Vladimir Oltean authored
      Many switches don't have an explicit knob for configuring the MTU
      (maximum transmission unit per interface).  Instead, they do the
      length-based packet admission checks on the ingress interface, for
      reasons that are easy to understand (why would you accept a packet in
      the queuing subsystem if you know you're going to drop it anyway).
      
      So it is actually the MRU that these switches permit configuring.
      
      In Linux there only exists the IFLA_MTU netlink attribute and the
      associated dev_set_mtu function. The comments like to play blind and say
      that it's changing the "maximum transfer unit", which is to say that
      there isn't any directionality in the meaning of the MTU word. So that
      is the interpretation that this patch is giving to things: MTU == MRU.
      
      When 2 interfaces having different MTUs are bridged, the bridge driver
      MTU auto-adjustment logic kicks in: what br_mtu_auto_adjust() does is it
      adjusts the MTU of the bridge net device itself (and not that of the
      slave net devices) to the minimum value of all slave interfaces, in
      order for forwarded packets to not exceed the MTU regardless of the
      interface they are received and send on.
      
      The idea behind this behavior, and why the slave MTUs are not adjusted,
      is that normal termination from Linux over the L2 forwarding domain
      should happen over the bridge net device, which _is_ properly limited by
      the minimum MTU. And termination over individual slave devices is
      possible even if those are bridged. But that is not "forwarding", so
      there's no reason to do normalization there, since only a single
      interface sees that packet.
      
      The problem with those switches that can only control the MRU is with
      the offloaded data path, where a packet received on an interface with
      MRU 9000 would still be forwarded to an interface with MRU 1500. And the
      br_mtu_auto_adjust() function does not really help, since the MTU
      configured on the bridge net device is ignored.
      
      In order to enforce the de-facto MTU == MRU rule for these switches, we
      need to do MTU normalization, which means: in order for no packet larger
      than the MTU configured on this port to be sent, then we need to limit
      the MRU on all ports that this packet could possibly come from. AKA
      since we are configuring the MRU via MTU, it means that all ports within
      a bridge forwarding domain should have the same MTU.
      
      And that is exactly what this patch is trying to do.
      
      >From an implementation perspective, we try to follow the intent of the
      user, otherwise there is a risk that we might livelock them (they try to
      change the MTU on an already-bridged interface, but we just keep
      changing it back in an attempt to keep the MTU normalized). So the MTU
      that the bridge is normalized to is either:
      
       - The most recently changed one:
      
         ip link set dev swp0 master br0
         ip link set dev swp1 master br0
         ip link set dev swp0 mtu 1400
      
         This sequence will make swp1 inherit MTU 1400 from swp0.
      
       - The one of the most recently added interface to the bridge:
      
         ip link set dev swp0 master br0
         ip link set dev swp1 mtu 1400
         ip link set dev swp1 master br0
      
         The above sequence will make swp0 inherit MTU 1400 as well.
      Suggested-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bff33f7e
    • Vladimir Oltean's avatar
      net: dsa: configure the MTU for switch ports · bfcb8132
      Vladimir Oltean authored
      It is useful be able to configure port policers on a switch to accept
      frames of various sizes:
      
      - Increase the MTU for better throughput from the default of 1500 if it
        is known that there is no 10/100 Mbps device in the network.
      - Decrease the MTU to limit the latency of high-priority frames under
        congestion, or work around various network segments that add extra
        headers to packets which can't be fragmented.
      
      For DSA slave ports, this is mostly a pass-through callback, called
      through the regular ndo ops and at probe time (to ensure consistency
      across all supported switches).
      
      The CPU port is called with an MTU equal to the largest configured MTU
      of the slave ports. The assumption is that the user might want to
      sustain a bidirectional conversation with a partner over any switch
      port.
      
      The DSA master is configured the same as the CPU port, plus the tagger
      overhead. Since the MTU is by definition L2 payload (sans Ethernet
      header), it is up to each individual driver to figure out if it needs to
      do anything special for its frame tags on the CPU port (it shouldn't
      except in special cases). So the MTU does not contain the tagger
      overhead on the CPU port.
      However the MTU of the DSA master, minus the tagger overhead, is used as
      a proxy for the MTU of the CPU port, which does not have a net device.
      This is to avoid uselessly calling the .change_mtu function on the CPU
      port when nothing should change.
      
      So it is safe to assume that the DSA master and the CPU port MTUs are
      apart by exactly the tagger's overhead in bytes.
      
      Some changes were made around dsa_master_set_mtu(), function which was
      now removed, for 2 reasons:
        - dev_set_mtu() already calls dev_validate_mtu(), so it's redundant to
          do the same thing in DSA
        - __dev_set_mtu() returns 0 if ops->ndo_change_mtu is an absent method
      That is to say, there's no need for this function in DSA, we can safely
      call dev_set_mtu() directly, take the rtnl lock when necessary, and just
      propagate whatever errors get reported (since the user probably wants to
      be informed).
      
      Some inspiration (mainly in the MTU DSA notifier) was taken from a
      vaguely similar patch from Murali and Florian, who are credited as
      co-developers down below.
      Co-developed-by: default avatarMurali Krishna Policharla <murali.policharla@broadcom.com>
      Signed-off-by: default avatarMurali Krishna Policharla <murali.policharla@broadcom.com>
      Co-developed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfcb8132