1. 11 Oct, 2018 3 commits
  2. 08 Oct, 2018 1 commit
  3. 07 Oct, 2018 7 commits
  4. 06 Oct, 2018 1 commit
    • Greg Kroah-Hartman's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · c1d84a1b
      Greg Kroah-Hartman authored
      Dave writes:
        "Networking fixes:
      
        1) Fix truncation of 32-bit right shift in bpf, from Jann Horn.
      
        2) Fix memory leak in wireless wext compat, from Stefan Seyfried.
      
        3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao.
      
        4) Need to cancel pending work when unbinding in smsc75xx otherwise
           we oops, also from Yu Zhao.
      
        5) Don't allow enslaving a team device to itself, from Ido Schimmel.
      
        6) Fix backwards compat with older userspace for rtnetlink FDB dumps.
           From Mauricio Faria.
      
        7) Add validation of tc policy netlink attributes, from David Ahern.
      
        8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
        net: mvpp2: Extract the correct ethtype from the skb for tx csum offload
        ipv6: take rcu lock in rawv6_send_hdrinc()
        net: sched: Add policy validation for tc attributes
        rtnetlink: fix rtnl_fdb_dump() for ndmsg header
        yam: fix a missing-check bug
        net: bpfilter: Fix type cast and pointer warnings
        net: cxgb3_main: fix a missing-check bug
        bpf: 32-bit RSH verification must truncate input before the ALU op
        net: phy: phylink: fix SFP interface autodetection
        be2net: don't flip hw_features when VXLANs are added/deleted
        net/packet: fix packet drop as of virtio gso
        net: dsa: b53: Keep CPU port as tagged in all VLANs
        openvswitch: load NAT helper
        bnxt_en: get the reduced max_irqs by the ones used by RDMA
        bnxt_en: free hwrm resources, if driver probe fails.
        bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request
        bnxt_en: Fix VNIC reservations on the PF.
        team: Forbid enslaving team device to itself
        net/usb: cancel pending work when unbinding smsc75xx
        mlxsw: spectrum: Delete RIF when VLAN device is removed
        ...
      c1d84a1b
  5. 05 Oct, 2018 28 commits
    • Greg Kroah-Hartman's avatar
      Merge branch 'akpm' · 091a1eaa
      Greg Kroah-Hartman authored
      * akpm:
        mm: madvise(MADV_DODUMP): allow hugetlbfs pages
        ocfs2: fix locking for res->tracking and dlm->tracking_list
        mm/vmscan.c: fix int overflow in callers of do_shrink_slab()
        mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
        mm/vmstat.c: fix outdated vmstat_text
        proc: restrict kernel stack dumps to root
        mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes
        mm/migrate.c: split only transparent huge pages when allocation fails
        ipc/shm.c: use ERR_CAST() for shm_lock() error return
        mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl
        mm, thp: fix mlocking THP page with migration enabled
        ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
        hugetlb: take PMD sharing into account when flushing tlb/caches
        mm: migration: fix migration of huge PMD shared pages
      091a1eaa
    • Daniel Black's avatar
      mm: madvise(MADV_DODUMP): allow hugetlbfs pages · d41aa525
      Daniel Black authored
      Reproducer, assuming 2M of hugetlbfs available:
      
      Hugetlbfs mounted, size=2M and option user=testuser
      
        # mount | grep ^hugetlbfs
        hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
        # sysctl vm.nr_hugepages=1
        vm.nr_hugepages = 1
        # grep Huge /proc/meminfo
        AnonHugePages:         0 kB
        ShmemHugePages:        0 kB
        HugePages_Total:       1
        HugePages_Free:        1
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
        Hugetlb:            2048 kB
      
      Code:
      
        #include <sys/mman.h>
        #include <stddef.h>
        #define SIZE 2*1024*1024
        int main()
        {
          void *ptr;
          ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
          madvise(ptr, SIZE, MADV_DONTDUMP);
          madvise(ptr, SIZE, MADV_DODUMP);
        }
      
      Compile and strace:
      
        mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
        madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
        madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)
      
      hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
      author testing with analysis from Florian Weimer[1].
      
      The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
      consequence of the large useage of VM_DONTEXPAND in device drivers.
      
      A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
      marked DODUMP.
      
      A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
      memory for a while and later request that madvise(MADV_DODUMP) on the same
      memory.  We correct this omission by allowing madvice(MADV_DODUMP) on
      hugetlbfs pages.
      
      [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
      [2] commit 0103bd16 ("mm: prepare VM_DONTDUMP for using in drivers")
      
      Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
      Link: https://lists.launchpad.net/maria-discuss/msg05245.html
      Fixes: 0103bd16 ("mm: prepare VM_DONTDUMP for using in drivers")
      Reported-by: default avatarKenneth Penza <kpenza@gmail.com>
      Signed-off-by: default avatarDaniel Black <daniel@linux.ibm.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d41aa525
    • Ashish Samant's avatar
      ocfs2: fix locking for res->tracking and dlm->tracking_list · cbe355f5
      Ashish Samant authored
      In dlm_init_lockres() we access and modify res->tracking and
      dlm->tracking_list without holding dlm->track_lock.  This can cause list
      corruptions and can end up in kernel panic.
      
      Fix this by locking res->tracking and dlm->tracking_list with
      dlm->track_lock instead of dlm->spinlock.
      
      Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.comSigned-off-by: default avatarAshish Samant <ashish.samant@oracle.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Acked-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Acked-by: default avatarJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbe355f5
    • Kirill Tkhai's avatar
      mm/vmscan.c: fix int overflow in callers of do_shrink_slab() · b8e57efa
      Kirill Tkhai authored
      do_shrink_slab() returns unsigned long value, and the placing into int
      variable cuts high bytes off.  Then we compare ret and 0xfffffffe (since
      SHRINK_EMPTY is converted to ret type).
      
      Thus a large number of objects returned by do_shrink_slab() may be
      interpreted as SHRINK_EMPTY, if low bytes of their value are equal to
      0xfffffffe.  Fix that by declaration ret as unsigned long in these
      functions.
      
      Link: http://lkml.kernel.org/r/153813407177.17544.14888305435570723973.stgit@localhost.localdomainSigned-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reported-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8e57efa
    • Jann Horn's avatar
      mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly · 58bc4c34
      Jann Horn authored
      5dd0b16c ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
      on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
      the kernel unconditional to reduce #ifdef soup, but (either to avoid
      showing dummy zero counters to userspace, or because that code was missed)
      didn't update the vmstat_array, meaning that all following counters would
      be shown with incorrect values.
      
      This only affects kernel builds with
      CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.
      
      Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
      Fixes: 5dd0b16c ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58bc4c34
    • Jann Horn's avatar
      mm/vmstat.c: fix outdated vmstat_text · 28e2c4bb
      Jann Horn authored
      7a9cdebd ("mm: get rid of vmacache_flush_all() entirely") removed the
      VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
      entry in vmstat_text.  This causes an out-of-bounds access in
      vmstat_show().
      
      Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
      is probably very rare.
      
      Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
      Fixes: 7a9cdebd ("mm: get rid of vmacache_flush_all() entirely")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      28e2c4bb
    • Jann Horn's avatar
      proc: restrict kernel stack dumps to root · f8a00cef
      Jann Horn authored
      Currently, you can use /proc/self/task/*/stack to cause a stack walk on
      a task you control while it is running on another CPU.  That means that
      the stack can change under the stack walker.  The stack walker does
      have guards against going completely off the rails and into random
      kernel memory, but it can interpret random data from your kernel stack
      as instruction pointers and stack pointers.  This can cause exposure of
      kernel stack contents to userspace.
      
      Restrict the ability to inspect kernel stacks of arbitrary tasks to root
      in order to prevent a local attacker from exploiting racy stack unwinding
      to leak kernel task stack contents.  See the added comment for a longer
      rationale.
      
      There don't seem to be any users of this userspace API that can't
      gracefully bail out if reading from the file fails.  Therefore, I believe
      that this change is unlikely to break things.  In the case that this patch
      does end up needing a revert, the next-best solution might be to fake a
      single-entry stack based on wchan.
      
      Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
      Fixes: 2ec220e2 ("proc: add /proc/*/stack")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8a00cef
    • Anshuman Khandual's avatar
      mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes · 20916d46
      Anshuman Khandual authored
      ARM64 architecture also supports 32MB and 512MB HugeTLB page sizes.  This
      just adds mmap() system call argument encoding for them.
      
      Link: http://lkml.kernel.org/r/1537841300-6979-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20916d46
    • Anshuman Khandual's avatar
      mm/migrate.c: split only transparent huge pages when allocation fails · e6112fc3
      Anshuman Khandual authored
      split_huge_page_to_list() fails on HugeTLB pages.  I was experimenting
      with moving 32MB contig HugeTLB pages on arm64 (with a debug patch
      applied) and hit the following stack trace when the kernel crashed.
      
      [ 3732.462797] Call trace:
      [ 3732.462835]  split_huge_page_to_list+0x3b0/0x858
      [ 3732.462913]  migrate_pages+0x728/0xc20
      [ 3732.462999]  soft_offline_page+0x448/0x8b0
      [ 3732.463097]  __arm64_sys_madvise+0x724/0x850
      [ 3732.463197]  el0_svc_handler+0x74/0x110
      [ 3732.463297]  el0_svc+0x8/0xc
      [ 3732.463347] Code: d1000400 f90b0e60 f2fbd5a2 a94982a1 (f9000420)
      
      When unmap_and_move[_huge_page]() fails due to lack of memory, the
      splitting should happen only for transparent huge pages not for HugeTLB
      pages.  PageTransHuge() returns true for both THP and HugeTLB pages.
      Hence the conditonal check should test PagesHuge() flag to make sure that
      given pages is not a HugeTLB one.
      
      Link: http://lkml.kernel.org/r/1537798495-4996-1-git-send-email-anshuman.khandual@arm.com
      Fixes: 94723aaf ("mm: unclutter THP migration")
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6112fc3
    • Kees Cook's avatar
      ipc/shm.c: use ERR_CAST() for shm_lock() error return · 59cf0a93
      Kees Cook authored
      This uses ERR_CAST() instead of an open-coded cast, as it is casting
      across structure pointers, which upsets __randomize_layout:
      
      ipc/shm.c: In function `shm_lock':
      ipc/shm.c:209:9: note: randstruct: casting between randomized structure pointer types (ssa): `struct shmid_kernel' and `struct kern_ipc_perm'
      
        return (void *)ipcp;
               ^~~~~~~~~~~~
      
      Link: http://lkml.kernel.org/r/20180919180722.GA15073@beast
      Fixes: 82061c57 ("ipc: drop ipc_lock()")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59cf0a93
    • YueHaibing's avatar
      mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl · 51896864
      YueHaibing authored
      get_user_pages_fast() will return negative value if no pages were pinned,
      then be converted to a unsigned, which is compared to zero, giving the
      wrong result.
      
      Link: http://lkml.kernel.org/r/20180921095015.26088-1-yuehaibing@huawei.com
      Fixes: 09e35a4a ("mm/gup_benchmark: handle gup failures")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51896864
    • Kirill A. Shutemov's avatar
      mm, thp: fix mlocking THP page with migration enabled · e125fe40
      Kirill A. Shutemov authored
      A transparent huge page is represented by a single entry on an LRU list.
      Therefore, we can only make unevictable an entire compound page, not
      individual subpages.
      
      If a user tries to mlock() part of a huge page, we want the rest of the
      page to be reclaimable.
      
      We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
      PMD on border of VM_LOCKED VMA will be split into PTE table.
      
      Introduction of THP migration breaks[1] the rules around mlocking THP
      pages.  If we had a single PMD mapping of the page in mlocked VMA, the
      page will get mlocked, regardless of PTE mappings of the page.
      
      For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
      remove_migration_pmd().
      
      Anon THP pages can only be shared between processes via fork().  Mlocked
      page can only be shared if parent mlocked it before forking, otherwise CoW
      will be triggered on mlock().
      
      For Anon-THP, we can fix the issue by munlocking the page on removing PTE
      migration entry for the page.  PTEs for the page will always come after
      mlocked PMD: rmap walks VMAs from oldest to newest.
      
      Test-case:
      
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/wait.h>
      	#include <linux/mempolicy.h>
      	#include <numaif.h>
      
      	int main(void)
      	{
      	        unsigned long nodemask = 4;
      	        void *addr;
      
      		addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
      			MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);
      
      	        if (fork()) {
      			wait(NULL);
      			return 0;
      	        }
      
      	        mlock(addr, 4UL << 10);
      	        mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
      	                &nodemask, 4, MPOL_MF_MOVE);
      
      	        return 0;
      	}
      
      [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com
      
      Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
      Fixes: 616b8371 ("mm: thp: enable thp migration in generic path")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarVegard Nossum <vegard.nossum@oracle.com>
      Reviewed-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e125fe40
    • Larry Chen's avatar
      ocfs2: fix crash in ocfs2_duplicate_clusters_by_page() · 69eb7765
      Larry Chen authored
      ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages
      is dirty.  When a page has not been written back, it is still in dirty
      state.  If ocfs2_duplicate_clusters_by_page() is called against the dirty
      page, the crash happens.
      
      To fix this bug, we can just unlock the page and wait until the page until
      its not dirty.
      
      The following is the backtrace:
      
      kernel BUG at /root/code/ocfs2/refcounttree.c:2961!
      [exception RIP: ocfs2_duplicate_clusters_by_page+822]
      __ocfs2_move_extent+0x80/0x450 [ocfs2]
      ? __ocfs2_claim_clusters+0x130/0x250 [ocfs2]
      ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2]
      __ocfs2_move_extents_range+0x2a4/0x470 [ocfs2]
      ocfs2_move_extents+0x180/0x3b0 [ocfs2]
      ? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2]
      ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2]
      ocfs2_ioctl+0x253/0x640 [ocfs2]
      do_vfs_ioctl+0x90/0x5f0
      SyS_ioctl+0x74/0x80
      do_syscall_64+0x74/0x140
      entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Once we find the page is dirty, we do not wait until it's clean, rather we
      use write_one_page() to write it back
      
      Link: http://lkml.kernel.org/r/20180829074740.9438-1-lchen@suse.com
      [lchen@suse.com: update comments]
        Link: http://lkml.kernel.org/r/20180830075041.14879-1-lchen@suse.com
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarLarry Chen <lchen@suse.com>
      Acked-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69eb7765
    • Mike Kravetz's avatar
      hugetlb: take PMD sharing into account when flushing tlb/caches · dff11abe
      Mike Kravetz authored
      When fixing an issue with PMD sharing and migration, it was discovered via
      code inspection that other callers of huge_pmd_unshare potentially have an
      issue with cache and tlb flushing.
      
      Use the routine adjust_range_if_pmd_sharing_possible() to calculate worst
      case ranges for mmu notifiers.  Ensure that this range is flushed if
      huge_pmd_unshare succeeds and unmaps a PUD_SUZE area.
      
      Link: http://lkml.kernel.org/r/20180823205917.16297-3-mike.kravetz@oracle.comSigned-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dff11abe
    • Mike Kravetz's avatar
      mm: migration: fix migration of huge PMD shared pages · 017b1660
      Mike Kravetz authored
      The page migration code employs try_to_unmap() to try and unmap the source
      page.  This is accomplished by using rmap_walk to find all vmas where the
      page is mapped.  This search stops when page mapcount is zero.  For shared
      PMD huge pages, the page map count is always 1 no matter the number of
      mappings.  Shared mappings are tracked via the reference count of the PMD
      page.  Therefore, try_to_unmap stops prematurely and does not completely
      unmap all mappings of the source page.
      
      This problem can result is data corruption as writes to the original
      source page can happen after contents of the page are copied to the target
      page.  Hence, data is lost.
      
      This problem was originally seen as DB corruption of shared global areas
      after a huge page was soft offlined due to ECC memory errors.  DB
      developers noticed they could reproduce the issue by (hotplug) offlining
      memory used to back huge pages.  A simple testcase can reproduce the
      problem by creating a shared PMD mapping (note that this must be at least
      PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
      migrate_pages() to migrate process pages between nodes while continually
      writing to the huge pages being migrated.
      
      To fix, have the try_to_unmap_one routine check for huge PMD sharing by
      calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
      mapping it will be 'unshared' which removes the page table entry and drops
      the reference on the PMD page.  After this, flush caches and TLB.
      
      mmu notifiers are called before locking page tables, but we can not be
      sure of PMD sharing until page tables are locked.  Therefore, check for
      the possibility of PMD sharing before locking so that notifiers can
      prepare for the worst possible case.
      
      Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
      [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
        Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      017b1660
    • Greg Kroah-Hartman's avatar
      Merge tag 'pci-v4.19-fixes-3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · 5943a9bb
      Greg Kroah-Hartman authored
      Bjorn writes:
        "PCI fixes for v4.19:
      
         - Reprogram bridge prefetch registers to fix NVIDIA and Radeon issues
           after suspend/resume (Daniel Drake)
      
         - Fix mvebu I/O mapping creation sequence (Thomas Petazzoni)
      
         - Fix minor MAINTAINERS file match issue (Bjorn Helgaas)"
      
      * tag 'pci-v4.19-fixes-3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
        PCI: mvebu: Fix PCI I/O mapping creation sequence
        MAINTAINERS: Remove obsolete drivers/pci pattern from ACPI section
        PCI: Reprogram bridge prefetch registers on resume
      5943a9bb
    • Greg Kroah-Hartman's avatar
      Merge tag 'for-4.19/dm-fixes-2' of... · b98d6cb8
      Greg Kroah-Hartman authored
      Merge tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Mike writes:
        "device mapper fixes
      
         - Fix a DM thinp __udivdi3 undefined on 32-bit bug introduced during
           4.19 merge window.
      
         - Fix leak and dangling pointer in DM multipath's scsi_dh related code.
      
         - A couple stable@ fixes for DM cache's resize support.
      
         - A DM raid fix to remove "const" from decipher_sync_action()'s return
           type."
      
      * tag 'for-4.19/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm cache: fix resize crash if user doesn't reload cache table
        dm cache metadata: ignore hints array being too small during resize
        dm raid: remove bogus const from decipher_sync_action() return type
        dm mpath: fix attached_handler_name leak and dangling hw_handler_name pointer
        dm thin metadata: fix __udivdi3 undefined on 32-bit
      b98d6cb8
    • Greg Kroah-Hartman's avatar
      Merge tag 'gpio-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · 3830711f
      Greg Kroah-Hartman authored
      Linus writes:
        "A single GPIO fix:
         Free the last used descriptor, an off by one error.
         This is tagged for stable as well."
      
      * tag 'gpio-v4.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
        gpiolib: Free the last requested descriptor
      3830711f
    • Greg Kroah-Hartman's avatar
      Merge tag 'pm-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 5aebc7d2
      Greg Kroah-Hartman authored
      Rafael writes:
        "Power management fix for 4.19-rc7
      
         Fix a bug that may cause runtime PM to misbehave for some devices
         after a failing or aborted system suspend which is nasty enough for
         an -rc7 time frame fix."
      
      * tag 'pm-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM / core: Clear the direct_complete flag on errors
      5aebc7d2
    • Greg Kroah-Hartman's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 31d09908
      Greg Kroah-Hartman authored
      Ingo writes:
        "perf fixes:
          - fix a CPU#0 hot unplug bug and a PCI enumeration bug in the x86 Intel uncore PMU driver
          - fix a CPU event enumeration bug in the x86 AMD PMU driver
          - fix a perf ring-buffer corruption bug when using tracepoints
          - fix a PMU unregister locking bug"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events
        perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX
        perf/ring_buffer: Prevent concurent ring buffer access
        perf/x86/intel/uncore: Use boot_cpu_data.phys_proc_id instead of hardcorded physical package ID 0
        perf/core: Fix perf_pmu_unregister() locking
      31d09908
    • Greg Kroah-Hartman's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 247373b5
      Greg Kroah-Hartman authored
      Ingo writes:
        "x86 fixes:
      
         Misc fixes:
      
          - fix various vDSO bugs: asm constraints and retpolines
          - add vDSO test units to make sure they never re-appear
          - fix UV platform TSC initialization bug
          - fix build warning on Clang"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/vdso: Fix vDSO syscall fallback asm constraint regression
        x86/cpu/amd: Remove unnecessary parentheses
        x86/vdso: Only enable vDSO retpolines when enabled and supported
        x86/tsc: Fix UV TSC initialization
        x86/platform/uv: Provide is_early_uv_system()
        selftests/x86: Add clock_gettime() tests to test_vdso
        x86/vdso: Fix asm constraints on vDSO syscall fallbacks
      247373b5
    • Greg Kroah-Hartman's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8be67373
      Greg Kroah-Hartman authored
      Ingo writes:
        "scheduler fixes:
      
         These fixes address a rather involved performance regression between
         v4.17->v4.19 in the sched/numa auto-balancing code. Since distros
         really need this fix we accelerated it to sched/urgent for a faster
         upstream merge.
      
         NUMA scheduling and balancing performance is now largely back to
         v4.17 levels, without reintroducing the NUMA placement bugs that
         v4.18 and v4.19 fixed.
      
         Many thanks to Srikar Dronamraju, Mel Gorman and Jirka Hladky, for
         reporting, testing, re-testing and solving this rather complex set of
         bugs."
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task
        mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration
        sched/numa: Avoid task migration for small NUMA improvement
        mm/migrate: Use spin_trylock() while resetting rate limit
        sched/numa: Limit the conditions where scan period is reset
        sched/numa: Reset scan rate whenever task moves across nodes
        sched/numa: Pass destination CPU as a parameter to migrate_task_rq
        sched/numa: Stop multiple tasks from moving to the CPU at the same time
      8be67373
    • Greg Kroah-Hartman's avatar
      Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1df377db
      Greg Kroah-Hartman authored
      Ingo writes:
        "locking fixes:
      
         A fix in the ww_mutex self-test that produces a scary splat, plus an
         updates to the maintained-filed patters in MAINTAINER."
      
      * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        locking/ww_mutex: Fix runtime warning in the WW mutex selftest
        MAINTAINERS: Remove dead path from LOCKING PRIMITIVES entry
      1df377db
    • Greg Kroah-Hartman's avatar
      Merge tag 'sound-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 8b6b383c
      Greg Kroah-Hartman authored
      Takashi writes:
        "sound fixes for 4.19-rc7
      
         Just two small fixes for HD-audio: one is for a typo in completion
         timeout, and another a fixup for Dell machines as usual"
      
      * tag 'sound-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda/realtek - Cannot adjust speaker's volume on Dell XPS 27 7760
        ALSA: hda: Fix the audio-component completion timeout
      8b6b383c
    • Maxime Chevallier's avatar
      net: mvpp2: Extract the correct ethtype from the skb for tx csum offload · 35f3625c
      Maxime Chevallier authored
      When offloading the L3 and L4 csum computation on TX, we need to extract
      the l3_proto from the ethtype, independently of the presence of a vlan
      tag.
      
      The actual driver uses skb->protocol as-is, resulting in packets with
      the wrong L4 checksum being sent when there's a vlan tag in the packet
      header and checksum offloading is enabled.
      
      This commit makes use of vlan_protocol_get() to get the correct ethtype
      regardless the presence of a vlan tag.
      
      Fixes: 3f518509 ("ethernet: Add new driver for Marvell Armada 375 network unit")
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35f3625c
    • Wei Wang's avatar
      ipv6: take rcu lock in rawv6_send_hdrinc() · a688caa3
      Wei Wang authored
      In rawv6_send_hdrinc(), in order to avoid an extra dst_hold(), we
      directly assign the dst to skb and set passed in dst to NULL to avoid
      double free.
      However, in error case, we free skb and then do stats update with the
      dst pointer passed in. This causes use-after-free on the dst.
      Fix it by taking rcu read lock right before dst could get released to
      make sure dst does not get freed until the stats update is done.
      Note: we don't have this issue in ipv4 cause dst is not used for stats
      update in v4.
      
      Syzkaller reported following crash:
      BUG: KASAN: use-after-free in rawv6_send_hdrinc net/ipv6/raw.c:692 [inline]
      BUG: KASAN: use-after-free in rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921
      Read of size 8 at addr ffff8801d95ba730 by task syz-executor0/32088
      
      CPU: 1 PID: 32088 Comm: syz-executor0 Not tainted 4.19.0-rc2+ #93
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
       print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
       rawv6_send_hdrinc net/ipv6/raw.c:692 [inline]
       rawv6_sendmsg+0x4421/0x4630 net/ipv6/raw.c:921
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:631
       ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114
       __sys_sendmsg+0x11d/0x280 net/socket.c:2152
       __do_sys_sendmsg net/socket.c:2161 [inline]
       __se_sys_sendmsg net/socket.c:2159 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x457099
      Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f83756edc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00007f83756ee6d4 RCX: 0000000000457099
      RDX: 0000000000000000 RSI: 0000000020003840 RDI: 0000000000000004
      RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000004d4b30 R14: 00000000004c90b1 R15: 0000000000000000
      
      Allocated by task 32088:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
       kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554
       dst_alloc+0xbb/0x1d0 net/core/dst.c:105
       ip6_dst_alloc+0x35/0xa0 net/ipv6/route.c:353
       ip6_rt_cache_alloc+0x247/0x7b0 net/ipv6/route.c:1186
       ip6_pol_route+0x8f8/0xd90 net/ipv6/route.c:1895
       ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2093
       fib6_rule_lookup+0x277/0x860 net/ipv6/fib6_rules.c:122
       ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2121
       ip6_route_output include/net/ip6_route.h:88 [inline]
       ip6_dst_lookup_tail+0xe27/0x1d60 net/ipv6/ip6_output.c:951
       ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
       rawv6_sendmsg+0x12d9/0x4630 net/ipv6/raw.c:905
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:621 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:631
       ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114
       __sys_sendmsg+0x11d/0x280 net/socket.c:2152
       __do_sys_sendmsg net/socket.c:2161 [inline]
       __se_sys_sendmsg net/socket.c:2159 [inline]
       __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 5356:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
       kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
       __cache_free mm/slab.c:3498 [inline]
       kmem_cache_free+0x83/0x290 mm/slab.c:3756
       dst_destroy+0x267/0x3c0 net/core/dst.c:141
       dst_destroy_rcu+0x16/0x19 net/core/dst.c:154
       __rcu_reclaim kernel/rcu/rcu.h:236 [inline]
       rcu_do_batch kernel/rcu/tree.c:2576 [inline]
       invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline]
       __rcu_process_callbacks kernel/rcu/tree.c:2847 [inline]
       rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864
       __do_softirq+0x30b/0xad8 kernel/softirq.c:292
      
      Fixes: 1789a640 ("raw: avoid two atomics in xmit")
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a688caa3
    • David Ahern's avatar
      net: sched: Add policy validation for tc attributes · 8b4c3cdd
      David Ahern authored
      A number of TC attributes are processed without proper validation
      (e.g., length checks). Add a tca policy for all input attributes and use
      when invoking nlmsg_parse.
      
      The 2 Fixes tags below cover the latest additions. The other attributes
      are a string (KIND), nested attribute (OPTIONS which does seem to have
      validation in most cases), for dumps only or a flag.
      
      Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
      Fixes: d47a6b0e ("net: sched: introduce ingress/egress block index attributes for qdisc")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b4c3cdd
    • Mauricio Faria de Oliveira's avatar
      rtnetlink: fix rtnl_fdb_dump() for ndmsg header · bd961c9b
      Mauricio Faria de Oliveira authored
      Currently, rtnl_fdb_dump() assumes the family header is 'struct ifinfomsg',
      which is not always true -- 'struct ndmsg' is used by iproute2 ('ip neigh').
      
      The problem is, the function bails out early if nlmsg_parse() fails, which
      does occur for iproute2 usage of 'struct ndmsg' because the payload length
      is shorter than the family header alone (as 'struct ifinfomsg' is assumed).
      
      This breaks backward compatibility with userspace -- nothing is sent back.
      
      Some examples with iproute2 and netlink library for go [1]:
      
       1) $ bridge fdb show
          33:33:00:00:00:01 dev ens3 self permanent
          01:00:5e:00:00:01 dev ens3 self permanent
          33:33:ff:15:98:30 dev ens3 self permanent
      
            This one works, as it uses 'struct ifinfomsg'.
      
            fdb_show() @ iproute2/bridge/fdb.c
              """
              .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
              ...
              if (rtnl_dump_request(&rth, RTM_GETNEIGH, [...]
              """
      
       2) $ ip --family bridge neigh
          RTNETLINK answers: Invalid argument
          Dump terminated
      
            This one fails, as it uses 'struct ndmsg'.
      
            do_show_or_flush() @ iproute2/ip/ipneigh.c
              """
              .n.nlmsg_type = RTM_GETNEIGH,
              .n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ndmsg)),
              """
      
       3) $ ./neighlist
          < no output >
      
            This one fails, as it uses 'struct ndmsg'-based.
      
            neighList() @ netlink/neigh_linux.go
              """
              req := h.newNetlinkRequest(unix.RTM_GETNEIGH, [...]
              msg := Ndmsg{
              """
      
      The actual breakage was introduced by commit 0ff50e83 ("net: rtnetlink:
      bail out from rtnl_fdb_dump() on parse error"), because nlmsg_parse() fails
      if the payload length (with the _actual_ family header) is less than the
      family header length alone (which is assumed, in parameter 'hdrlen').
      This is true in the examples above with struct ndmsg, with size and payload
      length shorter than struct ifinfomsg.
      
      However, that commit just intends to fix something under the assumption the
      family header is indeed an 'struct ifinfomsg' - by preventing access to the
      payload as such (via 'ifm' pointer) if the payload length is not sufficient
      to actually contain it.
      
      The assumption was introduced by commit 5e6d2435 ("bridge: netlink dump
      interface at par with brctl"), to support iproute2's 'bridge fdb' command
      (not 'ip neigh') which indeed uses 'struct ifinfomsg', thus is not broken.
      
      So, in order to unbreak the 'struct ndmsg' family headers and still allow
      'struct ifinfomsg' to continue to work, check for the known message sizes
      used with 'struct ndmsg' in iproute2 (with zero or one attribute which is
      not used in this function anyway) then do not parse the data as ifinfomsg.
      
      Same examples with this patch applied (or revert/before the original fix):
      
          $ bridge fdb show
          33:33:00:00:00:01 dev ens3 self permanent
          01:00:5e:00:00:01 dev ens3 self permanent
          33:33:ff:15:98:30 dev ens3 self permanent
      
          $ ip --family bridge neigh
          dev ens3 lladdr 33:33:00:00:00:01 PERMANENT
          dev ens3 lladdr 01:00:5e:00:00:01 PERMANENT
          dev ens3 lladdr 33:33:ff:15:98:30 PERMANENT
      
          $ ./neighlist
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0x0, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x1, 0x0, 0x5e, 0x0, 0x0, 0x1}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
          netlink.Neigh{LinkIndex:2, Family:7, State:128, Type:0, Flags:2, IP:net.IP(nil), HardwareAddr:net.HardwareAddr{0x33, 0x33, 0xff, 0x15, 0x98, 0x30}, LLIPAddr:net.IP(nil), Vlan:0, VNI:0}
      
      Tested on mainline (v4.19-rc6) and net-next (3bd09b05).
      
      References:
      
      [1] netlink library for go (test-case)
          https://github.com/vishvananda/netlink
      
          $ cat ~/go/src/neighlist/main.go
          package main
          import ("fmt"; "syscall"; "github.com/vishvananda/netlink")
          func main() {
              neighs, _ := netlink.NeighList(0, syscall.AF_BRIDGE)
              for _, neigh := range neighs { fmt.Printf("%#v\n", neigh) }
          }
      
          $ export GOPATH=~/go
          $ go get github.com/vishvananda/netlink
          $ go build neighlist
          $ ~/go/src/neighlist/neighlist
      
      Thanks to David Ahern for suggestions to improve this patch.
      
      Fixes: 0ff50e83 ("net: rtnetlink: bail out from rtnl_fdb_dump() on parse error")
      Fixes: 5e6d2435 ("bridge: netlink dump interface at par with brctl")
      Reported-by: default avatarAidan Obley <aobley@pivotal.io>
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd961c9b