1. 03 Apr, 2019 40 commits
    • Cornelia Huck's avatar
      vfio: ccw: only free cp on final interrupt · 0f273f0c
      Cornelia Huck authored
      commit 50b7f1b7 upstream.
      
      When we get an interrupt for a channel program, it is not
      necessarily the final interrupt; for example, the issuing
      guest may request an intermediate interrupt by specifying
      the program-controlled-interrupt flag on a ccw.
      
      We must not switch the state to idle if the interrupt is not
      yet final; even more importantly, we must not free the translated
      channel program if the interrupt is not yet final, or the host
      can crash during cp rewind.
      
      Fixes: e5f84dba ("vfio: ccw: return I/O results asynchronously")
      Cc: stable@vger.kernel.org # v4.12+
      Reviewed-by: default avatarEric Farman <farman@linux.ibm.com>
      Signed-off-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0f273f0c
    • Naveen N. Rao's avatar
      powerpc: bpf: Fix generation of load/store DW instructions · 92d4ee2e
      Naveen N. Rao authored
      commit 86be36f6 upstream.
      
      Yauheni Kaliuta pointed out that PTR_TO_STACK store/load verifier test
      was failing on powerpc64 BE, and rightfully indicated that the PPC_LD()
      macro is not masking away the last two bits of the offset per the ISA,
      resulting in the generation of 'lwa' instruction instead of the intended
      'ld' instruction.
      
      Segher also pointed out that we can't simply mask away the last two bits
      as that will result in loading/storing from/to a memory location that
      was not intended.
      
      This patch addresses this by using ldx/stdx if the offset is not
      word-aligned. We load the offset into a temporary register (TMP_REG_2)
      and use that as the index register in a subsequent ldx/stdx. We fix
      PPC_LD() macro to mask off the last two bits, but enhance PPC_BPF_LL()
      and PPC_BPF_STL() to factor in the offset value and generate the proper
      instruction sequence. We also convert all existing users of PPC_LD() and
      PPC_STD() to use these macros. All existing uses of these macros have
      been audited to ensure that TMP_REG_2 can be clobbered.
      
      Fixes: 156d0e29 ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
      Cc: stable@vger.kernel.org # v4.9+
      Reported-by: default avatarYauheni Kaliuta <yauheni.kaliuta@redhat.com>
      Signed-off-by: default avatarNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92d4ee2e
    • Kohji Okuno's avatar
      ARM: imx6q: cpuidle: fix bug that CPU might not wake up at expected time · 9397f0d9
      Kohji Okuno authored
      commit 91740fc8 upstream.
      
      In the current cpuidle implementation for i.MX6q, the CPU that sets
      'WAIT_UNCLOCKED' and the CPU that returns to 'WAIT_CLOCKED' are always
      the same. While the CPU that sets 'WAIT_UNCLOCKED' is in IDLE state of
      "WAIT", if the other CPU wakes up and enters IDLE state of "WFI"
      istead of "WAIT", this CPU can not wake up at expired time.
       Because, in the case of "WFI", the CPU must be waked up by the local
      timer interrupt. But, while 'WAIT_UNCLOCKED' is set, the local timer
      is stopped, when all CPUs execute "wfi" instruction. As a result, the
      local timer interrupt is not fired.
       In this situation, this CPU will wake up by IRQ different from local
      timer. (e.g. broacast timer)
      
      So, this fix changes CPU to return to 'WAIT_CLOCKED'.
      Signed-off-by: default avatarKohji Okuno <okuno.kohji@jp.panasonic.com>
      Fixes: e5f9dec8 ("ARM: imx6q: support WAIT mode using cpuidle")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9397f0d9
    • Filipe Manana's avatar
      Btrfs: fix assertion failure on fsync with NO_HOLES enabled · fd1b2536
      Filipe Manana authored
      commit 0ccc3876 upstream.
      
      Back in commit a89ca6f2 ("Btrfs: fix fsync after truncate when
      no_holes feature is enabled") I added an assertion that is triggered when
      an inline extent is found to assert that the length of the (uncompressed)
      data the extent represents is the same as the i_size of the inode, since
      that is true most of the time I couldn't find or didn't remembered about
      any exception at that time. Later on the assertion was expanded twice to
      deal with a case of a compressed inline extent representing a range that
      matches the sector size followed by an expanding truncate, and another
      case where fallocate can update the i_size of the inode without adding
      or updating existing extents (if the fallocate range falls entirely within
      the first block of the file). These two expansion/fixes of the assertion
      were done by commit 7ed586d0 ("Btrfs: fix assertion on fsync of
      regular file when using no-holes feature") and commit 6399fb5a
      ("Btrfs: fix assertion failure during fsync in no-holes mode").
      These however missed the case where an falloc expands the i_size of an
      inode to exactly the sector size and inline extent exists, for example:
      
       $ mkfs.btrfs -f -O no-holes /dev/sdc
       $ mount /dev/sdc /mnt
      
       $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar
       wrote 1096/1096 bytes at offset 0
       1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec)
      
       $ xfs_io -c "falloc 1096 3000" /mnt/foobar
       $ xfs_io -c "fsync" /mnt/foobar
       Segmentation fault
      
       $ dmesg
       [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727
       [701253.602962] ------------[ cut here ]------------
       [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533!
       [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G        W         5.0.0-rc8-btrfs-next-45 #1
       [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs]
       (...)
       [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286
       [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000
       [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868
       [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000
       [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18
       [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80
       [701253.607769] FS:  00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000
       [701253.608163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0
       [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [701253.609608] Call Trace:
       [701253.609994]  btrfs_log_inode+0xdfb/0xe40 [btrfs]
       [701253.610383]  btrfs_log_inode_parent+0x2be/0xa60 [btrfs]
       [701253.610770]  ? do_raw_spin_unlock+0x49/0xc0
       [701253.611150]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
       [701253.611537]  btrfs_sync_file+0x3b2/0x440 [btrfs]
       [701253.612010]  ? do_sysinfo+0xb0/0xf0
       [701253.612552]  do_fsync+0x38/0x60
       [701253.612988]  __x64_sys_fsync+0x10/0x20
       [701253.613360]  do_syscall_64+0x60/0x1b0
       [701253.613733]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [701253.614103] RIP: 0033:0x7f14da4e66d0
       (...)
       [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
       [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0
       [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003
       [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010
       [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260
       [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10
       (...)
       [701253.619941] ---[ end trace e088d74f132b6da5 ]---
      
      Updating the assertion again to allow for this particular case would result
      in a meaningless assertion, plus there is currently no risk of logging
      content that would result in any corruption after a log replay if the size
      of the data encoded in an inline extent is greater than the inode's i_size
      (which is not currently possibe either with or without compression),
      therefore just remove the assertion.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd1b2536
    • Nikolay Borisov's avatar
      btrfs: Avoid possible qgroup_rsv_size overflow in btrfs_calculate_inode_block_rsv_size · 0ae3b84b
      Nikolay Borisov authored
      commit 139a5617 upstream.
      
      qgroup_rsv_size is calculated as the product of
      outstanding_extent * fs_info->nodesize. The product is calculated with
      32 bit precision since both variables are defined as u32. Yet
      qgroup_rsv_size expects a 64 bit result.
      
      Avoid possible multiplication overflow by casting outstanding_extent to
      u64. Such overflow would in the worst case (64K nodesize) require more
      than 65536 extents, which is quite large and i'ts not likely that it
      would happen in practice.
      
      Fixes-coverity-id: 1435101
      Fixes: ff6bc37e ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ae3b84b
    • Andrea Righi's avatar
      btrfs: raid56: properly unmap parity page in finish_parity_scrub() · 1cf4ab01
      Andrea Righi authored
      commit 3897b6f0 upstream.
      
      Parity page is incorrectly unmapped in finish_parity_scrub(), triggering
      a reference counter bug on i386, i.e.:
      
       [ 157.662401] kernel BUG at mm/highmem.c:349!
       [ 157.666725] invalid opcode: 0000 [#1] SMP PTI
      
      The reason is that kunmap(p_page) was completely left out, so we never
      did an unmap for the p_page and the loop unmapping the rbio page was
      iterating over the wrong number of stripes: unmapping should be done
      with nr_data instead of rbio->real_stripes.
      
      Test case to reproduce the bug:
      
       - create a raid5 btrfs filesystem:
         # mkfs.btrfs -m raid5 -d raid5 /dev/sdb /dev/sdc /dev/sdd /dev/sde
      
       - mount it:
         # mount /dev/sdb /mnt
      
       - run btrfs scrub in a loop:
         # while :; do btrfs scrub start -BR /mnt; done
      
      BugLink: https://bugs.launchpad.net/bugs/1812845
      Fixes: 5a6ac9ea ("Btrfs, raid56: support parity scrub on raid56")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cf4ab01
    • David Sterba's avatar
      btrfs: don't report readahead errors and don't update statistics · d952c337
      David Sterba authored
      commit 0cc068e6 upstream.
      
      As readahead is an optimization, all errors are usually filtered out,
      but still properly handled when the real read call is done. The commit
      5e9d3982 ("btrfs: readpages() should submit IO as read-ahead") added
      REQ_RAHEAD to readpages() because that's only used for readahead
      (despite what one would expect from the callback name).
      
      This causes a flood of messages and inflated read error stats, so skip
      reporting in case it's readahead.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202403Reported-by: default avatarLimeTech <tomm@lime-technology.com>
      Fixes: 5e9d3982 ("btrfs: readpages() should submit IO as read-ahead")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d952c337
    • Josef Bacik's avatar
      btrfs: remove WARN_ON in log_dir_items · b57220cc
      Josef Bacik authored
      commit 2cc83342 upstream.
      
      When Filipe added the recursive directory logging stuff in
      2f2ff0ee ("Btrfs: fix metadata inconsistencies after directory
      fsync") he specifically didn't take the directory i_mutex for the
      children directories that we need to log because of lockdep.  This is
      generally fine, but can lead to this WARN_ON() tripping if we happen to
      run delayed deletion's in between our first search and our second search
      of dir_item/dir_indexes for this directory.  We expect this to happen,
      so the WARN_ON() isn't necessary.  Drop the WARN_ON() and add a comment
      so we know why this case can happen.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b57220cc
    • Filipe Manana's avatar
      Btrfs: fix incorrect file size after shrinking truncate and fsync · 22dcb30f
      Filipe Manana authored
      commit bf504110 upstream.
      
      If we do a shrinking truncate against an inode which is already present
      in the respective log tree and then rename it, as part of logging the new
      name we end up logging an inode item that reflects the old size of the
      file (the one which we previously logged) and not the new smaller size.
      The decision to preserve the size previously logged was added by commit
      1a4bcf47 ("Btrfs: fix fsync data loss after adding hard link to
      inode") in order to avoid data loss after replaying the log. However that
      decision is only needed for the case the logged inode size is smaller then
      the current size of the inode, as explained in that commit's change log.
      If the current size of the inode is smaller then the previously logged
      size, we know a shrinking truncate happened and therefore need to use
      that smaller size.
      
      Example to trigger the problem:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo
        $ xfs_io -c "fsync" /mnt/foo
        $ xfs_io -c "truncate 3000" /mnt/foo
      
        $ mv /mnt/foo /mnt/bar
        $ xfs_io -c "fsync" /mnt/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/bar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0008000
      
      Once we rename the file, we log its name (and inode item), and because
      the inode was already logged before in the current transaction, we log it
      with a size of 8000 bytes because that is the size we previously logged
      (with the first fsync). As part of the rename, besides logging the inode,
      we do also sync the log, which is done since commit d4682ba0
      ("Btrfs: sync log after logging new name"), so the next fsync against our
      inode is effectively a no-op, since no new changes happened since the
      rename operation. Even if did not sync the log during the rename
      operation, the same problem (fize size of 8000 bytes instead of 3000
      bytes) would be visible after replaying the log if the log ended up
      getting synced to disk through some other means, such as for example by
      fsyncing some other modified file. In the example above the fsync after
      the rename operation is there just because not every filesystem may
      guarantee logging/journalling the inode (and syncing the log/journal)
      during the rename operation, for example it is needed for f2fs, but not
      for ext4 and xfs.
      
      Fix this scenario by, when logging a new name (which is triggered by
      rename and link operations), using the current size of the inode instead
      of the previously logged inode size.
      
      A test case for fstests follows soon.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695
      CC: stable@vger.kernel.org # 4.4+
      Reported-by: default avatarSeulbae Kim <seulbae@gatech.edu>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22dcb30f
    • Michael Ellerman's avatar
      powerpc/security: Fix spectre_v2 reporting · a1df5db3
      Michael Ellerman authored
      commit 92edf8df upstream.
      
      When I updated the spectre_v2 reporting to handle software count cache
      flush I got the logic wrong when there's no software count cache
      enabled at all.
      
      The result is that on systems with the software count cache flush
      disabled we print:
      
        Mitigation: Indirect branch cache disabled, Software count cache flush
      
      Which correctly indicates that the count cache is disabled, but
      incorrectly says the software count cache flush is enabled.
      
      The root of the problem is that we are trying to handle all
      combinations of options. But we know now that we only expect to see
      the software count cache flush enabled if the other options are false.
      
      So split the two cases, which simplifies the logic and fixes the bug.
      We were also missing a space before "(hardware accelerated)".
      
      The result is we see one of:
      
        Mitigation: Indirect branch serialisation (kernel only)
        Mitigation: Indirect branch cache disabled
        Mitigation: Software count cache flush
        Mitigation: Software count cache flush (hardware accelerated)
      
      Fixes: ee13cb24 ("powerpc/64s: Add support for software count cache flush")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarMichael Neuling <mikey@neuling.org>
      Reviewed-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a1df5db3
    • Christophe Leroy's avatar
      powerpc/fsl: Fix the flush of branch predictor. · 986f0c65
      Christophe Leroy authored
      commit 27da8071 upstream.
      
      The commit identified below adds MC_BTB_FLUSH macro only when
      CONFIG_PPC_FSL_BOOK3E is defined. This results in the following error
      on some configs (seen several times with kisskb randconfig_defconfig)
      
      arch/powerpc/kernel/exceptions-64e.S:576: Error: Unrecognized opcode: `mc_btb_flush'
      make[3]: *** [scripts/Makefile.build:367: arch/powerpc/kernel/exceptions-64e.o] Error 1
      make[2]: *** [scripts/Makefile.build:492: arch/powerpc/kernel] Error 2
      make[1]: *** [Makefile:1043: arch/powerpc] Error 2
      make: *** [Makefile:152: sub-make] Error 2
      
      This patch adds a blank definition of MC_BTB_FLUSH for other cases.
      
      Fixes: 10c5e83a ("powerpc/fsl: Flush the branch predictor at each kernel entry (64bit)")
      Cc: Diana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: default avatarDaniel Axtens <dja@axtens.net>
      Reviewed-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      986f0c65
    • Diana Craciun's avatar
      powerpc/fsl: Fixed warning: orphan section `__btb_flush_fixup' · b848d19c
      Diana Craciun authored
      commit 039daac5 upstream.
      
      Fixed the following build warning:
      powerpc-linux-gnu-ld: warning: orphan section `__btb_flush_fixup' from
      `arch/powerpc/kernel/head_44x.o' being placed in section
      `__btb_flush_fixup'.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b848d19c
    • Diana Craciun's avatar
      powerpc/fsl: Update Spectre v2 reporting · 632d8392
      Diana Craciun authored
      commit dfa88658 upstream.
      
      Report branch predictor state flush as a mitigation for
      Spectre variant 2.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      632d8392
    • Diana Craciun's avatar
      powerpc/fsl: Enable runtime patching if nospectre_v2 boot arg is used · 43f40620
      Diana Craciun authored
      commit 3bc8ea86 upstream.
      
      If the user choses not to use the mitigations, replace
      the code sequence with nops.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43f40620
    • Diana Craciun's avatar
      powerpc/fsl: Flush branch predictor when entering KVM · a46a5038
      Diana Craciun authored
      commit e7aa61f4 upstream.
      
      Switching from the guest to host is another place
      where the speculative accesses can be exploited.
      Flush the branch predictor when entering KVM.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a46a5038
    • Diana Craciun's avatar
      powerpc/fsl: Flush the branch predictor at each kernel entry (32 bit) · 3cb931c7
      Diana Craciun authored
      commit 7fef4362 upstream.
      
      In order to protect against speculation attacks on
      indirect branches, the branch predictor is flushed at
      kernel entry to protect for the following situations:
      - userspace process attacking another userspace process
      - userspace process attacking the kernel
      Basically when the privillege level change (i.e.the kernel
      is entered), the branch predictor state is flushed.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3cb931c7
    • Diana Craciun's avatar
      powerpc/fsl: Flush the branch predictor at each kernel entry (64bit) · cf72dad9
      Diana Craciun authored
      commit 10c5e83a upstream.
      
      In order to protect against speculation attacks on
      indirect branches, the branch predictor is flushed at
      kernel entry to protect for the following situations:
      - userspace process attacking another userspace process
      - userspace process attacking the kernel
      Basically when the privillege level change (i.e. the
      kernel is entered), the branch predictor state is flushed.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf72dad9
    • Diana Craciun's avatar
      powerpc/fsl: Add nospectre_v2 command line argument · 020e5f13
      Diana Craciun authored
      commit f633a8ad upstream.
      
      When the command line argument is present, the Spectre variant 2
      mitigations are disabled.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      020e5f13
    • Diana Craciun's avatar
      powerpc/fsl: Emulate SPRN_BUCSR register · 4a6a2287
      Diana Craciun authored
      commit 98518c4d upstream.
      
      In order to flush the branch predictor the guest kernel performs
      writes to the BUCSR register which is hypervisor privilleged. However,
      the branch predictor is flushed at each KVM entry, so the branch
      predictor has been already flushed, so just return as soon as possible
      to guest.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      [mpe: Tweak comment formatting]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a6a2287
    • Diana Craciun's avatar
      powerpc/fsl: Add macro to flush the branch predictor · 4944f1d4
      Diana Craciun authored
      commit 1cbf8990 upstream.
      
      The BUCSR register can be used to invalidate the entries in the
      branch prediction mechanisms.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4944f1d4
    • Diana Craciun's avatar
      powerpc/fsl: Add infrastructure to fixup branch predictor flush · d67ab3d9
      Diana Craciun authored
      commit 76a5eaa3 upstream.
      
      In order to protect against speculation attacks (Spectre
      variant 2) on NXP PowerPC platforms, the branch predictor
      should be flushed when the privillege level is changed.
      This patch is adding the infrastructure to fixup at runtime
      the code sections that are performing the branch predictor flush
      depending on a boot arg parameter which is added later in a
      separate patch.
      Signed-off-by: default avatarDiana Craciun <diana.craciun@nxp.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d67ab3d9
    • Eric Dumazet's avatar
      tun: add a missing rcu_read_unlock() in error path · e044d21c
      Eric Dumazet authored
      commit 9180bb4f upstream.
      
      In my latest patch I missed one rcu_read_unlock(), in case
      device is down.
      
      Fixes: 4477138f ("tun: properly test for IFF_UP")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e044d21c
    • Dean Nelson's avatar
      thunderx: eliminate extra calls to put_page() for pages held for recycling · 6bdb5fdc
      Dean Nelson authored
      [ Upstream commit cd35ef91 ]
      
      For the non-XDP case, commit 77322538 ("net: thunderx: Optimize
      page recycling for XDP") added code to nicvf_free_rbdr() that, when releasing
      the additional receive buffer page reference held for recycling, repeatedly
      calls put_page() until the page's _refcount goes to zero. Which results in
      the page being freed.
      
      This is not okay if the page's _refcount was greater than 1 (in the non-XDP
      case), because nicvf_free_rbdr() should not be subtracting more than what
      nicvf_alloc_page() had previously added to the page's _refcount, which was
      only 1 (in the non-XDP case).
      
      This can arise if a received packet is still being processed and the receive
      buffer (i.e., skb->head) has not yet been freed via skb_free_head() when
      nicvf_free_rbdr() is spinning through the aforementioned put_page() loop.
      
      If this should occur, when the received packet finishes processing and
      skb_free_head() is called, various problems can ensue. Exactly what, depends on
      whether the page has already been reallocated or not, anything from "BUG: Bad
      page state ... ", to "Unable to handle kernel NULL pointer dereference ..." or
      "Unable to handle kernel paging request...".
      
      So this patch changes nicvf_free_rbdr() to only call put_page() once for pages
      held for recycling (in the non-XDP case).
      
      Fixes: 77322538 ("net: thunderx: Optimize page recycling for XDP")
      Signed-off-by: default avatarDean Nelson <dnelson@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6bdb5fdc
    • Dean Nelson's avatar
      thunderx: enable page recycling for non-XDP case · ac8411d7
      Dean Nelson authored
      [ Upstream commit b3e20806 ]
      
      Commit 77322538 ("net: thunderx: Optimize page recycling for XDP")
      added code to nicvf_alloc_page() that inadvertently disables receive buffer
      page recycling for the non-XDP case by always NULL'ng the page pointer.
      
      This patch corrects two if-conditionals to allow for the recycling of non-XDP
      mode pages by only setting the page pointer to NULL when the page is not ready
      for recycling.
      
      Fixes: 77322538 ("net: thunderx: Optimize page recycling for XDP")
      Signed-off-by: default avatarDean Nelson <dnelson@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac8411d7
    • John Hurley's avatar
      net: sched: fix cleanup NULL pointer exception in act_mirr · a491de90
      John Hurley authored
      [ Upstream commit 064c5d68 ]
      
      A new mirred action is created by the tcf_mirred_init function. This
      contains a list head struct which is inserted into a global list on
      successful creation of a new action. However, after a creation, it is
      still possible to error out and call the tcf_idr_release function. This,
      in turn, calls the act_mirr cleanup function via __tcf_idr_release and
      __tcf_action_put. This cleanup function tries to delete the list entry
      which is as yet uninitialised, leading to a NULL pointer exception.
      
      Fix this by initialising the list entry on creation of a new action.
      
      Bug report:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 8000000840c73067 P4D 8000000840c73067 PUD 858dcc067 PMD 0
      Oops: 0002 [#1] SMP PTI
      CPU: 32 PID: 5636 Comm: handler194 Tainted: G           OE     5.0.0+ #186
      Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.3.6 06/03/2015
      RIP: 0010:tcf_mirred_release+0x42/0xa7 [act_mirred]
      Code: f0 90 39 c0 e8 52 04 57 c8 48 c7 c7 b8 80 39 c0 e8 94 fa d4 c7 48 8b 93 d0 00 00 00 48 8b 83 d8 00 00 00 48 c7 c7 f0 90 39 c0 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 83 d0 00
      RSP: 0018:ffffac4aa059f688 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff9dcd1b214d00 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff9dcd1fa165f8 RDI: ffffffffc03990f0
      RBP: ffff9dccf9c7af80 R08: 0000000000000a3b R09: 0000000000000000
      R10: ffff9dccfa11f420 R11: 0000000000000000 R12: 0000000000000001
      R13: ffff9dcd16b433c0 R14: ffff9dcd1b214d80 R15: 0000000000000000
      FS:  00007f441bfff700(0000) GS:ffff9dcd1fa00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 0000000839e64004 CR4: 00000000001606e0
      Call Trace:
      tcf_action_cleanup+0x59/0xca
      __tcf_action_put+0x54/0x6b
      __tcf_idr_release.cold.33+0x9/0x12
      tcf_mirred_init.cold.20+0x22e/0x3b0 [act_mirred]
      tcf_action_init_1+0x3d0/0x4c0
      tcf_action_init+0x9c/0x130
      tcf_exts_validate+0xab/0xc0
      fl_change+0x1ca/0x982 [cls_flower]
      tc_new_tfilter+0x647/0x8d0
      ? load_balance+0x14b/0x9e0
      rtnetlink_rcv_msg+0xe3/0x370
      ? __switch_to_asm+0x40/0x70
      ? __switch_to_asm+0x34/0x70
      ? _cond_resched+0x15/0x30
      ? __kmalloc_node_track_caller+0x1d4/0x2b0
      ? rtnl_calcit.isra.31+0xf0/0xf0
      netlink_rcv_skb+0x49/0x110
      netlink_unicast+0x16f/0x210
      netlink_sendmsg+0x1df/0x390
      sock_sendmsg+0x36/0x40
      ___sys_sendmsg+0x27b/0x2c0
      ? futex_wake+0x80/0x140
      ? do_futex+0x2b9/0xac0
      ? ep_scan_ready_list.constprop.22+0x1f2/0x210
      ? ep_poll+0x7a/0x430
      __sys_sendmsg+0x47/0x80
      do_syscall_64+0x55/0x100
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 4e232818 ("net: sched: act_mirred: remove dependency on rtnl lock")
      Signed-off-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a491de90
    • Herbert Xu's avatar
      ila: Fix rhashtable walker list corruption · 7254ad09
      Herbert Xu authored
      [ Upstream commit b5f9bd15 ]
      
      ila_xlat_nl_cmd_flush uses rhashtable walkers allocated from the
      stack but it never frees them.  This corrupts the walker list of
      the hash table.
      
      This patch fixes it.
      
      Reported-by: syzbot+dae72a112334aa65a159@syzkaller.appspotmail.com
      Fixes: b6e71bde ("ila: Flush netlink command to clear xlat...")
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7254ad09
    • Zhiqiang Liu's avatar
      vxlan: Don't call gro_cells_destroy() before device is unregistered · 979f8a67
      Zhiqiang Liu authored
      [ Upstream commit cc4807bb ]
      
      Commit ad6c9986 ("vxlan: Fix GRO cells race condition between
      receive and link delete") fixed a race condition for the typical case a vxlan
      device is dismantled from the current netns. But if a netns is dismantled,
      vxlan_destroy_tunnels() is called to schedule a unregister_netdevice_queue()
      of all the vxlan tunnels that are related to this netns.
      
      In vxlan_destroy_tunnels(), gro_cells_destroy() is called and finished before
      unregister_netdevice_queue(). This means that the gro_cells_destroy() call is
      done too soon, for the same reasons explained in above commit.
      
      So we need to fully respect the RCU rules, and thus must remove the
      gro_cells_destroy() call or risk use after-free.
      
      Fixes: 58ce31cc ("vxlan: GRO support at tunnel layer")
      Signed-off-by: default avatarSuanming.Mou <mousuanming@huawei.com>
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: default avatarZhiqiang Liu <liuzhiqiang26@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      979f8a67
    • Sabrina Dubroca's avatar
      vrf: prevent adding upper devices · 3b1386be
      Sabrina Dubroca authored
      [ Upstream commit 1017e098 ]
      
      VRF devices don't work with upper devices. Currently, it's possible to
      add a VRF device to a bridge or team, and to create macvlan, macsec, or
      ipvlan devices on top of a VRF (bond and vlan are prevented respectively
      by the lack of an ndo_set_mac_address op and the NETIF_F_VLAN_CHALLENGED
      feature flag).
      
      Fix this by setting the IFF_NO_RX_HANDLER flag (introduced in commit
      f5426250 ("net: introduce IFF_NO_RX_HANDLER")).
      
      Cc: David Ahern <dsahern@gmail.com>
      Fixes: 193125db ("net: Introduce VRF device driver")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Acked-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b1386be
    • Eric Dumazet's avatar
      tun: properly test for IFF_UP · 8ea78da1
      Eric Dumazet authored
      [ Upstream commit 4477138f ]
      
      Same reasons than the ones explained in commit 4179cb5a
      ("vxlan: test dev->flags & IFF_UP before calling netif_rx()")
      
      netif_rx_ni() or napi_gro_frags() must be called under a strict contract.
      
      At device dismantle phase, core networking clears IFF_UP
      and flush_all_backlogs() is called after rcu grace period
      to make sure no incoming packet might be in a cpu backlog
      and still referencing the device.
      
      A similar protocol is used for gro layer.
      
      Most drivers call netif_rx() from their interrupt handler,
      and since the interrupts are disabled at device dismantle,
      netif_rx() does not have to check dev->flags & IFF_UP
      
      Virtual drivers do not have this guarantee, and must
      therefore make the check themselves.
      
      Fixes: 1bd4978a ("tun: honor IFF_UP in tun_get_user()")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8ea78da1
    • Erik Hugne's avatar
      tipc: fix cancellation of topology subscriptions · 52a7505c
      Erik Hugne authored
      [ Upstream commit 33872d79 ]
      
      When cancelling a subscription, we have to clear the cancel bit in the
      request before iterating over any established subscriptions with memcmp.
      Otherwise no subscription will ever be found, and it will not be
      possible to explicitly unsubscribe individual subscriptions.
      
      Fixes: 8985ecc7 ("tipc: simplify endianness handling in topology subscriber")
      Signed-off-by: default avatarErik Hugne <erik.hugne@gmail.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52a7505c
    • Xin Long's avatar
      tipc: change to check tipc_own_id to return in tipc_net_stop · 1be6c0c7
      Xin Long authored
      [ Upstream commit 9926cb5f ]
      
      When running a syz script, a panic occurred:
      
      [  156.088228] BUG: KASAN: use-after-free in tipc_disc_timeout+0x9c9/0xb20 [tipc]
      [  156.094315] Call Trace:
      [  156.094844]  <IRQ>
      [  156.095306]  dump_stack+0x7c/0xc0
      [  156.097346]  print_address_description+0x65/0x22e
      [  156.100445]  kasan_report.cold.3+0x37/0x7a
      [  156.102402]  tipc_disc_timeout+0x9c9/0xb20 [tipc]
      [  156.106517]  call_timer_fn+0x19a/0x610
      [  156.112749]  run_timer_softirq+0xb51/0x1090
      
      It was caused by the netns freed without deleting the discoverer timer,
      while later on the netns would be accessed in the timer handler.
      
      The timer should have been deleted by tipc_net_stop() when cleaning up a
      netns. However, tipc has been able to enable a bearer and start d->timer
      without the local node_addr set since Commit 52dfae5c ("tipc: obtain
      node identity from interface by default"), which caused the timer not to
      be deleted in tipc_net_stop() then.
      
      So fix it in tipc_net_stop() by changing to check local node_id instead
      of local node_addr, as Jon suggested.
      
      While at it, remove the calling of tipc_nametbl_withdraw() there, since
      tipc_nametbl_stop() will take of the nametbl's freeing after.
      
      Fixes: 52dfae5c ("tipc: obtain node identity from interface by default")
      Reported-by: syzbot+a25307ad099309f1c2b9@syzkaller.appspotmail.com
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Acked-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1be6c0c7
    • Erik Hugne's avatar
      tipc: allow service ranges to be connect()'ed on RDM/DGRAM · 24d1a625
      Erik Hugne authored
      [ Upstream commit ea239314 ]
      
      We move the check that prevents connecting service ranges to after
      the RDM/DGRAM check, and move address sanity control to a separate
      function that also validates the service range.
      
      Fixes: 23998835 ("tipc: improve address sanity check in tipc_connect()")
      Signed-off-by: default avatarErik Hugne <erik.hugne@gmail.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24d1a625
    • Eric Dumazet's avatar
      tcp: do not use ipv6 header for ipv4 flow · 7115df61
      Eric Dumazet authored
      [ Upstream commit 89e41309 ]
      
      When a dual stack tcp listener accepts an ipv4 flow,
      it should not attempt to use an ipv6 header or tcp_v6_iif() helper.
      
      Fixes: 1397ed35 ("ipv6: add flowinfo for tcp6 pkt_options for all cases")
      Fixes: df3687ff ("ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7115df61
    • Xin Long's avatar
      sctp: use memdup_user instead of vmemdup_user · cab576f1
      Xin Long authored
      [ Upstream commit ef82bcfa ]
      
      In sctp_setsockopt_bindx()/__sctp_setsockopt_connectx(), it allocates
      memory with addrs_size which is passed from userspace. We used flag
      GFP_USER to put some more restrictions on it in Commit cacc0621
      ("sctp: use GFP_USER for user-controlled kmalloc").
      
      However, since Commit c981f254 ("sctp: use vmemdup_user() rather
      than badly open-coding memdup_user()"), vmemdup_user() has been used,
      which doesn't check GFP_USER flag when goes to vmalloc_*(). So when
      addrs_size is a huge value, it could exhaust memory and even trigger
      oom killer.
      
      This patch is to use memdup_user() instead, in which GFP_USER would
      work to limit the memory allocation with a huge addrs_size.
      
      Note we can't fix it by limiting 'addrs_size', as there's no demand
      for it from RFC.
      
      Reported-by: syzbot+ec1b7575afef85a0e5ca@syzkaller.appspotmail.com
      Fixes: c981f254 ("sctp: use vmemdup_user() rather than badly open-coding memdup_user()")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cab576f1
    • Xin Long's avatar
      sctp: get sctphdr by offset in sctp_compute_cksum · 97265479
      Xin Long authored
      [ Upstream commit 273160ff ]
      
      sctp_hdr(skb) only works when skb->transport_header is set properly.
      
      But in Netfilter, skb->transport_header for ipv6 is not guaranteed
      to be right value for sctphdr. It would cause to fail to check the
      checksum for sctp packets.
      
      So fix it by using offset, which is always right in all places.
      
      v1->v2:
        - Fix the changelog.
      
      Fixes: e6d8b64b ("net: sctp: fix and consolidate SCTP checksumming code")
      Reported-by: default avatarLi Shuang <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      97265479
    • Herbert Xu's avatar
      rhashtable: Still do rehash when we get EEXIST · cf86f7a9
      Herbert Xu authored
      [ Upstream commit 408f13ef ]
      
      As it stands if a shrink is delayed because of an outstanding
      rehash, we will go into a rescheduling loop without ever doing
      the rehash.
      
      This patch fixes this by still carrying out the rehash and then
      rescheduling so that we can shrink after the completion of the
      rehash should it still be necessary.
      
      The return value of EEXIST captures this case and other cases
      (e.g., another thread expanded/rehashed the table at the same
      time) where we should still proceed with the rehash.
      
      Fixes: da20420f ("rhashtable: Add nested tables")
      Reported-by: default avatarJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: default avatarJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf86f7a9
    • Maxime Chevallier's avatar
      packets: Always register packet sk in the same order · 69cea7cf
      Maxime Chevallier authored
      [ Upstream commit a4dc6a49 ]
      
      When using fanouts with AF_PACKET, the demux functions such as
      fanout_demux_cpu will return an index in the fanout socket array, which
      corresponds to the selected socket.
      
      The ordering of this array depends on the order the sockets were added
      to a given fanout group, so for FANOUT_CPU this means sockets are bound
      to cpus in the order they are configured, which is OK.
      
      However, when stopping then restarting the interface these sockets are
      bound to, the sockets are reassigned to the fanout group in the reverse
      order, due to the fact that they were inserted at the head of the
      interface's AF_PACKET socket list.
      
      This means that traffic that was directed to the first socket in the
      fanout group is now directed to the last one after an interface restart.
      
      In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
      socket that used to receive traffic from the last CPU after an interface
      restart.
      
      This commit introduces a helper to add a socket at the tail of a list,
      then uses it to register AF_PACKET sockets.
      
      Note that this changes the order in which sockets are listed in /proc and
      with sock_diag.
      
      Fixes: dc99f600 ("packet: Add fanout support")
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69cea7cf
    • YueHaibing's avatar
      net-sysfs: call dev_hold if kobject_init_and_add success · d9d215be
      YueHaibing authored
      [ Upstream commit a3e23f71 ]
      
      In netdev_queue_add_kobject and rx_queue_add_kobject,
      if sysfs_create_group failed, kobject_put will call
      netdev_queue_release to decrease dev refcont, however
      dev_hold has not be called. So we will see this while
      unregistering dev:
      
      unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Fixes: d0d66837 ("net: don't decrement kobj reference count on init failure")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9d215be
    • Aaro Koskinen's avatar
      net: stmmac: fix memory corruption with large MTUs · 8dcf078d
      Aaro Koskinen authored
      [ Upstream commit 223a960c ]
      
      When using 16K DMA buffers and ring mode, the DES3 refill is not working
      correctly as the function is using a bogus pointer for checking the
      private data. As a result stale pointers will remain in the RX descriptor
      ring, so DMA will now likely overwrite/corrupt some already freed memory.
      
      As simple reproducer, just receive some UDP traffic:
      
      	# ifconfig eth0 down; ifconfig eth0 mtu 9000; ifconfig eth0 up
      	# iperf3 -c 192.168.253.40 -u -b 0 -R
      
      If you didn't crash by now check the RX descriptors to find non-contiguous
      RX buffers:
      
      	cat /sys/kernel/debug/stmmaceth/eth0/descriptors_status
      	[...]
      	1 [0x2be5020]: 0xa3220321 0x9ffc1ffc 0x72d70082 0x130e207e
      					     ^^^^^^^^^^^^^^^^^^^^^
      	2 [0x2be5040]: 0xa3220321 0x9ffc1ffc 0x72998082 0x1311a07e
      					     ^^^^^^^^^^^^^^^^^^^^^
      
      A simple ping test will now report bad data:
      
      	# ping -s 8200 192.168.253.40
      	PING 192.168.253.40 (192.168.253.40) 8200(8228) bytes of data.
      	8208 bytes from 192.168.253.40: icmp_seq=1 ttl=64 time=1.00 ms
      	wrong data byte #8144 should be 0xd0 but was 0x88
      
      Fix the wrong pointer. Also we must refill DES3 only if the DMA buffer
      size is 16K.
      
      Fixes: 54139cf3 ("net: stmmac: adding multiple buffers for rx")
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@nokia.com>
      Acked-by: default avatarJose Abreu <joabreu@synopsys.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8dcf078d
    • Eric Dumazet's avatar
      net: rose: fix a possible stack overflow · 7eeb12ed
      Eric Dumazet authored
      [ Upstream commit e5dcc0c3 ]
      
      rose_write_internal() uses a temp buffer of 100 bytes, but a manual
      inspection showed that given arbitrary input, rose_create_facilities()
      can fill up to 110 bytes.
      
      Lets use a tailroom of 256 bytes for peace of mind, and remove
      the bounce buffer : we can simply allocate a big enough skb
      and adjust its length as needed.
      
      syzbot report :
      
      BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:352 [inline]
      BUG: KASAN: stack-out-of-bounds in rose_create_facilities net/rose/rose_subr.c:521 [inline]
      BUG: KASAN: stack-out-of-bounds in rose_write_internal+0x597/0x15d0 net/rose/rose_subr.c:116
      Write of size 7 at addr ffff88808b1ffbef by task syz-executor.0/24854
      
      CPU: 0 PID: 24854 Comm: syz-executor.0 Not tainted 5.0.0+ #97
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x38/0x50 mm/kasan/common.c:131
       memcpy include/linux/string.h:352 [inline]
       rose_create_facilities net/rose/rose_subr.c:521 [inline]
       rose_write_internal+0x597/0x15d0 net/rose/rose_subr.c:116
       rose_connect+0x7cb/0x1510 net/rose/af_rose.c:826
       __sys_connect+0x266/0x330 net/socket.c:1685
       __do_sys_connect net/socket.c:1696 [inline]
       __se_sys_connect net/socket.c:1693 [inline]
       __x64_sys_connect+0x73/0xb0 net/socket.c:1693
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458079
      Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f47b8d9dc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458079
      RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000004
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f47b8d9e6d4
      R13: 00000000004be4a4 R14: 00000000004ceca8 R15: 00000000ffffffff
      
      The buggy address belongs to the page:
      page:ffffea00022c7fc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1fffc0000000000()
      raw: 01fffc0000000000 0000000000000000 ffffffff022c0101 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88808b1ffa80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff88808b1ffb00: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 03
      >ffff88808b1ffb80: f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 04 f3
                                                                   ^
       ffff88808b1ffc00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
       ffff88808b1ffc80: 00 00 00 00 00 00 00 f1 f1 f1 f1 f1 f1 01 f2 01
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7eeb12ed