1. 30 Aug, 2017 40 commits
    • Eric Biggers's avatar
      mm/madvise.c: fix freeing of locked page with MADV_FREE · 4823f463
      Eric Biggers authored
      commit 263630e8 upstream.
      
      If madvise(..., MADV_FREE) split a transparent hugepage, it called
      put_page() before unlock_page().
      
      This was wrong because put_page() can free the page, e.g. if a
      concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
      mapping. put_page() then rightfully complained about freeing a locked
      page.
      
      Fix this by moving the unlock_page() before put_page().
      
      This bug was found by syzkaller, which encountered the following splat:
      
          BUG: Bad page state in process syzkaller412798  pfn:1bd800
          page:ffffea0006f60000 count:0 mapcount:0 mapping:          (null) index:0x20a00
          flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
          raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
          raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
          page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
          bad because of flags: 0x1(locked)
          Modules linked in:
          CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:16 [inline]
           dump_stack+0x194/0x257 lib/dump_stack.c:52
           bad_page+0x230/0x2b0 mm/page_alloc.c:565
           free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
           free_pages_check mm/page_alloc.c:952 [inline]
           free_pages_prepare mm/page_alloc.c:1043 [inline]
           free_pcp_prepare mm/page_alloc.c:1068 [inline]
           free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
           __put_single_page mm/swap.c:79 [inline]
           __put_page+0xfb/0x160 mm/swap.c:113
           put_page include/linux/mm.h:814 [inline]
           madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
           walk_pmd_range mm/pagewalk.c:50 [inline]
           walk_pud_range mm/pagewalk.c:108 [inline]
           walk_p4d_range mm/pagewalk.c:134 [inline]
           walk_pgd_range mm/pagewalk.c:160 [inline]
           __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
           walk_page_range+0x200/0x470 mm/pagewalk.c:326
           madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
           madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
           madvise_dontneed_free mm/madvise.c:555 [inline]
           madvise_vma mm/madvise.c:664 [inline]
           SYSC_madvise mm/madvise.c:832 [inline]
           SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
           entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Here is a C reproducer:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <sys/mman.h>
          #include <unistd.h>
      
          #define MADV_FREE	8
          #define PAGE_SIZE	4096
      
          static void *mapping;
          static const size_t mapping_size = 0x1000000;
      
          static void *madvise_thrproc(void *arg)
          {
              madvise(mapping, mapping_size, (long)arg);
          }
      
          int main(void)
          {
              pthread_t t[2];
      
              for (;;) {
                  mapping = mmap(NULL, mapping_size, PROT_WRITE,
                                 MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      
                  munmap(mapping + mapping_size / 2, PAGE_SIZE);
      
                  pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
                  pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
                  pthread_join(t[0], NULL);
                  pthread_join(t[1], NULL);
                  munmap(mapping, mapping_size);
              }
          }
      
      Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
      CONFIG_DEBUG_VM=y are needed.
      
      Google Bug Id: 64696096
      
      Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
      Fixes: 854e9ed0 ("mm: support madvise(MADV_FREE)")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4823f463
    • Ulf Hansson's avatar
      i2c: designware: Fix system suspend · c237efed
      Ulf Hansson authored
      commit a23318fe upstream.
      
      The commit 8503ff16 ("i2c: designware: Avoid unnecessary resuming
      during system suspend"), may suggest to the PM core to try out the so
      called direct_complete path for system sleep. In this path, the PM core
      treats a runtime suspended device as it's already in a proper low power
      state for system sleep, which makes it skip calling the system sleep
      callbacks for the device, except for the ->prepare() and the ->complete()
      callbacks.
      
      However, the PM core may unset the direct_complete flag for a parent
      device, in case its child device are being system suspended before. In this
      scenario, the PM core invokes the system sleep callbacks, no matter if the
      device is runtime suspended or not.
      
      Particularly in cases of an existing i2c slave device, the above path is
      triggered, which breaks the assumption that the i2c device is always
      runtime resumed whenever the dw_i2c_plat_suspend() is being called.
      
      More precisely, dw_i2c_plat_suspend() calls clk_core_disable() and
      clk_core_unprepare(), for an already disabled/unprepared clock, leading to
      a splat in the log about clocks calls being wrongly balanced and breaking
      system sleep.
      
      To still allow the direct_complete path in cases when it's possible, but
      also to keep the fix simple, let's runtime resume the i2c device in the
      ->suspend() callback, before continuing to put the device into low power
      state.
      
      Note, in cases when the i2c device is attached to the ACPI PM domain, this
      problem doesn't occur, because ACPI's ->suspend() callback, assigned to
      acpi_subsys_suspend(), already calls pm_runtime_resume() for the device.
      
      It should also be noted that this change does not fix commit 8503ff16
      ("i2c: designware: Avoid unnecessary resuming during system suspend").
      Because for the non-ACPI case, the system sleep support was already broken
      prior that point.
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Tested-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Tested-by: default avatarJarkko Nikula <jarkko.nikula@linux.intel.com>
      Acked-by: default avatarJarkko Nikula <jarkko.nikula@linux.intel.com>
      Reviewed-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarWolfram Sang <wsa@the-dreams.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c237efed
    • Ross Zwisler's avatar
      dax: fix deadlock due to misaligned PMD faults · 3a9495fd
      Ross Zwisler authored
      commit fffa281b upstream.
      
      In DAX there are two separate places where the 2MiB range of a PMD is
      defined.
      
      The first is in the page tables, where a PMD mapping inserted for a
      given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
      PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
      address to the 2MiB boundary above the address.
      
      So, for example, a fault at address 3MiB (0x30 0000) falls within the
      PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
      
      The second PMD range is in the mapping->page_tree, where a given file
      offset is covered by a radix tree entry that spans from one 2MiB aligned
      file offset to another 2MiB aligned file offset.
      
      So, for example, the file offset for 3MiB (pgoff 768) falls within the
      PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
      512) to 4MiB (pgoff 1024).
      
      This system works so long as the addresses and file offsets for a given
      mapping both have the same offsets relative to the start of each PMD.
      
      Consider the case where the starting address for a given file isn't 2MiB
      aligned - say our faulting address is 3 MiB (0x30 0000), but that
      corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
      the mapping are misaligned so that the 2MiB range defined in the page
      tables never matches up with the 2MiB range defined in the radix tree.
      
      The current code notices this case for DAX faults to storage with the
      following test in dax_pmd_insert_mapping():
      
      	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
      		goto unlock_fallback;
      
      This test makes sure that the pfn we get from the driver is 2MiB
      aligned, and relies on the assumption that the 2MiB alignment of the pfn
      we get back from the driver matches the 2MiB alignment of the faulting
      address.
      
      However, faults to holes were not checked and we could hit the problem
      described above.
      
      This was reported in response to the NVML nvml/src/test/pmempool_sync
      TEST5:
      
      	$ cd nvml/src/test/pmempool_sync
      	$ make TEST5
      
      You can grab NVML here:
      
      	https://github.com/pmem/nvml/
      
      The dmesg warning you see when you hit this error is:
      
        WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
      
      Where we notice in dax_insert_mapping_entry() that the radix tree entry
      we are about to replace doesn't match the locked entry that we had
      previously inserted into the tree.  This happens because the initial
      insertion was done in grab_mapping_entry() using a pgoff calculated from
      the faulting address (vmf->address), and the replacement in
      dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
      vmf->pgoff.
      
      In our failure case those two page offsets (one calculated from
      vmf->address, one using vmf->pgoff) point to different order 9 radix
      tree entries.
      
      This failure case can result in a deadlock because the radix tree unlock
      also happens on the pgoff calculated from vmf->address.  This means that
      the locked radix tree entry that we swapped in to the tree in
      dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
      future faults to that 2MiB range will block forever.
      
      Fix this by validating that the faulting address's PMD offset matches
      the PMD offset from the start of the file.  This check is done at the
      very beginning of the fault and covers faults that would have mapped to
      storage as well as faults to holes.  I left the COLOUR check in
      dax_pmd_insert_mapping() in place in case we ever hit the insanity
      condition where the alignment of the pfn we get from the driver doesn't
      match the alignment of the userspace address.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.comSigned-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: default avatar"Slusarz, Marcin" <marcin.slusarz@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a9495fd
    • Kirill A. Shutemov's avatar
      mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled · 735a252f
      Kirill A. Shutemov authored
      commit 435c0b87 upstream.
      
      /sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
      to allocate huge pages when allocate pages for private in-kernel shmem
      mount.
      
      Unfortunately, as Dan noticed, I've screwed it up and the only way to
      make kernel allocate huge page for the mount is to use "force" there.
      All other values will be effectively ignored.
      
      Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
      Fixes: 5a6e75f8 ("shmem: prepare huge= mount option and sysfs knob")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      735a252f
    • Chen Yu's avatar
      PM/hibernate: touch NMI watchdog when creating snapshot · b2719637
      Chen Yu authored
      commit 556b969a upstream.
      
      There is a problem that when counting the pages for creating the
      hibernation snapshot will take significant amount of time, especially on
      system with large memory.  Since the counting job is performed with irq
      disabled, this might lead to NMI lockup.  The following warning were
      found on a system with 1.5TB DRAM:
      
        Freezing user space processes ... (elapsed 0.002 seconds) done.
        OOM killer disabled.
        PM: Preallocating image memory...
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
        CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
        task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
        RIP: 0010:memory_bm_find_bit+0xf4/0x100
        Call Trace:
         swsusp_set_page_free+0x2b/0x30
         mark_free_pages+0x147/0x1c0
         count_data_pages+0x41/0xa0
         hibernate_preallocate_memory+0x80/0x450
         hibernation_snapshot+0x58/0x410
         hibernate+0x17c/0x310
         state_store+0xdf/0xf0
         kobj_attr_store+0xf/0x20
         sysfs_kf_write+0x37/0x40
         kernfs_fop_write+0x11c/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xb1/0x1a0
         SyS_write+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        ...
        done (allocated 6590003 pages)
        PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)
      
      It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
      triggered.  In case the timeout of the NMI watch dog has been set to 1
      second, a safe interval should be 6590003/20 = 320k pages in theory.
      However there might also be some platforms running at a lower frequency,
      so feed the watchdog every 100k pages.
      
      [yu.c.chen@intel.com: simplification]
        Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
      [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
      Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.comSigned-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Reported-by: default avatarJan Filipcewicz <jan.filipcewicz@intel.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2719637
    • Vineet Gupta's avatar
      ARCv2: PAE40: set MSB even if !CONFIG_ARC_HAS_PAE40 but PAE exists in SoC · 8b366972
      Vineet Gupta authored
      commit b5ddb6d5 upstream.
      
      PAE40 confiuration in hardware extends some of the address registers
      for TLB/cache ops to 2 words.
      
      So far kernel was NOT setting the higher word if feature was not enabled
      in software which is wrong. Those need to be set to 0 in such case.
      
      Normally this would be done in the cache flush / tlb ops, however since
      these registers only exist conditionally, this would have to be
      conditional to a flag being set on boot which is expensive/ugly -
      specially for the more common case of PAE exists but not in use.
      Optimize that by zero'ing them once at boot - nobody will write to
      them afterwards
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b366972
    • Alexey Brodkin's avatar
      ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses · fcedf2f2
      Alexey Brodkin authored
      commit 7d79cee2 upstream.
      
      It is necessary to explicitly set both SLC_AUX_RGN_START1 and SLC_AUX_RGN_END1
      which hold MSB bits of the physical address correspondingly of region start
      and end otherwise SLC region operation is executed in unpredictable manner
      
      Without this patch, SLC flushes on HSDK (IOC disabled) were taking
      seconds.
      Reported-by: default avatarVladimir Kondratiev <vladimir.kondratiev@intel.com>
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      [vgupta: PAR40 regs only written if PAE40 exist]
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcedf2f2
    • Alexey Brodkin's avatar
      ARCv2: SLC: Make sure busy bit is set properly for region ops · 763ad317
      Alexey Brodkin authored
      commit b37174d9 upstream.
      
      c70c4733 "ARCv2: SLC: Make sure busy bit is set properly on SLC flushing"
      fixes problem for entire SLC operation where the problem was initially
      caught. But given a nature of the issue it is perfectly possible for
      busy bit to be read incorrectly even when region operation was started.
      
      So extending initial fix for regional operation as well.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      763ad317
    • Takashi Sakamoto's avatar
      ALSA: firewire-motu: destroy stream data surely at failure of card initialization · 8537b1e0
      Takashi Sakamoto authored
      commit dbd7396b upstream.
      
      When failing sound card registration after initializing stream data, this
      module leaves allocated data in stream data. This commit fixes the bug.
      
      Fixes: 9b2bb4f2 ('ALSA: firewire-motu: add stream management functionality')
      Signed-off-by: default avatarTakashi Sakamoto <o-takashi@sakamocchi.jp>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8537b1e0
    • Takashi Sakamoto's avatar
      ALSA: firewire: fix NULL pointer dereference when releasing uninitialized data of iso-resource · 59d00061
      Takashi Sakamoto authored
      commit 0c264af7 upstream.
      
      When calling 'iso_resource_free()' for uninitialized data, this function
      causes NULL pointer dereference due to its 'unit' member. This occurs when
      unplugging audio and music units on IEEE 1394 bus at failure of card
      registration.
      
      This commit fixes the bug. The bug exists since kernel v4.5.
      
      Fixes: 324540c4 ('ALSA: fireface: postpone sound card registration') at v4.12
      Fixes: 8865a31e ('ALSA: firewire-motu: postpone sound card registration') at v4.12
      Fixes: b610386c ('ALSA: firewire-tascam: deleyed registration of sound card') at v4.7
      Fixes: 86c8dd7f ('ALSA: firewire-digi00x: delayed registration of sound card') at v4.7
      Fixes: 6c29230e ('ALSA: oxfw: delayed registration of sound card') at v4.7
      Fixes: 7d3c1d59 ('ALSA: fireworks: delayed registration of sound card') at v4.7
      Fixes: 04a2c73c ('ALSA: bebob: delayed registration of sound card') at v4.7
      Fixes: b59fb190 ('ALSA: dice: postpone card registration') at v4.5
      Signed-off-by: default avatarTakashi Sakamoto <o-takashi@sakamocchi.jp>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59d00061
    • Takashi Iwai's avatar
      ALSA: hda - Add stereo mic quirk for Lenovo G50-70 (17aa:3978) · 2f45c61b
      Takashi Iwai authored
      commit bbba6f9d upstream.
      
      Lenovo G50-70 (17aa:3978) with Conexant codec chip requires the
      similar workaround for the inverted stereo dmic like other Lenovo
      models.
      
      Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1020657Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f45c61b
    • Takashi Iwai's avatar
      ALSA: core: Fix unexpected error at replacing user TLV · ba6b08b6
      Takashi Iwai authored
      commit 88c54cdf upstream.
      
      When user tries to replace the user-defined control TLV, the kernel
      checks the change of its content via memcmp().  The problem is that
      the kernel passes the return value from memcmp() as is.  memcmp()
      gives a non-zero negative value depending on the comparison result,
      and this shall be recognized as an error code.
      
      The patch covers that corner-case, return 1 properly for the changed
      TLV.
      
      Fixes: 8aa9b586 ("[ALSA] Control API - more robust TLV implementation")
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba6b08b6
    • Joakim Tjernlund's avatar
      ALSA: usb-audio: Add delay quirk for H650e/Jabra 550a USB headsets · 1157dcda
      Joakim Tjernlund authored
      commit 07b3b5e9 upstream.
      
      These headsets reports a lot of: cannot set freq 44100 to ep 0x81
      and need a small delay between sample rate settings, just like
      Zoom R16/24. Add both headsets to the Zoom R16/24 quirk for
      a 1 ms delay between control msgs.
      Signed-off-by: default avatarJoakim Tjernlund <joakim.tjernlund@infinera.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1157dcda
    • Paolo Bonzini's avatar
      KVM: x86: block guest protection keys unless the host has them enabled · 2f76f62a
      Paolo Bonzini authored
      commit c469268c upstream.
      
      If the host has protection keys disabled, we cannot read and write the
      guest PKRU---RDPKRU and WRPKRU fail with #GP(0) if CR4.PKE=0.  Block
      the PKU cpuid bit in that case.
      
      This ensures that guest_CR4.PKE=1 implies host_CR4.PKE=1.
      
      Fixes: 1be0e61cReviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f76f62a
    • Paolo Bonzini's avatar
      KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state · 3c498d4b
      Paolo Bonzini authored
      commit 38cfd5e3 upstream.
      
      The host pkru is restored right after vcpu exit (commit 1be0e61c), so
      KVM_GET_XSAVE will return the host PKRU value instead.  Fix this by
      using the guest PKRU explicitly in fill_xsave and load_xsave.  This
      part is based on a patch by Junkang Fu.
      
      The host PKRU data may also not match the value in vcpu->arch.guest_fpu.state,
      because it could have been changed by userspace since the last time
      it was saved, so skip loading it in kvm_load_guest_fpu.
      Reported-by: default avatarJunkang Fu <junkang.fjk@alibaba-inc.com>
      Cc: Yang Zhang <zy107165@alibaba-inc.com>
      Fixes: 1be0e61cSigned-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c498d4b
    • Paolo Bonzini's avatar
      KVM: x86: simplify handling of PKRU · d0e52c82
      Paolo Bonzini authored
      commit b9dd21e1 upstream.
      
      Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a
      simple comparison against the host value of the register.  The write of
      PKRU in addition can be skipped if the guest has not enabled the feature.
      Once we do this, we need not test OSPKE in the host anymore, because
      guest_CR4.PKE=1 implies host_CR4.PKE=1.
      
      The static PKU test is kept to elide the code on older CPUs.
      Suggested-by: default avatarYang Zhang <zy107165@alibaba-inc.com>
      Fixes: 1be0e61cReviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d0e52c82
    • Heiko Carstens's avatar
      KVM: s390: sthyi: fix specification exception detection · 6dc06cd6
      Heiko Carstens authored
      commit 857b8de9 upstream.
      
      sthyi should only generate a specification exception if the function
      code is zero and the response buffer is not on a 4k boundary.
      
      The current code would also test for unknown function codes if the
      response buffer, that is currently only defined for function code 0,
      is not on a 4k boundary and incorrectly inject a specification
      exception instead of returning with condition code 3 and return code 4
      (unsupported function code).
      
      Fix this by moving the boundary check.
      
      Fixes: 95ca2cb5 ("KVM: s390: Add sthyi emulation")
      Reviewed-by: default avatarJanosch Frank <frankja@linux.vnet.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6dc06cd6
    • Heiko Carstens's avatar
      KVM: s390: sthyi: fix sthyi inline assembly · e516834a
      Heiko Carstens authored
      commit 4a4eefcd upstream.
      
      The sthyi inline assembly misses register r3 within the clobber
      list. The sthyi instruction will always write a return code to
      register "R2+1", which in this case would be r3. Due to that we may
      have register corruption and see host crashes or data corruption
      depending on how gcc decided to allocate and use registers during
      compile time.
      
      Fixes: 95ca2cb5 ("KVM: s390: Add sthyi emulation")
      Reviewed-by: default avatarJanosch Frank <frankja@linux.vnet.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e516834a
    • Masaki Ota's avatar
      Input: ALPS - fix two-finger scroll breakage in right side on ALPS touchpad · ddae9e6e
      Masaki Ota authored
      commit 4a646580 upstream.
      
      Fixed the issue that two finger scroll does not work correctly
      on V8 protocol. The cause is that V8 protocol X-coordinate decode
      is wrong at SS4 PLUS device. I added SS4 PLUS X decode definition.
      
      Mote notes:
      the problem manifests itself by the commit e7348396 ("Input: ALPS
      - fix V8+ protocol handling (73 03 28)"), where a fix for the V8+
      protocol was applied.  Although the culprit must have been present
      beforehand, the two-finger scroll worked casually even with the
      wrongly reported values by some reason.  It got broken by the commit
      above just because it changed x_max value, and this made libinput
      correctly figuring the MT events.  Since the X coord is reported as
      falsely doubled, the events on the right-half side go outside the
      boundary, thus they are no longer handled.  This resulted as a broken
      two-finger scroll.
      
      One finger event is decoded differently, and it didn't suffer from
      this problem.  The problem was only about MT events. --tiwai
      
      Fixes: e7348396 ("Input: ALPS - fix V8+ protocol handling (73 03 28)")
      Signed-off-by: default avatarMasaki Ota <masaki.ota@jp.alps.com>
      Tested-by: default avatarTakashi Iwai <tiwai@suse.de>
      Tested-by: default avatarPaul Donohue <linux-kernel@PaulSD.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddae9e6e
    • KT Liao's avatar
      Input: elan_i2c - add ELAN0602 ACPI ID to support Lenovo Yoga310 · 8dcee8e8
      KT Liao authored
      commit 1d2226e4 upstream.
      
      Add ELAN0602 to the list of known ACPI IDs to enable support for ELAN
      touchpads found in Lenovo Yoga310.
      Signed-off-by: default avatarKT Liao <kt.liao@emc.com.tw>
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8dcee8e8
    • Aaron Ma's avatar
      Input: trackpoint - add new trackpoint firmware ID · 38c36f9d
      Aaron Ma authored
      commit ec667683 upstream.
      
      Synaptics add new TP firmware ID: 0x2 and 0x3, for now both lower 2 bits
      are indicated as TP. Change the constant to bitwise values.
      
      This makes trackpoint to be recognized on Lenovo Carbon X1 Gen5 instead
      of it being identified as "PS/2 Generic Mouse".
      Signed-off-by: default avatarAaron Ma <aaron.ma@canonical.com>
      Signed-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38c36f9d
    • Edward Cree's avatar
      bpf/verifier: fix min/max handling in BPF_SUB · c9c682f3
      Edward Cree authored
      
      [ Upstream commit 9305706c ]
      
      We have to subtract the src max from the dst min, and vice-versa, since
       (e.g.) the smallest result comes from the largest subtrahend.
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9c682f3
    • Daniel Borkmann's avatar
      bpf: fix mixed signed/unsigned derived min/max value bounds · eb6cf01c
      Daniel Borkmann authored
      
      [ Upstream commit 4cabc5b1 ]
      
      Edward reported that there's an issue in min/max value bounds
      tracking when signed and unsigned compares both provide hints
      on limits when having unknown variables. E.g. a program such
      as the following should have been rejected:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff8a94cda93400
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+7
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = -1
        10: (2d) if r1 > r2 goto pc+3
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        11: (65) if r1 s> 0x1 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=0,max_value=1
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        12: (0f) r0 += r1
        13: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=0,max_value=1 R1=inv,min_value=0,max_value=1
        R2=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
        14: (b7) r0 = 0
        15: (95) exit
      
      What happens is that in the first part ...
      
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = -1
        10: (2d) if r1 > r2 goto pc+3
      
      ... r1 carries an unsigned value, and is compared as unsigned
      against a register carrying an immediate. Verifier deduces in
      reg_set_min_max() that since the compare is unsigned and operation
      is greater than (>), that in the fall-through/false case, r1's
      minimum bound must be 0 and maximum bound must be r2. Latter is
      larger than the bound and thus max value is reset back to being
      'invalid' aka BPF_REGISTER_MAX_RANGE. Thus, r1 state is now
      'R1=inv,min_value=0'. The subsequent test ...
      
        11: (65) if r1 s> 0x1 goto pc+2
      
      ... is a signed compare of r1 with immediate value 1. Here,
      verifier deduces in reg_set_min_max() that since the compare
      is signed this time and operation is greater than (>), that
      in the fall-through/false case, we can deduce that r1's maximum
      bound must be 1, meaning with prior test, we result in r1 having
      the following state: R1=inv,min_value=0,max_value=1. Given that
      the actual value this holds is -8, the bounds are wrongly deduced.
      When this is being added to r0 which holds the map_value(_adj)
      type, then subsequent store access in above case will go through
      check_mem_access() which invokes check_map_access_adj(), that
      will then probe whether the map memory is in bounds based
      on the min_value and max_value as well as access size since
      the actual unknown value is min_value <= x <= max_value; commit
      fce366a9 ("bpf, verifier: fix alu ops against map_value{,
      _adj} register types") provides some more explanation on the
      semantics.
      
      It's worth to note in this context that in the current code,
      min_value and max_value tracking are used for two things, i)
      dynamic map value access via check_map_access_adj() and since
      commit 06c1c049 ("bpf: allow helpers access to variable memory")
      ii) also enforced at check_helper_mem_access() when passing a
      memory address (pointer to packet, map value, stack) and length
      pair to a helper and the length in this case is an unknown value
      defining an access range through min_value/max_value in that
      case. The min_value/max_value tracking is /not/ used in the
      direct packet access case to track ranges. However, the issue
      also affects case ii), for example, the following crafted program
      based on the same principle must be rejected as well:
      
         0: (b7) r2 = 0
         1: (bf) r3 = r10
         2: (07) r3 += -512
         3: (7a) *(u64 *)(r10 -16) = -8
         4: (79) r4 = *(u64 *)(r10 -16)
         5: (b7) r6 = -1
         6: (2d) if r4 > r6 goto pc+5
        R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
        R4=inv,min_value=0 R6=imm-1,max_value=18446744073709551615,min_align=1 R10=fp
         7: (65) if r4 s> 0x1 goto pc+4
        R1=ctx R2=imm0,min_value=0,max_value=0,min_align=2147483648 R3=fp-512
        R4=inv,min_value=0,max_value=1 R6=imm-1,max_value=18446744073709551615,min_align=1
        R10=fp
         8: (07) r4 += 1
         9: (b7) r5 = 0
        10: (6a) *(u16 *)(r10 -512) = 0
        11: (85) call bpf_skb_load_bytes#26
        12: (b7) r0 = 0
        13: (95) exit
      
      Meaning, while we initialize the max_value stack slot that the
      verifier thinks we access in the [1,2] range, in reality we
      pass -7 as length which is interpreted as u32 in the helper.
      Thus, this issue is relevant also for the case of helper ranges.
      Resetting both bounds in check_reg_overflow() in case only one
      of them exceeds limits is also not enough as similar test can be
      created that uses values which are within range, thus also here
      learned min value in r1 is incorrect when mixed with later signed
      test to create a range:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff880ad081fa00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+7
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = 2
        10: (3d) if r2 >= r1 goto pc+3
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        11: (65) if r1 s> 0x4 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
        R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        12: (0f) r0 += r1
        13: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=3,max_value=4
        R1=inv,min_value=3,max_value=4 R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        14: (b7) r0 = 0
        15: (95) exit
      
      This leaves us with two options for fixing this: i) to invalidate
      all prior learned information once we switch signed context, ii)
      to track min/max signed and unsigned boundaries separately as
      done in [0]. (Given latter introduces major changes throughout
      the whole verifier, it's rather net-next material, thus this
      patch follows option i), meaning we can derive bounds either
      from only signed tests or only unsigned tests.) There is still the
      case of adjust_reg_min_max_vals(), where we adjust bounds on ALU
      operations, meaning programs like the following where boundaries
      on the reg get mixed in context later on when bounds are merged
      on the dst reg must get rejected, too:
      
         0: (7a) *(u64 *)(r10 -8) = 0
         1: (bf) r2 = r10
         2: (07) r2 += -8
         3: (18) r1 = 0xffff89b2bf87ce00
         5: (85) call bpf_map_lookup_elem#1
         6: (15) if r0 == 0x0 goto pc+6
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R10=fp
         7: (7a) *(u64 *)(r10 -16) = -8
         8: (79) r1 = *(u64 *)(r10 -16)
         9: (b7) r2 = 2
        10: (3d) if r2 >= r1 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R10=fp
        11: (b7) r7 = 1
        12: (65) if r7 s> 0x0 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,max_value=0 R10=fp
        13: (b7) r0 = 0
        14: (95) exit
      
        from 12 to 15: R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0
        R1=inv,min_value=3 R2=imm2,min_value=2,max_value=2,min_align=2 R7=imm1,min_value=1 R10=fp
        15: (0f) r7 += r1
        16: (65) if r7 s> 0x4 goto pc+2
        R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
        17: (0f) r0 += r7
        18: (72) *(u8 *)(r0 +0) = 0
        R0=map_value_adj(ks=8,vs=8,id=0),min_value=4,max_value=4 R1=inv,min_value=3
        R2=imm2,min_value=2,max_value=2,min_align=2 R7=inv,min_value=4,max_value=4 R10=fp
        19: (b7) r0 = 0
        20: (95) exit
      
      Meaning, in adjust_reg_min_max_vals() we must also reset range
      values on the dst when src/dst registers have mixed signed/
      unsigned derived min/max value bounds with one unbounded value
      as otherwise they can be added together deducing false boundaries.
      Once both boundaries are established from either ALU ops or
      compare operations w/o mixing signed/unsigned insns, then they
      can safely be added to other regs also having both boundaries
      established. Adding regs with one unbounded side to a map value
      where the bounded side has been learned w/o mixing ops is
      possible, but the resulting map value won't recover from that,
      meaning such op is considered invalid on the time of actual
      access. Invalid bounds are set on the dst reg in case i) src reg,
      or ii) in case dst reg already had them. The only way to recover
      would be to perform i) ALU ops but only 'add' is allowed on map
      value types or ii) comparisons, but these are disallowed on
      pointers in case they span a range. This is fine as only BPF_JEQ
      and BPF_JNE may be performed on PTR_TO_MAP_VALUE_OR_NULL registers
      which potentially turn them into PTR_TO_MAP_VALUE type depending
      on the branch, so only here min/max value cannot be invalidated
      for them.
      
      In terms of state pruning, value_from_signed is considered
      as well in states_equal() when dealing with adjusted map values.
      With regards to breaking existing programs, there is a small
      risk, but use-cases are rather quite narrow where this could
      occur and mixing compares probably unlikely.
      
      Joint work with Josef and Edward.
      
        [0] https://lists.iovisor.org/pipermail/iovisor-dev/2017-June/000822.html
      
      Fixes: 48461135 ("bpf: allow access into map value arrays")
      Reported-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb6cf01c
    • John Fastabend's avatar
      bpf, verifier: add additional patterns to evaluate_reg_imm_alu · 659ee968
      John Fastabend authored
      
      [ Upstream commit 43188702 ]
      
      Currently the verifier does not track imm across alu operations when
      the source register is of unknown type. This adds additional pattern
      matching to catch this and track imm. We've seen LLVM generating this
      pattern while working on cilium.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      659ee968
    • Konstantin Khlebnikov's avatar
      net_sched: fix order of queue length updates in qdisc_replace() · d8a4ae09
      Konstantin Khlebnikov authored
      
      [ Upstream commit 68a66d14 ]
      
      This important to call qdisc_tree_reduce_backlog() after changing queue
      length. Parent qdisc should deactivate class in ->qlen_notify() called from
      qdisc_tree_reduce_backlog() but this happens only if qdisc->q.qlen in zero.
      
      Missed class deactivations leads to crashes/warnings at picking packets
      from empty qdisc and corrupting state at reactivating this class in future.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Fixes: 86a7996c ("net_sched: introduce qdisc_replace() helper")
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8a4ae09
    • Xin Long's avatar
      net: sched: fix NULL pointer dereference when action calls some targets · 09e1d36d
      Xin Long authored
      
      [ Upstream commit 4f8a881a ]
      
      As we know in some target's checkentry it may dereference par.entryinfo
      to check entry stuff inside. But when sched action calls xt_check_target,
      par.entryinfo is set with NULL. It would cause kernel panic when calling
      some targets.
      
      It can be reproduce with:
        # tc qd add dev eth1 ingress handle ffff:
        # tc filter add dev eth1 parent ffff: u32 match u32 0 0 action xt \
          -j ECN --ecn-tcp-remove
      
      It could also crash kernel when using target CLUSTERIP or TPROXY.
      
      By now there's no proper value for par.entryinfo in ipt_init_target,
      but it can not be set with NULL. This patch is to void all these
      panics by setting it with an ipt_entry obj with all members = 0.
      
      Note that this issue has been there since the very beginning.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      09e1d36d
    • Colin Ian King's avatar
      irda: do not leak initialized list.dev to userspace · f4e4a296
      Colin Ian King authored
      
      [ Upstream commit b024d949 ]
      
      list.dev has not been initialized and so the copy_to_user is copying
      data from the stack back to user space which is a potential
      information leak. Fix this ensuring all of list is initialized to
      zero.
      
      Detected by CoverityScan, CID#1357894 ("Uninitialized scalar variable")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4e4a296
    • Huy Nguyen's avatar
      net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled · 754df4da
      Huy Nguyen authored
      
      [ Upstream commit ca3d89a3 ]
      
      enable_4k_uar module parameter was added in patch cited below to
      address the backward compatibility issue in SRIOV when the VM has
      system's PAGE_SIZE uar implementation and the Hypervisor has 4k uar
      implementation.
      
      The above compatibility issue does not exist in the non SRIOV case.
      In this patch, we always enable 4k uar implementation if SRIOV
      is not enabled on mlx4's supported cards.
      
      Fixes: 76e39ccf ("net/mlx4_core: Fix backward compatibility on VFs")
      Signed-off-by: default avatarHuy Nguyen <huyn@mellanox.com>
      Reviewed-by: default avatarDaniel Jurgens <danielj@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      754df4da
    • Neal Cardwell's avatar
      tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP · 2d093adf
      Neal Cardwell authored
      
      [ Upstream commit cdbeb633 ]
      
      In some situations tcp_send_loss_probe() can realize that it's unable
      to send a loss probe (TLP), and falls back to calling tcp_rearm_rto()
      to schedule an RTO timer. In such cases, sometimes tcp_rearm_rto()
      realizes that the RTO was eligible to fire immediately or at some
      point in the past (delta_us <= 0). Previously in such cases
      tcp_rearm_rto() was scheduling such "overdue" RTOs to happen at now +
      icsk_rto, which caused needless delays of hundreds of milliseconds
      (and non-linear behavior that made reproducible testing
      difficult). This commit changes the logic to schedule "overdue" RTOs
      ASAP, rather than at now + icsk_rto.
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Suggested-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d093adf
    • Wei Wang's avatar
      ipv6: repair fib6 tree in failure case · 7bbc60d9
      Wei Wang authored
      
      [ Upstream commit 348a4002 ]
      
      In fib6_add(), it is possible that fib6_add_1() picks an intermediate
      node and sets the node's fn->leaf to NULL in order to add this new
      route. However, if fib6_add_rt2node() fails to add the new
      route for some reason, fn->leaf will be left as NULL and could
      potentially cause crash when fn->leaf is accessed in fib6_locate().
      This patch makes sure fib6_repair_tree() is called to properly repair
      fn->leaf in the above failure case.
      
      Here is the syzkaller reported general protection fault in fib6_locate:
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 40937 Comm: syz-executor3 Not tainted
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff8801d7d64100 ti: ffff8801d01a0000 task.ti: ffff8801d01a0000
      RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] __ipv6_prefix_equal64_half include/net/ipv6.h:475 [inline]
      RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] ipv6_prefix_equal include/net/ipv6.h:492 [inline]
      RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] fib6_locate_1 net/ipv6/ip6_fib.c:1210 [inline]
      RIP: 0010:[<ffffffff82a3e0e1>]  [<ffffffff82a3e0e1>] fib6_locate+0x281/0x3c0 net/ipv6/ip6_fib.c:1233
      RSP: 0018:ffff8801d01a36a8  EFLAGS: 00010202
      RAX: 0000000000000020 RBX: ffff8801bc790e00 RCX: ffffc90002983000
      RDX: 0000000000001219 RSI: ffff8801d01a37a0 RDI: 0000000000000100
      RBP: ffff8801d01a36f0 R08: 00000000000000ff R09: 0000000000000000
      R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000001
      R13: dffffc0000000000 R14: ffff8801d01a37a0 R15: 0000000000000000
      FS:  00007f6afd68c700(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004c6340 CR3: 00000000ba41f000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Stack:
       ffff8801d01a37a8 ffff8801d01a3780 ffffed003a0346f5 0000000c82a23ea0
       ffff8800b7bd7700 ffff8801d01a3780 ffff8800b6a1c940 ffffffff82a23ea0
       ffff8801d01a3920 ffff8801d01a3748 ffffffff82a223d6 ffff8801d7d64988
      Call Trace:
       [<ffffffff82a223d6>] ip6_route_del+0x106/0x570 net/ipv6/route.c:2109
       [<ffffffff82a23f9d>] inet6_rtm_delroute+0xfd/0x100 net/ipv6/route.c:3075
       [<ffffffff82621359>] rtnetlink_rcv_msg+0x549/0x7a0 net/core/rtnetlink.c:3450
       [<ffffffff8274c1d1>] netlink_rcv_skb+0x141/0x370 net/netlink/af_netlink.c:2281
       [<ffffffff82613ddf>] rtnetlink_rcv+0x2f/0x40 net/core/rtnetlink.c:3456
       [<ffffffff8274ad38>] netlink_unicast_kernel net/netlink/af_netlink.c:1206 [inline]
       [<ffffffff8274ad38>] netlink_unicast+0x518/0x750 net/netlink/af_netlink.c:1232
       [<ffffffff8274b83e>] netlink_sendmsg+0x8ce/0xc30 net/netlink/af_netlink.c:1778
       [<ffffffff82564aff>] sock_sendmsg_nosec net/socket.c:609 [inline]
       [<ffffffff82564aff>] sock_sendmsg+0xcf/0x110 net/socket.c:619
       [<ffffffff82564d62>] sock_write_iter+0x222/0x3a0 net/socket.c:834
       [<ffffffff8178523d>] new_sync_write+0x1dd/0x2b0 fs/read_write.c:478
       [<ffffffff817853f4>] __vfs_write+0xe4/0x110 fs/read_write.c:491
       [<ffffffff81786c38>] vfs_write+0x178/0x4b0 fs/read_write.c:538
       [<ffffffff817892a9>] SYSC_write fs/read_write.c:585 [inline]
       [<ffffffff817892a9>] SyS_write+0xd9/0x1b0 fs/read_write.c:577
       [<ffffffff82c71e32>] entry_SYSCALL_64_fastpath+0x12/0x17
      
      Note: there is no "Fixes" tag as this seems to be a bug introduced
      very early.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7bbc60d9
    • Wei Wang's avatar
      ipv6: reset fn->rr_ptr when replacing route · 368129fe
      Wei Wang authored
      
      [ Upstream commit 383143f3 ]
      
      syzcaller reported the following use-after-free issue in rt6_select():
      BUG: KASAN: use-after-free in rt6_select net/ipv6/route.c:755 [inline] at addr ffff8800bc6994e8
      BUG: KASAN: use-after-free in ip6_pol_route.isra.46+0x1429/0x1470 net/ipv6/route.c:1084 at addr ffff8800bc6994e8
      Read of size 4 by task syz-executor1/439628
      CPU: 0 PID: 439628 Comm: syz-executor1 Not tainted 4.3.5+ #8
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       0000000000000000 ffff88018fe435b0 ffffffff81ca384d ffff8801d3588c00
       ffff8800bc699380 ffff8800bc699500 dffffc0000000000 ffff8801d40a47c0
       ffff88018fe435d8 ffffffff81735751 ffff88018fe43660 ffff8800bc699380
      Call Trace:
       [<ffffffff81ca384d>] __dump_stack lib/dump_stack.c:15 [inline]
       [<ffffffff81ca384d>] dump_stack+0xc1/0x124 lib/dump_stack.c:51
      sctp: [Deprecated]: syz-executor0 (pid 439615) Use of struct sctp_assoc_value in delayed_ack socket option.
      Use struct sctp_sack_info instead
       [<ffffffff81735751>] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
       [<ffffffff817359c4>] print_address_description mm/kasan/report.c:196 [inline]
       [<ffffffff817359c4>] kasan_report_error+0x1b4/0x4a0 mm/kasan/report.c:285
       [<ffffffff81735d93>] kasan_report mm/kasan/report.c:305 [inline]
       [<ffffffff81735d93>] __asan_report_load4_noabort+0x43/0x50 mm/kasan/report.c:325
       [<ffffffff82a28e39>] rt6_select net/ipv6/route.c:755 [inline]
       [<ffffffff82a28e39>] ip6_pol_route.isra.46+0x1429/0x1470 net/ipv6/route.c:1084
       [<ffffffff82a28fb1>] ip6_pol_route_output+0x81/0xb0 net/ipv6/route.c:1203
       [<ffffffff82ab0a50>] fib6_rule_action+0x1f0/0x680 net/ipv6/fib6_rules.c:95
       [<ffffffff8265cbb6>] fib_rules_lookup+0x2a6/0x7a0 net/core/fib_rules.c:223
       [<ffffffff82ab1430>] fib6_rule_lookup+0xd0/0x250 net/ipv6/fib6_rules.c:41
       [<ffffffff82a22006>] ip6_route_output+0x1d6/0x2c0 net/ipv6/route.c:1224
       [<ffffffff829e83d2>] ip6_dst_lookup_tail+0x4d2/0x890 net/ipv6/ip6_output.c:943
       [<ffffffff829e889a>] ip6_dst_lookup_flow+0x9a/0x250 net/ipv6/ip6_output.c:1079
       [<ffffffff82a9f7d8>] ip6_datagram_dst_update+0x538/0xd40 net/ipv6/datagram.c:91
       [<ffffffff82aa0978>] __ip6_datagram_connect net/ipv6/datagram.c:251 [inline]
       [<ffffffff82aa0978>] ip6_datagram_connect+0x518/0xe50 net/ipv6/datagram.c:272
       [<ffffffff82aa1313>] ip6_datagram_connect_v6_only+0x63/0x90 net/ipv6/datagram.c:284
       [<ffffffff8292f790>] inet_dgram_connect+0x170/0x1f0 net/ipv4/af_inet.c:564
       [<ffffffff82565547>] SYSC_connect+0x1a7/0x2f0 net/socket.c:1582
       [<ffffffff8256a649>] SyS_connect+0x29/0x30 net/socket.c:1563
       [<ffffffff82c72032>] entry_SYSCALL_64_fastpath+0x12/0x17
      Object at ffff8800bc699380, in cache ip6_dst_cache size: 384
      
      The root cause of it is that in fib6_add_rt2node(), when it replaces an
      existing route with the new one, it does not update fn->rr_ptr.
      This commit resets fn->rr_ptr to NULL when it points to a route which is
      replaced in fib6_add_rt2node().
      
      Fixes: 27596472 ("ipv6: fix ECMP route replacement")
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      368129fe
    • Eric Dumazet's avatar
      tipc: fix use-after-free · c549de48
      Eric Dumazet authored
      
      [ Upstream commit 5bfd37b4 ]
      
      syszkaller reported use-after-free in tipc [1]
      
      When msg->rep skb is freed, set the pointer to NULL,
      so that caller does not free it again.
      
      [1]
      
      ==================================================================
      BUG: KASAN: use-after-free in skb_push+0xd4/0xe0 net/core/skbuff.c:1466
      Read of size 8 at addr ffff8801c6e71e90 by task syz-executor5/4115
      
      CPU: 1 PID: 4115 Comm: syz-executor5 Not tainted 4.13.0-rc4+ #32
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       print_address_description+0x73/0x250 mm/kasan/report.c:252
       kasan_report_error mm/kasan/report.c:351 [inline]
       kasan_report+0x24e/0x340 mm/kasan/report.c:409
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
       skb_push+0xd4/0xe0 net/core/skbuff.c:1466
       tipc_nl_compat_recv+0x833/0x18f0 net/tipc/netlink_compat.c:1209
       genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
       genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
       netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
       genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
       netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
       netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
       netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       sock_write_iter+0x31a/0x5d0 net/socket.c:898
       call_write_iter include/linux/fs.h:1743 [inline]
       new_sync_write fs/read_write.c:457 [inline]
       __vfs_write+0x684/0x970 fs/read_write.c:470
       vfs_write+0x189/0x510 fs/read_write.c:518
       SYSC_write fs/read_write.c:565 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:557
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x4512e9
      RSP: 002b:00007f3bc8184c08 EFLAGS: 00000216 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004512e9
      RDX: 0000000000000020 RSI: 0000000020fdb000 RDI: 0000000000000006
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004b5e76
      R13: 00007f3bc8184b48 R14: 00000000004b5e86 R15: 0000000000000000
      
      Allocated by task 4115:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
       kmem_cache_alloc_node+0x13d/0x750 mm/slab.c:3651
       __alloc_skb+0xf1/0x740 net/core/skbuff.c:219
       alloc_skb include/linux/skbuff.h:903 [inline]
       tipc_tlv_alloc+0x26/0xb0 net/tipc/netlink_compat.c:148
       tipc_nl_compat_dumpit+0xf2/0x3c0 net/tipc/netlink_compat.c:248
       tipc_nl_compat_handle net/tipc/netlink_compat.c:1130 [inline]
       tipc_nl_compat_recv+0x756/0x18f0 net/tipc/netlink_compat.c:1199
       genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
       genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
       netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
       genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
       netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
       netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
       netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       sock_write_iter+0x31a/0x5d0 net/socket.c:898
       call_write_iter include/linux/fs.h:1743 [inline]
       new_sync_write fs/read_write.c:457 [inline]
       __vfs_write+0x684/0x970 fs/read_write.c:470
       vfs_write+0x189/0x510 fs/read_write.c:518
       SYSC_write fs/read_write.c:565 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:557
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Freed by task 4115:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
       __cache_free mm/slab.c:3503 [inline]
       kmem_cache_free+0x77/0x280 mm/slab.c:3763
       kfree_skbmem+0x1a1/0x1d0 net/core/skbuff.c:622
       __kfree_skb net/core/skbuff.c:682 [inline]
       kfree_skb+0x165/0x4c0 net/core/skbuff.c:699
       tipc_nl_compat_dumpit+0x36a/0x3c0 net/tipc/netlink_compat.c:260
       tipc_nl_compat_handle net/tipc/netlink_compat.c:1130 [inline]
       tipc_nl_compat_recv+0x756/0x18f0 net/tipc/netlink_compat.c:1199
       genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
       genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
       netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
       genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
       netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
       netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
       netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       sock_write_iter+0x31a/0x5d0 net/socket.c:898
       call_write_iter include/linux/fs.h:1743 [inline]
       new_sync_write fs/read_write.c:457 [inline]
       __vfs_write+0x684/0x970 fs/read_write.c:470
       vfs_write+0x189/0x510 fs/read_write.c:518
       SYSC_write fs/read_write.c:565 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:557
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      The buggy address belongs to the object at ffff8801c6e71dc0
       which belongs to the cache skbuff_head_cache of size 224
      The buggy address is located 208 bytes inside of
       224-byte region [ffff8801c6e71dc0, ffff8801c6e71ea0)
      The buggy address belongs to the page:
      page:ffffea00071b9c40 count:1 mapcount:0 mapping:ffff8801c6e71000 index:0x0
      flags: 0x200000000000100(slab)
      raw: 0200000000000100 ffff8801c6e71000 0000000000000000 000000010000000c
      raw: ffffea0007224a20 ffff8801d98caf48 ffff8801d9e79040 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801c6e71d80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
       ffff8801c6e71e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff8801c6e71e80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
                               ^
       ffff8801c6e71f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff8801c6e71f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      ==================================================================
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov  <dvyukov@google.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Ying Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c549de48
    • Alexander Potapenko's avatar
      sctp: fully initialize the IPv6 address in sctp_v6_to_addr() · 62b3580f
      Alexander Potapenko authored
      
      [ Upstream commit 15339e44 ]
      
      KMSAN reported use of uninitialized sctp_addr->v4.sin_addr.s_addr and
      sctp_addr->v6.sin6_scope_id in sctp_v6_cmp_addr() (see below).
      Make sure all fields of an IPv6 address are initialized, which
      guarantees that the IPv4 fields are also initialized.
      
      ==================================================================
       BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
       net/sctp/ipv6.c:517
       CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
       01/01/2011
       Call Trace:
        dump_stack+0x172/0x1c0 lib/dump_stack.c:42
        is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
        kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
        native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
        arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
        arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
        __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
        sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
        sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
        sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
        sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
        sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
        inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
        sock_sendmsg_nosec net/socket.c:633 [inline]
        sock_sendmsg net/socket.c:643 [inline]
        SYSC_sendto+0x608/0x710 net/socket.c:1696
        SyS_sendto+0x8a/0xb0 net/socket.c:1664
        entry_SYSCALL_64_fastpath+0x13/0x94
       RIP: 0033:0x44b479
       RSP: 002b:00007f6213f21c08 EFLAGS: 00000286 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 0000000020000000 RCX: 000000000044b479
       RDX: 0000000000000041 RSI: 0000000020edd000 RDI: 0000000000000006
       RBP: 00000000007080a8 R08: 0000000020b85fe4 R09: 000000000000001c
       R10: 0000000000040005 R11: 0000000000000286 R12: 00000000ffffffff
       R13: 0000000000003760 R14: 00000000006e5820 R15: 0000000000ff8000
       origin description: ----dst_saddr@sctp_v6_get_dst
       local variable created at:
        sk_fullsock include/net/sock.h:2321 [inline]
        inet6_sk include/linux/ipv6.h:309 [inline]
        sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
        sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
      ==================================================================
       BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
       net/sctp/ipv6.c:517
       CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
       01/01/2011
       Call Trace:
        dump_stack+0x172/0x1c0 lib/dump_stack.c:42
        is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
        kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
        native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
        arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
        arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
        __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
        sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
        sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
        sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
        sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
        sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
        inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
        sock_sendmsg_nosec net/socket.c:633 [inline]
        sock_sendmsg net/socket.c:643 [inline]
        SYSC_sendto+0x608/0x710 net/socket.c:1696
        SyS_sendto+0x8a/0xb0 net/socket.c:1664
        entry_SYSCALL_64_fastpath+0x13/0x94
       RIP: 0033:0x44b479
       RSP: 002b:00007f6213f21c08 EFLAGS: 00000286 ORIG_RAX: 000000000000002c
       RAX: ffffffffffffffda RBX: 0000000020000000 RCX: 000000000044b479
       RDX: 0000000000000041 RSI: 0000000020edd000 RDI: 0000000000000006
       RBP: 00000000007080a8 R08: 0000000020b85fe4 R09: 000000000000001c
       R10: 0000000000040005 R11: 0000000000000286 R12: 00000000ffffffff
       R13: 0000000000003760 R14: 00000000006e5820 R15: 0000000000ff8000
       origin description: ----dst_saddr@sctp_v6_get_dst
       local variable created at:
        sk_fullsock include/net/sock.h:2321 [inline]
        inet6_sk include/linux/ipv6.h:309 [inline]
        sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
        sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
      ==================================================================
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62b3580f
    • Eric Dumazet's avatar
      tun: handle register_netdevice() failures properly · dda84477
      Eric Dumazet authored
      
      [ Upstream commit ff244c6b ]
      
      syzkaller reported a double free [1], caused by the fact
      that tun driver was not updated properly when priv_destructor
      was added.
      
      When/if register_netdevice() fails, priv_destructor() must have been
      called already.
      
      [1]
      BUG: KASAN: double-free or invalid-free in selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
      
      CPU: 0 PID: 2919 Comm: syzkaller227220 Not tainted 4.13.0-rc4+ #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       print_address_description+0x7f/0x260 mm/kasan/report.c:252
       kasan_report_double_free+0x55/0x80 mm/kasan/report.c:333
       kasan_slab_free+0xa0/0xc0 mm/kasan/kasan.c:514
       __cache_free mm/slab.c:3503 [inline]
       kfree+0xd3/0x260 mm/slab.c:3820
       selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
       security_tun_dev_free_security+0x48/0x80 security/security.c:1512
       tun_set_iff drivers/net/tun.c:1884 [inline]
       __tun_chr_ioctl+0x2ce6/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x443ff9
      RSP: 002b:00007ffc34271f68 EFLAGS: 00000217 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000443ff9
      RDX: 0000000020533000 RSI: 00000000400454ca RDI: 0000000000000003
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000401ce0
      R13: 0000000000401d70 R14: 0000000000000000 R15: 0000000000000000
      
      Allocated by task 2919:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_kmalloc+0xaa/0xd0 mm/kasan/kasan.c:551
       kmem_cache_alloc_trace+0x101/0x6f0 mm/slab.c:3627
       kmalloc include/linux/slab.h:493 [inline]
       kzalloc include/linux/slab.h:666 [inline]
       selinux_tun_dev_alloc_security+0x49/0x170 security/selinux/hooks.c:5012
       security_tun_dev_alloc_security+0x6d/0xa0 security/security.c:1506
       tun_set_iff drivers/net/tun.c:1839 [inline]
       __tun_chr_ioctl+0x1730/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Freed by task 2919:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_slab_free+0x6e/0xc0 mm/kasan/kasan.c:524
       __cache_free mm/slab.c:3503 [inline]
       kfree+0xd3/0x260 mm/slab.c:3820
       selinux_tun_dev_free_security+0x15/0x20 security/selinux/hooks.c:5023
       security_tun_dev_free_security+0x48/0x80 security/security.c:1512
       tun_free_netdev+0x13b/0x1b0 drivers/net/tun.c:1563
       register_netdevice+0x8d0/0xee0 net/core/dev.c:7605
       tun_set_iff drivers/net/tun.c:1859 [inline]
       __tun_chr_ioctl+0x1caf/0x3d50 drivers/net/tun.c:2064
       tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2309
       vfs_ioctl fs/ioctl.c:45 [inline]
       do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
       SYSC_ioctl fs/ioctl.c:700 [inline]
       SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      The buggy address belongs to the object at ffff8801d2843b40
       which belongs to the cache kmalloc-32 of size 32
      The buggy address is located 0 bytes inside of
       32-byte region [ffff8801d2843b40, ffff8801d2843b60)
      The buggy address belongs to the page:
      page:ffffea000660cea8 count:1 mapcount:0 mapping:ffff8801d2843000 index:0xffff8801d2843fc1
      flags: 0x200000000000100(slab)
      raw: 0200000000000100 ffff8801d2843000 ffff8801d2843fc1 000000010000003f
      raw: ffffea0006626a40 ffffea00066141a0 ffff8801dbc00100
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801d2843a00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
       ffff8801d2843a80: 00 00 00 fc fc fc fc fc fb fb fb fb fc fc fc fc
      >ffff8801d2843b00: 00 00 00 00 fc fc fc fc fb fb fb fb fc fc fc fc
                                                 ^
       ffff8801d2843b80: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
       ffff8801d2843c00: fb fb fb fb fc fc fc fc fb fb fb fb fc fc fc fc
      
      ==================================================================
      
      Fixes: cf124db5 ("net: Fix inconsistent teardown and release of private netdev state.")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dda84477
    • Colin Ian King's avatar
      nfp: fix infinite loop on umapping cleanup · 3c3181e1
      Colin Ian King authored
      
      [ Upstream commit eac2c68d ]
      
      The while loop that performs the dma page unmapping never decrements
      index counter f and hence loops forever. Fix this with a pre-decrement
      on f.
      
      Detected by CoverityScan, CID#1357309 ("Infinite loop")
      
      Fixes: 4c352362 ("net: add driver for Netronome NFP4000/NFP6000 NIC VFs")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c3181e1
    • Eric Dumazet's avatar
      ipv4: better IP_MAX_MTU enforcement · 9c579acf
      Eric Dumazet authored
      
      [ Upstream commit c780a049 ]
      
      While working on yet another syzkaller report, I found
      that our IP_MAX_MTU enforcements were not properly done.
      
      gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
      final result can be bigger than IP_MAX_MTU :/
      
      This is a problem because device mtu can be changed on other cpus or
      threads.
      
      While this patch does not fix the issue I am working on, it is
      probably worth addressing it.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c579acf
    • Eric Dumazet's avatar
      ptr_ring: use kmalloc_array() · 12ee6d75
      Eric Dumazet authored
      
      [ Upstream commit 81fbfe8a ]
      
      As found by syzkaller, malicious users can set whatever tx_queue_len
      on a tun device and eventually crash the kernel.
      
      Lets remove the ALIGN(XXX, SMP_CACHE_BYTES) thing since a small
      ring buffer is not fast anyway.
      
      Fixes: 2e0ab8ca ("ptr_ring: array based FIFO for pointers")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      12ee6d75
    • Liping Zhang's avatar
      openvswitch: fix skb_panic due to the incorrect actions attrlen · cb445bfc
      Liping Zhang authored
      
      [ Upstream commit 494bea39 ]
      
      For sw_flow_actions, the actions_len only represents the kernel part's
      size, and when we dump the actions to the userspace, we will do the
      convertions, so it's true size may become bigger than the actions_len.
      
      But unfortunately, for OVS_PACKET_ATTR_ACTIONS, we use the actions_len
      to alloc the skbuff, so the user_skb's size may become insufficient and
      oops will happen like this:
        skbuff: skb_over_panic: text:ffffffff8148fabf len:1749 put:157 head:
        ffff881300f39000 data:ffff881300f39000 tail:0x6d5 end:0x6c0 dev:<NULL>
        ------------[ cut here ]------------
        kernel BUG at net/core/skbuff.c:129!
        [...]
        Call Trace:
         <IRQ>
         [<ffffffff8148be82>] skb_put+0x43/0x44
         [<ffffffff8148fabf>] skb_zerocopy+0x6c/0x1f4
         [<ffffffffa0290d36>] queue_userspace_packet+0x3a3/0x448 [openvswitch]
         [<ffffffffa0292023>] ovs_dp_upcall+0x30/0x5c [openvswitch]
         [<ffffffffa028d435>] output_userspace+0x132/0x158 [openvswitch]
         [<ffffffffa01e6890>] ? ip6_rcv_finish+0x74/0x77 [ipv6]
         [<ffffffffa028e277>] do_execute_actions+0xcc1/0xdc8 [openvswitch]
         [<ffffffffa028e3f2>] ovs_execute_actions+0x74/0x106 [openvswitch]
         [<ffffffffa0292130>] ovs_dp_process_packet+0xe1/0xfd [openvswitch]
         [<ffffffffa0292b77>] ? key_extract+0x63c/0x8d5 [openvswitch]
         [<ffffffffa029848b>] ovs_vport_receive+0xa1/0xc3 [openvswitch]
        [...]
      
      Also we can find that the actions_len is much little than the orig_len:
        crash> struct sw_flow_actions 0xffff8812f539d000
        struct sw_flow_actions {
          rcu = {
            next = 0xffff8812f5398800,
            func = 0xffffe3b00035db32
          },
          orig_len = 1384,
          actions_len = 592,
          actions = 0xffff8812f539d01c
        }
      
      So as a quick fix, use the orig_len instead of the actions_len to alloc
      the user_skb.
      
      Last, this oops happened on our system running a relative old kernel, but
      the same risk still exists on the mainline, since we use the wrong
      actions_len from the beginning.
      
      Fixes: ccea7445 ("openvswitch: include datapath actions with sampled-packet upcall to userspace")
      Cc: Neil McKee <neil.mckee@inmon.com>
      Signed-off-by: default avatarLiping Zhang <zlpnobody@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb445bfc
    • David Ahern's avatar
      net: igmp: Use ingress interface rather than vrf device · c6fc7b98
      David Ahern authored
      
      [ Upstream commit c7b725be ]
      
      Anuradha reported that statically added groups for interfaces enslaved
      to a VRF device were not persisting. The problem is that igmp queries
      and reports need to use the data in the in_dev for the real ingress
      device rather than the VRF device. Update igmp_rcv accordingly.
      
      Fixes: e58e4159 ("net: Enable support for VRF with ipv4 multicast")
      Reported-by: default avatarAnuradha Karuppiah <anuradhak@cumulusnetworks.com>
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Reviewed-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6fc7b98
    • Daniel Borkmann's avatar
      bpf: fix bpf_trace_printk on 32 bit archs · 921739a9
      Daniel Borkmann authored
      
      [ Upstream commit 88a5c690 ]
      
      James reported that on MIPS32 bpf_trace_printk() is currently
      broken while MIPS64 works fine:
      
        bpf_trace_printk() uses conditional operators to attempt to
        pass different types to __trace_printk() depending on the
        format operators. This doesn't work as intended on 32-bit
        architectures where u32 and long are passed differently to
        u64, since the result of C conditional operators follows the
        "usual arithmetic conversions" rules, such that the values
        passed to __trace_printk() will always be u64 [causing issues
        later in the va_list handling for vscnprintf()].
      
        For example the samples/bpf/tracex5 test printed lines like
        below on MIPS32, where the fd and buf have come from the u64
        fd argument, and the size from the buf argument:
      
          [...] 1180.941542: 0x00000001: write(fd=1, buf=  (null), size=6258688)
      
        Instead of this:
      
          [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
      
      One way to get it working is to expand various combinations
      of argument types into 8 different combinations for 32 bit
      and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
      as well that it resolves the issue.
      
      Fixes: 9c959c86 ("tracing: Allow BPF programs to call bpf_trace_printk()")
      Reported-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Tested-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      921739a9