1. 14 Feb, 2020 4 commits
    • Håkon Bugge's avatar
      RDMA/netlink: Do not always generate an ACK for some netlink operations · ab40fc36
      Håkon Bugge authored
      commit a242c369 upstream.
      
      In rdma_nl_rcv_skb(), the local variable err is assigned the return value
      of the supplied callback function, which could be one of
      ib_nl_handle_resolve_resp(), ib_nl_handle_set_timeout(), or
      ib_nl_handle_ip_res_resp(). These three functions all return skb->len on
      success.
      
      rdma_nl_rcv_skb() is merely a copy of netlink_rcv_skb(). The callback
      functions used by the latter have the convention: "Returns 0 on success or
      a negative error code".
      
      In particular, the statement (equal for both functions):
      
         if (nlh->nlmsg_flags & NLM_F_ACK || err)
      
      implies that rdma_nl_rcv_skb() always will ack a message, independent of
      the NLM_F_ACK being set in nlmsg_flags or not.
      
      The fix could be to change the above statement, but it is better to keep
      the two *_rcv_skb() functions equal in this respect and instead change the
      three callback functions in the rdma subsystem to the correct convention.
      
      Fixes: 2ca546b9 ("IB/sa: Route SA pathrecord query through netlink")
      Fixes: ae43f828 ("IB/core: Add IP to GID netlink offload")
      Link: https://lore.kernel.org/r/20191216120436.3204814-1-haakon.bugge@oracle.comSuggested-by: default avatarMark Haywood <mark.haywood@oracle.com>
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Tested-by: default avatarMark Haywood <mark.haywood@oracle.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab40fc36
    • Jack Morgenstein's avatar
      IB/mlx4: Fix memory leak in add_gid error flow · 6ddcb302
      Jack Morgenstein authored
      commit eaad647e upstream.
      
      In procedure mlx4_ib_add_gid(), if the driver is unable to update the FW
      gid table, there is a memory leak in the driver's copy of the gid table:
      the gid entry's context buffer is not freed.
      
      If such an error occurs, free the entry's context buffer, and mark the
      entry as available (by setting its context pointer to NULL).
      
      Fixes: e26be1bf ("IB/mlx4: Implement ib_device callbacks")
      Link: https://lore.kernel.org/r/20200115085050.73746-1-leon@kernel.orgSigned-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Reviewed-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ddcb302
    • Sunil Muthuswamy's avatar
      hv_sock: Remove the accept port restriction · 9a7f8a17
      Sunil Muthuswamy authored
      [ Upstream commit c742c59e ]
      
      Currently, hv_sock restricts the port the guest socket can accept
      connections on. hv_sock divides the socket port namespace into two parts
      for server side (listening socket), 0-0x7FFFFFFF & 0x80000000-0xFFFFFFFF
      (there are no restrictions on client port namespace). The first part
      (0-0x7FFFFFFF) is reserved for sockets where connections can be accepted.
      The second part (0x80000000-0xFFFFFFFF) is reserved for allocating ports
      for the peer (host) socket, once a connection is accepted.
      This reservation of the port namespace is specific to hv_sock and not
      known by the generic vsock library (ex: af_vsock). This is problematic
      because auto-binds/ephemeral ports are handled by the generic vsock
      library and it has no knowledge of this port reservation and could
      allocate a port that is not compatible with hv_sock (and legitimately so).
      The issue hasn't surfaced so far because the auto-bind code of vsock
      (__vsock_bind_stream) prior to the change 'VSOCK: bind to random port for
      VMADDR_PORT_ANY' would start walking up from LAST_RESERVED_PORT (1023) and
      start assigning ports. That will take a large number of iterations to hit
      0x7FFFFFFF. But, after the above change to randomize port selection, the
      issue has started coming up more frequently.
      There has really been no good reason to have this port reservation logic
      in hv_sock from the get go. Reserving a local port for peer ports is not
      how things are handled generally. Peer ports should reflect the peer port.
      This fixes the issue by lifting the port reservation, and also returns the
      right peer port. Since the code converts the GUID to the peer port (by
      using the first 4 bytes), there is a possibility of conflicts, but that
      seems like a reasonable risk to take, given this is limited to vsock and
      that only applies to all local sockets.
      Signed-off-by: default avatarSunil Muthuswamy <sunilmut@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9a7f8a17
    • Ranjani Sridharan's avatar
      ASoC: pcm: update FE/BE trigger order based on the command · e7751a4b
      Ranjani Sridharan authored
      [ Upstream commit acbf2774 ]
      
      Currently, the trigger orders SND_SOC_DPCM_TRIGGER_PRE/POST
      determine the order in which FE DAI and BE DAI are triggered.
      In the case of SND_SOC_DPCM_TRIGGER_PRE, the FE DAI is
      triggered before the BE DAI and in the case of
      SND_SOC_DPCM_TRIGGER_POST, the BE DAI is triggered before
      the FE DAI. And this order remains the same irrespective of the
      trigger command.
      
      In the case of the SOF driver, during playback, the FW
      expects the BE DAI to be triggered before the FE DAI during
      the START trigger. The BE DAI trigger handles the starting of
      Link DMA and so it must be started before the FE DAI is started
      to prevent xruns during pause/release. This can be addressed
      by setting the trigger order for the FE dai link to
      SND_SOC_DPCM_TRIGGER_POST. But during the STOP trigger,
      the FW expects the FE DAI to be triggered before the BE DAI.
      Retaining the same order during the START and STOP commands,
      results in FW error as the DAI component in the FW is still
      active.
      
      The issue can be fixed by mirroring the trigger order of
      FE and BE DAI's during the START and STOP trigger. So, with the
      trigger order set to SND_SOC_DPCM_TRIGGER_PRE, the FE DAI will be
      trigger first during SNDRV_PCM_TRIGGER_START/STOP/RESUME
      and the BE DAI will be triggered first during the
      STOP/SUSPEND/PAUSE commands. Conversely, with the trigger order
      set to SND_SOC_DPCM_TRIGGER_POST, the BE DAI will be triggered
      first during the SNDRV_PCM_TRIGGER_START/STOP/RESUME commands
      and the FE DAI will be triggered first during the
      SNDRV_PCM_TRIGGER_STOP/SUSPEND/PAUSE commands.
      Signed-off-by: default avatarRanjani Sridharan <ranjani.sridharan@linux.intel.com>
      Signed-off-by: default avatarPierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
      Link: https://lore.kernel.org/r/20191104224812.3393-2-ranjani.sridharan@linux.intel.comSigned-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e7751a4b
  2. 11 Feb, 2020 36 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.19.103 · 35766839
      Greg Kroah-Hartman authored
      35766839
    • David Howells's avatar
      rxrpc: Fix service call disconnection · 06748661
      David Howells authored
      [ Upstream commit b39a934e ]
      
      The recent patch that substituted a flag on an rxrpc_call for the
      connection pointer being NULL as an indication that a call was disconnected
      puts the set_bit in the wrong place for service calls.  This is only a
      problem if a call is implicitly terminated by a new call coming in on the
      same connection channel instead of a terminating ACK packet.
      
      In such a case, rxrpc_input_implicit_end_call() calls
      __rxrpc_disconnect_call(), which is now (incorrectly) setting the
      disconnection bit, meaning that when rxrpc_release_call() is later called,
      it doesn't call rxrpc_disconnect_call() and so the call isn't removed from
      the peer's error distribution list and the list gets corrupted.
      
      KASAN finds the issue as an access after release on a call, but the
      position at which it occurs is confusing as it appears to be related to a
      different call (the call site is where the latter call is being removed
      from the error distribution list and either the next or pprev pointer
      points to a previously released call).
      
      Fix this by moving the setting of the flag from __rxrpc_disconnect_call()
      to rxrpc_disconnect_call() in the same place that the connection pointer
      was being cleared.
      
      Fixes: 5273a191 ("rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      06748661
    • Song Liu's avatar
      perf/core: Fix mlock accounting in perf_mmap() · a3623db4
      Song Liu authored
      commit 00346155 upstream.
      
      Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
      a perf ring buffer may lead to an integer underflow in locked memory
      accounting. This may lead to the undesired behaviors, such as failures in
      BPF map creation.
      
      Address this by adjusting the accounting logic to take into account the
      possibility that the amount of already locked memory may exceed the
      current limit.
      
      Fixes: c4b75479 ("perf/core: Make the mlock accounting simple again")
      Suggested-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Acked-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a3623db4
    • Konstantin Khlebnikov's avatar
      clocksource: Prevent double add_timer_on() for watchdog_timer · 6284d30e
      Konstantin Khlebnikov authored
      commit febac332 upstream.
      
      Kernel crashes inside QEMU/KVM are observed:
      
        kernel BUG at kernel/time/timer.c:1154!
        BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().
      
      At the same time another cpu got:
      
        general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:
      
        __hlist_del at include/linux/list.h:681
        (inlined by) detach_timer at kernel/time/timer.c:818
        (inlined by) expire_timers at kernel/time/timer.c:1355
        (inlined by) __run_timers at kernel/time/timer.c:1686
        (inlined by) run_timer_softirq at kernel/time/timer.c:1699
      
      Unfortunately kernel logs are badly scrambled, stacktraces are lost.
      
      Printing the timer->function before the BUG_ON() pointed to
      clocksource_watchdog().
      
      The execution of clocksource_watchdog() can race with a sequence of
      clocksource_stop_watchdog() .. clocksource_start_watchdog():
      
      expire_timers()
       detach_timer(timer, true);
        timer->entry.pprev = NULL;
       raw_spin_unlock_irq(&base->lock);
       call_timer_fn
        clocksource_watchdog()
      
      					clocksource_watchdog_kthread() or
      					clocksource_unbind()
      
      					spin_lock_irqsave(&watchdog_lock, flags);
      					clocksource_stop_watchdog();
      					 del_timer(&watchdog_timer);
      					 watchdog_running = 0;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      
      					spin_lock_irqsave(&watchdog_lock, flags);
      					clocksource_start_watchdog();
      					 add_timer_on(&watchdog_timer, ...);
      					 watchdog_running = 1;
      					spin_unlock_irqrestore(&watchdog_lock, flags);
      
        spin_lock(&watchdog_lock);
        add_timer_on(&watchdog_timer, ...);
         BUG_ON(timer_pending(timer) || !timer->function);
          timer_pending() -> true
          BUG()
      
      I.e. inside clocksource_watchdog() watchdog_timer could be already armed.
      
      Check timer_pending() before calling add_timer_on(). This is sufficient as
      all operations are synchronized by watchdog_lock.
      
      Fixes: 75c5158f ("timekeeping: Update clocksource with stop_machine")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzzSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6284d30e
    • Thomas Gleixner's avatar
      x86/apic/msi: Plug non-maskable MSI affinity race · 032a2bf9
      Thomas Gleixner authored
      commit 6f1a4891 upstream.
      
      Evan tracked down a subtle race between the update of the MSI message and
      the device raising an interrupt internally on PCI devices which do not
      support MSI masking. The update of the MSI message is non-atomic and
      consists of either 2 or 3 sequential 32bit wide writes to the PCI config
      space.
      
         - Write address low 32bits
         - Write address high 32bits (If supported by device)
         - Write data
      
      When an interrupt is migrated then both address and data might change, so
      the kernel attempts to mask the MSI interrupt first. But for MSI masking is
      optional, so there exist devices which do not provide it. That means that
      if the device raises an interrupt internally between the writes then a MSI
      message is sent built from half updated state.
      
      On x86 this can lead to spurious interrupts on the wrong interrupt
      vector when the affinity setting changes both address and data. As a
      consequence the device interrupt can be lost causing the device to
      become stuck or malfunctioning.
      
      Evan tried to handle that by disabling MSI accross an MSI message
      update. That's not feasible because disabling MSI has issues on its own:
      
       If MSI is disabled the PCI device is routing an interrupt to the legacy
       INTx mechanism. The INTx delivery can be disabled, but the disablement is
       not working on all devices.
      
       Some devices lose interrupts when both MSI and INTx delivery are disabled.
      
      Another way to solve this would be to enforce the allocation of the same
      vector on all CPUs in the system for this kind of screwed devices. That
      could be done, but it would bring back the vector space exhaustion problems
      which got solved a few years ago.
      
      Fortunately the high address (if supported by the device) is only relevant
      when X2APIC is enabled which implies interrupt remapping. In the interrupt
      remapping case the affinity setting is happening at the interrupt remapping
      unit and the PCI MSI message is programmed only once when the PCI device is
      initialized.
      
      That makes it possible to solve it with a two step update:
      
        1) Target the MSI msg to the new vector on the current target CPU
      
        2) Target the MSI msg to the new vector on the new target CPU
      
      In both cases writing the MSI message is only changing a single 32bit word
      which prevents the issue of inconsistency.
      
      After writing the final destination it is necessary to check whether the
      device issued an interrupt while the intermediate state #1 (new vector,
      current CPU) was in effect.
      
      This is possible because the affinity change is always happening on the
      current target CPU. The code runs with interrupts disabled, so the
      interrupt can be detected by checking the IRR of the local APIC. If the
      vector is pending in the IRR then the interrupt is retriggered on the new
      target CPU by sending an IPI for the associated vector on the target CPU.
      
      This can cause spurious interrupts on both the local and the new target
      CPU.
      
       1) If the new vector is not in use on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then interrupt entry code will
          ignore that spurious interrupt. The vector is marked so that the
          'No irq handler for vector' warning is supressed once.
      
       2) If the new vector is in use already on the local CPU then the IRR check
          might see an pending interrupt from the device which is using this
          vector. The IPI to the new target CPU will then invoke the handler of
          the device, which got the affinity change, even if that device did not
          issue an interrupt
      
       3) If the new vector is in use already on the local CPU and the device
          affected by the affinity change raised an interrupt during the
          transitional state (step #1 above) then the handler of the device which
          uses that vector on the local CPU will be invoked.
      
      expose issues in device driver interrupt handlers which are not prepared to
      handle a spurious interrupt correctly. This not a regression, it's just
      exposing something which was already broken as spurious interrupts can
      happen for a lot of reasons and all driver handlers need to be able to deal
      with them.
      Reported-by: default avatarEvan Green <evgreen@chromium.org>
      Debugged-by: default avatarEvan Green <evgreen@chromium.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarEvan Green <evgreen@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/87imkr4s7n.fsf@nanos.tec.linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      032a2bf9
    • Ronnie Sahlberg's avatar
      cifs: fail i/o on soft mounts if sessionsetup errors out · 71a47ed6
      Ronnie Sahlberg authored
      commit b0dd940e upstream.
      
      RHBZ: 1579050
      
      If we have a soft mount we should fail commands for session-setup
      failures (such as the password having changed/ account being deleted/ ...)
      and return an error back to the application.
      Signed-off-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71a47ed6
    • David Hildenbrand's avatar
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · 0a69047d
      David Hildenbrand authored
      [ Upstream commit e822969c ]
      
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0a69047d
    • Pavel Tatashin's avatar
      mm: return zero_resv_unavail optimization · f19a50c1
      Pavel Tatashin authored
      [ Upstream commit ec393a0f ]
      
      When checking for valid pfns in zero_resv_unavail(), it is not necessary
      to verify that pfns within pageblock_nr_pages ranges are valid, only the
      first one needs to be checked.  This is because memory for pages are
      allocated in contiguous chunks that contain pageblock_nr_pages struct
      pages.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.comSigned-off-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f19a50c1
    • Naoya Horiguchi's avatar
      mm: zero remaining unavailable struct pages · 9ac5917a
      Naoya Horiguchi authored
      [ Upstream commit 907ec5fc ]
      
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9ac5917a
    • Sean Christopherson's avatar
      KVM: Play nice with read-only memslots when querying host page size · 21b70d9b
      Sean Christopherson authored
      [ Upstream commit 42cde48b ]
      
      Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
      on read-only memslots due to gfn_to_hva() assuming writes.  Functionally,
      this allows x86 to create large mappings for read-only memslots that
      are backed by HugeTLB mappings.
      
      Note, the changelog for commit 05da4558 ("KVM: MMU: large page
      support") states "If the largepage contains write-protected pages, a
      large pte is not used.", but "write-protected" refers to pages that are
      temporarily read-only, e.g. read-only memslots didn't even exist at the
      time.
      
      Fixes: 4d8b81ab ("KVM: introduce readonly memslot")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      21b70d9b
    • Sean Christopherson's avatar
      KVM: Use vcpu-specific gva->hva translation when querying host page size · dabf1a10
      Sean Christopherson authored
      [ Upstream commit f9b84e19 ]
      
      Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
      correct set of memslots is used when handling x86 page faults in SMM.
      
      Fixes: 54bf36aa ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dabf1a10
    • Miaohe Lin's avatar
      KVM: nVMX: vmread should not set rflags to specify success in case of #PF · eb2c9541
      Miaohe Lin authored
      [ Upstream commit a4d956b9 ]
      
      In case writing to vmread destination operand result in a #PF, vmread
      should not call nested_vmx_succeed() to set rflags to specify success.
      Similar to as done in VMPTRST (See handle_vmptrst()).
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      eb2c9541
    • Sean Christopherson's avatar
      KVM: VMX: Add non-canonical check on writes to RTIT address MSRs · 57211b73
      Sean Christopherson authored
      [ Upstream commit fe6ed369 ]
      
      Reject writes to RTIT address MSRs if the data being written is a
      non-canonical address as the MSRs are subject to canonical checks, e.g.
      KVM will trigger an unchecked #GP when loading the values to hardware
      during pt_guest_enter().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      57211b73
    • Sean Christopherson's avatar
      KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM · 9b376cb6
      Sean Christopherson authored
      [ Upstream commit 736c291c ]
      
      Convert a plethora of parameters and variables in the MMU and page fault
      flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
      
      Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
      addresses.  When TDP is enabled, the fault address is a guest physical
      address and thus can be a 64-bit value, even when both KVM and its guest
      are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
      64-bit field, not a natural width field.
      
      Using a gva_t for the fault address means KVM will incorrectly drop the
      upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
      translate L2 GPAs to L1 GPAs.
      
      Opportunistically rename variables and parameters to better reflect the
      dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
      "addr" instead of "vaddr" when the address may be either a GVA or an L2
      GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
      a confusing "gpa_t gva" declaration; this also sets the stage for a
      future patch to combing nonpaging_page_fault() and tdp_page_fault() with
      minimal churn.
      
      Sprinkle in a few comments to document flows where an address is known
      to be a GVA and thus can be safely truncated to a 32-bit value.  Add
      WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
      document such cases and detect bugs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9b376cb6
    • Sean Christopherson's avatar
      KVM: x86/mmu: Apply max PA check for MMIO sptes to 32-bit KVM · c2e29d0f
      Sean Christopherson authored
      [ Upstream commit e30a7d62 ]
      
      Remove the bogus 64-bit only condition from the check that disables MMIO
      spte optimization when the system supports the max PA, i.e. doesn't have
      any reserved PA bits.  32-bit KVM always uses PAE paging for the shadow
      MMU, and per Intel's SDM:
      
        PAE paging translates 32-bit linear addresses to 52-bit physical
        addresses.
      
      The kernel's restrictions on max physical addresses are limits on how
      much memory the kernel can reasonably use, not what physical addresses
      are supported by hardware.
      
      Fixes: ce88decf ("KVM: MMU: mmio page fault support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c2e29d0f
    • Josef Bacik's avatar
      btrfs: flush write bio if we loop in extent_write_cache_pages · 86047371
      Josef Bacik authored
      [ Upstream commit 96bf313ecb33567af4cb53928b0c951254a02759 ]
      
      There exists a deadlock with range_cyclic that has existed forever.  If
      we loop around with a bio already built we could deadlock with a writer
      who has the page locked that we're attempting to write but is waiting on
      a page in our bio to be written out.  The task traces are as follows
      
        PID: 1329874  TASK: ffff889ebcdf3800  CPU: 33  COMMAND: "kworker/u113:5"
         #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
         #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
         #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
         #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
         #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
         #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
         #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
         #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
         #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
         #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
      
        PID: 2167901  TASK: ffff889dc6a59c00  CPU: 14  COMMAND:
        "aio-dio-invalid"
         #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
         #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
         #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
         #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
         #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
         #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
         #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
         #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
         #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
         #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
      
      I used drgn to find the respective pages we were stuck on
      
      page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
      page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
      
      As you can see the kworker is waiting for bit 0 (PG_locked) on index
      7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
      8148.  aio-dio-invalid has 7680, and the kworker epd looks like the
      following
      
        crash> struct extent_page_data ffffc900297bbbb0
        struct extent_page_data {
          bio = 0xffff889f747ed830,
          tree = 0xffff889eed6ba448,
          extent_locked = 0,
          sync_io = 0
        }
      
      Probably worth mentioning as well that it waits for writeback of the
      page to complete while holding a lock on it (at prepare_pages()).
      
      Using drgn I walked the bio pages looking for page
      0xffffea00fbfc7500 which is the one we're waiting for writeback on
      
        bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
        for i in range(0, bio.bi_vcnt.value_()):
            bv = bio.bi_io_vec[i]
            if bv.bv_page.value_() == 0xffffea00fbfc7500:
      	  print("FOUND IT")
      
      which validated what I suspected.
      
      The fix for this is simple, flush the epd before we loop back around to
      the beginning of the file during writeout.
      
      Fixes: b293f02e ("Btrfs: Add writepages support")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      86047371
    • Wayne Lin's avatar
      drm/dp_mst: Remove VCPI while disabling topology mgr · 4ecba33e
      Wayne Lin authored
      [ Upstream commit 64e62bdf ]
      
      [Why]
      
      This patch is trying to address the issue observed when hotplug DP
      daisy chain monitors.
      
      e.g.
      src-mstb-mstb-sst -> src (unplug) mstb-mstb-sst -> src-mstb-mstb-sst
      (plug in again)
      
      Once unplug a DP MST capable device, driver will call
      drm_dp_mst_topology_mgr_set_mst() to disable MST. In this function,
      it cleans data of topology manager while disabling mst_state. However,
      it doesn't clean up the proposed_vcpis of topology manager.
      If proposed_vcpi is not reset, once plug in MST daisy chain monitors
      later, code will fail at checking port validation while trying to
      allocate payloads.
      
      When MST capable device is plugged in again and try to allocate
      payloads by calling drm_dp_update_payload_part1(), this
      function will iterate over all proposed virtual channels to see if
      any proposed VCPI's num_slots is greater than 0. If any proposed
      VCPI's num_slots is greater than 0 and the port which the
      specific virtual channel directed to is not in the topology, code then
      fails at the port validation. Since there are stale VCPI allocations
      from the previous topology enablement in proposed_vcpi[], code will fail
      at port validation and reurn EINVAL.
      
      [How]
      
      Clean up the data of stale proposed_vcpi[] and reset mgr->proposed_vcpis
      to NULL while disabling mst in drm_dp_mst_topology_mgr_set_mst().
      
      Changes since v1:
      *Add on more details in commit message to describe the issue which the
      patch is trying to fix
      Signed-off-by: default avatarWayne Lin <Wayne.Lin@amd.com>
      [added cc to stable]
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20191205090043.7580-1-Wayne.Lin@amd.com
      Cc: <stable@vger.kernel.org> # v3.17+
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4ecba33e
    • Claudiu Beznea's avatar
      drm: atmel-hlcdc: enable clock before configuring timing engine · 1f1611dc
      Claudiu Beznea authored
      [ Upstream commit 2c1fb9d8 ]
      
      Changing pixel clock source without having this clock source enabled
      will block the timing engine and the next operations after (in this case
      setting ATMEL_HLCDC_CFG(5) settings in atmel_hlcdc_crtc_mode_set_nofb()
      will fail). It is recomended (although in datasheet this is not present)
      to actually enabled pixel clock source before doing any changes on timing
      enginge (only SAM9X60 datasheet specifies that the peripheral clock and
      pixel clock must be enabled before using LCD controller).
      
      Fixes: 1a396789 ("drm: add Atmel HLCDC Display Controller support")
      Signed-off-by: default avatarClaudiu Beznea <claudiu.beznea@microchip.com>
      Signed-off-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: <stable@vger.kernel.org> # v4.0+
      Link: https://patchwork.freedesktop.org/patch/msgid/1576672109-22707-3-git-send-email-claudiu.beznea@microchip.comSigned-off-by: default avatarSasha Levin <sashal@kernel.org>
      1f1611dc
    • Josef Bacik's avatar
      btrfs: free block groups after free'ing fs trees · 159db2ae
      Josef Bacik authored
      [ Upstream commit 4e19443d ]
      
      Sometimes when running generic/475 we would trip the
      WARN_ON(cache->reserved) check when free'ing the block groups on umount.
      This is because sometimes we don't commit the transaction because of IO
      errors and thus do not cleanup the tree logs until at umount time.
      
      These blocks are still reserved until they are cleaned up, but they
      aren't cleaned up until _after_ we do the free block groups work.  Fix
      this by moving the free after free'ing the fs roots, that way all of the
      tree logs are cleaned up and we have a properly cleaned fs.  A bunch of
      loops of generic/475 confirmed this fixes the problem.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      159db2ae
    • Anand Jain's avatar
      btrfs: use bool argument in free_root_pointers() · 381a16fa
      Anand Jain authored
      [ Upstream commit 4273eaff ]
      
      We don't need int argument bool shall do in free_root_pointers().  And
      rename the argument as it confused two people.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      381a16fa
    • Eric Biggers's avatar
      ext4: fix deadlock allocating crypto bounce page from mempool · 987bb7a3
      Eric Biggers authored
      [ Upstream commit 547c556f ]
      
      ext4_writepages() on an encrypted file has to encrypt the data, but it
      can't modify the pagecache pages in-place, so it encrypts the data into
      bounce pages and writes those instead.  All bounce pages are allocated
      from a mempool using GFP_NOFS.
      
      This is not correct use of a mempool, and it can deadlock.  This is
      because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never
      fail" mode for mempool_alloc() where a failed allocation will fall back
      to waiting for one of the preallocated elements in the pool.
      
      But since this mode is used for all a bio's pages and not just the
      first, it can deadlock waiting for pages already in the bio to be freed.
      
      This deadlock can be reproduced by patching mempool_alloc() to pretend
      that pool->alloc() always fails (so that it always falls back to the
      preallocations), and then creating an encrypted file of size > 128 KiB.
      
      Fix it by only using GFP_NOFS for the first page in the bio.  For
      subsequent pages just use GFP_NOWAIT, and if any of those fail, just
      submit the bio and start a new one.
      
      This will need to be fixed in f2fs too, but that's less straightforward.
      
      Fixes: c9af28fd ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.orgSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      987bb7a3
    • Florian Fainelli's avatar
      net: dsa: b53: Always use dev->vlan_enabled in b53_configure_vlan() · 25a1729e
      Florian Fainelli authored
      [ Upstream commit df373702 ]
      
      b53_configure_vlan() is called by the bcm_sf2 driver upon setup and
      indirectly through resume as well. During the initial setup, we are
      guaranteed that dev->vlan_enabled is false, so there is no change in
      behavior, however during suspend, we may have enabled VLANs before, so we
      do want to restore that setting.
      
      Fixes: dad8d7c6 ("net: dsa: b53: Properly account for VLAN filtering")
      Fixes: 967dd82f ("net: dsa: b53: Add support for Broadcom RoboSwitch")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25a1729e
    • Harini Katakam's avatar
      net: macb: Limit maximum GEM TX length in TSO · 62e5f512
      Harini Katakam authored
      [ Upstream commit f822e9c4 ]
      
      GEM_MAX_TX_LEN currently resolves to 0x3FF8 for any IP version supporting
      TSO with full 14bits of length field in payload descriptor. But an IP
      errata causes false amba_error (bit 6 of ISR) when length in payload
      descriptors is specified above 16387. The error occurs because the DMA
      falsely concludes that there is not enough space in SRAM for incoming
      payload. These errors were observed continuously under stress of large
      packets using iperf on a version where SRAM was 16K for each queue. This
      errata will be documented shortly and affects all versions since TSO
      functionality was added. Hence limit the max length to 0x3FC0 (rounded).
      Signed-off-by: default avatarHarini Katakam <harini.katakam@xilinx.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62e5f512
    • Harini Katakam's avatar
      net: macb: Remove unnecessary alignment check for TSO · de784e74
      Harini Katakam authored
      [ Upstream commit 41c1ef97 ]
      
      The IP TSO implementation does NOT require the length to be a
      multiple of 8. That is only a requirement for UFO as per IP
      documentation. Hence, exit macb_features_check function in the
      beginning if the protocol is not UDP. Only when it is UDP,
      proceed further to the alignment checks. Update comments to
      reflect the same. Also remove dead code checking for protocol
      TCP when calculating header length.
      
      Fixes: 1629dd4f ("cadence: Add LSO support.")
      Signed-off-by: default avatarHarini Katakam <harini.katakam@xilinx.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de784e74
    • Raed Salem's avatar
      net/mlx5: IPsec, fix memory leak at mlx5_fpga_ipsec_delete_sa_ctx · 16415cf7
      Raed Salem authored
      [ Upstream commit 08db2cf5 ]
      
      SA context is allocated at mlx5_fpga_ipsec_create_sa_ctx,
      however the counterpart mlx5_fpga_ipsec_delete_sa_ctx function
      nullifies sa_ctx pointer without freeing the memory allocated,
      hence the memory leak.
      
      Fix by free SA context when the SA is released.
      
      Fixes: d6c4f029 ("net/mlx5: Refactor accel IPSec code")
      Signed-off-by: default avatarRaed Salem <raeds@mellanox.com>
      Reviewed-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16415cf7
    • Raed Salem's avatar
      net/mlx5: IPsec, Fix esp modify function attribute · c893c6e6
      Raed Salem authored
      [ Upstream commit 0dc2c534 ]
      
      The function mlx5_fpga_esp_validate_xfrm_attrs is wrongly used
      with negative negation as zero value indicates success but it
      used as failure return value instead.
      
      Fix by remove the unary not negation operator.
      
      Fixes: 05564d0a ("net/mlx5: Add flow-steering commands for FPGA IPSec implementation")
      Signed-off-by: default avatarRaed Salem <raeds@mellanox.com>
      Reviewed-by: default avatarBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c893c6e6
    • Florian Fainelli's avatar
      net: systemport: Avoid RBUF stuck in Wake-on-LAN mode · b81a002b
      Florian Fainelli authored
      [ Upstream commit 263a425a ]
      
      After a number of suspend and resume cycles, it is possible for the RBUF
      to be stuck in Wake-on-LAN mode, despite the MPD enable bit being
      cleared which instructed the RBUF to exit that mode.
      
      Avoid creating that problematic condition by clearing the RX_EN and
      TX_EN bits in the UniMAC prior to disable the Magic Packet Detector
      logic which is guaranteed to make the RBUF exit Wake-on-LAN mode.
      
      Fixes: 83e82f4c ("net: systemport: add Wake-on-LAN support")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b81a002b
    • Cong Wang's avatar
      net_sched: fix a resource leak in tcindex_set_parms() · 7b3dbf95
      Cong Wang authored
      [ Upstream commit 52b5ae50 ]
      
      Jakub noticed there is a potential resource leak in
      tcindex_set_parms(): when tcindex_filter_result_init() fails
      and it jumps to 'errout1' which doesn't release the memory
      and resources allocated by tcindex_alloc_perfect_hash().
      
      We should just jump to 'errout_alloc' which calls
      tcindex_free_perfect_hash().
      
      Fixes: b9a24bb7 ("net_sched: properly handle failure case of tcf_exts_init()")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b3dbf95
    • Lorenzo Bianconi's avatar
      net: mvneta: move rx_dropped and rx_errors in per-cpu stats · af746042
      Lorenzo Bianconi authored
      [ Upstream commit c35947b8 ]
      
      Move rx_dropped and rx_errors counters in mvneta_pcpu_stats in order to
      avoid possible races updating statistics
      
      Fixes: 562e2f46 ("net: mvneta: Improve the buffer allocation method for SWBM")
      Fixes: dc35a10f ("net: mvneta: bm: add support for hardware buffer management")
      Fixes: c5aff182 ("net: mvneta: driver for Marvell Armada 370/XP network unit")
      Signed-off-by: default avatarLorenzo Bianconi <lorenzo@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      af746042
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Only 7278 supports 2Gb/sec IMP port · fbd4c421
      Florian Fainelli authored
      [ Upstream commit de34d708 ]
      
      The 7445 switch clocking profiles do not allow us to run the IMP port at
      2Gb/sec in a way that it is reliable and consistent. Make sure that the
      setting is only applied to the 7278 family.
      
      Fixes: 8f1880cb ("net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbd4c421
    • Eric Dumazet's avatar
      bonding/alb: properly access headers in bond_alb_xmit() · 6513fd0a
      Eric Dumazet authored
      [ Upstream commit 38f88c45 ]
      
      syzbot managed to send an IPX packet through bond_alb_xmit()
      and af_packet and triggered a use-after-free.
      
      First, bond_alb_xmit() was using ipx_hdr() helper to reach
      the IPX header, but ipx_hdr() was using the transport offset
      instead of the network offset. In the particular syzbot
      report transport offset was 0xFFFF
      
      This patch removes ipx_hdr() since it was only (mis)used from bonding.
      
      Then we need to make sure IPv4/IPv6/IPX headers are pulled
      in skb->head before dereferencing anything.
      
      BUG: KASAN: use-after-free in bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
      Read of size 2 at addr ffff8801ce56dfff by task syz-executor.2/18108
       (if (ipx_hdr(skb)->ipx_checksum != IPX_NO_CHECKSUM) ...)
      
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       [<ffffffff8441fc42>] __dump_stack lib/dump_stack.c:17 [inline]
       [<ffffffff8441fc42>] dump_stack+0x14d/0x20b lib/dump_stack.c:53
       [<ffffffff81a7dec4>] print_address_description+0x6f/0x20b mm/kasan/report.c:282
       [<ffffffff81a7e0ec>] kasan_report_error mm/kasan/report.c:380 [inline]
       [<ffffffff81a7e0ec>] kasan_report mm/kasan/report.c:438 [inline]
       [<ffffffff81a7e0ec>] kasan_report.cold+0x8c/0x2a0 mm/kasan/report.c:422
       [<ffffffff81a7dc4f>] __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:469
       [<ffffffff82c8c00a>] bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
       [<ffffffff82c60c74>] __bond_start_xmit drivers/net/bonding/bond_main.c:4199 [inline]
       [<ffffffff82c60c74>] bond_start_xmit+0x4f4/0x1570 drivers/net/bonding/bond_main.c:4224
       [<ffffffff83baa558>] __netdev_start_xmit include/linux/netdevice.h:4525 [inline]
       [<ffffffff83baa558>] netdev_start_xmit include/linux/netdevice.h:4539 [inline]
       [<ffffffff83baa558>] xmit_one net/core/dev.c:3611 [inline]
       [<ffffffff83baa558>] dev_hard_start_xmit+0x168/0x910 net/core/dev.c:3627
       [<ffffffff83bacf35>] __dev_queue_xmit+0x1f55/0x33b0 net/core/dev.c:4238
       [<ffffffff83bae3a8>] dev_queue_xmit+0x18/0x20 net/core/dev.c:4278
       [<ffffffff84339189>] packet_snd net/packet/af_packet.c:3226 [inline]
       [<ffffffff84339189>] packet_sendmsg+0x4919/0x70b0 net/packet/af_packet.c:3252
       [<ffffffff83b1ac0c>] sock_sendmsg_nosec net/socket.c:673 [inline]
       [<ffffffff83b1ac0c>] sock_sendmsg+0x12c/0x160 net/socket.c:684
       [<ffffffff83b1f5a2>] __sys_sendto+0x262/0x380 net/socket.c:1996
       [<ffffffff83b1f700>] SYSC_sendto net/socket.c:2008 [inline]
       [<ffffffff83b1f700>] SyS_sendto+0x40/0x60 net/socket.c:2004
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6513fd0a
    • Andreas Kemnade's avatar
      mfd: rn5t618: Mark ADC control register volatile · 5e4013f9
      Andreas Kemnade authored
      commit 2f3dc25c upstream.
      
      There is a bit which gets cleared after conversion.
      
      Fixes: 9bb9e29c ("mfd: Add Ricoh RN5T618 PMIC core driver")
      Signed-off-by: default avatarAndreas Kemnade <andreas@kemnade.info>
      Signed-off-by: default avatarLee Jones <lee.jones@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e4013f9
    • Marco Felsch's avatar
      mfd: da9062: Fix watchdog compatible string · 17d00207
      Marco Felsch authored
      commit 1112ba02 upstream.
      
      The watchdog driver compatible is "dlg,da9062-watchdog" and not
      "dlg,da9062-wdt". Therefore the mfd-core can't populate the of_node and
      fwnode. As result the watchdog driver can't parse the devicetree.
      
      Fixes: 9b40b030 ("mfd: da9062: Supply core driver")
      Signed-off-by: default avatarMarco Felsch <m.felsch@pengutronix.de>
      Acked-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarAdam Thomson <Adam.Thomson.Opensource@diasemi.com>
      Signed-off-by: default avatarLee Jones <lee.jones@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      17d00207
    • Dan Carpenter's avatar
      ubi: Fix an error pointer dereference in error handling code · d9e9451c
      Dan Carpenter authored
      commit 5d3805af upstream.
      
      If "seen_pebs = init_seen(ubi);" fails then "seen_pebs" is an error pointer
      and we try to kfree() it which results in an Oops.
      
      This patch re-arranges the error handling so now it only frees things
      which have been allocated successfully.
      
      Fixes: daef3dd1 ("UBI: Fastmap: Add self check to detect absent PEBs")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9e9451c
    • Sascha Hauer's avatar
      ubi: fastmap: Fix inverted logic in seen selfcheck · 5fe3a95d
      Sascha Hauer authored
      commit ef5aafb6 upstream.
      
      set_seen() sets the bit corresponding to the PEB number in the bitmap,
      so when self_check_seen() wants to find PEBs that haven't been seen we
      have to print the PEBs that have their bit cleared, not the ones which
      have it set.
      
      Fixes: 5d71afb0 ("ubi: Use bitmaps in Fastmap self-check code")
      Signed-off-by: default avatarSascha Hauer <s.hauer@pengutronix.de>
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5fe3a95d
    • Trond Myklebust's avatar
      nfsd: Return the correct number of bytes written to the file · 9939dffe
      Trond Myklebust authored
      commit 09a80f2a upstream.
      
      We must allow for the fact that iov_iter_write() could have returned
      a short write (e.g. if there was an ENOSPC issue).
      
      Fixes: d890be15 "nfsd: Add I/O trace points in the NFSv4 write path"
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9939dffe