1. 26 Apr, 2018 40 commits
    • Hans de Goede's avatar
      ACPI / bus: Do not call _STA on battery devices with unmet dependencies · f920e914
      Hans de Goede authored
      
      [ Upstream commit 54ddce70 ]
      
      The battery code uses acpi_device->dep_unmet to check for unmet deps and
      if there are unmet deps it does not bind to the device to avoid errors
      about missing OpRegions when calling ACPI methods on the device.
      
      The missing OpRegions when there are unmet deps problem also applies to
      the _STA method of some battery devices and calling it too early results
      in errors like these:
      
      [    0.123579] ACPI Error: No handler for Region [ECRM] (00000000ba9edc4c)
                     [GenericSerialBus] (20170831/evregion-166)
      [    0.123601] ACPI Error: Region GenericSerialBus (ID=9) has no handler
                     (20170831/exfldio-299)
      [    0.123618] ACPI Error: Method parse/execution failed
                     \_SB.I2C1.BAT1._STA, AE_NOT_EXIST (20170831/psparse-550)
      
      This commit fixes these errors happening when acpi_get_bus_status gets
      called by checking dep_unmet for battery devices and reporting a status
      of 0 until all dependencies are met.
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f920e914
    • Chen Yu's avatar
      ACPI: processor_perflib: Do not send _PPC change notification if not ready · 51939996
      Chen Yu authored
      
      [ Upstream commit ba1edb9a ]
      
      The following warning was triggered after resumed from S3 -
      if all the nonboot CPUs were put offline before suspend:
      
      [ 1840.329515] unchecked MSR access error: RDMSR from 0x771 at rIP: 0xffffffff86061e3a (native_read_msr+0xa/0x30)
      [ 1840.329516] Call Trace:
      [ 1840.329521]  __rdmsr_on_cpu+0x33/0x50
      [ 1840.329525]  generic_exec_single+0x81/0xb0
      [ 1840.329527]  smp_call_function_single+0xd2/0x100
      [ 1840.329530]  ? acpi_ds_result_pop+0xdd/0xf2
      [ 1840.329532]  ? acpi_ds_create_operand+0x215/0x23c
      [ 1840.329534]  rdmsrl_on_cpu+0x57/0x80
      [ 1840.329536]  ? cpumask_next+0x1b/0x20
      [ 1840.329538]  ? rdmsrl_on_cpu+0x57/0x80
      [ 1840.329541]  intel_pstate_update_perf_limits+0xf3/0x220
      [ 1840.329544]  ? notifier_call_chain+0x4a/0x70
      [ 1840.329546]  intel_pstate_set_policy+0x4e/0x150
      [ 1840.329548]  cpufreq_set_policy+0xcd/0x2f0
      [ 1840.329550]  cpufreq_update_policy+0xb2/0x130
      [ 1840.329552]  ? cpufreq_update_policy+0x130/0x130
      [ 1840.329556]  acpi_processor_ppc_has_changed+0x65/0x80
      [ 1840.329558]  acpi_processor_notify+0x80/0x100
      [ 1840.329561]  acpi_ev_notify_dispatch+0x44/0x5c
      [ 1840.329563]  acpi_os_execute_deferred+0x14/0x20
      [ 1840.329565]  process_one_work+0x193/0x3c0
      [ 1840.329567]  worker_thread+0x35/0x3b0
      [ 1840.329569]  kthread+0x125/0x140
      [ 1840.329571]  ? process_one_work+0x3c0/0x3c0
      [ 1840.329572]  ? kthread_park+0x60/0x60
      [ 1840.329575]  ? do_syscall_64+0x67/0x180
      [ 1840.329577]  ret_from_fork+0x25/0x30
      [ 1840.329585] unchecked MSR access error: WRMSR to 0x774 (tried to write 0x0000000000000000) at rIP: 0xffffffff86061f78 (native_write_msr+0x8/0x30)
      [ 1840.329586] Call Trace:
      [ 1840.329587]  __wrmsr_on_cpu+0x37/0x40
      [ 1840.329589]  generic_exec_single+0x81/0xb0
      [ 1840.329592]  smp_call_function_single+0xd2/0x100
      [ 1840.329594]  ? acpi_ds_create_operand+0x215/0x23c
      [ 1840.329595]  ? cpumask_next+0x1b/0x20
      [ 1840.329597]  wrmsrl_on_cpu+0x57/0x70
      [ 1840.329598]  ? rdmsrl_on_cpu+0x57/0x80
      [ 1840.329599]  ? wrmsrl_on_cpu+0x57/0x70
      [ 1840.329602]  intel_pstate_hwp_set+0xd3/0x150
      [ 1840.329604]  intel_pstate_set_policy+0x119/0x150
      [ 1840.329606]  cpufreq_set_policy+0xcd/0x2f0
      [ 1840.329607]  cpufreq_update_policy+0xb2/0x130
      [ 1840.329610]  ? cpufreq_update_policy+0x130/0x130
      [ 1840.329613]  acpi_processor_ppc_has_changed+0x65/0x80
      [ 1840.329615]  acpi_processor_notify+0x80/0x100
      [ 1840.329617]  acpi_ev_notify_dispatch+0x44/0x5c
      [ 1840.329619]  acpi_os_execute_deferred+0x14/0x20
      [ 1840.329620]  process_one_work+0x193/0x3c0
      [ 1840.329622]  worker_thread+0x35/0x3b0
      [ 1840.329624]  kthread+0x125/0x140
      [ 1840.329625]  ? process_one_work+0x3c0/0x3c0
      [ 1840.329626]  ? kthread_park+0x60/0x60
      [ 1840.329628]  ? do_syscall_64+0x67/0x180
      [ 1840.329631]  ret_from_fork+0x25/0x30
      
      This is because if there's only one online CPU, the MSR_PM_ENABLE
      (package wide)can not be enabled after resumed, due to
      intel_pstate_hwp_enable() will only be invoked on AP's online
      process after resumed - if there's no AP online, the HWP remains
      disabled after resumed (BIOS has disabled it in S3). Then if
      there comes a _PPC change notification which touches HWP register
      during this stage, the warning is triggered.
      
      Since we don't call acpi_processor_register_performance() when
      HWP is enabled, the pr->performance will be NULL. When this is
      NULL we don't need to do _PPC change notification.
      Reported-by: default avatarDoug Smythies <dsmythies@telus.net>
      Suggested-by: default avatarSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: default avatarYu Chen <yu.c.chen@intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51939996
    • Jean Delvare's avatar
      firmware: dmi_scan: Fix handling of empty DMI strings · 573cb560
      Jean Delvare authored
      
      [ Upstream commit a7770ae1 ]
      
      The handling of empty DMI strings looks quite broken to me:
      * Strings from 1 to 7 spaces are not considered empty.
      * True empty DMI strings (string index set to 0) are not considered
        empty, and result in allocating a 0-char string.
      * Strings with invalid index also result in allocating a 0-char
        string.
      * Strings starting with 8 spaces are all considered empty, even if
        non-space characters follow (sounds like a weird thing to do, but
        I have actually seen occurrences of this in DMI tables before.)
      * Strings which are considered empty are reported as 8 spaces,
        instead of being actually empty.
      
      Some of these issues are the result of an off-by-one error in memcmp,
      the rest is incorrect by design.
      
      So let's get it square: missing strings and strings made of only
      spaces, regardless of their length, should be treated as empty and
      no memory should be allocated for them. All other strings are
      non-empty and should be allocated.
      Signed-off-by: default avatarJean Delvare <jdelvare@suse.de>
      Fixes: 79da4721 ("x86: fix DMI out of memory problems")
      Cc: Parag Warudkar <parag.warudkar@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      573cb560
    • Arnd Bergmann's avatar
      x86/dumpstack: Avoid uninitlized variable · ee06ed9b
      Arnd Bergmann authored
      
      [ Upstream commit ebfc1501 ]
      
      In some configurations, 'partial' does not get initialized, as shown by
      this gcc-8 warning:
      
      arch/x86/kernel/dumpstack.c: In function 'show_trace_log_lvl':
      arch/x86/kernel/dumpstack.c:156:4: error: 'partial' may be used uninitialized in this function [-Werror=maybe-uninitialized]
          show_regs_if_on_stack(&stack_info, regs, partial);
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      This initializes it to false, to get the previous behavior in this case.
      
      Fixes: a9cdbe72 ("x86/dumpstack: Fix partial register dumps")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: https://lkml.kernel.org/r/20180202145634.200291-1-arnd@arndb.deSigned-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee06ed9b
    • Arnd Bergmann's avatar
      x86/power: Fix swsusp_arch_resume prototype · 42350547
      Arnd Bergmann authored
      
      [ Upstream commit 328008a7 ]
      
      The declaration for swsusp_arch_resume marks it as 'asmlinkage', but the
      definition in x86-32 does not, and it fails to include the header with the
      declaration. This leads to a warning when building with
      link-time-optimizations:
      
      kernel/power/power.h:108:23: error: type of 'swsusp_arch_resume' does not match original declaration [-Werror=lto-type-mismatch]
       extern asmlinkage int swsusp_arch_resume(void);
                             ^
      arch/x86/power/hibernate_32.c:148:0: note: 'swsusp_arch_resume' was previously declared here
       int swsusp_arch_resume(void)
      
      This moves the declaration into a globally visible header file and fixes up
      both x86 definitions to match it.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: linux-pm@vger.kernel.org
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Link: https://lkml.kernel.org/r/20180202145634.200291-2-arnd@arndb.deSigned-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      42350547
    • Subash Abhinov Kasiviswanathan's avatar
      netfilter: ipv6: nf_defrag: Kill frag queue on RFC2460 failure · 074372c8
      Subash Abhinov Kasiviswanathan authored
      
      [ Upstream commit ea23d5e3 ]
      
      Failures were seen in ICMPv6 fragmentation timeout tests if they were
      run after the RFC2460 failure tests. Kernel was not sending out the
      ICMPv6 fragment reassembly time exceeded packet after the fragmentation
      reassembly timeout of 1 minute had elapsed.
      
      This happened because the frag queue was not released if an error in
      IPv6 fragmentation header was detected by RFC2460.
      
      Fixes: 83f1999c ("netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460")
      Signed-off-by: default avatarSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      074372c8
    • Sebastian Ott's avatar
      s390/eadm: fix CONFIG_BLOCK include dependency · 2cd51003
      Sebastian Ott authored
      
      [ Upstream commit 366b77ae ]
      
      Commit 2a842aca ("block: introduce new block status code type")
      added blk_status_t usage to the eadm subchannel driver. However
      blk_status_t is unknown when included via <linux/blkdev.h> for CONFIG_BLOCK=n.
      
      Only include <linux/blk_types.h> since this is the only dependency eadm has.
      
      This fixes build failures like below:
      In file included from drivers/s390/cio/eadm_sch.c:24:0:
      ./arch/s390/include/asm/eadm.h:111:4: error: unknown type name 'blk_status_t'; did you mean 'si_status'?
          blk_status_t error);
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarSebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2cd51003
    • Karol Herbst's avatar
      drm/nouveau/pmu/fuc: don't use movw directly anymore · eb41efa1
      Karol Herbst authored
      
      [ Upstream commit fe9748b7 ]
      
      Fixes failure to compile with recent envyas as a result of the 'movw'
      alias being removed for v5.
      
      A bit of history:
      
      v3 only has a 16-bit sign-extended immediate mov op. In order to set
      the high bits, there's a separate 'sethi' op. envyas validates that
      the value passed to mov(imm) is between -0x8000 and 0x7fff. In order
      to simplify macros that load both the low and high word, a 'movw'
      alias was added which takes an unsigned 16-bit immediate. However the
      actual hardware op still sign extends.
      
      v5 has a full 32-bit immediate mov op. The v3 16-bit immediate mov op
      is gone (loads 0 into the dst reg). However due to a bug in envyas,
      the movw alias still existed, and selected the no-longer-present v3
      16-bit immediate mov op. As a result usage of movw on v5 is the same
      as mov with a 0x0 argument.
      
      The proper fix throughout is to only ever use the 'movw' alias in
      combination with 'sethi'. Anything else should get the sign-extended
      validation to ensure that the intended value ends up in the
      destination register.
      
      Changes in fuc3 binaries is the result of a different encoding being
      selected for a mov with an 8-bit value.
      
      v2: added commit message written by Ilia, thanks for that!
      v3: messed up rebasing, now it should apply
      Signed-off-by: default avatarKarol Herbst <kherbst@redhat.com>
      Signed-off-by: default avatarBen Skeggs <bskeggs@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eb41efa1
    • Don Hiatt's avatar
      IB/core: Map iWarp AH type to undefined in rdma_ah_find_type · fd370b8e
      Don Hiatt authored
      
      [ Upstream commit 87daac68 ]
      
      iWarp devices do not support the creation of address handles
      so return AH_ATTR_TYPE_UNDEFINED for all iWarp devices.
      
      While we are here reduce the size of port_num to u8 and add
      a comment.
      
      Fixes: 44c58487 ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types")
      Reported-by: default avatarParav Pandit <parav@mellanox.com>
      CC: Sean Hefty <sean.hefty@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarShiraz Saleem <shiraz.saleem@intel.com>
      Signed-off-by: default avatarDon Hiatt <don.hiatt@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd370b8e
    • Alex Estrin's avatar
      IB/ipoib: Fix for potential no-carrier state · f63bb026
      Alex Estrin authored
      
      [ Upstream commit 10293610 ]
      
      On reboot SM can program port pkey table before ipoib registered its
      event handler, which could result in missing pkey event and leave root
      interface with initial pkey value from index 0.
      
      Since OPA port starts with invalid pkey in index 0, root interface will
      fail to initialize and stay down with no-carrier flag.
      
      For IB ipoib interface may end up with pkey different from value
      opensm put in pkey table idx 0, resulting in connectivity issues
      (different mcast groups, for example).
      
      Close the window by calling event handler after registration
      to make sure ipoib pkey is in sync with port pkey table.
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAlex Estrin <alex.estrin@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f63bb026
    • Alex Estrin's avatar
      IB/hfi1: Fix for potential refcount leak in hfi1_open_file() · 8f96d408
      Alex Estrin authored
      
      [ Upstream commit 2b1e7fe1 ]
      
      The dd refcount is speculatively incremented prior to allocating
      the fd memory with kzalloc(). If that kzalloc() failed the dd
      refcount leaks.
      Increment refcount on kzalloc success.
      
      Fixes: e11ffbd5 ("IB/hfi1: Do not free hfi1 cdev parent structure early")
      Reviewed-by: default avatarMichael J Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: default avatarAlex Estrin <alex.estrin@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f96d408
    • Michael J. Ruhl's avatar
      IB/hfi1: Re-order IRQ cleanup to address driver cleanup race · 5ceae769
      Michael J. Ruhl authored
      
      [ Upstream commit 82a97926 ]
      
      The pci_request_irq() interfaces always adds the IRQF_SHARED bit to
      all IRQ requests.
      
      When the kernel is built with CONFIG_DEBUG_SHIRQ config flag, if the
      IRQF_SHARED bit is set, a call to the IRQ handler is made from the
      __free_irq() function. This is testing a race condition between the
      IRQ cleanup and an IRQ racing the cleanup.  The HFI driver should be
      able to handle this race, but does not.
      
      This race can cause traces that start with this footprint:
      
      BUG: unable to handle kernel NULL pointer dereference at   (null)
      Call Trace:
       <hfi1 irq handler>
       ...
       __free_irq+0x1b3/0x2d0
       free_irq+0x35/0x70
       pci_free_irq+0x1c/0x30
       clean_up_interrupts+0x53/0xf0 [hfi1]
       hfi1_start_cleanup+0x122/0x190 [hfi1]
       postinit_cleanup+0x1d/0x280 [hfi1]
       remove_one+0x233/0x250 [hfi1]
       pci_device_remove+0x39/0xc0
      
      Export IRQ cleanup function so it can be called from other modules.
      
      Using the exported cleanup function:
      
        Re-order the driver cleanup code to clean up IRQ resources before
        other resources, eliminating the race.
      
        Re-order error path for init so that the race does not occur.
      
      Reduce severity on spurious error message for SDMA IRQs to info.
      Reviewed-by: default avatarAlex Estrin <alex.estrin@intel.com>
      Reviewed-by: default avatarPatel Jay P <jay.p.patel@intel.com>
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarMichael J. Ruhl <michael.j.ruhl@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ceae769
    • Jens Axboe's avatar
      blk-mq: fix discard merge with scheduler attached · 73027d80
      Jens Axboe authored
      
      [ Upstream commit 445251d0 ]
      
      I ran into an issue on my laptop that triggered a bug on the
      discard path:
      
      WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
       Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
       CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G     U           4.15.0+ #176
       Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
       RIP: 0010:nvme_setup_cmd+0x3d3/0x430
       RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
       RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
       RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
       RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
       R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
       R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
       FS:  0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        nvme_queue_rq+0x40/0xa00
        ? __sbitmap_queue_get+0x24/0x90
        ? blk_mq_get_tag+0xa3/0x250
        ? wait_woken+0x80/0x80
        ? blk_mq_get_driver_tag+0x97/0xf0
        blk_mq_dispatch_rq_list+0x7b/0x4a0
        ? deadline_remove_request+0x49/0xb0
        blk_mq_do_dispatch_sched+0x4f/0xc0
        blk_mq_sched_dispatch_requests+0x106/0x170
        __blk_mq_run_hw_queue+0x53/0xa0
        __blk_mq_delay_run_hw_queue+0x83/0xa0
        blk_mq_run_hw_queue+0x6c/0xd0
        blk_mq_sched_insert_request+0x96/0x140
        __blk_mq_try_issue_directly+0x3d/0x190
        blk_mq_try_issue_directly+0x30/0x70
        blk_mq_make_request+0x1a4/0x6a0
        generic_make_request+0xfd/0x2f0
        ? submit_bio+0x5c/0x110
        submit_bio+0x5c/0x110
        ? __blkdev_issue_discard+0x152/0x200
        submit_bio_wait+0x43/0x60
        ext4_process_freed_data+0x1cd/0x440
        ? account_page_dirtied+0xe2/0x1a0
        ext4_journal_commit_callback+0x4a/0xc0
        jbd2_journal_commit_transaction+0x17e2/0x19e0
        ? kjournald2+0xb0/0x250
        kjournald2+0xb0/0x250
        ? wait_woken+0x80/0x80
        ? commit_timeout+0x10/0x10
        kthread+0x111/0x130
        ? kthread_create_worker_on_cpu+0x50/0x50
        ? do_group_exit+0x3a/0xa0
        ret_from_fork+0x1f/0x30
       Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
       ---[ end trace 50d361cc444506c8 ]---
       print_req_error: I/O error, dev nvme0n1, sector 847167488
      
      Decoding the assembly, the request claims to have 0xffff segments,
      while nvme counts two. This turns out to be because we don't check
      for a data carrying request on the mq scheduler path, and since
      blk_phys_contig_segment() returns true for a non-data request,
      we decrement the initial segment count of 0 and end up with
      0xffff in the unsigned short.
      
      There are a few issues here:
      
      1) We should initialize the segment count for a discard to 1.
      2) The discard merging is currently using the data limits for
         segments and sectors.
      
      Fix this up by having attempt_merge() correctly identify the
      request, and by initializing the segment count correctly
      for discards.
      
      This can only be triggered with mq-deadline on discard capable
      devices right now, which isn't a common configuration.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73027d80
    • Ed Swierk's avatar
      openvswitch: Remove padding from packet before L3+ conntrack processing · 6eddea4b
      Ed Swierk authored
      
      [ Upstream commit 9382fe71 ]
      
      IPv4 and IPv6 packets may arrive with lower-layer padding that is not
      included in the L3 length. For example, a short IPv4 packet may have
      up to 6 bytes of padding following the IP payload when received on an
      Ethernet device with a minimum packet length of 64 bytes.
      
      Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(),
      and help() in nf_conntrack_ftp) assume skb->len reflects the length of
      the L3 header and payload, rather than referring back to
      ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by
      lower-layer padding.
      
      In the normal IPv4 receive path, ip_rcv() trims the packet to
      ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive
      path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly
      in the br_netfilter receive path, br_validate_ipv4() and
      br_validate_ipv6() trim the packet to the L3 length before invoking
      netfilter hooks.
      
      Currently in the OVS conntrack receive path, ovs_ct_execute() pulls
      the skb to the L3 header but does not trim it to the L3 length before
      calling nf_conntrack_in(NF_INET_PRE_ROUTING). When
      nf_conntrack_proto_tcp encounters a packet with lower-layer padding,
      nf_ip_checksum() fails causing a "nf_ct_tcp: bad TCP checksum" log
      message. While extra zero bytes don't affect the checksum, the length
      in the IP pseudoheader does. That length is based on skb->len, and
      without trimming, it doesn't match the length the sender used when
      computing the checksum.
      
      In ovs_ct_execute(), trim the skb to the L3 length before higher-layer
      processing.
      Signed-off-by: default avatarEd Swierk <eswierk@skyportsystems.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6eddea4b
    • shidao.ytt's avatar
      mm/fadvise: discard partial page if endbyte is also EOF · 3b1d9626
      shidao.ytt authored
      
      [ Upstream commit a7ab400d ]
      
      During our recent testing with fadvise(FADV_DONTNEED), we find that if
      given offset/length is not page-aligned, the last page will not be
      discarded.  The tool we use is vmtouch (https://hoytech.com/vmtouch/),
      we map a 10KB-sized file into memory and then try to run this tool to
      evict the whole file mapping, but the last single page always remains
      staying in the memory:
      
      $./vmtouch -e test_10K
                 Files: 1
           Directories: 0
         Evicted Pages: 3 (12K)
               Elapsed: 2.1e-05 seconds
      
      $./vmtouch test_10K
                 Files: 1
           Directories: 0
        Resident Pages: 1/3  4K/12K  33.3%
               Elapsed: 5.5e-05 seconds
      
      However when we test with an older kernel, say 3.10, this problem is
      gone.  So we wonder if this is a regression:
      
      $./vmtouch -e test_10K
                 Files: 1
           Directories: 0
         Evicted Pages: 3 (12K)
               Elapsed: 8.2e-05 seconds
      
      $./vmtouch test_10K
                 Files: 1
           Directories: 0
        Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
               Elapsed: 5e-05 seconds
      
      After digging a little bit into this problem, we find it seems not a
      regression.  Not discarding partial page is likely to be on purpose
      according to commit 441c228f ("mm: fadvise: document the
      fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel
      Gorman.  He explained why partial pages should be preserved instead of
      being discarded when using fadvise(FADV_DONTNEED).
      
      However, the interesting part is that the actual code did NOT work as
      the same as it was described, the partial page was still discarded
      anyway, due to a calculation mistake of `end_index' passed to
      invalidate_mapping_pages().  This mistake has not been fixed until
      recently, that's why we fail to reproduce our problem in old kernels.
      The fix is done in commit 18aba41c ("mm/fadvise.c: do not discard
      partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin.
      
      Back to the original testing, our problem becomes that there is a
      special case that, if the page-unaligned `endbyte' is also the end of
      file, it is not necessary at all to preserve the last partial page, as
      we all know no one else will use the rest of it.  It should be safe
      enough if we just discard the whole page.  So we add an EOF check in
      this patch.
      
      We also find a poosbile real world issue in mainline kernel.  Assume
      such scenario: A userspace backup application want to backup a huge
      amount of small files (<4k) at once, the developer might (I guess) want
      to use fadvise(FADV_DONTNEED) to save memory.  However, FADV_DONTNEED
      won't really happen since the only page mapped is a partial page, and
      kernel will preserve it.  Our patch also fixes this problem, since we
      know the endbyte is EOF, so we discard it.
      
      Here is a simple reproducer to reproduce and verify each scenario we
      described above:
      
        test_fadvise.c
        ==============================
        #include <sys/mman.h>
        #include <sys/stat.h>
        #include <fcntl.h>
        #include <stdlib.h>
        #include <string.h>
        #include <stdio.h>
        #include <unistd.h>
      
        int main(int argc, char **argv)
        {
        	int i, fd, ret, len;
        	struct stat buf;
        	void *addr;
        	unsigned char *vec;
        	char *strbuf;
        	ssize_t pagesize = getpagesize();
        	ssize_t filesize;
      
        	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
        	if (fd < 0)
        		return -1;
        	filesize = strtoul(argv[2], NULL, 10);
      
        	strbuf = malloc(filesize);
        	memset(strbuf, 42, filesize);
        	write(fd, strbuf, filesize);
        	free(strbuf);
        	fsync(fd);
      
        	len = (filesize + pagesize - 1) / pagesize;
        	printf("length of pages: %d\n", len);
      
        	addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
        	if (addr == MAP_FAILED)
        		return -1;
      
        	ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
        	if (ret < 0)
        		return -1;
      
        	vec = malloc(len);
        	ret = mincore(addr, filesize, (void *)vec);
        	if (ret < 0)
        		return -1;
      
        	for (i = 0; i < len; i++)
        		printf("pages[%d]: %x\n", i, vec[i] & 0x1);
      
        	free(vec);
        	close(fd);
      
        	return 0;
        }
        ==============================
      
      Test 1: running on kernel with commit 18aba41c reverted:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6.revert+
        [root@caspar ~]# ./test_fadvise file1 1024
        length of pages: 1
        pages[0]: 0    # <-- partial page discarded
        [root@caspar ~]# ./test_fadvise file2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise file3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 0    # <-- partial page discarded
      
      Test 2: running on mainline kernel:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6+
        [root@caspar ~]# ./test_fadvise test1 1024
        length of pages: 1
        pages[0]: 1    # <-- partial and the only page not discarded
        [root@caspar ~]# ./test_fadvise test2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise test3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 1    # <-- partial page not discarded
      
      Test 3: running on kernel with this patch:
      
        [root@caspar ~]# uname -r
        4.15.0-rc6.patched+
        [root@caspar ~]# ./test_fadvise test1 1024
        length of pages: 1
        pages[0]: 0    # <-- partial page and EOF, discarded
        [root@caspar ~]# ./test_fadvise test2 8192
        length of pages: 2
        pages[0]: 0
        pages[1]: 0
        [root@caspar ~]# ./test_fadvise test3 10240
        length of pages: 3
        pages[0]: 0
        pages[1]: 0
        pages[2]: 0    # <-- partial page and EOF, discarded
      
      [akpm@linux-foundation.org: tweak code comment]
      Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.comSigned-off-by: default avatarshidao.ytt <shidao.ytt@alibaba-inc.com>
      Signed-off-by: default avatarCaspar Zhang <jinli.zjl@alibaba-inc.com>
      Reviewed-by: default avatarOliver Yang <zhiche.yy@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b1d9626
    • Mel Gorman's avatar
      mm: pin address_space before dereferencing it while isolating an LRU page · 1f9c87e2
      Mel Gorman authored
      
      [ Upstream commit 69d763fc ]
      
      Minchan Kim asked the following question -- what locks protects
      address_space destroying when race happens between inode trauncation and
      __isolate_lru_page? Jan Kara clarified by describing the race as follows
      
      CPU1                                            CPU2
      
      truncate(inode)                                 __isolate_lru_page()
        ...
        truncate_inode_page(mapping, page);
          delete_from_page_cache(page)
            spin_lock_irqsave(&mapping->tree_lock, flags);
              __delete_from_page_cache(page, NULL)
                page_cache_tree_delete(..)
                  ...                                   mapping = page_mapping(page);
                  page->mapping = NULL;
                  ...
            spin_unlock_irqrestore(&mapping->tree_lock, flags);
            page_cache_free_page(mapping, page)
              put_page(page)
                if (put_page_testzero(page)) -> false
      - inode now has no pages and can be freed including embedded address_space
      
                                                        if (mapping && !mapping->a_ops->migratepage)
      - we've dereferenced mapping which is potentially already free.
      
      The race is theoretically possible but unlikely.  Before the
      delete_from_page_cache, truncate_cleanup_page is called so the page is
      likely to be !PageDirty or PageWriteback which gets skipped by the only
      caller that checks the mappping in __isolate_lru_page.  Even if the race
      occurs, a substantial amount of work has to happen during a tiny window
      with no preemption but it could potentially be done using a virtual
      machine to artifically slow one CPU or halt it during the critical
      window.
      
      This patch should eliminate the race with truncation by try-locking the
      page before derefencing mapping and aborting if the lock was not
      acquired.  There was a suggestion from Huang Ying to use RCU as a
      side-effect to prevent mapping being freed.  However, I do not like the
      solution as it's an unconventional means of preserving a mapping and
      it's not a context where rcu_read_lock is obviously protecting rcu data.
      
      Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
      Fixes: c8244935 ("mm: compaction: make isolate_lru_page() filter-aware again")
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f9c87e2
    • Yang Shi's avatar
      mm: thp: use down_read_trylock() in khugepaged to avoid long block · 8054b87f
      Yang Shi authored
      
      [ Upstream commit 3b454ad3 ]
      
      In the current design, khugepaged needs to acquire mmap_sem before
      scanning an mm.  But in some corner cases, khugepaged may scan a process
      which is modifying its memory mapping, so khugepaged blocks in
      uninterruptible state.  But the process might hold the mmap_sem for a
      long time when modifying a huge memory space and it may trigger the
      below khugepaged hung issue:
      
        INFO: task khugepaged:270 blocked for more than 120 seconds.
        Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        khugepaged D 0 270 2 0x00000000 
        ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
        ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
        0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
        Call Trace:
          schedule+0x36/0x80
          rwsem_down_read_failed+0xf0/0x150
          call_rwsem_down_read_failed+0x18/0x30
          down_read+0x20/0x40
          khugepaged+0x476/0x11d0
          kthread+0xe6/0x100
          ret_from_fork+0x25/0x30
      
      So it sounds pointless to just block khugepaged waiting for the
      semaphore so replace down_read() with down_read_trylock() to move to
      scan the next mm quickly instead of just blocking on the semaphore so
      that other processes can get more chances to install THP.  Then
      khugepaged can come back to scan the skipped mm when it has finished the
      current round full_scan.
      
      And it appears that the change can improve khugepaged efficiency a
      little bit.
      
      Below is the test result when running LTP on a 24 cores 4GB memory 2
      nodes NUMA VM:
      
                                          pristine          w/ trylock
        full_scan                         197               187
        pages_collapsed                   21                26
        thp_fault_alloc                   40818             44466
        thp_fault_fallback                18413             16679
        thp_collapse_alloc                21                150
        thp_collapse_alloc_failed         14                16
        thp_file_alloc                    369               369
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: tweak comment]
      [arnd@arndb.de: avoid uninitialized variable use]
        Link: http://lkml.kernel.org/r/20171215125129.2948634-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1513281203-54878-1-git-send-email-yang.s@alibaba-inc.comSigned-off-by: default avatarYang Shi <yang.s@alibaba-inc.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8054b87f
    • Nitin Gupta's avatar
      sparc64: update pmdp_invalidate() to return old pmd value · 6acb8818
      Nitin Gupta authored
      
      [ Upstream commit a8e654f0 ]
      
      It's required to avoid losing dirty and accessed bits.
      
      [akpm@linux-foundation.org: add a `do' to the do-while loop]
      Link: http://lkml.kernel.org/r/20171213105756.69879-9-kirill.shutemov@linux.intel.comSigned-off-by: default avatarNitin Gupta <nitin.m.gupta@oracle.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6acb8818
    • Kirill A. Shutemov's avatar
      asm-generic: provide generic_pmdp_establish() · 78185a93
      Kirill A. Shutemov authored
      
      [ Upstream commit c58f0bb7 ]
      
      Patch series "Do not lose dirty bit on THP pages", v4.
      
      Vlastimil noted that pmdp_invalidate() is not atomic and we can lose
      dirty and access bits if CPU sets them after pmdp dereference, but
      before set_pmd_at().
      
      The bug can lead to data loss, but the race window is tiny and I haven't
      seen any reports that suggested that it happens in reality.  So I don't
      think it worth sending it to stable.
      
      Unfortunately, there's no way to address the issue in a generic way.  We
      need to fix all architectures that support THP one-by-one.
      
      All architectures that have THP supported have to provide atomic
      pmdp_invalidate() that returns previous value.
      
      If generic implementation of pmdp_invalidate() is used, architecture
      needs to provide atomic pmdp_estabish().
      
      pmdp_estabish() is not used out-side generic implementation of
      pmdp_invalidate() so far, but I think this can change in the future.
      
      This patch (of 12):
      
      This is an implementation of pmdp_establish() that is only suitable for
      an architecture that doesn't have hardware dirty/accessed bits.  In this
      case we can't race with CPU which sets these bits and non-atomic
      approach is fine.
      
      Link: http://lkml.kernel.org/r/20171213105756.69879-2-kirill.shutemov@linux.intel.comSigned-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      78185a93
    • Yisheng Xie's avatar
      mm/mempolicy: add nodes_empty check in SYSC_migrate_pages · 305e5675
      Yisheng Xie authored
      
      [ Upstream commit 0486a38b ]
      
      As in manpage of migrate_pages, the errno should be set to EINVAL when
      none of the node IDs specified by new_nodes are on-line and allowed by
      the process's current cpuset context, or none of the specified nodes
      contain memory.  However, when test by following case:
      
      	new_nodes = 0;
      	old_nodes = 0xf;
      	ret = migrate_pages(pid, old_nodes, new_nodes, MAX);
      
      The ret will be 0 and no errno is set.  As the new_nodes is empty, we
      should expect EINVAL as documented.
      
      To fix the case like above, this patch check whether target nodes AND
      current task_nodes is empty, and then check whether AND
      node_states[N_MEMORY] is empty.
      
      Link: http://lkml.kernel.org/r/1510882624-44342-4-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tan Xiaojun <tanxiaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      305e5675
    • Yisheng Xie's avatar
      mm/mempolicy: fix the check of nodemask from user · 6cab60ac
      Yisheng Xie authored
      
      [ Upstream commit 56521e7a ]
      
      As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system
      which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2:
      
        migrate_pages01    0  TINFO  :  test_invalid_nodes
        migrate_pages01   14  TFAIL  :  migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
        migrate_pages01   15  TFAIL  :  migrate_pages_common.c:55: call succeeded unexpectedly
      
      In this case the test_invalid_nodes of migrate_pages01 will call:
      SYSC_migrate_pages as:
      
        migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0
      
      The new nodes specifies one or more node IDs that are greater than the
      maximum supported node ID, however, the errno is not set to EINVAL as
      expected.
      
      As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3]
      mentioned, when nodemask specifies one or more node IDs that are greater
      than the maximum supported node ID, the errno should set to EINVAL.
      However, get_nodes only check whether the part of bits
      [BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and
      remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES)
      unchecked.
      
      This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes
      to let migrate_pages set the errno to EINVAL when nodemask specifies one
      or more node IDs that are greater than the maximum supported node ID,
      which follows the manpage's guide.
      
      [1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html
      [2] http://man7.org/linux/man-pages/man2/mbind.2.html
      [3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html
      
      Link: http://lkml.kernel.org/r/1510882624-44342-3-git-send-email-xieyisheng1@huawei.comSigned-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarTan Xiaojun <tanxiaojun@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6cab60ac
    • piaojun's avatar
      ocfs2: return error when we attempt to access a dirty bh in jbd2 · a7fbc7f3
      piaojun authored
      
      [ Upstream commit d984187e ]
      
      We should not reuse the dirty bh in jbd2 directly due to the following
      situation:
      
      1. When removing extent rec, we will dirty the bhs of extent rec and
         truncate log at the same time, and hand them over to jbd2.
      
      2. The bhs are submitted to jbd2 area successfully.
      
      3. The write-back thread of device help flush the bhs to disk but
         encounter write error due to abnormal storage link.
      
      4. After a while the storage link become normal. Truncate log flush
         worker triggered by the next space reclaiming found the dirty bh of
         truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
         __ocfs2_journal_access():
      
         ocfs2_truncate_log_worker
           ocfs2_flush_truncate_log
             __ocfs2_flush_truncate_log
               ocfs2_replay_truncate_records
                 ocfs2_journal_access_di
                   __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
      
      5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
         extent rec is still in error state, and unfortunately nobody will
         take care of it.
      
      6. At last the space of extent rec was not reduced, but truncate log
         flush worker have given it back to globalalloc. That will cause
         duplicate cluster problem which could be identified by fsck.ocfs2.
      
      Sadly we can hardly revert this but set fs read-only in case of ruining
      atomicity and consistency of space reclaim.
      
      Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
      Fixes: acf8fdbe ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7fbc7f3
    • piaojun's avatar
      ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute · a66174eb
      piaojun authored
      
      [ Upstream commit 16c8d569 ]
      
      The race between *set_acl and *get_acl will cause getting incomplete
      xattr data as below:
      
        processA                                    processB
      
        ocfs2_set_acl
          ocfs2_xattr_set
            __ocfs2_xattr_set_handle
      
                                                    ocfs2_get_acl_nolock
                                                      ocfs2_xattr_get_nolock:
      
      processB may get incomplete xattr data if processA hasn't set_acl done.
      
      So we should use 'ip_xattr_sem' to protect getting extended attribute in
      ocfs2_get_acl_nolock(), as other processes could be changing it
      concurrently.
      
      Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.comSigned-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a66174eb
    • piaojun's avatar
      ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid · 66aaeed2
      piaojun authored
      
      [ Upstream commit 025bcbde ]
      
      If metadata is corrupted such as 'invalid inode block', we will get
      failed by calling 'mount()' and then set filesystem readonly as below:
      
        ocfs2_mount
          ocfs2_initialize_super
            ocfs2_init_global_system_inodes
              ocfs2_iget
                ocfs2_read_locked_inode
                  ocfs2_validate_inode_block
      	      ocfs2_error
      	        ocfs2_handle_error
      	          ocfs2_set_ro_flag(osb, 0);  // set readonly
      
      In this situation we need return -EROFS to 'mount.ocfs2', so that user
      can fix it by fsck.  And then mount again.  In addition, 'mount.ocfs2'
      should be updated correspondingly as it only return 1 for all errno.
      And I will post a patch for 'mount.ocfs2' too.
      
      Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.comSigned-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      66aaeed2
    • Jan H. Schönherr's avatar
      fs/dax.c: release PMD lock even when there is no PMD support in DAX · 710b5124
      Jan H. Schönherr authored
      
      [ Upstream commit ee190ca6 ]
      
      follow_pte_pmd() can theoretically return after having acquired a PMD
      lock, even when DAX was not compiled with CONFIG_FS_DAX_PMD.
      
      Release the PMD lock unconditionally.
      
      Link: http://lkml.kernel.org/r/20180118133839.20587-1-jschoenh@amazon.de
      Fixes: f729c8c9 ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean")
      Signed-off-by: default avatarJan H. Schönherr <jschoenh@amazon.de>
      Reviewed-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      710b5124
    • Vitaly Kuznetsov's avatar
      x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested · cc0600da
      Vitaly Kuznetsov authored
      
      [ Upstream commit d391f120 ]
      
      I was investigating an issue with seabios >= 1.10 which stopped working
      for nested KVM on Hyper-V. The problem appears to be in
      handle_ept_violation() function: when we do fast mmio we need to skip
      the instruction so we do kvm_skip_emulated_instruction(). This, however,
      depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
      However, this is not the case.
      
      Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
      EPT MISCONFIG occurs. While on real hardware it was observed to be set,
      some hypervisors follow the spec and don't set it; we end up advancing
      IP with some random value.
      
      I checked with Microsoft and they confirmed they don't fill
      VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
      
      Fix the issue by doing instruction skip through emulator when running
      nested.
      
      Fixes: 68c3b4d1Suggested-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc0600da
    • KarimAllah Ahmed's avatar
      kvm: Map PFN-type memory regions as writable (if possible) · d757c3a9
      KarimAllah Ahmed authored
      
      [ Upstream commit a340b3e2 ]
      
      For EPT-violations that are triggered by a read, the pages are also mapped with
      write permissions (if their memory region is also writable). That would avoid
      getting yet another fault on the same page when a write occurs.
      
      This optimization only happens when you have a "struct page" backing the memory
      region. So also enable it for memory regions that do not have a "struct page".
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarKarimAllah Ahmed <karahmed@amazon.de>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d757c3a9
    • Gustavo A. R. Silva's avatar
      tcp_nv: fix potential integer overflow in tcpnv_acked · a6a25002
      Gustavo A. R. Silva authored
      
      [ Upstream commit e4823fbd ]
      
      Add suffix ULL to constant 80000 in order to avoid a potential integer
      overflow and give the compiler complete information about the proper
      arithmetic to use. Notice that this constant is used in a context that
      expects an expression of type u64.
      
      The current cast to u64 effectively applies to the whole expression
      as an argument of type u64 to be passed to div64_u64, but it does
      not prevent it from being evaluated using 32-bit arithmetic instead
      of 64-bit arithmetic.
      
      Also, once the expression is properly evaluated using 64-bit arithmentic,
      there is no need for the parentheses and the external cast to u64.
      
      Addresses-Coverity-ID: 1357588 ("Unintentional integer overflow")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6a25002
    • Dmitry Vyukov's avatar
      netfilter: x_tables: fix pointer leaks to userspace · ad10785a
      Dmitry Vyukov authored
      
      [ Upstream commit 1e98ffea ]
      
      Several netfilter matches and targets put kernel pointers into
      info objects, but don't set usersize in descriptors.
      This leads to kernel pointer leaks if a match/target is set
      and then read back to userspace.
      
      Properly set usersize for these matches/targets.
      
      Found with manual code inspection.
      
      Fixes: ec231890 ("xtables: extend matches and targets with .usersize")
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ad10785a
    • Vitaly Kuznetsov's avatar
      x86/hyperv: Check for required priviliges in hyperv_init() · 2b7cc936
      Vitaly Kuznetsov authored
      
      [ Upstream commit 89a8f6d4 ]
      
      In hyperv_init() its presumed that it always has access to VP index and
      hypercall MSRs while according to the specification it should be checked if
      it's allowed to access the corresponding MSRs before accessing them.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: kvm@vger.kernel.org
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: devel@linuxdriverproject.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Cathy Avery <cavery@redhat.com>
      Cc: Mohammed Gamal <mmorsy@redhat.com>
      Link: https://lkml.kernel.org/r/20180124132337.30138-2-vkuznets@redhat.comSigned-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b7cc936
    • Andy Spencer's avatar
      gianfar: prevent integer wrapping in the rx handler · cdf635a6
      Andy Spencer authored
      
      [ Upstream commit 202a0a70 ]
      
      When the frame check sequence (FCS) is split across the last two frames
      of a fragmented packet, part of the FCS gets counted twice, once when
      subtracting the FCS, and again when subtracting the previously received
      data.
      
      For example, if 1602 bytes are received, and the first fragment contains
      the first 1600 bytes (including the first two bytes of the FCS), and the
      second fragment contains the last two bytes of the FCS:
      
        'skb->len == 1600' from the first fragment
      
        size  = lstatus & BD_LENGTH_MASK; # 1602
        size -= ETH_FCS_LEN;              # 1598
        size -= skb->len;                 # -2
      
      Since the size is unsigned, it wraps around and causes a BUG later in
      the packet handling, as shown below:
      
        kernel BUG at ./include/linux/skbuff.h:2068!
        Oops: Exception in kernel mode, sig: 5 [#1]
        ...
        NIP [c021ec60] skb_pull+0x24/0x44
        LR [c01e2fbc] gfar_clean_rx_ring+0x498/0x690
        Call Trace:
        [df7edeb0] [c01e2c1c] gfar_clean_rx_ring+0xf8/0x690 (unreliable)
        [df7edf20] [c01e33a8] gfar_poll_rx_sq+0x3c/0x9c
        [df7edf40] [c023352c] net_rx_action+0x21c/0x274
        [df7edf90] [c0329000] __do_softirq+0xd8/0x240
        [df7edff0] [c000c108] call_do_irq+0x24/0x3c
        [c0597e90] [c00041dc] do_IRQ+0x64/0xc4
        [c0597eb0] [c000d920] ret_from_except+0x0/0x18
        --- interrupt: 501 at arch_cpu_idle+0x24/0x5c
      
      Change the size to a signed integer and then trim off any part of the
      FCS that was received prior to the last fragment.
      
      Fixes: 6c389fc9 ("gianfar: fix size of scatter-gathered frames")
      Signed-off-by: default avatarAndy Spencer <aspencer@spacex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cdf635a6
    • Logan Gunthorpe's avatar
      ntb_transport: Fix bug with max_mw_size parameter · 67fa8bff
      Logan Gunthorpe authored
      
      [ Upstream commit cbd27448 ]
      
      When using the max_mw_size parameter of ntb_transport to limit the size of
      the Memory windows, communication cannot be established and the queues
      freeze.
      
      This is because the mw_size that's reported to the peer is correctly
      limited but the size used locally is not. So the MW is initialized
      with a buffer smaller than the window but the TX side is using the
      full window. This means the TX side will be writing to a region of the
      window that points nowhere.
      
      This is easily fixed by applying the same limit to tx_size in
      ntb_transport_init_queue().
      
      Fixes: e26a5843 ("NTB: Split ntb_hw_intel and ntb_transport drivers")
      Signed-off-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Acked-by: default avatarAllen Hubbe <Allen.Hubbe@dell.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarJon Mason <jdmason@kudzu.us>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67fa8bff
    • Leon Romanovsky's avatar
      RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failure · d810c548
      Leon Romanovsky authored
      
      [ Upstream commit b081808a ]
      
      Failure in XRCD FW deallocation command leaves memory leaked and
      returns error to the user which he can't do anything about it.
      
      This patch changes behavior to always free memory and always return
      success to the user.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Reviewed-by: default avatarMajd Dibbiny <majd@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: default avatarYuval Shaia <yuval.shaia@oracle.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d810c548
    • Michael Bringmann's avatar
      powerpc/numa: Ensure nodes initialized for hotplug · 0bddd43a
      Michael Bringmann authored
      
      [ Upstream commit ea05ba7c ]
      
      This patch fixes some problems encountered at runtime with
      configurations that support memory-less nodes, or that hot-add CPUs
      into nodes that are memoryless during system execution after boot. The
      problems of interest include:
      
      * Nodes known to powerpc to be memoryless at boot, but to have CPUs in
        them are allowed to be 'possible' and 'online'. Memory allocations
        for those nodes are taken from another node that does have memory
        until and if memory is hot-added to the node.
      
      * Nodes which have no resources assigned at boot, but which may still
        be referenced subsequently by affinity or associativity attributes,
        are kept in the list of 'possible' nodes for powerpc. Hot-add of
        memory or CPUs to the system can reference these nodes and bring
        them online instead of redirecting the references to one of the set
        of nodes known to have memory at boot.
      
      Note that this software operates under the context of CPU hotplug. We
      are not doing memory hotplug in this code, but rather updating the
      kernel's CPU topology (i.e. arch_update_cpu_topology /
      numa_update_cpu_topology). We are initializing a node that may be used
      by CPUs or memory before it can be referenced as invalid by a CPU
      hotplug operation. CPU hotplug operations are protected by a range of
      APIs including cpu_maps_update_begin/cpu_maps_update_done,
      cpus_read/write_lock / cpus_read/write_unlock, device locks, and more.
      Memory hotplug operations, including try_online_node, are protected by
      mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the
      case of CPUs being hot-added to a previously memoryless node, the
      try_online_node operation occurs wholly within the CPU locks with no
      overlap. Using HMC hot-add/hot-remove operations, we have been able to
      add and remove CPUs to any possible node without failures. HMC
      operations involve a degree self-serialization, though.
      Signed-off-by: default avatarMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0bddd43a
    • Michael Bringmann's avatar
      powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes · 0caebc38
      Michael Bringmann authored
      
      [ Upstream commit a346137e ]
      
      On powerpc systems which allow 'hot-add' of CPU or memory resources,
      it may occur that the new resources are to be inserted into nodes that
      were not used for these resources at bootup. In the kernel, any node
      that is used must be defined and initialized. These empty nodes may
      occur when,
      
      * Dedicated vs. shared resources. Shared resources require information
        such as the VPHN hcall for CPU assignment to nodes. Associativity
        decisions made based on dedicated resource rules, such as
        associativity properties in the device tree, may vary from decisions
        made using the values returned by the VPHN hcall.
      
      * memoryless nodes at boot. Nodes need to be defined as 'possible' at
        boot for operation with other code modules. Previously, the powerpc
        code would limit the set of possible nodes to those which have
        memory assigned at boot, and were thus online. Subsequent add/remove
        of CPUs or memory would only work with this subset of possible
        nodes.
      
      * memoryless nodes with CPUs at boot. Due to the previous restriction
        on nodes, nodes that had CPUs but no memory were being collapsed
        into other nodes that did have memory at boot. In practice this
        meant that the node assignment presented by the runtime kernel
        differed from the affinity and associativity attributes presented by
        the device tree or VPHN hcalls. Nodes that might be known to the
        pHyp were not 'possible' in the runtime kernel because they did not
        have memory at boot.
      
      This patch ensures that sufficient nodes are defined to support
      configuration requirements after boot, as well as at boot. This patch
      set fixes a couple of problems.
      
      * Nodes known to powerpc to be memoryless at boot, but to have CPUs in
        them are allowed to be 'possible' and 'online'. Memory allocations
        for those nodes are taken from another node that does have memory
        until and if memory is hot-added to the node. * Nodes which have no
        resources assigned at boot, but which may still be referenced
        subsequently by affinity or associativity attributes, are kept in
        the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs
        to the system can reference these nodes and bring them online
        instead of redirecting to one of the set of nodes that were known to
        have memory at boot.
      
      This patch extracts the value of the lowest domain level (number of
      allocable resources) from the device tree property
      "ibm,max-associativity-domains" to use as the maximum number of nodes
      to setup as possibly available in the system. This new setting will
      override the instruction:
      
          nodes_and(node_possible_map, node_possible_map, node_online_map);
      
      presently seen in the function arch/powerpc/mm/numa.c:initmem_init().
      
      If the "ibm,max-associativity-domains" property is not present at
      boot, no operation will be performed to define or enable additional
      nodes, or enable the above 'nodes_and()'.
      Signed-off-by: default avatarMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0caebc38
    • Mickaël Salaün's avatar
      samples/bpf: Partially fixes the bpf.o build · b086dd2d
      Mickaël Salaün authored
      
      [ Upstream commit c25ef6a5 ]
      
      Do not build lib/bpf/bpf.o with this Makefile but use the one from the
      library directory.  This avoid making a buggy bpf.o file (e.g. missing
      symbols).
      
      This patch is useful if some code (e.g. Landlock tests) needs both the
      bpf.o (from tools/lib/bpf) and the bpf_load.o (from samples/bpf).
      Signed-off-by: default avatarMickaël Salaün <mic@digikod.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b086dd2d
    • Jacob Keller's avatar
      i40e: fix reported mask for ntuple filters · 64e5e46c
      Jacob Keller authored
      
      [ Upstream commit 40339af3 ]
      
      In commit 36777d9f ("i40e: check current configured input set when
      adding ntuple filters") some code was added to report the input set
      mask for a given filter when reporting it to the user.
      
      This code is necessary so that the reported filter correctly displays
      that it is or is not masking certain fields.
      
      Unfortunately the code was incorrect. Development error accidentally
      swapped the mask values for the IPv4 addresses with the L4 port numbers.
      The port numbers are only 16bits wide while IPv4 addresses are 32 bits.
      Unfortunately we assigned only 16 bits to the IPv4 address masks.
      Additionally we assigned 32bit value 0xFFFFFFF to the TCP port numbers.
      This second part does not matter as the value would be truncated to
      16bits regardless, but it is unnecessary.
      
      Fix the reported masks to properly report that the entire field is
      masked.
      
      Fixes: 36777d9f ("i40e: check current configured input set when adding ntuple filters")
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64e5e46c
    • Jacob Keller's avatar
      i40e: program fragmented IPv4 filter input set · 1ec85fe4
      Jacob Keller authored
      
      [ Upstream commit 02b4016b ]
      
      When implementing support for IP_USER_FLOW filters, we correctly
      programmed a filter for both the non fragmented IPv4/Other filter, as
      well as the fragmented IPv4 filters. However, we did not properly
      program the input set for fragmented IPv4 PCTYPE. This meant that the
      filters would almost certainly not match, unless the user specified all
      of the flow types.
      
      Add support to program the fragmented IPv4 filter input set. Since we
      always program these filters together, we'll assume that the two input
      sets must match, and will thus always program the input sets to the same
      value.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ec85fe4
    • Emil Tantilov's avatar
      ixgbe: don't set RXDCTL.RLPML for 82599 · 7addb3e4
      Emil Tantilov authored
      
      [ Upstream commit 2bafa8fa ]
      
      commit 2de6aa3a ("ixgbe: Add support for padding packet")
      
      Uses RXDCTL.RLPML to limit the maximum frame size on Rx when using
      build_skb. Unfortunately that register does not work on 82599.
      
      Added an explicit check to avoid setting this register on 82599 MAC.
      
      Extended the comment related to the setting of RXDCTL.RLPML to better
      explain its purpose.
      Signed-off-by: default avatarEmil Tantilov <emil.s.tantilov@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7addb3e4
    • Jake Daryll Obina's avatar
      jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path · 27eb641f
      Jake Daryll Obina authored
      
      [ Upstream commit 5bdd0c6f ]
      
      If jffs2_iget() fails for a newly-allocated inode, jffs2_do_clear_inode()
      can get called twice in the error handling path, the first call in
      jffs2_iget() itself and the second through iget_failed(). This can result
      to a use-after-free error in the second jffs2_do_clear_inode() call, such
      as shown by the oops below wherein the second jffs2_do_clear_inode() call
      was trying to free node fragments that were already freed in the first
      jffs2_do_clear_inode() call.
      
      [   78.178860] jffs2: error: (1904) jffs2_do_read_inode_internal: CRC failed for read_inode of inode 24 at physical location 0x1fc00c
      [   78.178914] Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b
      [   78.185871] pgd = ffffffc03a567000
      [   78.188794] [6b6b6b6b6b6b6b7b] *pgd=0000000000000000, *pud=0000000000000000
      [   78.194968] Internal error: Oops: 96000004 [#1] PREEMPT SMP
      ...
      [   78.513147] PC is at rb_first_postorder+0xc/0x28
      [   78.516503] LR is at jffs2_kill_fragtree+0x28/0x90 [jffs2]
      [   78.520672] pc : [<ffffff8008323d28>] lr : [<ffffff8000eb1cc8>] pstate: 60000105
      [   78.526757] sp : ffffff800cea38f0
      [   78.528753] x29: ffffff800cea38f0 x28: ffffffc01f3f8e80
      [   78.532754] x27: 0000000000000000 x26: ffffff800cea3c70
      [   78.536756] x25: 00000000dc67c8ae x24: ffffffc033d6945d
      [   78.540759] x23: ffffffc036811740 x22: ffffff800891a5b8
      [   78.544760] x21: 0000000000000000 x20: 0000000000000000
      [   78.548762] x19: ffffffc037d48910 x18: ffffff800891a588
      [   78.552764] x17: 0000000000000800 x16: 0000000000000c00
      [   78.556766] x15: 0000000000000010 x14: 6f2065646f6e695f
      [   78.560767] x13: 6461657220726f66 x12: 2064656c69616620
      [   78.564769] x11: 435243203a6c616e x10: 7265746e695f6564
      [   78.568771] x9 : 6f6e695f64616572 x8 : ffffffc037974038
      [   78.572774] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008
      [   78.576775] x5 : 002f91d85bd44a2f x4 : 0000000000000000
      [   78.580777] x3 : 0000000000000000 x2 : 000000403755e000
      [   78.584779] x1 : 6b6b6b6b6b6b6b6b x0 : 6b6b6b6b6b6b6b6b
      ...
      [   79.038551] [<ffffff8008323d28>] rb_first_postorder+0xc/0x28
      [   79.042962] [<ffffff8000eb5578>] jffs2_do_clear_inode+0x88/0x100 [jffs2]
      [   79.048395] [<ffffff8000eb9ddc>] jffs2_evict_inode+0x3c/0x48 [jffs2]
      [   79.053443] [<ffffff8008201ca8>] evict+0xb0/0x168
      [   79.056835] [<ffffff8008202650>] iput+0x1c0/0x200
      [   79.060228] [<ffffff800820408c>] iget_failed+0x30/0x3c
      [   79.064097] [<ffffff8000eba0c0>] jffs2_iget+0x2d8/0x360 [jffs2]
      [   79.068740] [<ffffff8000eb0a60>] jffs2_lookup+0xe8/0x130 [jffs2]
      [   79.073434] [<ffffff80081f1a28>] lookup_slow+0x118/0x190
      [   79.077435] [<ffffff80081f4708>] walk_component+0xfc/0x28c
      [   79.081610] [<ffffff80081f4dd0>] path_lookupat+0x84/0x108
      [   79.085699] [<ffffff80081f5578>] filename_lookup+0x88/0x100
      [   79.089960] [<ffffff80081f572c>] user_path_at_empty+0x58/0x6c
      [   79.094396] [<ffffff80081ebe14>] vfs_statx+0xa4/0x114
      [   79.098138] [<ffffff80081ec44c>] SyS_newfstatat+0x58/0x98
      [   79.102227] [<ffffff800808354c>] __sys_trace_return+0x0/0x4
      [   79.106489] Code: d65f03c0 f9400001 b40000e1 aa0103e0 (f9400821)
      
      The jffs2_do_clear_inode() call in jffs2_iget() is unnecessary since
      iget_failed() will eventually call jffs2_do_clear_inode() if needed, so
      just remove it.
      
      Fixes: 5451f79f ("iget: stop JFFS2 from using iget() and read_inode()")
      Reviewed-by: default avatarRichard Weinberger <richard@nod.at>
      Signed-off-by: default avatarJake Daryll Obina <jake.obina@gmail.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27eb641f