1. 29 Nov, 2019 5 commits
  2. 18 Nov, 2019 3 commits
    • Andrew Duggan's avatar
      HID: rmi: Check that the RMI_STARTED bit is set before unregistering the RMI transport device · 8725aa4f
      Andrew Duggan authored
      In the event that the RMI device is unreachable, the calls to rmi_set_mode() or
      rmi_set_page() will fail before registering the RMI transport device. When the
      device is removed, rmi_remove() will call rmi_unregister_transport_device()
      which will attempt to access the rmi_dev pointer which was not set.
      This patch adds a check of the RMI_STARTED bit before calling
      rmi_unregister_transport_device().  The RMI_STARTED bit is only set
      after rmi_register_transport_device() completes successfully.
      
      The kernel oops was reported in this message:
      https://www.spinics.net/lists/linux-input/msg58433.html
      
      [jkosina@suse.cz: reworded changelog as agreed with Andrew]
      Signed-off-by: default avatarAndrew Duggan <aduggan@synaptics.com>
      Reported-by: default avatarFederico Cerutti <federico@ceres-c.it>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      8725aa4f
    • Heiner Kallweit's avatar
      HID: quirks: remove hid-led devices from hid_have_special_driver · b03e5774
      Heiner Kallweit authored
      Since e04a0442 ("HID: core: remove the absolute need of
      hid_have_special_driver[]") it's no longer needed to list these LED
      devices in hid_have_special_driver[]. This allows libraries needing
      access to the hidraw device to work properly.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      b03e5774
    • Blaž Hrastnik's avatar
      HID: Improve Windows Precision Touchpad detection. · 2dbc6f11
      Blaž Hrastnik authored
      Per Microsoft spec, usage 0xC5 (page 0xFF) returns a blob containing
      data used to verify the touchpad as a Windows Precision Touchpad.
      
         0x85, REPORTID_PTPHQA,    //    REPORT_ID (PTPHQA)
          0x09, 0xC5,              //    USAGE (Vendor Usage 0xC5)
          0x15, 0x00,              //    LOGICAL_MINIMUM (0)
          0x26, 0xff, 0x00,        //    LOGICAL_MAXIMUM (0xff)
          0x75, 0x08,              //    REPORT_SIZE (8)
          0x96, 0x00, 0x01,        //    REPORT_COUNT (0x100 (256))
          0xb1, 0x02,              //    FEATURE (Data,Var,Abs)
      
      However, some devices, namely Microsoft's Surface line of products
      instead implement a "segmented device certification report" (usage 0xC6)
      which returns the same report, but in smaller chunks.
      
          0x06, 0x00, 0xff,        //     USAGE_PAGE (Vendor Defined)
          0x85, REPORTID_PTPHQA,   //     REPORT_ID (PTPHQA)
          0x09, 0xC6,              //     USAGE (Vendor usage for segment #)
          0x25, 0x08,              //     LOGICAL_MAXIMUM (8)
          0x75, 0x08,              //     REPORT_SIZE (8)
          0x95, 0x01,              //     REPORT_COUNT (1)
          0xb1, 0x02,              //     FEATURE (Data,Var,Abs)
          0x09, 0xC7,              //     USAGE (Vendor Usage)
          0x26, 0xff, 0x00,        //     LOGICAL_MAXIMUM (0xff)
          0x95, 0x20,              //     REPORT_COUNT (32)
          0xb1, 0x02,              //     FEATURE (Data,Var,Abs)
      
      By expanding Win8 touchpad detection to also look for the segmented
      report, all Surface touchpads are now properly recognized by
      hid-multitouch.
      Signed-off-by: default avatarBlaž Hrastnik <blaz@mxxn.io>
      Signed-off-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      2dbc6f11
  3. 15 Nov, 2019 4 commits
  4. 14 Nov, 2019 1 commit
    • Jinke Fan's avatar
      HID: quirks: Add quirk for HP MSU1465 PIXART OEM mouse · f1a0094c
      Jinke Fan authored
      The PixArt OEM mouse disconnets/reconnects every minute on
      Linux. All contents of dmesg are repetitive:
      
      [ 1465.810014] usb 1-2.2: USB disconnect, device number 20
      [ 1467.431509] usb 1-2.2: new low-speed USB device number 21 using xhci_hcd
      [ 1467.654982] usb 1-2.2: New USB device found, idVendor=03f0,idProduct=1f4a, bcdDevice= 1.00
      [ 1467.654985] usb 1-2.2: New USB device strings: Mfr=1, Product=2,SerialNumber=0
      [ 1467.654987] usb 1-2.2: Product: HP USB Optical Mouse
      [ 1467.654988] usb 1-2.2: Manufacturer: PixArt
      [ 1467.699722] input: PixArt HP USB Optical Mouse as /devices/pci0000:00/0000:00:07.1/0000:05:00.3/usb1/1-2/1-2.2/1-2.2:1.0/0003:03F0:1F4A.0012/input/input19
      [ 1467.700124] hid-generic 0003:03F0:1F4A.0012: input,hidraw0: USB HID v1.11 Mouse [PixArt HP USB Optical Mouse] on usb-0000:05:00.3-2.2/input0
      
      So add HID_QUIRK_ALWAYS_POLL for this one as well.
      Test the patch, the mouse is no longer disconnected and there are no
      duplicate logs in dmesg.
      
      Reference:
      https://github.com/sriemer/fix-linux-mouseSigned-off-by: default avatarJinke Fan <fanjinke@hygon.cn>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      f1a0094c
  5. 12 Nov, 2019 1 commit
    • Candle Sun's avatar
      HID: core: check whether Usage Page item is after Usage ID items · 1cb0d2ae
      Candle Sun authored
      Upstream commit 58e75155 ("HID: core: move Usage Page concatenation
      to Main item") adds support for Usage Page item after Usage ID items
      (such as keyboards manufactured by Primax).
      
      Usage Page concatenation in Main item works well for following report
      descriptor patterns:
      
          USAGE_PAGE (Keyboard)                   05 07
          USAGE_MINIMUM (Keyboard LeftControl)    19 E0
          USAGE_MAXIMUM (Keyboard Right GUI)      29 E7
          LOGICAL_MINIMUM (0)                     15 00
          LOGICAL_MAXIMUM (1)                     25 01
          REPORT_SIZE (1)                         75 01
          REPORT_COUNT (8)                        95 08
          INPUT (Data,Var,Abs)                    81 02
      
      -------------
      
          USAGE_MINIMUM (Keyboard LeftControl)    19 E0
          USAGE_MAXIMUM (Keyboard Right GUI)      29 E7
          LOGICAL_MINIMUM (0)                     15 00
          LOGICAL_MAXIMUM (1)                     25 01
          REPORT_SIZE (1)                         75 01
          REPORT_COUNT (8)                        95 08
          USAGE_PAGE (Keyboard)                   05 07
          INPUT (Data,Var,Abs)                    81 02
      
      But it makes the parser act wrong for the following report
      descriptor pattern(such as some Gamepads):
      
          USAGE_PAGE (Button)                     05 09
          USAGE (Button 1)                        09 01
          USAGE (Button 2)                        09 02
          USAGE (Button 4)                        09 04
          USAGE (Button 5)                        09 05
          USAGE (Button 7)                        09 07
          USAGE (Button 8)                        09 08
          USAGE (Button 14)                       09 0E
          USAGE (Button 15)                       09 0F
          USAGE (Button 13)                       09 0D
          USAGE_PAGE (Consumer Devices)           05 0C
          USAGE (Back)                            0a 24 02
          USAGE (HomePage)                        0a 23 02
          LOGICAL_MINIMUM (0)                     15 00
          LOGICAL_MAXIMUM (1)                     25 01
          REPORT_SIZE (1)                         75 01
          REPORT_COUNT (11)                       95 0B
          INPUT (Data,Var,Abs)                    81 02
      
      With Usage Page concatenation in Main item, parser recognizes all the
      11 Usages as consumer keys, it is not the HID device's real intention.
      
      This patch checks whether Usage Page is really defined after Usage ID
      items by comparing usage page using status.
      
      Usage Page concatenation on currently defined Usage Page will always
      do in local parsing when Usage ID items encountered.
      
      When Main item is parsing, concatenation will do again with last
      defined Usage Page if this page has not been used in the previous
      usages concatenation.
      Signed-off-by: default avatarCandle Sun <candle.sun@unisoc.com>
      Signed-off-by: default avatarNianfu Bai <nianfu.bai@unisoc.com>
      Benjamin Tissoires <benjamin.tissoires@redhat.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      1cb0d2ae
  6. 07 Nov, 2019 1 commit
  7. 06 Nov, 2019 19 commits
    • Jason Gerecke's avatar
      HID: wacom: generic: Treat serial number and related fields as unsigned · ff479731
      Jason Gerecke authored
      The HID descriptors for most Wacom devices oddly declare the serial
      number and other related fields as signed integers. When these numbers
      are ingested by the HID subsystem, they are automatically sign-extended
      into 32-bit integers. We treat the fields as unsigned elsewhere in the
      kernel and userspace, however, so this sign-extension causes problems.
      In particular, the sign-extended tool ID sent to userspace as ABS_MISC
      does not properly match unsigned IDs used by xf86-input-wacom and libwacom.
      
      We introduce a function 'wacom_s32tou' that can undo the automatic sign
      extension performed by 'hid_snto32'. We call this function when processing
      the serial number and related fields to ensure that we are dealing with
      and reporting the unsigned form. We opt to use this method rather than
      adding a descriptor fixup in 'wacom_hid_usage_quirk' since it should be
      more robust in the face of future devices.
      
      Ref: https://github.com/linuxwacom/input-wacom/issues/134
      Fixes: f85c9dc6 ("HID: wacom: generic: Support tool ID and additional tool types")
      CC: <stable@vger.kernel.org> # v4.10+
      Signed-off-by: default avatarJason Gerecke <jason.gerecke@wacom.com>
      Reviewed-by: default avatarAaron Armstrong Skomra <aaron.skomra@wacom.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      ff479731
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 4dd58158
      Linus Torvalds authored
      Merge more fixes from Andrew Morton:
       "17 fixes"
      
      Mostly mm fixes and one ocfs2 locking fix.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges
        mm/memory_hotplug: fix updating the node span
        scripts/gdb: fix debugging modules compiled with hot/cold partitioning
        mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly
        MAINTAINERS: update information for "MEMORY MANAGEMENT"
        dump_stack: avoid the livelock of the dump_lock
        zswap: add Vitaly to the maintainers list
        mm/page_alloc.c: ratelimit allocation failure warnings more aggressively
        mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y
        mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo
        mm, vmstat: hide /proc/pagetypeinfo from normal users
        mm/mmu_notifiers: use the right return code for WARN_ON
        ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()
        mm: thp: handle page cache THP correctly in PageTransCompoundMap
        mm, meminit: recalculate pcpu batch and high limits after init completes
        mm/gup_benchmark: fix MAP_HUGETLB case
        mm: memcontrol: fix NULL-ptr deref in percpu stats flush
      4dd58158
    • Johannes Weiner's avatar
      mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges · 869712fd
      Johannes Weiner authored
      While upgrading from 4.16 to 5.2, we noticed these allocation errors in
      the log of the new kernel:
      
        SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
          cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
          node 0: slabs: 5, objs: 170, free: 0
      
              slab_out_of_memory+1
              ___slab_alloc+969
              __slab_alloc+14
              kmem_cache_alloc+346
              inet_twsk_alloc+60
              tcp_time_wait+46
              tcp_fin+206
              tcp_data_queue+2034
              tcp_rcv_state_process+784
              tcp_v6_do_rcv+405
              __release_sock+118
              tcp_close+385
              inet_release+46
              __sock_release+55
              sock_close+17
              __fput+170
              task_work_run+127
              exit_to_usermode_loop+191
              do_syscall_64+212
              entry_SYSCALL_64_after_hwframe+68
      
      accompanied by an increase in machines going completely radio silent
      under memory pressure.
      
      One thing that changed since 4.16 is e699e2c6 ("net, mm: account
      sock objects to kmemcg"), which made these slab caches subject to cgroup
      memory accounting and control.
      
      The problem with that is that cgroups, unlike the page allocator, do not
      maintain dedicated atomic reserves.  As a cgroup's usage hovers at its
      limit, atomic allocations - such as done during network rx - can fail
      consistently for extended periods of time.  The kernel is not able to
      operate under these conditions.
      
      We don't want to revert the culprit patch, because it indeed tracks a
      potentially substantial amount of memory used by a cgroup.
      
      We also don't want to implement dedicated atomic reserves for cgroups.
      There is no point in keeping a fixed margin of unused bytes in the
      cgroup's memory budget to accomodate a consumer that is impossible to
      predict - we'd be wasting memory and get into configuration headaches,
      not unlike what we have going with min_free_kbytes.  We do this for
      physical mem because we have to, but cgroups are an accounting game.
      
      Instead, account these privileged allocations to the cgroup, but let
      them bypass the configured limit if they have to.  This way, we get the
      benefits of accounting the consumed memory and have it exert pressure on
      the rest of the cgroup, but like with the page allocator, we shift the
      burden of reclaimining on behalf of atomic allocations onto the regular
      allocations that can block.
      
      Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
      Fixes: e699e2c6 ("net, mm: account sock objects to kmemcg")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      869712fd
    • David Hildenbrand's avatar
      mm/memory_hotplug: fix updating the node span · 656d5711
      David Hildenbrand authored
      We recently started updating the node span based on the zone span to
      avoid touching uninitialized memmaps.
      
      Currently, we will always detect the node span to start at 0, meaning a
      node can easily span too many pages.  pgdat_is_empty() will still work
      correctly if all zones span no pages.  We should skip over all zones
      without spanned pages and properly handle the first detected zone that
      spans pages.
      
      Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
      span cannot easily be inspected and tested.  The node span gives no real
      guarantees when an architecture supports memory hotplug, meaning it can
      easily contain holes or span pages of different nodes.
      
      The node span is not really used after init on architectures that
      support memory hotplug.
      
      E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
      mm/kmemleak.c:kmemleak_scan().  These users seem to be fine.
      
      Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com
      Fixes: 00d6c019 ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      656d5711
    • Ilya Leoshkevich's avatar
      scripts/gdb: fix debugging modules compiled with hot/cold partitioning · 8731acc5
      Ilya Leoshkevich authored
      gcc's -freorder-blocks-and-partition option makes it group frequently
      and infrequently used code in .text.hot and .text.unlikely sections
      respectively.  At least when building modules on s390, this option is
      used by default.
      
      gdb assumes that all code is located in .text section, and that .text
      section is located at module load address.  With such modules this is no
      longer the case: there is code in .text.hot and .text.unlikely, and
      either of them might precede .text.
      
      Fix by explicitly telling gdb the addresses of code sections.
      
      It might be tempting to do this for all sections, not only the ones in
      the white list.  Unfortunately, gdb appears to have an issue, when
      telling it about e.g. loadable .note.gnu.build-id section causes it to
      think that non-loadable .note.Linux section is loaded at address 0,
      which in turn causes NULL pointers to be resolved to bogus symbols.  So
      keep using the white list approach for the time being.
      
      Link: http://lkml.kernel.org/r/20191028152734.13065-1-iii@linux.ibm.comSigned-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Reviewed-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Cc: Kieran Bingham <kbingham@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8731acc5
    • Roman Gushchin's avatar
      mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly · 221ec5c0
      Roman Gushchin authored
      page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
      slab pages, because it depends on PgHead AND PgSlab flags to be set to
      determine the memory cgroup from the kmem_cache.  It's correct for
      compound pages, but not for generic small pages.  Those don't have PgHead
      set, so it ends up returning zero.
      
      Fix this by replacing the condition to PageSlab() && !PageTail().
      
      Before this patch:
        [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
        0x0000000000000080	        38        0  _______S___________________________________	slab
      
      After this patch:
        [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
        0x0000000000000080	       147        0  _______S___________________________________	slab
      
      Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
      to filter error injection events based on memcg.  So if
      page_cgroup_ino() fails to return memcg pointer, we just fail to inject
      memory error.  Considering that hwpoison filter is for testing, affected
      users are limited and the impact should be marginal.
      
      [n-horiguchi@ah.jp.nec.com: changelog additions]
      Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      221ec5c0
    • Song Liu's avatar
      MAINTAINERS: update information for "MEMORY MANAGEMENT" · 6981b76c
      Song Liu authored
      I was trying to find the mm tree in MAINTAINERS by searching "Morton".
      Unfortunately, I didn't find one.  And I didn't even locate the MEMORY
      MANAGEMENT section quickly, because Andrew's name was not listed there.
      
      Thanks to Johannes who helped me find the mm tree.
      
      Let save other's time searching around by adding:
      
      M:	Andrew Morton <akpm@linux-foundation.org>
      T:	git git://github.com/hnaz/linux-mm.git
      
      [akpm@linux-foundation.org: add ozlabs.org quilt trees]
      Link: http://lkml.kernel.org/r/20191030202217.3498133-1-songliubraving@fb.comSigned-off-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6981b76c
    • Kevin Hao's avatar
      dump_stack: avoid the livelock of the dump_lock · 5cbf2fff
      Kevin Hao authored
      In the current code, we use the atomic_cmpxchg() to serialize the output
      of the dump_stack(), but this implementation suffers the thundering herd
      problem.  We have observed such kind of livelock on a Marvell cn96xx
      board(24 cpus) when heavily using the dump_stack() in a kprobe handler.
      Actually we can let the competitors to wait for the releasing of the
      lock before jumping to atomic_cmpxchg().  This will definitely mitigate
      the thundering herd problem.  Thanks Linus for the suggestion.
      
      [akpm@linux-foundation.org: fix comment]
      Link: http://lkml.kernel.org/r/20191030031637.6025-1-haokexin@gmail.com
      Fixes: b58d9774 ("dump_stack: serialize the output from dump_stack()")
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5cbf2fff
    • Vitaly Wool's avatar
      a3163130
    • Johannes Weiner's avatar
      mm/page_alloc.c: ratelimit allocation failure warnings more aggressively · 1be334e5
      Johannes Weiner authored
      While investigating a bug related to higher atomic allocation failures,
      we noticed the failure warnings positively drowning the console, and in
      our case trigger lockup warnings because of a serial console too slow to
      handle all that output.
      
      But even if we had a faster console, it's unclear what additional
      information the current level of repetition provides.
      
      Allocation failures happen for three reasons: The machine is OOM, the VM
      is failing to handle reasonable requests, or somebody is making
      unreasonable requests (and didn't acknowledge their opportunism with
      __GFP_NOWARN).  Having the memory dump, a callstack, and the ratelimit
      stats on skipped failure warnings should provide enough information to
      let users/admins/developers know whether something is wrong and point
      them in the right direction for debugging, bpftracing etc.
      
      Limit allocation failure warnings to one spew every ten seconds.
      
      Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1be334e5
    • Ville Syrjälä's avatar
      mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y · ec649c9d
      Ville Syrjälä authored
      I got some khugepaged spew on a 32bit x86:
      
        BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
        in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
        INFO: lockdep is turned off.
        CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
        Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203    07/08/2009
        Call Trace:
         dump_stack+0x66/0x8e
         ___might_sleep.cold.96+0x95/0xa6
         __might_sleep+0x2e/0x80
         collapse_huge_page.isra.51+0x5ac/0x1360
         khugepaged+0x9a9/0x20f0
         kthread+0xf5/0x110
         ret_from_fork+0x2e/0x38
      
      Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
      vs.  mmu_notifier_invalidate_range_start().  Let's do the naive approach
      and just reorder the two operations.
      
      Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com
      Fixes: 810e24e0 ("mm/mmu_notifiers: annotate with might_sleep()")
      Signed-off-by: default avatarVille Syrjl <ville.syrjala@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec649c9d
    • Michal Hocko's avatar
      mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo · 93b3a674
      Michal Hocko authored
      pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
      This is not really nice because it blocks both any interrupts on that
      cpu and the page allocator.  On large machines this might even trigger
      the hard lockup detector.
      
      Considering the pagetypeinfo is a debugging tool we do not really need
      exact numbers here.  The primary reason to look at the outuput is to see
      how pageblocks are spread among different migratetypes and low number of
      pages is much more interesting therefore putting a bound on the number
      of pages on the free_list sounds like a reasonable tradeoff.
      
      The new output will simply tell
        [...]
        Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
      
      instead of
        Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
      
      The limit has been chosen arbitrary and it is a subject of a future
      change should there be a need for that.
      
      While we are at it, also drop the zone lock after each free_list
      iteration which will help with the IRQ and page allocator responsiveness
      even further as the IRQ lock held time is always bound to those 100k
      pages.
      
      [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
      Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93b3a674
    • Michal Hocko's avatar
      mm, vmstat: hide /proc/pagetypeinfo from normal users · abaed011
      Michal Hocko authored
      /proc/pagetypeinfo is a debugging tool to examine internal page
      allocator state wrt to fragmentation.  It is not very useful for any
      other use so normal users really do not need to read this file.
      
      Waiman Long has noticed that reading this file can have negative side
      effects because zone->lock is necessary for gathering data and that a)
      interferes with the page allocator and its users and b) can lead to hard
      lockups on large machines which have very long free_list.
      
      Reduce both issues by simply not exporting the file to regular users.
      
      Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
      Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Jann Horn <jannh@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abaed011
    • Jason Gunthorpe's avatar
      mm/mmu_notifiers: use the right return code for WARN_ON · df2ec764
      Jason Gunthorpe authored
      The return code from the op callback is actually in _ret, while the
      WARN_ON was checking ret which causes it to misfire.
      
      Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
      Fixes: 8402ce61 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df2ec764
    • Shuning Zhang's avatar
      ocfs2: protect extent tree in ocfs2_prepare_inode_for_write() · e74540b2
      Shuning Zhang authored
      When the extent tree is modified, it should be protected by inode
      cluster lock and ip_alloc_sem.
      
      The extent tree is accessed and modified in the
      ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.
      
      The following is a case.  The function ocfs2_fiemap is accessing the
      extent tree, which is modified at the same time.
      
        kernel BUG at fs/ocfs2/extent_map.c:475!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
        CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
        Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
        task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
        RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
        Call Trace:
          ocfs2_fiemap+0x1e3/0x430 [ocfs2]
          do_vfs_ioctl+0x155/0x510
          SyS_ioctl+0x81/0xa0
          system_call_fastpath+0x18/0xd8
        Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
        RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
        ---[ end trace c8aa0c8180e869dc ]---
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: disabled
      
      This issue can be reproduced every week in a production environment.
      
      This issue is related to the usage mode.  If others use ocfs2 in this
      mode, the kernel will panic frequently.
      
      [akpm@linux-foundation.org: coding style fixes]
      [Fix new warning due to unused function by removing said function - Linus ]
      Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.comSigned-off-by: default avatarShuning Zhang <sunny.s.zhang@oracle.com>
      Reviewed-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e74540b2
    • Yang Shi's avatar
      mm: thp: handle page cache THP correctly in PageTransCompoundMap · 169226f7
      Yang Shi authored
      We have a usecase to use tmpfs as QEMU memory backend and we would like
      to take the advantage of THP as well.  But, our test shows the EPT is
      not PMD mapped even though the underlying THP are PMD mapped on host.
      The number showed by /sys/kernel/debug/kvm/largepage is much less than
      the number of PMD mapped shmem pages as the below:
      
        7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   579584 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        12
      
      And some benchmarks do worse than with anonymous THPs.
      
      By digging into the code we figured out that commit 127393fb ("mm:
      thp: kvm: fix memory corruption in KVM with THP enabled") checks if
      there is a single PTE mapping on the page for anonymous THP when setting
      up EPT map.  But the _mapcount < 0 check doesn't work for page cache THP
      since every subpage of page cache THP would get _mapcount inc'ed once it
      is PMD mapped, so PageTransCompoundMap() always returns false for page
      cache THP.  This would prevent KVM from setting up PMD mapped EPT entry.
      
      So we need handle page cache THP correctly.  However, when page cache
      THP's PMD gets split, kernel just remove the map instead of setting up
      PTE map like what anonymous THP does.  Before KVM calls get_user_pages()
      the subpages may get PTE mapped even though it is still a THP since the
      page cache THP may be mapped by other processes at the mean time.
      
      Checking its _mapcount and whether the THP has PTE mapped or not.
      Although this may report some false negative cases (PTE mapped by other
      processes), it looks not trivial to make this accurate.
      
      With this fix /sys/kernel/debug/kvm/largepage would show reasonable
      pages are PMD mapped by EPT as the below:
      
        7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted)
        Size:            4194304 kB
        [snip]
        AnonHugePages:         0 kB
        ShmemPmdMapped:   557056 kB
        [snip]
        Locked:                0 kB
      
        cat /sys/kernel/debug/kvm/largepages
        271
      
      And the benchmarks are as same as anonymous THPs.
      
      [yang.shi@linux.alibaba.com: v4]
        Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Tested-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      169226f7
    • Mel Gorman's avatar
      mm, meminit: recalculate pcpu batch and high limits after init completes · 3e8fc007
      Mel Gorman authored
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e8fc007
    • John Hubbard's avatar
      mm/gup_benchmark: fix MAP_HUGETLB case · 64801d19
      John Hubbard authored
      The MAP_HUGETLB ("-H" option) of gup_benchmark fails:
      
        $ sudo ./gup_benchmark -H
        mmap: Invalid argument
      
      This is because gup_benchmark.c is passing in a file descriptor to
      mmap(), but the fd came from opening up the /dev/zero file.  This
      confuses the mmap syscall implementation, which thinks that, if the
      caller did not specify MAP_ANONYMOUS, then the file must be a huge page
      file.  So it attempts to verify that the file really is a huge page
      file, as you can see here:
      
      ksys_mmap_pgoff()
      {
          if (!(flags & MAP_ANONYMOUS)) {
              retval = -EINVAL;
              if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
                  goto out_fput; /* THIS IS WHERE WE END UP */
      
          else if (flags & MAP_HUGETLB) {
              ...proceed normally, /dev/zero is ok here...
      
      ...and of course is_file_hugepages() returns "false" for the /dev/zero
      file.
      
      The problem is that the user space program, gup_benchmark.c, really just
      wants anonymous memory here.  The simplest way to get that is to pass
      MAP_ANONYMOUS whenever MAP_HUGETLB is specified, so that's what this
      patch does.
      
      Link: http://lkml.kernel.org/r/20191021212435.398153-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64801d19
    • Shakeel Butt's avatar
      mm: memcontrol: fix NULL-ptr deref in percpu stats flush · 7961eee3
      Shakeel Butt authored
      __mem_cgroup_free() can be called on the failure path in
      mem_cgroup_alloc().  However memcg_flush_percpu_vmstats() and
      memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
      access the fields of memcg which can potentially be null if called from
      failure path from mem_cgroup_alloc().  Indeed syzbot has reported the
      following crash:
      
      	kasan: CONFIG_KASAN_INLINE enabled
      	kasan: GPF could be caused by NULL-ptr deref or user memory access
      	general protection fault: 0000 [#1] PREEMPT SMP KASAN
      	CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      	RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
      	Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
      	RSP: 0018:ffff888095c27980 EFLAGS: 00010206
      	RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000
      	RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007
      	RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33
      	R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760
      	R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007
      	FS:  00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	__mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
      	mem_cgroup_free mm/memcontrol.c:5033 [inline]
      	mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
      	css_create kernel/cgroup/cgroup.c:5156 [inline]
      	cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
      	cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
      	kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
      	vfs_mkdir+0x42e/0x670 fs/namei.c:3807
      	do_mkdirat+0x234/0x2a0 fs/namei.c:3830
      	__do_sys_mkdir fs/namei.c:3846 [inline]
      	__se_sys_mkdir fs/namei.c:3844 [inline]
      	__x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
      	do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
      	entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixing this by moving the flush to mem_cgroup_free as there is no need
      to flush anything if we see failure in mem_cgroup_alloc().
      
      Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com
      Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg")
      Fixes: c350a99e ("mm: memcontrol: flush percpu vmstats before releasing memcg")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7961eee3
  8. 05 Nov, 2019 3 commits
    • Linus Torvalds's avatar
      Merge tag 'for-linus-2019-11-05' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 26bc6721
      Linus Torvalds authored
      Pull clone3 stack argument update from Christian Brauner:
       "This changes clone3() to do basic stack validation and to set up the
        stack depending on whether or not it is growing up or down.
      
        With clone3() the expectation is now very simply that the .stack
        argument points to the lowest address of the stack and that
        .stack_size specifies the initial stack size. This is diferent from
        legacy clone() where the "stack" argument had to point to the lowest
        or highest address of the stack depending on the architecture.
      
        clone3() was released with 5.3. Currently, it is not documented and
        very unclear to userspace how the stack and stack_size argument have
        to be passed. After talking to glibc folks we concluded that changing
        clone3() to determine stack direction and doing basic validation is
        the right course of action.
      
        Note, this is a potentially user visible change. In the very unlikely
        case, that it breaks someone's use-case we will revert. (And then e.g.
        place the new behavior under an appropriate flag.)
      
        Note that passing an empty stack will continue working just as before.
        Breaking someone's use-case is very unlikely. Neither glibc nor musl
        currently expose a wrapper for clone3(). There is currently also no
        real motivation for anyone to use clone3() directly. First, because
        using clone{3}() with stacks requires some assembly (see glibc and
        musl). Second, because it does not provide features that legacy
        clone() doesn't. New features for clone3() will first happen in v5.5
        which is why v5.4 is still a good time to try and make that change now
        and backport it to v5.3.
      
        I did a codesearch on https://codesearch.debian.net, github, and
        gitlab and could not find any software currently relying directly on
        clone3(). I expect this to change once we land CLONE_CLEAR_SIGHAND
        which was a request coming from glibc at which point they'll likely
        start using it"
      
      * tag 'for-linus-2019-11-05' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        clone3: validate stack arguments
      26bc6721
    • Linus Torvalds's avatar
      Merge tag 'gpio-v5.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio · 7111fa11
      Linus Torvalds authored
      Pull GPIO fixes from Linus Walleij:
       "More GPIO fixes! We found a late regression in the Intel Merrifield
        driver. Oh well. We fixed it up.
      
         - Fix a build error in the tools used for kselftest
      
         - A series of reverts to bring the Intel Merrifield back to working.
      
        We will likely unrevert the reverts for v5.5 but we can't have v5.4
        broken"
      
      * tag 'gpio-v5.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
        Revert "gpio: merrifield: Pass irqchip when adding gpiochip"
        Revert "gpio: merrifield: Restore use of irq_base"
        Revert "gpio: merrifield: Move hardware initialization to callback"
        tools: gpio: Use !building_out_of_srctree to determine srctree
      7111fa11
    • Christian Brauner's avatar
      clone3: validate stack arguments · fa729c4d
      Christian Brauner authored
      Validate the stack arguments and setup the stack depening on whether or not
      it is growing down or up.
      
      Legacy clone() required userspace to know in which direction the stack is
      growing and pass down the stack pointer appropriately. To make things more
      confusing microblaze uses a variant of the clone() syscall selected by
      CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
      IA64 has a separate clone2() syscall which also takes an additional
      stack_size argument. Finally, parisc has a stack that is growing upwards.
      Userspace therefore has a lot nasty code like the following:
      
       #define __STACK_SIZE (8 * 1024 * 1024)
       pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
       {
               pid_t ret;
               void *stack;
      
               stack = malloc(__STACK_SIZE);
               if (!stack)
                       return -ENOMEM;
      
       #ifdef __ia64__
               ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
       #elif defined(__parisc__) /* stack grows up */
               ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
       #else
               ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
       #endif
               return ret;
       }
      
      or even crazier variants such as [3].
      
      With clone3() we have the ability to validate the stack. We can check that
      when stack_size is passed, the stack pointer is valid and the other way
      around. We can also check that the memory area userspace gave us is fine to
      use via access_ok(). Furthermore, we probably should not require
      userspace to know in which direction the stack is growing. It is easy
      for us to do this in the kernel and I couldn't find the original
      reasoning behind exposing this detail to userspace.
      
      /* Intentional user visible API change */
      clone3() was released with 5.3. Currently, it is not documented and very
      unclear to userspace how the stack and stack_size argument have to be
      passed. After talking to glibc folks we concluded that trying to change
      clone3() to setup the stack instead of requiring userspace to do this is
      the right course of action.
      Note, that this is an explicit change in user visible behavior we introduce
      with this patch. If it breaks someone's use-case we will revert! (And then
      e.g. place the new behavior under an appropriate flag.)
      Breaking someone's use-case is very unlikely though. First, neither glibc
      nor musl currently expose a wrapper for clone3(). Second, there is no real
      motivation for anyone to use clone3() directly since it does not provide
      features that legacy clone doesn't. New features for clone3() will first
      happen in v5.5 which is why v5.4 is still a good time to try and make that
      change now and backport it to v5.3. Searches on [4] did not reveal any
      packages calling clone3().
      
      [1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
      [2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
      [3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/raw-clone.h#L31
      [4]: https://codesearch.debian.net
      Fixes: 7f192e3c ("fork: add clone3")
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 5.3
      Cc: GNU C Library <libc-alpha@sourceware.org>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.com
      fa729c4d
  9. 03 Nov, 2019 3 commits