1. 23 Aug, 2021 1 commit
    • Eric W. Biederman's avatar
      ucounts: Fix regression preventing increasing of rlimits in init_user_ns · 5ddf994f
      Eric W. Biederman authored
      "Ma, XinjianX" <xinjianx.ma@intel.com> reported:
      
      > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
      > in kselftest failed with following message.
      >
      > # selftests: mqueue: mq_perf_tests
      > #
      > # Initial system state:
      > #       Using queue path:                       /mq_perf_tests
      > #       RLIMIT_MSGQUEUE(soft):                  819200
      > #       RLIMIT_MSGQUEUE(hard):                  819200
      > #       Maximum Message Size:                   8192
      > #       Maximum Queue Size:                     10
      > #       Nice value:                             0
      > #
      > # Adjusted system state for testing:
      > #       RLIMIT_MSGQUEUE(soft):                  (unlimited)
      > #       RLIMIT_MSGQUEUE(hard):                  (unlimited)
      > #       Maximum Message Size:                   16777216
      > #       Maximum Queue Size:                     65530
      > #       Nice value:                             -20
      > #       Continuous mode:                        (disabled)
      > #       CPUs to pin:                            3
      > # ./mq_perf_tests: mq_open() at 296: Too many open files
      > not ok 2 selftests: mqueue: mq_perf_tests # exit=1
      > ```
      >
      > Test env:
      > rootfs: debian-10
      > gcc version: 9
      
      After investigation the problem turned out to be that ucount_max for
      the rlimits in init_user_ns was being set to the initial rlimit value.
      The practical problem is that ucount_max provides a limit that
      applications inside the user namespace can not exceed.  Which means in
      practice that rlimits that have been converted to use the ucount
      infrastructure were not able to exceend their initial rlimits.
      
      Solve this by setting the relevant values of ucount_max to
      RLIM_INIFINITY.  A limit in init_user_ns is pointless so the code
      should allow the values to grow as large as possible without riscking
      an underflow or an overflow.
      
      As the ltp test case was a bit of a pain I have reproduced the rlimit failure
      and tested the fix with the following little C program:
      > #include <stdio.h>
      > #include <fcntl.h>
      > #include <sys/stat.h>
      > #include <mqueue.h>
      > #include <sys/time.h>
      > #include <sys/resource.h>
      > #include <errno.h>
      > #include <string.h>
      > #include <stdlib.h>
      > #include <limits.h>
      > #include <unistd.h>
      >
      > int main(int argc, char **argv)
      > {
      > 	struct mq_attr mq_attr;
      > 	struct rlimit rlim;
      > 	mqd_t mqd;
      > 	int ret;
      >
      > 	ret = getrlimit(RLIMIT_MSGQUEUE, &rlim);
      > 	if (ret != 0) {
      > 		fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      > 	printf("RLIMIT_MSGQUEUE %lu %lu\n",
      > 	       rlim.rlim_cur, rlim.rlim_max);
      > 	rlim.rlim_cur = RLIM_INFINITY;
      > 	rlim.rlim_max = RLIM_INFINITY;
      > 	ret = setrlimit(RLIMIT_MSGQUEUE, &rlim);
      > 	if (ret != 0) {
      > 		fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      >
      > 	memset(&mq_attr, 0, sizeof(struct mq_attr));
      > 	mq_attr.mq_maxmsg = 65536 - 1;
      > 	mq_attr.mq_msgsize = 16*1024*1024 - 1;
      >
      > 	mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr);
      > 	if (mqd == (mqd_t)-1) {
      > 		fprintf(stderr, "mq_open failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      > 	ret = mq_close(mqd);
      > 	if (ret) {
      > 		fprintf(stderr, "mq_close failed; %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      >
      > 	return EXIT_SUCCESS;
      > }
      
      Fixes: 6e52a9f0 ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
      Fixes: d7c9e99a ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
      Fixes: d6469690 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
      Fixes: 21d1c5e3 ("Reimplement RLIMIT_NPROC on top of ucounts")
      Reported-by: kernel test robot lkp@intel.com
      Acked-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      5ddf994f
  2. 09 Aug, 2021 1 commit
  3. 28 Jul, 2021 1 commit
    • Alexey Gladkov's avatar
      ucounts: Fix race condition between alloc_ucounts and put_ucounts · 345daff2
      Alexey Gladkov authored
      The race happens because put_ucounts() doesn't use spinlock and
      get_ucounts is not under spinlock:
      
      CPU0                    CPU1
      ----                    ----
      alloc_ucounts()         put_ucounts()
      
      spin_lock_irq(&ucounts_lock);
      ucounts = find_ucounts(ns, uid, hashent);
      
                              atomic_dec_and_test(&ucounts->count))
      
      spin_unlock_irq(&ucounts_lock);
      
                              spin_lock_irqsave(&ucounts_lock, flags);
                              hlist_del_init(&ucounts->node);
                              spin_unlock_irqrestore(&ucounts_lock, flags);
                              kfree(ucounts);
      
      ucounts = get_ucounts(ucounts);
      
      ==================================================================
      BUG: KASAN: use-after-free in instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
      BUG: KASAN: use-after-free in atomic_add_negative include/asm-generic/atomic-instrumented.h:556 [inline]
      BUG: KASAN: use-after-free in get_ucounts kernel/ucount.c:152 [inline]
      BUG: KASAN: use-after-free in get_ucounts kernel/ucount.c:150 [inline]
      BUG: KASAN: use-after-free in alloc_ucounts+0x19b/0x5b0 kernel/ucount.c:188
      Write of size 4 at addr ffff88802821e41c by task syz-executor.4/16785
      
      CPU: 1 PID: 16785 Comm: syz-executor.4 Not tainted 5.14.0-rc1-next-20210712-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:105
       print_address_description.constprop.0.cold+0x6c/0x309 mm/kasan/report.c:233
       __kasan_report mm/kasan/report.c:419 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:436
       check_region_inline mm/kasan/generic.c:183 [inline]
       kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
       instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
       atomic_add_negative include/asm-generic/atomic-instrumented.h:556 [inline]
       get_ucounts kernel/ucount.c:152 [inline]
       get_ucounts kernel/ucount.c:150 [inline]
       alloc_ucounts+0x19b/0x5b0 kernel/ucount.c:188
       set_cred_ucounts+0x171/0x3a0 kernel/cred.c:684
       __sys_setuid+0x285/0x400 kernel/sys.c:623
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x4665d9
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007fde54097188 EFLAGS: 00000246 ORIG_RAX: 0000000000000069
      RAX: ffffffffffffffda RBX: 000000000056bf80 RCX: 00000000004665d9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000ff
      RBP: 00000000004bfcb9 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 000000000056bf80
      R13: 00007ffc8655740f R14: 00007fde54097300 R15: 0000000000022000
      
      Allocated by task 16784:
       kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       ____kasan_kmalloc mm/kasan/common.c:513 [inline]
       ____kasan_kmalloc mm/kasan/common.c:472 [inline]
       __kasan_kmalloc+0x9b/0xd0 mm/kasan/common.c:522
       kmalloc include/linux/slab.h:591 [inline]
       kzalloc include/linux/slab.h:721 [inline]
       alloc_ucounts+0x23d/0x5b0 kernel/ucount.c:169
       set_cred_ucounts+0x171/0x3a0 kernel/cred.c:684
       __sys_setuid+0x285/0x400 kernel/sys.c:623
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Freed by task 16785:
       kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
       kasan_set_track+0x1c/0x30 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xfb/0x130 mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:229 [inline]
       slab_free_hook mm/slub.c:1650 [inline]
       slab_free_freelist_hook+0xdf/0x240 mm/slub.c:1675
       slab_free mm/slub.c:3235 [inline]
       kfree+0xeb/0x650 mm/slub.c:4295
       put_ucounts kernel/ucount.c:200 [inline]
       put_ucounts+0x117/0x150 kernel/ucount.c:192
       put_cred_rcu+0x27a/0x520 kernel/cred.c:124
       rcu_do_batch kernel/rcu/tree.c:2550 [inline]
       rcu_core+0x7ab/0x1380 kernel/rcu/tree.c:2785
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Last potentially related work creation:
       kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
       kasan_record_aux_stack+0xe5/0x110 mm/kasan/generic.c:348
       insert_work+0x48/0x370 kernel/workqueue.c:1332
       __queue_work+0x5c1/0xed0 kernel/workqueue.c:1498
       queue_work_on+0xee/0x110 kernel/workqueue.c:1525
       queue_work include/linux/workqueue.h:507 [inline]
       call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
       kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
       netdev_queue_add_kobject net/core/net-sysfs.c:1621 [inline]
       netdev_queue_update_kobjects+0x374/0x450 net/core/net-sysfs.c:1655
       register_queue_kobjects net/core/net-sysfs.c:1716 [inline]
       netdev_register_kobject+0x35a/0x430 net/core/net-sysfs.c:1959
       register_netdevice+0xd33/0x1500 net/core/dev.c:10331
       nsim_init_netdevsim drivers/net/netdevsim/netdev.c:317 [inline]
       nsim_create+0x381/0x4d0 drivers/net/netdevsim/netdev.c:364
       __nsim_dev_port_add+0x32e/0x830 drivers/net/netdevsim/dev.c:1295
       nsim_dev_port_add_all+0x53/0x150 drivers/net/netdevsim/dev.c:1355
       nsim_dev_probe+0xcb5/0x1190 drivers/net/netdevsim/dev.c:1496
       call_driver_probe drivers/base/dd.c:517 [inline]
       really_probe+0x23c/0xcd0 drivers/base/dd.c:595
       __driver_probe_device+0x338/0x4d0 drivers/base/dd.c:747
       driver_probe_device+0x4c/0x1a0 drivers/base/dd.c:777
       __device_attach_driver+0x20b/0x2f0 drivers/base/dd.c:894
       bus_for_each_drv+0x15f/0x1e0 drivers/base/bus.c:427
       __device_attach+0x228/0x4a0 drivers/base/dd.c:965
       bus_probe_device+0x1e4/0x290 drivers/base/bus.c:487
       device_add+0xc2f/0x2180 drivers/base/core.c:3356
       nsim_bus_dev_new drivers/net/netdevsim/bus.c:431 [inline]
       new_device_store+0x436/0x710 drivers/net/netdevsim/bus.c:298
       bus_attr_store+0x72/0xa0 drivers/base/bus.c:122
       sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:139
       kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2152 [inline]
       new_sync_write+0x426/0x650 fs/read_write.c:518
       vfs_write+0x75a/0xa40 fs/read_write.c:605
       ksys_write+0x12d/0x250 fs/read_write.c:658
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Second to last potentially related work creation:
       kasan_save_stack+0x1b/0x40 mm/kasan/common.c:38
       kasan_record_aux_stack+0xe5/0x110 mm/kasan/generic.c:348
       insert_work+0x48/0x370 kernel/workqueue.c:1332
       __queue_work+0x5c1/0xed0 kernel/workqueue.c:1498
       queue_work_on+0xee/0x110 kernel/workqueue.c:1525
       queue_work include/linux/workqueue.h:507 [inline]
       call_usermodehelper_exec+0x1f0/0x4c0 kernel/umh.c:435
       kobject_uevent_env+0xf8f/0x1650 lib/kobject_uevent.c:618
       kobject_synth_uevent+0x701/0x850 lib/kobject_uevent.c:208
       uevent_store+0x20/0x50 drivers/base/core.c:2371
       dev_attr_store+0x50/0x80 drivers/base/core.c:2072
       sysfs_kf_write+0x110/0x160 fs/sysfs/file.c:139
       kernfs_fop_write_iter+0x342/0x500 fs/kernfs/file.c:296
       call_write_iter include/linux/fs.h:2152 [inline]
       new_sync_write+0x426/0x650 fs/read_write.c:518
       vfs_write+0x75a/0xa40 fs/read_write.c:605
       ksys_write+0x12d/0x250 fs/read_write.c:658
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88802821e400
       which belongs to the cache kmalloc-192 of size 192
      The buggy address is located 28 bytes inside of
       192-byte region [ffff88802821e400, ffff88802821e4c0)
      The buggy address belongs to the page:
      page:ffffea0000a08780 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2821e
      flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000000200 dead000000000100 dead000000000122 ffff888010841a00
      raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12cc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY), pid 1, ts 12874702440, free_ts 12637793385
       prep_new_page mm/page_alloc.c:2433 [inline]
       get_page_from_freelist+0xa72/0x2f80 mm/page_alloc.c:4166
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5374
       alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2119
       alloc_pages+0x238/0x2a0 mm/mempolicy.c:2242
       alloc_slab_page mm/slub.c:1713 [inline]
       allocate_slab+0x32b/0x4c0 mm/slub.c:1853
       new_slab mm/slub.c:1916 [inline]
       new_slab_objects mm/slub.c:2662 [inline]
       ___slab_alloc+0x4ba/0x820 mm/slub.c:2825
       __slab_alloc.constprop.0+0xa7/0xf0 mm/slub.c:2865
       slab_alloc_node mm/slub.c:2947 [inline]
       slab_alloc mm/slub.c:2989 [inline]
       __kmalloc+0x312/0x330 mm/slub.c:4133
       kmalloc include/linux/slab.h:596 [inline]
       kzalloc include/linux/slab.h:721 [inline]
       __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1318
       rds_tcp_init_net+0x1db/0x4f0 net/rds/tcp.c:551
       ops_init+0xaf/0x470 net/core/net_namespace.c:140
       __register_pernet_operations net/core/net_namespace.c:1137 [inline]
       register_pernet_operations+0x35a/0x850 net/core/net_namespace.c:1214
       register_pernet_device+0x26/0x70 net/core/net_namespace.c:1301
       rds_tcp_init+0x77/0xe0 net/rds/tcp.c:717
       do_one_initcall+0x103/0x650 init/main.c:1285
       do_initcall_level init/main.c:1360 [inline]
       do_initcalls init/main.c:1376 [inline]
       do_basic_setup init/main.c:1396 [inline]
       kernel_init_freeable+0x6b8/0x741 init/main.c:1598
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1343 [inline]
       free_pcp_prepare+0x312/0x7d0 mm/page_alloc.c:1394
       free_unref_page_prepare mm/page_alloc.c:3329 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3408
       __vunmap+0x783/0xb70 mm/vmalloc.c:2587
       free_work+0x58/0x70 mm/vmalloc.c:82
       process_one_work+0x98d/0x1630 kernel/workqueue.c:2276
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2422
       kthread+0x3e5/0x4d0 kernel/kthread.c:319
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
      
      Memory state around the buggy address:
       ffff88802821e300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff88802821e380: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
      >ffff88802821e400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                  ^
       ffff88802821e480: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       ffff88802821e500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      ==================================================================
      
      - The race fix has two parts.
        * Changing the code to guarantee that ucounts->count is only decremented
          when ucounts_lock is held.  This guarantees that find_ucounts
          will never find a structure with a zero reference count.
        * Changing alloc_ucounts to increment ucounts->count while
          ucounts_lock is held.  This guarantees the reference count on the
          found data structure will not be decremented to zero (and the data
          structure freed) before the reference count is incremented.
        -- Eric Biederman
      
      Reported-by: syzbot+01985d7909f9468f013c@syzkaller.appspotmail.com
      Reported-by: syzbot+59dd63761094a80ad06d@syzkaller.appspotmail.com
      Reported-by: syzbot+6cd79f45bb8fa1c9eeae@syzkaller.appspotmail.com
      Reported-by: syzbot+b6e65bd125a05f803d6b@syzkaller.appspotmail.com
      Fixes: b6c33652 ("Use atomic_t for ucounts reference counting")
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/7b2ace1759b281cdd2d66101d6b305deef722efb.1627397820.git.legion@kernel.orgSigned-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      345daff2
  4. 25 Jul, 2021 9 commits
  5. 24 Jul, 2021 27 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 6498f615
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - properly set the memory size, which fixes 32-bit systems
      
       - allow initrd to load anywhere in memory, rather that restricting it
         to the first 256MiB
      
       - fix the 'mem=' parameter on 64-bit systems to properly account for
         the maximum supported memory now that the kernel is outside the
         linear map
      
       - avoid installing mappings into the last 4KiB of memory, which
         conflicts with error values
      
       - avoid the stack from being freed while it is being walked
      
       - a handful of fixes to the new copy to/from user routines
      
      * tag 'riscv-for-linus-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: __asm_copy_to-from_user: Fix: Typos in comments
        riscv: __asm_copy_to-from_user: Remove unnecessary size check
        riscv: __asm_copy_to-from_user: Fix: fail on RV32
        riscv: __asm_copy_to-from_user: Fix: overrun copy
        riscv: stacktrace: pin the task's stack in get_wchan
        riscv: Make sure the kernel mapping does not overlap with IS_ERR_VALUE
        riscv: Make sure the linear mapping does not use the kernel mapping
        riscv: Fix memory_limit for 64-bit kernel
        RISC-V: load initrd wherever it fits into memory
        riscv: Fix 32-bit RISC-V boot failure
      6498f615
    • Linus Torvalds's avatar
      ACPI: fix NULL pointer dereference · fc68f42a
      Linus Torvalds authored
      Commit 71f64283 ("ACPI: utils: Fix reference counting in
      for_each_acpi_dev_match()") started doing "acpi_dev_put()" on a pointer
      that was possibly NULL.  That fails miserably, because that helper
      inline function is not set up to handle that case.
      
      Just make acpi_dev_put() silently accept a NULL pointer, rather than
      calling down to put_device() with an invalid offset off that NULL
      pointer.
      
      Link: https://lore.kernel.org/lkml/a607c149-6bf6-0fd0-0e31-100378504da2@kernel.dk/Reported-and-tested-by: default avatarJens Axboe <axboe@kernel.dk>
      Tested-by: default avatarDaniel Scally <djrscally@gmail.com>
      Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc68f42a
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 7ffca2bb
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Four fixes, all in drivers, all of which can lead to user visible
        problems in certain situations"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: target: Fix NULL dereference on XCOPY completion
        scsi: mpt3sas: Transition IOC to Ready state during shutdown
        scsi: target: Fix protect handling in WRITE SAME(32)
        scsi: iscsi: Fix iface sysfs attr detection
      7ffca2bb
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block · 0ee818c3
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - Fix a memory leak due to a race condition in io_init_wq_offload
         (Yang)
      
       - Poll error handling fixes (Pavel)
      
       - Fix early fdput() regression (me)
      
       - Don't reissue iopoll requests off release path (me)
      
       - Add a safety check for io-wq queue off wrong path (me)
      
      * tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
        io_uring: explicitly catch any illegal async queue attempt
        io_uring: never attempt iopoll reissue from release path
        io_uring: fix early fdput() of file
        io_uring: fix memleak in io_init_wq_offload()
        io_uring: remove double poll entry on arm failure
        io_uring: explicitly count entries for poll reqs
      0ee818c3
    • Linus Torvalds's avatar
      Merge tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block · 4d4a60ce
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - NVMe pull request (Christoph):
          - tracing fix (Keith Busch)
          - fix multipath head refcounting (Hannes Reinecke)
          - Write Zeroes vs PI fix (me)
          - drop a bogus WARN_ON (Zhihao Cheng)
      
       - Increase max blk-cgroup policy size, now that mq-deadline
         uses it too (Oleksandr)
      
      * tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
        nvme: set the PRACT bit when using Write Zeroes with T10 PI
        nvme: fix nvme_setup_command metadata trace event
        nvme: fix refcounting imbalance when all paths are down
        nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
        block: increase BLKCG_MAX_POLS
      4d4a60ce
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 0823baef
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "Two bugfixes for the I2C subsystem"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: mpc: Poll for MCF
        misc: eeprom: at24: Always append device id even if label property is set.
      0823baef
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · bca1d4de
      Linus Torvalds authored
      Merge misc mm fixes from Andrew Morton:
       "15 patches.
      
        VM subsystems affected by this patch series: userfaultfd, kfence,
        highmem, pagealloc, memblock, pagecache, secretmem, pagemap, and
        hugetlbfs"
      
      * akpm:
        hugetlbfs: fix mount mode command line processing
        mm: fix the deadlock in finish_fault()
        mm: mmap_lock: fix disabling preemption directly
        mm/secretmem: wire up ->set_page_dirty
        writeback, cgroup: do not reparent dax inodes
        writeback, cgroup: remove wb from offline list before releasing refcnt
        memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
        mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction
        mm: use kmap_local_page in memzero_page
        mm: call flush_dcache_page() in memcpy_to_page() and memzero_page()
        kfence: skip all GFP_ZONEMASK allocations
        kfence: move the size check to the beginning of __kfence_alloc()
        kfence: defer kfence_test_init to ensure that kunit debugfs is created
        selftest: use mmap instead of posix_memalign to allocate memory
        userfaultfd: do not untag user pointers
      bca1d4de
    • Akira Tsukamoto's avatar
      riscv: __asm_copy_to-from_user: Fix: Typos in comments · ea196c54
      Akira Tsukamoto authored
      Fixing typos and grammar mistakes and using more intuitive label
      name.
      Signed-off-by: default avatarAkira Tsukamoto <akira.tsukamoto@gmail.com>
      Fixes: ca6eaaa2 ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      ea196c54
    • Akira Tsukamoto's avatar
      riscv: __asm_copy_to-from_user: Remove unnecessary size check · d4b3e010
      Akira Tsukamoto authored
      Clean up:
      
      The size of 0 will be evaluated in the next step. Not
      required here.
      Signed-off-by: default avatarAkira Tsukamoto <akira.tsukamoto@gmail.com>
      Fixes: ca6eaaa2 ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      d4b3e010
    • Akira Tsukamoto's avatar
      riscv: __asm_copy_to-from_user: Fix: fail on RV32 · 22b5f16f
      Akira Tsukamoto authored
      Had a bug when converting bytes to bits when the cpu was rv32.
      
      The a3 contains the number of bytes and multiple of 8
      would be the bits. The LGREG is holding 2 for RV32 and 3 for
      RV32, so to achieve multiple of 8 it must always be constant 3.
      The 2 was mistakenly used for rv32.
      Signed-off-by: default avatarAkira Tsukamoto <akira.tsukamoto@gmail.com>
      Fixes: ca6eaaa2 ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      22b5f16f
    • Akira Tsukamoto's avatar
      riscv: __asm_copy_to-from_user: Fix: overrun copy · 6010d300
      Akira Tsukamoto authored
      There were two causes for the overrun memory access.
      
      The threshold size was too small.
      The aligning dst require one SZREG and unrolling word copy requires
      8*SZREG, total have to be at least 9*SZREG.
      
      Inside the unrolling copy, the subtracting -(8*SZREG-1) would make
      iteration happening one extra loop. Proper value is -(8*SZREG).
      Signed-off-by: default avatarAkira Tsukamoto <akira.tsukamoto@gmail.com>
      Fixes: ca6eaaa2 ("riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall")
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      6010d300
    • Mike Kravetz's avatar
      hugetlbfs: fix mount mode command line processing · e0f7e2b2
      Mike Kravetz authored
      In commit 32021982 ("hugetlbfs: Convert to fs_context") processing
      of the mount mode string was changed from match_octal() to fsparam_u32.
      
      This changed existing behavior as match_octal does not require octal
      values to have a '0' prefix, but fsparam_u32 does.
      
      Use fsparam_u32oct which provides the same behavior as match_octal.
      
      Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com
      Fixes: 32021982 ("hugetlbfs: Convert to fs_context")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDennis Camera <bugs+kernel.org@dtnr.ch>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0f7e2b2
    • Qi Zheng's avatar
      mm: fix the deadlock in finish_fault() · e4dc3489
      Qi Zheng authored
      Commit 63f3655f ("mm, memcg: fix reclaim deadlock with writeback")
      fix the following ABBA deadlock by pre-allocating the pte page table
      without holding the page lock.
      
      	                                lock_page(A)
                                              SetPageWriteback(A)
                                              unlock_page(A)
        lock_page(B)
                                              lock_page(B)
        pte_alloc_one
          shrink_page_list
            wait_on_page_writeback(A)
                                              SetPageWriteback(B)
                                              unlock_page(B)
      
                                              # flush A, B to clear the writeback
      
      Commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault()
      codepaths") reworked the relevant code but ignored this race.  This will
      cause the deadlock above to appear again, so fix it.
      
      Link: https://lkml.kernel.org/r/20210721074849.57004-1-zhengqi.arch@bytedance.com
      Fixes: f9ce0be7 ("mm: Cleanup faultaround and finish_fault() codepaths")
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4dc3489
    • Muchun Song's avatar
      mm: mmap_lock: fix disabling preemption directly · e904c2cc
      Muchun Song authored
      Commit 832b5072 ("mm: mmap_lock: use local locks instead of
      disabling preemption") fixed a bug by using local locks.
      
      But commit d01079f3 ("mm/mmap_lock: remove dead code for
      !CONFIG_TRACING configurations") changed those lines back to the
      original version.
      
      I guess it was introduced by fixing conflicts.
      
      Link: https://lkml.kernel.org/r/20210720074228.76342-1-songmuchun@bytedance.com
      Fixes: d01079f3 ("mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta@ionos.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e904c2cc
    • Mike Rapoport's avatar
      mm/secretmem: wire up ->set_page_dirty · af642374
      Mike Rapoport authored
      Make secretmem up to date with the changes done in commit 0af57378
      ("mm: require ->set_page_dirty to be explicitly wired up") so that
      unconditional call to this method won't cause crashes.
      
      Link: https://lkml.kernel.org/r/20210716063933.31633-1-rppt@kernel.org
      Fixes: 0af57378 ("mm: require ->set_page_dirty to be explicitly wired up")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af642374
    • Roman Gushchin's avatar
      writeback, cgroup: do not reparent dax inodes · 593311e8
      Roman Gushchin authored
      The inode switching code is not suited for dax inodes.  An attempt to
      switch a dax inode to a parent writeback structure (as a part of a
      writeback cleanup procedure) results in a panic like this:
      
        run fstests generic/270 at 2021-07-15 05:54:02
        XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use.  Use at your own risk!
        XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
        XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
        XFS (pmem0p2): Mounting V5 Filesystem
        XFS (pmem0p2): Ending clean mount
        XFS (pmem0p2): Quotacheck needed: Please wait.
        XFS (pmem0p2): Quotacheck: Done.
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
        BUG: unable to handle page fault for address: 0000000005b0f669
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-8096acd7+ #8
        Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
        Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
        RIP: 0010:inode_do_switch_wbs+0xaf/0x470
        Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
        RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
        RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
        RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
        RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
        R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
        R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
        FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
        Call Trace:
         inode_switch_wbs_work_fn+0xb6/0x2a0
         process_one_work+0x1e6/0x380
         worker_thread+0x53/0x3d0
         kthread+0x10f/0x130
         ret_from_fork+0x22/0x30
        Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
        CR2: 0000000005b0f669
        ---[ end trace ed2105faff8384f3 ]---
        RIP: 0010:inode_do_switch_wbs+0xaf/0x470
        Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
        RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
        RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
        RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
        RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
        R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
        R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
        FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
        Kernel panic - not syncing: Fatal exception
        Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
        ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      The crash happens on an attempt to iterate over attached pagecache pages
      and check the dirty flag: a dax inode's xarray contains pfn's instead of
      generic struct page pointers.
      
      This happens for DAX and not for other kinds of non-page entries in the
      inodes because it's a tagged iteration, and shadow/swap entries are
      never tagged; only DAX entries get tagged.
      
      Fix the problem by bailing out (with the false return value) of
      inode_prepare_sbs_switch() if a dax inode is passed.
      
      [willy@infradead.org: changelog addition]
      
      Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com
      Fixes: c22d70a1 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reported-by: default avatarMurphy Zhou <jencce.kernel@gmail.com>
      Reported-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Tested-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Tested-by: default avatarMurphy Zhou <jencce.kernel@gmail.com>
      Acked-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      593311e8
    • Roman Gushchin's avatar
      writeback, cgroup: remove wb from offline list before releasing refcnt · b43a9e76
      Roman Gushchin authored
      Boyang reported that the commit c22d70a1 ("writeback, cgroup:
      release dying cgwbs by switching attached inodes") causes the kernel to
      crash while running xfstests generic/256 on ext4 on aarch64 and ppc64le.
      
        run fstests generic/256 at 2021-07-12 05:41:40
        EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: . Quota mode: none.
        Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
        Mem abort info:
           ESR = 0x96000005
           EC = 0x25: DABT (current EL), IL = 32 bits
           SET = 0, FnV = 0
           EA = 0, S1PTW = 0
           FSC = 0x05: level 1 translation fault
        Data abort info:
           ISV = 0, ISS = 0x00000005
           CM = 0, WnR = 0
        user pgtable: 64k pages, 48-bit VAs, pgdp=00000000b0502000
        [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
        Internal error: Oops: 96000005 [#1] SMP
        Modules linked in: dm_flakey dm_snapshot dm_bufio dm_zero dm_mod loop tls rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc ext4 vfat fat mbcache jbd2 drm fuse xfs libcrc32c crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_blk virtio_net net_failover virtio_console failover virtio_mmio aes_neon_bs [last unloaded: scsi_debug]
        CPU: 0 PID: 408468 Comm: kworker/u8:5 Tainted: G X --------- ---  5.14.0-0.rc1.15.bx.el9.aarch64 #1
        Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
        Workqueue: events_unbound cleanup_offline_cgwbs_workfn
        pstate: 004000c5 (nzcv daIF +PAN -UAO -TCO BTYPE=--)
        pc : cleanup_offline_cgwbs_workfn+0x320/0x394
        lr : cleanup_offline_cgwbs_workfn+0xe0/0x394
        sp : ffff80001554fd10
        x29: ffff80001554fd10 x28: 0000000000000000 x27: 0000000000000001
        x26: 0000000000000000 x25: 00000000000000e0 x24: ffffd2a2fbe671a8
        x23: ffff80001554fd88 x22: ffffd2a2fbe67198 x21: ffffd2a2fc25a730
        x20: ffff210412bc3000 x19: ffff210412bc3280 x18: 0000000000000000
        x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
        x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000040
        x11: ffff210481572238 x10: ffff21048157223a x9 : ffffd2a2fa276c60
        x8 : ffff210484106b60 x7 : 0000000000000000 x6 : 000000000007d18a
        x5 : ffff210416a86400 x4 : ffff210412bc0280 x3 : 0000000000000000
        x2 : ffff80001554fd88 x1 : ffff210412bc0280 x0 : 0000000000000003
        Call trace:
           cleanup_offline_cgwbs_workfn+0x320/0x394
           process_one_work+0x1f4/0x4b0
           worker_thread+0x184/0x540
           kthread+0x114/0x120
           ret_from_fork+0x10/0x18
        Code: d63f0020 97f99963 17ffffa6 f8588263 (f9400061)
        ---[ end trace e250fe289272792a ]---
        Kernel panic - not syncing: Oops: Fatal exception
        SMP: stopping secondary CPUs
        SMP: failed to stop secondary CPUs 0-2
        Kernel Offset: 0x52a2e9fa0000 from 0xffff800010000000
        PHYS_OFFSET: 0xfff0defca0000000
        CPU features: 0x00200251,23200840
        Memory Limit: none
        ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
      
      The problem happens when cgwb_release_workfn() races with
      cleanup_offline_cgwbs_workfn(): wb_tryget() in
      cleanup_offline_cgwbs_workfn() can be called after percpu_ref_exit() is
      cgwb_release_workfn(), which is basically a use-after-free error.
      
      Fix the problem by making removing the writeback structure from the
      offline list before releasing the percpu reference counter.  It will
      guarantee that cleanup_offline_cgwbs_workfn() will not see and not
      access writeback structures which are about to be released.
      
      Link: https://lkml.kernel.org/r/20210716201039.3762203-1-guro@fb.com
      Fixes: c22d70a1 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reported-by: default avatarBoyang Xue <bxue@redhat.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Murphy Zhou <jencce.kernel@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b43a9e76
    • Mike Rapoport's avatar
      memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions · 79e482e9
      Mike Rapoport authored
      Commit b10d6bca ("arch, drivers: replace for_each_membock() with
      for_each_mem_range()") didn't take into account that when there is
      movable_node parameter in the kernel command line, for_each_mem_range()
      would skip ranges marked with MEMBLOCK_HOTPLUG.
      
      The page table setup code in POWER uses for_each_mem_range() to create
      the linear mapping of the physical memory and since the regions marked
      as MEMORY_HOTPLUG are skipped, they never make it to the linear map.
      
      A later access to the memory in those ranges will fail:
      
        BUG: Unable to handle kernel data access on write at 0xc000000400000000
        Faulting instruction address: 0xc00000000008a3c0
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7
        NIP:  c00000000008a3c0 LR: c0000000003c1ed8 CTR: 0000000000000040
        REGS: c000000008a57770 TRAP: 0300   Not tainted  (5.13.0)
        MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 84222202  XER: 20040000
        CFAR: c0000000003c1ed4 DAR: c000000400000000 DSISR: 42000000 IRQMASK: 0
        GPR00: c0000000003c1ed8 c000000008a57a10 c0000000019da700 c000000400000000
        GPR04: 0000000000000280 0000000000000180 0000000000000400 0000000000000200
        GPR08: 0000000000000100 0000000000000080 0000000000000040 0000000000000300
        GPR12: 0000000000000380 c000000001bc0000 c0000000001660c8 c000000006337e00
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: 0000000040000000 0000000020000000 c000000001a81990 c000000008c30000
        GPR24: c000000008c20000 c000000001a81998 000fffffffff0000 c000000001a819a0
        GPR28: c000000001a81908 c00c000001000000 c000000008c40000 c000000008a64680
        NIP clear_user_page+0x50/0x80
        LR __handle_mm_fault+0xc88/0x1910
        Call Trace:
          __handle_mm_fault+0xc44/0x1910 (unreliable)
          handle_mm_fault+0x130/0x2a0
          __get_user_pages+0x248/0x610
          __get_user_pages_remote+0x12c/0x3e0
          get_arg_page+0x54/0xf0
          copy_string_kernel+0x11c/0x210
          kernel_execve+0x16c/0x220
          call_usermodehelper_exec_async+0x1b0/0x2f0
          ret_from_kernel_thread+0x5c/0x70
        Instruction dump:
        79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 7d893050
        7d4903a6 60000000 60000000 60000000 <7c001fec> 7c091fec 7c081fec 7c051fec
        ---[ end trace 490b8c67e6075e09 ]---
      
      Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the
      traversal fixes this issue.
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100
      Link: https://lkml.kernel.org/r/20210712071132.20902-1-rppt@kernel.org
      Fixes: b10d6bca ("arch, drivers: replace for_each_membock() with for_each_mem_range()")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: default avatarGreg Kurz <groug@kaod.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79e482e9
    • Sergei Trofimovich's avatar
      mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction · 69e5d322
      Sergei Trofimovich authored
      To reproduce the failure we need the following system:
      
       - kernel command: page_poison=1 init_on_free=0 init_on_alloc=0
      
       - kernel config:
          * CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
          * CONFIG_INIT_ON_FREE_DEFAULT_ON=y
          * CONFIG_PAGE_POISONING=y
      
      Resulting in:
      
          0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G     U     O      5.13.1-gentoo-x86_64 #1
          Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021
          Call Trace:
           dump_stack+0x64/0x7c
           __kernel_unpoison_pages.cold+0x48/0x84
           post_alloc_hook+0x60/0xa0
           get_page_from_freelist+0xdb8/0x1000
           __alloc_pages+0x163/0x2b0
           __get_free_pages+0xc/0x30
           pgd_alloc+0x2e/0x1a0
           mm_init+0x185/0x270
           dup_mm+0x6b/0x4f0
           copy_process+0x190d/0x1b10
           kernel_clone+0xba/0x3b0
           __do_sys_clone+0x8f/0xb0
           do_syscall_64+0x68/0x80
           entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Before commit 51cba1eb ("init_on_alloc: Optimize static branches")
      init_on_alloc never enabled static branch by default.  It could only be
      enabed explicitly by init_mem_debugging_and_hardening().
      
      But after commit 51cba1eb, a static branch could already be enabled
      by default.  There was no code to ever disable it.  That caused
      page_poison=1 / init_on_free=1 conflict.
      
      This change extends init_mem_debugging_and_hardening() to also disable
      static branch disabling.
      
      Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org
      Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org
      Fixes: 51cba1eb ("init_on_alloc: Optimize static branches")
      Signed-off-by: default avatarSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Co-developed-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarMikhail Morfikov <mmorfikov@gmail.com>
      Reported-by: <bowsingbetee@pm.me>
      Tested-by: <bowsingbetee@protonmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69e5d322
    • Christoph Hellwig's avatar
      mm: use kmap_local_page in memzero_page · d9a42b53
      Christoph Hellwig authored
      The commit message introducing the global memzero_page explicitly
      mentions switching to kmap_local_page in the commit log but doesn't
      actually do that.
      
      Link: https://lkml.kernel.org/r/20210713055231.137602-3-hch@lst.de
      Fixes: 28961998 ("iov_iter: lift memzero_page() to highmem.h")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9a42b53
    • Christoph Hellwig's avatar
      mm: call flush_dcache_page() in memcpy_to_page() and memzero_page() · 8dad53a1
      Christoph Hellwig authored
      memcpy_to_page and memzero_page can write to arbitrary pages, which
      could be in the page cache or in high memory, so call
      flush_kernel_dcache_pages to flush the dcache.
      
      This is a problem when using these helpers on dcache challeneged
      architectures.  Right now there are just a few users, chances are no one
      used the PC floppy driver, the aha1542 driver for an ISA SCSI HBA, and a
      few advanced and optional btrfs and ext4 features on those platforms yet
      since the conversion.
      
      Link: https://lkml.kernel.org/r/20210713055231.137602-2-hch@lst.de
      Fixes: bb90d4bc ("mm/highmem: Lift memcpy_[to|from]_page to core")
      Fixes: 28961998 ("iov_iter: lift memzero_page() to highmem.h")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dad53a1
    • Alexander Potapenko's avatar
      kfence: skip all GFP_ZONEMASK allocations · 236e9f15
      Alexander Potapenko authored
      Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot
      be fulfilled by KFENCE, because KFENCE memory pool is located in a zone
      different from the requested one.
      
      Because callers of kmem_cache_alloc() may actually rely on the
      allocation to reside in the requested zone (e.g.  memory allocations
      done with __GFP_DMA must be DMAable), skip all allocations done with
      GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and
      SLAB_CACHE_DMA32).
      
      Link: https://lkml.kernel.org/r/20210714092222.1890268-2-glider@google.com
      Fixes: 0ce20dd8 ("mm: add Kernel Electric-Fence infrastructure")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      236e9f15
    • Alexander Potapenko's avatar
      kfence: move the size check to the beginning of __kfence_alloc() · 235a85cb
      Alexander Potapenko authored
      Check the allocation size before toggling kfence_allocation_gate.
      
      This way allocations that can't be served by KFENCE will not result in
      waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating
      anything.
      
      Link: https://lkml.kernel.org/r/20210714092222.1890268-1-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      235a85cb
    • Weizhao Ouyang's avatar
      kfence: defer kfence_test_init to ensure that kunit debugfs is created · 32ae8a06
      Weizhao Ouyang authored
      kfence_test_init and kunit_init both use the same level late_initcall,
      which means if kfence_test_init linked ahead of kunit_init,
      kfence_test_init will get a NULL debugfs_rootdir as parent dentry, then
      kfence_test_init and kfence_debugfs_init both create a debugfs node
      named "kfence" under debugfs_mount->mnt_root, and it will throw out
      "debugfs: Directory 'kfence' with parent '/' already present!" with
      EEXIST.  So kfence_test_init should be deferred.
      
      Link: https://lkml.kernel.org/r/20210714113140.2949995-1-o451686892@gmail.comSigned-off-by: default avatarWeizhao Ouyang <o451686892@gmail.com>
      Tested-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32ae8a06
    • Peter Collingbourne's avatar
      selftest: use mmap instead of posix_memalign to allocate memory · 0db282ba
      Peter Collingbourne authored
      This test passes pointers obtained from anon_allocate_area to the
      userfaultfd and mremap APIs.  This causes a problem if the system
      allocator returns tagged pointers because with the tagged address ABI
      the kernel rejects tagged addresses passed to these APIs, which would
      end up causing the test to fail.  To make this test compatible with such
      system allocators, stop using the system allocator to allocate memory in
      anon_allocate_area, and instead just use mmap.
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-3-pcc@google.com
      Link: https://linux-review.googlesource.com/id/Icac91064fcd923f77a83e8e133f8631c5b8fc241
      Fixes: c47174fc ("userfaultfd: selftest")
      Co-developed-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Signed-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0db282ba
    • Peter Collingbourne's avatar
      userfaultfd: do not untag user pointers · e71e2ace
      Peter Collingbourne authored
      Patch series "userfaultfd: do not untag user pointers", v5.
      
      If a user program uses userfaultfd on ranges of heap memory, it may end
      up passing a tagged pointer to the kernel in the range.start field of
      the UFFDIO_REGISTER ioctl.  This can happen when using an MTE-capable
      allocator, or on Android if using the Tagged Pointers feature for MTE
      readiness [1].
      
      When a fault subsequently occurs, the tag is stripped from the fault
      address returned to the application in the fault.address field of struct
      uffd_msg.  However, from the application's perspective, the tagged
      address *is* the memory address, so if the application is unaware of
      memory tags, it may get confused by receiving an address that is, from
      its point of view, outside of the bounds of the allocation.  We observed
      this behavior in the kselftest for userfaultfd [2] but other
      applications could have the same problem.
      
      Address this by not untagging pointers passed to the userfaultfd ioctls.
      Instead, let the system call fail.  Also change the kselftest to use
      mmap so that it doesn't encounter this problem.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      This patch (of 2):
      
      Do not untag pointers passed to the userfaultfd ioctls.  Instead, let
      the system call fail.  This will provide an early indication of problems
      with tag-unaware userspace code instead of letting the code get confused
      later, and is consistent with how we decided to handle brk/mmap/mremap
      in commit dcde2373 ("mm: Avoid creating virtual address aliases in
      brk()/mmap()/mremap()"), as well as being consistent with the existing
      tagged address ABI documentation relating to how ioctl arguments are
      handled.
      
      The code change is a revert of commit 7d032574 ("userfaultfd: untag
      user pointers") plus some fixups to some additional calls to
      validate_range that have appeared since then.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
      Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
      Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
      Fixes: 63f0c603 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e71e2ace
    • Jisheng Zhang's avatar
      riscv: stacktrace: pin the task's stack in get_wchan · 76f5dfac
      Jisheng Zhang authored
      Pin the task's stack before calling walk_stackframe() in get_wchan().
      This can fix the panic as reported by Andreas when CONFIG_VMAP_STACK=y:
      
      [   65.609696] Unable to handle kernel paging request at virtual address ffffffd0003bbde8
      [   65.610460] Oops [#1]
      [   65.610626] Modules linked in: virtio_blk virtio_mmio rtc_goldfish btrfs blake2b_generic libcrc32c xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
      [   65.611670] CPU: 2 PID: 1 Comm: systemd Not tainted 5.14.0-rc1-1.g34fe32a-default #1 openSUSE Tumbleweed (unreleased) c62f7109153e5a0897ee58ba52393ad99b070fd2
      [   65.612334] Hardware name: riscv-virtio,qemu (DT)
      [   65.613008] epc : get_wchan+0x5c/0x88
      [   65.613334]  ra : get_wchan+0x42/0x88
      [   65.613625] epc : ffffffff800048a4 ra : ffffffff8000488a sp : ffffffd00021bb90
      [   65.614008]  gp : ffffffff817709f8 tp : ffffffe07fe91b80 t0 : 00000000000001f8
      [   65.614411]  t1 : 0000000000020000 t2 : 0000000000000000 s0 : ffffffd00021bbd0
      [   65.614818]  s1 : ffffffd0003bbdf0 a0 : 0000000000000001 a1 : 0000000000000002
      [   65.615237]  a2 : ffffffff81618008 a3 : 0000000000000000 a4 : 0000000000000000
      [   65.615637]  a5 : ffffffd0003bc000 a6 : 0000000000000002 a7 : ffffffe27d370000
      [   65.616022]  s2 : ffffffd0003bbd90 s3 : ffffffff8071a81e s4 : 0000000000003fff
      [   65.616407]  s5 : ffffffffffffc000 s6 : 0000000000000000 s7 : ffffffff81618008
      [   65.616845]  s8 : 0000000000000001 s9 : 0000000180000040 s10: 0000000000000000
      [   65.617248]  s11: 000000000000016b t3 : 000000ff00000000 t4 : 0c6aec92de5e3fd7
      [   65.617672]  t5 : fff78f60608fcfff t6 : 0000000000000078
      [   65.618088] status: 0000000000000120 badaddr: ffffffd0003bbde8 cause: 000000000000000d
      [   65.618621] [<ffffffff800048a4>] get_wchan+0x5c/0x88
      [   65.619008] [<ffffffff8022da88>] do_task_stat+0x7a2/0xa46
      [   65.619325] [<ffffffff8022e87e>] proc_tgid_stat+0xe/0x16
      [   65.619637] [<ffffffff80227dd6>] proc_single_show+0x46/0x96
      [   65.619979] [<ffffffff801ccb1e>] seq_read_iter+0x190/0x31e
      [   65.620341] [<ffffffff801ccd70>] seq_read+0xc4/0x104
      [   65.620633] [<ffffffff801a6bfe>] vfs_read+0x6a/0x112
      [   65.620922] [<ffffffff801a701c>] ksys_read+0x54/0xbe
      [   65.621206] [<ffffffff801a7094>] sys_read+0xe/0x16
      [   65.621474] [<ffffffff8000303e>] ret_from_syscall+0x0/0x2
      [   65.622169] ---[ end trace f24856ed2b8789c5 ]---
      [   65.622832] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
      Signed-off-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Signed-off-by: default avatarPalmer Dabbelt <palmerdabbelt@google.com>
      76f5dfac
  6. 23 Jul, 2021 1 commit
    • Jens Axboe's avatar
      io_uring: explicitly catch any illegal async queue attempt · 991468dc
      Jens Axboe authored
      Catch an illegal case to queue async from an unrelated task that got
      the ring fd passed to it. This should not be possible to hit, but
      better be proactive and catch it explicitly. io-wq is extended to
      check for early IO_WQ_WORK_CANCEL being set on a work item as well,
      so it can run the request through the normal cancelation path.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      991468dc