1. 23 Aug, 2022 3 commits
  2. 22 Aug, 2022 6 commits
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 072e5135
      Linus Torvalds authored
      Pull NFS client fixes from Trond Myklebust:
      "Stable fixes:
         - NFS: Fix another fsync() issue after a server reboot
      
        Bugfixes:
         - NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
         - NFS: Fix missing unlock in nfs_unlink()
         - Add sanity checking of the file type used by __nfs42_ssc_open
         - Fix a case where we're failing to set task->tk_rpc_status
      
        Cleanups:
         - Remove the NFS_CONTEXT_RESEND_WRITES flag that got obsoleted by the
           fsync() fix"
      
      * tag 'nfs-for-5.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        SUNRPC: RPC level errors should set task->tk_rpc_status
        NFSv4.2 fix problems with __nfs42_ssc_open
        NFS: unlink/rmdir shouldn't call d_delete() twice on ENOENT
        NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES
        NFS: Remove a bogus flag setting in pnfs_write_done_resend_to_mds
        NFS: Fix another fsync() issue after a server reboot
        NFS: Fix missing unlock in nfs_unlink()
      072e5135
    • Linus Torvalds's avatar
      Merge tag 'fs.idmapped.fixes.v6.0-rc3' of... · d3cd67d6
      Linus Torvalds authored
      Merge tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
      
      Pull idmapping fixes from Christian Brauner:
      
       - Since Seth joined as co-maintainer for idmapped mounts we decided to
         use a shared git tree. Konstantin suggested we use vfs/idmapping.git
         on kernel.org under the vfs/ namespace. So this updates the tree in
         the maintainers file.
      
       - Ensure that POSIX ACLs checking, getting, and setting works correctly
         for filesystems mountable with a filesystem idmapping that want to
         support idmapped mounts.
      
         Since no filesystems mountable with an fs_idmapping do yet support
         idmapped mounts there is no problem. But this could change in the
         future, so add a check to refuse to create idmapped mounts when the
         mounter is not privileged over the mount's idmapping.
      
       - Check that caller is privileged over the idmapping that will be
         attached to a mount.
      
         Currently no FS_USERNS_MOUNT filesystems support idmapped mounts,
         thus this is not a problem as only CAP_SYS_ADMIN in init_user_ns is
         allowed to set up idmapped mounts. But this could change in the
         future, so add a check to refuse to create idmapped mounts when the
         mounter is not privileged over the mount's idmapping.
      
       - Fix POSIX ACLs for ntfs3. While looking at our current POSIX ACL
         handling in the context of some overlayfs work I went through a range
         of other filesystems checking how they handle them currently and
         encountered a few bugs in ntfs3.
      
         I've sent this some time ago and the fixes haven't been picked up
         even though the pull request for other ntfs3 fixes got sent after.
         This should really be fixed as right now POSIX ACLs are broken in
         certain circumstances for ntfs3.
      
      * tag 'fs.idmapped.fixes.v6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
        ntfs: fix acl handling
        fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
        MAINTAINERS: update idmapping tree
        acl: handle idmapped mounts for idmapped filesystems
      d3cd67d6
    • Linus Torvalds's avatar
      Merge tag 'filelock-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux · b20ee481
      Linus Torvalds authored
      Pull file locking fix from Jeff Layton:
       "Just a single patch for a bugfix in the flock() codepath, introduced
        by a patch that went in recently"
      
      * tag 'filelock-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
        locks: Fix dropped call to ->fl_release_private()
      b20ee481
    • Yang Jihong's avatar
      perf tools: Fix compile error for x86 · cfd2b5c1
      Yang Jihong authored
      Commit a0a12c3e ("asm goto: eradicate CC_HAS_ASM_GOTO") eradicates
      CC_HAS_ASM_GOTO, and in the process also causes the perf tool on x86 to
      use asm_volatile_goto when compiling __GEN_RMWcc.
      
      However, asm_volatile_goto is not declared in the perf tool headers,
      which causes a compilation error:
      
        In file included from tools/arch/x86/include/asm/atomic.h:7,
                         from tools/include/asm/atomic.h:6,
                         from tools/include/linux/atomic.h:5,
                         from tools/include/linux/refcount.h:41,
                         from tools/lib/perf/include/internal/cpumap.h:5,
                         from tools/perf/util/cpumap.h:7,
                         from tools/perf/util/env.h:7,
                         from tools/perf/util/header.h:12,
                         from pmu-events/pmu-events.c:9:
        tools/arch/x86/include/asm/atomic.h: In function ‘atomic_dec_and_test’:
        tools/arch/x86/include/asm/rmwcc.h:7:2: error: implicit declaration of function ‘asm_volatile_goto’ [-Werror=implicit-function-declaration]
          asm_volatile_goto (fullop "; j" cc " %l[cc_label]"  \
          ^~~~~~~~~~~~~~~~~
      
      Define asm_volatile_goto in compiler_types.h if not declared, like the
      main kernel header files do.
      
      Fixes: a0a12c3e ("asm goto: eradicate CC_HAS_ASM_GOTO")
      Signed-off-by: default avatarYang Jihong <yangjihong1@huawei.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Tested-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cfd2b5c1
    • Christian Brauner's avatar
      ntfs: fix acl handling · 0c3bc789
      Christian Brauner authored
      While looking at our current POSIX ACL handling in the context of some
      overlayfs work I went through a range of other filesystems checking how they
      handle them currently and encountered ntfs3.
      
      The posic_acl_{from,to}_xattr() helpers always need to operate on the
      filesystem idmapping. Since ntfs3 can only be mounted in the initial user
      namespace the relevant idmapping is init_user_ns.
      
      The posix_acl_{from,to}_xattr() helpers are concerned with translating between
      the kernel internal struct posix_acl{_entry} and the uapi struct
      posix_acl_xattr_{header,entry} and the kernel internal data structure is cached
      filesystem wide.
      
      Additional idmappings such as the caller's idmapping or the mount's idmapping
      are handled higher up in the VFS. Individual filesystems usually do not need to
      concern themselves with these.
      
      The posix_acl_valid() helper is concerned with checking whether the values in
      the kernel internal struct posix_acl can be represented in the filesystem's
      idmapping. IOW, if they can be written to disk. So this helper too needs to
      take the filesystem's idmapping.
      
      Fixes: be71b5cb ("fs/ntfs3: Add attrib operations")
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: ntfs3@lists.linux.dev
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      0c3bc789
    • Linus Torvalds's avatar
      Linux 6.0-rc2 · 1c23f9e6
      Linus Torvalds authored
      1c23f9e6
  3. 21 Aug, 2022 17 commits
  4. 20 Aug, 2022 14 commits
    • Kuniyuki Iwashima's avatar
      kprobes: don't call disarm_kprobe() for disabled kprobes · 9c80e799
      Kuniyuki Iwashima authored
      The assumption in __disable_kprobe() is wrong, and it could try to disarm
      an already disarmed kprobe and fire the WARN_ONCE() below. [0]  We can
      easily reproduce this issue.
      
      1. Write 0 to /sys/kernel/debug/kprobes/enabled.
      
        # echo 0 > /sys/kernel/debug/kprobes/enabled
      
      2. Run execsnoop.  At this time, one kprobe is disabled.
      
        # /usr/share/bcc/tools/execsnoop &
        [1] 2460
        PCOMM            PID    PPID   RET ARGS
      
        # cat /sys/kernel/debug/kprobes/list
        ffffffff91345650  r  __x64_sys_execve+0x0    [FTRACE]
        ffffffff91345650  k  __x64_sys_execve+0x0    [DISABLED][FTRACE]
      
      3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes
         kprobes_all_disarmed to false but does not arm the disabled kprobe.
      
        # echo 1 > /sys/kernel/debug/kprobes/enabled
      
        # cat /sys/kernel/debug/kprobes/list
        ffffffff91345650  r  __x64_sys_execve+0x0    [FTRACE]
        ffffffff91345650  k  __x64_sys_execve+0x0    [DISABLED][FTRACE]
      
      4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the
         disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace().
      
        # fg
        /usr/share/bcc/tools/execsnoop
        ^C
      
      Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses
      some cleanups and leaves the aggregated kprobe in the hash table.  Then,
      __unregister_trace_kprobe() initialises tk->rp.kp.list and creates an
      infinite loop like this.
      
        aggregated kprobe.list -> kprobe.list -.
                                           ^    |
                                           '.__.'
      
      In this situation, these commands fall into the infinite loop and result
      in RCU stall or soft lockup.
      
        cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the
                                             infinite loop with RCU.
      
        /usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex,
                                         and __get_valid_kprobe() is stuck in
      				   the loop.
      
      To avoid the issue, make sure we don't call disarm_kprobe() for disabled
      kprobes.
      
      [0]
      Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2)
      WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
      Modules linked in: ena
      CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28
      Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017
      RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
      Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff <0f> 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94
      RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001
      RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff
      RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff
      R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40
      R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000
      FS:  00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
      <TASK>
       __disable_kprobe (kernel/kprobes.c:1716)
       disable_kprobe (kernel/kprobes.c:2392)
       __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340)
       disable_trace_kprobe (kernel/trace/trace_kprobe.c:429)
       perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168)
       perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295)
       _free_event (kernel/events/core.c:4971)
       perf_event_release_kernel (kernel/events/core.c:5176)
       perf_release (kernel/events/core.c:5186)
       __fput (fs/file_table.c:321)
       task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1))
       exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201)
       syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296)
       do_syscall_64 (arch/x86/entry/common.c:87)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      RIP: 0033:0x7fe7ff210654
      Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc
      RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
      RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654
      RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008
      RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30
      R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600
      R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560
      </TASK>
      
      Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com
      Fixes: 69d54b91 ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reported-by: default avatarAyushman Dutta <ayudutta@amazon.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
      Cc: Kuniyuki Iwashima <kuni1840@gmail.com>
      Cc: Ayushman Dutta <ayudutta@amazon.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c80e799
    • Hugh Dickins's avatar
      mm/shmem: shmem_replace_page() remember NR_SHMEM · 76d36dea
      Hugh Dickins authored
      Elsewhere, NR_SHMEM is updated at the same time as shmem NR_FILE_PAGES;
      but shmem_replace_page() was forgetting to do that - so NR_SHMEM stats
      could grow too big or too small, in those unusual cases when it's used.
      
      Link: https://lkml.kernel.org/r/cec7c09d-5874-e160-ada6-6e10ee48784@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Radoslaw Burny <rburny@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76d36dea
    • Hugh Dickins's avatar
      mm/shmem: tmpfs fallocate use file_modified() · 15f242bb
      Hugh Dickins authored
      5.18 fixed the btrfs and ext4 fallocates to use file_modified(), as xfs
      was already doing, to drop privileges: and fstests generic/{683,684,688}
      expect this.  There's no need to argue over keep-size allocation (which
      could just update ctime): fix shmem_fallocate() to behave the same way.
      
      Link: https://lkml.kernel.org/r/39c5e62-4896-7795-c0a0-f79c50d4909@google.comSigned-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Radoslaw Burny <rburny@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15f242bb
    • Hugh Dickins's avatar
      mm/shmem: fix chattr fsflags support in tmpfs · cb241339
      Hugh Dickins authored
      ext[234] have always allowed unimplemented chattr flags to be set, but
      other filesystems have tended to be stricter.  Follow the stricter
      approach for tmpfs: I don't want to have to explain why csu attributes
      don't actually work, and we won't need to update the chattr(1) manpage;
      and it's never wrong to start off strict, relaxing later if persuaded. 
      Allow only a (append only) i (immutable) A (no atime) and d (no dump).
      
      Although lsattr showed 'A' inherited, the NOATIME behavior was not being
      inherited: because nothing sync'ed FS_NOATIME_FL to S_NOATIME.  Add
      shmem_set_inode_flags() to sync the flags, using inode_set_flags() to
      avoid that instant of lost immutablility during fileattr_set().
      
      But that change switched generic/079 from passing to failing: because
      FS_IMMUTABLE_FL and FS_APPEND_FL had been unconventionally included in the
      INHERITED fsflags: remove them and generic/079 is back to passing.
      
      Link: https://lkml.kernel.org/r/2961dcb0-ddf3-b9f0-3268-12a4ff996856@google.com
      Fixes: e408e695 ("mm/shmem: support FS_IOC_[SG]ETFLAGS in tmpfs")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Radoslaw Burny <rburny@google.com>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb241339
    • David Hildenbrand's avatar
      mm/hugetlb: support write-faults in shared mappings · 1d8d1464
      David Hildenbrand authored
      If we ever get a write-fault on a write-protected page in a shared
      mapping, we'd be in trouble (again).  Instead, we can simply map the page
      writable.
      
      And in fact, there is even a way right now to trigger that code via
      uffd-wp ever since we stared to support it for shmem in 5.19:
      
      --------------------------------------------------------------------------
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <fcntl.h>
       #include <unistd.h>
       #include <errno.h>
       #include <sys/mman.h>
       #include <sys/syscall.h>
       #include <sys/ioctl.h>
       #include <linux/userfaultfd.h>
      
       #define HUGETLB_SIZE (2 * 1024 * 1024u)
      
       static char *map;
       int uffd;
      
       static int temp_setup_uffd(void)
       {
       	struct uffdio_api uffdio_api;
       	struct uffdio_register uffdio_register;
       	struct uffdio_writeprotect uffd_writeprotect;
       	struct uffdio_range uffd_range;
      
       	uffd = syscall(__NR_userfaultfd,
       		       O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
       	if (uffd < 0) {
       		fprintf(stderr, "syscall() failed: %d\n", errno);
       		return -errno;
       	}
      
       	uffdio_api.api = UFFD_API;
       	uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
       	if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
       		fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
       		return -errno;
       	}
      
       	if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
       		fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
       		return -ENOSYS;
       	}
      
       	/* Register UFFD-WP */
       	uffdio_register.range.start = (unsigned long) map;
       	uffdio_register.range.len = HUGETLB_SIZE;
       	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
       	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
       		fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
       		return -errno;
       	}
      
       	/* Writeprotect a single page. */
       	uffd_writeprotect.range.start = (unsigned long) map;
       	uffd_writeprotect.range.len = HUGETLB_SIZE;
       	uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
       	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
       		fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
       		return -errno;
       	}
      
       	/* Unregister UFFD-WP without prior writeunprotection. */
       	uffd_range.start = (unsigned long) map;
       	uffd_range.len = HUGETLB_SIZE;
       	if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
       		fprintf(stderr, "UFFDIO_UNREGISTER failed: %d\n", errno);
       		return -errno;
       	}
      
       	return 0;
       }
      
       int main(int argc, char **argv)
       {
       	int fd;
      
       	fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
       	if (!fd) {
       		fprintf(stderr, "open() failed\n");
       		return -errno;
       	}
       	if (ftruncate(fd, HUGETLB_SIZE)) {
       		fprintf(stderr, "ftruncate() failed\n");
       		return -errno;
       	}
      
       	map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
       	if (map == MAP_FAILED) {
       		fprintf(stderr, "mmap() failed\n");
       		return -errno;
       	}
      
       	*map = 0;
      
       	if (temp_setup_uffd())
       		return 1;
      
       	*map = 0;
      
       	return 0;
       }
      --------------------------------------------------------------------------
      
      Above test fails with SIGBUS when there is only a single free hugetlb page.
       # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       Bus error (core dumped)
      
      And worse, with sufficient free hugetlb pages it will map an anonymous page
      into a shared mapping, for example, messing up accounting during unmap
      and breaking MAP_SHARED semantics:
       # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       # cat /proc/meminfo | grep HugePages_
       HugePages_Total:       2
       HugePages_Free:        1
       HugePages_Rsvd:    18446744073709551615
       HugePages_Surp:        0
      
      Reason is that uffd-wp doesn't clear the uffd-wp PTE bit when
      unregistering and consequently keeps the PTE writeprotected.  Reason for
      this is to avoid the additional overhead when unregistering.  Note that
      this is the case also for !hugetlb and that we will end up with writable
      PTEs that still have the uffd-wp PTE bit set once we return from
      hugetlb_wp().  I'm not touching the uffd-wp PTE bit for now, because it
      seems to be a generic thing -- wp_page_reuse() also doesn't clear it.
      
      VM_MAYSHARE handling in hugetlb_fault() for FAULT_FLAG_WRITE indicates
      that MAP_SHARED handling was at least envisioned, but could never have
      worked as expected.
      
      While at it, make sure that we never end up in hugetlb_wp() on write
      faults without VM_WRITE, because we don't support maybe_mkwrite()
      semantics as commonly used in the !hugetlb case -- for example, in
      wp_page_reuse().
      
      Note that there is no need to do any kind of reservation in
      hugetlb_fault() in this case ...  because we already have a hugetlb page
      mapped R/O that we will simply map writable and we are not dealing with
      COW/unsharing.
      
      Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com
      Fixes: b1f9e876 ("mm/uffd: enable write protection for shmem & hugetlbfs")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.19]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d8d1464
    • David Hildenbrand's avatar
      mm/hugetlb: fix hugetlb not supporting softdirty tracking · f96f7a40
      David Hildenbrand authored
      Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.
      
      I observed that hugetlb does not support/expect write-faults in shared
      mappings that would have to map the R/O-mapped page writable -- and I
      found two case where we could currently get such faults and would
      erroneously map an anon page into a shared mapping.
      
      Reproducers part of the patches.
      
      I propose to backport both fixes to stable trees.  The first fix needs a
      small adjustment.
      
      
      This patch (of 2):
      
      Staring at hugetlb_wp(), one might wonder where all the logic for shared
      mappings is when stumbling over a write-protected page in a shared
      mapping.  In fact, there is none, and so far we thought we could get away
      with that because e.g., mprotect() should always do the right thing and
      map all pages directly writable.
      
      Looks like we were wrong:
      
      --------------------------------------------------------------------------
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <fcntl.h>
       #include <unistd.h>
       #include <errno.h>
       #include <sys/mman.h>
      
       #define HUGETLB_SIZE (2 * 1024 * 1024u)
      
       static void clear_softdirty(void)
       {
               int fd = open("/proc/self/clear_refs", O_WRONLY);
               const char *ctrl = "4";
               int ret;
      
               if (fd < 0) {
                       fprintf(stderr, "open(clear_refs) failed\n");
                       exit(1);
               }
               ret = write(fd, ctrl, strlen(ctrl));
               if (ret != strlen(ctrl)) {
                       fprintf(stderr, "write(clear_refs) failed\n");
                       exit(1);
               }
               close(fd);
       }
      
       int main(int argc, char **argv)
       {
               char *map;
               int fd;
      
               fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
               if (!fd) {
                       fprintf(stderr, "open() failed\n");
                       return -errno;
               }
               if (ftruncate(fd, HUGETLB_SIZE)) {
                       fprintf(stderr, "ftruncate() failed\n");
                       return -errno;
               }
      
               map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
               if (map == MAP_FAILED) {
                       fprintf(stderr, "mmap() failed\n");
                       return -errno;
               }
      
               *map = 0;
      
               if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
                       fprintf(stderr, "mmprotect() failed\n");
                       return -errno;
               }
      
               clear_softdirty();
      
               if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
                       fprintf(stderr, "mmprotect() failed\n");
                       return -errno;
               }
      
               *map = 0;
      
               return 0;
       }
      --------------------------------------------------------------------------
      
      Above test fails with SIGBUS when there is only a single free hugetlb page.
       # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       Bus error (core dumped)
      
      And worse, with sufficient free hugetlb pages it will map an anonymous page
      into a shared mapping, for example, messing up accounting during unmap
      and breaking MAP_SHARED semantics:
       # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       # cat /proc/meminfo | grep HugePages_
       HugePages_Total:       2
       HugePages_Free:        1
       HugePages_Rsvd:    18446744073709551615
       HugePages_Surp:        0
      
      Reason in this particular case is that vma_wants_writenotify() will
      return "true", removing VM_SHARED in vma_set_page_prot() to map pages
      write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
      support softdirty tracking.
      
      Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com
      Fixes: 64e45507 ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f96f7a40
    • Peter Xu's avatar
      mm/uffd: reset write protection when unregister with wp-mode · f369b07c
      Peter Xu authored
      The motivation of this patch comes from a recent report and patchfix from
      David Hildenbrand on hugetlb shared handling of wr-protected page [1].
      
      With the reproducer provided in commit message of [1], one can leverage
      the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
      not only the attacker process, but also the whole system.
      
      The lazy-reset mechanism of uffd-wp was used to make unregister faster,
      meanwhile it has an assumption that any leftover pgtable entries should
      only affect the process on its own, so not only the user should be aware
      of anything it does, but also it should not affect outside of the process.
      
      But it seems that this is not true, and it can also be utilized to make
      some exploit easier.
      
      So far there's no clue showing that the lazy-reset is important to any
      userfaultfd users because normally the unregister will only happen once
      for a specific range of memory of the lifecycle of the process.
      
      Considering all above, what this patch proposes is to do explicit pte
      resets when unregister an uffd region with wr-protect mode enabled.
      
      It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
      right before ioctl(UFFDIO_UNREGISTER) for the user.  So potentially it'll
      make the unregister slower.  From that pov it's a very slight abi change,
      but hopefully nothing should break with this change either.
      
      Regarding to the change itself - core of uffd write [un]protect operation
      is moved into a separate function (uffd_wp_range()) and it is reused in
      the unregister code path.
      
      Note that the new function will not check for anything, e.g.  ranges or
      memory types, because they should have been checked during the previous
      UFFDIO_REGISTER or it should have failed already.  It also doesn't check
      mmap_changing because we're with mmap write lock held anyway.
      
      I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
      the only issue reported so far and that's the commit David's reproducer
      will start working (v5.19+).  But the whole idea actually applies to not
      only file memories but also anonymous.  It's just that we don't need to
      fix anonymous prior to v5.19- because there's no known way to exploit.
      
      IOW, this patch can also fix the issue reported in [1] as the patch 2 does.
      
      [1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/
      
      Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
      Fixes: b1f9e876 ("mm/uffd: enable write protection for shmem & hugetlbfs")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f369b07c
    • Peter Xu's avatar
      mm/smaps: don't access young/dirty bit if pte unpresent · efd41493
      Peter Xu authored
      These bits should only be valid when the ptes are present.  Introducing
      two booleans for it and set it to false when !pte_present() for both pte
      and pmd accountings.
      
      The bug is found during code reading and no real world issue reported, but
      logically such an error can cause incorrect readings for either smaps or
      smaps_rollup output on quite a few fields.
      
      For example, it could cause over-estimate on values like Shared_Dirty,
      Private_Dirty, Referenced.  Or it could also cause under-estimate on
      values like LazyFree, Shared_Clean, Private_Clean.
      
      Link: https://lkml.kernel.org/r/20220805160003.58929-1-peterx@redhat.com
      Fixes: b1d4d9e0 ("proc/smaps: carefully handle migration entries")
      Fixes: c94b6923 ("/proc/PID/smaps: Add PMD migration entry parsing")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      efd41493
    • Hao Lee's avatar
      mm: add DEVICE_ZONE to FOR_ALL_ZONES · a39c5d3c
      Hao Lee authored
      FOR_ALL_ZONES should be consistent with enum zone_type.  Otherwise,
      __count_zid_vm_events have the potential to add count to wrong item when
      zid is ZONE_DEVICE.
      
      Link: https://lkml.kernel.org/r/20220807154442.GA18167@haolee.ioSigned-off-by: default avatarHao Lee <haolee.swjtu@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a39c5d3c
    • Randy Dunlap's avatar
      kernel/sys_ni: add compat entry for fadvise64_64 · a8faed3a
      Randy Dunlap authored
      When CONFIG_ADVISE_SYSCALLS is not set/enabled and CONFIG_COMPAT is
      set/enabled, the riscv compat_syscall_table references
      'compat_sys_fadvise64_64', which is not defined:
      
      riscv64-linux-ld: arch/riscv/kernel/compat_syscall_table.o:(.rodata+0x6f8):
      undefined reference to `compat_sys_fadvise64_64'
      
      Add 'fadvise64_64' to kernel/sys_ni.c as a conditional COMPAT function so
      that when CONFIG_ADVISE_SYSCALLS is not set, there is a fallback function
      available.
      
      Link: https://lkml.kernel.org/r/20220807220934.5689-1-rdunlap@infradead.org
      Fixes: d3ac21ca ("mm: Support compiling out madvise and fadvise")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a8faed3a
    • David Hildenbrand's avatar
      mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30
      David Hildenbrand authored
      Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
      that FOLL_FORCE can be possibly dangerous, especially if there are races
      that can be exploited by user space.
      
      Right now, it would be sufficient to have some code that sets a PTE of a
      R/O-mapped shared page dirty, in order for it to erroneously become
      writable by FOLL_FORCE.  The implications of setting a write-protected PTE
      dirty might not be immediately obvious to everyone.
      
      And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
      pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
      a shmem page R/O while marking the pte dirty.  This can be used by
      unprivileged user space to modify tmpfs/shmem file content even if the
      user does not have write permissions to the file, and to bypass memfd
      write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
      
      To fix such security issues for good, the insight is that we really only
      need that fancy retry logic (FOLL_COW) for COW mappings that are not
      writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
      we have an exclusive anonymous page mapped.  If we have something else
      mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
      we have to trigger a write fault to break COW.  If we don't find an
      exclusive anonymous page when we retry, we have to trigger COW breaking
      once again because something intervened.
      
      Let's move away from this mandatory-retry + dirty handling and rely on our
      PageAnonExclusive() flag for making a similar decision, to use the same
      COW logic as in other kernel parts here as well.  In case we stumble over
      a PTE in a COW mapping that does not map an exclusive anonymous page, COW
      was not properly broken and we have to trigger a fake write-fault to break
      COW.
      
      Just like we do in can_change_pte_writable() added via commit 64fe24a3
      ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
      when changing protection") and commit 76aefad6 ("mm/mprotect: fix
      soft-dirty check in can_change_pte_writable()"), take care of softdirty
      and uffd-wp manually.
      
      For example, a write() via /proc/self/mem to a uffd-wp-protected range has
      to fail instead of silently granting write access and bypassing the
      userspace fault handler.  Note that FOLL_FORCE is not only used for debug
      access, but also triggered by applications without debug intentions, for
      example, when pinning pages via RDMA.
      
      This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
      affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
      
      Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
      let's just get rid of it.
      
      Thanks to Nadav Amit for pointing out that the pte_dirty() check in
      FOLL_FORCE code is problematic and might be exploitable.
      
      Note 1: We don't check for the PTE being dirty because it doesn't matter
      	for making a "was COWed" decision anymore, and whoever modifies the
      	page has to set the page dirty either way.
      
      Note 2: Kernels before extended uffd-wp support and before
      	PageAnonExclusive (< 5.19) can simply revert the problematic
      	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
      	v5.19 requires minor adjustments due to lack of
      	vma_soft_dirty_enabled().
      
      Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
      Fixes: 9ae0f87d ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.16]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5535be30
    • Jiri Slaby's avatar
      Revert "zram: remove double compression logic" · 37887783
      Jiri Slaby authored
      This reverts commit e7be8d1d ("zram: remove double compression
      logic") as it causes zram failures.  It does not revert cleanly, PTR_ERR
      handling was introduced in the meantime.  This is handled by appropriate
      IS_ERR.
      
      When under memory pressure, zs_malloc() can fail.  Before the above
      commit, the allocation was retried with direct reclaim enabled (GFP_NOIO).
      After the commit, it is not -- only __GFP_KSWAPD_RECLAIM is tried.
      
      So when the failure occurs under memory pressure, the overlaying
      filesystem such as ext2 (mounted by ext4 module in this case) can emit
      failures, making the (file)system unusable:
        EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 16386 starting block 159744)
        Buffer I/O error on device zram0, logical block 159744
      
      With direct reclaim, memory is really reclaimed and allocation succeeds,
      eventually.  In the worst case, the oom killer is invoked, which is proper
      outcome if user sets up zram too large (in comparison to available RAM).
      
      This very diff doesn't apply to 5.19 (stable) cleanly (see PTR_ERR note
      above). Use revert of e7be8d1d directly.
      
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1202203
      Link: https://lkml.kernel.org/r/20220810070609.14402-1-jslaby@suse.cz
      Fixes: e7be8d1d ("zram: remove double compression logic")
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Dmitry Rokosov <ddrokosov@sberdevices.ru>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.19]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37887783
    • Dan Carpenter's avatar
      get_maintainer: add Alan to .get_maintainer.ignore · d10a72de
      Dan Carpenter authored
      Alan asked to be added to the .get_maintainer.ignore list.
      
      Link: https://lkml.kernel.org/r/YvN30KhO9aD5Sza9@kiliSigned-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d10a72de
    • Linus Torvalds's avatar
      Merge tag 'kbuild-fixes-v6.0' of... · 15b3f48a
      Linus Torvalds authored
      Merge tag 'kbuild-fixes-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
      
      Pull Kbuild fixes from Masahiro Yamada:
      
       - Fix module versioning broken on some architectures
      
       - Make dummy-tools enable CONFIG_PPC_LONG_DOUBLE_128
      
       - Remove -Wformat-zero-length, which has no warning instance
      
       - Fix the order between drivers and libs in modules.order
      
       - Fix false-positive warnings in clang-analyzer
      
      * tag 'kbuild-fixes-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        scripts/clang-tools: Remove DeprecatedOrUnsafeBufferHandling check
        kbuild: fix the modules order between drivers and libs
        scripts/Makefile.extrawarn: Do not disable clang's -Wformat-zero-length
        kbuild: dummy-tools: pretend we understand __LONG_DOUBLE_128__
        modpost: fix module versioning when a symbol lacks valid CRC
      15b3f48a