1. 05 Jun, 2023 4 commits
    • chenzhiyin's avatar
      fs.h: Optimize file struct to prevent false sharing · a7bc2e8d
      chenzhiyin authored
      In the syscall test of UnixBench, performance regression occurred due
      to false sharing.
      
      The lock and atomic members, including file::f_lock, file::f_count and
      file::f_pos_lock are highly contended and frequently updated in the
      high-concurrency test scenarios. perf c2c indentified one affected
      read access, file::f_op.
      To prevent false sharing, the layout of file struct is changed as
      following
      (A) f_lock, f_count and f_pos_lock are put together to share the same
      cache line.
      (B) The read mostly members, including f_path, f_inode, f_op are put
      into a separate cache line.
      (C) f_mode is put together with f_count, since they are used frequently
       at the same time.
      Due to '__randomize_layout' attribute of file struct, the updated layout
      only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'.
      
      The optimization has been validated in the syscall test of UnixBench.
      performance gain is 30~50%. Furthermore, to confirm the optimization
      effectiveness on the other codes path, the results of fsdisk, fsbuffer
      and fstime are also shown.
      
      Here are the detailed test results of unixbench.
      
      Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk
      
      Without Patch
      ------------------------------------------------------------------------
      File Copy 1024 bufsize 2000 maxblocks   875052.1 KBps  (30.0 s, 2 samples)
      File Copy 256 bufsize 500 maxblocks     235484.0 KBps  (30.0 s, 2 samples)
      File Copy 4096 bufsize 8000 maxblocks  2815153.5 KBps  (30.0 s, 2 samples)
      System Call Overhead                   5772268.3 lps   (10.0 s, 7 samples)
      
      System Benchmarks Partial Index         BASELINE       RESULT    INDEX
      File Copy 1024 bufsize 2000 maxblocks     3960.0     875052.1   2209.7
      File Copy 256 bufsize 500 maxblocks       1655.0     235484.0   1422.9
      File Copy 4096 bufsize 8000 maxblocks     5800.0    2815153.5   4853.7
      System Call Overhead                     15000.0    5772268.3   3848.2
                                                                    ========
      System Benchmarks Index Score (Partial Only)                    2768.3
      
      With Patch
      ------------------------------------------------------------------------
      File Copy 1024 bufsize 2000 maxblocks  1009977.2 KBps  (30.0 s, 2 samples)
      File Copy 256 bufsize 500 maxblocks     264765.9 KBps  (30.0 s, 2 samples)
      File Copy 4096 bufsize 8000 maxblocks  3052236.0 KBps  (30.0 s, 2 samples)
      System Call Overhead                   8237404.4 lps   (10.0 s, 7 samples)
      
      System Benchmarks Partial Index         BASELINE       RESULT    INDEX
      File Copy 1024 bufsize 2000 maxblocks     3960.0    1009977.2   2550.4
      File Copy 256 bufsize 500 maxblocks       1655.0     264765.9   1599.8
      File Copy 4096 bufsize 8000 maxblocks     5800.0    3052236.0   5262.5
      System Call Overhead                     15000.0    8237404.4   5491.6
                                                                    ========
      System Benchmarks Index Score (Partial Only)                    3295.3
      Signed-off-by: default avatarchenzhiyin <zhiyin.chen@intel.com>
      Message-Id: <20230601092400.27162-1-zhiyin.chen@intel.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      a7bc2e8d
    • Fabio M. De Francesco's avatar
      highmem: Rename put_and_unmap_page() to unmap_and_put_page() · d0e13540
      Fabio M. De Francesco authored
      With commit 849ad04c ("new helper: put_and_unmap_page()"), Al Viro
      introduced the put_and_unmap_page() to use in those many places where we
      have a common pattern consisting of calls to kunmap_local() +
      put_page().
      
      Obviously, first we unmap and then we put pages. Instead, the original
      name of this helper seems to imply that we first put and then unmap.
      
      Therefore, rename the helper and change the only known upstreamed user
      (i.e., fs/sysv) before this helper enters common use and might become
      difficult to find all call sites and instead easy to break the builds.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Message-Id: <20230602103307.5637-1-fmdefrancesco@gmail.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      d0e13540
    • David Howells's avatar
      cachefiles: Allow the cache to be non-root · 79aa2849
      David Howells authored
      Set mode 0600 on files in the cache so that cachefilesd can run as an
      unprivileged user rather than leaving the files all with 0.  Directories
      are already set to 0700.
      
      Userspace then needs to set the uid and gid before issuing the "bind"
      command and the cache must've been chown'd to those IDs.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      cc: David Howells <dhowells@redhat.com>
      cc: Jeff Layton <jlayton@kernel.org>
      cc: linux-cachefs@redhat.com
      cc: linux-erofs@lists.ozlabs.org
      cc: linux-fsdevel@vger.kernel.org
      Message-Id: <1853230.1684516880@warthog.procyon.org.uk>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      79aa2849
    • Yihuan Pan's avatar
      init: remove unused names parameter in split_fs_names() · 26e293f7
      Yihuan Pan authored
      The split_fs_names() function takes a names parameter, but the function
      actually uses the root_fs_names global variable instead. This names
      parameter is not used in the function, so it can be safely removed.
      
      This change does not affect the functionality of split_fs_names() or any
      other part of the kernel.
      Signed-off-by: default avatarYihuan Pan <xun794@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Message-Id: <4lsiigvaw4lxcs37rlhgepv77xyxym6krkqcpc3xfncnswok3y@b67z3b44orar>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      26e293f7
  2. 02 Jun, 2023 1 commit
    • Kees Cook's avatar
      jfs: Use unsigned variable for length calculations · 820eb59d
      Kees Cook authored
      To avoid confusing the compiler about possible negative sizes, switch
      "ssize" which can never be negative from int to u32.  Seen with GCC 13:
      
      ../fs/jfs/namei.c: In function 'jfs_symlink': ../include/linux/fortify-string.h:57:33: warning: '__builtin_memcpy' pointer overflow between offset 0 and size [-2147483648, -1]
      [-Warray-bounds=]
         57 | #define __underlying_memcpy     __builtin_memcpy
            |                                 ^
      ...
      ../fs/jfs/namei.c:950:17: note: in expansion of macro 'memcpy'
        950 |                 memcpy(ip->i_link, name, ssize);
            |                 ^~~~~~
      
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: jfs-discussion@lists.sourceforge.net
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarJeff Xu <jeffxu@chromium.org>
      Message-Id: <20230204183355.never.877-kees@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      820eb59d
  3. 01 Jun, 2023 1 commit
  4. 24 May, 2023 1 commit
    • David Sterba's avatar
      fs: use UB-safe check for signed addition overflow in remap_verify_area · b7a9a503
      David Sterba authored
      The following warning pops up with enabled UBSAN in tests fstests/generic/303:
      
        [23127.529395] UBSAN: Undefined behaviour in fs/read_write.c:1725:7
        [23127.529400] signed integer overflow:
        [23127.529403] 4611686018427322368 + 9223372036854775807 cannot be represented in type 'long long int'
        [23127.529412] CPU: 4 PID: 26180 Comm: xfs_io Not tainted 5.2.0-rc2-1.ge195904-vanilla+ #450
        [23127.556999] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
        [23127.557001] Call Trace:
        [23127.557060]  dump_stack+0x67/0x9b
        [23127.557070]  ubsan_epilogue+0x9/0x40
        [23127.573496]  handle_overflow+0xb3/0xc0
        [23127.573514]  do_clone_file_range+0x28f/0x2a0
        [23127.573547]  vfs_clone_file_range+0x35/0xb0
        [23127.573564]  ioctl_file_clone+0x8d/0xc0
        [23127.590144]  do_vfs_ioctl+0x300/0x700
        [23127.590160]  ksys_ioctl+0x70/0x80
        [23127.590203]  ? trace_hardirqs_off_thunk+0x1a/0x1c
        [23127.590210]  __x64_sys_ioctl+0x16/0x20
        [23127.590215]  do_syscall_64+0x5c/0x1d0
        [23127.590224]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [23127.590231] RIP: 0033:0x7ff6d7250327
        [23127.590241] RSP: 002b:00007ffe3a38f1d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
        [23127.590246] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ff6d7250327
        [23127.590249] RDX: 00007ffe3a38f220 RSI: 000000004020940d RDI: 0000000000000003
        [23127.590252] RBP: 0000000000000000 R08: 00007ffe3a3c80a0 R09: 00007ffe3a3c8080
        [23127.590255] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 0000000000000000
        [23127.590260] R13: 0000000000000000 R14: 3fffffffffff0000 R15: 00007ff6d750a20c
      
      As loff_t is a signed type, we should use the safe overflow checks
      instead of relying on compiler implementation.
      
      The bogus values are intentional and the test is supposed to verify the
      boundary conditions.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Message-Id: <20230523162628.17071-1-dsterba@suse.com>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      b7a9a503
  5. 17 May, 2023 4 commits
    • Arnd Bergmann's avatar
      procfs: consolidate arch_report_meminfo declaration · ef104443
      Arnd Bergmann authored
      The arch_report_meminfo() function is provided by four architectures,
      with a __weak fallback in procfs itself. On architectures that don't
      have a custom version, the __weak version causes a warning because
      of the missing prototype.
      
      Remove the architecture specific prototypes and instead add one
      in linux/proc_fs.h.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for arch/x86
      Acked-by: Helge Deller <deller@gmx.de> # parisc
      Reviewed-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Message-Id: <20230516195834.551901-1-arnd@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      ef104443
    • Arnd Bergmann's avatar
      fs: pipe: reveal missing function protoypes · 247c8d2f
      Arnd Bergmann authored
      A couple of functions from fs/pipe.c are used both internally
      and for the watch queue code, but the declaration is only
      visible when the latter is enabled:
      
      fs/pipe.c:1254:5: error: no previous prototype for 'pipe_resize_ring'
      fs/pipe.c:758:15: error: no previous prototype for 'account_pipe_buffers'
      fs/pipe.c:764:6: error: no previous prototype for 'too_many_pipe_buffers_soft'
      fs/pipe.c:771:6: error: no previous prototype for 'too_many_pipe_buffers_hard'
      fs/pipe.c:777:6: error: no previous prototype for 'pipe_is_unprivileged_user'
      
      Make the visible unconditionally to avoid these warnings.
      
      Fixes: c73be61c ("pipe: Add general notification queue support")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Message-Id: <20230516195629.551602-1-arnd@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      247c8d2f
    • Arnd Bergmann's avatar
      fs: d_path: include internal.h · df67cb4c
      Arnd Bergmann authored
      make W=1 warns about a missing prototype that is defined but
      not visible at point where simple_dname() is defined:
      
      fs/d_path.c:317:7: error: no previous prototype for 'simple_dname' [-Werror=missing-prototypes]
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Message-Id: <20230516195444.551461-1-arnd@kernel.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      df67cb4c
    • Vladimir Sementsov-Ogievskiy's avatar
      coredump: require O_WRONLY instead of O_RDWR · 88e46070
      Vladimir Sementsov-Ogievskiy authored
      The motivation for this patch has been to enable using a stricter
      apparmor profile to prevent programs from reading any coredump in the
      system.
      
      However, this became something else. The following details are based on
      Christian's and Linus' archeology into the history of the number "2" in
      the coredump handling code.
      
      To make sure we're not accidently introducing some subtle behavioral
      change into the coredump code we set out on a voyage into the depths of
      history.git to figure out why this was O_RDWR in the first place.
      
      Coredump handling was introduced over 30 years ago in commit
      ddc733f4 ("[PATCH] Linux-0.97 (August 1, 1992)").
      The original code used O_WRONLY:
      
          open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL)
      
      However, this changed in 1993 and starting with commit
      9cb9f18b ("[PATCH] Linux-0.99.10 (June 7, 1993)") the coredump code
      suddenly used the constant "2":
      
          open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL)
      
      This was curious as in the same commit the kernel switched from
      constants to proper defines in other places such as KERNEL_DS and
      USER_DS and O_RDWR did already exist.
      
      So why was "2" used? It turns out that open_namei() - an early version
      of what later turned into filp_open() - didn't accept O_RDWR.
      
      A semantic quirk of the open() uapi is the definition of the O_RDONLY
      flag. It would seem natural to define:
      
          #define O_RDWR (O_RDONLY | O_WRONLY)
      
      but that isn't possible because:
      
          #define O_RDONLY 0
      
      This makes O_RDONLY effectively meaningless when passed to the kernel.
      In other words, there has never been a way - until O_PATH at least - to
      open a file without any permission; O_RDONLY was always implied on the
      uapi side while the kernel does in fact allow opening files without
      permissions.
      
      The trouble comes when trying to map the uapi flags onto the
      corresponding file mode flags FMODE_{READ,WRITE}. This mapping still
      happens today and is causing issues to this day (We ran into this
      during additions for openat2() for example.).
      
      So the special value "3" was used to indicate that the file was opened
      for special access:
      
          f->f_flags = flag = flags;
          f->f_mode = (flag+1) & O_ACCMODE;
          if (f->f_mode)
                  flag++;
      
      This allowed the file mode to be set to FMODE_READ | FMODE_WRITE mapping
      the O_{RDONLY,WRONLY,RDWR} flags into the FMODE_{READ,WRITE} flags. The
      special access then required read-write permissions and 0 was used to
      access symlinks.
      
      But back when ddc733f4 ("[PATCH] Linux-0.97 (August 1, 1992)") added
      coredump handling open_namei() took the FMODE_{READ,WRITE} flags as an
      argument. So the coredump handling introduced in
      ddc733f4 ("[PATCH] Linux-0.97 (August 1, 1992)") was buggy because
      O_WRONLY shouldn't have been passed. Since O_WRONLY is 1 but
      open_namei() took FMODE_{READ,WRITE} it was passed FMODE_READ on
      accident.
      
      So 9cb9f18b ("[PATCH] Linux-0.99.10 (June 7, 1993)") was a bugfix
      for this and the 2 didn't really mean O_RDWR, it meant FMODE_WRITE which
      was correct.
      
      The clue is that FMODE_{READ,WRITE} didn't exist yet and thus a raw "2"
      value was passed.
      
      Fast forward 5 years when around 2.2.4pre4 (February 16, 1999) this code
      was changed to:
      
          -       dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
          ...
          +       file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
      
      At this point the raw "2" should have become O_WRONLY again as
      filp_open() didn't take FMODE_{READ,WRITE} but O_{RDONLY,WRONLY,RDWR}.
      
      Another 17 years later, the code was changed again cementing the mistake
      and making it almost impossible to detect when commit
      378c6520 ("fs/coredump: prevent fsuid=0 dumps into user-controlled directories")
      replaced the raw "2" with O_RDWR.
      
      And now, here we are with this patch that sent us on a quest to answer
      the big questions in life such as "Why are coredump files opened with
      O_RDWR?" and "Is it safe to just use O_WRONLY?".
      
      So with this commit we're reintroducing O_WRONLY again and bringing this
      code back to its original state when it was first introduced in commit
      ddc733f4 ("[PATCH] Linux-0.97 (August 1, 1992)") over 30 years ago.
      Signed-off-by: default avatarVladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
      Message-Id: <20230420120409.602576-1-vsementsov@yandex-team.ru>
      [brauner@kernel.org: completely rewritten commit message]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <brauner@kernel.org>
      88e46070
  6. 15 May, 2023 5 commits
  7. 14 May, 2023 13 commits
  8. 13 May, 2023 11 commits