1. 13 Mar, 2019 3 commits
  2. 06 Mar, 2019 8 commits
    • Jaegeuk Kim's avatar
      f2fs: don't clear CP_QUOTA_NEED_FSCK_FLAG · fb40d618
      Jaegeuk Kim authored
      If we met this once, let fsck.f2fs clear this only.
      Note that, this addresses all the subtle fault injection test.
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      fb40d618
    • Chao Yu's avatar
      f2fs: don't allow negative ->write_io_size_bits · 6d52e135
      Chao Yu authored
      As Dan reported:
      
      "We put an upper bound on ->write_io_size_bits but we don't have a lower
      bound."
      
      So let's add lower bound check for ->write_io_size_bits in parse_options().
      
      [We don't allow configuring ->write_io_size_bits to zero, since at least
      we need to fill one dummy page for aligned IO.]
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      6d52e135
    • Chao Yu's avatar
      f2fs: fix to check inline_xattr_size boundary correctly · 500e0b28
      Chao Yu authored
      We use below condition to check inline_xattr_size boundary:
      
      	if (!F2FS_OPTION(sbi).inline_xattr_size ||
      		F2FS_OPTION(sbi).inline_xattr_size >=
      				DEF_ADDRS_PER_INODE -
      				F2FS_TOTAL_EXTRA_ATTR_SIZE -
      				DEF_INLINE_RESERVED_SIZE -
      				DEF_MIN_INLINE_SIZE)
      
      There is there problems in that check:
      - we should allow inline_xattr_size equaling to min size of inline
      {data,dentry} area.
      - F2FS_TOTAL_EXTRA_ATTR_SIZE and inline_xattr_size are based on
      different size unit, previous one is 4 bytes, latter one is 1 bytes.
      - DEF_MIN_INLINE_SIZE only indicate min size of inline data area,
      however, we need to consider min size of inline dentry area as well,
      minimal inline dentry should at least contain two entries: '.' and
      '..', so that min inline_dentry size is 40 bytes.
      
      .bitmap		1 * 1 = 1
      .reserved	1 * 1 = 1
      .dentry		11 * 2 = 22
      .filename	8 * 2 = 16
      total		40
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      500e0b28
    • Sahitya Tummala's avatar
      f2fs: do not use mutex lock in atomic context · 9083977d
      Sahitya Tummala authored
      Fix below warning coming because of using mutex lock in atomic context.
      
      BUG: sleeping function called from invalid context at kernel/locking/mutex.c:98
      in_atomic(): 1, irqs_disabled(): 0, pid: 585, name: sh
      Preemption disabled at: __radix_tree_preload+0x28/0x130
      Call trace:
       dump_backtrace+0x0/0x2b4
       show_stack+0x20/0x28
       dump_stack+0xa8/0xe0
       ___might_sleep+0x144/0x194
       __might_sleep+0x58/0x8c
       mutex_lock+0x2c/0x48
       f2fs_trace_pid+0x88/0x14c
       f2fs_set_node_page_dirty+0xd0/0x184
      
      Do not use f2fs_radix_tree_insert() to avoid doing cond_resched() with
      spin_lock() acquired.
      Signed-off-by: default avatarSahitya Tummala <stummala@codeaurora.org>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      9083977d
    • Chao Yu's avatar
      f2fs: fix potential data inconsistence of checkpoint · c42d28ce
      Chao Yu authored
      Previously, we changed lock from cp_rwsem to node_change, it solved
      the deadlock issue which was caused by below race condition:
      
      Thread A			Thread B
      - f2fs_setattr
       - f2fs_lock_op  -- read_lock
       - dquot_transfer
        - __dquot_transfer
         - dquot_acquire
          - commit_dqblk
           - f2fs_quota_write
            - f2fs_write_begin
             - f2fs_write_failed
      				- write_checkpoint
      				 - block_operations
      				  - f2fs_lock_all  -- write_lock
              - f2fs_truncate_blocks
               - f2fs_lock_op  -- read_lock
      
      But it breaks the sematics of cp_rwsem, in other callers like:
      - f2fs_file_write_iter -> f2fs_write_begin -> f2fs_write_failed
      - f2fs_direct_IO -> f2fs_write_failed
      
      We allow to truncate dnode w/o cp_rwsem held, result in incorrect sit
      bitmap update, which can cause further data corruption.
      
      So this patch reverts previous fix implementation, and try to fix
      deadlock by skipping calling f2fs_truncate_blocks() in f2fs_write_failed()
      only for quota file, and keep the preallocated data/node in the tail of
      quota file, we can expecte that the preallocated space can be used to
      store quota info latter soon.
      
      Fixes: af033b2a ("f2fs: guarantee journalled quota data by checkpoint")
      Signed-off-by: default avatarGao Xiang <gaoxiang25@huawei.com>
      Signed-off-by: default avatarSheng Yong <shengyong1@huawei.com>
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      c42d28ce
    • Chengguang Xu's avatar
      f2fs: jump to label 'free_node_inode' when failing from d_make_root() · 025cdb16
      Chengguang Xu authored
      When sb->s_root is NULL dput() will do nothing,
      so jump to label 'free_node_inode' instead of lable
      'free_root_inode' when failing from d_make_root().
      Signed-off-by: default avatarChengguang Xu <cgxu519@gmx.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      025cdb16
    • Chao Yu's avatar
      f2fs: fix to document inline_xattr_size option · 7321dd97
      Chao Yu authored
      We missed to add document for inline_xattr_size mount option in f2fs.txt,
      add it.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      7321dd97
    • zhengliang's avatar
      f2fs: fix to data block override node segment by mistake · a0770e13
      zhengliang authored
      v4: Rearrange the previous three versions.
      
      The following scenario could lead to data block override by mistake.
      
      TASK A            |  TASK kworker                                            |     TASK B                                            |       TASK C
                        |                                                          |                                                       |
      open              |                                                          |                                                       |
      write             |                                                          |                                                       |
      close             |                                                          |                                                       |
                        |  f2fs_write_data_pages                                   |                                                       |
                        |    f2fs_write_cache_pages                                |                                                       |
                        |      f2fs_outplace_write_data                            |                                                       |
                        |        f2fs_allocate_data_block (get block in seg S,     |                                                       |
                        |                                  S is full, and only     |                                                       |
                        |                                  have this valid data    |                                                       |
                        |                                  block)                  |                                                       |
                        |          allocate_segment                                |                                                       |
                        |          locate_dirty_segment (mark S as PRE)            |                                                       |
                        |        f2fs_submit_page_write (submit but is not         |                                                       |
                        |                                written on dev)           |                                                       |
      unlink            |                                                          |                                                       |
       iput_final       |                                                          |                                                       |
        f2fs_drop_inode |                                                          |                                                       |
          f2fs_truncate |                                                          |                                                       |
       (not evict)      |                                                          |                                                       |
                        |                                                          | write_checkpoint                                      |
                        |                                                          |  flush merged bio but not wait file data writeback    |
                        |                                                          |  set_prefree_as_free (mark S as FREE)                 |
                        |                                                          |                                                       | update NODE/DATA
                        |                                                          |                                                       | allocate_segment (select S)
                        |     writeback done                                       |                                                       |
      
      So we need to guarantee io complete before truncate inode in f2fs_drop_inode.
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarZheng Liang <zhengliang6@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      a0770e13
  3. 16 Feb, 2019 8 commits
  4. 04 Feb, 2019 1 commit
  5. 22 Jan, 2019 6 commits
    • Chao Yu's avatar
      f2fs: fix to set sbi dirty correctly · 20109873
      Chao Yu authored
      In order to record direct IO count, we add two additional type in
      enum count_type: F2FS_DIO_{WRITE,READ}, but those IO won't dirty
      filesystem metadata, so we don't need to set filesystem dirty in
      inc_page_count(), fix it.
      
      Fixes: 02b16d0a ("f2fs: add to account direct IO")
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      20109873
    • Chao Yu's avatar
      f2fs: fix to initialize variable to avoid UBSAN/smatch warning · f9aa52a8
      Chao Yu authored
      As Dan Carpenter as below:
      
      The patch df634f444ee9: "f2fs: use rb_*_cached friends" from Oct 4,
      2018, leads to the following static checker warning:
      
      	fs/f2fs/extent_cache.c:606 f2fs_update_extent_tree_range()
      	error: uninitialized symbol 'leftmost'.
      
      And also Eric Biggers, and Kyungtae Kim reported, there is an UBSAN
      warning described as below:
      
      We report a bug in linux-4.20.2: "UBSAN: Undefined behaviour in
      fs/f2fs/extent_cache.c"
      
      kernel config: https://kt0755.github.io/etc/config_v4.20_stable
      repro: https://kt0755.github.io/etc/repro.4a3e7.c (f2fs is mounted on
      /mnt/f2fs/)
      
      This arose in f2fs_update_extent_tree_range (fs/f2fs/extent_cache.c:605).
      It seems that, for some reason, its last argument became "24"
      although that was supposed to be bool type.
      
      =========================================
      UBSAN: Undefined behaviour in fs/f2fs/extent_cache.c:605:4
      load of value 24 is not a valid value for type '_Bool'
      CPU: 0 PID: 6774 Comm: syz-executor5 Not tainted 4.20.2 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0xb1/0x118 lib/dump_stack.c:113
       ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
       __ubsan_handle_load_invalid_value+0x17a/0x1be lib/ubsan.c:457
       f2fs_update_extent_tree_range+0x1d4a/0x1d50 fs/f2fs/extent_cache.c:605
       f2fs_update_extent_cache+0x2b6/0x350 fs/f2fs/extent_cache.c:804
       f2fs_update_data_blkaddr+0x61/0x70 fs/f2fs/data.c:656
       f2fs_outplace_write_data+0x1d6/0x4b0 fs/f2fs/segment.c:3140
       f2fs_convert_inline_page+0x86d/0x2060 fs/f2fs/inline.c:163
       f2fs_convert_inline_inode+0x6b5/0xad0 fs/f2fs/inline.c:208
       f2fs_preallocate_blocks+0x78b/0xb00 fs/f2fs/data.c:982
       f2fs_file_write_iter+0x31b/0xf40 fs/f2fs/file.c:3062
       call_write_iter include/linux/fs.h:1857 [inline]
       new_sync_write fs/read_write.c:474 [inline]
       __vfs_write+0x538/0x6e0 fs/read_write.c:487
       vfs_write+0x1b3/0x520 fs/read_write.c:549
       ksys_write+0xde/0x1c0 fs/read_write.c:598
       __do_sys_write fs/read_write.c:610 [inline]
       __se_sys_write fs/read_write.c:607 [inline]
       __x64_sys_write+0x7e/0xc0 fs/read_write.c:607
       do_syscall_64+0xbe/0x4f0 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4497b9
      Code: e8 8c 9f 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48
      89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
      01 f0 ff ff 0f 83 9b 6b fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f1ea15edc68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00007f1ea15ee6cc RCX: 00000000004497b9
      RDX: 0000000000001000 RSI: 0000000020000140 RDI: 0000000000000013
      RBP: 000000000071bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 000000000000bb50 R14: 00000000006f4bf0 R15: 00007f1ea15ee700
      =========================================
      
      As I checked, this uninitialized variable won't cause extent cache
      corruption, but in order to avoid such kind of warning of both UBSAN
      and smatch, fix to initialize related variable.
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reported-by: default avatarEric Biggers <ebiggers@google.com>
      Reported-by: default avatarKyungtae Kim <kt0755@gmail.com>
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      f9aa52a8
    • Sheng Yong's avatar
      f2fs: UBSAN: set boolean value iostat_enable correctly · ac929858
      Sheng Yong authored
      When setting /sys/fs/f2fs/<DEV>/iostat_enable with non-bool value, UBSAN
      reports the following warning.
      
      [ 7562.295484] ================================================================================
      [ 7562.296531] UBSAN: Undefined behaviour in fs/f2fs/f2fs.h:2776:10
      [ 7562.297651] load of value 64 is not a valid value for type '_Bool'
      [ 7562.298642] CPU: 1 PID: 7487 Comm: dd Not tainted 4.20.0-rc4+ #79
      [ 7562.298653] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [ 7562.298662] Call Trace:
      [ 7562.298760]  dump_stack+0x46/0x5b
      [ 7562.298811]  ubsan_epilogue+0x9/0x40
      [ 7562.298830]  __ubsan_handle_load_invalid_value+0x72/0x90
      [ 7562.298863]  f2fs_file_write_iter+0x29f/0x3f0
      [ 7562.298905]  __vfs_write+0x115/0x160
      [ 7562.298922]  vfs_write+0xa7/0x190
      [ 7562.298934]  ksys_write+0x50/0xc0
      [ 7562.298973]  do_syscall_64+0x4a/0xe0
      [ 7562.298992]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 7562.299001] RIP: 0033:0x7fa45ec19c00
      [ 7562.299004] Code: 73 01 c3 48 8b 0d 88 92 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d dd eb 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce 8f 01 00 48 89 04 24
      [ 7562.299044] RSP: 002b:00007ffca52b49e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [ 7562.299052] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa45ec19c00
      [ 7562.299059] RDX: 0000000000000400 RSI: 000000000093f000 RDI: 0000000000000001
      [ 7562.299065] RBP: 000000000093f000 R08: 0000000000000004 R09: 0000000000000000
      [ 7562.299071] R10: 00007ffca52b47b0 R11: 0000000000000246 R12: 0000000000000400
      [ 7562.299077] R13: 000000000093f000 R14: 000000000093f400 R15: 0000000000000000
      [ 7562.299091] ================================================================================
      
      So, if iostat_enable is enabled, set its value as true.
      Signed-off-by: default avatarSheng Yong <shengyong1@huawei.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      ac929858
    • Sheng Yong's avatar
      f2fs: add brackets for macros · 2f84babf
      Sheng Yong authored
      Signed-off-by: default avatarSheng Yong <shengyong1@huawei.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      2f84babf
    • Sheng Yong's avatar
      f2fs: check if file namelen exceeds max value · 720db068
      Sheng Yong authored
      Dentry bitmap is not enough to detect incorrect dentries. So this patch
      also checks the namelen value of a dentry.
      Signed-off-by: default avatarGong Chen <gongchen4@huawei.com>
      Signed-off-by: default avatarSheng Yong <shengyong1@huawei.com>
      Reviewed-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      720db068
    • Chao Yu's avatar
      f2fs: fix to trigger fsck if dirent.name_len is zero · ddf06b75
      Chao Yu authored
      While traversing dirents in f2fs_fill_dentries(), if bitmap is valid,
      filename length should not be zero, otherwise, directory structure
      consistency could be corrupted, in this case, let's print related
      info and set SBI_NEED_FSCK to trigger fsck for repairing.
      Signed-off-by: default avatarChao Yu <yuchao0@huawei.com>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      ddf06b75
  6. 09 Jan, 2019 3 commits
  7. 08 Jan, 2019 5 commits
  8. 07 Jan, 2019 4 commits
    • Masahiro Yamada's avatar
      arch: restore generic-y += shmparam.h for some architectures · 3bd6e94b
      Masahiro Yamada authored
      For some reasons, I accidentally got rid of "generic-y += shmparam.h"
      from some architectures.
      
      Restore them to fix building c6x, h8300, hexagon, m68k, microblaze,
      openrisc, and unicore32.
      
      Fixes: d6e4b3e3 ("arch: remove redundant UAPI generic-y defines")
      Signed-off-by: default avatarMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3bd6e94b
    • Linus Torvalds's avatar
      Linux 5.0-rc1 · bfeffd15
      Linus Torvalds authored
      bfeffd15
    • Linus Torvalds's avatar
      Merge tag 'kbuild-v4.21-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild · 85e1ffbd
      Linus Torvalds authored
      Pull more Kbuild updates from Masahiro Yamada:
      
       - improve boolinit.cocci and use_after_iter.cocci semantic patches
      
       - fix alignment for kallsyms
      
       - move 'asm goto' compiler test to Kconfig and clean up jump_label
         CONFIG option
      
       - generate asm-generic wrappers automatically if arch does not
         implement mandatory UAPI headers
      
       - remove redundant generic-y defines
      
       - misc cleanups
      
      * tag 'kbuild-v4.21-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
        kconfig: rename generated .*conf-cfg to *conf-cfg
        kbuild: remove unnecessary stubs for archheader and archscripts
        kbuild: use assignment instead of define ... endef for filechk_* rules
        arch: remove redundant UAPI generic-y defines
        kbuild: generate asm-generic wrappers if mandatory headers are missing
        arch: remove stale comments "UAPI Header export list"
        riscv: remove redundant kernel-space generic-y
        kbuild: change filechk to surround the given command with { }
        kbuild: remove redundant target cleaning on failure
        kbuild: clean up rule_dtc_dt_yaml
        kbuild: remove UIMAGE_IN and UIMAGE_OUT
        jump_label: move 'asm goto' support test to Kconfig
        kallsyms: lower alignment on ARM
        scripts: coccinelle: boolinit: drop warnings on named constants
        scripts: coccinelle: check for redeclaration
        kconfig: remove unused "file" field of yylval union
        nds32: remove redundant kernel-space generic-y
        nios2: remove unneeded HAS_DMA define
      85e1ffbd
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ac5eed2b
      Linus Torvalds authored
      Pull perf tooling updates form Ingo Molnar:
       "A final batch of perf tooling changes: mostly fixes and small
        improvements"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
        perf session: Add comment for perf_session__register_idle_thread()
        perf thread-stack: Fix thread stack processing for the idle task
        perf thread-stack: Allocate an array of thread stacks
        perf thread-stack: Factor out thread_stack__init()
        perf thread-stack: Allow for a thread stack array
        perf thread-stack: Avoid direct reference to the thread's stack
        perf thread-stack: Tidy thread_stack__bottom() usage
        perf thread-stack: Simplify some code in thread_stack__process()
        tools gpio: Allow overriding CFLAGS
        tools power turbostat: Override CFLAGS assignments and add LDFLAGS to build command
        tools thermal tmon: Allow overriding CFLAGS assignments
        tools power x86_energy_perf_policy: Override CFLAGS assignments and add LDFLAGS to build command
        perf c2c: Increase the HITM ratio limit for displayed cachelines
        perf c2c: Change the default coalesce setup
        perf trace beauty ioctl: Beautify USBDEVFS_ commands
        perf trace beauty: Export function to get the files for a thread
        perf trace: Wire up ioctl's USBDEBFS_ cmd table generator
        perf beauty ioctl: Add generator for USBDEVFS_ ioctl commands
        tools headers uapi: Grab a copy of usbdevice_fs.h
        perf trace: Store the major number for a file when storing its pathname
        ...
      ac5eed2b
  9. 06 Jan, 2019 2 commits
    • Linus Torvalds's avatar
      Change mincore() to count "mapped" pages rather than "cached" pages · 574823bf
      Linus Torvalds authored
      The semantics of what "in core" means for the mincore() system call are
      somewhat unclear, but Linux has always (since 2.3.52, which is when
      mincore() was initially done) treated it as "page is available in page
      cache" rather than "page is mapped in the mapping".
      
      The problem with that traditional semantic is that it exposes a lot of
      system cache state that it really probably shouldn't, and that users
      shouldn't really even care about.
      
      So let's try to avoid that information leak by simply changing the
      semantics to be that mincore() counts actual mapped pages, not pages
      that might be cheaply mapped if they were faulted (note the "might be"
      part of the old semantics: being in the cache doesn't actually guarantee
      that you can access them without IO anyway, since things like network
      filesystems may have to revalidate the cache before use).
      
      In many ways the old semantics were somewhat insane even aside from the
      information leak issue.  From the very beginning (and that beginning is
      a long time ago: 2.3.52 was released in March 2000, I think), the code
      had a comment saying
      
        Later we can get more picky about what "in core" means precisely.
      
      and this is that "later".  Admittedly it is much later than is really
      comfortable.
      
      NOTE! This is a real semantic change, and it is for example known to
      change the output of "fincore", since that program literally does a
      mmmap without populating it, and then doing "mincore()" on that mapping
      that doesn't actually have any pages in it.
      
      I'm hoping that nobody actually has any workflow that cares, and the
      info leak is real.
      
      We may have to do something different if it turns out that people have
      valid reasons to want the old semantics, and if we can limit the
      information leak sanely.
      
      Cc: Kevin Easton <kevin@guarana.org>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Masatake YAMATO <yamato@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      574823bf
    • Linus Torvalds's avatar
      Fix 'acccess_ok()' on alpha and SH · 94bd8a05
      Linus Torvalds authored
      Commit 594cc251 ("make 'user_access_begin()' do 'access_ok()'")
      broke both alpha and SH booting in qemu, as noticed by Guenter Roeck.
      
      It turns out that the bug wasn't actually in that commit itself (which
      would have been surprising: it was mostly a no-op), but in how the
      addition of access_ok() to the strncpy_from_user() and strnlen_user()
      functions now triggered the case where those functions would test the
      access of the very last byte of the user address space.
      
      The string functions actually did that user range test before too, but
      they did it manually by just comparing against user_addr_max().  But
      with user_access_begin() doing the check (using "access_ok()"), it now
      exposed problems in the architecture implementations of that function.
      
      For example, on alpha, the access_ok() helper macro looked like this:
      
        #define __access_ok(addr, size) \
              ((get_fs().seg & (addr | size | (addr+size))) == 0)
      
      and what it basically tests is of any of the high bits get set (the
      USER_DS masking value is 0xfffffc0000000000).
      
      And that's completely wrong for the "addr+size" check.  Because it's
      off-by-one for the case where we check to the very end of the user
      address space, which is exactly what the strn*_user() functions do.
      
      Why? Because "addr+size" will be exactly the size of the address space,
      so trying to access the last byte of the user address space will fail
      the __access_ok() check, even though it shouldn't.  As a result, the
      user string accessor functions failed consistently - because they
      literally don't know how long the string is going to be, and the max
      access is going to be that last byte of the user address space.
      
      Side note: that alpha macro is buggy for another reason too - it re-uses
      the arguments twice.
      
      And SH has another version of almost the exact same bug:
      
        #define __addr_ok(addr) \
              ((unsigned long __force)(addr) < current_thread_info()->addr_limit.seg)
      
      so far so good: yes, a user address must be below the limit.  But then:
      
        #define __access_ok(addr, size)         \
              (__addr_ok((addr) + (size)))
      
      is wrong with the exact same off-by-one case: the case when "addr+size"
      is exactly _equal_ to the limit is actually perfectly fine (think "one
      byte access at the last address of the user address space")
      
      The SH version is actually seriously buggy in another way: it doesn't
      actually check for overflow, even though it did copy the _comment_ that
      talks about overflow.
      
      So it turns out that both SH and alpha actually have completely buggy
      implementations of access_ok(), but they happened to work in practice
      (although the SH overflow one is a serious serious security bug, not
      that anybody likely cares about SH security).
      
      This fixes the problems by using a similar macro on both alpha and SH.
      It isn't trying to be clever, the end address is based on this logic:
      
              unsigned long __ao_end = __ao_a + __ao_b - !!__ao_b;
      
      which basically says "add start and length, and then subtract one unless
      the length was zero".  We can't subtract one for a zero length, or we'd
      just hit an underflow instead.
      
      For a lot of access_ok() users the length is a constant, so this isn't
      actually as expensive as it initially looks.
      Reported-and-tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94bd8a05