1. 31 May, 2019 32 commits
    • Amir Goldstein's avatar
      ovl: relax WARN_ON() for overlapping layers use case · ebe6000e
      Amir Goldstein authored
      commit acf3062a upstream.
      
      This nasty little syzbot repro:
      https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000
      
      Creates overlay mounts where the same directory is both in upper and lower
      layers. Simplified example:
      
        mkdir foo work
        mount -t overlay none foo -o"lowerdir=.,upperdir=foo,workdir=work"
      
      The repro runs several threads in parallel that attempt to chdir into foo
      and attempt to symlink/rename/exec/mkdir the file bar.
      
      The repro hits a WARN_ON() I placed in ovl_instantiate(), which suggests
      that an overlay inode already exists in cache and is hashed by the pointer
      of the real upper dentry that ovl_create_real() has just created. At the
      point of the WARN_ON(), for overlay dir inode lock is held and upper dir
      inode lock, so at first, I did not see how this was possible.
      
      On a closer look, I see that after ovl_create_real(), because of the
      overlapping upper and lower layers, a lookup by another thread can find the
      file foo/bar that was just created in upper layer, at overlay path
      foo/foo/bar and hash the an overlay inode with the new real dentry as lower
      dentry. This is possible because the overlay directory foo/foo is not
      locked and the upper dentry foo/bar is in dcache, so ovl_lookup() can find
      it without taking upper dir inode shared lock.
      
      Overlapping layers is considered a wrong setup which would result in
      unexpected behavior, but it shouldn't crash the kernel and it shouldn't
      trigger WARN_ON() either, so relax this WARN_ON() and leave a pr_warn()
      instead to cover all cases of failure to get an overlay inode.
      
      The error returned from failure to insert new inode to cache with
      inode_insert5() was changed to -EEXIST, to distinguish from the error
      -ENOMEM returned on failure to get/allocate inode with iget5_locked().
      
      Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com
      Fixes: 01b39dcc ("ovl: use inode_insert5() to hash a newly...")
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ebe6000e
    • Will Deacon's avatar
      arm64: errata: Add workaround for Cortex-A76 erratum #1463225 · 2f71a4e3
      Will Deacon authored
      commit 969f5ea6 upstream.
      
      Revisions of the Cortex-A76 CPU prior to r4p0 are affected by an erratum
      that can prevent interrupts from being taken when single-stepping.
      
      This patch implements a software workaround to prevent userspace from
      effectively being able to disable interrupts.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      2f71a4e3
    • Shile Zhang's avatar
      fbdev: fix divide error in fb_var_to_videomode · 4aeac859
      Shile Zhang authored
      commit cf84807f upstream.
      
      To fix following divide-by-zero error found by Syzkaller:
      
        divide error: 0000 [#1] SMP PTI
        CPU: 7 PID: 8447 Comm: test Kdump: loaded Not tainted 4.19.24-8.al7.x86_64 #1
        Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:fb_var_to_videomode+0xae/0xc0
        Code: 04 44 03 46 78 03 4e 7c 44 03 46 68 03 4e 70 89 ce d1 ee 69 c0 e8 03 00 00 f6 c2 01 0f 45 ce 83 e2 02 8d 34 09 0f 45 ce 31 d2 <41> f7 f0 31 d2 f7 f1 89 47 08 f3 c3 66 0f 1f 44 00 00 0f 1f 44 00
        RSP: 0018:ffffb7e189347bf0 EFLAGS: 00010246
        RAX: 00000000e1692410 RBX: ffffb7e189347d60 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb7e189347c10
        RBP: ffff99972a091c00 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000100
        R13: 0000000000010000 R14: 00007ffd66baf6d0 R15: 0000000000000000
        FS:  00007f2054d11740(0000) GS:ffff99972fbc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f205481fd20 CR3: 00000004288a0001 CR4: 00000000001606a0
        Call Trace:
         fb_set_var+0x257/0x390
         ? lookup_fast+0xbb/0x2b0
         ? fb_open+0xc0/0x140
         ? chrdev_open+0xa6/0x1a0
         do_fb_ioctl+0x445/0x5a0
         do_vfs_ioctl+0x92/0x5f0
         ? __alloc_fd+0x3d/0x160
         ksys_ioctl+0x60/0x90
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x5b/0x190
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f20548258d7
        Code: 44 00 00 48 8b 05 b9 15 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 15 2d 00 f7 d8 64 89 01 48
      
      It can be triggered easily with following test code:
      
        #include <linux/fb.h>
        #include <fcntl.h>
        #include <sys/ioctl.h>
        int main(void)
        {
                struct fb_var_screeninfo var = {.activate = 0x100, .pixclock = 60};
                int fd = open("/dev/fb0", O_RDWR);
                if (fd < 0)
                        return 1;
      
                if (ioctl(fd, FBIOPUT_VSCREENINFO, &var))
                        return 1;
      
                return 0;
        }
      Signed-off-by: default avatarShile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Fredrik Noring <noring@nocrew.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: default avatarMukesh Ojha <mojha@codeaurora.org>
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4aeac859
    • Tobin C. Harding's avatar
      btrfs: sysfs: don't leak memory when failing add fsid · 53804824
      Tobin C. Harding authored
      commit e3277335 upstream.
      
      A failed call to kobject_init_and_add() must be followed by a call to
      kobject_put().  Currently in the error path when adding fs_devices we
      are missing this call.  This could be fixed by calling
      btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
      by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
      Here we choose the second option because it prevents the slightly
      unusual error path handling requirements of kobject from leaking out
      into btrfs functions.
      
      Add a call to kobject_put() in the error path of kobject_add_and_init().
      This causes the release method to be called if kobject_init_and_add()
      fails.  open_tree() is the function that calls btrfs_sysfs_add_fsid()
      and the error code in this function is already written with the
      assumption that the release method is called during the error path of
      open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
      fail_fsdev_sysfs label).
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53804824
    • Tobin C. Harding's avatar
      btrfs: sysfs: Fix error path kobject memory leak · b2a48467
      Tobin C. Harding authored
      commit 450ff834 upstream.
      
      If a call to kobject_init_and_add() fails we must call kobject_put()
      otherwise we leak memory.
      
      Calling kobject_put() when kobject_init_and_add() fails drops the
      refcount back to 0 and calls the ktype release method (which in turn
      calls the percpu destroy and kfree).
      
      Add call to kobject_put() in the error path of call to
      kobject_init_and_add().
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarTobin C. Harding <tobin@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2a48467
    • Filipe Manana's avatar
      Btrfs: fix race between ranged fsync and writeback of adjacent ranges · ffd658ad
      Filipe Manana authored
      commit 0c713cba upstream.
      
      When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
      inode) that happens to be ranged, which happens during a msync() or writes
      for files opened with O_SYNC for example, we can end up with a corrupt log,
      due to different file extent items representing ranges that overlap with
      each other, or hit some assertion failures.
      
      When doing a ranged fsync we only flush delalloc and wait for ordered
      exents within that range. If while we are logging items from our inode
      ordered extents for adjacent ranges complete, we end up in a race that can
      make us insert the file extent items that overlap with others we logged
      previously and the assertion failures.
      
      For example, if tree-log.c:copy_items() receives a leaf that has the
      following file extents items, all with a length of 4K and therefore there
      is an implicit hole in the range 68K to 72K - 1:
      
        (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
      
      It copies them to the log tree. However due to the need to detect implicit
      holes, it may release the path, in order to look at the previous leaf to
      detect an implicit hole, and then later it will search again in the tree
      for the first file extent item key, with the goal of locking again the
      leaf (which might have changed due to concurrent changes to other inodes).
      
      However when it locks again the leaf containing the first key, the key
      corresponding to the extent at offset 72K may not be there anymore since
      there is an ordered extent for that range that is finishing (that is,
      somewhere in the middle of btrfs_finish_ordered_io()), and it just
      removed the file extent item but has not yet replaced it with a new file
      extent item, so the part of copy_items() that does hole detection will
      decide that there is a hole in the range starting from 68K to 76K - 1,
      and therefore insert a file extent item to represent that hole, having
      a key offset of 68K. After that we now have a log tree with 2 different
      extent items that have overlapping ranges:
      
       1) The file extent item copied before copy_items() released the path,
          which has a key offset of 72K and a length of 4K, representing the
          file range 72K to 76K - 1.
      
       2) And a file extent item representing a hole that has a key offset of
          68K and a length of 8K, representing the range 68K to 76K - 1. This
          item was inserted after releasing the path, and overlaps with the
          extent item inserted before.
      
      The overlapping extent items can cause all sorts of unpredictable and
      incorrect behaviour, either when replayed or if a fast (non full) fsync
      happens later, which can trigger a BUG_ON() when calling
      btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
      trace like the following:
      
        [61666.783269] ------------[ cut here ]------------
        [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
        [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
        (...)
        [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
        [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
        [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
        [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
        [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
        [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
        [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
        [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
        [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
        [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
        [61666.786253] Call Trace:
        [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
        [61666.786253]  ? time_hardirqs_on+0x9/0x14
        [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
        [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
        [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
        [61666.786253]  ? lock_acquire+0x131/0x1c5
        [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
        [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
        [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
        [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
        [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
        [61666.786253]  vfs_fsync_range+0x8c/0x9e
        [61666.786253]  SyS_msync+0x13c/0x1c9
        [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad
      
      A sample of a corrupt log tree leaf with overlapping extents I got from
      running btrfs/072:
      
            item 14 key (295 108 200704) itemoff 2599 itemsize 53
                    extent data disk bytenr 0 nr 0
                    extent data offset 0 nr 458752 ram 458752
            item 15 key (295 108 659456) itemoff 2546 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 606208 nr 163840 ram 770048
            item 16 key (295 108 663552) itemoff 2493 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 610304 nr 155648 ram 770048
            item 17 key (295 108 819200) itemoff 2440 itemsize 53
                    extent data disk bytenr 4334788608 nr 4096
                    extent data offset 0 nr 4096 ram 4096
      
      The file extent item at offset 659456 (item 15) ends at offset 823296
      (659456 + 163840) while the next file extent item (item 16) starts at
      offset 663552.
      
      Another different problem that the race can trigger is a failure in the
      assertions at tree-log.c:copy_items(), which expect that the first file
      extent item key we found before releasing the path exists after we have
      released path and that the last key we found before releasing the path
      also exists after releasing the path:
      
        $ cat -n fs/btrfs/tree-log.c
        4080          if (need_find_last_extent) {
        4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
        4082                  btrfs_release_path(src_path);
        4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
        4084                                  src_path, 0, 0);
        4085                  if (ret < 0)
        4086                          return ret;
        4087                  ASSERT(ret == 0);
        (...)
        4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
        4104                          ret = btrfs_next_leaf(inode->root, src_path);
        4105                          if (ret < 0)
        4106                                  return ret;
        4107                          ASSERT(ret == 0);
        4108                          src = src_path->nodes[0];
        4109                          i = 0;
        4110                          need_find_last_extent = true;
        4111                  }
        (...)
      
      The second assertion implicitly expects that the last key before the path
      release still exists, because the surrounding while loop only stops after
      we have found that key. When this assertion fails it produces a stack like
      this:
      
        [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
        [139590.037406] ------------[ cut here ]------------
        [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
        [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
        (...)
        [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
        (...)
        [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
        [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
        [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
        [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
        [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
        [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
        [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
        [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
        [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [139590.044250] Call Trace:
        [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
        [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
        [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
        [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
        [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
        [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
        [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
        [139590.047592]  __vfs_write+0x129/0x1c0
        [139590.047932]  vfs_write+0xc2/0x1b0
        [139590.048270]  ksys_write+0x55/0xc0
        [139590.048608]  do_syscall_64+0x60/0x1b0
        [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [139590.049287] RIP: 0033:0x7f2efc4be190
        (...)
        [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
        [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
        [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
        [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
        [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
        (...)
        [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
      
      So fix this race between a full ranged fsync and writeback of adjacent
      ranges by flushing all delalloc and waiting for all ordered extents to
      complete before logging the inode. This is the simplest way to solve the
      problem because currently the full fsync path does not deal with ranges
      at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
      to look at adjacent ranges for hole detection. For use cases of ranged
      fsyncs this can make a few fsyncs slower but on the other hand it can
      make some following fsyncs to other ranges do less work or no need to do
      anything at all. A full fsync is rare anyway and happens only once after
      loading/creating an inode and once after less common operations such as a
      shrinking truncate.
      
      This is an issue that exists for a long time, and was often triggered by
      generic/127, because it does mmap'ed writes and msync (which triggers a
      ranged fsync). Adding support for the tree checker to detect overlapping
      extents (next patch in the series) and trigger a WARN() when such cases
      are found, and then calling btrfs_check_leaf_full() at the end of
      btrfs_insert_file_extent() made the issue much easier to detect. Running
      btrfs/072 with that change to the tree checker and making fsstress open
      files always with O_SYNC made it much easier to trigger the issue (as
      triggering it with generic/127 is very rare).
      
      CC: stable@vger.kernel.org # 3.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ffd658ad
    • Filipe Manana's avatar
      Btrfs: avoid fallback to transaction commit during fsync of files with holes · fb4bdda0
      Filipe Manana authored
      commit ebb92906 upstream.
      
      When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
      file that has holes and has file extent items spanning two or more leafs,
      we can end up falling to back to a full transaction commit due to a logic
      bug that leads to failure to insert a duplicate file extent item that is
      meant to represent a hole between the last file extent item of a leaf and
      the first file extent item in the next leaf. The failure (EEXIST error)
      leads to a transaction commit (as most errors when logging an inode do).
      
      For example, we have the two following leafs:
      
      Leaf N:
      
        -----------------------------------------------
        | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
        -----------------------------------------------
        The file extent item at the end of leaf N has a length of 4Kb,
        representing the file range from 64K to 68K - 1.
      
      Leaf N + 1:
      
        -----------------------------------------------
        | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
        -----------------------------------------------
        The file extent item at the first slot of leaf N + 1 has a length of
        4Kb too, representing the file range from 72K to 76K - 1.
      
      During the full fsync path, when we are at tree-log.c:copy_items() with
      leaf N as a parameter, after processing the last file extent item, that
      represents the extent at offset 64K, we take a look at the first file
      extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
      between the two extents, and therefore we insert a file extent item
      representing that hole, starting at file offset 68K and ending at offset
      72K - 1. However we don't update the value of *last_extent, which is used
      to represent the end offset (plus 1, non-inclusive end) of the last file
      extent item inserted in the log, so it stays with a value of 68K and not
      with a value of 72K.
      
      Then, when copy_items() is called for leaf N + 1, because the value of
      *last_extent is smaller then the offset of the first extent item in the
      leaf (68K < 72K), we look at the last file extent item in the previous
      leaf (leaf N) and see it there's a 4K gap between it and our first file
      extent item (again, 68K < 72K), so we decide to insert a file extent item
      representing the hole, starting at file offset 68K and ending at offset
      72K - 1, this insertion will fail with -EEXIST being returned from
      btrfs_insert_file_extent() because we already inserted a file extent item
      representing a hole for this offset (68K) in the previous call to
      copy_items(), when processing leaf N.
      
      The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
      which falls back to a full transaction commit.
      
      Fix this by adjusting *last_extent after inserting a hole when we had to
      look at the next leaf.
      
      Fixes: 4ee3fad3 ("Btrfs: fix fsync after hole punching when using no-holes feature")
      Cc: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb4bdda0
    • Filipe Manana's avatar
      Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path · be69efb3
      Filipe Manana authored
      commit 72bd2323 upstream.
      
      Currently when we fail to COW a path at btrfs_update_root() we end up
      always aborting the transaction. However all the current callers of
      btrfs_update_root() are able to deal with errors returned from it, many do
      end up aborting the transaction themselves (directly or not, such as the
      transaction commit path), other BUG_ON() or just gracefully cancel whatever
      they were doing.
      
      When syncing the fsync log, we call btrfs_update_root() through
      tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
      sync code does not abort the transaction, instead it gracefully handles
      the error and returns -EAGAIN to the fsync handler, so that it falls back
      to a transaction commit. Any other error different from -ENOSPC, makes the
      log sync code abort the transaction.
      
      So remove the transaction abort from btrfs_update_log() when we fail to
      COW a path to update the root item, so that if an -ENOSPC failure happens
      we avoid aborting the current transaction and have a chance of the fsync
      succeeding after falling back to a transaction commit.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Cc: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be69efb3
    • Johnny Chang's avatar
      btrfs: Check the compression level before getting a workspace · 69bb5079
      Johnny Chang authored
      commit 2b90883c upstream.
      
      When a file's compression property is set as zlib or zstd but leave
      the compression mount option not be set, that means btrfs will try
      to compress the file with default compression level. But in
      btrfs_compress_pages(), it calls get_workspace() with level = 0.
      This will return a workspace with a wrong compression level.
      For zlib, the compression level in the workspace will be 0
      (that means "store only"). And for zstd, the compression in the
      workspace will be 1, not the default level 3.
      
      How to reproduce:
        mkfs -t btrfs /dev/sdb
        mount /dev/sdb /mnt/
        mkdir /mnt/zlib
        btrfs property set /mnt/zlib/ compression zlib
        dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
        sync
        btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M
      
      btrfs-debugfs output:
      * before:
        ...
        (258 9961472): ram 524288 disk 1106247680 disk_size 524288
        file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00
      
      * after:
       ...
       (258 10354688): ram 131072 disk 14217216 disk_size 4096
       file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00
      
      The steps for zstd are similar, but need to put a debugging message to
      show the level of the return workspace in zstd_get_workspace().
      
      This commit adds a check of the compression level before getting a
      workspace by set_level().
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarJohnny Chang <johnnyc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69bb5079
    • Josef Bacik's avatar
      btrfs: don't double unlock on error in btrfs_punch_hole · 38bf3e22
      Josef Bacik authored
      commit 8fca9550 upstream.
      
      If we have an error writing out a delalloc range in
      btrfs_punch_hole_lock_range we'll unlock the inode and then goto
      out_only_mutex, where we will again unlock the inode.  This is bad,
      don't do this.
      
      Fixes: f27451f2 ("Btrfs: add support for fallocate's zero range operation")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38bf3e22
    • Andreas Gruenbacher's avatar
      gfs2: Fix sign extension bug in gfs2_update_stats · 873aac4c
      Andreas Gruenbacher authored
      commit 5a5ec83d upstream.
      
      Commit 4d207133 changed the types of the statistic values in struct
      gfs2_lkstats from s64 to u64.  Because of that, what should be a signed
      value in gfs2_update_stats turned into an unsigned value.  When shifted
      right, we end up with a large positive value instead of a small negative
      value, which results in an incorrect variance estimate.
      
      Fixes: 4d207133 ("gfs2: Make statistics unsigned, suitable for use with do_div()")
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Cc: stable@vger.kernel.org # v4.4+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      873aac4c
    • Christoph Hellwig's avatar
      arm64/iommu: handle non-remapped addresses in ->mmap and ->get_sgtable · b232deed
      Christoph Hellwig authored
      commit a98d9ae9 upstream.
      
      DMA allocations that can't sleep may return non-remapped addresses, but
      we do not properly handle them in the mmap and get_sgtable methods.
      Resolve non-vmalloc addresses using virt_to_page to handle this corner
      case.
      
      Cc: <stable@vger.kernel.org>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b232deed
    • Will Deacon's avatar
      arm64: Kconfig: Make ARM64_PSEUDO_NMI depend on BROKEN for now · 4101ec81
      Will Deacon authored
      commit 96a13f57 upstream.
      
      Although we merged support for pseudo-nmi using interrupt priority
      masking in 5.1, we've since uncovered a number of non-trivial issues
      with the implementation. Although there are patches pending to address
      these problems, we're facing issues that prevent us from merging them at
      this current time:
      
        https://lkml.kernel.org/r/1556553607-46531-1-git-send-email-julien.thierry@arm.com
      
      For now, simply mark this optional feature as BROKEN in the hope that we
      can fix things properly in the near future.
      
      Cc: <stable@vger.kernel.org> # 5.1
      Cc: Julien Thierry <julien.thierry@arm.com>
      Acked-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4101ec81
    • Ard Biesheuvel's avatar
      arm64/kernel: kaslr: reduce module randomization range to 2 GB · 2c210489
      Ard Biesheuvel authored
      commit b2eed9b5 upstream.
      
      The following commit
      
        7290d580 ("module: use relative references for __ksymtab entries")
      
      updated the ksymtab handling of some KASLR capable architectures
      so that ksymtab entries are emitted as pairs of 32-bit relative
      references. This reduces the size of the entries, but more
      importantly, it gets rid of statically assigned absolute
      addresses, which require fixing up at boot time if the kernel
      is self relocating (which takes a 24 byte RELA entry for each
      member of the ksymtab struct).
      
      Since ksymtab entries are always part of the same module as the
      symbol they export, it was assumed at the time that a 32-bit
      relative reference is always sufficient to capture the offset
      between a ksymtab entry and its target symbol.
      
      Unfortunately, this is not always true: in the case of per-CPU
      variables, a per-CPU variable's base address (which usually differs
      from the actual address of any of its per-CPU copies) is allocated
      in the vicinity of the ..data.percpu section in the core kernel
      (i.e., in the per-CPU reserved region which follows the section
      containing the core kernel's statically allocated per-CPU variables).
      
      Since we randomize the module space over a 4 GB window covering
      the core kernel (based on the -/+ 4 GB range of an ADRP/ADD pair),
      we may end up putting the core kernel out of the -/+ 2 GB range of
      32-bit relative references of module ksymtab entries that refer to
      per-CPU variables.
      
      So reduce the module randomization range a bit further. We lose
      1 bit of randomization this way, but this is something we can
      tolerate.
      
      Cc: <stable@vger.kernel.org> # v4.19+
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@arm.com>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c210489
    • Dan Williams's avatar
      libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead · e9e27bfc
      Dan Williams authored
      commit 52f476a3 upstream.
      
      Jeff discovered that performance improves from ~375K iops to ~519K iops
      on a simple psync-write fio workload when moving the location of 'struct
      page' from the default PMEM location to DRAM. This result is surprising
      because the expectation is that 'struct page' for dax is only needed for
      third party references to dax mappings. For example, a dax-mapped buffer
      passed to another system call for direct-I/O requires 'struct page' for
      sending the request down the driver stack and pinning the page. There is
      no usage of 'struct page' for first party access to a file via
      read(2)/write(2) and friends.
      
      However, this "no page needed" expectation is violated by
      CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
      copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
      check_heap_object() helper routine assumes the buffer is backed by a
      slab allocator (DRAM) page and applies some checks.  Those checks are
      invalid, dax pages do not originate from the slab, and redundant,
      dax_iomap_actor() has already validated that the I/O is within bounds.
      Specifically that routine validates that the logical file offset is
      within bounds of the file, then it does a sector-to-pfn translation
      which validates that the physical mapping is within bounds of the block
      device.
      
      Bypass additional hardened usercopy overhead and call the 'no check'
      versions of the copy_{to,from}_iter operations directly.
      
      Fixes: 0aed55af ("x86, uaccess: introduce copy_from_iter_flushcache...")
      Cc: <stable@vger.kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Reported-and-tested-by: default avatarJeff Smits <jeff.smits@intel.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9e27bfc
    • Wanpeng Li's avatar
      KVM: nVMX: Fix using __this_cpu_read() in preemptible context · e3feb4af
      Wanpeng Li authored
      commit 541e886f upstream.
      
       BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/4590
        caller is nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
        CPU: 4 PID: 4590 Comm: qemu-system-x86 Tainted: G           OE     5.1.0-rc4+ #1
        Call Trace:
         dump_stack+0x67/0x95
         __this_cpu_preempt_check+0xd2/0xe0
         nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
         nested_vmx_run+0xda/0x2b0 [kvm_intel]
         handle_vmlaunch+0x13/0x20 [kvm_intel]
         vmx_handle_exit+0xbd/0x660 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0xa2c/0x1e50 [kvm]
         kvm_vcpu_ioctl+0x3ad/0x6d0 [kvm]
         do_vfs_ioctl+0xa5/0x6e0
         ksys_ioctl+0x6d/0x80
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x6f/0x6c0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Accessing per-cpu variable should disable preemption, this patch extends the
      preemption disable region for __this_cpu_read().
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Fixes: 52017608 ("KVM: nVMX: add option to perform early consistency checks via H/W")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3feb4af
    • Suthikulpanit, Suravee's avatar
      kvm: svm/avic: fix off-by-one in checking host APIC ID · 4a4c222e
      Suthikulpanit, Suravee authored
      commit c9bcd3e3 upstream.
      
      Current logic does not allow VCPU to be loaded onto CPU with
      APIC ID 255. This should be allowed since the host physical APIC ID
      field in the AVIC Physical APIC table entry is an 8-bit value,
      and APIC ID 255 is valid in system with x2APIC enabled.
      Instead, do not allow VCPU load if the host APIC ID cannot be
      represented by an 8-bit value.
      
      Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK
      instead of AVIC_MAX_PHYSICAL_ID_COUNT.
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a4c222e
    • Peter Xu's avatar
      kvm: Check irqchip mode before assign irqfd · baaee956
      Peter Xu authored
      commit 654f1f13 upstream.
      
      When assigning kvm irqfd we didn't check the irqchip mode but we allow
      KVM_IRQFD to succeed with all the irqchip modes.  However it does not
      make much sense to create irqfd even without the kernel chips.  Let's
      provide a arch-dependent helper to check whether a specific irqfd is
      allowed by the arch.  At least for x86, it should make sense to check:
      
      - when irqchip mode is NONE, all irqfds should be disallowed, and,
      
      - when irqchip mode is SPLIT, irqfds that are with resamplefd should
        be disallowed.
      
      For either of the case, previously we'll silently ignore the irq or
      the irq ack event if the irqchip mode is incorrect.  However that can
      cause misterious guest behaviors and it can be hard to triage.  Let's
      fail KVM_IRQFD even earlier to detect these incorrect configurations.
      
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Radim Krčmář <rkrcmar@redhat.com>
      CC: Alex Williamson <alex.williamson@redhat.com>
      CC: Eduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      baaee956
    • Dan Williams's avatar
      dax: Arrange for dax_supported check to span multiple devices · e00303be
      Dan Williams authored
      commit 7bf7eac8 upstream.
      
      Pankaj reports that starting with commit ad428cdb "dax: Check the
      end of the block-device capacity with dax_direct_access()" device-mapper
      no longer allows dax operation. This results from the stricter checks in
      __bdev_dax_supported() that validate that the start and end of a
      block-device map to the same 'pagemap' instance.
      
      Teach the dax-core and device-mapper to validate the 'pagemap' on a
      per-target basis. This is accomplished by refactoring the
      bdev_dax_supported() internals into generic_fsdax_supported() which
      takes a sector range to validate. Consequently generic_fsdax_supported()
      is suitable to be used in a device-mapper ->iterate_devices() callback.
      A new ->dax_supported() operation is added to allow composite devices to
      split and route upper-level bdev_dax_supported() requests.
      
      Fixes: ad428cdb ("dax: Check the end of the block-device...")
      Cc: <stable@vger.kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Tested-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Tested-by: default avatarVaibhav Jain <vaibhav@linux.ibm.com>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e00303be
    • Tom Zanussi's avatar
      tracing: Add a check_val() check before updating cond_snapshot() track_val · 269360f1
      Tom Zanussi authored
      commit 9b2ca371 upstream.
      
      Without this check a snapshot is taken whenever a bucket's max is hit,
      rather than only when the global max is hit, as it should be.
      
      Before:
      
        In this example, we do a first run of the workload (cyclictest),
        examine the output, note the max ('triggering value') (347), then do
        a second run and note the max again.
      
        In this case, the max in the second run (39) is below the max in the
        first run, but since we haven't cleared the histogram, the first max
        is still in the histogram and is higher than any other max, so it
        should still be the max for the snapshot.  It isn't however - the
        value should still be 347 after the second run.
      
        # echo 'hist:keys=pid:ts0=common_timestamp.usecs if comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
        # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_prio,next_comm,prev_pid,prev_prio,prev_comm):onmax($wakeup_lat).snapshot() if next_comm=="cyclictest"' >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
      
        # cyclictest -p 80 -n -s -t 2 -D 2
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
      
        { next_pid:       2143 } hitcount:        199
          max:         44  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/4
      
        { next_pid:       2145 } hitcount:       1325
          max:         38  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/2
      
        { next_pid:       2144 } hitcount:       1982
          max:        347  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        Snapshot taken (see tracing/snapshot).  Details:
            triggering value { onmax($wakeup_lat) }:        347
            triggered by event with key: { next_pid:       2144 }
      
        # cyclictest -p 80 -n -s -t 2 -D 2
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
      
        { next_pid:       2143 } hitcount:        199
          max:         44  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/4
      
        { next_pid:       2148 } hitcount:        199
          max:         16  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/1
      
        { next_pid:       2145 } hitcount:       1325
          max:         38  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/2
      
        { next_pid:       2150 } hitcount:       1326
          max:         39  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/4
      
        { next_pid:       2144 } hitcount:       1982
          max:        347  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        { next_pid:       2149 } hitcount:       1983
          max:        130  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/0
      
        Snapshot taken (see tracing/snapshot).  Details:
          triggering value { onmax($wakeup_lat) }:    39
          triggered by event with key: { next_pid:       2150 }
      
      After:
      
        In this example, we do a first run of the workload (cyclictest),
        examine the output, note the max ('triggering value') (375), then do
        a second run and note the max again.
      
        In this case, the max in the second run is still 375, the highest in
        any bucket, as it should be.
      
        # cyclictest -p 80 -n -s -t 2 -D 2
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
      
        { next_pid:       2072 } hitcount:        200
          max:         28  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/5
      
        { next_pid:       2074 } hitcount:       1323
          max:        375  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/2
      
        { next_pid:       2073 } hitcount:       1980
          max:        153  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        Snapshot taken (see tracing/snapshot).  Details:
          triggering value { onmax($wakeup_lat) }:        375
          triggered by event with key: { next_pid:       2074 }
      
        # cyclictest -p 80 -n -s -t 2 -D 2
      
        # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
      
        { next_pid:       2101 } hitcount:        199
          max:         49  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        { next_pid:       2072 } hitcount:        200
          max:         28  next_prio:        120  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/5
      
        { next_pid:       2074 } hitcount:       1323
          max:        375  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/2
      
        { next_pid:       2103 } hitcount:       1325
          max:         74  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/4
      
        { next_pid:       2073 } hitcount:       1980
          max:        153  next_prio:         19  next_comm: cyclictest
          prev_pid:          0  prev_prio:        120  prev_comm: swapper/6
      
        { next_pid:       2102 } hitcount:       1981
          max:         84  next_prio:         19  next_comm: cyclictest
          prev_pid:         12  prev_prio:        120  prev_comm: kworker/0:1
      
        Snapshot taken (see tracing/snapshot).  Details:
          triggering value { onmax($wakeup_lat) }:        375
          triggered by event with key: { next_pid:       2074 }
      
      Link: http://lkml.kernel.org/r/95958351329f129c07504b4d1769c47a97b70d65.1555597045.git.tom.zanussi@linux.intel.com
      
      Cc: stable@vger.kernel.org
      Fixes: a3785b7e ("tracing: Add hist trigger snapshot() action")
      Signed-off-by: default avatarTom Zanussi <tom.zanussi@linux.intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      269360f1
    • Trac Hoang's avatar
      mmc: sdhci-iproc: Set NO_HISPD bit to fix HS50 data hold time problem · acf49fa4
      Trac Hoang authored
      commit ec0970e0 upstream.
      
      The iproc host eMMC/SD controller hold time does not meet the
      specification in the HS50 mode.  This problem can be mitigated
      by disabling the HISPD bit; thus forcing the controller output
      data to be driven on the falling clock edges rather than the
      rising clock edges.
      
      Stable tag (v4.12+) chosen to assist stable kernel maintainers so that
      the change does not produce merge conflicts backporting to older kernel
      versions. In reality, the timing bug existed since the driver was first
      introduced but there is no need for this driver to be supported in kernel
      versions that old.
      
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarTrac Hoang <trac.hoang@broadcom.com>
      Signed-off-by: default avatarScott Branden <scott.branden@broadcom.com>
      Acked-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      acf49fa4
    • Trac Hoang's avatar
      mmc: sdhci-iproc: cygnus: Set NO_HISPD bit to fix HS50 data hold time problem · a0514c0a
      Trac Hoang authored
      commit b7dfa695 upstream.
      
      The iproc host eMMC/SD controller hold time does not meet the
      specification in the HS50 mode. This problem can be mitigated
      by disabling the HISPD bit; thus forcing the controller output
      data to be driven on the falling clock edges rather than the
      rising clock edges.
      
      This change applies only to the Cygnus platform.
      
      Stable tag (v4.12+) chosen to assist stable kernel maintainers so that
      the change does not produce merge conflicts backporting to older kernel
      versions. In reality, the timing bug existed since the driver was first
      introduced but there is no need for this driver to be supported in kernel
      versions that old.
      
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarTrac Hoang <trac.hoang@broadcom.com>
      Signed-off-by: default avatarScott Branden <scott.branden@broadcom.com>
      Acked-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0514c0a
    • Daniel Axtens's avatar
      crypto: vmx - CTR: always increment IV as quadword · 1860a557
      Daniel Axtens authored
      commit 009b30ac upstream.
      
      The kernel self-tests picked up an issue with CTR mode:
      alg: skcipher: p8_aes_ctr encryption test failed (wrong result) on test vector 3, cfg="uneven misaligned splits, may sleep"
      
      Test vector 3 has an IV of FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFD, so
      after 3 increments it should wrap around to 0.
      
      In the aesp8-ppc code from OpenSSL, there are two paths that
      increment IVs: the bulk (8 at a time) path, and the individual
      path which is used when there are fewer than 8 AES blocks to
      process.
      
      In the bulk path, the IV is incremented with vadduqm: "Vector
      Add Unsigned Quadword Modulo", which does 128-bit addition.
      
      In the individual path, however, the IV is incremented with
      vadduwm: "Vector Add Unsigned Word Modulo", which instead
      does 4 32-bit additions. Thus the IV would instead become
      FFFFFFFFFFFFFFFFFFFFFFFF00000000, throwing off the result.
      
      Use vadduqm.
      
      This was probably a typo originally, what with q and w being
      adjacent. It is a pretty narrow edge case: I am really
      impressed by the quality of the kernel self-tests!
      
      Fixes: 5c380d62 ("crypto: vmx - Add support for VMS instructions by ASM")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Acked-by: default avatarNayna Jain <nayna@linux.ibm.com>
      Tested-by: default avatarNayna Jain <nayna@linux.ibm.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1860a557
    • Eric Biggers's avatar
      crypto: hash - fix incorrect HASH_MAX_DESCSIZE · 6920fcd3
      Eric Biggers authored
      commit e1354400 upstream.
      
      The "hmac(sha3-224-generic)" algorithm has a descsize of 368 bytes,
      which is greater than HASH_MAX_DESCSIZE (360) which is only enough for
      sha3-224-generic.  The check in shash_prepare_alg() doesn't catch this
      because the HMAC template doesn't set descsize on the algorithms, but
      rather sets it on each individual HMAC transform.
      
      This causes a stack buffer overflow when SHASH_DESC_ON_STACK() is used
      with hmac(sha3-224-generic).
      
      Fix it by increasing HASH_MAX_DESCSIZE to the real maximum.  Also add a
      sanity check to hmac_init().
      
      This was detected by the improved crypto self-tests in v5.2, by loading
      the tcrypt module with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y enabled.  I
      didn't notice this bug when I ran the self-tests by requesting the
      algorithms via AF_ALG (i.e., not using tcrypt), probably because the
      stack layout differs in the two cases and that made a difference here.
      
      KASAN report:
      
          BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:359 [inline]
          BUG: KASAN: stack-out-of-bounds in shash_default_import+0x52/0x80 crypto/shash.c:223
          Write of size 360 at addr ffff8880651defc8 by task insmod/3689
      
          CPU: 2 PID: 3689 Comm: insmod Tainted: G            E     5.1.0-10741-g35c99ffa #11
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
          Call Trace:
           __dump_stack lib/dump_stack.c:77 [inline]
           dump_stack+0x86/0xc5 lib/dump_stack.c:113
           print_address_description+0x7f/0x260 mm/kasan/report.c:188
           __kasan_report+0x144/0x187 mm/kasan/report.c:317
           kasan_report+0x12/0x20 mm/kasan/common.c:614
           check_memory_region_inline mm/kasan/generic.c:185 [inline]
           check_memory_region+0x137/0x190 mm/kasan/generic.c:191
           memcpy+0x37/0x50 mm/kasan/common.c:125
           memcpy include/linux/string.h:359 [inline]
           shash_default_import+0x52/0x80 crypto/shash.c:223
           crypto_shash_import include/crypto/hash.h:880 [inline]
           hmac_import+0x184/0x240 crypto/hmac.c:102
           hmac_init+0x96/0xc0 crypto/hmac.c:107
           crypto_shash_init include/crypto/hash.h:902 [inline]
           shash_digest_unaligned+0x9f/0xf0 crypto/shash.c:194
           crypto_shash_digest+0xe9/0x1b0 crypto/shash.c:211
           generate_random_hash_testvec.constprop.11+0x1ec/0x5b0 crypto/testmgr.c:1331
           test_hash_vs_generic_impl+0x3f7/0x5c0 crypto/testmgr.c:1420
           __alg_test_hash+0x26d/0x340 crypto/testmgr.c:1502
           alg_test_hash+0x22e/0x330 crypto/testmgr.c:1552
           alg_test.part.7+0x132/0x610 crypto/testmgr.c:4931
           alg_test+0x1f/0x40 crypto/testmgr.c:4952
      
      Fixes: b68a7ec1 ("crypto: hash - Remove VLA usage")
      Reported-by: default avatarCorentin Labbe <clabbe.montjoie@gmail.com>
      Cc: <stable@vger.kernel.org> # v4.20+
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarCorentin Labbe <clabbe.montjoie@gmail.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6920fcd3
    • Martin K. Petersen's avatar
      Revert "scsi: sd: Keep disk read-only when re-reading partition" · 204d5350
      Martin K. Petersen authored
      commit 8acf608e upstream.
      
      This reverts commit 20bd1d02.
      
      This patch introduced regressions for devices that come online in
      read-only state and subsequently switch to read-write.
      
      Given how the partition code is currently implemented it is not
      possible to persist the read-only flag across a device revalidate
      call. This may need to get addressed in the future since it is common
      for user applications to proactively call BLKRRPART.
      
      Reverting this commit will re-introduce a regression where a
      device-initiated revalidate event will cause the admin state to be
      forgotten. A separate patch will address this issue.
      
      Fixes: 20bd1d02 ("scsi: sd: Keep disk read-only when re-reading partition")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      204d5350
    • Andrea Parri's avatar
      sbitmap: fix improper use of smp_mb__before_atomic() · 15e5e4b9
      Andrea Parri authored
      commit a0934fd2 upstream.
      
      This barrier only applies to the read-modify-write operations; in
      particular, it does not apply to the atomic_set() primitive.
      
      Replace the barrier with an smp_mb().
      
      Fixes: 6c0ca7ae ("sbitmap: fix wakeup hang after sbq resize")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar"Paul E. McKenney" <paulmck@linux.ibm.com>
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      15e5e4b9
    • Andrea Parri's avatar
      bio: fix improper use of smp_mb__before_atomic() · 01b5e7f8
      Andrea Parri authored
      commit f381c6a4 upstream.
      
      This barrier only applies to the read-modify-write operations; in
      particular, it does not apply to the atomic_set() primitive.
      
      Replace the barrier with an smp_mb().
      
      Fixes: dac56212 ("bio: skip atomic inc/dec of ->bi_cnt for most use cases")
      Cc: stable@vger.kernel.org
      Reported-by: default avatar"Paul E. McKenney" <paulmck@linux.ibm.com>
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      01b5e7f8
    • Borislav Petkov's avatar
      x86/kvm/pmu: Set AMD's virt PMU version to 1 · e83d85e7
      Borislav Petkov authored
      commit a80c4ec1 upstream.
      
      After commit:
      
        672ff6cf ("KVM: x86: Raise #GP when guest vCPU do not support PMU")
      
      my AMD guests started #GPing like this:
      
        general protection fault: 0000 [#1] PREEMPT SMP
        CPU: 1 PID: 4355 Comm: bash Not tainted 5.1.0-rc6+ #3
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        RIP: 0010:x86_perf_event_update+0x3b/0xa0
      
      with Code: pointing to RDPMC. It is RDPMC because the guest has the
      hardware watchdog CONFIG_HARDLOCKUP_DETECTOR_PERF enabled which uses
      perf. Instrumenting kvm_pmu_rdpmc() some, showed that it fails due to:
      
        if (!pmu->version)
        	return 1;
      
      which the above commit added. Since AMD's PMU leaves the version at 0,
      that causes the #GP injection into the guest.
      
      Set pmu->version arbitrarily to 1 and move it above the non-applicable
      struct kvm_pmu members.
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Mihai Carabas <mihai.carabas@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 672ff6cf ("KVM: x86: Raise #GP when guest vCPU do not support PMU")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e83d85e7
    • Paolo Bonzini's avatar
      KVM: x86: fix return value for reserved EFER · d7a74fba
      Paolo Bonzini authored
      commit 66f61c92 upstream.
      
      Commit 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for
      host-initiated writes", 2019-04-02) introduced a "return false" in a
      function returning int, and anyway set_efer has a "nonzero on error"
      conventon so it should be returning 1.
      Reported-by: default avatarPavel Machek <pavel@denx.de>
      Fixes: 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7a74fba
    • Jan Kara's avatar
      ext4: wait for outstanding dio during truncate in nojournal mode · 824adfb2
      Jan Kara authored
      commit 82a25b02 upstream.
      
      We didn't wait for outstanding direct IO during truncate in nojournal
      mode (as we skip orphan handling in that case). This can lead to fs
      corruption or stale data exposure if truncate ends up freeing blocks
      and these get reallocated before direct IO finishes. Fix the condition
      determining whether the wait is necessary.
      
      CC: stable@vger.kernel.org
      Fixes: 1c9114f9 ("ext4: serialize unlocked dio reads with truncate")
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      824adfb2
    • Jan Kara's avatar
      ext4: do not delete unlinked inode from orphan list on failed truncate · 5f2e67d3
      Jan Kara authored
      commit ee0ed02c upstream.
      
      It is possible that unlinked inode enters ext4_setattr() (e.g. if
      somebody calls ftruncate(2) on unlinked but still open file). In such
      case we should not delete the inode from the orphan list if truncate
      fails. Note that this is mostly a theoretical concern as filesystem is
      corrupted if we reach this path anyway but let's be consistent in our
      orphan handling.
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5f2e67d3
    • Steven Rostedt (VMware)'s avatar
      x86: Hide the int3_emulate_call/jmp functions from UML · 680ae6ba
      Steven Rostedt (VMware) authored
      commit 693713cb upstream.
      
      User Mode Linux does not have access to the ip or sp fields of the pt_regs,
      and accessing them causes UML to fail to build. Hide the int3_emulate_jmp()
      and int3_emulate_call() instructions from UML, as it doesn't need them
      anyway.
      Reported-by: default avatarkbuild test robot <lkp@intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      680ae6ba
  2. 25 May, 2019 8 commits
    • Greg Kroah-Hartman's avatar
      Linux 5.1.5 · 83536593
      Greg Kroah-Hartman authored
      83536593
    • Yifeng Li's avatar
      fbdev: sm712fb: fix memory frequency by avoiding a switch/case fallthrough · fa416e2b
      Yifeng Li authored
      commit 9dc20113 upstream.
      
      A fallthrough in switch/case was introduced in f627caf5 ("fbdev:
      sm712fb: fix crashes and garbled display during DPMS modesetting"),
      due to my copy-paste error, which would cause the memory clock frequency
      for SM720 to be programmed to SM712.
      
      Since it only reprograms the clock to a different frequency, it's only
      a benign issue without visible side-effect, so it also evaded Sudip
      Mukherjee's code review and regression tests. scripts/checkpatch.pl
      also failed to discover the issue, possibly due to nested switch
      statements.
      
      This issue was found by Stephen Rothwell by building linux-next with
      -Wimplicit-fallthrough.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Fixes: f627caf5 ("fbdev: sm712fb: fix crashes and garbled display during DPMS modesetting")
      Signed-off-by: default avatarYifeng Li <tomli@tomli.me>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa416e2b
    • Adam Ford's avatar
      ARM: dts: imx6q-logicpd: Reduce inrush current on start · bf5ccf41
      Adam Ford authored
      commit dbb58e29 upstream.
      
      The main 3.3V regulator sources a series of additional regulators.
      This patch adds a small delay, so when the 3.3V regulator comes
      on it delays a bit before the subsequent regulators can come on.
      This reduces the inrush current a bit on the external DC power
      supply to help prevent a situation where the sourcing power supply
      cannot source enough current and overloads and the kit fails to
      start.
      
      Fixes: 1c207f91 ("ARM: dts: imx: Add support for Logic PD i.MX6QD EVM")
      Signed-off-by: default avatarAdam Ford <aford173@gmail.com>
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf5ccf41
    • Adam Ford's avatar
      ARM: dts: imx6q-logicpd: Reduce inrush current on USBH1 · 5397a98f
      Adam Ford authored
      commit 7aedca87 upstream.
      
      Some USB peripherals draw more power, and the sourcing regulator
      take a little time to turn on.  This patch fixes an issue where
      some devices occasionally do not get detected, because the power
      isn't quite ready when communication starts, so we add a bit
      of a delay.
      
      Fixes: 1c207f91 ("ARM: dts: imx: Add support for Logic PD i.MX6QD EVM")
      Signed-off-by: default avatarAdam Ford <aford173@gmail.com>
      Signed-off-by: default avatarShawn Guo <shawnguo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5397a98f
    • Qu Wenruo's avatar
      btrfs: reloc: Fix NULL pointer dereference due to expanded reloc_root lifespan · a28d37d7
      Qu Wenruo authored
      commit 10995c04 upstream.
      
      Commit d2311e69 ("btrfs: relocation: Delay reloc tree deletion after
      merge_reloc_roots()") expands the life span of root->reloc_root.
      
      This breaks certain checs of fs_info->reloc_ctl.  Before that commit, if
      we have a root with valid reloc_root, then it's ensured to have
      fs_info->reloc_ctl.
      
      But now since reloc_root doesn't always mean a valid fs_info->reloc_ctl,
      such check is unreliable and can cause the following NULL pointer
      dereference:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000005c1
        IP: btrfs_reloc_pre_snapshot+0x20/0x50 [btrfs]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 0 PID: 10379 Comm: snapperd Not tainted
        Call Trace:
         create_pending_snapshot+0xd7/0xfc0 [btrfs]
         create_pending_snapshots+0x8e/0xb0 [btrfs]
         btrfs_commit_transaction+0x2ac/0x8f0 [btrfs]
         btrfs_mksubvol+0x561/0x570 [btrfs]
         btrfs_ioctl_snap_create_transid+0x189/0x190 [btrfs]
         btrfs_ioctl_snap_create_v2+0x102/0x150 [btrfs]
         btrfs_ioctl+0x5c9/0x1e60 [btrfs]
         do_vfs_ioctl+0x90/0x5f0
         SyS_ioctl+0x74/0x80
         do_syscall_64+0x7b/0x150
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
        RIP: 0033:0x7fd7cdab8467
      
      Fix it by explicitly checking fs_info->reloc_ctl other than using the
      implied root->reloc_root.
      
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a28d37d7
    • Arnd Bergmann's avatar
      y2038: Make CONFIG_64BIT_TIME unconditional · 9e195a42
      Arnd Bergmann authored
      commit f3d96467 upstream.
      
      As Stepan Golosunov points out, there is a small mistake in the
      get_timespec64() function in the kernel. It was originally added under the
      assumption that CONFIG_64BIT_TIME would get enabled on all 32-bit and
      64-bit architectures, but when the conversion was done, it was only turned
      on for 32-bit ones.
      
      The effect is that the get_timespec64() function never clears the upper
      half of the tv_nsec field for 32-bit tasks in compat mode. Clearing this is
      required for POSIX compliant behavior of functions that pass a 'timespec'
      structure with a 64-bit tv_sec and a 32-bit tv_nsec, plus uninitialized
      padding.
      
      The easiest fix for linux-5.1 is to just make the Kconfig symbol
      unconditional, as it was originally intended. As a follow-up, the #ifdef
      CONFIG_64BIT_TIME can be removed completely..
      
      Note: for native 32-bit mode, no change is needed, this works as
      designed and user space should never need to clear the upper 32
      bits of the tv_nsec field, in or out of the kernel.
      
      Fixes: 00bf25d6 ("y2038: use time32 syscall names on 32-bit")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Joseph Myers <joseph@codesourcery.com>
      Cc: libc-alpha@sourceware.org
      Cc: linux-api@vger.kernel.org
      Cc: Deepa Dinamani <deepa.kernel@gmail.com>
      Cc: Lukasz Majewski <lukma@denx.de>
      Cc: Stepan Golosunov <stepan@golosunov.pp.ru>
      Link: https://lore.kernel.org/lkml/20190422090710.bmxdhhankurhafxq@sghpc.golosunov.pp.ru/
      Link: https://lkml.kernel.org/r/20190429131951.471701-1-arnd@arndb.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9e195a42
    • Daniel Borkmann's avatar
      bpf, lru: avoid messing with eviction heuristics upon syscall lookup · 9503419a
      Daniel Borkmann authored
      commit 50b045a8 upstream.
      
      One of the biggest issues we face right now with picking LRU map over
      regular hash table is that a map walk out of user space, for example,
      to just dump the existing entries or to remove certain ones, will
      completely mess up LRU eviction heuristics and wrong entries such
      as just created ones will get evicted instead. The reason for this
      is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
      system call lookup side as well. Thus upon walk, all entries are
      being marked, so information of actual least recently used ones
      are "lost".
      
      In case of Cilium where it can be used (besides others) as a BPF
      based connection tracker, this current behavior causes disruption
      upon control plane changes that need to walk the map from user space
      to evict certain entries. Discussion result from bpfconf [0] was that
      we should simply just remove marking from system call side as no
      good use case could be found where it's actually needed there.
      Therefore this patch removes marking for regular LRU and per-CPU
      flavor. If there ever should be a need in future, the behavior could
      be selected via map creation flag, but due to mentioned reason we
      avoid this here.
      
        [0] http://vger.kernel.org/bpfconf.html
      
      Fixes: 29ba732a ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
      Fixes: 8f844938 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9503419a
    • Daniel Borkmann's avatar
      bpf: add map_lookup_elem_sys_only for lookups from syscall side · 45b56138
      Daniel Borkmann authored
      commit c6110222 upstream.
      
      Add a callback map_lookup_elem_sys_only() that map implementations
      could use over map_lookup_elem() from system call side in case the
      map implementation needs to handle the latter differently than from
      the BPF data path. If map_lookup_elem_sys_only() is set, this will
      be preferred pick for map lookups out of user space. This hook is
      used in a follow-up fix for LRU map, but once development window
      opens, we can convert other map types from map_lookup_elem() (here,
      the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
      the callback to simplify and clean up the latter.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45b56138