1. 13 May, 2016 12 commits
    • Filipe Manana's avatar
      Btrfs: fix race between fsync and direct IO writes for prealloc extents · 0b901916
      Filipe Manana authored
      When we do a direct IO write against a preallocated extent (fallocate)
      that does not go beyond the i_size of the inode, we do the write operation
      without holding the inode's i_mutex (an optimization that landed in
      commit 38851cc1 ("Btrfs: implement unlocked dio write")). This allows
      for a very tiny time window where a race can happen with a concurrent
      fsync using the fast code path, as the direct IO write path creates first
      a new extent map (no longer flagged as a prealloc extent) and then it
      creates the ordered extent, while the fast fsync path first collects
      ordered extents and then it collects extent maps. This allows for the
      possibility of the fast fsync path to collect the new extent map without
      collecting the new ordered extent, and therefore logging an extent item
      based on the extent map without waiting for the ordered extent to be
      created and complete. This can result in a situation where after a log
      replay we end up with an extent not marked anymore as prealloc but it was
      only partially written (or not written at all), exposing random, stale or
      garbage data corresponding to the unwritten pages and without any
      checksums in the csum tree covering the extent's range.
      
      This is an extension of what was done in commit de0ee0ed ("Btrfs: fix
      race between fsync and lockless direct IO writes").
      
      So fix this by creating first the ordered extent and then the extent
      map, so that this way if the fast fsync patch collects the new extent
      map it also collects the corresponding ordered extent.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      0b901916
    • Filipe Manana's avatar
      Btrfs: fix number of transaction units for renames with whiteout · 5062af35
      Filipe Manana authored
      When we do a rename with the whiteout flag, we need to create the whiteout
      inode, which in the worst case requires 5 transaction units (1 inode item,
      1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
      number of transaction units from 11 to 16 if the whiteout flag is set.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      5062af35
    • Filipe Manana's avatar
      Btrfs: pin logs earlier when doing a rename exchange operation · 376e5a57
      Filipe Manana authored
      The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
      which had a race fixed by my previous patch titled "Btrfs: pin log earlier
      when renaming", and so it suffers from the same problem.
      
      We pin the logs of the affected roots after we insert the new inode
      references, leaving a time window where concurrent tasks logging the
      inodes can end up logging both the new and old references, resulting
      in log trees that when replayed can turn the metadata into inconsistent
      states. This behaviour was added to btrfs_rename() in 2009 without any
      explanation about why not pinning the logs earlier, just leaving a
      comment about the posibility for the race. As of today it's perfectly
      safe and sane to pin the logs before we start doing any of the steps
      involved in the rename operation.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      376e5a57
    • Filipe Manana's avatar
      Btrfs: unpin logs if rename exchange operation fails · 86e8aa0e
      Filipe Manana authored
      If rename exchange operations fail at some point after we pinned any of
      the logs, we end up aborting the current transaction but never unpin the
      logs, which leaves concurrent tasks that are trying to sync the logs (as
      part of an fsync request from user space) blocked forever and preventing
      the filesystem from being unmountable.
      
      Fix this by safely unpinning the log.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      86e8aa0e
    • Filipe Manana's avatar
      Btrfs: fix inode leak on failure to setup whiteout inode in rename · c9901618
      Filipe Manana authored
      If we failed to fully setup the whiteout inode during a rename operation
      with the whiteout flag, we ended up leaking the inode, not decrementing
      its link count nor removing all its items from the fs/subvol tree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      c9901618
    • Dan Fuhry's avatar
      btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT · cdd1fedf
      Dan Fuhry authored
      Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
      behavior in the renameat2() syscall. This behavior is primarily used by
      overlayfs. This patch adds support for these flags to btrfs, enabling it to
      be used as a fully functional upper layer for overlayfs.
      
      RENAME_EXCHANGE support was written by Davide Italiano originally
      submitted on 2 April 2015.
      Signed-off-by: default avatarDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: default avatarDan Fuhry <dfuhry@datto.com>
      [ remove unlikely ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      cdd1fedf
    • Filipe Manana's avatar
      Btrfs: pin log earlier when renaming · c4aba954
      Filipe Manana authored
      We were pinning the log right after the first step in the rename operation
      (inserting inode ref for the new name in the destination directory)
      instead of doing it before. This behaviour was introduced in 2009 for some
      reason that was not mentioned neither on the changelog nor any comment,
      with the drawback of a small time window where concurrent log writers can
      end up logging the new inode reference for the inode we are renaming while
      the rename operation is in progress (so that we can end up with a log
      containing both the new and old references). As of today there's no reason
      to not pin the log before that first step anymore, so just fix this.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      c4aba954
    • Filipe Manana's avatar
      Btrfs: unpin log if rename operation fails · 3dc9e8f7
      Filipe Manana authored
      If rename operations fail at some point after we pinned the log, we end
      up aborting the current transaction but never unpin the log, which leaves
      concurrent tasks that are trying to sync the log (as part of an fsync
      request from user space) blocked forever and preventing the filesystem
      from being unmountable.
      
      Fix this by safely unpinning the log.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      3dc9e8f7
    • Filipe Manana's avatar
      Btrfs: don't do unnecessary delalloc flushes when relocating · 9cfa3e34
      Filipe Manana authored
      Before we start the actual relocation process of a block group, we do
      calls to flush delalloc of all inodes and then wait for ordered extents
      to complete. However we do these flush calls just to make sure we don't
      race with concurrent tasks that have actually already started to run
      delalloc and have allocated an extent from the block group we want to
      relocate, right before we set it to readonly mode, but have not yet
      created the respective ordered extents. The flush calls make us wait
      for such concurrent tasks because they end up calling
      filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
      __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
      btrfs_run_delalloc_work()) which ends up serializing us with those tasks
      due to attempts to lock the same pages (and the delalloc flush procedure
      calls the allocator and creates the ordered extents before unlocking the
      pages).
      
      These flushing calls not only make us waste time (cpu, IO) but also reduce
      the chances of writing larger extents (applications might be writing to
      contiguous ranges and we flush before they finish dirtying the whole
      ranges).
      
      So make sure we don't flush delalloc and just wait for concurrent tasks
      that have already started flushing delalloc and have allocated an extent
      from the block group we are about to relocate.
      
      This change also ends up fixing a race with direct IO writes that makes
      relocation not wait for direct IO ordered extents. This race is
      illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                 starts direct IO write,
                                                 target inode currently has no
                                                 ordered extents ongoing nor
                                                 dirty pages (delalloc regions),
                                                 therefore the root for our inode
                                                 is not in the list
                                                 fs_info->ordered_roots
      
                                                 btrfs_direct_IO()
                                                   __blockdev_direct_IO()
                                                     btrfs_get_blocks_direct()
                                                       btrfs_lock_extent_direct()
                                                         locks range in the io tree
                                                       btrfs_new_extent_direct()
                                                         btrfs_reserve_extent()
                                                           --> extent allocated
                                                               from bg X
      
         btrfs_inc_block_group_ro(bg X)
      
         btrfs_start_delalloc_roots()
           __start_delalloc_inodes()
             --> does nothing, no dealloc ranges
                 in the inode's io tree so the
                 inode's root is not in the list
                 fs_info->delalloc_roots
      
         btrfs_wait_ordered_roots()
           --> does not find the inode's root in the
               list fs_info->ordered_roots
      
           --> ends up not waiting for the direct IO
               write started by the task at CPU 2
      
         relocate_block_group(rc->stage ==
           MOVE_DATA_EXTENTS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
      
           iterates the extent tree, using its
           commit root and moves extents into new
           locations
      
                                                         btrfs_add_ordered_extent_dio()
                                                           --> now a ordered extent is
                                                               created and added to the
                                                               list root->ordered_extents
                                                               and the root added to the
                                                               list fs_info->ordered_roots
                                                           --> this is too late and the
                                                               task at CPU 1 already
                                                               started the relocation
      
           btrfs_commit_transaction()
      
                                                         btrfs_finish_ordered_io()
                                                           btrfs_alloc_reserved_file_extent()
                                                             --> adds delayed data reference
                                                                 for the extent allocated
                                                                 from bg X
      
         relocate_block_group(rc->stage ==
           UPDATE_DATA_PTRS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
               --> delayed refs are run, so an extent
                   item for the allocated extent from
                   bg X is added to extent tree
               --> commit roots are switched, so the
                   next scan in the extent tree will
                   see the extent item
      
           sees the extent in the extent tree
      
      When this happens the relocation produces the following warning when it
      finishes:
      
      [ 7260.832836] ------------[ cut here ]------------
      [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
      [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
      [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
      [ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
      [ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
      [ 7260.852998] Call Trace:
      [ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
      [ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
      [ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
      [ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
      [ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
      [ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
      [ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
      [ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
      [ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
      [ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
      [ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
      [ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
      [ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
      [ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
      [ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
      [ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
      
      This is because at the end of the first stage, in relocate_block_group(),
      we commit the current transaction, which makes delayed refs run, the
      commit roots are switched and so the second stage will find the extent
      item that the ordered extent added to the delayed refs. But this extent
      was not moved (ordered extent completed after first stage finished), so
      at the end of the relocation our block group item still has a positive
      used bytes counter, triggering a warning at the end of
      btrfs_relocate_block_group(). Later on when trying to read the extent
      contents from disk we hit a BUG_ON() due to the inability to map a block
      with a logical address that belongs to the block group we relocated and
      is no longer valid, resulting in the following trace:
      
      [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
      [ 7344.887518] ------------[ cut here ]------------
      [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
      [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
      [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
      [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
      [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
      [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
      [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
      [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
      [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
      [ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
      [ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
      [ 7344.888431] Stack:
      [ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
      [ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
      [ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
      [ 7344.888431] Call Trace:
      [ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
      [ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
      [ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
      [ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
      [ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
      [ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
      [ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
      [ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
      [ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
      [ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
      [ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
      [ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
      [ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
      [ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
      [ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
      [ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431]  RSP <ffff8802046878f0>
      [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      9cfa3e34
    • Filipe Manana's avatar
      Btrfs: don't wait for unrelated IO to finish before relocation · 578def7c
      Filipe Manana authored
      Before the relocation process of a block group starts, it sets the block
      group to readonly mode, then flushes all delalloc writes and then finally
      it waits for all ordered extents to complete. This last step includes
      waiting for ordered extents destinated at extents allocated in other block
      groups, making us waste unecessary time.
      
      So improve this by waiting only for ordered extents that fall into the
      block group's range.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      578def7c
    • Filipe Manana's avatar
      Btrfs: fix empty symlink after creating symlink and fsync parent dir · 3f9749f6
      Filipe Manana authored
      If we create a symlink, fsync its parent directory, crash/power fail and
      mount the filesystem, we end up with an empty symlink, which not only is
      useless it's also not allowed in linux (the man page symlink(2) is well
      explicit about that).  So we just need to make sure to fully log an inode
      if it's a symlink, to ensure its inline extent gets logged, ensuring the
      same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/testdir
        $ sync
        $ ln -s /mnt/foo /mnt/testdir/bar
        $ xfs_io -c fsync /mnt/testdir
        <power fail>
        $ mount /dev/sdb /mnt
        $ readlink /mnt/testdir/bar
        <empty string>
      
      A test case for fstests follows soon.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      3f9749f6
    • Filipe Manana's avatar
      Btrfs: fix for incorrect directory entries after fsync log replay · 657ed1aa
      Filipe Manana authored
      If we move a directory to a new parent and later log that parent and don't
      explicitly log the old parent, when we replay the log we can end up with
      entries for the moved directory in both the old and new parent directories.
      Besides being ilegal to have directories with multiple hard links in linux,
      it also resulted in the leaving the inode item with a link count of 1.
      A similar issue also happens if we move a regular file - after the log tree
      is replayed the file has a link in both the old and new parent directories,
      when it should be only at the new directory.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ mkdir /mnt/x
        $ mkdir /mnt/y
        $ touch /mnt/x/foo
        $ mkdir /mnt/y/z
        $ sync
        $ ln /mnt/x/foo /mnt/x/bar
        $ mv /mnt/y/z /mnt/x/z
        < power fail >
        $ mount /dev/sdc /mnt
        $ ls -1Ri /mnt
        /mnt:
        257 x
        258 y
      
        /mnt/x:
        259 bar
        259 foo
        260 z
      
        /mnt/x/z:
      
        /mnt/y:
        260 z
      
        /mnt/y/z:
      
        $ umount /dev/sdc
        $ btrfs check /dev/sdc
        Checking filesystem on /dev/sdc
        UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 260 errors 2000, link count wrong
              unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0
              unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0
        (...)
      
      Attempting to remove the directory becomes impossible:
      
        $ mount /dev/sdc /mnt
        $ rmdir /mnt/y/z
        $ ls -lh /mnt/y
        ls: cannot access /mnt/y/z: No such file or directory
        total 0
        d????????? ? ? ? ?            ? z
        $ rmdir /mnt/x/z
        rmdir: failed to remove ‘/mnt/x/z’: Stale file handle
        $ ls -lh /mnt/x
        ls: cannot access /mnt/x/z: Stale file handle
        total 0
        -rw-r--r-- 2 root root 0 Apr  6 18:06 bar
        -rw-r--r-- 2 root root 0 Apr  6 18:06 foo
        d????????? ? ?    ?    ?            ? z
      
      So make sure that on rename we set the last_unlink_trans value for our
      inode, even if it's a directory, to the value of the current transaction's
      ID and that if the new parent directory is logged that we fallback to a
      transaction commit.
      
      A test case for fstests is being submitted as well.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      657ed1aa
  2. 08 May, 2016 1 commit
  3. 07 May, 2016 7 commits
  4. 06 May, 2016 20 commits
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 07837831
      Linus Torvalds authored
      Pull writeback fix from Jens Axboe:
       "Just a single fix for domain aware writeback, fixing a regression that
        can cause balance_dirty_pages() to keep looping while not getting any
        work done"
      
      * 'for-linus' of git://git.kernel.dk/linux-block:
        writeback: Fix performance regression in wb_over_bg_thresh()
      07837831
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3f86ba5d
      Linus Torvalds authored
      Pull x86 fixes from Ingo Molnar:
       "This contains two fixes: a boot fix for older SGI/UV systems, and an
        APIC calibration fix"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO
        x86/platform/UV: Bring back the call to map_low_mmrs in uv_system_init
      3f86ba5d
    • Linus Torvalds's avatar
      Merge tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 01ec7167
      Linus Torvalds authored
      Pull power management and ACPI fixes from Rafael Wysocki:
       "Fixes for problems introduced or discovered recently (intel_pstate,
        sti-cpufreq, ARM64 cpuidle, Operating Performance Points framework,
        generic device properties framework) and one fix for a hotplug-related
        deadlock in ACPICA that's been there forever, but is nasty enough.
      
        Specifics:
      
         - Fix for a recent regression in the intel_pstate driver causing it
           to fail to restore the HWP (HW-managed P-states) configuration of
           the boot CPU after suspend-to-RAM (Rafael Wysocki).
      
         - Fix for two recent regressions in the intel_pstate driver, one that
           can trigger a divide by zero if the driver is accessed via sysfs
           before it manages to take the first sample and one causing it to
           fail to update a structure field used in a trace point, so the
           information coming from it is less useful (Rafael Wysocki).
      
         - Fix for a problem in the sti-cpufreq driver introduced during the
           4.5 cycle that causes it to break CPU PM in multi-platform kernels
           by registering cpufreq-dt (which subsequently doesn't work)
           unconditionally and preventing the driver that would actually work
           from registering (Sudeep Holla).
      
         - Stable-candidate fix for an ARM64 cpuidle issue causing idle state
           usage counters to be incorrectly updated for idle states that were
           not entered due to errors (James Morse).
      
         - Fix for a recently introduced issue in the OPP (Operating
           Performance Points) framework causing it to print bogus error
           messages for missing optional regulators (Viresh Kumar).
      
         - Fix for a recently introduced issue in the generic device
           properties framework that may cause it to attempt to dereferece and
           invalid pointer in some cases (Heikki Krogerus).
      
         - Fix for a deadlock in the ACPICA core that may be triggered by
           device (eg Thunderbolt) hotplug (Prarit Bhargava)"
      
      * tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        PM / OPP: Remove useless check
        ACPICA: Dispatcher: Update thread ID for recursive method calls
        intel_pstate: Fix intel_pstate_get()
        cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
        cpufreq: st: enable selective initialization based on the platform
        ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
        device property: Avoid potential dereferences of invalid pointers
      01ec7167
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 17d25a33
      Linus Torvalds authored
      Pull scheduler fix from Ingo Molnar:
       "This contains a single fix that fixes a nohz tick stopping bug when
        mixed-poliocy SCHED_FIFO and SCHED_RR tasks are present on a runqueue"
      
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        nohz/full, sched/rt: Fix missed tick-reenabling bug in sched_can_stop_tick()
      17d25a33
    • Linus Torvalds's avatar
      Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 18fb92c3
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "This tree contains two fixes: new Intel CPU model numbers and an
        AMD/iommu uncore PMU driver fix"
      
      * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/amd/iommu: Do not register a task ctx for uncore like PMUs
        perf/x86: Add model numbers for Kabylake CPUs
      18fb92c3
    • Linus Torvalds's avatar
      Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · cade8184
      Linus Torvalds authored
      Pull EFI fixes from Ingo Molnar:
       "This tree contains three fixes: a console spam fix, a file pattern fix
        and a sysfb_efi fix for a bug that triggered on older ThinkPads"
      
      * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/sysfb_efi: Fix valid BAR address range check
        x86/efi-bgrt: Switch all pr_err() to pr_notice() for invalid BGRT
        MAINTAINERS: Remove asterisk from EFI directory names
      cade8184
    • Linus Torvalds's avatar
      Merge branch 'parisc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 83a395d3
      Linus Torvalds authored
      Pull parisc fix from Helge Deller:
       "Patch from Dmitry V Levin to fix a kernel crash when a straced process
        calls the (invalid) syscall which is equal to value of __NR_Linux_syscalls"
      
      * 'parisc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: fix a bug when syscall number of tracee is __NR_Linux_syscalls
      83a395d3
    • Linus Torvalds's avatar
      Merge tag 'arc-4.6-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · dd287690
      Linus Torvalds authored
      Pull ARC fixes from Vineet Gupta:
       "Late in the cycle, but this has fixes for couple of issues: a PAE40
        boot crash and Arnd spotting lack of barriers in BE io-accessors.
      
        The 3rd patch for enabling highmem in low physical mem ;-) honestly is
        more than a "fix" but its been in works for some time, seems to be
        stable in testing and enables 2 of our customers to go forward with
        4.6 kernel.
      
         - Fix for PTE truncation in PAE40 builds
         - Fix for big endian IO accessors lacking IO barrier
         - Allow HIGHMEM to work with low physical addresses"
      
      * tag 'arc-4.6-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: support HIGHMEM even without PAE40
        ARC: Fix PAE40 boot failures due to PTE truncation
        ARC: Add missing io barriers to io{read,write}{16,32}be()
      dd287690
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 4883d11e
      Linus Torvalds authored
      Pull powerpc fix from Michael Ellerman:
       "Fix bad inline asm constraint in create_zero_mask() from Anton
        Blanchard"
      
      * tag 'powerpc-4.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc: Fix bad inline asm constraint in create_zero_mask()
      4883d11e
    • Linus Torvalds's avatar
      Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · 659a1823
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Fixes for i915, amdgpu/radeon and imx.
      
        The IMX fix is for an autoloading regression found in Fedora.  The
        radeon fixes, are the same fix to amdgpu/radeon to avoid a hardware
        lockup in some circumstances with a bad mode, and a double free bug I
        took a few hours chasing down the other morning.
      
        The i915 fixes are across the board, all stable material, and fixing
        some hangs and suspend/resume issues, along with a live status
        regressions"
      
      * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
        gpu: ipu-v3: Fix imx-ipuv3-crtc module autoloading
        drm/amdgpu: make sure vertical front porch is at least 1
        drm/radeon: make sure vertical front porch is at least 1
        drm/amdgpu: set metadata pointer to NULL after freeing.
        drm/i915: Make RPS EI/thresholds multiple of 25 on SNB-BDW
        drm/i915: Fake HDMI live status
        drm/i915: Fix eDP low vswing for Broadwell
        drm/i915/ddi: Fix eDP VDD handling during booting and suspend/resume
        drm/i915: Fix system resume if PCI device remained enabled
        drm/i915: Avoid stalling on pending flips for legacy cursor updates
      659a1823
    • Dmitry V. Levin's avatar
      parisc: fix a bug when syscall number of tracee is __NR_Linux_syscalls · f0b22d1b
      Dmitry V. Levin authored
      Do not load one entry beyond the end of the syscall table when the
      syscall number of a traced process equals to __NR_Linux_syscalls.
      Similar bug with regular processes was fixed by commit 3bb457af
      ("[PARISC] Fix bug when syscall nr is __NR_Linux_syscalls").
      
      This bug was found by strace test suite.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Acked-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      f0b22d1b
    • Rafael J. Wysocki's avatar
      Merge branches 'pm-opp-fixes', 'pm-cpufreq-fixes' and 'pm-cpuidle-fixes' · 5f2f88e3
      Rafael J. Wysocki authored
      * pm-opp-fixes:
        PM / OPP: Remove useless check
      
      * pm-cpufreq-fixes:
        intel_pstate: Fix intel_pstate_get()
        cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
        cpufreq: st: enable selective initialization based on the platform
      
      * pm-cpuidle-fixes:
        ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
      5f2f88e3
    • Rafael J. Wysocki's avatar
      Merge branches 'acpica-fixes' and 'device-properties-fixes' · 7c21b38c
      Rafael J. Wysocki authored
      * acpica-fixes:
        ACPICA: Dispatcher: Update thread ID for recursive method calls
      
      * device-properties-fixes:
        device property: Avoid potential dereferences of invalid pointers
      7c21b38c
    • Chen Yu's avatar
      x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO · 886123fb
      Chen Yu authored
      Currently we read the tsc radio: ratio = (MSR_PLATFORM_INFO >> 8) & 0x1f;
      
      Thus we get bit 8-12 of MSR_PLATFORM_INFO, however according to the SDM
      (35.5), the ratio bits are bit 8-15.
      
      Ignoring the upper bits can result in an incorrect tsc ratio, which causes the
      TSC calibration and the Local APIC timer frequency to be incorrect.
      
      Fix this problem by masking 0xff instead.
      
      [ tglx: Massaged changelog ]
      
      Fixes: 7da7c156 "x86, tsc: Add static (MSR) TSC calibration on Intel Atom SoCs"
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: stable@vger.kernel.org
      Cc: Bin Gao <bin.gao@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Link: http://lkml.kernel.org/r/1462505619-5516-1-git-send-email-yu.c.chen@intel.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      886123fb
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 9caa7e78
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "14 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        byteswap: try to avoid __builtin_constant_p gcc bug
        lib/stackdepot: avoid to return 0 handle
        mm: fix kcompactd hang during memory offlining
        modpost: fix module autoloading for OF devices with generic compatible property
        proc: prevent accessing /proc/<PID>/environ until it's ready
        mm/zswap: provide unique zpool name
        mm: thp: kvm: fix memory corruption in KVM with THP enabled
        MAINTAINERS: fix Rajendra Nayak's address
        mm, cma: prevent nr_isolated_* counters from going negative
        mm: update min_free_kbytes from khugepaged after core initialization
        huge pagecache: mmap_sem is unlocked when truncation splits pmd
        rapidio/mport_cdev: fix uapi type definitions
        mm: memcontrol: let v2 cgroups follow changes in system swappiness
        mm: thp: correct split_huge_pages file permission
      9caa7e78
    • Linus Torvalds's avatar
      mailmap: add John Paul Adrian Glaubitz · 43a3e837
      Linus Torvalds authored
      Apparently patchwork ended up truncating the full name.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43a3e837
    • Linus Torvalds's avatar
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 7270a3f7
      Linus Torvalds authored
      Pull libnvdimm fixes from Dan Williams:
      
       - a fix for the persistent memory 'struct page' driver.  The
         implementation overlooked the fact that pages are allocated in 2MB
         units leading to -ENOMEM when establishing some configurations.
      
         It's tagged for -stable as the problem was introduced with the
         initial implementation in 4.5.
      
       - The new "error status translation" routine, introduced with the 4.6
         updates to the nfit driver, missed a necessary path in
         acpi_nfit_ctl().
      
         The end result is that we are falsely assuming commands complete
         successfully when the embedded status says otherwise.
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        nfit: fix translation of command status results
        libnvdimm, pfn: fix memmap reservation sizing
      7270a3f7
    • Arnd Bergmann's avatar
      byteswap: try to avoid __builtin_constant_p gcc bug · 7322dd75
      Arnd Bergmann authored
      This is another attempt to avoid a regression in wwn_to_u64() after that
      started using get_unaligned_be64(), which in turn ran into a bug on
      gcc-4.9 through 6.1.
      
      The regression got introduced due to the combination of two separate
      workarounds (commits e3bde956: "include/linux/unaligned: force
      inlining of byteswap operations" and ef3fb242: "scsi: fc: use
      get/put_unaligned64 for wwn access") that each try to sidestep distinct
      problems with gcc behavior (code growth and increased stack usage).
      
      Unfortunately after both have been applied, a more serious gcc bug has
      been uncovered, leading to incorrect object code that discards part of a
      function and causes undefined behavior.
      
      As part of this problem is how __builtin_constant_p gets evaluated on an
      argument passed by reference into an inline function, this avoids the
      use of __builtin_constant_p() for all architectures that set
      CONFIG_ARCH_USE_BUILTIN_BSWAP.  Most architectures do not set
      ARCH_SUPPORTS_OPTIMIZED_INLINING, which means they probably do not
      suffer from the problem in the qla2xxx driver, but they might still run
      into it elsewhere.
      
      Both of the original workarounds were only merged in the 4.6 kernel, and
      the bug that is fixed by this patch should only appear if both are
      there, so we probably don't need to backport the fix.  On the other
      hand, it works by simplifying the code path and should not have any
      negative effects.
      
      [arnd@arndb.de: fix older gcc warnings]
        (http://lkml.kernel.org/r/12243652.bxSxEgjgfk@wuerfel)
      Link: https://lkml.org/lkml/headers/2016/4/12/1103
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70232
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
      Fixes: e3bde956 ("include/linux/unaligned: force inlining of byteswap operations")
      Fixes: ef3fb242 ("scsi: fc: use get/put_unaligned64 for wwn access")
      Link: http://lkml.kernel.org/r/1780465.XdtPJpi8Tt@wuerfelSigned-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> # on gcc-5.3
      Tested-by: default avatarQuinn Tran <quinn.tran@qlogic.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
      Cc: Jan Hubicka <hubicka@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7322dd75
    • Joonsoo Kim's avatar
      lib/stackdepot: avoid to return 0 handle · 7c31190b
      Joonsoo Kim authored
      Recently, we allow to save the stacktrace whose hashed value is 0.  It
      causes the problem that stackdepot could return 0 even if in success.
      User of stackdepot cannot distinguish whether it is success or not so we
      need to solve this problem.  In this patch, 1 bit are added to handle
      and make valid handle none 0 by setting this bit.  After that, valid
      handle will not be 0 and 0 handle will represent failure correctly.
      
      Fixes: 33334e25 ("lib/stackdepot.c: allow the stack trace hash to be zero")
      Link: http://lkml.kernel.org/r/1462252403-1106-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c31190b
    • Vlastimil Babka's avatar
      mm: fix kcompactd hang during memory offlining · 172400c6
      Vlastimil Babka authored
      Assume memory47 is the last online block left in node1.  This will hang:
      
        # echo offline > /sys/devices/system/node/node1/memory47/state
      
      After a couple of minutes, the following pops up in dmesg:
      
        INFO: task bash:957 blocked for more than 120 seconds.
               Not tainted 4.6.0-rc6+ #6
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        bash            D ffff8800b7adbaf8     0   957    951 0x00000000
        Call Trace:
          schedule+0x35/0x80
          schedule_timeout+0x1ac/0x270
          wait_for_completion+0xe1/0x120
          kthread_stop+0x4f/0x110
          kcompactd_stop+0x26/0x40
          __offline_pages.constprop.28+0x7e6/0x840
          offline_pages+0x11/0x20
          memory_block_action+0x73/0x1d0
          memory_subsys_offline+0x47/0x60
          device_offline+0x86/0xb0
          store_mem_state+0xda/0xf0
          dev_attr_store+0x18/0x30
          sysfs_kf_write+0x37/0x40
          kernfs_fop_write+0x11d/0x170
          __vfs_write+0x37/0x120
          vfs_write+0xa9/0x1a0
          SyS_write+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
      actually exit.  Check kthread_should_stop() to break out of the wait.
      
      Fixes: 698b1b30 ("mm, compaction: introduce kcompactd").
      Reported-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      172400c6