1. 30 May, 2018 40 commits
    • Jeff Mahoney's avatar
      btrfs: fix lockdep splat in btrfs_alloc_subvolume_writers · de00d572
      Jeff Mahoney authored
      [ Upstream commit 8a5a916d ]
      
      While running btrfs/011, I hit the following lockdep splat.
      
      This is the important bit:
         pcpu_alloc+0x1ac/0x5e0
         __percpu_counter_init+0x4e/0xb0
         btrfs_init_fs_root+0x99/0x1c0 [btrfs]
         btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
         resolve_indirect_refs+0x130/0x830 [btrfs]
         find_parent_nodes+0x69e/0xff0 [btrfs]
         btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
         btrfs_find_all_roots+0x50/0x70 [btrfs]
         btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
         btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
      
      The percpu_counter_init call in btrfs_alloc_subvolume_writers
      uses GFP_KERNEL, which we can't do during transaction commit.
      
      This switches it to GFP_NOFS.
      
      ========================================================
      WARNING: possible irq lock inversion dependency detected
      4.12.14-kvmsmall #8 Tainted: G        W
      --------------------------------------------------------
      kswapd0/50 just changed the state of lock:
       (&delayed_node->mutex){+.+.-.}, at: [<ffffffffc06994fa>] __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
      but this lock took another, RECLAIM_FS-unsafe lock in the past:
       (pcpu_alloc_mutex){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
      Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> pcpu_alloc_mutex
      
       Possible interrupt unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcpu_alloc_mutex);
                                     local_irq_disable();
                                     lock(&delayed_node->mutex);
                                     lock(&found->groups_sem);
        <Interrupt>
          lock(&delayed_node->mutex);
      
       *** DEADLOCK ***
      
      2 locks held by kswapd0/50:
       #0:  (shrinker_rwsem){++++..}, at: [<ffffffff811dc11f>] shrink_slab+0x7f/0x5b0
       #1:  (&type->s_umount_key#30){+++++.}, at: [<ffffffff8126dec6>] trylock_super+0x16/0x50
      
      the shortest dependencies between 2nd lock and 1st lock:
         -> (pcpu_alloc_mutex){+.+.+.} ops: 4904 {
            HARDIRQ-ON-W at:
                                __mutex_lock+0x4e/0x8c0
                                pcpu_alloc+0x1ac/0x5e0
                                alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                                __do_tune_cpucache+0x2c/0x220
                                do_tune_cpucache+0x26/0xc0
                                enable_cpucache+0x6d/0xf0
                                kmem_cache_init_late+0x42/0x75
                                start_kernel+0x343/0x4cb
                                x86_64_start_kernel+0x127/0x134
                                secondary_startup_64+0xa5/0xb0
            SOFTIRQ-ON-W at:
                                __mutex_lock+0x4e/0x8c0
                                pcpu_alloc+0x1ac/0x5e0
                                alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                                __do_tune_cpucache+0x2c/0x220
                                do_tune_cpucache+0x26/0xc0
                                enable_cpucache+0x6d/0xf0
                                kmem_cache_init_late+0x42/0x75
                                start_kernel+0x343/0x4cb
                                x86_64_start_kernel+0x127/0x134
                                secondary_startup_64+0xa5/0xb0
            RECLAIM_FS-ON-W at:
                                   __kmalloc+0x47/0x310
                                   pcpu_extend_area_map+0x2b/0xc0
                                   pcpu_alloc+0x3ec/0x5e0
                                   alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                                   __do_tune_cpucache+0x2c/0x220
                                   do_tune_cpucache+0x26/0xc0
                                   enable_cpucache+0x6d/0xf0
                                   __kmem_cache_create+0x1bf/0x390
                                   create_cache+0xba/0x1b0
                                   kmem_cache_create+0x1f8/0x2b0
                                   ksm_init+0x6f/0x19d
                                   do_one_initcall+0x50/0x1b0
                                   kernel_init_freeable+0x201/0x289
                                   kernel_init+0xa/0x100
                                   ret_from_fork+0x3a/0x50
            INITIAL USE at:
                               __mutex_lock+0x4e/0x8c0
                               pcpu_alloc+0x1ac/0x5e0
                               alloc_kmem_cache_cpus.isra.70+0x25/0xa0
                               setup_cpu_cache+0x2f/0x1f0
                               __kmem_cache_create+0x1bf/0x390
                               create_boot_cache+0x8b/0xb1
                               kmem_cache_init+0xa1/0x19e
                               start_kernel+0x270/0x4cb
                               x86_64_start_kernel+0x127/0x134
                               secondary_startup_64+0xa5/0xb0
          }
          ... key      at: [<ffffffff821d8e70>] pcpu_alloc_mutex+0x70/0xa0
          ... acquired at:
         pcpu_alloc+0x1ac/0x5e0
         __percpu_counter_init+0x4e/0xb0
         btrfs_init_fs_root+0x99/0x1c0 [btrfs]
         btrfs_get_fs_root.part.54+0x5b/0x150 [btrfs]
         resolve_indirect_refs+0x130/0x830 [btrfs]
         find_parent_nodes+0x69e/0xff0 [btrfs]
         btrfs_find_all_roots_safe+0xa0/0x110 [btrfs]
         btrfs_find_all_roots+0x50/0x70 [btrfs]
         btrfs_qgroup_prepare_account_extents+0x53/0x90 [btrfs]
         btrfs_commit_transaction+0x3ce/0x9b0 [btrfs]
         transaction_kthread+0x176/0x1b0 [btrfs]
         kthread+0x102/0x140
         ret_from_fork+0x3a/0x50
      
        -> (&fs_info->commit_root_sem){++++..} ops: 1566382 {
           HARDIRQ-ON-W at:
                              down_write+0x3e/0xa0
                              cache_block_group+0x287/0x420 [btrfs]
                              find_free_extent+0x106c/0x12d0 [btrfs]
                              btrfs_reserve_extent+0xd8/0x170 [btrfs]
                              cow_file_range.isra.66+0x133/0x470 [btrfs]
                              run_delalloc_range+0x121/0x410 [btrfs]
                              writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                              __extent_writepage+0x19a/0x360 [btrfs]
                              extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                              extent_writepages+0x4d/0x60 [btrfs]
                              do_writepages+0x1a/0x70
                              __filemap_fdatawrite_range+0xa7/0xe0
                              btrfs_rename+0x5ee/0xdb0 [btrfs]
                              vfs_rename+0x52a/0x7e0
                              SyS_rename+0x351/0x3b0
                              do_syscall_64+0x79/0x1e0
                              entry_SYSCALL_64_after_hwframe+0x42/0xb7
           HARDIRQ-ON-R at:
                              down_read+0x35/0x90
                              caching_thread+0x57/0x560 [btrfs]
                              normal_work_helper+0x1c0/0x5e0 [btrfs]
                              process_one_work+0x1e0/0x5c0
                              worker_thread+0x44/0x390
                              kthread+0x102/0x140
                              ret_from_fork+0x3a/0x50
           SOFTIRQ-ON-W at:
                              down_write+0x3e/0xa0
                              cache_block_group+0x287/0x420 [btrfs]
                              find_free_extent+0x106c/0x12d0 [btrfs]
                              btrfs_reserve_extent+0xd8/0x170 [btrfs]
                              cow_file_range.isra.66+0x133/0x470 [btrfs]
                              run_delalloc_range+0x121/0x410 [btrfs]
                              writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                              __extent_writepage+0x19a/0x360 [btrfs]
                              extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                              extent_writepages+0x4d/0x60 [btrfs]
                              do_writepages+0x1a/0x70
                              __filemap_fdatawrite_range+0xa7/0xe0
                              btrfs_rename+0x5ee/0xdb0 [btrfs]
                              vfs_rename+0x52a/0x7e0
                              SyS_rename+0x351/0x3b0
                              do_syscall_64+0x79/0x1e0
                              entry_SYSCALL_64_after_hwframe+0x42/0xb7
           SOFTIRQ-ON-R at:
                              down_read+0x35/0x90
                              caching_thread+0x57/0x560 [btrfs]
                              normal_work_helper+0x1c0/0x5e0 [btrfs]
                              process_one_work+0x1e0/0x5c0
                              worker_thread+0x44/0x390
                              kthread+0x102/0x140
                              ret_from_fork+0x3a/0x50
           INITIAL USE at:
                             down_write+0x3e/0xa0
                             cache_block_group+0x287/0x420 [btrfs]
                             find_free_extent+0x106c/0x12d0 [btrfs]
                             btrfs_reserve_extent+0xd8/0x170 [btrfs]
                             cow_file_range.isra.66+0x133/0x470 [btrfs]
                             run_delalloc_range+0x121/0x410 [btrfs]
                             writepage_delalloc.isra.50+0xfe/0x180 [btrfs]
                             __extent_writepage+0x19a/0x360 [btrfs]
                             extent_write_cache_pages.constprop.56+0x249/0x3e0 [btrfs]
                             extent_writepages+0x4d/0x60 [btrfs]
                             do_writepages+0x1a/0x70
                             __filemap_fdatawrite_range+0xa7/0xe0
                             btrfs_rename+0x5ee/0xdb0 [btrfs]
                             vfs_rename+0x52a/0x7e0
                             SyS_rename+0x351/0x3b0
                             do_syscall_64+0x79/0x1e0
                             entry_SYSCALL_64_after_hwframe+0x42/0xb7
         }
         ... key      at: [<ffffffffc0729578>] __key.61970+0x0/0xfffffffffff9aa88 [btrfs]
         ... acquired at:
         cache_block_group+0x287/0x420 [btrfs]
         find_free_extent+0x106c/0x12d0 [btrfs]
         btrfs_reserve_extent+0xd8/0x170 [btrfs]
         btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
         btrfs_create_tree+0xbb/0x2a0 [btrfs]
         btrfs_create_uuid_tree+0x37/0x140 [btrfs]
         open_ctree+0x23c0/0x2660 [btrfs]
         btrfs_mount+0xd36/0xf90 [btrfs]
         mount_fs+0x3a/0x160
         vfs_kern_mount+0x66/0x150
         btrfs_mount+0x18c/0xf90 [btrfs]
         mount_fs+0x3a/0x160
         vfs_kern_mount+0x66/0x150
         do_mount+0x1c1/0xcc0
         SyS_mount+0x7e/0xd0
         do_syscall_64+0x79/0x1e0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
       -> (&found->groups_sem){++++..} ops: 2134587 {
          HARDIRQ-ON-W at:
                            down_write+0x3e/0xa0
                            __link_block_group+0x34/0x130 [btrfs]
                            btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                            open_ctree+0x2054/0x2660 [btrfs]
                            btrfs_mount+0xd36/0xf90 [btrfs]
                            mount_fs+0x3a/0x160
                            vfs_kern_mount+0x66/0x150
                            btrfs_mount+0x18c/0xf90 [btrfs]
                            mount_fs+0x3a/0x160
                            vfs_kern_mount+0x66/0x150
                            do_mount+0x1c1/0xcc0
                            SyS_mount+0x7e/0xd0
                            do_syscall_64+0x79/0x1e0
                            entry_SYSCALL_64_after_hwframe+0x42/0xb7
          HARDIRQ-ON-R at:
                            down_read+0x35/0x90
                            btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
                            open_ctree+0x207b/0x2660 [btrfs]
                            btrfs_mount+0xd36/0xf90 [btrfs]
                            mount_fs+0x3a/0x160
                            vfs_kern_mount+0x66/0x150
                            btrfs_mount+0x18c/0xf90 [btrfs]
                            mount_fs+0x3a/0x160
                            vfs_kern_mount+0x66/0x150
                            do_mount+0x1c1/0xcc0
                            SyS_mount+0x7e/0xd0
                            do_syscall_64+0x79/0x1e0
                            entry_SYSCALL_64_after_hwframe+0x42/0xb7
          SOFTIRQ-ON-W at:
                            down_write+0x3e/0xa0
                            __link_block_group+0x34/0x130 [btrfs]
                            btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                            open_ctree+0x2054/0x2660 [btrfs]
                            btrfs_mount+0xd36/0xf90 [btrfs]
                            mount_fs+0x3a/0x160
                            vfs_kern_mount+0x66/0x150
                            btrfs_mount+0x18c/0xf90 [btrfs]
                            mount_fs+0x3a/0x160
                            vfs_kern_mount+0x66/0x150
                            do_mount+0x1c1/0xcc0
                            SyS_mount+0x7e/0xd0
                            do_syscall_64+0x79/0x1e0
                            entry_SYSCALL_64_after_hwframe+0x42/0xb7
          SOFTIRQ-ON-R at:
                            down_read+0x35/0x90
                            btrfs_calc_num_tolerated_disk_barrier_failures+0x113/0x1f0 [btrfs]
                            open_ctree+0x207b/0x2660 [btrfs]
                            btrfs_mount+0xd36/0xf90 [btrfs]
                            mount_fs+0x3a/0x160
                            vfs_kern_mount+0x66/0x150
                            btrfs_mount+0x18c/0xf90 [btrfs]
                            mount_fs+0x3a/0x160
                            vfs_kern_mount+0x66/0x150
                            do_mount+0x1c1/0xcc0
                            SyS_mount+0x7e/0xd0
                            do_syscall_64+0x79/0x1e0
                            entry_SYSCALL_64_after_hwframe+0x42/0xb7
          INITIAL USE at:
                           down_write+0x3e/0xa0
                           __link_block_group+0x34/0x130 [btrfs]
                           btrfs_read_block_groups+0x33d/0x7b0 [btrfs]
                           open_ctree+0x2054/0x2660 [btrfs]
                           btrfs_mount+0xd36/0xf90 [btrfs]
                           mount_fs+0x3a/0x160
                           vfs_kern_mount+0x66/0x150
                           btrfs_mount+0x18c/0xf90 [btrfs]
                           mount_fs+0x3a/0x160
                           vfs_kern_mount+0x66/0x150
                           do_mount+0x1c1/0xcc0
                           SyS_mount+0x7e/0xd0
                           do_syscall_64+0x79/0x1e0
                           entry_SYSCALL_64_after_hwframe+0x42/0xb7
        }
        ... key      at: [<ffffffffc0729488>] __key.59101+0x0/0xfffffffffff9ab78 [btrfs]
        ... acquired at:
         find_free_extent+0xcb4/0x12d0 [btrfs]
         btrfs_reserve_extent+0xd8/0x170 [btrfs]
         btrfs_alloc_tree_block+0x12f/0x4c0 [btrfs]
         __btrfs_cow_block+0x110/0x5b0 [btrfs]
         btrfs_cow_block+0xd7/0x290 [btrfs]
         btrfs_search_slot+0x1f6/0x960 [btrfs]
         btrfs_lookup_inode+0x2a/0x90 [btrfs]
         __btrfs_update_delayed_inode+0x65/0x210 [btrfs]
         btrfs_commit_inode_delayed_inode+0x121/0x130 [btrfs]
         btrfs_evict_inode+0x3fe/0x6a0 [btrfs]
         evict+0xc4/0x190
         __dentry_kill+0xbf/0x170
         dput+0x2ae/0x2f0
         SyS_rename+0x2a6/0x3b0
         do_syscall_64+0x79/0x1e0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      -> (&delayed_node->mutex){+.+.-.} ops: 5580204 {
         HARDIRQ-ON-W at:
                          __mutex_lock+0x4e/0x8c0
                          btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                          btrfs_update_inode+0x83/0x110 [btrfs]
                          btrfs_dirty_inode+0x62/0xe0 [btrfs]
                          touch_atime+0x8c/0xb0
                          do_generic_file_read+0x818/0xb10
                          __vfs_read+0xdc/0x150
                          vfs_read+0x8a/0x130
                          SyS_read+0x45/0xa0
                          do_syscall_64+0x79/0x1e0
                          entry_SYSCALL_64_after_hwframe+0x42/0xb7
         SOFTIRQ-ON-W at:
                          __mutex_lock+0x4e/0x8c0
                          btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                          btrfs_update_inode+0x83/0x110 [btrfs]
                          btrfs_dirty_inode+0x62/0xe0 [btrfs]
                          touch_atime+0x8c/0xb0
                          do_generic_file_read+0x818/0xb10
                          __vfs_read+0xdc/0x150
                          vfs_read+0x8a/0x130
                          SyS_read+0x45/0xa0
                          do_syscall_64+0x79/0x1e0
                          entry_SYSCALL_64_after_hwframe+0x42/0xb7
         IN-RECLAIM_FS-W at:
                             __mutex_lock+0x4e/0x8c0
                             __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
                             btrfs_evict_inode+0x22c/0x6a0 [btrfs]
                             evict+0xc4/0x190
                             dispose_list+0x35/0x50
                             prune_icache_sb+0x42/0x50
                             super_cache_scan+0x139/0x190
                             shrink_slab+0x262/0x5b0
                             shrink_node+0x2eb/0x2f0
                             kswapd+0x2eb/0x890
                             kthread+0x102/0x140
                             ret_from_fork+0x3a/0x50
         INITIAL USE at:
                         __mutex_lock+0x4e/0x8c0
                         btrfs_delayed_update_inode+0x46/0x6e0 [btrfs]
                         btrfs_update_inode+0x83/0x110 [btrfs]
                         btrfs_dirty_inode+0x62/0xe0 [btrfs]
                         touch_atime+0x8c/0xb0
                         do_generic_file_read+0x818/0xb10
                         __vfs_read+0xdc/0x150
                         vfs_read+0x8a/0x130
                         SyS_read+0x45/0xa0
                         do_syscall_64+0x79/0x1e0
                         entry_SYSCALL_64_after_hwframe+0x42/0xb7
       }
       ... key      at: [<ffffffffc072d488>] __key.56935+0x0/0xfffffffffff96b78 [btrfs]
       ... acquired at:
         __lock_acquire+0x264/0x11c0
         lock_acquire+0xbd/0x1e0
         __mutex_lock+0x4e/0x8c0
         __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
         btrfs_evict_inode+0x22c/0x6a0 [btrfs]
         evict+0xc4/0x190
         dispose_list+0x35/0x50
         prune_icache_sb+0x42/0x50
         super_cache_scan+0x139/0x190
         shrink_slab+0x262/0x5b0
         shrink_node+0x2eb/0x2f0
         kswapd+0x2eb/0x890
         kthread+0x102/0x140
         ret_from_fork+0x3a/0x50
      
      stack backtrace:
      CPU: 1 PID: 50 Comm: kswapd0 Tainted: G        W        4.12.14-kvmsmall #8 SLE15 (unreleased)
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      Call Trace:
       dump_stack+0x78/0xb7
       print_irq_inversion_bug.part.38+0x19f/0x1aa
       check_usage_forwards+0x102/0x120
       ? ret_from_fork+0x3a/0x50
       ? check_usage_backwards+0x110/0x110
       mark_lock+0x16c/0x270
       __lock_acquire+0x264/0x11c0
       ? pagevec_lookup_entries+0x1a/0x30
       ? truncate_inode_pages_range+0x2b3/0x7f0
       lock_acquire+0xbd/0x1e0
       ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
       __mutex_lock+0x4e/0x8c0
       ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
       ? __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
       ? btrfs_evict_inode+0x1f6/0x6a0 [btrfs]
       __btrfs_release_delayed_node+0x3a/0x1f0 [btrfs]
       btrfs_evict_inode+0x22c/0x6a0 [btrfs]
       evict+0xc4/0x190
       dispose_list+0x35/0x50
       prune_icache_sb+0x42/0x50
       super_cache_scan+0x139/0x190
       shrink_slab+0x262/0x5b0
       shrink_node+0x2eb/0x2f0
       kswapd+0x2eb/0x890
       kthread+0x102/0x140
       ? mem_cgroup_shrink_node+0x2c0/0x2c0
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x3a/0x50
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de00d572
    • Filipe Manana's avatar
      Btrfs: fix copy_items() return value when logging an inode · 92efba91
      Filipe Manana authored
      [ Upstream commit 8434ec46 ]
      
      When logging an inode, at tree-log.c:copy_items(), if we call
      btrfs_next_leaf() at the loop which checks for the need to log holes, we
      need to make sure copy_items() returns the value 1 to its caller and
      not 0 (on success). This is because the path the caller passed was
      released and is now different from what is was before, and the caller
      expects a return value of 0 to mean both success and that the path
      has not changed, while a return value of 1 means both success and
      signals the caller that it can not reuse the path, it has to perform
      another tree search.
      
      Even though this is a case that should not be triggered on normal
      circumstances or very rare at least, its consequences can be very
      unpredictable (especially when replaying a log tree).
      
      Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92efba91
    • Qu Wenruo's avatar
      btrfs: tests/qgroup: Fix wrong tree backref level · d7255626
      Qu Wenruo authored
      [ Upstream commit 3c0efdf0 ]
      
      The extent tree of the test fs is like the following:
      
       BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
        item 0 key (4096 168 4096) itemoff 3944 itemsize 51
                extent refs 1 gen 1 flags 2
                tree block key (68719476736 0 0) level 1
                                                 ^^^^^^^
                ref#0: tree block backref root 5
      
      And it's using an empty tree for fs tree, so there is no way that its
      level can be 1.
      
      For REAL (created by mkfs) fs tree backref with no skinny metadata, the
      result should look like:
      
       item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
               refs 1 gen 4 flags TREE_BLOCK
               tree block key (256 INODE_ITEM 0) level 0
                                                 ^^^^^^^
               tree block backref root 5
      
      Fix the level to 0, so it won't break later tree level checker.
      
      Fixes: faa2dbf0 ("Btrfs: add sanity tests for new qgroup accounting code")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7255626
    • Nicholas Piggin's avatar
      powerpc/64s: sreset panic if there is no debugger or crash dump handlers · 27a913cc
      Nicholas Piggin authored
      [ Upstream commit d40b6768 ]
      
      system_reset_exception does most of its own crash handling now,
      invoking the debugger or crash dumps if they are registered. If not,
      then it goes through to die() to print stack traces, and then is
      supposed to panic (according to comments).
      
      However after die() prints oopses, it does its own handling which
      doesn't allow system_reset_exception to panic (e.g., it may just
      kill the current process). This patch causes sreset exceptions to
      return from die after it prints messages but before acting.
      
      This also stops die from invoking the debugger on 0x100 crashes.
      system_reset_exception similarly calls the debugger. It had been
      thought this was harmless (because if the debugger was disabled,
      neither call would fire, and if it was enabled the first call
      would return). However in some cases like xmon 'X' command, the
      debugger returns 0, which currently causes it to be entered
      again (first in system_reset_exception, then in die), which is
      confusing.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27a913cc
    • Florian Fainelli's avatar
      net: bgmac: Correctly annotate register space · 305f25c1
      Florian Fainelli authored
      [ Upstream commit 16a1c064 ]
      
      All the members: base, idm_base and nicpm_base should be annotated with
      __iomem since they are pointers to register space. This fixes a bunch of
      sparse reported warnings.
      
      Fixes: f6a95a24 ("net: ethernet: bgmac: Add platform device support")
      Fixes: dd5c5d03 ("net: ethernet: bgmac: add NS2 support")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      305f25c1
    • Florian Fainelli's avatar
      net: bgmac: Fix endian access in bgmac_dma_tx_ring_free() · 435290f7
      Florian Fainelli authored
      [ Upstream commit 60d6e6f0 ]
      
      bgmac_dma_tx_ring_free() assigns the ctl1 word which is a litle endian
      32-bit word without using proper accessors, fix this, and because a
      length cannot be negative, use unsigned int while at it.
      
      Fixes: 9cde9450 ("bgmac: implement scatter/gather support")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      435290f7
    • David S. Miller's avatar
      sparc64: Make atomic_xchg() an inline function rather than a macro. · 4a6cd791
      David S. Miller authored
      [ Upstream commit d13864b6 ]
      
      This avoids a lot of -Wunused warnings such as:
      
      ====================
      kernel/debug/debug_core.c: In function ‘kgdb_cpu_enter’:
      ./arch/sparc/include/asm/cmpxchg_64.h:55:22: warning: value computed is not used [-Wunused-value]
       #define xchg(ptr,x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr))))
      
      ./arch/sparc/include/asm/atomic_64.h:86:30: note: in expansion of macro ‘xchg’
       #define atomic_xchg(v, new) (xchg(&((v)->counter), new))
                                    ^~~~
      kernel/debug/debug_core.c:508:4: note: in expansion of macro ‘atomic_xchg’
          atomic_xchg(&kgdb_active, cpu);
          ^~~~~~~~~~~
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4a6cd791
    • David Howells's avatar
      fscache: Fix hanging wait on page discarded by writeback · 22f1bde5
      David Howells authored
      [ Upstream commit 2c984257 ]
      
      If the fscache asynchronous write operation elects to discard a page that's
      pending storage to the cache because the page would be over the store limit
      then it needs to wake the page as someone may be waiting on completion of
      the write.
      
      The problem is that the store limit may be updated by a different
      asynchronous operation - and so may miss the write - and that the store
      limit may not even get updated until later by the netfs.
      
      Fix the kernel hang by making fscache_write_op() mark as written any pages
      that are over the limit.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      22f1bde5
    • Alexander Graf's avatar
      lan78xx: Connect phy early · 6d03ff16
      Alexander Graf authored
      [ Upstream commit 92571a1a ]
      
      When using wicked with a lan78xx device attached to the system, we
      end up with ethtool commands issued on the device before an ifup
      got issued. That lead to the following crash:
      
          Unable to handle kernel NULL pointer dereference at virtual address 0000039c
          pgd = ffff800035b30000
          [0000039c] *pgd=0000000000000000
          Internal error: Oops: 96000004 [#1] SMP
          Modules linked in: [...]
          Supported: Yes
          CPU: 3 PID: 638 Comm: wickedd Tainted: G            E      4.12.14-0-default #1
          Hardware name: raspberrypi rpi/rpi, BIOS 2018.03-rc2 02/21/2018
          task: ffff800035e74180 task.stack: ffff800036718000
          PC is at phy_ethtool_ksettings_get+0x20/0x98
          LR is at lan78xx_get_link_ksettings+0x44/0x60 [lan78xx]
          pc : [<ffff0000086f7f30>] lr : [<ffff000000dcca84>] pstate: 20000005
          sp : ffff80003671bb20
          x29: ffff80003671bb20 x28: ffff800035e74180
          x27: ffff000008912000 x26: 000000000000001d
          x25: 0000000000000124 x24: ffff000008f74d00
          x23: 0000004000114809 x22: 0000000000000000
          x21: ffff80003671bbd0 x20: 0000000000000000
          x19: ffff80003671bbd0 x18: 000000000000040d
          x17: 0000000000000001 x16: 0000000000000000
          x15: 0000000000000000 x14: ffffffffffffffff
          x13: 0000000000000000 x12: 0000000000000020
          x11: 0101010101010101 x10: fefefefefefefeff
          x9 : 7f7f7f7f7f7f7f7f x8 : fefefeff31677364
          x7 : 0000000080808080 x6 : ffff80003671bc9c
          x5 : ffff80003671b9f8 x4 : ffff80002c296190
          x3 : 0000000000000000 x2 : 0000000000000000
          x1 : ffff80003671bbd0 x0 : ffff80003671bc00
          Process wickedd (pid: 638, stack limit = 0xffff800036718000)
          Call trace:
          Exception stack(0xffff80003671b9e0 to 0xffff80003671bb20)
          b9e0: ffff80003671bc00 ffff80003671bbd0 0000000000000000 0000000000000000
          ba00: ffff80002c296190 ffff80003671b9f8 ffff80003671bc9c 0000000080808080
          ba20: fefefeff31677364 7f7f7f7f7f7f7f7f fefefefefefefeff 0101010101010101
          ba40: 0000000000000020 0000000000000000 ffffffffffffffff 0000000000000000
          ba60: 0000000000000000 0000000000000001 000000000000040d ffff80003671bbd0
          ba80: 0000000000000000 ffff80003671bbd0 0000000000000000 0000004000114809
          baa0: ffff000008f74d00 0000000000000124 000000000000001d ffff000008912000
          bac0: ffff800035e74180 ffff80003671bb20 ffff000000dcca84 ffff80003671bb20
          bae0: ffff0000086f7f30 0000000020000005 ffff80002c296000 ffff800035223900
          bb00: 0000ffffffffffff 0000000000000000 ffff80003671bb20 ffff0000086f7f30
          [<ffff0000086f7f30>] phy_ethtool_ksettings_get+0x20/0x98
          [<ffff000000dcca84>] lan78xx_get_link_ksettings+0x44/0x60 [lan78xx]
          [<ffff0000087cbc40>] ethtool_get_settings+0x68/0x210
          [<ffff0000087cc0d4>] dev_ethtool+0x214/0x2180
          [<ffff0000087e5008>] dev_ioctl+0x400/0x630
          [<ffff00000879dd00>] sock_do_ioctl+0x70/0x88
          [<ffff00000879f5f8>] sock_ioctl+0x208/0x368
          [<ffff0000082cde10>] do_vfs_ioctl+0xb0/0x848
          [<ffff0000082ce634>] SyS_ioctl+0x8c/0xa8
          Exception stack(0xffff80003671bec0 to 0xffff80003671c000)
          bec0: 0000000000000009 0000000000008946 0000fffff4e841d0 0000aa0032687465
          bee0: 0000aaaafa2319d4 0000fffff4e841d4 0000000032687465 0000000032687465
          bf00: 000000000000001d 7f7fff7f7f7f7f7f 72606b622e71ff4c 7f7f7f7f7f7f7f7f
          bf20: 0101010101010101 0000000000000020 ffffffffffffffff 0000ffff7f510c68
          bf40: 0000ffff7f6a9d18 0000ffff7f44ce30 000000000000040d 0000ffff7f6f98f0
          bf60: 0000fffff4e842c0 0000000000000001 0000aaaafa2c2e00 0000ffff7f6ab000
          bf80: 0000fffff4e842c0 0000ffff7f62a000 0000aaaafa2b9f20 0000aaaafa2c2e00
          bfa0: 0000fffff4e84818 0000fffff4e841a0 0000ffff7f5ad0cc 0000fffff4e841a0
          bfc0: 0000ffff7f44ce3c 0000000080000000 0000000000000009 000000000000001d
          bfe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
      The culprit is quite simple: The driver tries to access the phy left and right,
      but only actually has a working reference to it when the device is up.
      
      The fix thus is quite simple too: Get a reference to the phy on probe already
      and keep it even when the device is going down.
      
      With this patch applied, I can successfully run wicked on my system and bring
      the interface up and down as many times as I want, without getting NULL pointer
      dereferences in between.
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d03ff16
    • Sean Christopherson's avatar
      KVM: VMX: raise internal error for exception during invalid protected mode state · 80b8f3da
      Sean Christopherson authored
      [ Upstream commit add5ff7a ]
      
      Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if we encounter
      an exception in Protected Mode while emulating guest due to invalid
      guest state.  Unlike Big RM, KVM doesn't support emulating exceptions
      in PM, i.e. PM exceptions are always injected via the VMCS.  Because
      we will never do VMRESUME due to emulation_required, the exception is
      never realized and we'll keep emulating the faulting instruction over
      and over until we receive a signal.
      
      Exit to userspace iff there is a pending exception, i.e. don't exit
      simply on a requested event. The purpose of this check and exit is to
      aid in debugging a guest that is in all likelihood already doomed.
      Invalid guest state in PM is extremely limited in normal operation,
      e.g. it generally only occurs for a few instructions early in BIOS,
      and any exception at this time is all but guaranteed to be fatal.
      Non-vectored interrupts, e.g. INIT, SIPI and SMI, can be cleanly
      handled/emulated, while checking for vectored interrupts, e.g. INTR
      and NMI, without hitting false positives would add a fair amount of
      complexity for almost no benefit (getting hit by lightning seems
      more likely than encountering this specific scenario).
      
      Add a WARN_ON_ONCE to vmx_queue_exception() if we try to inject an
      exception via the VMCS and emulation_required is true.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      80b8f3da
    • Sai Praneeth's avatar
      x86/mm: Fix bogus warning during EFI bootup, use boot_cpu_has() instead of... · fd97bbca
      Sai Praneeth authored
      x86/mm: Fix bogus warning during EFI bootup, use boot_cpu_has() instead of this_cpu_has() in build_cr3_noflush()
      
      [ Upstream commit 162ee5a8 ]
      
      Linus reported the following boot warning:
      
        WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/tlbflush.h:134 load_new_mm_cr3+0x114/0x170
        [...]
        Call Trace:
        switch_mm_irqs_off+0x267/0x590
        switch_mm+0xe/0x20
        efi_switch_mm+0x3e/0x50
        efi_enter_virtual_mode+0x43f/0x4da
        start_kernel+0x3bf/0x458
        secondary_startup_64+0xa5/0xb0
      
      ... after merging:
      
        03781e40: x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
      
      When the platform supports PCID and if CONFIG_DEBUG_VM=y is enabled,
      build_cr3_noflush() (called via switch_mm()) does a sanity check to see
      if X86_FEATURE_PCID is set.
      
      Presently, build_cr3_noflush() uses "this_cpu_has(X86_FEATURE_PCID)" to
      perform the check but this_cpu_has() works only after SMP is initialized
      (i.e. per cpu cpu_info's should be populated) and this happens to be very
      late in the boot process (during rest_init()).
      
      As efi_runtime_services() are called during (early) kernel boot time
      and run time, modify build_cr3_noflush() to use boot_cpu_has() all the
      time. As suggested by Dave Hansen, this should be OK because all CPU's have
      same capabilities on x86.
      
      With this change the warning is fixed.
      
      ( Dave also suggested that we put a warning in this_cpu_has() if it's used
        early in the boot process. This is still work in progress as it affects
        MCE. )
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Lee Chun-Yi <jlee@suse.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Ricardo Neri <ricardo.neri@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/1522870459-7432-1-git-send-email-sai.praneeth.prakhya@intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fd97bbca
    • Davidlohr Bueso's avatar
      sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warning · 3aeaeecd
      Davidlohr Bueso authored
      [ Upstream commit d29a2064 ]
      
      While running rt-tests' pi_stress program I got the following splat:
      
        rq->clock_update_flags < RQCF_ACT_SKIP
        WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20
      
        [...]
      
        <IRQ>
        enqueue_top_rt_rq+0xf4/0x150
        ? cpufreq_dbs_governor_start+0x170/0x170
        sched_rt_rq_enqueue+0x65/0x80
        sched_rt_period_timer+0x156/0x360
        ? sched_rt_rq_enqueue+0x80/0x80
        __hrtimer_run_queues+0xfa/0x260
        hrtimer_interrupt+0xcb/0x220
        smp_apic_timer_interrupt+0x62/0x120
        apic_timer_interrupt+0xf/0x20
        </IRQ>
      
        [...]
      
        do_idle+0x183/0x1e0
        cpu_startup_entry+0x5f/0x70
        start_secondary+0x192/0x1d0
        secondary_startup_64+0xa5/0xb0
      
      We can get rid of it be the "traditional" means of adding an
      update_rq_clock() call after acquiring the rq->lock in
      do_sched_rt_period_timer().
      
      The case for the RT task throttling (which this workload also hits)
      can be ignored in that the skip_update call is actually bogus and
      quite the contrary (the request bits are removed/reverted).
      
      By setting RQCF_UPDATED we really don't care if the skip is happening
      or not and will therefore make the assert_clock_updated() check happy.
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: default avatarMatt Fleming <matt@codeblueprint.co.uk>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: linux-kernel@vger.kernel.org
      Cc: rostedt@goodmis.org
      Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3aeaeecd
    • Nicholas Piggin's avatar
      powerpc/64s/idle: Fix restore of AMOR on POWER9 after deep sleep · be6a5ad5
      Nicholas Piggin authored
      [ Upstream commit c1b25a17 ]
      
      POWER8 restores AMOR when waking from deep sleep, but POWER9 does not,
      because it does not go through the subcore restore.
      
      Have POWER9 restore it in core restore.
      
      Fixes: ee97b6b9 ("powerpc/mm/radix: Setup AMOR in HV mode to allow key 0")
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be6a5ad5
    • Jun Piao's avatar
      ocfs2/dlm: don't handle migrate lockres if already in shutdown · 839c27f7
      Jun Piao authored
      [ Upstream commit bb34f24c ]
      
      We should not handle migrate lockres if we are already in
      'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
      dlm domain.  At last other nodes will get stuck into infinite loop when
      requsting lock from us.
      
      The problem is caused by concurrency umount between nodes.  Before
      receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
      migrate target.  So N2 will continue sending lockres to N1 even though
      N1 has left domain.
      
              N1                             N2 (owner)
                                             touch file
      
          access the file,
          and get pr lock
      
                                             begin leave domain and
                                             pick up N1 as new owner
      
          begin leave domain and
          migrate all lockres done
      
                                             begin migrate lockres to N1
      
          end leave domain, but
          the lockres left
          unexpectedly, because
          migrate task has passed
      
      [piaojun@huawei.com: v3]
        Link: http://lkml.kernel.org/r/5A9CBD19.5020107@huawei.com
      Link: http://lkml.kernel.org/r/5A99F028.2090902@huawei.comSigned-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      839c27f7
    • Mikhail Malygin's avatar
      IB/rxe: Fix for oops in rxe_register_device on ppc64le arch · 9ebe2977
      Mikhail Malygin authored
      [ Upstream commit efc365e7 ]
      
      On ppc64le arch rxe_add command causes oops in kernel log:
      
      [   92.495140] Oops: Kernel access of bad area, sig: 11 [#1]
      [   92.499710] SMP NR_CPUS=2048 NUMA pSeries
      [   92.499792] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) iptable
      _nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) xt_addrtype(E) iptable_filter(E) ip_tables(E) xt_conntrack(E) x_tables(E)
       nf_nat(E) nf_conntrack(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) af_packet(E) rpcrdma(E) ib_isert(E) iscsi_target_mod(E) i
      b_iser(E) libiscsi(E) ib_srpt(E) target_core_mod(E) ib_srp(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) bochs_drm(E) tt
      m(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) drm(E) agpgart(E) virtio_rng(E) virtio_console(E) rtc_
      generic(E) dm_ec(OEN) ttln_rdma(OEN) rdma_cm(E) configfs(E) iw_cm(E) ib_cm(E) rdma_rxe(E) ip6_udp_tunnel(E) udp_tunnel(E) ib_core(E) ql
      a2xxx(E)
      [   92.499832]  scsi_transport_fc(E) nvme_fc(E) nvme_fabrics(E) nvme_core(E) ipmi_watchdog(E) ipmi_ssif(E) ipmi_poweroff(E) ipmi_powernv(EX) ipmi_devintf(E) ipmi_msghandler(E) dummy(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_service_time(E) scsi_transport_iscsi(E) sd_mod(E) sr_mod(E) cdrom(E) hid_generic(E) usbhid(E) virtio_blk(E) virtio_scsi(E) virtio_net(E) ibmvscsi(EX) scsi_transport_srp(E) xhci_pci(E) xhci_hcd(E) usbcore(E) usb_common(E) virtio_pci(E) virtio_ring(E) virtio(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E)
      [   92.499834] Supported: No, Unsupported modules are loaded
      [   92.499839] CPU: 3 PID: 5576 Comm: sh Tainted: G           OE   NX 4.4.120-ttln.17-default #1
      [   92.499841] task: c0000000afe8a490 ti: c0000000beba8000 task.ti: c0000000beba8000
      [   92.499842] NIP: c00000000008ba3c LR: c000000000027644 CTR: c00000000008ba10
      [   92.499844] REGS: c0000000bebab750 TRAP: 0300   Tainted: G           OE   NX  (4.4.120-ttln.17-default)
      [   92.499850] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28424428  XER: 20000000
      [   92.499871] CFAR: 0000000000002424 DAR: 0000000000000208 DSISR: 40000000 SOFTE: 1
                     GPR00: c000000000027644 c0000000bebab9d0 c000000000f09700 0000000000000000
                     GPR04: d0000000043d7192 0000000000000002 000000000000001a fffffffffffffffe
                     GPR08: 000000000000009c c00000000008ba10 d0000000043e5848 d0000000043d3828
                     GPR12: c00000000008ba10 c000000007a02400 0000000010062e38 0000010020388860
                     GPR16: 0000000000000000 0000000000000000 00000100203885f0 00000000100f6c98
                     GPR20: c0000000b3f1fcc0 c0000000b3f1fc48 c0000000b3f1fbd0 c0000000b3f1fb58
                     GPR24: c0000000b3f1fae0 c0000000b3f1fa68 00000000000005dc c0000000b3f1f9f0
                     GPR28: d0000000043e5848 c0000000b3f1f900 c0000000b3f1f320 c0000000b3f1f000
      [   92.499881] NIP [c00000000008ba3c] dma_get_required_mask_pSeriesLP+0x2c/0x1a0
      [   92.499885] LR [c000000000027644] dma_get_required_mask+0x44/0xac
      [   92.499886] Call Trace:
      [   92.499891] [c0000000bebab9d0] [c0000000bebaba30] 0xc0000000bebaba30 (unreliable)
      [   92.499894] [c0000000bebaba10] [c000000000027644] dma_get_required_mask+0x44/0xac
      [   92.499904] [c0000000bebaba30] [d0000000043cb4b4] rxe_register_device+0xc4/0x430 [rdma_rxe]
      [   92.499910] [c0000000bebabab0] [d0000000043c06c8] rxe_add+0x448/0x4e0 [rdma_rxe]
      [   92.499915] [c0000000bebabb30] [d0000000043d28dc] rxe_net_add+0x4c/0xf0 [rdma_rxe]
      [   92.499921] [c0000000bebabb60] [d0000000043d305c] rxe_param_set_add+0x6c/0x1ac [rdma_rxe]
      [   92.499924] [c0000000bebabbf0] [c0000000000e78c0] param_attr_store+0xa0/0x180
      [   92.499927] [c0000000bebabc70] [c0000000000e6448] module_attr_store+0x48/0x70
      [   92.499932] [c0000000bebabc90] [c000000000391f60] sysfs_kf_write+0x70/0xb0
      [   92.499935] [c0000000bebabcb0] [c000000000390f1c] kernfs_fop_write+0x18c/0x1e0
      [   92.499939] [c0000000bebabd00] [c0000000002e22ac] __vfs_write+0x4c/0x1d0
      [   92.499942] [c0000000bebabd90] [c0000000002e2f94] vfs_write+0xc4/0x200
      [   92.499945] [c0000000bebabde0] [c0000000002e488c] SyS_write+0x6c/0x110
      [   92.499948] [c0000000bebabe30] [c000000000009384] system_call+0x38/0xe4
      [   92.499949] Instruction dump:
      [   92.499954] 4e800020 3c4c00e8 3842dcf0 7c0802a6 f8010010 60000000 7c0802a6 fba1ffe8
      [   92.499958] fbc1fff0 fbe1fff8 f8010010 f821ffc1 <e9230208> 7c7e1b78 2fa90000 419e0078
      [   92.499962] ---[ end trace bed077e15eb420cf ]---
      
      It fails in dma_get_required_mask, that has ppc-specific implementation,
      and fail if provided device argument is NULL
      Signed-off-by: default avatarMikhail Malygin <mikhail@malygin.me>
      Reviewed-by: default avatarYonatan Cohen <yonatanc@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ebe2977
    • Nikolay Borisov's avatar
      btrfs: Fix possible softlock on single core machines · 370b3353
      Nikolay Borisov authored
      [ Upstream commit 1e1c50a9 ]
      
      do_chunk_alloc implements a loop checking whether there is a pending
      chunk allocation and if so causes the caller do loop. Generally this
      loop is executed only once, however testing with btrfs/072 on a single
      core vm machines uncovered an extreme case where the system could loop
      indefinitely. This is due to a missing cond_resched when loop which
      doesn't give a chance to the previous chunk allocator finish its job.
      
      The fix is to simply add the missing cond_resched.
      
      Fixes: 6d74119f ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      370b3353
    • Liu Bo's avatar
      Btrfs: fix NULL pointer dereference in log_dir_items · acfd8e88
      Liu Bo authored
      [ Upstream commit 80c0b421 ]
      
      0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
      returned, path->nodes[0] could be NULL, log_dir_items lacks such a
      check for <0 and we may run into a null pointer dereference panic.
      
      Fixes: e02119d5 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      acfd8e88
    • Liu Bo's avatar
      Btrfs: bail out on error during replay_dir_deletes · afef64b1
      Liu Bo authored
      [ Upstream commit b98def7c ]
      
      If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
      to bail out, otherwise @ret would be forced to be 0 after 'break;' and
      the caller won't be aware of it.
      
      Fixes: e02119d5 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      afef64b1
    • Yang Shi's avatar
      mm: thp: fix potential clearing to referenced flag in page_idle_clear_pte_refs_one() · 5ade3c96
      Yang Shi authored
      [ Upstream commit f0849ac0 ]
      
      For PTE-mapped THP, the compound THP has not been split to normal 4K
      pages yet, the whole THP is considered referenced if any one of sub page
      is referenced.
      
      When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked
      to retrieve referenced bit.  But, the current code just returns the
      result of the last PTE.  If the last PTE has not referenced, the
      referenced flag will be cleared.
      
      Just set referenced when ptep{pmdp}_clear_young_notify() returns true.
      
      Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Suggested-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ade3c96
    • Huang Ying's avatar
      mm: fix races between address_space dereference and free in page_evicatable · 8d700626
      Huang Ying authored
      [ Upstream commit e92bb4dd ]
      
      When page_mapping() is called and the mapping is dereferenced in
      page_evicatable() through shrink_active_list(), it is possible for the
      inode to be truncated and the embedded address space to be freed at the
      same time.  This may lead to the following race.
      
      CPU1                                                CPU2
      
      truncate(inode)                                     shrink_active_list()
        ...                                                 page_evictable(page)
        truncate_inode_page(mapping, page);
          delete_from_page_cache(page)
            spin_lock_irqsave(&mapping->tree_lock, flags);
              __delete_from_page_cache(page, NULL)
                page_cache_tree_delete(..)
                  ...                                         mapping = page_mapping(page);
                  page->mapping = NULL;
                  ...
            spin_unlock_irqrestore(&mapping->tree_lock, flags);
            page_cache_free_page(mapping, page)
              put_page(page)
                if (put_page_testzero(page)) -> false
      - inode now has no pages and can be freed including embedded address_space
      
                                                              mapping_unevictable(mapping)
      							  test_bit(AS_UNEVICTABLE, &mapping->flags);
      - we've dereferenced mapping which is potentially already free.
      
      Similar race exists between swap cache freeing and page_evicatable()
      too.
      
      The address_space in inode and swap cache will be freed after a RCU
      grace period.  So the races are fixed via enclosing the page_mapping()
      and address_space usage in rcu_read_lock/unlock().  Some comments are
      added in code to make it clear what is protected by the RCU read lock.
      
      Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d700626
    • Claudio Imbrenda's avatar
      mm/ksm: fix interaction with THP · 763111d9
      Claudio Imbrenda authored
      [ Upstream commit 77da2ba0 ]
      
      This patch fixes a corner case for KSM.  When two pages belong or
      belonged to the same transparent hugepage, and they should be merged,
      KSM fails to split the page, and therefore no merging happens.
      
      This bug can be reproduced by:
      * making sure ksm is running (in case disabling ksmtuned)
      * enabling transparent hugepages
      * allocating a THP-aligned 1-THP-sized buffer
        e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
      * filling it with the same values
        e.g. memset(p, 42, 1<<21)
      * performing madvise to make it mergeable
        e.g. madvise(p, 1<<21, MADV_MERGEABLE)
      * waiting for KSM to perform a few scans
      
      The expected outcome is that the all the pages get merged (1 shared and
      the rest sharing); the actual outcome is that no pages get merged (1
      unshared and the rest volatile)
      
      The reason of this behaviour is that we increase the reference count
      once for both pages we want to merge, but if they belong to the same
      hugepage (or compound page), the reference counter used in both cases is
      the one of the head of the compound page.  This means that
      split_huge_page will find a value of the reference counter too high and
      will fail.
      
      This patch solves this problem by testing if the two pages to merge
      belong to the same hugepage when attempting to merge them.  If so, the
      hugepage is split safely.  This means that the hugepage is not split if
      not necessary.
      
      Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.comSigned-off-by: default avatarClaudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Co-authored-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      763111d9
    • Thomas Falcon's avatar
      ibmvnic: Zero used TX descriptor counter on reset · 378a1e49
      Thomas Falcon authored
      [ Upstream commit 41f71467 ]
      
      The counter that tracks used TX descriptors pending completion
      needs to be zeroed as part of a device reset. This change fixes
      a bug causing transmit queues to be stopped unnecessarily and in
      some cases a transmit queue stall and timeout reset. If the counter
      is not reset, the remaining descriptors will not be "removed",
      effectively reducing queue capacity. If the queue is over half full,
      it will cause the queue to stall if stopped.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      378a1e49
    • Esben Haabendal's avatar
      dp83640: Ensure against premature access to PHY registers after reset · d04e5e72
      Esben Haabendal authored
      [ Upstream commit 76327a35 ]
      
      The datasheet specifies a 3uS pause after performing a software
      reset. The default implementation of genphy_soft_reset() does not
      provide this, so implement soft_reset with the needed pause.
      Signed-off-by: default avatarEsben Haabendal <eha@deif.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d04e5e72
    • Sandipan Das's avatar
      perf clang: Add support for recent clang versions · 4be06bc0
      Sandipan Das authored
      [ Upstream commit 7854e499 ]
      
      The clang API calls used by perf have changed in recent releases and
      builds succeed with libclang-3.9 only. This introduces compatibility
      with libclang-4.0 and above.
      
      Without this patch, we will see the following compilation errors with
      libclang-4.0+:
      
       util/c++/clang.cpp: In function ‘clang::CompilerInvocation* perf::createCompilerInvocation(llvm::opt::ArgStringList, llvm::StringRef&, clang::DiagnosticsEngine&)’:
       util/c++/clang.cpp:62:33: error: ‘IK_C’ was not declared in this scope
         Opts.Inputs.emplace_back(Path, IK_C);
                                        ^~~~
       util/c++/clang.cpp: In function ‘std::unique_ptr<llvm::Module> perf::getModuleFromSource(llvm::opt::ArgStringList, llvm::StringRef, llvm::IntrusiveRefCntPtr<clang::vfs::FileSystem>)’:
       util/c++/clang.cpp:75:26: error: no matching function for call to ‘clang::CompilerInstance::setInvocation(clang::CompilerInvocation*)’
         Clang.setInvocation(&*CI);
                                 ^
       In file included from util/c++/clang.cpp:14:0:
       /usr/include/clang/Frontend/CompilerInstance.h:231:8: note: candidate: void clang::CompilerInstance::setInvocation(std::shared_ptr<clang::CompilerInvocation>)
          void setInvocation(std::shared_ptr<CompilerInvocation> Value);
               ^~~~~~~~~~~~~
      
      Committer testing:
      
      Tested on Fedora 27 after installing the clang-devel and llvm-devel
      packages, versions:
      
        # rpm -qa | egrep llvm\|clang
        llvm-5.0.1-6.fc27.x86_64
        clang-libs-5.0.1-5.fc27.x86_64
        clang-5.0.1-5.fc27.x86_64
        clang-tools-extra-5.0.1-5.fc27.x86_64
        llvm-libs-5.0.1-6.fc27.x86_64
        llvm-devel-5.0.1-6.fc27.x86_64
        clang-devel-5.0.1-5.fc27.x86_64
        #
      
      Make sure you don't have some older version lying around in /usr/local,
      etc, then:
      
        $ make LIBCLANGLLVM=1 -C tools/perf install-bin
      
      And in the end perf will be linked agains these libraries:
      
        # ldd ~/bin/perf | egrep -i llvm\|clang
      	libclangAST.so.5 => /lib64/libclangAST.so.5 (0x00007f8bb2eb4000)
      	libclangBasic.so.5 => /lib64/libclangBasic.so.5 (0x00007f8bb29e3000)
      	libclangCodeGen.so.5 => /lib64/libclangCodeGen.so.5 (0x00007f8bb23f7000)
      	libclangDriver.so.5 => /lib64/libclangDriver.so.5 (0x00007f8bb2060000)
      	libclangFrontend.so.5 => /lib64/libclangFrontend.so.5 (0x00007f8bb1d06000)
      	libclangLex.so.5 => /lib64/libclangLex.so.5 (0x00007f8bb1a3e000)
      	libclangTooling.so.5 => /lib64/libclangTooling.so.5 (0x00007f8bb17d4000)
      	libclangEdit.so.5 => /lib64/libclangEdit.so.5 (0x00007f8bb15c5000)
      	libclangSema.so.5 => /lib64/libclangSema.so.5 (0x00007f8bb0cc9000)
      	libclangAnalysis.so.5 => /lib64/libclangAnalysis.so.5 (0x00007f8bb0a23000)
      	libclangParse.so.5 => /lib64/libclangParse.so.5 (0x00007f8bb0725000)
      	libclangSerialization.so.5 => /lib64/libclangSerialization.so.5 (0x00007f8bb039a000)
      	libLLVM-5.0.so => /lib64/libLLVM-5.0.so (0x00007f8bace98000)
      	libclangASTMatchers.so.5 => /lib64/../lib64/libclangASTMatchers.so.5 (0x00007f8bab735000)
      	libclangFormat.so.5 => /lib64/../lib64/libclangFormat.so.5 (0x00007f8bab4b2000)
      	libclangRewrite.so.5 => /lib64/../lib64/libclangRewrite.so.5 (0x00007f8bab2a1000)
      	libclangToolingCore.so.5 => /lib64/../lib64/libclangToolingCore.so.5 (0x00007f8bab08e000)
        #
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Fixes: 00b86691 ("perf clang: Add builtin clang support ant test case")
      Link: http://lkml.kernel.org/r/20180404180419.19056-2-sandipan@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4be06bc0
    • Sandipan Das's avatar
      perf tools: Fix perf builds with clang support · ee7c28b2
      Sandipan Das authored
      [ Upstream commit c2fb54a1 ]
      
      For libclang, some distro packages provide static libraries (.a) while
      some provide shared libraries (.so). Currently, perf code can only be
      linked with static libraries. This makes perf build possible for both
      cases.
      Signed-off-by: default avatarSandipan Das <sandipan@linux.vnet.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Fixes: d58ac0bf ("perf build: Add clang and llvm compile and linking support")
      Link: http://lkml.kernel.org/r/20180404180419.19056-1-sandipan@linux.vnet.ibm.comSigned-off-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee7c28b2
    • Anshuman Khandual's avatar
      powerpc/fscr: Enable interrupts earlier before calling get_user() · 6689a4c7
      Anshuman Khandual authored
      [ Upstream commit 709b973c ]
      
      The function get_user() can sleep while trying to fetch instruction
      from user address space and causes the following warning from the
      scheduler.
      
      BUG: sleeping function called from invalid context
      
      Though interrupts get enabled back but it happens bit later after
      get_user() is called. This change moves enabling these interrupts
      earlier covering the function get_user(). While at this, lets check
      for kernel mode and crash as this interrupt should not have been
      triggered from the kernel context.
      Signed-off-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6689a4c7
    • Shunyong Yang's avatar
      cpufreq: CPPC: Initialize shared perf capabilities of CPUs · 96fdc64d
      Shunyong Yang authored
      [ Upstream commit 8913315e ]
      
      When multiple CPUs are related in one cpufreq policy, the first online
      CPU will be chosen by default to handle cpufreq operations. Let's take
      cpu0 and cpu1 as an example.
      
      When cpu0 is offline, policy->cpu will be shifted to cpu1. cpu1's perf
      capabilities should be initialized. Otherwise, perf capabilities are 0s
      and speed change can not take effect.
      
      This patch copies perf capabilities of the first online CPU to other
      shared CPUs when policy shared type is CPUFREQ_SHARED_TYPE_ANY.
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarShunyong Yang <shunyong.yang@hxt-semitech.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      96fdc64d
    • Carlos Maiolino's avatar
      Force log to disk before reading the AGF during a fstrim · 8bff7ca9
      Carlos Maiolino authored
      [ Upstream commit 8c81dd46 ]
      
      Forcing the log to disk after reading the agf is wrong, we might be
      calling xfs_log_force with XFS_LOG_SYNC with a metadata lock held.
      
      This can cause a deadlock when racing a fstrim with a filesystem
      shutdown.
      
      The deadlock has been identified due a miscalculation bug in device-mapper
      dm-thin, which returns lack of space to its users earlier than the device itself
      really runs out of space, changing the device-mapper volume into an error state.
      
      The problem happened while filling the filesystem with a single file,
      triggering the bug in device-mapper, consequently causing an IO error
      and shutting down the filesystem.
      
      If such file is removed, and fstrim executed before the XFS finishes the
      shut down process, the fstrim process will end up holding the buffer
      lock, and going to sleep on the cil wait queue.
      
      At this point, the shut down process will try to wake up all the threads
      waiting on the cil wait queue, but for this, it will try to hold the
      same buffer log already held my the fstrim, locking up the filesystem.
      Signed-off-by: default avatarCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8bff7ca9
    • Jens Axboe's avatar
      sr: get/drop reference to device in revalidate and check_events · 28143fe3
      Jens Axboe authored
      [ Upstream commit 2d097c50 ]
      
      We can't just use scsi_cd() to get the scsi_cd structure, we have
      to grab a live reference to the device. For both callbacks, we're
      not inside an open where we already hold a reference to the device.
      
      This fixes device removal/addition under concurrent device access,
      which otherwise could result in the below oops.
      
      NULL pointer dereference at 0000000000000010
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in:
      sr 12:0:0:0: [sr2] scsi-1 drive
       scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core sb_edac xl
      sr 12:0:0:0: Attached scsi CD-ROM sr2
       sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress zlib_defc
      sr 12:0:0:0: Attached scsi generic sg7 type 5
       igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif]
      CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
      Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
      RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
      RSP: 0018:ffff883ff357bb58 EFLAGS: 00010292
      RAX: ffffffffa00b07d0 RBX: ffff883ff3058000 RCX: ffff883ff357bb66
      RDX: 0000000000000003 RSI: 0000000000007530 RDI: ffff881fea631000
      RBP: 0000000000000000 R08: ffff881fe4d38400 R09: 0000000000000000
      R10: 0000000000000000 R11: 00000000000001b6 R12: 000000000800005d
      R13: 000000000800005d R14: ffff883ffd9b3790 R15: 0000000000000000
      FS:  00007f7dc8e6d8c0(0000) GS:ffff883fff340000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000010 CR3: 0000003ffda98005 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       ? __invalidate_device+0x48/0x60
       check_disk_change+0x4c/0x60
       sr_block_open+0x16/0xd0 [sr_mod]
       __blkdev_get+0xb9/0x450
       ? iget5_locked+0x1c0/0x1e0
       blkdev_get+0x11e/0x320
       ? bdget+0x11d/0x150
       ? _raw_spin_unlock+0xa/0x20
       ? bd_acquire+0xc0/0xc0
       do_dentry_open+0x1b0/0x320
       ? inode_permission+0x24/0xc0
       path_openat+0x4e6/0x1420
       ? cpumask_any_but+0x1f/0x40
       ? flush_tlb_mm_range+0xa0/0x120
       do_filp_open+0x8c/0xf0
       ? __seccomp_filter+0x28/0x230
       ? _raw_spin_unlock+0xa/0x20
       ? __handle_mm_fault+0x7d6/0x9b0
       ? list_lru_add+0xa8/0xc0
       ? _raw_spin_unlock+0xa/0x20
       ? __alloc_fd+0xaf/0x160
       ? do_sys_open+0x1a6/0x230
       do_sys_open+0x1a6/0x230
       do_syscall_64+0x5a/0x100
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      Reviewed-by: default avatarLee Duncan <lduncan@suse.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      28143fe3
    • Xidong Wang's avatar
      z3fold: fix memory leak · 3a0de65a
      Xidong Wang authored
      [ Upstream commit 1ec6995d ]
      
      In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
      released on the error path that pool->compact_wq , which holds the
      return value of create_singlethread_workqueue(), is NULL.  This will
      result in a memory leak bug.
      
      [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
      Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.comSigned-off-by: default avatarXidong Wang <wangxidong_97@163.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a0de65a
    • Tom Abraham's avatar
      swap: divide-by-zero when zero length swap file on ssd · 2ab77381
      Tom Abraham authored
      [ Upstream commit a06ad633 ]
      
      Calling swapon() on a zero length swap file on SSD can lead to a
      divide-by-zero.
      
      Although creating such files isn't possible with mkswap and they woud be
      considered invalid, it would be better for the swapon code to be more
      robust and handle this condition gracefully (return -EINVAL).
      Especially since the fix is small and straightforward.
      
      To help with wear leveling on SSD, the swapon syscall calculates a
      random position in the swap file using modulo p->highest_bit, which is
      set to maxpages - 1 in read_swap_header.
      
      If the swap file is zero length, read_swap_header sets maxpages=1 and
      last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
      modulo p->highest_bit in swapon syscall.
      
      This can be prevented by having read_swap_header return zero if
      last_page is zero.
      
      Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.comSigned-off-by: default avatarThomas Abraham <tabraham@suse.com>
      Reported-by: <Mark.Landis@Teradata.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ab77381
    • Danilo Krummrich's avatar
      fs/proc/proc_sysctl.c: fix potential page fault while unregistering sysctl table · 9c9844d9
      Danilo Krummrich authored
      [ Upstream commit a0b0d1c3 ]
      
      proc_sys_link_fill_cache() does not take currently unregistering sysctl
      tables into account, which might result into a page fault in
      sysctl_follow_link() - add a check to fix it.
      
      This bug has been present since v3.4.
      
      Link: http://lkml.kernel.org/r/20180228013506.4915-1-danilokrummrich@dk-develop.de
      Fixes: 0e47c99d ("sysctl: Replace root_list with links between sysctl_table_sets")
      Signed-off-by: default avatarDanilo Krummrich <danilokrummrich@dk-develop.de>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c9844d9
    • Dave Hansen's avatar
      x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init · 59bdc587
      Dave Hansen authored
      [ Upstream commit 639d6aaf ]
      
      __ro_after_init data gets stuck in the .rodata section.  That's normally
      fine because the kernel itself manages the R/W properties.
      
      But, if we run __change_page_attr() on an area which is __ro_after_init,
      the .rodata checks will trigger and force the area to be immediately
      read-only, even if it is early-ish in boot.  This caused problems when
      trying to clear the _PAGE_GLOBAL bit for these area in the PTI code:
      it cleared _PAGE_GLOBAL like I asked, but also took it up on itself
      to clear _PAGE_RW.  The kernel then oopses the next time it wrote to
      a __ro_after_init data structure.
      
      To fix this, add the kernel_set_to_readonly check, just like we have
      for kernel text, just a few lines below in this function.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59bdc587
    • Joerg Roedel's avatar
      x86/pgtable: Don't set huge PUD/PMD on non-leaf entries · c1af6891
      Joerg Roedel authored
      [ Upstream commit e3e28812 ]
      
      The pmd_set_huge() and pud_set_huge() functions are used from
      the generic ioremap() code to establish large mappings where this
      is possible.
      
      But the generic ioremap() code does not check whether the
      PMD/PUD entries are already populated with a non-leaf entry,
      so that any page-table pages these entries point to will be
      lost.
      
      Further, on x86-32 with SHARED_KERNEL_PMD=0, this causes a
      BUG_ON() in vmalloc_sync_one() when PMD entries are synced
      from swapper_pg_dir to the current page-table. This happens
      because the PMD entry from swapper_pg_dir was promoted to a
      huge-page entry while the current PGD still contains the
      non-leaf entry. Because both entries are present and point
      to a different page, the BUG_ON() triggers.
      
      This was actually triggered with pti-x32 enabled in a KVM
      virtual machine by the graphics driver.
      
      A real and better fix for that would be to improve the
      page-table handling in the generic ioremap() code. But that is
      out-of-scope for this patch-set and left for later work.
      Reported-by: default avatarDavid H. Gutteridge <dhgutteridge@sympatico.ca>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180411152437.GC15462@8bytes.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1af6891
    • Filipe Manana's avatar
      Btrfs: fix loss of prealloc extents past i_size after fsync log replay · c527ab91
      Filipe Manana authored
      [ Upstream commit 471d557a ]
      
      Currently if we allocate extents beyond an inode's i_size (through the
      fallocate system call) and then fsync the file, we log the extents but
      after a power failure we replay them and then immediately drop them.
      This behaviour happens since about 2009, commit c71bf099 ("Btrfs:
      Avoid orphan inodes cleanup while replaying log"), because it marks
      the inode as an orphan instead of dropping any extents beyond i_size
      before replaying logged extents, so after the log replay, and while
      the mount operation is still ongoing, we find the inode marked as an
      orphan and then perform a truncation (drop extents beyond the inode's
      i_size). Because the processing of orphan inodes is still done
      right after replaying the log and before the mount operation finishes,
      the intention of that commit does not make any sense (at least as
      of today). However reverting that behaviour is not enough, because
      we can not simply discard all extents beyond i_size and then replay
      logged extents, because we risk dropping extents beyond i_size created
      in past transactions, for example:
      
        add prealloc extent beyond i_size
        fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode
        transaction commit
        add another prealloc extent beyond i_size
        fsync - triggers the fast fsync path
        power failure
      
      In that scenario, we would drop the first extent and then replay the
      second one. To fix this just make sure that all prealloc extents
      beyond i_size are logged, and if we find too many (which is far from
      a common case), fallback to a full transaction commit (like we do when
      logging regular extents in the fast fsync path).
      
      Trivial reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
       $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo
       $ sync
       $ xfs_io -c "falloc -k 256K 1M" /mnt/foo
       $ xfs_io -c "fsync" /mnt/foo
       <power failure>
      
       # mount to replay log
       $ mount /dev/sdb /mnt
       # at this point the file only has one extent, at offset 0, size 256K
      
      A test case for fstests follows soon, covering multiple scenarios that
      involve adding prealloc extents with previous shrinking truncates and
      without such truncates.
      
      Fixes: c71bf099 ("Btrfs: Avoid orphan inodes cleanup while replaying log")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c527ab91
    • Liu Bo's avatar
      Btrfs: clean up resources during umount after trans is aborted · f2924e32
      Liu Bo authored
      [ Upstream commit af722733 ]
      
      Currently if some fatal errors occur, like all IO get -EIO, resources
      would be cleaned up when
      a) transaction is being committed or
      b) BTRFS_FS_STATE_ERROR is set
      
      However, in some rare cases, resources may be left alone after transaction
      gets aborted and umount may run into some ASSERT(), e.g.
      ASSERT(list_empty(&block_group->dirty_list));
      
      For case a), in btrfs_commit_transaciton(), there're several places at the
      beginning where we just call btrfs_end_transaction() without cleaning up
      resources.  For case b), it is possible that the trans handle doesn't have
      any dirty stuff, then only trans hanlde is marked as aborted while
      BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
      
      This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
      all resources won't stay in memory after umount.
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2924e32
    • Johannes Thumshirn's avatar
      nvme: don't send keep-alives to the discovery controller · 1908ca22
      Johannes Thumshirn authored
      [ Upstream commit 74c6c715 ]
      
      NVMe over Fabrics 1.0 Section 5.2 "Discovery Controller Properties and
      Command Support" Figure 31 "Discovery Controller – Admin Commands"
      explicitly listst all commands but "Get Log Page" and "Identify" as
      reserved, but NetApp report the Linux host is sending Keep Alive
      commands to the discovery controller, which is a violation of the
      Spec.
      
      We're already checking for discovery controllers when configuring the
      keep alive timeout but when creating a discovery controller we're not
      hard wiring the keep alive timeout to 0 and thus remain on
      NVME_DEFAULT_KATO for the discovery controller.
      
      This can be easily remproduced when issuing a direct connect to the
      discovery susbsystem using:
      'nvme connect [...] --nqn=nqn.2014-08.org.nvmexpress.discovery'
      Signed-off-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Fixes: 07bfcd09 ("nvme-fabrics: add a generic NVMe over Fabrics library")
      Reported-by: default avatarMartin George <marting@netapp.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1908ca22
    • Jean Delvare's avatar
      firmware: dmi_scan: Fix UUID length safety check · 145b7e06
      Jean Delvare authored
      [ Upstream commit 90fe6f8f ]
      
      The test which ensures that the DMI type 1 structure is long enough
      to hold the UUID is off by one. It would fail if the structure is
      exactly 24 bytes long, while that's sufficient to hold the UUID.
      
      I don't expect this bug to cause problem in practice because all
      implementations I have seen had length 8, 25 or 27 bytes, in line
      with the SMBIOS specifications. But let's fix it still.
      Signed-off-by: default avatarJean Delvare <jdelvare@suse.de>
      Fixes: a814c359 ("firmware: dmi_scan: Check DMI structure length")
      Reviewed-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      145b7e06
    • Rich Felker's avatar
      sh: fix debug trap failure to process signals before return to user · d9179b4a
      Rich Felker authored
      [ Upstream commit 96a59899 ]
      
      When responding to a debug trap (breakpoint) in userspace, the
      kernel's trap handler raised SIGTRAP but returned from the trap via a
      code path that ignored pending signals, resulting in an infinite loop
      re-executing the trapping instruction.
      Signed-off-by: default avatarRich Felker <dalias@libc.org>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9179b4a
    • Yelena Krivosheev's avatar
      net: mvneta: fix enable of all initialized RXQs · 4ee9130f
      Yelena Krivosheev authored
      [ Upstream commit e81b5e01 ]
      
      In mvneta_port_up() we enable relevant RX and TX port queues by write
      queues bit map to an appropriate register.
      
      q_map must be ZERO in the beginning of this process.
      Signed-off-by: default avatarYelena Krivosheev <yelena@marvell.com>
      Acked-by: default avatarThomas Petazzoni <thomas.petazzoni@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <alexander.levin@microsoft.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4ee9130f