1. 20 Jun, 2019 29 commits
  2. 17 Jun, 2019 1 commit
  3. 15 Jun, 2019 10 commits
    • Tejun Heo's avatar
      blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration · 66311422
      Tejun Heo authored
      wbc_account_io() collects information on cgroup ownership of writeback
      pages to determine which cgroup should own the inode.  Pages can stay
      associated with dead memcgs but we want to avoid attributing IOs to
      dead blkcgs as much as possible as the association is likely to be
      stale.  However, currently, pages associated with dead memcgs
      contribute to the accounting delaying and/or confusing the
      arbitration.
      
      Fix it by ignoring pages associated with dead memcgs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      66311422
    • Tejun Heo's avatar
      blkcg: blkcg_activate_policy() should initialize ancestors first · 71c81407
      Tejun Heo authored
      When blkcg_activate_policy() is creating blkg_policy_data for existing
      blkgs, it did in the wrong order - descendants first.  Fix it.  None
      of the existing controllers seem affected by this.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      71c81407
    • Tejun Heo's avatar
      blkcg: perpcu_ref init/exit should be done from blkg_alloc/free() · ef069b97
      Tejun Heo authored
      blkg alloc is performed as a separate step from the rest of blkg
      creation so that GFP_KERNEL allocations can be used when creating
      blkgs from configuration file writes because otherwise user actions
      may fail due to failures of opportunistic GFP_NOWAIT allocations.
      
      While making blkgs use percpu_ref, 7fcf2b03 ("blkcg: change blkg
      reference counting to use percpu_ref") incorrectly added unconditional
      opportunistic percpu_ref_init() to blkg_create() breaking this
      guarantee.
      
      This patch moves percpu_ref_init() to blkg_alloc() so makes it use
      @gfp_mask that blkg_alloc() is called with.  Also, percpu_ref_exit()
      is moved to blkg_free() for consistency.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 7fcf2b03 ("blkcg: change blkg reference counting to use percpu_ref")
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ef069b97
    • Tejun Heo's avatar
      blkcg: update blkcg_print_stat() to handle larger outputs · f539da82
      Tejun Heo authored
      Depending on the number of devices, blkcg stats can go over the
      default seqfile buf size.  seqfile normally retries with a larger
      buffer but since the ->pd_stat() addition, blkcg_print_stat() doesn't
      tell seqfile that overflow has happened and the output gets printed
      truncated.  Fix it by calling seq_commit() w/ -1 on possible
      overflows.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 903d23f0 ("blk-cgroup: allow controllers to output their own stats")
      Cc: stable@vger.kernel.org # v4.19+
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f539da82
    • Tejun Heo's avatar
      blk-iolatency: clear use_delay when io.latency is set to zero · 5de0073f
      Tejun Heo authored
      If use_delay was non-zero when the latency target of a cgroup was set
      to zero, it will stay stuck until io.latency is enabled on the cgroup
      again.  This keeps readahead disabled for the cgroup impacting
      performance negatively.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Fixes: d7067512 ("block: introduce blk-iolatency io controller")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5de0073f
    • Gustavo A. R. Silva's avatar
      block: bio: Use struct_size() in kmalloc() · f1f8f292
      Gustavo A. R. Silva authored
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct bio_map_data {
      	...
              struct iovec iov[];
      };
      
      instance = kmalloc(sizeof(sizeof(struct bio_map_data) + sizeof(struct iovec) *
                                count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, iov, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f1f8f292
    • Gustavo A. R. Silva's avatar
      block: genhd: Use struct_size() helper · 78b90a2c
      Gustavo A. R. Silva authored
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes, in particular in the
      context in which this code is being used.
      
      So, replace the following form:
      
      sizeof(*new_ptbl) + target * sizeof(new_ptbl->part[0])
      
      with:
      
      struct_size(new_ptbl, part, target)
      
      Also, notice that variable size is unnecessary, hence it is removed.
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      78b90a2c
    • Bob Liu's avatar
      block: null_blk: fix race condition for null_del_dev · 7602843f
      Bob Liu authored
      Dulicate call of null_del_dev() will trigger null pointer error like below.
      The reason is a race condition between nullb_device_power_store() and
      nullb_group_drop_item().
      
        CPU#0                         CPU#1
        ----------------              -----------------
        do_rmdir()
         >configfs_rmdir()
          >client_drop_item()
           >nullb_group_drop_item()
                                      nullb_device_power_store()
      				>null_del_dev()
      
            >test_and_clear_bit(NULLB_DEV_FL_UP
             >null_del_dev()
             ^^^^^
             Duplicated null_dev_dev() triger null pointer error
      
      				>clear_bit(NULLB_DEV_FL_UP
      
      The fix could be keep the sequnce of clear NULLB_DEV_FL_UP and null_del_dev().
      
      [  698.613600] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [  698.613608] #PF error: [normal kernel read fault]
      [  698.613611] PGD 0 P4D 0
      [  698.613619] Oops: 0000 [#1] SMP PTI
      [  698.613627] CPU: 3 PID: 6382 Comm: rmdir Not tainted 5.0.0+ #35
      [  698.613631] Hardware name: LENOVO 20LJS2EV08/20LJS2EV08, BIOS R0SET33W (1.17 ) 07/18/2018
      [  698.613644] RIP: 0010:null_del_dev+0xc/0x110 [null_blk]
      [  698.613649] Code: 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b eb 97 e8 47 bb 2a e8 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 53 <8b> 77 18 48 89 fb 4c 8b 27 48 c7 c7 40 57 1e c1 e8 bf c7 cb e8 48
      [  698.613654] RSP: 0018:ffffb887888bfde0 EFLAGS: 00010286
      [  698.613659] RAX: 0000000000000000 RBX: ffff9d436d92bc00 RCX: ffff9d43a9184681
      [  698.613663] RDX: ffffffffc11e5c30 RSI: 0000000068be6540 RDI: 0000000000000000
      [  698.613667] RBP: ffffb887888bfdf0 R08: 0000000000000001 R09: 0000000000000000
      [  698.613671] R10: ffffb887888bfdd8 R11: 0000000000000f16 R12: ffff9d436d92bc08
      [  698.613675] R13: ffff9d436d94e630 R14: ffffffffc11e5088 R15: ffffffffc11e5000
      [  698.613680] FS:  00007faa68be6540(0000) GS:ffff9d43d14c0000(0000) knlGS:0000000000000000
      [  698.613685] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  698.613689] CR2: 0000000000000018 CR3: 000000042f70c002 CR4: 00000000003606e0
      [  698.613693] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  698.613697] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  698.613700] Call Trace:
      [  698.613712]  nullb_group_drop_item+0x50/0x70 [null_blk]
      [  698.613722]  client_drop_item+0x29/0x40
      [  698.613728]  configfs_rmdir+0x1ed/0x300
      [  698.613738]  vfs_rmdir+0xb2/0x130
      [  698.613743]  do_rmdir+0x1c7/0x1e0
      [  698.613750]  __x64_sys_rmdir+0x17/0x20
      [  698.613759]  do_syscall_64+0x5a/0x110
      [  698.613768]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7602843f
    • Pavel Begunkov's avatar
      blk-mq/debugfs: Fix improper print qualifier · 315eb656
      Pavel Begunkov authored
      struct blk_rq_stat::mean is a u64 value, so use %llu
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      315eb656
    • Guoqing Jiang's avatar
      md/raid10: read balance chooses idlest disk for SSD · e9eeba28
      Guoqing Jiang authored
      Andy reported that raid10 array with SSD disks has poor
      read performance. Compared with raid1, RAID-1 can be 3x
      faster than RAID-10 sometimes [1].
      
      The thing is that raid10 chooses the low distance disk
      for read request, however, the approach doesn't work
      well for SSD device since it doesn't have spindle like
      HDD, we should just read from the SSD which has less
      pending IO like commit 9dedf603 ("md/raid1: read
      balance chooses idlest disk for SSD").
      
      So this commit selects the idlest SSD disk for read if
      array has none rotational disk, otherwise, read_balance
      uses the previous distance priority algorithm. With the
      change, the performance of raid10 gets increased largely
      per Andy's test [2].
      
      [1]. https://marc.info/?l=linux-raid&m=155915890004761&w=2
      [2]. https://marc.info/?l=linux-raid&m=155990654223786&w=2Tested-by: default avatarAndy Smith <andy@strugglers.net>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e9eeba28