1. 18 Aug, 2017 10 commits
  2. 17 Aug, 2017 2 commits
  3. 15 Aug, 2017 1 commit
  4. 11 Aug, 2017 3 commits
    • Ritesh Harjani's avatar
      cfq: Give a chance for arming slice idle timer in case of group_idle · b3193bc0
      Ritesh Harjani authored
      In below scenario blkio cgroup does not work as per their assigned
      weights :-
      1. When the underlying device is nonrotational with a single HW queue
      with depth of >= CFQ_HW_QUEUE_MIN
      2. When the use case is forming two blkio cgroups cg1(weight 1000) &
      cg2(wight 100) and two processes(file1 and file2) doing sync IO in
      their respective blkio cgroups.
      
      For above usecase result of fio (without this patch):-
      file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan  1 19:41:49 1970
        write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)
      <...>
      file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan  1 19:41:49 1970
        write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)
      <...>
      // both the process BW is equal even though they belong to diff.
      cgroups with weight of 1000(cg1) and 100(cg2)
      
      In above case (for non rotational NCQ devices),
      as soon as the request from cg1 is completed and even
      though it is provided with higher set_slice=10, because of CFQ
      algorithm when the driver tries to fetch the request, CFQ expires
      this group without providing any idle time nor weight priority
      and schedules another cfq group (in this case cg2).
      And thus both cfq groups(cg1 & cg2) keep alternating to get the
      disk time and hence loses the cgroup weight based scheduling.
      
      Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
      to arm the slice timer in case group_idle is enabled.
      In case if group_idle is also not required (including for nonrotational
      NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
      such cases.
      
      With this patch result of fio(for above usecase) :-
      file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan  1 00:06:08 1970
        write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)
      <..>
      file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan  1 00:06:08 1970
        write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)
      <..>
      // In this processes BW is as per their respective cgroups weight.
      Signed-off-by: default avatarRitesh Harjani <riteshh@codeaurora.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b3193bc0
    • Paolo Valente's avatar
      block, bfq: boost throughput with flash-based non-queueing devices · edaf9428
      Paolo Valente authored
      When a queue associated with a process remains empty, there are cases
      where throughput gets boosted if the device is idled to await the
      arrival of a new I/O request for that queue. Currently, BFQ assumes
      that one of these cases is when the device has no internal queueing
      (regardless of the properties of the I/O being served). Unfortunately,
      this condition has proved to be too general. So, this commit refines it
      as "the device has no internal queueing and is rotational".
      
      This refinement provides a significant throughput boost with random
      I/O, on flash-based storage without internal queueing. For example, on
      a HiKey board, throughput increases by up to 125%, growing, e.g., from
      6.9MB/s to 15.6MB/s with two or three random readers in parallel.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      edaf9428
    • Paolo Valente's avatar
      block,bfq: refactor device-idling logic · d5be3fef
      Paolo Valente authored
      The logic that decides whether to idle the device is scattered across
      three functions. Almost all of the logic is in the function
      bfq_bfqq_may_idle, but (1) part of the decision is made in
      bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
      switch off idling regardless of the output of bfq_bfqq_may_idle. In
      addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
      their decisions as a function of parameters that are used, for similar
      purposes, also in bfq_bfqq_may_idle. This commit addresses these
      issues by moving all the logic into bfq_bfqq_may_idle.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d5be3fef
  5. 10 Aug, 2017 1 commit
  6. 09 Aug, 2017 7 commits
  7. 07 Aug, 2017 2 commits
  8. 02 Aug, 2017 1 commit
  9. 01 Aug, 2017 1 commit
    • Jens Axboe's avatar
      blk-mq: add warning to __blk_mq_run_hw_queue() for ints disabled · b7a71e66
      Jens Axboe authored
      We recently had a bug in the IPR SCSI driver, where it would end up
      making the SCSI mid layer run the mq hardware queue with interrupts
      disabled. This isn't legal, since the software queue locking relies
      on never being grabbed from interrupt context. Additionally, drivers
      that set BLK_MQ_F_BLOCKING may schedule from this context.
      
      Add a WARN_ON_ONCE() to catch bad users up front.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7a71e66
  10. 29 Jul, 2017 12 commits
    • Jens Axboe's avatar
      blk-mq: blk_mq_requeue_work() doesn't need to save IRQ flags · 18e9781d
      Jens Axboe authored
      We know we're in process context, so don't bother using the
      IRQ safe versions of the spin lock.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      18e9781d
    • Arnd Bergmann's avatar
      block: DAC960: shut up format-overflow warning · 33027c2b
      Arnd Bergmann authored
      gcc-7 points out that a large controller number would overflow the
      string length for the procfs name and the firmware version string:
      
      drivers/block/DAC960.c: In function 'DAC960_Probe':
      drivers/block/DAC960.c:6591:38: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=]
      drivers/block/DAC960.c: In function 'DAC960_V1_ReadControllerConfiguration':
      drivers/block/DAC960.c:1681:40: error: '%02d' directive writing between 2 and 3 bytes into a region of size between 2 and 5 [-Werror=format-overflow=]
      drivers/block/DAC960.c:1681:40: note: directive argument in the range [0, 255]
      drivers/block/DAC960.c:1681:3: note: 'sprintf' output between 10 and 14 bytes into a destination of size 12
      
      Both of these seem appropriately sized, and using snprintf()
      instead of sprintf() improves this by ensuring that even
      incorrect data won't cause undefined behavior here.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      33027c2b
    • Shaohua Li's avatar
      block: use standard blktrace API to output cgroup info for debug notes · 35fe6d76
      Shaohua Li authored
      Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
      way. Now we have standard blktrace API for this, so convert them to use
      it.
      
      Note, this changes the behavior a little bit. cgroup info isn't output
      by default, we only do this with 'blk_cgroup' option enabled. cgroup
      info isn't output as a string by default too, we only do this with
      'blk_cgname' option enabled. Also cgroup info is output in different
      position of the note string. I think these behavior changes aren't a big
      issue (actually we make trace data shorter which is good), since the
      blktrace note is solely for debugging.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      35fe6d76
    • Shaohua Li's avatar
      blktrace: add an option to allow displaying cgroup path · 69fd5c39
      Shaohua Li authored
      By default we output cgroup id in blktrace. This adds an option to
      display cgroup path. Since get cgroup path is a relativly heavy
      operation, we don't enable it by default.
      
      with the option enabled, blktrace will output something like this:
      dd-1353  [007] d..2   293.015252:   8,0   /test/level  D   R 24 + 8 [dd]
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      69fd5c39
    • Shaohua Li's avatar
      block: always attach cgroup info into bio · 007cc56b
      Shaohua Li authored
      blkcg_bio_issue_check() already gets blkcg for a BIO.
      bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
      operation. There is no point we don't attach the cgroup info into bio at
      blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
      info.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      007cc56b
    • Shaohua Li's avatar
      blktrace: export cgroup info in trace · ca1136c9
      Shaohua Li authored
      Currently blktrace isn't cgroup aware. blktrace prints out task name of
      current context, but the task of current context isn't always in the
      cgroup where the BIO comes from. We can't use task name to find out IO
      cgroup. For example, Writeback BIOs always comes from flusher thread but
      the BIOs are for different blk cgroups. Request could be requeued and
      dispatched from completely different tasks. MD/DM are another examples.
      
      This patch tries to fix the gap. We print out cgroup fhandle info in
      blktrace. Userspace can use open_by_handle_at() syscall to find the
      cgroup by fhandle. Or userspace can use name_to_handle_at() syscall to
      find fhandle for a cgroup and use a BPF program to filter out blktrace
      for a specific cgroup.
      
      We add a new 'blk_cgroup' trace option for blk tracer. It's default off.
      Application which doesn't know the new option isn't affected.  When it's
      on, we output fhandle info right after blk_io_trace with an extra bit
      set in event action. So from application point of view, blktrace with
      the option will output new actions.
      
      I didn't change blk trace event yet, since I'm not sure if changing the
      trace event output is an ABI issue. If not, I'll do it later.
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ca1136c9
    • Shaohua Li's avatar
      cgroup: export fhandle info for a cgroup · 121508df
      Shaohua Li authored
      Add an API to export cgroup fhandle info. We don't export a full 'struct
      file_handle', there are unrequired info. Sepcifically, cgroup is always
      a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle,
      we only need export the inode number and generation number just like
      what generic_fh_to_dentry does. And we can avoid the overhead of getting
      an inode too, since kernfs_node_id (ino and generation) has all the info
      required.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      121508df
    • Shaohua Li's avatar
      kernfs: add exportfs operations · aa818825
      Shaohua Li authored
      Now we have the facilities to implement exportfs operations. The idea is
      cgroup can export the fhandle info to userspace, then userspace uses
      fhandle to find the cgroup name. Another example is userspace can get
      fhandle for a cgroup and BPF uses the fhandle to filter info for the
      cgroup.
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa818825
    • Shaohua Li's avatar
      kernfs: introduce kernfs_node_id · c53cd490
      Shaohua Li authored
      inode number and generation can identify a kernfs node. We are going to
      export the identification by exportfs operations, so put ino and
      generation into a separate structure. It's convenient when later patches
      use the identification.
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c53cd490
    • Shaohua Li's avatar
      kernfs: don't set dentry->d_fsdata · 319ba91d
      Shaohua Li authored
      When working on adding exportfs operations in kernfs, I found it's hard
      to initialize dentry->d_fsdata in the exportfs operations. Looks there
      is no way to do it without race condition. Look at the kernfs code
      closely, there is no point to set dentry->d_fsdata. inode->i_private
      already points to kernfs_node, and we can get inode from a dentry. So
      this patch just delete the d_fsdata usage.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      319ba91d
    • Shaohua Li's avatar
      kernfs: add an API to get kernfs node from inode number · ba16b284
      Shaohua Li authored
      Add an API to get kernfs node from inode number. We will need this to
      implement exportfs operations.
      
      This API will be used in blktrace too later, so it should be as fast as
      possible. To make the API lock free, kernfs node is freed in RCU
      context. And we depend on kernfs_node count/ino number to filter out
      stale kernfs nodes.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ba16b284
    • Shaohua Li's avatar
      kernfs: implement i_generation · 4a3ef68a
      Shaohua Li authored
      Set i_generation for kernfs inode. This is required to implement
      exportfs operations. The generation is 32-bit, so it's possible the
      generation wraps up and we find stale files. To reduce the posssibility,
      we don't reuse inode numer immediately. When the inode number allocation
      wraps, we increase generation number. In this way generation/inode
      number consist of a 64-bit number which is unlikely duplicated. This
      does make the idr tree more sparse and waste some memory. Since idr
      manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
      bits of the key. In a 100k inode kernfs, the worst case will have around
      300k radix tree node. Each node is 576bytes, so the tree will use about
      ~150M memory. Sounds not too bad, if this really is a problem, we should
      find better data structure.
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4a3ef68a