1. 26 Jun, 2019 2 commits
  2. 25 Jun, 2019 7 commits
    • Paolo Valente's avatar
      block, bfq: re-schedule empty queues if they deserve I/O plugging · 3726112e
      Paolo Valente authored
      Consider, on one side, a bfq_queue Q that remains empty while in
      service, and, on the other side, the pending I/O of bfq_queues that,
      according to their timestamps, have to be served after Q.  If an
      uncontrolled amount of I/O from the latter bfq_queues were dispatched
      while Q is waiting for its new I/O to arrive, then Q's bandwidth
      guarantees would be violated. To prevent this, I/O dispatch is plugged
      until Q receives new I/O (except for a properly controlled amount of
      injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
      for the following reason.
      
      Preemption is performed in two steps. First, Q is expired and
      re-scheduled. Second, the new bfq_queue to serve is chosen. The first
      step is needed by the second, as the second can be performed only
      after Q's timestamps have been properly updated (done in the
      expiration step), and Q has been re-queued for service. This
      dependency is a consequence of the way how BFQ's scheduling algorithm
      is currently implemented.
      
      But Q is not re-scheduled at all in the first step, because Q is
      empty. As a consequence, an uncontrolled amount of I/O may be
      dispatched until Q becomes non empty again. This breaks Q's service
      guarantees.
      
      This commit addresses this issue by re-scheduling Q even if it is
      empty. This in turn breaks the assumption that all scheduled queues
      are non empty. Then a few extra checks are now needed.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3726112e
    • Paolo Valente's avatar
      block, bfq: preempt lower-weight or lower-priority queues · 96a291c3
      Paolo Valente authored
      BFQ enqueues the I/O coming from each process into a separate
      bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
      served for at most timeout_sync milliseconds (default: 125 ms). This
      service scheme is prone to the following inaccuracy.
      
      While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
      receive I/O, and, according to BFQ's scheduling policy, may become the
      right bfq_queue to serve, in place of the currently in-service
      bfq_queue. In this respect, postponing the service of Q2 to after the
      service of Q1 finishes may delay the completion of Q2's I/O, compared
      with an ideal service in which all non-empty bfq_queues are served in
      parallel, and every non-empty bfq_queue is served at a rate
      proportional to the bfq_queue's weight. This additional delay is equal
      at most to the time Q1 may unjustly remain in service before switching
      to Q2.
      
      If Q1 and Q2 have the same weight, then this time is most likely
      negligible compared with the completion time to be guaranteed to Q2's
      I/O. In addition, first, one of the reasons why BFQ may want to serve
      Q1 for a while is that this boosts throughput and, second, serving Q1
      longer reduces BFQ's overhead. As a conclusion, it is usually better
      not to preempt Q1 if both Q1 and Q2 have the same weight.
      
      In contrast, as Q2's weight or priority becomes higher and higher
      compared with that of Q1, the above delay becomes larger and larger,
      compared with the I/O completion times that have to be guaranteed to
      Q2 according to Q2's weight. So reducing this delay may be more
      important than avoiding the costs of preempting Q1.
      
      Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
      higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
      triggers a new choice of the next bfq_queue to serve. If Q2 really is
      the next bfq_queue to serve, then Q2 will be set in service
      immediately.
      
      This change reduces the component of the I/O latency caused by the
      above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
      SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
      a process doing sporadic random reads while another process is doing
      continuous sequential reads.
      Signed-off-by: default avatarNicola Bottura <bottura.nicola95@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      96a291c3
    • Paolo Valente's avatar
      block, bfq: detect wakers and unconditionally inject their I/O · 13a857a4
      Paolo Valente authored
      A bfq_queue Q may happen to be synchronized with another
      bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
      receive new I/O. We call Q2 "waker queue".
      
      If I/O plugging is being performed for Q, and Q is not receiving any
      more I/O because of the above synchronization, then, thanks to BFQ's
      injection mechanism, the waker queue is likely to get served before
      the I/O-plugging timeout fires.
      
      Unfortunately, this fact may not be sufficient to guarantee a high
      throughput during the I/O plugging, because the inject limit for Q may
      be too low to guarantee a lot of injected I/O. In addition, the
      duration of the plugging, i.e., the time before Q finally receives new
      I/O, may not be minimized, because the waker queue may happen to be
      served only after other queues.
      
      To address these issues, this commit introduces the explicit detection
      of the waker queue, and the unconditional injection of a pending I/O
      request of the waker queue on each invocation of
      bfq_dispatch_request().
      
      One may be concerned that this systematic injection of I/O from the
      waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
      the contrary, next Q's I/O is brought forward dramatically, for it is
      not blocked for milliseconds.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13a857a4
    • Paolo Valente's avatar
      block, bfq: bring forward seek&think time update · a3f9bce3
      Paolo Valente authored
      Until the base value for request service times gets finally computed
      for a bfq_queue, the inject limit for that queue does depend on the
      think-time state (short|long) of the queue. A timely update of the
      think time then guarantees a quicker activation or deactivation of the
      injection. Fortunately, the think time of a bfq_queue is updated in
      the same code path as the inject limit; but after the inject limit.
      
      This commits moves the update of the think time before the update of
      the inject limit. For coherence, it moves the update of the seek time
      too.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a3f9bce3
    • Paolo Valente's avatar
      block, bfq: update base request service times when possible · 24792ad0
      Paolo Valente authored
      I/O injection gets reduced if it increases the request service times
      of the victim queue beyond a certain threshold.  The threshold, in its
      turn, is computed as a function of the base service time enjoyed by
      the queue when it undergoes no injection.
      
      As a consequence, for injection to work properly, the above base value
      has to be accurate. In this respect, such a value may vary over
      time. For example, it varies if the size or the spatial locality of
      the I/O requests in the queue change. It is then important to update
      this value whenever possible. This commit performs this update.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      24792ad0
    • Paolo Valente's avatar
      block, bfq: fix rq_in_driver check in bfq_update_inject_limit · db599f9e
      Paolo Valente authored
      One of the cases where the parameters for injection may be updated is
      when there are no more in-flight I/O requests. The number of in-flight
      requests is stored in the field bfqd->rq_in_driver of the descriptor
      bfqd of the device. So, the controlled condition is
      bfqd->rq_in_driver == 0.
      
      Unfortunately, this is wrong because, the instruction that checks this
      condition is in the code path that handles the completion of a
      request, and, in particular, the instruction is executed before
      bfqd->rq_in_driver is decremented in such a code path.
      
      This commit fixes this issue by just replacing 0 with 1 in the
      comparison.
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db599f9e
    • Paolo Valente's avatar
      block, bfq: reset inject limit when think-time state changes · 766d6141
      Paolo Valente authored
      Until the base value of the request service times gets finally
      computed for a bfq_queue, the inject limit does depend on the
      think-time state (short|long). The limit must be 0 or 1 if the think
      time is deemed, respectively, as short or long. However, such a check
      and possible limit update is performed only periodically, once per
      second. So, to make the injection mechanism much more reactive, this
      commit performs the update also every time the think-time state
      changes.
      
      In addition, in the following special case, this commit lets the
      inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's
      think time is short: bfqq's I/O is synchronized with that of some
      other queue, i.e., bfqq may receive new I/O only after the I/O of the
      other queue is completed. Keeping the inject limit to 1 allows the
      blocking I/O to be served while bfqq is in service. And this is very
      convenient both for bfqq and for the total throughput, as explained
      in detail in the comments in bfq_update_has_short_ttime().
      Reported-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: default avatarSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      766d6141
  3. 24 Jun, 2019 1 commit
    • Jens Axboe's avatar
      Merge branch 'nvme-5.3' of git://git.infradead.org/nvme into for-5.3/block · 6b2c8e52
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "A large chunk of NVMe updates for 5.3.  Highlights:
      
       - improved PCIe suspent support (Keith Busch)
       - error injection support for the admin queue (Akinobu Mita)
       - Fibre Channel discovery improvements (James Smart)
       - tracing improvements including nvmetc tracing support (Minwoo Im)
       - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
         Kulkarni)"
      
      * 'nvme-5.3' of git://git.infradead.org/nvme: (26 commits)
        Documentation: nvme: add an example for nvme fault injection
        nvme: enable to inject errors into admin commands
        nvme: prepare for fault injection into admin commands
        nvmet: introduce target-side trace
        nvme-trace: print result and status in hex format
        nvme-trace: support for fabrics commands in host-side
        nvme-trace: move opcode symbol print to nvme.h
        nvme-trace: do not export nvme_trace_disk_name
        nvme-pci: clean up nvme_remove_dead_ctrl a bit
        nvme-pci: properly report state change failure in nvme_reset_work
        nvme-pci: set the errno on ctrl state change error
        nvme-pci: adjust irq max_vector using num_possible_cpus()
        nvme-pci: remove queue_count_ops for write_queues and poll_queues
        nvme-pci: remove unnecessary zero for static var
        nvme-pci: use host managed power state for suspend
        nvme: introduce nvme_is_fabrics to check fabrics cmd
        nvme: export get and set features
        nvme: fix possible io failures when removing multipathed ns
        nvme-fc: add message when creating new association
        lpfc: add sysfs interface to post NVME RSCN
        ...
      6b2c8e52
  4. 21 Jun, 2019 30 commits