1. 19 Apr, 2017 22 commits
    • Christoph Hellwig's avatar
      block: remove blk_end_request_cur · fa1a15c0
      Christoph Hellwig authored
      This function is not used anywhere in the kernel.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      fa1a15c0
    • Christoph Hellwig's avatar
      block: remove blk_end_request_err and __blk_end_request_err · 314fe91b
      Christoph Hellwig authored
      Both functions are entirely unused.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      314fe91b
    • Christoph Hellwig's avatar
      block: remove the osdblk driver · 10081552
      Christoph Hellwig authored
      This was just a proof of concept user for the SCSI OSD library, and
      never had any real users.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarBoaz Harrosh <ooo@electrozaur.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      10081552
    • Jan Kara's avatar
      block: Make writeback throttling defaults consistent for SQ devices · 8330cdb0
      Jan Kara authored
      When CFQ is used as an elevator, it disables writeback throttling
      because they don't play well together. Later when a different elevator
      is chosen for the device, writeback throttling doesn't get enabled
      again as it should. Make sure CFQ enables writeback throttling (if it
      should be enabled by default) when we switch from it to another IO
      scheduler.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8330cdb0
    • Paolo Valente's avatar
      block, bfq: split bfq-iosched.c into multiple source files · ea25da48
      Paolo Valente authored
      The BFQ I/O scheduler features an optimal fair-queuing
      (proportional-share) scheduling algorithm, enriched with several
      mechanisms to boost throughput and reduce latency for interactive and
      real-time applications. This makes BFQ a large and complex piece of
      code. This commit addresses this issue by splitting BFQ into three
      main, independent components, and by moving each component into a
      separate source file:
      1. Main algorithm: handles the interaction with the kernel, and
      decides which requests to dispatch; it uses the following two further
      components to achieve its goals.
      2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
      computes the schedule, using weights and budgets provided by the above
      component.
      3. cgroups support: handles group operations (creation, destruction,
      move, ...).
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ea25da48
    • Paolo Valente's avatar
      block, bfq: remove all get and put of I/O contexts · 6fa3e8d3
      Paolo Valente authored
      When a bfq queue is set in service and when it is merged, a reference
      to the I/O context associated with the queue is taken. This reference
      is then released when the queue is deselected from service or
      split. More precisely, the release of the reference is postponed to
      when the scheduler lock is released, to avoid nesting between the
      scheduler and the I/O-context lock. In fact, such nesting would lead
      to deadlocks, because of other code paths that take the same locks in
      the opposite order. This postponing of I/O-context releases does
      complicate code.
      
      This commit addresses these issue by modifying involved operations in
      such a way to not need to get the above I/O-context references any
      more. Then it also removes any get and release of these references.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6fa3e8d3
    • Arianna Avanzini's avatar
      block, bfq: handle bursts of queue activations · e1b2324d
      Arianna Avanzini authored
      Many popular I/O-intensive services or applications spawn or
      reactivate many parallel threads/processes during short time
      intervals. Examples are systemd during boot or git grep.  These
      services or applications benefit mostly from a high throughput: the
      quicker the I/O generated by their processes is cumulatively served,
      the sooner the target job of these services or applications gets
      completed. As a consequence, it is almost always counterproductive to
      weight-raise any of the queues associated to the processes of these
      services or applications: in most cases it would just lower the
      throughput, mainly because weight-raising also implies device idling.
      
      To address this issue, an I/O scheduler needs, first, to detect which
      queues are associated with these services or applications. In this
      respect, we have that, from the I/O-scheduler standpoint, these
      services or applications cause bursts of activations, i.e.,
      activations of different queues occurring shortly after each
      other. However, a shorter burst of activations may be caused also by
      the start of an application that does not consist in a lot of parallel
      I/O-bound threads (see the comments on the function bfq_handle_burst
      for details).
      
      In view of these facts, this commit introduces:
      1) an heuristic to detect (only) bursts of queue activations caused by
         services or applications consisting in many parallel I/O-bound
         threads;
      2) the prevention of device idling and weight-raising for the queues
         belonging to these bursts.
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e1b2324d
    • Paolo Valente's avatar
      block, bfq: boost the throughput with random I/O on NCQ-capable HDDs · e01eff01
      Paolo Valente authored
      This patch is basically the counterpart, for NCQ-capable rotational
      devices, of the previous patch. Exactly as the previous patch does on
      flash-based devices and for any workload, this patch disables device
      idling on rotational devices, but only for random I/O. In fact, only
      with these queues disabling idling boosts the throughput on
      NCQ-capable rotational devices. To not break service guarantees,
      idling is disabled for NCQ-enabled rotational devices only when the
      same symmetry conditions considered in the previous patches hold.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e01eff01
    • Paolo Valente's avatar
      block, bfq: boost the throughput on NCQ-capable flash-based devices · bf2b79e7
      Paolo Valente authored
      This patch boosts the throughput on NCQ-capable flash-based devices,
      while still preserving latency guarantees for interactive and soft
      real-time applications. The throughput is boosted by just not idling
      the device when the in-service queue remains empty, even if the queue
      is sync and has a non-null idle window. This helps to keep the drive's
      internal queue full, which is necessary to achieve maximum
      performance. This solution to boost the throughput is a port of
      commits a68bbdd and f7d7b7a7 for CFQ.
      
      As already highlighted in a previous patch, allowing the device to
      prefetch and internally reorder requests trivially causes loss of
      control on the request service order, and hence on service guarantees.
      Fortunately, as discussed in detail in the comments on the function
      bfq_bfqq_may_idle(), if every process has to receive the same
      fraction of the throughput, then the service order enforced by the
      internal scheduler of a flash-based device is relatively close to that
      enforced by BFQ. In particular, it is close enough to let service
      guarantees be substantially preserved.
      
      Things change in an asymmetric scenario, i.e., if not every process
      has to receive the same fraction of the throughput. In this case, to
      guarantee the desired throughput distribution, the device must be
      prevented from prefetching requests. This is exactly what this patch
      does in asymmetric scenarios.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      bf2b79e7
    • Arianna Avanzini's avatar
      block, bfq: reduce idling only in symmetric scenarios · 1de0c4cd
      Arianna Avanzini authored
      A seeky queue (i..e, a queue containing random requests) is assigned a
      very small device-idling slice, for throughput issues. Unfortunately,
      given the process associated with a seeky queue, this behavior causes
      the following problem: if the process, say P, performs sync I/O and
      has a higher weight than some other processes doing I/O and associated
      with non-seeky queues, then BFQ may fail to guarantee to P its
      reserved share of the throughput. The reason is that idling is key
      for providing service guarantees to processes doing sync I/O [1].
      
      This commit addresses this issue by allowing the device-idling slice
      to be reduced for a seeky queue only if the scenario happens to be
      symmetric, i.e., if all the queues are to receive the same share of
      the throughput.
      
      [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
          Scheduler", Proceedings of the First Workshop on Mobile System
          Technologies (MST-2015), May 2015.
          http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdfSigned-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarRiccardo Pizzetti <riccardo.pizzetti@gmail.com>
      Signed-off-by: default avatarSamuele Zecchini <samuele.zecchini92@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1de0c4cd
    • Arianna Avanzini's avatar
      block, bfq: add Early Queue Merge (EQM) · 36eca894
      Arianna Avanzini authored
      A set of processes may happen to perform interleaved reads, i.e.,
      read requests whose union would give rise to a sequential read pattern.
      There are two typical cases: first, processes reading fixed-size chunks
      of data at a fixed distance from each other; second, processes reading
      variable-size chunks at variable distances. The latter case occurs for
      example with QEMU, which splits the I/O generated by a guest into
      multiple chunks, and lets these chunks be served by a pool of I/O
      threads, iteratively assigning the next chunk of I/O to the first
      available thread. CFQ denotes as 'cooperating' a set of processes that
      are doing interleaved I/O, and when it detects cooperating processes,
      it merges their queues to obtain a sequential I/O pattern from the union
      of their I/O requests, and hence boost the throughput.
      
      Unfortunately, in the following frequent case, the mechanism
      implemented in CFQ for detecting cooperating processes and merging
      their queues is not responsive enough to handle also the fluctuating
      I/O pattern of the second type of processes. Suppose that one process
      of the second type issues a request close to the next request to serve
      of another process of the same type. At that time the two processes
      would be considered as cooperating. But, if the request issued by the
      first process is to be merged with some other already-queued request,
      then, from the moment at which this request arrives, to the moment
      when CFQ controls whether the two processes are cooperating, the two
      processes are likely to be already doing I/O in distant zones of the
      disk surface or device memory.
      
      CFQ uses however preemption to get a sequential read pattern out of
      the read requests performed by the second type of processes too.  As a
      consequence, CFQ uses two different mechanisms to achieve the same
      goal: boosting the throughput with interleaved I/O.
      
      This patch introduces Early Queue Merge (EQM), a unified mechanism to
      get a sequential read pattern with both types of processes. The main
      idea is to immediately check whether a newly-arrived request lets some
      pair of processes become cooperating, both in the case of actual
      request insertion and, to be responsive with the second type of
      processes, in the case of request merge. Both types of processes are
      then handled by just merging their queues.
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarMauro Andreolini <mauro.andreolini@unimore.it>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      36eca894
    • Paolo Valente's avatar
      block, bfq: reduce latency during request-pool saturation · cfd69712
      Paolo Valente authored
      This patch introduces an heuristic that reduces latency when the
      I/O-request pool is saturated. This goal is achieved by disabling
      device idling, for non-weight-raised queues, when there are weight-
      raised queues with pending or in-flight requests. In fact, as
      explained in more detail in the comment on the function
      bfq_bfqq_may_idle(), this reduces the rate at which processes
      associated with non-weight-raised queues grab requests from the pool,
      thereby increasing the probability that processes associated with
      weight-raised queues get a request immediately (or at least soon) when
      they need one. Along the same line, if there are weight-raised queues,
      then this patch halves the service rate of async (write) requests for
      non-weight-raised queues.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      cfd69712
    • Paolo Valente's avatar
      block, bfq: preserve a low latency also with NCQ-capable drives · bcd56426
      Paolo Valente authored
      I/O schedulers typically allow NCQ-capable drives to prefetch I/O
      requests, as NCQ boosts the throughput exactly by prefetching and
      internally reordering requests.
      
      Unfortunately, as discussed in detail and shown experimentally in [1],
      this may cause fairness and latency guarantees to be violated. The
      main problem is that the internal scheduler of an NCQ-capable drive
      may postpone the service of some unlucky (prefetched) requests as long
      as it deems serving other requests more appropriate to boost the
      throughput.
      
      This patch addresses this issue by not disabling device idling for
      weight-raised queues, even if the device supports NCQ. This allows BFQ
      to start serving a new queue, and therefore allows the drive to
      prefetch new requests, only after the idling timeout expires. At that
      time, all the outstanding requests of the expired queue have been most
      certainly served.
      
      [1] P. Valente and M. Andreolini, "Improving Application
          Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
          the 5th Annual International Systems and Storage Conference
          (SYSTOR '12), June 2012.
          Slightly extended version:
          http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
      							results.pdf
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      bcd56426
    • Paolo Valente's avatar
      block, bfq: reduce I/O latency for soft real-time applications · 77b7dcea
      Paolo Valente authored
      To guarantee a low latency also to the I/O requests issued by soft
      real-time applications, this patch introduces a further heuristic,
      which weight-raises (in the sense explained in the previous patch)
      also the queues associated to applications deemed as soft real-time.
      
      To be deemed as soft real-time, an application must meet two
      requirements.  First, the application must not require an average
      bandwidth higher than the approximate bandwidth required to playback
      or record a compressed high-definition video. Second, the request
      pattern of the application must be isochronous, i.e., after issuing a
      request or a batch of requests, the application must stop issuing new
      requests until all its pending requests have been completed. After
      that, the application may issue a new batch, and so on.
      
      As for the second requirement, it is critical to require also that,
      after all the pending requests of the application have been completed,
      an adequate minimum amount of time elapses before the application
      starts issuing new requests. This prevents also greedy (i.e.,
      I/O-bound) applications from being incorrectly deemed, occasionally,
      as soft real-time. In fact, if *any amount of time* is fine, then even
      a greedy application may, paradoxically, meet both the above
      requirements, if: (1) the application performs random I/O and/or the
      device is slow, and (2) the CPU load is high. The reason is the
      following.  First, if condition (1) is true, then, during the service
      of the application, the throughput may be low enough to let the
      application meet the bandwidth requirement.  Second, if condition (2)
      is true as well, then the application may occasionally behave in an
      apparently isochronous way, because it may simply stop issuing
      requests while the CPUs are busy serving other processes.
      
      To address this issue, the heuristic leverages the simple fact that
      greedy applications issue *all* their requests as quickly as they can,
      whereas soft real-time applications spend some time processing data
      after each batch of requests is completed. In particular, the
      heuristic works as follows. First, according to the above isochrony
      requirement, the heuristic checks whether an application may be soft
      real-time, thereby giving to the application the opportunity to be
      deemed as such, only when both the following two conditions happen to
      hold: 1) the queue associated with the application has expired and is
      empty, 2) there is no outstanding request of the application.
      
      Suppose that both conditions hold at time, say, t_c and that the
      application issues its next request at time, say, t_i. At time t_c the
      heuristic computes the next time instant, called soft_rt_next_start in
      the code, such that, only if t_i >= soft_rt_next_start, then both the
      next conditions will hold when the application issues its next
      request: 1) the application will meet the above bandwidth requirement,
      2) a given minimum time interval, say Delta, will have elapsed from
      time t_c (so as to filter out greedy application).
      
      The current value of Delta is a little bit higher than the value that
      we have found, experimentally, to be adequate on a real,
      general-purpose machine. In particular we had to increase Delta to
      make the filter quite precise also in slower, embedded systems, and in
      KVM/QEMU virtual machines (details in the comments on the code).
      
      If the application actually issues its next request after time
      soft_rt_next_start, then its associated queue will be weight-raised
      for a relatively short time interval. If, during this time interval,
      the application proves again to meet the bandwidth and isochrony
      requirements, then the end of the weight-raising period for the queue
      is moved forward, and so on. Note that an application whose associated
      queue never happens to be empty when it expires will never have the
      opportunity to be deemed as soft real-time.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      77b7dcea
    • Paolo Valente's avatar
      block, bfq: improve responsiveness · 44e44a1b
      Paolo Valente authored
      This patch introduces a simple heuristic to load applications quickly,
      and to perform the I/O requested by interactive applications just as
      quickly. To this purpose, both a newly-created queue and a queue
      associated with an interactive application (we explain in a moment how
      BFQ decides whether the associated application is interactive),
      receive the following two special treatments:
      
      1) The weight of the queue is raised.
      
      2) The queue unconditionally enjoys device idling when it empties; in
      fact, if the requests of a queue are sync, then performing device
      idling for the queue is a necessary condition to guarantee that the
      queue receives a fraction of the throughput proportional to its weight
      (see [1] for details).
      
      For brevity, we call just weight-raising the combination of these
      two preferential treatments. For a newly-created queue,
      weight-raising starts immediately and lasts for a time interval that:
      1) depends on the device speed and type (rotational or
      non-rotational), and 2) is equal to the time needed to load (start up)
      a large-size application on that device, with cold caches and with no
      additional workload.
      
      Finally, as for guaranteeing a fast execution to interactive,
      I/O-related tasks (such as opening a file), consider that any
      interactive application blocks and waits for user input both after
      starting up and after executing some task. After a while, the user may
      trigger new operations, after which the application stops again, and
      so on. Accordingly, the low-latency heuristic weight-raises again a
      queue in case it becomes backlogged after being idle for a
      sufficiently long (configurable) time. The weight-raising then lasts
      for the same time as for a just-created queue.
      
      According to our experiments, the combination of this low-latency
      heuristic and of the improvements described in the previous patch
      allows BFQ to guarantee a high application responsiveness.
      
      [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
          Scheduler", Proceedings of the First Workshop on Mobile System
          Technologies (MST-2015), May 2015.
          http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdfSigned-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      44e44a1b
    • Paolo Valente's avatar
      block, bfq: add more fairness with writes and slow processes · c074170e
      Paolo Valente authored
      This patch deals with two sources of unfairness, which can also cause
      high latencies and throughput loss. The first source is related to
      write requests. Write requests tend to starve read requests, basically
      because, on one side, writes are slower than reads, whereas, on the
      other side, storage devices confuse schedulers by deceptively
      signaling the completion of write requests immediately after receiving
      them. This patch addresses this issue by just throttling writes. In
      particular, after a write request is dispatched for a queue, the
      budget of the queue is decremented by the number of sectors to write,
      multiplied by an (over)charge coefficient. The value of the
      coefficient is the result of our tuning with different devices.
      
      The second source of unfairness has to do with slowness detection:
      when the in-service queue is expired, BFQ also controls whether the
      queue has been "too slow", i.e., has consumed its last-assigned budget
      at such a low rate that it would have been impossible to consume all
      of this budget within the maximum time slice T_max (Subsec. 3.5 in
      [1]). In this case, the queue is always (over)charged the whole
      budget, to reduce its utilization of the device. Both this overcharge
      and the slowness-detection criterion may cause unfairness.
      
      First, always charging a full budget to a slow queue is too coarse. It
      is much more accurate, and this patch lets BFQ do so, to charge an
      amount of service 'equivalent' to the amount of time during which the
      queue has been in service. As explained in more detail in the comments
      on the code, this enables BFQ to provide time fairness among slow
      queues.
      
      Secondly, because of ZBR, a queue may be deemed as slow when its
      associated process is performing I/O on the slowest zones of a
      disk. However, unless the process is truly too slow, not reducing the
      disk utilization of the queue is more profitable in terms of disk
      throughput than the opposite. A similar problem is caused by logical
      block mapping on non-rotational devices. For this reason, this patch
      lets a queue be charged time, and not budget, only if the queue has
      consumed less than 2/3 of its assigned budget. As an additional,
      important benefit, this tolerance allows BFQ to preserve enough
      elasticity to still perform bandwidth, and not time, distribution with
      little unlucky or quasi-sequential processes.
      
      Finally, for the same reasons as above, this patch makes slowness
      detection itself much less harsh: a queue is deemed slow only if it
      has consumed its budget at less than half of the peak rate.
      
      [1] P. Valente and M. Andreolini, "Improving Application
          Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
          the 5th Annual International Systems and Storage Conference
          (SYSTOR '12), June 2012.
          Slightly extended version:
          http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
      							results.pdf
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c074170e
    • Paolo Valente's avatar
      block, bfq: modify the peak-rate estimator · ab0e43e9
      Paolo Valente authored
      Unless the maximum budget B_max that BFQ can assign to a queue is set
      explicitly by the user, BFQ automatically updates B_max. In
      particular, BFQ dynamically sets B_max to the number of sectors that
      can be read, at the current estimated peak rate, during the maximum
      time, T_max, allowed before a budget timeout occurs. In formulas, if
      we denote as R_est the estimated peak rate, then B_max = T_max ∗
      R_est. Hence, the higher R_est is with respect to the actual device
      peak rate, the higher the probability that processes incur budget
      timeouts unjustly is. Besides, a too high value of B_max unnecessarily
      increases the deviation from an ideal, smooth service.
      
      Unfortunately, it is not trivial to estimate the peak rate correctly:
      because of the presence of sw and hw queues between the scheduler and
      the device components that finally serve I/O requests, it is hard to
      say exactly when a given dispatched request is served inside the
      device, and for how long. As a consequence, it is hard to know
      precisely at what rate a given set of requests is actually served by
      the device.
      
      On the opposite end, the dispatch time of any request is trivially
      available, and, from this piece of information, the "dispatch rate"
      of requests can be immediately computed. So, the idea in the next
      function is to use what is known, namely request dispatch times
      (plus, when useful, request completion times), to estimate what is
      unknown, namely in-device request service rate.
      
      The main issue is that, because of the above facts, the rate at
      which a certain set of requests is dispatched over a certain time
      interval can vary greatly with respect to the rate at which the
      same requests are then served. But, since the size of any
      intermediate queue is limited, and the service scheme is lossless
      (no request is silently dropped), the following obvious convergence
      property holds: the number of requests dispatched MUST become
      closer and closer to the number of requests completed as the
      observation interval grows. This is the key property used in
      this new version of the peak-rate estimator.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ab0e43e9
    • Paolo Valente's avatar
      block, bfq: improve throughput boosting · 54b60456
      Paolo Valente authored
      The feedback-loop algorithm used by BFQ to compute queue (process)
      budgets is basically a set of three update rules, one for each of the
      main reasons why a queue may be expired. If many processes suddenly
      switch from sporadic I/O to greedy and sequential I/O, then these
      rules are quite slow to assign large budgets to these processes, and
      hence to achieve a high throughput. On the opposite side, BFQ assigns
      the maximum possible budget B_max to a just-created queue. This allows
      a high throughput to be achieved immediately if the associated process
      is I/O-bound and performs sequential I/O from the beginning. But it
      also increases the worst-case latency experienced by the first
      requests issued by the process, because the larger the budget of a
      queue waiting for service is, the later the queue will be served by
      B-WF2Q+ (Subsec 3.3 in [1]). This is detrimental for an interactive or
      soft real-time application.
      
      To tackle these throughput and latency problems, on one hand this
      patch changes the initial budget value to B_max/2. On the other hand,
      it re-tunes the three rules, adopting a more aggressive,
      multiplicative increase/linear decrease scheme. This scheme trades
      latency for throughput more than before, and tends to assign large
      budgets quickly to processes that are or become I/O-bound. For two of
      the expiration reasons, the new version of the rules also contains
      some more little improvements, briefly described below.
      
      *No more backlog.* In this case, the budget was larger than the number
      of sectors actually read/written by the process before it stopped
      doing I/O. Hence, to reduce latency for the possible future I/O
      requests of the process, the old rule simply set the next budget to
      the number of sectors actually consumed by the process. However, if
      there are still outstanding requests, then the process may have not
      yet issued its next request just because it is still waiting for the
      completion of some of the still outstanding ones. If this sub-case
      holds true, then the new rule, instead of decreasing the budget,
      doubles it, proactively, in the hope that: 1) a larger budget will fit
      the actual needs of the process, and 2) the process is sequential and
      hence a higher throughput will be achieved by serving the process
      longer after granting it access to the device.
      
      *Budget timeout*. The original rule set the new budget to the maximum
      value B_max, to maximize throughput and let all processes experiencing
      budget timeouts receive the same share of the device time. In our
      experiments we verified that this sudden jump to B_max did not provide
      sensible benefits; rather it increased the latency of processes
      performing sporadic and short I/O. The new rule only doubles the
      budget.
      
      [1] P. Valente and M. Andreolini, "Improving Application
          Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
          the 5th Annual International Systems and Storage Conference
          (SYSTOR '12), June 2012.
          Slightly extended version:
          http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
      							results.pdf
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      54b60456
    • Arianna Avanzini's avatar
      block, bfq: add full hierarchical scheduling and cgroups support · e21b7a0b
      Arianna Avanzini authored
      Add complete support for full hierarchical scheduling, with a cgroups
      interface. Full hierarchical scheduling is implemented through the
      'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
      associated with processes, and groups are represented in general by
      entities. Given the bfq_queues associated with the processes belonging
      to a given group, the entities representing these queues are sons of
      the entity representing the group. At higher levels, if a group, say
      G, contains other groups, then the entity representing G is the parent
      entity of the entities representing the groups in G.
      
      Hierarchical scheduling is performed as follows: if the timestamps of
      a leaf entity (i.e., of a bfq_queue) change, and such a change lets
      the entity become the next-to-serve entity for its parent entity, then
      the timestamps of the parent entity are recomputed as a function of
      the budget of its new next-to-serve leaf entity. If the parent entity
      belongs, in its turn, to a group, and its new timestamps let it become
      the next-to-serve for its parent entity, then the timestamps of the
      latter parent entity are recomputed as well, and so on. When a new
      bfq_queue must be set in service, the reverse path is followed: the
      next-to-serve highest-level entity is chosen, then its next-to-serve
      child entity, and so on, until the next-to-serve leaf entity is
      reached, and the bfq_queue that this entity represents is set in
      service.
      
      Writeback is accounted for on a per-group basis, i.e., for each group,
      the async I/O requests of the processes of the group are enqueued in a
      distinct bfq_queue, and the entity associated with this queue is a
      child of the entity associated with the group.
      
      Weights can be assigned explicitly to groups and processes through the
      cgroups interface, differently from what happens, for single
      processes, if the cgroups interface is not used (as explained in the
      description of the previous patch). In particular, since each node has
      a full scheduler, each group can be assigned its own weight.
      Signed-off-by: default avatarFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e21b7a0b
    • Paolo Valente's avatar
      block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler · aee69d78
      Paolo Valente authored
      We tag as v0 the version of BFQ containing only BFQ's engine plus
      hierarchical support. BFQ's engine is introduced by this commit, while
      hierarchical support is added by next commit. We use the v0 tag to
      distinguish this minimal version of BFQ from the versions containing
      also the features and the improvements added by next commits. BFQ-v0
      coincides with the version of BFQ submitted a few years ago [1], apart
      from the introduction of preemption, described below.
      
      BFQ is a proportional-share I/O scheduler, whose general structure,
      plus a lot of code, are borrowed from CFQ.
      
      - Each process doing I/O on a device is associated with a weight and a
        (bfq_)queue.
      
      - BFQ grants exclusive access to the device, for a while, to one queue
        (process) at a time, and implements this service model by
        associating every queue with a budget, measured in number of
        sectors.
      
        - After a queue is granted access to the device, the budget of the
          queue is decremented, on each request dispatch, by the size of the
          request.
      
        - The in-service queue is expired, i.e., its service is suspended,
          only if one of the following events occurs: 1) the queue finishes
          its budget, 2) the queue empties, 3) a "budget timeout" fires.
      
          - The budget timeout prevents processes doing random I/O from
            holding the device for too long and dramatically reducing
            throughput.
      
          - Actually, as in CFQ, a queue associated with a process issuing
            sync requests may not be expired immediately when it empties. In
            contrast, BFQ may idle the device for a short time interval,
            giving the process the chance to go on being served if it issues
            a new request in time. Device idling typically boosts the
            throughput on rotational devices, if processes do synchronous
            and sequential I/O. In addition, under BFQ, device idling is
            also instrumental in guaranteeing the desired throughput
            fraction to processes issuing sync requests (see [2] for
            details).
      
            - With respect to idling for service guarantees, if several
              processes are competing for the device at the same time, but
              all processes (and groups, after the following commit) have
              the same weight, then BFQ guarantees the expected throughput
              distribution without ever idling the device. Throughput is
              thus as high as possible in this common scenario.
      
        - Queues are scheduled according to a variant of WF2Q+, named
          B-WF2Q+, and implemented using an augmented rb-tree to preserve an
          O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
          also ready for hierarchical scheduling. However, for a cleaner
          logical breakdown, the code that enables and completes
          hierarchical support is provided in the next commit, which focuses
          exactly on this feature.
      
        - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
          perfectly fair, and smooth service. In particular, B-WF2Q+
          guarantees that each queue receives a fraction of the device
          throughput proportional to its weight, even if the throughput
          fluctuates, and regardless of: the device parameters, the current
          workload and the budgets assigned to the queue.
      
        - The last, budget-independence, property (although probably
          counterintuitive in the first place) is definitely beneficial, for
          the following reasons:
      
          - First, with any proportional-share scheduler, the maximum
            deviation with respect to an ideal service is proportional to
            the maximum budget (slice) assigned to queues. As a consequence,
            BFQ can keep this deviation tight not only because of the
            accurate service of B-WF2Q+, but also because BFQ *does not*
            need to assign a larger budget to a queue to let the queue
            receive a higher fraction of the device throughput.
      
          - Second, BFQ is free to choose, for every process (queue), the
            budget that best fits the needs of the process, or best
            leverages the I/O pattern of the process. In particular, BFQ
            updates queue budgets with a simple feedback-loop algorithm that
            allows a high throughput to be achieved, while still providing
            tight latency guarantees to time-sensitive applications. When
            the in-service queue expires, this algorithm computes the next
            budget of the queue so as to:
      
            - Let large budgets be eventually assigned to the queues
              associated with I/O-bound applications performing sequential
              I/O: in fact, the longer these applications are served once
              got access to the device, the higher the throughput is.
      
            - Let small budgets be eventually assigned to the queues
              associated with time-sensitive applications (which typically
              perform sporadic and short I/O), because, the smaller the
              budget assigned to a queue waiting for service is, the sooner
              B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
      
      - Weights can be assigned to processes only indirectly, through I/O
        priorities, and according to the relation:
        weight = 10 * (IOPRIO_BE_NR - ioprio).
        The next patch provides, instead, a cgroups interface through which
        weights can be assigned explicitly.
      
      - If several processes are competing for the device at the same time,
        but all processes and groups have the same weight, then BFQ
        guarantees the expected throughput distribution without ever idling
        the device. It uses preemption instead. Throughput is then much
        higher in this common scenario.
      
      - ioprio classes are served in strict priority order, i.e.,
        lower-priority queues are not served as long as there are
        higher-priority queues.  Among queues in the same class, the
        bandwidth is distributed in proportion to the weight of each
        queue. A very thin extra bandwidth is however guaranteed to the Idle
        class, to prevent it from starving.
      
      - If the strict_guarantees parameter is set (default: unset), then BFQ
           - always performs idling when the in-service queue becomes empty;
           - forces the device to serve one I/O request at a time, by
             dispatching a new request only if there is no outstanding
             request.
        In the presence of differentiated weights or I/O-request sizes,
        both the above conditions are needed to guarantee that every
        queue receives its allotted share of the bandwidth (see
        Documentation/block/bfq-iosched.txt for more details). Setting
        strict_guarantees may evidently affect throughput.
      
      [1] https://lkml.org/lkml/2008/4/1/234
          https://lkml.org/lkml/2008/11/11/148
      
      [2] P. Valente and M. Andreolini, "Improving Application
          Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
          the 5th Annual International Systems and Storage Conference
          (SYSTOR '12), June 2012.
          Slightly extended version:
          http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
      							results.pdf
      Signed-off-by: default avatarFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarArianna Avanzini <avanzini.arianna@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      aee69d78
    • Josef Bacik's avatar
      nbd: set the max segment size to UINT_MAX · ebb16d0d
      Josef Bacik authored
      NBD doesn't care about limiting the segment size, let the user push the
      largest bio's they want.  This allows us to control the request size
      solely through max_sectors_kb.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ebb16d0d
    • Jens Axboe's avatar
      Merge branch 'stable/for-jens-4.12' of... · 6af38473
      Jens Axboe authored
      Merge branch 'stable/for-jens-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-4.12/block
      
      Konrad writes:
      
      It has one fix - to emit an uevent whenever the size of the guest disk image
      changes.
      6af38473
  2. 18 Apr, 2017 1 commit
  3. 17 Apr, 2017 12 commits
    • Josef Bacik's avatar
      nbd: add a flag to destroy an nbd device on disconnect · a2c97909
      Josef Bacik authored
      For ease of management it would be nice for users to specify that the
      device node for a nbd device is destroyed once it is disconnected and
      there are no more users.  Add a client flag and enable this operation to
      happen.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a2c97909
    • Josef Bacik's avatar
      nbd: add device refcounting · c6a4759e
      Josef Bacik authored
      In order to support deleting the device on disconnect we need to
      refcount the actual nbd_device struct.  So add the refcounting framework
      and change how we free the normal devices at rmmod time so we can catch
      reference leaks.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c6a4759e
    • Josef Bacik's avatar
      nbd: add a status netlink command · 47d902b9
      Josef Bacik authored
      Allow users to query the status of existing nbd devices.  Right now this
      only returns whether or not the device is connected, but could be
      extended in the future to include more information.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      47d902b9
    • Josef Bacik's avatar
      nbd: handle dead connections · 560bc4b3
      Josef Bacik authored
      Sometimes we like to upgrade our server without making all of our
      clients freak out and reconnect.  This patch provides a way to specify a
      dead connection timeout to allow us to pause all requests and wait for
      new connections to be opened.  With this in place I can take down the
      nbd server for less than the dead connection timeout time and bring it
      back up and everything resumes gracefully.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      560bc4b3
    • Josef Bacik's avatar
      nbd: only clear the queue on device teardown · 2516ab15
      Josef Bacik authored
      When running a disconnect torture test I noticed that sometimes we would
      crash with a negative ref count on our queue.  This was because we were
      ending the same request twice.  Turns out we were racing with
      NBD_CLEAR_SOCK clearing the requests as well as the teardown of the
      device clearing the requests.  So instead make the ioctl only shutdown
      the sockets and make it so that we only ever run nbd_clear_que from the
      device teardown.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2516ab15
    • Josef Bacik's avatar
      nbd: multicast dead link notifications · 799f9a38
      Josef Bacik authored
      Provide a mechanism to notify userspace that there's been a link problem
      on a NBD device.  This will allow userspace to re-establish a connection
      and provide the new socket to the device without disrupting the device.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      799f9a38
    • Josef Bacik's avatar
      nbd: add a reconfigure netlink command · b7aa3d39
      Josef Bacik authored
      We want to be able to reconnect dead connections to existing block
      devices, so add a reconfigure netlink command.  We will also allow users
      to change their timeout on the fly, but everything else will require a
      disconnect and reconnect.  You won't be able to add more connections
      either, simply replace dead connections with new more lively
      connections.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b7aa3d39
    • Josef Bacik's avatar
      nbd: add a basic netlink interface · e46c7287
      Josef Bacik authored
      The existing ioctl interface for configuring NBD devices is a bit
      cumbersome and hard to extend.  The other problem is we leave a
      userspace app sitting in it's syscall until the device disconnects,
      which is less than ideal.
      
      This patch introduces a netlink interface for adding and disconnecting
      nbd devices.  This has the benefits of being easily extendable without
      breaking older userspace applications, and allows us to configure a nbd
      device without leaving a userspace app sitting waiting for the device to
      disconnect.
      
      With this interface we also gain the ability to configure more devices
      than are preallocated at insmod time.  We also have gained the ability
      to not specify a particular device and be provided one for us so that
      userspace doesn't need to find a free device to configure.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      e46c7287
    • Josef Bacik's avatar
      nbd: stop using the bdev everywhere · 29eaadc0
      Josef Bacik authored
      In preparation for the upcoming netlink interface we need to not rely on
      already having the bdev for the NBD device we are doing operations on.
      Instead of passing the bdev around, just use it in places where we know
      we already have the bdev.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      29eaadc0
    • Josef Bacik's avatar
      nbd: separate out the config information · 5ea8d108
      Josef Bacik authored
      In order to properly refcount the various aspects of a NBD device we
      need to separate out the configuration elements of the nbd device.  The
      configuration of a NBD device has a different lifetime from the actual
      device, so it doesn't make sense to bundle these two concepts.  Add a
      config_refs to keep track of the configuration structure, that way we
      can be sure that we never access it when we've torn down the device.
      Add a new nbd_config structure to hold all of the transient
      configuration information.  Finally create this when we open the device
      so that it is in place when we start to configure the device.  This has
      a nice side-effect of fixing a long standing problem where you could end
      up with a half-configured nbd device that needed to be "disconnected" in
      order to be usable again.  Now once we close our device the
      configuration will be discarded.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5ea8d108
    • Josef Bacik's avatar
      nbd: handle single path failures gracefully · f3733247
      Josef Bacik authored
      Currently if we have multiple connections and one of them goes down we will tear
      down the whole device.  However there's no reason we need to do this as we
      could have other connections that are working fine.  Deal with this by keeping
      track of the state of the different connections, and if we lose one we mark it
      as dead and send all IO destined for that socket to one of the other healthy
      sockets.  Any outstanding requests that were on the dead socket will timeout and
      be re-submitted properly.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f3733247
    • Josef Bacik's avatar
      nbd: put socket in error cases · 9b1355d5
      Josef Bacik authored
      When adding a new socket we look it up and then try to add it to our
      configuration.  If any of those steps fail we need to make sure we put
      the socket so we don't leak them.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9b1355d5
  4. 16 Apr, 2017 5 commits
    • Dan Carpenter's avatar
      lightnvm: fix some error code in pblk-init.c · 1c6286f2
      Dan Carpenter authored
      There were a bunch of places in pblk_lines_init() where we didn't set an
      error code.  And in pblk_writer_init() we accidentally return 1 instead
      of a correct error code, which would result in a Oops later.
      
      Fixes: 11a5d6fdf919 ("lightnvm: physical block device (pblk) target")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarMatias Bjørling <matias@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1c6286f2
    • Dan Carpenter's avatar
      lightnvm: fix some WARN() messages · 2a79efd8
      Dan Carpenter authored
      WARN_ON() takes a condition, not an error message.  I slightly tweaked
      some conditions so hopefully it's more clear.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarMatias Bjørling <matias@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      2a79efd8
    • Dan Carpenter's avatar
      lightnvm: pblk-gc: fix an error pointer dereference in init · 503ec94e
      Dan Carpenter authored
      These labels are reversed so we could end up dereferencing an error
      pointer or leaking.
      
      Fixes: 7f347ba6bb3a ("lightnvm: physical block device (pblk) target")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarMatias Bjørling <matias@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      503ec94e
    • Javier González's avatar
      lightnvm: physical block device (pblk) target · a4bd217b
      Javier González authored
      This patch introduces pblk, a host-side translation layer for
      Open-Channel SSDs to expose them like block devices. The translation
      layer allows data placement decisions, and I/O scheduling to be
      managed by the host, enabling users to optimize the SSD for their
      specific workloads.
      
      An open-channel SSD has a set of LUNs (parallel units) and a
      collection of blocks. Each block can be read in any order, but
      writes must be sequential. Writes may also fail, and if a block
      requires it, must also be reset before new writes can be
      applied.
      
      To manage the constraints, pblk maintains a logical to
      physical address (L2P) table,  write cache, garbage
      collection logic, recovery scheme, and logic to rate-limit
      user I/Os versus garbage collection I/Os.
      
      The L2P table is fully-associative and manages sectors at a
      4KB granularity. Pblk stores the L2P table in two places, in
      the out-of-band area of the media and on the last page of a
      line. In the cause of a power failure, pblk will perform a
      scan to recover the L2P table.
      
      The user data is organized into lines. A line is data
      striped across blocks and LUNs. The lines enable the host to
      reduce the amount of metadata to maintain besides the user
      data and makes it easier to implement RAID or erasure coding
      in the future.
      
      pblk implements multi-tenant support and can be instantiated
      multiple times on the same drive. Each instance owns a
      portion of the SSD - both regarding I/O bandwidth and
      capacity - providing I/O isolation for each case.
      
      Finally, pblk also exposes a sysfs interface that allows
      user-space to peek into the internals of pblk. The interface
      is available at /dev/block/*/pblk/ where * is the block
      device name exposed.
      
      This work also contains contributions from:
        Matias Bjørling <matias@cnexlabs.com>
        Simon A. F. Lund <slund@cnexlabs.com>
        Young Tack Jin <youngtack.jin@gmail.com>
        Huaicheng Li <huaicheng@cs.uchicago.edu>
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <matias@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a4bd217b
    • Javier González's avatar
      lightnvm: convert sprintf into strlcpy · 6eb08245
      Javier González authored
      Convert sprintf calls to strlcpy in order to make possible buffer
      overflow more obvious.
      Signed-off-by: default avatarJavier González <javier@cnexlabs.com>
      Signed-off-by: default avatarMatias Bjørling <matias@cnexlabs.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      6eb08245