1. 02 Sep, 2020 19 commits
    • Tejun Heo's avatar
      blk-iocost: revamp donation amount determination · f1de2439
      Tejun Heo authored
      iocost has various safety nets to combat inuse adjustment calculation
      inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse
      adjustment calculations are now accurate and we can make donation amount
      determinations accurate too.
      
      * Stop keeping track of past usage history and using the maximum. Act on the
        immediate usage information.
      
      * Remove donation constraints defined by SURPLUS_* constants. Donate
        whatever isn't used.
      
      * Determine the donation amount so that the iocg will end up with
        MARGIN_TARGET_PCT budget at the end of the coming period assuming the same
        usage as the previous period. TARGET is set at 50% of period, which is the
        previous maximum. This provides smooth convergence for most repetitive IO
        patterns.
      
      * Apply donation logic early at 20% budget. There's no risk in doing so as
        the calculation is based on the delta between the current budget and the
        target budget at the end of the coming period.
      
      * Remove preemptive iocg activation for zero cost IOs. As donation can reach
        near zero now, the mere activation doesn't provide any protection anymore.
        In the unlikely case that this becomes a problem, the right solution is
        assigning appropriate costs for such IOs.
      
      This significantly improves the donation determination logic while also
      simplifying it. Now all donations are immediate, exact and smooth.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f1de2439
    • Tejun Heo's avatar
      blk-iocost: implement Andy's method for donation weight updates · e08d02aa
      Tejun Heo authored
      iocost implements work conservation by reducing iocg->inuse and propagating
      the adjustment upwards proportionally. However, while I knew the target
      absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure
      out how to determine the iocg->inuse adjustment to achieve that and
      approximated the adjustment by scaling iocg->inuse using the proportion of
      the needed hweight_inuse changes.
      
      When nested, these scalings aren't accurate even when adjusting a single
      node as the donating node also receives the benefit of the donated portion.
      When multiple nodes are donating as they often do, they can be wildly wrong.
      
      iocost employed various safety nets to combat the inaccuracies. There are
      ample buffers in determining how much to donate, the adjustments are
      conservative and gradual. While it can achieve a reasonable level of work
      conservation in simple scenarios, the inaccuracies can easily add up leading
      to significant loss of total work. This in turn makes it difficult to
      closely cap vrate as vrate adjustment is needed to compensate for the loss
      of work. The combination of inaccurate donation calculations and vrate
      adjustments can lead to wide fluctuations and clunky overall behaviors.
      
      Andy Newell devised a method to calculate the needed ->inuse updates to
      achieve the target hweight_inuse's. The method is compatible with the
      proportional inuse adjustment propagation which allows all hot path
      operations to be local to each iocg.
      
      To roughly summarize, Andy's method divides the tree into donating and
      non-donating parts, calculates global donation rate which is used to
      determine the target hweight_inuse for each node, and then derives per-level
      proportions. There's non-trivial amount of math involved. Please refer to
      the following pdfs for detailed descriptions.
      
        https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
        https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
        https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
      
      This patch implements Andy's method in transfer_surpluses(). This makes the
      donation calculations accurate per cycle and enables further improvements in
      other parts of the donation logic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e08d02aa
    • Tejun Heo's avatar
      blk-iocost: restructure surplus donation logic · 93f7d2db
      Tejun Heo authored
      The way the surplus donation logic is structured isn't great. There are two
      separate paths for starting/increasing donations and decreasing them making
      the logic harder to follow and is prone to unnecessary behavior differences.
      
      In preparation for improved donation handling, this patch restructures the
      code so that
      
      * All donors - new, increasing and decreasing - are funneled through the
        same code path.
      
      * The target donation calculation is factored into hweight_after_donation()
        which is called once from the same spot for all possible donors.
      
      * Actual inuse adjustment is factored into trasnfer_surpluses().
      
      This change introduces a few behavior differences - e.g. donation amount
      reduction now uses the max usage of the recent three periods just like new
      and increasing donations, and inuse now gets adjusted upwards the same way
      it gets downwards. These differences are unlikely to have severely negative
      implications and the whole logic will be revamped soon.
      
      This patch also removes two tracepoints. The existing TPs don't quite fit
      the new implementation. A later patch will update and reinstate them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      93f7d2db
    • Tejun Heo's avatar
      blk-iocost: decouple vrate adjustment from surplus transfers · 065655c8
      Tejun Heo authored
      Budget donations are inaccurate and could take multiple periods to converge.
      To prevent triggering vrate adjustments while surplus transfers were
      catching up, vrate adjustment was suppressed if donations were increasing,
      which was indicated by non-zero nr_surpluses.
      
      This entangling won't be necessary with the scheduled rewrite of donation
      mechanism which will make it precise and immediate. Let's decouple the two
      in preparation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      065655c8
    • Tejun Heo's avatar
      blk-iocost: replace iocg->has_surplus with ->surplus_list · 8692d2db
      Tejun Heo authored
      Instead of marking iocgs with surplus with a flag and filtering for them
      while walking all active iocgs, build a surpluses list. This doesn't make
      much difference now but will help implementing improved donation logic which
      will iterate iocgs with surplus multiple times.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8692d2db
    • Tejun Heo's avatar
      blk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_us · 1aa50d02
      Tejun Heo authored
      Currently, iocg->usages[] which are used to guide inuse adjustments are
      calculated from vtime deltas. This, however, assumes that the hierarchical
      inuse weight at the time of calculation held for the entire period, which
      often isn't true and can lead to significant errors.
      
      Now that we have absolute usage information collected, we can derive
      iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
      decisions are made based on actual absolute usage. The calculated usage is
      clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
      saturation regardless of the current hierarchical inuse weight.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1aa50d02
    • Tejun Heo's avatar
      blk-iocost: add absolute usage stat · 97eb1975
      Tejun Heo authored
      Currently, iocost doesn't collect or expose any statistics punting off all
      monitoring duties to drgn based iocost_monitor.py. While it works for some
      scenarios, there are some usability and data availability challenges. For
      example, accurate per-cgroup usage information can't be tracked by vtime
      progression at all and the number available in iocg->usages[] are really
      short-term snapshots used for control heuristics with possibly significant
      errors.
      
      This patch implements per-cgroup absolute usage stat counter and exposes it
      through io.stat along with the current vrate. Usage stat collection and
      flushing employ the same method as cgroup rstat on the active iocg's and the
      only hot path overhead is preemption toggling and adding to a percpu
      counter.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97eb1975
    • Tejun Heo's avatar
      blk-iocost: grab ioc->lock for debt handling · da437b95
      Tejun Heo authored
      Currently, debt handling requires only iocg->waitq.lock. In the future, we
      want to adjust and propagate inuse changes depending on debt status. Let's
      grab ioc->lock in debt handling paths in preparation.
      
      * Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
        ioc->lock needs to be made before entering the critical sections.
      
      * Add and use iocg_[un]lock() which handles the conditional double locking.
      
      * Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
        the caller grabbed both locks.
      
      This patch is prepatory and the comments contain references to future
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      da437b95
    • Tejun Heo's avatar
      blk-iocost: streamline vtime margin and timer slack handling · 7ca5b2e6
      Tejun Heo authored
      The margin handling was pretty inconsistent.
      
      * ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin
        thresholds. However, the two are in different units with the former
        requiring conversion to vtime on use.
      
      * iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of
        period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a
        quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use
        different values for the two.
      
      This patch cleans up margin and timer slack handling:
      
      * vtime margins are now recorded in ioc->margins.{min, max} on period
        duration changes and used consistently.
      
      * Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and
        used consistently for iocg_kick_waitq() and iocg_kick_delay().
      
      The only functional change is shortening of timer slack. No meaningful
      visible change is expected.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ca5b2e6
    • Tejun Heo's avatar
      blk-iocost: make ioc_now->now and ioc->period_at 64bit · ce95570a
      Tejun Heo authored
      They are in microseconds and wrap in around 1.2 hours with u32. While
      unlikely, confusions from wraparounds are still possible. We aren't saving
      anything meaningful by keeping these u32. Let's make them u64.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ce95570a
    • Tejun Heo's avatar
      blk-iocost: use WEIGHT_ONE based fixed point number for weights · bd0adb91
      Tejun Heo authored
      To improve weight donations, we want to able to scale inuse with a greater
      accuracy and down below 1. Let's make non-hierarchical weights to use
      WEIGHT_ONE based fixed point numbers too like hierarchical ones.
      
      This doesn't cause any behavior changes yet.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bd0adb91
    • Tejun Heo's avatar
      blk-iocost: s/HWEIGHT_WHOLE/WEIGHT_ONE/g · fe20cdb5
      Tejun Heo authored
      We're gonna use HWEIGHT_WHOLE for regular weights too. Let's rename it to
      WEIGHT_ONE.
      
      Pure rename.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fe20cdb5
    • Tejun Heo's avatar
      blk-iocost: make iocg_kick_waitq() call iocg_kick_delay() after paying debt · 7b84b49e
      Tejun Heo authored
      iocg_kick_waitq() is the function which pays debt and iocg_kick_delay()
      updates the actual delay status accordingly. If iocg_kick_delay() is not
      called after iocg_kick_delay() updated debt, unnecessarily large delays can
      be applied temporarily.
      
      Let's make sure such conditions don't occur by making iocg_kick_waitq()
      always call iocg_kick_delay() after paying debt.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b84b49e
    • Tejun Heo's avatar
      blk-iocost: move iocg_kick_delay() above iocg_kick_waitq() · 6ef20f78
      Tejun Heo authored
      We'll make iocg_kick_waitq() call iocg_kick_delay(). Reorder them in
      preparation. This is pure code reorganization.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6ef20f78
    • Tejun Heo's avatar
      blk-iocost: clamp inuse and skip noops in __propagate_weights() · db84a72a
      Tejun Heo authored
      __propagate_weights() currently expects the callers to clamp inuse within
      [1, active], which is needlessly fragile. The inuse adjustment logic is
      going to be revamped, in preparation, let's make __propagate_weights() clamp
      inuse on entry.
      
      Also, make it avoid weight updates altogether if neither active or inuse is
      changed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db84a72a
    • Tejun Heo's avatar
      blk-iocost: rename propagate_active_weights() to propagate_weights() · 00410f1b
      Tejun Heo authored
      It already propagates two weights - active and inuse - and there will be
      another soon. Let's drop the confusing misnomers. Rename
      [__]propagate_active_weights() to [__]propagate_weights() and
      commit_active_weights() to commit_weights().
      
      This is pure rename.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      00410f1b
    • Tejun Heo's avatar
      blk-iocost: use local[64]_t for percpu stat · 5e124f74
      Tejun Heo authored
      blk-iocost has been reading percpu stat counters from remote cpus which on
      some archs can lead to torn reads in really rare occassions. Use local[64]_t
      for those counters.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5e124f74
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
      block: remove the disk argument to delete_partition · 8328eb28
      Christoph Hellwig authored
      We can trivially derive the gendisk from the hd_struct.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8328eb28
  2. 01 Sep, 2020 21 commits