1. 28 Jun, 2013 3 commits
    • Wei Yongjun's avatar
      drbd: fix error return code in drbd_init() · 6110d70b
      Wei Yongjun authored
      Fix to return a negative error code from the error handling
      case instead of 0, as returned elsewhere in this function.
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarAndreas Gruenbacher <agruen@linbit.com>
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6110d70b
    • Andreas Gruenbacher's avatar
      drbd: Do not sleep inside rcu · 26ea8f92
      Andreas Gruenbacher authored
      Signed-off-by: default avatarAndreas Gruenbacher <agruen@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      26ea8f92
    • Jens Axboe's avatar
      Merge branch 'stable/for-jens-3.10' of... · f35546e0
      Jens Axboe authored
      Merge branch 'stable/for-jens-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.11/drivers
      
      Konrad writes:
      
      It has the 'feature-max-indirect-segments' implemented in both backend
      and frontend. The current problem with the backend and frontend is that the
      segment size is limited to 11 pages. It means we can at most squeeze in 44kB per
      request. The ring can hold 32 (next power of two below 36) requests, meaning we
      can do 1.4M of outstanding requests. Nowadays that is not enough.
      
      The problem in the past was addressed in two ways - but neither one went upstream.
      The first solution to this proposed by Justin from Spectralogic was to negotiate
      the segment size.  This means that the ‘struct blkif_sring_entry’ is now a variable size.
      It can expand from 112 bytes (cover 11 pages of data - 44kB) to 1580 bytes
      (256 pages of data - so 1MB). It is a simple extension by just making the array in the
      request expand from 11 to a variable size negotiated. But it had limits: this extension
      still limits the number of segments per request to 255 (as the total number must be
      specified in the request, which only has an 8-bit field for that purpose).
      
      The other solution (from Intel - Ronghui) was to create one extra ring that only has the
      ‘struct blkif_request_segment’ in them. The ‘struct blkif_request’ would be changed to have
      an index in said ‘segment ring’. There is only one segment ring. This means that the size of
      the initial ring is still the same. The requests would point to the segment and enumerate out
      how many of the indexes it wants to use. The limit is of course the size of the segment.
      If one assumes a one-page segment this means we can in one request cover ~4MB.
      
      Those patches were posted as RFC and the author never followed up on the ideas on changing
      it to be a bit more flexible.
      
      There is yet another mechanism that could be employed  (which these patches implement) - and it
      borrows from VirtIO protocol. And that is the ‘indirect descriptors’. This very similar to
      what Intel suggests, but with a twist. The twist is to negotiate how many of these
      'segment' pages (aka indirect descriptor pages) we want to support (in reality we negotiate
      how many entries in the segment we want to cover, and we module the number if it is
      bigger than the segment size).
      
      This means that with the existing 36 slots in the ring (single page) we can cover:
      32 slots * each blkif_request_indirect covers: 512 * 4096 ~= 64M. Since we ample space
      in the blkif_request_indirect to span more than one indirect page, that number (64M)
      can be also multiplied by eight = 512MB.
      
      Roger Pau Monne took the idea and implemented them in these patches. They work
      great and the corner cases (migration between backends with and without this extension)
      work nicely. The backend has a limit right now off how many indirect entries
      it can handle: one indirect page, and at maximum 256 entries (out of 512 - so  50% of the page
      is used). That comes out to 32 slots * 256 entries in a indirect page * 1 indirect page
      per request * 4096 = 32MB.
      
      This is a conservative number that can change in the future. Right now it strikes
      a good balance between giving excellent performance, memory usage in the backend, and
      balancing the needs of many guests.
      
      In the patchset there is also the split of the blkback structure to be per-VBD.
      This means that the spinlock contention we had with many guests trying to do I/O and
      all the blkback threads hitting the same lock has been eliminated.
      
      Also there are bug-fixes to deal with oddly sized sectors, insane amounts on
      th ring, and also a security fix (posted earlier).
      f35546e0
  2. 25 Jun, 2013 1 commit
  3. 21 Jun, 2013 2 commits
  4. 19 Jun, 2013 11 commits
  5. 17 Jun, 2013 2 commits
    • Konrad Rzeszutek Wilk's avatar
      xen/blkback: Check for insane amounts of request on the ring (v6). · 8e3f8755
      Konrad Rzeszutek Wilk authored
      Check that the ring does not have an insane amount of requests
      (more than there could fit on the ring).
      
      If we detect this case we will stop processing the requests
      and wait until the XenBus disconnects the ring.
      
      The existing check RING_REQUEST_CONS_OVERFLOW which checks for how
      many responses we have created in the past (rsp_prod_pvt) vs
      requests consumed (req_cons) and whether said difference is greater or
      equal to the size of the ring, does not catch this case.
      
      Wha the condition does check if there is a need to process more
      as we still have a backlog of responses to finish. Note that both
      of those values (rsp_prod_pvt and req_cons) are not exposed on the
      shared ring.
      
      To understand this problem a mini crash course in ring protocol
      response/request updates is in place.
      
      There are four entries: req_prod and rsp_prod; req_event and rsp_event
      to track the ring entries. We are only concerned about the first two -
      which set the tone of this bug.
      
      The req_prod is a value incremented by frontend for each request put
      on the ring. Conversely the rsp_prod is a value incremented by the backend
      for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when
      pushing the responses on the ring).  Both values can
      wrap and are modulo the size of the ring (in block case that is 32).
      Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details.
      
      The culprit here is that if the difference between the
      req_prod and req_cons is greater than the ring size we have a problem.
      Fortunately for us, the '__do_block_io_op' loop:
      
      	rc = blk_rings->common.req_cons;
      	rp = blk_rings->common.sring->req_prod;
      
      	while (rc != rp) {
      
      		..
      		blk_rings->common.req_cons = ++rc; /* before make_response() */
      
      	}
      
      will loop up to the point when rc == rp. The macros inside of the
      loop (RING_GET_REQUEST) is smart and is indexing based on the modulo
      of the ring size. If the frontend has provided a bogus req_prod value
      we will loop until the 'rc == rp' - which means we could be processing
      already processed requests (or responses) often.
      
      The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is
      b/c it only tracks how many responses we have internally produced
      and whether we would should process more. The astute reader will
      notice that the macro RING_REQUEST_CONS_OVERFLOW provides two
      arguments - more on this later.
      
      For example, if we were to enter this function with these values:
      
             	blk_rings->common.sring->req_prod =  X+31415 (X is the value from
      		the last time __do_block_io_op was called).
              blk_rings->common.req_cons = X
              blk_rings->common.rsp_prod_pvt = X
      
      The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons)
      is doing:
      
      	req_cons - rsp_prod_pvt >= 32
      
      Which is,
      	X - X >= 32 or 0 >= 32
      
      And that is false, so we continue on looping (this bug).
      
      If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp
      instead (sring->req_prod) of rc, the this macro can do the check:
      
           req_prod - rsp_prov_pvt >= 32
      
      Which is,
             X + 31415 - X >= 32 , or 31415 >= 32
      
      which is true, so we can error out and break out of the function.
      
      Unfortunatly the difference between rsp_prov_pvt and req_prod can be
      at 32 (which would error out in the macro). This condition exists when
      the backend is lagging behind with the responses and still has not finished
      responding to all of them (so make_response has not been called), and
      the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able
      to use said macro.
      
      Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does
      a simple check of:
      
          req_prod - rsp_prod_pvt > RING_SIZE
      
      And with the X values from above:
      
         X + 31415 - X > 32
      
      Returns true. Also not that if the ring is full (which is where
      the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the
      same condition:
      
         X + 32 - X > 32
      
      Which is false.
      
      Lets use that macro.
      Note that in v5 of this patchset the macro was different - we used an
      earlier version.
      
      Cc: stable@vger.kernel.org
      [v1: Move the check outside the loop]
      [v2: Add a pr_warn as suggested by David]
      [v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan]
      [v4: Move wake_up after kthread_stop as suggested by Jan]
      [v5: Use RING_REQUEST_PROD_OVERFLOW instead]
      [v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: default avatarJan Beulich <jbeulich@suse.com>
      
      gadsa
      8e3f8755
    • Jan Beulich's avatar
      xen/io/ring.h: new macro to detect whether there are too many requests on the ring · 8d925690
      Jan Beulich authored
      Backends may need to protect themselves against an insane number of
      produced requests stored by a frontend, in case they iterate over
      requests until reaching the req_prod value. There can't be more
      requests on the ring than the difference between produced requests
      and produced (but possibly not yet published) responses.
      
      This is a more strict alternative to a patch previously posted by
      Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>.
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      8d925690
  6. 07 Jun, 2013 2 commits
    • Konrad Rzeszutek Wilk's avatar
      xen/blkback: Check device permissions before allowing OP_DISCARD · 604c499c
      Konrad Rzeszutek Wilk authored
      We need to make sure that the device is not RO or that
      the request is not past the number of sectors we want to
      issue the DISCARD operation for.
      
      This fixes CVE-2013-2140.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarJan Beulich <JBeulich@suse.com>
      Acked-by: default avatarIan Campbell <Ian.Campbell@citrix.com>
      [v1: Made it pr_warn instead of pr_debug]
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      604c499c
    • Stefan Bader's avatar
      xen/blkback: Use physical sector size for setup · 7c4d7d71
      Stefan Bader authored
      Currently xen-blkback passes the logical sector size over xenbus and
      xen-blkfront sets up the paravirt disk with that logical block size.
      But newer drives usually have the logical sector size set to 512 for
      compatibility reasons and would show the actual sector size only in
      physical sector size.
      This results in the device being partitioned and accessed in dom0 with
      the correct sector size, but the guest thinks 512 bytes is the correct
      block size. And that results in poor performance.
      
      To fix this, blkback gets modified to pass also physical-sector-size
      over xenbus and blkfront to use both values to set up the paravirt
      disk. I did not just change the passed in sector-size because I am
      not sure having a bigger logical sector size than the physical one
      is valid (and that would happen if a newer dom0 kernel hits an older
      domU kernel). Also this way a domU set up before should still be
      accessible (just some tools might detect the unaligned setup).
      
      [v2: Make xenbus write failure non-fatal]
      [v3: Use xenbus_scanf instead of xenbus_gather]
      [v4: Rebased against segment changes]
      Signed-off-by: default avatarStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      7c4d7d71
  7. 04 Jun, 2013 2 commits
  8. 14 May, 2013 17 commits
    • Tejun Heo's avatar
      blk-throttle: implement proper hierarchy support · 9138125b
      Tejun Heo authored
      With the recent updates, blk-throttle is finally ready for proper
      hierarchy support.  Dispatching now honors service_queue->parent_sq
      and propagates correctly.  The only thing missing is setting
      ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
      hierarchy.
      
      This patch updates throtl_pd_init() such that service_queues form the
      same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
      As this concludes proper hierarchy support for blkcg, the shameful
      .broken_hierarchy tag is removed from blkio_subsys.
      
      v2: Updated blkio-controller.txt as suggested by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      9138125b
    • Tejun Heo's avatar
      blk-throttle: implement throtl_grp->has_rules[] · 693e751e
      Tejun Heo authored
      blk_throtl_bio() has a quick exit path for throtl_grps without limits
      configured.  It looks at the bps and iops limits and if both are not
      configured, the bio is issued immediately.  While this is correct in
      the current flat hierarchy as each throtl_grp behaves completely
      independently, it would become wrong in proper hierarchy mode.  A
      group without any limits could still be limited by one of its
      ancestors and bio's queued for such group should not bypass
      blk-throtl.
      
      As having a quick bypass mechanism is beneficial, this patch
      reimplements the mechanism such that it's correct even with proper
      hierarchy.  throtl_grp->has_rules[] is added.  These booleans are
      updated for the whole subtree whenever a config is updated so that
      has_rules[] of the whole subtree stays synchronized.  They're also
      updated when a new throtl_grp comes online so that it can't escape the
      limits of its ancestors.
      
      As no throtl_grp has another throtl_grp as parent now, this patch
      doesn't yet make any behavior differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      693e751e
    • Vivek Goyal's avatar
      blk-throttle: Account for child group's start time in parent while bio climbs up · 32ee5bc4
      Vivek Goyal authored
      With the planned proper hierarchy support, a bio will climb up the
      tree before actually being dispatched. This makes sure bio is also
      subjected to parent's throttling limits, if any.
      
      It might happen that parent is idle and when bio is transferred to
      parent, a new slice starts fresh. But that is incorrect as parents
      wait time should have started when bio was queued in child group and
      causes IOs to be throttled more than configured as they climb the
      hierarchy.
      
      Given the fact that we have not written hierarchical algorithm in a
      way where child's and parents time slices are synchronized, we
      transfer the child's start time to parent if parent was idling.  If
      parent was busy doing dispatch of other bios all this while, this is
      not an issue.
      
      Child's slice start time is passed to parent. Parent looks at its
      last expired slice start time. If child's start time is after parents
      old start time, that means parent had been idle and after parent
      went idle, child had an IO queued. So use child's start time as
      parent start time.
      
      If parent's start time is after child's start time, that means,
      when IO got queued in child group, parent was not idle. But later
      it dispatched some IO, its slice got trimmed and then it went idle.
      After a while child's request got shifted in parent group. In this
      case use parent's old start time as new start time as that's the
      duration of slice we did not use.
      
      This logic is far from perfect as if there are multiple childs
      then first child transferring the bio decides the start time while
      a bio might have queued up even earlier in other child, which is
      yet to be transferred up to parent. In that case we will lose
      time and bandwidth in parent. This patch is just an approximation
      to make situation somewhat better.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      32ee5bc4
    • Tejun Heo's avatar
      blk-throttle: add throtl_qnode for dispatch fairness · c5cc2070
      Tejun Heo authored
      With flat hierarchy, there's only single level of dispatching
      happening and fairness beyond that point is the responsibility of the
      rest of the block layer and driver, which usually works out okay;
      however, with the planned hierarchy support,
      service_queue->bio_lists[] can be filled up by bios from a single
      source.  While the limits would still be honored, it'd be very easy to
      starve IOs from siblings or children.
      
      To avoid such starvation, this patch implements throtl_qnode and
      converts service_queue->bio_lists[] to lists of per-source qnodes
      which in turn contains the bio's.  For example, when a bio is
      dispatched from a child group, the bio doesn't get queued on
      ->bio_lists[] directly but it first gets queued on the group's qnode
      which in turn gets queued on service_queue->queued[].  When
      dispatching for the upper level, the ->queued[] list is consumed in
      round-robing order so that the dispatch windows is consumed fairly by
      all IO sources.
      
      There are two ways a bio can come to a throtl_grp - directly queued to
      the group or dispatched from a child.  For the former
      throtl_grp->qnode_on_self[rw] is used.  For the latter, the child's
      ->qnode_on_parent[rw].
      
      Note that this means that the child which is contributing a bio to its
      parent should stay pinned until all its bios are dispatched to its
      grand-parent.  This patch moves blkg refcnting from bio add/remove
      spots to qnode activation/deactivation so that the blkg containing an
      active qnode is always pinned.  As child pins the parent, this is
      sufficient for keeping the relevant sub-tree pinned while bios are in
      flight.
      
      The starvation issue was spotted by Vivek Goyal.
      
      v2: The original patch used the same throtl_grp->qnode_on_self/parent
          for reads and writes causing RWs to be queued incorrectly if there
          already are outstanding IOs in the other direction.  They should
          be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
          can use different qnodes.  Spotted by Vivek Goyal.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      c5cc2070
    • Tejun Heo's avatar
      blk-throttle: make throtl_pending_timer_fn() ready for hierarchy · 2e48a530
      Tejun Heo authored
      throtl_pending_timer_fn() currently assumes that the parent_sq is the
      top level one and the bio's dispatched are ready to be issued;
      however, this assumption will be wrong with proper hierarchy support.
      This patch makes the following changes to make
      throtl_pending_timer_fn() ready for hiearchy.
      
      * If the parent_sq isn't the top-level one, update the parent
        throtl_grp's dispatch time and schedule the next dispatch as
        necessary.  If the parent's dispatch time is now, repeat the
        function for the parent throtl_grp.
      
      * If the parent_sq is the top-level one, kick issue work_item as
        before.
      
      * The debug message printed by throtl_log() now prints out the
        service_queue's nr_queued[] instead of the total nr_queued as the
        latter becomes uninteresting and misleading with hierarchical
        dispatch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      2e48a530
    • Tejun Heo's avatar
      blk-throttle: make tg_dispatch_one_bio() ready for hierarchy · 6bc9c2b4
      Tejun Heo authored
      tg_dispatch_one_bio() currently assumes that the parent_sq is the top
      level one and the bio being dispatched is ready to be issued; however,
      this assumption will be wrong with proper hierarchy support.  This
      patch makes the following changes to make tg_dispatch_on_bio() ready
      for hiearchy.
      
      * throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
        of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
        transfer a bio from a child tg to its parent.
      
      * tg_dispatch_one_bio() is updated to distinguish whether its parent
        is another throtl_grp or the throtl_data.  If former, the bio is
        transferred to the parent throtl_grp using throtl_add_bio_tg().  If
        latter, the bio is ready to be issued and put on the top-level
        service_queue's bio_lists[] and throtl_data->nr_queued is
        decremented.
      
      As all throtl_grps currently have the top level service_queue as their
      ->parent_sq, this patch in itself doesn't make any behavior
      difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      6bc9c2b4
    • Tejun Heo's avatar
      blk-throttle: make blk_throtl_bio() ready for hierarchy · 9e660acf
      Tejun Heo authored
      Currently, blk_throtl_bio() issues the passed in bio directly if it's
      within limits of its associated tg (throtl_grp).  This behavior
      becomes incorrect with hierarchy support as the bio should be
      accounted to and throttled by the ancestor throtl_grps too.
      
      This patch makes the direct issue path of blk_throtl_bio() to loop
      until it reaches the top-level service_queue or gets throttled.  If
      the former, the bio can be issued directly; otherwise, it gets queued
      at the first layer it was above limits.
      
      As tg->parent_sq is always the top-level service queue currently, this
      patch in itself doesn't make any behavior differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      9e660acf
    • Tejun Heo's avatar
      blk-throttle: make blk_throtl_drain() ready for hierarchy · 2a12f0dc
      Tejun Heo authored
      The current blk_throtl_drain() assumes that all active throtl_grps are
      queued on throtl_data->service_queue, which won't be true once
      hierarchy support is implemented.
      
      This patch makes blk_throtl_drain() perform post-order walk of the
      blkg hierarchy draining each associated throtl_grp, which guarantees
      that all bios will eventually be pushed to the top-level service_queue
      in throtl_data.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      2a12f0dc
    • Tejun Heo's avatar
      blk-throttle: dispatch from throtl_pending_timer_fn() · 6e1a5704
      Tejun Heo authored
      Currently, blk_throtl_dispatch_work_fn() is responsible for both
      dispatching bio's from throtl_grp's according to their limits and then
      issuing the dispatched bios.
      
      This patch moves the dispatch part to throtl_pending_timer_fn() so
      that the work item is kicked iff there are bio's to issue.  This is to
      avoid work item execution at each step when hierarchy support is
      enabled.  bio's will be dispatched towards the top-level service_queue
      from the timers at each layer and the work item will only be used to
      issue the bio's which reached the top-level service_queue.
      
      While fetching bio's to issue from bio_lists[],
      blk_throtl_dispatch_work_fn() fetches all READs before WRITEs.  While
      the original code also dispatched READs first, if multiple throtl_grps
      are dispatched on the same run, WRITEs from throtl_grp which is
      dispatched first would precede READs from throtl_grps which are
      dispatched later.  While this is a behavior change, given that the
      previous code already prioritized READs and block layer generally
      prioritizes and segregates READs from WRITEs, this isn't likely to
      make any noticeable differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      6e1a5704
    • Tejun Heo's avatar
      blk-throttle: implement dispatch looping · 7f52f98c
      Tejun Heo authored
      throtl_select_dispatch() only dispatches throtl_quantum bios on each
      invocation.  blk_throtl_dispatch_work_fn() in turn depends on
      throtl_schedule_next_dispatch() scheduling the next dispatch window
      immediately so that undue delays aren't incurred.  This effectively
      chains multiple dispatch work item executions back-to-back when there
      are more than throtl_quantum bios to dispatch on a given tick.
      
      There is no reason to finish the current work item just to repeat it
      immediately.  This patch makes throtl_schedule_next_dispatch() return
      %false without doing anything if the current dispatch window is still
      open and updates blk_throtl_dispatch_work_fn() repeat dispatching
      after cpu_relax() on %false return.
      
      This change will help implementing hierarchy support as dispatching
      will be done from pending_timer and immediate reschedule of timer
      function isn't supported and doesn't make much sense.
      
      While this patch changes how dispatch behaves when there are more than
      throtl_quantum bios to dispatch on a single tick, the behavior change
      is immaterial.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      7f52f98c
    • Tejun Heo's avatar
      blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work · 69df0ab0
      Tejun Heo authored
      Currently, throtl_data->dispatch_work is a delayed_work item which
      handles both delayed dispatch and issuing bios.  The two tasks will be
      separated to support proper hierarchy.  To prepare for that, this
      patch separates out the timer into throtl_service_queue->pending_timer
      from throtl_data->dispatch_work and make the latter a work_struct.
      
      * As the timer is now per-service_queue, it's initialized and
        del_sync'd as its corresponding service_queue is created and
        destroyed.  The timer, when triggered, simply schedules
        throtl_data->dispathc_work for execution.
      
      * throtl_schedule_delayed_work() is renamed to
        throtl_schedule_pending_timer() and takes @sq and @expires now.
      
      * Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
        should be the parent_sq of the service_queue which just got a new
        bio or updated.  As the parent_sq is always the top-level
        service_queue now, this doesn't change anything at this point.
      
      This patch doesn't introduce any behavior differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      69df0ab0
    • Tejun Heo's avatar
      blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it · 2a0f61e6
      Tejun Heo authored
      With proper hierarchy support, a bio can be dispatched multiple times
      until it reaches the top-level service_queue and we don't want to
      update dispatch stats at each step.  They are local stats and will be
      kept local.  If recursive stats are necessary, they should be
      implemented separately and definitely not by updating counters
      recursively on each dispatch.
      
      This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
      stats update with it so that dispatch stats are updated only on the
      first time the bio is charged to a throtl_grp, which will always be
      the throtl_grp the bio was originally queued to.
      
      This means that REQ_THROTTLED would be set even for bios which don't
      get throttled.  As we don't want bios to leave blk-throtl with the
      flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
      and clear if the bio is being issued directly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      2a0f61e6
    • Tejun Heo's avatar
      blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() · fda6f272
      Tejun Heo authored
      Now that both throtl_data and throtl_grp embed throtl_service_queue,
      we can unify throtl_log() and throtl_log_tg().
      
      * sq_to_tg() is added.  This returns the throtl_grp a service_queue is
        embedded in.  If the service_queue is the top-level one embedded in
        throtl_data, NULL is returned.
      
      * sq_to_td() is added.  A service_queue is always associated with a
        throtl_data.  This function finds the associated td and returns it.
      
      * throtl_log() is updated to take throtl_service_queue instead of
        throtl_data.  If the service_queue is one embedded in throtl_grp, it
        prints the same header as throtl_log_tg() did.  If it's one embedded
        in throtl_data, it behaves the same as before.  This renders
        throtl_log_tg() unnecessary.  Removed.
      
      This change is necessary for hierarchy support as we're gonna be using
      the same code paths to dispatch bios to intermediate service_queues
      embedded in throtl_grps and the top-level service_queue embedded in
      throtl_data.
      
      This patch doesn't make any behavior changes.
      
      v2: throtl_log() didn't print a space after blkg path.  Updated so
          that it prints a space after throtl_grp path.  Spotted by Vivek.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      fda6f272
    • Tejun Heo's avatar
      blk-throttle: add throtl_service_queue->parent_sq · 77216b04
      Tejun Heo authored
      To prepare for hierarchy support, this patch adds
      throtl_service_queue->service_sq which points to the arent
      service_queue.  Currently, for all service_queues embedded in
      throtl_grps, it points to throtl_data->service_queue.  As
      throtl_data->service_queue doesn't have a parent its parent_sq is set
      to NULL.
      
      There are a number of functions which take both throtl_grp *tg and
      throtl_service_queue *parent_sq.  With this patch, the parent
      service_queue can be determined from @tg and the @parent_sq arguments
      are removed.
      
      This patch doesn't make any behavior differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      77216b04
    • Tejun Heo's avatar
      blk-throttle: generalize update_disptime optimization in blk_throtl_bio() · 0e9f4164
      Tejun Heo authored
      When blk_throtl_bio() wants to queue a bio to a tg (throtl_grp), it
      avoids invoking tg_update_disptime() and
      throtl_schedule_next_dispatch() if the tg already has bios queued in
      that direction.  As a new bio is appeneded after the existing ones, it
      can't change the tg's next dispatch time or the parent's dispatch
      schedule.
      
      This optimization is currently open coded in blk_throtl_bio().
      Whether the target biolist was occupied was recorded in a local
      variable and later used to skip disptime update.  This patch moves
      generalizes it so that throtl_add_bio_tg() sets a new flag
      THROTL_TG_WAS_EMPTY if the biolist was empty before the new bio was
      added.  tg_update_disptime() clears the flag automatically.
      blk_throtl_bio() is updated to simply test the flag before updating
      disptime.
      
      This patch doesn't make any functional differences now but will enable
      using the same optimization for recursive dispatch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      0e9f4164
    • Tejun Heo's avatar
      blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] · 651930bc
      Tejun Heo authored
      throtl_service_queues will eventually form a tree which is anchored at
      throtl_data->service_queue and queue bios will climb the tree to the
      top service_queue to be executed.
      
      This patch makes the dispatch paths in blk_throtl_dispatch_work_fn()
      and blk_throtl_drain() to dispatch bios to
      throtl_data->service_queue.bio_lists[] instead of the on-stack
      bio_lists.  This will keep the final dispatch to the top level
      service_queue share the same mechanism as dispatches through the rest
      of the hierarchy.
      
      As bio's should be issued in a sleepable context,
      blk_throtl_dispatch_work_fn() transfers all dispatched bio's from the
      service_queue bio_lists[] into an onstack one before dropping
      queue_lock and issuing the bio's.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      651930bc
    • Tejun Heo's avatar
      blk-throttle: move bio_lists[] and friends to throtl_service_queue · 73f0d49a
      Tejun Heo authored
      throtl_service_queues will eventually form a tree which is anchored at
      throtl_data->service_queue and queue bios will climb the tree to the
      top service_queue to be executed.
      
      This patch moves bio_lists[] and nr_queued[] from throtl_grp to its
      service_queue to prepare for that.  As currently only the
      throtl_data->service_queue is in use, this patch just ends up moving
      throtl_grp->bio_lists[] and ->nr_queued[] to
      throtl_grp->service_queue.bio_lists[] and ->nr_queued[] without making
      any functional differences.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      73f0d49a