1. 18 Dec, 2011 11 commits
    • Wu Fengguang's avatar
      writeback: balanced_rate cannot exceed write bandwidth · bdaac490
      Wu Fengguang authored
      Add an upper limit to balanced_rate according to the below inequality.
      This filters out some rare but huge singular points, which at least
      enables more readable gnuplot figures.
      
      When there are N dd dirtiers,
      
      	balanced_dirty_ratelimit = write_bw / N
      
      So it holds that
      
      	balanced_dirty_ratelimit <= write_bw
      
      The singular points originate from dirty_rate in the below formular:
      
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
      where
      	dirty_rate = (number of page dirties in the past 200ms) / 200ms
      
      In the extreme case, if all dd tasks suddenly get blocked on something
      else and hence no pages are dirtied at all, dirty_rate will be 0 and
      balanced_dirty_ratelimit will be inf. This could happen in reality.
      
      Note that these huge singular points are not a real threat, since they
      are _guaranteed_ to be filtered out by the
      	min(balanced_dirty_ratelimit, task_ratelimit)
      line in bdi_update_dirty_ratelimit(). task_ratelimit is based on the
      number of dirty pages, which will never _suddenly_ fly away like
      balanced_dirty_ratelimit. So any weirdly large balanced_dirty_ratelimit
      will be cut down to the level of task_ratelimit.
      
      There won't be tiny singular points though, as long as the dirty pages
      lie inside the dirty throttling region (above the freerun region).
      Because there the dd tasks will be throttled by balanced_dirty_pages()
      and won't be able to suddenly dirty much more pages than average.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      bdaac490
    • Wu Fengguang's avatar
      writeback: do strict bdi dirty_exceeded · 82791940
      Wu Fengguang authored
      This helps to reduce dirty throttling polls and hence CPU overheads.
      
      bdi->dirty_exceeded typically only helps when suddenly starting 100+
      dd's on a disk, in which case the dd's may need to poll
      balance_dirty_pages() earlier than tsk->nr_dirtied_pause.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      82791940
    • Wu Fengguang's avatar
      writeback: avoid tiny dirty poll intervals · 5b9b3574
      Wu Fengguang authored
      The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k.
      Shaohua manages to root cause it to be the much smaller dirty pause times
      and hence much more frequent invocations to the IO-less balance_dirty_pages().
      Since fio_mmap_randwrite_64k effectively contains both reads and writes,
      the more frequent pauses triggered more idling in the cfq IO scheduler.
      
      The solution is to increase pause time all the way up to the max 200ms
      in this case, which is found to restore most performance. This will help
      reduce CPU overheads in other cases, too.
      
      Note that I don't expect many performance critical workloads to run this
      access pattern: the mmap read-on-write is rather inefficient and could
      be avoided by doing normal writes syscalls.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reported-by: default avatarLi Shaohua <shaohua.li@intel.com>
      Tested-by: default avatarLi Shaohua <shaohua.li@intel.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      5b9b3574
    • Wu Fengguang's avatar
      writeback: max, min and target dirty pause time · 7ccb9ad5
      Wu Fengguang authored
      Control the pause time and the call intervals to balance_dirty_pages()
      with three parameters:
      
      1) max_pause, limited by bdi_dirty and MAX_PAUSE
      
      2) the target pause time, grows with the number of dd tasks
         and is normally limited by max_pause/2
      
      3) the minimal pause, set to half the target pause
         and is used to skip short sleeps and accumulate them into bigger ones
      
      The typical behaviors after patch:
      
      - if ever task_ratelimit is far below dirty_ratelimit, the pause time
        will remain constant at max_pause and nr_dirtied_pause will be
        fluctuating with task_ratelimit
      
      - in the normal cases, nr_dirtied_pause will remain stable (keep in the
        same pace with dirty_ratelimit) and the pause time will be fluctuating
        with task_ratelimit
      
      In summary, someone has to fluctuate with task_ratelimit, because
      
      	task_ratelimit = nr_dirtied_pause / pause
      
      We normally prefer a stable nr_dirtied_pause, until reaching max_pause.
      
      The notable behavior changes are:
      
      - in stable workloads, there will no longer be sudden big trajectory
        switching of nr_dirtied_pause as concerned by Peter. It will be as
        smooth as dirty_ratelimit and changing proportionally with it (as
        always, assuming bdi bandwidth does not fluctuate across 2^N lines,
        otherwise nr_dirtied_pause will show up in 2+ parallel trajectories)
      
      - in the rare cases when something keeps task_ratelimit far below
        dirty_ratelimit, the smoothness can no longer be retained and
        nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a
        (not that destructive but still not good) bug that
      	  dirty_ratelimit gets brought down undesirably
      	  <= balanced_dirty_ratelimit is under estimated
      	  <= weakly executed task_ratelimit
      	  <= pause goes too large and gets trimmed down to max_pause
      	  <= nr_dirtied_pause (based on dirty_ratelimit) is set too large
      	  <= dirty_ratelimit being much larger than task_ratelimit
      
      - introduce min_pause to avoid small pause sleeps
      
      - when pause is trimmed down to max_pause, try to compensate it at the
        next pause time
      
      The "refactor" type of changes are:
      
      The max_pause equation is slightly transformed to make it slightly more
      efficient.
      
      We now scale target_pause by (N * 10ms) on 2^N concurrent tasks, which
      is effectively equal to the original scaling max_pause by (N * 20ms)
      because the original code does implicit target_pause ~= max_pause / 2.
      Based on the same implicit ratio, target_pause starts with 10ms on 1 dd.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      7ccb9ad5
    • Wu Fengguang's avatar
      writeback: dirty ratelimit - think time compensation · 83712358
      Wu Fengguang authored
      Compensate the task's think time when computing the final pause time,
      so that ->dirty_ratelimit can be executed accurately.
      
              think time := time spend outside of balance_dirty_pages()
      
      In the rare case that the task slept longer than the 200ms period time
      (result in negative pause time), the sleep time will be compensated in
      the following periods, too, if it's less than 1 second.
      
      Accumulated errors are carefully avoided as long as the max pause area
      is not hitted.
      
      Pseudo code:
      
              period = pages_dirtied / task_ratelimit;
              think = jiffies - dirty_paused_when;
              pause = period - think;
      
      1) normal case: period > think
      
              pause = period - think
              dirty_paused_when = jiffies + pause
              nr_dirtied = 0
      
                                   period time
                    |===============================>|
                        think time      pause time
                    |===============>|==============>|
              ------|----------------|---------------|------------------------
              dirty_paused_when   jiffies
      
      2) no pause case: period <= think
      
              don't pause; reduce future pause time by:
              dirty_paused_when += period
              nr_dirtied = 0
      
                                 period time
                    |===============================>|
                                        think time
                    |===================================================>|
              ------|--------------------------------+-------------------|----
              dirty_paused_when                                       jiffies
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      83712358
    • Wu Fengguang's avatar
      btrfs: fix dirtied pages accounting on sub-page writes · 32c7f202
      Wu Fengguang authored
      When doing 1KB sequential writes to the same page,
      balance_dirty_pages_ratelimited_nr() should be called once instead of 4
      times, the latter makes the dirtier tasks be throttled much too heavy.
      
      Fix it with proper de-accounting on clear_page_dirty_for_io().
      
      CC: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      32c7f202
    • Wu Fengguang's avatar
      writeback: fix dirtied pages accounting on redirty · 2f800fbd
      Wu Fengguang authored
      De-account the accumulative dirty counters on page redirty.
      
      Page redirties (very common in ext4) will introduce mismatch between
      counters (a) and (b)
      
      a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
      b) NR_WRITTEN, BDI_WRITTEN
      
      This will introduce systematic errors in balanced_rate and result in
      dirty page position errors (ie. the dirty pages are no longer balanced
      around the global/bdi setpoints).
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      2f800fbd
    • Wu Fengguang's avatar
      writeback: fix dirtied pages accounting on sub-page writes · d3bc1fef
      Wu Fengguang authored
      When dd in 512bytes, generic_perform_write() calls
      balance_dirty_pages_ratelimited() 8 times for the same page, but
      obviously the page is only dirtied once.
      
      Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      d3bc1fef
    • Wu Fengguang's avatar
      writeback: charge leaked page dirties to active tasks · 54848d73
      Wu Fengguang authored
      It's a years long problem that a large number of short-lived dirtiers
      (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
      (eg. dd) as well as pushing the dirty pages to the global hard limit.
      
      The solution is to charge the pages dirtied by the exited gcc to the
      other random dirtying tasks. It sounds not perfect, however should
      behave good enough in practice, seeing as that throttled tasks aren't
      actually running so those that are running are more likely to pick it up
      and get throttled, therefore promoting an equal spread.
      
      Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      54848d73
    • Jan Kara's avatar
      writeback: Include all dirty inodes in background writeback · 1bc36b64
      Jan Kara authored
      Current livelock avoidance code makes background work to include only inodes
      that were dirtied before background writeback has started. However background
      writeback can be running for a long time and thus excluding newly dirtied
      inodes can eventually exclude significant portion of dirty inodes making
      background writeback inefficient. Since background writeback avoids livelocking
      the flusher thread by yielding to any other work, there is no real reason why
      background work should not include all dirty inodes so change the logic in
      wb_writeback().
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      1bc36b64
    • Wu Fengguang's avatar
      writeback: show writeback reason with __print_symbolic · b3bba872
      Wu Fengguang authored
      This makes the binary trace understandable by trace-cmd.
      
      CC: Dave Chinner <david@fromorbit.com>
      CC: Curt Wohlgemuth <curtw@google.com>
      CC: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      b3bba872
  2. 17 Dec, 2011 1 commit
  3. 16 Dec, 2011 22 commits
  4. 15 Dec, 2011 6 commits