• NeilBrown's avatar
    mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE · a37b0715
    NeilBrown authored
    PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
    loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
    daemon needs to write to one bdi (the final bdi) in order to free up
    writes queued to another bdi (the client bdi).
    
    The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
    pages, so that it can still dirty pages after other processses have been
    throttled.  The purpose of this is to avoid deadlock that happen when
    the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
    but it is being thottled and cannot write.
    
    This approach was designed when all threads were blocked equally,
    independently on which device they were writing to, or how fast it was.
    Since that time the writeback algorithm has changed substantially with
    different threads getting different allowances based on non-trivial
    heuristics.  This means the simple "add 25%" heuristic is no longer
    reliable.
    
    The important issue is not that the daemon needs a *larger* dirty page
    allowance, but that it needs a *private* dirty page allowance, so that
    dirty pages for the "client" bdi that it is helping to clear (the bdi
    for an NFS filesystem or loop block device etc) do not affect the
    throttling of the daemon writing to the "final" bdi.
    
    This patch changes the heuristic so that the task is not throttled when
    the bdi it is writing to has a dirty page count below below (or equal
    to) the free-run threshold for that bdi.  This ensures it will always be
    able to have some pages in flight, and so will not deadlock.
    
    In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
    still be throttled by global threshold, but that is acceptable as it is
    only the deadlock state that is interesting for this flag.
    
    This approach of "only throttle when target bdi is busy" is consistent
    with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
    it causes attention to be focussed only on the target bdi.
    
    So this patch
     - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
     - removes the 25% bonus that that flag gives, and
     - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
       global and the local free-run thresholds are exceeded.
    
    Note that previously realtime threads were treated the same as
    PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
    for real-time threads, so it is now different from the behaviour of nfsd
    and loop tasks.  I don't know what is wanted for realtime.
    
    [akpm@linux-foundation.org: coding style fixes]
    Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
    Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.nameSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    a37b0715
sched.h 55.9 KB