• Doug Anderson's avatar
    ARM: 8505/1: dma-mapping: Optimize allocation · 33298ef6
    Doug Anderson authored
    The __iommu_alloc_buffer() is expected to be called to allocate pretty
    sizeable buffers.  Upon simple tests of video I saw it trying to
    allocate 4,194,304 bytes.  The function tries to allocate large chunks
    in order to optimize IOMMU TLB usage.
    
    The current function is very, very slow.
    
    One problem is the way it keeps trying and trying to allocate big
    chunks.  Imagine a very fragmented memory that has 4M free but no
    contiguous pages at all.  Further imagine allocating 4M (1024 pages).
    We'll do the following memory allocations:
    - For page 1:
      - Try to allocate order 10 (no retry)
      - Try to allocate order 9 (no retry)
      - ...
      - Try to allocate order 0 (with retry, but not needed)
    - For page 2:
      - Try to allocate order 9 (no retry)
      - Try to allocate order 8 (no retry)
      - ...
      - Try to allocate order 0 (with retry, but not needed)
    - ...
    - ...
    
    Total number of calls to alloc() calls for this case is:
      sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
      => 9228
    
    The above is obviously worse case, but given how slow alloc can be we
    really want to try to avoid even somewhat bad cases.  I timed the old
    code with a device under memory pressure and it wasn't hard to see it
    take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
    was done on kernel 3.14, so possibly mainline would behave
    differently).
    
    A second problem is that allocating big chunks under memory pressure
    when we don't need them is just not a great idea anyway unless we really
    need them.  We can make due pretty well with smaller chunks so it's
    probably wise to leave bigger chunks for other users once memory
    pressure is on.
    
    Let's adjust the allocation like this:
    
    1. If a big chunk fails, stop trying to hard and bump down to lower
       order allocations.
    2. Don't try useless orders.  The whole point of big chunks is to
       optimize the TLB and it can really only make use of 2M, 1M, 64K and
       4K sizes.
    
    We'll still tend to eat up a bunch of big chunks, but that might be the
    right answer for some users.  A future patch could possibly add a new
    DMA_ATTR that would let the caller decide that TLB optimization isn't
    important and that we should use smaller chunks.  Presumably this would
    be a sane strategy for some callers.
    Signed-off-by: default avatarDouglas Anderson <dianders@chromium.org>
    Acked-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
    Reviewed-by: default avatarRobin Murphy <robin.murphy@arm.com>
    Reviewed-by: default avatarTomasz Figa <tfiga@chromium.org>
    Tested-by: default avatarJavier Martinez Canillas <javier@osg.samsung.com>
    Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
    33298ef6
dma-mapping.c 55.7 KB