• Linus Torvalds's avatar
    x86: re-introduce non-generic memcpy_{to,from}io · 170d13ca
    Linus Torvalds authored
    This has been broken forever, and nobody ever really noticed because
    it's purely a performance issue.
    
    Long long ago, in commit 6175ddf0 ("x86: Clean up mem*io functions")
    Brian Gerst simplified the memory copies to and from iomem, since on
    x86, the instructions to access iomem are exactly the same as the
    regular instructions.
    
    That is technically true, and things worked, and nobody said anything.
    Besides, back then the regular memcpy was pretty simple and worked fine.
    
    Nobody noticed except for David Laight, that is.  David has a testing a
    TLP monitor he was writing for an FPGA, and has been occasionally
    complaining about how memcpy_toio() writes things one byte at a time.
    
    Which is completely unacceptable from a performance standpoint, even if
    it happens to technically work.
    
    The reason it's writing one byte at a time is because while it's
    technically true that accesses to iomem are the same as accesses to
    regular memory on x86, the _granularity_ (and ordering) of accesses
    matter to iomem in ways that they don't matter to regular cached memory.
    
    In particular, when ERMS is set, we default to using "rep movsb" for
    larger memory copies.  That is indeed perfectly fine for real memory,
    since the whole point is that the CPU is going to do cacheline
    optimizations and executes the memory copy efficiently for cached
    memory.
    
    With iomem? Not so much.  With iomem, "rep movsb" will indeed work, but
    it will copy things one byte at a time. Slowly and ponderously.
    
    Now, originally, back in 2010 when commit 6175ddf0 was done, we
    didn't use ERMS, and this was much less noticeable.
    
    Our normal memcpy() was simpler in other ways too.
    
    Because in fact, it's not just about using the string instructions.  Our
    memcpy() these days does things like "read and write overlapping values"
    to handle the last bytes of the copy.  Again, for normal memory,
    overlapping accesses isn't an issue.  For iomem? It can be.
    
    So this re-introduces the specialized memcpy_toio(), memcpy_fromio() and
    memset_io() functions.  It doesn't particularly optimize them, but it
    tries to at least not be horrid, or do overlapping accesses.  In fact,
    this uses the existing __inline_memcpy() function that we still had
    lying around that uses our very traditional "rep movsl" loop followed by
    movsw/movsb for the final bytes.
    
    Somebody may decide to try to improve on it, but if we've gone almost a
    decade with only one person really ever noticing and complaining, maybe
    it's not worth worrying about further, once it's not _completely_ broken?
    Reported-by: default avatarDavid Laight <David.Laight@aculab.com>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    170d13ca
string_64.h 4.05 KB