-
Andrew Morton authored
When an ext3 (or ext2) file is first created the filesystem has to choose the initial starting block for its data allocations. In the usual (new-file) case, that initial goal block is the zeroeth block of a particular blockgroup. This is the worst possible choice. Because it _guarantees_ that this file's blocks will be pessimally intermingled with the blocks of another file which is growing within the same blockgroup. We've always had this problem with files in the same directory. With the introduction of the Orlov allocator we now have the problem with files in different directories. And it got noticed. This is the cause of the post-Orlov 50% slowdown in dbench throughput on ext3 on write-through caching SCSI on SMP. And 25% in ext2. It doesn't happen on uniprocessor because a single CPU will not exhibit sufficient concurrency in allocation against two or more files. It will happen on uniprocessor if the files are growing slowly. It has always happened if the files are in the same directory. ext2 has the same problem but it is siginficantly less damaging there because of ext2's eight-block per-inode preallocation window. The patch largely solves this problem by not always starting the allocation goal at the zeroeth block of the blockgroup. We instead chop the blockgroup into sixteen starting points and select one of those based on the lower four bits of the calling process's PID. The PID was chosen as the index because this will help to ensure that related files have the same starting goal. If one process is slowly writing two files in the same directory, we still lose. Using the PID in the heuristic is a bit weird. As an alternative I tried using the file's directory's i_ino. That fixed the dbench problem OK but caused a 15% slowdown in the fast-growth `untar a kernel tree' workload. Because this approach will cause files which are in different directories to spread out more. Suppressing that behaviour when the files are all being created by the same process is a reasonable heuristic. I changed dbench to never unlink its files, and used e2fsck to determine how many fragmented files were present after a `dbench 32' run. With this patch and the next couple, ext2's fragmentation went from 22% to 13% and ext3's from 25% to 10.4%.
d2562c9d