• Divyesh Shah's avatar
    block: Fix the starving writes bug in the anticipatory IO scheduler · d585d0b9
    Divyesh Shah authored
    AS scheduler alternates between issuing read and write batches. It does
    the batch switch only after all requests from the previous batch are
    completed.
    
    When switching to a write batch, if there is an on-going read request,
    it waits for its completion and indicates its intention of switching by
    setting ad->changed_batch and the new direction but does not update the
    batch_expire_time for the new write batch which it does in the case of
    no previous pending requests.
    On completion of the read request, it sees that we were waiting for the
    switch and schedules work for kblockd right away and resets the
    ad->changed_data flag.
    Now when kblockd enters dispatch_request where it is expected to pick
    up a write request, it in turn ends the write batch because the
    batch_expire_timer was not updated and shows the expire timestamp for
    the previous batch.
    
    This results in the write starvation for all the cases where there is
    the intention for switching to a write batch, but there is a previous
    in-flight read request and the batch gets reverted to a read_batch
    right away.
    
    This also holds true in the reverse case (switching from a write batch
    to a read batch with an in-flight write request).
    
    I've checked that this bug exists on 2.6.11, 2.6.18, 2.6.24 and
    linux-2.6-block git HEAD. I've tested the fix on x86 platforms with
    SCSI drives where the driver asks for the next request while a current
    request is in-flight.
    
    This patch is based off linux-2.6-block git HEAD.
    
    Bug reproduction:
    A simple scenario which reproduces this bug is:
    - dd if=/dev/hda3 of=/dev/null &
    - lilo
       The lilo takes forever to complete.
    
    This can also be reproduced fairly easily with the earlier dd and
    another test
    program doing msync().
    
    The example test program below should print out a message after every
    iteration
    but it simply hangs forever. With this bugfix it makes forward progress.
    
    ====
    Example test program using msync() (thanks to suleiman AT google DOT
    com)
    
    inline uint64_t
    rdtsc(void)
    {
             int64_t tsc;
    
             __asm __volatile("rdtsc" : "=A" (tsc));
             return (tsc);
    }
    
    int
    main(int argc, char **argv)
    {
             struct stat st;
             uint64_t e, s, t;
             char *p, q;
             long i;
             int fd;
    
             if (argc < 2) {
                     printf("Usage: %s <file>\n", argv[0]);
                     return (1);
             }
    
             if ((fd = open(argv[1], O_RDWR | O_NOATIME)) < 0)
                     err(1, "open");
    
             if (fstat(fd, &st) < 0)
                     err(1, "fstat");
    
             p = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
    MAP_SHARED, fd, 0);
    
             t = 0;
             for (i = 0; i < 1000; i++) {
                     *p = 0;
                     msync(p, 4096, MS_SYNC);
                     s = rdtsc();
                    *p = 0;
                     __asm __volatile(""::: "memory");
                     e = rdtsc();
                     if (argc > 2)
                             printf("%d: %lld cycles %jd %jd\n",
                                    i, e - s, (intmax_t)s, (intmax_t)e);
                     t += e - s;
             }
             printf("average time: %lld cycles\n", t / 1000);
             return (0);
    }
    
    Cc: <stable@kernel.org>
    Acked-by: default avatarNick Piggin <npiggin@suse.de>
    Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
    d585d0b9
as-iosched.c 38.2 KB