Commit 0d85f8bf authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] direct IO updates

This patch is a performance and correctness update to the direct-IO
code: O_DIRECT and the raw driver.  It mainly affects IO against
blockdevs.

The direct_io code was returning -EINVAL for a filesystem hole.  Change
it to clear the userspace page instead.

There were a few restrictions and weirdnesses wrt blocksize and
alignments.  The code has been reworked so we now lay out maximum-sized
BIOs at any sector alignment.

Because of this, the raw driver has been altered to set the blockdev's
soft blocksize to the minimum possible at open() time.  Typically, 512
bytes.  There are now no performance disadvantages to using small
blocksizes, and this gives the finest possible alignment.

There is no API here for setting or querying the soft blocksize of the
raw driver (there never was, really), which could conceivably be a
problem.  If it is, we can permit BLKBSZSET and BLKBSZGET against the
fd which /dev/raw/rawN returned, but that would require that
blk_ioctl() be exported to modules again.

This code is wickedly quick.  Here's an oprofile of a single 500MHz
PIII reading from four (old) scsi disks (two aic7xxx controllers) via
the raw driver.  Aggregate throughput is 72 megabytes/second:

c013363c 24       0.0896492   __set_page_dirty_buffers
c021b8cc 24       0.0896492   ahc_linux_isr
c012b5dc 25       0.0933846   kmem_cache_free
c014d894 26       0.09712     dio_bio_complete
c01cc78c 26       0.09712     number
c0123bd4 40       0.149415    follow_page
c01eed8c 46       0.171828    end_that_request_first
c01ed410 49       0.183034    blk_recount_segments
c01ed574 65       0.2428      blk_rq_map_sg
c014db38 85       0.317508    do_direct_IO
c021b090 90       0.336185    ahc_linux_run_device_queue
c010bb78 236      0.881551    timer_interrupt
c01052d8 25354    94.707      poll_idle

A testament to the efficiency of the 2.5 block layer.

And against four IDE disks on an HPT374 controller.  Throughput is 120
megabytes/sec:

c01eed8c 80       0.292462    end_that_request_first
c01fe850 87       0.318052    hpt3xx_intrproc
c01ed574 123      0.44966     blk_rq_map_sg
c01f8f10 141      0.515464    ata_select
c014db38 153      0.559333    do_direct_IO
c010bb78 235      0.859107    timer_interrupt
c01f9144 281      1.02727     ata_irq_enable
c01ff990 290      1.06017     udma_pci_init
c01fe878 308      1.12598     hpt3xx_maskproc
c02006f8 379      1.38554     idedisk_do_request
c02356a0 609      2.22637     pci_conf1_read
c01ff8dc 611      2.23368     udma_pci_start
c01ff950 922      3.37062     udma_pci_irq_status
c01f8fac 1002     3.66308     ata_status
c01ff26c 1059     3.87146     ata_start_dma
c01feb70 1141     4.17124     hpt374_udma_stop
c01f9228 3072     11.2305     ata_out_regfile
c01052d8 15193    55.5422     poll_idle

Not so good.

One problem which has been identified with O_DIRECT is the cost of
repeated calls into the mapping's get_block() callback.  Not a big
problem with ext2 but other filesystems have more complex get_block
implementations.

So what I have done is to require that callers of generic_direct_IO()
implement the new `get_blocks()' interface.  This is a small extension
to get_block().  It gets passed another argument which indicates the
maximum number of blocks which should be mapped, and it returns the
number of blocks which it did map in bh_result->b_size.  This allows
the fs to map up to 4G of disk (or of hole) in a single get_block()
invokation.

There are some other caveats and requirements of get_blocks() which are
documented in the comment block over fs/direct_io.c:get_more_blocks().

Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block
mapping.  It certainly allows good speedups.  But it doesn't allow the
fs to return a scatter list of blocks - it only understands linear
chunks of disk.  I think that's really all it _should_ do.

I'll let get_blocks() sit for a while and wait for some feedback.  If
it is sufficient and nobody objects too much, I shall convert all
get_block() instances in the kernel to be get_blocks() instances.  And
I'll teach readahead (at least) to use the get_blocks() extension.

Delayed allocate writeback could use get_blocks().  As could
block_prepare_write() for blocksize < PAGE_CACHE_SIZE.  There's no
mileage using it in mpage_writepages() because all our filesystems are
syncalloc, and nobody uses MAP_SHARED for much.

It will be tricky to use get_blocks() for writes, because if a ton of
blocks have been mapped into the file and then something goes wrong,
the kernel needs to either remove those blocks from the file or zero
them out.  The direct_io code zeroes them out.

btw, some time ago you mentioned that some drivers and/or hardware may
get upset if there are multiple simultaneous IOs in progress against
the same block.  Well, the raw driver has always allowed that to
happen.  O_DIRECT writes to blockdevs do as well now.

todo:

1) The driver will probably explode if someone runs BLKBSZSET while
   IO is in progress.  Need to use bdclaim() somewhere.

2) readv() and writev() need to become direct_io-aware.  At present
   we're doing stop-and-wait for each segment when performing
   readv/writev against the raw driver and O_DIRECT blockdevs.
parent 62b52f5c
...@@ -17,11 +17,9 @@ ...@@ -17,11 +17,9 @@
#include <linux/smp_lock.h> #include <linux/smp_lock.h>
#include <asm/uaccess.h> #include <asm/uaccess.h>
#define dprintk(x...)
typedef struct raw_device_data_s { typedef struct raw_device_data_s {
struct block_device *binding; struct block_device *binding;
int inuse, sector_size, sector_bits; int inuse;
struct semaphore mutex; struct semaphore mutex;
} raw_device_data_t; } raw_device_data_t;
...@@ -65,15 +63,15 @@ __initcall(raw_init); ...@@ -65,15 +63,15 @@ __initcall(raw_init);
/* /*
* Open/close code for raw IO. * Open/close code for raw IO.
*
* Set the device's soft blocksize to the minimum possible. This gives the
* finest possible alignment and has no adverse impact on performance.
*/ */
int raw_open(struct inode *inode, struct file *filp) int raw_open(struct inode *inode, struct file *filp)
{ {
int minor; int minor;
struct block_device * bdev; struct block_device * bdev;
int err; int err;
int sector_size;
int sector_bits;
minor = minor(inode->i_rdev); minor = minor(inode->i_rdev);
...@@ -87,12 +85,11 @@ int raw_open(struct inode *inode, struct file *filp) ...@@ -87,12 +85,11 @@ int raw_open(struct inode *inode, struct file *filp)
} }
down(&raw_devices[minor].mutex); down(&raw_devices[minor].mutex);
/* /*
* No, it is a normal raw device. All we need to do on open is * No, it is a normal raw device. All we need to do on open is
* to check that the device is bound, and force the underlying * to check that the device is bound.
* block device to a sector-size blocksize.
*/ */
bdev = raw_devices[minor].binding; bdev = raw_devices[minor].binding;
err = -ENODEV; err = -ENODEV;
if (!bdev) if (!bdev)
...@@ -100,23 +97,19 @@ int raw_open(struct inode *inode, struct file *filp) ...@@ -100,23 +97,19 @@ int raw_open(struct inode *inode, struct file *filp)
atomic_inc(&bdev->bd_count); atomic_inc(&bdev->bd_count);
err = blkdev_get(bdev, filp->f_mode, 0, BDEV_RAW); err = blkdev_get(bdev, filp->f_mode, 0, BDEV_RAW);
if (err) if (!err) {
goto out; int minsize = bdev_hardsect_size(bdev);
/*
* Don't change the blocksize if we already have users using
* this device
*/
if (raw_devices[minor].inuse++) if (bdev) {
goto out; int ret;
sector_size = bdev_hardsect_size(bdev);
raw_devices[minor].sector_size = sector_size;
for (sector_bits = 0; !(sector_size & 1); )
sector_size>>=1, sector_bits++;
raw_devices[minor].sector_bits = sector_bits;
ret = set_blocksize(bdev, minsize);
if (ret)
printk("%s: set_blocksize() failed: %d\n",
__FUNCTION__, ret);
}
raw_devices[minor].inuse++;
}
out: out:
up(&raw_devices[minor].mutex); up(&raw_devices[minor].mutex);
...@@ -137,26 +130,29 @@ int raw_release(struct inode *inode, struct file *filp) ...@@ -137,26 +130,29 @@ int raw_release(struct inode *inode, struct file *filp)
return 0; return 0;
} }
/* Forward ioctls to the underlying block device. */ /* Forward ioctls to the underlying block device. */
int raw_ioctl(struct inode *inode, int raw_ioctl(struct inode *inode,
struct file *flip, struct file *filp,
unsigned int command, unsigned int command,
unsigned long arg) unsigned long arg)
{ {
int minor = minor(inode->i_rdev), err; int minor = minor(inode->i_rdev);
int err;
struct block_device *b; struct block_device *b;
err = -ENODEV;
if (minor < 1 && minor > 255) if (minor < 1 && minor > 255)
return -ENODEV; goto out;
b = raw_devices[minor].binding; b = raw_devices[minor].binding;
err = -EINVAL; err = -EINVAL;
if (b && b->bd_inode && b->bd_op && b->bd_op->ioctl) { if (b == NULL)
goto out;
if (b->bd_inode && b->bd_op && b->bd_op->ioctl)
err = b->bd_op->ioctl(b->bd_inode, NULL, command, arg); err = b->bd_op->ioctl(b->bd_inode, NULL, command, arg);
} out:
return err; return err;
} }
/* /*
* Deal with ioctls against the raw-device control interface, to bind * Deal with ioctls against the raw-device control interface, to bind
...@@ -164,12 +160,12 @@ int raw_ioctl(struct inode *inode, ...@@ -164,12 +160,12 @@ int raw_ioctl(struct inode *inode,
*/ */
int raw_ctl_ioctl(struct inode *inode, int raw_ctl_ioctl(struct inode *inode,
struct file *flip, struct file *filp,
unsigned int command, unsigned int command,
unsigned long arg) unsigned long arg)
{ {
struct raw_config_request rq; struct raw_config_request rq;
int err = 0; int err;
int minor; int minor;
switch (command) { switch (command) {
...@@ -178,26 +174,23 @@ int raw_ctl_ioctl(struct inode *inode, ...@@ -178,26 +174,23 @@ int raw_ctl_ioctl(struct inode *inode,
/* First, find out which raw minor we want */ /* First, find out which raw minor we want */
if (copy_from_user(&rq, (void *) arg, sizeof(rq))) { err = -EFAULT;
err = -EFAULT; if (copy_from_user(&rq, (void *) arg, sizeof(rq)))
break; goto out;
}
minor = rq.raw_minor; minor = rq.raw_minor;
if (minor <= 0 || minor > MINORMASK) { err = -EINVAL;
err = -EINVAL; if (minor <= 0 || minor > MINORMASK)
break; goto out;
}
if (command == RAW_SETBIND) { if (command == RAW_SETBIND) {
/* /*
* This is like making block devices, so demand the * This is like making block devices, so demand the
* same capability * same capability
*/ */
if (!capable(CAP_SYS_ADMIN)) { err = -EPERM;
err = -EPERM; if (!capable(CAP_SYS_ADMIN))
break; goto out;
}
/* /*
* For now, we don't need to check that the underlying * For now, we don't need to check that the underlying
...@@ -206,24 +199,23 @@ int raw_ctl_ioctl(struct inode *inode, ...@@ -206,24 +199,23 @@ int raw_ctl_ioctl(struct inode *inode,
* major/minor numbers make sense. * major/minor numbers make sense.
*/ */
if ((rq.block_major == 0 && err = -EINVAL;
rq.block_minor != 0) || if ((rq.block_major == 0 && rq.block_minor != 0) ||
rq.block_major > MAX_BLKDEV || rq.block_major > MAX_BLKDEV ||
rq.block_minor > MINORMASK) { rq.block_minor > MINORMASK)
err = -EINVAL; goto out;
break;
}
down(&raw_devices[minor].mutex); down(&raw_devices[minor].mutex);
err = -EBUSY;
if (raw_devices[minor].inuse) { if (raw_devices[minor].inuse) {
up(&raw_devices[minor].mutex); up(&raw_devices[minor].mutex);
err = -EBUSY; goto out;
break;
} }
if (raw_devices[minor].binding) if (raw_devices[minor].binding)
bdput(raw_devices[minor].binding); bdput(raw_devices[minor].binding);
raw_devices[minor].binding = raw_devices[minor].binding =
bdget(kdev_t_to_nr(mk_kdev(rq.block_major, rq.block_minor))); bdget(kdev_t_to_nr(mk_kdev(rq.block_major,
rq.block_minor)));
up(&raw_devices[minor].mutex); up(&raw_devices[minor].mutex);
} else { } else {
struct block_device *bdev; struct block_device *bdev;
...@@ -237,16 +229,18 @@ int raw_ctl_ioctl(struct inode *inode, ...@@ -237,16 +229,18 @@ int raw_ctl_ioctl(struct inode *inode,
} else { } else {
rq.block_major = rq.block_minor = 0; rq.block_major = rq.block_minor = 0;
} }
err = copy_to_user((void *) arg, &rq, sizeof(rq)); err = -EFAULT;
if (err) if (copy_to_user((void *) arg, &rq, sizeof(rq)))
err = -EFAULT; goto out;
} }
err = 0;
break; break;
default: default:
err = -EINVAL; err = -EINVAL;
break;
} }
out:
return err; return err;
} }
...@@ -257,7 +251,7 @@ ssize_t raw_read(struct file *filp, char * buf, size_t size, loff_t *offp) ...@@ -257,7 +251,7 @@ ssize_t raw_read(struct file *filp, char * buf, size_t size, loff_t *offp)
ssize_t raw_write(struct file *filp, const char *buf, size_t size, loff_t *offp) ssize_t raw_write(struct file *filp, const char *buf, size_t size, loff_t *offp)
{ {
return rw_raw_dev(WRITE, filp, (char *) buf, size, offp); return rw_raw_dev(WRITE, filp, (char *)buf, size, offp);
} }
ssize_t ssize_t
......
...@@ -24,14 +24,14 @@ ...@@ -24,14 +24,14 @@
#include <asm/uaccess.h> #include <asm/uaccess.h>
static unsigned long max_block(struct block_device *bdev) static sector_t max_block(struct block_device *bdev)
{ {
unsigned int retval = ~0U; sector_t retval = ~0U;
loff_t sz = bdev->bd_inode->i_size; loff_t sz = bdev->bd_inode->i_size;
if (sz) { if (sz) {
unsigned int size = block_size(bdev); sector_t size = block_size(bdev);
unsigned int sizebits = blksize_bits(size); unsigned sizebits = blksize_bits(size);
retval = (sz >> sizebits); retval = (sz >> sizebits);
} }
return retval; return retval;
...@@ -88,7 +88,9 @@ int sb_min_blocksize(struct super_block *sb, int size) ...@@ -88,7 +88,9 @@ int sb_min_blocksize(struct super_block *sb, int size)
return sb_set_blocksize(sb, size); return sb_set_blocksize(sb, size);
} }
static int blkdev_get_block(struct inode * inode, sector_t iblock, struct buffer_head * bh, int create) static int
blkdev_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
{ {
if (iblock >= max_block(inode->i_bdev)) if (iblock >= max_block(inode->i_bdev))
return -EIO; return -EIO;
...@@ -99,12 +101,26 @@ static int blkdev_get_block(struct inode * inode, sector_t iblock, struct buffer ...@@ -99,12 +101,26 @@ static int blkdev_get_block(struct inode * inode, sector_t iblock, struct buffer
return 0; return 0;
} }
static int
blkdev_get_blocks(struct inode *inode, sector_t iblock,
unsigned long max_blocks, struct buffer_head *bh, int create)
{
if ((iblock + max_blocks) >= max_block(inode->i_bdev))
return -EIO;
bh->b_bdev = inode->i_bdev;
bh->b_blocknr = iblock;
bh->b_size = max_blocks << inode->i_blkbits;
set_buffer_mapped(bh);
return 0;
}
static int static int
blkdev_direct_IO(int rw, struct inode *inode, char *buf, blkdev_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count) loff_t offset, size_t count)
{ {
return generic_direct_IO(rw, inode, buf, offset, return generic_direct_IO(rw, inode, buf, offset,
count, blkdev_get_block); count, blkdev_get_blocks);
} }
static int blkdev_writepage(struct page * page) static int blkdev_writepage(struct page * page)
......
...@@ -13,6 +13,7 @@ ...@@ -13,6 +13,7 @@
#include <linux/types.h> #include <linux/types.h>
#include <linux/fs.h> #include <linux/fs.h>
#include <linux/mm.h> #include <linux/mm.h>
#include <linux/highmem.h>
#include <linux/pagemap.h> #include <linux/pagemap.h>
#include <linux/bio.h> #include <linux/bio.h>
#include <linux/wait.h> #include <linux/wait.h>
...@@ -39,13 +40,17 @@ struct dio { ...@@ -39,13 +40,17 @@ struct dio {
struct bio_vec *bvec; /* current bvec in that bio */ struct bio_vec *bvec; /* current bvec in that bio */
struct inode *inode; struct inode *inode;
int rw; int rw;
unsigned blkbits; /* doesn't change */
sector_t block_in_file; /* changes */ sector_t block_in_file; /* changes */
unsigned blocks_available; /* At block_in_file. changes */
sector_t final_block_in_request;/* doesn't change */ sector_t final_block_in_request;/* doesn't change */
unsigned first_block_in_page; /* doesn't change */ unsigned first_block_in_page; /* doesn't change, Used only once */
int boundary; /* prev block is at a boundary */ int boundary; /* prev block is at a boundary */
int reap_counter; /* rate limit reaping */ int reap_counter; /* rate limit reaping */
get_block_t *get_block; get_blocks_t *get_blocks; /* block mapping function */
sector_t last_block_in_bio; sector_t last_block_in_bio; /* current final block in bio */
sector_t next_block_in_bio; /* next block to be added to bio */
struct buffer_head map_bh; /* last get_blocks() result */
/* Page fetching state */ /* Page fetching state */
int curr_page; /* changes */ int curr_page; /* changes */
...@@ -53,15 +58,16 @@ struct dio { ...@@ -53,15 +58,16 @@ struct dio {
unsigned long curr_user_address;/* changes */ unsigned long curr_user_address;/* changes */
/* Page queue */ /* Page queue */
struct page *pages[DIO_PAGES]; struct page *pages[DIO_PAGES]; /* page buffer */
unsigned head; unsigned head; /* next page to process */
unsigned tail; unsigned tail; /* last valid page + 1 */
int page_errors; /* errno from get_user_pages() */
/* BIO completion state */ /* BIO completion state */
atomic_t bio_count; atomic_t bio_count; /* nr bios in flight */
spinlock_t bio_list_lock; spinlock_t bio_list_lock; /* protects bio_list */
struct bio *bio_list; /* singly linked via bi_private */ struct bio *bio_list; /* singly linked via bi_private */
struct task_struct *waiter; struct task_struct *waiter; /* waiting task (NULL if none) */
}; };
/* /*
...@@ -93,6 +99,21 @@ static int dio_refill_pages(struct dio *dio) ...@@ -93,6 +99,21 @@ static int dio_refill_pages(struct dio *dio)
NULL); /* vmas */ NULL); /* vmas */
up_read(&current->mm->mmap_sem); up_read(&current->mm->mmap_sem);
if (ret < 0 && dio->blocks_available && (dio->rw == WRITE)) {
/*
* A memory fault, but the filesystem has some outstanding
* mapped blocks. We need to use those blocks up to avoid
* leaking stale data in the file.
*/
if (dio->page_errors == 0)
dio->page_errors = ret;
dio->pages[0] = ZERO_PAGE(dio->cur_user_address);
dio->head = 0;
dio->tail = 1;
ret = 0;
goto out;
}
if (ret >= 0) { if (ret >= 0) {
dio->curr_user_address += ret * PAGE_SIZE; dio->curr_user_address += ret * PAGE_SIZE;
dio->curr_page += ret; dio->curr_page += ret;
...@@ -100,6 +121,7 @@ static int dio_refill_pages(struct dio *dio) ...@@ -100,6 +121,7 @@ static int dio_refill_pages(struct dio *dio)
dio->tail = ret; dio->tail = ret;
ret = 0; ret = 0;
} }
out:
return ret; return ret;
} }
...@@ -115,11 +137,8 @@ static struct page *dio_get_page(struct dio *dio) ...@@ -115,11 +137,8 @@ static struct page *dio_get_page(struct dio *dio)
int ret; int ret;
ret = dio_refill_pages(dio); ret = dio_refill_pages(dio);
if (ret) { if (ret)
printk("%s: dio_refill_pages returns %d\n",
__FUNCTION__, ret);
return ERR_PTR(ret); return ERR_PTR(ret);
}
BUG_ON(dio_pages_present(dio) == 0); BUG_ON(dio_pages_present(dio) == 0);
} }
return dio->pages[dio->head++]; return dio->pages[dio->head++];
...@@ -140,8 +159,9 @@ static void dio_bio_end_io(struct bio *bio) ...@@ -140,8 +159,9 @@ static void dio_bio_end_io(struct bio *bio)
spin_lock_irqsave(&dio->bio_list_lock, flags); spin_lock_irqsave(&dio->bio_list_lock, flags);
bio->bi_private = dio->bio_list; bio->bi_private = dio->bio_list;
dio->bio_list = bio; dio->bio_list = bio;
if (dio->waiter)
wake_up_process(dio->waiter);
spin_unlock_irqrestore(&dio->bio_list_lock, flags); spin_unlock_irqrestore(&dio->bio_list_lock, flags);
wake_up_process(dio->waiter);
} }
static int static int
...@@ -179,6 +199,7 @@ static void dio_bio_submit(struct dio *dio) ...@@ -179,6 +199,7 @@ static void dio_bio_submit(struct dio *dio)
dio->bio = NULL; dio->bio = NULL;
dio->bvec = NULL; dio->bvec = NULL;
dio->boundary = 0;
} }
/* /*
...@@ -202,10 +223,12 @@ static struct bio *dio_await_one(struct dio *dio) ...@@ -202,10 +223,12 @@ static struct bio *dio_await_one(struct dio *dio)
while (dio->bio_list == NULL) { while (dio->bio_list == NULL) {
set_current_state(TASK_UNINTERRUPTIBLE); set_current_state(TASK_UNINTERRUPTIBLE);
if (dio->bio_list == NULL) { if (dio->bio_list == NULL) {
dio->waiter = current;
spin_unlock_irqrestore(&dio->bio_list_lock, flags); spin_unlock_irqrestore(&dio->bio_list_lock, flags);
blk_run_queues(); blk_run_queues();
schedule(); schedule();
spin_lock_irqsave(&dio->bio_list_lock, flags); spin_lock_irqsave(&dio->bio_list_lock, flags);
dio->waiter = NULL;
} }
set_current_state(TASK_RUNNING); set_current_state(TASK_RUNNING);
} }
...@@ -268,29 +291,142 @@ static int dio_bio_reap(struct dio *dio) ...@@ -268,29 +291,142 @@ static int dio_bio_reap(struct dio *dio)
while (dio->bio_list) { while (dio->bio_list) {
unsigned long flags; unsigned long flags;
struct bio *bio; struct bio *bio;
int ret2;
spin_lock_irqsave(&dio->bio_list_lock, flags); spin_lock_irqsave(&dio->bio_list_lock, flags);
bio = dio->bio_list; bio = dio->bio_list;
dio->bio_list = bio->bi_private; dio->bio_list = bio->bi_private;
spin_unlock_irqrestore(&dio->bio_list_lock, flags); spin_unlock_irqrestore(&dio->bio_list_lock, flags);
ret2 = dio_bio_complete(dio, bio); ret = dio_bio_complete(dio, bio);
if (ret == 0)
ret = ret2;
} }
dio->reap_counter = 0; dio->reap_counter = 0;
} }
return ret; return ret;
} }
/*
* Call into the fs to map some more disk blocks. We record the current number
* of available blocks at dio->blocks_available. These are in units of the
* fs blocksize, (1 << inode->i_blkbits).
*
* The fs is allowed to map lots of blocks at once. If it wants to do that,
* it uses the passed inode-relative block number as the file offset, as usual.
*
* get_blocks() is passed the number of i_blkbits-sized blocks which direct_io
* has remaining to do. The fs should not map more than this number of blocks.
*
* If the fs has mapped a lot of blocks, it should populate bh->b_size to
* indicate how much contiguous disk space has been made available at
* bh->b_blocknr.
*
* If *any* of the mapped blocks are new, then the fs must set buffer_new().
* This isn't very efficient...
*
* In the case of filesystem holes: the fs may return an arbitrarily-large
* hole by returning an appropriate value in b_size and by clearing
* buffer_mapped(). This code _should_ handle that case correctly, but it has
* only been tested against single-block holes (b_size == blocksize).
*/
static int get_more_blocks(struct dio *dio)
{
int ret;
struct buffer_head *map_bh = &dio->map_bh;
if (dio->blocks_available)
return 0;
/*
* If there was a memory error and we've overwritten all the
* mapped blocks then we can now return that memory error
*/
if (dio->page_errors) {
ret = dio->page_errors;
goto out;
}
map_bh->b_state = 0;
map_bh->b_size = 0;
BUG_ON(dio->block_in_file >= dio->final_block_in_request);
ret = (*dio->get_blocks)(dio->inode, dio->block_in_file,
dio->final_block_in_request - dio->block_in_file,
map_bh, dio->rw == WRITE);
if (ret)
goto out;
if (buffer_mapped(map_bh)) {
BUG_ON(map_bh->b_size == 0);
BUG_ON((map_bh->b_size & ((1 << dio->blkbits) - 1)) != 0);
dio->blocks_available = map_bh->b_size >> dio->blkbits;
/* blockdevs do not set buffer_new */
if (buffer_new(map_bh)) {
sector_t block = map_bh->b_blocknr;
unsigned i;
for (i = 0; i < dio->blocks_available; i++)
unmap_underlying_metadata(map_bh->b_bdev,
block++);
}
} else {
BUG_ON(dio->rw != READ);
if (dio->bio)
dio_bio_submit(dio);
}
dio->next_block_in_bio = map_bh->b_blocknr;
out:
return ret;
}
/*
* Check to see if we can continue to grow the BIO. If not, then send it.
*/
static void dio_prep_bio(struct dio *dio)
{
if (dio->bio == NULL)
return;
if (dio->bio->bi_idx == dio->bio->bi_vcnt ||
dio->boundary ||
dio->last_block_in_bio != dio->next_block_in_bio - 1)
dio_bio_submit(dio);
}
/*
* There is no bio. Make one now.
*/
static int dio_new_bio(struct dio *dio)
{
sector_t sector;
int ret;
ret = dio_bio_reap(dio);
if (ret)
goto out;
sector = dio->next_block_in_bio << (dio->blkbits - 9);
ret = dio_bio_alloc(dio, dio->map_bh.b_bdev, sector,
DIO_BIO_MAX_SIZE / PAGE_SIZE);
dio->boundary = 0;
out:
return ret;
}
/* /*
* Walk the user pages, and the file, mapping blocks to disk and emitting BIOs. * Walk the user pages, and the file, mapping blocks to disk and emitting BIOs.
*
* Direct IO against a blockdev is different from a file. Because we can
* happily perform page-sized but 512-byte aligned IOs. It is important that
* blockdev IO be able to have fine alignment and large sizes.
*
* So what we do is to permit the ->get_blocks function to populate bh.b_size
* with the size of IO which is permitted at this offset and this i_blkbits.
*
* For best results, the blockdev should be set up with 512-byte i_blkbits and
* it should set b_size to PAGE_SIZE or more inside get_blocks(). This gives
* fine alignment but still allows this function to work in PAGE_SIZE units.
*/ */
int do_direct_IO(struct dio *dio) int do_direct_IO(struct dio *dio)
{ {
struct inode * const inode = dio->inode; const unsigned blkbits = dio->blkbits;
const unsigned blkbits = inode->i_blkbits;
const unsigned blocksize = 1 << blkbits;
const unsigned blocks_per_page = PAGE_SIZE >> blkbits; const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
struct page *page; struct page *page;
unsigned block_in_page; unsigned block_in_page;
...@@ -309,46 +445,35 @@ int do_direct_IO(struct dio *dio) ...@@ -309,46 +445,35 @@ int do_direct_IO(struct dio *dio)
} }
new_page = 1; new_page = 1;
for ( ; block_in_page < blocks_per_page; block_in_page++) { while (block_in_page < blocks_per_page) {
struct buffer_head map_bh;
struct bio *bio; struct bio *bio;
unsigned this_chunk_bytes; /* # of bytes mapped */
unsigned this_chunk_blocks; /* # of blocks */
unsigned u;
map_bh.b_state = 0; ret = get_more_blocks(dio);
ret = (*dio->get_block)(inode, dio->block_in_file, if (ret)
&map_bh, dio->rw == WRITE);
if (ret) {
printk("%s: get_block returns %d\n",
__FUNCTION__, ret);
goto fail_release; goto fail_release;
/* Handle holes */
if (!buffer_mapped(&dio->map_bh)) {
char *kaddr = kmap_atomic(page, KM_USER0);
memset(kaddr + (block_in_page << blkbits),
0, 1 << blkbits);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
dio->block_in_file++;
dio->next_block_in_bio++;
block_in_page++;
goto next_block;
} }
/* blockdevs do not set buffer_new */
if (buffer_new(&map_bh)) dio_prep_bio(dio);
unmap_underlying_metadata(map_bh.b_bdev,
map_bh.b_blocknr);
if (!buffer_mapped(&map_bh)) {
ret = -EINVAL; /* A hole */
goto fail_release;
}
if (dio->bio) {
if (dio->bio->bi_idx == dio->bio->bi_vcnt ||
dio->boundary ||
dio->last_block_in_bio !=
map_bh.b_blocknr - 1) {
dio_bio_submit(dio);
dio->boundary = 0;
}
}
if (dio->bio == NULL) { if (dio->bio == NULL) {
ret = dio_bio_reap(dio); ret = dio_new_bio(dio);
if (ret)
goto fail_release;
ret = dio_bio_alloc(dio, map_bh.b_bdev,
map_bh.b_blocknr << (blkbits - 9),
DIO_BIO_MAX_SIZE / PAGE_SIZE);
if (ret) if (ret)
goto fail_release; goto fail_release;
new_page = 1; new_page = 1;
dio->boundary = 0;
} }
bio = dio->bio; bio = dio->bio;
...@@ -357,17 +482,34 @@ int do_direct_IO(struct dio *dio) ...@@ -357,17 +482,34 @@ int do_direct_IO(struct dio *dio)
page_cache_get(page); page_cache_get(page);
dio->bvec->bv_page = page; dio->bvec->bv_page = page;
dio->bvec->bv_len = 0; dio->bvec->bv_len = 0;
dio->bvec->bv_offset = block_in_page*blocksize; dio->bvec->bv_offset = block_in_page << blkbits;
bio->bi_idx++; bio->bi_idx++;
new_page = 0;
} }
new_page = 0;
dio->bvec->bv_len += blocksize; /* Work out how much disk we can add to this page */
bio->bi_size += blocksize; this_chunk_blocks = dio->blocks_available;
dio->last_block_in_bio = map_bh.b_blocknr; u = (PAGE_SIZE - dio->bvec->bv_len) >> blkbits;
dio->boundary = buffer_boundary(&map_bh); if (this_chunk_blocks > u)
this_chunk_blocks = u;
dio->block_in_file++; u = dio->final_block_in_request - dio->block_in_file;
if (dio->block_in_file >= dio->final_block_in_request) if (this_chunk_blocks > u)
this_chunk_blocks = u;
this_chunk_bytes = this_chunk_blocks << blkbits;
BUG_ON(this_chunk_bytes == 0);
dio->bvec->bv_len += this_chunk_bytes;
bio->bi_size += this_chunk_bytes;
dio->next_block_in_bio += this_chunk_blocks;
dio->last_block_in_bio = dio->next_block_in_bio - 1;
dio->boundary = buffer_boundary(&dio->map_bh);
dio->block_in_file += this_chunk_blocks;
block_in_page += this_chunk_blocks;
dio->blocks_available -= this_chunk_blocks;
next_block:
if (dio->block_in_file > dio->final_block_in_request)
BUG();
if (dio->block_in_file == dio->final_block_in_request)
break; break;
} }
block_in_page = 0; block_in_page = 0;
...@@ -381,11 +523,16 @@ int do_direct_IO(struct dio *dio) ...@@ -381,11 +523,16 @@ int do_direct_IO(struct dio *dio)
return ret; return ret;
} }
/*
* The main direct-IO function. This is a library function for use by
* filesystem drivers.
*/
int int
generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset, generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
size_t count, get_block_t get_block) size_t count, get_blocks_t get_blocks)
{ {
const unsigned blocksize_mask = (1 << inode->i_blkbits) - 1; const unsigned blkbits = inode->i_blkbits;
const unsigned blocksize_mask = (1 << blkbits) - 1;
const unsigned long user_addr = (unsigned long)buf; const unsigned long user_addr = (unsigned long)buf;
int ret; int ret;
int ret2; int ret2;
...@@ -403,16 +550,18 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset, ...@@ -403,16 +550,18 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
dio.bvec = NULL; dio.bvec = NULL;
dio.inode = inode; dio.inode = inode;
dio.rw = rw; dio.rw = rw;
dio.block_in_file = offset >> inode->i_blkbits; dio.blkbits = blkbits;
dio.final_block_in_request = (offset + count) >> inode->i_blkbits; dio.block_in_file = offset >> blkbits;
dio.blocks_available = 0;
dio.final_block_in_request = (offset + count) >> blkbits;
/* Index into the first page of the first block */ /* Index into the first page of the first block */
dio.first_block_in_page = (user_addr & (PAGE_SIZE - 1)) dio.first_block_in_page = (user_addr & (PAGE_SIZE - 1)) >> blkbits;
>> inode->i_blkbits;
dio.boundary = 0; dio.boundary = 0;
dio.reap_counter = 0; dio.reap_counter = 0;
dio.get_block = get_block; dio.get_blocks = get_blocks;
dio.last_block_in_bio = -1; dio.last_block_in_bio = -1;
dio.next_block_in_bio = -1;
/* Page fetching state */ /* Page fetching state */
dio.curr_page = 0; dio.curr_page = 0;
...@@ -428,12 +577,13 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset, ...@@ -428,12 +577,13 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
/* Page queue */ /* Page queue */
dio.head = 0; dio.head = 0;
dio.tail = 0; dio.tail = 0;
dio.page_errors = 0;
/* BIO completion state */ /* BIO completion state */
atomic_set(&dio.bio_count, 0); atomic_set(&dio.bio_count, 0);
spin_lock_init(&dio.bio_list_lock); spin_lock_init(&dio.bio_list_lock);
dio.bio_list = NULL; dio.bio_list = NULL;
dio.waiter = current; dio.waiter = NULL;
ret = do_direct_IO(&dio); ret = do_direct_IO(&dio);
...@@ -444,9 +594,11 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset, ...@@ -444,9 +594,11 @@ generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
ret2 = dio_await_completion(&dio); ret2 = dio_await_completion(&dio);
if (ret == 0) if (ret == 0)
ret = ret2; ret = ret2;
if (ret == 0)
ret = dio.page_errors;
if (ret == 0) if (ret == 0)
ret = count - ((dio.final_block_in_request - ret = count - ((dio.final_block_in_request -
dio.block_in_file) << inode->i_blkbits); dio.block_in_file) << blkbits);
out: out:
return ret; return ret;
} }
......
...@@ -606,11 +606,24 @@ static int ext2_bmap(struct address_space *mapping, long block) ...@@ -606,11 +606,24 @@ static int ext2_bmap(struct address_space *mapping, long block)
return generic_block_bmap(mapping,block,ext2_get_block); return generic_block_bmap(mapping,block,ext2_get_block);
} }
static int
ext2_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
struct buffer_head *bh_result, int create)
{
int ret;
ret = ext2_get_block(inode, iblock, bh_result, create);
if (ret == 0)
bh_result->b_size = (1 << inode->i_blkbits);
return ret;
}
static int static int
ext2_direct_IO(int rw, struct inode *inode, char *buf, ext2_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count) loff_t offset, size_t count)
{ {
return generic_direct_IO(rw, inode, buf, offset, count, ext2_get_block); return generic_direct_IO(rw, inode, buf,
offset, count, ext2_get_blocks);
} }
static int static int
......
...@@ -293,10 +293,23 @@ static int jfs_bmap(struct address_space *mapping, long block) ...@@ -293,10 +293,23 @@ static int jfs_bmap(struct address_space *mapping, long block)
return generic_block_bmap(mapping, block, jfs_get_block); return generic_block_bmap(mapping, block, jfs_get_block);
} }
static int
jfs_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
struct buffer_head *bh_result, int create)
{
int ret;
ret = jfs_get_block(inode, iblock, bh_result, create);
if (ret == 0)
bh_result->b_size = (1 << inode->i_blkbits);
return ret;
}
static int jfs_direct_IO(int rw, struct inode *inode, char *buf, static int jfs_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count) loff_t offset, size_t count)
{ {
return generic_direct_IO(rw, inode, buf, offset, count, jfs_get_block); return generic_direct_IO(rw, inode, buf,
offset, count, jfs_get_blocks);
} }
struct address_space_operations jfs_aops = { struct address_space_operations jfs_aops = {
......
...@@ -211,7 +211,11 @@ extern void mnt_init(unsigned long); ...@@ -211,7 +211,11 @@ extern void mnt_init(unsigned long);
extern void files_init(unsigned long); extern void files_init(unsigned long);
struct buffer_head; struct buffer_head;
typedef int (get_block_t)(struct inode*,sector_t,struct buffer_head*,int); typedef int (get_block_t)(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
typedef int (get_blocks_t)(struct inode *inode, sector_t iblock,
unsigned long max_blocks,
struct buffer_head *bh_result, int create);
#include <linux/pipe_fs_i.h> #include <linux/pipe_fs_i.h>
/* #include <linux/umsdos_fs_i.h> */ /* #include <linux/umsdos_fs_i.h> */
...@@ -1238,7 +1242,7 @@ extern void do_generic_file_read(struct file *, loff_t *, read_descriptor_t *, r ...@@ -1238,7 +1242,7 @@ extern void do_generic_file_read(struct file *, loff_t *, read_descriptor_t *, r
ssize_t generic_file_direct_IO(int rw, struct inode *inode, char *buf, ssize_t generic_file_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count); loff_t offset, size_t count);
int generic_direct_IO(int rw, struct inode *inode, char *buf, int generic_direct_IO(int rw, struct inode *inode, char *buf,
loff_t offset, size_t count, get_block_t *get_block); loff_t offset, size_t count, get_blocks_t *get_blocks);
extern loff_t no_llseek(struct file *file, loff_t offset, int origin); extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment