Commits · de16834e42295433958e8af61730c872efcdf9de · nexedi / linux

29 Jul, 2002 28 commits

[PATCH] update overcommit doc and comment · de16834e
Hugh Dickins authored Jul 29, 2002
```
Update Doc and remove FIXME comment from fork.c now accounting right.
```
de16834e

[PATCH] fix shared and private accounting · e054680b

Hugh Dickins authored Jul 29, 2002

do_mmap_pgoff's (file == NULL) check was incorrect: it caused shared
MAP_ANONYMOUS objects to be counted twice (again in shmem_file_setup),
and again on fork(); whereas the equivalent shared /dev/zero objects
were correctly counted.  Conversely, a private readonly file mapping
was (correctly) not counted, but still not counted when mprotected to
writable: mprotect_fixup had pointless "charged = 0" changes, now it
does vm_enough_memory checking when private is first made writable
(but later we may want to refine behaviour on a noreserve mapping).

Also changed correct (flags & MAP_SHARED) test in do_mmap_pgoff to
equivalent (vm_flags & VM_SHARED) test: because do_mmap_pgoff is
dealing with vm_flags rather than the input flags by that stage.

e054680b

[PATCH] remove unhelpful vm_unacct_vma · 3f7583d3

Hugh Dickins authored Jul 29, 2002

Remove vm_unacct_vma function: it's only used in one place,
which can do it better by using vm_unacct_memory directly.

3f7583d3

[PATCH] mmap MAP_NORESERVE not in vm_flags · feb32a85

Hugh Dickins authored Jul 29, 2002

do_mmap_pgoff clears MAP_NORESERVE from vm_flags when VM accounts
strictly: but it's not in vm_flags, it's in flags (and tested there).

feb32a85

[PATCH] mremap MAP_NORESERVE not in flags · 4b07a3c5

Hugh Dickins authored Jul 29, 2002

There is no point in do_mremap clearing MAP_NORESERVE from its flags:
it has already validated that only the MREMAP_ flags can be set,
and it has no use for MAP_NORESERVE in the code that follows anyway.

4b07a3c5

[PATCH] SHMEM_MAX_BYTES overflow checking · 36372380

Hugh Dickins authored Jul 29, 2002

shmem_notify_change and shmem_file_write be careful about overflowingly
large loff_t before shifting it into unsigned long for vm_enough_memory.
Rename SHMEM_MAX_BLOCKS to SHMEM_MAX_INDEX (to avoid confusion with
512-byte blocks), define SHMEM_MAX_BYTES from it.

But 2.5 vmtruncate lacked the s_maxbytes error handling which
shmem_notify_change now expects: bring it in from the -dj tree.
shmem_file_write error handling needs a closer look later on.

36372380

[PATCH] shmem_file_write rounding VM_ACCT · 477436ba

Hugh Dickins authored Jul 29, 2002

Repeated overnight kernel builds in tmpfs showed insane Committed_AS
by morning.  The main bug was that shmem_file_write was passing
(newsize-oldsize)>>PAGE_SHIFT to vm_enough_memory, but it has to be
((newsize>>PAGE_SHIFT)-(oldsize>>PAGE_SHIFT)) - imagine 1k writes.

But actually, if we're going to do strict accounting, then we should
round up to next page not down - use VM_ACCT macro throughout (needs
unusual mix of PAGE_CACHE_SIZE with PAGE_SHIFT); and must count one
page for a long symlink.

477436ba

[PATCH] implement kmem_cache_size() · e0126e64

Christoph Hellwig authored Jul 29, 2002

Currently there is no way to find out the effective object size of a slab
cache. XFS has lots of IRIX-derived code that want to do zalloc() style
allocations on zones (which are implemented as slab caches in XFS/Linux)
and thus needs to know about it. There are three ways do implement it:

a) implement kmem_cache_zalloc
b) make the xfs zone a struct of kmem_cache_t and a size variable
c) implement kmem_cache_size

The current XFS tree does a) but I absolutely don't like it as encourages
people to use kmem_cache_zalloc for new code instead of thinking about how
to utilize slab object reuse. b) would be easy, but I guess kmem_cache_size
is usefull enough to get into the kernel. Here's the patch:

e0126e64

Rename "sys_pread/pwrite" to "sys_pread64/pwrite64" to match the · 5ff53a14
Linus Torvalds authored Jul 29, 2002
```
actual implementation and avoid confusion.
```
5ff53a14
[PATCH] fix e1000 after irq craziness · 12ebbff8
Dave Hansen authored Jul 28, 2002
```
I just duplicated the method used in drivers/net/tulip/de2104x.c
```
12ebbff8
parport: fix warning - "flags" is unused after cli/sti removal. · cb217905
Linus Torvalds authored Jul 28, 2002

cb217905
[PATCH] export rwsem downgrade function · ac7823e4
David Howells authored Jul 28, 2002
```
This should do the trick.
```
ac7823e4
[PATCH] fix REQ_QUEUED clearing in blk_insert_request() · 08f9788a
Jens Axboe authored Jul 28, 2002

08f9788a

[PATCH] page table page->index · f6c2354a

Paul Mackerras authored Jul 28, 2002

I found a situation where page->index for a pagetable page can be set
to 0 instead of the correct value.  This means that ptep_to_address
will return the wrong answer.  The problem occurs when remap_pmd_range
calls pte_alloc_map and pte_alloc_map needs to allocate a new pte
page, because remap_pmd_range has masked off the top bits of the
address (to avoid overflow in the computation of `end'), and it passes
the masked address to pte_alloc_map.

Now we presumably don't need to get from the physical pages mapped by
remap_page_range back to the ptes mapping them.  But we could easily
map some normal pages using ptes in that pagetable page subsequently,
and when we call ptep_to_address on their ptes it will give the wrong
answer.

The patch below fixes the problem.

There is a more general question this brings up - some of the
procedures which iterate over ranges of ptes will do the wrong thing
if the end of the address range is too close to ~0UL, while others are
OK.  Is this a problem in practice?  On i386, ppc, and the 64-bit
architectures it isn't since user addresses can't go anywhere near
~0UL, but what about arm or m68k for instance?

And BTW, being able to go from a pte pointer to the mm and virtual
address that that pte maps is an extremely useful thing on ppc, since
it will enable me to do MMU hash-table management at set_pte (and
ptep_*) time and thus avoid the extra traversal of the pagetables that
I am currently doing in flush_tlb_*.  So if you do decide to back out
rmap, please leave in the hooks for setting page->mapping and
page->index on pagetable pages.

f6c2354a

[PATCH] fix include/linux/timer.h compile · 07611a33
Paul Mackerras authored Jul 28, 2002
```
include/linux/timer.h needs to include <linux/stddef.h>
to get the definition of NULL.
```
07611a33

[PATCH] fix do_open() interaction with rd.c · bac5bcac

Adam J. Richter authored Jul 28, 2002

	linux-2.5.28/drivers/block_dev.c has a new do_open that broke
initial ramdisk support, because it now requires devices that "manually"
set bdev->bd_openers to set bdev->bd_inode->i_size as well.  The
following single line patch, suggested by Russell King, fixes the
problem.

	There does not appear to be anyone acting as maintainer for
rd.c, so I posted to lkml yesterday to ask if anyone objected to my
submitting the patch to you, and I also emailed the message to Russell
King and Al Viro.  Nobody has complained.  I have been running the
patch for almost a day without problems.

bac5bcac

Merge bk://bk.arm.linux.org.uk:14691 · 0cd3455f
Linus Torvalds authored Jul 28, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
0cd3455f

[SERIAL] Cleanup includes. · 29363ba0

Russell King authored Jul 29, 2002

Al Viro pointed out there was a fair bit of redundancy here.  We
remove many include files from the serial layer, leaving those
which are necessary for it to build.  This has been posted to lkml,
no one complained.

This cset also combines a missing include of asm/io.h in 8250_pci.c
(unfortunately I've lost the name of the reporter, sorry.)

29363ba0

[SERIAL] Add HP Diva PCI serial port support. · 81059697
Russell King authored Jul 29, 2002

81059697
Merge bk://jfs.bkbits.net/linux-2.5 · fd4588d0
Linus Torvalds authored Jul 28, 2002
```
into home.transmeta.com:/home/torvalds/v2.5/linux
```
fd4588d0

Remove d_delete call from jfs_rmdir and jfs_unlink · cb36441a

Dave Kleikamp authored Jul 28, 2002

jfs_rmdir and jfs_unlink have always called d_delete, but it hasn't
caused a problem until 2.5.28.  The call is an artifact of the 2.2
kernel, which had gone unnoticed in 2.4 and 2.5.

cb36441a

Automerge · db469c8d
Linus Torvalds authored Jul 28, 2002

db469c8d
VM: remove unused /proc/sys/vm/kswapd and swapctl.h · a074f680
Christoph Hellwig authored Jul 29, 2002
```
These were totally unused for a long time.  It's interesting how
many files include swapctl.h, though..
```
a074f680

[PATCH] Remove cli() from R3964 line discipline. · aa1190a2

David Woodhouse authored Jul 28, 2002

I did this ages ago but never submitted it because I never got round to
testing it. I still haven't tested it, but it ought to work, and the code
is definitely broken without it...

aa1190a2

Leftover from trident cli/sti removal. · 86560be4
Linus Torvalds authored Jul 28, 2002
```
Noticed by Zwane Mwaikambo.
```
86560be4
Merge hch's addition of generic_file_sendfile into jffs2_file_operations. · 1a4c1b9e
David Woodhouse authored Jul 29, 2002

1a4c1b9e
[SERIAL] Add pci_disable_device() to initialisation failure paths. · 6a3c9612
Russell King authored Jul 29, 2002

6a3c9612

[SERIAL] Remove some old compatibility cruft from 8250_pci.c · b3a1d183

Russell King authored Jul 29, 2002

8250_pci.c contains some old compatibility cruft for when __devexit
wasn't defined by the generic kernel.  It is now, so it's gone.

b3a1d183

28 Jul, 2002 12 commits

[PATCH] restore lru_cache_del() in truncate_complete_page · ff42067b

Andrew Morton authored Jul 28, 2002

I removed the PF_INVALIDATE debug check from buffercache
leaks, too.  It's non-functional - the flag should have been
set across truncate_inode_pages(), not invalidate_inode_pages().

ff42067b

Cset exclude: mingo@elte.hu|ChangeSet|20020728030719|07783 · 5dfb4838
Linus Torvalds authored Jul 28, 2002

5dfb4838
Make "cpu_relax()" imply a barrier, since that's how it is · 3f0c2c5b
Linus Torvalds authored Jul 28, 2002
```
used.

This fixes a lockup in synchronize_irq() on x86.
```
3f0c2c5b
Automerge · 984e13d3
Linus Torvalds authored Jul 28, 2002

984e13d3
Merge · 621f5626
Linus Torvalds authored Jul 28, 2002

621f5626

[PATCH] sched-2.5.29-B1 · 8e77485f

Ingo Molnar authored Jul 28, 2002

the attached patch is a comment update of sched.c and it also does a small
cleanup in migration_thread().

8e77485f

[PATCH] SCSI MODE_SENSE transfer length fix · c5155e55

Matthew Dharm authored Jul 28, 2002

Modified the MODE_SENSE write-protect test in sd.c to issue a SCSI
request with the request_bufflen the same size as the MODE_SENSE
command being issued requests.

c5155e55

[PATCH] SCSI INQUIRY transfer length fix · d7cdb541

Matthew Dharm authored Jul 28, 2002

Fixed one of the INQUIRY commands used for probing SCSI devices.  This
badly-formed command was trapped by the usb-storage driver BUG_ON()
which is designed to stop command with a badly formed transfer_length
field.

d7cdb541

[PATCH] put_page() uses audited · 06829ded

Andrew Morton authored Jul 28, 2002

Audit put_page() uses of pages that may be in the page cache.

Use page_cache_release() instead.

06829ded

[PATCH] Re: Limit in set_thread_area · 686d6649

Ingo Molnar authored Jul 28, 2002

the attached patch does the set_thread_area parameter simplification - it
also cleans up some other TLS issues, it removes the tls_* fields from the
thread_struct, and removes the now unused page-granularity flag.

686d6649

[PATCH] permit modular build of raw driver · 603e29ca

Andrew Morton authored Jul 28, 2002

This patch allows the raw driver to be built as a kernel module.

It also cleans up a bunch of stuff, C99ifies the initialisers, gives
lots of symbols static scope, etc.

The module is unloadable when there are zero bindings. The current
ioctl() interface have no way of undoing a binding - it only allows
bindings to be overwritten. So I overloaded a bind to major=0,minor=0
to mean "undo the binding". I'll update the raw(8) manpage for that.

generic_file_direct_IO has been exported to modules.

The call to invalidate_inode_pages2() has been removed from all
generic_file_driect_IO() callers, into generic_file_direct_IO() itself.
Mainly to avoid exporting invalidate_inode_pages2() to modules.

603e29ca

[PATCH] direct IO updates · 0d85f8bf

Andrew Morton authored Jul 28, 2002

This patch is a performance and correctness update to the direct-IO
code: O_DIRECT and the raw driver.  It mainly affects IO against
blockdevs.

The direct_io code was returning -EINVAL for a filesystem hole.  Change
it to clear the userspace page instead.

There were a few restrictions and weirdnesses wrt blocksize and
alignments.  The code has been reworked so we now lay out maximum-sized
BIOs at any sector alignment.

Because of this, the raw driver has been altered to set the blockdev's
soft blocksize to the minimum possible at open() time.  Typically, 512
bytes.  There are now no performance disadvantages to using small
blocksizes, and this gives the finest possible alignment.

There is no API here for setting or querying the soft blocksize of the
raw driver (there never was, really), which could conceivably be a
problem.  If it is, we can permit BLKBSZSET and BLKBSZGET against the
fd which /dev/raw/rawN returned, but that would require that
blk_ioctl() be exported to modules again.

This code is wickedly quick.  Here's an oprofile of a single 500MHz
PIII reading from four (old) scsi disks (two aic7xxx controllers) via
the raw driver.  Aggregate throughput is 72 megabytes/second:

c013363c 24       0.0896492   __set_page_dirty_buffers
c021b8cc 24       0.0896492   ahc_linux_isr
c012b5dc 25       0.0933846   kmem_cache_free
c014d894 26       0.09712     dio_bio_complete
c01cc78c 26       0.09712     number
c0123bd4 40       0.149415    follow_page
c01eed8c 46       0.171828    end_that_request_first
c01ed410 49       0.183034    blk_recount_segments
c01ed574 65       0.2428      blk_rq_map_sg
c014db38 85       0.317508    do_direct_IO
c021b090 90       0.336185    ahc_linux_run_device_queue
c010bb78 236      0.881551    timer_interrupt
c01052d8 25354    94.707      poll_idle

A testament to the efficiency of the 2.5 block layer.

And against four IDE disks on an HPT374 controller.  Throughput is 120
megabytes/sec:

c01eed8c 80       0.292462    end_that_request_first
c01fe850 87       0.318052    hpt3xx_intrproc
c01ed574 123      0.44966     blk_rq_map_sg
c01f8f10 141      0.515464    ata_select
c014db38 153      0.559333    do_direct_IO
c010bb78 235      0.859107    timer_interrupt
c01f9144 281      1.02727     ata_irq_enable
c01ff990 290      1.06017     udma_pci_init
c01fe878 308      1.12598     hpt3xx_maskproc
c02006f8 379      1.38554     idedisk_do_request
c02356a0 609      2.22637     pci_conf1_read
c01ff8dc 611      2.23368     udma_pci_start
c01ff950 922      3.37062     udma_pci_irq_status
c01f8fac 1002     3.66308     ata_status
c01ff26c 1059     3.87146     ata_start_dma
c01feb70 1141     4.17124     hpt374_udma_stop
c01f9228 3072     11.2305     ata_out_regfile
c01052d8 15193    55.5422     poll_idle

Not so good.

One problem which has been identified with O_DIRECT is the cost of
repeated calls into the mapping's get_block() callback.  Not a big
problem with ext2 but other filesystems have more complex get_block
implementations.

So what I have done is to require that callers of generic_direct_IO()
implement the new `get_blocks()' interface.  This is a small extension
to get_block().  It gets passed another argument which indicates the
maximum number of blocks which should be mapped, and it returns the
number of blocks which it did map in bh_result->b_size.  This allows
the fs to map up to 4G of disk (or of hole) in a single get_block()
invokation.

There are some other caveats and requirements of get_blocks() which are
documented in the comment block over fs/direct_io.c:get_more_blocks().

Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block
mapping.  It certainly allows good speedups.  But it doesn't allow the
fs to return a scatter list of blocks - it only understands linear
chunks of disk.  I think that's really all it _should_ do.

I'll let get_blocks() sit for a while and wait for some feedback.  If
it is sufficient and nobody objects too much, I shall convert all
get_block() instances in the kernel to be get_blocks() instances.  And
I'll teach readahead (at least) to use the get_blocks() extension.

Delayed allocate writeback could use get_blocks().  As could
block_prepare_write() for blocksize < PAGE_CACHE_SIZE.  There's no
mileage using it in mpage_writepages() because all our filesystems are
syncalloc, and nobody uses MAP_SHARED for much.

It will be tricky to use get_blocks() for writes, because if a ton of
blocks have been mapped into the file and then something goes wrong,
the kernel needs to either remove those blocks from the file or zero
them out.  The direct_io code zeroes them out.

btw, some time ago you mentioned that some drivers and/or hardware may
get upset if there are multiple simultaneous IOs in progress against
the same block.  Well, the raw driver has always allowed that to
happen.  O_DIRECT writes to blockdevs do as well now.

todo:

1) The driver will probably explode if someone runs BLKBSZSET while
   IO is in progress.  Need to use bdclaim() somewhere.

2) readv() and writev() need to become direct_io-aware.  At present
   we're doing stop-and-wait for each segment when performing
   readv/writev against the raw driver and O_DIRECT blockdevs.

0d85f8bf