1. 19 Jul, 2007 40 commits
    • Kawai, Hidehiro's avatar
      coredump masking: ELF-FDPIC: enable core dump filtering · ee78b0a6
      Kawai, Hidehiro authored
      This patch enables core dump filtering for ELF-FDPIC-formatted core file.
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee78b0a6
    • Kawai, Hidehiro's avatar
      coredump masking: ELF-FDPIC: remove an unused argument · e2e00906
      Kawai, Hidehiro authored
      This patch removes an unused argument from elf_fdpic_dump_segments().
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2e00906
    • Kawai, Hidehiro's avatar
      coredump masking: ELF: enable core dump filtering · a1b59e80
      Kawai, Hidehiro authored
      This patch enables core dump filtering for ELF-formatted core file.
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1b59e80
    • Kawai, Hidehiro's avatar
      coredump masking: add an interface for core dump filter · 3cb4a0bb
      Kawai, Hidehiro authored
      This patch adds an interface to set/reset flags which determines each memory
      segment should be dumped or not when a core file is generated.
      
      /proc/<pid>/coredump_filter file is provided to access the flags.  You can
      change the flag status for a particular process by writing to or reading from
      the file.
      
      The flag status is inherited to the child process when it is created.
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3cb4a0bb
    • Kawai, Hidehiro's avatar
      coredump masking: reimplementation of dumpable using two flags · 6c5d5238
      Kawai, Hidehiro authored
      This patch changes mm_struct.dumpable to a pair of bit flags.
      
      set_dumpable() converts three-value dumpable to two flags and stores it into
      lower two bits of mm_struct.flags instead of mm_struct.dumpable.
      get_dumpable() behaves in the opposite way.
      
      [akpm@linux-foundation.org: export set_dumpable]
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c5d5238
    • Kawai, Hidehiro's avatar
      coredump masking: bound suid_dumpable sysctl · 76fdbb25
      Kawai, Hidehiro authored
      This patch series is version 5 of the core dump masking feature, which
      controls which VMAs should be dumped based on their memory types and
      per-process flags.
      
      I adopted most of Andrew's suggestion at the previous version.  He also
      suggested using system call instead of /proc/<pid>/ interface, I decided to
      use the latter continuously because adding new system call with pid argument
      will give a big impact on the kernel.
      
      You can access the per-process flags via /proc/<pid>/coredump_filter
      interface.  coredump_filter represents a bitmask of memory types, and if a bit
      is set, VMAs of corresponding memory type are written into a core file when
      the process is dumped.  The bitmask is inherited from the parent process when
      a process is created.
      
      The original purpose is to avoid longtime system slowdown when a number of
      processes which share a huge shared memory are dumped at the same time.  To
      achieve this purpose, this patch series adds an ability to suppress dumping
      anonymous shared memory for specified processes.  In this version, three other
      memory types are also supported.
      
      Here are the coredump_filter bits:
        bit 0: anonymous private memory
        bit 1: anonymous shared memory
        bit 2: file-backed private memory
        bit 3: file-backed shared memory
      
      The default value of coredump_filter is 0x3.  This means the new core dump
      routine has the same behavior as conventional behavior by default.
      
      In this version, coredump_filter bits and mm.dumpable are merged into
      mm.flags, and it is accessed by atomic bitops.
      
      The supported core file formats are ELF and ELF-FDPIC.  ELF has been tested,
      but ELF-FDPIC has not been built and tested because I don't have the test
      environment.
      
      This patch limits a value of suid_dumpable sysctl to the range of 0 to 2.
      Signed-off-by: default avatarHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76fdbb25
    • Randy Dunlap's avatar
      docbook: don't reference file without kernel-doc · 86fd6dfc
      Randy Dunlap authored
      Remove include/linux/rmap.h from kernel-api.tmpl since it no longer
      contains kernel-doc.  Fixes this warning:
      
      Warning(linux-2.6.22//include/linux/rmap.h): no structured comments found
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86fd6dfc
    • Randy Dunlap's avatar
      kernel-doc: fix leading dot in man-mode output · cdccb316
      Randy Dunlap authored
      If a parameter description begins with a '.', this indicates a "request"
      for "man" mode output (*roff), so it needs special handling.
      
      Problem case is in include/asm-i386/atomic.h for function
      atomic_add_unless():
       * @u: ...unless v is equal to u.
      This parameter description is currently not printed in man mode output.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdccb316
    • Randy Dunlap's avatar
      kernel-doc: strip C99 comments · 51f5a0c8
      Randy Dunlap authored
      Strip C99-style comments from the input stream.
      /*...*/ comments are already stripped.
      C99 comments confuse the kernel-doc script.
      
      Also update some comments.
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51f5a0c8
    • Randy Dunlap's avatar
      kernel-doc: fix unnamed struct/union warning · 5f8c7c98
      Randy Dunlap authored
      Fix kernel-doc warning:
      Warning(linux-2.6.22-rc2-git2/include/linux/skbuff.h:316): No description found for parameter '}'
      
      which is caused by nested anonymous structs/unions ending with:
        };
      };
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f8c7c98
    • Randy Dunlap's avatar
      kernel-doc: add tools doc in Makefile · 2ac534bc
      Randy Dunlap authored
      Add kernel-doc tools info in Makefile.
      Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ac534bc
    • Josef 'Jeff' Sipek's avatar
      fs: remove path_walk export · f79c20f5
      Josef 'Jeff' Sipek authored
      Signed-off-by: default avatarJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f79c20f5
    • Josef 'Jeff' Sipek's avatar
      fs: mark link_path_walk static · c4a7808f
      Josef 'Jeff' Sipek authored
      Signed-off-by: default avatarJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4a7808f
    • Josef 'Jeff' Sipek's avatar
      nfsctl: use vfs_path_lookup · 16b6287a
      Josef 'Jeff' Sipek authored
      use vfs_path_lookup instead of open-coding the necessary functionality.
      Signed-off-by: default avatarJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Acked-by: default avatarNeilBrown <neilb@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16b6287a
    • Josef 'Jeff' Sipek's avatar
      sunrpc: use vfs_path_lookup · 4ac4efc1
      Josef 'Jeff' Sipek authored
      use vfs_path_lookup instead of open-coding the necessary functionality.
      Signed-off-by: default avatarJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Acked-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ac4efc1
    • Josef 'Jeff' Sipek's avatar
      fs: introduce vfs_path_lookup · 16f18200
      Josef 'Jeff' Sipek authored
      Stackable file systems, among others, frequently need to lookup paths or
      path components starting from an arbitrary point in the namespace
      (identified by a dentry and a vfsmount).  Currently, such file systems use
      lookup_one_len, which is frowned upon [1] as it does not pass the lookup
      intent along; not passing a lookup intent, for example, can trigger BUG_ON's
      when stacking on top of NFSv4.
      
      The first patch introduces a new lookup function to allow lookup starting
      from an arbitrary point in the namespace.  This approach has been suggested
      by Christoph Hellwig [2].
      
      The second patch changes sunrpc to use vfs_path_lookup.
      
      The third patch changes nfsctl.c to use vfs_path_lookup.
      
      The fourth patch marks link_path_walk static.
      
      The fifth, and last patch, unexports path_walk because it is no longer
      unnecessary to call it directly, and using the new vfs_path_lookup is
      cleaner.
      
      For example, the following snippet of code, looks up "some/path/component"
      in a directory pointed to by parent_{dentry,vfsmnt}:
      
      err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
      		      "some/path/component", 0, &nd);
      if (!err) {
      	/* exits */
      
      	...
      
      	/* once done, release the references */
      	path_release(&nd);
      } else if (err == -ENOENT) {
      	/* doesn't exist */
      } else {
      	/* other error */
      }
      
      VFS functions such as lookup_create can be used on the nameidata structure
      to pass the create intent to the file system.
      Signed-off-by: default avatarJosef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16f18200
    • Ollie Wild's avatar
      mm: variable length argument support · b6a2fea3
      Ollie Wild authored
      Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
      the old mm into the new mm.
      
      We create the new mm before the binfmt code runs, and place the new stack at
      the very top of the address space.  Once the binfmt code runs and figures out
      where the stack should be, we move it downwards.
      
      It is a bit peculiar in that we have one task with two mm's, one of which is
      inactive.
      
      [a.p.zijlstra@chello.nl: limit stack size]
      Signed-off-by: default avatarOllie Wild <aaw@google.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      [bunk@stusta.de: unexport bprm_mm_init]
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6a2fea3
    • Peter Zijlstra's avatar
      audit: rework execve audit · bdf4c48a
      Peter Zijlstra authored
      The purpose of audit_bprm() is to log the argv array to a userspace daemon at
      the end of the execve system call.  Since user-space hasn't had time to run,
      this array is still in pristine state on the process' stack; so no need to
      copy it, we can just grab it from there.
      
      In order to minimize the damage to audit_log_*() copy each string into a
      temporary kernel buffer first.
      
      Currently the audit code requires that the full argument vector fits in a
      single packet.  So currently it does clip the argv size to a (sysctl) limit,
      but only when execve auditing is enabled.
      
      If the audit protocol gets extended to allow for multiple packets this check
      can be removed.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarOllie Wild <aaw@google.com>
      Cc: <linux-audit@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdf4c48a
    • Peter Zijlstra's avatar
      arch: personality independent stack top · b111757c
      Peter Zijlstra authored
      New arch macro STACK_TOP_MAX it gives the larges valid stack address for the
      architecture in question.
      
      It differs from STACK_TOP in that it will not distinguish between
      personalities but will always return the largest possible address.
      
      This is used to create the initial stack on execve, which we will move down to
      the proper location once the binfmt code has figured out where that is.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarOllie Wild <aaw@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b111757c
    • Fenghua Yu's avatar
      use the new percpu interface for shared data · f34e3b61
      Fenghua Yu authored
      Currently most of the per cpu data, which is accessed by different cpus,
      has a ____cacheline_aligned_in_smp attribute.  Move all this data to the
      new per cpu shared data section: .data.percpu.shared_aligned.
      
      This will seperate the percpu data which is referenced frequently by other
      cpus from the local only percpu data.
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Acked-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f34e3b61
    • Fenghua Yu's avatar
      define new percpu interface for shared data · 5fb7dc37
      Fenghua Yu authored
      per cpu data section contains two types of data.  One set which is
      exclusively accessed by the local cpu and the other set which is per cpu,
      but also shared by remote cpus.  In the current kernel, these two sets are
      not clearely separated out.  This can potentially cause the same data
      cacheline shared between the two sets of data, which will result in
      unnecessary bouncing of the cacheline between cpus.
      
      One way to fix the problem is to cacheline align the remotely accessed per
      cpu data, both at the beginning and at the end.  Because of the padding at
      both ends, this will likely cause some memory wastage and also the
      interface to achieve this is not clean.
      
      This patch:
      
      Moves the remotely accessed per cpu data (which is currently marked
      as ____cacheline_aligned_in_smp) into a different section, where all the data
      elements are cacheline aligned. And as such, this differentiates the local
      only data and remotely accessed data cleanly.
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Acked-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: <linux-arch@vger.kernel.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fb7dc37
    • Michael Ellerman's avatar
      jprobes: make jprobes a little safer for users · 3d7e3382
      Michael Ellerman authored
      I realise jprobes are a razor-blades-included type of interface, but that
      doesn't mean we can't try and make them safer to use.  This guy I know once
      wrote code like this:
      
      struct jprobe jp = { .kp.symbol_name = "foo", .entry = "jprobe_foo" };
      
      And then his kernel exploded. Oops.
      
      This patch adds an arch hook, arch_deref_entry_point() (I don't like it
      either) which takes the void * in a struct jprobe, and gives back the text
      address that it represents.
      
      We can then use that in register_jprobe() to check that the entry point we're
      passed is actually in the kernel text, rather than just some random value.
      Signed-off-by: default avatarMichael Ellerman <michael@ellerman.id.au>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Acked-by: default avatarAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d7e3382
    • Michael Ellerman's avatar
      jprobes: remove JPROBE_ENTRY() · 9e367d85
      Michael Ellerman authored
      AFAICT now that jprobe.entry is a void *, JPROBE_ENTRY doesn't do anything
      useful - so remove it ..
      
      I've left a do-nothing version so that out-of-tree jprobes code will still
      compile without modifications.
      Signed-off-by: default avatarMichael Ellerman <michael@ellerman.id.au>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Acked-by: default avatarAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e367d85
    • Michael Ellerman's avatar
      jprobes: make struct jprobe.entry a void * · 81eae375
      Michael Ellerman authored
      Currently jprobe.entry is a kprobe_opcode_t *, but that's a lie.  On some
      platforms it doesn't point to an opcode at all, it points to a function
      descriptor.
      
      It's really a pointer to something that the arch code can turn into a function
      entry point.  And that's what actually happens, none of the generic code ever
      looks at jprobe.entry, it's only ever dereferenced by arch code.
      
      So just make it a void *.
      Signed-off-by: default avatarMichael Ellerman <michael@ellerman.id.au>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Acked-by: default avatarAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81eae375
    • Fengguang Wu's avatar
      readahead: sanify file_ra_state names · f9acc8c7
      Fengguang Wu authored
      Rename some file_ra_state variables and remove some accessors.
      
      It results in much simpler code.
      Kudos to Rusty!
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9acc8c7
    • Rusty Russell's avatar
      readahead: split ondemand readahead interface into two functions · cf914a7d
      Rusty Russell authored
      Split ondemand readahead interface into two functions.  I think this makes it
      a little clearer for non-readahead experts (like Rusty).
      
      Internally they both call ondemand_readahead(), but the page argument is
      changed to an obvious boolean flag.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf914a7d
    • Fengguang Wu's avatar
      mm: share PG_readahead and PG_reclaim · fe3cba17
      Fengguang Wu authored
      Share the same page flag bit for PG_readahead and PG_reclaim.
      
      One is used only on file reads, another is only for emergency writes.  One
      is used mostly for fresh/young pages, another is for old pages.
      
      Combinations of possible interactions are:
      
      a) clear PG_reclaim => implicit clear of PG_readahead
      	it will delay an asynchronous readahead into a synchronous one
      	it actually does _good_ for readahead:
      		the pages will be reclaimed soon, it's readahead thrashing!
      		in this case, synchronous readahead makes more sense.
      
      b) clear PG_readahead => implicit clear of PG_reclaim
      	one(and only one) page will not be reclaimed in time
      	it can be avoided by checking PageWriteback(page) in readahead first
      
      c) set PG_reclaim => implicit set of PG_readahead
      	will confuse readahead and make it restart the size rampup process
      	it's a trivial problem, and can mostly be avoided by checking
      	PageWriteback(page) first in readahead
      
      d) set PG_readahead => implicit set of PG_reclaim
      	PG_readahead will never be set on already cached pages.
      	PG_reclaim will always be cleared on dirtying a page.
      	so not a problem.
      
      In summary,
      	a)   we get better behavior
      	b,d) possible interactions can be avoided
      	c)   racy condition exists that might affect readahead, but the chance
      	     is _really_ low, and the hurt on readahead is trivial.
      
      Compound pages also use PG_reclaim, but for now they do not interact with
      reclaim/readahead code.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe3cba17
    • Fengguang Wu's avatar
      readahead: pass real splice size · d8983910
      Fengguang Wu authored
      Pass real splice size to page_cache_readahead_ondemand().
      
      The splice code works in chunks of 16 pages internally.  The readahead code
      should be told of the overall splice size, instead of the internal chunk size.
       Otherwize bad things may happen.  Imagine some 17-page random splice reads.
      The code before this patch will result in two readahead calls: readahead(16);
      readahead(1); That leads to one 16-page I/O and one 32-page I/O: one extra I/O
      and 31 readahead miss pages.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8983910
    • Fengguang Wu's avatar
      readahead: move synchronous readahead call out of splice loop · 431a4820
      Fengguang Wu authored
      Move synchronous page_cache_readahead_ondemand() call out of splice loop.
      
      This avoids one pointless page allocation/insertion in case of non-zero
      ra_pages, or many pointless readahead calls in case of zero ra_pages.
      
      Note that if a user sets ra_pages to less than PIPE_BUFFERS=16 pages, he will
      not get expected readahead behavior anyway.  The splice code works in batches
      of 16 pages, which can be taken as another form of synchronous readahead.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      431a4820
    • Fengguang Wu's avatar
      readahead: remove the old algorithm · c743d96b
      Fengguang Wu authored
      Remove the old readahead algorithm.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c743d96b
    • Fengguang Wu's avatar
      readahead: convert ext3/ext4 invocations · dc7868fc
      Fengguang Wu authored
      Convert ext3/ext4 dir reads to use on-demand readahead.
      
      Readahead for dirs operates _not_ on file level, but on blockdev level.  This
      makes a difference when the data blocks are not continuous.  And the read
      routine is somehow opaque: there's no handy info about the status of current
      page.  So a simplified call scheme is employed: to call into readahead
      whenever the current page falls out of readahead windows.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc7868fc
    • Fengguang Wu's avatar
      readahead: convert splice invocations · a08a166f
      Fengguang Wu authored
      Convert splice reads to use on-demand readahead.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Jens Axboe <axboe@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a08a166f
    • Fengguang Wu's avatar
      readahead: convert filemap invocations · 3ea89ee8
      Fengguang Wu authored
      Convert filemap reads to use on-demand readahead.
      
      The new call scheme is to
      - call readahead on non-cached page
      - call readahead on look-ahead page
      - update prev_index when finished with the read request
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ea89ee8
    • Fengguang Wu's avatar
      readahead: on-demand readahead logic · 122a21d1
      Fengguang Wu authored
      This is a minimal readahead algorithm that aims to replace the current one.
      It is more flexible and reliable, while maintaining almost the same behavior
      and performance.  Also it is full integrated with adaptive readahead.
      
      It is designed to be called on demand:
      	- on a missing page, to do synchronous readahead
      	- on a lookahead page, to do asynchronous readahead
      
      In this way it eliminated the awkward workarounds for cache hit/miss,
      readahead thrashing, retried read, and unaligned read.  It also adopts the
      data structure introduced by adaptive readahead, parameterizes readahead
      pipelining with `lookahead_index', and reduces the current/ahead windows to
      one single window.
      
      HEURISTICS
      
      The logic deals with four cases:
      
      	- sequential-next
      		found a consistent readahead window, so push it forward
      
      	- random
      		standalone small read, so read as is
      
      	- sequential-first
      		create a new readahead window for a sequential/oversize request
      
      	- lookahead-clueless
      		hit a lookahead page not associated with the readahead window,
      		so create a new readahead window and ramp it up
      
      In each case, three parameters are determined:
      
      	- readahead index: where the next readahead begins
      	- readahead size:  how much to readahead
      	- lookahead size:  when to do the next readahead (for pipelining)
      
      BEHAVIORS
      
      The old behaviors are maximally preserved for trivial sequential/random reads.
      Notable changes are:
      
      	- It no longer imposes strict sequential checks.
      	  It might help some interleaved cases, and clustered random reads.
      	  It does introduce risks of a random lookahead hit triggering an
      	  unexpected readahead. But in general it is more likely to do good
      	  than to do evil.
      
      	- Interleaved reads are supported in a minimal way.
      	  Their chances of being detected and proper handled are still low.
      
      	- Readahead thrashings are better handled.
      	  The current readahead leads to tiny average I/O sizes, because it
      	  never turn back for the thrashed pages.  They have to be fault in
      	  by do_generic_mapping_read() one by one.  Whereas the on-demand
      	  readahead will redo readahead for them.
      
      OVERHEADS
      
      The new code reduced the overheads of
      
      	- excessively calling the readahead routine on small sized reads
      	  (the current readahead code insists on seeing all requests)
      
      	- doing a lot of pointless page-cache lookups for small cached files
      	  (the current readahead only turns itself off after 256 cache hits,
      	  unfortunately most files are < 1MB, so never see that chance)
      
      That accounts for speedup of
      	- 0.3% on 1-page sequential reads on sparse file
      	- 1.2% on 1-page cache hot sequential reads
      	- 3.2% on 256-page cache hot sequential reads
      	- 1.3% on cache hot `tar /lib`
      
      However, it does introduce one extra page-cache lookup per cache miss, which
      impacts random reads slightly. That's 1% overheads for 1-page random reads on
      sparse file.
      
      PERFORMANCE
      
      The basic benchmark setup is
      	- 2.6.20 kernel with on-demand readahead
      	- 1MB max readahead size
      	- 2.9GHz Intel Core 2 CPU
      	- 2GB memory
      	- 160G/8M Hitachi SATA II 7200 RPM disk
      
      The benchmarks show that
      	- it maintains the same performance for trivial sequential/random reads
      	- sysbench/OLTP performance on MySQL gains up to 8%
      	- performance on readahead thrashing gains up to 3 times
      
      iozone throughput (KB/s): roughly the same
      ==========================================
      iozone -c -t1 -s 4096m -r 64k
      
      			       2.6.20          on-demand      gain
      first run
      	  "  Initial write "   61437.27        64521.53      +5.0%
      	  "        Rewrite "   47893.02        48335.20      +0.9%
      	  "           Read "   62111.84        62141.49      +0.0%
      	  "        Re-read "   62242.66        62193.17      -0.1%
      	  "   Reverse Read "   50031.46        49989.79      -0.1%
      	  "    Stride read "    8657.61         8652.81      -0.1%
      	  "    Random read "   13914.28        13898.23      -0.1%
      	  " Mixed workload "   19069.27        19033.32      -0.2%
      	  "   Random write "   14849.80        14104.38      -5.0%
      	  "         Pwrite "   62955.30        65701.57      +4.4%
      	  "          Pread "   62209.99        62256.26      +0.1%
      
      second run
      	  "  Initial write "   60810.31        66258.69      +9.0%
      	  "        Rewrite "   49373.89        57833.66     +17.1%
      	  "           Read "   62059.39        62251.28      +0.3%
      	  "        Re-read "   62264.32        62256.82      -0.0%
      	  "   Reverse Read "   49970.96        50565.72      +1.2%
      	  "    Stride read "    8654.81         8638.45      -0.2%
      	  "    Random read "   13901.44        13949.91      +0.3%
      	  " Mixed workload "   19041.32        19092.04      +0.3%
      	  "   Random write "   14019.99        14161.72      +1.0%
      	  "         Pwrite "   64121.67        68224.17      +6.4%
      	  "          Pread "   62225.08        62274.28      +0.1%
      
      In summary, writes are unstable, reads are pretty close on average:
      
      			  access pattern  2.6.20  on-demand   gain
      				   Read  62085.61  62196.38  +0.2%
      				Re-read  62253.49  62224.99  -0.0%
      			   Reverse Read  50001.21  50277.75  +0.6%
      			    Stride read   8656.21   8645.63  -0.1%
      			    Random read  13907.86  13924.07  +0.1%
      	 		 Mixed workload  19055.29  19062.68  +0.0%
      				  Pread  62217.53  62265.27  +0.1%
      
      aio-stress: roughly the same
      ============================
      aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
      aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
      
      					2.6.20      on-demand  delta
      			sequential	 92.57s      92.54s    -0.0%
      			random		311.87s     312.15s    +0.1%
      
      sysbench fileio: roughly the same
      =================================
      sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
      	 --file-total-size=4G --file-block-size=64K \
      	 --num-threads=001 --max-requests=10000 --max-time=900 run
      
      				threads    2.6.20   on-demand    delta
      		first run
      				      1   59.1974s    59.2262s  +0.0%
      				      2   58.0575s    58.2269s  +0.3%
      				      4   48.0545s    47.1164s  -2.0%
      				      8   41.0684s    41.2229s  +0.4%
      				     16   35.8817s    36.4448s  +1.6%
      				     32   32.6614s    32.8240s  +0.5%
      				     64   23.7601s    24.1481s  +1.6%
      				    128   24.3719s    23.8225s  -2.3%
      				    256   23.2366s    22.0488s  -5.1%
      
      		second run
      				      1   59.6720s    59.5671s  -0.2%
      				      8   41.5158s    41.9541s  +1.1%
      				     64   25.0200s    23.9634s  -4.2%
      				    256   22.5491s    20.9486s  -7.1%
      
      Note that the numbers are not very stable because of the writes.
      The overall performance is close when we sum all seconds up:
      
                      sum all up               495.046s    491.514s   -0.7%
      
      sysbench oltp (trans/sec): up to 8% gain
      ========================================
      sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
      	 --mysql-socket=/var/run/mysqld/mysqld.sock \
      	 --mysql-user=root --mysql-password=readahead \
      	 --num-threads=064 --max-requests=10000 --max-time=900 run
      
      	10000-transactions run
      				threads    2.6.20   on-demand    gain
      				      1     62.81       64.56   +2.8%
      				      2     67.97       70.93   +4.4%
      				      4     81.81       85.87   +5.0%
      				      8     94.60       97.89   +3.5%
      				     16     99.07      104.68   +5.7%
      				     32     95.93      104.28   +8.7%
      				     64     96.48      103.68   +7.5%
      	5000-transactions run
      				      1     48.21       48.65   +0.9%
      				      8     68.60       70.19   +2.3%
      				     64     70.57       74.72   +5.9%
      	2000-transactions run
      				      1     37.57       38.04   +1.3%
      				      2     38.43       38.99   +1.5%
      				      4     45.39       46.45   +2.3%
      				      8     51.64       52.36   +1.4%
      				     16     54.39       55.18   +1.5%
      				     32     52.13       54.49   +4.5%
      				     64     54.13       54.61   +0.9%
      
      That's interesting results. Some investigations show that
      	- MySQL is accessing the db file non-uniformly: some parts are
      	  more hot than others
      	- It is mostly doing 4-page random reads, and sometimes doing two
      	  reads in a row, the latter one triggers a 16-page readahead.
      	- The on-demand readahead leaves many lookahead pages (flagged
      	  PG_readahead) there. Many of them will be hit, and trigger
      	  more readahead pages. Which might save more seeks.
      	- Naturally, the readahead windows tend to lie in hot areas,
      	  and the lookahead pages in hot areas is more likely to be hit.
      	- The more overall read density, the more possible gain.
      
      That also explains the adaptive readahead tricks for clustered random reads.
      
      readahead thrashing: 3 times better
      ===================================
      We boot kernel with "mem=128m single", and start a 100KB/s stream on every
      second, until reaching 200 streams.
      
      			      max throughput     min avg I/O size
      		2.6.20:            5MB/s               16KB
      		on-demand:        15MB/s              140KB
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      122a21d1
    • Fengguang Wu's avatar
      readahead: data structure and routines · 5ce1110b
      Fengguang Wu authored
      Extend struct file_ra_state to support the on-demand readahead logic.  Also
      define some helpers for it.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ce1110b
    • Fengguang Wu's avatar
      readahead: MIN_RA_PAGES/MAX_RA_PAGES macros · f615bfca
      Fengguang Wu authored
      Define two convenient macros for read-ahead:
      	- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
      	- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD
      
      Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
      page sizes like 64k.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f615bfca
    • Fengguang Wu's avatar
      readahead: add look-ahead support to __do_page_cache_readahead() · 46fc3e7b
      Fengguang Wu authored
      Add look-ahead support to __do_page_cache_readahead().
      
      It works by
      	- mark the Nth backwards page with PG_readahead,
      	(which instructs the page's first reader to invoke readahead)
      	- and only do the marking for newly allocated pages.
      	(to prevent blindly doing readahead on already cached pages)
      
      Look-ahead is a technique to achieve I/O pipelining:
      
      While the application is working through a chunk of cached pages, the kernel
      reads-ahead the next chunk of pages _before_ time of need.  It effectively
      hides low level I/O latencies to high level applications.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46fc3e7b
    • Fengguang Wu's avatar
      readahead: introduce PG_readahead · d77c2d7c
      Fengguang Wu authored
      Introduce a new page flag: PG_readahead.
      
      It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
      invoke the read-ahead logic.  For the sake of I/O pipelining, don't wait until
      it runs out of cached pages!
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d77c2d7c
    • David Brownell's avatar
      AIO sparse fix (type of ki_flags) · 2ba2d003
      David Brownell authored
      Fix type issue reported by latest 'sparse': kiocb.ki_flags should be
      "unsigned long" (not "long"), to match bitop type signature.
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ba2d003
    • Michael Halcrow's avatar
      eCryptfs: ecryptfs_setattr() bugfix · 64ee4808
      Michael Halcrow authored
      There is another bug recently introduced into the ecryptfs_setattr()
      function in 2.6.22.  eCryptfs will attempt to treat special files like
      regular eCryptfs files on chmod, chown, and so forth.  This leads to a NULL
      pointer dereference.  This patch validates that the file is a regular file
      before proceeding with operations related to the inode's crypt_stat.
      
      Thanks to Ryusuke Konishi for finding this bug and suggesting the fix.
      Signed-off-by: default avatarMichael Halcrow <mhalcrow@us.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64ee4808