1. 05 Feb, 2008 40 commits
    • Lucas Woods's avatar
      arch/alpha: remove duplicate includes · e820ce72
      Lucas Woods authored
      Signed-off-by: default avatarLucas Woods <woodzy@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e820ce72
    • Adrian Bunk's avatar
      m68knommu: remove duplicate exports · 49eaf7d7
      Adrian Bunk authored
      One EXPORT_SYMBOL should be enough for everyone.
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Acked-by: default avatarGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49eaf7d7
    • Paul Mundt's avatar
      nommu: add new vmalloc_user() and remap_vmalloc_range() interfaces. · f905bc44
      Paul Mundt authored
      This builds on top of the earlier vmalloc_32_user() work introduced by
      b5073173, as we now have places in the nommu
      allmodconfig that hit up against these missing APIs.
      
      As vmalloc_32_user() is already implemented, this is moved over to
      vmalloc_user() and simply made a wrapper.  As all current nommu platforms are
      32-bit addressable, there's no special casing we have to do for ZONE_DMA and
      things of that nature as per GFP_VMALLOC32.
      
      remap_vmalloc_range() needs to check VM_USERMAP in order to figure out whether
      we permit the remap or not, which means that we also have to rework the
      vmalloc_user() code to grovel for the VMA and set the flag.
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      Acked-by: default avatarDavid McCullough <david_mccullough@securecomputing.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarGreg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f905bc44
    • Jiri Olsa's avatar
      m68knommu: removing config variable DUMPTOFLASH · f156ac8c
      Jiri Olsa authored
      Removing config variable DUMPTOFLASH, since it is not used
      Signed-off-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Acked-by: default avatarGreg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f156ac8c
    • Jiri Olsa's avatar
      m68knomu: remove dead config symbols from m68knomu code · c155f3f9
      Jiri Olsa authored
      remove dead config symbols from m68knommu code
      Signed-off-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Acked-by: default avatarGreg Ungerer <gerg@snapgear.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c155f3f9
    • Greg Ungerer's avatar
      m68knommu: use ARRAY_SIZE in ColdFire serial driver · 16791963
      Greg Ungerer authored
      Use ARRAY_SIZE macroto get maximum ports in ColdFire serial driver.
      Signed-off-by: default avatarGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16791963
    • Pavel Emelyanov's avatar
      frv: use find_task_by_vpid in cxn_pin_by_pid · 540e3102
      Pavel Emelyanov authored
      The function is question gets the pid from sysctl table, so this one is a
      virtual pid, i.e.  the pid of a task as it is seen from inside a namespace.
      
      So the find_task_by_vpid() must be used here.
      Signed-off-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      540e3102
    • Jiri Olsa's avatar
      frv: remove dead config symbol from FRV code · 8c5900b2
      Jiri Olsa authored
      Remove dead config symbol from FRV code.
      Signed-off-by: default avatarJiri Olsa <olsajiri@gmail.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c5900b2
    • Robert P. J. Day's avatar
      FRV: move DMA macros to scatterlist.h for consistency. · 82b12e23
      Robert P. J. Day authored
      To be consistent with other architectures, these two DMA macros should
      be defined in scatterlist.h as opposed to dma-mapping.h
      Signed-off-by: default avatarRobert P. J. Day <rpjday@crashcourse.ca>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82b12e23
    • David Howells's avatar
      FRV: permit the memory to be located elsewhere in NOMMU mode · 7038220a
      David Howells authored
      Permit the memory to be located somewhere other than address 0xC0000000 in
      NOMMU mode.  The configuration options are already present, it just
      requires wiring up in the linker script.
      
      Note that only a limited set of locations of runtime addresses are available
      because of the way the CPU protection registers work.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7038220a
    • Casey Schaufler's avatar
      Smack: Simplified Mandatory Access Control Kernel · e114e473
      Casey Schaufler authored
      Smack is the Simplified Mandatory Access Control Kernel.
      
      Smack implements mandatory access control (MAC) using labels
      attached to tasks and data containers, including files, SVIPC,
      and other tasks. Smack is a kernel based scheme that requires
      an absolute minimum of application support and a very small
      amount of configuration data.
      
      Smack uses extended attributes and
      provides a set of general mount options, borrowing technics used
      elsewhere. Smack uses netlabel for CIPSO labeling. Smack provides
      a pseudo-filesystem smackfs that is used for manipulation of
      system Smack attributes.
      
      The patch, patches for ls and sshd, a README, a startup script,
      and x86 binaries for ls and sshd are also available on
      
          http://www.schaufler-ca.com
      
      Development has been done using Fedora Core 7 in a virtual machine
      environment and on an old Sony laptop.
      
      Smack provides mandatory access controls based on the label attached
      to a task and the label attached to the object it is attempting to
      access. Smack labels are deliberately short (1-23 characters) text
      strings. Single character labels using special characters are reserved
      for system use. The only operation applied to Smack labels is equality
      comparison. No wildcards or expressions, regular or otherwise, are
      used. Smack labels are composed of printable characters and may not
      include "/".
      
      A file always gets the Smack label of the task that created it.
      
      Smack defines and uses these labels:
      
          "*" - pronounced "star"
          "_" - pronounced "floor"
          "^" - pronounced "hat"
          "?" - pronounced "huh"
      
      The access rules enforced by Smack are, in order:
      
      1. Any access requested by a task labeled "*" is denied.
      2. A read or execute access requested by a task labeled "^"
         is permitted.
      3. A read or execute access requested on an object labeled "_"
         is permitted.
      4. Any access requested on an object labeled "*" is permitted.
      5. Any access requested by a task on an object with the same
         label is permitted.
      6. Any access requested that is explicitly defined in the loaded
         rule set is permitted.
      7. Any other access is denied.
      
      Rules may be explicitly defined by writing subject,object,access
      triples to /smack/load.
      
      Smack rule sets can be easily defined that describe Bell&LaPadula
      sensitivity, Biba integrity, and a variety of interesting
      configurations. Smack rule sets can be modified on the fly to
      accommodate changes in the operating environment or even the time
      of day.
      
      Some practical use cases:
      
      Hierarchical levels. The less common of the two usual uses
      for MLS systems is to define hierarchical levels, often
      unclassified, confidential, secret, and so on. To set up smack
      to support this, these rules could be defined:
      
         C        Unclass rx
         S        C       rx
         S        Unclass rx
         TS       S       rx
         TS       C       rx
         TS       Unclass rx
      
      A TS process can read S, C, and Unclass data, but cannot write it.
      An S process can read C and Unclass. Note that specifying that
      TS can read S and S can read C does not imply TS can read C, it
      has to be explicitly stated.
      
      Non-hierarchical categories. This is the more common of the
      usual uses for an MLS system. Since the default rule is that a
      subject cannot access an object with a different label no
      access rules are required to implement compartmentalization.
      
      A case that the Bell & LaPadula policy does not allow is demonstrated
      with this Smack access rule:
      
      A case that Bell&LaPadula does not allow that Smack does:
      
          ESPN    ABC   r
          ABC     ESPN  r
      
      On my portable video device I have two applications, one that
      shows ABC programming and the other ESPN programming. ESPN wants
      to show me sport stories that show up as news, and ABC will
      only provide minimal information about a sports story if ESPN
      is covering it. Each side can look at the other's info, neither
      can change the other. Neither can see what FOX is up to, which
      is just as well all things considered.
      
      Another case that I especially like:
      
          SatData Guard   w
          Guard   Publish w
      
      A program running with the Guard label opens a UDP socket and
      accepts messages sent by a program running with a SatData label.
      The Guard program inspects the message to ensure it is wholesome
      and if it is sends it to a program running with the Publish label.
      This program then puts the information passed in an appropriate
      place. Note that the Guard program cannot write to a Publish
      file system object because file system semanitic require read as
      well as write.
      
      The four cases (categories, levels, mutual read, guardbox) here
      are all quite real, and problems I've been asked to solve over
      the years. The first two are easy to do with traditonal MLS systems
      while the last two you can't without invoking privilege, at least
      for a while.
      Signed-off-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Cc: Joshua Brindle <method@manicmethod.com>
      Cc: Paul Moore <paul.moore@hp.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Ahmed S. Darwish" <darwish.07@gmail.com>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e114e473
    • Paul Moore's avatar
      NetLabel: introduce a new kernel configuration API for NetLabel · eda61d32
      Paul Moore authored
      Add a new set of configuration functions to the NetLabel/LSM API so that
      LSMs can perform their own configuration of the NetLabel subsystem without
      relying on assistance from userspace.
      Signed-off-by: default avatarPaul Moore <paul.moore@hp.com>
      Signed-off-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Reviewed-by: default avatarJames Morris <jmorris@namei.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eda61d32
    • Serge E. Hallyn's avatar
      oom_kill: remove uid==0 checks · 97829955
      Serge E. Hallyn authored
      Root processes are considered more important when out of memory and killing
      proceses.  The check for CAP_SYS_ADMIN was augmented with a check for
      uid==0 or euid==0.
      
      There are several possible ways to look at this:
      
      	1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN
      	   alone.  However CAP_SYS_RESOURCE is the one that really
      	   means "give me extra resources" so allow for that as
      	   well.
      	2. Any privileged code should be protected, but uid is not
      	   an indication of privilege.  So we should check whether
      	   any capabilities are raised.
      	3. uid==0 makes processes on the host as well as in containers
      	   more important, so we should keep the existing checks.
      	4. uid==0 makes processes only on the host more important,
      	   even without any capabilities.  So we should be keeping
      	   the (uid==0||euid==0) check but only when
      	   userns==&init_user_ns.
      
      I'm following number 1 here.
      Signed-off-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Andrew Morgan <morgan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97829955
    • Serge E. Hallyn's avatar
      capabilities: introduce per-process capability bounding set · 3b7391de
      Serge E. Hallyn authored
      The capability bounding set is a set beyond which capabilities cannot grow.
       Currently cap_bset is per-system.  It can be manipulated through sysctl,
      but only init can add capabilities.  Root can remove capabilities.  By
      default it includes all caps except CAP_SETPCAP.
      
      This patch makes the bounding set per-process when file capabilities are
      enabled.  It is inherited at fork from parent.  Noone can add elements,
      CAP_SETPCAP is required to remove them.
      
      One example use of this is to start a safer container.  For instance, until
      device namespaces or per-container device whitelists are introduced, it is
      best to take CAP_MKNOD away from a container.
      
      The bounding set will not affect pP and pE immediately.  It will only
      affect pP' and pE' after subsequent exec()s.  It also does not affect pI,
      and exec() does not constrain pI'.  So to really start a shell with no way
      of regain CAP_MKNOD, you would do
      
      	prctl(PR_CAPBSET_DROP, CAP_MKNOD);
      	cap_t cap = cap_get_proc();
      	cap_value_t caparray[1];
      	caparray[0] = CAP_MKNOD;
      	cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
      	cap_set_proc(cap);
      	cap_free(cap);
      
      The following test program will get and set the bounding
      set (but not pI).  For instance
      
      	./bset get
      		(lists capabilities in bset)
      	./bset drop cap_net_raw
      		(starts shell with new bset)
      		(use capset, setuid binary, or binary with
      		file capabilities to try to increase caps)
      
      ************************************************************
      cap_bound.c
      ************************************************************
       #include <sys/prctl.h>
       #include <linux/capability.h>
       #include <sys/types.h>
       #include <unistd.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
      
       #ifndef PR_CAPBSET_READ
       #define PR_CAPBSET_READ 23
       #endif
      
       #ifndef PR_CAPBSET_DROP
       #define PR_CAPBSET_DROP 24
       #endif
      
      int usage(char *me)
      {
      	printf("Usage: %s get\n", me);
      	printf("       %s drop <capability>\n", me);
      	return 1;
      }
      
       #define numcaps 32
      char *captable[numcaps] = {
      	"cap_chown",
      	"cap_dac_override",
      	"cap_dac_read_search",
      	"cap_fowner",
      	"cap_fsetid",
      	"cap_kill",
      	"cap_setgid",
      	"cap_setuid",
      	"cap_setpcap",
      	"cap_linux_immutable",
      	"cap_net_bind_service",
      	"cap_net_broadcast",
      	"cap_net_admin",
      	"cap_net_raw",
      	"cap_ipc_lock",
      	"cap_ipc_owner",
      	"cap_sys_module",
      	"cap_sys_rawio",
      	"cap_sys_chroot",
      	"cap_sys_ptrace",
      	"cap_sys_pacct",
      	"cap_sys_admin",
      	"cap_sys_boot",
      	"cap_sys_nice",
      	"cap_sys_resource",
      	"cap_sys_time",
      	"cap_sys_tty_config",
      	"cap_mknod",
      	"cap_lease",
      	"cap_audit_write",
      	"cap_audit_control",
      	"cap_setfcap"
      };
      
      int getbcap(void)
      {
      	int comma=0;
      	unsigned long i;
      	int ret;
      
      	printf("i know of %d capabilities\n", numcaps);
      	printf("capability bounding set:");
      	for (i=0; i<numcaps; i++) {
      		ret = prctl(PR_CAPBSET_READ, i);
      		if (ret < 0)
      			perror("prctl");
      		else if (ret==1)
      			printf("%s%s", (comma++) ? ", " : " ", captable[i]);
      	}
      	printf("\n");
      	return 0;
      }
      
      int capdrop(char *str)
      {
      	unsigned long i;
      
      	int found=0;
      	for (i=0; i<numcaps; i++) {
      		if (strcmp(captable[i], str) == 0) {
      			found=1;
      			break;
      		}
      	}
      	if (!found)
      		return 1;
      	if (prctl(PR_CAPBSET_DROP, i)) {
      		perror("prctl");
      		return 1;
      	}
      	return 0;
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc<2)
      		return usage(argv[0]);
      	if (strcmp(argv[1], "get")==0)
      		return getbcap();
      	if (strcmp(argv[1], "drop")!=0 || argc<3)
      		return usage(argv[0]);
      	if (capdrop(argv[2])) {
      		printf("unknown capability\n");
      		return 1;
      	}
      	return execl("/bin/bash", "/bin/bash", NULL);
      }
      ************************************************************
      
      [serue@us.ibm.com: fix typo]
      Signed-off-by: default avatarSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: default avatarAndrew G. Morgan <morgan@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>a
      Signed-off-by: default avatar"Serge E. Hallyn" <serue@us.ibm.com>
      Tested-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b7391de
    • Andrew Morgan's avatar
      Remove unnecessary include from include/linux/capability.h · 46c383cc
      Andrew Morgan authored
      KaiGai Kohei observed that this line in the linux header is not needed.
      Signed-off-by: default avatarAndrew G. Morgan <morgan@kernel.org>
      Cc: KaiGai Kohei <kaigai@kaigai.gr.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46c383cc
    • Andrew Morgan's avatar
      Add 64-bit capability support to the kernel · e338d263
      Andrew Morgan authored
      The patch supports legacy (32-bit) capability userspace, and where possible
      translates 32-bit capabilities to/from userspace and the VFS to 64-bit
      kernel space capabilities.  If a capability set cannot be compressed into
      32-bits for consumption by user space, the system call fails, with -ERANGE.
      
      FWIW libcap-2.00 supports this change (and earlier capability formats)
      
       http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/
      
      [akpm@linux-foundation.org: coding-syle fixes]
      [akpm@linux-foundation.org: use get_task_comm()]
      [ezk@cs.sunysb.edu: build fix]
      [akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
      [akpm@linux-foundation.org: unused var]
      [serue@us.ibm.com: export __cap_ symbols]
      Signed-off-by: default avatarAndrew G. Morgan <morgan@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: default avatarErez Zadok <ezk@cs.sunysb.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e338d263
    • Andrew Morton's avatar
      revert "capabilities: clean up file capability reading" · 8f6936f4
      Andrew Morton authored
      Revert b68680e4 to make way for the next
      patch: "Add 64-bit capability support to the kernel".
      
      We want to keep the vfs_cap_data.data[] structure, using two 'data's for
      64-bit caps (and later three for 96-bit caps), whereas
      b68680e4 had gotten rid of the 'data' struct
      made its members inline.
      
      The 64-bit caps patch keeps the stack abuse fix at get_file_caps(), which was
      the more important part of that patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Andrew Morgan <morgan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f6936f4
    • David P. Quigley's avatar
      VFS: Reorder vfs_getxattr to avoid unnecessary calls to the LSM · 4bea5805
      David P. Quigley authored
      Originally vfs_getxattr would pull the security xattr variable using
      the inode getxattr handle and then proceed to clobber it with a subsequent call
      to the LSM.
      
      This patch reorders the two operations such that when the xattr requested is
      in the security namespace it first attempts to grab the value from the LSM
      directly.
      
      If it fails to obtain the value because there is no module present or the
      module does not support the operation it will fall back to using the inode
      getxattr operation.
      
      In the event that both are inaccessible it returns EOPNOTSUPP.
      Signed-off-by: default avatarDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Acked-by: default avatarJames Morris <jmorris@namei.org>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bea5805
    • David P. Quigley's avatar
      VFS/Security: Rework inode_getsecurity and callers to return resulting buffer · 42492594
      David P. Quigley authored
      This patch modifies the interface to inode_getsecurity to have the function
      return a buffer containing the security blob and its length via parameters
      instead of relying on the calling function to give it an appropriately sized
      buffer.
      
      Security blobs obtained with this function should be freed using the
      release_secctx LSM hook.  This alleviates the problem of the caller having to
      guess a length and preallocate a buffer for this function allowing it to be
      used elsewhere for Labeled NFS.
      
      The patch also removed the unused err parameter.  The conversion is similar to
      the one performed by Al Viro for the security_getprocattr hook.
      Signed-off-by: default avatarDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Acked-by: default avatarJames Morris <jmorris@namei.org>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42492594
    • Matt Mackall's avatar
      37291458
    • Matt Mackall's avatar
      slob: reduce external fragmentation by using three free lists · 20cecbae
      Matt Mackall authored
      By putting smaller objects on their own list, we greatly reduce overall
      external fragmentation and increase repeatability.  This reduces total SLOB
      overhead from > 50% to ~6% on a simple boot test.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20cecbae
    • Matt Mackall's avatar
      slob: fix free block merging at head of subpage · 679299b3
      Matt Mackall authored
      We weren't merging freed blocks at the beginning of the free list.  Fixing
      this showed a 2.5% efficiency improvement in a userspace test harness.
      Signed-off-by: default avatarMatt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      679299b3
    • Fengguang Wu's avatar
      writeback: speed up writeback of big dirty files · 8bc3be27
      Fengguang Wu authored
      After making dirty a 100M file, the normal behavior is to start the
      writeback for all data after 30s delays.  But sometimes the following
      happens instead:
      
      	- after 30s:    ~4M
      	- after 5s:     ~4M
      	- after 5s:     all remaining 92M
      
      Some analyze shows that the internal io dispatch queues goes like this:
      
      		s_io            s_more_io
      		-------------------------
      	1)	100M,1K         0
      	2)	1K              96M
      	3)	0               96M
      1) initial state with a 100M file and a 1K file
      
      2) 4M written, nr_to_write <= 0, so write more
      
      3) 1K written, nr_to_write > 0, no more writes(BUG)
      
      nr_to_write > 0 in (3) fools the upper layer to think that data have all
      been written out.  The big dirty file is actually still sitting in
      s_more_io.  We cannot simply splice s_more_io back to s_io as soon as s_io
      becomes empty, and let the loop in generic_sync_sb_inodes() continue: this
      may starve newly expired inodes in s_dirty.  It is also not an option to
      draw inodes from both s_more_io and s_dirty, an let the loop go on: this
      might lead to live locks, and might also starve other superblocks in sync
      time(well kupdate may still starve some superblocks, that's another bug).
      
      We have to return when a full scan of s_io completes.  So nr_to_write > 0
      does not necessarily mean that "all data are written".  This patch
      introduces a flag writeback_control.more_io to indicate that more io should
      be done.  With it the big dirty file no longer has to wait for the next
      kupdate invokation 5s later.
      
      In sync_sb_inodes() we only set more_io on super_blocks we actually
      visited.  This avoids the interaction between two pdflush deamons.
      
      Also in __sync_single_inode() we don't blindly keep requeuing the io if the
      filesystem cannot progress.  Failing to do so may lead to 100% iowait.
      Tested-by: default avatarMike Snitzer <snitzer@gmail.com>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8bc3be27
    • Sam Ravnborg's avatar
      mm: fix section mismatch warning in sparse.c · a322f8ab
      Sam Ravnborg authored
      Fix following warning:
      WARNING: mm/built-in.o(.text+0x22069): Section mismatch in reference from the function sparse_early_usemap_alloc() to the function .init.text:__alloc_bootmem_node()
      
      static sparse_early_usemap_alloc() were used only by sparse_init()
      and with sparse_init() annotated _init it is safe to
      annotate sparse_early_usemap_alloc with __init too.
      Signed-off-by: default avatarSam Ravnborg <sam@ravnborg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a322f8ab
    • Nick Piggin's avatar
      mm: fix PageUptodate data race · 0ed361de
      Nick Piggin authored
      After running SetPageUptodate, preceeding stores to the page contents to
      actually bring it uptodate may not be ordered with the store to set the
      page uptodate.
      
      Therefore, another CPU which checks PageUptodate is true, then reads the
      page contents can get stale data.
      
      Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
      PageUptodate.
      
      Many places that test PageUptodate, do so with the page locked, and this
      would be enough to ensure memory ordering in those places if
      SetPageUptodate were only called while the page is locked.  Unfortunately
      that is not always the case for some filesystems, but it could be an idea
      for the future.
      
      Also bring the handling of anonymous page uptodateness in line with that of
      file backed page management, by marking anon pages as uptodate when they
      _are_ uptodate, rather than when our implementation requires that they be
      marked as such.  Doing allows us to get rid of the smp_wmb's in the page
      copying functions, which were especially added for anonymous pages for an
      analogous memory ordering problem.  Both file and anonymous pages are
      handled with the same barriers.
      
      FAQ:
      Q. Why not do this in flush_dcache_page?
      A. Firstly, flush_dcache_page handles only one side (the smb side) of the
      ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
      memory barriers in a completely unrelated function is nasty; at least in the
      PageUptodate macros, they are located together with (half) the operations
      involved in the ordering. Thirdly, the smp_wmb is only required when first
      bringing the page uptodate, wheras flush_dcache_page should be called each time
      it is written to through the kernel mapping. It is logically the wrong place to
      put it.
      
      Q. Why does this increase my text size / reduce my performance / etc.
      A. Because it is adding the necessary instructions to eliminate the data-race.
      
      Q. Can it be improved?
      A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
      run under the page lock, we could avoid the smp_rmb places where PageUptodate
      is queried under the page lock. Requires audit of all filesystems and at least
      some would need reworking. That's great you're interested, I'm eagerly awaiting
      your patches.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ed361de
    • Shaohua Li's avatar
      page migraton: handle orphaned pages · 62e1c553
      Shaohua Li authored
      Orphaned page might have fs-private metadata, the page is truncated.  As
      the page hasn't mapping, page migration refuse to migrate the page.  It
      appears the page is only freed in page reclaim and if zone watermark is
      low, the page is never freed, as a result migration always fail.  I thought
      we could free the metadata so such page can be freed in migration and make
      migration more reliable.
      
      [akpm@linux-foundation.org: go direct to try_to_free_buffers()]
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62e1c553
    • Yasunori Goto's avatar
      Document lowmem_reserve_ratio · 7786fa9a
      Yasunori Goto authored
      Though the lower_zone_protection was changed to lowmem_reserve_ratio, the
      document has been not changed.  The lowmem_reserve_ratio seems quite hard
      to estimate, but there is no guidance.  This patch is to change document
      for it.
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Andrea Arcangeli <andrea@cpushare.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7786fa9a
    • Masatake YAMATO's avatar
      check ADVICE of fadvise64_64 even if get_xip_page is given · b5beb1ca
      Masatake YAMATO authored
      I've written some test programs in ltp project.  During writing I met an
      problem which I cannot solve in user land.  So I wrote a patch for linux
      kernel.  Please, include this patch if acceptable.
      
      The test program tests the 4th parameter of fadvise64_64:
      
          long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);
      
      My test case calls fadvise64_64 with invalid advice value and checks errno is
      set to EINVAL.  About the advice parameter man page says:
      
          ...
          Permissible values for advice include:
      
      	   POSIX_FADV_NORMAL
                        ...
      	   POSIX_FADV_SEQUENTIAL
                        ...
      	   POSIX_FADV_RANDOM
      		  ...
      	   POSIX_FADV_NOREUSE
                        ...
      	   POSIX_FADV_WILLNEED
                        ...
      	   POSIX_FADV_DONTNEED
      		  ...
          ERRORS
                 ...
      	   EINVAL An invalid value was specified for advice.
      
      However, I got a bug report that the system call invocations
      in my test case returned 0 unexpectedly.
      
      I've inspected the kernel code:
      
          asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
          {
      	    struct file *file = fget(fd);
      	    struct address_space *mapping;
      	    struct backing_dev_info *bdi;
      	    loff_t endbyte;			/* inclusive */
      	    pgoff_t start_index;
      	    pgoff_t end_index;
      	    unsigned long nrpages;
      	    int ret = 0;
      
      	    if (!file)
      		    return -EBADF;
      
      	    if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
      		    ret = -ESPIPE;
      		    goto out;
      	    }
      
      	    mapping = file->f_mapping;
      	    if (!mapping || len < 0) {
      		    ret = -EINVAL;
      		    goto out;
      	    }
      
      	    if (mapping->a_ops->get_xip_page)
      		    /* no bad return value, but ignore advice */
      		    goto out;
          ...
          out:
      	    fput(file);
      	    return ret;
          }
      
      I found the advice parameter is just ignored in the case
      mapping->a_ops->get_xip_page is given. This behavior is different from
      what is written on the man page. Is this o.k.?
      
      get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
      Anyway I cannot find the easy way to detect get_xip_page
      field is given or CONFIG_EXT2_FS_XIP is true from the
      user space.
      
      I propose the following patch which checks the advice parameter
      even if get_xip_page is given.
      Signed-off-by: default avatarMasatake YAMATO <yamato@redhat.com>
      Acked-by: default avatarCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b5beb1ca
    • Larry Woodman's avatar
      Include count of pagecache pages in show_mem() output · e6f3602d
      Larry Woodman authored
      The show_mem() output does not include the total number of pagecache
      pages.  This would be helpful when analyzing the debug information in
      the /var/log/messages file after OOM kills occur.
      
      This patch includes the total pagecache pages in that output.
      Signed-off-by: default avatarLarry Woodman <lwoodman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6f3602d
    • Bjorn Steinbrink's avatar
      Fix dirty page accounting leak with ext3 data=journal · a2b34564
      Bjorn Steinbrink authored
      In 46d2277c ("Clean up and make
      try_to_free_buffers() not race with dirty pages"), try_to_free_buffers
      was changed to bail out if the page was dirty.
      
      That in turn caused truncate_complete_page to leak massive amounts of
      memory, because the dirty bit was only cleared after the call to
      try_to_free_buffers.
      
      So the call to cancel_dirty_page was moved up to have the dirty bit
      cleared early in 3e67c098 ("truncate:
      clear page dirtiness before running try_to_free_buffers()").
      
      The problem with that fix is, that the page can be redirtied after
      cancel_dirty_page was called, eg. like this:
      
      truncate_complete_page()
        cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
        do_invalidatepage()
          ext3_invalidatepage()
            journal_invalidatepage()
              journal_unmap_buffer()
                __dispose_buffer()
                  __journal_unfile_buffer()
                    __journal_temp_unlink_buffer()
                      mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
      
      And then we end up with dirty pages being wrongly accounted.
      
      As a result, in ecdfc978 ("Resurrect
      'try_to_free_buffers()' VM hackery") the changes to try_to_free_buffers
      were reverted, so the original reason for the massive memory leak is
      gone, and we can also revert the move of the call to cancel_dirty_page
      from truncate_complete_page and get the accounting right again.
      
      I'm not sure if it matters, but opposed to the final check in
      __remove_from_page_cache, this one also cares about the task io
      accounting, so maybe we want to use this instead, although it's not
      quite the clean fix either.
      Signed-off-by: default avatarBjörn Steinbrink <B.Steinbrink@gmx.de>
      Tested-by: default avatarKrzysztof Piotr Oledzki <ole@ans.pl>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Osterried <osterried@jesse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2b34564
    • Qi Yong's avatar
      set_page_refcounted() VM_BUG_ON fix · ae1276b9
      Qi Yong authored
      The current PageTail semantic is that a PageTail page is first a
      PageCompound page.  So remove the redundant PageCompound test in
      set_page_refcounted().
      Signed-off-by: default avatarQi Yong <qiyong@fc-cn.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae1276b9
    • Harvey Harrison's avatar
      mm: remove fastcall from mm/ · 920c7a5d
      Harvey Harrison authored
      fastcall is always defined to be empty, remove it
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      920c7a5d
    • Andi Kleen's avatar
    • Qi Yong's avatar
      skip writing data pages when inode is under I_SYNC · 2d544564
      Qi Yong authored
      Since I_SYNC was split out from I_LOCK, the concern in commit
      4b89eed9 ("Write back inode data pages
      even when the inode itself is locked") is not longer valid.
      
      We should revert to the original behavior: in __writeback_single_inode(),
      when we find an I_SYNC-ed inode and we're not doing a data-integrity sync,
      skip writing entirely.  Otherwise, we are double calling do_writepages()
      Signed-off-by: default avatarQi Yong <qiyong@fc-cn.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Joern Engel <joern@wohnheim.fh-wedel.de>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d544564
    • Hugh Dickins's avatar
      mm: don't waste swap on locked pages · 5a9bbdcd
      Hugh Dickins authored
      try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
      migrating), and recycles it back to the active list.  But if it's an
      anonymous page, we've already allocated swap to it: just wasting swap.
      Spot locked pages in page_referenced_one and treat them as referenced.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Tested-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ethan Solomita <solo@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a9bbdcd
    • Christoph Lameter's avatar
      vmstat: remove prefetch · 9eccf2a8
      Christoph Lameter authored
      Remove the prefetch logic in order to avoid touching impossible per cpu
      areas.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9eccf2a8
    • Andrea Arcangeli's avatar
      Fix /proc dcache deadlock in do_exit · 7766755a
      Andrea Arcangeli authored
      This patch fixes a sles9 system hang in start_this_handle from a customer
      with some heavy workload where all tasks are waiting on kjournald to commit
      the transaction, but kjournald waits on t_updates to go down to zero (it
      never does).
      
      This was reported as a lowmem shortage deadlock but when checking the debug
      data I noticed the VM wasn't under pressure at all (well it was really
      under vm pressure, because lots of tasks hanged in the VM prune_dcache
      methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
      mode, the holder of the journal handle should have if this was a vm issue
      in the first place).
      
      No task was apparently holding the leftover handle in the committing
      transaction, so I deduced t_updates was stuck to 1 because a journal_stop
      was never run by some path (this turned out to be correct).  With a debug
      patch adding proper reverse links and stack trace logging in ext3 deployed
      in production, I found journal_stop is never run because
      mark_inode_dirty_sync is called inside release_task called by do_exit.
      (that was quite fun because I would have never thought about this
      subtleness, I thought a regular path in ext3 had a bug and it forgot to
      call journal_stop)
      
      do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
      come back to run journal_stop)
      
      The reason is that shrink_dcache_parent is racy by design (feature not
      a bug) and it can do blocking I/O in some case, but the point is that
      calling shrink_dcache_parent at the last stage of do_exit isn't safe
      for self-reaping tasks.
      
      I guess the memory pressure of the unbalanced highmem system allowed
      to trigger this more easily.
      
      Now mainline doesn't have this line in iput (like sles9 has):
      
          	     if (inode->i_state & I_DIRTY_DELAYED)
      	     			mark_inode_dirty_sync(inode);
      
      so it will probably not crash with ext3, but for example ext2 implements an
      I/O-blocking ext2_put_inode that will lead to similar screwups with
      ext2_free_blocks never coming back and it's definitely wrong to call
      blocking-IO paths inside do_exit.  So this should fix a subtle bug in
      mainline too (not verified in practice though).  The equivalent fix for
      ext3 is also not verified yet to fix the problem in sles9 but I don't have
      doubt it will (it usually takes days to crash, so it'll take weeks to be
      sure).
      
      An alternate fix would be to offload that work to a kernel thread, but I
      don't think a reschedule for this is worth it, the vm should be able to
      collect those entries for the synchronous release_task.
      Signed-off-by: default avatarAndrea Arcangeli <andrea@suse.de>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7766755a
    • Bron Gondwana's avatar
      mm/page-writeback: highmem_is_dirtyable option · 195cf453
      Bron Gondwana authored
      Add vm.highmem_is_dirtyable toggle
      
      A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
      approximately 2Gb size which contains a hash format that is written
      randomly by the dbclean process.  On 2.6.16 this process took a few
      minutes.  With lowmem only accounting of dirty ratios, this takes about 12
      hours of 100% disk IO, all random writes.
      
      Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
      add the highmem back to the total available memory count.
      
      [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
      Signed-off-by: default avatarBron Gondwana <brong@fastmail.fm>
      Cc: Ethan Solomita <solo@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      195cf453
    • Christoph Lameter's avatar
      Page allocator: get rid of the list of cold pages · 3dfa5721
      Christoph Lameter authored
      We have repeatedly discussed if the cold pages still have a point. There is
      one way to join the two lists: Use a single list and put the cold pages at the
      end and the hot pages at the beginning. That way a single list can serve for
      both types of allocations.
      
      The discussion of the RFC for this and Mel's measurements indicate that
      there may not be too much of a point left to having separate lists for
      hot and cold pages (see http://marc.info/?t=119492914200001&r=1&w=2).
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Martin Bligh <mbligh@mbligh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dfa5721
    • Robert Bragg's avatar
      mm: don't allow ioremapping of ranges larger than vmalloc space · 5dc33185
      Robert Bragg authored
      When running with a 16M IOREMAP_MAX_ORDER (on armv7) we found that the
      vmlist search routine in __get_vm_area_node can mistakenly allow a driver
      to ioremap a range larger than vmalloc space.
      
      If at the time of the ioremap all existing vmlist areas sit below the
      determined alignment then the search routine continues past all entries and
      exits the for loop - straight into the found: label - without ever testing
      for integer wrapping or that the requested size fits.
      
      We were seeing a driver successfully ioremap 128M of flash even though
      there was only 120M of vmalloc space.  From that point the system was left
      with the remainder of the first 16M of space to vmalloc/ioremap within.
      Signed-off-by: default avatarRobert Bragg <robert@sixbynine.org>
      Acked-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5dc33185