1. 18 Jul, 2007 24 commits
    • Jeremy Fitzhardinge's avatar
      xen: Core Xen implementation · 5ead97c8
      Jeremy Fitzhardinge authored
      This patch is a rollup of all the core pieces of the Xen
      implementation, including:
       - booting and setup
       - pagetable setup
       - privileged instructions
       - segmentation
       - interrupt flags
       - upcalls
       - multicall batching
      
      BOOTING AND SETUP
      
      The vmlinux image is decorated with ELF notes which tell the Xen
      domain builder what the kernel's requirements are; the domain builder
      then constructs the address space accordingly and starts the kernel.
      
      Xen has its own entrypoint for the kernel (contained in an ELF note).
      The ELF notes are set up by xen-head.S, which is included into head.S.
      In principle it could be linked separately, but it seems to provoke
      lots of binutils bugs.
      
      Because the domain builder starts the kernel in a fairly sane state
      (32-bit protected mode, paging enabled, flat segments set up), there's
      not a lot of setup needed before starting the kernel proper.  The main
      steps are:
        1. Install the Xen paravirt_ops, which is simply a matter of a
           structure assignment.
        2. Set init_mm to use the Xen-supplied pagetables (analogous to the
           head.S generated pagetables in a native boot).
        3. Reserve address space for Xen, since it takes a chunk at the top
           of the address space for its own use.
        4. Call start_kernel()
      
      PAGETABLE SETUP
      
      Once we hit the main kernel boot sequence, it will end up calling back
      via paravirt_ops to set up various pieces of Xen specific state.  One
      of the critical things which requires a bit of extra care is the
      construction of the initial init_mm pagetable.  Because Xen places
      tight constraints on pagetables (an active pagetable must always be
      valid, and must always be mapped read-only to the guest domain), we
      need to be careful when constructing the new pagetable to keep these
      constraints in mind.  It turns out that the easiest way to do this is
      use the initial Xen-provided pagetable as a template, and then just
      insert new mappings for memory where a mapping doesn't already exist.
      
      This means that during pagetable setup, it uses a special version of
      xen_set_pte which ignores any attempt to remap a read-only page as
      read-write (since Xen will map its own initial pagetable as RO), but
      lets other changes to the ptes happen, so that things like NX are set
      properly.
      
      PRIVILEGED INSTRUCTIONS AND SEGMENTATION
      
      When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
      This means that it is more privileged than user-mode in ring 3, but it
      still can't run privileged instructions directly.  Non-performance
      critical instructions are dealt with by taking a privilege exception
      and trapping into the hypervisor and emulating the instruction, but
      more performance-critical instructions have their own specific
      paravirt_ops.  In many cases we can avoid having to do any hypercalls
      for these instructions, or the Xen implementation is quite different
      from the normal native version.
      
      The privileged instructions fall into the broad classes of:
        Segmentation: setting up the GDT and the GDT entries, LDT,
           TLS and so on.  Xen doesn't allow the GDT to be directly
           modified; all GDT updates are done via hypercalls where the new
           entries can be validated.  This is important because Xen uses
           segment limits to prevent the guest kernel from damaging the
           hypervisor itself.
        Traps and exceptions: Xen uses a special format for trap entrypoints,
           so when the kernel wants to set an IDT entry, it needs to be
           converted to the form Xen expects.  Xen sets int 0x80 up specially
           so that the trap goes straight from userspace into the guest kernel
           without going via the hypervisor.  sysenter isn't supported.
        Kernel stack: The esp0 entry is extracted from the tss and provided to
           Xen.
        TLB operations: the various TLB calls are mapped into corresponding
           Xen hypercalls.
        Control registers: all the control registers are privileged.  The most
           important is cr3, which points to the base of the current pagetable,
           and we handle it specially.
      
      Another instruction we treat specially is CPUID, even though its not
      privileged.  We want to control what CPU features are visible to the
      rest of the kernel, and so CPUID ends up going into a paravirt_op.
      Xen implements this mainly to disable the ACPI and APIC subsystems.
      
      INTERRUPT FLAGS
      
      Xen maintains its own separate flag for masking events, which is
      contained within the per-cpu vcpu_info structure.  Because the guest
      kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
      ignored (and must be, because even if a guest domain disables
      interrupts for itself, it can't disable them overall).
      
      (A note on terminology: "events" and interrupts are effectively
      synonymous.  However, rather than using an "enable flag", Xen uses a
      "mask flag", which blocks event delivery when it is non-zero.)
      
      There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
      are implemented to manage the Xen event mask state.  The only thing
      worth noting is that when events are unmasked, we need to explicitly
      see if there's a pending event and call into the hypervisor to make
      sure it gets delivered.
      
      UPCALLS
      
      Xen needs a couple of upcall (or callback) functions to be implemented
      by each guest.  One is the event upcalls, which is how events
      (interrupts, effectively) are delivered to the guests.  The other is
      the failsafe callback, which is used to report errors in either
      reloading a segment register, or caused by iret.  These are
      implemented in i386/kernel/entry.S so they can jump into the normal
      iret_exc path when necessary.
      
      MULTICALL BATCHING
      
      Xen provides a multicall mechanism, which allows multiple hypercalls
      to be issued at once in order to mitigate the cost of trapping into
      the hypervisor.  This is particularly useful for context switches,
      since the 4-5 hypercalls they would normally need (reload cr3, update
      TLS, maybe update LDT) can be reduced to one.  This patch implements a
      generic batching mechanism for hypercalls, which gets used in many
      places in the Xen code.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Ian Pratt <ian.pratt@xensource.com>
      Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Cc: Adrian Bunk <bunk@stusta.de>
      5ead97c8
    • Jeremy Fitzhardinge's avatar
      xen: Add Xen interface header files · a42089dd
      Jeremy Fitzhardinge authored
      Add Xen interface header files. These are taken fairly directly from
      the Xen tree, but somewhat rearranged to suit the kernel's conventions.
      
      Define macros and inline functions for doing hypercalls into the
      hypervisor.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      a42089dd
    • Jeremy Fitzhardinge's avatar
      Add nosegneg capability to the vsyscall page notes · 24037a8b
      Jeremy Fitzhardinge authored
      Add the "nosegneg" fake capabilty to the vsyscall page notes. This is
      used by the runtime linker to select a glibc version which then
      disables negative-offset accesses to the thread-local segment via
      %gs. These accesses require emulation in Xen (because segments are
      truncated to protect the hypervisor address space) and avoiding them
      provides a measurable performance boost.
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarZachary Amsden <zach@vmware.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      24037a8b
    • Jeremy Fitzhardinge's avatar
      Add a sched_clock paravirt_op · 688340ea
      Jeremy Fitzhardinge authored
      The tsc-based get_scheduled_cycles interface is not a good match for
      Xen's runstate accounting, which reports everything in nanoseconds.
      
      This patch replaces this interface with a sched_clock interface, which
      matches both Xen and VMI's requirements.
      
      In order to do this, we:
         1. replace get_scheduled_cycles with sched_clock
         2. hoist cycles_2_ns into a common header
         3. update vmi accordingly
      
      One thing to note: because sched_clock is implemented as a weak
      function in kernel/sched.c, we must define a real function in order to
      override this weak binding.  This means the usual paravirt_ops
      technique of using an inline function won't work in this case.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Zachary Amsden <zach@vmware.com>
      Cc: Dan Hecht <dhecht@vmware.com>
      Cc: john stultz <johnstul@us.ibm.com>
      688340ea
    • Jeremy Fitzhardinge's avatar
      paravirt: helper to disable all IO space · d572929c
      Jeremy Fitzhardinge authored
      In a virtual environment, device drivers such as legacy IDE will waste
      quite a lot of time probing for their devices which will never appear.
      This helper function allows a paravirt implementation to lay claim to
      the whole iomem and ioport space, thereby disabling all device drivers
      trying to claim IO resources.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      d572929c
    • Jeremy Fitzhardinge's avatar
      Allocate and free vmalloc areas · 5f4352fb
      Jeremy Fitzhardinge authored
      Allocate/release a chunk of vmalloc address space:
       alloc_vm_area reserves a chunk of address space, and makes sure all
       the pagetables are constructed for that address range - but no pages.
      
       free_vm_area releases the address space range.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarIan Pratt <ian.pratt@xensource.com>
      Signed-off-by: default avatarChristian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: "Jan Beulich" <JBeulich@novell.com>
      Cc: "Andi Kleen" <ak@muc.de>
      5f4352fb
    • Jeremy Fitzhardinge's avatar
      paravirt: export __supported_pte_mask · bdef40a6
      Jeremy Fitzhardinge authored
      __supported_pte_mask is needed when constructing pte values.  Xen
      device drivers need to do this to make mappings of foreign pages (ie,
      pages granted to us by other domains).
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      bdef40a6
    • Jeremy Fitzhardinge's avatar
      paravirt: make siblingmap functions visible · c70df743
      Jeremy Fitzhardinge authored
      Paravirt implementations need to set the sibling map on new cpus.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      c70df743
    • Jeremy Fitzhardinge's avatar
      paravirt: unstatic smp_store_cpu_info · 724faa89
      Jeremy Fitzhardinge authored
      Paravirt implementations need to store cpu info when bringing up cpus.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      724faa89
    • Jeremy Fitzhardinge's avatar
      paravirt: unstatic leave_mm · 53787013
      Jeremy Fitzhardinge authored
      Make globally leave_mm visible, specifically so that Xen can use it to
      shoot-down lazy uses of cr3.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      53787013
    • Jeremy Fitzhardinge's avatar
      paravirt: increase IRQ limit · 03f0c2f9
      Jeremy Fitzhardinge authored
      When running with CONFIG_PARAVIRT, we may want lots of IRQs even if
      there's no IO APIC.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      03f0c2f9
    • Jeremy Fitzhardinge's avatar
      paravirt: add a hook for once the allocator is ready · 6996d3b6
      Jeremy Fitzhardinge authored
      Add a hook so that the paravirt backend knows when the allocator is
      ready.  This is useful for the obvious reason that the allocator is
      available, but the other side-effect of having the bootmem allocator
      available is that each page now has an associated "struct page".
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      6996d3b6
    • Jeremy Fitzhardinge's avatar
      paravirt: add an "mm" argument to alloc_pt · fdb4c338
      Jeremy Fitzhardinge authored
      It's useful to know which mm is allocating a pagetable.  Xen uses this
      to determine whether the pagetable being added to is pinned or not.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      fdb4c338
    • Jeremy Fitzhardinge's avatar
      use elfnote.h to generate vsyscall notes. · 810bab44
      Jeremy Fitzhardinge authored
      Use existing elfnote.h to generate vsyscall notes, rather than doing
      it locally.  Changes elfnote.h a bit to suit, since this is the first
      asm user, and it wasn't quite right.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.com>
      810bab44
    • Jeremy Fitzhardinge's avatar
      usermodehelper: Tidy up waiting · 86313c48
      Jeremy Fitzhardinge authored
      Rather than using a tri-state integer for the wait flag in
      call_usermodehelper_exec, define a proper enum, and use that.  I've
      preserved the integer values so that any callers I've missed should
      still work OK.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: David Howells <dhowells@redhat.com>
      86313c48
    • Jeremy Fitzhardinge's avatar
      Add common orderly_poweroff() · 10a0a8d4
      Jeremy Fitzhardinge authored
      Various pieces of code around the kernel want to be able to trigger an
      orderly poweroff.  This pulls them together into a single
      implementation.
      
      By default the poweroff command is /sbin/poweroff, but it can be set
      via sysctl: kernel/poweroff_cmd.  This is split at whitespace, so it
      can include command-line arguments.
      
      This patch replaces four other instances of invoking either "poweroff"
      or "shutdown -h now": two sbus drivers, and acpi thermal
      management.
      
      sparc64 has its own "powerd"; still need to determine whether it should
      be replaced by orderly_poweroff().
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: default avatarLen Brown <lenb@kernel.org>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David S. Miller <davem@davemloft.net>
      10a0a8d4
    • Jeremy Fitzhardinge's avatar
      usermodehelper: split setup from execution · 0ab4dc92
      Jeremy Fitzhardinge authored
      Rather than having hundreds of variations of call_usermodehelper for
      various pieces of usermode state which could be set up, split the
      info allocation and initialization from the actual process execution.
      
      This means the general pattern becomes:
       info = call_usermodehelper_setup(path, argv, envp); /* basic state */
       call_usermodehelper_<SET EXTRA STATE>(info, stuff...);	/* extra state */
       call_usermodehelper_exec(info, wait);	/* run process and free info */
      
      This patch introduces wrappers for all the existing calling styles for
      call_usermodehelper_*, but folds their implementations into one.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Bj?rn Steinbrink <B.Steinbrink@gmx.de>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      0ab4dc92
    • Jeremy Fitzhardinge's avatar
      add argv_split() · d84d1cc7
      Jeremy Fitzhardinge authored
      argv_split() is a helper function which takes a string, splits it at
      whitespace, and returns a NULL-terminated argv vector.  This is
      deliberately simple - it does no quote processing of any kind.
      
      [ Seems to me that this is something which is already being done in
        the kernel, but I couldn't find any other implementations, either to
        steal or replace.  Keep an eye out. ]
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      d84d1cc7
    • Jeremy Fitzhardinge's avatar
      add kstrndup · 1e66df3e
      Jeremy Fitzhardinge authored
      Add a kstrndup function, modelled on strndup.  Like strndup this
      returns a string copied into its own allocated memory, but it copies
      no more than the specified number of bytes from the source.
      
      Remove private strndup() from irda code.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarChris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@mandriva.com>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Panagiotis Issaris <takis@issaris.org>
      Cc: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
      1e66df3e
    • Maciej W. Rozycki's avatar
      zs: move to the serial subsystem · 8b4a4080
      Maciej W. Rozycki authored
      This is a reimplementation of the zs driver for the serial subsystem.  Any
      resemblance to the old driver is purely coincidential.  ;-) I do hope I got
      the handling of modem lines right -- better do not tackle me about the
      issue unless you feel too good...
      
      Any users of the old driver: please note the numbers of the serial lines
      have now been swapped, i.e.  ttyS0 <-> ttyS1 and ttyS2 <-> ttyS3.  It has
      to do with the modem lines mentioned above; basically the port A in a given
      chip has to be initialised before the port B if you want to use the latter
      as the serial console (which is usually the case), as operations on modem
      lines of the serial line associated with the port B access both ports (see
      the comment at the top of the driver for the details of wiring used).
      Please update your scripts.
      
      This is also the reason each SCC now requests an IRQ once only (as seen in
      "/proc/interrupts") -- the handler takes care of both ports at once as the
      line associated with the port B has to take status update interrupts from
      both ports (and yet the line of the port A takes its own for itself too).
      The old driver never got it right...
      Signed-off-by: default avatarMaciej W. Rozycki <macro@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b4a4080
    • Yinghai Lu's avatar
      serial: add early_serial_setup() back to header file · b187f180
      Yinghai Lu authored
      early_serial_setup was removed from serial.h, but forgot to put in
      serial_8250.h
      Signed-off-by: default avatarYinghai Lu <yinghai.lu@sun.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b187f180
    • Arnd Bergmann's avatar
      fbdev: make fb_append_extra_logo() depend on fb=y · 04e08d0e
      Arnd Bergmann authored
      We can't show the extra logo from boot code if FB is built as a module.
      Make the FB_LOGO_EXTRA depend on FB=y.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Antonino A. Daplas" <adaplas@pol.net>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04e08d0e
    • Jesper Juhl's avatar
      dm: fix memory leak in dm_create_persistent() when starting metadata update thread fails · 851a8a7f
      Jesper Juhl authored
      If, in dm_create_persistent(), the call to create_singlethread_workqueue()
      fails then we'll return without freeing the memory allocated to 'ps', thus
      leaking sizeof(struct pstore) bytes.  This patch fixes the leak.
      
      Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com
      Acked-by: default avatarAlasdair G Kergon <agk@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      851a8a7f
    • Paul Mundt's avatar
      slob: Kill off duplicate kzalloc() definition. · cb32da04
      Paul Mundt authored
      With the slab zeroing allocations cleanups Christoph stubbed in a generic
      kzalloc(), which was missed on SLOB. Follow the SLAB/SLUB changes and
      kill off the __kzalloc() wrapper that SLOB was using.
      Reported-by: default avatarJan Engelhardt <jengelh@computergmbh.de>
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb32da04
  2. 17 Jul, 2007 16 commits