1. 13 Feb, 2008 11 commits
    • Herbert Xu's avatar
      [IPV6]: Fix IPsec datagram fragmentation · 28a89453
      Herbert Xu authored
      This is a long-standing bug in the IPsec IPv6 code that breaks
      when we emit a IPsec tunnel-mode datagram packet.  The problem
      is that the code the emits the packet assumes the IPv6 stack
      will fragment it later, but the IPv6 stack assumes that whoever
      is emitting the packet is going to pre-fragment the packet.
      
      In the long term we need to fix both sides, e.g., to get the
      datagram code to pre-fragment as well as to get the IPv6 stack
      to fragment locally generated tunnel-mode packet.
      
      For now this patch does the second part which should make it
      work for the IPsec host case.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28a89453
    • David S. Miller's avatar
      [NDISC]: Fix race in generic address resolution · 69cc64d8
      David S. Miller authored
      Frank Blaschka provided the bug report and the initial suggested fix
      for this bug.  He also validated this version of this fix.
      
      The problem is that the access to neigh->arp_queue is inconsistent, we
      grab references when dropping the lock lock to call
      neigh->ops->solicit() but this does not prevent other threads of
      control from trying to send out that packet at the same time causing
      corruptions because both code paths believe they have exclusive access
      to the skb.
      
      The best option seems to be to hold the write lock on neigh->lock
      during the ->solicit() call.  I looked at all of the ndisc_ops
      implementations and this seems workable.  The only case that needs
      special care is the IPV4 ARP implementation of arp_solicit().  It
      wants to take neigh->lock as a reader to protect the header entry in
      neigh->ha during the emission of the soliciation.  We can simply
      remove the read lock calls to take care of that since holding the lock
      as a writer at the caller providers a superset of the protection
      afforded by the existing read locking.
      
      The rest of the ->solicit() implementations don't care whether the
      neigh is locked or not.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69cc64d8
    • David Newall's avatar
      hci_ldisc: fix null pointer deref · 3611f4d2
      David Newall authored
      Arjan:
      
        With the help of kerneloops.org I've spotted a nice little interaction
        between the TTY layer and the bluetooth code, however the tty layer is not
        something I'm all too familiar with so I rather ask than brute-force fix the
        code incorrectly.
      
        The raw details are at:
        http://www.kerneloops.org/search.php?search=uart_flush_buffer
      
        What happens is that, on closing the bluetooth tty, the tty layer goes
        into the release_dev() function, which first does a bunch of stuff, then
        sets the file->private_data to NULL, does some more stuff and then calls the
        ldisc close function.  Which in this case, is hci_uart_tty_close().
      
        Now, hci_uart_tty_close() calls hci_uart_close() which clears some
        internal bit, and then calls hci_uart_flush()...  which calls back to the
        tty layers' uart_flush_buffer() function.  (in drivers/bluetooth/hci_tty.c
        around line 194) Which then WARN_ON()'s because that's not allowed/supposed
        to be called this late in the shutdown of the port....
      
        Should the bluetooth driver even call this flush function at all??
      
      David:
      
        This seems to be what happens: Hci_uart_close() flushes using
        hci_uart_flush().  Subsequently, in hci_dev_do_close(), (one step in
        hci_unregister_dev()), hci_uart_flush() is called again.  The comment in
        uart_flush_buffer(), relating to the WARN_ON(), indicates you can't flush
        after the port is closed; which sounds reasonable.  I think hci_uart_close()
        should set hdev->flush to NULL before returning.  Hci_dev_do_close() does
        check for this.  The code path is rather involved and I'm not entirely clear
        of all steps, but I think that's what should be done.
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3611f4d2
    • Jarek Poplawski's avatar
      [AX25] ax25_ds_timer: use mod_timer instead of add_timer · e848b583
      Jarek Poplawski authored
      This patch changes current use of: init_timer(), add_timer()
      and del_timer() to setup_timer() with mod_timer(), which
      should be safer anyway.
      Reported-by: default avatarJann Traschewski <jann@gmx.de>
      Signed-off-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e848b583
    • Jarek Poplawski's avatar
      [AX25] ax25_timer: use mod_timer instead of add_timer · 21fab4a8
      Jarek Poplawski authored
      According to one of Jann's OOPS reports it looks like
      BUG_ON(timer_pending(timer)) triggers during add_timer()
      in ax25_start_t1timer(). This patch changes current use
      of: init_timer(), add_timer() and del_timer() to
      setup_timer() with mod_timer(), which should be safer
      anyway.
      Reported-by: default avatarJann Traschewski <jann@gmx.de>
      Signed-off-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21fab4a8
    • Jarek Poplawski's avatar
      [AX25] ax25_route: make ax25_route_lock BH safe · 4de211f1
      Jarek Poplawski authored
      > =================================
      > [ INFO: inconsistent lock state ]
      > 2.6.24-dg8ngn-p02 #1
      > ---------------------------------
      > inconsistent {softirq-on-W} -> {in-softirq-R} usage.
      > linuxnet/3046 [HC0[0]:SC1[2]:HE1:SE0] takes:
      >  (ax25_route_lock){--.+}, at: [<f8a0cfb7>] ax25_get_route+0x18/0xb7 [ax25]
      > {softirq-on-W} state was registered at:
      ...
      
      This lockdep report shows that ax25_route_lock is taken for reading in
      softirq context, and for writing in process context with BHs enabled.
      So, to make this safe, all write_locks in ax25_route.c are changed to
      _bh versions.
      
      Reported-by: Jann Traschewski <jann@gmx.de>,
      Signed-off-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4de211f1
    • Jarek Poplawski's avatar
      [AX25] af_ax25: remove sock lock in ax25_info_show() · 1105b5d1
      Jarek Poplawski authored
      This lockdep warning:
      
      > =======================================================
      > [ INFO: possible circular locking dependency detected ]
      > 2.6.24 #3
      > -------------------------------------------------------
      > swapper/0 is trying to acquire lock:
      >  (ax25_list_lock){-+..}, at: [<f91dd3b1>] ax25_destroy_socket+0x171/0x1f0 [ax25]
      >
      > but task is already holding lock:
      >  (slock-AF_AX25){-+..}, at: [<f91dbabc>] ax25_std_heartbeat_expiry+0x1c/0xe0 [ax25]
      >
      > which lock already depends on the new lock.
      ...
      
      shows that ax25_list_lock and slock-AF_AX25 are taken in different
      order: ax25_info_show() takes slock (bh_lock_sock(ax25->sk)) while
      ax25_list_lock is held, so reversely to other functions. To fix this
      the sock lock should be moved to ax25_info_start(), and there would
      be still problem with breaking ax25_list_lock (it seems this "proper"
      order isn't optimal yet). But, since it's only for reading proc info
      it seems this is not necessary (e.g.  ax25_send_to_raw() does similar
      reading without this lock too).
      
      So, this patch removes sock lock to avoid deadlock possibility; there
      is also used sock_i_ino() function, which reads sk_socket under proper
      read lock. Additionally printf format of this i_ino is changed to %lu.
      Reported-by: default avatarBernard Pidoux F6BVP <f6bvp@free.fr>
      Signed-off-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1105b5d1
    • Stephen Hemminger's avatar
      fib_trie: /proc/net/route performance improvement · 8315f5d8
      Stephen Hemminger authored
      Use key/offset caching to change /proc/net/route (use by iputils route)
      from O(n^2) to O(n). This improves performance from 30sec with 160,000
      routes to 1sec.
      Signed-off-by: default avatarStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8315f5d8
    • Stephen Hemminger's avatar
      fib_trie: handle empty tree · ec28cf73
      Stephen Hemminger authored
      This fixes possible problems when trie_firstleaf() returns NULL
      to trie_leafindex().
      Signed-off-by: default avatarStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec28cf73
    • David S. Miller's avatar
      [IPV4]: Remove IP_TOS setting privilege checks. · e4f8b5d4
      David S. Miller authored
      Various RFCs have all sorts of things to say about the CS field of the
      DSCP value.  In particular they try to make the distinction between
      values that should be used by "user applications" and things like
      routing daemons.
      
      This seems to have influenced the CAP_NET_ADMIN check which exists for
      IP_TOS socket option settings, but in fact it has an off-by-one error
      so it wasn't allowing CS5 which is meant for "user applications" as
      well.
      
      Further adding to the inconsistency and brokenness here, IPV6 does not
      validate the DSCP values specified for the IPV6_TCLASS socket option.
      
      The real actual uses of these TOS values are system specific in the
      final analysis, and these RFC recommendations are just that, "a
      recommendation".  In fact the standards very purposefully use
      "SHOULD" and "SHOULD NOT" when describing how these values can be
      used.
      
      In the final analysis the only clean way to provide consistency here
      is to remove the CAP_NET_ADMIN check.  The alternatives just don't
      work out:
      
      1) If we add the CAP_NET_ADMIN check to ipv6, this can break existing
         setups.
      
      2) If we just fix the off-by-one error in the class comparison in
         IPV4, certain DSCP values can be used in IPV6 but not IPV4 by
         default.  So people will just ask for a sysctl asking to
         override that.
      
      I checked several other freely available kernel trees and they
      do not make any privilege checks in this area like we do.  For
      the BSD stacks, this goes back all the way to Stevens Volume 2
      and beyond.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4f8b5d4
    • David S. Miller's avatar
  2. 12 Feb, 2008 9 commits
    • Linus Torvalds's avatar
      WMI: initialize wmi_blocks.list even if ACPI is disabled · 96b5a46e
      Linus Torvalds authored
      Even if we don't want to register the WMI driver, we should initialize
      the wmi_blocks list to be empty, since we don't want the wmi helper
      functions to oops just because that basic list has not even been set up.
      
      With this, "find_guid()" will happily return "not found" rather than
      oopsing all over the place, and the callers will then just automatically
      return false or AE_NOT_FOUND as appropriate.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96b5a46e
    • Roland McGrath's avatar
      x86: vdso_install fix · 2c158269
      Roland McGrath authored
      The makefile magic for installing the 32-bit vdso images on disk had a
      little error.  A single-line change would fix that bug, but this does a
      little more to reduce the error-prone duplication of this bit of
      makefile variable magic.
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c158269
    • KOSAKI Motohiro's avatar
      mempolicy: silently restrict nodemask to allowed nodes · 31f1de46
      KOSAKI Motohiro authored
      Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
      presence of memoryless nodes.  This patch attempts to fix that problem.
      
      Some background:
      
      numactl --interleave=all calls set_mempolicy(2) with a fully populated
      [out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
      calls contextualize_policy() which requires that the nodemask be a
      subset of the current task's mems_allowed; else EINVAL will be returned.
      
      A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
      i.e., nodes with memory.  So, a fully populated nodemask will be
      declared invalid if it includes memoryless nodes.
      
        NOTE:  the same thing will occur when running in a cpuset
               with restricted mem_allowed--for the same reason:
               node mask contains dis-allowed nodes.
      
      mbind(2), on the other hand, just masks off any nodes in the nodemask
      that are not included in the caller's mems_allowed.
      
      In each case [mbind() and set_mempolicy()], mpol_check_policy() will
      complain [again, resulting in EINVAL] if the nodemask contains any
      memoryless nodes.  This is somewhat redundant as mpol_new() will remove
      memoryless nodes for interleave policy, as will bind_zonelist()--called
      by mpol_new() for BIND policy.
      
      Proposed fix:
      
      1) modify contextualize_policy logic to:
         a) remember whether the incoming node mask is empty.
         b) if not, restrict the nodemask to allowed nodes, as is
            currently done in-line for mbind().  This guarantees
            that the resulting mask includes only nodes with memory.
      
            NOTE:  this is a [benign, IMO] change in behavior for
                   set_mempolicy().  Dis-allowed nodes will be
                   silently ignored, rather than returning an error.
      
         c) fold this code into mpol_check_policy(), replace 2 calls to
            contextualize_policy() to call mpol_check_policy() directly
            and remove contextualize_policy().
      
      2) In existing mpol_check_policy() logic, after "contextualization":
         a) MPOL_DEFAULT:  require that in coming mask "was_empty"
         b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
            contains at least one node.
         c) add a case for MPOL_PREFERRED:  if in coming was not empty
            and resulting mask IS empty, user specified invalid nodes.
            Return EINVAL.
         c) remove the now redundant check for memoryless nodes
      
      3) remove the now redundant masking of policy nodes for interleave
         policy from mpol_new().
      
      4) Now that mpol_check_policy() contextualizes the nodemask, remove
         the in-line nodes_and() from sys_mbind().  I believe that this
         restores mbind() to the behavior before the memoryless-nodes
         patch series.  E.g., we'll no longer treat an invalid nodemask
         with MPOL_PREFERRED as local allocation.
      
      [ Patch history:
      
        v1 -> v2:
         - Communicate whether or not incoming node mask was empty to
           mpol_check_policy() for better error checking.
         - As suggested by David Rientjes, remove the now unused
           cpuset_nodes_subset_current_mems_allowed() from cpuset.h
      
        v2 -> v3:
         - As suggested by Kosaki Motohito, fold the "contextualization"
           of policy nodemask into mpol_check_policy().  Looks a little
           cleaner. ]
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31f1de46
    • Linus Torvalds's avatar
    • Jonathan Corbet's avatar
      Be more robust about bad arguments in get_user_pages() · 900cf086
      Jonathan Corbet authored
      So I spent a while pounding my head against my monitor trying to figure
      out the vmsplice() vulnerability - how could a failure to check for
      *read* access turn into a root exploit? It turns out that it's a buffer
      overflow problem which is made easy by the way get_user_pages() is
      coded.
      
      In particular, "len" is a signed int, and it is only checked at the
      *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
      will execute once and decrement len to -1.  At that point, the loop will
      proceed until the next invalid address is found; in the process, it will
      likely overflow the pages array passed in to get_user_pages().
      
      I think that, if get_user_pages() has been asked to grab zero pages,
      that's what it should do.  Thus this patch; it is, among other things,
      enough to block the (already fixed) root exploit and any others which
      might be lurking in similar code.  I also think that the number of pages
      should be unsigned, but changing the prototype of this function probably
      requires some more careful review.
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      900cf086
    • Linus Torvalds's avatar
    • Pekka Enberg's avatar
      Add Matt to MAINTAINERS as a SLAB allocator maintainer · c76d118e
      Pekka Enberg authored
      Matt is already the maintainer of SLOB which is one of the "SLAB" allocators in
      the kernel so add him to MAINTAINERS.
      Signed-off-by: default avatarPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c76d118e
    • Linus Torvalds's avatar
      Merge branch 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev · a17b7a39
      Linus Torvalds authored
      * 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
        sata_mv: platform driver allocs dma without create
        pata_ninja32: setup changes
        pata_legacy: typo fix
        pata_amd: Note in the module description it handles Nvidia
        sata_mv: fix loop with last port
        libata: ignore deverr on SETXFER if mode is configured
        pata_via: fix SATA cable detection on cx700
      a17b7a39
    • Andi Kleen's avatar
      Make topology fallback macros reference their arguments. · 271cad6d
      Andi Kleen authored
      This avoids warnings with unreferenced variables in the !NUMA case.
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      271cad6d
  3. 11 Feb, 2008 20 commits