1. 06 Mar, 2013 40 commits
    • Eric Dumazet's avatar
      ipv6: use a stronger hash for tcp · 0fd0ff7e
      Eric Dumazet authored
      [ Upstream commit 08dcdbf6 ]
      
      It looks like its possible to open thousands of TCP IPv6
      sessions on a server, all landing in a single slot of TCP hash
      table. Incoming packets have to lookup sockets in a very
      long list.
      
      We should hash all bits from foreign IPv6 addresses, using
      a salt and hash mix, not a simple XOR.
      
      inet6_ehashfn() can also separately use the ports, instead
      of xoring them.
      Reported-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0fd0ff7e
    • Li Wei's avatar
      ipv4: fix a bug in ping_err(). · 52430c06
      Li Wei authored
      [ Upstream commit b531ed61 ]
      
      We should get 'type' and 'code' from the outer ICMP header.
      Signed-off-by: default avatarLi Wei <lw@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      52430c06
    • David Vrabel's avatar
      xen-netback: cancel the credit timer when taking the vif down · db3521fa
      David Vrabel authored
      [ Upstream commit 3e55f8b3 ]
      
      If the credit timer is left armed after calling
      xen_netbk_remove_xenvif(), then it may fire and attempt to schedule
      the vif which will then oops as vif->netbk == NULL.
      
      This may happen both in the fatal error path and during normal
      disconnection from the front end.
      
      The sequencing during shutdown is critical to ensure that: a)
      vif->netbk doesn't become unexpectedly NULL; and b) the net device/vif
      is not freed.
      
      1. Mark as unschedulable (netif_carrier_off()).
      2. Synchronously cancel the timer.
      3. Remove the vif from the schedule list.
      4. Remove it from it netback thread group.
      5. Wait for vif->refcnt to become 0.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: default avatarIan Campbell <ian.campbell@citrix.com>
      Reported-by: default avatarChristopher S. Aker <caker@theshore.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      db3521fa
    • David Vrabel's avatar
      xen-netback: correctly return errors from netbk_count_requests() · aa067cfc
      David Vrabel authored
      [ Upstream commit 35876b5f ]
      
      netbk_count_requests() could detect an error, call
      netbk_fatal_tx_error() but return 0.  The vif may then be used
      afterwards (e.g., in a call to netbk_tx_error().
      
      Since netbk_fatal_tx_error() could set vif->refcnt to 1, the vif may
      be freed immediately after the call to netbk_fatal_tx_error() (e.g.,
      if the vif is also removed).
      
      Netback thread              Xenwatch thread
      -------------------------------------------
      netbk_fatal_tx_err()        netback_remove()
                                    xenvif_disconnect()
                                      ...
                                      free_netdev()
      netbk_tx_err() Oops!
      Signed-off-by: default avatarWei Liu <wei.liu2@citrix.com>
      Signed-off-by: default avatarJan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reported-by: default avatarChristopher S. Aker <caker@theshore.net>
      Acked-by: default avatarIan Campbell <ian.campbell@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      aa067cfc
    • Stephen Hemminger's avatar
      bridge: set priority of STP packets · 1642974c
      Stephen Hemminger authored
      [ Upstream commit 547b4e71 ]
      
      Spanning Tree Protocol packets should have always been marked as
      control packets, this causes them to get queued in the high prirority
      FIFO. As Radia Perlman mentioned in her LCA talk, STP dies if bridge
      gets overloaded and can't communicate. This is a long-standing bug back
      to the first versions of Linux bridge.
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1642974c
    • Jan Beulich's avatar
      xen-pciback: rate limit error messages from xen_pcibk_enable_msi{,x}() · ecb1d58c
      Jan Beulich authored
      commit 51ac8893 upstream.
      
      ... as being guest triggerable (e.g. by invoking
      XEN_PCI_OP_enable_msi{,x} on a device not being MSI/MSI-X capable).
      
      This is CVE-2013-0231 / XSA-43.
      
      Also make the two messages uniform in both their wording and severity.
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Acked-by: default avatarIan Campbell <ian.campbell@citrix.com>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      [bwh: Backported to 3.2: add #include <linux/ratelimited.h>, needed by
       printk_ratelimited()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      ecb1d58c
    • Helge Deller's avatar
      unbreak automounter support on 64-bit kernel with 32-bit userspace (v2) · f20a819c
      Helge Deller authored
      commit 4f4ffc3a upstream.
      
      automount-support is broken on the parisc architecture, because the existing
      #if list does not include a check for defined(__hppa__). The HPPA (parisc)
      architecture is similiar to other 64bit Linux targets where we have to define
      autofs_wqt_t (which is passed back and forth to user space) as int type which
      has a size of 32bit across 32 and 64bit kernels.
      
      During the discussion on the mailing list, H. Peter Anvin suggested to invert
      the #if list since only specific platforms (specifically those who do not have
      a 32bit userspace, like IA64 and Alpha) should have autofs_wqt_t as unsigned
      long type.
      
      This suggestion is probably the best way to go, since Arm64 (and maybe others?)
      seems to have a non-working automounter. So in the long run even for other new
      upcoming architectures this inverted check seem to be the best solution, since
      it will not require them to change this #if again (unless they are 64bit only).
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Acked-by: default avatarIan Kent <raven@themaw.net>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      CC: James Bottomley <James.Bottomley@HansenPartnership.com>
      CC: Rolf Eike Beer <eike-kernel@sf-tec.de>
      [bwh: Backported to 3.2: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f20a819c
    • Heiko Carstens's avatar
      s390/timer: avoid overflow when programming clock comparator · 903fba3b
      Heiko Carstens authored
      commit d911e03d upstream.
      
      Since ed4f2094 "s390/time: fix sched_clock() overflow" a new helper function
      is used to avoid overflows when converting TOD format values to nanosecond
      values.
      The kvm interrupt code formerly however only worked by accident because of
      an overflow. It tried to program a timer that would expire in more than ~29
      years. Because of the old TOD-to-nanoseconds overflow bug the real expiry
      value however was much smaller, but now it isn't anymore.
      This however triggers yet another bug in the function that programs the clock
      comparator s390_next_ktime(): if the absolute "expires" value is after 2042
      this will result in an overflow and the programmed value is lower than the
      current TOD value which immediatly triggers a clock comparator (= timer)
      interrupt.
      Since the timer isn't expired it will be programmed immediately again and so
      on... the result is a dead system.
      To fix this simply program the maximum possible value if an overflow is
      detected.
      Reported-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Tested-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      903fba3b
    • Alex Deucher's avatar
      drm/radeon/evergreen+: wait for the MC to settle after MC blackout · 39ca2020
      Alex Deucher authored
      commit ed39fadd upstream.
      
      Some chips seem to need a little delay after blacking out
      the MC before the requests actually stop.
      
      May fix:
      https://bugs.freedesktop.org/show_bug.cgi?id=56139
      https://bugs.freedesktop.org/show_bug.cgi?id=57567Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      39ca2020
    • Alexander Duyck's avatar
      igb: Remove artificial restriction on RQDPC stat reading · 74d789d8
      Alexander Duyck authored
      commit ae1c07a6 upstream.
      
      For some reason the reading of the RQDPC register was being artificially
      limited to 4K.  Instead of limiting the value we should read the value and
      add the full amount.  Otherwise this can lead to a misleading number of
      dropped packets when the actual value is in fact much higher.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarJeff Pieper   <jeffrey.e.pieper@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      74d789d8
    • Paolo Bonzini's avatar
      nbd: fsync and kill block device on shutdown · 4189aa4c
      Paolo Bonzini authored
      commit 3a2d63f8 upstream.
      
      There are two problems with shutdown in the NBD driver.
      
      1: Receiving the NBD_DISCONNECT ioctl does not sync the filesystem.
      
         This patch adds the sync operation into __nbd_ioctl()'s
         NBD_DISCONNECT handler.  This is useful because BLKFLSBUF is restricted
         to processes that have CAP_SYS_ADMIN, and the NBD client may not
         possess it (fsync of the block device does not sync the filesystem,
         either).
      
      2: Once we clear the socket we have no guarantee that later reads will
         come from the same backing storage.
      
         The patch adds calls to kill_bdev() in __nbd_ioctl()'s socket
         clearing code so the page cache is cleaned, lest reads that hit on the
         page cache will return stale data from the previously-accessible disk.
      
      Example:
      
          # qemu-nbd -r -c/dev/nbd0 /dev/sr0
          # file -s /dev/nbd0
          /dev/stdin: # UDF filesystem data (version 1.5) etc.
          # qemu-nbd -d /dev/nbd0
          # qemu-nbd -r -c/dev/nbd0 /dev/sda
          # file -s /dev/nbd0
          /dev/stdin: # UDF filesystem data (version 1.5) etc.
      
      While /dev/sda has:
      
          # file -s /dev/sda
          /dev/sda: x86 boot sector; etc.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarPaul Clements <Paul.Clements@steeleye.com>
      Cc: Alex Bligh <alex@alex.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2:
       - Adjusted context
       - s/\bnbd\b/lo/
       - Incorporate export of kill_bdev() from commit ff01bb48
         ('fs: move code out of buffer.c')]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4189aa4c
    • Xi Wang's avatar
      sysctl: fix null checking in bin_dn_node_address() · 0ba15a92
      Xi Wang authored
      commit df1778be upstream.
      
      The null check of `strchr() + 1' is broken, which is always non-null,
      leading to OOB read.  Instead, check the result of strchr().
      Signed-off-by: default avatarXi Wang <xi.wang@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0ba15a92
    • Tejun Heo's avatar
      idr: fix top layer handling · a2a9af6c
      Tejun Heo authored
      commit 326cf0f0 upstream.
      
      Most functions in idr fail to deal with the high bits when the idr
      tree grows to the maximum height.
      
      * idr_get_empty_slot() stops growing idr tree once the depth reaches
        MAX_IDR_LEVEL - 1, which is one depth shallower than necessary to
        cover the whole range.  The function doesn't even notice that it
        didn't grow the tree enough and ends up allocating the wrong ID
        given sufficiently high @starting_id.
      
        For example, on 64 bit, if the starting id is 0x7fffff01,
        idr_get_empty_slot() will grow the tree 5 layer deep, which only
        covers the 30 bits and then proceed to allocate as if the bit 30
        wasn't specified.  It ends up allocating 0x3fffff01 without the bit
        30 but still returns 0x7fffff01.
      
      * __idr_remove_all() will not remove anything if the tree is fully
        grown.
      
      * idr_find() can't find anything if the tree is fully grown.
      
      * idr_for_each() and idr_get_next() can't iterate anything if the tree
        is fully grown.
      
      Fix it by introducing idr_max() which returns the maximum possible ID
      given the depth of tree and replacing the id limit checks in all
      affected places.
      
      As the idr_layer pointer array pa[] needs to be 1 larger than the
      maximum depth, enlarge pa[] arrays by one.
      
      While this plugs the discovered issues, the whole code base is
      horrible and in desparate need of rewrite.  It's fragile like hell,
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2:
       - Adjust context
       - s/MAX_IDR_LEVEL/MAX_LEVEL/; s/MAX_IDR_SHIFT/MAX_ID_SHIFT/
       - Drop change to idr_alloc()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a2a9af6c
    • Hugh Dickins's avatar
      idr: make idr_get_next() good for rcu_read_lock() · 6195dff3
      Hugh Dickins authored
      commit 9f7de827 upstream.
      
      Make one small adjustment to idr_get_next(): take the height from the top
      layer (stable under RCU) instead of from the root (unprotected by RCU), as
      idr_find() does: so that it can be used with RCU locking.  Copied comment
      on RCU locking from idr_find().
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      6195dff3
    • Tejun Heo's avatar
      firewire: add minor number range check to fw_device_init() · 507b2ac9
      Tejun Heo authored
      commit 3bec60d5 upstream.
      
      fw_device_init() didn't check whether the allocated minor number isn't
      too large.  Fail if it goes overflows MINORBITS.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      Acked-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      507b2ac9
    • Tejun Heo's avatar
      block: fix synchronization and limit check in blk_alloc_devt() · a51db52c
      Tejun Heo authored
      commit ce23bba8 upstream.
      
      idr allocation in blk_alloc_devt() wasn't synchronized against lookup
      and removal, and its limit check was off by one - 1 << MINORBITS is
      the number of minors allowed, not the maximum allowed minor.
      
      Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
      checking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a51db52c
    • Tejun Heo's avatar
      idr: fix a subtle bug in idr_get_next() · 18536d61
      Tejun Heo authored
      commit 6cdae741 upstream.
      
      The iteration logic of idr_get_next() is borrowed mostly verbatim from
      idr_for_each().  It walks down the tree looking for the slot matching
      the current ID.  If the matching slot is not found, the ID is
      incremented by the distance of single slot at the given level and
      repeats.
      
      The implementation assumes that during the whole iteration id is aligned
      to the layer boundaries of the level closest to the leaf, which is true
      for all iterations starting from zero or an existing element and thus is
      fine for idr_for_each().
      
      However, idr_get_next() may be given any point and if the starting id
      hits in the middle of a non-existent layer, increment to the next layer
      will end up skipping the same offset into it.  For example, an IDR with
      IDs filled between [64, 127] would look like the following.
      
                [  0  64 ... ]
             /----/   |
             |        |
            NULL    [ 64 ... 127 ]
      
      If idr_get_next() is called with 63 as the starting point, it will try
      to follow down the pointer from 0.  As it is NULL, it will then try to
      proceed to the next slot in the same level by adding the slot distance
      at that level which is 64 - making the next try 127.  It goes around the
      loop and finds and returns 127 skipping [64, 126].
      
      Note that this bug also triggers in idr_for_each_entry() loop which
      deletes during iteration as deletions can make layers go away leaving
      the iteration with unaligned ID into missing layers.
      
      Fix it by ensuring proceeding to the next slot doesn't carry over the
      unaligned offset - ie.  use round_up(id + 1, slot_distance) instead of
      id += slot_distance.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDavid Teigland <teigland@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      18536d61
    • Tomas Henzl's avatar
      block: fix ext_devt_idr handling · 2abb7d3a
      Tomas Henzl authored
      commit 7b74e912 upstream.
      
      While adding and removing a lot of disks disks and partitions this
      sometimes shows up:
      
        WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
        Hardware name:
        sysfs: cannot create duplicate filename '/dev/block/259:751'
        Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
        Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
        Call Trace:
          warn_slowpath_common+0x87/0xc0
          warn_slowpath_fmt+0x46/0x50
          sysfs_add_one+0xc9/0x130
          sysfs_do_create_link+0x12b/0x170
          sysfs_create_link+0x13/0x20
          device_add+0x317/0x650
          idr_get_new+0x13/0x50
          add_partition+0x21c/0x390
          rescan_partitions+0x32b/0x470
          sd_open+0x81/0x1f0 [sd_mod]
          __blkdev_get+0x1b6/0x3c0
          blkdev_get+0x10/0x20
          register_disk+0x155/0x170
          add_disk+0xa6/0x160
          sd_probe_async+0x13b/0x210 [sd_mod]
          add_wait_queue+0x46/0x60
          async_thread+0x102/0x250
          default_wake_function+0x0/0x20
          async_thread+0x0/0x250
          kthread+0x96/0xa0
          child_rip+0xa/0x20
          kthread+0x0/0xa0
          child_rip+0x0/0x20
      
      This most likely happens because dev_t is freed while the number is
      still used and idr_get_new() is not protected on every use.  The fix
      adds a mutex where it wasn't before and moves the dev_t free function so
      it is called after device del.
      Signed-off-by: default avatarTomas Henzl <thenzl@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      2abb7d3a
    • Xiaowei.Hu's avatar
      ocfs2: ac->ac_allow_chain_relink=0 won't disable group relink · 0ebab908
      Xiaowei.Hu authored
      commit 309a85b6 upstream.
      
      ocfs2_block_group_alloc_discontig() disables chain relink by setting
      ac->ac_allow_chain_relink = 0 because it grabs clusters from multiple
      cluster groups.
      
      It doesn't keep the credits for all chain relink,but
      ocfs2_claim_suballoc_bits overrides this in this call trace:
      ocfs2_block_group_claim_bits()->ocfs2_claim_clusters()->
      __ocfs2_claim_clusters()->ocfs2_claim_suballoc_bits()
      ocfs2_claim_suballoc_bits set ac->ac_allow_chain_relink = 1; then call
      ocfs2_search_chain() one time and disable it again, and then we run out
      of credits.
      
      Fix is to allow relink by default and disable it in
      ocfs2_block_group_alloc_discontig.
      
      Without this patch, End-users will run into a crash due to run out of
      credits, backtrace like this:
      
        RIP: 0010:[<ffffffffa0808b14>]  [<ffffffffa0808b14>]
        jbd2_journal_dirty_metadata+0x164/0x170 [jbd2]
        RSP: 0018:ffff8801b919b5b8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff88022139ddc0 RCX: ffff880159f652d0
        RDX: ffff880178aa3000 RSI: ffff880159f652d0 RDI: ffff880087f09bf8
        RBP: ffff8801b919b5e8 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000001e00 R11: 00000000000150b0 R12: ffff880159f652d0
        R13: ffff8801a0cae908 R14: ffff880087f09bf8 R15: ffff88018d177800
        FS:  00007fc9b0b6b6e0(0000) GS:ffff88022fd40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 000000000040819c CR3: 0000000184017000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process dd (pid: 9945, threadinfo ffff8801b919a000, task ffff880149a264c0)
        Call Trace:
          ocfs2_journal_dirty+0x2f/0x70 [ocfs2]
          ocfs2_relink_block_group+0x111/0x480 [ocfs2]
          ocfs2_search_chain+0x455/0x9a0 [ocfs2]
          ...
      Signed-off-by: default avatarXiaowei.Hu <xiaowei.hu@oracle.com>
      Reviewed-by: default avatarSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      0ebab908
    • Jeff Liu's avatar
      ocfs2: fix ocfs2_init_security_and_acl() to initialize acl correctly · 98e4a531
      Jeff Liu authored
      commit 32918dd9 upstream.
      
      We need to re-initialize the security for a new reflinked inode with its
      parent dirs if it isn't specified to be preserved for ocfs2_reflink().
      However, the code logic is broken at ocfs2_init_security_and_acl()
      although ocfs2_init_security_get() succeed.  As a result,
      ocfs2_acl_init() does not involked and therefore the default ACL of
      parent dir was missing on the new inode.
      
      Note this was introduced by 9d8f13ba ("security: new
      security_inode_init_security API adds function callback")
      
      To reproduce:
      
          set default ACL for the parent dir(ocfs2 in this case):
          $ setfacl -m default:user:jeff:rwx ../ocfs2/
          $ getfacl ../ocfs2/
          # file: ../ocfs2/
          # owner: jeff
          # group: jeff
          user::rwx
          group::r-x
          other::r-x
          default:user::rwx
          default:user:jeff:rwx
          default:group::r-x
          default:mask::rwx
          default:other::r-x
      
          $ touch a
          $ getfacl a
          # file: a
          # owner: jeff
          # group: jeff
          user::rw-
          group::rw-
          other::r--
      
      Before patching, create reflink file b from a, the user
      default ACL entry(user:jeff:rwx)was missing:
      
          $ ./ocfs2_reflink a b
          $ getfacl b
          # file: b
          # owner: jeff
          # group: jeff
          user::rw-
          group::rw-
          other::r--
      
      In this case, the end user can also observed an error message at syslog:
      
        (ocfs2_reflink,3229,2):ocfs2_init_security_and_acl:7193 ERROR: status = 0
      
      After applying this patch, create reflink file c from a:
      
          $ ./ocfs2_reflink a c
          $ getfacl c
          # file: c
          # owner: jeff
          # group: jeff
          user::rw-
          user:jeff:rwx			#effective:rw-
          group::r-x			#effective:r--
          mask::rw-
          other::r--
      
      Test program:
      /* Usage: reflink <source> <dest> */
      #include <stdio.h>
      #include <stdint.h>
      #include <stdbool.h>
      #include <string.h>
      #include <errno.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <sys/ioctl.h>
      
      static int
      reflink_file(char const *src_name, char const *dst_name,
      	     bool preserve_attrs)
      {
      	int fd;
      
      #ifndef REFLINK_ATTR_NONE
      #  define REFLINK_ATTR_NONE 0
      #endif
      #ifndef REFLINK_ATTR_PRESERVE
      #  define REFLINK_ATTR_PRESERVE 1
      #endif
      #ifndef OCFS2_IOC_REFLINK
      	struct reflink_arguments {
      		uint64_t old_path;
      		uint64_t new_path;
      		uint64_t preserve;
      	};
      
      #  define OCFS2_IOC_REFLINK _IOW ('o', 4, struct reflink_arguments)
      #endif
      	struct reflink_arguments args = {
      		.old_path = (unsigned long) src_name,
      		.new_path = (unsigned long) dst_name,
      		.preserve = preserve_attrs ? REFLINK_ATTR_PRESERVE :
      					     REFLINK_ATTR_NONE,
      	};
      
      	fd = open(src_name, O_RDONLY);
      	if (fd < 0) {
      		fprintf(stderr, "Failed to open %s: %s\n",
      			src_name, strerror(errno));
      		return -1;
      	}
      
      	if (ioctl(fd, OCFS2_IOC_REFLINK, &args) < 0) {
      		fprintf(stderr, "Failed to reflink %s to %s: %s\n",
      			src_name, dst_name, strerror(errno));
      		return -1;
      	}
      }
      
      int
      main(int argc, char *argv[])
      {
      	if (argc != 3) {
      		fprintf(stdout, "Usage: %s source dest\n", argv[0]);
      		return 1;
      	}
      
      	return reflink_file(argv[1], argv[2], 0);
      }
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Reviewed-by: default avatarTao Ma <boyu.mt@taobao.com>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      98e4a531
    • H. Peter Anvin's avatar
      x86: Make sure we can boot in the case the BDA contains pure garbage · c2581d3d
      H. Peter Anvin authored
      commit 7c100936 upstream.
      
      On non-BIOS platforms it is possible that the BIOS data area contains
      garbage instead of being zeroed or something equivalent (firmware
      people: we are talking of 1.5K here, so please do the sane thing.)
      
      We need on the order of 20-30K of low memory in order to boot, which
      may grow up to < 64K in the future.  We probably want to avoid the
      lowest of the low memory.  At the same time, it seems extremely
      unlikely that a legitimate EBDA would ever reach down to the 128K
      (which would require it to be over half a megabyte in size.)  Thus,
      pick 128K as the cutoff for "this is insane, ignore."  We may still
      end up reserving a bunch of extra memory on the low megabyte, but that
      is not really a major issue these days.  In the worst case we lose
      512K of RAM.
      
      This code really should be merged with trim_bios_range() in
      arch/x86/kernel/setup.c, but that is a bigger patch for a later merge
      window.
      Reported-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: Matt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/n/tip-oebml055yyfm8yxmria09rja@git.kernel.orgSigned-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      c2581d3d
    • Jan Kara's avatar
      ocfs2: fix possible use-after-free with AIO · d7763498
      Jan Kara authored
      commit 9b171e0c upstream.
      
      Running AIO is pinning inode in memory using file reference. Once AIO
      is completed using aio_complete(), file reference is put and inode can
      be freed from memory. So we have to be sure that calling aio_complete()
      is the last thing we do with the inode.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      d7763498
    • Konrad Rzeszutek Wilk's avatar
      doc, kernel-parameters: Document 'console=hvc<n>' · 7479d5dd
      Konrad Rzeszutek Wilk authored
      commit a2fd6419 upstream.
      
      Both the PowerPC hypervisor and Xen hypervisor can utilize the
      hvc driver.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: http://lkml.kernel.org/r/1361825650-14031-3-git-send-email-konrad.wilk@oracle.comSigned-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7479d5dd
    • Konrad Rzeszutek Wilk's avatar
      doc, xen: Mention 'earlyprintk=xen' in the documentation. · 56070188
      Konrad Rzeszutek Wilk authored
      commit 2482a92e upstream.
      
      The earlyprintk for Xen PV guests utilizes a simple hypercall
      (console_io) to provide output to Xen emergency console.
      
      Note that the Xen hypervisor should be booted with 'loglevel=all'
      to output said information.
      Reported-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: http://lkml.kernel.org/r/1361825650-14031-2-git-send-email-konrad.wilk@oracle.comSigned-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      56070188
    • Shawn Guo's avatar
      mmc: sdhci-esdhc-imx: fix host version read · e3efb0af
      Shawn Guo authored
      commit ef4d0888 upstream.
      
      When commit 95a2482a (mmc: sdhci-esdhc-imx: add basic imx6q usdhc
      support) works around host version issue on imx6q, it gets the
      register address fixup "reg ^= 2" lost for imx25/35/51/53 esdhc.
      Thus, the controller version on these SoCs is wrongly identified
      as v1 while it's actually v2.
      
      Add the address fixup back and take a different approach to correct
      imx6q host version, so that the host version read gets back to work
      for all SoCs.
      Signed-off-by: default avatarShawn Guo <shawn.guo@linaro.org>
      Signed-off-by: default avatarChris Ball <cjb@laptop.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      e3efb0af
    • Greg Thelen's avatar
      tmpfs: fix use-after-free of mempolicy object · 2b82b58d
      Greg Thelen authored
      commit 5f00110f upstream.
      
      The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
      option is not specified in the remount request.  A new policy can be
      specified if mpol=M is given.
      
      Before this patch remounting an mpol bound tmpfs without specifying
      mpol= mount option in the remount request would set the filesystem's
      mempolicy object to a freed mempolicy object.
      
      To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
          # mkdir /tmp/x
      
          # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0
      
          # mount -o remount,size=200M nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
              # note ? garbage in mpol=... output above
      
          # dd if=/dev/zero of=/tmp/x/f count=1
              # panic here
      
      Panic:
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          [...]
          Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
          Call Trace:
            mpol_shared_policy_init+0xa5/0x160
            shmem_get_inode+0x209/0x270
            shmem_mknod+0x3e/0xf0
            shmem_create+0x18/0x20
            vfs_create+0xb5/0x130
            do_last+0x9a1/0xea0
            path_openat+0xb3/0x4d0
            do_filp_open+0x42/0xa0
            do_sys_open+0xfe/0x1e0
            compat_sys_open+0x1b/0x20
            cstar_dispatch+0x7/0x1f
      
      Non-debug kernels will not crash immediately because referencing the
      dangling mpol will not cause a fault.  Instead the filesystem will
      reference a freed mempolicy object, which will cause unpredictable
      behavior.
      
      The problem boils down to a dropped mpol reference below if
      shmem_parse_options() does not allocate a new mpol:
      
          config = *sbinfo
          shmem_parse_options(data, &config, true)
          mpol_put(sbinfo->mpol)
          sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */
      
      This patch avoids the crash by not releasing the mempolicy if
      shmem_parse_options() doesn't create a new mpol.
      
      How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
      not look back further.
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      2b82b58d
    • Mel Gorman's avatar
      mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages · 9fb42c9a
      Mel Gorman authored
      commit 67d46b29 upstream.
      
      Rob van der Heij reported the following (paraphrased) on private mail.
      
      	The scenario is that I want to avoid backups to fill up the page
      	cache and purge stuff that is more likely to be used again (this is
      	with s390x Linux on z/VM, so I don't give it as much memory that
      	we don't care anymore). So I have something with LD_PRELOAD that
      	intercepts the close() call (from tar, in this case) and issues
      	a posix_fadvise() just before closing the file.
      
      	This mostly works, except for small files (less than 14 pages)
      	that remains in page cache after the face.
      
      Unfortunately Rob has not had a chance to test this exact patch but the
      test program below should be reproducing the problem he described.
      
      The issue is the per-cpu pagevecs for LRU additions.  If the pages are
      added by one CPU but fadvise() is called on another then the pages
      remain resident as the invalidate_mapping_pages() only drains the local
      pagevecs via its call to pagevec_release().  The user-visible effect is
      that a program that uses fadvise() properly is not obeyed.
      
      A possible fix for this is to put the necessary smarts into
      invalidate_mapping_pages() to globally drain the LRU pagevecs if a
      pagevec page could not be discarded.  The downside with this is that an
      inode cache shrink would send a global IPI and memory pressure
      potentially causing global IPI storms is very undesirable.
      
      Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
      check if invalidate_mapping_pages() discarded all the requested pages.
      If a subset of pages are discarded it drains the LRU pagevecs and tries
      again.  If the second attempt fails, it assumes it is due to the pages
      being mapped, locked or dirty and does not care.  With this patch, an
      application using fadvise() correctly will be obeyed but there is a
      downside that a malicious application can force the kernel to send
      global IPIs and increase overhead.
      
      If accepted, I would like this to be considered as a -stable candidate.
      It's not an urgent issue but it's a system call that is not working as
      advertised which is weak.
      
      The following test program demonstrates the problem.  It should never
      report that pages are still resident but will without this patch.  It
      assumes that CPU 0 and 1 exist.
      
      int main() {
      	int fd;
      	int pagesize = getpagesize();
      	ssize_t written = 0, expected;
      	char *buf;
      	unsigned char *vec;
      	int resident, i;
      	cpu_set_t set;
      
      	/* Prepare a buffer for writing */
      	expected = FILESIZE_PAGES * pagesize;
      	buf = malloc(expected + 1);
      	if (buf == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      	buf[expected] = 0;
      	memset(buf, 'a', expected);
      
      	/* Prepare the mincore vec */
      	vec = malloc(FILESIZE_PAGES);
      	if (vec == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Bind ourselves to CPU 0 */
      	CPU_ZERO(&set);
      	CPU_SET(0, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* open file, unlink and write buffer */
      	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
      	if (fd == -1) {
      		perror("open");
      		exit(EXIT_FAILURE);
      	}
      	unlink("fadvise-test-file");
      	while (written < expected) {
      		ssize_t this_write;
      		this_write = write(fd, buf + written, expected - written);
      
      		if (this_write == -1) {
      			perror("write");
      			exit(EXIT_FAILURE);
      		}
      
      		written += this_write;
      	}
      	free(buf);
      
      	/*
      	 * Force ourselves to another CPU. If fadvise only flushes the local
      	 * CPUs pagevecs then the fadvise will fail to discard all file pages
      	 */
      	CPU_ZERO(&set);
      	CPU_SET(1, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* sync and fadvise to discard the page cache */
      	fsync(fd);
      	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
      		perror("posix_fadvise");
      		exit(EXIT_FAILURE);
      	}
      
      	/* map the file and use mincore to see which parts of it are resident */
      	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
      	if (buf == NULL) {
      		perror("mmap");
      		exit(EXIT_FAILURE);
      	}
      	if (mincore(buf, expected, vec) == -1) {
      		perror("mincore");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Check residency */
      	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
      		if (vec[i])
      			resident++;
      	}
      	if (resident != 0) {
      		printf("Nr unexpected pages resident: %d\n", resident);
      		exit(EXIT_FAILURE);
      	}
      
      	munmap(buf, expected);
      	close(fd);
      	free(vec);
      	exit(EXIT_SUCCESS);
      }
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarRob van der Heij <rvdheij@gmail.com>
      Tested-by: default avatarRob van der Heij <rvdheij@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9fb42c9a
    • Robin Holt's avatar
      mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts · 09454abb
      Robin Holt authored
      commit 751efd86 upstream.
      
      There is a race condition between mmu_notifier_unregister() and
      __mmu_notifier_release().
      
      Assume two tasks, one calling mmu_notifier_unregister() as a result of a
      filp_close() ->flush() callout (task A), and the other calling
      mmu_notifier_release() from an mmput() (task B).
      
                      A                               B
      t1                                              srcu_read_lock()
      t2              if (!hlist_unhashed())
      t3                                              srcu_read_unlock()
      t4              srcu_read_lock()
      t5                                              hlist_del_init_rcu()
      t6                                              synchronize_srcu()
      t7              srcu_read_unlock()
      t8              hlist_del_rcu()  <--- NULL pointer deref.
      
      Additionally, the list traversal in __mmu_notifier_release() is not
      protected by the by the mmu_notifier_mm->hlist_lock which can result in
      callouts to the ->release() notifier from both mmu_notifier_unregister()
      and __mmu_notifier_release().
      
      -stable suggestions:
      
      The stable trees prior to 3.7.y need commits 21a92735 and
      70400303 cherry-picked in that order prior to cherry-picking this
      commit.  The 3.7.y tree already has those two commits.
      Signed-off-by: default avatarRobin Holt <holt@sgi.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Sagi Grimberg <sagig@mellanox.co.il>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      09454abb
    • Andrea Arcangeli's avatar
      mm: mmu_notifier: make the mmu_notifier srcu static · 3d2b066c
      Andrea Arcangeli authored
      commit 70400303 upstream.
      
      The variable must be static especially given the variable name.
      
      s/RCU/SRCU/ over a few comments.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3d2b066c
    • Sagi Grimberg's avatar
      mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely schedule · 80da2799
      Sagi Grimberg authored
      commit 21a92735 upstream.
      
      With an RCU based mmu_notifier implementation, any callout to
      mmu_notifier_invalidate_range_{start,end}() or
      mmu_notifier_invalidate_page() would not be allowed to call schedule()
      as that could potentially allow a modification to the mmu_notifier
      structure while it is currently being used.
      
      Since srcu allocs 4 machine words per instance per cpu, we may end up
      with memory exhaustion if we use srcu per mm.  So all mms share a global
      srcu.  Note that during large mmu_notifier activity exit & unregister
      paths might hang for longer periods, but it is tolerable for current
      mmu_notifier clients.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.co.il>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      80da2799
    • Phileas Fogg's avatar
      powerpc/kexec: Disable hard IRQ before kexec · 819a56e8
      Phileas Fogg authored
      commit 8520e443 upstream.
      
      Disable hard IRQ before kexec a new kernel image.
      Not doing it can result in corrupted data in the memory segments
      reserved for the new kernel.
      Signed-off-by: default avatarPhileas Fogg <phileas-fogg@mail.ru>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      819a56e8
    • Jan Kara's avatar
      fs: Fix possible use-after-free with AIO · 53edcd23
      Jan Kara authored
      commit 54c807e7 upstream.
      
      Running AIO is pinning inode in memory using file reference. Once AIO
      is completed using aio_complete(), file reference is put and inode can
      be freed from memory. So we have to be sure that calling aio_complete()
      is the last thing we do with the inode.
      
      CC: Christoph Hellwig <hch@infradead.org>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Jeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      53edcd23
    • Lukas Czerner's avatar
      ext4: fix free clusters calculation in bigalloc filesystem · 668f6e6a
      Lukas Czerner authored
      commit 304e220f upstream.
      
      ext4_has_free_clusters() should tell us whether there is enough free
      clusters to allocate, however number of free clusters in the file system
      is converted to blocks using EXT4_C2B() which is not only wrong use of
      the macro (we should have used EXT4_NUM_B2C) but it's also completely
      wrong concept since everything else is in cluster units.
      
      Moreover when calculating number of root clusters we should be using
      macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
      off by one. However r_blocks_count should always be a multiple of the
      cluster ratio so doing a plain bit shift should be enough here. We
      avoid using EXT4_B2C() because it's confusing.
      
      As a result of the first problem number of free clusters is much bigger
      than it should have been and ext4_has_free_clusters() would return 1 even
      if there is really not enough free clusters available.
      
      Fix this by removing the EXT4_C2B() conversion of free clusters and
      using bit shift when calculating number of root clusters. This bug
      affects number of xfstests tests covering file system ENOSPC situation
      handling. With this patch most of the ENOSPC problems with bigalloc file
      system disappear, especially the errors caused by delayed allocation not
      having enough space when the actual allocation is finally requested.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      668f6e6a
    • Lars-Peter Clausen's avatar
      drivers/video/backlight/adp88?0_bl.c: fix resume · 8ca4e7b9
      Lars-Peter Clausen authored
      commit 5eb02c01 upstream.
      
      Clearing the NSTBY bit in the control register also automatically clears
      the BLEN bit.  So we need to make sure to set it again during resume,
      otherwise the backlight will stay off.
      Signed-off-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Acked-by: default avatarMichael Hennerich <michael.hennerich@analog.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      8ca4e7b9
    • Junxiao Bi's avatar
      ocfs2: unlock super lock if lockres refresh failed · 9ba3d1e2
      Junxiao Bi authored
      commit 3278bb74 upstream.
      
      If lockres refresh failed, the super lock will never be released which
      will cause some processes on other cluster nodes hung forever.
      Signed-off-by: default avatarJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      9ba3d1e2
    • MITSUNARI Shigeo's avatar
      fs/block_dev.c: page cache wrongly left invalidated after revalidate_disk() · 65a79a3b
      MITSUNARI Shigeo authored
      commit 7630b661 upstream.
      
      We found that bdev->bd_invalidated was left set once revalidate_disk()
      is called, which results in page cache flush every time that device is
      open.
      
      Specifically, we found this problem in MD block device.  Once we resize
      a MD device, mdadm --monitor periodically flush all page cache for that
      device every 60 or 1000 seconds when it opens the device.
      
      This bug lies since at least 3.2.0 till the latest kernel(3.6.2).  Patch
      is attached.
      
      The following steps will reproduce the problem.
      
      1. prepair a block device (eg /dev/sdb).
      
      2. create two partitions:
      
         sudo parted /dev/sdb
         mklabel gpt
         mkpart primary 0% 50%
         mkpart primary 50% 100%
      
      3. create a md device.
      
         sudo mdadm -C /dev/md/hoge -l 1 -n 2 -e 1.2 --assume-clean --auto=md --symlink=no /dev/sdb1 /dev/sdb2
      
      4. create file system and mount it
      
         sudo mkfs.ext3 /dev/md/hoge
         sudo mkdir /mnt/test
         sudo mount /dev/md/hoge /mnt/test
      
      5. try to resize the device
      
         sudo mdadm -G /dev/md/hoge --size=max
      
      6. create a file to fill file cache.
      
        sudo dd if=/dev/urandom of=/mnt/test/data bs=1M count=10
      
      and verify the current status of file by free command.
      
      7. mdadm monitor will open the md device every 1000 seconds and you
         will find all file cache on the device are cleared.
      
      The timing can be reduced by the following steps.
      
      a) kill mdadm and restart it with --delay option
      
         /sbin/mdadm --monitor --delay=30 --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog
      
      or open the md device directly.
      
         sudo dd if=/dev/md/hoge of=/dev/null bs=4096 count=1
      Signed-off-by: default avatarMITSUNARI Shigeo <herumi@nifty.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      65a79a3b
    • Jim Somerville's avatar
      inotify: remove broken mask checks causing unmount to be EINVAL · c8164cb5
      Jim Somerville authored
      commit 676a0675 upstream.
      
      Running the command:
      
      	inotifywait -e unmount /mnt/disk
      
      immediately aborts with a -EINVAL return code.  This is however a valid
      parameter.  This abort occurs only if unmount is the sole event
      parameter.  If other event parameters are supplied, then the unmount
      event wait will work.
      
      The problem was introduced by commit 44b350fc ("inotify: Fix mask
      checks").  In that commit, it states:
      
      	The mask checks in inotify_update_existing_watch() and
      	inotify_new_watch() are useless because inotify_arg_to_mask()
      	sets FS_IN_IGNORED and FS_EVENT_ON_CHILD bits anyway.
      
      But instead of removing the useless checks, it did this:
      
      	        mask = inotify_arg_to_mask(arg);
      	-       if (unlikely(!mask))
      	+       if (unlikely(!(mask & IN_ALL_EVENTS)))
      	                return -EINVAL;
      
      The problem is that IN_ALL_EVENTS doesn't include IN_UNMOUNT, and other
      parts of the code keep IN_UNMOUNT separate from IN_ALL_EVENTS.  So the
      check should be:
      
      	if (unlikely(!(mask & (IN_ALL_EVENTS | IN_UNMOUNT))))
      
      But inotify_arg_to_mask(arg) always sets the IN_UNMOUNT bit in the mask
      anyway, so the check is always going to pass and thus should simply be
      removed.  Also note that inotify_arg_to_mask completely controls what
      mask bits get set from arg, there's no way for invalid bits to get
      enabled there.
      
      Lets fix it by simply removing the useless broken checks.
      Signed-off-by: default avatarJim Somerville <Jim.Somerville@windriver.com>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      c8164cb5
    • Tejun Heo's avatar
      posix-timer: Don't call idr_find() with out-of-range ID · 4fa1b6ed
      Tejun Heo authored
      commit e182bb38 upstream.
      
      When idr_find() was fed a negative ID, it used to look up the ID
      ignoring the sign bit before recent ("idr: remove MAX_IDR_MASK and
      move left MAX_IDR_* into idr.c") patch. Now a negative ID triggers
      a WARN_ON_ONCE().
      
      __lock_timer() feeds timer_id from userland directly to idr_find()
      without sanitizing it which can trigger the above malfunctions.  Add a
      range check on @timer_id before invoking idr_find() in __lock_timer().
      
      While timer_t is defined as int by all archs at the moment, Andrew
      worries that it may be defined as a larger type later on.  Make the
      test cover larger integers too so that it at least is guaranteed to
      not return the wrong timer.
      
      Note that WARN_ON_ONCE() in idr_find() on id < 0 is transitional
      precaution while moving away from ignoring MSB.  Once it's gone we can
      remove the guard as long as timer_t isn't larger than int.
      
      Signed-off-by: Tejun Heo <tj@kernel.org>nnn
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20130220232412.GL3570@htj.dyndns.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4fa1b6ed
    • Pawel Moll's avatar
      ALSA: usb: Fix Processing Unit Descriptor parsers · 7d3cbb43
      Pawel Moll authored
      commit b531f81b upstream.
      
      Commit 99fc8645 "ALSA: usb-mixer:
      parse descriptors with structs" introduced a set of useful parsers
      for descriptors. Unfortunately the parses for the Processing Unit
      Descriptor came with a very subtle bug...
      
      Functions uac_processing_unit_iProcessing() and
      uac_processing_unit_specific() were indexing the baSourceID array
      forgetting the fields before the iProcessing and process-specific
      descriptors.
      
      The problem was observed with Sound Blaster Extigy mixer,
      where nNrModes in Up/Down-mix Processing Unit Descriptor
      was accessed at offset 10 of the descriptor (value 0)
      instead of offset 15 (value 7). In result the resulting
      control had interesting limit values:
      
      Simple mixer control 'Channel Routing Mode Select',0
        Capabilities: volume volume-joined penum
        Playback channels: Mono
        Capture channels: Mono
        Limits: 0 - -1
        Mono: -1 [100%]
      
      Fixed by starting from the bmControls, which was calculated
      correctly, instead of baSourceID.
      
      Now the mentioned control is fine:
      
      Simple mixer control 'Channel Routing Mode Select',0
        Capabilities: volume volume-joined penum
        Playback channels: Mono
        Capture channels: Mono
        Limits: 0 - 6
        Mono: 0 [0%]
      Signed-off-by: default avatarPawel Moll <mail@pawelmoll.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      [bwh: Backported to 3.2: adjust filename]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7d3cbb43
    • Matt Fleming's avatar
      x86, efi: Make "noefi" really disable EFI runtime serivces · 1a116473
      Matt Fleming authored
      commit fb834c7a upstream.
      
      commit 1de63d60 ("efi: Clear EFI_RUNTIME_SERVICES rather than
      EFI_BOOT by "noefi" boot parameter") attempted to make "noefi" true to
      its documentation and disable EFI runtime services to prevent the
      bricking bug described in commit e0094244 ("samsung-laptop:
      Disable on EFI hardware"). However, it's not possible to clear
      EFI_RUNTIME_SERVICES from an early param function because
      EFI_RUNTIME_SERVICES is set in efi_init() *after* parse_early_param().
      
      This resulted in "noefi" effectively becoming a no-op and no longer
      providing users with a way to disable EFI, which is bad for those
      users that have buggy machines.
      Reported-by: default avatarWalt Nelson Jr <walt0924@gmail.com>
      Cc: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/r/1361392572-25657-1-git-send-email-matt@console-pimps.orgSigned-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      [bwh: Backported to 3.2: efi_runtime_init() is not a separate function,
       so put a whole set of statements in an if (!disable_runtime) block]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1a116473