Commits · f34efdfedc63f7be448a6de93211d9942f7e2610 · Kirill Smelkov / linux

06 Aug, 2015 11 commits

freeing unlinked file indefinitely delayed · f34efdfe

Al Viro authored Jul 08, 2015

commit 75a6f82a upstream.

	Normally opening a file, unlinking it and then closing will have
the inode freed upon close() (provided that it's not otherwise busy and
has no remaining links, of course).  However, there's one case where that
does *not* happen.  Namely, if you open it by fhandle with cold dcache,
then unlink() and close().

	In normal case you get d_delete() in unlink(2) notice that dentry
is busy and unhash it; on the final dput() it will be forcibly evicted from
dcache, triggering iput() and inode removal.  In this case, though, we end
up with *two* dentries - disconnected (created by open-by-fhandle) and
regular one (used by unlink()).  The latter will have its reference to inode
dropped just fine, but the former will not - it's considered hashed (it
is on the ->s_anon list), so it will stay around until the memory pressure
will finally do it in.  As the result, we have the final iput() delayed
indefinitely.  It's trivial to reproduce -

void flush_dcache(void)
{
        system("mount -o remount,rw /");
}

static char buf[20 * 1024 * 1024];

main()
{
        int fd;
        union {
                struct file_handle f;
                char buf[MAX_HANDLE_SZ];
        } x;
        int m;

        x.f.handle_bytes = sizeof(x);
        chdir("/root");
        mkdir("foo", 0700);
        fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
        close(fd);
        name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
        flush_dcache();
        fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
        unlink("foo/bar");
        write(fd, buf, sizeof(buf));
        system("df .");			/* 20Mb eaten */
        close(fd);
        system("df .");			/* should've freed those 20Mb */
        flush_dcache();
        system("df .");			/* should be the same as #2 */
}

will spit out something like
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 303843      1131 100% /
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 303843      1131 100% /
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 283282     21692  93% /
- inode gets freed only when dentry is finally evicted (here we trigger
than by remount; normally it would've happened in response to memory
pressure hell knows when).
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[ kamal: backport to 3.19-stable: no fast_dput() ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

f34efdfe

9p: don't leave a half-initialized inode sitting around · 72069d8f

Al Viro authored Jul 12, 2015

commit 0a73d0a2 upstream.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

72069d8f

hpfs: kstrdup() out of memory handling · be901bdf

Sanidhya Kashyap authored Mar 21, 2015

commit ce657611 upstream.

There is a possibility of nothing being allocated to the new_opts in
case of memory pressure, therefore return ENOMEM for such case.
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

be901bdf

ACPI / PNP: Reserve ACPI resources at the fs_initcall_sync stage · 3e47907f

Rafael J. Wysocki authored Jul 04, 2015

commit 0294112e upstream.

This effectively reverts the following three commits:

 7bc10388 ACPI / resources: free memory on error in add_region_before()
 0f1b414d ACPI / PNP: Avoid conflicting resource reservations
 b9a5e5e1 ACPI / init: Fix the ordering of acpi_reserve_resources()

(commit b9a5e5e1 introduced regressions some of which, but not
all, were addressed by commit 0f1b414d and commit 7bc10388
was a fixup on top of the latter) and causes ACPI fixed hardware
resources to be reserved at the fs_initcall_sync stage of system
initialization.

The story is as follows.  First, a boot regression was reported due
to an apparent resource reservation ordering change after a commit
that shouldn't lead to such changes.  Investigation led to the
conclusion that the problem happened because acpi_reserve_resources()
was executed at the device_initcall() stage of system initialization
which wasn't strictly ordered with respect to driver initialization
(and with respect to the initialization of the pcieport driver in
particular), so a random change causing the device initcalls to be
run in a different order might break things.

The response to that was to attempt to run acpi_reserve_resources()
as soon as we knew that ACPI would be in use (commit b9a5e5e1).
However, that turned out to be too early, because it caused resource
reservations made by the PNP system driver to fail on at least one
system and that failure was addressed by commit 0f1b414d.

That fix still turned out to be insufficient, though, because
calling acpi_reserve_resources() before the fs_initcall stage of
system initialization caused a boot regression to happen on the
eCAFE EC-800-H20G/S netbook.  That meant that we only could call
acpi_reserve_resources() at the fs_initcall initialization stage
or later, but then we might just as well call it after the PNP
initalization in which case commit 0f1b414d wouldn't be
necessary any more.

For this reason, the changes made by commit 0f1b414d are reverted
(along with a memory leak fixup on top of that commit), the changes
made by commit b9a5e5e1 that went too far are reverted too and
acpi_reserve_resources() is changed into fs_initcall_sync, which
will cause it to be executed after the PNP subsystem initialization
(which is an fs_initcall) and before device initcalls (including
the pcieport driver initialization) which should avoid the initial
issue.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=100581
Link: http://marc.info/?t=143092384600002&r=1&w=2
Link: https://bugzilla.kernel.org/show_bug.cgi?id=99831
Link: http://marc.info/?t=143389402600001&r=1&w=2
Fixes: b9a5e5e1 "ACPI / init: Fix the ordering of acpi_reserve_resources()"
Reported-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

3e47907f

ext4: replace open coded nofail allocation in ext4_free_blocks() · 61bd00c1

Michal Hocko authored Jul 05, 2015

commit 7444a072 upstream.

ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

61bd00c1

ext4: correctly migrate a file with a hole at the beginning · 4e792b2a

Eryu Guan authored Jul 04, 2015

commit 8974fec7 upstream.

Currently ext4_ind_migrate() doesn't correctly handle a file which
contains a hole at the beginning of the file.  This caused the migration
to be done incorrectly, and then if there is a subsequent following
delayed allocation write to the "hole", this would reclaim the same data
blocks again and results in fs corruption.

  # assmuing 4k block size ext4, with delalloc enabled
  # skip the first block and write to the second block
  xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile

  # converting to indirect-mapped file, which would move the data blocks
  # to the beginning of the file, but extent status cache still marks
  # that region as a hole
  chattr -e /mnt/ext4/testfile

  # delayed allocation writes to the "hole", reclaim the same data block
  # again, results in i_blocks corruption
  xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
  umount /mnt/ext4
  e2fsck -nf /dev/sda6
  ...
  Inode 53, i_blocks is 16, should be 8.  Fix? no
  ...
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

4e792b2a

ext4: be more strict when migrating to non-extent based file · 261c583a

Eryu Guan authored Jul 03, 2015

commit d6f123a9 upstream.

Currently the check in ext4_ind_migrate() is not enough before doing the
real conversion:

a) delayed allocated extents could bypass the check on eh->eh_entries
   and eh->eh_depth

This can be demonstrated by this script

  xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile

where testfile has two extents but still be converted to non-extent
based file format.

b) only extent length is checked but not the offset, which would result
   in data lose (delalloc) or fs corruption (nodelalloc), because
   non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
   blocks

This can be demostrated by

  xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
  chattr -e /mnt/ext4/testfile
  sync

If delalloc is enabled, dmesg prints
  EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
  EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
  EXT4-fs (dm-4): This should not happen!! Data will be lost

If delalloc is disabled, e2fsck -nf shows corruption
  Inode 53, i_size is 5497558142976, should be 4096.  Fix? no

Fix the two issues by

a) forcing all delayed allocation blocks to be allocated before checking
   eh->eh_depth and eh->eh_entries
b) limiting the last logical block of the extent is within direct map
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

261c583a

ext4: fix reservation release on invalidatepage for delalloc fs · e726980a

Lukas Czerner authored Jul 03, 2015

commit 9705acd6 upstream.

On delalloc enabled file system on invalidatepage operation
in ext4_da_page_release_reservation() we want to clear the delayed
buffer and remove the extent covering the delayed buffer from the extent
status tree.

However currently there is a bug where on the systems with page size >
block size we will always remove extents from the start of the page
regardless where the actual delayed buffers are positioned in the page.
This leads to the errors like this:

EXT4-fs warning (device loop0): ext4_da_release_space:1225:
ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
blocks

This however can cause data loss on writeback time if the file system is
in ENOSPC condition because we're releasing reservation for someones
else delayed buffer.

Fix this by only removing extents that corresponds to the part of the
page we want to invalidate.

This problem is reproducible by the following fio receipt (however I was
only able to reproduce it with fio-2.1 or older.

[global]
bs=8k
iodepth=1024
iodepth_batch=60
randrepeat=1
size=1m
directory=/mnt/test
numjobs=20
[job1]
ioengine=sync
bs=1k
direct=1
rw=randread
filename=file1:file2
[job2]
ioengine=libaio
rw=randwrite
direct=1
filename=file1:file2
[job3]
bs=1k
ioengine=posixaio
rw=randwrite
direct=1
filename=file1:file2
[job5]
bs=1k
ioengine=sync
rw=randread
filename=file1:file2
[job7]
ioengine=libaio
rw=randwrite
filename=file1:file2
[job8]
ioengine=posixaio
rw=randwrite
filename=file1:file2
[job10]
ioengine=mmap
rw=randwrite
bs=1k
filename=file1:file2
[job11]
ioengine=mmap
rw=randwrite
direct=1
filename=file1:file2
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

e726980a

Btrfs: fix fsync data loss after append write · f5880941

Filipe Manana authored Jun 17, 2015

commit e4545de5 upstream.

If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.

This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.

Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).

This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  here=`pwd`
  tmp=/tmp/$$
  status=1	# failure is the default!

  _cleanup()
  {
          _cleanup_flakey
          rm -f $tmp.*
  }
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  _crash_and_mount()
  {
          # Simulate a crash/power loss.
          _load_flakey_table $FLAKEY_DROP_WRITES
          _unmount_flakey
          # Allow writes again and mount. This makes the fs replay its fsync log.
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and then fsync it.
  # The fsync here is only needed to trigger the issue in btrfs, as it causes the
  # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
                  -c "fsync" \
                  $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add a hard link to our file.
  # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
  # which is a necessary condition to trigger the issue.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now append more data to our file, increasing its size, and fsync the file.
  # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
  # write path did not update the inode item in the btree nor the delayed inode
  # item (in memory struture) in the current transaction (created by the fsync
  # handler), the fsync did not record the inode's new i_size in the fsync
  # log/journal. This made the data unavailable after the fsync log/journal is
  # replayed.
  $XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
               -c "fsync" \
               $SCRATCH_MNT/foo | _filter_xfs_io

  echo "File content after fsync and before crash:"
  od -t x1 $SCRATCH_MNT/foo

  _crash_and_mount

  echo "File content after crash and log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected file output before and after the crash/power failure expects the
appended data to be available, which is:

  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0200000
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

f5880941

Btrfs: fix race between caching kthread and returning inode to inode cache · 13321418

Filipe Manana authored Jun 13, 2015

commit ae9d8f17 upstream.

While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info->commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.

This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.

This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:

[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[132408.505075]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[132408.505075]  [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
[132408.505075]  [<ffffffff81155733>] __cache_free+0xe2/0x4b6
[132408.505075]  [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075]  [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
[132408.505075]  [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
[132408.505075]  [<ffffffff811563a1>] ? kfree+0xb6/0x14e
[132408.505075]  [<ffffffff811563d0>] kfree+0xe5/0x14e
[132408.505075]  [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075]  [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075]  [<ffffffff8106698f>] kthread+0xef/0xf7
[132408.505075]  [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075]  [<ffffffff814653d2>] ret_from_fork+0x42/0x70
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!

Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.

Fixes: 1c70d8fb ("Btrfs: fix inode caching vs tree log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

13321418

Btrfs: use kmem_cache_free when freeing entry in inode cache · 6c7f16ec

Filipe Manana authored Jun 13, 2015

commit c3f4a168 upstream.

The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:

  /*
   * (...)
   *
   * Don't free memory not originally allocated by kmalloc()
   * or you will run into trouble.
   */

So better be safe and use kmem_cache_free().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

6c7f16ec

04 Aug, 2015 1 commit

sg_start_req(): make sure that there's not too many elements in iovec · 80f6a339

Al Viro authored Aug 01, 2015

commit 451a2886 upstream.

unfortunately, allowing an arbitrary 16bit value means a possibility of
overflow in the calculation of total number of pages in bio_map_user_iov() -
we rely on there being no more than PAGE_SIZE members of sum in the
first loop there.  If that sum wraps around, we end up allocating
too small array of pointers to pages and it's easy to overflow it in
the second loop.

X-Coverup: TINC (and there's no lumber cartel either)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[bwh: s/MAX_UIOVEC/UIO_MAXIOV/. This was fixed upstream by commit
 fdc81f45 ("sg_start_req(): use import_iovec()"), but we don't have
 that function.]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Reference: CVE-2015-5707
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

80f6a339

30 Jul, 2015 9 commits

KEYS: ensure we free the assoc array edit if edit is valid · f86e37eb

Colin Ian King authored Jul 27, 2015

commit ca4da5dd upstream.

__key_link_end is not freeing the associated array edit structure
and this leads to a 512 byte memory leak each time an identical
existing key is added with add_key().

The reason the add_key() system call returns okay is that
key_create_or_update() calls __key_link_begin() before checking to see
whether it can update a key directly rather than adding/replacing - which
it turns out it can.  Thus __key_link() is not called through
__key_instantiate_and_link() and __key_link_end() must cancel the edit.

CVE-2015-1333
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

f86e37eb

x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection · c0f31160

Andy Lutomirski authored Jul 10, 2015

commit 810bc075 upstream.

We have a tricky bug in the nested NMI code: if we see RSP pointing
to the NMI stack on NMI entry from kernel mode, we assume that we
are executing a nested NMI.

This isn't quite true.  A malicious userspace program can point RSP
at the NMI stack, issue SYSCALL, and arrange for an NMI to happen
while RSP is still pointing at the NMI stack.

Fix it with a sneaky trick.  Set DF in the region of code that the RSP
check is intended to detect.  IRET will clear DF atomically.

(Note: other than paravirt, there's little need for all this complexity.
 We could check RIP instead of RSP.)

Fixes CVE-2015-3291.
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
[bwh: Backported to 4.0: adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
CVE-2015-3291
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

c0f31160

x86/nmi/64: Reorder nested NMI checks · 9146aa83

Andy Lutomirski authored Jul 12, 2015

commit a27507ca upstream.

Check the repeat_nmi .. end_repeat_nmi special case first.  The next
patch will rework the RSP check and, as a side effect, the RSP check
will no longer detect repeat_nmi .. end_repeat_nmi, so we'll need
this ordering of the checks.

Note: this is more subtle than it appears.  The check for repeat_nmi
.. end_repeat_nmi jumps straight out of the NMI code instead of
adjusting the "iret" frame to force a repeat.  This is necessary,
because the code between repeat_nmi and end_repeat_nmi sets "NMI
executing" and then writes to the "iret" frame itself.  If a nested
NMI comes in and modifies the "iret" frame while repeat_nmi is also
modifying it, we'll end up with garbage.  The old code got this
right, as does the new code, but the new code is a bit more
explicit.

If we were to move the check right after the "NMI executing" check,
then we'd get it wrong and have random crashes.

This is a prerequisite for the fix for CVE-2015-3291.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
[bwh: Backported to 4.0: adjust filename, spacing]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

9146aa83

x86/nmi/64: Improve nested NMI comments · 8978aab3

Andy Lutomirski authored Jul 10, 2015

commit 0b22930e upstream.

I found the nested NMI documentation to be difficult to follow.
Improve the comments.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
[bwh: Backported to 4.0: adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

8978aab3

x86/nmi/64: Switch stacks on userspace NMI entry · a1831575

Andy Lutomirski authored Jul 10, 2015

commit 9b6e6a83 upstream.

Returning to userspace is tricky: IRET can fail, and ESPFIX can
rearrange the stack prior to IRET.

The NMI nesting fixup relies on a precise stack layout and atomic
IRET.  Rather than trying to teach the NMI nesting fixup to handle
ESPFIX and failed IRET, punt: run NMIs that came from user mode on
the normal kernel stack.

This will make some nested NMIs visible to C code, but the C code is
okay with that.

As a side effect, this should speed up perf: it eliminates an RDMSR
when NMIs come from user mode.

Fixes CVE-2015-3290.
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
[bwh: Backported to 4.0:
 - Adjust filename, context
 - s/restore_c_regs_and_iret/restore_args/
 - Use kernel_stack + KERNEL_STACK_OFFSET instead of cpu_current_top_of_stack]
[luto: Open-coded return path to avoid dependency on partial pt_regs details]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
CVE-2015-3290, CVE-2015-5157
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

a1831575

x86/nmi/64: Remove asm code that saves cr2 · 522d486b

Andy Lutomirski authored Jul 10, 2015

commit 0e181bb5 upstream.

Now that do_nmi saves cr2, we don't need to save it in asm.

This is a prerequisity for the fix for CVE-2015-3290.
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
[bwh: Backported to 4.0: adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

522d486b

x86/nmi: Enable nested do_nmi handling for 64-bit kernels · 322c5cf5

Andy Lutomirski authored Jul 10, 2015

commit 9d050416 upstream.

32-bit kernels handle nested NMIs in C.  Enable the exact same
handling on 64-bit kernels as well.  This isn't currently necessary,
but it will become necessary once the asm code starts allowing
limited nesting.

This is a prerequisite for the fix for CVE-2015-3290.
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

322c5cf5

x86/asm/entry/64: Remove a redundant jump · 88d865e6

Denys Vlasenko authored Apr 07, 2015

commit a30b0085 upstream.

Jumping to the very next instruction is not very useful:

        jmp label
    label:

Removing the jump.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1428439424-7258-5-git-send-email-dvlasenk@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

88d865e6

x86/asm/entry/64: Fold the 'test_in_nmi' macro into its only user · 8d0da710

Denys Vlasenko authored Apr 01, 2015

commit 0784b364 upstream.

No code changes.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427899858-7165-1-git-send-email-dvlasenk@redhat.comSigned-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

8d0da710

28 Jul, 2015 1 commit
- Linux 3.13.11-ckt24 · a82c024e
  Kamal Mostafa authored Jul 28, 2015
```
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
```
  a82c024e
23 Jul, 2015 18 commits

mac80211: prevent possible crypto tx tailroom corruption · aa9d4028

Michal Kazior authored May 22, 2015

commit ab499db8 upstream.

There was a possible race between
ieee80211_reconfig() and
ieee80211_delayed_tailroom_dec(). This could
result in inability to transmit data if driver
crashed during roaming or rekeying and subsequent
skbs with insufficient tailroom appeared.

This race was probably never seen in the wild
because a device driver would have to crash AND
recover within 0.5s which is very unlikely.

I was able to prove this race exists after
changing the delay to 10s locally and crashing
ath10k via debugfs immediately after GTK
rekeying. In case of ath10k the counter went below
0. This was harmless but other drivers which
actually require tailroom (e.g. for WEP ICV or
MMIC) could end up with the counter at 0 instead
of >0 and introduce insufficient skb tailroom
failures because mac80211 would not resize skbs
appropriately anymore.

Fixes: 8d1f7ecd ("mac80211: defer tailroom counter manipulation when roaming")
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

aa9d4028

HID: i2c-hid: fix harmless test_bit() issue · ff0474b4

Dan Carpenter authored May 14, 2015

commit c8fd51dc upstream.

These defines are used like this:

	if (test_bit(I2C_HID_STARTED, &ihid->flags))

The intent was to use bits 0, 1, and 2 but because of the extra shifts
we're using bits 1, 2, and 4.  It's harmless becuase it's done
consistently but it's not the intent and static checkers will complain.

Fixes: 4a200c3b ('HID: i2c-hid: introduce HID over i2c specification implementation')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

ff0474b4

of: return NUMA_NO_NODE from fallback of_node_to_nid() · d82d7501

Konstantin Khlebnikov authored Apr 08, 2015

commit c8fff7bc upstream.

Node 0 might be offline as well as any other numa node,
in this case kernel cannot handle memory allocation and crashes.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 0c3f061c ("of: implement of_node_to_nid as a weak function")
Signed-off-by: Grant Likely <grant.likely@linaro.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

d82d7501

pktgen: adjust spacing in proc file interface output · 4ac38fc9

Jesper Dangaard Brouer authored May 21, 2015

commit d079abd1 upstream.

Too many spaces were introduced in commit 63adc6fb ("pktgen: cleanup
checkpatch warnings"), thus misaligning "src_min:" to other columns.

Fixes: 63adc6fb ("pktgen: cleanup checkpatch warnings")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

4ac38fc9

security_syslog() should be called once only · ce458596

Vasily Averin authored Jun 25, 2015

commit d194e5d6 upstream.

The final version of commit 637241a9 ("kmsg: honor dmesg_restrict
sysctl on /dev/kmsg") lost few hooks, as result security_syslog() are
processed incorrectly:

- open of /dev/kmsg checks syslog access permissions by using
  check_syslog_permissions() where security_syslog() is not called if
  dmesg_restrict is set.

- syslog syscall and /proc/kmsg calls do_syslog() where security_syslog
  can be executed twice (inside check_syslog_permissions() and then
  directly in do_syslog())

With this patch security_syslog() is called once only in all
syslog-related operations regardless of dmesg_restrict value.

Fixes: 637241a9 ("kmsg: honor dmesg_restrict sysctl on /dev/kmsg")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

ce458596

NFS: Fix size of NFSACL SETACL operations · 5648c103

Chuck Lever authored May 26, 2015

commit d683cc49 upstream.

When encoding the NFSACL SETACL operation, reserve just the estimated
size of the ACL rather than a fixed maximum. This eliminates needless
zero padding on the wire that the server ignores.

Fixes: ee5dc773 ('NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!"')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

5648c103

rndis_wlan: harmless issue calling set_bit() · 9c24c7a2

Dan Carpenter authored May 14, 2015

commit e3958e9d upstream.

These are used like:

	set_bit(WORK_LINK_UP, &priv->work_pending);

The problem is that set_bit() takes the actual bit number and not a mask
so static checkers get upset.  It doesn't affect run time because we do
it consistently, but we may as well clean it up.

Fixes: 6010ce07 ('rndis_wlan: do link-down state change in worker thread')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

9c24c7a2

mtd: dc21285: use raw spinlock functions for nw_gpio_lock · f15929a9

Uwe Kleine-König authored May 28, 2015

commit e5babdf9 upstream.

Since commit bd31b859 (which is in 3.2-rc1) nw_gpio_lock is a raw spinlock
that needs usage of the corresponding raw functions.

This fixes:

  drivers/mtd/maps/dc21285.c: In function 'nw_en_write':
  drivers/mtd/maps/dc21285.c:41:340: warning: passing argument 1 of 'spinlock_check' from incompatible pointer type
    spin_lock_irqsave(&nw_gpio_lock, flags);

  In file included from include/linux/seqlock.h:35:0,
                   from include/linux/time.h:5,
                   from include/linux/stat.h:18,
                   from include/linux/module.h:10,
                   from drivers/mtd/maps/dc21285.c:8:
  include/linux/spinlock.h:299:102: note: expected 'struct spinlock_t *' but argument is of type 'struct raw_spinlock_t *'
   static inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
                                                                                                        ^
  drivers/mtd/maps/dc21285.c:43:25: warning: passing argument 1 of 'spin_unlock_irqrestore' from incompatible pointer type
    spin_unlock_irqrestore(&nw_gpio_lock, flags);
                           ^
  In file included from include/linux/seqlock.h:35:0,
                   from include/linux/time.h:5,
                   from include/linux/stat.h:18,
                   from include/linux/module.h:10,
                   from drivers/mtd/maps/dc21285.c:8:
  include/linux/spinlock.h:370:91: note: expected 'struct spinlock_t *' but argument is of type 'struct raw_spinlock_t *'
   static inline void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)

Fixes: bd31b859 ("locking, ARM: Annotate low level hw locks as raw")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

f15929a9

tty: remove platform_sysrq_reset_seq · b909b574

Arnd Bergmann authored May 26, 2015

commit ffb6e0c9 upstream.

The platform_sysrq_reset_seq code was intended as a way for an embedded
platform to provide its own sysrq sequence at compile time. After over two
years, nobody has started using it in an upstream kernel, and the platforms
that were interested in it have moved on to devicetree, which can be used
to configure the sequence without requiring kernel changes. The method is
also incompatible with the way that most architectures build support for
multiple platforms into a single kernel.

Now the code is producing warnings when built with gcc-5.1:

drivers/tty/sysrq.c: In function 'sysrq_init':
drivers/tty/sysrq.c:959:33: warning: array subscript is above array bounds [-Warray-bounds]
   key = platform_sysrq_reset_seq[i];

We could fix this, but it seems unlikely that it will ever be used, so
let's just remove the code instead. We still have the option to pass the
sequence either in DT, using the kernel command line, or using the
/sys/module/sysrq/parameters/reset_seq file.

Fixes: 154b7a48 ("Input: sysrq - allow specifying alternate reset sequence")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

b909b574

mmc: card: Fixup request missing in mmc_blk_issue_rw_rq · b21cbf58

Ding Wang authored May 18, 2015

commit 29535f7b upstream.

The current handler of MMC_BLK_CMD_ERR in mmc_blk_issue_rw_rq function
may cause new coming request permanent missing when the ongoing
request (previoulsy started) complete end.

The problem scenario is as follows:
(1) Request A is ongoing;
(2) Request B arrived, and finally mmc_blk_issue_rw_rq() is called;
(3) Request A encounters the MMC_BLK_CMD_ERR error;
(4) In the error handling of MMC_BLK_CMD_ERR, suppose mmc_blk_cmd_err()
    end request A completed and return zero. Continue the error handling,
    suppose mmc_blk_reset() reset device success;
(5) Continue the execution, while loop completed because variable ret
    is zero now;
(6) Finally, mmc_blk_issue_rw_rq() return without processing request B.

The process related to the missing request may wait that IO request
complete forever, possibly crashing the application or hanging the system.

Fix this issue by starting new request when reset success.
Signed-off-by: Ding Wang <justin.wang@spreadtrum.com>
Fixes: 67716327 ("mmc: block: add eMMC hardware reset support")
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

b21cbf58

neigh: do not modify unlinked entries · 8f95a0d1

Julian Anastasov authored Jun 16, 2015

commit 2c51a97f upstream.

The lockless lookups can return entry that is unlinked.
Sometimes they get reference before last neigh_cleanup_and_release,
sometimes they do not need reference. Later, any
modification attempts may result in the following problems:

1. entry is not destroyed immediately because neigh_update
can start the timer for dead entry, eg. on change to NUD_REACHABLE
state. As result, entry lives for some time but is invisible
and out of control.

2. __neigh_event_send can run in parallel with neigh_destroy
while refcnt=0 but if timer is started and expired refcnt can
reach 0 for second time leading to second neigh_destroy and
possible crash.

Thanks to Eric Dumazet and Ying Xue for their work and analyze
on the __neigh_event_send change.

Fixes: 767e97e1 ("neigh: RCU conversion of struct neighbour")
Fixes: a263b309 ("ipv4: Make neigh lookups directly in output packet path.")
Fixes: 6fd6ce20 ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ kamal: backport to 3.13-stable: no __neigh_set_probe_once() ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

8f95a0d1

ASoC: imx-wm8962: Add a missing error check · 6aee2e89

Dan Carpenter authored Jun 10, 2015

commit 474ff0ae upstream.

My static checker complains that:

	sound/soc/fsl/imx-wm8962.c:196 imx_wm8962_probe() warn:
	we tested 'ret' before and it was 'false'

The intent was that we use "ret" to check imx_audmux_v2_configure_port().

Fixes: 8de2ae2a ('ASoC: fsl: add imx-wm8962 machine driver')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Otherwise, Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

6aee2e89

watchdog: omap: assert the counter being stopped before reprogramming · 10d08999

Uwe Kleine-König authored Apr 29, 2015

commit 530c11d4 upstream.

The omap watchdog has the annoying behaviour that writes to most
registers don't have any effect when the watchdog is already running.
Quoting the AM335x reference manual:

	To modify the timer counter value (the WDT_WCRR register),
	prescaler ratio (the WDT_WCLR[4:2] PTV bit field), delay
	configuration value (the WDT_WDLY[31:0] DLY_VALUE bit field), or
	the load value (the WDT_WLDR[31:0] TIMER_LOAD bit field), the
	watchdog timer must be disabled by using the start/stop sequence
	(the WDT_WSPR register).

Currently the timer is stopped in the .probe callback but still there
are possibilities that yield to a situation where omap_wdt_start is
entered with the timer running (e.g. when /dev/watchdog is closed
without stopping and then reopened). In such a case programming the
timeout silently fails!

To circumvent this stop the timer before reprogramming.

Assuming one of the first things the watchdog user does is setting the
timeout explicitly nothing too bad should happen because this explicit
setting works fine.

Fixes: 7768a13c ("[PATCH] OMAP: Add Watchdog driver support")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

10d08999

mm/hugetlb: introduce minimum hugepage order · 198039dd

Naoya Horiguchi authored Jun 24, 2015

commit 641844f5 upstream.

Currently the initial value of order in dissolve_free_huge_page is 64 or
32, which leads to the following warning in static checker:

  mm/hugetlb.c:1203 dissolve_free_huge_pages()
  warn: potential right shift more than type allows '9,18,64'

This is a potential risk of infinite loop, because 1 << order (== 0) is used
in for-loop like this:

  for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D 1 << order)
      ...

So this patch fixes it by using global minimum_order calculated at boot time.

    text    data     bss     dec     hex filename
   28313     469   84236  113018   1b97a mm/hugetlb.o
   28256     473   84236  112965   1b945 mm/hugetlb.o (patched)

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

198039dd

mm: fix potential infinite loop in dissolve_free_huge_pages() · 78c2699f

Li Zhong authored Aug 06, 2014

commit d0177639 upstream.

It is possible for some platforms, such as powerpc to set HPAGE_SHIFT to
0 to indicate huge pages not supported.

When this is the case, hugetlbfs could be disabled during boot time:
hugetlbfs: disabling because there are no supported hugepage sizes

Then in dissolve_free_huge_pages(), order is kept maximum (64 for
64bits), and the for loop below won't end: for (pfn = start_pfn; pfn <
end_pfn; pfn += 1 << order)

As suggested by Naoya, below fix checks hugepages_supported() before
calling dissolve_free_huge_pages().

[rientjes@google.com: no legitimate reason to call dissolve_free_huge_pages() when !hugepages_supported()]
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ kamal: 3.13-stable prereq for
  641844f5 mm/hugetlb: introduce minimum hugepage order ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

78c2699f

hugetlb: ensure hugepage access is denied if hugepages are not supported · 462e6bf1

Nishanth Aravamudan authored May 06, 2014

commit 457c1b27 upstream.

Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:

  Unable to handle kernel paging request for data at address 0x00000031
  Faulting instruction address: 0xc000000000245710
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA pSeries
  ....

In KVM guests on Power, in a guest not backed by hugepages, we see the
following:

  AnonHugePages:         0 kB
  HugePages_Total:       0
  HugePages_Free:        0
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:         64 kB

HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init().  Extract the check to a helper function, and use it in a
few relevant places.

This does make hugetlbfs not supported (not registered at all) in this
environment.  I believe this is fine, as there are no valid hugepages
and that won't change at runtime.

[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ kamal: 3.13-stable prereq for
  641844f5 mm/hugetlb: introduce minimum hugepage order ]
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

462e6bf1

cfg80211: ignore netif running state when changing iftype · c0fdda83

Michal Kazior authored May 22, 2015

commit 6cbfb1bb upstream.

It was possible for mac80211 to be coerced into an
unexpected flow causing sdata union to become
corrupted. Station pointer was put into
sdata->u.vlan.sta memory location while it was
really master AP's sdata->u.ap.next_beacon. This
led to station entry being later freed as
next_beacon before __sta_info_flush() in
ieee80211_stop_ap() and a subsequent invalid
pointer dereference crash.

The problem was that ieee80211_ptr->use_4addr
wasn't cleared on interface type changes.

This could be reproduced with the following steps:

 # host A and host B have just booted; no
 # wpa_s/hostapd running; all vifs are down
 host A> iw wlan0 set type station
 host A> iw wlan0 set 4addr on
 host A> printf 'interface=wlan0\nssid=4addrcrash\nchannel=1\nwds_sta=1' > /tmp/hconf
 host A> hostapd -B /tmp/conf
 host B> iw wlan0 set 4addr on
 host B> ifconfig wlan0 up
 host B> iw wlan0 connect -w hostAssid
 host A> pkill hostapd
 # host A crashed:

 [  127.928192] BUG: unable to handle kernel NULL pointer dereference at 00000000000006c8
 [  127.929014] IP: [<ffffffff816f4f32>] __sta_info_flush+0xac/0x158
 ...
 [  127.934578]  [<ffffffff8170789e>] ieee80211_stop_ap+0x139/0x26c
 [  127.934578]  [<ffffffff8100498f>] ? dump_trace+0x279/0x28a
 [  127.934578]  [<ffffffff816dc661>] __cfg80211_stop_ap+0x84/0x191
 [  127.934578]  [<ffffffff816dc7ad>] cfg80211_stop_ap+0x3f/0x58
 [  127.934578]  [<ffffffff816c5ad6>] nl80211_stop_ap+0x1b/0x1d
 [  127.934578]  [<ffffffff815e53f8>] genl_family_rcv_msg+0x259/0x2b5

Note: This isn't a revert of f8cdddb8
("cfg80211: check iface combinations only when
iface is running") as far as functionality is
considered because b6a55015 ("cfg80211/mac80211:
move more combination checks to mac80211") moved
the logic somewhere else already.

Fixes: f8cdddb8 ("cfg80211: check iface combinations only when iface is running")
Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

c0fdda83

bridge: multicast: restore router configuration on port link down/up · 0428b04a

Satish Ashok authored Jun 19, 2015

commit 754bc547 upstream.

When a port goes through a link down/up the multicast router configuration
is not restored.
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 0909e117 ("bridge: Add multicast_router sysfs entries")
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>

0428b04a