1. 28 Oct, 2015 40 commits
    • Peng Tao's avatar
      nfs: fix pg_test page count calculation · f67da137
      Peng Tao authored
      [ Upstream commit 048883e0 ]
      
      We really want sizeof(struct page *) instead. Otherwise we limit
      maximum IO size to 64 pages rather than 512 pages on a 64bit system.
      
      Fixes 2e11f829(nfs: cap request size to fit a kmalloced page array).
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarPeng Tao <tao.peng@primarydata.com>
      Fixes: 2e11f829 ("nfs: cap request size to fit a kmalloced page array")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f67da137
    • Florian Westphal's avatar
      netfilter: nf_log: don't zap all loggers on unregister · 8bf6c729
      Florian Westphal authored
      [ Upstream commit 205ee117 ]
      
      like nf_log_unset, nf_log_unregister must not reset the list of loggers.
      Otherwise, a call to nf_log_unregister() will render loggers of other nf
      protocols unusable:
      
      iptables -A INPUT -j LOG
      modprobe nf_log_arp ; rmmod nf_log_arp
      iptables -A INPUT -j LOG
      iptables: No chain/target/match by that name
      
      Fixes: 30e0c6a6 ("netfilter: nf_log: prepare net namespace support for loggers")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      8bf6c729
    • Marcelo Leitner's avatar
      netfilter: nf_log: Introduce nft_log_dereference() macro · 2f6e5594
      Marcelo Leitner authored
      [ Upstream commit 0c26ed1c ]
      
      Wrap up a common call pattern in an easier to handle call.
      Signed-off-by: default avatarMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2f6e5594
    • Pablo Neira Ayuso's avatar
      netfilter: nft_compat: skip family comparison in case of NFPROTO_UNSPEC · 98a38395
      Pablo Neira Ayuso authored
      [ Upstream commit ba378ca9 ]
      
      Fix lookup of existing match/target structures in the corresponding list
      by skipping the family check if NFPROTO_UNSPEC is used.
      
      This is resulting in the allocation and insertion of one match/target
      structure for each use of them. So this not only bloats memory
      consumption but also severely affects the time to reload the ruleset
      from the iptables-compat utility.
      
      After this patch, iptables-compat-restore and iptables-compat take
      almost the same time to reload large rulesets.
      
      Fixes: 0ca743a5 ("netfilter: nf_tables: add compatibility layer for x_tables")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      98a38395
    • Pablo Neira Ayuso's avatar
      netfilter: nf_log: wait for rcu grace after logger unregistration · 8dafc993
      Pablo Neira Ayuso authored
      [ Upstream commit ad5001cc ]
      
      The nf_log_unregister() function needs to call synchronize_rcu() to make sure
      that the objects are not dereferenced anymore on module removal.
      
      Fixes: 5962815a ("netfilter: nf_log: use an array of loggers instead of list")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      8dafc993
    • Pablo Neira Ayuso's avatar
      netfilter: ctnetlink: put back references to master ct and expect objects · ba1fa01d
      Pablo Neira Ayuso authored
      [ Upstream commit 95dd8653 ]
      
      We have to put back the references to the master conntrack and the expectation
      that we just created, otherwise we'll leak them.
      
      Fixes: 0ef71ee1 ("netfilter: ctnetlink: refactor ctnetlink_create_expect")
      Reported-by: default avatarTim Wiess <Tim.Wiess@watchguard.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ba1fa01d
    • Joe Stringer's avatar
      netfilter: nf_conntrack: Support expectations in different zones · f17d9f15
      Joe Stringer authored
      [ Upstream commit 4b31814d ]
      
      When zones were originally introduced, the expectation functions were
      all extended to perform lookup using the zone. However, insertion was
      not modified to check the zone. This means that two expectations which
      are intended to apply for different connections that have the same tuple
      but exist in different zones cannot both be tracked.
      
      Fixes: 5d0aa2cc (netfilter: nf_conntrack: add support for "conntrack zones")
      Signed-off-by: default avatarJoe Stringer <joestringer@nicira.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f17d9f15
    • Pablo Neira Ayuso's avatar
      netfilter: nfnetlink: work around wrong endianess in res_id field · 3ad1bd82
      Pablo Neira Ayuso authored
      [ Upstream commit a9de9777 ]
      
      The convention in nfnetlink is to use network byte order in every header field
      as well as in the attribute payload. The initial version of the batching
      infrastructure assumes that res_id comes in host byte order though.
      
      The only client of the batching infrastructure is nf_tables, so let's add a
      workaround to address this inconsistency. We currently have 11 nfnetlink
      subsystems according to NFNL_SUBSYS_COUNT, so we can assume that the subsystem
      2560, ie. htons(10), will not be allocated anytime soon, so it can be an alias
      of nf_tables from the nfnetlink batching path when interpreting the res_id
      field.
      
      Based on original patch from Florian Westphal.
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3ad1bd82
    • Mikulas Patocka's avatar
      dm raid: fix round up of default region size · 545d2525
      Mikulas Patocka authored
      [ Upstream commit 042745ee ]
      
      Commit 3a0f9aae ("dm raid: round region_size to power of two")
      intended to make sure that the default region size is a power of two.
      However, the logic in that commit is incorrect and sets the variable
      region_size to 0 or 1, depending on whether min_region_size is a power
      of two.
      
      Fix this logic, using roundup_pow_of_two(), so that region_size is
      properly rounded up to the next power of two.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Fixes: 3a0f9aae ("dm raid: round region_size to power of two")
      Cc: stable@vger.kernel.org # v3.8+
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      545d2525
    • Liu.Zhao's avatar
      USB: option: add ZTE PIDs · 798191c2
      Liu.Zhao authored
      [ Upstream commit 19ab6bc5 ]
      
      This is intended to add ZTE device PIDs on kernel.
      Signed-off-by: default avatarLiu.Zhao <lzsos369@163.com>
      Cc: stable <stable@vger.kernel.org>
      [johan: sort the new entries ]
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      798191c2
    • Shawn Lin's avatar
      staging: ion: fix corruption of ion_import_dma_buf · 606c9512
      Shawn Lin authored
      [ Upstream commit 6fa92e2b ]
      
      we found this issue but still exit in lastest kernel. Simply
      keep ion_handle_create under mutex_lock to avoid this race.
      
      WARNING: CPU: 2 PID: 2648 at drivers/staging/android/ion/ion.c:512 ion_handle_add+0xb4/0xc0()
      ion_handle_add: buffer already found.
      Modules linked in: iwlmvm iwlwifi mac80211 cfg80211 compat
      CPU: 2 PID: 2648 Comm: TimedEventQueue Tainted: G        W    3.14.0 #7
       00000000 00000000 9a3efd2c 80faf273 9a3efd6c 9a3efd5c 80935dc9 811d7fd3
       9a3efd88 00000a58 812208a0 00000200 80e128d4 80e128d4 8d4ae00c a8cd8600
       a8cd8094 9a3efd74 80935e0e 00000009 9a3efd6c 811d7fd3 9a3efd88 9a3efd9c
      Call Trace:
        [<80faf273>] dump_stack+0x48/0x69
        [<80935dc9>] warn_slowpath_common+0x79/0x90
        [<80e128d4>] ? ion_handle_add+0xb4/0xc0
        [<80e128d4>] ? ion_handle_add+0xb4/0xc0
        [<80935e0e>] warn_slowpath_fmt+0x2e/0x30
        [<80e128d4>] ion_handle_add+0xb4/0xc0
        [<80e144cc>] ion_import_dma_buf+0x8c/0x110
        [<80c517c4>] reg_init+0x364/0x7d0
        [<80993363>] ? futex_wait+0x123/0x210
        [<80992e0e>] ? get_futex_key+0x16e/0x1e0
        [<8099308f>] ? futex_wake+0x5f/0x120
        [<80c51e19>] vpu_service_ioctl+0x1e9/0x500
        [<80994aec>] ? do_futex+0xec/0x8e0
        [<80971080>] ? prepare_to_wait_event+0xc0/0xc0
        [<80c51c30>] ? reg_init+0x7d0/0x7d0
        [<80a22562>] do_vfs_ioctl+0x2d2/0x4c0
        [<80b198ad>] ? inode_has_perm.isra.41+0x2d/0x40
        [<80b199cf>] ? file_has_perm+0x7f/0x90
        [<80b1a5f7>] ? selinux_file_ioctl+0x47/0xf0
        [<80a227a8>] SyS_ioctl+0x58/0x80
        [<80fb45e8>] syscall_call+0x7/0x7
        [<80fb0000>] ? mmc_do_calc_max_discard+0xab/0xe4
      
      Fixes: 83271f62 ("ion: hold reference to handle...")
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Reviewed-by: default avatarLaura Abbott <labbott@redhat.com>
      Cc: stable <stable@vger.kernel.org> # 3.14+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      606c9512
    • Joe Thornber's avatar
      dm btree: add ref counting ops for the leaves of top level btrees · f2670858
      Joe Thornber authored
      [ Upstream commit b0dc3c8b ]
      
      When using nested btrees, the top leaves of the top levels contain
      block addresses for the root of the next tree down.  If we shadow a
      shared leaf node the leaf values (sub tree roots) should be incremented
      accordingly.
      
      This is only an issue if there is metadata sharing in the top levels.
      Which only occurs if metadata snapshots are being used (as is possible
      with dm-thinp).  And could result in a block from the thinp metadata
      snap being reused early, thus corrupting the thinp metadata snap.
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f2670858
    • Chuck Lever's avatar
      svcrdma: Fix send_reply() scatter/gather set-up · e8b81595
      Chuck Lever authored
      [ Upstream commit 9d11b51c ]
      
      The Linux NFS server returns garbage in the data payload of inline
      NFS/RDMA READ replies. These are READs of under 1000 bytes or so
      where the client has not provided either a reply chunk or a write
      list.
      
      The NFS server delivers the data payload for an NFS READ reply to
      the transport in an xdr_buf page list. If the NFS client did not
      provide a reply chunk or a write list, send_reply() is supposed to
      set up a separate sge for the page containing the READ data, and
      another sge for XDR padding if needed, then post all of the sges via
      a single SEND Work Request.
      
      The problem is send_reply() does not advance through the xdr_buf
      when setting up scatter/gather entries for SEND WR. It always calls
      dma_map_xdr with xdr_off set to zero. When there's more than one
      sge, dma_map_xdr() sets up the SEND sge's so they all point to the
      xdr_buf's head.
      
      The current Linux NFS/RDMA client always provides a reply chunk or
      a write list when performing an NFS READ over RDMA. Therefore, it
      does not exercise this particular case. The Linux server has never
      had to use more than one extra sge for building RPC/RDMA replies
      with a Linux client.
      
      However, an NFS/RDMA client _is_ allowed to send small NFS READs
      without setting up a write list or reply chunk. The NFS READ reply
      fits entirely within the inline reply buffer in this case. This is
      perhaps a more efficient way of performing NFS READs that the Linux
      NFS/RDMA client may some day adopt.
      
      Fixes: b432e6b3 ('svcrdma: Change DMA mapping logic to . . .')
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=285Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e8b81595
    • Michal Kazior's avatar
      ath10k: fix dma_mapping_error() handling · fa1b77ba
      Michal Kazior authored
      [ Upstream commit 5e55e3cb ]
      
      The function returns 1 when DMA mapping fails. The
      driver would return bogus values and could
      possibly confuse itself if DMA failed.
      
      Fixes: 767d34fc ("ath10k: remove DMA mapping wrappers")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarMichal Kazior <michal.kazior@tieto.com>
      Signed-off-by: default avatarKalle Valo <kvalo@qca.qualcomm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      fa1b77ba
    • Filipe Manana's avatar
      Btrfs: update fix for read corruption of compressed and shared extents · 089699ed
      Filipe Manana authored
      [ Upstream commit 808f80b4 ]
      
      My previous fix in commit 005efedf ("Btrfs: fix read corruption of
      compressed and shared extents") was effective only if the compressed
      extents cover a file range with a length that is not a multiple of 16
      pages. That's because the detection of when we reached a different range
      of the file that shares the same compressed extent as the previously
      processed range was done at extent_io.c:__do_contiguous_readpages(),
      which covers subranges with a length up to 16 pages, because
      extent_readpages() groups the pages in clusters no larger than 16 pages.
      So fix this by tracking the start of the previously processed file
      range's extent map at extent_readpages().
      
      The following test case for fstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        test_clone_and_read_compressed_extent()
        {
            local mount_opts=$1
      
            _scratch_mkfs >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # Create our test file with a single extent of 64Kb that is going to
            # be compressed no matter which compression algo is used (zlib/lzo).
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
                $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Now clone the compressed extent into an adjacent file offset.
            $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
                $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
            echo "File digest before unmount:"
            md5sum $SCRATCH_MNT/foo | _filter_scratch
      
            # Remount the fs or clear the page cache to trigger the bug in
            # btrfs. Because the extent has an uncompressed length that is a
            # multiple of 16 pages, all the pages belonging to the second range
            # of the file (64K to 128K), which points to the same extent as the
            # first range (0K to 64K), had their contents full of zeroes instead
            # of the byte 0xaa. This was a bug exclusively in the read path of
            # compressed extents, the correct data was stored on disk, btrfs
            # just failed to fill in the pages correctly.
            _scratch_remount
      
            echo "File digest after remount:"
            # Must match the digest we got before.
            md5sum $SCRATCH_MNT/foo | _filter_scratch
        }
      
        echo -e "\nTesting with zlib compression..."
        test_clone_and_read_compressed_extent "-o compress=zlib"
      
        _scratch_unmount
      
        echo -e "\nTesting with lzo compression..."
        test_clone_and_read_compressed_extent "-o compress=lzo"
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Tested-by: default avatarTimofey Titovets <nefelim4ag@gmail.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      089699ed
    • Filipe Manana's avatar
      Btrfs: fix read corruption of compressed and shared extents · 3c62114f
      Filipe Manana authored
      [ Upstream commit 005efedf ]
      
      If a file has a range pointing to a compressed extent, followed by
      another range that points to the same compressed extent and a read
      operation attempts to read both ranges (either completely or part of
      them), the pages that correspond to the second range are incorrectly
      filled with zeroes.
      
      Consider the following example:
      
        File layout
        [0 - 8K]                      [8K - 24K]
            |                             |
            |                             |
         points to extent X,         points to extent X,
         offset 4K, length of 8K     offset 0, length 16K
      
        [extent X, compressed length = 4K uncompressed length = 16K]
      
      If a readpages() call spans the 2 ranges, a single bio to read the extent
      is submitted - extent_io.c:submit_extent_page() would only create a new
      bio to cover the second range pointing to the extent if the extent it
      points to had a different logical address than the extent associated with
      the first range. This has a consequence of the compressed read end io
      handler (compression.c:end_compressed_bio_read()) finish once the extent
      is decompressed into the pages covering the first range, leaving the
      remaining pages (belonging to the second range) filled with zeroes (done
      by compression.c:btrfs_clear_biovec_end()).
      
      So fix this by submitting the current bio whenever we find a range
      pointing to a compressed extent that was preceded by a range with a
      different extent map. This is the simplest solution for this corner
      case. Making the end io callback populate both ranges (or more, if we
      have multiple pointing to the same extent) is a much more complex
      solution since each bio is tightly coupled with a single extent map and
      the extent maps associated to the ranges pointing to the shared extent
      can have different offsets and lengths.
      
      The following test case for fstests triggers the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        test_clone_and_read_compressed_extent()
        {
            local mount_opts=$1
      
            _scratch_mkfs >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # Create a test file with a single extent that is compressed (the
            # data we write into it is highly compressible no matter which
            # compression algorithm is used, zlib or lzo).
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K"        \
                            -c "pwrite -S 0xbb 4K 8K"        \
                            -c "pwrite -S 0xcc 12K 4K"       \
                            $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Now clone our extent into an adjacent offset.
            $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
                $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
            # Same as before but for this file we clone the extent into a lower
            # file offset.
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K"         \
                            -c "pwrite -S 0xbb 12K 8K"        \
                            -c "pwrite -S 0xcc 20K 4K"        \
                            $SCRATCH_MNT/bar | _filter_xfs_io
      
            $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
                $SCRATCH_MNT/bar $SCRATCH_MNT/bar
      
            echo "File digests before unmounting filesystem:"
            md5sum $SCRATCH_MNT/foo | _filter_scratch
            md5sum $SCRATCH_MNT/bar | _filter_scratch
      
            # Evicting the inode or clearing the page cache before reading
            # again the file would also trigger the bug - reads were returning
            # all bytes in the range corresponding to the second reference to
            # the extent with a value of 0, but the correct data was persisted
            # (it was a bug exclusively in the read path). The issue happened
            # only if the same readpages() call targeted pages belonging to the
            # first and second ranges that point to the same compressed extent.
            _scratch_remount
      
            echo "File digests after mounting filesystem again:"
            # Must match the same digests we got before.
            md5sum $SCRATCH_MNT/foo | _filter_scratch
            md5sum $SCRATCH_MNT/bar | _filter_scratch
        }
      
        echo -e "\nTesting with zlib compression..."
        test_clone_and_read_compressed_extent "-o compress=zlib"
      
        _scratch_unmount
      
        echo -e "\nTesting with lzo compression..."
        test_clone_and_read_compressed_extent "-o compress=lzo"
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3c62114f
    • Jeff Mahoney's avatar
      btrfs: skip waiting on ordered range for special files · b0849b6a
      Jeff Mahoney authored
      [ Upstream commit a30e577c ]
      
      In btrfs_evict_inode, we properly truncate the page cache for evicted
      inodes but then we call btrfs_wait_ordered_range for every inode as well.
      It's the right thing to do for regular files but results in incorrect
      behavior for device inodes for block devices.
      
      filemap_fdatawrite_range gets called with inode->i_mapping which gets
      resolved to the block device inode before getting passed to
      wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
      next depends on whether there's an open file handle associated with the
      inode.  If there is, we write to the block device, which is unexpected
      behavior.  If there isn't, we through normally and inode->i_data is used.
      We can also end up racing against open/close which can result in crashes
      when i_mapping points to a block device inode that has been closed.
      
      Since there can't be any page cache associated with special file inodes,
      it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
      the problem.
      
      Cc: <stable@vger.kernel.org>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911Tested-by: default avatarChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      b0849b6a
    • Yitian Bu's avatar
      ASoC: dwc: correct irq clear method · b03abc8b
      Yitian Bu authored
      [ Upstream commit 4873867e ]
      
      from Designware I2S datasheet, tx/rx XRUN irq is cleared by
      reading register TOR/ROR, rather than by writing into them.
      Signed-off-by: default avatarYitian Bu <yitian.bu@tangramtek.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      b03abc8b
    • Robert Jarzmik's avatar
      ASoC: fix broken pxa SoC support · 8df65445
      Robert Jarzmik authored
      [ Upstream commit 3c8f7710 ]
      
      The previous fix of pxa library support, which was introduced to fix the
      library dependency, broke the previous SoC behavior, where a machine
      code binding pxa2xx-ac97 with a coded relied on :
       - sound/soc/pxa/pxa2xx-ac97.c
       - sound/soc/codecs/XXX.c
      
      For example, the mioa701_wm9713.c machine code is currently broken. The
      "select ARM" statement wrongly selects the soc/arm/pxa2xx-ac97 for
      compilation, as per an unfortunate fate SND_PXA2XX_AC97 is both declared
      in sound/arm/Kconfig and sound/soc/pxa/Kconfig.
      
      Fix this by ensuring that SND_PXA2XX_SOC correctly triggers the correct
      pxa2xx-ac97 compilation.
      
      Fixes: 846172df ("ASoC: fix SND_PXA2XX_LIB Kconfig warning")
      Signed-off-by: default avatarRobert Jarzmik <robert.jarzmik@free.fr>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      8df65445
    • Robert Jarzmik's avatar
      ASoC: pxa: pxa2xx-ac97: fix dma requestor lines · b496c804
      Robert Jarzmik authored
      [ Upstream commit 8811191f ]
      
      PCM receive and transmit DMA requestor lines were reverted, breaking the
      PCM playback interface for PXA platforms using the sound/soc/ variant
      instead of the sound/arm variant.
      
      The commit below shows the inversion in the requestor lines.
      
      Fixes: d65a1458 ("ASoC: pxa: use snd_dmaengine_dai_dma_data")
      Signed-off-by: default avatarRobert Jarzmik <robert.jarzmik@free.fr>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      b496c804
    • John Flatness's avatar
      ALSA: hda - Apply SPDIF pin ctl to MacBookPro 12,1 · acd1288e
      John Flatness authored
      [ Upstream commit e8ff581f ]
      
      The MacBookPro 12,1 has the same setup as the 11 for controlling the
      status of the optical audio light. Simply apply the existing workaround
      to the subsystem ID for the 12,1.
      
      [sorted the fixup entry by tiwai]
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=105401Signed-off-by: default avatarJohn Flatness <john@zerocrates.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      acd1288e
    • Laura Abbott's avatar
      ALSA: hda: Add dock support for ThinkPad T550 · 91b15aa1
      Laura Abbott authored
      [ Upstream commit d05ea7da ]
      
      Much like all the other Lenovo laptops, add a quirk to make
      sound work with docking.
      
      Reported-and-tested-by: lacknerflo@gmail.com
      Signed-off-by: default avatarLaura Abbott <labbott@fedoraproject.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      91b15aa1
    • Takashi Iwai's avatar
      ALSA: synth: Fix conflicting OSS device registration on AWE32 · 63758060
      Takashi Iwai authored
      [ Upstream commit 225db576 ]
      
      When OSS emulation is loaded on ISA SB AWE32 chip, we get now kernel
      warnings like:
        WARNING: CPU: 0 PID: 2791 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x51/0x80()
        sysfs: cannot create duplicate filename '/devices/isa/sbawe.0/sound/card0/seq-oss-0-0'
      
      It's because both emux synth and opl3 drivers try to register their
      OSS device object with the same static index number 0.  This hasn't
      been a big problem until the recent rewrite of device management code
      (that exposes sysfs at the same time), but it's been an obvious bug.
      
      This patch works around it just by using a different index number of
      emux synth object.  There can be a more elegant way to fix, but it's
      enough for now, as this code won't be touched so often, in anyway.
      Reported-and-tested-by: default avatarMichael Shell <list1@michaelshell.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      63758060
    • Mel Gorman's avatar
      mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault · 204e65d3
      Mel Gorman authored
      [ Upstream commit 2f84a899 ]
      
      SunDong reported the following on
      
        https://bugzilla.kernel.org/show_bug.cgi?id=103841
      
      	I think I find a linux bug, I have the test cases is constructed. I
      	can stable recurring problems in fedora22(4.0.4) kernel version,
      	arch for x86_64.  I construct transparent huge page, when the parent
      	and child process with MAP_SHARE, MAP_PRIVATE way to access the same
      	huge page area, it has the opportunity to lead to huge page copy on
      	write failure, and then it will munmap the child corresponding mmap
      	area, but then the child mmap area with VM_MAYSHARE attributes, child
      	process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
      	functions (vma - > vm_flags & VM_MAYSHARE).
      
      There were a number of problems with the report (e.g.  it's hugetlbfs that
      triggers this, not transparent huge pages) but it was fundamentally
      correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
      looks like this
      
      	 vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
      	 next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
      	 prot 8000000000000027 anon_vma           (null) vm_ops ffffffff8182a7a0
      	 pgoff 0 file ffff88106bdb9800 private_data           (null)
      	 flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
      	 ------------
      	 kernel BUG at mm/hugetlb.c:462!
      	 SMP
      	 Modules linked in: xt_pkttype xt_LOG xt_limit [..]
      	 CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
      	 Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
      	 set_vma_resv_flags+0x2d/0x30
      
      The VM_BUG_ON is correct because private and shared mappings have
      different reservation accounting but the warning clearly shows that the
      VMA is shared.
      
      When a private COW fails to allocate a new page then only the process
      that created the VMA gets the page -- all the children unmap the page.
      If the children access that data in the future then they get killed.
      
      The problem is that the same file is mapped shared and private.  During
      the COW, the allocation fails, the VMAs are traversed to unmap the other
      private pages but a shared VMA is found and the bug is triggered.  This
      patch identifies such VMAs and skips them.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reported-by: default avatarSunDong <sund_sky@126.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      204e65d3
    • Joseph Qi's avatar
      ocfs2/dlm: fix deadlock when dispatch assert master · 13587ce1
      Joseph Qi authored
      [ Upstream commit 012572d4 ]
      
      The order of the following three spinlocks should be:
      dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock
      
      But dlm_dispatch_assert_master() is called while holding
      dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
      dlm_grab() which will take dlm_domain_lock.
      
      Once another thread (for example, dlm_query_join_handler) has already
      taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
      happens.
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: "Junxiao Bi" <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      13587ce1
    • Tan, Jui Nee's avatar
      spi: spi-pxa2xx: Check status register to determine if SSSR_TINT is disabled · f6d08014
      Tan, Jui Nee authored
      [ Upstream commit 02bc933e ]
      
      On Intel Baytrail, there is case when interrupt handler get called, no SPI
      message is captured. The RX FIFO is indeed empty when RX timeout pending
      interrupt (SSSR_TINT) happens.
      
      Use the BIOS version where both HSUART and SPI are on the same IRQ. Both
      drivers are using IRQF_SHARED when calling the request_irq function. When
      running two separate and independent SPI and HSUART application that
      generate data traffic on both components, user will see messages like
      below on the console:
      
        pxa2xx-spi pxa2xx-spi.0: bad message state in interrupt handler
      
      This commit will fix this by first checking Receiver Time-out Interrupt,
      if it is disabled, ignore the request and return without servicing.
      Signed-off-by: default avatarTan, Jui Nee <jui.nee.tan@intel.com>
      Acked-by: default avatarJarkko Nikula <jarkko.nikula@linux.intel.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f6d08014
    • Max Filippov's avatar
      spi: xtensa-xtfpga: fix register endianness · 3cd1f376
      Max Filippov authored
      [ Upstream commit b0b48550 ]
      
      XTFPGA SPI controller has native endian registers.
      Fix register acessors so that they work in big-endian configurations.
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3cd1f376
    • Guenter Roeck's avatar
      spi: Fix documentation of spi_alloc_master() · 9ab87815
      Guenter Roeck authored
      [ Upstream commit a394d635 ]
      
      Actually, spi_master_put() after spi_alloc_master() must _not_ be followed
      by kfree(). The memory is already freed with the call to spi_master_put()
      through spi_master_class, which registers a release function. Calling both
      spi_master_put() and kfree() results in often nasty (and delayed) crashes
      elsewhere in the kernel, often in the networking stack.
      
      This reverts commit eb4af0f5.
      
      Link to patch and concerns: https://lkml.org/lkml/2012/9/3/269
      or
      http://lkml.iu.edu/hypermail/linux/kernel/1209.0/00790.html
      
      Alexey Klimov: This revert becomes valid after
      94c69f76 when spi-imx.c
      has been fixed and there is no need to call kfree() so comment
      for spi_alloc_master() should be fixed.
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAlexey Klimov <alexey.klimov@linaro.org>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      9ab87815
    • Christian Borntraeger's avatar
      s390/boot/decompression: disable floating point in decompressor · 8ba742c0
      Christian Borntraeger authored
      [ Upstream commit adc0b7fb ]
      
      my gcc 5.1 used an ldgr instruction with a register != 0,2,4,6 for
      spilling/filling into a floating point register in our decompressor.
      
      This will cause an AFP-register data exception as the decompressor
      did not setup the additional floating point registers via cr0.
      That causes a program check loop that looked like a hang with
      one "Uncompressing Linux... " message (directly booted via kvm)
      or a loop of "Uncompressing Linux... " messages (when booted via
      zipl boot loader).
      
      The offending code in my build was
      
         48e400:       e3 c0 af ff ff 71       lay     %r12,-1(%r10)
      -->48e406:       b3 c1 00 1c             ldgr    %f1,%r12
         48e40a:       ec 6c 01 22 02 7f       clij    %r6,2,12,0x48e64e
      
      but gcc could do spilling into an fpr at any function. We can
      simply disable floating point support at that early stage.
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      8ba742c0
    • Martin Schwidefsky's avatar
      s390/compat: correct uc_sigmask of the compat signal frame · cdd0e80e
      Martin Schwidefsky authored
      [ Upstream commit 8d4bd0ed ]
      
      The uc_sigmask in the ucontext structure is an array of words to keep
      the 64 signal bits (or 1024 if you ask glibc but the kernel sigset_t
      only has 64 bits).
      
      For 64 bit the sigset_t contains a single 8 byte word, but for 31 bit
      there are two 4 byte words. The compat signal handler code uses a
      simple copy of the 64 bit sigset_t to the 31 bit compat_sigset_t.
      As s390 is a big-endian architecture this is incorrect, the two words
      in the 31 bit sigset_t array need to be swapped.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarStefan Liebler <stli@linux.vnet.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      cdd0e80e
    • Peter Zijlstra's avatar
      sched/core: Fix TASK_DEAD race in finish_task_switch() · 9992c7d5
      Peter Zijlstra authored
      [ Upstream commit 95913d97 ]
      
      So the problem this patch is trying to address is as follows:
      
              CPU0                            CPU1
      
              context_switch(A, B)
                                              ttwu(A)
                                                LOCK A->pi_lock
                                                A->on_cpu == 0
              finish_task_switch(A)
                prev_state = A->state  <-.
                WMB                      |
                A->on_cpu = 0;           |
                UNLOCK rq0->lock         |
                                         |    context_switch(C, A)
                                         `--  A->state = TASK_DEAD
                prev_state == TASK_DEAD
                  put_task_struct(A)
                                              context_switch(A, C)
                                              finish_task_switch(A)
                                                A->state == TASK_DEAD
                                                  put_task_struct(A)
      
      The argument being that the WMB will allow the load of A->state on CPU0
      to cross over and observe CPU1's store of A->state, which will then
      result in a double-drop and use-after-free.
      
      Now the comment states (and this was true once upon a long time ago)
      that we need to observe A->state while holding rq->lock because that
      will order us against the wakeup; however the wakeup will not in fact
      acquire (that) rq->lock; it takes A->pi_lock these days.
      
      We can obviously fix this by upgrading the WMB to an MB, but that is
      expensive, so we'd rather avoid that.
      
      The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
      which avoids the MB on some archs, but not important ones like ARM.
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org> # v3.1+
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: manfred@colorfullife.com
      Cc: will.deacon@arm.com
      Fixes: e4a52bcb ("sched: Remove rq->lock from the first half of ttwu()")
      Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      9992c7d5
    • Vitaly Kuznetsov's avatar
      x86/xen: Support kexec/kdump in HVM guests by doing a soft reset · 113da0df
      Vitaly Kuznetsov authored
      [ Upstream commit 0b34a166 ]
      
      Currently there is a number of issues preventing PVHVM Xen guests from
      doing successful kexec/kdump:
      
        - Bound event channels.
        - Registered vcpu_info.
        - PIRQ/emuirq mappings.
        - shared_info frame after XENMAPSPACE_shared_info operation.
        - Active grant mappings.
      
      Basically, newly booted kernel stumbles upon already set up Xen
      interfaces and there is no way to reestablish them. In Xen-4.7 a new
      feature called 'soft reset' is coming. A guest performing kexec/kdump
      operation is supposed to call SCHEDOP_shutdown hypercall with
      SHUTDOWN_soft_reset reason before jumping to new kernel. Hypervisor
      (with some help from toolstack) will do full domain cleanup (but
      keeping its memory and vCPU contexts intact) returning the guest to
      the state it had when it was first booted and thus allowing it to
      start over.
      
      Doing SHUTDOWN_soft_reset on Xen hypervisors which don't support it is
      probably OK as by default all unknown shutdown reasons cause domain
      destroy with a message in toolstack log: 'Unknown shutdown reason code
      5. Destroying domain.'  which gives a clue to what the problem is and
      eliminates false expectations.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      113da0df
    • Stephen Smalley's avatar
      x86/mm: Set NX on gap between __ex_table and rodata · 4502d698
      Stephen Smalley authored
      [ Upstream commit ab76f7b4 ]
      
      Unused space between the end of __ex_table and the start of
      rodata can be left W+x in the kernel page tables.  Extend the
      setting of the NX bit to cover this gap by starting from
      text_end rather than rodata_start.
      
        Before:
        ---[ High Kernel Mapping ]---
        0xffffffff80000000-0xffffffff81000000          16M                               pmd
        0xffffffff81000000-0xffffffff81600000           6M     ro         PSE     GLB x  pmd
        0xffffffff81600000-0xffffffff81754000        1360K     ro                 GLB x  pte
        0xffffffff81754000-0xffffffff81800000         688K     RW                 GLB x  pte
        0xffffffff81800000-0xffffffff81a00000           2M     ro         PSE     GLB NX pmd
        0xffffffff81a00000-0xffffffff81b3b000        1260K     ro                 GLB NX pte
        0xffffffff81b3b000-0xffffffff82000000        4884K     RW                 GLB NX pte
        0xffffffff82000000-0xffffffff82200000           2M     RW         PSE     GLB NX pmd
        0xffffffff82200000-0xffffffffa0000000         478M                               pmd
      
        After:
        ---[ High Kernel Mapping ]---
        0xffffffff80000000-0xffffffff81000000          16M                               pmd
        0xffffffff81000000-0xffffffff81600000           6M     ro         PSE     GLB x  pmd
        0xffffffff81600000-0xffffffff81754000        1360K     ro                 GLB x  pte
        0xffffffff81754000-0xffffffff81800000         688K     RW                 GLB NX pte
        0xffffffff81800000-0xffffffff81a00000           2M     ro         PSE     GLB NX pmd
        0xffffffff81a00000-0xffffffff81b3b000        1260K     ro                 GLB NX pte
        0xffffffff81b3b000-0xffffffff82000000        4884K     RW                 GLB NX pte
        0xffffffff82000000-0xffffffff82200000           2M     RW         PSE     GLB NX pmd
        0xffffffff82200000-0xffffffffa0000000         478M                               pmd
      Signed-off-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1443704662-3138-1-git-send-email-sds@tycho.nsa.govSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      4502d698
    • Thomas Gleixner's avatar
      x86/process: Add proper bound checks in 64bit get_wchan() · 09be2e41
      Thomas Gleixner authored
      [ Upstream commit eddd3826 ]
      
      Dmitry Vyukov reported the following using trinity and the memory
      error detector AddressSanitizer
      (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel).
      
      [ 124.575597] ERROR: AddressSanitizer: heap-buffer-overflow on
      address ffff88002e280000
      [ 124.576801] ffff88002e280000 is located 131938492886538 bytes to
      the left of 28857600-byte region [ffffffff81282e0a, ffffffff82e0830a)
      [ 124.578633] Accessed by thread T10915:
      [ 124.579295] inlined in describe_heap_address
      ./arch/x86/mm/asan/report.c:164
      [ 124.579295] #0 ffffffff810dd277 in asan_report_error
      ./arch/x86/mm/asan/report.c:278
      [ 124.580137] #1 ffffffff810dc6a0 in asan_check_region
      ./arch/x86/mm/asan/asan.c:37
      [ 124.581050] #2 ffffffff810dd423 in __tsan_read8 ??:0
      [ 124.581893] #3 ffffffff8107c093 in get_wchan
      ./arch/x86/kernel/process_64.c:444
      
      The address checks in the 64bit implementation of get_wchan() are
      wrong in several ways:
      
       - The lower bound of the stack is not the start of the stack
         page. It's the start of the stack page plus sizeof (struct
         thread_info)
      
       - The upper bound must be:
      
             top_of_stack - TOP_OF_KERNEL_STACK_PADDING - 2 * sizeof(unsigned long).
      
         The 2 * sizeof(unsigned long) is required because the stack pointer
         points at the frame pointer. The layout on the stack is: ... IP FP
         ... IP FP. So we need to make sure that both IP and FP are in the
         bounds.
      
      Fix the bound checks and get rid of the mix of numeric constants, u64
      and unsigned long. Making all unsigned long allows us to use the same
      function for 32bit as well.
      
      Use READ_ONCE() when accessing the stack. This does not prevent a
      concurrent wakeup of the task and the stack changing, but at least it
      avoids TOCTOU.
      
      Also check task state at the end of the loop. Again that does not
      prevent concurrent changes, but it avoids walking for nothing.
      
      Add proper comments while at it.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Based-on-patch-from: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarBorislav Petkov <bp@alien8.de>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: kasan-dev <kasan-dev@googlegroups.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20150930083302.694788319@linutronix.deSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      09be2e41
    • Andy Lutomirski's avatar
      x86/asm/entry: Create and use a 'TOP_OF_KERNEL_STACK_PADDING' macro · 6dbba213
      Andy Lutomirski authored
      [ Upstream commit 3ee4298f ]
      
      x86_32, unlike x86_64, pads the top of the kernel stack, because the
      hardware stack frame formats are variable in size.
      
      Document this padding and give it a name.
      
      This should make no change whatsoever to the compiled kernel
      image. It also doesn't fix any of the current bugs in this area.
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Acked-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/02bf2f54b8dcb76a62a142b6dfe07d4ef7fc582e.1426009661.git.luto@amacapital.net
      [ Fixed small details, such as a missed magic constant in entry_32.S pointed out by Denys Vlasenko. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6dbba213
    • Lee, Chun-Yi's avatar
      x86/kexec: Fix kexec crash in syscall kexec_file_load() · 9a2a1db5
      Lee, Chun-Yi authored
      [ Upstream commit e3c41e37 ]
      
      The original bug is a page fault crash that sometimes happens
      on big machines when preparing ELF headers:
      
          BUG: unable to handle kernel paging request at ffffc90613fc9000
          IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260
      
      The bug is caused by us under-counting the number of memory ranges
      and subsequently not allocating enough ELF header space for them.
      The bug is typically masked on smaller systems, because the ELF header
      allocation is rounded up to the next page.
      
      This patch modifies the code in fill_up_crash_elf_data() by using
      walk_system_ram_res() instead of walk_system_ram_range() to correctly
      count the max number of crash memory ranges. That's because the
      walk_system_ram_range() filters out small memory regions that
      reside in the same page, but walk_system_ram_res() does not.
      
      Here's how I found the bug:
      
      After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
      the code uses walk_system_ram_res() to fill-in crash memory regions information
      to the program header, so it counts those small memory regions that
      reside in a page area.
      
      But, when the kernel was using walk_system_ram_range() in
      fill_up_crash_elf_data() to count the number of crash memory regions,
      it filters out small regions.
      
      I printed those small memory regions, for example:
      
        kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0
      
      Based on the code in walk_system_ram_range(), this memory region
      will be filtered out:
      
        pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
        end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
        end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE
      
      So, the max_nr_ranges that's counted by the kernel doesn't include
      small memory regions - causing us to under-allocate the required space.
      That causes the page fault crash that happens in a later code path
      when preparing ELF headers.
      
      This bug is not easy to reproduce on small machines that have few
      CPUs, because the allocated page aligned ELF buffer has more free
      space to cover those small memory regions' PT_LOAD headers.
      Signed-off-by: default avatarLee, Chun-Yi <jlee@suse.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: kexec@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1443531537-29436-1-git-send-email-jlee@suse.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      9a2a1db5
    • Matt Fleming's avatar
      x86/efi: Fix boot crash by mapping EFI memmap entries bottom-up at runtime, instead of top-down · 3b8db56e
      Matt Fleming authored
      [ Upstream commit a5caa209 ]
      
      Beginning with UEFI v2.5 EFI_PROPERTIES_TABLE was introduced
      that signals that the firmware PE/COFF loader supports splitting
      code and data sections of PE/COFF images into separate EFI
      memory map entries. This allows the kernel to map those regions
      with strict memory protections, e.g. EFI_MEMORY_RO for code,
      EFI_MEMORY_XP for data, etc.
      
      Unfortunately, an unwritten requirement of this new feature is
      that the regions need to be mapped with the same offsets
      relative to each other as observed in the EFI memory map. If
      this is not done crashes like this may occur,
      
        BUG: unable to handle kernel paging request at fffffffefe6086dd
        IP: [<fffffffefe6086dd>] 0xfffffffefe6086dd
        Call Trace:
         [<ffffffff8104c90e>] efi_call+0x7e/0x100
         [<ffffffff81602091>] ? virt_efi_set_variable+0x61/0x90
         [<ffffffff8104c583>] efi_delete_dummy_variable+0x63/0x70
         [<ffffffff81f4e4aa>] efi_enter_virtual_mode+0x383/0x392
         [<ffffffff81f37e1b>] start_kernel+0x38a/0x417
         [<ffffffff81f37495>] x86_64_start_reservations+0x2a/0x2c
         [<ffffffff81f37582>] x86_64_start_kernel+0xeb/0xef
      
      Here 0xfffffffefe6086dd refers to an address the firmware
      expects to be mapped but which the OS never claimed was mapped.
      The issue is that included in these regions are relative
      addresses to other regions which were emitted by the firmware
      toolchain before the "splitting" of sections occurred at
      runtime.
      
      Needless to say, we don't satisfy this unwritten requirement on
      x86_64 and instead map the EFI memory map entries in reverse
      order. The above crash is almost certainly triggerable with any
      kernel newer than v3.13 because that's when we rewrote the EFI
      runtime region mapping code, in commit d2f7cbe7 ("x86/efi:
      Runtime services virtual mapping"). For kernel versions before
      v3.13 things may work by pure luck depending on the
      fragmentation of the kernel virtual address space at the time we
      map the EFI regions.
      
      Instead of mapping the EFI memory map entries in reverse order,
      where entry N has a higher virtual address than entry N+1, map
      them in the same order as they appear in the EFI memory map to
      preserve this relative offset between regions.
      
      This patch has been kept as small as possible with the intention
      that it should be applied aggressively to stable and
      distribution kernels. It is very much a bugfix rather than
      support for a new feature, since when EFI_PROPERTIES_TABLE is
      enabled we must map things as outlined above to even boot - we
      have no way of asking the firmware not to split the code/data
      regions.
      
      In fact, this patch doesn't even make use of the more strict
      memory protections available in UEFI v2.5. That will come later.
      Suggested-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Reported-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chun-Yi <jlee@suse.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: James Bottomley <JBottomley@Odin.com>
      Cc: Lee, Chun-Yi <jlee@suse.com>
      Cc: Leif Lindholm <leif.lindholm@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Jones <pjones@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1443218539-7610-2-git-send-email-matt@codeblueprint.co.ukSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3b8db56e
    • Dirk Müller's avatar
      Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS · 1e4f2890
      Dirk Müller authored
      [ Upstream commit d2922422 ]
      
      The cpu feature flags are not ever going to change, so warning
      everytime can cause a lot of kernel log spam
      (in our case more than 10GB/hour).
      
      The warning seems to only occur when nested virtualization is
      enabled, so it's probably triggered by a KVM bug.  This is a
      sensible and safe change anyway, and the KVM bug fix might not
      be suitable for stable releases anyway.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDirk Mueller <dmueller@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      1e4f2890
    • Andy Lutomirski's avatar
      x86/paravirt: Replace the paravirt nop with a bona fide empty function · 7cb9685d
      Andy Lutomirski authored
      [ Upstream commit fc57a7c6 ]
      
      PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an
      example, trimmed for readability):
      
          ff 15 00 00 00 00       callq  *0x0(%rip)        # 2796 <nmi+0x6>
                    2792: R_X86_64_PC32     pv_irq_ops+0x2c
      
      That's a call through a function pointer to regular C function that
      does nothing on native boots, but that function isn't protected
      against kprobes, isn't marked notrace, and is certainly not
      guaranteed to preserve any registers if the compiler is feeling
      perverse.  This is bad news for a CLBR_NONE operation.
      
      Of course, if everything works correctly, once paravirt ops are
      patched, it gets nopped out, but what if we hit this code before
      paravirt ops are patched in?  This can potentially cause breakage
      that is very difficult to debug.
      
      A more subtle failure is possible here, too: if _paravirt_nop uses
      the stack at all (even just to push RBP), it will overwrite the "NMI
      executing" variable if it's called in the NMI prologue.
      
      The Xen case, perhaps surprisingly, is fine, because it's already
      written in asm.
      
      Fix all of the cases that default to paravirt_nop (including
      adjust_exception_frame) with a big hammer: replace paravirt_nop with
      an asm function that is just a ret instruction.
      
      The Xen case may have other problems, so document them.
      
      This is part of a fix for some random crashes that Sasha saw.
      Reported-and-tested-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.orgSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7cb9685d
    • David Woodhouse's avatar
      x86/platform: Fix Geode LX timekeeping in the generic x86 build · 6ea48cdc
      David Woodhouse authored
      [ Upstream commit 03da3ff1 ]
      
      In 2007, commit 07190a08 ("Mark TSC on GeodeLX reliable")
      bypassed verification of the TSC on Geode LX. However, this code
      (now in the check_system_tsc_reliable() function in
      arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was
      set.
      
      OpenWRT has recently started building its generic Geode target
      for Geode GX, not LX, to include support for additional
      platforms. This broke the timekeeping on LX-based devices,
      because the TSC wasn't marked as reliable:
      https://dev.openwrt.org/ticket/20531
      
      By adding a runtime check on is_geode_lx(), we can also include
      the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus
      fixing the problem.
      Signed-off-by: default avatarDavid Woodhouse <David.Woodhouse@intel.com>
      Cc: Andres Salomon <dilinger@queued.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.orgSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      6ea48cdc