Commits · 31ff0ceeb266a4ac96f3fc8cebb85df862a22f92 · Kirill Smelkov / linux

22 Aug, 2018 21 commits

target/iscsi: Allocate session IDs from an IDA · 31ff0cee

Matthew Wilcox authored Jun 19, 2018

Since the session is never looked up by ID, we can use the more
space-efficient IDA instead of the IDR.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

31ff0cee

iscsi target: fix session creation failure handling · 26abc916

Mike Christie authored Jul 26, 2018

The problem is that iscsi_login_zero_tsih_s1 sets conn->sess early in
iscsi_login_set_conn_values. If the function fails later like when we
alloc the idr it does kfree(sess) and leaves the conn->sess pointer set.
iscsi_login_zero_tsih_s1 then returns -Exyz and we then call
iscsi_target_login_sess_out and access the freed memory.

This patch has iscsi_login_zero_tsih_s1 either completely setup the
session or completely tear it down, so later in
iscsi_target_login_sess_out we can just check for it being set to the
connection.

Cc: stable@vger.kernel.org
Fixes: 0957627a ("iscsi-target: Fix sess allocation leak in...")
Signed-off-by: Mike Christie <mchristi@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>

26abc916

drm/vmwgfx: Convert to new IDA API · 4eb085e4

Matthew Wilcox authored Jun 18, 2018

Reorder allocation to avoid an awkward lock/unlock/lock sequence.
Simpler code due to being able to use ida_alloc_max(), even if we can't
eliminate the driver's spinlock.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Sinclair Yeh <syeh@vmware.com>

4eb085e4

dmaengine: Convert to new IDA API · 485258b4

Matthew Wilcox authored Jun 18, 2018

Simpler and shorter code.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Vinod Koul <vkoul@kernel.org>

485258b4

ppc: Convert vas ID allocation to new IDA API · cd38049e

Matthew Wilcox authored Jun 18, 2018

Removes a custom spinlock and simplifies the code.  Also fix an
error where we could allocate one ID too many.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

cd38049e

media: Convert entity ID allocation to new IDA API · 5a2ab034

Matthew Wilcox authored Jun 18, 2018

Removes a call to ida_pre_get().
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

5a2ab034

ppc: Convert mmu context allocation to new IDA API · b3fa6417

Matthew Wilcox authored Jun 18, 2018

ida_alloc_range is the perfect fit for this use case.  Eliminates
a custom spinlock, a call to ida_pre_get and a local check for the
allocated ID exceeding a maximum.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

b3fa6417

Convert net_namespace to new IDA API · 6e77cc47
Matthew Wilcox authored Jun 17, 2018
```
Signed-off-by: Matthew Wilcox <willy@infradead.org>
```
6e77cc47

cb710: Convert to new IDA API · 4c9ca2fd

Matthew Wilcox authored Jun 11, 2018

Eliminates the custom spinlock and the call to ida_pre_get.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>

4c9ca2fd

rsxx: Convert to new IDA API · 37ae133c

Matthew Wilcox authored Jun 11, 2018

Eliminate the custom spinlock and the call to ida_pre_get.
Also add a call to ida_free() in the card remove routine, which I believe
fixes a bug in this driver.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

37ae133c

osd: Convert to new IDA API · 5963e78d
Matthew Wilcox authored Jun 11, 2018
```
Slightly simpler code.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
```
5963e78d

sd: Convert to new IDA API · 94015080

Matthew Wilcox authored Jun 11, 2018

Allows us to remove an explicit spinlock.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

94015080

devpts: Convert to new IDA API · 0f0a0e54

Matthew Wilcox authored Jun 11, 2018

ida_alloc_max() matches what this driver wants to do. Also removes a
call to ida_pre_get(). We no longer need the protection of the mutex,
so convert pty_count to an atomic_t and remove the mutex entirely.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

0f0a0e54

fs: Convert namespace IDAs to new API · 169b480e

Matthew Wilcox authored Jun 11, 2018

We don't need to keep track of the starting value; the IDA is efficient.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

169b480e

fs: Convert unnamed_dev_ida to new API · 5a66847e

Matthew Wilcox authored Jun 11, 2018

The new API is much easier for this user.  Also add kerneldoc for
get_anon_bdev().
Signed-off-by: Matthew Wilcox <willy@infradead.org>

5a66847e

mtip32xx: Convert to new IDA API · 3aed4bc1

Matthew Wilcox authored Mar 21, 2018

Removes a use of ida_pre_get() and a personalised spinlock.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

3aed4bc1

ida: Add new API · 5ade60dd

Matthew Wilcox authored Mar 20, 2018

Add ida_alloc(), ida_alloc_min(), ida_alloc_max(), ida_alloc_range()
and ida_free().  The ida_alloc_max() and ida_alloc_range() functions
differ from ida_simple_get() in that they take an inclusive 'max'
parameter instead of an exclusive 'end' parameter.  Callers are about
evenly split whether they'd like inclusive or exclusive parameters and
'max' is easier to document than 'end'.

Change the IDA allocation to first attempt to allocate a bit using
existing memory, and only allocate memory afterwards.  Also change the
behaviour of 'min' > INT_MAX from being a BUG() to returning -ENOSPC.

Leave compatibility wrappers in place for ida_simple_get() and
ida_simple_remove() to avoid changing all callers.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

5ade60dd

ida: Lock the IDA in ida_destroy · 50d97d50

Matthew Wilcox authored Jun 21, 2018

The user has no need to handle locking between ida_simple_get() and
ida_simple_remove(). They shouldn't be forced to think about whether
ida_destroy() might be called at the same time as any of their other
IDA manipulation calls. Improve the documnetation while I'm in here.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

50d97d50

radix-tree: Fix UBSAN warning · 76f070b4

Matthew Wilcox authored Aug 18, 2018

get_slot_offset() can be called with a NULL 'parent' argument.
In this case, the calculated value will not be used, but calculating
it is undefined.  Rather than fixing the caller (__radix_tree_delete)
to not call get_slot_offset(), make get_slot_offset() robust against
being called with a NULL parent.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

76f070b4

radix tree test suite: Enable ubsan · d1c0d5e3

Matthew Wilcox authored May 19, 2018

Add support for the undefined behaviour sanitizer and fix the bugs
that ubsan pointed out.  Nothing major, and all in the test suite,
not the code.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

d1c0d5e3

radix tree test suite: Fix compilation · c9b93352

Matthew Wilcox authored Aug 21, 2018

An include of xarray.h was added to lib/idr.c without updating the test
suite.
Signed-off-by: Matthew Wilcox <willy@infradead.org>

c9b93352

12 Aug, 2018 4 commits

Linux 4.18 · 94710cac
Linus Torvalds authored Aug 12, 2018

94710cac

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 921195d3

Linus Torvalds authored Aug 12, 2018

Pull SCSI fixes from James Bottomley:
 "Eight fixes.

  The most important one is the mpt3sas fix which makes the driver work
  again on big endian systems. The rest are mostly minor error path or
  checker issues and the vmw_scsi one fixes a performance problem"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: vmw_pvscsi: Return DID_RESET for status SAM_STAT_COMMAND_TERMINATED
  scsi: sr: Avoid that opening a CD-ROM hangs with runtime power management enabled
  scsi: mpt3sas: Swap I/O memory read value back to cpu endianness
  scsi: fcoe: clear FC_RP_STARTED flags when receiving a LOGO
  scsi: fcoe: drop frames in ELS LOGO error path
  scsi: fcoe: fix use-after-free in fcoe_ctlr_els_send
  scsi: qedi: Fix a potential buffer overflow
  scsi: qla2xxx: Fix memory leak for allocating abort IOCB

921195d3

init: rename and re-order boot_cpu_state_init() · b5b1404d

Linus Torvalds authored Aug 12, 2018

This is purely a preparatory patch for upcoming changes during the 4.19
merge window.

We have a function called "boot_cpu_state_init()" that isn't really
about the bootup cpu state: that is done much earlier by the similarly
named "boot_cpu_init()" (note lack of "state" in name).

This function initializes some hotplug CPU state, and needs to run after
the percpu data has been properly initialized.  It even has a comment to
that effect.

Except it _doesn't_ actually run after the percpu data has been properly
initialized.  On x86 it happens to do that, but on at least arm and
arm64, the percpu base pointers are initialized by the arch-specific
'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().

This had some unexpected results, and in particular we have a patch
pending for the merge window that did the obvious cleanup of using
'this_cpu_write()' in the cpu hotplug init code:

  -       per_cpu_ptr(&cpuhp_state, smp_processor_id())->state = CPUHP_ONLINE;
  +       this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);

which is obviously the right thing to do.  Except because of the
ordering issue, it actually failed miserably and unexpectedly on arm64.

So this just fixes the ordering, and changes the name of the function to
be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
hotplug state, because the core CPU state was supposed to have already
been done earlier.

Marked for stable, since the (not yet merged) patch that will show this
problem is marked for stable.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Mian Yousaf Kaukab <yousaf.kaukab@suse.com>
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b5b1404d

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · d6dd6431

Linus Torvalds authored Aug 12, 2018

Pull vfs fixes from Al Viro:
 "A bunch of race fixes, mostly around lazy pathwalk.

  All of it is -stable fodder, a large part going back to 2013"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make sure that __dentry_kill() always invalidates d_seq, unhashed or not
  fix __legitimize_mnt()/mntput() race
  fix mntput/mntput race
  root dentries need RCU-delayed freeing

d6dd6431

11 Aug, 2018 9 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · ec0c9671

Linus Torvalds authored Aug 11, 2018

Pull networking fixes from David Miller:
 "Last bit of straggler fixes...

  1) Fix btf library licensing to LGPL, from Martin KaFai lau.

  2) Fix error handling in bpf sockmap code, from Daniel Borkmann.

  3) XDP cpumap teardown handling wrt. execution contexts, from Jesper
     Dangaard Brouer.

  4) Fix loss of runtime PM on failed vlan add/del, from Ivan
     Khoronzhuk.

  5) xen-netfront caches skb_shinfo(skb) across a __pskb_pull_tail()
     call, which potentially changes the skb's data buffer, and thus
     skb_shinfo(). Fix from Juergen Gross"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  xen/netfront: don't cache skb_shinfo()
  net: ethernet: ti: cpsw: fix runtime_pm while add/kill vlan
  net: ethernet: ti: cpsw: clear all entries when delete vid
  xdp: fix bug in devmap teardown code path
  samples/bpf: xdp_redirect_cpu adjustment to reproduce teardown race easier
  xdp: fix bug in cpumap teardown code path
  bpf, sockmap: fix cork timeout for select due to epipe
  bpf, sockmap: fix leak in bpf_tcp_sendmsg wait for mem path
  bpf, sockmap: fix bpf_tcp_sendmsg sock error handling
  bpf: btf: Change tools/lib/bpf/btf to LGPL

ec0c9671

xen/netfront: don't cache skb_shinfo() · d472b3a6

Juergen Gross authored Aug 09, 2018

skb_shinfo() can change when calling __pskb_pull_tail(): Don't cache
its return value.

Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d472b3a6

Merge branch 'cpsw-runtime-pm-fix' · 556fdd85

David S. Miller authored Aug 11, 2018

Grygorii Strashko says:

====================
net: ethernet: ti: cpsw: fix runtime pm while add/del reserved vid

Here 2 not critical fixes for:
- vlan ale table leak while error if deleting vlan (simplifies next fix)
- runtime pm while try to set reserved vlan
====================
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

556fdd85

net: ethernet: ti: cpsw: fix runtime_pm while add/kill vlan · 803c4f64

Ivan Khoronzhuk authored Aug 10, 2018

It's exclusive with normal behaviour but if try to set vlan to one of
the reserved values is made, the cpsw runtime pm is broken.

Fixes: a6c5d14f ("drivers: net: cpsw: ndev: fix accessing to suspended device")
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

803c4f64

net: ethernet: ti: cpsw: clear all entries when delete vid · be35b982

Ivan Khoronzhuk authored Aug 10, 2018

In cases if some of the entries were not found in forwarding table
while killing vlan, the rest not needed entries still left in the
table. No need to stop, as entry was deleted anyway. So fix this by
returning error only after all was cleaned. To implement this, return
-ENOENT in cpsw_ale_del_mcast() as it's supposed to be.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

be35b982

zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature · 4f7a7bea

Minchan Kim authored Aug 10, 2018

If zram supports writeback feature, it's no longer a
BD_CAP_SYNCHRONOUS_IO device beause zram does asynchronous IO operations
for incompressible pages.

Do not pretend to be synchronous IO device.  It makes the system very
sluggish due to waiting for IO completion from upper layers.

Furthermore, it causes a user-after-free problem because swap thinks the
opearion is done when the IO functions returns so it can free the page
(e.g., lock_page_or_retry and goto out_release in do_swap_page) but in
fact, IO is asynchronous so the driver could access a just freed page
afterward.

This patch fixes the problem.

  BUG: Bad page state in process qemu-system-x86  pfn:3dfab21
  page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1
  flags: 0x17fffc000000008(uptodate)
  raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000
  raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
  bad because of flags: 0x8(uptodate)
  CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G    B 4.18.0-rc5+ #1
  Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
  Call Trace:
    dump_stack+0x5c/0x7b
    bad_page+0xba/0x120
    get_page_from_freelist+0x1016/0x1250
    __alloc_pages_nodemask+0xfa/0x250
    alloc_pages_vma+0x7c/0x1c0
    do_swap_page+0x347/0x920
    __handle_mm_fault+0x7b4/0x1110
    handle_mm_fault+0xfc/0x1f0
    __get_user_pages+0x12f/0x690
    get_user_pages_unlocked+0x148/0x1f0
    __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
    try_async_pf+0x87/0x230 [kvm]
    tdp_page_fault+0x132/0x290 [kvm]
    kvm_mmu_page_fault+0x74/0x570 [kvm]
    kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
    kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
    do_vfs_ioctl+0xa2/0x630
    ksys_ioctl+0x70/0x80
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x55/0x100
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Link: https://lore.kernel.org/lkml/0516ae2d-b0fd-92c5-aa92-112ba7bd32fc@contabo.de/
Link: http://lkml.kernel.org/r/20180802051112.86174-1-minchan@kernel.org
[minchan@kernel.org: fix changelog, add comment]
 Link: https://lore.kernel.org/lkml/0516ae2d-b0fd-92c5-aa92-112ba7bd32fc@contabo.de/
 Link: http://lkml.kernel.org/r/20180802051112.86174-1-minchan@kernel.org
 Link: http://lkml.kernel.org/r/20180805233722.217347-1-minchan@kernel.org
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Tino Lehnig <tino.lehnig@contabo.de>
Tested-by: Tino Lehnig <tino.lehnig@contabo.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>	[4.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4f7a7bea

mm/memory.c: check return value of ioremap_prot · 24eee1e4

jie@chenjie6@huwei.com authored Aug 10, 2018

ioremap_prot() can return NULL which could lead to an oops.

Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.comSigned-off-by: chen jie <chenjie6@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: chenjie <chenjie6@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

24eee1e4

lib/ubsan: remove null-pointer checks · 3ca17b1f

Andrey Ryabinin authored Aug 10, 2018

With gcc-8 fsanitize=null become very noisy.  GCC started to complain
about things like &a->b, where 'a' is NULL pointer.  There is no NULL
dereference, we just calculate address to struct member.  It's
technically undefined behavior so UBSAN is correct to report it.  But as
long as there is no real NULL-dereference, I think, we should be fine.

-fno-delete-null-pointer-checks compiler flag should protect us from any
consequences.  So let's just no use -fsanitize=null as it's not useful
for us.  If there is a real NULL-deref we will see crash.  Even if
userspace mapped something at NULL (root can do this), with things like
SMAP should catch the issue.

Link: http://lkml.kernel.org/r/20180802153209.813-1-aryabinin@virtuozzo.comSigned-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

3ca17b1f

MAINTAINERS: GDB: update e-mail address · 5832fcf9

Kieran Bingham authored Aug 10, 2018

This entry was created with my personal e-mail address. Update this entry
to my open-source kernel.org account.

Link: http://lkml.kernel.org/r/20180806143904.4716-4-kieran.bingham@ideasonboard.comSigned-off-by: Kieran Bingham <kbingham@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

5832fcf9

10 Aug, 2018 2 commits

Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · f313b43b

Linus Torvalds authored Aug 10, 2018

Pull i2c fix from Wolfram Sang:
 "A single driver bugfix for I2C.

  The bug was found by systematically stress testing the driver, so I am
  confident to merge it that late in the cycle although it is probably
  unusually large"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: xlp9xx: Fix case where SSIF read transaction completes early

f313b43b

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · e91e2189

David S. Miller authored Aug 09, 2018

Daniel Borkmann says:

====================
pull-request: bpf 2018-08-10

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix cpumap and devmap on teardown as they're under RCU context
   and won't have same assumption as running under NAPI protection,
   from Jesper.

2) Fix various sockmap bugs in bpf_tcp_sendmsg() code, e.g. we had
   a bug where socket error was not propagated correctly, from Daniel.

3) Fix incompatible libbpf header license for BTF code and match it
   before it gets officially released with the rest of libbpf which
   is LGPL-2.1, from Martin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e91e2189

09 Aug, 2018 4 commits

make sure that __dentry_kill() always invalidates d_seq, unhashed or not · 4c0d7cd5

Al Viro authored Aug 09, 2018

RCU pathwalk relies upon the assumption that anything that changes
->d_inode of a dentry will invalidate its ->d_seq.  That's almost
true - the one exception is that the final dput() of already unhashed
dentry does *not* touch ->d_seq at all.  Unhashing does, though,
so for anything we'd found by RCU dcache lookup we are fine.
Unfortunately, we can *start* with an unhashed dentry or jump into
it.

We could try and be careful in the (few) places where that could
happen.  Or we could just make the final dput() invalidate the damn
thing, unhashed or not.  The latter is much simpler and easier to
backport, so let's do it that way.
Reported-by: "Dae R. Jeong" <threeearcat@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

4c0d7cd5

fix __legitimize_mnt()/mntput() race · 119e1ef8

Al Viro authored Aug 09, 2018

__legitimize_mnt() has two problems - one is that in case of success
the check of mount_lock is not ordered wrt preceding increment of
refcount, making it possible to have successful __legitimize_mnt()
on one CPU just before the otherwise final mntpu() on another,
with __legitimize_mnt() not seeing mntput() taking the lock and
mntput() not seeing the increment done by __legitimize_mnt().
Solved by a pair of barriers.

Another is that failure of __legitimize_mnt() on the second
read_seqretry() leaves us with reference that'll need to be
dropped by caller; however, if that races with final mntput()
we can end up with caller dropping rcu_read_lock() and doing
mntput() to release that reference - with the first mntput()
having freed the damn thing just as rcu_read_lock() had been
dropped.  Solution: in "do mntput() yourself" failure case
grab mount_lock, check if MNT_DOOMED has been set by racing
final mntput() that has missed our increment and if it has -
undo the increment and treat that as "failure, caller doesn't
need to drop anything" case.

It's not easy to hit - the final mntput() has to come right
after the first read_seqretry() in __legitimize_mnt() *and*
manage to miss the increment done by __legitimize_mnt() before
the second read_seqretry() in there.  The things that are almost
impossible to hit on bare hardware are not impossible on SMP
KVM, though...
Reported-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 48a066e7 ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

119e1ef8

fix mntput/mntput race · 9ea0a46c

Al Viro authored Aug 09, 2018

mntput_no_expire() does the calculation of total refcount under mount_lock;
unfortunately, the decrement (as well as all increments) are done outside
of it, leading to false positives in the "are we dropping the last reference"
test.  Consider the following situation:
	* mnt is a lazy-umounted mount, kept alive by two opened files.  One
of those files gets closed.  Total refcount of mnt is 2.  On CPU 42
mntput(mnt) (called from __fput()) drops one reference, decrementing component
	* After it has looked at component #0, the process on CPU 0 does
mntget(), incrementing component #0, gets preempted and gets to run again -
on CPU 69.  There it does mntput(), which drops the reference (component #69)
and proceeds to spin on mount_lock.
	* On CPU 42 our first mntput() finishes counting.  It observes the
decrement of component #69, but not the increment of component #0.  As the
result, the total it gets is not 1 as it should've been - it's 0.  At which
point we decide that vfsmount needs to be killed and proceed to free it and
shut the filesystem down.  However, there's still another opened file
on that filesystem, with reference to (now freed) vfsmount, etc. and we are
screwed.

It's not a wide race, but it can be reproduced with artificial slowdown of
the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.

Fix consists of moving the refcount decrement under mount_lock; the tricky
part is that we want (and can) keep the fast case (i.e. mount that still
has non-NULL ->mnt_ns) entirely out of mount_lock.  All places that zero
mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
before that mntput().  IOW, if mntput() observes (under rcu_read_lock())
a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
be dropped.
Reported-by: Jann Horn <jannh@google.com>
Tested-by: Jann Horn <jannh@google.com>
Fixes: 48a066e7 ("RCU'd vsfmounts")
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9ea0a46c

Merge branch 'bpf-fix-cpu-and-devmap-teardown' · 9c954201

Daniel Borkmann authored Aug 09, 2018

Jesper Dangaard Brouer says:

====================
Removing entries from cpumap and devmap, goes through a number of
syncronization steps to make sure no new xdp_frames can be enqueued.
But there is a small chance, that xdp_frames remains which have not
been flushed/processed yet.  Flushing these during teardown, happens
from RCU context and not as usual under RX NAPI context.

The optimization introduced in commt 389ab7f0 ("xdp: introduce
xdp_return_frame_rx_napi"), missed that the flush operation can also
be called from RCU context.  Thus, we cannot always use the
xdp_return_frame_rx_napi call, which take advantage of the protection
provided by XDP RX running under NAPI protection.

The samples/bpf xdp_redirect_cpu have a --stress-mode, that is
adjusted to easier reproduce (verified by Red Hat QA).
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

9c954201