1. 13 Dec, 2017 12 commits
    • Mike Snitzer's avatar
      dm: remove BIOSET_NEED_RESCUER based dm_offload infrastructure · 4a3f54d9
      Mike Snitzer authored
      Now that all of DM has been revised and/or verified to no longer require
      the use of BIOSET_NEED_RESCUER the dm_offload code may be removed.
      Suggested-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      4a3f54d9
    • Mike Snitzer's avatar
      dm: safely allocate multiple bioset bios · 318716dd
      Mike Snitzer authored
      DM targets can request multiple bios be sent to them by DM core (see:
      num_{flush,discard,write_same,write_zeroes}_bios).  But until now these
      bios were allocated in an unsafe manner than could potentially exhaust
      the DM device's bioset -- in the face of multiple threads each trying to
      do multiple allocations from the same DM device's bioset.
      
      Fix __send_duplicate_bios() by using the new alloc_multiple_bios().  The
      allocation strategy used by alloc_multiple_bios() models that used by
      dm-crypt.c:crypt_alloc_buffer().
      
      Neil Brown initially proposed this fix but the implementation has been
      revised enough that it inappropriate to attribute the entirety of it to
      him.
      Suggested-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      318716dd
    • NeilBrown's avatar
      dm: remove unused 'num_write_bios' target interface · f31c21e4
      NeilBrown authored
      No DM target provides num_write_bios and none has since dm-cache's
      brief use in 2013.
      
      Having the possibility of num_write_bios > 1 complicates bio
      allocation.  So remove the interface and assume there is only one bio
      needed.
      
      If a target ever needs more, it must provide a suitable bioset and
      allocate itself based on its particular needs.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      f31c21e4
    • NeilBrown's avatar
      dm: ensure bio submission follows a depth-first tree walk · 18a25da8
      NeilBrown authored
      A dm device can, in general, represent a tree of targets, each of which
      handles a sub-range of the range of blocks handled by the parent.
      
      The bio sequencing managed by generic_make_request() requires that bios
      are generated and handled in a depth-first manner.  Each call to a
      make_request_fn() may submit bios to a single member device, and may
      submit bios for a reduced region of the same device as the
      make_request_fn.
      
      In particular, any bios submitted to member devices must be expected to
      be processed in order, so a later one must never wait for an earlier
      one.
      
      This ordering is usually achieved by using bio_split() to reduce a bio
      to a size that can be completely handled by one target, and resubmitting
      the remainder to the originating device. bio_queue_split() shows the
      canonical approach.
      
      dm doesn't follow this approach, largely because it has needed to split
      bios since long before bio_split() was available.  It currently can
      submit bios to separate targets within the one dm_make_request() call.
      Dependencies between these targets, as can happen with dm-snap, can
      cause deadlocks if either bios gets stuck behind the other in the queues
      managed by generic_make_request().  This requires the 'rescue'
      functionality provided by dm_offload_{start,end}.
      
      Some of this requirement can be removed by changing the order of bio
      submission to follow the canonical approach.  That is, if dm finds that
      it needs to split a bio, the remainder should be sent to
      generic_make_request() rather than being handled immediately.  This
      delays the handling until the first part is completely processed, so the
      deadlock problems do not occur.
      
      __split_and_process_bio() can be called both from dm_make_request() and
      from dm_wq_work().  When called from dm_wq_work() the current approach
      is perfectly satisfactory as each bio will be processed immediately.
      When called from dm_make_request(), current->bio_list will be non-NULL,
      and in this case it is best to create a separate "clone" bio for the
      remainder.
      
      When we use bio_clone_bioset() to split off the front part of a bio
      and chain the two together and submit the remainder to
      generic_make_request(), it is important that the newly allocated
      bio is used as the head to be processed immediately, and the original
      bio gets "bio_advance()"d and sent to generic_make_request() as the
      remainder.  Otherwise, if the newly allocated bio is used as the
      remainder, and if it then needs to be split again, then the next
      bio_clone_bioset() call will be made while holding a reference a bio
      (result of the first clone) from the same bioset.  This can potentially
      exhaust the bioset mempool and result in a memory allocation deadlock.
      
      Note that there is no race caused by reassigning cio.io->bio after already
      calling __map_bio().  This bio will only be dereferenced again after
      dec_pending() has found io->io_count to be zero, and this cannot happen
      before the dec_pending() call at the end of __split_and_process_bio().
      
      To provide the clone bio when splitting, we use q->bio_split.  This
      was previously being freed by bio-based dm to avoid having excess
      rescuer threads.  As bio_split bio sets no longer create rescuer
      threads, there is little cost and much gain from restoring the
      q->bio_split bio set.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      18a25da8
    • NeilBrown's avatar
      dm io: remove BIOSET_NEED_RESCUER flag from bios bioset · c110a4b6
      NeilBrown authored
      The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
      do two allocations from the one bioset, and the second one could block
      until the first bio completes.
      
      dm_io() is called from make_request_fn() context.  The closest it comes
      to multiple allocations is in chunk_io() in dm-snap-persistent.  But
      there the code uses a separate thread to avoid problems.
      
      So BIOSET_NEED_RESCUER is not needed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c110a4b6
    • NeilBrown's avatar
      dm crypt: remove BIOSET_NEED_RESCUER flag · 80cd1757
      NeilBrown authored
      The BIOSET_NEED_RESCUER flag is only needed when a make_request_fn might
      do two allocations from the one bioset, and the second one could block
      until the first bio completes.
      
      dm-crypt does allocate from this bioset inside the dm make_request_fn,
      but does so using GFP_NOWAIT so that the allocation will not block.
      
      So BIOSET_NEED_RESCUER is not needed.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      80cd1757
    • NeilBrown's avatar
      dm: fix comment above dm_accept_partial_bio · c06b3e58
      NeilBrown authored
      Clarify that dm_accept_partial_bio isn't allowed for REQ_OP_ZONE_RESET
      bios.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c06b3e58
    • Heinz Mauelshagen's avatar
      dm raid: use rs_is_raid*() · 552aa679
      Heinz Mauelshagen authored
      Cleanup, no functional change.
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      552aa679
    • Heinz Mauelshagen's avatar
      dm raid: simplify rs_get_progress() · 7c29744e
      Heinz Mauelshagen authored
      No need to calculate the reshaping progress because
      mddev->curr_resync_completed holds it.
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      7c29744e
    • Heinz Mauelshagen's avatar
      dm raid: ensure 'a' chars during reshape · dc15b943
      Heinz Mauelshagen authored
      During reshape, 'A' chars were reported in status rather than 'a'.
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      dc15b943
    • Heinz Mauelshagen's avatar
      dm raid: stop keeping raid set frozen altogether · 11e47232
      Heinz Mauelshagen authored
      In order to avoid redoing synchronization/recovery/reshape partially,
      the raid set got frozen until after all passed in table line flags had
      been cleared.  The related table reload sequence had to be precisely
      followed, or reshaping may lead to data corruption caused by the active
      mapping carrying on with a reshape when the inactive mapping already
      had retrieved a stale reshape position.
      
      Harden by retrieving the actual resync/recovery/reshape position
      during resume whilst the active table is suspended thus avoiding
      to keep the raid set frozen altogether.  This prevents superfluous
      redoing of an already resynchronized or recovered segment and,
      most importantly, potential for redoing of an already reshaped
      segment causing data corruption.
      
      Fixes: d39f0010 ("dm raid: fix raid_resume() to keep raid set frozen as needed")
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      11e47232
    • Heinz Mauelshagen's avatar
      dm raid: validate current raid sets redundancy · 53bf5384
      Heinz Mauelshagen authored
      Verifying the current raid sets redundancy based on retrieved
      superblock content has to use the superblock's raid level (e.g. raid0),
      not the constructor requested one (e.g. raid10).
      
      Using the requested raid level of raid10 lead to a "divide error"
      on raid0 which defines data copies divided by to be zero.
      
      Also check for bogus data copies.
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      53bf5384
  2. 08 Dec, 2017 13 commits
  3. 04 Dec, 2017 2 commits
    • monty_pavel@sina.com's avatar
      dm: fix various targets to dm_register_target after module __init resources created · 7e6358d2
      monty_pavel@sina.com authored
      A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
      processes race to load the dm-thin-pool module:
      
       PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
        #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
        0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
        0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
        0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
        0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
        0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
        0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
        0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
        0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
           [exception RIP: kmem_cache_alloc+108]
           RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
           RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
           RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
           RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
           R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
           R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
           ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
        0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
       0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
       0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
       0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
       0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
       0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
       0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]
      
      The race results in a NULL pointer because:
      
      Process A (vgchange -ay -K):
       	a. send DM_LIST_VERSIONS_CMD ioctl;
       	b. pool_target not registered;
       	c. modprobe dm_thin_pool and wait until end.
      
      Process B (vgchange -ay -K):
       	a. send DM_LIST_VERSIONS_CMD ioctl;
       	b. pool_target registered;
       	c. table_load->dm_table_add_target->pool_ctr;
       	d. _new_mapping_cache is NULL and panic.
      Note:
       	1. process A and process B are two concurrent processes.
       	2. pool_target can be detected by process B but
       	_new_mapping_cache initialization has not ended.
      
      To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
      with the same problem, simply dm_register_target() after all resources
      created during module init (as labelled with __init) are finished.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarmonty <monty_pavel@sina.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      7e6358d2
    • Mike Snitzer's avatar
      dm table: fix regression from improper dm_dev_internal.count refcount_t conversion · afc567a4
      Mike Snitzer authored
      Multiple refcounts are needed if the device was already added.  The
      micro-optimization of setting the refcount to 1 on first added (rather
      than fall thru to a common refcount_inc) lost sight of the fact that the
      refcount_inc is also needed for the case when the device already exists
      and the mode need not be upgraded.
      
      Fixes: 2a0b4682 ("dm: convert dm_dev_internal.count from atomic_t to refcount_t")
      Reported-by: default avatarZdenek Kabelac <zkabelac@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      afc567a4
  4. 03 Dec, 2017 5 commits
  5. 02 Dec, 2017 4 commits
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs · 2db767d9
      Linus Torvalds authored
      Pull NFS client fixes from Anna Schumaker:
       "These patches fix a problem with compiling using an old version of
        gcc, and also fix up error handling in the SUNRPC layer.
      
         - NFSv4: Ensure gcc 4.4.4 can compile initialiser for
           "invalid_stateid"
      
         - SUNRPC: Allow connect to return EHOSTUNREACH
      
         - SUNRPC: Handle ENETDOWN errors"
      
      * tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
        SUNRPC: Handle ENETDOWN errors
        SUNRPC: Allow connect to return EHOSTUNREACH
        NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid"
      2db767d9
    • Linus Torvalds's avatar
      Merge tag 'xfs-4.15-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux · 788c1da0
      Linus Torvalds authored
      Pull xfs fixes from Darrick Wong:
       "Here are some bug fixes for 4.15-rc2.
      
         - fix memory leaks that appeared after removing ifork inline data
           buffer
      
         - recover deferred rmap update log items in correct order
      
         - fix memory leaks when buffer construction fails
      
         - fix memory leaks when bmbt is corrupt
      
         - fix some uninitialized variables and math problems in the quota
           scrubber
      
         - add some omitted attribution tags on the log replay commit
      
         - fix some UBSAN complaints about integer overflows with large sparse
           files
      
         - implement an effective inode mode check in online fsck
      
         - fix log's inability to retry quota item writeout due to transient
           errors"
      
      * tag 'xfs-4.15-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
        xfs: Properly retry failed dquot items in case of error during buffer writeback
        xfs: scrub inode mode properly
        xfs: remove unused parameter from xfs_writepage_map
        xfs: ubsan fixes
        xfs: calculate correct offset in xfs_scrub_quota_item
        xfs: fix uninitialized variable in xfs_scrub_quota
        xfs: fix leaks on corruption errors in xfs_bmap.c
        xfs: fortify xfs_alloc_buftarg error handling
        xfs: log recovery should replay deferred ops in order
        xfs: always free inline data before resetting inode fork during ifree
      788c1da0
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-4.15-rc2_cleanups' of... · e1ba1c99
      Linus Torvalds authored
      Merge tag 'riscv-for-linus-4.15-rc2_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux
      
      Pull RISC-V cleanups and ABI fixes from Palmer Dabbelt:
       "This contains a handful of small cleanups that are a result of
        feedback that didn't make it into our original patch set, either
        because the feedback hadn't been given yet, I missed the original
        emails, or we weren't ready to submit the changes yet.
      
        I've been maintaining the various cleanup patch sets I have as their
        own branches, which I then merged together and signed. Each merge
        commit has a short summary of the changes, and each branch is based on
        your latest tag (4.15-rc1, in this case). If this isn't the right way
        to do this then feel free to suggest something else, but it seems sane
        to me.
      
        Here's a short summary of the changes, roughly in order of how
        interesting they are.
      
         - libgcc.h has been moved from include/lib, where it's the only
           member, to include/linux. This is meant to avoid tab completion
           conflicts.
      
         - VDSO entries for clock_get/gettimeofday/getcpu have been added.
           These are simple syscalls now, but we want to let glibc use them
           from the start so we can make them faster later.
      
         - A VDSO entry for instruction cache flushing has been added so
           userspace can flush the instruction cache.
      
         - The VDSO symbol versions for __vdso_cmpxchg{32,64} have been
           removed, as those VDSO entries don't actually exist.
      
         - __io_writes has been corrected to respect the given type.
      
         - A new READ_ONCE in arch_spin_is_locked().
      
         - __test_and_op_bit_ord() is now actually ordered.
      
         - Various small fixes throughout the tree to enable allmodconfig to
           build cleanly.
      
         - Removal of some dead code in our atomic support headers.
      
         - Improvements to various comments in our atomic support headers"
      
      * tag 'riscv-for-linus-4.15-rc2_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/linux: (23 commits)
        RISC-V: __io_writes should respect the length argument
        move libgcc.h to include/linux
        RISC-V: Clean up an unused include
        RISC-V: Allow userspace to flush the instruction cache
        RISC-V: Flush I$ when making a dirty page executable
        RISC-V: Add missing include
        RISC-V: Use define for get_cycles like other architectures
        RISC-V: Provide stub of setup_profiling_timer()
        RISC-V: Export some expected symbols for modules
        RISC-V: move empty_zero_page definition to C and export it
        RISC-V: io.h: type fixes for warnings
        RISC-V: use RISCV_{INT,SHORT} instead of {INT,SHORT} for asm macros
        RISC-V: use generic serial.h
        RISC-V: remove spin_unlock_wait()
        RISC-V: `sfence.vma` orderes the instruction cache
        RISC-V: Add READ_ONCE in arch_spin_is_locked()
        RISC-V: __test_and_op_bit_ord should be strongly ordered
        RISC-V: Remove smb_mb__{before,after}_spinlock()
        RISC-V: Remove __smp_bp__{before,after}_atomic
        RISC-V: Comment on why {,cmp}xchg is ordered how it is
        ...
      e1ba1c99
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 4b1967c9
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "The critical one here is a fix for fpsimd register corruption across
        signals which was introduced by the SVE support code (the register
        files overlap), but the others are worth having as well.
      
        Summary:
      
         - Fix FP register corruption when SVE is not available or in use
      
         - Fix out-of-tree module build failure when CONFIG_ARM64_MODULE_PLTS=y
      
         - Missing 'const' generating errors with LTO builds
      
         - Remove unsupported events from Cortex-A73 PMU description
      
         - Removal of stale and incorrect comments"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: context: Fix comments and remove pointless smp_wmb()
        arm64: cpu_ops: Add missing 'const' qualifiers
        arm64: perf: remove unsupported events for Cortex-A73
        arm64: fpsimd: Fix failure to restore FPSIMD state after signals
        arm64: pgd: Mark pgd_cache as __ro_after_init
        arm64: ftrace: emit ftrace-mod.o contents through code
        arm64: module-plts: factor out PLT generation code for ftrace
        arm64: mm: cleanup stale AIVIVT references
      4b1967c9
  6. 01 Dec, 2017 4 commits
    • Palmer Dabbelt's avatar
      RISC-V: Fixes for clean allmodconfig build · 3b62de26
      Palmer Dabbelt authored
      Olaf said: Here's a short series of patches that produces a working
      allmodconfig. Would be nice to see them go in so we can add build
      coverage.
      
      I've dropped patches 8 and 10 from the original set:
      
      * [PATCH 08/10] (RISC-V: Set __ARCH_WANT_RENAMEAT to pick up generic
        version) has a better fix that I've sent out for review, we don't want
        renameat.
      * [PATCH 10/10] (input: joystick: riscv has get_cycles) has already been
        taken into Dmitry Torokhov's tree.
      3b62de26
    • Palmer Dabbelt's avatar
      move libgcc.h to include/linux · 185e788c
      Palmer Dabbelt authored
      185e788c
    • Palmer Dabbelt's avatar
      7382fbde
    • Palmer Dabbelt's avatar
      RISC-V: User-Visible Changes · 07f8ba74
      Palmer Dabbelt authored
      This merge contains the user-visible, ABI-breaking changes that we want
      to make sure we have in Linux before our first release.   Highlights
      include:
      
      * VDSO entries for clock_get/gettimeofday/getcpu have been added.  These
        are simple syscalls now, but we want to let glibc use them from the
        start so we can make them faster later.
      * A VDSO entry for instruction cache flushing has been added so
        userspace can flush the instruction cache.
      * The VDSO symbol versions for __vdso_cmpxchg{32,64} have been removed,
        as those VDSO entries don't actually exist.
      
      Conflicts:
              arch/riscv/include/asm/tlbflush.h
      07f8ba74