1. 20 Jun, 2017 35 commits
    • Takashi Iwai's avatar
      ALSA: seq: Fix racy cell insertions during snd_seq_pool_done() · 2dbb155d
      Takashi Iwai authored
      commit c520ff3d upstream.
      
      When snd_seq_pool_done() is called, it marks the closing flag to
      refuse the further cell insertions.  But snd_seq_pool_done() itself
      doesn't clear the cells but just waits until all cells are cleared by
      the caller side.  That is, it's racy, and this leads to the endless
      stall as syzkaller spotted.
      
      This patch addresses the racy by splitting the setup of pool->closing
      flag out of snd_seq_pool_done(), and calling it properly before
      snd_seq_pool_done().
      
      BugLink: http://lkml.kernel.org/r/CACT4Y+aqqy8bZA1fFieifNxR2fAfFQQABcBHj801+u5ePV0URw@mail.gmail.comReported-and-tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      2dbb155d
    • Takashi Iwai's avatar
      ALSA: seq: Fix link corruption by event error handling · 2dbfb5cb
      Takashi Iwai authored
      commit f3ac9f73 upstream.
      
      The sequencer FIFO management has a bug that may lead to a corruption
      (shortage) of the cell linked list.  When a sequencer client faces an
      error at the event delivery, it tries to put back the dequeued cell.
      When the first queue was put back, this forgot the tail pointer
      tracking, and the link will be screwed up.
      
      Although there is no memory corruption, the sequencer client may stall
      forever at exit while flushing the pending FIFO cells in
      snd_seq_pool_done(), as spotted by syzkaller.
      
      This patch addresses the missing tail pointer tracking at
      snd_seq_fifo_cell_putback().  Also the patch makes sure to clear the
      cell->enxt pointer at snd_seq_fifo_event_in() for avoiding a similar
      mess-up of the FIFO linked list.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      2dbfb5cb
    • Takashi Iwai's avatar
      ALSA: timer: Reject user params with too small ticks · 7a3085a3
      Takashi Iwai authored
      commit 71321eb3 upstream.
      
      When a user sets a too small ticks with a fine-grained timer like
      hrtimer, the kernel tries to fire up the timer irq too frequently.
      This may lead to the condensed locks, eventually the kernel spinlock
      lockup with warnings.
      
      For avoiding such a situation, we define a lower limit of the
      resolution, namely 1ms.  When the user passes a too small tick value
      that results in less than that, the kernel returns -EINVAL now.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      7a3085a3
    • Takashi Iwai's avatar
      ALSA: seq: Don't handle loop timeout at snd_seq_pool_done() · 28567fb4
      Takashi Iwai authored
      commit 37a7ea4a upstream.
      
      snd_seq_pool_done() syncs with closing of all opened threads, but it
      aborts the wait loop with a timeout, and proceeds to the release
      resource even if not all threads have been closed.  The timeout was 5
      seconds, and if you run a crazy stuff, it can exceed easily, and may
      result in the access of the invalid memory address -- this is what
      syzkaller detected in a bug report.
      
      As a fix, let the code graduate from naiveness, simply remove the loop
      timeout.
      
      BugLink: http://lkml.kernel.org/r/CACT4Y+YdhDV2H5LLzDTJDVF-qiYHUHhtRaW4rbb4gUhTCQB81w@mail.gmail.comReported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      28567fb4
    • Takashi Iwai's avatar
      ALSA: seq: Fix race at creating a queue · 6dd5cf43
      Takashi Iwai authored
      commit 4842e98f upstream.
      
      When a sequencer queue is created in snd_seq_queue_alloc(),it adds the
      new queue element to the public list before referencing it.  Thus the
      queue might be deleted before the call of snd_seq_queue_use(), and it
      results in the use-after-free error, as spotted by syzkaller.
      
      The fix is to reference the queue object at the right time.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      6dd5cf43
    • Takashi Iwai's avatar
      ALSA: hda - Fix up GPIO for ASUS ROG Ranger · bbcdcb83
      Takashi Iwai authored
      commit 85bcf96c upstream.
      
      ASUS ROG Ranger VIII with ALC1150 codec requires the extra GPIO pin to
      up for the front panel.  Just use the existing fixup for setting up
      the GPIO pins.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=189411Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      bbcdcb83
    • Marc Kleine-Budde's avatar
      can: usb_8dev: Fix memory leak of priv->cmd_msg_buffer · d2ab6e52
      Marc Kleine-Budde authored
      commit 7c426313 upstream.
      
      The priv->cmd_msg_buffer is allocated in the probe function, but never
      kfree()ed. This patch converts the kzalloc() to resource-managed
      kzalloc.
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d2ab6e52
    • Oliver Hartkopp's avatar
      can: bcm: fix hrtimer/tasklet termination in bcm op removal · d5361ee8
      Oliver Hartkopp authored
      commit a06393ed upstream.
      
      When removing a bcm tx operation either a hrtimer or a tasklet might run.
      As the hrtimer triggers its associated tasklet and vice versa we need to
      take care to mutually terminate both handlers.
      Reported-by: default avatarMichael Josenhans <michael.josenhans@web.de>
      Signed-off-by: default avatarOliver Hartkopp <socketcan@hartkopp.net>
      Tested-by: default avatarMichael Josenhans <michael.josenhans@web.de>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d5361ee8
    • Yegor Yefremov's avatar
      can: ti_hecc: add missing prepare and unprepare of the clock · 2adddc0a
      Yegor Yefremov authored
      commit befa6011 upstream.
      
      In order to make the driver work with the common clock framework, this
      patch converts the clk_enable()/clk_disable() to
      clk_prepare_enable()/clk_disable_unprepare().
      
      Also add error checking for clk_prepare_enable().
      Signed-off-by: default avatarYegor Yefremov <yegorslists@googlemail.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      2adddc0a
    • Einar Jón's avatar
      can: c_can_pci: fix null-pointer-deref in c_can_start() - set device pointer · 165cc033
      Einar Jón authored
      commit c97c52be upstream.
      
      The priv->device pointer for c_can_pci is never set, but it is used
      without a NULL check in c_can_start(). Setting it in c_can_pci_probe()
      like c_can_plat_probe() prevents c_can_pci.ko from crashing, with and
      without CONFIG_PM.
      
      This might also cause the pm_runtime_*() functions in c_can.c to
      actually be executed for c_can_pci devices - they are the only other
      place where priv->device is used, but they all contain a null check.
      Signed-off-by: default avatarEinar Jón <tolvupostur@gmail.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      165cc033
    • 추지호's avatar
      can: peak: fix bad memory access and free sequence · 3e7b58a3
      추지호 authored
      commit b67d0dd7 upstream.
      
      Fix for bad memory access while disconnecting. netdev is freed before
      private data free, and dev is accessed after freeing netdev.
      
      This makes a slub problem, and it raise kernel oops with slub debugger
      config.
      Signed-off-by: default avatarJiho Chu <jiho.chu@samsung.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3e7b58a3
    • Marc Kleine-Budde's avatar
      can: raw: raw_setsockopt: limit number of can_filter that can be set · cf796900
      Marc Kleine-Budde authored
      commit 332b05ca upstream.
      
      This patch adds a check to limit the number of can_filters that can be
      set via setsockopt on CAN_RAW sockets. Otherwise allocations > MAX_ORDER
      are not prevented resulting in a warning.
      
      Reference: https://lkml.org/lkml/2016/12/2/230Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      cf796900
    • Tariq Saeed's avatar
      ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed() · 1918581a
      Tariq Saeed authored
      commit 3d46a44a upstream.
      
      PID: 614    TASK: ffff882a739da580  CPU: 3   COMMAND: "ocfs2dc"
        #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
        #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
        #2 [ffff882ecc375af0] oops_end at ffffffff815091d8
        #3 [ffff882ecc375b20] die at ffffffff8101868b
        #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
        #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
        #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
           [exception RIP: ocfs2_ci_checkpointed+208]
           RIP: ffffffffa0a7e940  RSP: ffff882ecc375cf0  RFLAGS: 00010002
           RAX: 0000000000000001  RBX: 000000000000654b  RCX: ffff8812dc83f1f8
           RDX: 00000000000017d9  RSI: ffff8812dc83f1f8  RDI: ffffffffa0b2c318
           RBP: ffff882ecc375d20   R8: ffff882ef6ecfa60   R9: ffff88301f272200
           R10: 0000000000000000  R11: 0000000000000000  R12: ffffffffffffffff
           R13: ffff8812dc83f4f0  R14: 0000000000000000  R15: ffff8812dc83f1f8
           ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2]
        #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
        #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2]
      assert is tripped because the tran is not checkpointed and the lock level is PR.
      
      Some time ago, chmod command had been executed. As result, the following call
      chain left the inode cluster lock in PR state, latter on causing the assert.
      system_call_fastpath
        -> my_chmod
         -> sys_chmod
          -> sys_fchmodat
           -> notify_change
            -> ocfs2_setattr
             -> posix_acl_chmod
              -> ocfs2_iop_set_acl
               -> ocfs2_set_acl
                -> ocfs2_acl_set_mode
      Here is how.
      1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
      1120 {
      1247         ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
      ..
      1258         if (!status && attr->ia_valid & ATTR_MODE) {
      1259                 status =  posix_acl_chmod(inode, inode->i_mode);
      
      519 posix_acl_chmod(struct inode *inode, umode_t mode)
      520 {
      ..
      539         ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);
      
      287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
      288 {
      289         return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);
      
      224 int ocfs2_set_acl(handle_t *handle,
      225                          struct inode *inode, ...
      231 {
      ..
      252                                 ret = ocfs2_acl_set_mode(inode, di_bh,
      253                                                          handle, mode);
      
      168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
      170 {
      183         if (handle == NULL) {
                          >>> BUG: inode lock not held in ex at this point <<<
      184                 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
      185                                            OCFS2_INODE_UPDATE_CREDITS);
      
      ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
      ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
      mode (it should be). How this could have happended?
      
      We are the lock master, were holding lock EX and have released it in
      ocfs2_setattr.#1247.  Note that there are no holders of this lock at
      this point.  Another node needs the lock in PR, and we downconvert from
      EX to PR.  So the inode lock is PR when do the trans in
      ocfs2_acl_set_mode.#184.  The trans stays in core (not flushed to disc).
      Now another node want the lock in EX, downconvert thread gets kicked
      (the one that tripped assert abovt), finds an unflushed trans but the
      lock is not EX (it is PR).  If the lock was at EX, it would have flushed
      the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before
      downconverting (to NULL) for the request.
      
      ocfs2_setattr must not drop inode lock ex in this code path.  If it
      does, takes it again before the trans, say in ocfs2_set_acl, another
      cluster node can get in between, execute another setattr, overwriting
      the one in progress on this node, resulting in a mode acl size combo
      that is a mix of the two.
      
      Orabug: 20189959
      Signed-off-by: default avatarTariq Saeed <tariq.x.saeed@oracle.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      1918581a
    • Eric Ren's avatar
      ocfs2: fix crash caused by stale lvb with fsdlm plugin · 4b93c1da
      Eric Ren authored
      commit e7ee2c08 upstream.
      
      The crash happens rather often when we reset some cluster nodes while
      nodes contend fiercely to do truncate and append.
      
      The crash backtrace is below:
      
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
         dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
         ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
         ocfs2: Beginning quota recovery on device (253,18) for slot 2
         ocfs2: Finishing quota recovery on device (253,18) for slot 2
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
         (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
         ------------[ cut here ]------------
         kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
         invalid opcode: 0000 [#1] SMP
         Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
         Supported: No, Unsupported modules are loaded
         CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
         task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
         RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
         RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
         RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
         RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
         RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
         R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
         R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
         FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
         CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
         Call Trace:
           ocfs2_setattr+0x698/0xa90 [ocfs2]
           notify_change+0x1ae/0x380
           do_truncate+0x5e/0x90
           do_sys_ftruncate.constprop.11+0x108/0x160
           entry_SYSCALL_64_fastpath+0x12/0x6d
         Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
         RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
      
      It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
      not equal to the disk i_size.  We mistakenly trust the LVB because the
      underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
      DLM_SBF_VALNOTVALID properly for us.  But, why?
      
      The current code tries to downconvert lock without DLM_LKF_VALBLK flag
      to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
      if the lock resource type needs LVB.  This is not the right way for
      fsdlm.
      
      The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
      DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
      DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
      this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
      failure happens.
      
      The following diagram briefly illustrates how this crash happens:
      
      RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;
      
      The 1st round:
      
                   Node1                                    Node2
      RSB1: PR
                                                        RSB1(master): NULL->EX
      ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
        ocfs2_dlm_lock(no DLM_LKF_VALBLK)
      
      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      
      dlm_lock(no DLM_LKF_VALBLK)
        convert_lock(overwrite lkb->lkb_exflags
                     with no DLM_LKF_VALBLK)
      
      RSB1: NULL                                        RSB1: EX
                                                        reset Node2
      dlm_recover_rsbs()
        recover_lvb()
      
      /* The LVB is not trustable if the node with EX fails and
       * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
       */
      
       if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
                 return;                   * to invalid the LVB here.
                                           */
      
      The 2nd round:
      
               Node 1                                Node2
      RSB1(become master from recovery)
      
      ocfs2_setattr()
        ocfs2_inode_lock(NULL->EX)
          /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
          ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
        ocfs2_truncate_file()
            mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */
      
      The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
      for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
      is uesed.
      
      Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.comSigned-off-by: default avatarEric Ren <zren@suse.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      4b93c1da
    • Sachin Prabhu's avatar
      cifs: Do not send echoes before Negotiate is complete · 4549e4af
      Sachin Prabhu authored
      commit 62a6cfdd upstream.
      
      commit 4fcd1813 ("Fix reconnect to not defer smb3 session reconnect
      long after socket reconnect") added support for Negotiate requests to
      be initiated by echo calls.
      
      To avoid delays in calling echo after a reconnect, I added the patch
      introduced by the commit b8c60012 ("Call echo service immediately
      after socket reconnect").
      
      This has however caused a regression with cifs shares which do not have
      support for echo calls to trigger Negotiate requests. On connections
      which need to call Negotiation, the echo calls trigger an error which
      triggers a reconnect which in turn triggers another echo call. This
      results in a loop which is only broken when an operation is performed on
      the cifs share. For an idle share, it can DOS a server.
      
      The patch uses the smb_operation can_echo() for cifs so that it is
      called only if connection has been already been setup.
      
      kernel bz: 194531
      Signed-off-by: default avatarSachin Prabhu <sprabhu@redhat.com>
      Tested-by: default avatarJonathan Liu <net147@gmail.com>
      Acked-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      4549e4af
    • Aurelien Aptel's avatar
      fs/cifs: make share unaccessible at root level mountable · a9baa441
      Aurelien Aptel authored
      commit a6b5058f upstream.
      
      if, when mounting //HOST/share/sub/dir/foo we can query /sub/dir/foo but
      not any of the path components above:
      
      - store the /sub/dir/foo prefix in the cifs super_block info
      - in the superblock, set root dentry to the subpath dentry (instead of
        the share root)
      - set a flag in the superblock to remember it
      - use prefixpath when building path from a dentry
      
      fixes bso#8950
      Signed-off-by: default avatarAurelien Aptel <aaptel@suse.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      a9baa441
    • Germano Percossi's avatar
      CIFS: remove bad_network_name flag · 60f2c2fa
      Germano Percossi authored
      commit a0918f1c upstream.
      
      STATUS_BAD_NETWORK_NAME can be received during node failover,
      causing the flag to be set and making the reconnect thread
      always unsuccessful, thereafter.
      
      Once the only place where it is set is removed, the remaining
      bits are rendered moot.
      
      Removing it does not prevent "mount" from failing when a non
      existent share is passed.
      
      What happens when the share really ceases to exist while the
      share is mounted is undefined now as much as it was before.
      Signed-off-by: default avatarGermano Percossi <germano.percossi@citrix.com>
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      60f2c2fa
    • Pavel Shilovsky's avatar
      CIFS: Fix a possible memory corruption in push locks · f9e74c2a
      Pavel Shilovsky authored
      commit e3d240e9 upstream.
      
      If maxBuf is not 0 but less than a size of SMB2 lock structure
      we can end up with a memory corruption.
      Signed-off-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      f9e74c2a
    • Pavel Shilovsky's avatar
      CIFS: Fix missing nls unload in smb2_reconnect() · 85beff45
      Pavel Shilovsky authored
      commit 4772c795 upstream.
      Acked-by: default avatarSachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      85beff45
    • Pavel Shilovsky's avatar
      CIFS: Fix a possible memory corruption during reconnect · e008a962
      Pavel Shilovsky authored
      commit 53e0e11e upstream.
      
      We can not unlock/lock cifs_tcp_ses_lock while walking through ses
      and tcon lists because it can corrupt list iterator pointers and
      a tcon structure can be released if we don't hold an extra reference.
      Fix it by moving a reconnect process to a separate delayed work
      and acquiring a reference to every tcon that needs to be reconnected.
      Also do not send an echo request on newly established connections.
      Signed-off-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      e008a962
    • colyli@suse.de's avatar
      md linear: fix a race between linear_add() and linear_congested() · d45256ff
      colyli@suse.de authored
      commit 03a9e24e upstream.
      
      Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
      disk to a md linear device causes kernel crash at linear_congested(). From
      the crash image analysis, I find in linear_congested(), mddev->raid_disks
      contains value N, but conf->disks[] only has N-1 pointers available. Then
      a NULL pointer deference crashes the kernel.
      
      There is a race between linear_add() and linear_congested(), RCU stuffs
      used in these two functions cannot avoid the race. Since Linuv v4.0
      RCU code is replaced by introducing mddev_suspend().  After checking the
      upstream code, it seems linear_congested() is not called in
      generic_make_request() code patch, so mddev_suspend() cannot provent it
      from being called. The possible race still exists.
      
      Here I explain how the race still exists in current code.  For a machine
      has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
      md linear device; at the same time on other CPU, linear_congested() is
      called to detect whether this md linear device is congested before issuing
      an I/O request onto it.
      
      Now I use a possible code execution time sequence to demo how the possible
      race happens,
      
      seq    linear_add()                linear_congested()
       0                                 conf=mddev->private
       1   oldconf=mddev->private
       2   mddev->raid_disks++
       3                              for (i=0; i<mddev->raid_disks;i++)
       4                                bdev_get_queue(conf->disks[i].rdev->bdev)
       5   mddev->private=newconf
      
      In linear_add() mddev->raid_disks is increased in time seq 2, and on
      another CPU in linear_congested() the for-loop iterates conf->disks[i] by
      the increased mddev->raid_disks in time seq 3,4. But conf with one more
      element (which is a pointer to struct dev_info type) to conf->disks[] is
      not updated yet, accessing its structure member in time seq 4 will cause a
      NULL pointer deference fault.
      
      To fix this race, there are 2 parts of modification in the patch,
       1) Add 'int raid_disks' in struct linear_conf, as a copy of
          mddev->raid_disks. It is initialized in linear_conf(), always being
          consistent with pointers number of 'struct dev_info disks[]'. When
          iterating conf->disks[] in linear_congested(), use conf->raid_disks to
          replace mddev->raid_disks in the for-loop, then NULL pointer deference
          will not happen again.
       2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
          free oldconf memory. Because oldconf may be referenced as mddev->private
          in linear_congested(), kfree_rcu() makes sure that its memory will not
          be released until no one uses it any more.
      Also some code comments are added in this patch, to make this modification
      to be easier understandable.
      
      This patch can be applied for kernels since v4.0 after commit:
      3be260cc ("md/linear: remove rcu protections in favour of
      suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
      people who maintain kernels before Linux v4.0, they need to do some back
      back port to this patch.
      
      Changelog:
       - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
             replace rcu_call() in linear_add().
       - v2: add RCU stuffs by suggestion from Shaohua and Neil.
       - v1: initial effort.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Neil Brown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d45256ff
    • Wei Fang's avatar
      md:raid1: fix a dead loop when read from a WriteMostly disk · 7a56cc09
      Wei Fang authored
      commit 816b0acf upstream.
      
      If first_bad == this_sector when we get the WriteMostly disk
      in read_balance(), valid disk will be returned with zero
      max_sectors. It'll lead to a dead loop in make_request(), and
      OOM will happen because of endless allocation of struct bio.
      
      Since we can't get data from this disk in this case, so
      continue for another disk.
      Signed-off-by: default avatarWei Fang <fangwei1@huawei.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: Julia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      7a56cc09
    • Konstantin Khlebnikov's avatar
      md/raid5: limit request size according to implementation limits · 523d7696
      Konstantin Khlebnikov authored
      commit e8d7c332 upstream.
      
      Current implementation employ 16bit counter of active stripes in lower
      bits of bio->bi_phys_segments. If request is big enough to overflow
      this counter bio will be completed and freed too early.
      
      Fortunately this not happens in default configuration because several
      other limits prevent that: stripe_cache_size * nr_disks effectively
      limits count of active stripes. And small max_sectors_kb at lower
      disks prevent that during normal read/write operations.
      
      Overflow easily happens in discard if it's enabled by module parameter
      "devices_handle_discard_safely" and stripe_cache_size is set big enough.
      
      This patch limits requests size with 256Mb - 8Kb to prevent overflows.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Neil Brown <neilb@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      523d7696
    • Benjamin Marzinski's avatar
      dm space map metadata: fix 'struct sm_metadata' leak on failed create · 82559603
      Benjamin Marzinski authored
      commit 314c25c5 upstream.
      
      In dm_sm_metadata_create() we temporarily change the dm_space_map
      operations from 'ops' (whose .destroy function deallocates the
      sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't).
      
      If dm_sm_metadata_create() fails in sm_ll_new_metadata() or
      sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls
      dm_sm_destroy() with the intention of freeing the sm_metadata, but it
      doesn't (because the dm_space_map operations is still set to
      'bootstrap_ops').
      
      Fix this by setting the dm_space_map operations back to 'ops' if
      dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'.
      
      [js] no nr_blocks test in 3.12 yet
      Signed-off-by: default avatarBenjamin Marzinski <bmarzins@redhat.com>
      Acked-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      82559603
    • Ondrej Kozina's avatar
      dm crypt: mark key as invalid until properly loaded · 3044e195
      Ondrej Kozina authored
      commit 265e9098 upstream.
      
      In crypt_set_key(), if a failure occurs while replacing the old key
      (e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag
      set.  Otherwise, the crypto layer would have an invalid key that still
      has DM_CRYPT_KEY_VALID flag set.
      Signed-off-by: default avatarOndrej Kozina <okozina@redhat.com>
      Reviewed-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3044e195
    • Dan Williams's avatar
      block: fix del_gendisk() vs blkdev_ioctl crash · 1dd3d3e6
      Dan Williams authored
      commit ac34f15e upstream.
      
      When tearing down a block device early in its lifetime, userspace may
      still be performing discovery actions like blkdev_ioctl() to re-read
      partitions.
      
      The nvdimm_revalidate_disk() implementation depends on
      disk->driverfs_dev to be valid at entry.  However, it is set to NULL in
      del_gendisk() and fatally this is happening *before* the disk device is
      deleted from userspace view.
      
      There's no reason for del_gendisk() to clear ->driverfs_dev.  That
      device is the parent of the disk.  It is guaranteed to not be freed
      until the disk, as a child, drops its ->parent reference.
      
      We could also fix this issue locally in nvdimm_revalidate_disk() by
      using disk_to_dev(disk)->parent, but lets fix it globally since
      ->driverfs_dev follows the lifetime of the parent.  Longer term we
      should probably just add a @parent parameter to add_disk(), and stop
      carrying this pointer in the gendisk.
      
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
       CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G           O    4.4.0-rc5 #2257
       [..]
       Call Trace:
        [<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0
        [<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70
        [<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0
        [<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40
        [<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0
        [<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0
        [<ffffffff812916dd>] block_ioctl+0x3d/0x50
        [<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560
        [<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100
        [<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70
        [<ffffffff81264f09>] SyS_ioctl+0x79/0x90
        [<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Reported-by: default avatarRobert Hu <robert.hu@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      1dd3d3e6
    • Mauricio Faria de Oliveira's avatar
      block: allow WRITE_SAME commands with the SG_IO ioctl · 5cb01741
      Mauricio Faria de Oliveira authored
      commit 25cdb645 upstream.
      
      The WRITE_SAME commands are not present in the blk_default_cmd_filter
      write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
      is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
      [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]
      
      The problem can be reproduced with the sg_write_same command
      
        # sg_write_same --num 1 --xferlen 512 /dev/sda
        #
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_same --num 1 --xferlen 512 /dev/sda'
          Write same: pass through os error: Operation not permitted
        #
      
      For comparison, the WRITE_VERIFY command does not observe this problem,
      since it is in that list:
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
        #
      
      So, this patch adds the WRITE_SAME commands to the list, in order
      for the SG_IO ioctl to finish successfully:
      
        # capsh --drop=cap_sys_rawio -- -c \
          'sg_write_same --num 1 --xferlen 512 /dev/sda'
        #
      
      That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
      (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]),
      which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).
      
      In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
      which are translated to write-same calls in the guest kernel, and then into
      SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:
      
        [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
        [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
        [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
        [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
        [...] blk_update_request: I/O error, dev sda, sector 17096824
      
      Links:
      [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
      [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
      Signed-off-by: default avatarMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: default avatarBrahadambal Srinivasan <latha@linux.vnet.ibm.com>
      Reported-by: default avatarManjunatha H R <manjuhr1@in.ibm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      5cb01741
    • Omar Sandoval's avatar
      block: fix use-after-free in sys_ioprio_get() · 0f3a4aaa
      Omar Sandoval authored
      commit 8ba86821 upstream.
      
      get_task_ioprio() accesses the task->io_context without holding the task
      lock and thus can race with exit_io_context(), leading to a
      use-after-free. The reproducer below hits this within a few seconds on
      my 4-core QEMU VM:
      
      int main(int argc, char **argv)
      {
      	pid_t pid, child;
      	long nproc, i;
      
      	/* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */
      	syscall(SYS_ioprio_set, 1, 0, 0x6000);
      
      	nproc = sysconf(_SC_NPROCESSORS_ONLN);
      
      	for (i = 0; i < nproc; i++) {
      		pid = fork();
      		assert(pid != -1);
      		if (pid == 0) {
      			for (;;) {
      				pid = fork();
      				assert(pid != -1);
      				if (pid == 0) {
      					_exit(0);
      				} else {
      					child = wait(NULL);
      					assert(child == pid);
      				}
      			}
      		}
      
      		pid = fork();
      		assert(pid != -1);
      		if (pid == 0) {
      			for (;;) {
      				/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
      				syscall(SYS_ioprio_get, 2, 0);
      			}
      		}
      	}
      
      	for (;;) {
      		/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
      		syscall(SYS_ioprio_get, 2, 0);
      	}
      
      	return 0;
      }
      
      This gets us KASAN dumps like this:
      
      [   35.526914] ==================================================================
      [   35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c
      [   35.530009] Read of size 2 by task ioprio-gpf/363
      [   35.530009] =============================================================================
      [   35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected
      [   35.530009] -----------------------------------------------------------------------------
      
      [   35.530009] Disabling lock debugging due to kernel taint
      [   35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360
      [   35.530009] 	___slab_alloc+0x55d/0x5a0
      [   35.530009] 	__slab_alloc.isra.20+0x2b/0x40
      [   35.530009] 	kmem_cache_alloc_node+0x84/0x200
      [   35.530009] 	create_task_io_context+0x2b/0x370
      [   35.530009] 	get_task_io_context+0x92/0xb0
      [   35.530009] 	copy_process.part.8+0x5029/0x5660
      [   35.530009] 	_do_fork+0x155/0x7e0
      [   35.530009] 	SyS_clone+0x19/0x20
      [   35.530009] 	do_syscall_64+0x195/0x3a0
      [   35.530009] 	return_from_SYSCALL_64+0x0/0x6a
      [   35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060
      [   35.530009] 	__slab_free+0x27b/0x3d0
      [   35.530009] 	kmem_cache_free+0x1fb/0x220
      [   35.530009] 	put_io_context+0xe7/0x120
      [   35.530009] 	put_io_context_active+0x238/0x380
      [   35.530009] 	exit_io_context+0x66/0x80
      [   35.530009] 	do_exit+0x158e/0x2b90
      [   35.530009] 	do_group_exit+0xe5/0x2b0
      [   35.530009] 	SyS_exit_group+0x1d/0x20
      [   35.530009] 	entry_SYSCALL_64_fastpath+0x1a/0xa4
      [   35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080
      [   35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001
      [   35.530009] ==================================================================
      
      Fix it by grabbing the task lock while we poke at the io_context.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Acked-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0f3a4aaa
    • Daeho Jeong's avatar
      ext4: fix inode checksum calculation problem if i_extra_size is small · cd0d9254
      Daeho Jeong authored
      commit 05ac5aa1 upstream.
      
      We've fixed the race condition problem in calculating ext4 checksum
      value in commit b47820ed ("ext4: avoid modifying checksum fields
      directly during checksum veficationon"). However, by this change,
      when calculating the checksum value of inode whose i_extra_size is
      less than 4, we couldn't calculate the checksum value in a proper way.
      This problem was found and reported by Nix, Thank you.
      Reported-by: default avatarNix <nix@esperi.org.uk>
      Signed-off-by: default avatarDaeho Jeong <daeho.jeong@samsung.com>
      Signed-off-by: default avatarYoungjin Gil <youngjin.gil@samsung.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      cd0d9254
    • Theodore Ts'o's avatar
      ext4: return EROFS if device is r/o and journal replay is needed · 48a5889b
      Theodore Ts'o authored
      commit 4753d8a2 upstream.
      
      If the file system requires journal recovery, and the device is
      read-ony, return EROFS to the mount system call.  This allows xfstests
      generic/050 to pass.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      48a5889b
    • Theodore Ts'o's avatar
      ext4: preserve the needs_recovery flag when the journal is aborted · 399562b6
      Theodore Ts'o authored
      commit 97abd7d4 upstream.
      
      If the journal is aborted, the needs_recovery feature flag should not
      be removed.  Otherwise, it's the journal might not get replayed and
      this could lead to more data getting lost.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      399562b6
    • Jan Kara's avatar
      ext4: trim allocation requests to group size · 98f58e05
      Jan Kara authored
      commit cd648b8a upstream.
      
      If filesystem groups are artifically small (using parameter -g to
      mkfs.ext4), ext4_mb_normalize_request() can result in a request that is
      larger than a block group. Trim the request size to not confuse
      allocation code.
      Reported-by: default avatar"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      98f58e05
    • Theodore Ts'o's avatar
      ext4: fix fencepost in s_first_meta_bg validation · 77bd57e6
      Theodore Ts'o authored
      commit 2ba3e6e8 upstream.
      
      It is OK for s_first_meta_bg to be equal to the number of block group
      descriptor blocks.  (It rarely happens, but it shouldn't cause any
      problems.)
      
      https://bugzilla.kernel.org/show_bug.cgi?id=194567
      
      Fixes: 3a4b77cdSigned-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      77bd57e6
    • Theodore Ts'o's avatar
      jbd2: don't leak modified metadata buffers on an aborted journal · 45f1a95e
      Theodore Ts'o authored
      commit e112666b upstream.
      
      If the journal has been aborted, we shouldn't mark the underlying
      buffer head as dirty, since that will cause the metadata block to get
      modified.  And if the journal has been aborted, we shouldn't allow
      this since it will almost certainly lead to a corrupted file system.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      45f1a95e
    • Eryu Guan's avatar
      ext4: validate s_first_meta_bg at mount time · 188b2ebb
      Eryu Guan authored
      commit 3a4b77cd upstream.
      
      Ralf Spenneberg reported that he hit a kernel crash when mounting a
      modified ext4 image. And it turns out that kernel crashed when
      calculating fs overhead (ext4_calculate_overhead()), this is because
      the image has very large s_first_meta_bg (debug code shows it's
      842150400), and ext4 overruns the memory in count_overhead() when
      setting bitmap buffer, which is PAGE_SIZE.
      
      ext4_calculate_overhead():
        buf = get_zeroed_page(GFP_NOFS);  <=== PAGE_SIZE buffer
        blks = count_overhead(sb, i, buf);
      
      count_overhead():
        for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
                ext4_set_bit(EXT4_B2C(sbi, s++), buf);   <=== buffer overrun
                count++;
        }
      
      This can be reproduced easily for me by this script:
      
        #!/bin/bash
        rm -f fs.img
        mkdir -p /mnt/ext4
        fallocate -l 16M fs.img
        mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
        debugfs -w -R "ssv first_meta_bg 842150400" fs.img
        mount -o loop fs.img /mnt/ext4
      
      Fix it by validating s_first_meta_bg first at mount time, and
      refusing to mount if its value exceeds the largest possible meta_bg
      number.
      
      [js] use EXT4_HAS_INCOMPAT_FEATURE instead of new
           ext4_has_feature_meta_bg
      Reported-by: default avatarRalf Spenneberg <ralf@os-t.de>
      Signed-off-by: default avatarEryu Guan <guaneryu@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      188b2ebb
  2. 19 Jun, 2017 5 commits
    • Theodore Ts'o's avatar
      ext4: add sanity checking to count_overhead() · d61f4e22
      Theodore Ts'o authored
      commit c48ae41b upstream.
      
      The commit "ext4: sanity check the block and cluster size at mount
      time" should prevent any problems, but in case the superblock is
      modified while the file system is mounted, add an extra safety check
      to make sure we won't overrun the allocated buffer.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      d61f4e22
    • Theodore Ts'o's avatar
      ext4: fix in-superblock mount options processing · bd652ad1
      Theodore Ts'o authored
      commit 5aee0f8a upstream.
      
      Fix a large number of problems with how we handle mount options in the
      superblock.  For one, if the string in the superblock is long enough
      that it is not null terminated, we could run off the end of the string
      and try to interpret superblocks fields as characters.  It's unlikely
      this will cause a security problem, but it could result in an invalid
      parse.  Also, parse_options is destructive to the string, so in some
      cases if there is a comma-separated string, it would be modified in
      the superblock.  (Fortunately it only happens on file systems with a
      1k block size.)
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      bd652ad1
    • Theodore Ts'o's avatar
      ext4: use more strict checks for inodes_per_block on mount · 408d8245
      Theodore Ts'o authored
      commit cd6bb35b upstream.
      
      Centralize the checks for inodes_per_block and be more strict to make
      sure the inodes_per_block_group can't end up being zero.
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      408d8245
    • Liu Bo's avatar
      Btrfs: fix memory leak in reading btree blocks · de714a8a
      Liu Bo authored
      commit 2571e739 upstream.
      
      So we can read a btree block via readahead or intentional read,
      and we can end up with a memory leak when something happens as
      follows,
      1) readahead starts to read block A but does not wait for read
         completion,
      2) btree_readpage_end_io_hook finds that block A is corrupted,
         and it needs to clear all block A's pages' uptodate bit.
      3) meanwhile an intentional read kicks in and checks block A's
         pages' uptodate to decide which page needs to be read.
      4) when some pages have the uptodate bit during 3)'s check so
         3) doesn't count them for eb->io_pages, but they are later
         cleared by 2) so we has to readpage on the page, we get
         the wrong eb->io_pages which results in a memory leak of
         this block.
      
      This fixes the problem by firstly getting all pages's locking and
      then checking pages' uptodate bit.
      
         t1(readahead)                              t2(readahead endio)                                       t3(the following read)
      read_extent_buffer_pages                    end_bio_extent_readpage
        for pg in eb:                                for page 0,1,2 in eb:
            if pg is uptodate:                           btree_readpage_end_io_hook(pg)
                num_reads++                              if uptodate:
        eb->io_pages = num_reads                             SetPageUptodate(pg)              _______________
        for pg in eb:                                for page 3 in eb:                                     read_extent_buffer_pages
             if pg is NOT uptodate:                      btree_readpage_end_io_hook(pg)                       for pg in eb:
                 __extent_read_full_page(pg)                 sanity check reports something wrong                 if pg is uptodate:
                                                             clear_extent_buffer_uptodate(eb)                         num_reads++
                                                                 for pg in eb:                                eb->io_pages = num_reads
                                                                     ClearPageUptodate(page)  _______________
                                                                                                              for pg in eb:
                                                                                                                  if pg is NOT uptodate:
                                                                                                                      __extent_read_full_page(pg)
      
      So t3's eb->io_pages is not consistent with the number of pages it's reading,
      and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
      number so that we're not able to free the eb.
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      de714a8a
    • Jeff Mahoney's avatar
      Revert "Btrfs: don't delay inode ref updates during log, replay" · 0ee88216
      Jeff Mahoney authored
      commit 081fafdd upstream.
      
      This reverts commit 644d1071, upstream
      commit 6f896054.
      
      The original patch for mainline, 6f896054 (Btrfs: don't delay
      inode ref updates during log replay) lists 1d52c78a (Btrfs: try
      not to ENOSPC on log replay) as the only pre-3.18 dependency, but it
      also depends on 67de1176 (Btrfs: introduce the delayed inode ref
      deletion for the single link inode), which was introduced in 3.14
      and isn't in 3.12.y.
      
      The -stable commit added the check to btrfs_delayed_update_inode,
      which may look similar to btrfs_delayed_delete_inode_ref, but it's
      only superficial.  The tops of both functions handle typical
      delayed node boilerplate.  The upshot is that the patch is harmless
      since the caller already checks to see if we're doing log recovery,
      so we're not breaking anything.  It should be reverted because it
      makes it appear as if this issue was fixed for users who did
      backport 67de1176, when it is not.
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0ee88216