1. 08 Nov, 2013 11 commits
    • Alireza Haghdoost's avatar
      block: Enable sysfs nomerge control for I/O requests in the plug list · 23779fbc
      Alireza Haghdoost authored
      This patch enables the sysfs to control I/O request merge
      functionality in the plug list. While this control has been
      implemented for the request queue, it was dismissed in the plug list.
      Therefore, block layer merges requests together (or attempt to merge)
      even if the merge capability was disable using sysfs nomerge parameter
      value 2.
      
      This limitation is directly affects functionality of io_submit()
      system call. The system call enables user to submit a bunch of IO
      requests from user space using struct iocb **ios input argument.
      However, the unconditioned merging functionality in the plug list
      potentially merges these requests together down the road. Therefore,
      there is no way to distinguish between an application sending bunch of
      sequential IOs and an application sending one big IO. Ultimately, all
      requests generated by the former app merge within the plug list
      together and looks similar to the second app.
      
      While the merging functionality is a desirable feature to improve the
      performance of IO subsystem for some applications, it is not useful
      for other application like ours at all.
      Signed-off-by: default avatarAlireza Haghdoost <alireza@cs.umn.edu>
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      
      Coding style modified.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      23779fbc
    • Mike Snitzer's avatar
      block: properly stack underlying max_segment_size to DM device · d82ae52e
      Mike Snitzer authored
      Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE
      (65536) even if the underlying device(s) have a larger value -- this is
      due to blk_stack_limits() using min_not_zero() when stacking the
      max_segment_size limit.
      
      1073741824
      
      before patch:
      65536
      
      after patch:
      1073741824
      Reported-by: default avatarLukasz Flis <l.flis@cyfronet.pl>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org # v3.3+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d82ae52e
    • Tomoki Sekiyama's avatar
      elevator: acquire q->sysfs_lock in elevator_change() · 7c8a3679
      Tomoki Sekiyama authored
      Add locking of q->sysfs_lock into elevator_change() (an exported function)
      to ensure it is held to protect q->elevator from elevator_init(), even if
      elevator_change() is called from non-sysfs paths.
      sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
      version, as the lock is already taken by elv_iosched_store().
      Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7c8a3679
    • Tomoki Sekiyama's avatar
      elevator: Fix a race in elevator switching and md device initialization · eb1c160b
      Tomoki Sekiyama authored
      The soft lockup below happens at the boot time of the system using dm
      multipath and the udev rules to switch scheduler.
      
      [  356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
      [  356.127001] RIP: 0010:[<ffffffff81072a7d>]  [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
      ...
      [  356.127001] Call Trace:
      [  356.127001]  [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
      [  356.127001]  [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
      [  356.127001]  [<ffffffff810738b2>] del_timer_sync+0x52/0x60
      [  356.127001]  [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
      [  356.127001]  [<ffffffff812c98df>] elevator_exit+0x2f/0x50
      [  356.127001]  [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
      [  356.127001]  [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
      [  356.127001]  [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
      [  356.127001]  [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
      [  356.127001]  [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
      [  356.127001]  [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
      [  356.127001]  [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
      
      This is caused by a race between md device initialization by multipathd and
      shell script to switch the scheduler using sysfs.
      
       - multipathd:
         SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
         -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
          q->elevator = elevator_alloc(q, e); // not yet initialized
      
       - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
         elevator_switch (in the call trace above)
          struct elevator_queue *old = q->elevator;
          q->elevator = elevator_alloc(q, new_e);
          elevator_exit(old);                 // lockup! (*)
      
       - multipathd: (cont.)
          err = e->ops.elevator_init_fn(q);   // init fails; q->elevator is modified
      
      (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
      while timer->base == NULL. In this case, as timer will never initialized,
      it results in lockup.
      
      This patch introduces acquisition of q->sysfs_lock around elevator_init()
      into blk_init_allocated_queue(), to provide mutual exclusion between
      initialization of the q->scheduler and switching of the scheduler.
      
      This should fix this bugzilla:
      https://bugzilla.redhat.com/show_bug.cgi?id=902012Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eb1c160b
    • Christoph Lameter's avatar
      block: Replace __get_cpu_var uses · 170d800a
      Christoph Lameter authored
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      At the end of the patch set all uses of __get_cpu_var have been removed so
      the macro is removed too.
      
      The patch set includes passes over all arches as well. Once these operations
      are used throughout then specialized macros can be defined in non -x86
      arches as well in order to optimize per cpu access by f.e.  using a global
      register that may be set to the per cpu base.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	this_cpu_inc(y)
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      170d800a
    • Mikulas Patocka's avatar
      bdi: test bdi_init failure · 8077c0d9
      Mikulas Patocka authored
      There were two places where return value from bdi_init was not tested.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8077c0d9
    • Mikulas Patocka's avatar
      block: fix a probe argument to blk_register_region · a207f593
      Mikulas Patocka authored
      The probe function is supposed to return NULL on failure (as we can see in
      kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj;
      
      However, in loop and brd, it returns negative error from ERR_PTR.
      
      This causes a crash if we simulate disk allocation failure and run
      less -f /dev/loop0 because the negative number is interpreted as a pointer:
      
      BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4
      IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450
      PGD 23c677067 PUD 23d6d1067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia]
      CPU: 1 PID: 6831 Comm: less Tainted: P        W  O 3.10.15-devel #18
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
      task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000
      RIP: 0010:[<ffffffff8118b188>]  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
      RSP: 0018:ffff88023e47dbd8  EFLAGS: 00010286
      RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
      RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658
      R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640
      FS:  00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640
       ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60
       ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292
      Call Trace:
       [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
       [<ffffffff8118c10d>] blkdev_get+0x1dd/0x370
       [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
       [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
       [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
       [<ffffffff8118c365>] blkdev_open+0x65/0x80
       [<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0
       [<ffffffff8114d214>] finish_open+0x34/0x50
       [<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50
       [<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0
       [<ffffffff81115a8e>] ? might_fault+0x4e/0xa0
       [<ffffffff8115f4f0>] do_filp_open+0x40/0x90
       [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
       [<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0
       [<ffffffff8114e45f>] do_sys_open+0xef/0x1d0
       [<ffffffff8114e559>] SyS_open+0x19/0x20
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5
      a4 07 00 44 89
      RIP  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
       RSP <ffff88023e47dbd8>
      CR2: 00000000000002b4
      ---[ end trace bb7f32dbf02398dc ]---
      
      The brd change should be backported to stable kernels starting with 2.6.25.
      The loop change should be backported to stable kernels starting with 2.6.22.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org	# 2.6.22+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a207f593
    • Mikulas Patocka's avatar
      loop: fix crash if blk_alloc_queue fails · 3ec981e3
      Mikulas Patocka authored
      loop: fix crash if blk_alloc_queue fails
      
      If blk_alloc_queue fails, loop_add cleans up, but it doesn't clean up the
      identifier allocated with idr_alloc. That causes crash on module unload in
      idr_for_each(&loop_index_idr, &loop_exit_cb, NULL); where we attempt to
      remove non-existed device with that id.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000380
      IP: [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
      PGD 43d399067 PUD 43d0ad067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      Modules linked in: loop(-) dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_ondemand cpufreq_conservative cpufreq_powersave spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc lm85 hwmon_vid snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq ohci_hcd freq_table tg3 ehci_pci mperf ehci_hcd kvm_amd kvm sata_svw serverworks libphy libata ide_core k10temp usbcore hwmon microcode ptp pcspkr pps_core e100 skge mii usb_common i2c_piix4 floppy evdev rtc_cmos i2c_core processor but!
       ton unix
      CPU: 7 PID: 2735 Comm: rmmod Tainted: G        W    3.10.15-devel #15
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
      task: ffff88043d38e780 ti: ffff88043d21e000 task.ti: ffff88043d21e000
      RIP: 0010:[<ffffffff812057c9>]  [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
      RSP: 0018:ffff88043d21fe10  EFLAGS: 00010282
      RAX: ffffffffa05102e0 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffff88043ea82800 RDI: 0000000000000000
      RBP: ffff88043d21fe48 R08: 0000000000000000 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000ff
      R13: 0000000000000080 R14: 0000000000000000 R15: ffff88043ea82800
      FS:  00007ff646534700(0000) GS:ffff880447000000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000380 CR3: 000000043e9bf000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       ffffffff8100aba4 0000000000000092 ffff88043d21fe48 ffff88043ea82800
       00000000000000ff ffff88043d21fe98 0000000000000000 ffff88043d21fe60
       ffffffffa05102b4 0000000000000000 ffff88043d21fe70 ffffffffa05102ec
      Call Trace:
       [<ffffffff8100aba4>] ? native_sched_clock+0x24/0x80
       [<ffffffffa05102b4>] loop_remove+0x14/0x40 [loop]
       [<ffffffffa05102ec>] loop_exit_cb+0xc/0x10 [loop]
       [<ffffffff81217b74>] idr_for_each+0x104/0x190
       [<ffffffffa05102e0>] ? loop_remove+0x40/0x40 [loop]
       [<ffffffff8109adc5>] ? trace_hardirqs_on_caller+0x105/0x1d0
       [<ffffffffa05135dc>] loop_exit+0x34/0xa58 [loop]
       [<ffffffff810a98ea>] SyS_delete_module+0x13a/0x260
       [<ffffffff81221d5e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      Code: f0 4c 8b 6d f8 c9 c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 56 41 55 4c 8d af 80 00 00 00 41 54 53 48 89 fb 48 83 ec 18 <48> 83 bf 80 03 00
      00 00 74 4d e8 98 fe ff ff 31 f6 48 c7 c7 20
      RIP  [<ffffffff812057c9>] del_gendisk+0x19/0x2d0
       RSP <ffff88043d21fe10>
      CR2: 0000000000000380
      ---[ end trace 64ec069ec70f1309 ]---
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org	# 3.1+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3ec981e3
    • Mikulas Patocka's avatar
      blk-core: Fix memory corruption if blkcg_init_queue fails · fff4996b
      Mikulas Patocka authored
      If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
      to clean up structures allocated by the backing dev.
      
      ------------[ cut here ]------------
      WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
      ODEBUG: free active (active state 0) object type: percpu_counter hint:           (null)
      Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
      CPU: 0 PID: 2739 Comm: lvchange Tainted: G        W
      3.10.15-devel #14
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
       0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
       ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
       ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
      Call Trace:
       [<ffffffff813c8fd4>] dump_stack+0x19/0x1b
       [<ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0
       [<ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50
       [<ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250
       [<ffffffff81229a15>] debug_print_object+0x85/0xa0
       [<ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250
       [<ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0
       [<ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0
       [<ffffffff811f672e>] blk_alloc_queue+0xe/0x10
       [<ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod]
       [<ffffffff81097662>] ? get_lock_stats+0x22/0x70
       [<ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod]
       [<ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520
       [<ffffffff8116cfc7>] ? fget_light+0x377/0x4e0
       [<ffffffff81161d2b>] SyS_ioctl+0x4b/0x90
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      ---[ end trace 4b5ff0d55673d986 ]---
      ------------[ cut here ]------------
      
      This fix should be backported to stable kernels starting with 2.6.37. Note
      that in the kernels prior to 3.5 the affected code is different, but the
      bug is still there - bdi_init is called and bdi_destroy isn't.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org	# 2.6.37+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fff4996b
    • Jeff Moyer's avatar
      block: fix race between request completion and timeout handling · 4912aa6c
      Jeff Moyer authored
      crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      
      Pid: 491, comm: scsi_eh_0 Tainted: G        W  ----------------   2.6.32-220.13.1.el6.x86_64 #1 IBM  -[8722PAX]-/00D1461
      RIP: 0010:[<ffffffff8124e424>]  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
      RSP: 0018:ffff881057eefd60  EFLAGS: 00010012
      RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
      RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
      RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
      R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
      FS:  0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
      Stack:
       0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
      <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
      <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
      Call Trace:
       [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
       [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
       [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
       [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
       [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
       [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
       [<ffffffff810908c6>] kthread+0x96/0xa0
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffff81090830>] ? kthread+0x0/0xa0
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
      RIP  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
       RSP <ffff881057eefd60>
      
      The RIP is this line:
              BUG_ON(blk_queued_rq(rq));
      
      After digging through the code, I think there may be a race between the
      request completion and the timer handler running.
      
      A timer is started for each request put on the device's queue (see
      blk_start_request->blk_add_timer).  If the request does not complete
      before the timer expires, the timer handler (blk_rq_timed_out_timer)
      will mark the request complete atomically:
      
      static inline int blk_mark_rq_complete(struct request *rq)
      {
              return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
      }
      
      and then call blk_rq_timed_out.  The latter function will call
      scsi_times_out, which will return one of BLK_EH_HANDLED,
      BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED.  If BLK_EH_RESET_TIMER is
      returned, blk_clear_rq_complete is called, and blk_add_timer is again
      called to simply wait longer for the request to complete.
      
      Now, if the request happens to complete while this is going on, what
      happens?  Given that we know the completion handler will bail if it
      finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
      handler running after that bit is cleared.  So, from the above
      paragraph, after the call to blk_clear_rq_complete.  If the completion
      sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
      there (I haven't seen this in the cores).  Next, if we get the
      completion before the call to list_add_tail, then the timer will
      eventually fire for an old req, which may either be freed or reallocated
      (there is evidence that this might be the case).  Finally, if the
      completion comes in *after* the addition to the timeout list, I think
      it's harmless.  The request will be removed from the timeout list,
      req_atom_complete will be set, and all will be well.
      
      This will only actually explain the coredumps *IF* the request
      structure was freed, reallocated *and* queued before the error handler
      thread had a chance to process it.  That is possible, but it may make
      sense to keep digging for another race.  I think that if this is what
      was happening, we would see other instances of this problem showing up
      as null pointer or garbage pointer dereferences, for example when the
      request structure was not re-used.  It looks like we actually do run
      into that situation in other reports.
      
      This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
      &req->atomic_flags)); from blk_add_timer to the only caller that could
      trip over it (blk_start_request).  It then inverts the calls to
      blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
      the race.  I've boot tested this patch, but nothing more.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4912aa6c
    • Jan Kara's avatar
      blktrace: Send BLK_TN_PROCESS events to all running traces · a404d557
      Jan Kara authored
      Currently each task sends BLK_TN_PROCESS event to the first traced
      device it interacts with after a new trace is started. When there are
      several traced devices and the task accesses more devices, this logic
      can result in BLK_TN_PROCESS being sent several times to some devices
      while it is never sent to other devices. Thus blkparse doesn't display
      command name when parsing some blktrace files.
      
      Fix the problem by sending BLK_TN_PROCESS event to all traced devices
      when a task interacts with any of them.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Review-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a404d557
  2. 03 Nov, 2013 3 commits
  3. 02 Nov, 2013 2 commits
  4. 01 Nov, 2013 20 commits
  5. 31 Oct, 2013 4 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (fixes from Andrew Morton) · 4f794ee8
      Linus Torvalds authored
      Merge four more fixes from Andrew Morton.
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        lib/scatterlist.c: don't flush_kernel_dcache_page on slab page
        mm: memcg: fix test for child groups
        mm: memcg: lockdep annotation for memcg OOM lock
        mm: memcg: use proper memcg in limit bypass
      4f794ee8
    • Ming Lei's avatar
      lib/scatterlist.c: don't flush_kernel_dcache_page on slab page · 3d77b50c
      Ming Lei authored
      Commit b1adaf65 ("[SCSI] block: add sg buffer copy helper
      functions") introduces two sg buffer copy helpers, and calls
      flush_kernel_dcache_page() on pages in SG list after these pages are
      written to.
      
      Unfortunately, the commit may introduce a potential bug:
      
       - Before sending some SCSI commands, kmalloc() buffer may be passed to
         block layper, so flush_kernel_dcache_page() can see a slab page
         finally
      
       - According to cachetlb.txt, flush_kernel_dcache_page() is only called
         on "a user page", which surely can't be a slab page.
      
       - ARCH's implementation of flush_kernel_dcache_page() may use page
         mapping information to do optimization so page_mapping() will see the
         slab page, then VM_BUG_ON() is triggered.
      
      Aaro Koskinen reported the bug on ARM/kirkwood when DEBUG_VM is enabled,
      and this patch fixes the bug by adding test of '!PageSlab(miter->page)'
      before calling flush_kernel_dcache_page().
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Tested-by: default avatarSimon Baatz <gmbnomis@gmail.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Aaro Koskinen <aaro.koskinen@iki.fi>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d77b50c
    • Johannes Weiner's avatar
      mm: memcg: fix test for child groups · 696ac172
      Johannes Weiner authored
      When memcg code needs to know whether any given memcg has children, it
      uses the cgroup child iteration primitives and returns true/false
      depending on whether the iteration loop is executed at least once or
      not.
      
      Because a cgroup's list of children is RCU protected, these primitives
      require the RCU read-lock to be held, which is not the case for all
      memcg callers.  This results in the following splat when e.g.  enabling
      hierarchy mode:
      
        WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160()
        CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9c-dirty #18
        Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012
        Call Trace:
          dump_stack+0x54/0x74
          warn_slowpath_common+0x78/0xa0
          warn_slowpath_null+0x1a/0x20
          css_next_child+0xa3/0x160
          mem_cgroup_hierarchy_write+0x5b/0xa0
          cgroup_file_write+0x108/0x2a0
          vfs_write+0xbd/0x1e0
          SyS_write+0x4c/0xa0
          system_call_fastpath+0x16/0x1b
      
      In the memcg case, we only care about children when we are attempting to
      modify inheritable attributes interactively.  Racing with deletion could
      mean a spurious -EBUSY, no problem.  Racing with addition is handled
      just fine as well through the memcg_create_mutex: if the child group is
      not on the list after the mutex is acquired, it won't be initialized
      from the parent's attributes until after the unlock.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      696ac172
    • Johannes Weiner's avatar
      mm: memcg: lockdep annotation for memcg OOM lock · 0056f4e6
      Johannes Weiner authored
      The memcg OOM lock is a mutex-type lock that is open-coded due to
      memcg's special needs.  Add annotations for lockdep coverage.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0056f4e6