1. 30 Apr, 2013 2 commits
    • Dmitry Monakhov's avatar
      relay: move remove_buf_file inside relay_close_buf · b8d4a5bf
      Dmitry Monakhov authored
      Currently remove_buf_file callback is called from from kobject
      release method. This result in follow issue:
      # blktrace -d /dev/sda1 -d /dev/sda -o test
      
      blktrace_setup()
       dir = create_dir()
       rchan = relay_open(dir,...)
       ->create_buf_file_callback
          buf_file  = debugfs_create_file(dir, )
      
      Userspace will open buf_file.
      Later we make a decision to stop tracing
      blktrace_down()
        relay_close(rhcan)  /* just decrement kobj reference  */
                            /* since it is not zero then callback not called */
        debugfs_remove(dir) /* FAIL due to non empty dir   */
      
      Later user space will close the file and file will be deleted,
      but directory still exist.
      user_space_close()
       ->file_release
         ->release_buf_file_callback
           ->debugfs_remove(buf_file
      ## TESTCASE:
      # blktrace -d /dev/sda1 -d /dev/sda -o test
      # After that blktrace infrastructure will remain broken in
      # an unusable state so: blktrace -d /dev/sda1 will not work.
      
      In fact this is general issue, blktrace is just one of examples.
      We can not reliably remove parent dir until all users close the
      buf_file.
      
      Solution: We don't have to wait that long. File should be deleted inside
      relay_close_buf().
      Signed-off-by: default avatarDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b8d4a5bf
    • Philippe De Muyter's avatar
      partitions/efi.c: replace useless kzalloc's by kmalloc's · ea56505b
      Philippe De Muyter authored
      In alloc_read_gpt_entries and alloc_read_gpt_header, the kzalloc'ated
      zones are either totally overwritten by the following read_lba call,
      or freed.  As kmalloc is cheaper than kzalloc, use kmalloc.
      Signed-off-by: default avatarPhilippe De Muyter <phdm@macqel.be>
      Cc: Matt Domsch <Matt_Domsch@dell.com>
      Cc: Panagiotis Issaris <takis@issaris.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ea56505b
  2. 29 Apr, 2013 1 commit
  3. 24 Apr, 2013 1 commit
    • James Bottomley's avatar
      block: fix max discard sectors limit · 871dd928
      James Bottomley authored
      linux-v3.8-rc1 and later support for plug for blkdev_issue_discard with
      commit 0cfbcafc
      (block: add plug for blkdev_issue_discard )
      
      For example,
      1) DISCARD rq-1 with size size 4GB
      2) DISCARD rq-2 with size size 1GB
      
      If these 2 discard requests get merged, final request size will be 5GB.
      
      In this case, request's __data_len field may overflow as it can store
      max 4GB(unsigned int).
      
      This issue was observed while doing mkfs.f2fs on 5GB SD card:
      https://lkml.org/lkml/2013/4/1/292
      
      Info: sector size = 512
      Info: total sectors = 11370496 (in 512bytes)
      Info: zone aligned segment0 blkaddr: 512
      [  257.789764] blk_update_request: bio idx 0 >= vcnt 0
      
      mkfs process gets stuck in D state and I see the following in the dmesg:
      
      [  257.789733] __end_that: dev mmcblk0: type=1, flags=122c8081
      [  257.789764]   sector 4194304, nr/cnr 2981888/4294959104
      [  257.789764]   bio df3840c0, biotail df3848c0, buffer   (null), len
      1526726656
      [  257.789764] blk_update_request: bio idx 0 >= vcnt 0
      [  257.794921] request botched: dev mmcblk0: type=1, flags=122c8081
      [  257.794921]   sector 4194304, nr/cnr 2981888/4294959104
      [  257.794921]   bio df3840c0, biotail df3848c0, buffer   (null), len
      1526726656
      
      This patch fixes this issue.
      Reported-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Parallels.com>
      Signed-off-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      Tested-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      871dd928
  4. 09 Apr, 2013 2 commits
    • Jun'ichi Nomura's avatar
      blkcg: fix "scheduling while atomic" in blk_queue_bypass_start · e5072664
      Jun'ichi Nomura authored
      Since 749fefe6 in v3.7 ("block: lift the initial queue bypass mode
      on blk_register_queue() instead of blk_init_allocated_queue()"),
      the following warning appears when multipath is used with CONFIG_PREEMPT=y.
      
      This patch moves blk_queue_bypass_start() before radix_tree_preload()
      to avoid the sleeping call while preemption is disabled.
      
        BUG: scheduling while atomic: multipath/2460/0x00000002
        1 lock held by multipath/2460:
         #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
        Modules linked in: ...
        Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
        Call Trace:
         [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
         [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
         [<ffffffff814291e6>] schedule+0x64/0x66
         [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
         [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
         [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
         [<ffffffff814289e3>] wait_for_common+0x9d/0xee
         [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
         [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
         [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
         [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
         [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
         [<ffffffff8106fb18>] ? complete+0x21/0x53
         [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
         [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
         [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
         [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
         [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
         [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
         [<ffffffff811d8c41>] elevator_init+0xe1/0x115
         [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
         [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
         [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
         [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
         [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
         [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
         [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
         [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
         [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
         [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
         [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
         [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e5072664
    • Namjae Jeon's avatar
      Documentation: cfq-iosched: update documentation help for cfq tunables · fdc6fdc5
      Namjae Jeon authored
      Add the documentation text for latency, target_latency & group_idle
      tunnable parameters in the block/cfq-iosched.txt.
      Also fix few typo(spelling) mistakes.
      Signed-off-by: default avatarNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: default avatarAmit Sahrawat <a.sahrawat@samsung.com>
      
      Language somewhat modified by Jens.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fdc6fdc5
  5. 02 Apr, 2013 5 commits
    • Jens Axboe's avatar
      Merge branch 'writeback-workqueue' of... · 64f8de4d
      Jens Axboe authored
      Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core
      
      Tejun writes:
      
      -----
      
      This is the pull request for the earlier patchset[1] with the same
      name.  It's only three patches (the first one was committed to
      workqueue tree) but the merge strategy is a bit involved due to the
      dependencies.
      
      * Because the conversion needs features from wq/for-3.10,
        block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
        with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
        workqueue conflicts from flaring up in block tree.
      
      * Resolving the issue that Jan and Dave raised about debugging
        requires arch-wide changes.  The patchset is being worked on[2] but
        it'll have to go through -mm after these changes show up in -next,
        and not included in this pull request.
      
      The three commits are located in the following git branch.
      
        git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
      
      Pulling it into block/for-3.10/core produces a conflict in
      drivers/md/raid5.c between the following two commits.
      
        e3620a3a ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
        2f6db2a7 ("raid5: use bio_reset()")
      
      The conflict is trivial - one removes an "if ()" conditional while the
      other removes "rbi->bi_next = NULL" right above it.  We just need to
      remove both.  The merged branch is available at
      
        git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
      
      so that you can use it for verification.  The test merge commit has
      proper merge description.
      
      While these changes are a bit of pain to route, they make code simpler
      and even have, while minute, measureable performance gain[3] even on a
      workload which isn't particularly favorable to showing the benefits of
      this conversion.
      
      ----
      
      Fixed up the conflict.
      
      Conflicts:
      	drivers/md/raid5.c
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      64f8de4d
    • Tejun Heo's avatar
      writeback: expose the bdi_wq workqueue · b5c872dd
      Tejun Heo authored
      There are cases where userland wants to tweak the priority and
      affinity of writeback flushers.  Expose bdi_wq to userland by setting
      WQ_SYSFS.  It appears under /sys/bus/workqueue/devices/writeback/ and
      allows adjusting maximum concurrency level, cpumask and nice level.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      b5c872dd
    • Tejun Heo's avatar
      writeback: replace custom worker pool implementation with unbound workqueue · 839a8e86
      Tejun Heo authored
      Writeback implements its own worker pool - each bdi can be associated
      with a worker thread which is created and destroyed dynamically.  The
      worker thread for the default bdi is always present and serves as the
      "forker" thread which forks off worker threads for other bdis.
      
      there's no reason for writeback to implement its own worker pool when
      using unbound workqueue instead is much simpler and more efficient.
      This patch replaces custom worker pool implementation in writeback
      with an unbound workqueue.
      
      The conversion isn't too complicated but the followings are worth
      mentioning.
      
      * bdi_writeback->last_active, task and wakeup_timer are removed.
        delayed_work ->dwork is added instead.  Explicit timer handling is
        no longer necessary.  Everything works by either queueing / modding
        / flushing / canceling the delayed_work item.
      
      * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
        bdi_writeback->dwork.  On each execution, it processes
        bdi->work_list and reschedules itself if there are more things to
        do.
      
        The function also handles low-mem condition, which used to be
        handled by the forker thread.  If the function is running off a
        rescuer thread, it only writes out limited number of pages so that
        the rescuer can serve other bdis too.  This preserves the flusher
        creation failure behavior of the forker thread.
      
      * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
        bdi_writeback_workfn() about on-going bdi unregistration so that it
        always drains work_list even if it's running off the rescuer.  Note
        that the original code was broken in this regard.  Under memory
        pressure, a bdi could finish unregistration with non-empty
        work_list.
      
      * The default bdi is no longer special.  It now is treated the same as
        any other bdi and bdi_cap_flush_forker() is removed.
      
      * BDI_pending is no longer used.  Removed.
      
      * Some tracepoints become non-applicable.  The following TPs are
        removed - writeback_nothread, writeback_wake_thread,
        writeback_wake_forker_thread, writeback_thread_start,
        writeback_thread_stop.
      
      Everything, including devices coming and going away and rescuer
      operation under simulated memory pressure, seems to work fine in my
      test setup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      839a8e86
    • Tejun Heo's avatar
      writeback: remove unused bdi_pending_list · 181387da
      Tejun Heo authored
      There's no user left.  Remove it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      181387da
    • Tejun Heo's avatar
      Merge tag 'v3.9-rc5' into wq/for-3.10 · 229641a6
      Tejun Heo authored
      Writeback conversion to workqueue will be based on top of wq/for-3.10
      branch to take advantage of custom attrs and NUMA support for unbound
      workqueues.  Mainline currently contains two commits which result in
      non-trivial merge conflicts with wq/for-3.10 and because
      block/for-3.10/core is based on v3.9-rc3 which contains one of the
      conflicting commits, we need a pre-merge-window merge anyway.  Let's
      pull v3.9-rc5 into wq/for-3.10 so that the block tree doesn't suffer
      from workqueue merge conflicts.
      
      The two conflicts and their resolutions:
      
      * e68035fb ("workqueue: convert to idr_alloc()") in mainline changes
        worker_pool_assign_id() to use idr_alloc() instead of the old idr
        interface.  worker_pool_assign_id() goes through multiple locking
        changes in wq/for-3.10 causing the following conflict.
      
        static int worker_pool_assign_id(struct worker_pool *pool)
        {
      	  int ret;
      
        <<<<<<< HEAD
      	  lockdep_assert_held(&wq_pool_mutex);
      
      	  do {
      		  if (!idr_pre_get(&worker_pool_idr, GFP_KERNEL))
      			  return -ENOMEM;
      		  ret = idr_get_new(&worker_pool_idr, pool, &pool->id);
      	  } while (ret == -EAGAIN);
        =======
      	  mutex_lock(&worker_pool_idr_mutex);
      	  ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
      	  if (ret >= 0)
      		  pool->id = ret;
      	  mutex_unlock(&worker_pool_idr_mutex);
        >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
      
      	  return ret < 0 ? ret : 0;
        }
      
        We want locking from the former and idr_alloc() usage from the
        latter, which can be combined to the following.
      
        static int worker_pool_assign_id(struct worker_pool *pool)
        {
      	  int ret;
      
      	  lockdep_assert_held(&wq_pool_mutex);
      
      	  ret = idr_alloc(&worker_pool_idr, pool, 0, 0, GFP_KERNEL);
      	  if (ret >= 0) {
      		  pool->id = ret;
      		  return 0;
      	  }
      	  return ret;
         }
      
      * eb283428 ("workqueue: fix possible pool stall bug in
        wq_unbind_fn()") updated wq_unbind_fn() such that it has single
        larger for_each_std_worker_pool() loop instead of two separate loops
        with a schedule() call inbetween.  wq/for-3.10 renamed
        pool->assoc_mutex to pool->manager_mutex causing the following
        conflict (earlier function body and comments omitted for brevity).
      
        static void wq_unbind_fn(struct work_struct *work)
        {
        ...
      		  spin_unlock_irq(&pool->lock);
        <<<<<<< HEAD
      		  mutex_unlock(&pool->manager_mutex);
      	  }
        =======
      		  mutex_unlock(&pool->assoc_mutex);
        >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
      
      		  schedule();
      
        <<<<<<< HEAD
      	  for_each_cpu_worker_pool(pool, cpu)
        =======
        >>>>>>> c67bf5361e7e66a0ff1f4caf95f89347d55dfb89
      		  atomic_set(&pool->nr_running, 0);
      
      		  spin_lock_irq(&pool->lock);
      		  wake_up_worker(pool);
      		  spin_unlock_irq(&pool->lock);
      	  }
        }
      
        The resolution is mostly trivial.  We want the control flow of the
        latter with the rename of the former.
      
        static void wq_unbind_fn(struct work_struct *work)
        {
        ...
      		  spin_unlock_irq(&pool->lock);
      		  mutex_unlock(&pool->manager_mutex);
      
      		  schedule();
      
      		  atomic_set(&pool->nr_running, 0);
      
      		  spin_lock_irq(&pool->lock);
      		  wake_up_worker(pool);
      		  spin_unlock_irq(&pool->lock);
      	  }
        }
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      229641a6
  6. 01 Apr, 2013 17 commits
    • Tejun Heo's avatar
      workqueue: update sysfs interface to reflect NUMA awareness and a kernel param... · d55262c4
      Tejun Heo authored
      workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
      
      Unbound workqueues are now NUMA aware.  Let's add some control knobs
      and update sysfs interface accordingly.
      
      * Add kernel param workqueue.numa_disable which disables NUMA affinity
        globally.
      
      * Replace sysfs file "pool_id" with "pool_ids" which contain
        node:pool_id pairs.  This change is userland-visible but "pool_id"
        hasn't seen a release yet, so this is okay.
      
      * Add a new sysf files "numa" which can toggle NUMA affinity on
        individual workqueues.  This is implemented as attrs->no_numa whichn
        is special in that it isn't part of a pool's attributes.  It only
        affects how apply_workqueue_attrs() picks which pools to use.
      
      After "pool_ids" change, first_pwq() doesn't have any user left.
      Removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      d55262c4
    • Tejun Heo's avatar
      workqueue: implement NUMA affinity for unbound workqueues · 4c16bd32
      Tejun Heo authored
      Currently, an unbound workqueue has single current, or first, pwq
      (pool_workqueue) to which all new work items are queued.  This often
      isn't optimal on NUMA machines as workers may jump around across node
      boundaries and work items get assigned to workers without any regard
      to NUMA affinity.
      
      This patch implements NUMA affinity for unbound workqueues.  Instead
      of mapping all entries of numa_pwq_tbl[] to the same pwq,
      apply_workqueue_attrs() now creates a separate pwq covering the
      intersecting CPUs for each NUMA node which has online CPUs in
      @attrs->cpumask.  Nodes which don't have intersecting possible CPUs
      are mapped to pwqs covering whole @attrs->cpumask.
      
      As CPUs come up and go down, the pool association is changed
      accordingly.  Changing pool association may involve allocating new
      pools which may fail.  To avoid failing CPU_DOWN, each workqueue
      always keeps a default pwq which covers whole attrs->cpumask which is
      used as fallback if pool creation fails during a CPU hotplug
      operation.
      
      This ensures that all work items issued on a NUMA node is executed on
      the same node as long as the workqueue allows execution on the CPUs of
      the node.
      
      As this maps a workqueue to multiple pwqs and max_active is per-pwq,
      this change the behavior of max_active.  The limit is now per NUMA
      node instead of global.  While this is an actual change, max_active is
      already per-cpu for per-cpu workqueues and primarily used as safety
      mechanism rather than for active concurrency control.  Concurrency is
      usually limited from workqueue users by the number of concurrently
      active work items and this change shouldn't matter much.
      
      v2: Fixed pwq freeing in apply_workqueue_attrs() error path.  Spotted
          by Lai.
      
      v3: The previous version incorrectly made a workqueue spanning
          multiple nodes spread work items over all online CPUs when some of
          its nodes don't have any desired cpus.  Reimplemented so that NUMA
          affinity is properly updated as CPUs go up and down.  This problem
          was spotted by Lai Jiangshan.
      
      v4: destroy_workqueue() was putting wq->dfl_pwq and then clearing it;
          however, wq may be freed at any time after dfl_pwq is put making
          the clearing use-after-free.  Clear wq->dfl_pwq before putting it.
      
      v5: apply_workqueue_attrs() was leaking @tmp_attrs, @new_attrs and
          @pwq_tbl after success.  Fixed.
      
          Retry loop in wq_update_unbound_numa_attrs() isn't necessary as
          application of new attrs is excluded via CPU hotplug.  Removed.
      
          Documentation on CPU affinity guarantee on CPU_DOWN added.
      
          All changes are suggested by Lai Jiangshan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      4c16bd32
    • Tejun Heo's avatar
      workqueue: introduce put_pwq_unlocked() · dce90d47
      Tejun Heo authored
      Factor out lock pool, put_pwq(), unlock sequence into
      put_pwq_unlocked().  The two existing places are converted and there
      will be more with NUMA affinity support.
      
      This is to prepare for NUMA affinity support for unbound workqueues
      and doesn't introduce any functional difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      dce90d47
    • Tejun Heo's avatar
      workqueue: introduce numa_pwq_tbl_install() · 1befcf30
      Tejun Heo authored
      Factor out pool_workqueue linking and installation into numa_pwq_tbl[]
      from apply_workqueue_attrs() into numa_pwq_tbl_install().  link_pwq()
      is made safe to call multiple times.  numa_pwq_tbl_install() links the
      pwq, installs it into numa_pwq_tbl[] at the specified node and returns
      the old entry.
      
      @last_pwq is removed from link_pwq() as the return value of the new
      function can be used instead.
      
      This is to prepare for NUMA affinity support for unbound workqueues.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      1befcf30
    • Tejun Heo's avatar
      workqueue: use NUMA-aware allocation for pool_workqueues · e50aba9a
      Tejun Heo authored
      Use kmem_cache_alloc_node() with @pool->node instead of
      kmem_cache_zalloc() when allocating a pool_workqueue so that it's
      allocated on the same node as the associated worker_pool.  As there's
      no no kmem_cache_zalloc_node(), move zeroing to init_pwq().
      
      This was suggested by Lai Jiangshan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      e50aba9a
    • Tejun Heo's avatar
      workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq() · f147f29e
      Tejun Heo authored
      Break init_and_link_pwq() into init_pwq() and link_pwq() and move
      unbound-workqueue specific handling into apply_workqueue_attrs().
      Also, factor out unbound pool and pool_workqueue allocation into
      alloc_unbound_pwq().
      
      This reorganization is to prepare for NUMA affinity and doesn't
      introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      f147f29e
    • Tejun Heo's avatar
      workqueue: map an unbound workqueues to multiple per-node pool_workqueues · df2d5ae4
      Tejun Heo authored
      Currently, an unbound workqueue has only one "current" pool_workqueue
      associated with it.  It may have multple pool_workqueues but only the
      first pool_workqueue servies new work items.  For NUMA affinity, we
      want to change this so that there are multiple current pool_workqueues
      serving different NUMA nodes.
      
      Introduce workqueue->numa_pwq_tbl[] which is indexed by NUMA node and
      points to the pool_workqueue to use for each possible node.  This
      replaces first_pwq() in __queue_work() and workqueue_congested().
      
      numa_pwq_tbl[] is currently initialized to point to the same
      pool_workqueue as first_pwq() so this patch doesn't make any behavior
      changes.
      
      v2: Use rcu_dereference_raw() in unbound_pwq_by_node() as the function
          may be called only with wq->mutex held.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      df2d5ae4
    • Tejun Heo's avatar
      workqueue: move hot fields of workqueue_struct to the end · 2728fd2f
      Tejun Heo authored
      Move wq->flags and ->cpu_pwqs to the end of workqueue_struct and align
      them to the cacheline.  These two fields are used in the work item
      issue path and thus hot.  The scheduled NUMA affinity support will add
      dispatch table at the end of workqueue_struct and relocating these two
      fields will allow us hitting only single cacheline on hot paths.
      
      Note that wq->pwqs isn't moved although it currently is being used in
      the work item issue path for unbound workqueues.  The dispatch table
      mentioned above will replace its use in the issue path, so it will
      become cold once NUMA support is implemented.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      2728fd2f
    • Tejun Heo's avatar
      workqueue: make workqueue->name[] fixed len · ecf6881f
      Tejun Heo authored
      Currently workqueue->name[] is of flexible length.  We want to use the
      flexible field for something more useful and there isn't much benefit
      in allowing arbitrary name length anyway.  Make it fixed len capping
      at 24 bytes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      ecf6881f
    • Tejun Heo's avatar
      workqueue: add workqueue->unbound_attrs · 6029a918
      Tejun Heo authored
      Currently, when exposing attrs of an unbound workqueue via sysfs, the
      workqueue_attrs of first_pwq() is used as that should equal the
      current state of the workqueue.
      
      The planned NUMA affinity support will make unbound workqueues make
      use of multiple pool_workqueues for different NUMA nodes and the above
      assumption will no longer hold.  Introduce workqueue->unbound_attrs
      which records the current attrs in effect and use it for sysfs instead
      of first_pwq()->attrs.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      6029a918
    • Tejun Heo's avatar
      workqueue: determine NUMA node of workers accourding to the allowed cpumask · f3f90ad4
      Tejun Heo authored
      When worker tasks are created using kthread_create_on_node(),
      currently only per-cpu ones have the matching NUMA node specified.
      All unbound workers are always created with NUMA_NO_NODE.
      
      Now that an unbound worker pool may have an arbitrary cpumask
      associated with it, this isn't optimal.  Add pool->node which is
      determined by the pool's cpumask.  If the pool's cpumask is contained
      inside a NUMA node proper, the pool is associated with that node, and
      all workers of the pool are created on that node.
      
      This currently only makes difference for unbound worker pools with
      cpumask contained inside single NUMA node, but this will serve as
      foundation for making all unbound pools NUMA-affine.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      f3f90ad4
    • Tejun Heo's avatar
      workqueue: drop 'H' from kworker names of unbound worker pools · e3c916a4
      Tejun Heo authored
      Currently, all workqueue workers which have negative nice value has
      'H' postfixed to their names.  This is necessary for per-cpu workers
      as they use the CPU number instead of pool->id to identify the pool
      and the 'H' postfix is the only thing distinguishing normal and
      highpri workers.
      
      As workers for unbound pools use pool->id, the 'H' postfix is purely
      informational.  TASK_COMM_LEN is 16 and after the static part and
      delimiters, there are only five characters left for the pool and
      worker IDs.  We're expecting to have more unbound pools with the
      scheduled NUMA awareness support.  Let's drop the non-essential 'H'
      postfix from unbound kworker name.
      
      While at it, restructure kthread_create*() invocation to help future
      NUMA related changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      e3c916a4
    • Tejun Heo's avatar
      workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[] · bce90380
      Tejun Heo authored
      Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
      and wq_numa_possible_cpumask[] in preparation.  The former is the
      highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
      each NUMA node.
      
      This patch only introduces these.  Future patches will make use of
      them.
      
      v2: NUMA initialization move into wq_numa_init().  Also, the possible
          cpumask array is not created if there aren't multiple nodes on the
          system.  wq_numa_enabled bool added.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      bce90380
    • Tejun Heo's avatar
      workqueue: move pwq_pool_locking outside of get/put_unbound_pool() · a892cacc
      Tejun Heo authored
      The scheduled NUMA affinity support for unbound workqueues would need
      to walk workqueues list and pool related operations on each workqueue.
      
      Move wq_pool_mutex locking out of get/put_unbound_pool() to their
      callers so that pool operations can be performed while walking the
      workqueues list, which is also protected by wq_pool_mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      a892cacc
    • Tejun Heo's avatar
      workqueue: fix memory leak in apply_workqueue_attrs() · 4862125b
      Tejun Heo authored
      apply_workqueue_attrs() wasn't freeing temp attrs variable @new_attrs
      in its success path.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarLai Jiangshan <laijs@cn.fujitsu.com>
      4862125b
    • Tejun Heo's avatar
      workqueue: fix unbound workqueue attrs hashing / comparison · 13e2e556
      Tejun Heo authored
      29c91e99 ("workqueue: implement attribute-based unbound worker_pool
      management") implemented attrs based worker_pool matching.  It tried
      to avoid false negative when comparing cpumasks with custom hash
      function; unfortunately, the hash and comparison functions fail to
      ignore CPUs which are not possible.  It incorrectly assumed that
      bitmap_copy() skips leftover bits in the last word of bitmap and
      cpumask_equal() ignores impossible CPUs.
      
      This patch updates attrs->cpumask handling such that impossible CPUs
      are properly ignored.
      
      * Hash and copy functions no longer do anything special.  They expect
        their callers to clear impossible CPUs.
      
      * alloc_workqueue_attrs() initializes the cpumask to cpu_possible_mask
        instead of setting all bits and explicit cpumask_setall() for
        unbound_std_wq_attrs[] in init_workqueues() is dropped.
      
      * apply_workqueue_attrs() is now responsible for ignoring impossible
        CPUs.  It makes a copy of @attrs and clears impossible CPUs before
        doing anything else.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      13e2e556
    • Tejun Heo's avatar
      workqueue: fix race condition in unbound workqueue free path · bc0caf09
      Tejun Heo authored
      8864b4e5 ("workqueue: implement get/put_pwq()") implemented pwq
      (pool_workqueue) refcnting which frees workqueue when the last pwq
      goes away.  It determined whether it was the last pwq by testing
      wq->pwqs is empty.  Unfortunately, the test was done outside wq->mutex
      and multiple pwq release could race and try to free wq multiple times
      leading to oops.
      
      Test wq->pwqs emptiness while holding wq->mutex.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      bc0caf09
  7. 31 Mar, 2013 5 commits
    • Linus Torvalds's avatar
      Linux 3.9-rc5 · 07961ac7
      Linus Torvalds authored
      07961ac7
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma · 0bb44280
      Linus Torvalds authored
      Pull slave-dmaengine fixes from Vinod Koul:
       "Two fixes for slave-dmaengine.
      
        The first one is for making slave_id value correct for dw_dmac and
        the other one fixes the endieness in DT parsing"
      
      * 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
        dw_dmac: adjust slave_id accordingly to request line base
        dmaengine: dw_dma: fix endianess for DT xlate function
      0bb44280
    • Linus Torvalds's avatar
      Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · a7b436d3
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
       "For a some fixes for Kernel 3.9:
         - subsystem build fix when VIDEO_DEV=y, VIDEO_V4L2=m and I2C=m
         - compilation fix for arm multiarch preventing IR_RX51 to be selected
         - regression fix at bttv crop logic
         - s5p-mfc/m5mols/exynos: a few fixes for cameras on exynos hardware"
      
      * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        [media] [REGRESSION] bt8xx: Fix too large height in cropcap
        [media] fix compilation with both V4L2 and I2C as 'm'
        [media] m5mols: Fix bug in stream on handler
        [media] s5p-fimc: Do not attempt to disable not enabled media pipeline
        [media] s5p-mfc: Fix encoder control 15 issue
        [media] s5p-mfc: Fix frame skip bug
        [media] s5p-fimc: send valid m2m ctx to fimc_m2m_job_finish
        [media] exynos-gsc: send valid m2m ctx to gsc_m2m_job_finish
        [media] fimc-lite: Fix the variable type to avoid possible crash
        [media] fimc-lite: Initialize 'step' field in fimc_lite_ctrl structure
        [media] ir: IR_RX51 only works on OMAP2
      a7b436d3
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20130331' of git://git.kernel.dk/linux-block · d299c290
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Alright, this time from 10K up in the air.
      
        Collection of fixes that have been queued up since the merge window
        opened, hence postponed until later in the cycle.  The pull request
        contains:
      
         - A bunch of fixes for the xen blk front/back driver.
      
         - A round of fixes for the new IBM RamSan driver, fixing various
           nasty issues.
      
         - Fixes for multiple drives from Wei Yongjun, bad handling of return
           values and wrong pointer math.
      
         - A fix for loop properly killing partitions when being detached."
      
      * tag 'for-linus-20130331' of git://git.kernel.dk/linux-block: (25 commits)
        mg_disk: fix error return code in mg_probe()
        rsxx: remove unused variable
        rsxx: enable error return of rsxx_eeh_save_issued_dmas()
        block: removes dynamic allocation on stack
        Block: blk-flush: Fixed indent code style
        cciss: fix invalid use of sizeof in cciss_find_cfgtables()
        loop: cleanup partitions when detaching loop device
        loop: fix error return code in loop_add()
        mtip32xx: fix error return code in mtip_pci_probe()
        xen-blkfront: remove frame list from blk_shadow
        xen-blkfront: pre-allocate pages for requests
        xen-blkback: don't store dev_bus_addr
        xen-blkfront: switch from llist to list
        xen-blkback: fix foreach_grant_safe to handle empty lists
        xen-blkfront: replace kmalloc and then memcpy with kmemdup
        xen-blkback: fix dispatch_rw_block_io() error path
        rsxx: fix missing unlock on error return in rsxx_eeh_remap_dmas()
        Adding in EEH support to the IBM FlashSystem 70/80 device driver
        block: IBM RamSan 70/80 error message bug fix.
        block: IBM RamSan 70/80 branding changes.
        ...
      d299c290
    • Paul Walmsley's avatar
      Revert "lockdep: check that no locks held at freeze time" · dbf520a9
      Paul Walmsley authored
      This reverts commit 6aa97070.
      
      Commit 6aa97070 ("lockdep: check that no locks held at freeze time")
      causes problems with NFS root filesystems.  The failures were noticed on
      OMAP2 and 3 boards during kernel init:
      
        [ BUG: swapper/0/1 still has locks held! ]
        3.9.0-rc3-00344-ga937536b #1 Not tainted
        -------------------------------------
        1 lock held by swapper/0/1:
         #0:  (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574
      
        stack backtrace:
          rpc_wait_bit_killable
          __wait_on_bit
          out_of_line_wait_on_bit
          __rpc_execute
          rpc_run_task
          rpc_call_sync
          nfs_proc_get_root
          nfs_get_root
          nfs_fs_mount_common
          nfs_try_mount
          nfs_fs_mount
          mount_fs
          vfs_kern_mount
          do_mount
          sys_mount
          do_mount_root
          mount_root
          prepare_namespace
          kernel_init_freeable
          kernel_init
      
      Although the rootfs mounts, the system is unstable.  Here's a transcript
      from a PM test:
      
        http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt
      
      Here's what the test log should look like:
      
        http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt
      
      Mailing list discussion is here:
      
        http://lkml.org/lkml/2013/3/4/221
      
      Deal with this for v3.9 by reverting the problem commit, until folks can
      figure out the right long-term course of action.
      Signed-off-by: default avatarPaul Walmsley <paul@pwsan.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Shawn Guo <shawn.guo@linaro.org>
      Cc: <maciej.rutecki@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ben Chan <benchan@chromium.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dbf520a9
  8. 30 Mar, 2013 1 commit
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 13d2080d
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "This includes the bug-fix for a >= v3.8-rc1 regression specific to
        iscsi-target persistent reservation conflict handling (CC'ed to
        stable), and a tcm_vhost patch to drop VIRTIO_RING_F_EVENT_IDX usage
        so that in-flight qemu vhost-scsi-pci device code can detect the
        proper vhost feature bits.
      
        Also, there are two more tcm_vhost patches still being discussed by
        MST and Asias for v3.9 that will be required for the in-flight qemu
        vhost-scsi-pci device patch to function properly, and that should
        (hopefully) be the last target fixes for this round."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        target: Fix RESERVATION_CONFLICT status regression for iscsi-target special case
        tcm_vhost: Avoid VIRTIO_RING_F_EVENT_IDX feature bit
      13d2080d
  9. 29 Mar, 2013 6 commits