1. 16 May, 2014 6 commits
    • Tejun Heo's avatar
      cgroup: remove cgroup->parent · d51f39b0
      Tejun Heo authored
      cgroup->parent is redundant as cgroup->self.parent can also be used to
      determine the parent cgroup and we're moving towards using
      cgroup_subsys_states as the fundamental structural blocks.  This patch
      introduces cgroup_parent() which follows cgroup->self.parent and
      removes cgroup->parent.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      d51f39b0
    • Tejun Heo's avatar
      device_cgroup: remove direct access to cgroup->children · 5877019d
      Tejun Heo authored
      Currently, devcg::has_children() directly tests cgroup->children for
      list emptiness.  The field is not a published field and scheduled to
      go away.  In addition, the test isn't strictly correct as devcg should
      only care about children which are visible to userland.
      
      This patch converts has_children() to use css_next_child() instead.
      The subtle incorrectness is noted and will be dealt with later.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarAristeu Rozanski <aris@redhat.com>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5877019d
    • Tejun Heo's avatar
      memcg: update memcg_has_children() to use css_next_child() · ea280e7b
      Tejun Heo authored
      Currently, memcg_has_children() and mem_cgroup_hierarchy_write()
      directly test cgroup->children for list emptiness.  It's semantically
      correct in traditional hierarchies as it actually wants to test for
      any children dead or alive; however, cgroup->children is not a
      published field and scheduled to go away.
      
      This patch moves out .use_hierarchy test out of memcg_has_children()
      and updates it to use css_next_child() to test whether there exists
      any children.  With .use_hierarchy test moved out, it can also be used
      by mem_cgroup_hierarchy_write().
      
      A side note: As .use_hierarchy is going away, it doesn't really matter
      but I'm not sure about how it's used in __memcg_activate_kmem().  The
      condition tested by memcg_has_children() is mushy when seen from
      userland as its result is affected by dead csses which aren't visible
      from userland.  I think the rule would be a lot clearer if we have a
      dedicated "freshly minted" flag which gets cleared when the first task
      is migrated into it or the first child is created and then gate
      activation with that.
      
      v2: Added comment noting that testing use_hierarchy is the
          responsibility of the callers of memcg_has_children() as suggested
          by Michal Hocko.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      ea280e7b
    • Michal Hocko's avatar
      memcg: remove tasks/children test from mem_cgroup_force_empty() · f61c42a7
      Michal Hocko authored
      Tejun has correctly pointed out that tasks/children test in
      mem_cgroup_force_empty is not correct because there is no other locking
      which preserves this state throughout the rest of the function so both
      new tasks can join the group or new children groups can be added while
      somebody is writing to memory.force_empty. A new task would break
      mem_cgroup_reparent_charges expectation that all failures as described
      by mem_cgroup_force_empty_list are temporal and there is no way out.
      
      The main use case for the knob as described by
      Documentation/cgroups/memory.txt is to:
      "
        The typical use case for this interface is before calling rmdir().
        Because rmdir() moves all pages to parent, some out-of-use page caches can be
        moved to the parent. If you want to avoid that, force_empty will be useful.
      "
      
      This means that reparenting is not really required as rmdir will
      reparent pages implicitly from the safe context. If we remove it from
      mem_cgroup_force_empty then we are safe even with existing tasks because
      the number of reclaim attempts is bounded. Moreover the knob still does
      what the documentation claims (modulo reparenting which doesn't make any
      difference) and users might expect. Longterm we want to deprecate the
      whole knob and put the reparented pages to the tail of parent LRU during
      cgroup removal.
      
      tj: Removed unused variable @cgrp from mem_cgroup_force_empty()
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f61c42a7
    • Tejun Heo's avatar
      cgroup: remove css_parent() · 5c9d535b
      Tejun Heo authored
      cgroup in general is moving towards using cgroup_subsys_state as the
      fundamental structural component and css_parent() was introduced to
      convert from using cgroup->parent to css->parent.  It was quite some
      time ago and we're moving forward with making css more prominent.
      
      This patch drops the trivial wrapper css_parent() and let the users
      dereference css->parent.  While at it, explicitly mark fields of css
      which are public and immutable.
      
      v2: New usage from device_cgroup.c converted.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatar"David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      5c9d535b
    • Tejun Heo's avatar
      cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css · 3b514d24
      Tejun Heo authored
      9395a450 ("cgroup: enable refcnting for root csses") enabled
      reference counting for root csses (cgroup_subsys_states) so that
      cgroup's self csses can be used to manage the lifetime of the
      containing cgroups.
      
      Unfortunately, this change was incorrect.  During early init,
      cgrp_dfl_root self css refcnt is used.  percpu_ref can't initialized
      during early init and its initialization is deferred till
      cgroup_init() time.  This means that cpu was using percpu_ref which
      wasn't properly initialized.  Due to the way percpu variables are laid
      out on x86, this didn't blow up immediately on x86 but ended up
      incrementing and decrementing the percpu variable at offset zero,
      whatever it may be; however, on other archs, this caused fault and
      early boot failure.
      
      As cgroup self csses for root cgroups of non-dfl hierarchies need
      working refcounting, we can't revert 9395a450.  This patch adds
      CSS_NO_REF which explicitly inhibits reference counting on the css and
      sets it on all normal (non-self) csses and cgroup_dfl_root self css.
      
      v2: cgrp_dfl_root.self is the offending one.  Set the flag on it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarStephen Warren <swarren@nvidia.com>
      Tested-by: default avatarStephen Warren <swarren@nvidia.com>
      Fixes: 9395a450 ("cgroup: enable refcnting for root csses")
      3b514d24
  2. 14 May, 2014 9 commits
    • Tejun Heo's avatar
      cgroup: use cgroup->self.refcnt for cgroup refcnting · 9d755d33
      Tejun Heo authored
      Currently cgroup implements refcnting separately using atomic_t
      cgroup->refcnt.  The destruction paths of cgroup and css are rather
      complex and bear a lot of similiarities including the use of RCU and
      bouncing to a work item.
      
      This patch makes cgroup use the refcnt of self css for refcnting
      instead of using its own.  This makes cgroup refcnting use css's
      percpu refcnt and share the destruction mechanism.
      
      * css_release_work_fn() and css_free_work_fn() are updated to handle
        both csses and cgroups.  This is a bit messy but should do until we
        can make cgroup->self a full css, which currently can't be done
        thanks to multiple hierarchies.
      
      * cgroup_destroy_locked() now performs
        percpu_ref_kill(&cgrp->self.refcnt) instead of cgroup_put(cgrp).
      
      * Negative refcnt sanity check in cgroup_get() is no longer necessary
        as percpu_ref already handles it.
      
      * Similarly, as a cgroup which hasn't been killed will never be
        released regardless of its refcnt value and percpu_ref has sanity
        check on kill, cgroup_is_dead() sanity check in cgroup_put() is no
        longer necessary.
      
      * As whether a refcnt reached zero or not can only be decided after
        the reference count is killed, cgroup_root->cgrp's refcnting can no
        longer be used to decide whether to kill the root or not.  Let's
        make cgroup_kill_sb() explicitly initiate destruction if the root
        doesn't have any children.  This makes sense anyway as unmounted
        cgroup hierarchy without any children should be destroyed.
      
      While this is a bit messy, this will allow pushing more bookkeeping
      towards cgroup->self and thus handling cgroups and csses in more
      uniform way.  In the very long term, it should be possible to
      introduce a base subsystem and convert the self css to a proper one
      making things whole lot simpler and unified.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      9d755d33
    • Tejun Heo's avatar
      cgroup: enable refcnting for root csses · 9395a450
      Tejun Heo authored
      Currently, css_get(), css_tryget() and css_tryget_online() are noops
      for root csses as an optimization; however, we're planning to use css
      refcnts to track of cgroup lifetime too and root cgroups also need to
      be reference counted.  Since css has been converted to percpu_refcnt,
      the overhead of refcnting is miniscule and this optimization isn't too
      meaningful anymore.  Furthermore, controllers which optimize the root
      cgroup often never even invoke these functions in their hot paths.
      
      This patch enables refcnting for root csses too.  This makes CSS_ROOT
      flag unused and removes it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      9395a450
    • Tejun Heo's avatar
      cgroup: bounce css release through css->destroy_work · 25e15d83
      Tejun Heo authored
      css release is planned to do more and would require process context.
      Bounce it through css->destroy_work.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      25e15d83
    • Tejun Heo's avatar
      cgroup: remove cgroup_destory_css_killed() · 249f3468
      Tejun Heo authored
      cgroup_destroy_css_killed() is cgroup destruction stage which happens
      after all csses are offlined.  After the recent updates, it no longer
      does anything other than putting the base reference.  This patch
      removes the function and makes cgroup_destroy_locked() put the base
      ref at the end isntead.
      
      This also makes cgroup->nr_css unnecessary.  Removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      249f3468
    • Tejun Heo's avatar
      cgroup: move cgroup->sibling unlinking to cgroup_put() · 4e4e2847
      Tejun Heo authored
      Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to
      cgroup_put().  This is later but still before the RCU grace period, so
      it doesn't break css_next_child() although there now is a larger
      window in which a dead cgroup is visible during css iteration.  As css
      iteration always could have included offline csses, this doesn't
      affect correctness; however, it does make css_next_child() fall back
      to reiterting mode more often.  This also makes cgroup_put() directly
      take cgroup_mutex, which limits where it can be called from.  These
      are not immediately problematic and will be dealt with later.
      
      This change enables simplification of cgroup destruction path.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      4e4e2847
    • Tejun Heo's avatar
      cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked() · 9e4173e1
      Tejun Heo authored
      Currently, check_for_release() on the parent of a destroyed cgroup is
      invoked from cgroup_destroy_css_killed().  This is because this is
      where the destroyed cgroup can be removed from the parent's children
      list.  check_for_release() tests the emptiness of the list directly,
      so invoking it before removing the cgroup from the list makes it think
      that the parent still has children even when it no longer does.
      
      This patch updates check_for_release() to use
      cgroup_has_live_children() instead of directly testing ->children
      emptiness and moves check_for_release(parent) earlier to the end of
      cgroup_destroy_locked().  As cgroup_has_live_children() ignores
      cgroups marked DEAD, check_for_release() functions correctly as long
      as it's called after asserting DEAD.
      
      This makes release notification slightly more timely and more
      importantly enables further simplification of cgroup destruction path.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      9e4173e1
    • Tejun Heo's avatar
      cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked() · cbc125ef
      Tejun Heo authored
      We're expecting another user.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      cbc125ef
    • Tejun Heo's avatar
      cgroup: rename cgroup->dummy_css to ->self and move it to the top · 9d800df1
      Tejun Heo authored
      cgroup->dummy_css is used as the placeholder css when performing css
      oriended operations on the cgroup.  We're gonna shift more cgroup
      management to this css.  Let's rename it to ->self and move it to the
      top.
      
      This is pure rename and field relocation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      9d800df1
    • Tejun Heo's avatar
      cgroup: use restart_syscall() for mount retries · a015edd2
      Tejun Heo authored
      cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root
      which is being destroyed.  The retry currently loops inside
      cgroup_mount() proper.  This patch makes it return with
      restart_syscall() instead so that retry travels out to userland
      boundary.
      
      This slightly simplifies the logic and more importantly makes the
      retry logic behave better when the wait for some reason becomes
      lengthy or infinite by allowing the operation to be suspended or
      terminated from userland.
      
      v2: The original patch forgot to free memory allocated for @opts.
          Fixed.  Caught by Li Zefan.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      a015edd2
  3. 13 May, 2014 25 commits
    • Tejun Heo's avatar
      cgroup: remove cgroup_tree_mutex · 8353da1f
      Tejun Heo authored
      cgroup_tree_mutex was introduced to work around the circular
      dependency between cgroup_mutex and kernfs active protection - some
      kernfs file and directory operations needed cgroup_mutex putting
      cgroup_mutex under active protection but cgroup also needs to be able
      to access cgroup hierarchies and cftypes to determine which
      kernfs_nodes need to be removed.  cgroup_tree_mutex nested above both
      cgroup_mutex and kernfs active protection and used to protect the
      hierarchy and cftypes.  While this worked, it added a lot of double
      lockings and was generally cumbersome.
      
      kernfs provides a mechanism to opt out of active protection and cgroup
      was already using it for removal and subtree_control.  There's no
      reason to mix both methods of avoiding circular locking dependency and
      the preceding cgroup_kn_lock_live() changes applied it to all relevant
      cgroup kernfs operations making it unnecessary to nest cgroup_mutex
      under kernfs active protection.  The previous patch reversed the
      original lock ordering and put cgroup_mutex above kernfs active
      protection.
      
      After these changes, all cgroup_tree_mutex usages are now accompanied
      by cgroup_mutex making the former completely redundant.  This patch
      removes cgroup_tree_mutex and all its usages.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      8353da1f
    • Tejun Heo's avatar
      cgroup: nest kernfs active protection under cgroup_mutex · 01f6474c
      Tejun Heo authored
      After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no
      longer nested below kernfs active protection.  The two don't have any
      relationship now.
      
      This patch nests kernfs active protection under cgroup_mutex.  All
      cftype operations now require both cgroup_tree_mutex and cgroup_mutex,
      temporary cgroup_mutex releases over kernfs operations are removed,
      and cgroup_add/rm_cftypes() grab both mutexes.
      
      This makes cgroup_tree_mutex redundant, which will be removed by the
      next patch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      01f6474c
    • Tejun Heo's avatar
      cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods · e76ecaee
      Tejun Heo authored
      Make __cgroup_procs_write() and cgroup_release_agent_write() use
      cgroup_kn_lock_live() and cgroup_kn_unlock() instead of
      cgroup_lock_live_group().  This puts the operations under both
      cgroup_tree_mutex and cgroup_mutex protection without circular
      dependency from kernfs active protection.  Also, this means that
      cgroup_mutex is no longer nested below kernfs active protection.
      There is no longer any place where the two locks interact.
      
      This leaves cgroup_lock_live_group() without any user.  Removed.
      
      This will help simplifying cgroup locking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e76ecaee
    • Tejun Heo's avatar
      cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock() · a9746d8d
      Tejun Heo authored
      cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
      share the logic to break active protection so that they can grab
      cgroup_tree_mutex which nests above active protection and/or remove
      self.  Factor out this logic into cgroup_kn_lock_live() and
      cgroup_kn_unlock().
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      a9746d8d
    • Tejun Heo's avatar
      cgroup: move cgroup->kn->priv clearing to cgroup_rmdir() · cfc79d5b
      Tejun Heo authored
      The ->priv field of a cgroup directory kernfs_node points back to the
      cgroup.  This field is RCU cleared in cgroup_destroy_locked() for
      non-kernfs accesses from css_tryget_from_dir() and
      cgroupstats_build().
      
      As these are only applicable to cgroups which finished creation
      successfully and fully initialized cgroups are always removed by
      cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().
      
      This will help simplifying cgroup locking and shouldn't introduce any
      behavior difference.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      cfc79d5b
    • Tejun Heo's avatar
      cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write() · ddab2b6e
      Tejun Heo authored
      Move cgroup_lock_live_group() invocation upwards to right below
      cgroup_tree_mutex in cgroup_subtree_control_write().  This is to help
      the planned locking simplification.
      
      This doesn't make any userland-visible behavioral changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ddab2b6e
    • Tejun Heo's avatar
      cgroup: collapse cgroup_create() into croup_mkdir() · b3bfd983
      Tejun Heo authored
      cgroup_mkdir() is the sole user of cgroup_create().  Let's collapse
      the latter into the former.  This will help simplifying locking.
      While at it, remove now stale comment about inode locking.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b3bfd983
    • Tejun Heo's avatar
      cgroup: reorganize cgroup_create() · ba0f4d76
      Tejun Heo authored
      Reorganize cgroup_create() so that all paths share unlock out path.
      
      * All err_* labels are renamed to out_* as they're now shared by both
        success and failure paths.
      
      * @err renamed to @ret for the similar reason as above and so that
        it's more consistent with other functions.
      
      * cgroup memory allocation moved after locking so that freeing failed
        cgroup happens before unlocking.  While this moves more code inside
        critical section, memory allocations inside cgroup locking are
        already pretty common and this is unlikely to make any noticeable
        difference.
      
      * While at it, replace a stray @parent->root dereference with @root.
      
      This reorganization will help simplifying locking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      ba0f4d76
    • Tejun Heo's avatar
      cgroup: remove cgroup->control_kn · b7fc5ad2
      Tejun Heo authored
      Now that cgroup_subtree_control_write() has access to the associated
      kernfs_open_file and thus the kernfs_node, there's no need to cache it
      in cgroup->control_kn on creation.  Remove cgroup->control_kn and use
      @of->kn directly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b7fc5ad2
    • Tejun Heo's avatar
      cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write() · acbef755
      Tejun Heo authored
      cgroup_tasks_write() and cgroup_procs_write() are currently using
      cftype->write_u64().  This patch converts them to use cftype->write()
      instead.  This allows access to the associated kernfs_open_file which
      will be necessary to implement the planned kernfs active protection
      manipulation for these files.
      
      This shifts buffer parsing to attach_task_by_pid() and makes it return
      @nbytes on success.  Let's rename it to __cgroup_procs_write() to
      clearly indicate that this is a write handler implementation.
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      acbef755
    • Tejun Heo's avatar
      cgroup: replace cftype->trigger() with cftype->write() · 6770c64e
      Tejun Heo authored
      cftype->trigger() is pointless.  It's trivial to ignore the input
      buffer from a regular ->write() operation.  Convert all ->trigger()
      users to ->write() and remove ->trigger().
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      6770c64e
    • Tejun Heo's avatar
      cgroup: replace cftype->write_string() with cftype->write() · 451af504
      Tejun Heo authored
      Convert all cftype->write_string() users to the new cftype->write()
      which maps directly to kernfs write operation and has full access to
      kernfs and cgroup contexts.  The conversions are mostly mechanical.
      
      * @css and @cft are accessed using of_css() and of_cft() accessors
        respectively instead of being specified as arguments.
      
      * Should return @nbytes on success instead of 0.
      
      * @buf is not trimmed automatically.  Trim if necessary.  Note that
        blkcg and netprio don't need this as the parsers already handle
        whitespaces.
      
      cftype->write_string() has no user left after the conversions and
      removed.
      
      While at it, remove unnecessary local variable @p in
      cgroup_subtree_control_write() and stale comment about
      CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
      
      This patch doesn't introduce any visible behavior changes.
      
      v2: netprio was missing from conversion.  Converted.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarAristeu Rozanski <arozansk@redhat.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      451af504
    • Tejun Heo's avatar
      cgroup: implement cftype->write() · b4168640
      Tejun Heo authored
      During the recent conversion to kernfs, cftype's seq_file operations
      are updated so that they are directly mapped to kernfs operations and
      thus can fully access the associated kernfs and cgroup contexts;
      however, write path hasn't seen similar updates and none of the
      existing write operations has access to, for example, the associated
      kernfs_open_file.
      
      Let's introduce a new operation cftype->write() which maps directly to
      the kernfs write operation and has access to all the arguments and
      contexts.  This will replace ->write_string() and ->trigger() and ease
      manipulation of kernfs active protection from cgroup file operations.
      
      Two accessors - of_cft() and of_css() - are introduced to enable
      accessing the associated cgroup context from cftype->write() which
      only takes kernfs_open_file for the context information.  The
      accessors for seq_file operations - seq_cft() and seq_css() - are
      rewritten to wrap the of_ accessors.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      b4168640
    • Tejun Heo's avatar
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo authored
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
    • Tejun Heo's avatar
      cgroup: use release_agent_path_lock in cgroup_release_agent_show() · 46cfeb04
      Tejun Heo authored
      release_path is now protected by release_agent_path_lock to allow
      accessing it without grabbing cgroup_mutex; however,
      cgroup_release_agent_show() was still grabbing cgroup_mutex.  Let's
      convert it to release_agent_path_lock so that we don't have to worry
      about this one for the planned locking updates.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      46cfeb04
    • Tejun Heo's avatar
      cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write() · 7d331fa9
      Tejun Heo authored
      After waiting for a child to finish offline,
      cgroup_subtree_control_write() jumps up to retry from after the input
      parsing and active protection breaking.  This retry makes the
      scheduled locking update - removal of cgroup_tree_mutex - more
      difficult.  Let's simplify it by returning with restart_syscall() for
      retries.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      7d331fa9
    • Tejun Heo's avatar
      cgroup: update and fix parsing of "cgroup.subtree_control" · d37167ab
      Tejun Heo authored
      I was confused that strsep() was equivalent to strtok_r() in skipping
      over consecutive delimiters.  strsep() just splits at the first
      occurrence of one of the delimiters which makes the parsing very
      inflexible, which makes allowing multiple whitespace chars as
      delimters kinda moot.  Let's just be consistently strict and require
      list of tokens separated by spaces.  This is what
      Documentation/cgroups/unified-hierarchy.txt describes too.
      
      Also, parsing may access beyond the end of the string if the string
      ends with spaces or is zero-length.  Make sure it skips zero-length
      tokens.  Note that this also ensures that the parser doesn't puke on
      multiple consecutive spaces.
      
      v2: Add zero-length token skipping.
      
      v3: Added missing space after "==".  Spotted by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      d37167ab
    • Tejun Heo's avatar
      cgroup: css_release() shouldn't clear cgroup->subsys[] · 0ab7a60d
      Tejun Heo authored
      c1a71504 ("cgroup: don't recycle cgroup id until all csses' have
      been destroyed") made cgroup ID persist until a cgroup is released and
      add cgroup->subsys[] clearing to css_release() so that css_from_id()
      doesn't return a css which has already been released which happens
      before cgroup release; however, the right change here was updating
      offline_css() to clear cgroup->subsys[] which was done by e3297803
      ("cgroup: cgroup->subsys[] should be cleared after the css is
      offlined") instead of clearing it from css_release().
      
      We're now clearing cgroup->subsys[] twice.  This is okay for
      traditional hierarchies as a css's lifetime is the same as its
      cgroup's; however, this confuses unified hierarchy and turning on and
      off a controller repeatedly using "cgroup.subtree_control" can lead to
      an oops like the following which happens because cgroup->subsys[] is
      incorrectly cleared asynchronously by css_release().
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
       IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0
       PGD 1170d067 PUD f0ab067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in:
       CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
       RIP: 0010:[<ffffffff81130c11>]  [<ffffffff81130c11>] kill_css+0x21/0x1c0
       RSP: 0018:ffff88000e199dc8  EFLAGS: 00010202
       RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
       RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
       RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
       R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
       R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
       FS:  00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
       Stack:
        ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
        ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
        ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
       Call Trace:
        [<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00
        [<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0
        [<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170
        [<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0
        [<ffffffff811f35ad>] SyS_write+0x4d/0xc0
        [<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b
       Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
       RIP  [<ffffffff81130c11>] kill_css+0x21/0x1c0
        RSP <ffff88000e199dc8>
       CR2: 0000000000000008
       ---[ end trace e7aae1f877c4e1b4 ]---
      
      Remove the unnecessary cgroup->subsys[] clearing from css_release().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      0ab7a60d
    • Tejun Heo's avatar
      cgroup: cgroup_idr_lock should be bh · 54504e97
      Tejun Heo authored
      cgroup_idr_remove() can be invoked from bh leading to lockdep
      detecting possible AA deadlock (IN_BH/ON_BH).  Make the lock bh-safe.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      54504e97
    • Tejun Heo's avatar
      cgroup: fix offlining child waiting in cgroup_subtree_control_write() · 0cee8b77
      Tejun Heo authored
      cgroup_subtree_control_write() waits for offline to complete
      child-by-child before enabling a controller; however, it has a couple
      bugs.
      
      * It doesn't initialize the wait_queue_t.  This can lead to infinite
        hang on the following schedule() among other things.
      
      * It forgets to pin the child before releasing cgroup_tree_mutex and
        performing schedule().  The child may already be gone by the time it
        wakes up and invokes finish_wait().  Pin the child being waited on.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      0cee8b77
    • Tejun Heo's avatar
      Merge branch 'for-3.15-fixes' of... · f21a4f75
      Tejun Heo authored
      Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.16
      
      Pull to receive e37a06f1 ("cgroup: fix the retry path of
      cgroup_mount()") to avoid unnecessary conflicts with planned
      cgroup_tree_mutex removal and also to be able to remove the temp fix
      added by 36c38fb7 ("blkcg: use trylock on blkcg_pol_mutex in
      blkcg_reset_stats()") afterwards.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      f21a4f75
    • Tejun Heo's avatar
      cgroup: fix rcu_read_lock() leak in update_if_frozen() · 36e9d2eb
      Tejun Heo authored
      While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
      replace freezer->lock with freezer_mutex") introduced a bug in
      update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
      by adding rcu_read_unlock() before returning.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      36e9d2eb
    • Tejun Heo's avatar
      Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.16 · d39ea871
      Tejun Heo authored
      Pull to receive percpu_ref_tryget[_live]() changes.  Planned cgroup
      changes will make use of them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      d39ea871
    • Tejun Heo's avatar
      cgroup_freezer: replace freezer->lock with freezer_mutex · e5ced8eb
      Tejun Heo authored
      After 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it
      to css_set_rwsem"), css task iterators requires sleepable context as
      it may block on css_set_rwsem.  I missed that cgroup_freezer was
      iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
      errors like the following on freezer state reads and transitions.
      
        BUG: sleeping function called from invalid context at /work
       /os/work/kernel/locking/rwsem.c:20
        in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
        5 locks held by bash/462:
         #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
         #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
         #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
         #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
         #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
        Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460
      
        CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
         ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
         ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
         0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
        Call Trace:
         [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
         [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
         [<ffffffff81d05974>] down_read+0x24/0x60
         [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
         [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
         [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
         [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
         [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
         [<ffffffff811f121d>] SyS_write+0x4d/0xc0
         [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b
      
      freezer->lock used to be used in hot paths but that time is long gone
      and there's no reason for the lock to be IRQ-safe spinlock or even
      per-cgroup.  In fact, given the fact that a cgroup may contain large
      number of tasks, it's not a good idea to iterate over them while
      holding IRQ-safe spinlock.
      
      Let's simplify locking by replacing per-cgroup freezer->lock with
      global freezer_mutex.  This also makes the comments explaining the
      intricacies of policy inheritance and the locking around it as the
      states are protected by a common mutex.
      
      The conversion is mostly straight-forward.  The followings are worth
      mentioning.
      
      * freezer_css_online() no longer needs double locking.
      
      * freezer_attach() now performs propagation simply while holding
        freezer_mutex.  update_if_frozen() race no longer exists and the
        comment is removed.
      
      * freezer_fork() now tests whether the task is in root cgroup using
        the new task_css_is_root() without doing rcu_read_lock/unlock().  If
        not, it grabs freezer_mutex and performs the operation.
      
      * freezer_read() and freezer_change_state() grab freezer_mutex across
        the whole operation and pin the css while iterating so that each
        descendant processing happens in sleepable context.
      
      Fixes: 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      e5ced8eb
    • Tejun Heo's avatar
      cgroup: introduce task_css_is_root() · 5024ae29
      Tejun Heo authored
      Determining the css of a task usually requires RCU read lock as that's
      the only thing which keeps the returned css accessible till its
      reference is acquired; however, testing whether a task belongs to the
      root can be performed without dereferencing the returned css by
      comparing the returned pointer against the root one in init_css_set[]
      which never changes.
      
      Implement task_css_is_root() which can be invoked in any context.
      This will be used by the scheduled cgroup_freezer change.
      
      v2: cgroup no longer supports modular controllers.  No need to export
          init_css_set.  Pointed out by Li.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      5024ae29