1. 26 Jul, 2011 40 commits
    • KAMEZAWA Hiroyuki's avatar
      memcg: add memory.vmscan_stat · 82f9d486
      KAMEZAWA Hiroyuki authored
      The commit log of 0ae5e89c ("memcg: count the soft_limit reclaim
      in...") says it adds scanning stats to memory.stat file.  But it doesn't
      because we considered we needed to make a concensus for such new APIs.
      
      This patch is a trial to add memory.scan_stat. This shows
        - the number of scanned pages(total, anon, file)
        - the number of rotated pages(total, anon, file)
        - the number of freed pages(total, anon, file)
        - the number of elaplsed time (including sleep/pause time)
      
        for both of direct/soft reclaim.
      
      The biggest difference with oringinal Ying's one is that this file
      can be reset by some write, as
      
        # echo 0 ...../memory.scan_stat
      
      Example of output is here. This is a result after make -j 6 kernel
      under 300M limit.
      
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
        scanned_pages_by_limit 9471864
        scanned_anon_pages_by_limit 6640629
        scanned_file_pages_by_limit 2831235
        rotated_pages_by_limit 4243974
        rotated_anon_pages_by_limit 3971968
        rotated_file_pages_by_limit 272006
        freed_pages_by_limit 2318492
        freed_anon_pages_by_limit 962052
        freed_file_pages_by_limit 1356440
        elapsed_ns_by_limit 351386416101
        scanned_pages_by_system 0
        scanned_anon_pages_by_system 0
        scanned_file_pages_by_system 0
        rotated_pages_by_system 0
        rotated_anon_pages_by_system 0
        rotated_file_pages_by_system 0
        freed_pages_by_system 0
        freed_anon_pages_by_system 0
        freed_file_pages_by_system 0
        elapsed_ns_by_system 0
        scanned_pages_by_limit_under_hierarchy 9471864
        scanned_anon_pages_by_limit_under_hierarchy 6640629
        scanned_file_pages_by_limit_under_hierarchy 2831235
        rotated_pages_by_limit_under_hierarchy 4243974
        rotated_anon_pages_by_limit_under_hierarchy 3971968
        rotated_file_pages_by_limit_under_hierarchy 272006
        freed_pages_by_limit_under_hierarchy 2318492
        freed_anon_pages_by_limit_under_hierarchy 962052
        freed_file_pages_by_limit_under_hierarchy 1356440
        elapsed_ns_by_limit_under_hierarchy 351386416101
        scanned_pages_by_system_under_hierarchy 0
        scanned_anon_pages_by_system_under_hierarchy 0
        scanned_file_pages_by_system_under_hierarchy 0
        rotated_pages_by_system_under_hierarchy 0
        rotated_anon_pages_by_system_under_hierarchy 0
        rotated_file_pages_by_system_under_hierarchy 0
        freed_pages_by_system_under_hierarchy 0
        freed_anon_pages_by_system_under_hierarchy 0
        freed_file_pages_by_system_under_hierarchy 0
        elapsed_ns_by_system_under_hierarchy 0
      
      total_xxxx is for hierarchy management.
      
      This will be useful for further memcg developments and need to be
      developped before we do some complicated rework on LRU/softlimit
      management.
      
      This patch adds a new struct memcg_scanrecord into scan_control struct.
      sc->nr_scanned at el is not designed for exporting information.  For
      example, nr_scanned is reset frequentrly and incremented +2 at scanning
      mapped pages.
      
      To avoid complexity, I added a new param in scan_control which is for
      exporting scanning score.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andrew Bresticker <abrestic@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82f9d486
    • Daisuke Nishimura's avatar
      memcg: fix behavior of mem_cgroup_resize_limit() · 108b6a78
      Daisuke Nishimura authored
      Commit 22a668d7 ("memcg: fix behavior under memory.limit equals to
      memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
      when mem_limit == memsw_limit.  The flag is checked at the beginning of
      reclaim, and "noswap" is set if the flag is true, because using swap is
      meaningless in this case.
      
      This works well in most cases, but when we try to shrink mem_limit,
      which is the same as memsw_limit now, we might fail to shrink mem_limit
      because swap doesn't used.
      
      This patch fixes this behavior by:
       - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
       - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      108b6a78
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix vmscan count in small memcgs · 4508378b
      KAMEZAWA Hiroyuki authored
      Commit 246e87a9 ("memcg: fix get_scan_count() for small targets")
      fixes the memcg/kswapd behavior against small targets and prevent vmscan
      priority too high.
      
      But the implementation is too naive and adds another problem to small
      memcg.  It always force scan to 32 pages of file/anon and doesn't handle
      swappiness and other rotate_info.  It makes vmscan to scan anon LRU
      regardless of swappiness and make reclaim bad.  This patch fixes it by
      adjusting scanning count with regard to swappiness at el.
      
      At a test "cat 1G file under 300M limit." (swappiness=20)
       before patch
              scanned_pages_by_limit 360919
              scanned_anon_pages_by_limit 180469
              scanned_file_pages_by_limit 180450
              rotated_pages_by_limit 31
              rotated_anon_pages_by_limit 25
              rotated_file_pages_by_limit 6
              freed_pages_by_limit 180458
              freed_anon_pages_by_limit 19
              freed_file_pages_by_limit 180439
              elapsed_ns_by_limit 429758872
       after patch
              scanned_pages_by_limit 180674
              scanned_anon_pages_by_limit 24
              scanned_file_pages_by_limit 180650
              rotated_pages_by_limit 35
              rotated_anon_pages_by_limit 24
              rotated_file_pages_by_limit 11
              freed_pages_by_limit 180634
              freed_anon_pages_by_limit 0
              freed_file_pages_by_limit 180634
              elapsed_ns_by_limit 367119089
              scanned_pages_by_system 0
      
      the numbers of scanning anon are decreased(as expected), and elapsed time
      reduced. By this patch, small memcgs will work better.
      (*) Because the amount of file-cache is much bigger than anon,
          recalaim_stat's rotate-scan counter make scanning files more.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4508378b
    • Michal Hocko's avatar
      memcg: change memcg_oom_mutex to spinlock · 1af8efe9
      Michal Hocko authored
      memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
      for oom_control.  None of the critical sections which it protects sleep
      (eventfd_signal works from atomic context and the rest are simple linked
      list resp.  oom_lock atomic operations).
      
      Mutex is also too heavyweight for those code paths because it triggers a
      lot of scheduling.  It also makes makes convoying effects more visible
      when we have a big number of oom killing because we take the lock
      mutliple times during mem_cgroup_handle_oom so we have multiple places
      where many processes can sleep.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1af8efe9
    • Michal Hocko's avatar
      memcg: make oom_lock 0 and 1 based rather than counter · 79dfdacc
      Michal Hocko authored
      Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
      counter which is incremented by mem_cgroup_oom_lock when we are about to
      handle memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep
      if oom_lock > 1 to prevent from multiple oom kills at the same time.
      The counter is then decremented by mem_cgroup_oom_unlock called from the
      same function.
      
      This works correctly but it can lead to serious starvations when we have
      many processes triggering OOM and many CPUs available for them (I have
      tested with 16 CPUs).
      
      Consider a process (call it A) which gets the oom_lock (the first one
      that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
      processes that are blocked on the mutex.  While A releases the mutex and
      calls mem_cgroup_out_of_memory others will wake up (one after another)
      and increase the counter and fall into sleep (memcg_oom_waitq).
      
      Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
      decreases oom_lock and wakes other tasks (if releasing memory by
      somebody else - e.g.  killed process - hasn't done it yet).
      
      A testcase would look like:
        Assume malloc XXX is a program allocating XXX Megabytes of memory
        which touches all allocated pages in a tight loop
        # swapoff SWAP_DEVICE
        # cgcreate -g memory:A
        # cgset -r memory.oom_control=0   A
        # cgset -r memory.limit_in_bytes= 200M
        # for i in `seq 100`
        # do
        #     cgexec -g memory:A   malloc 10 &
        # done
      
      The main problem here is that all processes still race for the mutex and
      there is no guarantee that we will get counter back to 0 for those that
      got back to mem_cgroup_handle_oom.  In the end the whole convoy
      in/decreases the counter but we do not get to 1 that would enable
      killing so nothing useful can be done.  The time is basically unbounded
      because it highly depends on scheduling and ordering on mutex (I have
      seen this taking hours...).
      
      This patch replaces the counter by a simple {un}lock semantic.  As
      mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
      make sure that nobody else races with us which is guaranteed by the
      memcg_oom_mutex.
      
      We have to be careful while locking subtrees because we can encounter a
      subtree which is already locked: hierarchy:
      
                A
              /   \
             B     \
            /\      \
           C  D     E
      
      B - C - D tree might be already locked.  While we want to enable locking
      E subtree because OOM situations cannot influence each other we
      definitely do not want to allow locking A.
      
      Therefore we have to refuse lock if any subtree is already locked and
      clear up the lock for all nodes that have been set up to the failure
      point.
      
      On the other hand we have to make sure that the rest of the world will
      recognize that a group is under OOM even though it doesn't have a lock.
      Therefore we have to introduce under_oom variable which is incremented
      and decremented for the whole subtree when we enter resp.  leave
      mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
      updated under memcg_oom_mutex because its users only check a single
      group and they use atomic operations for that.
      
      This can be checked easily by the following test case:
      
        # cgcreate -g memory:A
        # cgset -r memory.use_hierarchy=1 A
        # cgset -r memory.oom_control=1   A
        # cgset -r memory.limit_in_bytes= 100M
        # cgset -r memory.memsw.limit_in_bytes= 100M
        # cgcreate -g memory:A/B
        # cgset -r memory.oom_control=1 A/B
        # cgset -r memory.limit_in_bytes=20M
        # cgset -r memory.memsw.limit_in_bytes=20M
        # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
        # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
      
      While B gets oom_lock A will not get it.  Both of them go into sleep and
      wait for an external action.  We can make the limit higher for A to
      enforce waking it up
      
        # cgset -r memory.memsw.limit_in_bytes=300M A
        # cgset -r memory.limit_in_bytes=300M A
      
      malloc in A has to wake up even though it doesn't have oom_lock.
      
      Finally, the unlock path is very easy because we always unlock only the
      subtree we have locked previously while we always decrement under_oom.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      79dfdacc
    • KAMEZAWA Hiroyuki's avatar
      memcg: consolidate memory cgroup lru stat functions · bb2a0de9
      KAMEZAWA Hiroyuki authored
      In mm/memcontrol.c, there are many lru stat functions as..
      
        mem_cgroup_zone_nr_lru_pages
        mem_cgroup_node_nr_file_lru_pages
        mem_cgroup_nr_file_lru_pages
        mem_cgroup_node_nr_anon_lru_pages
        mem_cgroup_nr_anon_lru_pages
        mem_cgroup_node_nr_unevictable_lru_pages
        mem_cgroup_nr_unevictable_lru_pages
        mem_cgroup_node_nr_lru_pages
        mem_cgroup_nr_lru_pages
        mem_cgroup_get_local_zonestat
      
      Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
      This seems bad. This patch consolidates all functions into
      
        mem_cgroup_zone_nr_lru_pages()
        mem_cgroup_node_nr_lru_pages()
        mem_cgroup_nr_lru_pages()
      
      For these functions, "which LRU?" information is passed by a mask.
      
      example:
        mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))
      
      And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.
      
      example:
        mem_cgroup_nr_lru_pages(mem, ALL_LRU)
      
      BTW, considering layout of NUMA memory placement of counters, this patch seems
      to be better.
      
      Now, when we gather all LRU information, we scan in following orer
          for_each_lru -> for_each_node -> for_each_zone.
      
      This means we'll touch cache lines in different node in turn.
      
      After patch, we'll scan
          for_each_node -> for_each_zone -> for_each_lru(mask)
      
      Then, we'll gather information in the same cacheline at once.
      
      [akpm@linux-foundation.org: fix warnigns, build error]
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb2a0de9
    • KAMEZAWA Hiroyuki's avatar
      memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() · 1f4c025b
      KAMEZAWA Hiroyuki authored
      Each memory cgroup has a 'swappiness' value which can be accessed by
      get_swappiness(memcg).  The major user is try_to_free_mem_cgroup_pages()
      and swappiness is passed by argument.  It's propagated by scan_control.
      
      get_swappiness() is a static function but some planned updates will need
      to get swappiness from files other than memcontrol.c This patch exports
      get_swappiness() as mem_cgroup_swappiness().  With this, we can remove the
      argument of swapiness from try_to_free...  and drop swappiness from
      scan_control.  only memcg uses it.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f4c025b
    • Thomas Gleixner's avatar
      rtc: fix hrtimer deadlock · b830ac1d
      Thomas Gleixner authored
      Ben reported a lockup related to rtc. The lockup happens due to:
      
      CPU0                                        CPU1
      
      rtc_irq_set_state()			    __run_hrtimer()
        spin_lock_irqsave(&rtc->irq_task_lock)    rtc_handle_legacy_irq();
      					      spin_lock(&rtc->irq_task_lock);
        hrtimer_cancel()
          while (callback_running);
      
      So the running callback never finishes as it's blocked on
      rtc->irq_task_lock.
      
      Use hrtimer_try_to_cancel() instead and drop rtc->irq_task_lock while
      waiting for the callback.  Fix this for both rtc_irq_set_state() and
      rtc_irq_set_freq().
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarBen Greear <greearb@candelatech.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b830ac1d
    • Thomas Gleixner's avatar
      rtc: limit frequency · 431e2bcc
      Thomas Gleixner authored
      Due to the hrtimer self rearming mode a user can DoS the machine simply
      because it's starved by hrtimer events.
      
      The RTC hrtimer is self rearming.  We really need to limit the frequency
      to something sensible.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      431e2bcc
    • Thomas Gleixner's avatar
      rtc: handle errors correctly in rtc_irq_set_state() · 2c4f57d1
      Thomas Gleixner authored
      The code checks the correctness of the parameters, but unconditionally
      arms/disarms the hrtimer.
      
      The result is that a random task might arm/disarm rtc timer and surprise
      the real owner by either generating events or by stopping them.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ben Greear <greearb@candelatech.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c4f57d1
    • Mathias Krause's avatar
      mn10300, exec: remove redundant set_fs(USER_DS) · b45d59fb
      Mathias Krause authored
      The address limit is already set in flush_old_exec() so this
      set_fs(USER_DS) is redundant.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b45d59fb
    • Jonghwan Choi's avatar
      drivers/base/power/opp.c: fix dev_opp initial value · fc92805a
      Jonghwan Choi authored
      Dev_opp initial value shoule be ERR_PTR(), IS_ERR() is used to check
      error.
      Signed-off-by: default avatarJonghwan Choi <jhbird.choi@samsung.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc92805a
    • Mathias Krause's avatar
      frv, exec: remove redundant set_fs(USER_DS) · adc400f6
      Mathias Krause authored
      The address limit is already set in flush_old_exec() so those calls to
      set_fs(USER_DS) are redundant.
      
      Also removed the dead code in flush_thread().
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adc400f6
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus · 6fd4ce88
      Linus Torvalds authored
      * 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus: (31 commits)
        MIPS: Close races in TLB modify handlers.
        MIPS: Add uasm UASM_i_SRL_SAFE macro.
        MIPS: RB532: Use hex_to_bin()
        MIPS: Enable cpu_has_clo_clz for MIPS Technologies' platforms
        MIPS: PowerTV: Provide cpu-feature-overrides.h
        MIPS: Remove pointless return statement from empty void functions.
        MIPS: Limit fixrange_init() to the FIXMAP region
        MIPS: Install handlers for software IRQs
        MIPS: Move FIXADDR_TOP into spaces.h
        MIPS: Add SYNC after cacheflush
        MIPS: pfn_valid() is broken on low memory HIGHMEM systems
        MIPS: HIGHMEM DMA on noncoherent MIPS32 processors
        MIPS: topdown mmap support
        MIPS: Remove redundant addr_limit assignment on exec.
        MIPS: AR7: Replace __attribute__((__packed__)) with __packed
        MIPS: AR7: Remove 'space before tabs' in platform.c
        MIPS: Lantiq: Add missing clk_enable and clk_disable functions.
        MIPS: AR7: Fix trailing semicolon bug in clock.c
        MAINTAINERS: Update MIPS entry.
        MIPS: BCM63xx: Remove duplicate PERF_IRQSTAT_REG definition
        ...
      6fd4ce88
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · ba5b56cb
      Linus Torvalds authored
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
        ceph: document unlocked d_parent accesses
        ceph: explicitly reference rename old_dentry parent dir in request
        ceph: document locking for ceph_set_dentry_offset
        ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
        ceph: protect d_parent access in ceph_d_revalidate
        ceph: protect access to d_parent
        ceph: handle racing calls to ceph_init_dentry
        ceph: set dir complete frag after adding capability
        rbd: set blk_queue request sizes to object size
        ceph: set up readahead size when rsize is not passed
        rbd: cancel watch request when releasing the device
        ceph: ignore lease mask
        ceph: fix ceph_lookup_open intent usage
        ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
        ceph: fix bad parent_inode calc in ceph_lookup_open
        ceph: avoid carrying Fw cap during write into page cache
        libceph: don't time out osd requests that haven't been received
        ceph: report f_bfree based on kb_avail rather than diffing.
        ceph: only queue capsnap if caps are dirty
        ceph: fix snap writeback when racing with writes
        ...
      ba5b56cb
    • Stephen Rothwell's avatar
      gma500: udelay(20000) it too long again · 243dd280
      Stephen Rothwell authored
      so replace it with mdelay(20).
      
      Fixes build error:
      
        ERROR: "__bad_udelay" [drivers/staging/gma500/psb_gfx.ko] undefined!
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      243dd280
    • Rafael J. Wysocki's avatar
      USB / Renesas: Fix build issue related to struct scatterlist · 9c646cfc
      Rafael J. Wysocki authored
      Fix build issue caused by undefined struct scatterlist in
      drivers/usb/renesas_usbhs/fifo.c.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c646cfc
    • Rafael J. Wysocki's avatar
      MMC / TMIO: Fix build issue related to struct scatterlist · 6c0cbef6
      Rafael J. Wysocki authored
      Fix build issue caused by undefined struct scatterlist in
      drivers/mmc/host/tmio_mmc.c.
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c0cbef6
    • Linus Torvalds's avatar
      Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 · 2ac232f3
      Linus Torvalds authored
      * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
        jbd: change the field "b_cow_tid" of struct journal_head from type unsigned to tid_t
        ext3.txt: update the links in the section "useful links" to the latest ones
        ext3: Fix data corruption in inodes with journalled data
        ext2: check xattr name_len before acquiring xattr_sem in ext2_xattr_get
        ext3: Fix compilation with -DDX_DEBUG
        quota: Remove unused declaration
        jbd: Use WRITE_SYNC in journal checkpoint.
        jbd: Fix oops in journal_remove_journal_head()
        ext3: Return -EINVAL when start is beyond the end of fs in ext3_trim_fs()
        ext3/ioctl.c: silence sparse warnings about different address spaces
        ext3/ext4 Documentation: remove bh/nobh since it has been deprecated
        ext3: Improve truncate error handling
        ext3: use proper little-endian bitops
        ext2: include fs.h into ext2_fs.h
        ext3: Fix oops in ext3_try_to_allocate_with_rsv()
        jbd: fix a bug of leaking jh->b_jcount
        jbd: remove dependency on __GFP_NOFAIL
        ext3: Convert ext3 to new truncate calling convention
        jbd: Add fixed tracepoints
        ext3: Add fixed tracepoints
      
      Resolve conflicts in fs/ext3/fsync.c due to fsync locking push-down and
      new fixed tracepoints.
      2ac232f3
    • Sage Weil's avatar
      ceph: document unlocked d_parent accesses · d79698da
      Sage Weil authored
      For the most part we don't care about racing with rename when directing
      MDS requests; either the old or new parent is fine.  Document that, and
      do some minor cleanup.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      d79698da
    • Sage Weil's avatar
      ceph: explicitly reference rename old_dentry parent dir in request · 41b02e1f
      Sage Weil authored
      We carry a pin on the parent directory for the rename source and dest
      dentries.  For the source it's r_locked_dir; we need to explicitly
      reference the old_dentry parent as well, since the dentry's d_parent may
      change between when the request was created and pinned and when it is
      freed.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      41b02e1f
    • Sage Weil's avatar
      4f177264
    • Sage Weil's avatar
      ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug · e5f86dc3
      Sage Weil authored
      Have caller pass in a safely-obtained reference to the parent directory
      for calculating a dentry's hash valud.
      
      While we're here, simpify the flow through ceph_encode_fh() so that there
      is a single exit point and cleanup.
      
      Also fix a bug with the dentry hash calculation: calculate the hash for the
      dentry we were given, not its parent.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      e5f86dc3
    • Sage Weil's avatar
      ceph: protect d_parent access in ceph_d_revalidate · bf1c6aca
      Sage Weil authored
      Protect d_parent with d_lock.  Carry a reference.  Simplify the flow so
      that there is a single exit point and cleanup.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      bf1c6aca
    • Sage Weil's avatar
      ceph: protect access to d_parent · 5f21c96d
      Sage Weil authored
      d_parent is protected by d_lock: use it when looking up a dentry's parent
      directory inode.  Also take a reference and drop it in the caller to avoid
      a use-after-free.
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      5f21c96d
    • Sage Weil's avatar
      ceph: handle racing calls to ceph_init_dentry · 48d0cbd1
      Sage Weil authored
      The ->lookup() and prepopulate_readdir() callers are working with unhashed
      dentries, so we don't have to worry.  The export.c callers, though, need
      to initialize something they got back from d_obtain_alias() and are
      potentially racing with other callers.  Make sure we don't return unless
      the dentry is properly initialized (by us or someone else).
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      48d0cbd1
    • Sage Weil's avatar
      ceph: set dir complete frag after adding capability · dfabbed6
      Sage Weil authored
      Curretly ceph_add_cap clears the complete bit if we are newly issued the
      FILE_SHARED cap, which is normally the case for a newly issue cap on a new
      directory.  That means we clear the just-set bit.  Move the check that sets
      the flag to after the cap is added/updated.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      dfabbed6
    • Josh Durgin's avatar
      rbd: set blk_queue request sizes to object size · 029bcbd8
      Josh Durgin authored
      This improves performance since more requests can be merged.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarJosh Durgin <josh.durgin@dreamhost.com>
      029bcbd8
    • Yehuda Sadeh's avatar
      ceph: set up readahead size when rsize is not passed · e9852227
      Yehuda Sadeh authored
      This should improve the default read performance, as without it
      readahead is practically disabled.
      Signed-off-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      e9852227
    • Yehuda Sadeh's avatar
      rbd: cancel watch request when releasing the device · 79e3057c
      Yehuda Sadeh authored
      We were missing this cleanup, so when a device was released
      the osd didn't clean up its watchers list, so following notifications
      could be slow as osd needed to timeout on the client.
      Signed-off-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      79e3057c
    • Sage Weil's avatar
      ceph: ignore lease mask · 2f90b852
      Sage Weil authored
      The lease mask is no longer used (and it changed a while back).  Instead,
      use a non-zero duration to indicate that there is a lease being issued.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      2f90b852
    • Sage Weil's avatar
      ceph: fix ceph_lookup_open intent usage · 468640e3
      Sage Weil authored
      We weren't properly calling lookup_instantiate_filp when setting up the
      lookup intent, which could lead to file leakage on errors.  So:
      
       - use separate helper for the hidden snapdir translation, immediately
         following the mds request
       - use ceph_finish_lookup for the final dentry/return value dance in the
         exit path
       - lookup_instantiate_filp on success
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      468640e3
    • Sage Weil's avatar
      ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC · 9bae113a
      Sage Weil authored
      We only need to put these on the directory unsafe list if they have
      side effects that fsync(2) should flush out.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      9bae113a
    • Sage Weil's avatar
      ceph: fix bad parent_inode calc in ceph_lookup_open · acda7657
      Sage Weil authored
      We were always getting NULL here because the intent file f_dentry is always
      NULL at this point, which means we were always passing NULL to
      ceph_mdsc_do_request.  In reality, this was fine, since this isn't
      currently ever a write operation that needs to get strung on the dir's
      unsafe list.
      
      Use the dir explicitly, and only pass it if this open has side-effects that
      a dir fsync should flush.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      acda7657
    • Sage Weil's avatar
      ceph: avoid carrying Fw cap during write into page cache · d8de9ab6
      Sage Weil authored
      The generic_file_aio_write call may block on balance_dirty_pages while we
      flush data to the OSDs.  If we hold a reference to the FILE_WR cap during
      that interval revocation by the MDS (e.g., to do a stat(2)) may be very
      slow.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      d8de9ab6
    • Sage Weil's avatar
      libceph: don't time out osd requests that haven't been received · 4cf9d544
      Sage Weil authored
      Keep track of when an outgoing message is ACKed (i.e., the server fully
      received it and, presumably, queued it for processing).  Time out OSD
      requests only if it's been too long since they've been received.
      
      This prevents timeouts and connection thrashing when the OSDs are simply
      busy and are throttling the requests they read off the network.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      4cf9d544
    • Greg Farnum's avatar
    • Sage Weil's avatar
      ceph: only queue capsnap if caps are dirty · e77dc3e9
      Sage Weil authored
      We used to go into this branch if i_wrbuffer_ref_head was non-zero.  This
      was an ancient check from before we were careful about dealing with all
      kinds of caps (and not just dirty pages).  It is cleaner to only queue a
      capsnap if there is an actual dirty cap.  If we are racing with...
      something...we will end up here with ci->i_wrbuffer_refs but no dirty
      caps.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      e77dc3e9
    • Sage Weil's avatar
      ceph: fix snap writeback when racing with writes · af0ed569
      Sage Weil authored
      There are two problems that come up when we try to queue a capsnap while a
      write is in progress:
      
       - The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap
         with dirty == 0.  That will crash later in __ceph_flush_snaps().  Or
         on the FILE_WR cap if a write is in progress.
       - We may not have i_head_snapc set, which causes problems pretty quickly.
         Look to the snaprealm in this case.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      af0ed569
    • Sage Weil's avatar
      ceph: use flag bit for at_end readdir flag · 9cfa1098
      Sage Weil authored
      This saves us a word of memory per file.
      Reviewed-by: default avatarYehuda Sadeh <yehuda@hq.newdream.net>
      Signed-off-by: default avatarSage Weil <sage@newdream.net>
      9cfa1098