- 17 Aug, 2018 40 commits
-
-
Colin Ian King authored
The functions zs_page_isolate, zs_page_migrate, zs_page_putback, lock_zspage, trylock_zspage and structure zsmalloc_aops are local to source and do not need to be in global scope, so make them static. Cleans up sparse warnings: symbol 'zs_page_isolate' was not declared. Should it be static? symbol 'zs_page_migrate' was not declared. Should it be static? symbol 'zs_page_putback' was not declared. Should it be static? symbol 'zsmalloc_aops' was not declared. Should it be static? symbol 'lock_zspage' was not declared. Should it be static? symbol 'trylock_zspage' was not declared. Should it be static? [arnd@arndb.de: hide unused lock_zspage] Link: http://lkml.kernel.org/r/20180706130924.3891230-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20180624213322.13776-1-colin.king@canonical.comSigned-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Greg Thelen authored
Commit 93f78d88 ("writeback: move backing_dev_info->bdi_stat[] into bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in account_page_redirty(). Update comment to track that change. BDI_DIRTIED => WB_DIRTIED BDI_WRITTEN => WB_WRITTEN Link: http://lkml.kernel.org/r/20180625171526.173483-1-gthelen@google.comSigned-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Shakeel Butt authored
The buffer_head can consume a significant amount of system memory and is directly related to the amount of page cache. In our production environment we have observed that a lot of machines are spending a significant amount of memory as buffer_head and can not be left as system memory overhead. Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the allocation. The buffer_heads can be allocated in a memcg different from the memcg of the page for which buffer_heads are being allocated. One concrete example is memory reclaim. The reclaim can trigger I/O of pages of any memcg on the system. So, the right way to charge buffer_head is to extract the memcg from the page for which buffer_heads are being allocated and then use targeted memcg charging API. [shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging] Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Shakeel Butt authored
Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Roman Gushchin authored
Introduce the mem_cgroup_put() helper, which helps to eliminate guarding memcg css release with "#ifdef CONFIG_MEMCG" in multiple places. Link: http://lkml.kernel.org/r/20180623000600.5818-2-guro@fb.comSigned-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Luis R. Rodriguez authored
Some architectures just don't have PAGE_KERNEL_EXEC. The mm/nommu.c and mm/vmalloc.c code have been using PAGE_KERNEL as a fallback for years. Move this fallback to asm-generic. Link: http://lkml.kernel.org/r/20180510185507.2439-3-mcgrof@kernel.orgSigned-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Luis R. Rodriguez authored
Some architectures do not define certain PAGE_KERNEL_* flags, this is either because: a) The way to implement some of these flags is *not yet ported*, or b) The architecture *has no way* to describe them Over time we have accumulated a few PAGE_KERNEL_* fallback workarounds for architectures in the kernel which do not define them using *relatively safe* equivalents. Move these scattered fallback hacks into asm-generic. We start off with PAGE_KERNEL_RO using PAGE_KERNEL as a fallback. This has been in place on the firmware loader for years. Move the fallback into the respective asm-generic header. Link: http://lkml.kernel.org/r/20180510185507.2439-2-mcgrof@kernel.orgSigned-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oscar Salvador authored
Callers of register_mem_sect_under_node() are always passing a valid memory_block (not NULL), so we can safely drop the check for NULL. In the same way, register_mem_sect_under_node() is only called in case the node is online, so we can safely remove that check as well. Link: http://lkml.kernel.org/r/20180622111839.10071-5-osalvador@techadventures.netSigned-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oscar Salvador authored
link_mem_sections() and walk_memory_range() share most of the code, so we can use convert link_mem_sections() into a dummy function that calls walk_memory_range() with a callback to register_mem_sect_under_node(). This patch converts register_mem_sect_under_node() in order to match a walk_memory_range's callback, getting rid of the check_nid argument and checking instead if the system is still boothing, since we only have to check for the nid if the system is in such state. Link: http://lkml.kernel.org/r/20180622111839.10071-4-osalvador@techadventures.netSigned-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oscar Salvador authored
When hotplugging memory, it is possible that two calls are being made to register_mem_sect_under_node(). One comes from __add_section()->hotplug_memory_register() and the other from add_memory_resource()->link_mem_sections() if we had to register a new node. In case we had to register a new node, hotplug_memory_register() will only handle/allocate the memory_block's since register_mem_sect_under_node() will return right away because the node it is not online yet. I think it is better if we leave hotplug_memory_register() to handle/allocate only memory_block's and make link_mem_sections() to call register_mem_sect_under_node(). So this patch removes the call to register_mem_sect_under_node() from hotplug_memory_register(), and moves the call to link_mem_sections() out of the condition, so it will always be called. In this way we only have one place where the memory sections are registered. Link: http://lkml.kernel.org/r/20180622111839.10071-3-osalvador@techadventures.netSigned-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Oscar Salvador authored
This is a small cleanup for the memhotplug code. A lot more could be done, but it is better to start somewhere. I tried to unify/remove duplicated code. The following is what this patchset does: 1) add_memory_resource() has code to allocate a node in case it was offline. Since try_online_node has some code for that as well, I just made add_memory_resource() to use that so we can remove duplicated code.. This is better explained in patch 1/4. 2) register_mem_sect_under_node() will be called only from link_mem_sections() 3) Make register_mem_sect_under_node() a callback of walk_memory_range() 4) Drop unnecessary checks from register_mem_sect_under_node() I have done some tests and I could not see anything broken because of this patchset. add_memory_resource() contains code to allocate a new node in case it is necessary. Since try_online_node() also has some code for this purpose, let us make use of that and remove duplicate code. This introduces __try_online_node(), which is called by add_memory_resource() and try_online_node(). __try_online_node() has two new parameters, start_addr of the node, and if the node should be onlined and registered right away. This is always wanted if we are calling from do_cpu_up(), but not when we are calling from memhotplug code. Nothing changes from the point of view of the users of try_online_node(), since try_online_node passes start_addr=0 and online_node=true to __try_online_node(). Link: http://lkml.kernel.org/r/20180622111839.10071-2-osalvador@techadventures.netSigned-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Andrew Morton authored
__list_lru_count_one() has a single callsite. Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sebastian Andrzej Siewior authored
shadow_lru_isolate() disables interrupts and acquires a lock. It could use spin_lock_irq() instead. It also uses local_irq_enable() while it could use spin_unlock_irq()/xa_unlock_irq(). Use proper suffix for lock/unlock in order to enable/disable interrupts during release/acquire of a lock. Link: http://lkml.kernel.org/r/20180622151221.28167-3-bigeasy@linutronix.deSigned-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Sebastian Andrzej Siewior authored
Patch series "mm: use irq locking suffix instead local_irq_disable()". A small series which avoids using local_irq_disable()/local_irq_enable() but instead does spin_lock_irq()/spin_unlock_irq() so it is within the context of the lock which it belongs to. Patch #1 is a cleanup where local_irq_.*() remained after the lock was removed. This patch (of 2): In 0c7c1bed ("mm: make counting of list_lru_one::nr_items lockless") the spin_lock(&nlru->lock); statement was replaced with rcu_read_lock(); in __list_lru_count_one(). The comment in count_shadow_nodes() says that the local_irq_disable() is required because the lock must be acquired with disabled interrupts and (spin_lock()) does not do so. Since the lock is replaced with rcu_read_lock() the local_irq_disable() is no longer needed. The code path is list_lru_shrink_count() -> list_lru_count_one() -> __list_lru_count_one() -> rcu_read_lock() -> list_lru_from_memcg_idx() -> rcu_read_unlock() Remove the local_irq_disable() statement. Link: http://lkml.kernel.org/r/20180622151221.28167-2-bigeasy@linutronix.deSigned-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
There is no real reason to blow up just because the caller doesn't know that __get_free_pages cannot return highmem pages. Simply fix that up silently. Even if we have some confused users such a fixup will not be harmful. [akpm@linux-foundation.org: mask off __GFP_HIGHMEM] Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.orgSigned-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jiankang Chen <chenjiankang1@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Huang Ying authored
This is to take better advantage of the general huge page copying optimization. Where, the target subpage will be copied last to avoid the cache lines of target subpage to be evicted when copying other subpages. This works better if the address of the target subpage is available when copying huge page. So hugetlbfs page fault handlers are changed to pass that information to hugetlb_cow(). This will benefit workloads which don't access the begin of the hugetlbfs huge page after the page fault under heavy cache contention. Link: http://lkml.kernel.org/r/20180524005851.4079-5-ying.huang@intel.comSigned-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Punit Agrawal <punit.agrawal@arm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Huang Ying authored
To take better advantage of general huge page copying optimization, the target subpage address will be passed to hugetlb_cow(), then copy_user_huge_page(). So we will use both target subpage address and huge page size aligned address in hugetlb_cow(). To distinguish between them, "haddr" is used for huge page size aligned address to be consistent with Transparent Huge Page naming convention. Now, only huge page size aligned address is used in hugetlb_cow(), so the "address" is renamed to "haddr" in hugetlb_cow() in this patch. Next patch will use target subpage address in hugetlb_cow() too. The patch is just code cleanup without any functionality changes. Link: http://lkml.kernel.org/r/20180524005851.4079-4-ying.huang@intel.comSigned-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Punit Agrawal <punit.agrawal@arm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Huang Ying authored
Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when copying huge page on x86_64 platform, the cache footprint is 4M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache contention is heavy when copying the huge page, and we copy the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing copying the end of the huge page. And it is possible for the application to access the begin of the huge page after copying the huge page. In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing huge page"), to keep the cache lines of the target subpage hot, the order to clear the subpages in the huge page in clear_huge_page() is changed to clearing the subpage which is furthest from the target subpage firstly, and the target subpage last. The similar order changing helps huge page copying too. That is implemented in this patch. Because we have put the order algorithm into a separate function, the implementation is quite simple. The patch is a generic optimization which should benefit quite some workloads, not for a specific use case. To demonstrate the performance benefit of the patch, we tested it with vm-scalability run on transparent huge page. With this patch, the throughput increases ~16.6% in vm-scalability anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case set /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big anonymous memory area and populate it, then forked 36 child processes, each writes to the anonymous memory area from the begin to the end, so cause copy on write. For each child process, other child processes could be seen as other workloads which generate heavy cache pressure. At the same time, the IPC (instruction per cycle) increased from 0.63 to 0.78, and the time spent in user space is reduced ~7.2%. Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.comSigned-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Huang Ying authored
Patch series "mm, huge page: Copy target sub-page last when copy huge page", v2. Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when copying huge page on x86_64 platform, the cache footprint is 4M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache contention is heavy when copying the huge page, and we copy the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing copying the end of the huge page. And it is possible for the application to access the begin of the huge page after copying the huge page. In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing huge page"), to keep the cache lines of the target subpage hot, the order to clear the subpages in the huge page in clear_huge_page() is changed to clearing the subpage which is furthest from the target subpage firstly, and the target subpage last. The similar order changing helps huge page copying too. That is implemented in this patchset. The patchset is a generic optimization which should benefit quite some workloads, not for a specific use case. To demonstrate the performance benefit of the patchset, we have tested it with vm-scalability run on transparent huge page. With this patchset, the throughput increases ~16.6% in vm-scalability anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case set /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big anonymous memory area and populate it, then forked 36 child processes, each writes to the anonymous memory area from the begin to the end, so cause copy on write. For each child process, other child processes could be seen as other workloads which generate heavy cache pressure. At the same time, the IPC (instruction per cycle) increased from 0.63 to 0.78, and the time spent in user space is reduced ~7.2%. This patch (of 4): In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing huge page"), to keep the cache lines of the target subpage hot, the order to clear the subpages in the huge page in clear_huge_page() is changed to clearing the subpage which is furthest from the target subpage firstly, and the target subpage last. This optimization could be applied to copying huge page too with the same order algorithm. To avoid code duplication and reduce maintenance overhead, in this patch, the order algorithm is moved out of clear_huge_page() into a separate function: process_huge_page(). So that we can use it for copying huge page too. This will change the direct calls to clear_user_highpage() into the indirect calls. But with the proper inline support of the compilers, the indirect call will be optimized to be the direct call. Our tests show no performance change with the patch. This patch is a code cleanup without functionality change. Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.comSigned-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jens Axboe authored
a_ops->readpages() is only ever used for read-ahead. Ensure that we pass this information down to the block layer. Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jens Axboe authored
a_ops->readpages() is only ever used for read-ahead. Ensure that we pass this information down to the block layer. Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jens Axboe authored
a_ops->readpages() is only ever used for read-ahead, yet we don't flag the IO being submitted as such. Fix that up. Any file system that uses mpage_readpages() as its ->readpages() implementation will now get this right. Since we're passing in whether the IO is read-ahead or not, we don't need to pass in the 'gfp' separately, as it is dependent on the IO being read-ahead. Kill off that member. Add some documentation notes on ->readpages() being purely for read-ahead. Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jens Axboe authored
Patch series "Submit ->readpages() IO as read-ahead", v4. The only caller of ->readpages() is from read-ahead, yet we don't submit IO flagged with REQ_RAHEAD. This means we don't see it in blktrace, for instance, which is a shame. Additionally, it's preventing further functional changes in the block layer for deadling with read-ahead more intelligently. We already make assumptions about ->readpages() just being for read-ahead in the mpage implementation, using readahead_gfp_mask(mapping) as out GFP mask of choice. This small series fixes up mpage_readpages() to submit with REQ_RAHEAD, which takes care of file systems using mpage_readpages(). The first patch is a prep patch, that makes do_mpage_readpage() take an argument structure. This patch (of 4): We're currently passing 8 arguments to this function, clean it up a bit by packing the arguments in an args structure we pass to it. No intentional functional changes in this patch. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20180621010725.17813-2-axboe@kernel.dkSigned-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yang Shi authored
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed is used to record the counter of collapsed THP, but it just gets inc'ed in anonymous THP collapse path, do this for shmem THP collapse too. Link: http://lkml.kernel.org/r/1529622949-75504-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yang Shi authored
When merging anonymous page vma, if the size of the vma can fit in at least one hugepage, the mm will be registered for khugepaged for collapsing THP in the future. But it skips shmem vmas. Do so for shmem also, but not for file-private mappings when merging a vma in order to increase the odds of collapsing a hugepage via khugepaged. hugepage_vma_check() sounds like a good fit to do the check. And move the definition of it before khugepaged_enter_vma_merge() to avoid a build error. Link: http://lkml.kernel.org/r/1529697791-6950-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Jia-Ju Bai authored
The argument "gfp_t flags" is not used in kasan_unpoison_element() and remove_element(), so remove it. Link: http://lkml.kernel.org/r/20180621070332.16633-1-baijiaju1990@gmail.comSigned-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Greg Thelen authored
Use smaller scan_control fields for order, priority, and reclaim_idx. Convert fields from int => s8. All easily fit within a byte: - allocation order range: 0..MAX_ORDER(64?) - priority range: 0..12(DEF_PRIORITY) - reclaim_idx range: 0..6(__MAX_NR_ZONES) Since 6538b8ea ("x86_64: expand kernel stack to 16K") x86_64 stack overflows are not an issue. But it's inefficient to use ints. Use s8 (signed byte) rather than u8 to allow for loops like: do { ... } while (--sc.priority >= 0); Add BUILD_BUG_ON to verify that s8 is capable of storing max values. This reduces sizeof(struct scan_control): - 96 => 80 bytes (x86_64) - 68 => 56 bytes (i386) scan_control structure field order is changed to utilize padding. After this patch there is 1 bit of scan_control padding. akpm: makes my vmscan.o's .text 572 bytes smaller as well. Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.comSigned-off-by: Greg Thelen <gthelen@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
lookup_page_ext() finds 'struct page_ext' for a given page. It requires only read access to the given struct page. Current implemnentation takes 'struct page *' as an argument. It makes compiler complain when 'const struct page *' passed. Change the argument to 'const struct page *'. Link: http://lkml.kernel.org/r/20180531135457.20167-3-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Kirill A. Shutemov authored
After commit bd33ef36 ("mm: enable page poisoning early at boot") PAGE_EXT_DEBUG_POISON is not longer used. Remove it. Link: http://lkml.kernel.org/r/20180531135457.20167-2-kirill.shutemov@linux.intel.comSigned-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Arnd Bergmann authored
get_seconds() is deprecated because it will lead to a 32-bit overflow in 2038 or 2106. We don't need the i_generation to be strictly monotonic anyway, and other file systems like ext4 and xfs just use prandom_u32(), so let's use the same one here. If this is considered too slow, we could also use ktime_get_seconds() or ktime_get_real_seconds() to keep the previous behavior. Both of these return a time64_t and are not deprecated, but only return a unique value once per second, and are predictable. Link: http://lkml.kernel.org/r/20180620082556.581543-1-arnd@arndb.deSigned-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
__alloc_pages_slowpath() has for a long time contained code to ignore node restrictions from memory policies for high priority allocations. The current code that resets the zonelist iterator however does effectively nothing after commit 7810e678 ("mm, page_alloc: do not break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset. Even before that commit, mempolicy restrictions were still not ignored, as they are passed in ac->nodemask which is untouched by the code. We can either remove the code, or make it work as intended. Since ac->nodemask can be set from task's mempolicy via alloc_pages_current() and thus also alloc_pages(), it may indeed affect kernel allocations, and it makes sense to ignore it to allow progress for high priority allocations. Thus, this patch resets ac->nodemask to NULL in such cases. This assumes all callers can handle it (i.e. there are no guarantees as in the case of __GFP_THISNODE) which seems to be the case. The same assumption is already present in check_retry_cpuset() for some time. The expected effect is that high priority kernel allocations in the context of userspace tasks (e.g. OOM victims) restricted by mempolicies will have higher chance to succeed if they are restricted to nodes with depleted memory, while there are other nodes with free memory left. It's not a new intention, but for the first time the code will match the intention, AFAICS. It was intended by commit 183f6371 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never really worked, as mempolicy restriction was already encoded in nodemask, not zonelist, at that time. So originally that was for ALLOC_NO_WATERMARK only. Then it was adjusted by e46e7b77 ("mm, page_alloc: recalculate the preferred zoneref if the context can ignore memory policies") and cd04ae1e ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the current state. So even GFP_ATOMIC would now ignore mempolicies after the initial attempts fail - if the code worked as people thought it does. Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.czSigned-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christian Hansen authored
Add a flag which causes page-types to use the kernels's idle page tracking to mark pages idle. As the tool already prints the idle flag if set, subsequent runs will show which pages have been accessed since last run. [akpm@linux-foundation.org: simplify mark_page_idle()] [chansen3@cisco.com: reorganize mark_page_idle() logic, add docs] Link: http://lkml.kernel.org/r/20180706172237.21691-1-chansen3@cisco.com Link: http://lkml.kernel.org/r/20180612153223.13174-1-chansen3@cisco.comSigned-off-by: Christian Hansen <chansen3@cisco.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Christian Hansen authored
Add a new flag that will read kpagecount for each PFN and print out the number of times the page is mapped along with the flags in the listing view. This information is useful in understanding and optimizing memory usage. Identifying pages which are not shared allows us to focus on adjusting the memory layout or access patterns for the sole owning process. Knowing the number of processes that share a page tells us how many other times we must make the same adjustments or how many processes to potentially disable. Truncated sample output: voffset map-cnt offset len flags 561a3591e 1 15fe8 1 ___U_lA____Ma_b___________________________ 561a3591f 1 2b103 1 ___U_lA____Ma_b___________________________ 561a36ca4 1 2cc78 1 ___U_lA____Ma_b___________________________ 7f588bb4e 14 2273c 1 __RU_lA____M______________________________ [akpm@linux-foundation.org: coding-style fixes] [chansen3@cisco.com: add documentation, tweak whitespace] Link: http://lkml.kernel.org/r/20180705181204.5529-1-chansen3@cisco.com Link: http://lkml.kernel.org/r/20180612153205.12879-1-chansen3@cisco.comSigned-off-by: Christian Hansen <chansen3@cisco.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Yang Shi authored
Since commit eca56ff9 ("mm, shmem: add internal shmem resident memory accounting"), MM_SHMEMPAGES is added to separate the shmem accounting from regular files. So, all shmem pages should be accounted to MM_SHMEMPAGES instead of MM_FILEPAGES. And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so shmem thp pages should be not treated differently. Account them to MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to keep consistent with normal 4K shmem pages. This will not change the rss counter of processes since shmem pages are still a part of it. The /proc/pid/status and /proc/pid/statm counters will however be more accurate wrt shmem usage, as originally intended. And as eca56ff9 ("mm, shmem: add internal shmem resident memory accounting") mentioned, oom also could report more accurate "shmem-rss". Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Pavel Tatashin authored
The role of zero_resv_unavail() is to make sure that every struct page that is allocated but is not backed by memory that is accessible by kernel is zeroed and not in some uninitialized state. Since struct pages are allocated in blocks (2M pages in x86 case), we can skip pageblock_nr_pages at a time, when the first one is found to be invalid. This optimization may help since now on x86 every hole in e820 maps is marked as reserved in memblock, and thus will go through this function. This function is called before sched_clock() is initialized, so I used my x86 early boot clock patches to measure the performance improvement. With 1T hole on i7-8700 currently we would take 0.606918s of boot time, but with this optimization 0.001103s. Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.comSigned-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Souptick Joarder authored
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t") In this patch all the caller of handle_mm_fault() are changed to return vm_fault_t type. Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PCSigned-off-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Tony Luck <tony.luck@intel.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: James Hogan <jhogan@kernel.org> Cc: Ley Foon Tan <lftan@altera.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: David S. Miller <davem@davemloft.net> Cc: Richard Weinberger <richard@nod.at> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Vlastimil Babka authored
In SLUB, prefetch_freepointer() is used when allocating an object from cache's freelist, to make sure the next object in the list is cache-hot, since it's probable it will be allocated soon. Commit 2482ddec ("mm: add SLUB free list pointer obfuscation") has unintentionally changed the prefetch in a way where the prefetch is turned to a real fetch, and only the next->next pointer is prefetched. In case there is not a stream of allocations that would benefit from prefetching, the extra real fetch might add a useless cache miss to the allocation. Restore the previous behavior. Link: http://lkml.kernel.org/r/20180809085245.22448-1-vbabka@suse.cz Fixes: 2482ddec ("mm: add SLUB free list pointer obfuscation") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kees Cook <keescook@chromium.org> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
NeilBrown authored
The documentation for seq_file suggests that it is necessary to be able to move the iterator to a given offset, however that is not the case. If the iterator is stored in the private data and is stable from one read() syscall to the next, it is only necessary to support first/next interactions. Implementing this in a client is a little clumsy. - if ->start() is given a pos of zero, it should go to start of sequence. - if ->start() is given the name pos that was given to the most recent next() or start(), it should restore the iterator to state just before that last call - if ->start is given another number, it should set the iterator one beyond the start just before the last ->start or ->next call. Also, the documentation says that the implementation can interpret the pos however it likes (other than zero meaning start), but seq_file increments the pos sometimes which does impose on the implementation. This patch simplifies the interface for first/next iteration and simplifies the code, while maintaining complete backward compatability. Now: - if ->start() is given a pos of zero, it should return an iterator placed at the start of the sequence - if ->start() is given a non-zero pos, it should return the iterator in the same state it was after the last ->start or ->next. This is particularly useful for interators which walk the multiple chains in a hash table, e.g. using rhashtable_walk*. See fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c A large part of achieving this is to *always* call ->next after ->show has successfully stored all of an entry in the buffer. Never just increment the index instead. Also: - always pass &m->index to ->start() and ->next(), never a temp variable - don't clear ->from when ->count is zero, as ->from is dead when ->count is zero. Some ->next functions do not increment *pos when they return NULL. To maintain compatability with this, we still need to increment m->index in one place, if ->next didn't increment it. Note that such ->next functions are buggy and should be fixed. A simple demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size larger than the size of /proc/swaps. This will always show the whole last line of /proc/swaps. This patch doesn't work around buggy next() functions for this case. [neilb@suse.com: ensure ->from is valid] Link: http://lkml.kernel.org/r/87601ryb8a.fsf@notabene.neil.brown.nameSigned-off-by: NeilBrown <neilb@suse.com> Acked-by: Jonathan Corbet <corbet@lwn.net> [docs] Tested-by: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
NeilBrown authored
This flag was introduce in 2.1.37pre1 and the only place it was tested was removed in 2.1.43pre1. The flag was never set. Let's discard it properly. Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.nameSigned-off-by: NeilBrown <neilb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Tetsuo Handa authored
Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes are initialized at __d_alloc(), we can't copy the whole size unconditionally. WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50) 636f6e66696766732e746d70000000000010000000000000020000000188ffff i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u ^ RIP: 0010:take_dentry_name_snapshot+0x28/0x50 RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246 RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002 RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60 RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001 R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00 R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0 take_dentry_name_snapshot+0x28/0x50 vfs_rename+0x128/0x870 SyS_rename+0x3b2/0x3d0 entry_SYSCALL_64_fastpath+0x1a/0xa4 0xffffffffffffffff Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jpSigned-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-