Commit f1f1f256 authored by Ivan Babrou's avatar Ivan Babrou Committed by Andrew Morton

proc: report open files as size in stat() for /proc/pid/fd

Many monitoring tools include open file count as a metric.  Currently the
only way to get this number is to enumerate the files in /proc/pid/fd.

The problem with the current approach is that it does many things people
generally don't care about when they need one number for a metric.  In our
tests for cadvisor, which reports open file counts per cgroup, we observed
that reading the number of open files is slow.  Out of 35.23% of CPU time
spent in `proc_readfd_common`, we see 29.43% spent in `proc_fill_cache`,
which is responsible for filling dentry info.  Some of this extra time is
spinlock contention, but it's a contention for the lock we don't want to
take to begin with.

We considered putting the number of open files in /proc/pid/status. 
Unfortunately, counting the number of fds involves iterating the
open_files bitmap, which has a linear complexity in proportion with the
number of open files (bitmap slots really, but it's close).  We don't want
to make /proc/pid/status any slower, so instead we put this info in
/proc/pid/fd as a size member of the stat syscall result.  Previously the
reported number was zero, so there's very little risk of breaking
anything, while still providing a somewhat logical way to count the open
files with a fallback if it's zero.

RFC for this patch included iterating open fds under RCU.  Thanks to Frank
Hofmann for the suggestion to use the bitmap instead.

Previously:

```
$ sudo stat /proc/1/fd | head -n2
  File: /proc/1/fd
  Size: 0         	Blocks: 0          IO Block: 1024   directory
```

With this patch:

```
$ sudo stat /proc/1/fd | head -n2
  File: /proc/1/fd
  Size: 65        	Blocks: 0          IO Block: 1024   directory
```

Correctness check:

```
$ sudo ls /proc/1/fd | wc -l
65
```

I added the docs for /proc/<pid>/fd while I'm at it.

[ivan@cloudflare.com: use bitmap_weight() to count the bits]
  Link: https://lkml.kernel.org/r/20221018045844.37697-1-ivan@cloudflare.com
[akpm@linux-foundation.org: include linux/bitmap.h for bitmap_weight()]
[ivan@cloudflare.com: return errno from proc_fd_getattr() instead of setting negative size]
  Link: https://lkml.kernel.org/r/20221024173140.30673-1-ivan@cloudflare.com
Link: https://lkml.kernel.org/r/20220922224027.59266-1-ivan@cloudflare.comSigned-off-by: default avatarIvan Babrou <ivan@cloudflare.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent 2122e2a4
...@@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 ...@@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
3.10 /proc/<pid>/timerslack_ns - Task timerslack value 3.10 /proc/<pid>/timerslack_ns - Task timerslack value
3.11 /proc/<pid>/patch_state - Livepatch patch operation state 3.11 /proc/<pid>/patch_state - Livepatch patch operation state
3.12 /proc/<pid>/arch_status - Task architecture specific information 3.12 /proc/<pid>/arch_status - Task architecture specific information
3.13 /proc/<pid>/fd - List of symlinks to open files
4 Configuring procfs 4 Configuring procfs
4.1 Mount options 4.1 Mount options
...@@ -2149,6 +2150,22 @@ AVX512_elapsed_ms ...@@ -2149,6 +2150,22 @@ AVX512_elapsed_ms
the task is unlikely an AVX512 user, but depends on the workload and the the task is unlikely an AVX512 user, but depends on the workload and the
scheduling scenario, it also could be a false negative mentioned above. scheduling scenario, it also could be a false negative mentioned above.
3.13 /proc/<pid>/fd - List of symlinks to open files
-------------------------------------------------------
This directory contains symbolic links which represent open files
the process is maintaining. Example output::
lr-x------ 1 root root 64 Sep 20 17:53 0 -> /dev/null
l-wx------ 1 root root 64 Sep 20 17:53 1 -> /dev/null
lrwx------ 1 root root 64 Sep 20 17:53 10 -> 'socket:[12539]'
lrwx------ 1 root root 64 Sep 20 17:53 11 -> 'socket:[12540]'
lrwx------ 1 root root 64 Sep 20 17:53 12 -> 'socket:[12542]'
The number of open files for the process is stored in 'size' member
of stat() output for /proc/<pid>/fd for fast access.
-------------------------------------------------------
Chapter 4: Configuring procfs Chapter 4: Configuring procfs
============================= =============================
......
...@@ -7,6 +7,7 @@ ...@@ -7,6 +7,7 @@
#include <linux/namei.h> #include <linux/namei.h>
#include <linux/pid.h> #include <linux/pid.h>
#include <linux/ptrace.h> #include <linux/ptrace.h>
#include <linux/bitmap.h>
#include <linux/security.h> #include <linux/security.h>
#include <linux/file.h> #include <linux/file.h>
#include <linux/seq_file.h> #include <linux/seq_file.h>
...@@ -279,6 +280,30 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx, ...@@ -279,6 +280,30 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
return 0; return 0;
} }
static int proc_readfd_count(struct inode *inode, loff_t *count)
{
struct task_struct *p = get_proc_task(inode);
struct fdtable *fdt;
if (!p)
return -ENOENT;
task_lock(p);
if (p->files) {
rcu_read_lock();
fdt = files_fdtable(p->files);
*count = bitmap_weight(fdt->open_fds, fdt->max_fds);
rcu_read_unlock();
}
task_unlock(p);
put_task_struct(p);
return 0;
}
static int proc_readfd(struct file *file, struct dir_context *ctx) static int proc_readfd(struct file *file, struct dir_context *ctx)
{ {
return proc_readfd_common(file, ctx, proc_fd_instantiate); return proc_readfd_common(file, ctx, proc_fd_instantiate);
...@@ -319,9 +344,29 @@ int proc_fd_permission(struct user_namespace *mnt_userns, ...@@ -319,9 +344,29 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
return rv; return rv;
} }
static int proc_fd_getattr(struct user_namespace *mnt_userns,
const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
struct inode *inode = d_inode(path->dentry);
int rv = 0;
generic_fillattr(&init_user_ns, inode, stat);
/* If it's a directory, put the number of open fds there */
if (S_ISDIR(inode->i_mode)) {
rv = proc_readfd_count(inode, &stat->size);
if (rv < 0)
return rv;
}
return rv;
}
const struct inode_operations proc_fd_inode_operations = { const struct inode_operations proc_fd_inode_operations = {
.lookup = proc_lookupfd, .lookup = proc_lookupfd,
.permission = proc_fd_permission, .permission = proc_fd_permission,
.getattr = proc_fd_getattr,
.setattr = proc_setattr, .setattr = proc_setattr,
}; };
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment