Commit 0192236d authored by Kirill Smelkov's avatar Kirill Smelkov

vfs: use &file->f_pos directly on files that have position

Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
.read / .write file_operations methods. That changed in 2004 in 55f09ec0
("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
onto each other toes and overwriting file->f_pos racily. That measure was not
complete and in 2014 commit 9c225f26 ("vfs: atomic f_pos accesses as per
POSIX") added file->f_pos_lock to completely disable simultaneous
read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
passing ppos=&file->f_pos directly due to concurrency vanished.

Linus explains[1]:

	In fact, we *used* to (long ago) pass in the address of "file->f_pos"
	itself to the low-level read/write routines. We then changed it to do
	that indirection through a local copy of pos (and
	file_pos_read/file_pos_write) because we didn't do the proper locking,
	so different read/write versions could mess with each other (and with
	lseek).

	But one of the things that commit 9c225f26 ("vfs: atomic f_pos
	accesses as per POSIX") did was to add the proper locking at least for
	the cases that we care about deeply, so we *could* say that we have
	three cases:

	 - FMODE_ATOMIC_POS: properly locked,
	 - FMODE_STREAM: no pos at all
	 - otherwise a "mostly don't care - don't mix!"

	and so we could go back to not copying the pos at all, and instead do
	something like

	        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
	        ret = vfs_write(f.file, buf, count, ppos);

	and perhaps have a long-term plan to try to get rid of the "don't mix"
	case entirely (ie "if you use f_pos, then we'll do the proper
	locking")

	(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
	implements the locking decision).

Currently for regular files we always set FMODE_ATOMIC_POS and change that to
FMODE_STREAM if stream_open is used explicitly on open. That leaves other
files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.

Sockets, for example, always check that on read/write the initial pos they
receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
That suggests that we can do the switch into passing &file->f_pos directly now
and incrementally convert to FMODE_STREAM files that were doing the stream-like
checking manually in their low-level .read/.write handlers.

Note: it is theoretically possible that a driver updates *ppos inside
even if read/write returns error. For such cases the conversion will
change IO semantic a bit. The semantic that is changing here was
introduced in 2013 in commit 5faf153e "don't call file_pos_write()
if vfs_{read,write}{,v}() fails".

[1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com/Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
parent 3df9bc6c
...@@ -579,14 +579,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) ...@@ -579,14 +579,7 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
ssize_t ret = -EBADF; ssize_t ret = -EBADF;
if (f.file) { if (f.file) {
loff_t pos, *ppos = file_ppos(f.file); ret = vfs_read(f.file, buf, count, file_ppos(f.file));
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_read(f.file, buf, count, ppos);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f); fdput_pos(f);
} }
return ret; return ret;
...@@ -603,14 +596,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) ...@@ -603,14 +596,7 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
ssize_t ret = -EBADF; ssize_t ret = -EBADF;
if (f.file) { if (f.file) {
loff_t pos, *ppos = file_ppos(f.file); ret = vfs_write(f.file, buf, count, file_ppos(f.file));
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_write(f.file, buf, count, ppos);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f); fdput_pos(f);
} }
...@@ -1026,14 +1012,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec, ...@@ -1026,14 +1012,7 @@ static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
ssize_t ret = -EBADF; ssize_t ret = -EBADF;
if (f.file) { if (f.file) {
loff_t pos, *ppos = file_ppos(f.file); ret = vfs_readv(f.file, vec, vlen, file_ppos(f.file), flags);
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_readv(f.file, vec, vlen, ppos, flags);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f); fdput_pos(f);
} }
...@@ -1050,14 +1029,7 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec, ...@@ -1050,14 +1029,7 @@ static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
ssize_t ret = -EBADF; ssize_t ret = -EBADF;
if (f.file) { if (f.file) {
loff_t pos, *ppos = file_ppos(f.file); ret = vfs_writev(f.file, vec, vlen, file_ppos(f.file), flags);
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_writev(f.file, vec, vlen, ppos, flags);
if (ret >= 0 && ppos)
f.file->f_pos = pos;
fdput_pos(f); fdput_pos(f);
} }
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment