• Kirill Smelkov's avatar
    vfs: use &file->f_pos directly on files that have position · 0192236d
    Kirill Smelkov authored
    Long ago vfs read/write operations were passing ppos=&file->f_pos directly to
    .read / .write file_operations methods. That changed in 2004 in 55f09ec0
    ("read/write: pass down a copy of f_pos, not f_pos itself.") which started to
    pass ppos=&local_var trying to avoid simultaneous read/write/lseek stepping
    onto each other toes and overwriting file->f_pos racily. That measure was not
    complete and in 2014 commit 9c225f26 ("vfs: atomic f_pos accesses as per
    POSIX") added file->f_pos_lock to completely disable simultaneous
    read/write/lseek runs. After f_pos_lock was introduced the reason to avoid
    passing ppos=&file->f_pos directly due to concurrency vanished.
    
    Linus explains[1]:
    
    	In fact, we *used* to (long ago) pass in the address of "file->f_pos"
    	itself to the low-level read/write routines. We then changed it to do
    	that indirection through a local copy of pos (and
    	file_pos_read/file_pos_write) because we didn't do the proper locking,
    	so different read/write versions could mess with each other (and with
    	lseek).
    
    	But one of the things that commit 9c225f26 ("vfs: atomic f_pos
    	accesses as per POSIX") did was to add the proper locking at least for
    	the cases that we care about deeply, so we *could* say that we have
    	three cases:
    
    	 - FMODE_ATOMIC_POS: properly locked,
    	 - FMODE_STREAM: no pos at all
    	 - otherwise a "mostly don't care - don't mix!"
    
    	and so we could go back to not copying the pos at all, and instead do
    	something like
    
    	        loff_t *ppos = f.file->f_mode & FMODE_STREAM ? NULL : &file->f_pos;
    	        ret = vfs_write(f.file, buf, count, ppos);
    
    	and perhaps have a long-term plan to try to get rid of the "don't mix"
    	case entirely (ie "if you use f_pos, then we'll do the proper
    	locking")
    
    	(The above is obviously surrounded by the fdget_pos()/fdput_pos() that
    	implements the locking decision).
    
    Currently for regular files we always set FMODE_ATOMIC_POS and change that to
    FMODE_STREAM if stream_open is used explicitly on open. That leaves other
    files, like e.g. sockets and pipes, for "mostly don't care - don't mix!" case.
    
    Sockets, for example, always check that on read/write the initial pos they
    receive is 0 and don't update it. And if it is !0 they return -ESPIPE.
    That suggests that we can do the switch into passing &file->f_pos directly now
    and incrementally convert to FMODE_STREAM files that were doing the stream-like
    checking manually in their low-level .read/.write handlers.
    
    Note: it is theoretically possible that a driver updates *ppos inside
    even if read/write returns error. For such cases the conversion will
    change IO semantic a bit. The semantic that is changing here was
    introduced in 2013 in commit 5faf153e "don't call file_pos_write()
    if vfs_{read,write}{,v}() fails".
    
    [1] https://lore.kernel.org/linux-fsdevel/CAHk-=whJtZt52SnhBGrNMnuxFn3GE9X_e02x8BPxtkqrfyZukw@mail.gmail.com/Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
    0192236d
read_write.c 49.5 KB