- 12 Sep, 2014 6 commits
-
-
Christoph Hellwig authored
Factor out a helper for all per-extent work, and merge the now trivial functions for lseg allocation and parsing. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
This isn't device(id) related, so move it into the main file. Simple move for now, the next commit will clean it up a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Instead of overflowing the XDR send buffer with our extent list allocate pages and pre-encode the layoutupdate payload into them. We optimistically allocate a single page use alloc_page and only switch to vmalloc when we have more extents outstanding. Currently there is only a single testcase (xfstests generic/113) which can reproduce large enough extent lists for this to occur. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
The current GETDEVICELIST implementation is buggy in that it doesn't handle cursors correctly, and in that it returns an error if the server returns NFSERR_NOTSUPP. Given that there is no actual need for GETDEVICELIST, it has various issues and might get removed for NFSv4.2 stop using it in the blocklayout driver, and thus the Linux NFS client as whole. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
The kbuild test robot complained about a new sparse warning in objio_alloc_deviceid_node, but it turns out that this was just a moved reference to an existing variable. Fix it to have the right big endian annotated type. Note that there are some other endianess issues in this file that I didn't bother to sort out as they involve global headers. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
The kbuild test robot complained that we got the printk format wrong. Let's just kill these printks instead of fixing them as there is not point after the initial tree algorithm debugging. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
- 10 Sep, 2014 31 commits
-
-
Jeff Layton authored
To make sparse happy... Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Jeff Layton authored
sparse says: fs/nfs/file.c:543:60: warning: incorrect type in argument 1 (different address spaces) fs/nfs/file.c:543:60: expected struct rpc_xprt *xprt fs/nfs/file.c:543:60: got struct rpc_xprt [noderef] <asn:4>*cl_xprt fs/nfs/file.c:548:53: warning: incorrect type in argument 1 (different address spaces) fs/nfs/file.c:548:53: expected struct rpc_xprt *xprt fs/nfs/file.c:548:53: got struct rpc_xprt [noderef] <asn:4>*cl_xprt cl_xprt is RCU-managed, so we need to take care to dereference and use it while holding the RCU read lock. Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
The VFS never calls setattr with ATTR_SIZE on anything but regular files. Remove the if check and turn it into an assert similar to what some other file systems do. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
This will be used by the block layout driver when splitting extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
At a simple helper to issue a GETDEVICELIST operation and pre-load the device id cache based on the result. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Add support to the common pNFS core to issue GETDEVICEINFO calls on a device ID cache miss. The code is taken from the well debugged file layout implementation and calls out to the layoutdriver through a new alloc_deviceid_node method. The calling conventions for nfs4_find_get_deviceid are changed so that all information needed to send a GETDEVICEINFO request is passed to the common code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
This speads up truncate-heavy workloads like fsx by multiple orders of magnitude. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
This allows removing extents from the extent tree especially on truncate operations, and thus fixing reads from truncated and re-extended that previously returned stale data. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Currently the block layout driver tracks extents in three separate data structures: - the two list of pnfs_block_extent structures returned by the server - the list of sectors that were in invalid state but have been written to - a list of pnfs_block_short_extent structures for LAYOUTCOMMIT All of these share the property that they are not only highly inefficient data structures, but also that operations on them are even more inefficient than nessecary. In addition there are various implementation defects like: - using an int to track sectors, causing corruption for large offsets - incorrect normalization of page or block granularity ranges - insufficient error handling - incorrect synchronization as extents can be modified while they are in use This patch replace all three data with a single unified rbtree structure tracking all extents, as well as their in-memory state, although we still need to instance for read-only and read-write extent due to the arcane client side COW feature in the block layouts spec. To fix the problem of extent possibly being modified while in use we make sure to return a copy of the extent for use in the write path - the extent can only be invalidated by a layout recall or return which has to wait until the I/O operations finished due to refcounts on the layout segment. The new extent tree work similar to the schemes used by block based filesystems like XFS or ext4. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
The core nfs code handles setting pages uptodate on reads, no need to mess with the pageflags outselves. Also remove a debug function to dump page flags. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Use the new PNFS_READ_WHOLE_PAGE flag to offload read-modify-write handling to core nfs code, and remove a huge chunk of deadlock prone mess from the block layout writeback path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
If a layout driver keeps per-inode state outside of the layout segments it needs to be notified of any layout returns or recalls on an inode, and not just about the freeing of layout segments. Add a method to acomplish this, which will allow the block layout driver to handle the case of truncated and re-expanded files properly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Like all block based filesystems, the pNFS block layout driver can't read or write at a byte granularity and thus has to perform read-modify-write cycles on writes smaller than this granularity. Add a flag so that the core NFS code always reads a whole page when starting a smaller write, so that we can do it in the place where the VFS expects it instead of doing in very deadlock prone way in the writeback handler. Note that in theory we could do less than page size reads here for disks that have a smaller sector size which are served by a server with a smaller pnfs block size. But so far that doesn't seem like a worthwhile optimization. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Expedite layout recall processing by forcing a layout commit when we see busy segments. Without it the layout recall might have to wait until the VM decided to start writeback for the file, which can introduce long delays. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Trond Myklebust authored
gcc reports: linux/fs/nfs/write.c: In function ‘nfs_page_find_head_request_locked.isra.17’: linux/fs/nfs/write.c:121:64: warning: ‘cinfo.mds’ may be used uninitialized in this function [-Wmaybe-uninitialized] list_for_each_entry_safe(freq, t, &cinfo.mds->list, wb_list) { ^ linux/fs/nfs/write.c:110:25: note: ‘cinfo.mds’ was declared here struct nfs_commit_info cinfo; Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com> Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
When we do non-page sized reads we can underflow the extent_length variable and read incorrect data. Fix the extent_length calculation and change to defensive <= checks for the extent length in the read and write path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Make sure the block queue is plugged when performing pNFS blocklayout I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Tell userspace what stage of GETDEVICEINFO failed so that there is a chance to debug it, especially with the userspace daemon clusterf***k in the block layout driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
The Linux VM subsystem can't support block sizes larger than page size for block based filesystems very well. While this can be hacked around to some extent for simple filesystems the read-modify-write cycles required for pnfs block invalid extents are extremly deadlock prone when operating on multiple pages. Reject this case early on instead of pretending to support it (badly). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Currently there is no XDR buffer space allocated for the per-layout driver layoutcommit payload, which leads to server buffer overflows in the blocklayout driver even under simple workloads. As we can't do per-layout sizes for XDR operations we'll have to splice a previously encoded list of pages into the XDR stream, similar to how we handle ACL buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
After we issued a layoutreturn operations the may free the layout stateid and will thus cause bad stateid error when the client uses it again. We currently try to avoid this case by chosing the open stateid if not lsegs are present for this inode. But various places can hold refererence on lsegs and thus cause the list not to be empty shortly after a layout return. Add an explicit flag to mark the current layout stateid invalid and force usage of the openstateid after we did a full file layoutreturn. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Currently we fall through to nfs4_async_handle_error when we get a bad stateid error back from layoutget. nfs4_async_handle_error with a NULL state argument will never retry the operations but return the error to higher layer, causing an avoiable fallback to MDS I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
When layoutget returns an entirely new layout stateid it should not check the generation counter as the new stateid will start with a new counter entirely unrelated to old one. The current behavior causes constant layoutget failures against a block server which allocates a new stateid after an recall that removed all outstanding layouts. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
Ensure the lsegs are initialized early so that we don't pass an unitialized one back to ->free_lseg during error processing. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Christoph Hellwig authored
pNFS servers may return arbitrarily large layouts. Trim back the I/O size to one that we can at least allocate the page array for. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Peng Tao authored
Following http://www.rfc-editor.org/errata_search.php?rfc=5661&eid=2751 Don't set layoutcommit for commit_through_mds case. For FILE_SYNC writes, don't set layoutcommit. For DATA_SYNC wirtes, set layout commit right after wirtes done. For UNSTABLE writes, set layout commit when commit done. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Peng Tao authored
Track lwb in nfs_commit_data so that we can use it to setup layoutcommit in commit_done callback. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Anna Schumaker authored
can_open_cached() reads values out of the state structure, meaning that we need the so_lock to have a correct return value. As a bonus, this helps clear up some potentially confusing code. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Chris Perl authored
When attempting to establish a local ephemeral endpoint for a TCP or UDP socket, do not explicitly call bind, instead let it happen implicilty when the socket is first used. The main motivating factor for this change is when TCP runs out of unique ephemeral ports (i.e. cannot find any ephemeral ports which are not a part of *any* TCP connection). In this situation if you explicitly call bind, then the call will fail with EADDRINUSE. However, if you allow the allocation of an ephemeral port to happen implicitly as part of connect (or other functions), then ephemeral ports can be reused, so long as the combination of (local_ip, local_port, remote_ip, remote_port) is unique for TCP sockets on the system. This doesn't matter for UDP sockets, but it seemed easiest to treat TCP and UDP sockets the same. This can allow mount.nfs(8) to continue to function successfully, even in the face of misbehaving applications which are creating a large number of TCP connections. Signed-off-by: Chris Perl <chris.perl@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
Weston Andros Adamson authored
filelayout_retry_commit was recently split out from alloc_ds_commits, but was done in such a way that the bucket pointer always starts at index 0 no matter what the @idx argument is set to. The intention of the @idx argument is to retry commits starting at bucket @idx. This is called when alloc_ds_commits fails for a bucket. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
- 09 Sep, 2014 1 commit
-
-
Jeff Layton authored
This reverts commit 49a4bda2. Christoph reported an oops due to the above commit: generic/089 242s ...[ 2187.041239] general protection fault: 0000 [#1] SMP [ 2187.042899] Modules linked in: [ 2187.044000] CPU: 0 PID: 11913 Comm: kworker/0:1 Not tainted 3.16.0-rc6+ #1151 [ 2187.044287] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 2187.044287] Workqueue: nfsiod free_lock_state_work [ 2187.044287] task: ffff880072b50cd0 ti: ffff88007a4ec000 task.ti: ffff88007a4ec000 [ 2187.044287] RIP: 0010:[<ffffffff81361ca6>] [<ffffffff81361ca6>] free_lock_state_work+0x16/0x30 [ 2187.044287] RSP: 0018:ffff88007a4efd58 EFLAGS: 00010296 [ 2187.044287] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007a947ac0 RCX: 8000000000000000 [ 2187.044287] RDX: ffffffff826af9e0 RSI: ffff88007b093c00 RDI: ffff88007b093db8 [ 2187.044287] RBP: ffff88007a4efd58 R08: ffffffff832d3e10 R09: 000001c40efc0000 [ 2187.044287] R10: 0000000000000000 R11: 0000000000059e30 R12: ffff88007fc13240 [ 2187.044287] R13: ffff88007fc18b00 R14: ffff88007b093db8 R15: 0000000000000000 [ 2187.044287] FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 [ 2187.044287] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2187.044287] CR2: 00007f93ec33fb80 CR3: 0000000079dc2000 CR4: 00000000000006f0 [ 2187.044287] Stack: [ 2187.044287] ffff88007a4efdd8 ffffffff810cc877 ffffffff810cc80d ffff88007fc13258 [ 2187.044287] 000000007a947af0 0000000000000000 ffffffff8353ccc8 ffffffff82b6f3d0 [ 2187.044287] 0000000000000000 ffffffff82267679 ffff88007a4efdd8 ffff88007fc13240 [ 2187.044287] Call Trace: [ 2187.044287] [<ffffffff810cc877>] process_one_work+0x1c7/0x490 [ 2187.044287] [<ffffffff810cc80d>] ? process_one_work+0x15d/0x490 [ 2187.044287] [<ffffffff810cd569>] worker_thread+0x119/0x4f0 [ 2187.044287] [<ffffffff810fbbad>] ? trace_hardirqs_on+0xd/0x10 [ 2187.044287] [<ffffffff810cd450>] ? init_pwq+0x190/0x190 [ 2187.044287] [<ffffffff810d3c6f>] kthread+0xdf/0x100 [ 2187.044287] [<ffffffff810d3b90>] ? __init_kthread_worker+0x70/0x70 [ 2187.044287] [<ffffffff81d9873c>] ret_from_fork+0x7c/0xb0 [ 2187.044287] [<ffffffff810d3b90>] ? __init_kthread_worker+0x70/0x70 [ 2187.044287] Code: 0f 1f 44 00 00 31 c0 5d c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 8d b7 48 fe ff ff 48 8b 87 58 fe ff ff 48 89 e5 48 8b 40 30 <48> 8b 00 48 8b 10 48 89 c7 48 8b 92 90 03 00 00 ff 52 28 5d c3 [ 2187.044287] RIP [<ffffffff81361ca6>] free_lock_state_work+0x16/0x30 [ 2187.044287] RSP <ffff88007a4efd58> [ 2187.103626] ---[ end trace 0f11326d28e5d8fa ]--- The original reason for this patch was because the fl_release_private operation couldn't sleep. With commit ed9814d8 (locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped), this is no longer a problem so we can revert this patch. Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
- 08 Sep, 2014 1 commit
-
-
Cong Wang authored
I saw the following kernel warning: [ 1852.321222] ------------[ cut here ]------------ [ 1852.326527] WARNING: CPU: 0 PID: 118 at fs/proc/generic.c:521 remove_proc_entry+0x154/0x16b() [ 1852.335630] remove_proc_entry: removing non-empty directory 'fs/nfsfs', leaking at least 'volumes' [ 1852.344084] CPU: 0 PID: 118 Comm: kworker/u8:2 Not tainted 3.16.0+ #540 [ 1852.350036] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1852.354992] Workqueue: netns cleanup_net [ 1852.358701] 0000000000000000 ffff880116f2fbd0 ffffffff819c03e9 ffff880116f2fc18 [ 1852.366474] ffff880116f2fc08 ffffffff810744ee ffffffff811e0e6e ffff8800d4e96238 [ 1852.373507] ffffffff81dbe665 ffff8800d46a5948 0000000000000005 ffff880116f2fc68 [ 1852.380224] Call Trace: [ 1852.381976] [<ffffffff819c03e9>] dump_stack+0x4d/0x66 [ 1852.385495] [<ffffffff810744ee>] warn_slowpath_common+0x7a/0x93 [ 1852.389869] [<ffffffff811e0e6e>] ? remove_proc_entry+0x154/0x16b [ 1852.393987] [<ffffffff8107457b>] warn_slowpath_fmt+0x4c/0x4e [ 1852.397999] [<ffffffff811e0e6e>] remove_proc_entry+0x154/0x16b [ 1852.402034] [<ffffffff8129c73d>] nfs_fs_proc_net_exit+0x53/0x56 [ 1852.406136] [<ffffffff812a103b>] nfs_net_exit+0x12/0x1d [ 1852.409774] [<ffffffff81785bc9>] ops_exit_list+0x44/0x55 [ 1852.413529] [<ffffffff81786389>] cleanup_net+0xee/0x182 [ 1852.417198] [<ffffffff81088c9e>] process_one_work+0x209/0x40d [ 1852.502320] [<ffffffff81088bf7>] ? process_one_work+0x162/0x40d [ 1852.587629] [<ffffffff810890c1>] worker_thread+0x1f0/0x2c7 [ 1852.673291] [<ffffffff81088ed1>] ? process_scheduled_works+0x2f/0x2f [ 1852.759470] [<ffffffff8108e079>] kthread+0xc9/0xd1 [ 1852.843099] [<ffffffff8109427f>] ? finish_task_switch+0x3a/0xce [ 1852.926518] [<ffffffff8108dfb0>] ? __kthread_parkme+0x61/0x61 [ 1853.008565] [<ffffffff819cbeac>] ret_from_fork+0x7c/0xb0 [ 1853.076477] [<ffffffff8108dfb0>] ? __kthread_parkme+0x61/0x61 [ 1853.140653] ---[ end trace 69c4c6617f78e32d ]--- It looks wrong that we add "/proc/net/nfsfs" in nfs_fs_proc_net_init() while remove "/proc/fs/nfsfs" in nfs_fs_proc_net_exit(). Fixes: commit 65b38851 (NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes) Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Dan Aloni <dan@kernelim.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> [Trond: replace uses of remove_proc_entry() with remove_proc_subtree() as suggested by Al Viro] Cc: stable@vger.kernel.org # 3.4.x : 65b38851: NFS: Fix /proc/fs/nfsfs/servers Cc: stable@vger.kernel.org # 3.4.x Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
-
- 07 Sep, 2014 1 commit
-
-
Linus Torvalds authored
-