.

1ffd04a5 · Kirill Smelkov · 0ec8ce51 · 0ec8ce51 · 0ec8ce51 · 0ec8ce51
Commit 1ffd04a5 authored Oct 25, 2020 by Kirill Smelkov
7 changed files
--- a/wcfs/TODO.txt
+++ b/wcfs/TODO.txt
- doc: notes on how things are organized in wendelin.core 2
-
-
-wcfs:
-
- SIGSEGV is used only to track writes
--- a/wcfs/__init__.py
+++ b/wcfs/__init__.py
--- a/wcfs/notes.txt
+++ b/wcfs/notes.txt
-==============================================
- Additional notes to documentation in wcfs.go
-==============================================
-
-This file contains notes additional to usage documentation and internal
-organization overview in wcfs.go .
-
-
-Notes on OS pagecache control
-=============================
-
-The cache of snapshotted bigfile can be pre-made hot if invalidated region
-was already in pagecache of head/bigfile/file:
-
- we can retrieve a region from pagecache of head/file with FUSE_NOTIFY_RETRIEVE.
- we can store that retrieved data into pagecache region of @<revX>/ with FUSE_NOTIFY_STORE.
- we can invalidate a region from pagecache of head/file with FUSE_NOTIFY_INVAL_INODE.
-
-we have to disable FUSE_AUTO_INVAL_DATA to tell the kernel we are fully
-responsible for invalidating pagecache. If we don't, the kernel will be
-clearing whole cache of head/file on e.g. its mtime change.
-
-Note: FUSE_AUTO_INVAL_DATA does not fully prevent kernel from automatically
-invalidating pagecache - e.g. it will invalidate whole cache on file size changes:
-
-https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/inode.c?id=e0bc833d10#n233
-
-It was hoped that we could workaround it with using writeback mode (see !is_wb
-in the link above), but it turned out that in writeback mode the kernel indeed
-does not invalidate data cache on file size change, but neither it allows the
-filesystem to set the size due to external event (see https://git.kernel.org/linus/8373200b12
-"fuse: Trust kernel i_size only"). This prevents us to use writeback workaround
-as we cannot even update the file from being empty to have some data.
-
-> we did the patch for FUSE to have proper flag for filesystem server to tell
-the kernel it is fully responsible for invalidating pagecache. The patch is
-part of Linux 5.2:
-
-  https://git.kernel.org/linus/ad2ba64dd489
-
-
-Invalidations to wcfs clients are delayed until block access
-============================================================
-
-Initially it was planned that wcfs would send invalidation messages to its
-clients right after receiving invalidation message from ZODB at transaction
-boundary time. That simplifies logic but requires that for a particular file,
-wcfs has to send to clients whole range of where the file was changed.
-
-Emitting whole δR right at transaction-boundary time requires to keep whole
-ZBigFile.blktab index in RAM. Even though from space point of view it is
-somewhat acceptable (~ 0.01% of whole-file data size, i.e. ~ 128MB of index for
-~ 1TB of data), it is not good from time overhead point of view - initial open
-of a file this way would be potentially slow as full blktab scan - including
-Trees _and_ Buckets nodes - would be required.
-
-> we took the approach where we send invalidation to client about a block
-lazily only when the block is actually accessed.
-
-XXX building δFtail lazily along serving fuse reads during scope of one
-transaction is not trivial and creates concurrency bottlenecks if simple
-locking scheme is used. With the main difficulty being to populate tracking set
-of δBtree lazily. However as the first approach we can still build complete
-tracking set for a BTree at the time of file open: we need to scan through all
-trees but _not_ buckets: this way we'll know oid of all tree nodes: trees _and_
-buckets, while avoiding loading buckets makes this approach practical: with
-default LOBTree settings (1 bucket = 60·objects, 1 tree = 500·buckets) it will
-require ~ 20 trees to cover 1TB of data. And we can scan those trees very
-quickly even if doing so serially. For 1PB of data it will require to scan ~
-10⁴ trees. If RTT to load 1 object is ~1ms this will become 10 seconds if done
-serially. However if we load all those tree objects in parallel it will be
-much less. Still the number of trees to scan is linear to the amount of data
-and it would be good to address the shortcoming of doing whole file index scan
-later.
-
-
-Changing mmapping while under pagefault is possible
-===================================================
-
-We can change a mapping while a page from it is under pagefault:
-
- the kernel, upon handling pagefault, queues read request to filesystem
-  server. As of Linux 4.20 this is done _with_ holding client->mm->mmap_sem:
-
-  kprobe:fuse_readpages (client->mm->mmap_sem.count: 1)
-         fuse_readpages+1
-         read_pages+109
-         __do_page_cache_readahead+401
-         filemap_fault+635
-         __do_fault+31
-         __handle_mm_fault+3403
-         handle_mm_fault+220
-         __do_page_fault+598
-         page_fault+30
-
- however the read request is queued to be performed asynchronously -
-  the kernel does not wait for it in fuse_readpages, because
-
-  * git.kernel.org/linus/c1aa96a5,
-  * git.kernel.org/linus/9cd68455,
-  * and go-fuse initially negotiating CAP_ASYNC_READ to the kernel.
-
- the kernel then _releases_ client->mm->mmap_sem and then waits
-  for to-read pages to become ready:
-
-  * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2411
-  * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n2457
-  * https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/filemap.c?id=v4.20-rc3-83-g06e68fed3282#n1301
-
- the filesystem server upon receiving the read request can manipulate
-  client's address space. This requires to write-lock client->mm->mmap_sem,
-  but we can be sure it won't deadlock because the kernel releases it
-  before waiting (see previous point).
-
-  in practice the manipulation is done by another client thread, because
-  on Linux it is not possible to change mm of another process. However
-  the main point here is that the manipulation is possible because
-  there will be no deadlock on client->mm->mmap_sem.
-
-For the reference here is how filesystem server reply looks under trace:
-
-  kprobe:fuse_readpages_end
-          fuse_readpages_end+1
-          request_end+188
-          fuse_dev_do_write+1921
-          fuse_dev_write+78
-          do_iter_readv_writev+325
-          do_iter_write+128
-          vfs_writev+152
-          do_writev+94
-          do_syscall_64+85
-          entry_SYSCALL_64_after_hwframe+68
-
-and a test program that demonstrates that it is possible to change
-mmapping while under pagefault to it:
-
-  https://lab.nexedi.com/kirr/go-fuse/commit/f822c9db
-
-Starting from Linux 5.1 mmap_sem should be generally released while doing any IO:
-
-  https://git.kernel.org/linus/6b4c9f4469
-
-but before that the analysis remains FUSE-specific.
-
-The property that changing mmapping while under pagefault is possible is
-verified by wcfs testsuite in `test_wcfs_remmap_on_pin` test.
-
-
-Client cannot be ptraced while under pagefault
-==============================================
-
-We cannot use ptrace to run code on client thread that is under pagefault:
-
-The kernel sends SIGSTOP to interrupt tracee, but the signal will be
-processed only when the process returns from kernel space, e.g. here
-
-     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160
-
-This way the tracer won't receive obligatory information that tracee
-stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other
-ptrace commands will fail:
-
-     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140
-     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207
-
-My original idea was to use ptrace to run code in process to change it's
-memory mappings, while the triggering process is under pagefault/read
-to wcfs, and the above shows it won't work - trying to ptrace the
-client from under wcfs will just block forever (the kernel will be
-waiting for read operation to finish for ptrace, and read will be first
-waiting on ptrace stopping to complete = deadlock)
-
-
-Kernel locks page on read/cache store/... - we have to be careful not to deadlock
-=================================================================================
-
-The kernel, when doing FUSE operations, locks corresponding pages. For example
-it locks a page, where it is going to read data into, before issuing FUSE read
-request. Correspondingly, on e.g. cache store, the kernel also locks page where
-data has to be stored.
-
-It is easy to deadlock if we don't take this locks into account. For example
-if we try to upload data to kernel pagecache from under serving read request,
-this can deadlock.
-
-Another case that needs to be cared about is interaction between uploadBlk and
-zwatcher: zheadMu being RWMutex, does not allow new RLocks to be taken once
-Lock request has been issued. Thus the following scenario is possible::
-
-   uploadBlk      os.Read        zwatcher
-
-                  page.Lock
-   zheadMu.Rlock
-                                 zheadMu.Lock
-   page.Lock
-                  zheadMu.Rlock
-
-
- zwatcher is waiting for uploadBlk to release zheadMu;
- uploadBlk is waiting for os.Read to release page;
- os.Read is waiting for zwatcher to release zheadMu;
- deadlock.
-
-To avoid such deadlocks zwatcher asks OS cache uploaders to pause while it is
-running, and retries taking zheadMu.Lock until all uploaders are indeed paused.
--- a/wcfs/todo.dot
+++ b/wcfs/todo.dot
-digraph {
-	subgraph  {
-		rank=same;
-		ordering=in;
-
-		wcfs		[label="wcfs"]
-		invProto	[label="open/isolation\nprotocol", style=filled fillcolor=grey95]
-		client		[label="client", style=filled fillcolor=grey97]
-
-		wcfs	-> invProto;
-		invProto -> client [dir=back];	// XXX = invProto <- client
-	}
-
-//	wcfs		-> wcfs_simple;
-//	wcfs		-> Sinvtree;
-//	wcfs		-> δR;
-	wcfs		-> liveCacheControl;
-	wcfs		-> autoexit [color=grey];
-
-	wcfs		-> wcfsInvProcess;
-	wcfs		-> wcfsRead;
-	wcfs		-> wcfsGC [color=grey];
-
-	wcfsInvProcess	-> ZODB_go_inv;
-	wcfsInvProcess	-> zconnCacheGet;
-	wcfsInvProcess	-> zobj2file;
-	wcfsInvProcess	-> δFtail;
-	wcfsInvProcess	-> fuseRetrieveCache;
-	wcfsInvProcess	-> _wcfs_zhead;
-
-	ZODB_go_inv	-> fs1_go_inv;
-	ZODB_go_inv	-> zeo_go_inv;
-	ZODB_go_inv	-> neo_go_inv;
-	ZODB_go_inv	-> zcache_go_inv [style=dashed, color=grey];	// wcfs works without raw cache now
-
-	wcfsRead	-> blktabGet;
-	wcfsRead	-> δFtail;
-	wcfsRead	-> setupWatch;
-	wcfsRead	-> headWatch;
-
-
-
-	zobj2file	-> zblk2file;
-	zobj2file	-> zbtree2file;
-	zbtree2file	-> δBTree [color=grey];
-
-//	wcfs_simple	-> Btree_read;
-//	wcfs_simple	-> ZBlk_read;
-//	wcfs_simple	-> autoexit;
-
-	client		-> wcfsRead;
-	client		-> setupWatch;
-	client		-> clientInvHandle;
-
-//	client		-> δR;
-	client		-> nowcfs;
-//	client		-> zodburl;
-//	client		-> wcfs_spawn;
-
-
-	clientInvHandle -> headWatch;
-
-	headWatch	-> fileSock;
-	_wcfs_zhead	-> fileSock;
-
-//	Btree_read	-> ZODB_read;
-//	ZBlk_read	-> ZODB_read;
-//	ZODB_read	-> ogorek_persref;
-
-
-//	wcfs_simple	[label="wcfs no\ninvalidations", style=filled fillcolor=grey95]
-//	wcfs_spawn	[label="spawn wcfs", style=filled fillcolor=lightyellow]
-	nowcfs		[label="!wcfs mode", style=filled fillcolor=grey95]
-
-	wcfsInvProcess	[label="process\nZODB invalidations", style=filled fillcolor=grey95]
-	zconnCacheGet	[label="zonn.\n.Cache.Get", style=filled fillcolor=lightyellow]
-	zobj2file	[label="Z* → file/[]#blk", style=filled fillcolor=grey95]
-	zblk2file	[label="ZBlk*\n↓\nfile/[]#blk", style=filled fillcolor=lightyellow]
-	zbtree2file	[label="BTree/Bucket\n↓\nfile/[]#blk"]
-	δBTree		[label="δ(BTree)", style=filled fillcolor=grey95]
-
-	fuseRetrieveCache [label="FUSE:\nretrieve cache", style=filled fillcolor=lightyellow]
-
-	_wcfs_zhead	[label=".wcfs/\nzhead", style=filled fillcolor=lightyellow]
-
-	wcfsRead	[label="read(#blk)", style=filled fillcolor=grey95]
-
-	blktabGet	[label="blktab.Get(#blk):\nmanually + → ⌈rev(#blk)⌉", style=filled fillcolor=grey95]
-	δFtail		[style=filled fillcolor=lightyellow]
-
-	setupWatch	[label="watches:\nregister/maint", style=filled fillcolor=grey95]
-	clientInvHandle	[label="process\n#blk invalidations", style=filled fillcolor=grey95]
-	headWatch	[label="#blk ← head/watch", style=filled fillcolor=grey95]
-	fileSock	[label="FileSock", style=filled fillcolor=lightyellow]
-
-	ZODB_go_inv	[label="ZODB/go\ninvalidations", style=filled fillcolor=grey95]
-	fs1_go_inv	[label="fs1/go\ninvalidations", style=filled fillcolor=lightyellow]
-	zeo_go_inv	[label="zeo/go\ninvalidations", style=filled fillcolor=lightyellow]
-	neo_go_inv	[label="neo/go\ninvalidations"]
-	zcache_go_inv	[label="ZCache/go\n←watchq", color=grey, fontcolor=grey]
-//	Btree_read	[label="BTree read", style=filled fillcolor=lightyellow]
-//	ZBlk_read	[label="ZBigFile / ZBlk* read", style=filled fillcolor=lightyellow]
-//	ZODB_read	[label="ZODB deserialize object", style=filled fillcolor=lightyellow]
-//	ogorek_persref	[label="ogórek:\npersistent references", style=filled fillcolor=lightyellow];
-
-//	Sinvtree	[label="server: inv. tree"]
-//	δR		[label="δR encoding"]
-
-//	zodburl		[label="zstor -> zurl", style=filled fillcolor=grey95]
-
-	wcfsGC		[label="GC\n@rev/"]
-	liveCacheControl [label="ZODB/go\nLiveCache fix", style=filled fillcolor=grey95]
-	autoexit	[label="autoexit\nif !activity"]
-}
--- a/wcfs/todo.svg
+++ b/wcfs/todo.svg
--- a/wcfs/wcfs_test.py
+++ b/wcfs/wcfs_test.py
--- a/wcfs/zblk.go
+++ b/wcfs/zblk.go
@@ -67,74 +67,11 @@ type zBlk interface {
 	// returns data and revision of ZBlk.
 	loadBlkData(ctx context.Context) (data []byte, rev zodb.Tid, _ error)

-	// inΔFtail returns pointer to struct zblkInΔFtail embedded into this ZBlk.
-//	inΔFtail() *zblkInΔFtail
-
-// XXX kill - in favour of inΔFtail
-/*
-	// bindFile associates ZBlk as being used by file to store block #blk.
-	//
-	// A ZBlk may be bound to several blocks inside one file, and to
-	// several files.
-	//
-	// The information is preserved even when ZBlk comes to ghost
-	// state, but is lost if ZBlk is garbage collected.
-	//
-	// it is safe to call multiple bindFile simultaneously.
-	// it is not safe to call bindFile and boundTo simultaneously.
-	//
-	// XXX link to overview.
-	bindFile(file *BigFile, blk int64)
-
-	// XXX unbindFile
-	// XXX zfile -> bind map for it
-
-	// blkBoundTo returns ZBlk association with file(s)/#blk(s).
-	//
-	// The association returned is that was previously set by bindFile.
-	//
-	// blkBoundTo must not be called simultaneously wrt bindFile.
-	blkBoundTo() map[*BigFile]SetI64
-*/
 }

 var _ zBlk = (*ZBlk0)(nil)
 var _ zBlk = (*ZBlk1)(nil)

-// XXX kill
-/*
-// ---- zBlkBase ----
-
-// zBlkBase provides common functionality to implement ZBlk* -> zfile, #blk binding.
-//
-// The data stored by zBlkBase is transient - it is _not_ included into
-// persistent state.
-type zBlkBase struct {
-	bindMu  sync.Mutex          // used only for binding to support multiple loaders
-	infile  map[*BigFile]SetI64 // {} file -> set(#blk)
-}
-
-// bindFile implements zBlk.
-func (zb *zBlkBase) bindFile(file *BigFile, blk int64) {
-	zb.bindMu.Lock()
-	defer zb.bindMu.Unlock()
-
-	blkmap, ok := zb.infile[file]
-	if !ok {
-		blkmap = make(SetI64, 1)
-		if zb.infile == nil {
-			zb.infile = make(map[*BigFile]SetI64)
-		}
-		zb.infile[file] = blkmap
-	}
-	blkmap.Add(blk)
-}
-
-// blkBoundTo implementss zBlk.
-func (zb *zBlkBase) blkBoundTo() map[*BigFile]SetI64 {
-	return zb.infile
-}
-*/

 // ---- ZBlk0 ----