Commit e16e029a authored by Kirill Smelkov's avatar Kirill Smelkov

wcfs: Handle ZODB invalidations

Use ΔFtail.Track on every READ, and query accumulated ΔFtail upon
receiving ZODB invalidation to query it about which blocks of which
files have been changed. Then invalidate those blocks in OS file cache.

See added documentation to wcfs.go and notes.txt for details.

Now the filesystem is no longer stale: it provides view of data
that is uptodate wrt changes on ZODB storage.

Some preliminary history:

9b4a42a3    X invalidation design draftly settled
27d91d47    X δFtail settled
33e0dfce    X ΔTail draftly done
822366a7    X keeping fd to root opened prevents the filesystem from being unmounted
89ad3a79    X Don't keep ZBigFile activated during whole current transaction
245511ac    X Give pointer on from where to get nxd-fuse.ko
d1cd128c    X Hit FUSE-related deadlock
d134ee44    X FUSE lookup deadlock should be hopefully fixed
0e60e9ff    X wcfs: Don't noise ZWatcher trace logs with "select ..."
bf9a7405    X No longer rely on ZODB cache invariant for invalidations
parent cb14b213
==============================================
Additional notes to documentation in wcfs.go
==============================================
This file contains notes additional to usage documentation and internal
organization overview in wcfs.go .
Notes on OS pagecache control
=============================
The cache of snapshotted bigfile can be pre-made hot if invalidated region
was already in pagecache of head/bigfile/file:
- we can retrieve a region from pagecache of head/file with FUSE_NOTIFY_RETRIEVE.
- we can store that retrieved data into pagecache region of @<revX>/ with FUSE_NOTIFY_STORE.
- we can invalidate a region from pagecache of head/file with FUSE_NOTIFY_INVAL_INODE.
we have to disable FUSE_AUTO_INVAL_DATA to tell the kernel we are fully
responsible for invalidating pagecache. If we don't, the kernel will be
clearing whole cache of head/file on e.g. its mtime change.
Note: disabling FUSE_AUTO_INVAL_DATA does not fully prevent kernel from automatically
invalidating pagecache - e.g. it will invalidate whole cache on file size changes:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fuse/inode.c?id=e0bc833d10#n233
It was hoped that we could workaround it with using writeback mode (see !is_wb
in the link above), but it turned out that in writeback mode the kernel indeed
does not invalidate data cache on file size change, but neither it allows the
filesystem to set the size due to external event (see https://git.kernel.org/linus/8373200b12
"fuse: Trust kernel i_size only"). This prevents us to use writeback workaround
as we cannot even update the file from being empty to have some data.
-> we did the patch for FUSE to have proper flag for filesystem server to tell
the kernel it is fully responsible for invalidating pagecache. The patch is
part of Linux 5.2:
https://git.kernel.org/linus/ad2ba64dd489
Kernel locks page on read/cache store/... - we have to be careful not to deadlock
=================================================================================
The kernel, when doing FUSE operations, locks corresponding pages. For example
it locks a page, where it is going to read data into, before issuing FUSE read
request. Correspondingly, on e.g. cache store, the kernel also locks page where
data has to be stored.
It is easy to deadlock if we don't take this locks into account. For example
if we try to upload data to kernel pagecache from under serving read request,
this can deadlock.
Another case that needs to be cared about is interaction between uploadBlk and
zwatcher: zheadMu being RWMutex, does not allow new RLocks to be taken once
Lock request has been issued. Thus the following scenario is possible::
uploadBlk os.Read zwatcher
page.Lock
zheadMu.Rlock
zheadMu.Lock
page.Lock
zheadMu.Rlock
- zwatcher is waiting for uploadBlk to release zheadMu;
- uploadBlk is waiting for os.Read to release page;
- os.Read is waiting for zwatcher to release zheadMu;
- deadlock.
To avoid such deadlocks zwatcher asks OS cache uploaders to pause while it is
running, and retries taking zheadMu.Lock until all uploaders are indeed paused.
This diff is collapsed.
...@@ -40,7 +40,7 @@ from errno import EINVAL, ENOTCONN ...@@ -40,7 +40,7 @@ from errno import EINVAL, ENOTCONN
from resource import setrlimit, getrlimit, RLIMIT_MEMLOCK from resource import setrlimit, getrlimit, RLIMIT_MEMLOCK
from golang import go, chan, select, func, defer, b from golang import go, chan, select, func, defer, b
from golang import context, time from golang import context, time
from zodbtools.util import ashex as h from zodbtools.util import ashex as h, fromhex
import pytest; xfail = pytest.mark.xfail import pytest; xfail = pytest.mark.xfail
from pytest import raises, fail from pytest import raises, fail
from wendelin.wcfs.internal import io, mm from wendelin.wcfs.internal import io, mm
...@@ -299,6 +299,7 @@ def timeout(parent=context.background()): # -> ctx ...@@ -299,6 +299,7 @@ def timeout(parent=context.background()): # -> ctx
# DF represents a change in files space. # DF represents a change in files space.
# it corresponds to ΔF in wcfs.go .
class DF: class DF:
# .rev tid # .rev tid
# .byfile {} ZBigFile -> DFile # .byfile {} ZBigFile -> DFile
...@@ -307,6 +308,7 @@ class DF: ...@@ -307,6 +308,7 @@ class DF:
dF.byfile = {} dF.byfile = {}
# DFile represents a change to one file. # DFile represents a change to one file.
# it is similar to ΔFile in wcfs.go .
class DFile: class DFile:
# .rev tid # .rev tid
# .ddata {} blk -> data # .ddata {} blk -> data
...@@ -415,6 +417,10 @@ class tDB(tWCFS): ...@@ -415,6 +417,10 @@ class tDB(tWCFS):
t.tail = t.root._p_jar.db().storage.lastTransaction() t.tail = t.root._p_jar.db().storage.lastTransaction()
t.dFtail = [] # of DF; head = dFtail[-1].rev t.dFtail = [] # of DF; head = dFtail[-1].rev
# fh(.wcfs/zhead) + history of zhead read from there
t._wc_zheadfh = open(t.wc.mountpoint + "/.wcfs/zhead")
t._wc_zheadv = []
# tracked opened tFiles # tracked opened tFiles
t._files = set() t._files = set()
...@@ -443,6 +449,7 @@ class tDB(tWCFS): ...@@ -443,6 +449,7 @@ class tDB(tWCFS):
for tf in t._files.copy(): for tf in t._files.copy():
tf.close() tf.close()
assert len(t._files) == 0 assert len(t._files) == 0
t._wc_zheadfh.close()
# open opens wcfs file corresponding to zf@at and starts to track it. # open opens wcfs file corresponding to zf@at and starts to track it.
# see returned tFile for details. # see returned tFile for details.
...@@ -521,15 +528,18 @@ class tDB(tWCFS): ...@@ -521,15 +528,18 @@ class tDB(tWCFS):
# _wcsync makes sure wcfs is synchronized to latest committed transaction. # _wcsync makes sure wcfs is synchronized to latest committed transaction.
def _wcsync(t): def _wcsync(t):
# XXX stub: unmount/remount + close/reopen files until wcfs supports invalidations while len(t._wc_zheadv) < len(t.dFtail):
files = t._files.copy() l = t._wc_zheadfh.readline()
for tf in files: #print('> zhead read: %r' % l)
tf.close() l = l.rstrip('\n')
tWCFS.close(t) wchead = tAt(t, fromhex(l))
tWCFS.__init__(t) i = len(t._wc_zheadv)
for tf in files: if wchead != t.dFtail[i].rev:
tf.__init__(t, tf.zf, tf.at) raise RuntimeError("wcsync #%d: wczhead (%s) != zhead (%s)" % (i, wchead, t.dFtail[i].rev))
assert len(t._files) == len(files) t._wc_zheadv.append(wchead)
# head/at = last txn of whole db
assert t.wc._read("head/at") == h(t.head)
# tFile provides testing environment for one bigfile opened on wcfs. # tFile provides testing environment for one bigfile opened on wcfs.
...@@ -763,7 +773,7 @@ def _blkDataAt(t, zf, blk, at): # -> (data, rev) ...@@ -763,7 +773,7 @@ def _blkDataAt(t, zf, blk, at): # -> (data, rev)
# ---- actual tests to access data ---- # ---- actual tests to access data ----
# exercise wcfs functionality # exercise wcfs functionality
# plain data access. # plain data access + wcfs handling of ZODB invalidations.
@func @func
def test_wcfs_basic(): def test_wcfs_basic():
t = tDB(); zf = t.zfile t = tDB(); zf = t.zfile
...@@ -783,20 +793,20 @@ def test_wcfs_basic(): ...@@ -783,20 +793,20 @@ def test_wcfs_basic():
at1 = t.commit(zf, {2:'c1'}) at1 = t.commit(zf, {2:'c1'})
f.assertCache([0,0,0]) # initially not cached f.assertCache([0,0,0]) # initially not cached
f.assertData (['','','c1']) # TODO + mtime=t.head f.assertData (['','','c1'], mtime=t.head)
# >>> (@at2) commit again -> we can see both latest and snapshotted states # >>> (@at2) commit again -> we can see both latest and snapshotted states
# NOTE blocks e(4) and f(5) will be accessed only in the end # NOTE blocks e(4) and f(5) will be accessed only in the end
at2 = t.commit(zf, {2:'c2', 3:'d2', 5:'f2'}) at2 = t.commit(zf, {2:'c2', 3:'d2', 5:'f2'})
# f @head # f @head
#f.assertCache([1,1,0,0,0,0]) TODO enable after wcfs supports invalidations f.assertCache([1,1,0,0,0,0])
f.assertData (['','', 'c2', 'd2', 'x','x']) # TODO + mtime=t.head f.assertData (['','', 'c2', 'd2', 'x','x'], mtime=t.head)
f.assertCache([1,1,1,1,0,0]) f.assertCache([1,1,1,1,0,0])
# f @at1 # f @at1
f1 = t.open(zf, at=at1) f1 = t.open(zf, at=at1)
#f1.assertCache([0,0,1]) TODO enable after wcfs supports invalidations f1.assertCache([0,0,1])
f1.assertData (['','','c1']) # TODO + mtime=at1 f1.assertData (['','','c1']) # TODO + mtime=at1
...@@ -804,22 +814,59 @@ def test_wcfs_basic(): ...@@ -804,22 +814,59 @@ def test_wcfs_basic():
f2 = t.open(zf, at=at2) f2 = t.open(zf, at=at2)
at3 = t.commit(zf, {0:'a3', 2:'c3', 5:'f3'}) at3 = t.commit(zf, {0:'a3', 2:'c3', 5:'f3'})
#f.assertCache([0,1,0,1,0,0]) TODO enable after wcfs supports invalidations f.assertCache([0,1,0,1,0,0])
# f @head is opened again -> cache must not be lost
f_ = t.open(zf)
f_.assertCache([0,1,0,1,0,0])
f_.close()
f.assertCache([0,1,0,1,0,0])
# f @head # f @head
#f.assertCache([0,1,0,1,0,0]) TODO enable after wcfs supports invalidations f.assertCache([0,1,0,1,0,0])
f.assertData (['a3','','c3','d2','x','x']) # TODO + mtime=t.head f.assertData (['a3','','c3','d2','x','x'], mtime=t.head)
# f @at2 # f @at2
# NOTE f(2) is accessed but via @at/ not head/ ; f(2) in head/zf remains unaccessed # NOTE f(2) is accessed but via @at/ not head/ ; f(2) in head/zf remains unaccessed
#f2.assertCache([0,0,1,0,0,0]) TODO enable after wcfs supports invalidations f2.assertCache([0,0,1,0,0,0])
f2.assertData (['','','c2','d2','','f2']) # TODO mtime=at2 f2.assertData (['','','c2','d2','','f2']) # TODO mtime=at2
# f @at1 # f @at1
#f1.assertCache([1,1,1]) TODO enable after wcfs supports invalidations f1.assertCache([1,1,1])
f1.assertData (['','','c1']) # TODO mtime=at1 f1.assertData (['','','c1']) # TODO mtime=at1
# >>> f close / open again -> cache must not be lost
# XXX a bit flaky since OS can evict whole f cache under pressure
f.assertCache([1,1,1,1,0,0])
f.close()
f = t.open(zf)
if f.cached() != [1,1,1,1,0,0]:
assert sum(f.cached()) > 4*1/2 # > 50%
# verify all blocks
f.assertData(['a3','','c3','d2','','f3'])
f.assertCache([1,1,1,1,1,1])
# verify how wcfs processes ZODB invalidations when hole becomes a block with data.
@func
def test_wcfs_basic_hole2zblk():
t = tDB(); zf = t.zfile
defer(t.close)
f = t.open(zf)
t.commit(zf, {2:'c1'}) # b & a are holes
f.assertCache([0,0,0])
f.assertData(['','','c1'])
t.commit(zf, {1:'b2'}) # hole -> zblk
f.assertCache([1,0,1])
f.assertData(['','b2','c1'])
# TODO ZBlk copied from blk1 -> blk2 ; for the same file and for file1 -> file2
# TODO ZBlk moved from blk1 -> blk2 ; for the same file and for file1 -> file2
# verify that read after file size returns (0, ok) # verify that read after file size returns (0, ok)
# (the same behaviour as on e.g. ext4 and as requested by posix) # (the same behaviour as on e.g. ext4 and as requested by posix)
@func @func
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment