X wcfs: client: Handle fork

Without special care a forked child may interfere in parent-wcfs exchange via Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS sent from the child. This actually happens for real when running test.py/neo-wcfs because NEO test cluster spawns master and storage nodes with just fork without exec. -> detach from wcfs in child right after fork and deactivate all mappings in order not to provide stale data. See top-level comments added to wcfs/client/wcfs.cpp for details.

X wcfs: client: Handle fork
Without special care a forked child may interfere in parent-wcfs exchange via Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS sent from the child. This actually happens for real when running test.py/neo-wcfs because NEO test cluster spawns master and storage nodes with just fork without exec. -> detach from wcfs in child right after fork and deactivate all mappings in order not to provide stale data. See top-level comments added to wcfs/client/wcfs.cpp for details.
3f83469c · Kirill Smelkov · c2c35851 · 3f83469c · 3f83469c · 3f83469c
Commit 3f83469c authored Oct 30, 2020 by Kirill Smelkov
7 changed files
--- a/wcfs/client/client_test.py
+++ b/wcfs/client/client_test.py
@@ -29,7 +29,7 @@ from __future__ import print_function, absolute_import
 from golang import func, defer, error, b
 from wendelin.bigfile.file_zodb import ZBigFile
-from wendelin.wcfs.wcfs_test import tDB, tAt
+from wendelin.wcfs.wcfs_test import tDB, tAt, timeout, waitfor_, eprint
 from wendelin.wcfs import wcfs_test
 from wendelin.wcfs.internal.wcfs_test import read_mustfault
 from wendelin.wcfs.internal import mm
@@ -37,6 +37,9 @@ from wendelin.wcfs.internal import mm
 from pytest import raises
 from golang.golang_test import panics
+import os, multiprocessing, gc
 # so that e.g. testdb is set up + ...
 def setup_module():         wcfs_test.setup_module()
 def teardown_module():      wcfs_test.teardown_module()
@@ -275,6 +278,60 @@ def test_wcfs_client_down_efault():
    with raises(error, match=".*: file already closed"): fh1.mmap(2, 3) # fh1 was explicitly closed ^^^
+# verify that on fork client turns all child's wcfs mappings to efault and
+# detaches from wcfs. (else even automatic FileH.__del__ - caused by GC in child
+# - can send message to wcfs server and this way break parent-wcfs exchange).
+@func
+def test_wcfs_client_afterfork():
+    t = tDB(); zf = t.zfile; at0=t.at0
+    defer(t.close)
+    # initial setup
+    at1 = t.commit(zf, {1:'b1', 3:'d1'})
+    wconn = t.wc.connect(at1)
+    defer(wconn.close)
+    fh = wconn.open(zf._p_oid);  defer(fh.close)
+    m  = fh.mmap(0, 4);  tm = tMapping(t, m)
+    tm.assertBlk(0, '',   {})
+    tm.assertBlk(1, 'b1', {})
+    tm.assertBlk(2, '',   {})
+    tm.assertBlk(3, 'd1', {})
+    # fork child and verify that it does not interact with wcfs
+    def forkedchild():
+        tm.assertBlkFaults(0)
+        tm.assertBlkFaults(1)
+        tm.assertBlkFaults(2)
+        tm.assertBlkFaults(3)
+        fh.close()   # must be noop in child
+        gc.collect()
+        os._exit(0)  # NOTE not sys.exit not to execute deferred cleanup prepared by parent
+    p = multiprocessing.Process(target=forkedchild)
+    p.start()
+    if not waitfor_(timeout(), lambda: p.exitcode is not None):
+        eprint("\nC: child stuck")
+        eprint("-> kill it (pid %s) ...\n" % p.pid)
+        p.terminate()
+    p.join()
+    assert p.exitcode == 0
+    # make sure that parent can continue using wcfs normally
+    at2 = t.commit(zf, {1:'b2'})
+    tm.assertBlk(0, '',   {})
+    tm.assertBlk(1, 'b1', {1:at1})  # pinned @at1
+    tm.assertBlk(2, '',   {1:at1})
+    tm.assertBlk(3, 'd1', {1:at1})
+    wconn.resync(at2) # unpins 1 to @head
+    tm.assertBlk(0, '',   {})
+    tm.assertBlk(1, 'b2', {})
+    tm.assertBlk(2, '',   {})
+    tm.assertBlk(3, 'd1', {})
 # TODO try to unit test at wcfs client level wcfs.Mapping with dirty RW page -
 # that it stays in sync with DB after dirty discard.

--- a/wcfs/client/wcfs.cpp
+++ b/wcfs/client/wcfs.cpp
@@ -183,6 +183,25 @@
 // (*) see "Wcfs locking organization" in wcfs.go
 // (%) see related comment in Conn.__pin1 for details.
+// Handling of fork
+//
+// When a process calls fork, OS copies its memory and creates child process
+// with only 1 thread. That child inherits file descriptors and memory mappings
+// from parent. To correctly continue using Conn, FileH and Mappings, the child
+// must recreate pinner thread and reconnect to wcfs via reopened watchlink.
+// The reason here is that without reconnection - by using watchlink file
+// descriptor inherited from parent - the child would interfere into
+// parent-wcfs exchange and neither parent nor child could continue normal
+// protocol communication with WCFS.
+//
+// For simplicity, since fork is seldomly used for things besides followup
+// exec, wcfs client currently takes straightforward approach by disabling
+// mappings and detaching from WCFS server in the child right after fork. This
+// ensures that there is no interference into parent-wcfs exchange should child
+// decide not to exec and to continue running in the forked thread. Without
+// this protection the interference might come even automatically via e.g.
+// Python GC -> PyFileH.__del__ -> FileH.close -> message to WCFS.
 #include "wcfs_misc.h"
 #include "wcfs.h"
@@ -301,6 +320,10 @@ pair<Conn, error> WCFS::connect(zodb::Tid at) {
    wconn->at       = at;
    wconn->_wlink   = wlink;
+    os::RegisterAfterFork(newref(
+        static_cast<os::_IAfterFork*>( wconn._ptr() )
+    ));
    context::Context pinCtx;
    tie(pinCtx, wconn->_pinCancel) = context::with_cancel(context::background());
    wconn->_pinWG = sync::NewWorkGroup(pinCtx);
@@ -317,6 +340,7 @@ static global<error> errConnClosed = errors::New("connection closed");
 //
 // opened fileh and mappings become invalid to use except close and unmap.
 error _Conn::close() {
+    // NOTE keep in sync with Conn.afterFork
    _Conn& wconn = *this;
    // lock virtmem early. TODO more granular virtmem locking (see __pin1 for
@@ -409,9 +433,46 @@ error _Conn::close() {
    if (!errors::Is(err, context::canceled)) // canceled - ok
        reterr1(err);
+    os::UnregisterAfterFork(newref(
+        static_cast<os::_IAfterFork*>( &wconn )
+    ));
    return E(eret);
 }
+// afterFork detaches from wcfs in child process right after fork.
+//
+// opened fileh are closed abruptly without sending "bye" not to interfere into
+// parent-wcfs exchange. Existing mappings become invalid to use.
+void _Conn::afterFork() {
+    // NOTE keep in sync with Conn.close
+    _Conn& wconn = *this;
+    // ↓↓↓ parallels Conn::close, but without locking and exchange with wcfs.
+    //
+    // After fork in child we are the only thread that exists/runs.
+    // -> no need to lock anything; trying to use locks could even deadlock,
+    // because locks state is snapshotted from at fork time, when a lock could
+    // be already locked by some thread.
+    bool alreadyClosed = (wconn._downErr == errConnClosed);
+    if (alreadyClosed)
+        return;
+    // close all files and make mappings efault.
+    while (!wconn._filehTab.empty()) {
+        FileH f = wconn._filehTab.begin()->second;
+        // close f even if f->_state < _FileHOpened
+        // (in parent closure of opening-in-progress files is done by
+        // Conn::open, but in child we are the only one to release resources)
+        f->_afterFork();
+    }
+    // NOTE no need to wlink->close() - wlink handles afterFork by itself.
+    // NOTE no need to signal pinner, as fork does not clone the pinner into child.
+}
 // _pinner receives pin messages from wcfs and adjusts wconn file mappings.
 error _Conn::_pinner(context::Context ctx) {
    Conn wconn = newref(this); // newref for go
@@ -944,6 +1005,7 @@ error _FileH::close() {
 // - virt_lock
 // - wconn.atMu
 error _FileH::_closeLocked(bool force) {
+    // NOTE keep in sync with FileH._afterFork
    _FileH& fileh = *this;
    Conn    wconn = fileh.wconn;
@@ -1018,6 +1080,40 @@ error _FileH::_closeLocked(bool force) {
    return E(eret);
 }
+// _afterFork is similar to _closeLocked and releases FileH resource and
+// mappings right after fork.
+void _FileH::_afterFork() {
+    // NOTE keep in sync with FileH._closeLocked
+    _FileH& fileh = *this;
+    Conn    wconn = fileh.wconn;
+    // ↓↓↓ parallels FileH._closeLocked but without locking and wcfs exchange.
+    //
+    // There is no locking (see Conn::afterFork for why) and we shutdown file
+    // even if ._state == _FileHClosing, because that state was copied from
+    // parent and it is inside parent where it is another thread that is
+    // currently closing *parent's* FileH.
+    if (fileh._state == _FileHClosed) // NOTE _not_ >= _FileHClosing
+        return;
+    // don't send to wcfs "stop watch f" not to disrupt parent-wcfs exchange.
+    // just close the file.
+    if (wconn->_filehTab.get(fileh.foid)._ptr() != &fileh)
+        panic("BUG: fileh.closeAfterFork: wconn.filehTab[fileh.foid] != fileh");
+    wconn->_filehTab.erase(fileh.foid);
+    fileh._headf->close(); // ignore err
+    // change all fileh.mmaps to cause EFAULT on access
+    for (auto mmap : fileh._mmaps) {
+        mmap->__remmapAsEfault(); // ignore err
+    }
+    // done
+    fileh._state = _FileHClosed;
+}
 // mmap creates file mapping representing file[blk_start +blk_len) data as of wconn.at database state.
 //
 // If vma != nil, created mapping is associated with that vma of user-space virtual memory manager:
@@ -1427,6 +1523,9 @@ string _Mapping::String() const {
 _Conn::_Conn()  {}
 _Conn::~_Conn() {}
+void _Conn::incref() {
+    object::incref();
+}
 void _Conn::decref() {
    if (__decref())
        delete this;

--- a/wcfs/client/wcfs.h
+++ b/wcfs/client/wcfs.h
@@ -169,7 +169,7 @@ struct WCFS {
 // Conn logically mirrors ZODB.Connection .
 // It is safe to use Conn from multiple threads simultaneously.
 typedef refptr<struct _Conn> Conn;
-struct _Conn : object {
+struct _Conn : os::_IAfterFork, object {
    WCFS        *_wc;
    WatchLink   _wlink; // watch/receive pins for mappings created under this conn
@@ -193,6 +193,7 @@ private:
    ~_Conn();
    friend pair<Conn, error> WCFS::connect(zodb::Tid at);
 public:
+    void incref();
    void decref();
 public:
@@ -207,6 +208,8 @@ private:
    error __pinner(context::Context ctx);
    error _pin1(PinReq *req);
    error __pin1(PinReq *req);
+    void afterFork();
 };
 // FileH represent isolated file view under Conn.
@@ -264,6 +267,7 @@ public:
    error _open();
    error _closeLocked(bool force);
+    void  _afterFork();
 };
 // Mapping represents one memory mapping of FileH.

--- a/wcfs/client/wcfs_misc.cpp
+++ b/wcfs/client/wcfs_misc.cpp
@@ -23,6 +23,7 @@
 #include <golang/errors.h>
 #include <golang/fmt.h>
 #include <golang/io.h>
+#include <golang/sync.h>
 using namespace golang;
 #include <inttypes.h>
@@ -32,6 +33,7 @@ using namespace golang;
 #include <unistd.h>
 #include <sys/mman.h>
+#include <algorithm>
 #include <memory>
 // golang::
@@ -134,6 +136,59 @@ static error _pathError(const char *op, const string &path, int syserr) {
 }
+// afterfork
+static sync::Mutex         _afterForkMu;
+static bool                _afterForkInit;
+static vector<IAfterFork>  _afterForkList;
+// _runAfterFork runs handlers registered by RegisterAfterFork.
+static void _runAfterFork() {
+    // we were just forked: This is child process and there is only 1 thread.
+    // The state of memory was copied from parent.
+    // There is no other mutators except us.
+    // -> go through _afterForkList *without* locking.
+    for (auto obj : _afterForkList) {
+        obj->afterFork();
+    }
+    // reset _afterFork* state because child could want to fork again
+    new (&_afterForkMu) sync::Mutex;
+    _afterForkInit = false;
+    _afterForkList.clear();
+}
+void RegisterAfterFork(IAfterFork obj) {
+    _afterForkMu.lock();
+    defer([&]() {
+        _afterForkMu.unlock();
+    });
+    if (!_afterForkInit) {
+        int e = pthread_atfork(/*prepare=*/nil, /*parent=*/nil, /*child=*/_runAfterFork);
+        if (e != 0) {
+            string estr = fmt::sprintf("pthread_atfork: %s", v(_sysErrString(e)));
+            panic(v(estr));
+        }
+        _afterForkInit = true;
+    }
+    _afterForkList.push_back(obj);
+}
+void UnregisterAfterFork(IAfterFork obj) {
+    _afterForkMu.lock();
+    defer([&]() {
+        _afterForkMu.unlock();
+    });
+    // _afterForkList.remove(obj)
+    _afterForkList.erase(
+        std::remove(_afterForkList.begin(), _afterForkList.end(), obj),
+        _afterForkList.end());
+}
 // _sysErrString returns string corresponding to system error syserr.
 static string _sysErrString(int syserr) {
    char ebuf[128];

--- a/wcfs/client/wcfs_misc.h
+++ b/wcfs/client/wcfs_misc.h
@@ -107,6 +107,25 @@ tuple<File, error> open(const string &path, int flags = O_RDONLY,
                      S_IRGRP | S_IWGRP | S_IXGRP |
                      S_IROTH | S_IWOTH | S_IXOTH);
+// afterfork
+// IAfterFork is the interface that objects must implement to be notified after fork.
+typedef refptr<struct _IAfterFork> IAfterFork;
+struct _IAfterFork : public _interface {
+    // afterFork is called in just forked child process for objects that
+    // were previously registered in parent via RegisterAfterFork.
+    virtual void afterFork() = 0;
+};
+// RegisterAfterFork registers obj so that obj.afterFork is run after fork in
+// the child process.
+void RegisterAfterFork(IAfterFork obj);
+// UnregisterAfterFork undoes RegisterAfterFork.
+// It is noop if obj was not registered.
+void UnregisterAfterFork(IAfterFork obj);
 }   // os::
 // mm::

--- a/wcfs/client/wcfs_watchlink.cpp
+++ b/wcfs/client/wcfs_watchlink.cpp
@@ -65,6 +65,10 @@ pair<WatchLink, error> WCFS::_openwatch() {
    wlink->rx_eof     = makechan<structZ>();
+    os::RegisterAfterFork(newref(
+        static_cast<os::_IAfterFork*>( wlink._ptr() )
+    ));
    context::Context serveCtx;
    tie(serveCtx, wlink->_serveCancel) = context::with_cancel(context::background());
    wlink->_serveWG = sync::NewWorkGroup(serveCtx);
@@ -98,9 +102,24 @@ error _WatchLink::close() {
    if (err == nil)
        err = err3;
+    os::UnregisterAfterFork(newref(
+        static_cast<os::_IAfterFork*>( &wlink )
+    ));
    return E(err);
 }
+// afterFork detaches from wcfs in child process right after fork.
+void _WatchLink::afterFork() {
+    _WatchLink& wlink = *this;
+    // in child right after fork we are the only thread to run; in particular
+    // _serveRX is not running. Just release the file handle, that fork
+    // duplicated, to make sure that child cannot send anything to wcfs and
+    // interfere into parent-wcfs exchange.
+    wlink._f->close(); // ignore err
+}
 // closeWrite closes send half of the link.
 error _WatchLink::closeWrite() {
    _WatchLink& wlink = *this;
@@ -487,6 +506,9 @@ string rxPkt::to_string() const {
 _WatchLink::_WatchLink()    {}
 _WatchLink::~_WatchLink()   {}
+void _WatchLink::incref() {
+    object::incref();
+}
 void _WatchLink::decref() {
    if (__decref())
        delete this;

--- a/wcfs/client/wcfs_watchlink.h
+++ b/wcfs/client/wcfs_watchlink.h
@@ -70,7 +70,7 @@ static_assert(sizeof(rxPkt) == 256, "rxPkt miscompiled"); // NOTE 128 is too low
 //
 // It is safe to use WatchLink from multiple threads simultaneously.
 typedef refptr<class _WatchLink> WatchLink;
-class _WatchLink : public object {
+class _WatchLink : public os::_IAfterFork, object {
    WCFS            *_wc;
    os::File        _f;      // head/watch file handle
    string          _rxbuf;  // buffer for data already read from _f
@@ -102,6 +102,7 @@ private:
    ~_WatchLink();
    friend pair<WatchLink, error> WCFS::_openwatch();
 public:
+    void incref();
    void decref();
 public:
@@ -122,6 +123,8 @@ private:
    StreamID _nextReqID();
    tuple<chan<rxPkt>, error> _sendReq(context::Context ctx, StreamID stream, const string &req);
+    void afterFork();
    friend error _twlinkwrite(WatchLink wlink, const string &pkt);
 };