1. 27 Jan, 2022 4 commits
    • Kirill Smelkov's avatar
      trun: Fix returncode when test run is canceled · 56e52da6
      Kirill Smelkov authored
      To detect leaked processes in the end of the run, we are first waiting
      for remaining test processes via procps. If the main waiting loop was
      canceled without main test process first completed (p.poll calls never
      returned !None), then the waitpid(pid=p.pid) system call will be done
      via procps. Which leaves further waitpid(pid=p.pid) system call invoked
      by subprocess to get -ECHILD error and artificially report 0 exit
      status:
      
      https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/subprocess.py#L1094-L1107
      
      -> Fix it by propagating .returncode from procps to Popen instance so
      that it does not get lost.
      
      On sample .nxdtest with
      
          TestCase('sleep', ['sleep', '10'])
      
      Before the patch the output with CTRL+C was:
      
          $ nxdtest
          ...
          >>> sleep
          $ sleep 10
          ^C# Interrupt
          ok      sleep   0.604s  # 1t 0e 0f 0s       <-- NOTE
          # test run canceled
          # ran 1 test case:  1·ok                    <-- NOTE
      
      After the patch the output becomes:
      
          $ nxdtest
          ...
          >>> sleep
          $ sleep 100
          ^C# Interrupt
          error   sleep   1.006s  # 1t 1e 0f 0s       <-- NOTE
          # test run canceled
          # ran 1 test case:  1·error                 <-- NOTE
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!16
      56e52da6
    • Kirill Smelkov's avatar
      Move detection of leaked processes to trun · cf300184
      Kirill Smelkov authored
      When test run is canceled via signal, we want to send SIGTERM to trun
      and let it kill/wait the tested processes. However until now we were
      sending SIGTERM to trun and immediately checking whether some spawned
      tested processes remain without giving any time for trun to complete first.
      
      Fix this by moving the code that detects/cleanups leaked processes to
      trun itself, and in the main nxdtest driver to only send SIGTERM to trun
      and let trun do the cleanup/leaked processes detection.
      
      This fixes erroneous "leaked process detected" as noticed by @jerome in
      the following example:
      
          (nexedi/nxdtest!16 (comment 150835))
          $ nxdtest
          ...
          >>> sleep
          $ sleep 10
          ^C# Interrupt
          # stopping due to cancel
          # leaked pid=188877 'sleep' ['sleep', '10']		<-- "leaked" is a bit misleading here
          error   sleep   1.030s  # 1t 1e 0f 0s
          # test run canceled
          # ran 1 test case:  1·error
      
      After the patch the output in this case becomes:
      
          $ nxdtest
          ...
          >>> sleep
          $ sleep 10
          ^C# Interrupt
          ok      sleep   0.604s  # 1t 0e 0f 0s
          # test run canceled
          # ran 1 test case:  1·ok
      
      There is no more "leaked pid=...." emitted, but the change from "error" to "ok" status is wrong.
      We will fix it in the next patch.
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!16
      cf300184
    • Kirill Smelkov's avatar
      Fix thinko when finally killing processes · 548a5325
      Kirill Smelkov authored
      First we send SIGTERM to leaked processes, and then, after timeout,
      intend to send SIGKILL to the leaked processes and to the main test
      process if it is still alive.
      
      However there is a thiko in the code: we were sending SIGKILL only to
      the main test process, not to all leaked ones.
      
      -> Fix it.
      
      The bug was there since 0ad45a9c (Detect if a test leaks processes and
      terminate them).
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!16
      548a5325
    • Kirill Smelkov's avatar
      Cancel test run on SIGINT/SIGTERM · b0cf277d
      Kirill Smelkov authored
      In addition to canceling test run is master tells us to do so, also cancel the
      run if interrupted or terminated.
      
      With the following sample .nxdtest
      
          TestCase('sleep', ['sleep', '10'])
      
      before the patch it does not react to CTRL+C:
      
          $ nxdtest
          ...
          >>> sleep
          $ sleep 10
          ^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C# ran 0 test cases.   <-- no reaction to CTRL+C, finishes after 10 seconds
          Traceback (most recent call last):
            File "/home/kirr/src/tools/go/py3.venv2/bin/nxdtest", line 33, in <module>
              sys.exit(load_entry_point('nxdtest', 'console_scripts', 'nxdtest')())
            File "/home/kirr/src/tools/go/py3.venv2/lib/python3.9/site-packages/decorator.py", line 232, in fun
              return caller(func, *(extras + args), **kw)
            File "/home/kirr/src/tools/go/pygolang-master/golang/__init__.py", line 103, in _
              return f(*argv, **kw)
            File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 339, in main
              wg.wait()
          KeyboardInterrupt
      
      after the patch:
      
          $ nxdtest
          ...
          >>> sleep
          $ sleep 10
          ^C# Interrupt					<-- prompt reaction to CTRL+C
          # stopping due to cancel
          # leaked pid=188877 'sleep' ['sleep', '10']
          error   sleep   1.030s  # 1t 1e 0f 0s
          # test run canceled
          # ran 1 test case:  1·error
      
      Needs nexedi/pygolang!17 to work.
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!16
      b0cf277d
  2. 23 Dec, 2021 1 commit
    • Kirill Smelkov's avatar
      fixup! trun: Spawn user test with sole regular uid/gid in /etc/{passwd,group} database · 6f75fa90
      Kirill Smelkov authored
      Even though slapos.core tests seem to mock getgrnam calls[1], disk group
      is being looked up in /etc/groups for real which fails, if there is no
      such group, e.g. as
      
          ERROR: test_not_existing (slapos.tests.test_slapgrid.TestSlapgridWithDevPermManagerDevPermEmptyLsblk)
          ----------------------------------------------------------------------
          Traceback (most recent call last):
            File "/srv/slapgrid/slappart3/t/bvi/i/0/parts/slapos.core/slapos/tests/test_slapgrid.py", line 3246, in setUp
              self.setUpExpected()
            File "/srv/slapgrid/slappart3/t/bvi/i/0/parts/slapos.core/slapos/tests/test_slapgrid.py", line 3230, in setUpExpected
              gid = grp.getgrnam("disk").gr_gid
          KeyError: 'getgrnam(): name not found: disk'
      
      -> Fix it up by also creating "disk" group in our namespace environment.
      
      I'm not sure, but maybe the better long-term fix would be for
      slapos.core tests not to access /etc/groups for real and to instead mock
      access to this database completely.
      
      Amends commits e6b7993c and b42ccfa5.
      
      /reported-by @tomo
      /reported-at slapos!1107 (comment 148758)
      
      [1] https://lab.nexedi.com/nexedi/slapos.core/blob/1.7.1-28-g0b6bf2af4/slapos/tests/test_slapformat.py#L160-166
      6f75fa90
  3. 20 Dec, 2021 13 commits
    • Kirill Smelkov's avatar
      Propagate cancellation to spawned test jobs · 81b1907a
      Kirill Smelkov authored
      So that if a test run is canceled in ERP5 UI, nxdtest stops its run
      soon, instead of after several hours in case of
      SlapOS.SoftwareReleases.IntegrationTest-* tests.
      
      See the main (938b5455) and other patches for details.
      
      /helped-and-reviewed-by @jerome
      /reviewed-on !14
      
      * y/cancel:
        Add test for cancel propagation
        tests: Run nxdtest.main for each test in a separate thread, so that pytest.timeout generally works
        Propagate cancellation to spawned test jobs
        Raise ctx.err() if test run was cancelled
        Stop spawned process softly on ctx cancel
      81b1907a
    • Jérome Perrin's avatar
      Add test for cancel propagation · 5d656ccf
      Jérome Perrin authored
      /reviewed-by @kirr
      /reviewed-on nexedi/nxdtest!14
      5d656ccf
    • Kirill Smelkov's avatar
      tests: Run nxdtest.main for each test in a separate thread, so that pytest.timeout generally works · 8f59b689
      Kirill Smelkov authored
      Factor-out the driver to run_nxdtest in a separate thread done in
      0ad45a9c (Detect if a test leaks processes and terminate them) from
      test_run_procleak into run_nxdtest, so that all tests are run this way
      for pytest.timeout, if requested, to work universally for all tests.
      
      The logic of how we run nxdtest under tests and handle timeout will be,
      hopefully, soon reworked more, but it is anyway good to keep this logic
      in only one place.
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!14
      8f59b689
    • Kirill Smelkov's avatar
      Propagate cancellation to spawned test jobs · 938b5455
      Kirill Smelkov authored
      A user might cancel test result in ERP5 UI if e.g. some misbehaviour is
      detected and a new revision is ready to be tested. This works by
      test_result.start() returning None - indicating that there is no more
      test_result_lines to exercise. Master also indicates this cancellation
      via test_result.isAlive() returning False, but until now we were not
      using that information and were always waiting for completion of current
      test job that is already spawned.
      
      This works well in practice if individual tests are not long, but e.g.
      for SlapOS.SoftwareReleases.IntegrationTest-* it is not good, because
      there an individual test might takes _hours_ to execute.
      
      -> Fix it by first setting global context to where we'll propagate
      cancellation from test_result.isAlive, and by using that context as the
      base for all other activities. This should terminate spawned test
      process if test_result is canceled.
      
      The interval to check is picked up as 5 minutes not to overload master.
      @jerome says that
      
          We now have 341 active test nodes, but sometimes we are using
          more, we did in the past to stress test some new machines.
      
          For the developer, if we reduce the waiting time from a few hours to 1
          minutes or 5 minutes seems more or less equivalent.
      
      For 350 testnodes and each nxdtest checking its test_result status via
      isAlive query to master every 5 minutes, it results in ~ 1 isAlive
      request/second to master on average.
      
      Had to change time to golang.time to use time.after().
      Due to that time() and sleep() are changed to time.now() and
      time.sleep() correspondingly.
      
      /helped-and-reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!14
      938b5455
    • Kirill Smelkov's avatar
      Raise ctx.err() if test run was cancelled · bdd183fb
      Kirill Smelkov authored
      This is normal rule to return an error from a task if the task has to
      abort due to cancellation. We already do this in tee, but not in the
      function that is waiting for spawned process to complete(*).
      
      -> Fix that. Wrap corresponding wg.wait() into try/except and check if
      it fails due to cancellation, upon which we should not raise, but
      instead should continue to finish current test_result_line and only
      after stop test run normally (again without raise, but with regular one
      line log entry).
      
      (*) added in 0ad45a9c (Detect if a test leaks processes and terminate them)
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!14
      bdd183fb
    • Kirill Smelkov's avatar
      Stop spawned process softly on ctx cancel · 34cb7879
      Kirill Smelkov authored
      In 0ad45a9c (Detect if a test leaks processes and terminate them) we
      organized waiting for spawned processes with handing ctx and sending
      SIGKILL to the main spawned process on ctx cancel even though other
      leaked processes are always first sent with SIGTERM and - only after
      shutdown timeout - later with SIGKILL.
      
      This is too brutal. Rework the code to first send SIGTERM to the main
      spawned test process too, and leverage SIGKILL only later after shutdown
      timeout.
      
      This will be tested in a later patch which exercises how cancel from
      master is propagated.
      
      /reviewed-by @jerome
      /reviewed-on !14
      34cb7879
    • Kirill Smelkov's avatar
      Run each testcase with its own /tmp and /dev/shm · 5acd1359
      Kirill Smelkov authored
      to detect after each test run leaked temporary files, leaked mount
      entries, to isolate different test runs from each other, and to provide
      tmpfs on /tmp for every test.
      
      The main change and description is in patch1 (a191468f); the other
      patches fix that up step-by-step to work for real for all our tests.
      
      /helped-by @tomo
      /helped-and-reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!13
      
      * y/unshare:
        trun: Require FUSE to be working inside user-namespaces to activate them
        Factor checking whether user-namespaces are available into trun.userns_available()
        trun: Add test for how /etc/{passwd,group} is setup for spawned job
        trun: Spawn user test with sole regular uid/gid in /etc/{passwd,group} database
        trun: Deactivate most capabilities before spawning user test
        Run each testcase with its own /tmp and /dev/shm
      5acd1359
    • Kirill Smelkov's avatar
      trun: Require FUSE to be working inside user-namespaces to activate them · f5f52434
      Kirill Smelkov authored
      FUSE is needed for wendelin.core, and if we don't check that FUSE works
      inside and activate user namespaces on e.g. Linux 4.9 kernel,
      wendelin.core will fail to function. Since today wendelin.core is
      included into ERP5 as its base component(*) it is practical to require
      FUSE-in-userns support unconditionally.
      
      FUSE in user-namespaces started to work in Linux 4.18(+). Detect if it
      should work inside via checking whether running kernel is newer. I
      choosed this way for simplicity not to unroll test filesystem to try to
      mount inside and also not to slow-down startup time.
      
      (*) see nexedi/slapos!874
      (+) see https://git.kernel.org/linus/da315f6e0398 and https://git.kernel.org/linus/8cb08329b080.
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!13
      f5f52434
    • Kirill Smelkov's avatar
      Factor checking whether user-namespaces are available into trun.userns_available() · ef41d960
      Kirill Smelkov authored
      In the next patch we'll also add detection whether FUSE works inside
      user namespaces. Before doing that stop duplicating related code in
      between trun.py and nxdtest_test.py .
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!13
      ef41d960
    • Jérome Perrin's avatar
      trun: Add test for how /etc/{passwd,group} is setup for spawned job · b42ccfa5
      Jérome Perrin authored
      Even if we don't want to spawn full sshd to verify that ptys work, we can
      still minimally test that inside user/group database is setup as expected.
      
      /reviewed-by @kirr
      /reviewed-on nexedi/nxdtest!13
      b42ccfa5
    • Kirill Smelkov's avatar
      trun: Spawn user test with sole regular uid/gid in /etc/{passwd,group} database · e6b7993c
      Kirill Smelkov authored
      Even though libc.opentty stopped to insist on chown(group=tty) for
      /dev/pts/*, openssh still wants to do it and fails, preventing sshd from
      working. Fix it by spawning test workload with sole current user and
      group being present in the password database.
      
      We anyway don't have IDs for users/groups different from current uid/gid
      mapped from current namespace, and the kernel, when seeing a file owned
      by those, maps them to "nobody/nogroup" for existing files, and rejects
      chown to those original IDs obtained from parent's namespace
      /etc/{passwd,group} as EINVAL. For the same reason we don't try to mount
      our own /dev/pts instance, because we have only current uid/gid mapped
      to parent namespace and gid=5 maps to nogroup in parent. With existing
      /dev/pts mount entries are only listed as having nogroup, and from
      outside they _are_ owned by parent's tty group. If we would mount
      /dev/pts anew, parent won't see our /dev/pts/* at all which moves us
      a bit more far from desired behaviour.
      
      Still keep root and nobody/nogroup as @jerome reports that without those
      users Go tests fail on Debian9:
      
          nexedi/slapos!1095 (comment 147177)
          nexedi/slapos!1095 (comment 147201)
      
      See added comment about all this for more details.
      
      This patch fixes sshd to work under tryn.py(*)
      
      No test because libc.openpty works both with and without this patch, and
      it would need too spawn whole sshd under test to verify this.
      
      (*) the diff for sshd 1) running successfully under regular user, and 2)
          previously failing under trun.py without this patch is below:
      
          diff --git a/regular-nopam.txt b/trun-nopam.txt
          index 378ccb6..5b96c08 100644
          --- a/regular-nopam.txt
          +++ b/trun-nopam.txt
          @@ -1,4 +1,4 @@
          -(neo) (z-dev) (g.env) kirr@deca:~/tmp/trashme/sshd$ /sbin/sshd -d -p 2222 -h `pwd`/ssh_host_rsa_key -o UsePAM=no
          +kirr@deca:~/tmp/trashme/sshd$ /sbin/sshd -d -p 2222 -h `pwd`/ssh_host_rsa_key -o UsePAM=no
           debug1: sshd version OpenSSH_8.4, OpenSSL 1.1.1k  25 Mar 2021
           debug1: private host key #0: ssh-rsa SHA256:y+ujVDqqFBXTclDM2NLy4GME7wReutLcUYOWAeriXdc
           debug1: setgroups() failed: Operation not permitted
          @@ -91,35 +91,13 @@ debug1: session_input_channel_req: session 0 req pty-req
           debug1: Allocating pty.
           debug1: session_new: session 0
           debug1: SELinux support disabled
          -Attempt to write login records by non-root user (aborting)
          -debug1: session_pty_req: session 0 alloc /dev/pts/2
          -debug1: server_input_channel_req: channel 0 request env reply 0
          -debug1: session_by_channel: session 0 channel 0
          -debug1: session_input_channel_req: session 0 req env
          -debug1: server_input_channel_req: channel 0 request shell reply 1
          -debug1: session_by_channel: session 0 channel 0
          -debug1: session_input_channel_req: session 0 req shell
          -Starting session: shell on pts/2 for kirr from 127.0.0.1 port 44106 id 0
          -debug1: Setting controlling tty using TIOCSCTTY.
          -
          -debug1: Received SIGCHLD.
          -debug1: session_by_pid: pid 693948
          -debug1: session_exit_message: session 0 channel 0 pid 693948
          -debug1: session_exit_message: release channel 0
          -debug1: session_by_tty: session 0 tty /dev/pts/2
          -debug1: session_pty_cleanup2: session 0 release /dev/pts/2
          -Attempt to write login records by non-root user (aborting)
          -debug1: session_by_channel: session 0 channel 0
          -debug1: session_close_by_channel: channel 0 child 0
          -Close session: user kirr from 127.0.0.1 port 44106 id 0
          -debug1: channel 0: free: server-session, nchannels 1
          -Received disconnect from 127.0.0.1 port 44106:11: disconnected by user
          -Disconnected from user kirr 127.0.0.1 port 44106
          +chown(/dev/pts/2, 1000, 5) failed: Invalid argument
           debug1: do_cleanup
           debug1: temporarily_use_uid: 1000/1000 (e=1000/1000)
           debug1: restore_uid: (unprivileged)
          +debug1: session_pty_cleanup2: session 0 release /dev/pts/2
          +Attempt to write login records by non-root user (aborting)
          +debug1: audit_event: unhandled event 12
           debug1: do_cleanup
           debug1: temporarily_use_uid: 1000/1000 (e=1000/1000)
           debug1: restore_uid: (unprivileged)
          -debug1: audit_event: unhandled event 12
      
          (see nexedi/slapos!1095 (comment 147018))
      
      /helped-and-reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!13
      e6b7993c
    • Kirill Smelkov's avatar
      trun: Deactivate most capabilities before spawning user test · a9af3a8a
      Kirill Smelkov authored
      In the previous patch we asked unshare to keep capabilities so that FUSE
      mounting works from under regular user. However full set of capabilities
      is too much, and in particular if cap_dac_override is present(*), it means
      that writes to files that have read-only permission, is not rejected by
      kernel.
      
      -> Adjust trun to retain only those capabilities that we actually need
         = CAP_SYS_ADMIN to mount things.
      
      This should fix the following Go build failure:
      
          --- FAIL: TestReadOnlyWriteFile (0.00s)
              ioutil_test.go:90: Expected an error when writing to read-only file /tmp/TestReadOnlyWriteFile3940340549/blurp.txt
          FAIL
          FAIL	io/ioutil	0.053s
      
      P.S. And if we would unshare to root instead (unshare -Umr) it should be
      still a good idea to drop extra capabilities, as we still want to reject
      writes to read-only files.
      
      (*) see https://man7.org/linux/man-pages/man7/capabilities.7.html
      
      /helped-and-reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!13
      a9af3a8a
    • Kirill Smelkov's avatar
      Run each testcase with its own /tmp and /dev/shm · a191468f
      Kirill Smelkov authored
      and detect leaked temporary files and mount entries after each test run.
      
      Background
      
      Currently we have several testing-related problems that are
      all connected to /tmp and similar directories:
      
      Problem 1: many tests create temporary files for each run. Usually
      tests are careful to remove them on teardown, but due to bugs, many kind
      of tests, test processes being hard-killed (SIGKILL, or SIGSEGV) and
      other reasons, in practice this cleanup does not work 100% reliably and
      there is steady growth of files leaked on /tmp on testnodes.
      
      Problem 2: due to using shared /tmp and /dev/shm, the isolation in
      between different test runs of potentially different users is not
      strong. For example @jerome reports that due to leakage of faketime's
      shared segments separate test runs affect each other and fail:
      https://erp5.nexedi.net/bug_module/20211125-1C8FE17
      
      Problem 3: many tests depend on /tmp being a tmpfs instance. This are for
      example wendelin.core tests which are intensively writing to database,
      and, if /tmp is resided on disk, timeout due to disk IO stalls in fsync
      on every commit. The stalls are as much as >30s and lead to ~2.5x overall
      slowdown for test runs. However the main problem is spike of increased
      latency which, with close to 100% probability, always render some test
      as missing its deadline. This topic is covered in
      https://erp5.com/group_section/forum/Using-tmpfs-for--tmp-on-testnodes-JTocCtJjOd
      
      --------
      
      There are many ways to try to address each problem separately, but they
      all come with limitations and drawbacks. We discussed things with @tomo
      and @jerome, and it looks like that all those problems can be addressed
      in one go if we run tests under user namespaces with private mounts for
      /tmp and /dev/shm.
      
      Even though namespaces is generally no-go in Nexedi, they seem to be ok
      to use in tests. For example they are already used via private_tmpfs
      option in SlapOS:
      
      https://lab.nexedi.com/nexedi/slapos/blob/1876c150/slapos/recipe/librecipe/execute.py#L87-103
      https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo-input-schema.json#L121-124
      https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L11-16
      https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L30-34
      https://lab.nexedi.com/nexedi/slapos/blob/1876c150/software/neoppod/instance-neo.cfg.in#L170-177
      ...
      https://lab.nexedi.com/nexedi/slapos/blob/1876c150/stack/erp5/instance-zope.cfg.in#L227-230
      
      Thomas says that using private tmpfs for each test would be a better
      solution than implementing tmpfs for whole /tmp on testnodes. He also
      reports that @jp is OK to use namespaces for test as long as there is a
      fallback if namespaces aren't available.
      
      -> So let's do that: teach nxdtest to run each test case in its own
      private environment with privately-mounted /tmp and /dev/shm if we can
      detect that user namespaces are available. In an environment where user
      namespaces are indeed available this addresses all 3 problems because
      isolation and being-tmpfs are there by design, and even if some files
      will leak, the kernel will free everything when test terminates and the
      filesystem is automatically unmounted. We also detect such leakage and
      report a warning so that such problems do not go completely unnoticed.
      
      Implementation
      
      We leverage unshare(1) for simplicity. I decided to preserve uid/gid
      instead of becoming uid=0 (= `unshare -Umr`) for better traceability, so
      that it is clear from test output under which real slapuser a test is
      run(*). Not changing uid requires to activate ambient capabilities so
      that mounting filesystems, including FUSE-based needed by wendelin.core,
      continue to work under regular non-zero uid. Please see
      https://git.kernel.org/linus/58319057b784 for details on this topic. And
      please refer to added trun.py for details on how per-test namespace is setup.
      
      Using FUSE inside user namespaces requires Linux >= 4.18 (see
      https://git.kernel.org/linus/da315f6e0398 and
      https://git.kernel.org/linus/8cb08329b080), so if we are really to use
      this patch we'll have to upgrade kernel on our testnodes, at least where
      wendelin.core is used in tests.
      
      "no namespaces" detection is implemented via first running `unshare ...
      true` with the same unshare options that are going to be used to create
      and enter new user namespace for real. If that fails, we fallback into
      "no namespaces" mode where no private /tmp and /dev/shm are mounted(%).
      
      (*) for example nxdtest logs information about the system on startup:
      
          date:   Mon, 29 Nov 2021 17:27:04 MSK
          xnode:  slapuserX@test.node
          ...
      
      (%) Here is how nxdtest is run in fallback mode on my Debian 11 with
          user namespaces disabled via `sysctl kernel.unprivileged_userns_clone=0`
      
          (neo) (z-dev) (g.env) kirr@deca:~/src/wendelin/nxdtest$ nxdtest
          date:   Thu, 02 Dec 2021 14:04:30 MSK
          xnode:  kirr@deca.navytux.spb.ru
          uname:  Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64
          cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
      
          >>> pytest
          $ python -m pytest
          # user namespaces not available. isolation and many checks will be deactivated.    <--- NOTE
          ===================== test session starts ======================
          platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.10.0, pluggy-0.13.1
          rootdir: /home/kirr/src/wendelin/nxdtest
          plugins: timeout-1.4.2
          collected 23 items
      
          nxdtest/nxdtest_pylint_test.py ....                      [ 17%]
          nxdtest/nxdtest_pytest_test.py ...                       [ 30%]
          nxdtest/nxdtest_test.py ......xx                         [ 65%]
          nxdtest/nxdtest_unittest_test.py ........                [100%]
      
          ============= 21 passed, 2 xfailed in 2.67 seconds =============
          ok      pytest  3.062s  # 23t 0e 0f 0s
          # ran 1 test case:  1·ok
      
      /helped-by @tomo
      /helped-and-reviewed-by @jerome
      /reviewed-on !13
      a191468f
  4. 09 Dec, 2021 2 commits
    • Kirill Smelkov's avatar
      Log that master is connected and for which test_result this run is · 4fe9ee16
      Kirill Smelkov authored
      Also log if master told us that we have nothing to do, and if the mode to run is local.
      
      This should make it a bit more clear what is going on just by looking at
      nxdtest log. See previous patch for more details and context.
      
      For the reference: here is how updated output looks like in the normal case:
      
          date:   Thu, 09 Dec 2021 04:20:37 CET
          xnode:  slapuser7@rapidspace-testnode-001
          uname:  Linux rapidspace-testnode-001 4.9.0-16-amd64 #1 SMP Debian 4.9.272-1 (2021-06-21) x86_64
          cpu:    Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
          # running for test_result_module/20211209-170FD3998
      
          >>> pytest
          $ python -m pytest
          ============================= test session starts ==============================
          platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.9.0, pluggy-0.13.1
          rootdir: /srv/slapgrid/slappart7/t/dfp/soft/47cc86af27d234f0464630f2a0d22a6f/parts/zodbtools-dev
          collected 46 items
      
          zodbtools/test/test_analyze.py .                                         [  2%]
          zodbtools/test/test_commit.py ..                                         [  6%]
          zodbtools/test/test_dump.py ...                                          [ 13%]
          zodbtools/test/test_restore.py ..                                        [ 17%]
          zodbtools/test/test_tidrange.py .............................            [ 80%]
          zodbtools/test/test_zodb.py .........                                    [100%]
      
          ========================== 46 passed in 9.15 seconds ===========================
          ok      pytest  12.433s # 46t 0e 0f 0s
          # ran 1 test case:  1·ok
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!15
      4fe9ee16
    • Kirill Smelkov's avatar
      Always log system info and run summary, even if master tells us to do nothing · f8ec5787
      Kirill Smelkov authored
      Nxdtest logs system information (bd91f6f1 "Include system information
      into log output") and run summary at the end (9f413221 "Emit run summary
      at the end"). However all that information currently is printed only if
      master is successfully connected and actually tells to run the tests.
      
      This behaviour is not very useful, because if nxdtest log output on
      testnode is empty, it is not clear - whether it was "we have to do
      nothing", or nxdtest stuck somewhere or something else.
      
      For example
      https://nexedijs.erp5.net/#/test_result_module/20211208-47D165B7/12 is
      currently marked as Running for many long hours already. And the log on
      testnode regarding nxdtest run is just:
      
          2021-12-08 15:42:17,314 INFO     $ PATH=/srv/slapgrid/slappart13/srv/slapos/soft/2956f419073cb2249ed953507fa6b173/bin:/opt/slapos/parts/bison/bin:/opt/slapos/parts/bzip2/bin:/opt/slapos/parts/gettext/bin:/opt/slapos/parts/glib/bin:/opt/slapos/parts/libxml2/bin:/opt/slapos/parts/libxslt/bin:/opt/slapos/parts/m4/bin:/opt/slapos/parts/ncurses/bin:/opt/slapos/parts/openssl/bin:/opt/slapos/parts/pkgconfig/bin:/opt/slapos/parts/python2.7/bin:/opt/slapos/parts/readline/bin:/opt/slapos/parts/sqlite3/bin:/opt/slapos/parts/swig/bin:/opt/slapos/bin:/opt/slapos/parts/patch/bin:/opt/slapos/parts/socat/bin:/usr/bin:/usr/sbin:/sbin:/bin SLAPOS_TEST_LOG_DIRECTORY=/srv/slapgrid/slappart13/var/log/testnode/dgd-xStX9safSG SLAPOS_TEST_SHARED_PART_LIST=/srv/slapgrid/slappart13/srv/shared:/srv/slapgrid/slappart13/t/dgd/shared /bin/sh /srv/slapgrid/slappart13/t/dgd/i/0/bin/runTestSuite --master_url $DISTRIBUTOR_URL --revision slapos=13977-ec686a708633f689382426063c21efbe3b2eab04,slapos.core=8698-91edab77ed36c160da8017cfdc1673fe7a8e10de --test_node_title rapidspace-testnode-008-3Nodes-DEPLOYTASK0 --test_suite SLAPOS-SR-TEST --test_suite_title SlapOS.SoftwareReleases.IntegrationTest-kirr.Python2 --project_title 'Rapid.Space Project'
      
      without anything else.
      
      With this patch nxdtest would print system information and report how many
      tests it had run, if its invocation did not stuck.
      
      In this patch we only move code that calls system_info and defer summary log
      before code that connects to master. In the following patch we'll add more
      logging around connecting to master.
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!15
      f8ec5787
  5. 11 Nov, 2021 1 commit
    • Kirill Smelkov's avatar
      Emit run summary at the end · 9f413221
      Kirill Smelkov authored
      Sometimes there is zero testcases to be executed on testnodes, and
      log output from nxdtest is just
      
          date:   Wed, 10 Nov 2021 12:31:50 MSK
          xnode:  kirr@deca.navytux.spb.ru
          uname:  Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64
          cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
      
      it is not clear from such output did the run ended or the test got
      stuck. After this patch it becomes
      
          date:   Wed, 10 Nov 2021 12:31:50 MSK
          xnode:  kirr@deca.navytux.spb.ru
          uname:  Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64
          cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
          # ran 0 test cases.
      
      And in general, when there are several testcases to be run, it is helpful to
      indicate end of such run and to print brief summary of result status for all
      ran test cases. Example output:
      
          wendelin.core$ nxdtest -k test.wcfs
          date:   Wed, 10 Nov 2021 12:35:34 MSK
          xnode:  kirr@deca.navytux.spb.ru
          uname:  Linux deca 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) x86_64
          cpu:    Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
          >>> test.wcfs/fs:1
          ...
          ok      test.wcfs/fs:1  25.035s # 35t 0e 0f 0s
      
          >>> test.wcfs/fs:2
          ...
          ok      test.wcfs/fs:2  21.033s # 35t 0e 0f 0s
      
          >>> test.wcfs/fs:
          ...
          ok      test.wcfs/fs:   21.056s # 35t 0e 0f 0s
          # ran 3 test cases:  3·ok
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!12
      9f413221
  6. 13 Aug, 2021 2 commits
    • Jérome Perrin's avatar
      support parsing pylint output · 72e36088
      Jérome Perrin authored
      This parses pylint output with a simple regexp and counts one failure per
      message reported.
      
      This has not been tested yet, but we decided to apply this commit already.
      
      /acked-by @kirr
      /reviewed-on !11
      72e36088
    • Jérome Perrin's avatar
      loadNXDTestFile: use `compile` for better tracebacks on errors · 7b5add47
      Jérome Perrin authored
      When using compile with the actual file path, we can have better tracebacks
      in case of errors.
      
      before:
      
          Traceback (most recent call last):
            File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/bin/nxdtest", line 22, in <module>
              sys.exit(nxdtest.main())
            File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 142, in main
              tenv = loadNXDTestFile('.nxdtest')
            File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 75, in loadNXDTestFile
              six.exec_(src, g)
            File "<string>", line 77, in <module>
          NameError: name 'Pylint' is not defined
      
      after:
      
          Traceback (most recent call last):
            File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/bin/nxdtest", line 22, in <module>
              sys.exit(nxdtest.main())
            File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 142, in main
              tenv = loadNXDTestFile('.nxdtest')
            File "/srv/slapgrid/slappart3/srv/runner/software/9544feb19475590d240ba2d32743c0a0/parts/nxdtest/nxdtest/__init__.py", line 75, in loadNXDTestFile
              six.exec_(compile(src, os.path.realpath(path), 'exec'), g)
            File "/srv/slapgrid/slappart3/srv/runner/instance/slappart8/var/nxdtest/.nxdtest", line 77, in <module>
              summaryf=Pylint.summary,
          NameError: name 'Pylint' is not defined
      
      /reviewed-by @kirr
      /reviewed-in nexedi/nxdtest!10
      7b5add47
  7. 12 Aug, 2021 1 commit
    • Kirill Smelkov's avatar
      Detect if a test leaks processes and terminate them · 0ad45a9c
      Kirill Smelkov authored
      For every TestCase nxdtest spawns test process to run with stdout/stderr
      redirected to pipes that nxdtest reads. Nxdtest, in turn, tees those
      pipes to its stdout/stderr until the pipes become EOF. If the test
      process, in turn, spawns other processes, those other processes will
      inherit opened pipes, and so the pipes won't become EOF untill _all_
      spawned test processes (main test process + other processes that it
      spawns) exit. Thus, if there will be any process, that the main test
      process spawned, but did not terminated upon its own exit, nxdtest will
      get stuck waiting for pipes to become EOF which won't happen at all if a
      spawned test subprocess persists not to terminate.
      
      I hit this problem for real on a Wendelin.core 2 test - there the main
      test processes was segfaulting and so did not instructed other spawned
      processes (ZEO, WCFS, ...) to terminate. As the result the whole test
      was becoming stuck instead of being promptly reported as failed:
      
          runTestSuite: Makefile:175: recipe for target 'test.wcfs' failed
          runTestSuite: make: *** [test.wcfs] Segmentation fault
          runTestSuite: wcfs: 2021/08/09 17:32:09 zlink [::1]:52052 - [::1]:23386: recvPkt: EOF
          runTestSuite: E0809 17:32:09.376800   38082 wcfs.go:2574] zwatch zeo://localhost:23386: zlink [::1]:52052 - [::1]:23386: recvPkt: EOF
          runTestSuite: E0809 17:32:09.377431   38082 wcfs.go:2575] zwatcher failed -> switching filesystem to EIO mode (TODO)
          <LONG WAIT>
          runTestSuite: PROCESS TOO LONG OR DEAD, GOING TO BE TERMINATED
      
      -> Fix it.
      
      /reviewed-by @jerome
      /reviewed-on nexedi/nxdtest!9
      0ad45a9c
  8. 01 Dec, 2020 1 commit
  9. 26 Nov, 2020 1 commit
    • Kirill Smelkov's avatar
      Switch tee from threading.Thread to sync.WorkGroup · 1e6a1cc6
      Kirill Smelkov authored
      The reason is that with threading.Thread if exception happens in that
      spawned thread, this error is not propagated to main driver, while with
      sync.WorkGroup an exception from any spawned worker is propagated back
      to main. For example with the following injected error
      
          --- a/nxdtest/__init__.py
          +++ b/nxdtest/__init__.py
          @@ -267,6 +267,7 @@ def main():
      
           # tee, similar to tee(1) utility, copies data from fin to fout appending them to buf.
           def tee(ctx, fin, fout, buf):
          +    1/0
               while 1:
      
      before this patch nxdtest behaves like ...
      
          (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ nxdtest
          date:   Tue, 24 Nov 2020 14:55:08 MSK
          xnode:  kirr@deco.navytux.spb.ru
          uname:  Linux deco 5.9.0-2-amd64 #1 SMP Debian 5.9.6-1 (2020-11-08) x86_64
          cpu:    Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
      
          >>> pytest
          $ python -m pytest
          Exception in thread Thread-2:
          Traceback (most recent call last):
            File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
              self.run()
            File "/usr/lib/python2.7/threading.py", line 754, in run
              self.__target(*self.__args, **self.__kwargs)
            File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee
              1/0
          ZeroDivisionError: integer division or modulo by zero
      
          Exception in thread Thread-1:
          Traceback (most recent call last):
            File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
              self.run()
            File "/usr/lib/python2.7/threading.py", line 754, in run
              self.__target(*self.__args, **self.__kwargs)
            File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee
              1/0
          ZeroDivisionError: integer division or modulo by zero
      
          error   pytest  0.583s  # 1t 1e 0f 0s
          (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ echo $?
          0
      
      Here the error in another thread is only printed, but nxdtest is not aborted.
      Above it reported "error", but e.g. when testing pygolang/py3 and raising an
      error in tee it even reported it was succeeding
      ( !6 (comment 121393) ):
      
          slapuser34@vifibcloud-rapidspace-hosting-007:~/srv/runner/instance/slappart0$ ./bin/runTestSuite
          date:   Tue, 24 Nov 2020 12:51:23 MSK
          xnode:  slapuser34@vifibcloud-rapidspace-hosting-007
          uname:  Linux vifibcloud-rapidspace-hosting-007 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64
          cpu:    Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz
      
          >>> thread
          $ python -m pytest
          Exception in thread Thread-1:
          Traceback (most recent call last):
            File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 926, in _bootstrap_inner
              self.run()
            File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 870, in run
              self._target(*self._args, **self._kwargs)
            File "/srv/slapgrid/slappart34/srv/runner/software/44fe7dd3f13ecd100894c6368a35c055/parts/nxdtest/nxdtest/__init__.py", line 268, in tee
              fout.write(data)
          TypeError: write() argument must be str, not bytes
      
          ok      thread  9.145s  # 1t 0e 0f 0s
      
          >>> gevent
          $ gpython -m pytest
          Exception in thread Thread-3:
          Traceback (most recent call last):
            File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 926, in _bootstrap_inner
              self.run()
            File "/srv/slapgrid/slappart34/srv/runner/shared/python3/5497998c60d97cbbf748337ccce21db2/lib/python3.7/threading.py", line 870, in run
              self._target(*self._args, **self._kwargs)
            File "/srv/slapgrid/slappart34/srv/runner/software/44fe7dd3f13ecd100894c6368a35c055/parts/nxdtest/nxdtest/__init__.py", line 268, in tee
              fout.write(data)
          TypeError: write() argument must be str, not bytes
      
          ok      gevent  21.980s # 1t 0e 0f 0s
      
      After this patch nxdtest correctly handles and propagates an error originating
      in spawned thread back to main driver:
      
          (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ nxdtest
          date:   Tue, 24 Nov 2020 14:54:19 MSK
          xnode:  kirr@deco.navytux.spb.ru
          uname:  Linux deco 5.9.0-2-amd64 #1 SMP Debian 5.9.6-1 (2020-11-08) x86_64
          cpu:    Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
      
          >>> pytest
          $ python -m pytest
          Traceback (most recent call last):
            File "/home/kirr/src/wendelin/venv/z4-dev/bin/nxdtest", line 11, in <module>
              load_entry_point('nxdtest', 'console_scripts', 'nxdtest')()
            File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 230, in main
              wg.wait()
            File "golang/_sync.pyx", line 237, in golang._sync.PyWorkGroup.wait
              pyerr_reraise(pyerr)
            File "golang/_sync.pyx", line 217, in golang._sync.PyWorkGroup.go.pyrunf
              f(pywg._pyctx, *argv, **kw)
            File "/home/kirr/src/wendelin/nxdtest/nxdtest/__init__.py", line 270, in tee
              1/0
          ZeroDivisionError: integer division or modulo by zero
          (neo) (z4-dev) (g.env) kirr@deco:~/src/wendelin/nxdtest$ echo $?
          1
      
      NOTE sync.WorkGroup requires every worker to handle context cancellation, so
      that whenever there is an error, all other workers are canceled. We add such
      cancellation handling to tee but only lightly: before going to block in
      read/write syscalls we check for whether ctx is canceled or not. However the
      proper handling would be to switch file descriptors into non-block mode and to
      select at every IO point on both potential IO events and potential
      cancellation. This is left as TODO for the future.
      
      /reviewed-on !7
      1e6a1cc6
  10. 24 Nov, 2020 5 commits
    • Jérome Perrin's avatar
      Unittest and Python3 support · 40e2c4ab
      Jérome Perrin authored
      These are the necessary changes to run `SlapOS.Eggs.UnitTest-*` and `SlapOS.SoftwareReleases.IntegrationTest-*` using nxdtest
      
      See merge request nexedi/nxdtest!6
      
      /reviewed-by @kirr
      40e2c4ab
    • Jérome Perrin's avatar
      Flush output right after printing running test name · a129b560
      Jérome Perrin authored
      If test program output on stderr (which is unbuffered), the
      output of the test program will appear before output from nxdtest
      advertising the program that is about to be executed, because nxdtest
      stdout is buffered (testnode does not set PYTHONUNBUFFERED, and eventhough
      nxdtest sets PYTHONUNBUFFERED in its own environ, this only applies to sub
      processes)
      a129b560
    • Jérome Perrin's avatar
    • Jérome Perrin's avatar
      Also pass stderr output to summary method · 53064e71
      Jérome Perrin authored
      While pytest sends everything in stdout, some other programs send on stderr.
      53064e71
    • Jérome Perrin's avatar
      Treat program output as binary for python3 support · beb9d47e
      Jérome Perrin authored
      While treating output as text would not really be impossible, treating it
      as bytes seems a better choice because:
       - we don't have to make assumptions about what output encoding the test
         program is using for output
       - `tee` can just read stream output bytes by bytes without having to worry
         about multi-bytes characters
       - testnode protocol uses xmlrpc.client.Binary, which uses bytes.
      
      Because using bufsize=1 implies reading subprocess output as text, we use
      bufsize=0 instead in the subprocess.Popen call, to prevent buffering.
      
      To make manipulation of strings and bytes easier, we add a dependency on
      pygolang, so that we can use its strings utility functions.
      
      Also add a few tests to verify general functionality.
      beb9d47e
  11. 09 Nov, 2020 1 commit
  12. 02 Nov, 2020 1 commit
  13. 30 Oct, 2020 2 commits
  14. 28 Oct, 2020 1 commit
  15. 20 Oct, 2020 1 commit
    • Kirill Smelkov's avatar
      Modularize · f74005e4
      Kirill Smelkov authored
      Create real nxdtest module, so that things could be imported from there.
      As the result switch `nxdtest` program from scripts to entry_point.
      f74005e4
  16. 08 Oct, 2020 1 commit
  17. 07 Oct, 2020 1 commit
  18. 05 Oct, 2020 1 commit
    • Kirill Smelkov's avatar
      --list and --run · 6991af9b
      Kirill Smelkov authored
      - To list which tests are in there,
      - To execute only selected tests.
      
      Apply only to local mode.
      Very handy for debugging.
      6991af9b