runner: if no "slapos-software" is defined in parameters, the auto build&run feature shouldn't work
-
Developer
Hello, I'm getting errors in tests because of this too (reverting this commit brings an error later on too)... I'm currently investigating.
-
Developer
Hi @kirr ,
Could you explain me more about what broke in your setup ?
According to the instance-runner-input-schema.json, auto-deploy and autorun have some influence only if slapos-software is defined. The purpose of these parameters was to order a webrunner & deploy a SR with only one request. I commited this change to conform with both of it, and also because it was causing the runner-import to build continuously the SR.
By knowing what is broken in your use case, I can provide a fix which fits the most our needs.
-
Owner
Hi @Nicolas, thanks for feedback. Since there is no autorestart functionality in plain slapos (supervisord can do it, but it has to be configured for it and slapos currently hardcodes autorestart=false in generated supervisor configuration for instances), we previously used autorun=true as a kind of substitution for it.
I understand schema description states it is only active when
slapos-software
is set, but in practice it used to work unconditionally.We can be ok with the change as long as there remains one or another practical way to have autorestarting watchdog for services under webrunner for all kind of software-releases - not only setup via
slapos-software
. -
Developer
Hi,
For the moment I will revert this commit, as the auto-restart feature seems to be needed but missing for the moment in the webrunner (the question of auto-restart processes came more than once to me on last week).
Therefore, the auto-restart feature should be implemented (I hope I can find some time for it), because using the autorun feature has 2 mains drawback in my opinion :
- it is run by a cron job every 2 minutes, which means if a process is down, we have to wait up to 2 minutes before getting the service up again. 2 minutes seems reasonable for waiting some service to get up again, but for a complex SR, it induces an important load on the computer (see next).
- autorun (and auto-deploy) restart processes by running a full processing of all the slapparts. It's an heavy process, and may produce unexpected results (if some manual change in the SR fail to compile, then it can end by more processes crashing, or some processes may be restarted/reloaded at every processing).
After talking with @rafael , we agreed on this :
- the auto-deploy feature is badly designed : autorun and auto-deploy features shouldn't have effect after the SR and the instances are deployed. Thus it shouldn't be handled by buildout (also for several other reasons, including the bug "slapos.git repo is sometimes unconditionnaly rewritten and changes are lost"). A slaprunner-bootstrapper should be written to get a more predictable result when bootstrapping a SR in a webrunner
- a clean auto-restart feature will be implemented, likely based on implementing watchdogs and the bang method within slapproxy.
-
Owner
hi @Nicolas, thanks for feedback; the plan looks good.
- a clean auto-restart feature will be implemented, likely based on implementing watchdogs and the bang method within slapproxy.
I suggest to simply use autorestart functionality already provided by supervisord. From https://github.com/Supervisor/supervisor/blob/master/docs/configuration.rst:
---- 8< ----
autorestart
Specifies if :program:
supervisord
should automatically restart a process if it exits when it is in theRUNNING
state. May be one offalse
,unexpected
, ortrue
. Iffalse
, the process will not be autorestarted. Ifunexpected
, the process will be restarted when the program exits with an exit code that is not one of the exit codes associated with this process' configuration (seeexitcodes
). Iftrue
, the process will be unconditionally restarted when it exits, without regard to its exit code. ...
---- 8< ----and btw supervisor events can be tracked and unexpected restarts counted and reported to monitoring, or via sms or telegram (or whatever) :
---- 8< ---- (https://github.com/Supervisor/supervisor/blob/master/docs/plugins.rst)
supervisor-alert
Send event notifications over Telegram or to an arbitrary command.
---- 8< ----and this will thus work not only webrunner but in general in slapos.
I mean the functionality is there, we don't need to reinvent the wheel - only use it.
-
Owner
@kiril This was already explained by JP few months ago to you. Please search his email and add on this thread please. (I don't want to start again for the 6th time the same discussion about supervisord restart features).
What monitors and notify things on VIFIB is promises, and only that.
-
Owner
@rafael thanks for reminding this and sorry for forgetting on my side - my bad (though with supervisor-alert, as I suggested, we can have the feedback to promises, master, etc). For the reference, I copy here relevant parts of our email conversations on this subject, as suggested by @rafael:
---- 8< ---- (@Tyagov, @kirr, @rafael and @jp says)
Yes, SlapOS currently just does not support configuring autorestart for services:
https://lab.nexedi.com/nexedi/slapos.core/blob/22beacb2/slapos/grid/templates/program_partition_supervisord.conf.in#L6
https://lab.nexedi.com/nexedi/slapos.core/blob/22beacb2/slapos/grid/SlapObject.py#L451The "autorestart" don't inform slapos master that a proccess was/is dead, neither restart the partition. So we can have infinity silent restarts and unchecked flacky problems. This would make master loose the control on what is working or not.
So the design is to let the watchdog "bang" the partition asking to buildout runs again.
Watchdog only bangs if we place the executable at etc/service I think (on-watch) as you may notice.
It looks like a missunderstanding of the design which is arround since day 1 of slapos.
We have autorestart=true for slapproxy under webrunner
https://lab.nexedi.com/nexedi/slapos/blob/ed3697d93/software/slaprunner/instance-runner.cfg#L680
https://lab.nexedi.com/nexedi/slapos/blob/ed3697d93/software/slaprunner/template/supervisord.conf.in#L54but it is only for slapproxy itself, not services running in it (= under webrunner). Again no autorestart feature supported for services.
There are two ways about this:
- either we teach slapos about services autorestarting, or
The restart is done by the re-run of buildout on the partition, re-run of buildout should be done by the call of bang, this is the design, w/o it we cannot monitor anything on slapos, as we loose control of what is been restarted.
If bang/re-run buildout don't work with slapproxy, slapproxy has to be fixed... and not he other way arround. It is required to update slapproxy code to behave like slapos master (ie.: handle bang, don't run the instances all the time) and slaprunner has to be improved to support "production", which means it has to behave like a computer (run slapos node instance command periodically, dont relaunch buildout all the time...)
I understand the value of restart a proccess, in speed terms it would make downtime from 1min to 1sec (we could eventually discuss restart control), however, autostart is a something forbidden.
- people continue to do with ad-hoc watchdogs like I did for gitlab
https://lab.nexedi.com/nexedi/slapos/blob/master/software/gitlab/watcher-sigkill.in
So, I understand why is writen, but this is a workarround and it should not reproduced. I would approciate an effort on fix slapproxy (or webrunner) instead propose workarrounds :). but I cannot prevent you to make dirty tricks instead fix the problems.
Rafael explanation is right. If one needs speed, then consider how to trigger run of buildout faster. (It would also accelerate initial deployments)
Jps.
Hi,
First, it would be good that this disucussion ends in a technical note since I explained it so many times already to one or the other, and sometimes more than once.
This way, we go faster next time.
Also, I am starting to forget and the risk of destroying the system by not understanding it is now increasing
So my question is if I can adjust bang call to happen every 1 minute ?
No. This is wrong. It would kill the system.
Bang is called whenever:
- called explicitely (by a promise for example, by a service itself)
- a process watched by the watchdog becomes in a state that is not the one it is supposed to be
In theory, buildout is run all the time, repeatedly and is supposed to have 0 execution time (theoretical model). But since that would take 100% of CPU, we have to call it less often. So, we find ways to call it less often.
buildout is called
- every X (this can be configured at the profile level)
- if promises are not all satisfied
- if requested services are not available
- as the result of bang
buildout is actually called by slapgrid. Slapgrid itself is called every Y (in theory, Y = 0, but in reality 1 minute). So, slapgrid is called:
- at lease every minute
- right after a slapgrid call if something happened in the previous call (ex. request of new service, failing promose) with an increasing delay to reduce CPU load
Please note that no slapproxy here and as this is an embedded device I prefer to not have it due to lack of resources and not needed complexity.
The solution is to ensure that slapgrid / slapos work as they are supposed to. In the case of your device, you need to re-run buldout after a bang that is trigerred by the "watchdog" finding out that the process dies (which itself is really bad, since processes are not supposed to die and restarting them automatically is a poor approach). I think that currently, bang has to go through the master. It is possible in future to consider a shortcut that does not go through the master. But I am not sure yet that it is a good idea. It is probably simpler and cleaner to run slapproxy locally if one needs full autonomy.
Regards,
JPS.
Noted here: https://nexedi.erp5.net/web_page_module/5506/