Commit 09252627 authored by iv's avatar iv

gitlab: start automation of gitlab-backup

  - git-clone and build git-backup sources
  - add a crontab doing `gitlab-backup pull` every 4 hours
  - add xnice for using less resources while backuping
  - add lock to avoid concurrent calls of backup script
parent 0abde9c5
......@@ -30,6 +30,7 @@ parts =
service-redis
service-cron
cron-entry-gitlab-backup
on-reinstantiate
......@@ -759,9 +760,7 @@ cron-entries = ${cron-dir:cron.d}
crontabs = ${cron-dir:crontabs}
cronstamps = ${cron-dir:cronstamps}
catcher = ${cron-simplelogger:wrapper}
dcrond-binary = {{ dcron_bin }}
  • @iv please avoid mixing unrelated style changes into the patch.

    Even if it is right to adjust it should not be mixed (git add -p is your friend). By the way here the blank lines are used as delimiters for different sections:

    1. inside-instance setup
    2. software binary we got from software.cfg
    3. dependency

    So offhand I would leave it as is (but I'm for sure biased since I'm author of those lines).

    /cc @jerome for having a look at this patch also.

Please register or sign in to reply
depends =
${logrotate-entry-cron:recipe}
......@@ -771,7 +770,6 @@ recipe = slapos.cookbook:simplelogger
wrapper = ${directory:bin}/${:_buildout_section_name_}
log = ${cron-dir:log}/cron.log
Please register or sign in to reply
# base entry for clients who registers to cron
[cron-entry]
recipe = slapos.cookbook:cron.d
......@@ -837,7 +835,11 @@ backup = ...
time = daily
command = ${logrotate:wrapper}
[cron-entry-gitlab-backup]
<= cron-entry
# run backup script every four hours
frequency = 0 */4 * * *
  • it is better we get backup frequency as instance parameter with default being what you directly put here.

Please register or sign in to reply
command = {{ backup_script }}
# 6. on-reinstantiate actions
......
# GitLab "switch-softwaretype" instance
[buildout]
parts = switch-softwaretype
parts =
switch-softwaretype
# std stuff for slapos instance
eggs-directory = ${buildout:eggs-directory}
develop-eggs-directory = ${buildout:develop-eggs-directory}
offline = true
[switch-softwaretype]
recipe = slapos.cookbook:softwaretype
default = $${instance-gitlab.cfg:rendered}
# TODO -export, -import, -pull-backup
[instance-gitlab.cfg]
recipe = slapos.recipe.template:jinja2
mode = 0644
......@@ -50,6 +49,9 @@ context =
raw ruby_location ${bundler-4gitlab:ruby-location}
raw watcher_sigkill ${watcher-sigkill:rendered}
# script
  • watcher_sigkill above is also "script". Any reason we need to distinguish the backup-related one?

  • Maintainer

    ups, I think I started a comment without finishing it...

Please register or sign in to reply
raw backup_script $${backup_script.bash:rendered}
  • about naming: I would remove _script suffix and use more descriptve name for this program. One option could be gitlab-backup-pull. This will be in line with other in-instance programs like gitlab-rails & gitlab-rake.

    For the template I suggesto to use the convention that it is name of what would be rendered + .in suffix.

Please register or sign in to reply
# config files
raw database_yml_in ${database.yml.in:target}
raw gitconfig_in ${gitconfig.in:target}
......@@ -64,3 +66,14 @@ context =
raw resque_yml_in ${resque.yml.in:target}
raw smtp_settings_rb_in ${smtp_settings.rb.in:target}
raw unicorn_rb_in ${unicorn.rb.in:target}
[backup_script.bash]
recipe = slapos.recipe.template:jinja2
mode = 0770
template = ${backup_script.bash.in:target}
rendered = $${buildout:directory}/backup_script.bash
context =
# XXX remove /usr/bin below? this is currently useful for command `which`
raw env_path /usr/bin:${coreutils:location}/bin:${grep:location}/bin:${tar:location}/bin:${gzip:location}/bin:$${buildout:directory}/bin:${gopath:bin-directory}:${git:location}/bin:${xnice-repository:location}/bin
raw backup_repo $${buildout:directory}/backup-gitlab.git
  • If I read correctly, the backup will be in slappartX/backup-gitlab.git folder. Usually it is in srv/backup subfolder, so storing this in slappartX/srv/backup/backup-gitlab.git would be more appropriate. There a mkdir recipe you can use to make sure repositories are created.

  • Maintainer

    @jerome: thanks, now, the backup repository is in ~/srv/runner/instance/slappart0/srv/backup/backup-gitlab.git, is it what you meant, or am I supposed to put it in ~/srv/runner/srv/backup/backup-gitlab.git?

  • Yes, this is what I meant. srv/backup/backup-gitlab.git relative to the partition root.

  • Maintainer

    @jerome OK, fine :)

Please register or sign in to reply
raw bash_bin ${bash:location}/bin/bash
......@@ -14,9 +14,12 @@ extends =
../../component/nginx/buildout.cfg
# for instance
../../component/coreutils/buildout.cfg
../../component/bash/buildout.cfg
../../component/grep/buildout.cfg
../../component/bzip2/buildout.cfg
../../component/curl/buildout.cfg
../../component/tar/buildout.cfg
  • I have not noticed any actual usage of tar, grep in the SR files and thus it can also be hard to understand (and thus eventually unintentionally break) for another person.

    I see those programs are needed for actual gitlab-backup run thus imho it is better to mention the fact in commit message, so then when one will be wondering it can know for why we add them.

  • Maintainer

    Indeed, this is needed for gitlab-backup, as we have to set up the PATH environment variable in the cron script, I'll add a line in the commit message.

Please register or sign in to reply
../../component/gzip/buildout.cfg
../../component/dcron/buildout.cfg
../../component/logrotate/buildout.cfg
......@@ -36,6 +39,7 @@ parts =
gitlab-shell/vendor
gitlab/vendor/bundle
gitlab-workhorse
git-backup
# for instance
instance.cfg
......@@ -44,13 +48,15 @@ parts =
eggs
bash
coreutils
curl
watcher-sigkill
grep
gzip
tar
  • No need to explicitly define grep and tar in parts, since they are referenced as ${grep:location} this will automatically trigger the installation.

Please register or sign in to reply
watcher-sigkill
dcron-output
logrotate
############################
# Software compilation #
############################
......@@ -65,7 +71,6 @@ recipe = z3c.recipe.scripts
interpreter = python2
eggs =
docutils
# rubygemsrecipe with fixed url and this way pinned rubygems version
[rubygemsrecipe]
recipe = rubygemsrecipe
......@@ -112,7 +117,7 @@ git-executable = ${git:location}/bin/git
[gitlab-repository]
<= git-repository
#repository = https://gitlab.com/gitlab-org/gitlab-ce.git
repository = https://lab.nexedi.com/nexedi/gitlab-ce.git
  • Maintainer

    commited by mistake

  • N/P I was still reading it as temporary. Though please be careful while you are preparing the clean patch. Once again git add -p is your friend.

Please register or sign in to reply
repository = https://lab.nexedi.com/iv/gitlab-ce.git
# 8.8.X + NXD patches:
revision = v8.8.9-10-g967afbdc3a2d2bcc58ed31ca50d5293fa2a2e9ed
location = ${buildout:parts-directory}/gitlab
......@@ -135,8 +140,6 @@ repository = https://lab.nexedi.com/nexedi/gitlab-workhorse.git
revision = v0.7.1-5-gd23a3247829fc3200e3dc784dcd57b5a0febac48
location = ${buildout:parts-directory}/gitlab-workhorse
# build needed-by-gitlab gems via bundler
[gitlab/vendor/bundle]
recipe = slapos.recipe.cmmi
......@@ -151,6 +154,51 @@ make-binary =
make-targets= cd ${:path} &&
${:bundle} install --deployment --without development test mysql kerberos
# directories and repositories required by gitlab-backup
[gopath]
directory = ${buildout:directory}/work
  • we can convetionally name go workspace as .../go (or if that name is already busy as .../go.work)

  • Maintainer

    I find go.work more explicit. Btw, in go doc, they use gocode or in another one, just work.

    Edited by iv
  • Whatever it is just non-ambigous. go.work seems to be ok choice imho.

Please register or sign in to reply
src-directory = ${:directory}/src
bin-directory = ${:directory}/bin
[git2go-repository]
<= git-repository
git2go-path = github.com/libgit2/git2go
  • if you change naming for this from git2go-path to e.g. go.importpath you can:

    factor out support for go git repositories and then us it this way:

    [git2go-repository]
    <= go-git-repository
    go.importpath = github.com/libgit2/git2go
    revision = ...
    
    [git-backup-repository]
    <= go-git-reposiroty
    go.importpath = lab.nexedi.com/kirr/git-backup
    revision = ...
Please register or sign in to reply
repository = https://${:git2go-path}.git
# branch 'next' is required by git-backup
revision = next-g53594d7581617dbae7bb5960b4ac5f0ff513c184
location = ${gopath:src-directory}/${:git2go-path}
[git-backup-repository]
<= git-repository
git-backup-path = lab.nexedi.com/kirr/git-backup
repository = https://${:git-backup-path}.git
revision = 3ba6cf73ba224c40f67f1fb87c855b915eb91f58
location = ${gopath:src-directory}/${:git-backup-path}
[git-backup]
recipe = slapos.recipe.cmmi
path = ${git-backup-repository:location}
configure-command = :
make-binary =
  • What is the meaning of configure-command = : ? Also, except if this section is overwriting another of the same name, it seems cleaner to me to remove empty parameters

  • I will try to answer, since I'm author of this pattern in gitlab SR:

    ":" is the same as "true" in shell. The above line means there is no configure step and tells to slapos.recipe.cmmi this via providing configure command that does nothing and always succeeds. make-binary is also abused to be set empty so the whole command executed is being below in make-targets.

    It's a kind of hack to use slapos.recipe.cmmi for builds that do not use make, but where slapos.recipe.cmmi utility to download/extract/execute build is handy.

  • Oh after grepping I see such approach of tweaking make-binary = ø and configure-command = something is quite widespread in slapos.

  • thank you for the tip !

Please register or sign in to reply
make-targets= cd ${git2go-repository:location}
&& git submodule update --init
  • I understand we do not have submodule support in slapos.recipe.build:gitclone, so manually fetching them is ok for now.

    /cc @jm

  • Yes, it would be better, but I'm not interested in doing it if there's no other use in SlapOS. Let's wait and see.

    Edited by Julien Muchembled
Please register or sign in to reply
&& make install
&& cd ${git-backup-repository:location}
&& install -d ${gopath:bin-directory}
&& go build -o ${gopath:bin-directory}/git-backup .
  • go build is for kind of manual checking / building in-place. When we have $GOPATH environment setup we can just do go install lab.nexedi.com/kirr/git-backup without even going into git-backup directory. That's all.

Please register or sign in to reply
&& cp ${git-backup-repository:location}/contrib/gitlab-backup ${gopath:bin-directory}
&& chmod +x ${gopath:bin-directory}/gitlab-backup
Please register or sign in to reply
environment =
PATH=${cmake:location}/bin:${pkgconfig:location}/bin:${git:location}/bin:${golang16:location}/bin:${buildout:bin-directory}:%(PATH)s
GOPATH=${gopath:directory}
[xnice-repository]
# to get kirr's misc repo containing xnice script for executing processes
# with lower priority (used for backup script inside the cron)
  • Thinking out loudly: maybe we should just make xnice generally for all slapos instances /cc @rafael.

  • a dependency on https://lab.nexedi.com/kirr/misc.git does not seem a good idea

  • This script is very small. We can probably inline it somewhere in slapos stack so that it can be used by other backup / monitoring operations.

  • This script is very small. We can probably inline it somewhere in slapos stack so that it can be used by other backup / monitoring operations.

    I agree with @jerome. /cc people who work on resiliency backup @Nicolas, @alain.takoudjou, @rafael.
    Do we maybe have an example already how to include such utilities into stack?

  • Hi,

    xnice seems used only in one place : backup_script.bash.in , so maybe it could be integrated as a function in this script ? Templating / Inlining it doesn't seem a bad idea, as this script is so simple it won't get major bug fixes / improvements.

    One thought : xnice description is "run something at lowest possible priority", thus are you sure despite that the backup will end in less than 4 hours ? In the stack/resiliency, the frequency of the backups is given through a slap-parameter, then it is very easy to change the frequency on the fly, depending of the criticity or size of the project, and without having to wait for the publication of a new version of the SR.

  • @Nicolas, thanks for feedback. It was already noted the frequence has to be moved to insance parameter. There is no techical problem to inline xnice to the script, but imho it would be better if it could be made available to everyone, not only in gitlab SR. The reason to run backup at lowest possible priority is to make interference to the actual service provided as low as possible. And this problem (interference from backup) is not only about git - we are seeing it in customer projects too.

    And if backup will not be ending in time when next backup has to be run - the logic here is that next backup tries to be run, sees its predecessors is not yet finished, and does nothing. This way it does not overlap. Btw if I recall correctly there is a mechanism in slapos to make sure something runs only once, but I do not remember its location/name offhand. I know it is used in resiliency stack, so any hint maybe?

  • Maintainer

    @kirr: I think it's equeue...

  • We can probably inline it somewhere in slapos stack so that it can be used by other backup / monitoring operations. I can't provide any exemple of such a thing. Personnaly, I would go in the direction of creating a component which will populate the bin directory of the SR. But someone with more experience than me would give a better idea.

    Btw if I recall correctly there is a mechanism in slapos to make sure something runs only once, but I do not remember its location/name offhand. I know it is used in resiliency stack, so any hint maybe ?

    You must refere to the equeue service. In the resilient stack, scripts to run are registered via the pubsub server, which sends a list of callbacks to the equeue process on an HTTP notification. For information on how to deploy it, you may have a look at some slapos source code here, in the resilient stack

  • @iv, @Nicolas thanks for feedback. I was aware of execution queue and pubsub service and that double registeration to execute something there results only in executing it once. But also isn't there a way for a program to just have a lock agains second itself? In resilient stack bin/exporter is run, IIRC, not via equeue and once again if I recall correctly it has the protection. Maybe I'm missing something though.

    P.S. on a customer project we got many times pubsub service stuck. @Nicolas I recall I conveyed to you my debugging bits on this topic.

  • ok I've just checked. Here is the top of bin/exporter in one of my resilient webrunner:

    # Check for other instances
    pidfile=/srv/slapgrid/slappart16/var/pid/exporter.pid
    if [ -s $pidfile ]; then
      if pid=`pgrep -F $pidfile -f "$COMMAND" 2>/dev/null`; then
        echo "Already running with pid $pid."
        exit 1
      fi
    fi
    echo $$ > $pidfile
    
    ...
    Edited by Kirill Smelkov
  • not sure this is completely 100% race free. But we can use the wrapper recipe anyway and if it is racy, just fix it there inside the recipe.

  • Oh, I see I already touched that piece of code, without looking at the purpose of this option:

    if pidfile parameter is specified, then it will make the wrapper a singleton, accepting to run only if no other instance is running.

    Ok, so using a pidfile can't be race-free and this has to be reviewed completely. I don't think it's possible in shell.

  • P.S. on a customer project we got many times pubsub service stuck. @Nicolas I recall I conveyed to you my debugging bits on this topic.

    About this, I've lately updated the scripts of pull and push run in the PBS instance. Pull/Push scripts are 2 notifier callbacks run by the pubsub server, which will run some shell scripts : pull_raw and push_raw (and also write feeds, and notify, etc...). Both of the push_raw/pull_raw scripts were containing infinite loops, that I removed in a recent commit. Now, the notifier scripts can run the raw scripts a given number of times until declaring and reporting a real failure. It should fix the "stuck pubsub" issue. If you want to know more about it, please do not hesitate to mail/text me.

  • Ok, so using a pidfile can't be race-free and this has to be reviewed completely. I don't think it's possible in shell.

    It is possible to do atomics in shell, at least e.g. this way: http://stackoverflow.com/questions/13828544/atomic-create-file-if-not-exists-from-bash-script

    (either via mkdir or >file with -o noclobber)


    @Nicolas thanks for heads up. I've saw your recent commits on this topic and imho it is right to remove those loops from the actual scripts and handle retries from outside; thanks for working on this. So far it is enought for me to see your cover description here and the commit messages itself. I just want to point, once again, that in my problematic cases the thing which was stucked was pubsub server itself. Once again thanks for feedback and offering help.

  • @kirr No. As I wrote above, it does not work if the script is SIGKILL'ed, since the script has no opportunity to remove it. One requirement is to use a kernel object that is automatically freed when the process ends.

  • I think:

    • if the process is killed, we probably want to lock to stay there because internall data are possible in inconsistent state and maybe it is better to investigate. We don't want such killage to happen unnoticed, right?
    • it is still possible to automatically cleanup after killed process with the help of outside watchdog. Ideaully such watchdog, upon seeing a process SIGKILL'ed or something similar should: a) generate rss alarm, and b) do the cleanup. Cleanup can be very trivial if we stick to convention that process state (at least which is needed to be cleaned up) is inside 1 directory.

    example of SIGKILL watchdog: https://lab.nexedi.com/nexedi/slapos/blob/fb8912e8/software/gitlab/watcher-sigkill.in . In gitlab it is used because on of subservices (sidekiq) is know to leak memory and for this reason SIGKILL's itself periodically after some RSS marging. There watchdog only does restarting, but we can use the pattern and add actions that we need.

    Another possibility: even if file is not removed, we can stick to convention that a lock is not only created, but is also kept open by running process. Then if the process is SIGKILL'ed the file will stay, but it won't be opened anymore. Then when another process starts and sees the lock exists it can check with lsof or by directly inspecting /proc//fd whether lock file is still open by a process or not. If not - it is safe to remove it and proceed running.

    Another possibility: flock(1) - "manage locks from shell scripts": https://linux.die.net/man/1/flock. There it is just flock lockfile command-to-run ... so essentially flock becomes such a watchdog.

    I mean: it all should be possible, we just need to pay attention carefully and do it.

  • if the process is killed, we probably want to lock to stay there because internall data are possible in inconsistent state and maybe it is better to investigate. We don't want such killage to happen unnoticed, right?

    No. It's better if your script detects that some manual intervention is required and only notify at that moment.

    it is still possible to automatically cleanup after killed process with the help of outside watchdog

    When we start putting logic outside, better do it completely, and well there's equeue (maybe overkill here but I still prefer something that everyone can use reliably) (assuming also equeue is fixed if it has bugs).

    I won't comment the rest because it's clear that any code using files for that become complicated and error-prone. More generally, most code using temporary resources that may outlive the process (temporary files, subprocesses) is buggy because people never think about race conditions and exceptions. That leads to a lot of maintenance to clean up machines.

  • Maintainer

    @kirr: after discussion with @Nicolas, in the case of the equeue, it seems necessary to do something like this:

                      notifier (does                                     another notifier
    [ cron ] ------------------------------------> [ pubsub server ] -------------------------> [ equeue ]
               nothing but transmits callback)                         (transmitted callback)

    So if we want to use existing code from the PBS stack: the cron will call a notifier (which is a script wrapper) with a callback for the pubsub server, which will add the gitlab-backup-pull script to the equeue.

    Otherwise, it is possible to simplify the schema above, by creating a script to discuss with the equeue directly, but:

    • having the pubsub server is nice if we want to tell another gitlab instance there are new data, and it should start pulling,
    • in case of changes in the equeue, the script will have to be rewritten.
    Edited by iv
  • @iv I prefer for now we simply use the wrapper recipe with pid parameter used for singletoning with explicit comment about why and how, even if it is not completely race free.

Please register or sign in to reply
<= git-repository
repository = https://lab.nexedi.com/kirr/misc.git
revision = 4073572ea700bf1b115f3a135aebebe5b3b824e4
location = ${buildout:parts-directory}/misc
# build needed-by-gitlab-shell gems via bundler
# ( there is not vendor/ dir in gitlab-shell, so to avoid having buildout error
......@@ -177,7 +225,6 @@ make-targets= ${:_buildout_section_name_}
environment =
PATH=${golang16:location}/bin:%(PATH)s
###############################
# Trampoline for instance #
###############################
......@@ -194,7 +241,7 @@ eggs =
recipe = slapos.recipe.template
url = ${:_profile_base_location_}/instance.cfg.in
output = ${buildout:directory}/instance.cfg
md5sum = ef85f02c4f6070c586d773b859a2f4e2
md5sum = 11d8a2628dd170169a334877968e0bcc
[watcher-sigkill]
recipe = slapos.recipe.template:jinja2
......@@ -222,6 +269,10 @@ destination = ${buildout:directory}/${:_buildout_section_name_}
url = ${:_profile_base_location_}/template/${:_buildout_section_name_}
[backup_script.bash.in]
<= download-template
md5sum = 272b05a7694b950cf3efa64ea14f8bbb
[database.yml.in]
<= download-template
md5sum = 61d1d04b9347b3168a1ad7676e4681ef
......@@ -248,7 +299,7 @@ md5sum = 176939a6428a7aca4767a36421b0af2b
[instance-gitlab.cfg.in]
<= download-file
md5sum = 89914e4a225f6cdebfa196d46359f6f2
md5sum = bf3b277dcf32adb08fbb1deb18ab3cfd
[macrolib.cfg.in]
<= download-file
......
#! {{ bash_bin }}
export PATH="{{ env_path }}"
mkdir -p "{{ backup_repo }}"
cd "{{ backup_repo }}"
# calling_backup exists if backup script is currently executed
if [ -d calling_backup ] ; then
exit 1
fi
# create the lock to avoid concurrent calls
mkdir calling_backup
  • Please see how "run something only once" is handled in slapos in other places. I know the resiliency stack should be doing it.

  • Wrong reasoning. If the mkdirdidn't fail when the directory already exists, you'd have a race condition. So remove the check above.

    The script would also leave the lock if any command failed after. Using trap 0 to remove it is better:

    trap "rmdir calling_backup 2>/dev/null" EXIT
    mkdir calling_backup

    But it's still not enough if the script is killed with SIGKILL, or if the machine crash. An abstract unix socket is ideal for this kind of thing.

    If the trap way is enough for you, please use /bin/sh instead of bash, and another common good practice is to start with set -e.

  • If the mkdirdidn't fail when the directory already exists, you'd have a race condition.

    Oops, I always assume the script runs with set -e. Either use it, or do mkdir calling_backup || exit

  • I agree for including the set -edirective. Adding it everywhere in the stack/resilient let us find out numerous bugs, and also raise alerts when backups failed

Please register or sign in to reply
if ! git rev-parse --is-inside-git-dir ; then
git init --bare
fi
xnice gitlab-backup pull
# done, remove lock
rmdir calling_backup
  • @iv I think the patch should be more or less ok to go. My above comments are mostly about cosmetics / structure etc and seems like it shold generally work. Maybe I've missed something but imho after fixing the styling & other minor nits we can give the patch a try putting it into lab.nexedi.com .

    While you do the fixups let's also give a chance for others to speak-up.

  • Maintainer

    OK, thanks for the review :)

Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment