You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

GitLab/Failover: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jelto
No edit summary
imported>AOkoth
No edit summary
Line 18: Line 18:


* copy ssh host keys for <code>/etc/ssh-gitlab</code> daemon from old host to new host
* copy ssh host keys for <code>/etc/ssh-gitlab</code> daemon from old host to new host
** this can be done from a cumin host using: <syntaxhighlight lang="bash">
sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp -3 <host>.wikimedia.org:/etc/ssh-gitlab/ <host>.wikimedia.org:/etc/ssh-gitlab/
</syntaxhighlight>
* apply [[gitlab:repos/releng/gitlab-settings|gitlab-settings]] to new host (done for all replicas)
* apply [[gitlab:repos/releng/gitlab-settings|gitlab-settings]] to new host (done for all replicas)
* lower TTL for gitlab.wikimedia.org (example change [[gerrit:c/operations/puppet/+/802090|802090]])
* lower TTL for gitlab.wikimedia.org (example change [[gerrit:c/operations/dns/+/802090|802090]])
* announce downtime some days ahead on engineering-all, #wikimedia-gitlab`
* announce downtime some days ahead on engineering-all, #wikimedia-gitlab`


Line 29: Line 32:
* create '''full backup''' on old host:
* create '''full backup''' on old host:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
/usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1" && ls -t "/mnt/gitlab-backup"/*gitlab_backup.tar | head -n1 | xargs -i cp {} "/mnt/gitlab-backup"/latest/latest-data.tar
/usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1" && ls -t "/srv/gitlab-backup"/*gitlab_backup.tar | head -n1 | xargs -i cp {} "/srv/gitlab-backup"/latest/latest-data.tar
</syntaxhighlight>
</syntaxhighlight>


* sync backup, on to new host: <code>/usr/bin/rsync -avp /mnt/gitlab-backup/latest/ rsync://<NEW_HOST>.wikimedia.org/data-backup</code>
* sync backup, on to new host: <code>/usr/bin/rsync -avp /srv/gitlab-backup/latest/ rsync://<NEW_HOST>.wikimedia.org/data-backup</code>


* configure new host with <code>profile::gitlab::service_name: 'gitlab.wikimedia.org'</code> (example change [[gerrit:c/operations/puppet/+/802150|802150]])
* configure new host with <code>profile::gitlab::service_name: 'gitlab.wikimedia.org'</code> (example change [[gerrit:c/operations/puppet/+/802150|802150]])

Revision as of 19:03, 4 August 2022

GitLab has a active host and one or more replicas. The replicas are cold-standby currently, meaning they don't serve any production traffic and hold up to 24h old data. For maintenance or in case of emergency it is possible to failover the active host to a replica. This page describes the process broadly.

The process takes around 1h to 1:30h (depending on backup size). During that time GitLab is not available.

Prerequisites

The host to failover to should be a proper GitLab replica, meaning:

  • has a second IPv4 and IPv6 address configured as profile::gitlab::service_ip_v4 and profile::gitlab::service_ip_v6
  • is running the puppet role(gitlab)
  • has enough disk space

Planned Failover

A planned failover means the old production instance is responding and working properly and doing a recent backup is possible. There is no data loss. The following steps are needed to failover to a new host.

Before failover

  • copy ssh host keys for /etc/ssh-gitlab daemon from old host to new host
    • this can be done from a cumin host using:
      sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock scp -3 <host>.wikimedia.org:/etc/ssh-gitlab/ <host>.wikimedia.org:/etc/ssh-gitlab/
      
  • apply gitlab-settings to new host (done for all replicas)
  • lower TTL for gitlab.wikimedia.org (example change 802090)
  • announce downtime some days ahead on engineering-all, #wikimedia-gitlab`

During failover

  • pause all GitLab Runners
  • stop puppet on old host with sudo disable-puppet "Failover in progress"
  • stop write access on nginx and ssh-gitlab on old host with gitlab-ctl stop nginx and systemctl stop ssh-gitlab
  • create full backup on old host:
/usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE="true" GITLAB_BACKUP_MAX_CONCURRENCY="4" GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY="1" && ls -t "/srv/gitlab-backup"/*gitlab_backup.tar | head -n1 | xargs -i cp {} "/srv/gitlab-backup"/latest/latest-data.tar
  • sync backup, on to new host: /usr/bin/rsync -avp /srv/gitlab-backup/latest/ rsync://<NEW_HOST>.wikimedia.org/data-backup
  • configure new host with profile::gitlab::service_name: 'gitlab.wikimedia.org' (example change 802150)
  • configure new host in profile::gitlab::active_host (example change 802150)
  • trigger restore on new host /srv/gitlab-backup/gitlab-restore.sh
  • overwrite home_page_url. on new host, run echo "ApplicationSetting.last.update(home_page_url: 'https://gitlab.wikimedia.org/explore')" | /usr/bin/gitlab-rails console
  • Point DNS entry for `gitlab.wikimedia.org` to new host (example change 802473) and run authdns-update
  • verify installation (login, push, pull, look at metrics)
  • run puppet on new host
  • enable puppet on old host with sudo enable-puppet "Failover in progress"
  • unpause all GitLab Runners
  • announce end of downtime

Unplanned Failover

A unplanned failover means the old production instance is not responding/lost and it is not possible to create a backup is possible. There is up to 24 hours of data loss GitLab.

Get as new data as possible

Check the age of the backup in bacula and on the existing replicas. If the backup is reasonably new, use this backup (make sure to check GitLab/Backup and Restore#Fetch backups from bacula). If that backup is too old, try to manually schedule a database dump and rsync the git repositories. However this is not an automated step and needs more planning.

During failover

The following steps assume that the old host is not available anymore and a replica with the most recent ("latest") backup is used to failover:

  • configure new host with profile::gitlab::service_name: 'gitlab.wikimedia.org' (example change 802150)
  • configure new host in profile::gitlab::active_host (example change 802150)
  • if needed, trigger a restore on new host /srv/gitlab-backup/gitlab-restore.sh (not needed if new backup can't be created)
  • overwrite home_page_url. on new host, run echo "ApplicationSetting.last.update(home_page_url: 'https://gitlab.wikimedia.org/explore')" | /usr/bin/gitlab-rails console
  • Point DNS entry for `gitlab.wikimedia.org` to new host (example change 802473) and run authdns-update
  • verify installation (login, push, pull, look at metrics)
  • run puppet on new host