You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Server Lifecycle/Reimage

From Wikitech-static
< Server Lifecycle
Revision as of 22:33, 6 October 2021 by imported>Volans (Add netbox data import from puppetdb step)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The act of reimaging refers to the clean installation of the OS (operating system) on a host. It could be either the first installation of a new fresh host (imaging) or the upgrade of a host to a newer OS version (reimaging). During the reimage process all data on the host will be lost, unless a specific partman recipe that retains data in specific partition has been configured.

Physical hosts

The reimage process is performed through the sre.hosts.reimage cookbook. The --os option is mandatory and allow to specify with OS version should be installed. When asked for the Management password, find it in the management file on pwstore.

How to run it

  • Check available options
sudo cookbook sre.hosts.reimage -h
  • Reimage of a generic host that doesn't need any specific options:
sudo cookbook sre.hosts.reimage --os bullseye -t T12345 somehost1001
  • Reimage of a generic host behind LVS:
sudo cookbook sre.hosts.reimage --os bullseye --conftool -t T12345 somehost1001
  • Reimage of a MediaWiki host:
sudo cookbook sre.hosts.reimage --os bullseye --conftool --httpbb -t T12345 somehost1001
  • Image of a new freshly racked host:
sudo cookbook sre.hosts.reimage --os bullseye --new -t T12345 somehost1001

Pre-reimage validation operations

  • Ensure the host to reimage is a physical host
  • If --new is set:
  • If --new is not set:
    • Ensure that the host exists in PuppetDB
  • Check that both the host and its management DNS names resolve correctly
  • Check that the host's IPMI is reachable and it's possible to execute commands

Reimage operations

  • Update the Phabricator task saying that the reimage has been started (if -t/--task is set)
  • Downtime on Icinga (unless --new or --no-downtime are set)
  • Depool from conftool (if -c/--conftool is set)
  • Disable Puppet on the host (but doesn't fail if it's unreachable or unable to disable it)
  • Remove the host from Puppet and PuppetDB
  • Delete any existing Puppet certificate for the host
  • Remove the host from Debmonitor
  • Unless --no-pxe is set:
    • Generate a temporary DHCP snippet on the install server in the same datacenter of the host with the specified OS version to use and restart the DHCP server
    • Force next boot to go via PXE via IPMI
    • Reboot the host via IPMI (power cycle or power up based on current power status) and poll until reachable
    • Verify that the host has rebooted into the Debian installer environment
    • Poll until the host gets rebooted by the Debian installer into the new OS
    • Verify that the host has rebooted into a new OS and not again into the Debian installer
    • Delete the temporary DHCP snippet and restart the DHCP server as the assigned IP has been statically configured in the host at this point
  • Mask the provided systemd services (if --mask is set)
  • Generate a Puppet certificate request on the host, poll for the CSR on the active Puppet CA server, verify its fingerprint and sign the new certificate
  • Run Puppet in NOOP mode to compile the catalog and populate PuppetDB with the exported resources. Poll PuppetDB until the Nagios_host resource appears.
  • Downtime the new host in Icinga forcing a Puppet run on the Icinga server host to get first all the check definitions for the new host generated by the exported resources
    • This step is always performed and not affected by the --no-downtime option that affects only the downtime prior of the reimage
  • Run Puppet for the first time (this step takes a long time), asking the user what to do on failure
  • Ensure that the BIOS boot parameters are back to normal to prevent an accidental reboot into PXE
  • Run Puppet on the host where the cookbook is running to get the known host key of the new host
  • Reboot the new host and poll until reachable
  • Poll until a successful Puppet run is completed
  • Run Httpbb /srv/deployment/httpbb-tests/appserver/* against the host (if --httpbb is set)
  • Print the command to unmask the masked units, if there is any
  • Force a recheck of all Icinga checks and poll until the host status in Icinga waiting to reach optimal status:
    • If the host status is optimal within few minutes remove the downtime
    • If the host status is still not optimal after few minutes print a warning and tell the user that the downtime has not been removed
  • Print the command to repool any depooled service, if there is any
  • Run the interface_automation.ImportPuppetDB Netbox script for the host to import all the data from PuppetDB and show its results
  • Update the Phabricator task with the result of all the actions performed (if -t/--task is set)

Virtual hosts

As of now (Oct. 2021) the reimage process usually involves the decommissioning of the existing VM and the provisioning of a new VM. There are plans to add reimage support for virtual machines too. See also Ganeti#Reinstall_/_Reimage_a_VM.

DHCP Automation

Workflow

  • On each install host the DHCP configuration includes a usually empty file: /etc/dhcp/automation/proxies/opt82-ttyS1-115200.conf
  • The reimage cookbook creates a DHCP snippet file in the /etc/dhcp/automation/opt82-ttyS1-115200 directory with the host{...} block for just the host that is being reimaged, with the configuration required to assign to it its assigned primary IPv4 in Netbox.
  • The /usr/local/sbin/dhcpincludes script is then run, that takes care of populating the mentioned /etc/dhcp/automation/proxies/opt82-ttyS1-115200.conf file with additional includes, one per available DHCP snippet file, and takes care of checking that the whole DHCP configuration is correct and restarts the DHCP server.
  • Once the host is rebooted into the new OS and the assigned IP has been statically configured in its /etc/network/interface, the reimage cookbook deletes the DHCP snippet and re-run the /usr/local/sbin/dhcpincludes script.

Key concepts

All the switches have enabled the DHCP relay agent information option (option 82) that injects additional information (switch hostname, interface name and vlan name) in the DHCP packets destined for a DHCP server.

The DHCP configuration snippets are generated getting the data that will be injected by the switch from Netbox so that the DHCP server will be able to match the request packets with the host and assign the right IP to it. See the example below.

This approach allowed to remove all the hardcoded MAC addresses from the Puppet repository and the need to keep the DHCP configuration for all physical hosts active all the time. One of the benefits, among keeping the DHCP configuration much smaller and simpler, is that if a host is rebooted by accident into PXE mode, it will not get any IP assigned and hence will not wipe its data.

Example snippet

host somehost1001 {
    host-identifier option agent.circuit-id "asw2-a-eqiad:ge-6/0/1.0:private1-a-eqiad";
    fixed-address 10.0.0.1;
    option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/bullseye-installer/";
}

What to do if...

The reimage fails

Because the reimage process is as much as possible idempotent it should be fine to run it again in case of failure. In case the host has been already removed from PuppetDB the pre-reimage validation step will fail and tell the user to retry setting the --new option. In case the reimage fails after the Debian installer has successfully installed the new OS the reimage can be resumed skipping the reboot into PXE and a new re-installation of the OS setting the --no-pxe option.

IPMI fails

Follow the steps outlined in the Management Interfaces page to troubleshoot and fix the IPMI connection with the host.