You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Server Lifecycle: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Volans
m (→‎wmf-auto-reimage: Add the --rename option description)
imported>RobH
Line 95: Line 95:
:* set up $name.mgmt.$loc.wmnet to the same IP as $assettag.mgmt.$loc.wmnet.
:* set up $name.mgmt.$loc.wmnet to the same IP as $assettag.mgmt.$loc.wmnet.
* VLAN: Network port is set to proper vlan (and labeled with hostname if not yet labeled.)
* VLAN: Network port is set to proper vlan (and labeled with hostname if not yet labeled.)
** Do not use the <code>enable</code> keyword (if it's not explicitly disabled, it's enabled)
:* Folks who can handle vlan assignments: Chris J, Faidon L, Mark B, Rob H., Arzhel Y.
:* Folks who can handle vlan assignments: Chris J, Faidon L, Mark B, Rob H., Arzhel Y.
::* Any ops folks who want this ability should speak to our network admins.
::* Any ops folks who want this ability should speak to our network admins.
Line 155: Line 157:
=== Reclaim to Spares OR Decommission ===
=== Reclaim to Spares OR Decommission ===
==== Steps for ANY Opsen ====
==== Steps for ANY Opsen ====
* A [https://phabricator.wikimedia.org/maniphest/task/create/?title=Reclaim/Decommission%20(specify)%20hostname%5bS%5d&projects=operations,hardware-requests Hardware requests] ticket should be created detailing if system is being decommissioned (and removed from datacenter) or reclaimed (wiped of all services/data and set system as spare for reallocation).
* A [https://phabricator.wikimedia.org/maniphest/task/create/?title=Reclaim/Decommission%20(specify)%20hostname%5bS%5d&projects=operations,decommission Decommission] ticket should be created detailing if system is being decommissioned (and removed from datacenter) or reclaimed (wiped of all services/data and set system as spare for reallocation).
** Please put a [[/checklist|checklist (click this link for a template to paste into phab)]] of the steps in the main task description, this ensures none are accidentally missed.
** Please put a [https://phabricator.wikimedia.org/project/profile/3364/ full decommission checklist] of the steps in the main task description, this ensures none are accidentally missed.
* System services must be confirmed to be offline.  Checking everything needed for this step and documenting it on this specific page is not feasible at this time(but we are working to add them all).  Please ensure you understand the full service details and what software configuration files must be modified.  This document will only list the generic steps required for the majority of servers.
* System services must be confirmed to be offline.  Checking everything needed for this step and documenting it on this specific page is not feasible at this time(but we are working to add them all).  Please ensure you understand the full service details and what software configuration files must be modified.  This document will only list the generic steps required for the majority of servers.
* Disable ALL service level checks in icinga for host.
* Disable ALL service level checks in icinga for host.
* If server is part of a service pool, ensure it is set to false or removed completely from pybal/[[LVS]].
* If server is part of a service pool, ensure it is set to false or removed completely from pybal/[[LVS]].
** Instructions on how to do so are listed on the [[LVS]] page.
** Instructions on how to do so are listed on the [[LVS]] page.
*If possible, use tcpdump to verify that no production traffic is hitting the services/ports
* If server is part of a service group, there will be associated files for removal or update.  The service in question needs to be understood by tech performing the decommission (to the point they know when they can take things offline.)  If assistance is needed, please seek out another operations team member to assist.
* If server is part of a service group, there will be associated files for removal or update.  The service in question needs to be understood by tech performing the decommission (to the point they know when they can take things offline.)  If assistance is needed, please seek out another operations team member to assist.
** Example: db class machines are in associated db-X.php, memcached in mc.php.
** Example: db class machines are in associated db-X.php, memcached in mc.php.
Line 174: Line 177:
* Confirm all puppet manifest entires removal, DSH removal, Hiera data removal.
* Confirm all puppet manifest entires removal, DSH removal, Hiera data removal.
'''These steps, once started, must be completed without interruption.'''
'''These steps, once started, must be completed without interruption.'''
Some of the following steps are covered by the <code>'''wmf-decommission-host'''</code> script available on the <code>cluster::management</code> hosts (<code>neodymium/sarin</code> as of Aug. 2018). Those covered by the script are marked as '''[decom script]'''.
* Disable puppet on the host (puppet agent --disable)
* Disable puppet on the host (puppet agent --disable)
** Admin log whenever you disable or enable puppet on a host!
** Admin log whenever you disable or enable puppet on a host!
Line 181: Line 187:
** remove from DHCPD lease file (puppet:///modules/install_server/files/dhcpd/linux-host-entries.ttyS...  filename changes based on serial console settings)
** remove from DHCPD lease file (puppet:///modules/install_server/files/dhcpd/linux-host-entries.ttyS...  filename changes based on serial console settings)
** Instructions on how to do so are on the [[Puppet]] service details page.
** Instructions on how to do so are on the [[Puppet]] service details page.
** $ puppet node clean <fqdn>
** $ puppet node clean <fqdn> '''[decom script]'''
** $ puppet node deactivate <fqdn>
** $ puppet node deactivate <fqdn> '''[decom script]'''
*** These  2 commands immediately preceding this should also remove the host from Icinga monitoring:
*** These  2 commands immediately preceding this should also remove the host from Icinga monitoring:
** Run puppet on the icinga master (currently einsteinium.wikimedia.org), so that all alerts for the host are removed
** Run puppet on the icinga master (currently einsteinium.wikimedia.org), so that all alerts for the host are removed [not needed if the decom script is run]
** Alternatively, put the host and all services into downtime for 1+ day, as it will then not alert when the host is powered down & the next puppet run on the icinga host will remove it from monitoring.
** Alternatively, put the host and all services into downtime for 1+ day, as it will then not alert when the host is powered down & the next puppet run on the icinga host will remove it from monitoring. '''[decom script]'''
**Remove the host from DebMonitor: from one of the <code>cluster::management</code> hosts (<code>neodymium/sarin</code> as of Jul. 2018) run: '''[decom script]'''<syntaxhighlight lang="bash">
sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key
</syntaxhighlight>
* Remove host's port vlan config
**<code># show interfaces ge-x/y/z | display inheritance</code> helps identify configuration applied to the port
* disable host's port on switch.
* disable host's port on switch.
** If system is being reclaimed for spare, do not change port label.
** If system is being reclaimed for spare, do not change port label.
** If system is being decommissioned, please do not wipe port description until AFTER it is unracked.
** If system is being decommissioned, please do not wipe port description until AFTER it is unracked.
** THIS MUST BE DONE, or host can be powered up and will be network accessible (but not in puppet and not getting security updates)
** THIS MUST BE DONE, or host can be powered up and will be network accessible (but not in puppet and not getting security updates)
** Move the switch port to <code>interface-range disabled</code>
** If you can't get puppet to run happily on icinga, get help.  If help is not available, stop, renable puppet on the host, and start these steps again once you can get help with icinga.
** If you can't get puppet to run happily on icinga, get help.  If help is not available, stop, renable puppet on the host, and start these steps again once you can get help with icinga.
** You should either ensure monitoring is removed, or at minimum disable notifications for that host.  Don't generate paging alerts for systems you are decommissioning.
** You should either ensure monitoring is removed, or at minimum disable notifications for that host.  Don't generate paging alerts for systems you are decommissioning.
* Power down system.
* Power down system.
'''End steps that must be completed without interruption.'''
'''End steps that must be completed without interruption.'''
''The following can be done one at a time and/or with long breaks in bewtween.''
''The following can be done one at a time and/or with long breaks in bewtween.''
Line 213: Line 226:
* Remove its mgmt [[DNS]] entries.
* Remove its mgmt [[DNS]] entries.
* Remove port description label for decomissioned host's switch port
* Remove port description label for decomissioned host's switch port
==== Network devices specific ====
* SRX only: ensure autorecovery is disabled (see [https://kb.juniper.net/InfoCenter/index?page=content&id=KB25782 Juniper doc])
* Wipe the configuration
** By either running the command <code>request system zeroize media</code>
** Or Pressing the reset button for 15s
* Confirm the wipe is successful by login to the device via console (root/no password)


== wmf-auto-reimage ==
== wmf-auto-reimage ==
Line 223: Line 244:
* set the hosts in dowtime on Icinga [unless <code>--no-downtime</code> is set]
* set the hosts in dowtime on Icinga [unless <code>--no-downtime</code> is set]
* depool hosts via conftool [only if <code>-c/--conftool</code> is set]
* depool hosts via conftool [only if <code>-c/--conftool</code> is set]
* set next boot in PXE mode [unless <code>--no-pxe</code> is set]
* set next boot in PXE mode, and check that the PXE is set (retry up to 3 times on failure) [unle ss <code>--no-pxe</code> is set]
* power cycle or power on based on current power status [unless <code>--no-pxe</code> is set]
* power cycle or power on based on current power status [unless <code>--no-pxe</code> is set]
* use the new hostname [if <code>--rename</code> is set, it requires that the new hostname is already set via DHCP and configured in DNS]
* use the new hostname [if <code>--rename</code> is set, it requires that the new hostname is already set via DHCP and configured in DNS]
Line 229: Line 250:
* run puppet once to create the certificate and the signing request to the Puppet master [puppet client >= 4 only if there is no signed cert]
* run puppet once to create the certificate and the signing request to the Puppet master [puppet client >= 4 only if there is no signed cert]
* wait for a Puppet certificate to sign and perform the following tasks, unless the certificate was already signed:
* wait for a Puppet certificate to sign and perform the following tasks, unless the certificate was already signed:
** mask all the provided systemd units (comma-separated list) to prevent them to start automatically during the first Puppet run. Useful in case of worker roles like the MediaWiki videoscalers [only if <code>--mask</code> is set]
{{Warning|content=If a service is started by Puppet the mask will be removed and the service started anyway by Puppet. For now the --mask works only for services started at boot, Debian packages, but not by Puppet}}
** make the first Puppet run
** make the first Puppet run
** run Puppet on the Icinga host and set it in dowtime (there are still many race conditions here that might prevent it to be able to suppress all the alarms)
** run Puppet on the Icinga host and set it in dowtime (there are still many race conditions here that might prevent it to be able to suppress all the alarms)
* verify that the BIOS parameters are back to normal (no override), print a warning message otherwise.
* reboot the host [unless <code>--no-reboot</code> is set]
* reboot the host [unless <code>--no-reboot</code> is set]
* monitor the reboot and that the first puppet run is successful [unless <code>--no-reboot</code> is set]
* monitor the reboot and that the first puppet run is successful [unless <code>--no-reboot</code> is set]
* run the apache-fast-test on the host after the reimage [only if <code>-a/--apache</code> is set]
* run the apache-fast-test on the host after the reimage [only if <code>-a/--apache</code> is set]
* print the <code>conftool</code> commands to re-pool the host [only if <code>-c/--conftool</code> is set]
* print the commands to <code>unmask</code> the masked systemd units (there is no automatic unmasking on purpose) [only if <code>--mask</code> is set]
* print the <code>conftool</code> commands to re-pool the host (there is no automatic repooling on purpose) [only if <code>-c/--conftool</code> is set]
* update the Phabricator task with the result of the reimage [only if <code>-p/-phab-task-id</code> is set]
* update the Phabricator task with the result of the reimage [only if <code>-p/-phab-task-id</code> is set]
For a full list of the available options run them with the <code>-h/--help</code> parameter.
For a full list of the available options run them with the <code>-h/--help</code> parameter.
== Server reimage + rename ==
This is a hint of a procedure that can be followed to rename a server while doing the reimage:
* Silence alerts for the host to be renamed
* patch for puppet adjusting install/roles for the new server. Merge it.
* patch for DNS adding the new mgmt FQDNs (don't delete olds yet). Merge it. <i>(what about non-mgmt entries?)</i>
* disable puppet in the server to be reimaged + renamed
* run the wmf-auto-reimage-host script (with the <code>--rename</code> option)  <i>(Run it where?)</i>
* patch for dns to cleanup DNS entries. Merge it.
* racktables/netbox entry update (dont change physical label field, just hostname field)
* get the physical relabeling done (open a task for dc-ops) including update to racktables physcial label field
* update the network port description on switch
* done
Examples of all of this: [[phab:T199521]], [[phab:T199107]].


== Position Assignments ==
== Position Assignments ==

Revision as of 20:28, 13 September 2018

This page describes the lifecycle of Wikimedia servers, starting from the moment we acquire them and until the time we unrack them. A server has various states that it goes through, with several steps that need to happen in each state. The goal is to standardize our processes for 99% of the servers we deploy or decomission and ensure that some necessary steps are taken for consistency, manageability & security reasons.

Server states

Requested

  • New hardware is requested for use via the instructions on Operations_requests#Hardware_Requests.
  • Hardware Allocation Tech will review request, and detail on ticket if we already have a system that meets these requirements, or if one must be ordered.
  • If further details are needed, the task will be assigned back to requester for additional information.
  • If hardware is already available and request is approved by operations management, system will be allocated, skipping the rest of this process to the Existing System Allocation step.
  • If hardware must be ordered, the buyer will gather quotes from our approved vendors & perform initial reviews on quote(s).
  • At this time, quotes are still in RT
  • Technical review is done by operations team members familiar with hardware in question (see below), attaching their confirmation or corrections.
  • If there are corrections, ticket goes back to buyer and requester as needed until issues are clarified otherwise escalate to Systems Architect(s).
  • System Architect(s) perform final technical review to ensure the technical correctness, cost effectiveness, & architecture/roadmap alignment; attaching confirmation or corrections.
  • If there are corrections, ticket goes back to buyer and requester as needed until issues are clarified otherwise escalate to Operations Management.
  • Operations Management reviews ticket and attaches approval(s) or questions as needed & assigns ticket back to buyer.
  • Buyer may create an on-site hardware confirmation task, this ticket will confirm all parts, cables, and assorted items are available for incoming system.
  • Ticket is assigned to on-site tech, who must confirm or request the required hardware accessories needed to support the system.
  • Order may proceed even if all hardware is not on site, depending on missing hardware and lead times.

Existing System Allocation

  • Only existing systems (not new) use this step if they are requested.
  • If a system must be ordered, please skip this section and proceed to Ordered section.
  • System name is changed if required.
  • All puppet files are ensured to be absent of old server declarations.
  • System should have already been cleared out of monitoring.
  • If system listed in puppet decommission files to clear said monitoring, it should be removed from those files before it is allocated again.
  • All old system keys are confirmed to be removed from puppet master.
  • All ssh authorized
  • If all the above are good, the Hardware Allocation Tech will update your Phabricator ticket and the Server Spares page to reflect the allocation.
  • Skip following steps until the Installation section.

Ordered

  • Only new systems (not existing/reclaimed systems)
  • Buyer purchases hardware, attaching ordering details to ticket.
  • Once order ships, buyer places inbound shipment ticket with datacenter vendor. Point of Contact details here.
  • Buyer assigns RT procurement ticket to the on-site technician to receive in hardware.

Post Order

  • An installation/deployment task should be created (if it doesn't already exist) for the overall deployment of the system/OS/service & place in the #operations project.
  • You can include the following steps on this ticket for ease of reference (taken from the entirely of the lifecycle document):
 System Deployment Steps:
  [] - mgmt dns entries created/updated (both asset tag & hostname) [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
  [] - system bios and mgmt setup and tested [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
  [] - network switch setup (port description & vlan) [link sub-task for network configuration here, sub-task should include the network project]
  [] - production dns entries created/updated (just hostname, no asset tag entry) [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
  [] - install_server module updated (dhcp and netboot/partitioning) [done via this task when on-site subtasks complete]
  [] - install OS (note jessie or trusty) [done via this task when network sub-task(s) complete]
  [] - service implementation [done via this task post puppet acceptance]

Receiving Systems On-Site

  • Before the new hardware arrives on site, a shipment ticket must be placed to the datacenter to allow it to be received.
  • If the shipment has a long enough lead time, the buyer should enter a ticket with the datacenter site. Note sometimes the shipment lead times won't allow this & a shipment notification will instead be sent when shipment arrives. In that event, the on-site technician should enter the receipt ticket with the datacenter vendor.
  • New hardware arrives on site & datacenter vendor notifies us of shipment receipt.
  • Packing slip for delivery should list an RT # & the RT ticket should have been assigned to the on-site technician for receipt at this time.
  • Open boxes, compare box contents to packing slip. Note on slip if correct or incorrect, scan packing slip and attach to ticket.
  • Compare packing slip to order receipt in the RT ticket, note results on ticket.
  • If any part of the order is incorrect, reply on RT ticket with what is wrong, and assign the ticket to the buyer on the ticket.
  • If the entire order was correct, please note on the procurement ticket. Unless the ticket states otherwise, it can be resolved by the receiving on-site technician at that time.
  • Assign asset tag to system, enter system into Racktables immediately, even if not in rack location.
  • Some systems will have a hostname assigned to them at time of order (usually for clustered systems), if it has been assigned, it will be on the procurement ticket. If it has not been assigned, name the systems in racktables with the Asset Tag under server name, and leave visible label blank.
  • Entry into racktables should always include the following: asset tag, OEM Serial Number (or service tag), hardware type (dropdown), procurement rt#, purchase date (on order), hardware warranty expiration, and location of system (tag).
  • The location tag lets us see whats in a location but NOT racked, so please make sure to check which site for each system.
  • Hardware warranties should be listed on the order ticket, most servers are three years after ship date.
  • Network equipment has one year coverage, which we renew each year as needed for various hardware.

Racked

  • A Phabricator task should exist with racking location and other details; made during the post-order steps above.
  • All systems should have the following common bios/ilom settings set: cpu hyperthreading on, cpu virtulization off (except for virt and ganeti hosts), serial redirection to com2, redirection after post off, boot mode to legacy bios, ipmi enabled, confirm boot order to list disk first, set performance options to OS performance per watt (dells).
  • Hostname may be assigned (or system may refer to asset tag name until it is allocated for specific role)
  • Please see Server naming conventions for details on how hostnames are determined.
  • If hostname was not previously assigned, a label with name must be affixed to front and back of server.
  • DNS is updated for the mgmt network connections.
  • DNS for mgmt should include both the assettag.mgmt.site.wmnet as well as hostname.mgmt.site.wmnet.
  • DNS for production network will be set only for hostname, since a system will have a hostname before going on the production network, systems may not have this set if their usage isn't yet determined.
  • Racktables entry updated to reflect rack location.
  • System Bios & out of band mgmt setttings are configured at this time..
  • On-site Tech should fully test the mgmt interface to ensure it responds to ssh, they are able to login, reboot the system, and watch a successful BIOS POST over serial console.
  • Switch port(s) are assigned and labeled.
  • Label with hostname, if not available label with asset tag.
  • VLAN assignment is completed at this time only if system role is known.
  • After systems have been racked, if they are not immediately allocated to a service (IE: they are spare), a ticket should be created in core-ops and assgined to the HW Allocation Tech with the asset tags, so they can add to spares list.

Installation

Some of the following steps can be automatically performed using the wmf-auto-reimage scripts. See the notes in square brackets ([]) on the right of each item.

  • Hostname must be assigned at this point.
  • Please see Server naming conventions for details on how hostnames are determined.
  • If hostname was not previously assigned, a label with name must be affixed to front and back of server.
  • DNS setup for production network.
  • $assettag.mgmt.$loc.wmnet should have been setup when the system was racked.
  • set up $name.mgmt.$loc.wmnet to the same IP as $assettag.mgmt.$loc.wmnet.
  • VLAN: Network port is set to proper vlan (and labeled with hostname if not yet labeled.)
    • Do not use the enable keyword (if it's not explicitly disabled, it's enabled)
  • Folks who can handle vlan assignments: Chris J, Faidon L, Mark B, Rob H., Arzhel Y.
  • Any ops folks who want this ability should speak to our network admins.
  • DHCP: Add server to appropriate file in Puppet, based on serial console port and speed:
  • modules/install_server/files/dhcpd/linux-host-entries.ttyS0-9600 = com port 1, speed of 9600
  • modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 = com port 1, speed of 115200
  • modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 = com port 2, speed of 115200 (most hosts)
  • You can pull this information from the management of most systems, as described in their specific pages under Platform-specific documentation.
  • Decide on partition mapping & add server to modules/install_server/files/autoinstall/netboot.cfg
  • Detailed implementation details for our Partman install exist here.
  • The majority of systems should use automatic partitioning, which is set by inclusion on the proper line in netboot.cfg.
  • Any hardware raid would need to be setup manually via rebooting and entering raid bios.
  • Right now there is a mix of hardware and software raid availability.
  • File located @ puppet modules/install_server.
  • partman recipe used located in modules/install_server
  • Please note if you are uncertain on what to pick, you should lean towards LVM.
  • Many reasons for this, including ease of expansion in event of filling the disk.
  • Proceed with Installation
  • Reboot system and boot from network / PXE boot. [automatically done by wmf-auto-reimage unless the option --no-pxe is set]
  • acquires hostname in DNS
  • acquires DHCP/autoinstall entries
  • gets installed

Post-Install: Get puppet running

  • Nothing replaces fully understanding how our deployment of Puppet operates, as detailed on the service info page.
  • Warning: if you are rebuilding a pre-existing server (rather than a brand new name), on the puppet master (puppetmaster1001), run puppet cert destroy $server_fqdn to clear out the old certificate before beginning this process. If you already began, also run (on the server you're building, not the puppet master) find /var/lib/puppet/ssl -type f -exec rm {} \; to clean out the client. [automatically done by wmf-auto-reimage unless the option --no-pxe is set]
  • Login to the puppet master (puppetmaster1001).
  • from puppetmaster1001, sudo /usr/local/sbin/install-console $server_fqdn to log into $server
  • on $server, run puppet agent --test [automatically done by wmf-auto-reimage unless the option --no-pxe is set]
It should whine that it can't get its cert automatically: Exiting; no certificate found and waitforcert is disabled
  • on the puppet master (puppetmaster1001), run puppet cert -l to list all pending certificate signings. [automatically done by wmf-auto-reimage]
  • on the puppet master, run puppet cert -s $server_fqdn for the specific server you wish to sign keys for. [automatically done by wmf-auto-reimage]
  • Now again on $server, run puppet agent --enable to administratively enable puppet, and then puppet agent --test. It should now succeed. [automatically done by wmf-auto-reimage]
  • After your first couple of successful puppet runs, you should reboot just to make sure it comes up clean. [automatically done by wmf-auto-reimage unless the --no-reboot option is set]
  • Your host should now appear in puppet stored configs and therefore in icinga.

In Service

  • When a server is placed into service, documentation of the service (not specifically the server) needs to reflect the new server's status. This includes puppet file references, as well as wikitech documentation pages.
  • When a server is set to be decommissioned or reclaimed below, all wikitech documentation should be updated before a hardware-requests task for reclaim or decommissioning is filed.

Reinstallation

Most of the following steps can be automatically performed using the wmf-auto-reimage scripts. See the notes in square brackets ([]) on the right of each item. YOU SHOULD USE THIS SCRIPT FOR REINSTALLS, as it prevents you from missing steps.

  • A Phabricator ticket should be created detailing the reinstallation in progress.
  • System services must be confirmed to be offline. Checking everything needed for this step and documenting it on this specific page is not feasible at this time. Please ensure you understand the full service details and what software configuration files must be modified. This document will only list the generic steps required for the majority of servers.
  • The following instructions assume the system is online and responsive. If the system is offline, simply skip the steps requiring you do run something on the host.
  • If server is part of a service pool, ensure it is set to false or removed completely from pybal/LVS. [see wmf-auto-reimage option -c/--conftool]
  • Instructions on how to do so are listed on the LVS page.
  • If server is part of a service group, there will be associated files for removal or update. The service in question needs to be understood by tech performing the decommission (to the point they know when they can take things offline.) If assistance is needed, please seek out another operations team member to assist.
  • Example: db class machines are in associated db-X.php, memcached in mc.php.
  • Remove server entry from DSH node groups, if present (if the server is part of a service pool, this is most probably not necessary as dsh gropups are populated from conftool)
  • These files are maintained in operations/puppet:hieradata/common/scap/dsh.yml
  • Put the downtime into icinga if said downtime will generate pages. [automatically done by wmf-auto-reimage unless the option --no-downtime is set]
  • Manually revoke keys from puppet. [automatically done by wmf-auto-reimage unless the option --no-pxe is set]
  • Instructions on how to do so are on the Puppet service details page.
  • $ puppet cert clean <fqdn>
  • Power down system. [automatically done by wmf-auto-reimage unless the option --no-pxe is set]
  • After the above is done, system installation can proceed from the Installation section above.

Reclaim to Spares OR Decommission

Steps for ANY Opsen

  • A Decommission ticket should be created detailing if system is being decommissioned (and removed from datacenter) or reclaimed (wiped of all services/data and set system as spare for reallocation).
  • System services must be confirmed to be offline. Checking everything needed for this step and documenting it on this specific page is not feasible at this time(but we are working to add them all). Please ensure you understand the full service details and what software configuration files must be modified. This document will only list the generic steps required for the majority of servers.
  • Disable ALL service level checks in icinga for host.
  • If server is part of a service pool, ensure it is set to false or removed completely from pybal/LVS.
    • Instructions on how to do so are listed on the LVS page.
  • If possible, use tcpdump to verify that no production traffic is hitting the services/ports
  • If server is part of a service group, there will be associated files for removal or update. The service in question needs to be understood by tech performing the decommission (to the point they know when they can take things offline.) If assistance is needed, please seek out another operations team member to assist.
    • Example: db class machines are in associated db-X.php, memcached in mc.php.
  • Remove server entry from DSH node groups (if any).
    • If the server is part of a service group, common DSH entries are populated from conftool, unless they're proxies or canaries
    • The list of dsh groups is in operations/puppet:hieradata/common/scap/dsh.yaml.
  • Remove system entries in site.pp, replace with system entry for role::spare::system, merge changes.
  • Remove all hiera data entries for host.
  • Run puppet on host to be reclaimed/decommissioned.
    • Leaving the host on, but with role::spare::system will allow it to receive security updates.

Steps for DC-OPS (with network switch access)

  • Confirm all puppet manifest entires removal, DSH removal, Hiera data removal.

These steps, once started, must be completed without interruption.

Some of the following steps are covered by the wmf-decommission-host script available on the cluster::management hosts (neodymium/sarin as of Aug. 2018). Those covered by the script are marked as [decom script].

  • Disable puppet on the host (puppet agent --disable)
    • Admin log whenever you disable or enable puppet on a host!
  • Remove all references in puppet:
    • remove from site.pp and from hiera data (both individual host files and entries in regex.yaml, if any
    • remove from netboot.cfg (puppet:///modules/install_server/files/autoinstall/netboot.cfg)
    • remove from DHCPD lease file (puppet:///modules/install_server/files/dhcpd/linux-host-entries.ttyS... filename changes based on serial console settings)
    • Instructions on how to do so are on the Puppet service details page.
    • $ puppet node clean <fqdn> [decom script]
    • $ puppet node deactivate <fqdn> [decom script]
      • These 2 commands immediately preceding this should also remove the host from Icinga monitoring:
    • Run puppet on the icinga master (currently einsteinium.wikimedia.org), so that all alerts for the host are removed [not needed if the decom script is run]
    • Alternatively, put the host and all services into downtime for 1+ day, as it will then not alert when the host is powered down & the next puppet run on the icinga host will remove it from monitoring. [decom script]
    • Remove the host from DebMonitor: from one of the cluster::management hosts (neodymium/sarin as of Jul. 2018) run: [decom script]
      sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key
      
  • Remove host's port vlan config
    • # show interfaces ge-x/y/z | display inheritance helps identify configuration applied to the port
  • disable host's port on switch.
    • If system is being reclaimed for spare, do not change port label.
    • If system is being decommissioned, please do not wipe port description until AFTER it is unracked.
    • THIS MUST BE DONE, or host can be powered up and will be network accessible (but not in puppet and not getting security updates)
    • Move the switch port to interface-range disabled
    • If you can't get puppet to run happily on icinga, get help. If help is not available, stop, renable puppet on the host, and start these steps again once you can get help with icinga.
    • You should either ensure monitoring is removed, or at minimum disable notifications for that host. Don't generate paging alerts for systems you are decommissioning.
  • Power down system.

End steps that must be completed without interruption. The following can be done one at a time and/or with long breaks in bewtween.

  • Remove DNS entries for the production network.
    • Don't remove the mgmt DNS entries at this time!
      • Reclaims never have mgmt entries removed, and decom servers should keep them until they are wiped and unracked.
  • Update associated Phabricator ticket, detailing steps taken and resolution.
    • If system is decommissioned by on-site tech, they can resolve the ticket.
    • If system is reclaimed into spares, ticket should be assigned to the HW Allocation Tech so he can update spares lists for allocation.

Decommission Specific (can be done by DC Ops without network switch access)

  • A Phabricator ticket for the decommission of the system should be placed in the #hardware-request project and the appropriate datacenter-specific ops-* project.
  • All further decommission steps are handled by the on-site technician.
  • Wipe all disks on system with minimum of 3 passes.
  • We presently boot off USB version of DBaN.
  • Reset all system bios, mgmt bios, & raid bios settings to factory defaults.
  • Unrack system & update Racktables, moving system from rack location to decommissioned rack.
  • Unless another system will be placed in the space vacated immediately, please remove all power & network cables from rack.
  • Once server is un-racked, do the following:
  • Remove its mgmt DNS entries.
  • Remove port description label for decomissioned host's switch port

Network devices specific

  • SRX only: ensure autorecovery is disabled (see Juniper doc)
  • Wipe the configuration
    • By either running the command request system zeroize media
    • Or Pressing the reset button for 15s
  • Confirm the wipe is successful by login to the device via console (root/no password)

wmf-auto-reimage

The wmf-auto-reimage scripts allow to automate most of the installation/re-image tasks outlined in this document. There are two scripts available:

  • wmf-auto-reimage-host: reimage a single host.
  • wmf-auto-reimage: allow to reimage multiple hosts in parallel or in sequence with an optional sleep.

The scripts are installed in the Cumin masters (see Cumin#Production infrastructure) and must be run in a screen/tmux with sudo -i (to load conftool authentication). The steps that are done are:

  • update the Phabricator task with the start of the reimage [only if -p/-phab-task-id is set]
  • validate FQDN of hosts to image/reimage [unless --new or --no-verify are set]
  • set the hosts in dowtime on Icinga [unless --no-downtime is set]
  • depool hosts via conftool [only if -c/--conftool is set]
  • set next boot in PXE mode, and check that the PXE is set (retry up to 3 times on failure) [unle ss --no-pxe is set]
  • power cycle or power on based on current power status [unless --no-pxe is set]
  • use the new hostname [if --rename is set, it requires that the new hostname is already set via DHCP and configured in DNS]
  • monitor the reboot [unless --no-pxe is set]
  • run puppet once to create the certificate and the signing request to the Puppet master [puppet client >= 4 only if there is no signed cert]
  • wait for a Puppet certificate to sign and perform the following tasks, unless the certificate was already signed:
    • mask all the provided systemd units (comma-separated list) to prevent them to start automatically during the first Puppet run. Useful in case of worker roles like the MediaWiki videoscalers [only if --mask is set]
    • make the first Puppet run
    • run Puppet on the Icinga host and set it in dowtime (there are still many race conditions here that might prevent it to be able to suppress all the alarms)
  • verify that the BIOS parameters are back to normal (no override), print a warning message otherwise.
  • reboot the host [unless --no-reboot is set]
  • monitor the reboot and that the first puppet run is successful [unless --no-reboot is set]
  • run the apache-fast-test on the host after the reimage [only if -a/--apache is set]
  • print the commands to unmask the masked systemd units (there is no automatic unmasking on purpose) [only if --mask is set]
  • print the conftool commands to re-pool the host (there is no automatic repooling on purpose) [only if -c/--conftool is set]
  • update the Phabricator task with the result of the reimage [only if -p/-phab-task-id is set]

For a full list of the available options run them with the -h/--help parameter.

Server reimage + rename

This is a hint of a procedure that can be followed to rename a server while doing the reimage:

  • Silence alerts for the host to be renamed
  • patch for puppet adjusting install/roles for the new server. Merge it.
  • patch for DNS adding the new mgmt FQDNs (don't delete olds yet). Merge it. (what about non-mgmt entries?)
  • disable puppet in the server to be reimaged + renamed
  • run the wmf-auto-reimage-host script (with the --rename option) (Run it where?)
  • patch for dns to cleanup DNS entries. Merge it.
  • racktables/netbox entry update (dont change physical label field, just hostname field)
  • get the physical relabeling done (open a task for dc-ops) including update to racktables physcial label field
  • update the network port description on switch
  • done

Examples of all of this: phab:T199521, phab:T199107.

Position Assignments

The cycle above references specific position/assignments, without referring to name. To keep the document generic, we'll keep the cycle with positions listed, and just list those folks here.

  • Buyer / HW Allocation Tech: Rob H (US), Mark B (EU)
  • On-site Tech EQIAD: Chris J
  • On-site Tech CODFW: Papaul T
  • On-site Tech ULSFO: Rob H
  • Director Technical Operations : Mark B
  • Operations Technical Review: Mark B, Faidon L

See also