You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Server Lifecycle: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Jcrespo
imported>Razzi
m (→‎Manual installation: Fix some typos)
(89 intermediate revisions by 29 users not shown)
Line 1: Line 1:
This page describes the lifecycle of Wikimedia servers, starting from the moment we acquire them and until the time we unrack them. A server has various states that it goes through, with several steps that need to happen in each state. The goal is to standardize our processes for 99% of the servers we deploy or decomission and ensure that some necessary steps are taken for consistency, manageability & security reasons.
This page describes the lifecycle of Wikimedia servers, starting from the moment we acquire them and until the time we don't own them anymore. A server has various states that it goes through, with several steps that need to happen in each state. The goal is to standardize our processes for 99% of the servers we deploy or decommission and ensure that some necessary steps are taken for consistency, manageability & security reasons.
 
This assumes the handling of '''bare metal hardware servers''', as it includes DCOps steps. While the general philosophy applies also to Virtual Machines in terms of steps handling and final status, check [[Ganeti#VM_operations]] for the usually simplified steps regarding VMs.
 
The inventory tool used is [[Netbox]] and each state change for a host is documented throughout this page.
 
== States ==
 
{| class="wikitable" style="background: none;"
|-
! Server Lifecycle !! Netbox
!Racked
!Power
|-
| <code>requested</code> || none, not yet in Netbox
|no
|n/a
|-
| <code>spare</code> || <code>INVENTORY</code>
|yes or no
|off
|-
|<code>planned</code>
|<code>PLANNED</code>
|yes or no
|off
|-
| <code>staged</code> || <code>STAGED</code>
|yes
|on
|-
| <code>active</code> || <code>ACTIVE</code>
|yes
|on
|-
| <code>failed</code> || <code>FAILED</code>
|yes
|on or off
|-
| <code>decommissioned</code> || <code>DECOMMISSIONING</code>
|yes
|on or off
|-
| <code>unracked</code> || <code>OFFLINE</code>
|no
|n/a
|-
| <code>recycled</code> || none, not anymore in Netbox
|no
|n/a
|}
 
== Server transitions ==
[[File:Server Lifecycle Statuses.png|alt=Diagram of the Server Lifecycle transitions|thumb|635x635px|Diagram of the Server Lifecycle transitions
 
* Dashed lines are for the transitions to Failed state.
* Red dashed lines highlight the transition Active -> Failed -> Staged to distinguish it from the Staged <-> Failed one.
]]


== Server states ==
=== Requested ===
=== Requested ===
* New hardware is requested for use via the instructions on [[Operations_requests#Hardware_Requests]].
* New hardware is requested for use via the instructions on the [https://phabricator.wikimedia.org/maniphest/task/edit/form/66/ Phabricator Procurement Form].
 
* Hardware Allocation Tech will review request, and detail on ticket if we already have a system that meets these requirements, or if one must be ordered.
* Hardware Allocation Tech will review request, and detail on ticket if we already have a system that meets these requirements, or if one must be ordered.
:* If further details are needed, the task will be assigned back to requester for additional information.
* If hardware is already available and request is approved by operations management, system will be allocated, skipping the rest of this process to the [[Server_Lifecycle#Existing_System_Allocation|Existing System Allocation]] step.
* If hardware must be ordered, the buyer will gather quotes from our approved vendors & perform initial reviews on quote(s).
:* At this time, quotes are still in [https://rt.wikimedia.org RT]
* Technical review is done by operations team members familiar with hardware in question ([[Server_Lifecycle#Position_Assignments|see below]]), attaching their confirmation or corrections.
:* If there are corrections, ticket goes back to buyer and requester as needed until issues are clarified otherwise escalate to Systems Architect(s).
* System Architect(s) perform final technical review to ensure the technical correctness, cost effectiveness, & architecture/roadmap alignment; attaching confirmation or corrections.
:* If there are corrections, ticket goes back to buyer and requester as needed until issues are clarified otherwise escalate to Operations Management.
* Operations Management reviews ticket and attaches approval(s) or questions as needed & assigns ticket back to buyer.
* Buyer may create an on-site hardware confirmation task, this ticket will confirm all parts, cables, and assorted items are available for incoming system.
:* Ticket is assigned to on-site tech, who must confirm or request the required hardware accessories needed to support the system.
:* Order may proceed even if all hardware is not on site, depending on missing hardware and lead times.


=== Existing System Allocation ===
* If hardware is already available and request is approved by SRE management, system will be allocated, skipping the generation of quotes and ordering.
* If hardware must be ordered, the then DC Operations will gather quotes from our approved vendors & perform initial reviews on quote(s), working with the sub-team who requested the hardware.
 
 
==== Existing System Allocation ====
See the [[#Decommissioned -> Staged]] section below.
 
* Only existing systems (not new) use this step if they are requested.
* Only existing systems (not new) use this step if they are requested.
:* If a system must be ordered, please skip this section and proceed to [[Server_Lifecycle#Ordered|Ordered]] section.
:* If a system must be ordered, please skip this section and proceed to [[Server_Lifecycle#Ordered|Ordered]] section.
* System name is changed if required.
* Spare pool allocations are detailed on the #Procurement task identically to new orders.
* All puppet files are ensured to be absent of old server declarations.
* Task is escalated to DC operations manager for approval of spare pool systems.
:* System should have already been cleared out of monitoring.
* Once approved, the same steps of updating the procurement gsheet & filing a racking task occur from the DC operations person triaging Procurement.
:* If system listed in puppet decommission files to clear said monitoring, it should be removed from those files before it is allocated again.
 
* All old system keys are confirmed to be removed from puppet master.
==== Ordered ====
* All ssh authorized
* If all the above are good, the Hardware Allocation Tech will update your RT ticket and the [[Server Spares]] page to reflect the allocation.
* Skip following steps until the [[Server_Lifecycle#Installation|Installation]] section.


=== Ordered ===
* Only new systems (not existing/reclaimed systems)  
* Only new systems (not existing/reclaimed systems)  
* Buyer purchases hardware, attaching ordering details to ticket.
* Quotes are reviewed and selected, then escalated to either DC Operations Management or SRE Management (budget dependent) for order approvals.
* Once order ships, buyer places inbound shipment ticket with datacenter vendor[https://office.wikimedia.org/wiki/Operations#Datacenter_Operations Point of Contact details here.]
* At the time of Phabricator order approval, a racking sub-task is created and our budget google sheets are updatedDC Ops then places the approved Phabricator task into Coupa for ordering.
* Buyer assigns [https://rt.wikimedia.org/ RT] procurement ticket to the on-site technician to receive in hardware.
* Coupa approvals and ordering takes place.
* Ordering task is updated by Procurement Manager (Finance) and reassigned to the on-site person for DC Operations to receive (in Coupa) and rack the hardware.
* Racking task is followed by DC Operations and resolved.


=== Post Order ===
==== Post Order ====
* An installation/deployment task should be created (if it doesn't already exist) for the overall deployment of the system/OS/service & place in the #operations project.
An installation/deployment task should be created (if it doesn't already exist) for the overall deployment of the system/OS/service & have the <code>#sre</code> and <code>#DCOps</code> tags. It can be created following the Phabricator [[phab:maniphest/task/edit/form/80/|Hardware Racking Request]] form.
:* You can include the following steps on this ticket for ease of reference (taken from the entirely of the lifecycle document):


  System Deployment Steps:
=== Requested -> Spare & Requested -> Planned ===
  [] - mgmt dns entries created/updated (both asset tag & hostname) [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
 
  [] - system bios and mgmt setup and tested [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
==== Receiving Systems On-Site ====
  [] - network switch setup (port description & vlan) [link sub-task for network configuration here, sub-task should include the network project]
  [] - production dns entries created/updated (just hostname, no asset tag entry) [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
  [] - install_server module updated (dhcp and netboot/partitioning) [done via this task when on-site subtasks complete]
  [] - install OS (note jessie or trusty) [done via this task when network sub-task(s) complete]
  [] - accept/sign puppet/salt keys [done via this task post os-installation]
  [] - service implementation [done via this task post puppet/salt acceptance]


=== Receiving Systems On-Site ===
* Before the new hardware arrives on site, a shipment ticket must be placed to the datacenter to allow it to be received.
* Before the new hardware arrives on site, a shipment ticket must be placed to the datacenter to allow it to be received.
:* If the shipment has a long enough lead time, the buyer should enter a ticket with the datacenter site.  Note sometimes the shipment lead times won't allow this & a shipment notification will instead be sent when shipment arrives.  In that event, the on-site technician should enter the receipt ticket with the datacenter vendor.
:* If the shipment has a long enough lead time, the buyer should enter a ticket with the datacenter site.  Note sometimes the shipment lead times won't allow this & a shipment notification will instead be sent when shipment arrives.  In that event, the on-site technician should enter the receipt ticket with the datacenter vendor.
* New hardware arrives on site & datacenter vendor notifies us of shipment receipt.
* New hardware arrives on site & datacenter vendor notifies us of shipment receipt.
* Packing slip for delivery should list an RT # & the RT ticket should have been assigned to the on-site technician for receipt at this time.
* Packing slip for delivery should list an Phabricator # or PO # & the Phabricator racking task should have been created in the correct datacenter project at time of shipment arrival.
* Open boxes, compare box contents to packing slip.  Note on slip if correct or incorrect, scan packing slip and attach to ticket.
* Open boxes, compare box contents to packing slip.  Note on slip if correct or incorrect, scan packing slip and attach to ticket.
* Compare packing slip to order receipt in the RT ticket, note results on ticket.
* Compare packing slip to order receipt in the Phabricator task, note results on Phabricator task.
* If any part of the order is incorrect, reply on RT ticket with what is wrong, and assign the ticket to the buyer on the ticket.
* If any part of the order is incorrect, reply on Phabricator task with what is wrong, and escalate back to DC Ops Mgmt.
* If the entire order was correct, please note on the procurement ticket.  Unless the ticket states otherwise, it can be resolved by the receiving on-site technician at that time.
* If the entire order was correct, please note on the procurement ticket.  Unless the ticket states otherwise, it can be resolved by the receiving on-site technician at that time.
* Assign asset tag to system, enter system into Racktables immediately, even if not in rack location.
* Assign asset tag to system, enter system into [[Netbox]] immediately, even if not in rack location, with:
:* Some systems will have a hostname assigned to them at time of order (usually for clustered systems), if it has been assigned, it will be on the procurement ticket. If it has not been assigned, name the systems in racktables with the Asset Tag under server name, and leave visible label blank.
 
:* Entry into racktables should always include the following: asset tag, OEM Serial Number (or service tag), hardware type (dropdown), procurement rt#, purchase date (on order), hardware warranty expiration, and location of system (tag).
:* Device role (dropdown), Manufacturer (dropdown), Device type (dropdown), Serial Number (OEM Serial number or Service tag), Asset tag, Site (dropdown), Platform (dropdown), Purchase date, Support expiry date, Procurement ticket (Phabricator or RT)
::* The location tag lets us see whats in a location but NOT racked, so please make sure to check which site for each system.
:**For State and Name:
::* Hardware warranties should be listed on the order ticket, most servers are three years after ship date.
:***If host is scheduled to be commissioned: use the hostname from the procurement ticket as Name and <code>PLANNED</code> as State
::* Network equipment has one year coverage, which we renew each year as needed for various hardware.
:***If host is a pure spare host, not to be commissioned: Use the asset tag as Name and <code>INVENTORY</code> as State
:* Hardware warranties should be listed on the order ticket, most servers are three years after ship date.
:* Network equipment has one year coverage, which we renew each year as needed for various hardware.
:*A [https://phabricator.wikimedia.org Phabricator] task should exist with racking location and other details; made during the post-order steps above.
 
=== Requested -> Planned additional steps & Spare -> Planned ===
 
* A hostname must be defined at this stage:  
**Please see [[Server naming conventions]] for details on how hostnames are determined.
** If hostname was not previously assigned, a label with name must be affixed to front and back of server.
***If system has a front LCD, please see instructions on how to set the name on it via [[Platform-specific documentation]]
 
* [[Netbox]] entry must be updated to reflect rack location and hostname
* Run the [https://netbox.wikimedia.org/extras/scripts/interface_automation/ProvisionServerNetwork/ Netbox ProvisionServerNetwork] script to assign mgmt IP, primary IPv4/IPv6, vlan and switch interface
*Follow the [[DNS/Netbox#Update_generated_records]] to create and deploy the mgmt and primary IPs (for mgmt should include both the <code>$assettag.mgmt.site.wmnet</code> as well as <code>$hostname.mgmt.site.wmnet</code>).
 
* Run [[https://wikitech.wikimedia.org/wiki/Homer#Running_Homer_from_cluster_management_hosts_(recommended)|Homer]] to configure the switch interface (description, vlan).
 
* System Bios & out of band mgmt settings are configured at this time.
**See the [[Platform-specific documentation]] for setup instructions for each system type looking for the ''Initial System Setup'' section.
***'''NB: for Dell servers the process is automated''', see [[SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Automatic_setup]].
**'''Serial Redirection and mgmt must be tested at this time'''
***On-site Tech should fully test the mgmt interface to ensure it responds to ssh, they are able to login, reboot the system, and watch a successful BIOS POST over serial console.
=== Planned -> Staged ===
 
==== Preparation ====
* Decide on partition mapping & add server to <code>modules/install_server/files/autoinstall/netboot.cfg</code>
**Detailed implementation details for our [[Partman]] install exist [[Partman|here]].
** The majority of systems should use automatic partitioning, which is set by inclusion on the proper line in <code>netboot.cfg</code>.
** Any hardware raid would need to be setup manually via rebooting and entering raid bios.
*:* Right now there is a mix of hardware and software raid availability.
** File located @ puppet <code>modules/install_server</code>.
*:* partman recipe used located in modules/install_server
*:* Please note if you are uncertain on what to pick, you should lean towards LVM.
*::* Many reasons for this, including ease of expansion in event of filling the disk.
*Check <code>site.pp</code> to ensure that the host will be reimaged into the <code>insetup</code> or <code>insetup_noferm</code> roles based on the requirements. If in doubt check with the service owner.
==== Installation ====
 
''For virtual machines, where there is no physical BIOS to change, but there is virtual hardware to setup, check [[Ganeti#Create_a_VM]] instead.''
 
At this point the host can be installed. From now on the service owner should be able to take over and install the host automatically, asking DC Ops to have a look only if there are issues. As a rule of thumb if the host is part of a larger cluster/batch order, it should install without issues and the service owner should try this path first. If instead the host is the first of a batch of new hardware, then is probably better to ask DC Ops to install the first one. Consider it a new hardware if it differs from the existing hosts by Generation, management card, RAID controller, network cards, BIOS, etc.
 
===== Automatic Installation =====
See the [[Server_Lifecycle/Reimage]] section on how to use the reimage script to install a new server. Don' t forget to set the <code>--new</code> CLI parameter.
 
===== Manual installation =====
 
<u>'''Warning''':</u> if you are rebuilding a pre-existing server (rather than a brand new name), on <code>puppetmaster</code> clear out the old certificate before beginning this process:
  puppetmaster$ sudo puppet cert destroy $server_fqdn
1. Reboot system and boot from network / PXE boot <br>
2. Acquires hostname in DNS<br>
3. Acquires DHCP/autoinstall entries<br>
4. OS installation<br>
 
'''Run Puppet for the first time'''<br><br>1. From the cumin hosts ({{CuminHosts}}) connect to <code>newserver</code> with install_console.
 
cumin1001:~$  sudo /usr/local/bin/install_console $newserver_fqdn
 
It is possible that ssh warns you of a bad key if an existing ssh fingerprint still exists on the cumin host, like:<syntaxhighlight lang="text">
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!    @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.
Please contact your system administrator.
Add correct host key in /dev/null to get rid of this message.
</syntaxhighlight>You can safely proceed with the installation; the next time puppet runs automatically on the puppetmaster this file will be updated.
 
Try then to do a mock puppet run (it will fail due to lack of certificate signage):
 
newserver# puppet agent --test
Exiting; no certificate found and waitforcert is disabled
 
2. On <code>puppetmaster</code>  list all pending certificate signings and sign this server's key
puppetmaster$ sudo puppet cert -l
puppetmaster$ sudo puppet cert -s $newserver_fqdn
3. Back to the <code>newserver</code>, enable puppet and test it
  newserver# puppet agent --enable
  newserver# puppet agent --test
4. After a couple of successful puppet runs, you should reboot <code>newserver</code> just to make sure it comes up clean.<br>5. The <code>newserver</code> should now appear in puppet and in Icinga.<br>6. If that is a new server, change the state in Netbox to <code>STAGED</code>
 
7. Run the [https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/ Netbox script] to update the device with its interfaces and related IP addresses (remember to Commit the change, the default run is just a preview).<br>
 
'''Note''': If you already began reinstalling the server before destroying its cert on the <code>puppetmaster</code>, you should clean out ON THE <code>newserver</code> (with care):
'''newserver#''' find /var/lib/puppet/ssl -type f -exec rm {} \;
 
=== Spare -> Failed & Planned -> Failed & Staged -> Failed ===
If a device in the Spare, Planned or Staged state has hardware failures it can be marked in Netbox as <code>FAILED</code>.
 
=== Spare -> Decommissioned ===
When a host in the spare pool has reached its end of life and must be unracked.
 
* DC Ops perform actions to safely unrack the host, see the [[#Reclaim to Spares OR Decommission]] section below.
 
=== Staged -> Active ===
 
* When a server is placed into service, documentation of the service (not specifically the server) needs to reflect the new server's state. This includes puppet file references, as well as Wikitech documentation pages.  
** Example: Some servers have [[Help:SSH_Fingerprints|SSH fingerprints]] listed.
*The service owner puts the host back in production.
*The service owner changes Netbox's to <code>ACTIVE</code>.
 
=== Active -> Staged ===
This transition should be used when reimaging or when a ''rollback'' of the <code>STAGED -> ACTIVE</code> transition is needed.
 
* Service owner perform actions to remove it from production, see the [[#Remove from production]] section below.
*Perform the reimage using the available scripts, see [[Server_Lifecycle/Reimage]].
* Service owner changes Netbox's state to <code>STAGED</code> [TODO: include this step into the sre.hosts.reimage cookbook]
 
=== Active -> Failed ===
When a host fails and requires physical maintenance/debugging by DC Ops:
 
* Service owner perform actions to remove it from production, see the [[#Remove from production]] section below.
* Service owner changes Netbox's state to <code>FAILED</code>
*Once the failure is resolved the host will be put back into <code>STAGED</code>, and not directly into <code>ACTIVE</code> and in production.
 
=== Active -> Decommissioned ===
When the host has completed his life in a given role and should decommissioned or returned to the spare pool for re-assignement.
 
*Service owner perform actions to remove it from production, see the [[#Remove from production]] section below.
*Follow instructions for [[Server Lifecycle#Reclaim to Spares OR Decommission]]
 
=== Failed -> Spare ===
When the failure of a Spare device has been fixed it can be set back to <code>INVENTORY</code> in Netbox.
 
=== Failed -> Planned ===
When the failure of a Planned device has been fixed it can be set back to <code>PLANNED</code> in Netbox.
 
=== Failed -> Staged ===
When the failure of an Active or Staged device has been fixed, it will go back to the Staged state. This because also if the host was <code>ACTIVE</code> before it needs to be tested and brought back to production by its service owner.
 
* Change Netbox's state to <code>STAGED</code>
 
=== Failed -> Decommissioned ===
When the failure cannot be fixed and the host is not anymore usable it must be decommissioned before unracking it.
 
* Follow instructions for [[Server Lifecycle#Reclaim to Spares OR Decommission]].
 
=== Decommissioned -> Spare ===
When a decommissioned host is going to be part of the spare pool.
 
* DC Ops wipe and power down the host, see the [[#Reclaim to Spares OR Decommission]] section below.
* DC Ops changes Netbox's state to <code>INVENTORY</code>
 
=== Decommissioned -> Staged ===
When a host is decomissioned from one role and immediately returned in service in a different role, usually with a different hostname. (Ideally it should be wiped too)
 
* Still follow the  [[#Reclaim to Spares OR Decommission]] steps first decommissioning and then re-allocating the host, optionally with a new name, but it requires some additional manual steps (TBD).
* Service owner changes Netbox's state to <code>STAGED</code>
 
=== Decommissioned -> Unracked ===
The host has completed its life and is being unracked
 
* DC Ops perform actions to safely unrack the host, see the [[#Reclaim to Spares OR Decommission]] section below.
 
=== Unracked -> Recycled ===
When the host physically leaves the datacenter.
 
* DC Ops perform actions to recycle the host, see the [[#Reclaim to Spares OR Decommission]] section below.
 
If Juniper device, fill the "Juniper Networks Service Waiver Policy" and send it to Juniper through a service request so it's removed from Juniper's DB.
 
== Server actions ==
 
=== Reimage ===
See the [[Server Lifecycle/Reimage]] page.
 
=== Remove from production ===
 
{{Note|Please use the Phabricator form for decommission tasks: https://phabricator.wikimedia.org/project/profile/3364/}}
 
*'''A [https://phabricator.wikimedia.org Phabricator] ticket''' should be created detailing the reinstallation in progress.
*'''System services must be confirmed to be offline'''.  Make sure no other services depend on this server.
*'''Remove from pybal/[[LVS]]''' (if applicable) - see the <code>sre.hosts.reimage</code> cookbook option <code>-c/--conftool</code> and consult the [[LVS]]  page
*'''Check if server is part of a service group'''. For example db class machines are in associated db-X.php, memcached in mc.php.
*'''Remove server entry from DSH node groups''' (if applicable). For example check <code>operations/puppet:hieradata/common/scap/dsh.yaml</code>
 
=== Rename while reimaging ===
{{Warning|content=Experimental procedure, not yet fully tested}}
'''Assumptions:'''


=== Racked ===
* The host will lose all its data.
* A [https://phabricator.wikimedia.org Phabricator] task should exist with racking location and other details; made during the post-order steps above.
* The host can change primary IPs. The following procedure doesn't guarantee that they will stay the same.
* Hostname may be assigned (or system may refer to asset tag name until it is allocated for specific role)
* If the host need to be also physically relocated, follow the additional steps inline.
:* Please see [[Server naming conventions]] for details on how hostnames are determined.
* A change of the host's VLAN during the procedure is supported.
:* If hostname was not previously assigned, a label with name must be affixed to front and back of server.
::* If system has a front LCD, please see instructions on how to set the name on it via [[Platform-specific documentation]].
* [[DNS]] is updated for the mgmt network connections.
:* DNS for mgmt should include both the assettag.mgmt.site.wmnet as well as hostname.mgmt.site.wmnet.
:* DNS for production network will be set only for hostname, since a system will have a hostname before going on the production network, systems may not have this set if their usage isn't yet determined.
* [https://racktables.wikimedia.org Racktables] entry updated to reflect rack location.
* System Bios & out of band mgmt setttings are configured at this time..
:* See the [[Platform-specific documentation]] for setup instructions for each system type.
:* '''Serial Redirection and mgmt must be tested at this time'''
::* On-site Tech should fully test the mgmt interface to ensure it responds to ssh, they are able to login, reboot the system, and watch a successful BIOS POST over serial console.
* Switch port(s) are assigned and labeled.
:* Label with hostname, if not available label with asset tag.
:* VLAN assignment is completed at this time only if system role is known.
* After systems have been racked, if they are not immediately allocated to a service (IE: they are '''spare'''), a ticket should be created in core-ops and assgined to the HW Allocation Tech with the asset tags, so they can add to spares list.


=== Installation ===
'''Procedure:'''
* Hostname must be assigned at this point.
:* Please see [[Server naming conventions]] for details on how hostnames are determined.
:* If hostname was not previously assigned, a label with name must be affixed to front and back of server.
::* If system has a front LCD, please see instructions on how to set the name on it via [[Platform-specific documentation]].
* DNS setup for production network.
:* $assettag.mgmt.$loc.wmnet should have been setup when the system was racked.
:* set up $name.mgmt.$loc.wmnet to the same IP as $assettag.mgmt.$loc.wmnet.
* VLAN: Network port is set to proper vlan (and labeled with hostname if not yet labeled.)
:* Folks who can handle vlan assignments: Chris J, Faidon L, Mark B, Rob H.
::* Any ops folks who want this ability should speak to our network admins.
* DHCP: Add server to appropriate file in Puppet, based on serial console port and speed:
:* modules/install_server/files/dhcpd/linux-host-entries.ttyS0-9600 = com port 1, speed of 9600
:* modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200 = com port 1, speed of 115200
:* modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 = com port 2, speed of 115200 (most hosts)
:* You can pull this information from the management of most systems, as described in their specific pages under [[Platform-specific documentation]].
* Decide on partition mapping & add server to modules/install_server/files/autoinstall/netboot.cfg
:* Detailed implementation details for our [[Partman]] install exist [[Partman|here]].
:* The majority of systems should use automatic partitioning, which is set by inclusion on the proper line in netboot.cfg.
:* Any hardware raid would need to be setup manually via rebooting and entering raid bios.
::* Right now there is a mix of hardware and software raid availability.
:* File located @ puppet modules/install_server.
::* partman recipe used located in modules/install_server
::* Please note if you are uncertain on what to pick, you should lean towards LVM.
:::* Many reasons for this, including ease of expansion in event of filling the disk.
* Proceed with Installation
:* Reboot system and boot from network / PXE boot.
:* acquires hostname in DNS
:* acquires DHCP/autoinstall entries
:* gets installed
==== Post-Install: Get puppet running ====
* Nothing replaces fully understanding how our deployment of [[Puppet]] operates, as detailed on the [[Puppet | service info page]].
* '''Warning''': if you are rebuilding a pre-existing server (rather than a brand new name), on the puppet master (palladium), run <code>puppet cert destroy $server_fqdn</code> to clear out the old certificate before beginning this process.  If you already began, also run (on the server you're building, not the puppet master) <code>find /var/lib/puppet/ssl -type f -exec rm {} \;</code> to clean out the client.
* Login to restricted bastion (iron, used for boxes in the labs-private vlan) or the puppet master (palladium, used for everything else).
:* from iron or palladium, <code>sudo /usr/local/sbin/install-console $server_fqdn</code> to log into $server
* on $server, run <code>puppet agent --test</code>
:It should whine that it can't get its cert automatically: <code>Exiting; no certificate found and waitforcert is disabled</code>
* on the puppet master (definitely palladium, this time), run <code>puppet cert -l</code> to list all pending certificate signings.
* on the puppet master, run <code>puppet cert -s $server_fqdn</code> for the specific server you wish to sign keys for.
* Now again on $server, run <code>puppet agent --enable</code>  to administratively enable puppet, and then <code>puppet agent --test</code>.  It should now succeed.
* Now that puppet has run, sign the new salt key on the salt master (neodymium): <code>salt-key -a $server_fqdn</code>
* After your first couple of successful puppet runs, you should reboot just to make sure it comes up clean.
:* Your host should now appear in puppet stored configs and therefore in ganglia and icinga.


* '''Important''': check that <code>salt-minion</code> is properly running on the host. If your host does not include the base or standard classes from puppet, include the specific class for the salt minion (at this writing role::salt::minion).
This procedure follows the <code>active -> decommissioned -> staged</code> path. '''All data on the host will be lost.'''
:* Look to see if the process 'salt-minion' is running.
:* Check /var/log/salt/minion for messages about keys; if you see continuing messages "The Salt Master has cached the public key for this node, this salt minion will wait for 10 seconds before attempting to re-authenticate", you need to accept the key on the salt master host, see [[Salt]] for how to do this.


=== In Service ===
* Remove the host from active production (depool, failover, etc.)
* Where all servers aspire to be.
* Run the  <code>'''sre.hosts.decommission'''</code> cookbook, see [[Spicerack/Cookbooks#Run_a_single_Cookbook]]
*If the host needs to be physically relocated:
**Physically relocate the host now.
**Update its device page on Netbox to reflect the new location.
*Update Netbox:
**Edit the device page to set the new name (use the hostname, not the FQDN) and set its status from '''DECOMMISSIONING''' to '''PLANNED'''.
**Rename the DNS Name of all its IPs (there should be only the management IP at this stage). In order to do so, search for them in the [https://netbox.wikimedia.org/ipam/ip-addresses/?q= IpAddresses] Netbox page (Search box on the right) using the current hostname (not FQDN in order to find the management IP too).[[File:NetboxConnectionDetails.png|thumb|300x300px|Netbox's connection details]]
**Take note of the primary interface connection details: '''Cable ID, Switch name, Switch port''' (see image on the right). They will be needed in a following step.
**[TODO: automate this step into the Netbox provisioning script] Go to the interfaces tab in the device's page on Netbox, select all the interfaces '''except the <code>mgmt</code> one''', proceed only if the selected interfaces have '''no IPs assigned to them.''' Delete the selected interfaces.
**Run the [https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/ interface_automation.ProvisionServerNetwork] Netbox script, filling the previously gathered data for switch, switch interface and cable ID (just the integer part). Fill out all the remaining data accordingly, ask for help if in doubt.
* Run the <code>sre.dns.netbox</code> cookbook: [[DNS/Netbox#Update_generated_records]]
*Run [[Homer]] against the switch the device is connected to, in order to configure the switch interface (initial) description and VLAN configuration.
**Note that netbox uses virtual names for switches, so e.g. <code>asw2-d1-eqiad</code> in netbox is <code>"asw2-d-eqiad*"</code> when using homer.
* Patch puppet:
**Adjust install/roles for the new server, hieradata, conftool, etc.
**Update partman entry.
**Get it reviewed, merge and deploy it.
*Run puppet on the install servers: <code>cumin 'A:installserver' 'run-puppet-agent -q'</code>
* Follow the reimage procedure at [[Server Lifecycle/Reimage]] using the <code>--new</code> option
*Edit the device page on Netbox, set its status from '''PLANNED''' to '''STAGED'''.
* Get the physical re-labeling done (open a task for dc-ops)
*Run [[Homer]] (again) against the switch the device is connected to, in order to update the port's description with the interface name assigned to the host during the reimage/install.
* Once the host is back in production update its status in Netbox from '''STAGED''' to '''ACTIVE'''.


=== Reinstallation ===
=== Reclaim to Spares OR Decommission ===
* The [https://github.com/wikimedia/operations-puppet/blob/production/modules/puppetmaster/files/wmf-reimage wmf-reimage script] can help you automate some of the tasks below.
TODO: this section should be split in three: Wipe, Unrack and Recycle.
* A [https://phabricator.wikimedia.org Phabricator] ticket should be created detailing the reinstallation in progress.
* System services must be confirmed to be offline.  Checking everything needed for this step and documenting it on this specific page is not feasible at this time.  Please ensure you understand the full service details and what software configuration files must be modified.  This document will only list the generic steps required for the majority of servers.
* The following instructions assume the system is online and responsive.  If the system is offline, simply skip the steps requiring you do run something on the host.
* If server is part of a service pool, ensure it is set to false or removed completely from pybal/[[LVS]].
:* Instructions on how to do so are listed on the [[LVS]] page.
* If server is part of a service group, there will be associated files for removal or update.  The service in question needs to be understood by tech performing the decommission (to the point they know when they can take things offline.)  If assistance is needed, please seek out another operations team member to assist.
:* Example: db class machines are in associated db-X.php, memcached in mc.php.
* Remove server entry from DSH node groups.
:* These files are maintained in operations/puppet:/modules/scap/files/dsh/group.
* Put the downtime into icinga if said downtime will generate pages.
* Disable puppet on the host (puppet agent --disable)
:* Admin log whenever you disable or enable puppet on a host!
* Manually revoke keys from puppet.
:* Instructions on how to do so are on the [[Puppet]] service details page.
:* $ puppet cert clean <fqdn>
* Power down system.
* Manually revoke  keys from salt.
:* Instructions on how to do so are on the [[Salt]] service details page.
:* $ salt-key -d <fqdn>
* After the above is done, system installation can proceed from the '''Installation''' section above.


=== Reclaim or Decommission ===
==== Steps for non-LVS hosts ====
''These steps can be done one at a time and/or with long breaks in between.''
* Run decomm cookbook, Note: this will also schedule downtime for the host
* A [https://phabricator.wikimedia.org Phabricator] ticket should be created detailing if system is being decommissioned (and removed from datacenter) or reclaimed (wipe all services and set system as spare for reallocation).
  $ cookbook sre.hosts.decommission  mc102[3-4].eqiad.wmnet -t T289657
:* Place the [https://phabricator.wikimedia.org/project/view/29/ operations] project on the task.
* Remove any references in puppet, most notably from <code>site.pp</code> and <code> modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200 </code>
==== Steps for ANY Opsen ====
* A [https://phabricator.wikimedia.org/maniphest/task/edit/form/52/ Decommission] ticket should be created detailing if system is being decommissioned (and removed from datacenter) or reclaimed (wiped of all services/data and set system as spare for reallocation).
** Please put a [https://phabricator.wikimedia.org/maniphest/task/edit/form/52/ full decommission checklist] of the steps in the main task description, this ensures none are accidentally missed.
* System services must be confirmed to be offline.  Checking everything needed for this step and documenting it on this specific page is not feasible at this time(but we are working to add them all).  Please ensure you understand the full service details and what software configuration files must be modified.  This document will only list the generic steps required for the majority of servers.
* System services must be confirmed to be offline.  Checking everything needed for this step and documenting it on this specific page is not feasible at this time(but we are working to add them all).  Please ensure you understand the full service details and what software configuration files must be modified.  This document will only list the generic steps required for the majority of servers.
* The following instructions assume the system is online and responsive.  If the system is offline, simply skip the steps requiring you do run something on the host.
* If server is part of a service pool, ensure it is set to false or removed completely from pybal/[[LVS]].
* If server is part of a service pool, ensure it is set to false or removed completely from pybal/[[LVS]].
:* Instructions on how to do so are listed on the [[LVS]] page.
** Instructions on how to do so are listed on the [[LVS]] page.
*If possible, use tcpdump to verify that no production traffic is hitting the services/ports
* If server is part of a service group, there will be associated files for removal or update.  The service in question needs to be understood by tech performing the decommission (to the point they know when they can take things offline.)  If assistance is needed, please seek out another operations team member to assist.
* If server is part of a service group, there will be associated files for removal or update.  The service in question needs to be understood by tech performing the decommission (to the point they know when they can take things offline.)  If assistance is needed, please seek out another operations team member to assist.
:* Example: db class machines are in associated db-X.php, memcached in mc.php.
** Example: db class machines are in associated db-X.php, memcached in mc.php.
* Remove server entry from DSH node groups (if any).
* Remove server entry from DSH node groups (if any).
:* These files are maintained in operations/puppet:modules/scap/files/dsh/group.
** If the server is part of a service group, common DSH entries are populated from conftool, unless they're proxies or canaries
* Remove from puppet stored configuration files.
** The list of dsh groups is in <code>operations/puppet:hieradata/common/scap/dsh.yaml</code>.
:* remove from site.pp (puppet:///manifests/site.pp)
* Run the  <code>'''sre.hosts.decommission'''</code> [[decom script|'''decom script''']] available on the <code>cluster::management</code> hosts ({{CuminHosts}}). '''The cookbooks is destructive and would make the host unbootable'''. This script, unlike the <code>sre.hosts.reimage</code> one, '''works for both physical hosts and virtual machines'''. The script will check for remaining occurrences of the hostname or IP in any puppet or DNS files and warn about them. Since at this point the workflow is that you should only remove the host from site.pp and DHCP after running it it is normal that you see warnings about those. You should check though if it still appears in any other files where it is not expected. Most notable case would be that an mw appserver happens to be an mcrouter proxy which needs to be replaced before decom. The actions performed by the cookbook are:
:* remove from netboot.cfg (puppet:///modules/install_server/files/autoinstall/netboot.cfg)
** Downtime the host on Icinga (it will be removed at the next Puppet run on the Icinga host)
:* remove from DHCPD lease file (puppet:///modules/install_server/files/dhcpd/linux-host-entries.ttyS...  filename changes based on serial console settings)
** Detect if Physical or Virtual host based on Netbox data.
''End steps''
** If virtual host (Ganeti VM)
*** Ganeti shutdown (tries OS shutdown first, pulls the plug after 2 minutes)
*** Force Ganeti->Netbox sync of VMs to update its state and avoid Netbox Report errors
** If physical host
*** Downtime the management host on Icinga (it will be removed at the next Puppet run on the Icinga host)
*** Wipe bootloaders to prevent it from booting again
*** Pull the plug (IPMI power off without shutdown) {{Warning|Every once in a while the remote IPMI command fails. Pay close attention that you do not get an error like in [[phab:T277780#6966775|T277780#6966775]] that says "Failed to power off". If this happens the host can end up in state where it is wiped from DNS but still in puppetdb which means it will still be in Icinga but alert and the mgmt DNS won't be reachable. This additionally breaks [[Memcached_for_MediaWiki/mcrouter#Generate_certs_for_a_new_host|add mcrouter certs for new hosts]] because the script doing that asks puppetdb for host names it then tries to find in DNS which fails for the "zombie server". Since recently the script will tell you though which host is the culprit. The fix is to manually run 'puppet node deactivate <fqdn>' on the puppetmaster followed by running puppet agent on the Icinga server. see [[phab:T277780#6968901|T277780#6968901]].}}
*** Update Netbox state to Decommissioning and delete all device interfaces and related IPs but the mgmt one
***Disable switch interface and remove vlan config in Netbox
** Remove it from DebMonitor
** Remove it from Puppet master and PuppetDB
** If virtual host (Ganeti VM), issue a VM removal that will destroy the VM. Can take few minutes.
** Run the sre.dns.netbox cookbook to propagate the DNS changes or prompt the user for a manual patch if needed in order to remove [[DNS]] entries for the production network, and the hostname management entries, but '''leave the asset tag mgmt entries''' at this stage, servers should keep them until they are wiped and unracked.
**Remove switch port config by running [[Homer]].
** Update the related Phabricator task
*Remove all references from Puppet repository:
**<code>site.pp</code>
**DHCP config from lease file (<code>modules/install_server/files/dhcpd/linux-host-entries.ttyS...</code> filename changes based on serial console settings)
**Partman recipe in <code>modules/install_server/files/autoinstall/netboot.cfg</code>
**All Hiera references both individual and in <code>regex.yaml</code>


'''These steps, once started, must be completed without interruption.'''
==== Steps for DC-OPS (with network switch access) ====
* Disable puppet on the host (puppet agent --disable)
* Confirm all puppet manifest entries removal, DSH removal, Hiera data removal.
:* Admin log whenever you disable or enable puppet on a host!
 
* Remove from Icinga monitoring:
*Remove host's port config on switch either manually (eqiad) or by running [[Homer]] (if not already done above).
:* Manually run puppetstoredconfigclean.rb <fqdn> on the puppet master.
**If manual: Move the switch port to <code>interface-range disabled</code>
:* Run puppet on the icinga host (at this writing, neon) and be sure it completes successfully, including refresh of Icinga.  If it fails, try /usr/sbin/icinga -v /etc/icinga/icinga.cfg to see what's wrong with the configuration.
**<code># show interfaces ge-x/y/z | display inheritance</code> helps identify configuration applied to the port
:* If you can't get puppet to run happily on icinga, get help.  If help is not available, stop, renable puppet on the host, and start these steps again once you can get help with icinga.
:* You should either ensure monitoring is removed, or at minimum disable notifications for that host.  Don't generate paging alerts for systems you are decommissioning.
* Manually revoke keys from puppet.
:* Instructions on how to do so are on the [[Puppet]] service details page.
:* $ puppet cert clean <fqdn>
* Power down system.
* Manually revoke  keys from salt.
:* Instructions on how to do so are on the [[Salt]] service details page.
:* $ salt-key -d <fqdn>
'''End steps'''


''The following can be done one at a time and/or with long breaks in bewtween.''
* Remove [[DNS]] entries for the production network.
:* Don't remove the mgmt [[DNS]] entries at this time!
::* Reclaims never have mgmt entries removed, and decom servers should keep them until they are wiped and unracked.
* Disable the production switch port.
:* If system is being reclaimed for spare, do not change port label.
:* If sysetm is being fully decommissioned, blank the port label in switch software.
:* If system is being reclaimed in Tampa and shipped elsewhere, blank the port label in switch software.
* Update associated [https://phabricator.wikimedia.org Phabricator] ticket, detailing steps taken and resolution.
* Update associated [https://phabricator.wikimedia.org Phabricator] ticket, detailing steps taken and resolution.
:* If system is decommissioned by on-site tech, they can resolve the ticket.
** If system is decommissioned by on-site tech, they can resolve the ticket.
:* If system is reclaimed into spares, ticket should be assigned to the HW Allocation Tech so he can update spares lists for allocation.
** If system is reclaimed into spares, ticket should be assigned to the HW Allocation Tech so he can update spares lists for allocation.
''End steps''


==== Decommission Specific ====
==== Decommission Specific (can be done by DC Ops without network switch access) ====
* A [https://phabricator.wikimedia.org Phabricator] ticket for the decommission of the system should be placed in the operations project and the appropriate datacenter-specific ops-* project.
* A [[phab:maniphest/task/edit/form/52/|Phabricator]] ticket for the decommission of the system should be placed in the #decommission project and the appropriate datacenter-specific ops-* project.
:* All further decommission steps are handled by the on-site technician.
:* The decom script can be run by anyone in SRE, but then reassign the server to the local DC ops engineer to wipe disks for return to service/spares, or reset bios and setttings and unrack for decommissio
* Wipe all disks on system with minimum of 3 passes.
*
:* We presently boot off USB version of [http://www.dban.org/ DBaN].
* Run the [https://netbox.wikimedia.org/extras/scripts/offline_device/OfflineDevice/ Offline a device with extra actions] Netbox script that will set the device in Offline status and delete all its interfaces and associated IP addresses left.
* Reset all system bios, mgmt bios, & raid bios settings to factory defaults.
** To run the script in '''dry-run''' mode, uncheck the '''Commit changes''' checkbox.
* Unrack system & update [https://racktables.wikimedia.org Racktables], moving system from rack location to decommissioned rack.
* Remove its mgmt [[DNS]] entries: run the [[DNS/Netbox#Update generated records|sre.dns.netbox]] cookbook
:* Unless another system will be placed in the space vacated immediately, please remove all power & network cables from rack.
* Unless another system will be placed in the space vacated immediately, please remove all power & network cables from rack.
* Once server is un-racked, remove its mgmt [[DNS]] entries.
 
==== Network devices specific ====
 
* SRX only: ensure autorecovery is disabled (see [https://kb.juniper.net/InfoCenter/index?page=content&id=KB25782 Juniper doc])
* Wipe the configuration
** By either running the command <code>request system zeroize media</code>
** Or Pressing the reset button for 15s
* Confirm the wipe is successful by login to the device via console (root/no password)


== Position Assignments ==
== Position Assignments ==
Line 226: Line 402:
* Director Technical Operations : Mark B
* Director Technical Operations : Mark B
* Operations Technical Review: Mark B, Faidon L
* Operations Technical Review: Mark B, Faidon L
== See also ==
* [[m:Hardware donation program]] and [https://blog.wikimedia.org/tag/server-donation/ traditional blog post announcements]
* [[phabricator:hardware-requests]]
* [[Server Spares]]
[[Category:Operations]]
[[Category:SRE Infrastructure Foundations]]

Revision as of 15:25, 12 May 2022

This page describes the lifecycle of Wikimedia servers, starting from the moment we acquire them and until the time we don't own them anymore. A server has various states that it goes through, with several steps that need to happen in each state. The goal is to standardize our processes for 99% of the servers we deploy or decommission and ensure that some necessary steps are taken for consistency, manageability & security reasons.

This assumes the handling of bare metal hardware servers, as it includes DCOps steps. While the general philosophy applies also to Virtual Machines in terms of steps handling and final status, check Ganeti#VM_operations for the usually simplified steps regarding VMs.

The inventory tool used is Netbox and each state change for a host is documented throughout this page.

States

Server Lifecycle Netbox Racked Power
requested none, not yet in Netbox no n/a
spare INVENTORY yes or no off
planned PLANNED yes or no off
staged STAGED yes on
active ACTIVE yes on
failed FAILED yes on or off
decommissioned DECOMMISSIONING yes on or off
unracked OFFLINE no n/a
recycled none, not anymore in Netbox no n/a

Server transitions

Diagram of the Server Lifecycle transitions
Diagram of the Server Lifecycle transitions * Dashed lines are for the transitions to Failed state. * Red dashed lines highlight the transition Active -> Failed -> Staged to distinguish it from the Staged <-> Failed one.

Requested

  • Hardware Allocation Tech will review request, and detail on ticket if we already have a system that meets these requirements, or if one must be ordered.
  • If hardware is already available and request is approved by SRE management, system will be allocated, skipping the generation of quotes and ordering.
  • If hardware must be ordered, the then DC Operations will gather quotes from our approved vendors & perform initial reviews on quote(s), working with the sub-team who requested the hardware.


Existing System Allocation

See the #Decommissioned -> Staged section below.

  • Only existing systems (not new) use this step if they are requested.
  • If a system must be ordered, please skip this section and proceed to Ordered section.
  • Spare pool allocations are detailed on the #Procurement task identically to new orders.
  • Task is escalated to DC operations manager for approval of spare pool systems.
  • Once approved, the same steps of updating the procurement gsheet & filing a racking task occur from the DC operations person triaging Procurement.

Ordered

  • Only new systems (not existing/reclaimed systems)
  • Quotes are reviewed and selected, then escalated to either DC Operations Management or SRE Management (budget dependent) for order approvals.
  • At the time of Phabricator order approval, a racking sub-task is created and our budget google sheets are updated. DC Ops then places the approved Phabricator task into Coupa for ordering.
  • Coupa approvals and ordering takes place.
  • Ordering task is updated by Procurement Manager (Finance) and reassigned to the on-site person for DC Operations to receive (in Coupa) and rack the hardware.
  • Racking task is followed by DC Operations and resolved.

Post Order

An installation/deployment task should be created (if it doesn't already exist) for the overall deployment of the system/OS/service & have the #sre and #DCOps tags. It can be created following the Phabricator Hardware Racking Request form.

Requested -> Spare & Requested -> Planned

Receiving Systems On-Site

  • Before the new hardware arrives on site, a shipment ticket must be placed to the datacenter to allow it to be received.
  • If the shipment has a long enough lead time, the buyer should enter a ticket with the datacenter site. Note sometimes the shipment lead times won't allow this & a shipment notification will instead be sent when shipment arrives. In that event, the on-site technician should enter the receipt ticket with the datacenter vendor.
  • New hardware arrives on site & datacenter vendor notifies us of shipment receipt.
  • Packing slip for delivery should list an Phabricator # or PO # & the Phabricator racking task should have been created in the correct datacenter project at time of shipment arrival.
  • Open boxes, compare box contents to packing slip. Note on slip if correct or incorrect, scan packing slip and attach to ticket.
  • Compare packing slip to order receipt in the Phabricator task, note results on Phabricator task.
  • If any part of the order is incorrect, reply on Phabricator task with what is wrong, and escalate back to DC Ops Mgmt.
  • If the entire order was correct, please note on the procurement ticket. Unless the ticket states otherwise, it can be resolved by the receiving on-site technician at that time.
  • Assign asset tag to system, enter system into Netbox immediately, even if not in rack location, with:
  • Device role (dropdown), Manufacturer (dropdown), Device type (dropdown), Serial Number (OEM Serial number or Service tag), Asset tag, Site (dropdown), Platform (dropdown), Purchase date, Support expiry date, Procurement ticket (Phabricator or RT)
    • For State and Name:
      • If host is scheduled to be commissioned: use the hostname from the procurement ticket as Name and PLANNED as State
      • If host is a pure spare host, not to be commissioned: Use the asset tag as Name and INVENTORY as State
  • Hardware warranties should be listed on the order ticket, most servers are three years after ship date.
  • Network equipment has one year coverage, which we renew each year as needed for various hardware.
  • A Phabricator task should exist with racking location and other details; made during the post-order steps above.

Requested -> Planned additional steps & Spare -> Planned

  • A hostname must be defined at this stage:
    • Please see Server naming conventions for details on how hostnames are determined.
    • If hostname was not previously assigned, a label with name must be affixed to front and back of server.
  • Netbox entry must be updated to reflect rack location and hostname
  • Run the Netbox ProvisionServerNetwork script to assign mgmt IP, primary IPv4/IPv6, vlan and switch interface
  • Follow the DNS/Netbox#Update_generated_records to create and deploy the mgmt and primary IPs (for mgmt should include both the $assettag.mgmt.site.wmnet as well as $hostname.mgmt.site.wmnet).
  • Run [[1]] to configure the switch interface (description, vlan).

Planned -> Staged

Preparation

  • Decide on partition mapping & add server to modules/install_server/files/autoinstall/netboot.cfg
    • Detailed implementation details for our Partman install exist here.
    • The majority of systems should use automatic partitioning, which is set by inclusion on the proper line in netboot.cfg.
    • Any hardware raid would need to be setup manually via rebooting and entering raid bios.
    • Right now there is a mix of hardware and software raid availability.
    • File located @ puppet modules/install_server.
    • partman recipe used located in modules/install_server
    • Please note if you are uncertain on what to pick, you should lean towards LVM.
    • Many reasons for this, including ease of expansion in event of filling the disk.
  • Check site.pp to ensure that the host will be reimaged into the insetup or insetup_noferm roles based on the requirements. If in doubt check with the service owner.

Installation

For virtual machines, where there is no physical BIOS to change, but there is virtual hardware to setup, check Ganeti#Create_a_VM instead.

At this point the host can be installed. From now on the service owner should be able to take over and install the host automatically, asking DC Ops to have a look only if there are issues. As a rule of thumb if the host is part of a larger cluster/batch order, it should install without issues and the service owner should try this path first. If instead the host is the first of a batch of new hardware, then is probably better to ask DC Ops to install the first one. Consider it a new hardware if it differs from the existing hosts by Generation, management card, RAID controller, network cards, BIOS, etc.

Automatic Installation

See the Server_Lifecycle/Reimage section on how to use the reimage script to install a new server. Don' t forget to set the --new CLI parameter.

Manual installation

Warning: if you are rebuilding a pre-existing server (rather than a brand new name), on puppetmaster clear out the old certificate before beginning this process:

 puppetmaster$ sudo puppet cert destroy $server_fqdn

1. Reboot system and boot from network / PXE boot
2. Acquires hostname in DNS
3. Acquires DHCP/autoinstall entries
4. OS installation

Run Puppet for the first time

1. From the cumin hosts (cumin1001.eqiad.wmnet, cumin2002.codfw.wmnet) connect to newserver with install_console.

cumin1001:~$  sudo /usr/local/bin/install_console $newserver_fqdn

It is possible that ssh warns you of a bad key if an existing ssh fingerprint still exists on the cumin host, like:

    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
	@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
	@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
	IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
	Someone could be eavesdropping on you right now (man-in-the-middle attack)!
	It is also possible that a host key has just been changed.
	The fingerprint for the ECDSA key sent by the remote host is
	SHA256:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.
	Please contact your system administrator.
	Add correct host key in /dev/null to get rid of this message.

You can safely proceed with the installation; the next time puppet runs automatically on the puppetmaster this file will be updated.

Try then to do a mock puppet run (it will fail due to lack of certificate signage):

newserver# puppet agent --test
Exiting; no certificate found and waitforcert is disabled

2. On puppetmaster list all pending certificate signings and sign this server's key

puppetmaster$ sudo puppet cert -l
puppetmaster$ sudo puppet cert -s $newserver_fqdn

3. Back to the newserver, enable puppet and test it

 newserver# puppet agent --enable
 newserver# puppet agent --test

4. After a couple of successful puppet runs, you should reboot newserver just to make sure it comes up clean.
5. The newserver should now appear in puppet and in Icinga.
6. If that is a new server, change the state in Netbox to STAGED

7. Run the Netbox script to update the device with its interfaces and related IP addresses (remember to Commit the change, the default run is just a preview).

Note: If you already began reinstalling the server before destroying its cert on the puppetmaster, you should clean out ON THE newserver (with care):

newserver# find /var/lib/puppet/ssl -type f -exec rm {} \;

Spare -> Failed & Planned -> Failed & Staged -> Failed

If a device in the Spare, Planned or Staged state has hardware failures it can be marked in Netbox as FAILED.

Spare -> Decommissioned

When a host in the spare pool has reached its end of life and must be unracked.

Staged -> Active

  • When a server is placed into service, documentation of the service (not specifically the server) needs to reflect the new server's state. This includes puppet file references, as well as Wikitech documentation pages.
  • The service owner puts the host back in production.
  • The service owner changes Netbox's to ACTIVE.

Active -> Staged

This transition should be used when reimaging or when a rollback of the STAGED -> ACTIVE transition is needed.

  • Service owner perform actions to remove it from production, see the #Remove from production section below.
  • Perform the reimage using the available scripts, see Server_Lifecycle/Reimage.
  • Service owner changes Netbox's state to STAGED [TODO: include this step into the sre.hosts.reimage cookbook]

Active -> Failed

When a host fails and requires physical maintenance/debugging by DC Ops:

  • Service owner perform actions to remove it from production, see the #Remove from production section below.
  • Service owner changes Netbox's state to FAILED
  • Once the failure is resolved the host will be put back into STAGED, and not directly into ACTIVE and in production.

Active -> Decommissioned

When the host has completed his life in a given role and should decommissioned or returned to the spare pool for re-assignement.

Failed -> Spare

When the failure of a Spare device has been fixed it can be set back to INVENTORY in Netbox.

Failed -> Planned

When the failure of a Planned device has been fixed it can be set back to PLANNED in Netbox.

Failed -> Staged

When the failure of an Active or Staged device has been fixed, it will go back to the Staged state. This because also if the host was ACTIVE before it needs to be tested and brought back to production by its service owner.

  • Change Netbox's state to STAGED

Failed -> Decommissioned

When the failure cannot be fixed and the host is not anymore usable it must be decommissioned before unracking it.

Decommissioned -> Spare

When a decommissioned host is going to be part of the spare pool.

Decommissioned -> Staged

When a host is decomissioned from one role and immediately returned in service in a different role, usually with a different hostname. (Ideally it should be wiped too)

  • Still follow the #Reclaim to Spares OR Decommission steps first decommissioning and then re-allocating the host, optionally with a new name, but it requires some additional manual steps (TBD).
  • Service owner changes Netbox's state to STAGED

Decommissioned -> Unracked

The host has completed its life and is being unracked

Unracked -> Recycled

When the host physically leaves the datacenter.

If Juniper device, fill the "Juniper Networks Service Waiver Policy" and send it to Juniper through a service request so it's removed from Juniper's DB.

Server actions

Reimage

See the Server Lifecycle/Reimage page.

Remove from production

  • A Phabricator ticket should be created detailing the reinstallation in progress.
  • System services must be confirmed to be offline. Make sure no other services depend on this server.
  • Remove from pybal/LVS (if applicable) - see the sre.hosts.reimage cookbook option -c/--conftool and consult the LVS page
  • Check if server is part of a service group. For example db class machines are in associated db-X.php, memcached in mc.php.
  • Remove server entry from DSH node groups (if applicable). For example check operations/puppet:hieradata/common/scap/dsh.yaml

Rename while reimaging

Assumptions:

  • The host will lose all its data.
  • The host can change primary IPs. The following procedure doesn't guarantee that they will stay the same.
  • If the host need to be also physically relocated, follow the additional steps inline.
  • A change of the host's VLAN during the procedure is supported.

Procedure:

This procedure follows the active -> decommissioned -> staged path. All data on the host will be lost.

  • Remove the host from active production (depool, failover, etc.)
  • Run the sre.hosts.decommission cookbook, see Spicerack/Cookbooks#Run_a_single_Cookbook
  • If the host needs to be physically relocated:
    • Physically relocate the host now.
    • Update its device page on Netbox to reflect the new location.
  • Update Netbox:
    • Edit the device page to set the new name (use the hostname, not the FQDN) and set its status from DECOMMISSIONING to PLANNED.
    • Rename the DNS Name of all its IPs (there should be only the management IP at this stage). In order to do so, search for them in the IpAddresses Netbox page (Search box on the right) using the current hostname (not FQDN in order to find the management IP too).
      Netbox's connection details
    • Take note of the primary interface connection details: Cable ID, Switch name, Switch port (see image on the right). They will be needed in a following step.
    • [TODO: automate this step into the Netbox provisioning script] Go to the interfaces tab in the device's page on Netbox, select all the interfaces except the mgmt one, proceed only if the selected interfaces have no IPs assigned to them. Delete the selected interfaces.
    • Run the interface_automation.ProvisionServerNetwork Netbox script, filling the previously gathered data for switch, switch interface and cable ID (just the integer part). Fill out all the remaining data accordingly, ask for help if in doubt.
  • Run the sre.dns.netbox cookbook: DNS/Netbox#Update_generated_records
  • Run Homer against the switch the device is connected to, in order to configure the switch interface (initial) description and VLAN configuration.
    • Note that netbox uses virtual names for switches, so e.g. asw2-d1-eqiad in netbox is "asw2-d-eqiad*" when using homer.
  • Patch puppet:
    • Adjust install/roles for the new server, hieradata, conftool, etc.
    • Update partman entry.
    • Get it reviewed, merge and deploy it.
  • Run puppet on the install servers: cumin 'A:installserver' 'run-puppet-agent -q'
  • Follow the reimage procedure at Server Lifecycle/Reimage using the --new option
  • Edit the device page on Netbox, set its status from PLANNED to STAGED.
  • Get the physical re-labeling done (open a task for dc-ops)
  • Run Homer (again) against the switch the device is connected to, in order to update the port's description with the interface name assigned to the host during the reimage/install.
  • Once the host is back in production update its status in Netbox from STAGED to ACTIVE.

Reclaim to Spares OR Decommission

TODO: this section should be split in three: Wipe, Unrack and Recycle.

Steps for non-LVS hosts

  • Run decomm cookbook, Note: this will also schedule downtime for the host
 $ cookbook sre.hosts.decommission  mc102[3-4].eqiad.wmnet -t T289657
  • Remove any references in puppet, most notably from site.pp and modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200

Steps for ANY Opsen

  • A Decommission ticket should be created detailing if system is being decommissioned (and removed from datacenter) or reclaimed (wiped of all services/data and set system as spare for reallocation).
  • System services must be confirmed to be offline. Checking everything needed for this step and documenting it on this specific page is not feasible at this time(but we are working to add them all). Please ensure you understand the full service details and what software configuration files must be modified. This document will only list the generic steps required for the majority of servers.
  • If server is part of a service pool, ensure it is set to false or removed completely from pybal/LVS.
    • Instructions on how to do so are listed on the LVS page.
  • If possible, use tcpdump to verify that no production traffic is hitting the services/ports
  • If server is part of a service group, there will be associated files for removal or update. The service in question needs to be understood by tech performing the decommission (to the point they know when they can take things offline.) If assistance is needed, please seek out another operations team member to assist.
    • Example: db class machines are in associated db-X.php, memcached in mc.php.
  • Remove server entry from DSH node groups (if any).
    • If the server is part of a service group, common DSH entries are populated from conftool, unless they're proxies or canaries
    • The list of dsh groups is in operations/puppet:hieradata/common/scap/dsh.yaml.
  • Run the sre.hosts.decommission decom script available on the cluster::management hosts (cumin1001.eqiad.wmnet, cumin2002.codfw.wmnet). The cookbooks is destructive and would make the host unbootable. This script, unlike the sre.hosts.reimage one, works for both physical hosts and virtual machines. The script will check for remaining occurrences of the hostname or IP in any puppet or DNS files and warn about them. Since at this point the workflow is that you should only remove the host from site.pp and DHCP after running it it is normal that you see warnings about those. You should check though if it still appears in any other files where it is not expected. Most notable case would be that an mw appserver happens to be an mcrouter proxy which needs to be replaced before decom. The actions performed by the cookbook are:
    • Downtime the host on Icinga (it will be removed at the next Puppet run on the Icinga host)
    • Detect if Physical or Virtual host based on Netbox data.
    • If virtual host (Ganeti VM)
      • Ganeti shutdown (tries OS shutdown first, pulls the plug after 2 minutes)
      • Force Ganeti->Netbox sync of VMs to update its state and avoid Netbox Report errors
    • If physical host
      • Downtime the management host on Icinga (it will be removed at the next Puppet run on the Icinga host)
      • Wipe bootloaders to prevent it from booting again
      • Pull the plug (IPMI power off without shutdown)
      • Update Netbox state to Decommissioning and delete all device interfaces and related IPs but the mgmt one
      • Disable switch interface and remove vlan config in Netbox
    • Remove it from DebMonitor
    • Remove it from Puppet master and PuppetDB
    • If virtual host (Ganeti VM), issue a VM removal that will destroy the VM. Can take few minutes.
    • Run the sre.dns.netbox cookbook to propagate the DNS changes or prompt the user for a manual patch if needed in order to remove DNS entries for the production network, and the hostname management entries, but leave the asset tag mgmt entries at this stage, servers should keep them until they are wiped and unracked.
    • Remove switch port config by running Homer.
    • Update the related Phabricator task
  • Remove all references from Puppet repository:
    • site.pp
    • DHCP config from lease file (modules/install_server/files/dhcpd/linux-host-entries.ttyS... filename changes based on serial console settings)
    • Partman recipe in modules/install_server/files/autoinstall/netboot.cfg
    • All Hiera references both individual and in regex.yaml

Steps for DC-OPS (with network switch access)

  • Confirm all puppet manifest entries removal, DSH removal, Hiera data removal.
  • Remove host's port config on switch either manually (eqiad) or by running Homer (if not already done above).
    • If manual: Move the switch port to interface-range disabled
    • # show interfaces ge-x/y/z | display inheritance helps identify configuration applied to the port
  • Update associated Phabricator ticket, detailing steps taken and resolution.
    • If system is decommissioned by on-site tech, they can resolve the ticket.
    • If system is reclaimed into spares, ticket should be assigned to the HW Allocation Tech so he can update spares lists for allocation.

Decommission Specific (can be done by DC Ops without network switch access)

  • A Phabricator ticket for the decommission of the system should be placed in the #decommission project and the appropriate datacenter-specific ops-* project.
  • The decom script can be run by anyone in SRE, but then reassign the server to the local DC ops engineer to wipe disks for return to service/spares, or reset bios and setttings and unrack for decommissio
  • Run the Offline a device with extra actions Netbox script that will set the device in Offline status and delete all its interfaces and associated IP addresses left.
    • To run the script in dry-run mode, uncheck the Commit changes checkbox.
  • Remove its mgmt DNS entries: run the sre.dns.netbox cookbook
  • Unless another system will be placed in the space vacated immediately, please remove all power & network cables from rack.

Network devices specific

  • SRX only: ensure autorecovery is disabled (see Juniper doc)
  • Wipe the configuration
    • By either running the command request system zeroize media
    • Or Pressing the reset button for 15s
  • Confirm the wipe is successful by login to the device via console (root/no password)

Position Assignments

The cycle above references specific position/assignments, without referring to name. To keep the document generic, we'll keep the cycle with positions listed, and just list those folks here.

  • Buyer / HW Allocation Tech: Rob H (US), Mark B (EU)
  • On-site Tech EQIAD: Chris J
  • On-site Tech CODFW: Papaul T
  • On-site Tech ULSFO: Rob H
  • Director Technical Operations : Mark B
  • Operations Technical Review: Mark B, Faidon L

See also