You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "SRE/Infrastructure Foundations/Ownership"

From Wikitech-static
Jump to navigation Jump to search
imported>Jobo
m
 
imported>Jobo
m
Line 10: Line 10:
|[[Install_server]]
|[[Install_server]]
|Bare metal Infrastructure
|Bare metal Infrastructure
|
|An ''install server'' consists of DHCP, TFTP, webproxy (Squid) and ''[[apt.wikimedia.org]]'' ([[reprepro]]) servers.
|
|
|
|
Line 16: Line 16:
|[[Ganeti]]
|[[Ganeti]]
|Bare metal Infrastructure
|Bare metal Infrastructure
|
|Clustered virtual machine management software tool built on top of existing virtualization technologies such as Xen or KVM and other open source software. It supports both KVM and Xen. At WMF we only have KVM as an enabled hypervisor.
|
|
|
|
Line 22: Line 22:
|[[Puppet]]
|[[Puppet]]
|Configuration Management Systems
|Configuration Management Systems
|
|Puppet is the main configuration management tool to be used on the Wikimedia clusters.
<code>puppet agent</code> is the client daemon that runs on all servers, and manages machines with configuration information gathered from <code>puppetmasterd</code>.
|https://phabricator.wikimedia.org/tag/puppet/
|https://phabricator.wikimedia.org/tag/puppet/
|Infrastructure part
|Infrastructure part
Line 28: Line 29:
|[[PCC]]
|[[PCC]]
|Configuration Management Systems
|Configuration Management Systems
|
|PCC - Puppet compiler. Compiler run Puppet Server and PuppetDB services, as well as a file sync client. When triggered by a web endpoint, file sync takes changes from the working directory on the primary server and deploys the code to a live code directory. File sync then deploys that code to all compilers.
|
|
|
|
Line 34: Line 35:
|Puppetboard
|Puppetboard
|Configuration Management Systems
|Configuration Management Systems
|
|Puppetboard is a web interface to PuppetDB aiming to replace the reporting functionality of Puppet Enterprise console.
|
|
|https://puppetboard.wikimedia.org/
|https://puppetboard.wikimedia.org/
Line 40: Line 41:
|[[Debmonitor]]
|[[Debmonitor]]
|Configuration Management Systems
|Configuration Management Systems
|
|DebMonitor is a Debian package tracker website and tool developed at the Wikimedia Foundation and used to track installed and upgradable packages across the fleet. It consists of DebMonitor website and DebMonitor client.
|
|
|
|
Line 46: Line 47:
|[[Homer]]
|[[Homer]]
|Configuration Management Systems
|Configuration Management Systems
|
|Homer is our homemade network configuration manager. It takes variables from Netbox and yaml files, run them through jinja templates to generate Juniper compatible configuration. Homer can then send those configurations to selected network devices, for a diff or a safe commit.
|https://phabricator.wikimedia.org/tag/homer/
|https://phabricator.wikimedia.org/tag/homer/
|
|
Line 58: Line 59:
|[[Spicerack]]
|[[Spicerack]]
|Orchestration Tooling
|Orchestration Tooling
|
|Spicerack is a Python library to orchestrate tasks in the Wikimedia Foundation production environment. It comes with an easy API and a cookbook entry point script that allows to write simple [[Spicerack/Cookbooks|Cookbooks]] to automate and orchestrate tasks.
|
|
|
|
Line 64: Line 65:
|[[Server Lifecycle#Reimage|wmf-auto-reimage]]
|[[Server Lifecycle#Reimage|wmf-auto-reimage]]
|Orchestration Tooling
|Orchestration Tooling
|
|The <code>wmf-auto-reimage-host</code> (single host) and <code>wmf-auto-reimage</code> (multiple hosts) are scripts that allow to automate some of the installation/re-image tasks in example:
 
* Updates the Phabricator task
* Validates FQDN of hosts
* Downtimes on Icinga
* Depool hosts via conftool
* Sets next boot in PXE mode
* Power cycles or powers on based on current power state
* Use the new hostname (if set).
* Runs puppet once to create the certificate and the signing request to the Puppet master
* Masks all provided systemd units to prevent them to start automatically during the first Puppet run.
* Triggers the first Puppet run
* Runs Puppet on the Icinga host and set it in downtime
* Reboots
* Checks if first puppet run is successful
* Run the Netbox script to update the device with its interfaces and related IPs
* Umasks the masked systemd units
* run httpbb if the <code>-a, --httpbb</code>
* Print the <code>conftool</code> commands to re-pool the host (if <code>-c</code> )
|
|
|
|
Line 70: Line 89:
|[[Software deployment|Debdeploy]]
|[[Software deployment|Debdeploy]]
|Orchestration Tooling
|Orchestration Tooling
|
|Debdeploy allows the deployment of software updates in Debian (or Debian-based) environments on a large scale. It is based on Cumin; updates are initiated via the debdeploy tool running on the Cumin master. Servers can be grouped into arbitrary sets of servers/services based on the Cumin syntax.
|
|
|
|
Line 76: Line 95:
|[[Conftool]]
|[[Conftool]]
|Orchestration Tooling
|Orchestration Tooling
|
|Conftool is a set of tools we use to sync and manage the dynamic state configuration for services ([[varnish]] backend, the [[pybal]] pools, the DNS discovery entries, and some variables in [[Mediawiki]] configuration). This configuration is stored in the distributed key/value store: [[Etcd]].
|
|
|
|
Line 82: Line 101:
|[[Dbctl]]
|[[Dbctl]]
|Orchestration Tooling
|Orchestration Tooling
|
|'''Dbctl''' is a tool based on [[conftool]] to store Mediawiki's database loadbalancer configuration in [[etcd]].
In production, the only hosts with dbctl installed are the [[cumin]] cluster management hosts (e.g. [[cumin1001]]).
|
|
|
|
Line 88: Line 108:
|[[Cumin]]
|[[Cumin]]
|Orchestration Tooling
|Orchestration Tooling
|
|Cumin is an automation and orchestration framework that provides a flexible and scalable automation framework to execute multiple commands on multiple hosts in parallel.
It allows to easily perform complex selections of hosts through a user-friendly query language which can interface with different backend modules and combine their results for a fine grained selection. The transport layer can also be selected, and can provide multiple execution strategies. The executed commands outputs are automatically grouped for an easy-to-read result.
|
|
|
|
Line 94: Line 115:
|[[Python/Wmflib|Wmflib]]
|[[Python/Wmflib|Wmflib]]
|Orchestration Tooling
|Orchestration Tooling
|
|A Python package that contains custom modules to interact with the WMF production infrastructure.
It can be used in any script throughout the fleet as it doesn't require any special privilege to be run, as opposed to [[Spicerack]] and its [[Spicerack/Cookbooks|Cookbooks]] and removes the need to re-implement each time the same functionalities over and over again.
|
|
|
|
Line 100: Line 122:
|[[PKI]]
|[[PKI]]
|Infrastructure security and packaging
|Infrastructure security and packaging
|
|A public key infrastructure is a set of roles, policies, hardware, software and procedures needed to create, manage, distribute, use, store and revoke digital certificates and manage public-key encryption. We currently use CFSSL to provide and manage PKI solutions. Clients are able to make use of the CFSSL API end point (it requires using the puppet agent certificate). Further to the client auth requirement API request also need to be signed with a hmac using a secret key (available in the puppet private repo)
|
|
|
|
Line 106: Line 128:
|[[CAS-SSO]]
|[[CAS-SSO]]
|Infrastructure security and packaging
|Infrastructure security and packaging
|
|The Wikimedia Developer SSO Portal at idp.wikimedia.org is a single sign-on (SSO) infrastructure built on Apereo CAS. When logging into a CAS-enabled website without an active SSO session you'll be redirected to the [https://idp.wikimedia.org/login CAS login page]. The CAS service collects LDAP group memberships and makes them available to services for making authorisation choices. After authentication the users get redirected to the initiating service.
|https://phabricator.wikimedia.org/tag/cas-sso/
|https://phabricator.wikimedia.org/tag/cas-sso/
|
|
Line 112: Line 134:
|[[Reprepro]]
|[[Reprepro]]
|Infrastructure security and packaging
|Infrastructure security and packaging
|
|Reprepro is able to manage multiple repositories for multiple distribution versions in one package pool. It can process updates from an <code>incoming</code> directory, copy package (references) between distribution versions, list all packages and/or package versions available in the repository, etc. Reprepro maintains an internal database (a .DBM file) of the contents of the repository, which makes it quite fast and efficient.
|
|
|
|
Line 118: Line 140:
|[[Cowbuilder]]
|[[Cowbuilder]]
|Infrastructure security and packaging
|Infrastructure security and packaging
|
|A module used to populate a Debian/Ubuntu package building environment. Meant to be used in the Wikimedia environment but could be adapted for other environments as well.
|
|
|
|
Line 124: Line 146:
|[[Netbox]]
|[[Netbox]]
|Infrastructure security
|Infrastructure security
|
|Netbox is a "IP address management (IPAM) and data center infrastructure management (DCIM) tool".
|https://phabricator.wikimedia.org/tag/netbox/
|https://phabricator.wikimedia.org/tag/netbox/
|
|
Line 130: Line 152:
|[[Netmon]]
|[[Netmon]]
|Infrastructure security
|Infrastructure security
|
|Netmon is a  network monitoring system with high-performance traffic sniffing technology.
|
|
|
|
Line 136: Line 158:
|[[RPKI]]
|[[RPKI]]
|Infrastructure security
|Infrastructure security
|
|Resource Public Key Infrastructure is a public key infrastructure framework to support improved security for the Internet's BGP routing infrastructure. RPKI provides a way to connect Internet number resource information to a trust anchor.
|
|
|
|
Line 142: Line 164:
|[[Cloudflare]]
|[[Cloudflare]]
|Infrastructure security
|Infrastructure security
|
|Cloudflare's Magic Transit protects IP subnets from DDoS attacks. It uses Cloudflare's global network to mitigate attacks, employing two networking protocols: BGP and GRE, for routing and encapsulation.
|
|
|
|
Line 148: Line 170:
|[[NEL]]
|[[NEL]]
|Infrastructure security
|Infrastructure security
|
|Network Error Logging is a mechanism that can be configured via the NEL HTTP response header. This header allows web sites and applications to opt-in to receive reports about failed (and, if desired, successful) network fetches from supporting browsers.
|
|
|
|

Revision as of 13:50, 4 June 2021

Service Category Description Phabricator tag Notes
Install_server Bare metal Infrastructure An install server consists of DHCP, TFTP, webproxy (Squid) and apt.wikimedia.org (reprepro) servers.
Ganeti Bare metal Infrastructure Clustered virtual machine management software tool built on top of existing virtualization technologies such as Xen or KVM and other open source software. It supports both KVM and Xen. At WMF we only have KVM as an enabled hypervisor.
Puppet Configuration Management Systems Puppet is the main configuration management tool to be used on the Wikimedia clusters.

puppet agent is the client daemon that runs on all servers, and manages machines with configuration information gathered from puppetmasterd.

https://phabricator.wikimedia.org/tag/puppet/ Infrastructure part
PCC Configuration Management Systems PCC - Puppet compiler. Compiler run Puppet Server and PuppetDB services, as well as a file sync client. When triggered by a web endpoint, file sync takes changes from the working directory on the primary server and deploys the code to a live code directory. File sync then deploys that code to all compilers.
Puppetboard Configuration Management Systems Puppetboard is a web interface to PuppetDB aiming to replace the reporting functionality of Puppet Enterprise console. https://puppetboard.wikimedia.org/
Debmonitor Configuration Management Systems DebMonitor is a Debian package tracker website and tool developed at the Wikimedia Foundation and used to track installed and upgradable packages across the fleet. It consists of DebMonitor website and DebMonitor client.
Homer Configuration Management Systems Homer is our homemade network configuration manager. It takes variables from Netbox and yaml files, run them through jinja templates to generate Juniper compatible configuration. Homer can then send those configurations to selected network devices, for a diff or a safe commit. https://phabricator.wikimedia.org/tag/homer/
Cookbooks Orchestration Tooling
Spicerack Orchestration Tooling Spicerack is a Python library to orchestrate tasks in the Wikimedia Foundation production environment. It comes with an easy API and a cookbook entry point script that allows to write simple Cookbooks to automate and orchestrate tasks.
wmf-auto-reimage Orchestration Tooling The wmf-auto-reimage-host (single host) and wmf-auto-reimage (multiple hosts) are scripts that allow to automate some of the installation/re-image tasks in example:
  • Updates the Phabricator task
  • Validates FQDN of hosts
  • Downtimes on Icinga
  • Depool hosts via conftool
  • Sets next boot in PXE mode
  • Power cycles or powers on based on current power state
  • Use the new hostname (if set).
  • Runs puppet once to create the certificate and the signing request to the Puppet master
  • Masks all provided systemd units to prevent them to start automatically during the first Puppet run.
  • Triggers the first Puppet run
  • Runs Puppet on the Icinga host and set it in downtime
  • Reboots
  • Checks if first puppet run is successful
  • Run the Netbox script to update the device with its interfaces and related IPs
  • Umasks the masked systemd units
  • run httpbb if the -a, --httpbb
  • Print the conftool commands to re-pool the host (if -c )
Debdeploy Orchestration Tooling Debdeploy allows the deployment of software updates in Debian (or Debian-based) environments on a large scale. It is based on Cumin; updates are initiated via the debdeploy tool running on the Cumin master. Servers can be grouped into arbitrary sets of servers/services based on the Cumin syntax.
Conftool Orchestration Tooling Conftool is a set of tools we use to sync and manage the dynamic state configuration for services (varnish backend, the pybal pools, the DNS discovery entries, and some variables in Mediawiki configuration). This configuration is stored in the distributed key/value store: Etcd.
Dbctl Orchestration Tooling Dbctl is a tool based on conftool to store Mediawiki's database loadbalancer configuration in etcd.

In production, the only hosts with dbctl installed are the cumin cluster management hosts (e.g. cumin1001).

Cumin Orchestration Tooling Cumin is an automation and orchestration framework that provides a flexible and scalable automation framework to execute multiple commands on multiple hosts in parallel.

It allows to easily perform complex selections of hosts through a user-friendly query language which can interface with different backend modules and combine their results for a fine grained selection. The transport layer can also be selected, and can provide multiple execution strategies. The executed commands outputs are automatically grouped for an easy-to-read result.

Wmflib Orchestration Tooling A Python package that contains custom modules to interact with the WMF production infrastructure.

It can be used in any script throughout the fleet as it doesn't require any special privilege to be run, as opposed to Spicerack and its Cookbooks and removes the need to re-implement each time the same functionalities over and over again.

PKI Infrastructure security and packaging A public key infrastructure is a set of roles, policies, hardware, software and procedures needed to create, manage, distribute, use, store and revoke digital certificates and manage public-key encryption. We currently use CFSSL to provide and manage PKI solutions. Clients are able to make use of the CFSSL API end point (it requires using the puppet agent certificate). Further to the client auth requirement API request also need to be signed with a hmac using a secret key (available in the puppet private repo)
CAS-SSO Infrastructure security and packaging The Wikimedia Developer SSO Portal at idp.wikimedia.org is a single sign-on (SSO) infrastructure built on Apereo CAS. When logging into a CAS-enabled website without an active SSO session you'll be redirected to the CAS login page. The CAS service collects LDAP group memberships and makes them available to services for making authorisation choices. After authentication the users get redirected to the initiating service. https://phabricator.wikimedia.org/tag/cas-sso/
Reprepro Infrastructure security and packaging Reprepro is able to manage multiple repositories for multiple distribution versions in one package pool. It can process updates from an incoming directory, copy package (references) between distribution versions, list all packages and/or package versions available in the repository, etc. Reprepro maintains an internal database (a .DBM file) of the contents of the repository, which makes it quite fast and efficient.
Cowbuilder Infrastructure security and packaging A module used to populate a Debian/Ubuntu package building environment. Meant to be used in the Wikimedia environment but could be adapted for other environments as well.
Netbox Infrastructure security Netbox is a "IP address management (IPAM) and data center infrastructure management (DCIM) tool". https://phabricator.wikimedia.org/tag/netbox/
Netmon Infrastructure security Netmon is a network monitoring system with high-performance traffic sniffing technology.
RPKI Infrastructure security Resource Public Key Infrastructure is a public key infrastructure framework to support improved security for the Internet's BGP routing infrastructure. RPKI provides a way to connect Internet number resource information to a trust anchor.
Cloudflare Infrastructure security Cloudflare's Magic Transit protects IP subnets from DDoS attacks. It uses Cloudflare's global network to mitigate attacks, employing two networking protocols: BGP and GRE, for routing and encapsulation.
NEL Infrastructure security Network Error Logging is a mechanism that can be configured via the NEL HTTP response header. This header allows web sites and applications to opt-in to receive reports about failed (and, if desired, successful) network fetches from supporting browsers.
Failoid Miscellanea