You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "SRE"

From Wikitech-static
Jump to navigation Jump to search
imported>Jforrester
(Make the klaxon link really obvious and prominent.)
imported>LSobanski
(Updating the Data Persistence description)
 
(7 intermediate revisions by 5 users not shown)
Line 1: Line 1:
'''TL;DR''':
== '''Site Reliability Engineering''' (SRE) ==
The team is responsible for developing and maintaining Wikimedia's production infrastructure. Previously known as Technical Operations, they are in charge of making sure all Wikimedia's sites and services used by the general public (including MediaWiki and all associated services) run reliably, securely, and with high performance.


* If you need help from SRE and it is an emergency, you can page us via https://klaxon.wikimedia.org.
* If you need help from SRE and it is an '''emergency''', you can page us via '''https://klaxon.wikimedia.org<nowiki/>.'''
* If it is not an emergency, but do not know which team is responsible for your question, just open a generic task on Phabricator in the [https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=sre SRE project] and our Clinic Duty person of the week will route it. If it more urgent or just a quick check you can find us on IRC: {{Irc|wikimedia-sre|}}.
* If it is not an emergency, but '''do not know which team''' is responsible for your question, just open a generic task on Phabricator in the [https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=sre '''SRE project'''] and our [[SRE/Clinic Duty|Clinic Duty]] engineer of the week will route it.  


'''Site Reliability Engineering''' (SRE) is responsible for availability, performance, monitoring, emergency response, infrastructure security, and capacity planning, plus the maintenance of software used for that purpose. This is similar to what in many other organizations is handled by an Operations or System Administration team. SRE treats computer operations as a software problem and applies automation wherever possible. The Foundation has a number of sub-teams within SRE, each responsible for different areas. Check [[SRE/SRE Team requests|SRE Team Requests]] to see how to get in touch with those teams, and see [[mw:Wikimedia Site Reliability Engineering]] for a more detailed team structure.
* If it more urgent or just a '''quick check''' you can find us on IRC: {{Irc|wikimedia-sre|}}.


* [[Dc-operations|SRE Data Center Operations]] - all things related to Data Centers, hardware maintenance and purchases
==== The Foundation has a number of sub-teams within SRE, each responsible for different areas: ====
* [[SRE/Data Persistence|SRE Data Persistence]] - Databases and Object storage (MariaDB and Swift)
<div class="mw-collapsible mw-collapsed"><div class="mw-collapsible-toggle toccolours" style="float:none;text-align:left;font-size: 1.2em;background:#efefef;border:0px solid #9c3434;border-top:10px solid #ffffff;border-bottom:5px solid #9c3434">'''SRE Data Center Operations'''<div class="floatright">▼</div></div> <div class="mw-collapsible-content">
* [[SRE/Infrastructure Foundations|SRE Infrastructure Foundations]] - Automation and Networking (cumin, netbox, puppet, spicerack)
<div style="border:4px solid #FFFFFF;background:#F7F7F7;padding: 10px">
* [[Observability|SRE Observability]] - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka)
{|class="sortable"
* [[SRE/Service Operations|SRE Service Operations]] - MediaWiki Operations and Supporting Services (Kubernetes, memcached, redis, Infrastructure for: Gitlab, OTRS, Phabricator)
!
* [[Traffic|SRE Traffic]] - Caching and DNS (ATS, varnish, GeoDNS, wikidough)
[[SRE/Dc-operations|SRE Data Center Operations]] - all things related to Data Centers, hardware maintenance and purchases.


The Data Center Operations team is responsible for all of Wikimedia’s data center deployments and logistics as well as maintaining our presence in locations across the world. They perform on-site work and maintain the full 5-year life cycle (specs, purchasing, physical install, break/fix and decommissioning) for all hardware.


References:
{{Irc|wikimedia-dcops|}}
|-
|}
</div></div></div>
 
<div class="mw-collapsible mw-collapsed"><div class="mw-collapsible-toggle toccolours" style="float:none;text-align:left;font-size: 1.2em;background:#efefef;border:0px solid #9c3434;border-top:10px solid #ffffff;border-bottom:5px solid #9c3434">'''SRE Data Persistence'''<div class="floatright">▼</div></div> <div class="mw-collapsible-content">
<div style="border:4px solid #FFFFFF;background:#F7F7F7;padding: 10px">
{|class="sortable"
!
[[SRE/Data Persistence|SRE Data Persistence]] - Databases, Backups and Object storage (MariaDB, Bacula, Swift).
 
The Data Persistence team focuses on Wikimedia’s persistent data storage and retrieval systems, including RDBMS, backup systems and (distributed) object storage.
 
{{Irc|wikimedia-data-persistence|}}
|-
|}
</div></div></div>
 
<div class="mw-collapsible mw-collapsed"><div class="mw-collapsible-toggle toccolours" style="float:none;text-align:left;font-size: 1.2em;background:#efefef;border:0px solid #9c3434;border-top:10px solid #ffffff;border-bottom:5px solid #9c3434">'''SRE Infrastructure Foundations'''<div class="floatright">▼</div></div> <div class="mw-collapsible-content">
<div style="border:4px solid #FFFFFF;background:#F7F7F7;padding: 10px">
{|class="sortable"
!
[[SRE/Infrastructure Foundations|SRE Infrastructure Foundations]] - Automation and Networking (cumin, netbox, puppet, spicerack).
 
The team focuses on building and maintaining our base platform (“metal cloud”) that forms the foundations which nearly everything else in our infrastructure builds upon. On top of our bare metal deployments, their responsibilities include (but are not limited to) configuration management systems, infrastructure automation, orchestration tooling, infrastructure security and network operations.
 
{{Irc|wikimedia-sre-foundations|}}
 
|-
|}
</div></div></div>
 
<div class="mw-collapsible mw-collapsed"><div class="mw-collapsible-toggle toccolours" style="float:none;text-align:left;font-size: 1.2em;background:#efefef;border:0px solid #9c3434;border-top:10px solid #ffffff;border-bottom:5px solid #9c3434">'''SRE Observability'''<div class="floatright">▼</div></div> <div class="mw-collapsible-content">
<div style="border:4px solid #FFFFFF;background:#F7F7F7;padding: 10px">
{|class="sortable"
!
[[SRE/Observability|SRE Observability]] - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka).
 
The Observability team, or "o11y" for short, works across SRE and Technology to provide teams with tools, platforms and insights into how systems and services are performing. It leverages technologies such as Grafana, Kibana/Logstash, Prometheus, AlertManager and more.
 
{{Irc|wikimedia-observability|}}
 
|-
|}
</div></div></div>
 
<div class="mw-collapsible mw-collapsed"><div class="mw-collapsible-toggle toccolours" style="float:none;text-align:left;font-size: 1.2em;background:#efefef;border:0px solid #9c3434;border-top:10px solid #ffffff;border-bottom:5px solid #9c3434">'''SRE Service Operations'''<div class="floatright">▼</div></div> <div class="mw-collapsible-content">
<div style="border:4px solid #FFFFFF;background:#F7F7F7;padding: 10px">
{|class="sortable"
!
[[SRE/Service Operations|SRE Service Operations]] - MediaWiki Operations and Supporting Services (Kubernetes, memcached, redis, Infrastructure for: Gitlab, OTRS, Phabricator).
 
The Service Operations team takes care of public and “user-visible” services alongside Technology and Product teams. This means, for example, our MediaWiki platform, but also the newer (micro)services that comprise our stack. It also includes miscellaneous services and components that we rely upon (think Phabricator, mail systems, OTRS, etc…). The team is also building our new SOA service infrastructure based on Kubernetes.
 
{{Irc|wikimedia-serviceops|}}
|-
|}
</div></div></div>
 
<div class="mw-collapsible mw-collapsed"><div class="mw-collapsible-toggle toccolours" style="float:none;text-align:left;font-size: 1.2em;background:#efefef;border:0px solid #9c3434;border-top:10px solid #ffffff;border-bottom:5px solid #9c3434">'''SRE Traffic'''<div class="floatright">▼</div></div> <div class="mw-collapsible-content">
<div style="border:4px solid #FFFFFF;background:#F7F7F7;padding: 10px">
{|class="sortable"
!
[[Traffic|SRE Traffic]] - Caching and DNS (ATS, varnish, GeoDNS, wikidough).
 
The Traffic team is responsible for the critical first layer of high-traffic infrastructure which now spans much of the globe, including our TLS termination and caching layers (ATS, Varnish), load balancing, DNS and our own network.
 
{{Irc|wikimedia-traffic|}}
|-
|}
</div></div></div>
 
 
Check [[SRE/SRE Team requests|'''SRE Team Requests''']] to see most how to see '''most common''' types of '''requests'''.
 
References:  


* [https://how.complexsystems.fail/ How complex systems fail] This is where SRE works
* [https://how.complexsystems.fail/ How complex systems fail] This is where SRE works
* [https://sre.google/books Google's SRE books] Google formalized many of the concepts and coined the term SRE
* [https://sre.google/books Google's SRE books] Google formalized many of the concepts and coined the term SRE

Latest revision as of 08:44, 6 September 2021

Site Reliability Engineering (SRE)

The team is responsible for developing and maintaining Wikimedia's production infrastructure. Previously known as Technical Operations, they are in charge of making sure all Wikimedia's sites and services used by the general public (including MediaWiki and all associated services) run reliably, securely, and with high performance.

  • If you need help from SRE and it is an emergency, you can page us via https://klaxon.wikimedia.org.
  • If it is not an emergency, but do not know which team is responsible for your question, just open a generic task on Phabricator in the SRE project and our Clinic Duty engineer of the week will route it.

The Foundation has a number of sub-teams within SRE, each responsible for different areas:

SRE Data Center Operations

SRE Data Center Operations - all things related to Data Centers, hardware maintenance and purchases.

The Data Center Operations team is responsible for all of Wikimedia’s data center deployments and logistics as well as maintaining our presence in locations across the world. They perform on-site work and maintain the full 5-year life cycle (specs, purchasing, physical install, break/fix and decommissioning) for all hardware.

#wikimedia-dcops connect

SRE Data Persistence

SRE Data Persistence - Databases, Backups and Object storage (MariaDB, Bacula, Swift).

The Data Persistence team focuses on Wikimedia’s persistent data storage and retrieval systems, including RDBMS, backup systems and (distributed) object storage.

#wikimedia-data-persistence connect

SRE Infrastructure Foundations

SRE Infrastructure Foundations - Automation and Networking (cumin, netbox, puppet, spicerack).

The team focuses on building and maintaining our base platform (“metal cloud”) that forms the foundations which nearly everything else in our infrastructure builds upon. On top of our bare metal deployments, their responsibilities include (but are not limited to) configuration management systems, infrastructure automation, orchestration tooling, infrastructure security and network operations.

#wikimedia-sre-foundations connect

SRE Observability

SRE Observability - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka).

The Observability team, or "o11y" for short, works across SRE and Technology to provide teams with tools, platforms and insights into how systems and services are performing. It leverages technologies such as Grafana, Kibana/Logstash, Prometheus, AlertManager and more.

#wikimedia-observability connect

SRE Service Operations

SRE Service Operations - MediaWiki Operations and Supporting Services (Kubernetes, memcached, redis, Infrastructure for: Gitlab, OTRS, Phabricator).

The Service Operations team takes care of public and “user-visible” services alongside Technology and Product teams. This means, for example, our MediaWiki platform, but also the newer (micro)services that comprise our stack. It also includes miscellaneous services and components that we rely upon (think Phabricator, mail systems, OTRS, etc…). The team is also building our new SOA service infrastructure based on Kubernetes.

#wikimedia-serviceops connect

SRE Traffic

SRE Traffic - Caching and DNS (ATS, varnish, GeoDNS, wikidough).

The Traffic team is responsible for the critical first layer of high-traffic infrastructure which now spans much of the globe, including our TLS termination and caching layers (ATS, Varnish), load balancing, DNS and our own network.

#wikimedia-traffic connect


Check SRE Team Requests to see most how to see most common types of requests.

References: