You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "SRE"

From Wikitech-static
Jump to navigation Jump to search
imported>Quiddity
(fix w:mystery meat navigation links ("click [here]"), and fix internal links, c/e)
imported>Wolfgang Kandek
Line 1: Line 1:
TL;DR: If you need help from SRE but do not know which team is responsible for your question, just open a generic task on Phabricator in the [https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=sre SRE project] and our Clinic Duty person of the week will route it.  
TL;DR: If you need help from SRE but do not know which team is responsible for your question, just open a generic task on Phabricator in the [https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=sre SRE project] and our Clinic Duty person of the week will route it.  


Site Reliability Engineering (SRE) is responsible for availability, performance, monitoring, emergency response, infrastructure security, and capacity planning, plus the maintenance of software used for that purpose. This is similar to what in many other organizations is handled by an Operations or System Administration team. SRE treats computer operations as a software problem and applies automation wherever possible. The Foundation has a number of sub-teams within SRE, each responsible for different areas. Check [[SRE Team requests]] to see how to get in touch with those teams, and see [[mw:Wikimedia Site Reliability Engineering]] for a more detailed team structure.
Site Reliability Engineering (SRE) is responsible for availability, performance, monitoring, emergency response, infrastructure security, and capacity planning, plus the maintenance of software used for that purpose. This is similar to what in many other organizations is handled by an Operations or System Administration team. SRE treats computer operations as a software problem and applies automation wherever possible. The Foundation has a number of sub-teams within SRE, each responsible for different areas. Check [[SRE/SRE Team requests|SRE Team Requests]] to see how to get in touch with those teams, and see [[mw:Wikimedia Site Reliability Engineering]] for a more detailed team structure.


* [[Dc-operations|SRE Data Center Operations]] - all things related to Data Centers, hardware maintenance and purchases
* [[Dc-operations|SRE Data Center Operations]] - all things related to Data Centers, hardware maintenance and purchases
* [[SRE/Data Persistence]] - Databases and Object storage (MariaDB and Swift)
* [[SRE/Data Persistence|SRE Data Persistence]] - Databases and Object storage (MariaDB and Swift)
* [[SRE/Infrastructure Foundations]] - Automation and Networking (cumin, netbox, puppet, spicerack)
* [[SRE/Infrastructure Foundations|SRE Infrastructure Foundations]] - Automation and Networking (cumin, netbox, puppet, spicerack)
* [[Observability|SRE Observability]] - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka)
* [[Observability|SRE Observability]] - Monitoring and Logging (Prometheus/Grafana and ElasticSearch, plus some Kafka)
* [[SRE/Service Operations]] - MediaWiki Operations and Supporting Services (Kubernetes, memcached, redis, Infrastructure for: Gitlab, OTRS, Phabricator)
* [[SRE/Service Operations|SRE Service Operations]] - MediaWiki Operations and Supporting Services (Kubernetes, memcached, redis, Infrastructure for: Gitlab, OTRS, Phabricator)
* [[Traffic|SRE Traffic]] - Caching and DNS (ATS, varnish, GenDNS, wikidough)
* [[Traffic|SRE Traffic]] - Caching and DNS (ATS, varnish, GeoDNS, wikidough)





Revision as of 20:49, 9 June 2021

TL;DR: If you need help from SRE but do not know which team is responsible for your question, just open a generic task on Phabricator in the SRE project and our Clinic Duty person of the week will route it.

Site Reliability Engineering (SRE) is responsible for availability, performance, monitoring, emergency response, infrastructure security, and capacity planning, plus the maintenance of software used for that purpose. This is similar to what in many other organizations is handled by an Operations or System Administration team. SRE treats computer operations as a software problem and applies automation wherever possible. The Foundation has a number of sub-teams within SRE, each responsible for different areas. Check SRE Team Requests to see how to get in touch with those teams, and see mw:Wikimedia Site Reliability Engineering for a more detailed team structure.


References: