You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
SLO: Difference between revisions
imported>Wolfgang Kandek No edit summary |
imported>RLazarus m (→Published SLOs) |
||
(13 intermediate revisions by 7 users not shown) | |||
Line 1: | Line 1: | ||
Service Level Objective (SLO) and Service Level Indicators (SLI) | '''Service Level Objective''' (SLO) and '''Service Level Indicators''' (SLI) | ||
Rationale: We’d love it if all our systems responded instantly and worked 100% of the time, but we also know that’s unrealistic. By choosing specific objectives, we can aim to keep our users happy, and still be able to prioritize other work as long as we’re meeting those objectives. If the performance starts to dip down toward the threshold, we know it’s time to refocus on short-term reliability work and put other things on hold. And by breaking up our complex production landscape into individual services, each with its own | Rationale: We’d love it if all our systems responded instantly and worked 100% of the time, but we also know that’s unrealistic. By choosing specific objectives based on what is important to our users, we can aim to keep our users happy, and still be able to prioritize other work as long as we’re meeting those objectives. If the performance starts to dip down toward the threshold, objectively we know it’s time to refocus on short-term reliability work and put other things on hold. And by breaking up our complex production landscape into individual services, each with its own SLOs, we know where to focus that work. | ||
== Service Level Indicator == | |||
a measurement of a behavior of a system that can be used to monitor the system's function. In the Service Level context ideally expressed as percentage | |||
=== Examples === | |||
Examples | |||
* Speed of Response: requests that are handled under a set threshold - "requests that are fulfilled under 500 ms divided by all requests * 100" | * Speed of Response: requests that are handled under a set threshold - "requests that are fulfilled under 500 ms divided by all requests * 100" | ||
Line 12: | Line 11: | ||
* Freshness: pages that are served updated - "pages served are outdated less than 5 seconds/all pages served * 100" | * Freshness: pages that are served updated - "pages served are outdated less than 5 seconds/all pages served * 100" | ||
== Service Level Objective == | |||
Once we have SLIs we can reason about an objective | |||
=== Examples === | |||
Examples | |||
* We want 99% of all request to be faster than 500ms | * We want 99% of all request to be faster than 500ms | ||
Line 21: | Line 20: | ||
* At most 1% of pages are served outdated | * At most 1% of pages are served outdated | ||
== Published SLOs == | |||
* Introduction to SLOs at the Wikimedia Foundation - [https://docs.google.com/presentation/d/1XJ-FnshjHzY2NkbWEz8xGSY6GOQjO2o_u_-QsML6KyM/edit#slide=id.g18c2b41bb4_0_160 Google Slide Link] | * [[SLO/etcd main cluster|etcd main cluster]] | ||
* Implementing Service Level Objectives - [https://learning.oreilly.com/library/view/implementing-service-level/9781492076803/ O'Reilly book] - [https://docs.google.com/document/d/1rH_SxQuK5hUq3kPFmQ08EblJxOuR-z47GangoM67-FA/edit Sample Chapters] | * [[SLO/API Gateway|API Gateway]] | ||
* [https://sre.google/sre-book/service-level-objectives/ Service Level Objective Intro] in Site Reliability Engineering (SRE) book by Google | * [[SLO/Docker-registry|docker-registry]] (draft) | ||
* [[SLO/Varnish|Varnish caching]] | |||
* [[SLO/logstash|Logstash]] | |||
==SLO reporting == | |||
Quarterly, but offset by one month, i.e. December, January, February. | |||
Example: | |||
[[File:SLOs Q3 FY20-21 Tuning Session.pdf|alt=slide of Tuning Sessions for SLOs|none|thumb|slide of Tuning Sessions for SLOs]] | |||
==External links== | |||
*Introduction to SLOs at the Wikimedia Foundation - [https://docs.google.com/presentation/d/1XJ-FnshjHzY2NkbWEz8xGSY6GOQjO2o_u_-QsML6KyM/edit#slide=id.g18c2b41bb4_0_160 Google Slide Link] | |||
*Implementing Service Level Objectives - [https://learning.oreilly.com/library/view/implementing-service-level/9781492076803/ O'Reilly book] - [https://docs.google.com/document/d/1rH_SxQuK5hUq3kPFmQ08EblJxOuR-z47GangoM67-FA/edit Sample Chapters] | |||
*[https://sre.google/sre-book/service-level-objectives/ Service Level Objective Intro] in Site Reliability Engineering (SRE) book by Google | |||
*[https://queue.acm.org/detail.cfm?id=3096459 Dependency Math] at Google | *[https://queue.acm.org/detail.cfm?id=3096459 Dependency Math] at Google | ||
*[https://blog.acolyer.org/2020/02/26/meaningful-availability/] Meaningful availability - downtime multiplied with the number of users online | |||
*[https://www.usenix.org/sites/default/files/conference/protected-files/srecon18emea_slides_fong-jones.pdf SLO Workshop] by Google at SRECon 2018 | |||
Service Level Objectives at the foundation | |||
*[https://docs.google.com/document/d/1aDPnadRRZFgGATaZ9rqjWNvvWdS6B4F50flMDVPeT1s/edit#heading=h.al2mmiq515rz Worksheet template] | |||
Additional Keywords: Error Budget, [https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/ RED Method], [https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals four golden signals] | |||
[[Category:SRE Service Operations]] |
Revision as of 00:48, 30 November 2021
Service Level Objective (SLO) and Service Level Indicators (SLI)
Rationale: We’d love it if all our systems responded instantly and worked 100% of the time, but we also know that’s unrealistic. By choosing specific objectives based on what is important to our users, we can aim to keep our users happy, and still be able to prioritize other work as long as we’re meeting those objectives. If the performance starts to dip down toward the threshold, objectively we know it’s time to refocus on short-term reliability work and put other things on hold. And by breaking up our complex production landscape into individual services, each with its own SLOs, we know where to focus that work.
Service Level Indicator
a measurement of a behavior of a system that can be used to monitor the system's function. In the Service Level context ideally expressed as percentage
Examples
- Speed of Response: requests that are handled under a set threshold - "requests that are fulfilled under 500 ms divided by all requests * 100"
- Success: requests that are handled error free - "all requests but return code 500 divided by all requests *100"
- Freshness: pages that are served updated - "pages served are outdated less than 5 seconds/all pages served * 100"
Service Level Objective
Once we have SLIs we can reason about an objective
Examples
- We want 99% of all request to be faster than 500ms
- No more than 0.1% of errors
- At most 1% of pages are served outdated
Published SLOs
SLO reporting
Quarterly, but offset by one month, i.e. December, January, February.
Example:
External links
- Introduction to SLOs at the Wikimedia Foundation - Google Slide Link
- Implementing Service Level Objectives - O'Reilly book - Sample Chapters
- Service Level Objective Intro in Site Reliability Engineering (SRE) book by Google
- Dependency Math at Google
- [1] Meaningful availability - downtime multiplied with the number of users online
- SLO Workshop by Google at SRECon 2018
Service Level Objectives at the foundation
Additional Keywords: Error Budget, RED Method, four golden signals