You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

SLO: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Wolfgang Kandek
No edit summary
imported>Wolfgang Kandek
No edit summary
Line 9: Line 9:


* Speed of Response: requests that are handled under a set threshold - "requests that are fulfilled under 500 ms divided by all requests * 100"
* Speed of Response: requests that are handled under a set threshold - "requests that are fulfilled under 500 ms divided by all requests * 100"
* Success: requests that are handled errorfree - "all requests but return code 500 divided by all requests *100"
* Success: requests that are handled error free - "all requests but return code 500 divided by all requests *100"
* Freshness: pages that are served updated - "pages served are outdated less than 5 seconds/all pages served * 100"
* Freshness: pages that are served updated - "pages served are outdated less than 5 seconds/all pages served * 100"


Line 20: Line 20:
* No more than 0.1% of errors
* No more than 0.1% of errors
* At most 1% of pages are served outdated
* At most 1% of pages are served outdated
Links
* Introduction to SLOs at the Wikimedia Foundation - [https://docs.google.com/presentation/d/1XJ-FnshjHzY2NkbWEz8xGSY6GOQjO2o_u_-QsML6KyM/edit#slide=id.g18c2b41bb4_0_160 Google Slide Link]
* Implementing Service Level Objectives - [https://learning.oreilly.com/library/view/implementing-service-level/9781492076803/ O'Reilly book] - [https://docs.google.com/document/d/1rH_SxQuK5hUq3kPFmQ08EblJxOuR-z47GangoM67-FA/edit Sample Chapters]
* [https://sre.google/sre-book/service-level-objectives/ Service Level Objective Intro] in Site Reliability Engineering (SRE) book by Google
*[https://queue.acm.org/detail.cfm?id=3096459 Dependency Math] at Google




Existing Service Level Objectives
Existing Service Level Objectives
* [[SLO/worksheet etcd SLO|Worksheet etcd SLO]]
* [[SLO/worksheet etcd SLO|Worksheet etcd SLO]]
*Worksheet API Gateway
Additional Keywords: Error Budget, [https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/ RED Method], [https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals four golden signals]

Revision as of 16:46, 22 January 2021

Service Level Objective (SLO) and Service Level Indicators (SLI)

Rationale: We’d love it if all our systems responded instantly and worked 100% of the time, but we also know that’s unrealistic. By choosing specific objectives, we can aim to keep our users happy, and still be able to prioritize other work as long as we’re meeting those objectives. If the performance starts to dip down toward the threshold, we know it’s time to refocus on short-term reliability work and put other things on hold. And by breaking up our complex production landscape into individual services, each with its own Service Level Objective, we know where to focus that work


Service Level Indicator: a measurement of a behavior of a system that can be used to monitor the system's function. In the Service Level context ideally expressed as percentage

Examples:

  • Speed of Response: requests that are handled under a set threshold - "requests that are fulfilled under 500 ms divided by all requests * 100"
  • Success: requests that are handled error free - "all requests but return code 500 divided by all requests *100"
  • Freshness: pages that are served updated - "pages served are outdated less than 5 seconds/all pages served * 100"


Service Level Objective: Once we have SLIs we can reason about an objective

Examples:

  • We want 99% of all request to be faster than 500ms
  • No more than 0.1% of errors
  • At most 1% of pages are served outdated

Links


Existing Service Level Objectives


Additional Keywords: Error Budget, RED Method, four golden signals