You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

SLO/template instructions

From Wikitech-static
< SLO
Jump to navigation Jump to search

A service level objective (SLO) is an agreement, among everyone who works on a service, about how reliable that service needs to be.

Without such an agreement, different people may have different implicit notions of how much latency is tolerable, how many errors justify getting someone out of bed, or how much instability justifies rolling back a release. By defining a quantitative, measurable set of targets, and then measuring our performance against those targets, we ensure that we're on the same page.

At the same time, it allows teams to reason about the services they depend on. Suppose a frontend promises 100 ms responses at the 95th percentile, but it passes each request to a backend promising 500 ms at the 95th. The SLOs are incompatible -- the frontend can't reasonably promise to respond faster than the backend! But in the absence of explicitly-agreed SLOs, such mismatches in implicit expectations are common.

Thus an SLO is also an agreement between services. It's an agreement among the multiple teams invested in a service, but also the teams that depend on them. As a result, it's important that SLOs are consistently met. We should only miss an SLO due to truly unexpected circumstances -- and if that happens, we take it seriously and prioritize reliability fixes over other work, so that it doesn't happen again.

An SLO is only valuable if it can be meaningfully relied on; an SLO that's regularly not upheld is worse than none at all. Therefore, in writing your SLO, err on the side of caution. Set unambitious targets that you know you'll be able to meet right away, and plan to tighten them later if you like -- rather than failing to meet your initial goals and loosening them until they're achievable.

Make a copy of SLO/template for your service, and fill it out section by section. This page has details on how to address each of the questions in the template -- each section will take some research, discussion, and sometimes negotiation. The ๐Ÿ“ symbol is the instruction to return to the SLO document and write up your findings.

Organizational

What is the service?

๐Ÿ“ Name it, and link to appropriate documentation. Define it clearly -- which binary or binaries are covered? If some hosts or clusters are covered but not others, be explicit. (If some requests aren't covered, we'll cover that later.)

Who are the responsible teams?

An SLO is generally an agreement between two or more teams (though a team might still benefit when solely operating a service for its own use -- and thus acting as both the service provider and the user). In order for the agreement to be useful, representatives from each team should be involved in drafting it, and must be involved in finalizing it.

The teams listed here should include everyone involved in handling or preventing incidents in the service: the people who receive production alerts, the people who write the code, and the people who build and deploy new releases. It's generally not reasonable to expect a team to reprioritize their own work to meet an SLO that was written without their participation.

Representatives from client services may also be consulted, to provide input on their service's needs.

๐Ÿ“ List all the responsible teams. You're probably already talking to all of them -- but if not, start that conversation before continuing to the next section.

Architectural

Where does the service run?

Any service can only be as reliable as the platform on which it operates -- but with proper redundancy, that platform can be much more reliable than any single machine. While none of these platforms has an SLO of its own, you should take these rough availability assumptions into account.

For services that run on bare metal

[TODO]

For services that run on Ganeti virtual machines

[TODO]

For services that run on Kubernetes

[TODO]

What are the service's dependencies?

Definition: Your service has a dependency on another service if your service can't work correctly when that service isn't working.

Hard and soft dependencies

Every dependency is either a hard dependency or soft dependency. If any of your hard dependencies is completely broken, then your service is completely broken. If a soft dependency is completely broken, then your service operates in a degraded mode, offering either reduced features or reduced performance.

Examples: MediaWiki has a hard dependency on MariaDB; if the core database is unavailable, MediaWiki can only serve errors. But MediaWiki has a soft dependency on Thumbor: for the duration of a Thumbor outage, thumbnails won't appear for images newly added to wiki articles, but everything else will work normally.

Because your service can't work when a hard dependency is broken, it's impossible for your availability to be higher than theirs. If you're waiting for a response from them in order to serve a response of your own, it's impossible for your latency to be lower than theirs. Thus, your dependencies' SLOs create a boundary on what yours can be.

Direct dependencies and proxies

The most common type of dependency is when your service sends a request to another service, and uses the response to do its own work. If it serves you an error, or exceeds its latency deadline, you have no choice but to serve an error yourself (hard dependency) or do your best without it (soft dependency). We'll call this type of relationship a direct dependency.

However, not every client-server relationship creates a dependency. If your service is a proxy, then it's doing its job correctly when it faithfully proxies an error message. (There's still an end user having an unsatisfactory experience, so error budgets should be consumed both upstream and downstream of your proxy, but the proxy itself is healthy.)

Distinguishing between a direct dependency and a proxy relationship can be nontrivial, since some proxies cache, mutate, or otherwise act on the response. To tell the difference between the two, ask whether your client cares where you send your traffic. MediaWiki could replace Thumbor with some other thumbnail-generating system, and MediaWiki's clients wouldn't mind as long as thumbnails continued to work. But clients of Envoy, Varnish, or ATS all have a specific destination in mind for their traffic.

This distinction affects your choice of SLIs. In a direct dependency, every error you serve to your users counts against your error budget, even if the error was your dependency's "fault," and likewise time spent waiting for your dependency counts as latency to your users. If this makes it impossible to meet your user-driven SLOs, you may need to reconsider your architecture: this service may be insufficiently reliable to depend on.

But for a proxy, a more reasonable SLI for error rate might be "percentage of requests which yielded an error response not proxied from the backend," and the latency SLI might exclude the time spent waiting for the backend. These comparably forgiving definitions are offset by much tighter targets: since your backend's errors don't count against your error budget, you shouldn't need as large a budget.

Indirect dependencies

Remember, a dependency is when your service can't work correctly when another service isn't working. It's possible for this relationship to exist even if your service doesn't send requests to the other, which we call an indirect dependency. One form of indirect dependency is a capacity cache, where a cache enables you to operate your service with less hardware by deduplicating work.

Example: Consider the ATS backend cache, which acts as a capacity cache in front of the application servers. Over 90% of incoming web requests are handled in the CDN without being proxied to MediaWiki at all, and as a result the app server fleet is only provisioned to serve a small fraction of the total load. If an incident in ats-be led to all traffic being forwarded -- for example, imagine a buggy ATS configuration that treats too many kinds of requests as "pass," i.e. uncacheable -- the resulting avalanche of traffic would overwhelm the app servers, causing a complete outage.

Thus the application layer's ability to serve correctly depends on the CDN layer doing the right thing. Surprising conclusion: even though ATS doesn't have a direct dependency on MediaWiki (due to being a proxy), MediaWiki has an indirect dependency on ATS! But that's okay: a failure in that form would be catastrophic, but is sufficiently low-probability -- or, in other words, ATS's objective for "percentage of cacheable requests not cached," if it had such an SLI, is sufficiently high -- that this isn't a significant concern relative to MediaWiki's other failure modes.

๐Ÿ“ List all dependencies, including links to their respective SLOs if applicable. You don't need to list your dependencies' dependencies, unless you also depend on them directly. For soft dependencies, also describe the expected degradation of service during their unavailability.

Client-facing

Who are the service's clients?

If all the clients are other internal WMF services, they should be identifiable. If some or all clients are external users (human or automated) characterize them in as much detail as possible, for the purpose of assessing their reliability needs.

For complicated services with a variety of distinct use cases, one way to catalogue your clients is to make a list of user journeys -- that is, scenarios like "a user logging into their account" or "a user editing an article" -- that eventually depend on your service's functionality. Then break down that list of user journeys by identifying, in each case, what piece of software directly contacts your service to play its role. ("For logged-in page views, service X calls us to fetch a key, but for edits, service Y fetches the key from us and then service Z writes it back.")

๐Ÿ“ List, or characterize, the service's clients. You don't need to list your clients' clients, unless they also depend on you directly.

What are the request classes?

Some services receive more than one kind of request, where each is subject to a different SLO. For example, read requests may have better latency guarantees than write requests.

Some classes of requests might be ineligible for any SLO guarantees, such as batch requests over a certain size. (These requests aren't invalid -- it goes without saying that malformed requests will be served errors. SLO-ineligible requests might still be served in a "best effort" fashion -- no guarantees, but in practice they'll often work.)

Ideally, the classes should be constructed such that a request can be classified based only on the request -- not on the response or server state. In other words, a client should be able to classify a request before sending it. In practice, this isn't always possible, and that's okay.

As above, one way to catalogue your request classes is to make a list of user journeys and identify what types of request are made in each case. (In the above example, maybe the fetches from services X and Y are functionally the same, making up one request class, and the write from service Z belongs to another. Or maybe the fetch from service Y requires a freshness guarantee and consequently has a longer latency deadline.)

๐Ÿ“ If your service has only one request class, you can delete this section. Otherwise, list all request classes, and the criteria for determining what class a request belongs to. If any classes are ineligible for the SLO as described above, label them.

Write the service level indicators

Service level indicators (SLIs) are the metrics you'll use to evaluate your service's performance. An example of an SLI is "Percentage of requests that receive a non-5xx response." The full set of SLIs, combined with numeric targets that we'll select later on, comprise the SLO.

SLIs should be:

  • directly client-visible. Measure symptoms, not causes: each SLI should reflect the client service's (or user's, if user-facing) perception of service performance. It should be impossible for an SLI to significantly worsen without any clients observing a degradation in service.
  • comprehensive. It should be impossible for a client to observe a degradation in service without any SLIs significantly worsening, unless some element of the service is intentionally not covered by the SLO. (For example, a logs-processing system could measure the percentage of items processed eventually but make no guarantees about how quickly, in which case processing latency might not be one of its SLIs.)
  • under your control. The purpose of your SLO is to help you know how to prioritize reliability work. That means that if the performance measured by your SLIs declines, you should be able to identify engineering work to improve reliability; there shouldn't be SLIs that you're powerless to affect. This won't be absolute, because your dependencies' reliability will always be able to affect your own. As a thought exercise, suppose that all your dependencies meet their SLOs, but you still measure a decline in your SLI (considering each SLI in turn). If that scenario is either mathematically impossible or would not be actionable, the SLI may not be useful.
  • aligned with overall service health. Consider the developer motivations that will emerge from your SLIs. Avoid perverse incentives. For example, if the only latency SLI is median request latency, then 50% of requests have no coverage in the SLO. That would incentivize developers to disregard tail latency, even though it may be key to user perception of service quality.
  • fully defined and empirically determined. Starting from a common set of data, parties should always agree about how to calculate the SLI. Try to avoid ambiguity: does "99% of traffic" mean 99% by request count, or 99% of bytes? Does a day mean any 24-hour period, or a UTC calendar day? Does request latency mean the time to first byte or last byte? Does it include network time?

Sometimes, multiple service characteristics can be combined into one SLI. For example, a service could define a Satisfactory response as being a non-error response served within a particular latency deadline. Then a single SLI, defined as "percentage of eligible requests which receive a Satisfactory response," captures both errors and latency. A spike in either latency or error rate would impact the service's availability as measured by this SLI. This approach is well suited to services whose clients have particularly sharp latency requirements, such that if the server takes longer than a certain period to respond, it might as well not have responded at all.

Some standard options for SLIs are listed below. If possible, you should prefer to copy them into your SLO, rather than writing your own from scratch.

Choose the ones that make sense for your service and for the available monitoring data; don't take all of them. For example, the different latency SLIs are alternative formulations of the same idea, so keeping more than one would be redundant.

  • Latency SLI, percentile: The [fill in]th percentile request latency, as measured at the server side.
  • Latency SLI, acceptable fraction: The percentage of all requests that complete within [fill in] milliseconds, measured at the server side.
  • Availability SLI: The percentage of all requests receiving a non-error response, defined as [fill in, e.g. "HTTP status code 200", or "'status': 'ok' in the JSON response body", etc].
  • Combined latency-availability SLI: The percentage of all requests that complete within [fill in] milliseconds and receive a non-error response, defined as [fill in as above].
  • Proxy latency SLI, percentile: The [fill in]th percentile of request latency contributed by the proxy, excluding backend wait time.
  • Proxy latency SLI, acceptable fraction: The percentage of all requests where the latency contributed by the proxy, excluding backend wait time, was within [fill in] milliseconds.
  • Proxy availability SLI: The percentage of all requests receiving a non-error response, or where the proxy accurately delivered an error response originating at the backend. Errors originating at the proxy are measured as [fill in, e.g. "HTTP status code 503"].
  • Proxy combined latency-availability SLI: The percentage of all requests where the latency contributed by the proxy, excluding backend wait time, is within [fill in] milliseconds and the request receives a non-error response or an error response originating at the backend. Errors originating at the proxy are measured as [fill in as above].
๐Ÿ“ Copy a selection of appropriate SLIs into your document, and fill in the blanks. If necessary, add (and fully define) any other SLIs appropriate to your service -- and consider adding them here if they may be useful to others. In all cases, if they can be measured in Grafana, link to a graph for each. Don't set numeric targets yet; we'll think about that next.

Operational

Every service experiences an outage sometimes, so the SLO should reflect its expected time to recovery. If the expected duration of a single outage exceeds the error budget, then the SLO reduces to "we promise not to make any mistakes." Relying on such an SLO is untenable.

Answer these questions for the service as it is, not as it ought to be, in order to arrive at a realistically supportable SLO. Alternatively, you may be able to make incremental improvements to the service as you progress through the worksheet. Resist the temptation to publish a more ambitious SLO than you can actually support immediately, even if it feels like you should be able to support it.

How is the service monitored?

Assuming that an outage ends when engineers mitigate or work around the underlying issue, you expect the outage to last at least as long as it takes someone to notice and respond to the outage. If all SLIs are monitored with paging alerts 24x7, this is the expected time between the time when the outage starts and the time when a responding engineer is hands-on-keyboard investigating it. (Remember to include any delay associated with the alert itself, such as a sampling interval or rolling-average window.)

If some SLIs don't have paging alerts, this period is likely much longer. For example, if some element of the service is only monitored during working hours, a breakage on Friday evening might go unnoticed all weekend. A single such outage would cause the service to miss a 99% quarterly uptime SLO.

How complex is the service to troubleshoot?

After an engineer begins working on the problem, how long will it take to identify the necessary mitigative action? This is the least scientific question on this worksheet; it will likely be informed in part by experience.

Questions to consider: Does the team receiving pages for the service also fully understand its internals, or will they have to escalate to the developers and wait for help? If the engineer responding to the page is relatively inexperienced, can they still find all the information they need -- how to interpret monitoring, diagnose problems, and take mitigative action -- in documentation thatโ€˜s complete, up-to-date, and discoverable?

How is the service deployed?

Production incidents are often resolved by rolling out a code or configuration change, so a slower deployment process means a slower resolution. If the normal rollout is intentionally slowed by canary checks, it's reasonable to assume here that they're skipped for a rollback to a known-safe version, as long as such a process exists.

๐Ÿ“ Answer all the operational questions realistically, explaining how long you expect each phase of an outage response to last in the ideal case, and explaining why you think so. Consider linking to past incident reports to use as a comparison.

Write the service level objectives

The reporting period is the time interval over which you assess your service's performance against its SLO, and determine whether or not you met your objective. Although you'll continuously monitor your SLO, the binary success/failure result for the reporting period can be an input to your decisions about work prioritization: if you're already at risk of missing your SLO, you might delay risky deployments and focus on reliability improvements for the remainder of the period.

Every service at the Wikimedia Foundation uses the same reporting period: three calendar months, phased one month earlier than the fiscal quarter. Thus the four SLO reporting quarters are:

  • December 1 - February 28 (or 29)
  • March 1 - May 31
  • June 1 - August 31
  • September 1 - November 30

By reporting on SLOs every quarter, we can align with the existing cycle of planning and executing work, which we do at the Foundation with quarterly OKRs. Thus a service experiencing reliability problems in one quarter can prioritize efforts to correct them in the next quarter. The one-month offset allows us time to make that determination: if the SLO reporting quarter ended the day before the fiscal quarter starts, we wouldn't have time to review SLO performance and take it into account when setting OKRs.

Calculate the realistic targets

Using monitoring data for each of your draft SLIs, review your service's past performance. If everything stays about the way it is now, what's the best performance you can achieve?

Review the list of dependencies you made earlier. Suppose that each of your dependencies exactly meets its SLO. (For dependencies without an SLO yet, or dependencies that habitually miss their SLO, assume that they maintain their historical performance, or worsen slightly but not dramatically.) For each of your draft SLIs, what's the best performance you could achieve?

Suppose that during the reporting period, your team accidentally merges one major code or configuration bug. Assume that automated monitoring detects the impact immediately when it's deployed to production, that the first responding engineer immediately decides to roll back the change, and that the rollback process works normally. Review your estimates earlier for how long it would take to resolve the incident. For each of your draft SLIs, what's the best performance you could achieve?

๐Ÿ“ Fill in a realistic target for each SLI and, if applicable, each request class. If you base your results on any assumptions not already discussed, state them. This is not your final SLO, just one side of a bounding range; it's okay to approximate.

Calculate the ideal targets

Review the list of clients you made earlier. For clients with an SLO of their own, what level of service would they need you to provide in order to meet their SLO? For clients without an SLO (including end users who call your service directly) what level of service would they consider basically satisfactory?

For example, if your 75th-percentile latency went up by 10%, would the effect on your clients be such that you would deprioritize other work to restore it? By 20%?

๐Ÿ“ Fill in an ideal target for each SLI and, if applicable, each request class. If you base your results on any assumptions not already discussed, state them. This is not your final SLO, just the other side of the bounding range; it's okay to approximate.

Sidebar: Why isn't the ideal target 100%?

Errors are bad, right? So why shouldn't your error budget be zero?

It's always good to strive for perfection, but it's unrealistic to plan on it. As a rule of thumb, each additional nine of actual measured availability requires about the same amount of engineering effort: it takes roughly as much work to get from 99% to 99.9% as to improve further to 99.99%, and so on. But for any given service, there's a point of diminishing returns, where the extra sliver of availability is of limited practical benefit, and all that engineering effort would be better spent on other goals, like building new features or resolving technical debt.

Some projects do require 100% availability. In some engineering systems, even a single error can have life-threatening consequences, or disastrous financial cost, or can cause irreparable harm such as leaking users' secrets. Systems like this are possible, but require a different class of effort: their architecture, design, implementation, deployment, and operation are all handled differently and with orders of magnitude more work. For example, NASA coding standards require static upper bounds for every loop and prohibit all recursion, tightly limiting even benign code in order to minimize the potential for certain classes of bugs. The extra effort is justified by the high cost of exceeding a 0% error rate.

At the Wikimedia Foundation, we have no such projects. It's always better not to serve errors, but none of our services have a catastrophic failure mode. By forgoing a 100% availability target, we accept the chance of some volatility in order to free up engineering resources for better use, and by writing down a specific sub-100% target, we ensure that all parties are in agreement on what level of unreliability would require that we prioritize engineering work to correct it. We enable services to form realistic, specific expectations of their dependencies, and to design and operate with those expectations in mind.

Reconcile the realistic vs. ideal targets

Now that you've worked out what SLO targets you'd like to offer, and what targets you can actually support, compare them. If you're lucky, the realistic values are the same or better than the ideal ones: that's great news. Publish the ideal values as your SLO, or choose a value in between. (Resist the urge to set a stricter SLO just because you can; it will constrain your options later.)

If you're less lucky, there's some distance between the SLO you'd like to offer and the one you can support. This is an uncomfortable situation, but it's also a natural one for a network of dependent services establishing their SLOs for the first time. Here, you'll need to make some decisions to close the gap. (Resist, even more strongly, the urge to set a stricter SLO just because you wish you could.)

One approach is to make the same decisions you would make if you already had an SLO and you were violating it. (In some sense, that's effectively the case: your service isn't reliable enough to meet its clients' expectations, you just didn't know it yet.) That means it's time to refocus engineering work onto the kind of projects that will bolster the affected SLIs. Publish an SLO that reflects the promises you can keep right now, but continue to tighten it over time as you complete reliability work.

The other approach is to do engineering work to relax clients' expectations. If they're relying on you for a level of service that you can't provide, there may be a way to make that level of service unnecessary. If your tail latency is high but you have spare capacity, they can use request hedging to avoid the tail. If they can't tolerate your rate of outages in a hard dependency, maybe they can rely on you as a soft dependency by adding a degraded mode.

Despite the use of "you" and "they" in the last couple of paragraphs, this is collaborative work toward a shared goal. The decision of which approach to take doesn't need to be adversarial or defensive.

You should also expect this work to comprise the majority of the effort involved in the SLO process. Where the earlier steps were characterized by documentation and gathering, here your work is directed at improving the practical reality of your software in production.

Regardless of the approach you take to reconciliation, you should publish a currently-realistic SLO, and begin measuring your performance against it, sooner rather than later. You can publish your aspirational targets too (as long as it's clearly marked that you don't currently guarantee to meet them) so that other teams can consider them in their longer-term planning. In the meantime, you'll be able to prioritize work to keep from backsliding on the progress you've already made.

๐Ÿ“ Clearly document any decisions you made during reconciliation. Finally, clearly list the agreed SLOs -- that is, SLIs and associated targets. There should be as many SLOs as the number of SLIs multiplied by the number of request classes -- or, if some request classes are ineligible for any guarantee, say which.

References

  • Jones, Wilkes, and Murphy with Smith, "Service Level Objectives" in Site Reliability Engineering, O'Reilly 2016 (free online)
  • Alex Hidalgo, Implementing Service Level Objectives, O'Reilly 2020 (WMF Tech copy)