You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

SLO/Docker-registry: Difference between revisions

From Wikitech-static
< SLO
Jump to navigation Jump to search
imported>JMeybohm
(docker-registry nodes depend on ganeti as well)
imported>RLazarus
(reformat (with minor semantic differences) to fit the modern SLO format)
Line 1: Line 1:
{{Draft}}
{{Draft}}
Status: '''draft''' <mark>(Replace with "approved" when the SLO is complete, agreed on by all responsible teams, and subject to quarterly reporting. You can still update it afterward, with all teams' agreement.)</mark>


==Summary==
== Organizational ==
This page covers the highly available [[Docker-registry|docker registry]] hosted at <code>docker-registry.wikimedia.org</code>.
 
{| class="wikitable"
=== Service ===
|Owner
This page covers the highly available [[Docker-registry|Docker registry]] hosted at <code>docker-registry.wikimedia.org</code>.
|Service Ops Team
 
|-
=== Teams ===
|Status
The Service Ops SRE team is the service owner of the Docker registry.
|ACTIVE
 
|-
== Architectural ==
|Dependencies
 
|Swift, PyBal, Ganeti
=== Environmental dependencies ===
|-
The Docker registry runs on Ganeti, active/passive via discovery DNS in eqiad and codfw, with traffic load-balanced via PyBal to two VMs in each data center.
|Services that depends on this one
 
|Kubernetes, CI
=== Service dependencies ===
|}
Beyond the environmental dependencies above, the Docker registry's only hard dependency is Swift, its storage backend. Redis, used as a blob cache, is a soft dependency: during a Redis outage, pulling and pushing images would be slower but would still complete successfully.
==Architecture==
 
==Service Level Objectives==
== Client-facing ==
===Service level objective 1: 95% of get manifest or tag operations will complete in less than 2s===
 
*Service level indicator 1: measured by DC, and will only take in account active DC for measurements [[https://grafana.wikimedia.org/d/StcefURWz/docker-registry-wip?orgId=1&var-datasource=codfw%20prometheus%2Fops&panelId=16&fullscreen&from=1556250746552&to=1556272346553 graph link]]
=== Clients ===
*Checked every month or at least every quarter
The Docker registry is used by Kubernetes (via [[Dragonfly]]) and by CI.
===Service level objective 2: 95% of push manifest or tag operations will complete in less than 3s===
 
*Service level indicator 1: measured by DC, and will only take in account active DC for measurements [[https://grafana.wikimedia.org/d/StcefURWz/docker-registry-wip?orgId=1&var-datasource=codfw%20prometheus%2Fops&panelId=16&fullscreen&from=1556250746552&to=1556272346553 graph link]]
=== Request Classes ===
*Checked every quarter/ month
Requests are classified by HTTP method and by the API endpoint:
===Service level objective 3: registry is serving content (at least in read-only mode, pulling images) 99% of the time===
 
*Service level indicator 1: 5XX responses over 2XX ratio is less than 1%
* '''Manifest reads''' are HTTP <code>GET</code> or <code>HEAD</code> requests to URL paths of the form <code>/v2/&lt;name&gt;/manifests/&lt;reference&gt;</code>.
*Checked every quarter or month.
 
* '''Tag reads''' are HTTP <code>GET</code> requests to URL paths of the form <code>/v2/&lt;name&gt;/tags/list</code>.
 
* '''Blob reads''' are HTTP <code>GET</code> or <code>HEAD</code> requests to URL paths of the form <code>/v2/&lt;name&gt;/blobs/&lt;digest&gt;</code>.
 
* '''Manifest writes''' are HTTP <code>PUT</code> requests to URL paths of the form <code>/v2/&lt;name&gt;/manifests/&lt;reference&gt;</code>.
 
All other requests are ineligible for the SLO. Only requests sent to the active data center are eligible for the SLO.
 
== Service Level Indicators (SLIs) ==
 
* '''Latency SLI''': The 95th-percentile request latency, as measured at the sever side.
 
* '''Availability SLI''': The percentage of all requests receiving a non-error response, defined as HTTP status code 2xx. (200 is the successful response status for reads, 201 or 202 as appropriate for writes.)
 
Both SLIs are computed over the [[SLO/template_instructions#Write_the_service_level_objectives|Foundation-standard reporting periods]]: three calendar months, phased one month earlier than the fiscal quarter.
 
== Service Level Objectives ==
 
<mark>TODO: Grafana links</mark>
 
* The 95th-percentile '''latency''' for '''manifest reads''' will be less than 2 seconds.
 
* The 95th-percentile '''latency''' for '''tag reads''' will be less than 2 seconds.
 
* The 95th-percentile '''latency''' for '''manifest writes''' will be less than 3 seconds.
 
* The '''availability''' for '''manifest, tag, and blob reads''', measured together, will be at least 99%.
 
Note that not all request classes have an objective for each SLI.


[[Category:Docker]]
[[Category:Docker]]

Revision as of 02:48, 21 August 2021

Status: draft (Replace with "approved" when the SLO is complete, agreed on by all responsible teams, and subject to quarterly reporting. You can still update it afterward, with all teams' agreement.)

Organizational

Service

This page covers the highly available Docker registry hosted at docker-registry.wikimedia.org.

Teams

The Service Ops SRE team is the service owner of the Docker registry.

Architectural

Environmental dependencies

The Docker registry runs on Ganeti, active/passive via discovery DNS in eqiad and codfw, with traffic load-balanced via PyBal to two VMs in each data center.

Service dependencies

Beyond the environmental dependencies above, the Docker registry's only hard dependency is Swift, its storage backend. Redis, used as a blob cache, is a soft dependency: during a Redis outage, pulling and pushing images would be slower but would still complete successfully.

Client-facing

Clients

The Docker registry is used by Kubernetes (via Dragonfly) and by CI.

Request Classes

Requests are classified by HTTP method and by the API endpoint:

  • Manifest reads are HTTP GET or HEAD requests to URL paths of the form /v2/<name>/manifests/<reference>.
  • Tag reads are HTTP GET requests to URL paths of the form /v2/<name>/tags/list.
  • Blob reads are HTTP GET or HEAD requests to URL paths of the form /v2/<name>/blobs/<digest>.
  • Manifest writes are HTTP PUT requests to URL paths of the form /v2/<name>/manifests/<reference>.

All other requests are ineligible for the SLO. Only requests sent to the active data center are eligible for the SLO.

Service Level Indicators (SLIs)

  • Latency SLI: The 95th-percentile request latency, as measured at the sever side.
  • Availability SLI: The percentage of all requests receiving a non-error response, defined as HTTP status code 2xx. (200 is the successful response status for reads, 201 or 202 as appropriate for writes.)

Both SLIs are computed over the Foundation-standard reporting periods: three calendar months, phased one month earlier than the fiscal quarter.

Service Level Objectives

TODO: Grafana links

  • The 95th-percentile latency for manifest reads will be less than 2 seconds.
  • The 95th-percentile latency for tag reads will be less than 2 seconds.
  • The 95th-percentile latency for manifest writes will be less than 3 seconds.
  • The availability for manifest, tag, and blob reads, measured together, will be at least 99%.

Note that not all request classes have an objective for each SLI.