You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "SRE/Observability/Documentation"

From Wikitech-static
Jump to navigation Jump to search
imported>Jobo
 
imported>LMata
(moved resources from main page)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{Observability/Navigation}}
{{Observability/Navigation}}


== SRE [[:Category:SRE Observability|Observability]] documentation ==
<categorytree mode=pages>SRE Observability</categorytree>The starting point for observability resources at Wikimedia SRE.


= How we work. =
===Alerts===
 
*[https://icinga.wikimedia.org/alerts icinga.w.o/alerts]: central monitoring and alerting platform. See also [[Icinga]].
= Overview =
*[https://upload.wikimedia.org/wikipedia/labs/0/0a/Alerting_Infrastructure_design_document_%26_roadmap.pdf Alerting infrastructure roadmap] PDF
The Observability team maintains several tools while curating and building a collaborative roadmap for the Wikimedia Foundation. Maintaining several of these work streams provide challenges as there is work that is adjacent, related, and directly assigned to the Observability team workboard but is not easily distinguishable from each other.
===Logs===
 
*[https://logstash.wikimedia.org/app/kibana Kibana] (a.k.a. logstash): central logging platform. See also [[Logstash]].
The purpose of this document is to describe:
*[https://upload.wikimedia.org/wikipedia/labs/5/58/Logging_infrastructure_design_document.pdf Logging infrastructure design document] PDF
 
===Metrics===
* How the #o11y team does its work
*[https://grafana.wikimedia.org/ grafana.w.o]: central observability platform. See also [[Grafana.wikimedia.org|Grafana]].
* How other teams can request assistance or time on our roadmap
*[[Prometheus]], recommended and supported metrics toolkit
* How we will surface  scheduling dependencies
*[[Graphite]], supported but deprecated time series framework
* How we will use Phabricator to boost visibility
*[[Statsd]], supported but deprecated metrics aggregation
* What tools we intend to use
*[[Observability/Dashboard_guidelines]], ideas towards better dashboards
 
= How to connect =
 
* Reach out:
** IRC #wikimedia-sre-observability
** Email [[Mailto:sre-observability@wikimedia.org|sre-observability@wikimedia.org]]
** Phabricator tags Mentioned above
 
= How we do Intake =
Work comes from multiple sources, but most requests should land in Phabricator. The #observability tag/project is the tag we use for all incoming work. Tasks will be one of six major states: intake, backlog, scheduled, in progress, radar, done/closed.
 
The Observability team grooms incoming tasks on a weekly basis normally during planning meetings on Monday at 8:00 AM Pacific. Some requests may receive  an out of band prioritization effort.
 
From there the task or request should go through a quick prioritization of "done this quarter" i.e. time sensitive, or backlog if the task is actionable for the Observability team. Otherwise the task goes to radar or is blocked in the backlog if unable to move forward. Tasks which do not have enough information provided to groom will receive a follow-up comment and remain unprioritized until enough information has been collected to effectively perform the task.
 
= How we Roadmap =
Roadmap planning will be a rolling 1+ year roadmap with the goal to have a list of tasks pre-groomed and prioritized periodically (quarterly).
 
There are 6 major work streams that drive work into the o11y team:
 
* Alerting
* Metrics
* Logs
* Tracing (future)
* Maintenance/Incidents
* Incident Management
 
The goal of this process is to drive each of these major workstreams, clearly set goals and deliverables, quantify effort and time investment per workstream according to the level of investment that the organization is interested in per each individual initiative, and allocate the adequate amount of time to each initiative.
 
Eventually these projects will turn into this Gantt-like visual representation, which should allow one to quickly determine if there is too much going on either sequentially or in parallel and then use that to help guide decisions around prioritization.
 
== Prioritization ==
This is both a scheduled and a continual effort in sizing up work and importance/impact of specific work streams. The team is employing a simple forced rank list of priorities that are fed from the intake process and groomed by the team. This effort in turn is then taken to a spreadsheet where these projects are scored for overall feel on value and capacity.  
 
Order of presence for prioritization:
 
* Tier1 (high): incidents, security events, privacy concerns, PII in logs, unbreak now events
* Tier2 (medium): project work (OKR), outside requests, maintenances
* Tier3 (low): non critical maintenances
* Tier4 (lowest): icebox
 
== Project Work (OKR) ==
All project work is prioritized and groomed beforehand. Overarching project tasks are created in Phabricator with subtasks, both of which can be tagged with a FY or Quarter (or both) "milestone" to indicate scheduling for projects that span multiple quarters or years.
 
== Maintenance (non-OKR) ==
Planned maintenance will follow the same workflow as regular project work, unplanned maintenance or requests will be groomed and prioritized based on urgency and severity.
 
= Our Work Cadence =
{| class="wikitable"
|Activity
|Frequency
|Where
|-
|Intake Grooming + Prioritization
|Weekly
|o11y office hours
|-
|Planning (rolling roadmap)
|Quarterly
|OKR Meetings
|-
|Annual Planning
|Yearly
|TBD
|}
 
= Our Workstreams =
 
* Alerting
* Logging
* Metrics
* Tracing (future
 
= Phabricator Workflow =
 
* Tasks go into the Observability intake project (#observability)
* Tasks are then groomed and tagged with the appropriate subcomponent (subproject) area; one of the four:
** o11y-alerting
** o11y-metrics
** o11y-logging
** o11y-tracing* (tbd later)
 
== Phabricator Subproject ==
{| class="wikitable"
|Subprojects are full-power projects that are contained inside some parent project. You can use them to divide a large or complex project into smaller parts.
 
Subprojects have normal members and normal policies, but note that the policies of the parent project affect the policies of the subproject (see "Parent Projects", below).
 
Subprojects can have their own subprojects, milestones, or both. If a subproject has its own subprojects, it is both a subproject and a parent project. Thus, the parent project rules apply to it, and are stronger than the subproject rules.
 
Subprojects can have normal workboards.
 
The maximum subproject depth is 16. This limit is intended to grossly exceed the depth necessary in normal usage.
 
Objects may not be tagged with multiple projects that are ancestors or descendants of one another. For example, a task may not be tagged with both Stonework and Stonework → Masonry.
 
When a project tag is added that is the ancestor or descendant of one or more existing tags, the old tags are replaced. For example, adding Stonework → Masonry to a task tagged with Stonework will replace Stonework with the newer, more specific tag.
 
This restriction does not apply to projects which share some common ancestor but are not themselves mutual ancestors. For example, a task may be tagged with both Stonework → Masonry and Stonework → Sculpting.
 
This restriction does apply when the descendant is a milestone. For example, a task may not be tagged with both Stonework and Stonework → Iteration II.
|}
 
* Generic tags allow for individual components to not be software specific (icinga, logstash, prometheus)
 
* Each quarter will be created as a subproject (milestone) allowing every board to display an approximation of how items are scheduled.
 
* Each component will have its own workboard the following columns:
** Intake, backlog, scheduled, in progress, radar, done/closed
** This, in an effort to provide specific long term roadmap view of each component,
* Similarly we intend to create a workboard view that will include all subcomponents and milestones to have an operation view of the active quarter, etc.
 
{| class="wikitable"
|Milestones are simple subprojects for tracking sprints, iterations, versions, or other similar blocks of work. Milestones make it easier to create and manage a large number of similar subprojects (for example: Sprint 1, Sprint 2, Sprint 3, etc).
 
Milestones can not have direct members or policies. Instead, the membership and policies of milestones are always the same as the milestone's parent project. This makes large numbers of milestones more manageable when changes occur.
 
Milestones can not have subprojects, and can not have their own milestones.
 
By default, Milestones do not have their own hashtags.
 
Milestones can have normal workboards.
 
Objects may not be tagged with two different milestones of the same parent project. For example, a task may not be tagged with both Stonework → Iteration III and Stonework → Iteration V.
 
When a milestone tag is added to an object which already has a tag from the same series of milestones, the old tag is removed. For example, adding the Stonework → Iteration V tag to a task which already has the Stonework → Iteration III tag will remove the Iteration III tag.
 
This restriction does not apply to milestones which are not part of the same series. For example, a task may be tagged with both Stonework → Iteration V and Heraldry → Iteration IX.
|}

Latest revision as of 15:40, 12 July 2021

SRE Observability documentation

The starting point for observability resources at Wikimedia SRE.

Alerts

Logs

Metrics