You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Wikimediastatus.net: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
mNo edit summary
imported>CDanis
Line 1: Line 1:
<big><big>https://www.wikimediastatus.net/</big></big>
''When distributing the link to others, prefer including the www. prefix, as that saves an HTTP redirect.''
'''wikimediastatus.net''' is a public and high-level uptime monitor. It is separated from our production infrastructure and hosted by Atlassian Statuspage.
'''wikimediastatus.net''' is a public and high-level uptime monitor. It is separated from our production infrastructure and hosted by Atlassian Statuspage.


It was launched in Jan 2022, and is maintained by the [[SRE]] team. It is the spiritual successor to [[status.wikimedia.org]], which was hosted by Watchmouse, but no longer under the wikimedia.org domain for security reasons ([[phab:T293504|T293504]]) and for availability reasons in the event of an outage of Wikimedia DNS.
It was launched in Jan 2022, and is maintained by the [[SRE]] team. It is the spiritual successor to [[status.wikimedia.org]], which was hosted by Watchmouse, but no longer under the wikimedia.org domain for security reasons ([[phab:T293504|T293504]]) and for availability reasons in the event of an outage of Wikimedia DNS and/or our networking infrastructure.
 
== SRE usage instructions ==
https://office.wikimedia.org/wiki/SRE/Status_page
 
== Statograph (automated metrics upload) ==
{{infobox
|above=Statograph
|subheader=Automatically uploads time-series metrics to the public status page.
|image=[[File:Pantograph_animation.gif|120px|center|Animated illustration of a pantograph, the namesake of Statograph]]
|label1=URL
|data1=https://www.wikimediastatus.net/
|label2=Language
|data2=Python
|label3=Source code
|data3={{gitweb|project=operations/software/statograph}}
|label4=Puppet classes
|data4={{gitweb|project=operations/puppet|file=modules/statograph/manifests/init.pp|text=Puppet module}} <br/>{{gitweb|project=operations/puppet|file=hieradata/common/profile/statograph.yaml|text=hiera configuration}}
}}
<code>statograph</code> is a tool that uploads timeseries metrics from sources like Prometheus and Graphite to the metrics on your statuspage.io installation.
 
As configured at WMF, it runs on the <code>alerting_host</code> puppet role (e.g. <code>alert1001</code>, <code>alert2001</code>), and scrapes timeseries from both [[Thanos]] globally-aggregated [[Prometheus]] as well as one from [[Graphite]].
 
These metrics are intentionally chosen to be high-level and broad.  This means that not only do they show many kinds of possible outages, but also that they are hopefully understandable even to users with limited technical knowledge.
 
Said metrics may also be found on a [https://grafana.wikimedia.org/d/3u6RLsL7k/status-page Grafana dashboard] that (manually) mirrors {{gitweb|project=operations/puppet|file=hieradata/common/profile/statograph.yaml|text=Statograph's configuration}}.
 
It is executed via a systemd timer that runs once a minute.  Runs are idempotent, so this is a simple mechanism to give high availability. 
 
More information on its execution model and on statuspage.io's API can be found in its {{gitweb|project=operations/software/statograph|file=statograph/uploader.py|text=Uploader class}}.
 
== Historical background ==
{{See also|phab:T202061}}
 
Our status page is primarily intended to serve the general public and the news media, although of course we expect community members to also use it as a resource -- although we certainly don't mean to replace, for example, on-wiki technical village pumps. The focus is on very visible/widespread outages.
 
We selected statuspage.io with the following considerations:
 
* Because we want the site to be working even in a widespread failure of Wikimedia infrastructure, any solution needs to be hosted externally
* We decided we did not want to take on the engineering effort needed to run scalable external hosting + separate CDN
* There are very few FLOSS status page projects that are more than just "toy" projects, and of those which aren't, even fewer are actively maintained
* statuspage.io had some distinguishing features: not just the basic manually-posted up/down functionality, but also support for automated uploads of timeseries metrics, and SLO-like uptime history on each component


== See also ==
== See also ==

Revision as of 20:09, 15 March 2022

https://www.wikimediastatus.net/

When distributing the link to others, prefer including the www. prefix, as that saves an HTTP redirect.

wikimediastatus.net is a public and high-level uptime monitor. It is separated from our production infrastructure and hosted by Atlassian Statuspage.

It was launched in Jan 2022, and is maintained by the SRE team. It is the spiritual successor to status.wikimedia.org, which was hosted by Watchmouse, but no longer under the wikimedia.org domain for security reasons (T293504) and for availability reasons in the event of an outage of Wikimedia DNS and/or our networking infrastructure.

SRE usage instructions

https://office.wikimedia.org/wiki/SRE/Status_page

Statograph (automated metrics upload)

Statograph
Automatically uploads time-series metrics to the public status page.
Animated illustration of a pantograph, the namesake of Statograph
URL https://www.wikimediastatus.net/
Language Python
Source code operations/software/statograph
Puppet classes Puppet module
hiera configuration

statograph is a tool that uploads timeseries metrics from sources like Prometheus and Graphite to the metrics on your statuspage.io installation.

As configured at WMF, it runs on the alerting_host puppet role (e.g. alert1001, alert2001), and scrapes timeseries from both Thanos globally-aggregated Prometheus as well as one from Graphite.

These metrics are intentionally chosen to be high-level and broad. This means that not only do they show many kinds of possible outages, but also that they are hopefully understandable even to users with limited technical knowledge.

Said metrics may also be found on a Grafana dashboard that (manually) mirrors Statograph's configuration.

It is executed via a systemd timer that runs once a minute. Runs are idempotent, so this is a simple mechanism to give high availability.

More information on its execution model and on statuspage.io's API can be found in its Uploader class.

Historical background

Our status page is primarily intended to serve the general public and the news media, although of course we expect community members to also use it as a resource -- although we certainly don't mean to replace, for example, on-wiki technical village pumps. The focus is on very visible/widespread outages.

We selected statuspage.io with the following considerations:

  • Because we want the site to be working even in a widespread failure of Wikimedia infrastructure, any solution needs to be hosted externally
  • We decided we did not want to take on the engineering effort needed to run scalable external hosting + separate CDN
  • There are very few FLOSS status page projects that are more than just "toy" projects, and of those which aren't, even fewer are actively maintained
  • statuspage.io had some distinguishing features: not just the basic manually-posted up/down functionality, but also support for automated uploads of timeseries metrics, and SLO-like uptime history on each component

See also

External link