|(8 intermediate revisions by 5 users not shown)|
'''wikimediastatus.net''' is a public and high-level uptime monitor. It is separated from our production infrastructure and hosted by Atlassian Statuspage.
'''wikimediastatus.net''' is a public and high-level uptime monitor. It is separated from our production infrastructure and hosted by Atlassian Statuspage
It was launched in Jan 2022, and is maintained by the [[SRE]] team. It is the spiritual successor to [[status.wikimedia.org]], which was hosted by Watchmouse, but no longer under the wikimedia.org domain for security reasons ([[phab:T293504|T293504]]) and for availability reasons in the event of an outage of Wikimedia DNS.
== See also ==
== See also ==
* Launch task: [[phab:T202061]]
* Launch task: [[phab:T202061]]
== External ==
When distributing the link to others, include the www. prefix, as the HTTP redirect from wikimediastatus.net is served from (offsite) WMF infrastructure.
wikimediastatus.net is a public and high-level uptime monitor. It is separated from our production infrastructure and hosted by Atlassian Statuspage.
It was launched in Jan 2022, and is maintained by the SRE team. It is the spiritual successor to status.wikimedia.org, which was hosted by Watchmouse, but no longer under the wikimedia.org domain for security reasons (T293504) and for availability reasons in the event of an outage of Wikimedia DNS and/or our networking infrastructure.
Instructions for users
Please see user instructions for how to read and interpret the page.
SRE usage instructions
Our status page is primarily intended to serve the general public and the news media, although of course we expect community members to also use it as a resource -- although we certainly don't mean to replace, for example, on-wiki technical village pumps. The focus is on very visible/widespread outages.
We selected Atlassian's statuspage.io with the following considerations:
- Because we want the site to be working even in a widespread failure of Wikimedia infrastructure, any solution needs to be hosted externally
- We decided we did not want to take on the non-trivial engineering effort needed to run scalable external hosting + separate CDN
- It's critically important that the status site be scalable and able to serve large spikes of load, because that is exactly what will happen to it in the event of a major outage to Wikimedia infra: not only will users be checking in, but the site is sure to be linked in popular news articles
- There are very few FLOSS status page projects that are more than just "toy" projects, and of those which aren't, even fewer are actively maintained
- statuspage.io had some distinguishing features: not just the basic manually-posted up/down functionality, but also support for automated uploads of timeseries metrics, and SLO-like uptime history on each component
What merits posting on the status page?
We intend to post only major outages. By “major outages” we mean problems so severe that the general public or the media might notice—issues like wikis being very slow or unreachable for many users. We don't intend to post for issues that only affect niche editing features, for example if automated citation generation is malfunctioning, or if mathematical formula rendering is slow, or if the Job Queue has delays.
The status page will definitely be useful for the editor community and others directly involved in the projects, but it won’t be replacing forums for in-depth discussion like Technical Village Pumps or Phabricator – rather, it will supplement them, particularly as a place to check when the wikis are unreachable for you.
Statograph (automated metrics upload)
statograph is a tool that uploads timeseries metrics from sources like Prometheus and Graphite to the metrics on your statuspage.io installation.
As configured at WMF, it runs on the
alerting_host puppet role (e.g.
alert2001), and scrapes timeseries from both Thanos globally-aggregated Prometheus as well as one from Graphite.
These metrics are intentionally chosen to be high-level and broad. This means that not only do they show many kinds of possible outages, but also that they are hopefully understandable even to users with limited technical knowledge.
Said metrics may also be found on a Grafana dashboard that (manually) mirrors Statograph's configuration.
It is executed via a systemd timer that runs once a minute. Runs are idempotent, so this is a simple mechanism to give high availability.
More information on its execution model and on statuspage.io's API can be found in its Uploader class.