You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Add reference numbers or images for what a "spike" might be defined as
Context: I'm looking at the graphs, and IIUC the scales at the sides are all dynamic. This makes it hard for a lay-person to instantly understand whether a spike is normal or a problem.
- I.e. During times of stability the graphs will always show many small spikes for a huge variety of reasons. Whereas during outages the scale will change and small spikes will be flattened while 1+ large spikes will appear.
- E.g. Today the "User-reported connectivity errors" graph (screenshot) shows a graph of "0–1.5 reports/second". As a lay-person looking at that page for the first time, I am uncertain whether the brief spike to ~1.02/s indicated a small problem or a big problem.
I suggest adding something to the Status-page to help explain what a spike might look like or be defined as. E.g.
- Perhaps adding some numbers into the tooltips? (E.g. "A variation of 0–1/s is a normal baseline; major outages usually go over 9000/s.")
- Perhaps linking/embedding a screenshot of an old major problem? (or a gallery, as some graphs might remain stable when others are spiking. E.g. this screenshot includes the outage from March 29 but only shows spikes in 2 of the 5 graphs.)
I can currently get a slightly better understanding by looking at the "week" and "month" tabs, but if there's a completely/relatively smooth month then I wouldn't even have that!
- Great feedback, thanks @Quiddity.
- I'll update the tooltip text with some ideas of a normal range. We'll just have to make sure to keep this up-to-date.
- I also poked around in the management UI hoping to find a way to set a minimum max-y (instead of a hard max-y that then clips the graph) -- but it doesn't seem to exist. My thought is that this could give a nice visual hint of what the expected range is. I'll file a FR. ✍ CDanis 13:03, 5 April 2022 (UTC)
In the list of 5 graphs at https://www.wikimediastatus.net/#system-metrics, I almost didn't notice the "Day | Week | Month" UI elements because the links are colored grey. I suggest changing these links to be blue, so they are more noticeable and intuitively link-colored. (More context at mw:User:Quiddity/Blue link color). Cheers, Quiddity (talk) 19:13, 31 March 2022 (UTC)
- Good idea -- done! ✍ CDanis 21:20, 4 April 2022 (UTC)
I suggest adding a few links to more useful-resources for visitors into the service's page. E.g.
- A link to this project page (https://wikitech.wikimedia.org/wiki/Wikimediastatus.net) for more context
- A link to the preferred feedback/questions location (this talkpage? and/or IRC?)
- A link to the logs-browser for #wikimedia-operations and/or for #wikimedia-tech so that people can see the latest messages without joining IRC directly, to determine if an incident is already reported.
- In general, I am wildly ambivalent about this.
- I really want to provide backlinks somewhere, but I'm worried about what will happen when we're having a large outage and the status page is getting a lot of traffic:
- either the links won't work because we're hard down for many users, or,
- we potentially create a greater outage, or an outage against some of our not-provisioned-for-whole-Internet-load monitoring tools (e.g. Grafana).
- I'm considering providing a mailing list for feedback (since email is async by nature). I'd still like to be able to link to a documentation page somewhere; it will probably be Wikitech as that's well-provisioned. But we probably should note on the page somehow that it might not be reachable in an outage? ✍ CDanis 13:06, 5 April 2022 (UTC)
- Ah, right. Good points!
- Hmmm. Maybe the links would be accessed significantly less, if they were placed in a collapsed-section? (cf. I've written an essay advising against using collapsed-sections as a UX (without due consideration) at mw:User:Quiddity/Collapsing and hiding, with 'decreased accessibility/readership' as one of the main reasons!) – Or located in a subpage like https://www.wikimediastatus.net/history with provisos/cautions/requests-for-hesitance highlighted at the top? (But I also grok the BEANS problem...)
- I.e. In my mind, 2 of the core use-cases for the status page are:
- (a) for Wikimedians who want to know if a problem is just affecting them (i.e. before asking at their local village-pump or realtime-chat platform), and those folks could be further helped by providing additional avenues for investigation/followup (I.e. "I am affected by a problem, and I see a spike in the status page, but nothing written about it (yet) in the Incident history. I want to make sure it has been reported. My next step is I should check [....].")
- (b) for external folks, like Press, who might include a link to the status page in their article/tweet/etc. And yeah, we want to not slashdot the links that were provided for group (a)!
- It's definitely difficult to balance the two... Quiddity (talk) 19:22, 5 April 2022 (UTC)
- I think I've figured out what I want to do here.
- This was actually prompted by something unrelated, which was trying out the 'publish postmortem' feature of Statuspage: https://www.wikimediastatus.net/incidents/jnqvz8gljzhy
- I don't mean for that to replace incident docs on Wikitech -- in fact I think we should link to a full Wikitech postmortem doc when we have one. But also I'd like for SRE to publish a very abbreviated version on Statuspage (max of a few sentences, and ideally sentences that would be suitable for Simple English Wikipedia).
- So now I think I've decided that linking to Wikitech is okay -- if it doesn't work for the user in an outage, it's not a great user experience, but it won't be any worse for us than what is already happening to the site. I don't think we should allow linking to Grafana, Phabricator, Gerrit, etc, directly from the status page -- those tools are much less well-provisioned than the main wiki cluster.
- I think I'll make a more user-facing version of the documentation page for the status page and start off by linking to it from the footer there. ✍ CDanis 13:53, 8 April 2022 (UTC)
- @CDanis That all sounds good & reasonable. :)
- I wondered if anything at all like this already existed, and found that Reporting a connectivity issue is currently linked to from a few places (e.g. mw:How to report a bug#Reporting a connectivity issue and w:en:Wikipedia:Village pump (technical)/FAQ), so perhaps updating that existing page would be good?
- HTH. Quiddity (talk) 17:49, 8 April 2022 (UTC)
- The new page at Wikimediastatus.net/User instructions looks good!
- I hesitantly suggest adding a section at the bottom for "Next steps"/"Report a problem" (or something) with a link to Reporting a connectivity issue and maybe to versions.toolforge.org. Quiddity (talk) 06:01, 24 April 2022 (UTC)
Alternative to Atlassian Statuspage
Hello, what is the license of Atlassian Statuspage and have you thinked about generating the site statically. I really like static websites and I think it would be great if it is possible for that Website to do so. I think it could be a interesting experiment for the Hackathon in May to try to do that.--Hogü-456 (talk) 19:08, 3 April 2022 (UTC)
- Thanks for your comment! Atlassian Statuspage is a commercial, closed-source product. We had considered static site generators but decided that the automated publishing of timeseries metrics was a feature we really wanted. ✍ CDanis 13:01, 5 April 2022 (UTC)
- Will you participate at the Hackathon. Up to now I thought that the strategy of the Wikimedia Foundation is to use OpenSource or Free Software where possible. Maybe there will be a alternative for Atlassian Statuspage in the future. I will look if I find a alternative for the automated publishing of timeseries metrics.--Hogü-456 (talk) 18:33, 5 April 2022 (UTC)
Time zone of graphs
While the incidents have timestamps in UTC, the graphs seem to be aligned on "local time zone", whatever that means (computer-setup time, I presume). I haven't seen that information on the page - would it make sense to add it somewhere? --Isabelle Hurbain-Palatin (talk) 07:15, 12 April 2022 (UTC)
- Yes, indeed, the graphs use the user's computer clock time zone, whereas the incident dates are presented in UTC. I'd rather have everything presented in the user's "local" time zone, which would simplify things. I'll file a FR! ✍ CDanis 14:56, 12 April 2022 (UTC)
Suggestion: MediaWiki version numbers
Just really quick. I like the new page very much! 🤗 The design and focus on a few core metrics is superb for a wide audience. The only detail I personally miss is something like https://versions.toolforge.org. Just the top row with the version numbers that currently run on each wiki. --Thiemo Kreuz (WMDE) (talk) 07:41, 12 April 2022 (UTC)
- Thanks Thiemo! I hear you that the wiki version numbers are useful information for you and other developers, however I think that they aren't something generally-relevant enough to put on the main status page. The Toolforge app, or Grafana, or Logstash will be better places for that info. ✍ CDanis 14:53, 12 April 2022 (UTC)