You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
|Manually page SRE about emergencies.|
|Puppet classes||klaxon profile::klaxon|
Klaxon is a simple web application that allows Wikimedia Foundation staff, as well as other trusted contributors, to manually notify SRE about outages and other emergency situations.
It can be accessed at https://klaxon.wikimedia.org/.
What kinds of emergencies should I use Klaxon for?
- Outages that affect many users, that demand an urgent response from SRE, and which aren't already known to the SRE team.
- The compromise of credentials for shell accounts or accounts with NDA access.
- A security vulnerability that is being actively exploited on WMF-run sites.
What shouldn't I use Klaxon for?
- Issues where automated monitoring has already paged SRE. (This is visible in Klaxon itself.)
- Issues that are not urgent / that can wait for business hours to be handled. (File a Phabricator ticket instead.)
- Contacting someone other than SRE.
Who receives pages submitted to Klaxon?
Klaxon is just a webapp over an API provided by Splunk On-Call service (previously known as VictorOps). This is the service that the SRE team uses to receive push notifications/SMS/phone calls when automated monitoring notices an issue.
Who is allowed to send pages using Klaxon?
- The wmf group, for Wikimedia Foundation staff
- The wmde group, for Wikimedia Deutschland staff
- The nda group, for volunteer contributors who have signed a non-disclosure agreement
- The ops group, people who have root in production
Eventually, it's likely we'd expand this to anyone who has a shell account or Mediawiki/other service deployment access (which often overlaps with one of the above groups, but doesn't always).
Should I ever put confidential data or sensitive information in Klaxon?
If you need to urgently report a security issue being actively exploited, open a security issue with the details.
You can then refer to those task numbers within Klaxon.
Klaxon is hosted on Wikimedia infrastructure, and relies upon our SSO service also hosted there – isn't that a problem?
No. We believe our automated monitoring (which includes externally-hosted meta-monitoring) is more than sufficient to detect issues on the scale of "an entire datacenter went offline" or "lots of critical infrastructure suffered a hard failure".
Klaxon is not intended as a substitute for other kinds of defenses-in-depth; rather, it is intended to allow trusted users to easily escalate urgent issues which fell through the cracks of automated monitoring (which is invariably imperfect).
Why do you keep talking about "pages" and "paging"?
The term originates from so-called pager devices (also known as 'beepers'). Unfortunately, like many computing terms, they are also overloaded with multiple meanings.