You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Klaxon: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Majavah
(tier 1 headers should only be used for page titles)
imported>Legoktm
m (ce, remove unnecessary  )
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{infobox
{{infobox
|above=Klaxon
|above=Klaxon
|subheader=Manually page SRE about emergencies.
|subheader=Allows humans to contact [[mw:Wikimedia Site Reliability Engineering|SRE]] about urgent emergencies.
|image=[[Image:Megaphone_(iconfinder_3890930).svg|80px|center]]
|image=[[File:Megaphone_(iconfinder_3890930).svg|80px|center]]
|label1=URL
|label1=URL
|data1=https://klaxon.wikimedia.org/
|data1=https://klaxon.wikimedia.org/
Line 10: Line 10:
|data3={{gitweb|project=operations/puppet|file=modules/klaxon/manifests/init.pp|text=klaxon}} {{gitweb|project=operations/puppet|file=modules/profile/manifests/klaxon.pp|text=profile::klaxon}}
|data3={{gitweb|project=operations/puppet|file=modules/klaxon/manifests/init.pp|text=klaxon}} {{gitweb|project=operations/puppet|file=modules/profile/manifests/klaxon.pp|text=profile::klaxon}}
}}
}}
'''Klaxon''' is a simple web application that allows Wikimedia Foundation staff, as well as other trusted contributors, to manually notify [[mw:Wikimedia Site Reliability Engineering|SRE]] about outages and other emergency situations.


'''Klaxon''' is a simple web application that allows Wikimedia Foundation staff, as well as other trusted contributors, to manually notify SRE about outages and other emergency situations.
It can be accessed at https://klaxon.wikimedia.org/.


It can be accessed at https://klaxon.wikimedia.org/.
SREs and other technical contributors wishing to contribute to Klaxon, see also [[Klaxon/Administration]].


== FAQ ==
== FAQ ==
Line 20: Line 21:
* Outages that affect many users, that demand an urgent response from SRE, and which aren't already known to the SRE team.
* Outages that affect many users, that demand an urgent response from SRE, and which aren't already known to the SRE team.
* The compromise of credentials for [[production access|shell accounts]] or accounts with NDA access.
* The compromise of credentials for [[production access|shell accounts]] or accounts with NDA access.
* A security vulnerability that is being actively exploited on WMF-run sites.
* A security vulnerability that is being actively exploited on WMF-run sites or infrastructure.


=== What shouldn't I use Klaxon for? ===
=== What shouldn't I use Klaxon for? ===
* Issues where automated monitoring has already paged SRE.  (This is visible in Klaxon itself.)
* Issues where automated monitoring has already paged SRE.  (This is visible in Klaxon itself.)
* Issues that are not urgent / that can wait for business hours to be handled.  ([https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?tags=Operations File a Phabricator ticket] instead.)
* Issues that are not urgent / that can wait for business hours to be handled.  ([https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?tags=SRE File a Phabricator ticket] instead.)
* Contacting someone other than SRE.
* Contacting someone other than SRE.


Line 30: Line 31:
Members of the [[mw:Wikimedia Site Reliability Engineering|Wikimedia Site Reliability Engineering]] team, whether or not it is their working hours.  (Most SREs receive pages just during hours they're likely to be awake; some opt for [[w:24/7 service|24/7]] paging.)
Members of the [[mw:Wikimedia Site Reliability Engineering|Wikimedia Site Reliability Engineering]] team, whether or not it is their working hours.  (Most SREs receive pages just during hours they're likely to be awake; some opt for [[w:24/7 service|24/7]] paging.)


Klaxon is just a webapp over an [[w:API|API]] provided by [https://www.splunk.com/en_us/software/splunk-on-call.html Splunk On-Call] service (previously known as VictorOps).  This is the service that the SRE team uses to receive push notifications/SMS/phone calls when automated monitoring notices an issue.
Klaxon is just a webapp over an [[w:API|API]] provided by the [https://www.splunk.com/en_us/software/splunk-on-call.html Splunk On-Call] service (previously known as VictorOps).  This is the service that the SRE team uses to receive push notifications/SMS/phone calls when automated monitoring notices an issue.


=== Who is allowed to send pages using Klaxon? ===
=== Who is allowed to send pages using Klaxon? ===
Line 39: Line 40:
* The [https://ldap.toolforge.org/group/ops ops] group, people who have root in production
* The [https://ldap.toolforge.org/group/ops ops] group, people who have root in production


Eventually, it's likely we'd expand this to anyone who has a shell account or Mediawiki/other service deployment access (which often overlaps with one of the above groups, but doesn't always).
Eventually, it's likely we'd expand this to anyone who has a shell account or MediaWiki/other service deployment access (which often overlaps with one of the above groups, but doesn't always).


=== Should I ever put confidential data or sensitive information in Klaxon? ===
=== Should I ever put confidential data or sensitive information in Klaxon? ===
Line 45: Line 46:
No.
No.


If you need to share [[w:Personal data|PII]] as part of reporting an outage – even if it is just your own IP address – open a [https://phabricator.wikimedia.org/maniphest/task/edit/form/23/ WMF-NDA task] with the details.
If you need to share [[w:Personal data|PII]] as part of reporting an outage – even if it is just your own IP address – open a [https://phabricator.wikimedia.org/maniphest/task/edit/form/23/ WMF-NDA task] with the details.  If you don't have permission to do that, open a [https://phabricator.wikimedia.org/maniphest/task/edit/form/75/ security issue] instead.


If you need to urgently report a security issue being actively exploited, open a [https://phabricator.wikimedia.org/maniphest/task/edit/form/75/ security issue] with the details.
If you need to urgently report a security issue being actively exploited, open a [https://phabricator.wikimedia.org/maniphest/task/edit/form/75/ security issue] with the details.
Line 51: Line 52:
You can then refer to those task numbers within Klaxon.
You can then refer to those task numbers within Klaxon.


=== Klaxon is hosted on Wikimedia infrastructure, and relies upon our [[w:Single sign-on|SSO]] service also hosted there – isn't that a problem? ===
=== Klaxon is hosted on Wikimedia infrastructure, and relies upon our [[w:Single sign-on|SSO]] service also hosted there – isn't that a problem? ===
 
No.


No.  We believe our automated monitoring (which includes externally-hosted [[wikitech-static#Meta-monitoring|meta-monitoring]]) is more than sufficient to detect issues on the scale of "an entire datacenter went offline" or "lots of critical infrastructure suffered a hard failure".
We believe our automated monitoring (which includes externally-hosted [[wikitech-static#Meta-monitoring|meta-monitoring]]) is more than sufficient to detect issues on the scale of "an entire datacenter went offline" or "lots of critical infrastructure suffered a hard failure".


Klaxon is not intended as a substitute for other kinds of defenses-in-depth; rather, it is intended to allow trusted users to easily escalate urgent issues which fell through the cracks of automated monitoring (which is invariably imperfect).
Klaxon is not intended as a substitute for other kinds of defenses-in-depth; rather, it is intended to allow trusted users to easily escalate urgent issues which fell through the cracks of automated monitoring (which is invariably imperfect).
=== Klaxon looks not entirely unlike a status page – should it be used for that purpose? ===
While Klaxon ''is'' a quick way to check if SRE has been paged recently, it is ''not'' a proper user-facing status page.
For one thing, there's a complicated, not-completely-overlapping relationship between pages, automated alerts, and user-affecting incidents – an incident can exist without any pages ever occurring, and also, vice versa.  Moreover, Klaxon displays machine-generated alert summaries, which are often difficult to interpret even for the SRE team themselves.  And finally, a true status page would have to be hosted externally, not on WMF networks and infrastructure.
In order to produce a proper user-facing status page, much more work would be needed – not just technical work, but also process work.  This is out-of-scope for Klaxon, but future work will hopefully follow soon.


=== Why do you keep talking about "pages" and "paging"? ===
=== Why do you keep talking about "pages" and "paging"? ===
The term originates from so-called [[w:pager|pager]] devices (also known as 'beepers').  Unfortunately, like many computing terms, they are also overloaded with multiple meanings.
The term originates from so-called [[w:pager|pager]] devices (also known as 'beepers').  Unfortunately, like many computing terms, they are also overloaded with multiple meanings.
[[Category:Services]]

Revision as of 23:21, 22 October 2021

Klaxon
Allows humans to contact SRE about urgent emergencies.
Megaphone (iconfinder 3890930).svg
URL https://klaxon.wikimedia.org/
Source code operations/software/klaxon
Puppet classes klaxon profile::klaxon

Klaxon is a simple web application that allows Wikimedia Foundation staff, as well as other trusted contributors, to manually notify SRE about outages and other emergency situations.

It can be accessed at https://klaxon.wikimedia.org/.

SREs and other technical contributors wishing to contribute to Klaxon, see also Klaxon/Administration.

FAQ

What kinds of emergencies should I use Klaxon for?

  • Outages that affect many users, that demand an urgent response from SRE, and which aren't already known to the SRE team.
  • The compromise of credentials for shell accounts or accounts with NDA access.
  • A security vulnerability that is being actively exploited on WMF-run sites or infrastructure.

What shouldn't I use Klaxon for?

  • Issues where automated monitoring has already paged SRE. (This is visible in Klaxon itself.)
  • Issues that are not urgent / that can wait for business hours to be handled. (File a Phabricator ticket instead.)
  • Contacting someone other than SRE.

Who receives pages submitted to Klaxon?

Members of the Wikimedia Site Reliability Engineering team, whether or not it is their working hours. (Most SREs receive pages just during hours they're likely to be awake; some opt for 24/7 paging.)

Klaxon is just a webapp over an API provided by the Splunk On-Call service (previously known as VictorOps). This is the service that the SRE team uses to receive push notifications/SMS/phone calls when automated monitoring notices an issue.

Who is allowed to send pages using Klaxon?

Currently, staff and other trusted contributors, as determined by membership in one of the following LDAP groups:

Eventually, it's likely we'd expand this to anyone who has a shell account or MediaWiki/other service deployment access (which often overlaps with one of the above groups, but doesn't always).

Should I ever put confidential data or sensitive information in Klaxon?

No.

If you need to share PII as part of reporting an outage – even if it is just your own IP address – open a WMF-NDA task with the details. If you don't have permission to do that, open a security issue instead.

If you need to urgently report a security issue being actively exploited, open a security issue with the details.

You can then refer to those task numbers within Klaxon.

Klaxon is hosted on Wikimedia infrastructure, and relies upon our SSO service also hosted there – isn't that a problem?

No.

We believe our automated monitoring (which includes externally-hosted meta-monitoring) is more than sufficient to detect issues on the scale of "an entire datacenter went offline" or "lots of critical infrastructure suffered a hard failure".

Klaxon is not intended as a substitute for other kinds of defenses-in-depth; rather, it is intended to allow trusted users to easily escalate urgent issues which fell through the cracks of automated monitoring (which is invariably imperfect).

Klaxon looks not entirely unlike a status page – should it be used for that purpose?

While Klaxon is a quick way to check if SRE has been paged recently, it is not a proper user-facing status page.

For one thing, there's a complicated, not-completely-overlapping relationship between pages, automated alerts, and user-affecting incidents – an incident can exist without any pages ever occurring, and also, vice versa. Moreover, Klaxon displays machine-generated alert summaries, which are often difficult to interpret even for the SRE team themselves. And finally, a true status page would have to be hosted externally, not on WMF networks and infrastructure.

In order to produce a proper user-facing status page, much more work would be needed – not just technical work, but also process work. This is out-of-scope for Klaxon, but future work will hopefully follow soon.

Why do you keep talking about "pages" and "paging"?

The term originates from so-called pager devices (also known as 'beepers'). Unfortunately, like many computing terms, they are also overloaded with multiple meanings.