You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

SRE/Clinic Duty: Difference between revisions

From Wikitech-static
< SRE
Jump to navigation Jump to search
imported>BCornwall
(Move schedule to subpage)
imported>BCornwall
(Move ''Handle incoming IRC requests'' down another heading to put it in line with the others)
 
Line 2: Line 2:
{{see|For the Clinic Duty schedule, see [[SRE/Clinic Duty/Schedule]].}}
{{see|For the Clinic Duty schedule, see [[SRE/Clinic Duty/Schedule]].}}


The SRE '''Clinic Duty''' triage duty was established to ensure that tickets (and thus requests and projects) are triaged and processed in a timely fashion, providing feedback and regular updates to SRE-supported projects/responsibilities.
The SRE '''Clinic Duty''' was established to ensure that tickets (and thus requests and projects) are triaged and processed in a timely fashion, providing feedback and regular updates to SRE-supported projects/responsibilities.


This is a duty that is fulfilled by a member of the Wikimedia SRE team (each member changes on a [[SRE/Clinic Duty/Schedule|rotating schedule]]).
This is a duty that is fulfilled by a member of the Wikimedia SRE team (each member changes on a [[SRE/Clinic Duty/Schedule|rotating schedule]]).
Line 18: Line 18:


;Be available and ready to do gruntwork
;Be available and ready to do gruntwork
:During SRE Clinic Duty the SRE on duty should remain available in IRC & email; This duty is fairly interrupt-driven, and will interrupt a person's normal workflow on the week they are on duty. The person on clinic duty is a first contact, including on IRC (timezone/availability permitting). However, This duty shouldn't normally require any adjustment to one's normal working schedule; if you work business hours in CET, then you wouldn't shift your hours on clinic duty for another time zone.
:During SRE Clinic Duty the SRE on duty should remain available in IRC and email; This duty is fairly interrupt-driven, and will interrupt a person's normal workflow on the week they are on duty. The person on clinic duty is a first contact, including on IRC (timezone/availability permitting). However, This duty shouldn't normally require any adjustment to one's normal working schedule; if you work business hours in CET, then you wouldn't shift your hours on clinic duty for another time zone. Clinic duty is not expected to be performed during weekends and holidays.


:Follow up with ticket owners and requestors as needed on old tickets to resolve, re-assign, or escalate as needed. Folks will, in turn, follow up with you after your shift is done. As the person on clinic duty you are welcome to join #wikimedia-clinic for assistance while carrying out your shift
:Follow up with ticket owners and requestors as needed on old tickets to resolve, re-assign, or escalate as needed. Folks will, in turn, follow up with you after your shift is done. As the person on clinic duty you are welcome to join #wikimedia-clinic for assistance while carrying out your shift
Line 26: Line 26:
{{note|An exception to the above rules is that clinic duty should not triage/escalate/work tasks in the ''S4 #procurement'' projects; These have a lot of out-of-Phabricator communications with vendors/engineers/finance and thus handled by Rob or Willy.}}
{{note|An exception to the above rules is that clinic duty should not triage/escalate/work tasks in the ''S4 #procurement'' projects; These have a lot of out-of-Phabricator communications with vendors/engineers/finance and thus handled by Rob or Willy.}}


==Hand-off / Takeover==
==Handing off duties==


Ideally all phabricator tasks are replied/commented upon in the process of reviewing and triaging, so no actual handoff of duties is required between weeks. Update the topic in IRC channel #wikimedia-operations, section 'SRE Clinic Duty:' with the person's name for that week; The topic on IRC and this page are currently the public facing methods of determining who is on duty.
Ideally all phabricator tasks are replied/commented upon in the process of reviewing and triaging, so no actual handoff of duties is required between weeks. Update the topic in IRC channel <code>#wikimedia-operations</code>, section ''SRE Clinic Duty:'' with the person's name for that week; The topic on IRC and this page are currently the public facing methods of determining who is on duty.


==Responsibilities==
==Responsibilities==
Line 41: Line 41:
===Review incoming tasks===
===Review incoming tasks===


:*Review all incoming tasks to the #sre-access-requests, #ldap-access-requests, #wmf-nda-requests that have also #SRE, #wikimedia-mailing-list (just list creation/maintenance columns), #patch-for-review that have also #SRE and #SRE projects workboards.
Review all incoming tasks to the following [[Phabricator]] projects workboards:
:*These are all included on the Workboard Links panel in the [[phab:dashboard/view/45/|SRE Clinic Duty Dashboard]]
:*Escalate, update, and follow up as needed for any incoming tasks to ensure they are worked upon.
::*Assign a priority to tasks that come in after consulting with the relevant team. Better: ask them to set a priority.
::*Ask for more data from requester if needed in order to confirm the request, such as date it must be completed by, additional details, etc.
::*Tag the task with all the relevant teams
::*If the request is relatively quick, just do it yourself


===Maintain the 'ops-maintenance' mails and calendar===
* #SRE
* #ldap-access-requests
* #patch-for-review (with #SRE tagged)
* #sre-access-requests
* #wikimedia-mailing-list (just list creation/maintenance columns)
* #wmf-nda-requests (with #SRE tagged)


:* Go to the Google group 'ops-maintenance'[1] and [https://groups.google.com/u/0/a/wikimedia.org/g/ops-maintenance/search?q=is%3Aunresolved filter by "resolved status: unresolved"]
{{note|These are all included on the Workboard Links panel in the [[phab:dashboard/view/45/|SRE Clinic Duty Dashboard]].}}
:**If you've opted out of the new Groups UI, instead go to the [https://groups.google.com/a/wikimedia.org/forum/#!forum/ops-maintenance Google group]. Go to "Filters", click the radio button next to "All unresolved" and then "Apply filter". ([//upload.wikimedia.org/wikipedia/labs/d/da/Maint-announce-filter-all-unresolved.png screenshot])
:*Your task is to process all messages you see now until this screen is empty. [2]
:*Check if there is a yellow banner that says '# messages pending', those were external messages blocked because the sender is not a member of the list. Click on it and review the messages, deciding if is spam and should be deleted or legit and should be posted to the list. Choose either post or post and always accept messages from this sender, on a case-by-case basis.
:*Open the [https://office.wikimedia.org/wiki/Office_IT/Calendars#Human_calendars gcal shared with all WMF named 'Ops vendor maintenance & contracts'] in a second tab. [3][4]
:*Read each message and determine if it needs an action or not.  Adding to the Google calendar is the only possible thing to do besides deciding that no action is needed [5].
:*If appropriate add an entry to the calendar.[6] From the calendar entry link back to the individual post in the group. You get the link from the context menu. ([//upload.wikimedia.org/wikipedia/labs/3/33/Maint-announce-get-post-link.png screenshot]) You may if you like add the tag 'added to calendar'; it's not required.
:**It is recommended to run <code>ops-maint-gcal.js</code> to automatically create calendar links in Google Groups. See https://github.com/wikimedia/operations-software/tree/master/clinic-duty for the code and instructions. If the email fails to parse please consider contributing to the script.
:*Click "Mark as complete" on each mail that has been processed in one way or another.
:*Repeat until there are no mails left that are shown with the filter "unresolved". You are done. [8]


[1]: You should have access either through individual membership or inherited permissions from being a member of the "[https://groups.google.com/a/wikimedia.org/g/sre sre]" group. If not, ask an existing member to add you, they should have the permissions to do so even if not owner/manager of the group. (Only add other SRE folks). Being a member gives you permissions to do things, it does _not_ necessarily mean you are also receiving emails to your personal inbox. It's entirely up to you whether you like to receive those mails in your personal inbox or just use the web interface while you're on duty.
Escalate, update, and follow up as needed for any incoming tasks to ensure they are worked upon.
*Assign a priority to tasks that come in after consulting with the relevant team. Better: ask them to set a priority.
:*Ask for more data from requester if needed in order to confirm the request, such as date it must be completed by, additional details, etc.
:*Tag the task with all the relevant teams
:*If the request is relatively quick, just do it yourself


[2]: Sometimes this doesn't seem to refresh and marked posts are not disappearing from your view immediately. If this happens, removing the filter and applying it again helps.
=== Maintain the maintenance calendar ===


[3]: If you are not able to create events, ask an SRE to add you (calendar settings => share this calendar).  
Wikimedia maintains a [https://calendar.google.com/calendar/embed?src=wikimedia.org_59rp973cn76evagsriqjs0eut8%40group.calendar.google.com calendar for Vendor Maintenance] events. This calendar must be kept up-to-date so that the team knows when datacenter maintenance occurs.


[4]: You probably want to add the GMT (not daylight) timezone to your calendar (calendar settings => general => add a timezone). In this way you'll be able to specify the correct timezone when creating events for planned maintenance (usually they are announced with UTC dates).
As a prerequisite, one should have access either through individual membership or inherited permissions from being a member of the [https://groups.google.com/a/wikimedia.org/g/sre sre group]. If not, ask an existing member to add you; they should have the permissions to do so even if not owner/manager of the group.


[5]: No action is needed if it's a duplicate/reminder for an event that has already been added to calendar, if it's just an "FYI" kind of mail like "reason for outage", simple spam or anything else that doesn't warrant a calendar entry.
Presently, processing these messages is highly manual:


[6]: Copy the important part of the subject line or the summary and use it as the event title. '''Mark the event as "Free" instead of "Busy".''' If the mail contains important information like a circuit ID or details on what is affected, paste them into the body part of the calendar event. You don't need to worry about changing subjects or date formats anymore since posts will be sorted by date anyways. You also don't need to reply with a "added to calendar" message anymore and there are no other status changes, just "action needed" or not (done).
# Go to the Google group ''ops-maintenance'' and [https://groups.google.com/u/0/a/wikimedia.org/g/ops-maintenance/search?q=is%3Aunresolved filter by "resolved status: unresolved"]
#:{{note|Check if there is a yellow banner that says '# messages pending', those were external messages blocked because the sender is not a member of the list. Audit the messages for spam if any are in the queue.}}
#Open the [https://office.wikimedia.org/wiki/Office_IT/Calendars#Human_calendars ''Ops vendor maintenance and contracts'' calendar] in a second tab.
#Read each message and determine if the email needs action. If so, add an event to the Google calendar:
##Copy the important part of the subject line or the summary and use it as the event title.
##Switch the calendar from your personal calendar to the ''Vendor Maintenance'' calendar.
##Mark the event as ''Free'' instead of ''Busy''.
##Copy the contents of the email into the description of the calendar event.
##Put the correct time from the message in, converting as necessary.
#Click "Mark as complete" on each mail that has been processed in one way or another.
#Repeat until there are no mails left that require action.


[7] It doesn't matter whether you added it to the calendar or determined it can be skipped, in either case _now_ there is "no action needed" (after you're done). We do it this way and don't use the "completed" status because the way Google groups works it forces you to actually _reply_ to a mail until it can be completed. We don't need that, that would just add unnecessary clicks and mail. Since both "no action needed" and "completed" are just different kinds of "resolution status" and the filter is based on "not resolved" the end result is the same and it is much simpler for us to just use that button.
These steps may be automated with [https://github.com/wikimedia/operations-software/tree/master/clinic-duty ops-maint-gcal.js]; See for the code and instructions in the repository for more details.


[8]: WARNING: Jaime realized that marking "no action needed" on the Google Group may mark later followups on the same thread, too. While followup are normally reminders, sometimes they are also meaningful updates and cancellations. I would recommend reading all new emails on the clinic duty window to not miss those updates.
{{note|You probably want to add the UTC timezone to your calendar (''calendar settings''→''general''→''add a timezone'')}}


==Manual==
===Handle incoming IRC requests===
This is a manual for the current "SRE on duty" in charge of triaging the Phabricator #SRE project.
If somebody asks SRE to do something via IRC, politely ask the requestor to turn their request into a [[Phabricator]] ticket and add the [https://phabricator.wikimedia.org/project/view/1025/ SRE] tag to it.
 
===How to handle IRC requests===
If somebody asks you to do something via IRC, if reasonable, politely ask requestor to turn their request into a [[Phabricator]] ticket and add the [https://phabricator.wikimedia.org/project/view/1025/ SRE] tag to it.


If you suspect the issue could be related to a recent deployment or need further investigation by deployers or developers, on the [[Phabricator]] ticket, add the [https://phabricator.wikimedia.org/project/view/1055/ Wikimedia-production-error] tag to it.
If you suspect the issue could be related to a recent deployment or need further investigation by deployers or developers, on the [[Phabricator]] ticket, add the [https://phabricator.wikimedia.org/project/view/1055/ Wikimedia-production-error] tag to it.


===Common, small "#SRE" tickets===
===Phabricator Administration===
 
====Phabricator Administration====


Please note that overall phabricator administration is handled by release engineering.  The SRE clinic duty person typically would only get involved if a file needed immediately deletion or some herald rule causing chaos.
{{Main|Phabricator#Administrative Commands}}


If an SRE clinic duty person has to login, please do so by accessing the phabricator servers.  These have role(phabricator) in site.pp, but are typically phab[12]001.
{{warning|Overall phabricator administration is handled by release engineering. The SRE clinic duty person typically would only get involved if a file needed immediately deletion or some herald rule causing chaos.}}


Once in the system, the admin account login can be generated via URL path, by running: <nowiki>sudo /srv/phab/phabricator/bin/auth recover admin
If an SRE clinic duty person has to login, please do so by accessing the phabricator servers.  These have role(phabricator) in <code>site.pp</code>, but are typically phab[12]001.
</nowiki> The system will output a full url path for a one time login token as the Admin user.  You can then navigate to the offending file or herald rule and delete it via the web ui.


See [[Phabricator#Administrative Commands]] for more information.
Once in the system, the admin account login can be generated via URL path, by running <code>sudo /srv/phab/phabricator/bin/auth recover admin</code>. The system will output a full url path for a one time login token as the Admin user.  You can then navigate to the offending file or herald rule and delete it via the web ui.


====Mail aliases====
===Mail aliases===


'''note''': SRE handles only role/group mail aliases, individual mail aliases are handled by ITS as outlined here [https://office.wikimedia.org/wiki/ITS/GroupsAliasMailman]
{{note|SRE handles only role/group mail aliases, individual mail aliases are handled by ITS as outlined in [https://office.wikimedia.org/wiki/ITS/GroupsAliasMailman their documentation].}}


'''note2''': more recently many aliases have been moved from SRE to ITS, and the goal is definitely NOT to add any new ones on our side unless they are strictly SRE-internal like monitoring etc. you can help by moving even more over to ITS, see [https://phabricator.wikimedia.org/T122144 T122144]
{{note|More recently many aliases have been moved from SRE to ITS, and the goal is not to add any new ones on our side unless they are strictly SRE-internal like monitoring, etc. You can help by moving even more over to ITS! see [https://phabricator.wikimedia.org/T122144 T122144].}}


Go to the puppet master ('''puppetmaster1001'''), cd to '''/srv/private/modules/privateexim/files/''' in the private repo, usually edit the file wikimedia.org (as root) and '''sudo git commit'''. This will create a mail to SRE about the commit, with your username automatically prepended to the commit message.
Go to the puppet master ('''puppetmaster1001'''), cd to '''/srv/private/modules/privateexim/files/''' in the private repo, usually edit the file wikimedia.org (as root) and '''sudo git commit'''. This will create a mail to SRE about the commit, with your username automatically prepended to the commit message.
Line 119: Line 117:
It is nice to add the corresponding Phab ticket number in a comment near changed aliases. Experience shows that it can be quite handy to be able to quickly answer questions like when exactly something has been changed and who requested it. There is one file or symlink per domain name. 95% of the time the requests are just regarding the "wikimedia.org" file. In other cases make sure you check for possible symlinks and realize which domains you are actually changing when editing a specific file.
It is nice to add the corresponding Phab ticket number in a comment near changed aliases. Experience shows that it can be quite handy to be able to quickly answer questions like when exactly something has been changed and who requested it. There is one file or symlink per domain name. 95% of the time the requests are just regarding the "wikimedia.org" file. In other cases make sure you check for possible symlinks and realize which domains you are actually changing when editing a specific file.


====Mailman mailing lists====
===Mailman mailing lists===
Public mailing lists should typically be requested through [[Phabricator]] tagged with "[https://phabricator.wikimedia.org/project/view/190/ Wikimedia-Mailing-lists]", and Phabricator-maintenance-bot will automatically add the SRE tag. Google mailing lists are managed by ITS. You know it's a mailman list if it's @lists.wikimedia.org. To check if an email address exists in Google you can do "exim4 -bt foo@wikimedia.org" on an MX server.
Public mailing lists should typically be requested through [[Phabricator]] tagged with "[https://phabricator.wikimedia.org/project/view/190/ Wikimedia-Mailing-lists]", and Phabricator-maintenance-bot will automatically add the SRE tag. Google mailing lists are managed by ITS. You know it's a mailman list if it's @lists.wikimedia.org. To check if an email address exists in Google you can do "exim4 -bt foo@wikimedia.org" on an MX server.
;Create a list
;Create a list
Line 130: Line 128:
:Login as administrator in lists.wikimedia.org (the password is in pwstore). Then go to the mailing list administrator page (e.g. [https://lists.wikimedia.org/postorius/lists/math.lists.wikimedia.org/ here is for math@] then go to Users tab and adjust Owners accordingly.
:Login as administrator in lists.wikimedia.org (the password is in pwstore). Then go to the mailing list administrator page (e.g. [https://lists.wikimedia.org/postorius/lists/math.lists.wikimedia.org/ here is for math@] then go to Users tab and adjust Owners accordingly.


==== Access requests ====
=== Access requests ===
See [[/Access requests]] for full instructions.
{{see|See [[SRE/Clinic Duty/Access requests]]}}


====Removing access====
===Removing access===


This typically isn't part of Clinic Duty, but if you need it you can find the relevant steps at [[SRE_Offboarding#All_Users]].
This typically isn't part of Clinic Duty, but if you need it you can find the relevant steps at [[SRE_Offboarding#All_Users]].


====Powercycling / reboots====
===Server power cycling===


SRE clinic duty paging for reboots is usually due to hardware failure, or immediate concerns of exploits.  Anything outside those issues would be handled by normal operations workflow, and would not necessarily fall to the SRE clinic duty person.
SRE clinic duty paging for reboots is usually due to hardware failure, or immediate concerns of exploits.  Anything outside those issues would be handled by normal operations workflow, and would not necessarily fall to the SRE clinic duty person.


Powercycling requires a passing familiarity with the different out of band management options we use (based on vendor).  Hardware type can be determined by looking up the hardware in question in [https://wikitech.wikimedia.org Netbox]; then you can determine the instructions from [[SRE/Dc-operations/Platform-specific documentation|Platform-specific_documentation]].
Power cycling requires a passing familiarity with the different out of band management options we use (based on vendor).  Hardware type can be determined by looking up the hardware in question in [https://wikitech.wikimedia.org Netbox]; then you can determine the instructions from [[SRE/Dc-operations/Platform-specific documentation|Platform-specific_documentation]].
 
=== Maps external usage requests ===


==== Maps external usage requests ====
{{Main|Maps/External usage}}
{{Main|Maps/External usage}}
The main thing to check is that the domain belongs to a project being supported by one of the [[m:Wikimedia movement affiliates|Wikimedia movement affiliates]]. Usually the requester will belong to an affiliate or there will be a wiki page explaining the project and which affiliate is backing it.
The main thing to check is that the domain belongs to a project being supported by one of the [[m:Wikimedia movement affiliates|Wikimedia movement affiliates]]. Usually the requester will belong to an affiliate or there will be a wiki page explaining the project and which affiliate is backing it.


If that all checks out, it should be fine to add the domain to the VCL regex.
If that all checks out, it should be fine to add the domain to the VCL regex.
[[Category:How-To]]
[[Category:How-To]]

Latest revision as of 22:09, 11 August 2022

The SRE Clinic Duty was established to ensure that tickets (and thus requests and projects) are triaged and processed in a timely fashion, providing feedback and regular updates to SRE-supported projects/responsibilities.

This is a duty that is fulfilled by a member of the Wikimedia SRE team (each member changes on a rotating schedule).

Expectations

This should be an infrequent event for your team
The same person should not go two weeks in a row, and no team should be affected two weeks in a row. The roster currently includes only members of the SRE team but this can eventually expand. People serve clinic duty at roughly equal frequencies
The right people are conscripted
If someone is doing their first clinic duty, they are backed up by a more experienced clinician, in a similar time zone. The roster excludes managers and directors.
This is a routine affair
The schedule runs from Monday to the following Monday
Be available and ready to do gruntwork
During SRE Clinic Duty the SRE on duty should remain available in IRC and email; This duty is fairly interrupt-driven, and will interrupt a person's normal workflow on the week they are on duty. The person on clinic duty is a first contact, including on IRC (timezone/availability permitting). However, This duty shouldn't normally require any adjustment to one's normal working schedule; if you work business hours in CET, then you wouldn't shift your hours on clinic duty for another time zone. Clinic duty is not expected to be performed during weekends and holidays.
Follow up with ticket owners and requestors as needed on old tickets to resolve, re-assign, or escalate as needed. Folks will, in turn, follow up with you after your shift is done. As the person on clinic duty you are welcome to join #wikimedia-clinic for assistance while carrying out your shift
Monitor the appropriate inboxes/mailing lists: Triage any mailing list requests for operations lists; Triage emails sent to root@ (if you don't receive them, you need to add your alias in the private repo). If you see a recurrent issue, please open a sub-task to T132324 and try to notify whoever you think can contribute to the task. Review the outstanding sub-tasks and follow up as needed.

Handing off duties

Ideally all phabricator tasks are replied/commented upon in the process of reviewing and triaging, so no actual handoff of duties is required between weeks. Update the topic in IRC channel #wikimedia-operations, section SRE Clinic Duty: with the person's name for that week; The topic on IRC and this page are currently the public facing methods of determining who is on duty.

Responsibilities

  • The idea is folks tend to have their own dashboard, which is fine when they are NOT on clinic duty. When you take clinic duty, you can install this dashboard to your homescreen during that time, and swap back to your own when finished.
  • Please try to refrain from editing the SRE Clinic Duty dashboard to reflect non-clinic duties. There is a panel for 'tasks assigned to myself' at the bottom, since most of the SRE Clinic Duty is triaging and knocking down tasks, but tend not to involve long-running personal tasks. However, even on clinic duty you need to see your tasks, so its at the bottom.
  • You can search "to:alerts@wikimedia.org" in gmail to see all things that have paged people, independent of timezones and individual settings. This is used to fill the "pages for awareness"-section in the SRE meeting document.

Review incoming tasks

Review all incoming tasks to the following Phabricator projects workboards:

  • #SRE
  • #ldap-access-requests
  • #patch-for-review (with #SRE tagged)
  • #sre-access-requests
  • #wikimedia-mailing-list (just list creation/maintenance columns)
  • #wmf-nda-requests (with #SRE tagged)

Escalate, update, and follow up as needed for any incoming tasks to ensure they are worked upon.

  • Assign a priority to tasks that come in after consulting with the relevant team. Better: ask them to set a priority.
  • Ask for more data from requester if needed in order to confirm the request, such as date it must be completed by, additional details, etc.
  • Tag the task with all the relevant teams
  • If the request is relatively quick, just do it yourself

Maintain the maintenance calendar

Wikimedia maintains a calendar for Vendor Maintenance events. This calendar must be kept up-to-date so that the team knows when datacenter maintenance occurs.

As a prerequisite, one should have access either through individual membership or inherited permissions from being a member of the sre group. If not, ask an existing member to add you; they should have the permissions to do so even if not owner/manager of the group.

Presently, processing these messages is highly manual:

  1. Go to the Google group ops-maintenance and filter by "resolved status: unresolved"
  2. Open the Ops vendor maintenance and contracts calendar in a second tab.
  3. Read each message and determine if the email needs action. If so, add an event to the Google calendar:
    1. Copy the important part of the subject line or the summary and use it as the event title.
    2. Switch the calendar from your personal calendar to the Vendor Maintenance calendar.
    3. Mark the event as Free instead of Busy.
    4. Copy the contents of the email into the description of the calendar event.
    5. Put the correct time from the message in, converting as necessary.
  4. Click "Mark as complete" on each mail that has been processed in one way or another.
  5. Repeat until there are no mails left that require action.

These steps may be automated with ops-maint-gcal.js; See for the code and instructions in the repository for more details.

Handle incoming IRC requests

If somebody asks SRE to do something via IRC, politely ask the requestor to turn their request into a Phabricator ticket and add the SRE tag to it.

If you suspect the issue could be related to a recent deployment or need further investigation by deployers or developers, on the Phabricator ticket, add the Wikimedia-production-error tag to it.

Phabricator Administration

Main article: Phabricator#Administrative Commands

If an SRE clinic duty person has to login, please do so by accessing the phabricator servers. These have role(phabricator) in site.pp, but are typically phab[12]001.

Once in the system, the admin account login can be generated via URL path, by running sudo /srv/phab/phabricator/bin/auth recover admin. The system will output a full url path for a one time login token as the Admin user. You can then navigate to the offending file or herald rule and delete it via the web ui.

Mail aliases

Go to the puppet master (puppetmaster1001), cd to /srv/private/modules/privateexim/files/ in the private repo, usually edit the file wikimedia.org (as root) and sudo git commit. This will create a mail to SRE about the commit, with your username automatically prepended to the commit message.

You can then run puppet on mx1001 and mx2001 to confirm your changes have been applied.

There are 3 types of domains:

a) domains that have their own alias file (wikimedia.org, wikipedia.org and a few others), you will find these files in /srv/private/modules/privateexim/files, just edit them there, sudo git commit, and presto!!!, as with any other change in the private repo.

b) domains that just link to wikimediafoundation.org. These are just symlinks and puppet generates them. If you need to add a new one or change links, go to /srv/private/modules/privateexim/manifests/mail.pp. You will find it in class exim::aliases::private and should be self-explanatory.

c) domains that link to another domain. currently just wikivoyage.de to .org, same as in b) but a separate definition in the puppet class.

It is nice to add the corresponding Phab ticket number in a comment near changed aliases. Experience shows that it can be quite handy to be able to quickly answer questions like when exactly something has been changed and who requested it. There is one file or symlink per domain name. 95% of the time the requests are just regarding the "wikimedia.org" file. In other cases make sure you check for possible symlinks and realize which domains you are actually changing when editing a specific file.

Mailman mailing lists

Public mailing lists should typically be requested through Phabricator tagged with "Wikimedia-Mailing-lists", and Phabricator-maintenance-bot will automatically add the SRE tag. Google mailing lists are managed by ITS. You know it's a mailman list if it's @lists.wikimedia.org. To check if an email address exists in Google you can do "exim4 -bt foo@wikimedia.org" on an MX server.

Create a list
Follow the normal procedure to create a Mailman mailing list.
Disable a list
When you get a request to disable a mailman list, you just have to run a shell script on the list server, see Mailman#Disable_or_re-enable_a_mailing_list. In addition it's nice if you login once using the master password and remove the former admins email addresses from the "list run by" field.
add/remove owners
Login as administrator in lists.wikimedia.org (the password is in pwstore). Then go to the mailing list administrator page (e.g. here is for math@ then go to Users tab and adjust Owners accordingly.

Access requests

Removing access

This typically isn't part of Clinic Duty, but if you need it you can find the relevant steps at SRE_Offboarding#All_Users.

Server power cycling

SRE clinic duty paging for reboots is usually due to hardware failure, or immediate concerns of exploits. Anything outside those issues would be handled by normal operations workflow, and would not necessarily fall to the SRE clinic duty person.

Power cycling requires a passing familiarity with the different out of band management options we use (based on vendor). Hardware type can be determined by looking up the hardware in question in Netbox; then you can determine the instructions from Platform-specific_documentation.

Maps external usage requests

Main article: Maps/External usage

The main thing to check is that the domain belongs to a project being supported by one of the Wikimedia movement affiliates. Usually the requester will belong to an affiliate or there will be a wiki page explaining the project and which affiliate is backing it.

If that all checks out, it should be fine to add the domain to the VCL regex.