You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "SRE/SRE Clinic Duty"

From Wikitech-static
< SRE
Jump to navigation Jump to search
imported>Kormat
(→‎Schedule: Upgrading name.)
imported>Neil P. Quinn-WMF
(Neil P. Quinn-WMF moved page SRE/SRE Clinic Duty to SRE/Clinic Duty: No need to duplicate "SRE")
 
(4 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{See|For getting assistance from the SRE team, see [[SRE/SRE Team requests]].}}
#REDIRECT [[SRE/Clinic Duty]]
 
The SRE Clinic Duty triage duty was established to ensure that tickets (and thus requests and projects) are triaged and processed in a timely fashion, providing feedback and regular updates to SRE-supported projects/responsibilities.
 
This is a duty that is fulfilled by a member of the Wikimedia SRE team.
 
= Roster =
 
== Schedule ==
{| class="wikitable sortable"
!Week starting
!Clinician/''backup''
!Team
|-
|2021-01-04
|Giuseppe Lavagetto
|SRE-Service Operations
|-
|2021-01-11
|Arzhel Younsi
|SRE-Infrastructure Foundations
|-
|2021-01-18
|Jaime Crespo
|SRE-Data Persistence
|-
|2021-01-25
|Kunal Mehta
|SRE-Service Operations
|-
|2021-02-01
|Chris Danis
|SRE-Infrastructure Foundations
|-
|2021-02-08
|Valentín Gutierrez
|SRE-Traffic
|-
|2021-02-15
|Moritz Mühlenhoff
|SRE-Infrastructure Foundations
|-
|2021-02-22
|John Bond
|SRE-Infrastructure Foundations
|-
|2021-03-01
|Janis Meybohm
|SRE-Service Operations
|-
|2021-03-08
|Cas Rusnov
|SRE-Infrastructure Foundations
|-
|2021-03-15
|Riccardo Coccioli
|SRE-Infrastructure Foundations
|-
|2021-03-22
|Stevie Beth Mhaol
|SRE-Data Persistence
|-
|2021-03-29
|Effie Mouzeli
|SRE-Service Operations
|-
|2021-04-05
|Emanuele Rocca
|SRE-Traffic
|-
|2021-04-12
|Filippo Giunchedi
|SRE-Observability
|-
|2021-04-19
|Alexandros Kosiaris
|SRE-Service Operations
|-
|2021-04-26
|Rob Halsell
|SRE-Data Center Operations
|-
|2021-05-03
|Daniel Zahn
|SRE-Service Operations
|-
|2021-05-10
|Arzhel Younsi
|SRE-Infrastructure Foundations
|-
|2021-05-17
|Brandon Black
|SRE-Traffic
|-
|2021-05-24
|Manuel Aróstegui
|SRE-Data Persistence
|-
|2021-05-31
|Cole White
|SRE-Observability
|-
|2021-06-07
|Riccardo Coccioli
|SRE-Infrastructure Foundations
|-
|2021-06-14
|Sukhbir Singh
|SRE-Traffic
|-
|2021-06-21
|John Bond
|SRE-Infrastructure Foundations
|-
|2021-06-28
|Keith Herron
|SRE-Observability
|-
|2021-07-05
|''No clinic duty this week''
|
|-
|2021-07-12
|Valentín Gutierrez
|SRE-Traffic
|-
|2021-07-19
|Reuven Lazarus
|SRE-Service Operations
|-
|2021-07-26
|Kunal Mehta
|SRE-Service Operations
|-
|2021-08-02
|Moritz Mühlenhoff
|SRE-Infrastructure Foundations
|-
|2021-08-09
|Emanuele Rocca
|SRE-Traffic
|-
|2021-08-16
|Rob Halsell
|SRE-Data Center Operations
|-
|2021-08-23
|Jaime Crespo
|SRE-Data Persistence
|-
|2021-08-30
|Filippo Giunchedi
|SRE-Observability
|-
|2021-09-06
|Alexandros Kosiaris
|SRE-Service Operations
|-
|2021-09-13
|Cathal Mooney/Arzhel Younsi
|SRE-Infrastructure Foundations
|-
|2021-09-20
|Manuel Aróstegui
|SRE-Data Persistence
|-
|2021-09-27
|Giuseppe Lavagetto
|SRE-Service Operations
|-
|2021-10-04
|Stevie Shirley
|SRE-Data Persistence
|-
|2021-10-11
|Chris Danis
|SRE-Infrastructure Foundations
|-
|2021-10-18
|Daniel Zahn
|SRE-Service Operations
|-
|2021-10-25
|Sukhbir Singh
|SRE-Traffic
|-
|2021-11-01
|Effie Mouzeli
|SRE-Service Operations
|-
|2021-11-08
|Cole White
|SRE-Observability
|-
|2021-11-15
|''JW''/Janis Meybohm
|SRE-Service Operations
|-
|2021-11-22
|Marc Mandere/Brandon Black
|SRE-Traffic
|-
|2021-11-29
|Keith Herron
|SRE-Observability
|-
|2021-12-06
|Valentín Gutierrez
|SRE-Traffic
|-
|2021-12-13
|''MV''/Manuel Aróstegui
|SRE-Data Persistence
|-
|2021-12-20
|Reuven Lazarus
|SRE-Service Operations
|-
|2021-12-27
|''No clinic duty this week''
|
|-
|}
 
== Parameters ==
 
* The same person should not go two weeks in a row.
* No team should be affected two weeks in a row.
* The roster currently includes only members of the SRE team - this can eventually expand.
* If someone is doing their first clinic duty, they are backed up by a more experienced clinician, in a similar time zone.
* The roster excludes managers and directors.
* People serve clinic duty at roughly equal frequencies
 
=Schedule =
 
*Monday to Monday
*During SRE Clinic Duty the SRE on duty should remain available in IRC & email.
 
:*Folks will follow up with the person on SRE Clinic Duty about existing tasks, as well as how to create new ones.
 
*This duty is fairly interrupt-driven, and will interrupt a person's normal workflow on the week they are on duty..
*This duty shouldn't normally require any adjustment to one's normal working schedule; if you work business hours in CET, then you wouldn't shift your hours on clinic duty for another time zone.
 
* As the person on clinic duty you are welcome to join #wikimedia-clinic for assistance while carrying out your shift
 
=Hand-off / Takeover=
 
*Ideally all phabricator tasks are replied/commented upon in the process of reviewing and triaging, so no actual handoff of duties is required between weeks
 
:*Update the topic in IRC channel #wikimedia-operations, section 'SRE Clinic Duty:' with the person's name for that week.
::*The topic on IRC and this page are currently the public facing methods of determining who is on duty.
 
=Exemptions=
Typically this would follow Responsibilities, but it is a much shorter list:
* Clinic duty should not triage/escalate/work tasks in the S4 #procurement projects as part of clinic duty.
** These have a lot of out of phabricator communications with vendors/engineers/finance and thus handled by Rob or Willy.
 
=Responsibilities=
 
*All incoming Clinic Duty tasks in phabricator can be viewed on the [[phab:dashboard/view/45/|SRE Clinic Duty Dashboard]]
 
:*The idea is folks tend to have their own dashboard, which is fine when they are NOT on clinic duty.  When you take clinic duty, you can install this dashboard to your homescreen during that time, and swap back to your own when finished.
:*Please try to refrain from editing the SRE Clinic Duty dashboard to reflect non-clinic duties.  There is a panel for 'tasks assigned to myself' at the bottom, since most of the SRE Clinic Duty is triaging and knocking down tasks, but tend not to involve long-running personal tasks.  However, even on clinic duty you need to see your tasks, so its at the bottom.
 
==Review incoming tasks==
 
:*Review all incoming tasks to the #sre-access-requests, #ldap-access-requests, #wmf-nda-requests that have also #SRE, #wikimedia-mailing-list (just list creation/maintenance columns), #patch-for-review that have also #SRE and #SRE projects workboards.
:*These are all included on the Workboard Links panel in the [[phab:dashboard/view/45/|SRE Clinic Duty Dashboard]]
:*Escalate, update, and follow up as needed for any incoming tasks to ensure they are worked upon.
::*Assign a priority to tasks that come in after consulting with the relevant team. Better: ask them to set a priority.
::*Ask for more data from requester if needed in order to confirm the request, such as date it must be completed by, additional details, etc.
::*Tag the task with all the relevant teams
::*If the request is relatively quick, just do it yourself
 
==Maintain the 'ops-maintenance' mails and calendar==
 
:* Go to the Google group 'ops-maintenance'[1] and [https://groups.google.com/u/1/a/wikimedia.org/g/ops-maintenance/search?q=is%3Aunresolved filter by "resolved status: unresolved"]
:**If you've opted out of the new Groups UI, instead go to the [https://groups.google.com/a/wikimedia.org/forum/#!forum/ops-maintenance Google group]. Go to "Filters", click the radio button next to "All unresolved" and then "Apply filter". ([//upload.wikimedia.org/wikipedia/labs/d/da/Maint-announce-filter-all-unresolved.png screenshot])
:*Your task is to process all messages you see now until this screen is empty. [2]
:*Check if there is a yellow banner that says '# messages pending', those were external messages blocked because the sender is not a member of the list. Click on it and review the messages, deciding if is spam and should be deleted or legit and should be posted to the list. Choose either post or post and always accept messages from this sender, on a case-by-case basis.
:*Open the [https://office.wikimedia.org/wiki/Office_IT/Calendars#Human_calendars gcal shared with all WMF named 'Ops vendor maintenance & contracts'] in a second tab. [3][4]
:*Read each message and determine if it needs an action or not.  Adding to the Google calendar is the only possible thing to do besides deciding that no action is needed [5].
:*If appropriate add an entry to the calendar.[6] From the calendar entry link back to the individual post in the group. You get the link from the context menu. ([//upload.wikimedia.org/wikipedia/labs/3/33/Maint-announce-get-post-link.png screenshot]) You may if you like add the tag 'added to calendar'; it's not required.
:*Click "Mark as complete" on each mail that has been processed in one way or another.
:*Repeat until there are no mails left that are shown with the filter "unresolved". You are done. [8]
 
[1]: You should have access either through individual membership or inherited permissions from being a member of the "[https://groups.google.com/a/wikimedia.org/g/sre sre]" group. If not, ask an existing member to add you, they should have the permissions to do so even if not owner/manager of the group. (Only add other SRE folks). Being a member gives you permissions to do things, it does _not_ necessarily mean you are also receiving emails to your personal inbox. It's entirely up to you whether you like to receive those mails in your personal inbox or just use the web interface while you're on duty.
 
[2]: Sometimes this doesn't seem to refresh and marked posts are not disappearing from your view immediately. If this happens, removing the filter and applying it again helps.
 
[3]: If you are not able to create events, ask an SRE to add you (calendar settings => share this calendar).
 
[4]: You probably want to add the GMT (not daylight) timezone to your calendar (calendar settings => general => add a timezone). In this way you'll be able to specify the correct timezone when creating events for planned maintenance (usually they are announced with UTC dates).
 
[5]: No action is needed if it's a duplicate/reminder for an event that has already been added to calendar, if it's just an "FYI" kind of mail like "reason for outage", simple spam or anything else that doesn't warrant a calendar entry.
 
[6]: Copy the important part of the subject line or the summary and use it as the event title. If the mail contains important information like a circuit ID or details on what is affected, paste them into the body part of the calendar event. It's usually good enough to just use "all day" accuracy instead of taking the time to add exact start and end date and converting timezones because we are adding the link back from calendar to the full post with all the details. You don't need to worry about changing subjects or date formats anymore since posts will be sorted by date anyways. You also don't need to reply with a "added to calendar" message anymore and there are no other status changes, just "action needed" or not (done).
 
[7] It doesn't matter whether you added it to the calendar or determined it can be skipped, in either case _now_ there is "no action needed" (after you're done). We do it this way and don't use the "completed" status because the way Google groups works it forces you to actually _reply_ to a mail until it can be completed. We don't need that, that would just add unnecessary clicks and mail. Since both "no action needed" and "completed" are just different kinds of "resolution status" and the filter is based on "not resolved" the end result is the same and it is much simpler for us to just use that button.
 
[8]: WARNING: Jaime realized that marking "no action needed" on the Google Group may mark later followups on the same thread, too. While followup are normally reminders, sometimes they are also meaningful updates and cancellations. I would recommend reading all new emails on the clinic duty window to not miss those updates.
 
==Be a first contact==
 
*Follow up with ticket owners and requestors as needed on old tickets to resolve, re-assign, or escalate as needed.
 
*Be a person of first contact, including on IRC (timezone/availability permitting).
*Triage any mailing list requests for operations lists.
 
==Read mail to root@==
 
*Triage emails sent to root@ (if you don't receive them, you need to add your alias in the private repo). If you see a recurrent issue, please open a sub-task to [[phab:T132324|T132324]] and try to notify whoever you think can contribute to the task. Review the outstanding sub-tasks and follow up as needed.
 
==misc==
 
*Try to improve [[#Manual|the manual below]].
 
=Tips=
 
*There is a clinic duty [https://phabricator.wikimedia.org/dashboard/view/45/ dashboard] for Phabricator
 
*You can search "to:alerts@wikimedia.org" in gmail to see all things that have paged people, independent of timezones and individual settings. This is used to fill the "pages for awareness"-section in the SRE meeting document.
 
=Manual=
This is a manual for the current "SRE on duty" in charge of triaging the Phabricator #SRE project.
 
==How to handle IRC requests==
If somebody asks you to do something via IRC, if reasonable, politely ask requestor to turn their request into a [[Phabricator]] ticket and add the [https://phabricator.wikimedia.org/project/view/1025/ SRE] tag to it.
 
If you suspect the issue could be related to a recent deployment or need further investigation by deployers or developers, on the [[Phabricator]] ticket, add the [https://phabricator.wikimedia.org/project/view/1055/ Wikimedia-production-error] tag to it.
 
==Common, small "#SRE" tickets==
 
===Phabricator Administration===
 
Please note that overall phabricator administration is handled by release engineering.  The SRE clinic duty person typically would only get involved if a file needed immediately deletion or some herald rule causing chaos.
 
If an SRE clinic duty person has to login, please do so by accessing the phabricator servers.  These have role(phabricator) in site.pp, but are typically phab[12]001.
 
Once in the system, the admin account login can be generated via URL path, by running: <nowiki>sudo /srv/phab/phabricator/bin/auth recover admin
</nowiki>  The system will output a full url path for a one time login token as the Admin user.  You can then navigate to the offending file or herald rule and delete it via the web ui.
 
See [[Phabricator#Administrative Commands]] for more information.
 
===Mail aliases===
 
'''note''': SRE handles only role/group mail aliases, individual mail aliases are handled by ITS as outlined here [https://office.wikimedia.org/wiki/ITS/GroupsAliasMailman]
 
'''note2''': more recently many aliases have been moved from SRE to ITS, and the goal is definitely NOT to add any new ones on our side unless they are strictly SRE-internal like monitoring etc. you can help by moving even more over to ITS, see [https://phabricator.wikimedia.org/T122144 T122144]
 
Go to the puppet master ('''puppetmaster1001'''), cd to '''/srv/private/modules/privateexim/files/''' in the private repo, usually edit the file wikimedia.org (as root) and '''sudo git commit'''. This will create a mail to SRE about the commit, with your username automatically prepended to the commit message.
 
You can then run puppet on '''mx1001 and mx2001''' to confirm your changes have been applied.
 
There are 3 types of domains:
 
a) domains that have their own alias file (wikimedia.org, wikipedia.org and a few others), you will find these files in /srv/private/modules/privateexim/files, just edit them there, sudo git commit, and presto!!!, as with any other change in the private repo.
 
b) domains that just link to wikimediafoundation.org. These are just symlinks and puppet generates them. If you need to add a new one or change links, go to /srv/private/modules/privateexim/manifests/mail.pp. You will find it in class exim::aliases::private and should be self-explanatory.
 
c) domains that link to another domain. currently just wikivoyage.de to .org, same as in b) but a separate definition in the puppet class.
 
It is nice to add the corresponding Phab ticket number in a comment near changed aliases. Experience shows that it can be quite handy to be able to quickly answer questions like when exactly something has been changed and who requested it. There is one file or symlink per domain name. 95% of the time the requests are just regarding the "wikimedia.org" file. In other cases make sure you check for possible symlinks and realize which domains you are actually changing when editing a specific file.
 
===Mailman mailing lists===
Public mailing lists should typically be requested through [[Phabricator]] tagged with "[https://phabricator.wikimedia.org/project/view/190/ Wikimedia-Mailing-lists]", and Phabricator-maintenance-bot will automatically add the SRE tag. Google mailing lists are managed by ITS. You know it's a mailman list if it's @lists.wikimedia.org. To check if an email address exists in Google you can do "exim4 -bt foo@wikimedia.org" on an MX server.
====create a list====
Follow [[Lists.wikimedia.org#Create_a_mailing_list|the normal procedure]] to create a Mailman mailing list.
 
====password reset====
Another common task is requests for password resets, see the docs on [[Mailman#Reset_the_admin_password_of_a_list]].
 
====disable a list====
When you get a request to disable a mailman list, you just have to run a shell script on the list server, see [[Mailman#Disable_or_re-enable_a_mailing_list]]. In addition it's nice if you login once using the master password and remove the former admins email addresses from the "list run by" field.
 
====add/remove owners====
 
From the list server (check puppet to see which host runs lists) you can change owners with <tt>withlist</tt> utility. The <tt>m.owner</tt> list contains a list of email addresses, for example for {{Bug|T220641}}:
 
<pre>
root@fermium:/var/lib/mailman/bin# ./list_admins wikimania-program
List: wikimania-program, Owners: itait@wikimedia.org
root@fermium:/var/lib/mailman/bin# ./withlist -l wikimania-program
Loading list wikimania-program (locked)
The variable `m' is the wikimania-program MailList instance
>>> m.owner
['itait@wikimedia.org']
>>> m.owner = ['icueva@wikimedia.org']
>>> m.owner
['icueva@wikimedia.org']
>>> m.Save()
>>>
Unlocking (but not saving) list: wikimania-program
Finalizing
</pre>
 
===LDAP group changes===
Access to a range of mostly web-based services is granted via the "wmf" and "nda" groups. The specific permissions are listed here: https://wikitech.wikimedia.org/wiki/LDAP_Groups
The change should be tracked in a ticket.
 
*WMF staff can be added to the "wmf" group on request (not everyone needs that kind of access)
** to confirm a staff member use [[Ldapsearch]] to check they have been created on the OIT ldap mirror (i.e. ldap-corp1001.wikimedia.org).
** alternatively, you can search on Namely
*Contractors will not appear on Namely, in which case you may ask for the person's manager/point of contact for approval, on the phabricator task
*Volunteers and researchers can be added to the "nda" group (this needs a valid NDA, everyone who's WMF staff is covered by the work contract NDA)
*All other groups need to be approved by the user's manager
 
====Create/Get LDAP account====
In order to add or update a a user's LDAP permissions, they will first need an LDAP account.  This can be created by either:
 
*having the user create a [https://wikitech.wikimedia.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page Wikitech account]
*if the user already has a meta account, using it to create a [https://toolsadmin.wikimedia.org/profile/settings/accounts/ Toolforge account]
** NB that in the case of a Toolforge-only account, Wikitech will have no knowledge of the user account, despite it existing in LDAP, and the account being valid for logging into other things.  This is [[phab:T250189#6086726|expected but not widely understood]]
 
In either case you will need to know the username (for Wikitech) or shell account name (for Toolforge) name used. You can also search ldap to try and find it: <code>mwmaint1002$ ldapsearch -x mail=user@example.org</code>
 
====Update data.yaml====
Check whether there's an existing entry in {{Gitweb|project=operations/puppet|file=modules/admin/data/data.yaml}}:
 
*If the user already has shell access, no further change is needed. You can proceed with the LDAP change below:
*If not, add the user to the ''ldap_only_users'' table at the end of the file, using their would-be shell username (LDAP <tt>uid=</tt>) as the key within ''ldap_only_users''. (This is just added for tracking/verification purposes. As you'll be making the LDAP changes yourself, no puppet run is required after this.)
**Add the ''realname'' of the user (most Cloud VPS accounts don't have a real name set)
**Add the ''email'' address of the users:
***If the user is WMF staff, use the email address of their Google account (You can double-check the account name in the Gmail interface). Some users have aliases for their nickname; don't use these, use the official Google account (this allows cross-checking data against corp LDAP)
***If the user is not staff, ask for a contact email address (to have a reliable contact e.g. in case of an account compromise)
**If the user is someone with a time-limited access e.g. interns, researchers who have time-limited MOUs or short term contractor, add the estimated account end date as <code>expiry_date</code> (format is YYYY-MM-DD) and a staff contact as <code>expiry_contact</code>
 
The entry should look something like the following:
<syntaxhighlight lang=yaml>
exampleuser:
  ensure: present
  realname: Example User
  email: exampleuser@example.org
  expiry_date: 2038-01-19
  expiry_contact: examplestaff@wikimedia.org
</syntaxhighlight>
 
====Modify LDAP groups====
After having added the user to data.yaml, the change in LDAP can be done from one of the "mediawiki maintenance" hosts like mwmaint1002 (this will be automated in a subsequent step):
 
*Check if they are a member of the group from the Cloud VPS LDAP server: <code>ldapsearch -x cn=grpname</code>
*Add them if they are not there: <code>modify-ldap-group grpname</code> and add their entry in the editor window that pops up
*To remove someone from an ldap group you can <code>modify-ldap-group grpname</code> and delete their entry in the editor window that pops up
 
''TIP: If a user has to be removed from special LDAP access, in most cases (e.g. contract termination) you may want to notify also @aklapper to remove/check Phabricator access on the same ticket.''
 
For further instructions see [[Help:Access]], [[LDAP]] and [[LDAP Groups]].
 
====wmde access====
 
Anyone at Wikimedia Deutschland who wants to get added to the "wmde" LDAP group needs to sign an NDA with the Legal department of the WMF. Simply add [https://phabricator.wikimedia.org/p/KFrancis @KFrancis] to the Phab task and she'll deal with it.
 
In addition, the access to "wmde" needs to be approved by an engineering manager from Wikimedia Deutschland. You can add either of the four to the Phab task:
* [https://phabricator.wikimedia.org/p/conny-kawohl_WMDE @conny-kawohl_WMDE]
* [https://phabricator.wikimedia.org/p/WMDE-leszek @WMDE-leszek]
* [https://phabricator.wikimedia.org/p/darthmon_wmde @darthmon_wmde]
* [https://phabricator.wikimedia.org/p/Tobi_WMDE_SW @Tobi_WMDE_SW]
* [https://phabricator.wikimedia.org/p/Lea_WMDE @Lea_WMDE]
* [https://phabricator.wikimedia.org/p/karapayneWMDE @karapayneWMDE]
 
===Access requests===
Access and reasoning for requesting it are documented on [[Requesting shell access]].  Please read and understand entirely before processing any access requests, as this very brief summary documentation may not cover all required points in the linked page.
 
If a request asks for things like new shell accounts, access to additional servers, log files, personal data, admin roles in systems like Mailman, Bugzilla, data center access, opening a firewall rule etc, then it is an access request and should be moved into the [https://phabricator.wikimedia.org/project/view/956/ SRE-Access-Requests] Project.
Once the initial request is made, a number of follow up steps must be confirmed, all have been included in this [[/access request checklist]]
 
*User's direct supervisor has approved of access request via comment on phabricator task.
*Approval from project lead where user's access will be granted via comment on phabricator task.
*[https://phabricator.wikimedia.org/legalpad/signatures/ Confirmation] that the user has read, comprehend, and signed the [https://phabricator.wikimedia.org/L3 Acknowledgement of Wikimedia Server Access Responsibilities] document.
*ALL ACCESS REQUESTS REQUIRE AN NDA. 
** Anyone who's Wikimedia staff has signed an NDA as part of their work contract. To validate that someone is staff you can either (1) check ldap-corp1001 via an LDAP search for the GCorp username (2) have the person's manager confirm (they need to sign off anyway)
** Volunteers and researchers need to sign an NDA with the legal department. You can check existing NDAs on file at https://docs.google.com/spreadsheets/d/1xQNx5s2yErvayCMzvk9VkIA2ZihFXSBEhT5Z5ziCsi4
** If that's not the case, add @[[phab:p/KFrancis|KFrancis]] to the Phabricator task to prepare an NDA (she'll confirm on task when that's completed)
*SSH public key has to be submitted via gerrit patchset by user, or by some confirmed (non-email) method (suggestion: wiki user page).
**You can verify that the SSH key is not used in WMCS by running from the current MediaWiki maintenance host (<code>mwmaint1002</code> as of March 2021) the cross-validate-accounts command, for example like:<syntaxhighlight lang="shell">
cross-validate-accounts --username USRNAME --uid 00000 --email address@example.com --real-name "Real User" --ssh-key "ssh-ed25519 AAAAC...." --kerberos
</syntaxhighlight>
*Any requests to add sudo rights to a ''group'' should be reviewed at the Infrastructure Foundations SRE meeting. Tag the task with [[phab:tag/Infrastructure-Foundations|#Infrastructure-Foundations]] and assign it to [[phab:p/joanna_borun|@joanna_borun]] to add it to the next meeting's agenda; don't grant access until it's discussed at the meeting. Skip this step for requests to add a ''user'' to an existing group (the most common kind of production access request).
*Please update the Task in phabricator, as the requestor will get update.
*Please raise any security concerns on ticket via comments.
 
====Analytics Groups====
 
*There are multiple potential groups.  They have been detailed on [[Analytics/Data_access#Access_Groups]].
**The clinic duty person can often link to this page for the person requesting access, and require the requestor to define which of the groups are required.
**The clinic duty person should also highlight [[Analytics/Data access#User responsibilities]] to the requestor.
* Make sure to seek signoff from Analytics folks on access tasks
 
====Deployment Groups====
 
*Requires Shell and must have approval from releng to be added to the deployment group
**The user should also be added to the Gerrit group [https://gerrit.wikimedia.org/r/#/admin/groups/21,members wmf-deployment].
 
====Creating new shell users====
Please see instructions in the puppet admin module's [https://github.com/wikimedia/puppet/blob/production/modules/admin/README#L46 README].
 
Some notable changes since February 2017:
 
*Add the ''realname'' of the user (most Cloud VPS accounts don't have a real name set)
*Add the ''email'' address of the users:
**If the user is WMF staff use the email address of their Google account (usually the first letter of the first name and the surname, you can double-check the account name in the Gmail interface). Some users have aliases for their nickname e.g., don't use these, use the official Google account (this allows cross-checking data against corp LDAP)
**If the user is a volunteer, a researcher or contractor without access to a wikimedia.org account, ask for a contact email address (to have a reliable contact e.g. in case of an account compromise)
*If the user to be added is someone with a time-limited access (e.g. interns, researchers (who have time-limited MOUs) or short term contractor), add the estimated account end date as ''expiry_date'' (format is YYYY-MM-DD) and add a staff contact as ''expiry_contact''
 
====Renaming shell users====
 
Sometimes we have to rename a shell user.  This is typically when their shell name doesn't match their login name, and they have issues logging into items requiring LDAP credentials.
 
Renaming a user will require a few things happen, in a very specific order.  Since many users keep data in their home directories, backups can sometimes be made, but not always.  (Private data that isn't allowed to be copied off the cluster should not be backed up to laptops.)  The existing username has to be removed from the host, since the new username will use the old username's UID.
 
*Patchset is prepared, but not merged.
*Using cumin for these batch commands, all hosts that have the existing (to be replaced) username should have puppet halted.
*Affected hosts should have the user (to be replaced) deleted.  DO NOT DELETE THE USER'S HOME DIRECTORY.
*Merge patchset with username change (UID remains the same).
*Run puppet on affected hosts, and they will create the new user (using the same UID.)
*Batch move the contents of the old user home into the new user home.
 
====IRC channel access====
/query chanserv
help access
access #channel list
access #channel add *!*@wikimedia/cloak
 
14:07 -ChanServ(ChanServ@services.)- Flags +Aiortv were set on ...
 
For people wanting to be a channel operator for #wikimedia-operations, first check they got nick protection enabled
 
  /msg nickserv info <nick>
  ...
  <nick> has enabled nick protection
 
and then
 
  /msg chanserv flags #wikimedia-operations <nick> +Aiotv
 
 
==== Check on a Phabricator user ====
 
As part of an access request you might want to check first if a Phabricator user is actually who they say they are.
There is a shell command on the the Phabricator server. see [[Phabricator#Check_on_a_Phabricator_user]]
 
===Removing access===
 
This typically isn't part of Clinic Duty, but if you need it you can find the relevant steps at [[SRE_Offboarding#All_Users]].
 
===Google search console access===
 
Documented at [[Google Search Console access]]. Google search console access is extremely limited compared to access to other services. This is due to the limitations of the service.
 
Revocations are done manually: at the moment, an entry is added to the main-announcement calendar and requires [[Google Search Console access#Removing_Users_from_Domains|manual action]].
 
===Powercycling / reboots===
 
RT duty paging for reboots is usually due to hardware failure, or immediate concerns of exploits.  Anything outside those issues would be handled by normal operations workflow, and would not necessarily fall to the RT triage duty person.
 
Powercycling requires a passing familiarity with the different out of band management options we use (based on vendor).  Hardware type can be determined by looking up the hardware in question in [https://racktables.wikimedia.org Racktables]; then you can determine the instructions from [[Platform-specific_documentation]].
 
[[Category:How-To]]

Latest revision as of 18:23, 27 August 2021

Redirect to: