You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Phabricator

From Wikitech
Jump to navigation Jump to search

Phabricator is an open-source software development platform. In Wikimedia, Phabricator is used for project management, software bug reporting, and feature requests. See mw:Phabricator for more details on end user usage.

phabricator.wikimedia.org runs on phab1001 in eqiad.

The Phabricator install relies on db1072 (m3 eqiad master). Other DB hosts (backup slaves) are db1117 (eqiad), db2042 (codfw) and db2078 (codfw). Databases access is routed through dbproxy1003, a.k.a. m3-master.

A disaster recovery plan for phabricator.wikimedia.org is being drafted at Phabricator/Disaster_Recovery.

Metrics are on https://grafana.wikimedia.org/d/000000587/phabricator

Operations Projects Workflows

The operations specific projects on Phabricator[1] include:

Project Description
Operations General Operations Team Project
Labs Labs Team Project
DC-Ops Datacenter Team Project
domains Domain support/changing/issues
hardware requests Server Allocation Requests
procurement Vendor & Procurement Tasks. Direct ordering of SSL certificates.
network Network Requests
Ops Access Requests Access requests to any Operations systems
ops-codfw Onsite queue for codfw
ops-eqdfw Onsite queue for eqdfw
ops-eqiad Onsite queue for eqiad
ops-eqord Onsite queue for eqord
ops-esams Onsite queue for esams
ops-ulsfo Onsite queue for ulsfo
DBA Database administration requests
Operations Software Development Software development projects

Hardware Request Stage

  • Tasks assigned to others are not reviewed as often, as they are awaiting input from the assignee. If they are left neglected by the assignee long term, they will likely be rejected, or have the hardware-requests project removed from the task.
  • If the system specification meets an on-site spare, system allocation may proceed.
  • This allocation step is typically processed by Rob and approved by Mark. (It involves a general overview of the roadmap and system procurement planning.)
  • If the system specifications require an order of hardware, the following occurs:
  • A RT procurement queue ticket is created for each set of vendor quotes.
  • Example: A caching system at this time could be Dell or HP, we create two RT tickets. One for each vendor to provide quotes for the system specification in question.
  • Quotes are generated and reviewed by Rob, Mark, and the requestors for the hardware.
  • Quotes are approved for purchase by Mark/Damon/Lila (escalation dependent on overall cost) and are typically placed by Rob (for US ordering) or Mark (for EU ordering).
  • The hardware-requests task will have the system details noted (hostname/asset tag) and the task will be linked to the system setup task.
  • These are kept separate for easy future search history on hardware allocations; thus its nice to leave a task with the hardware-request in said project.

Hardware/Server Setup / Deployment Stage Workflow

  • This task is the primary tracking task for the setup and deployment of the server.
  • Task should include the following (base template):
  • System Deployment Steps:
   [] - mgmt dns entries created/updated (both asset tag & hostname) [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
   [] - system bios and mgmt setup and tested [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
   [] - network switch setup (port description & vlan) [link sub-task for network configuration here, sub-task should include the network project]
   [] - production dns entries created/updated (just hostname, no asset tag entry) [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
   [] - install-server module updated (dhcp and netboot/partitioning) [done via this task when on-site subtasks complete]
   [] - install OS (note jessie or trusty) [done via this task when network sub-task(s) complete]
   [] - service implementation [done via this task post puppet acceptance]
  • The main task is basically for all the software setup, and the sub-tasks are for the specific on-site or networking tasks.
  • Many times, the network task isn't created, as the person doing the software work can also do the network configuration.

Misc. Production Virtual Machine Requests Workflow

  • Tasks assigned to others are not reviewed as often, as they are awaiting input from the assignee. If they are left neglected by the assignee long term, they will likely be rejected, or have the vm-requests project removed from the task.
  • If the system specifications meet all requirements for approval/allocation of a production virtual machine, Alex will process and grant the request.

Administrative Commands

  • All Phabricator documentation refers to scripts in the phabricator bin directory. On our setup, that is: /srv/phab/phabricator/bin/


Dump the entire database

Write the entire contents of phabricator's databases to disk, compressed:

cd /srv/phab/phabricator
sudo ./bin/storage dump --output /srv/dumps/phabricator_db_$(date +%Y%m%d%H%M%S).sql.gz --compress


Remove a repo

First you need the repo's callsign. This is an all-uppercase identifier with 'r' prefixed that is used in urls and such in Phabricator for the repo. For example, Puppet's is OPUP. First SSH to phab1001N. Then:

cd /srv/phab/phabricator
sudo ./bin/remove destroy rFOO

Remove a file

First you need the file's ID prefixed with 'F'. First SSH to phab1001N. Then:

cd /srv/phab/phabricator
sudo ./bin/remove destroy Fxxxxxxxx

Removing Two Factor Authentication

  • Please note that removal of 2FA is a serious request, and all too easily socially engineered. All requests of this nature should be treated with the same degree of security and confirmation as ssh key changes. The user guidelines require one month between the paste of the user committed identity hash on the wiki user page and the reset request, or verification via a video call.
  • When copying the text phrase from a Phabricator Paste, make sure to use View Raw File and save the file, to avoid issues with line breaks via copy&paste. Afterwards, run cat file | sha512sum.
  • Once confirmed, the actual command is quite simple, run on the phabricator host:
  sudo /srv/phab/phabricator/bin/auth strip --all-types --user <username>
  • You will be prompted with a yes or no to remove the multi-authentication types on the user.

Revoking a Conduit token

Users can do this themselves with the big red "Terminate Tokens" button in Settings > Conduit API Tokens. If it needs to be forced for some reason, you can do it from a phabricator server:

ssh phab1001.eqiad.wmnet
    sudo /srv/phab/phabricator/bin/auth revoke --type conduit --from @<username> 

Revoking a user's sessions

This invalidates any active sessions and forces the user to log in again.

ssh phab1001
    sudo /srv/phab/phabricator/bin/auth revoke --type session --from @<username>

Revoking a user's ssh keys

This invalidates any authorized ssh keys that the user has configured in phabricator.

ssh phab1001
    sudo /srv/phab/phabricator/bin/auth revoke --type ssh --from @<username>

Rebuild phabricator search index

Warning: This takes a really long time, probably more than 8 hours. Service will be online during the reindex, however, search quality will be degraded.

ssh phab1001
   sudo /srv/phab/phabricator/bin/search init
   sudo /srv/phab/phabricator/bin/search index --all --force --background

Revert all activity of a given user

Warning: This removes most of the user's activity from Phabricator and it is a destructive operation. This should only be done when cleaning up vandalism and after taking appropriate precautions such as taking a database snapshot immediately prior to running the script.

The rollback script attempts to undo edits made by a given user. With the optional --delete argument it will also remove all traces of the corresponding transactions from the phabricator activity log. Any field which has been edited by someone after the vandal's edit will be treated as an edit conflict and the field will be left alone to avoid potentially overwriting useful edits by other users.

The way it works is the tool replays the edit transactions in reverse, from newest to oldest. Each transaction in Phabricator stores the field name, the old value and the new value. To revert a user's activity, what do is as follows: for each transaction, if the new value matches the current value, then the old value is applied to the field. After all transactions have been replayed, if any field was changed then the record is saved back to the database. Finally, if --delete was also specified, then all the replayed transactions are also deleted to clean up the history of activity.

ssh phab1001
    sudo /srv/phab/libext/misc/bin/rollback execute --delete --user <username>

Converting a parent project into a subproject

ssh phab1001
    sudo /srv/phab/phabricator/bin/move_project --subproject --child "<projectname1>" --parent "<projectname2>" --keep-members child

See phab:T221112 for more information.

Note:

  • Avoid using --keep-members for milestones (which cannot have members), as the script does not block you from doing that. (phab:T224420)
  • The script does not check if tasks are in both the future parent and future subproject and it does not remove future parent project. Remove manually to avoid DB corruption before running the script. (phab:T224421)
  • The workboard seems to get nuked so you'd have to locally keep data which task is in which column and manually fix that up
  • Moving a non-parent project to non-parent project is not supported. See https://phabricator.wikimedia.org/T219608#5181020 for the manual steps to perform.

Hence for an example list of steps to work around the first two restrictions, see https://phabricator.wikimedia.org/T230831#5433325


Run a bulk job silently (suppressing notification spam)

First set up a bulk job in phabricator's GUI, then get the bulk job id and run the make-silent command below, specifying your bulk job id. Finally, start the job in the GUI and it will run without sending notifications.

ssh phab1001
    sudo /srv/phab/phabricator/bin/bulk make-silent  --id <bulkid>

read-only mode / restarting mariadb

To put phabricator into read-only mode, which allows it to continue serving requests during a master database restart, do the following on the active phabricator server:

ssh phab1001
    sudo /srv/phab/phabricator/bin/config set cluster.read-only true
    # restart database server
    sudo /srv/phab/phabricator/bin/config set cluster.read-only false

Network Architecture

Phabricator is currently hosted on phab1001.eqiad.wmnet / phab2001.codfw.wmnet.

The full path of traffic from the public internet through to the database is as follows:

cache_text esams -> cache_text codfw -> cache_text eqiad -> phab1001 -> dbproxy1003 -> db1043

Fixing Common Problems

PhutilMissingSymbolException

Some Phabricator applications throwing exceptions like Failed to load class or interface "Phabricator*" - this can sometimes be resolved by running arc liberate inside of /srv/phab/phabricator which will update the library map as in this commit.

Phabricator is intermittently down or slow

Check the logs on /var/log/apache2/phabricator_error.log

Check the host in Icinga for more failed checks (eg. PHD should be running).

Check the status of the phd process (sudo service phd status).

Do not run aphlict server using websockets and proxy through Apache also running main Phabricator.


Failure Scenarios / Failover

Simple failure of the phabricator server

A simple failure of the phabricator server, e.g. a disk failure or other hardware failure on phab1001.

Take a look at a previous fail-over ticket at T238956.

Code changes needed for the actual fail-over can be seen at the topic branch phab-buster. Decommissioning of the previous server can be seed at the topic branch phab1003-decom.

Additionally the etherpad Phabricator-migration-20191203 was used.

Steps to fail-over an existing Phabricator server to a new server

If there are 2 existing servers, just follow the steps. If the existing prod server died, assume "old_server" means the warm standby in the other data center. If the standby server died see the section below.

  1. install a new server and add the role::phabricator puppet class on it, run puppet agent
  2. rsync /srv/repos from old_server to new_server, run it with --delete as well and ensure both sides have the same size. (rsyncd / ferm rules for this are already puppetized on all servers)
  3. verify code in /srv/phab is up to date and both servers are on the same git tag (if not use scap to deploy to new server / run 'scap pull' on it)
  4. switch the "phabricator dumps host" to the new server. code change
  5. (optional) put phab on new_server in maintenance mode (phab admin action)
  6. set downtimes for both servers in Icinga
  7. change the "phabricator_server" setting to the new server name. code change
  8. (changing the "active server" setting is not needed anymore, setup has been simplified)
  9. switch the discovery record in DNS to the new server. The TTL is 300 seconds by default for all discovery records. It does not need to be changed but be aware there might be a 5 minute window where clients could get the old server. code change
  10. switch the config for varnish to the new server code change
  11. switch the mail destination on mx to the new server code change
  12. using systemctl, restart the "ssh-phab" service on the new server to make it listen on IPv6
  13. using conftool, depool the "vcs" service on the old server, change conftool data to use the new server code change and pool it
  14. (if reimage script failed in the past and you have ongoing Icinga alerts about pybal and the vcs server): delete stale confd files on puppetmaster to clear Icinga alerts about confd template compilation failing
  15. make the "phd" service run on the new server to avoid breakage of repos code change
  16. verify things work and remove Icinga downtimes
  17. (a few days later) decom the old server following the usual decom steps and as outlined in the phab1003-decom branch linked above

Steps to re-create a warm standby server

If the non-active server died and you want to re-create it under a new host name:

  1. install a new server and add the role::phabricator puppet class on it, run puppet agent
  2. rsync /srv/repos from the prod server to the new_server, run it with --delete as well and ensure both sides have the same size. (rsyncd / ferm rules for this are already puppetized on all servers)
  3. verify code in /srv/phab is up to date and both servers are on the same git tag (if not use scap to deploy to new server / run 'scap pull' on it)
  4. Add the new host name to the list of "phabricator_servers" in Hiera in hieradata/role/common/phabricator.yaml.
  5. using systemctl, restart the "ssh-phab" service on the new server to make it listen on IPv6
  6. using conftool, depool the "vcs" service on the old server, change conftool data to use the new server code change and pool it
  7. You do NOT have to worry about the phd service running, it's only needed on the active server.

Complete datacenter failover

Complete datacenter failover, e.g. some major event takes down eqiad and we need to fail over to codfw.

How to make codfw master writable

root@cumin1001:~# mysql --skip-ssl -hm3-master.codfw.wmnet

Master database failure

Master database fails, we need to fail over to a slave and swap the slave to become a master

If the master goes down, the proxy would automatically failover to the existing slave (which is read-only) and would need to be set up as read_only=OFF by an admin.

References

  1. The Operations specific Phabricator projects were discussed in T119944 in early 2016.

External links