You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

ORES/Deployment: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Elukey
imported>Elukey
Line 101: Line 101:
# Checkout new changes
# Checkout new changes
git checkout master
git checkout master
git fetch origin/master
git fetch origin master
# Confirm that the diff between origin and local is the expected one
# Confirm that the diff between origin and local is the expected one
git diff origin
git diff origin/master
git pull
git pull
cd ../../
cd ../../

Revision as of 16:18, 17 May 2021

This page is a guide on how to deploy new version of ORES into the server.

Prepare the source code

PyPI

So, your patches are merged into ores/revscoring/other dependencies. You need to increment the version number. Try to do that in a SemVer fashion. Like only upgrading the patch level (e.g. 0.5.8 -> 0.5.9). You need to do it in setup.py and __init__.py (and probably some other place too, use grep to check where the current version is used)

Then you need to push new version into PyPI using:

python setup.py sdist bdist_wheel upload

If you got GPG/PGP you can try adding sign to the list above to also sign the wheel and the sdist

Update models

If you are doing breaking changes to revscoring probably old model files won't work, so you need to rebuild models. Do it using Makefile in editquality & wikiclass repos. If a model changes substantially (new features, new algorithm, etc), make sure to increment the model versions in the Makefile too.

Update wheels

First, clone https://github.com/wiki-ai/ores-wmflabs-deploy:

git clone https://github.com/wiki-ai/ores-wmflabs-deploy

There is a file in ores-wmflabs-deploy called "requirements.txt". Update their version number and make wheels by making a virtualenv and installing everything in it:

virtualenv -p python3 tmp
source tmp/bin/activate
pip install --upgrade pip
pip install wheel
pip wheel -w wheels/ -r requirements.txt

It's critical to do this in an environment that will be binary-compatible with the production cluster. ores-misc-01.ores.eqiad1.wikimedia.cloud is designed to do that. Don't forget to install C dependencies beforehand. Be careful if any kind of error happened.

Once wheels are ready, there is a repo in gerrit called wheels (in research/ores/wheels) we keep wheels and nltk data in it. You need to git clone, update wheels and make a patch:

git clone ssh://YOURUSERNAME@gerrit.wikimedia.org:29418/research/ores/wheels

Then, you need to copy new versions to wheels folder, delete old ones and make a new patch:

cd wheels
git commit -m "New wheels for wiki-ai 1.2" -a
git review -R

To rebuild the production wheels, use frozen-requirements.txt rather than requirements.txt.

Update ores-wmflabs-deploy

After +2ing and being merged, you should update ores-wmflabs-deploy

NOTE: This is not a required step for production, but we like to keep the repos in sync.

cd ores-wmflabs-deploy
git checkout -b wiki_ai_1.2
source tmp/bin/activate
pip freeze | grep -v setuptools > frozen-requirements.txt
cd submodules/wheels
git pull
cd ../..
git commit -m "Release wiki-ai 1.2"
git push -f origin wiki_ai_1.2

After that you need to make a PR in github and once it's merged it's good to go!

If you want to deploy to prod as well (ores.wikimedia.org) you need to backport your commits in gerrit too (ewww). The gerrit repos are:

git clone ssh://YOURUSERNAME@gerrit.wikimedia.org:29418/mediawiki/services/ores

For ores.

And:

  • "mediawiki/services/ores/deploy" for ores-wmflabs-deploy (note that these repos have diverged [FIXME: Mande?])
  • "mediawiki/services/ores/editquality" for editquality
  • "mediawiki/services/ores/wikiclass" for wikiclass

Merge the code and prepare to deploy

There are two use cases: updating repositories with models and updating the ORES deploy repository.

Updating model repositories

In this case, we are interested for example to update or add a model to one of the repositories, for example https://github.com/wikimedia/editquality. After doing all the work the first step is to send a pull request to the github repository, and wait for approvals from the WMF Machine Learning team before merging. For example: https://github.com/wikimedia/editquality/pull/233

Once the change is merged, then we need to propagate git LFS object from github to gerrit (since we deploy gerrit repositories in production) following what suggested in https://phabricator.wikimedia.org/T212818#4865070:

$ git clone https://github.com/wikimedia/editquality
$ cd editquality
$ git lfs pull
$ git remote add gerrit https://gerrit.wikimedia.org/r/scoring/ores/editquality
$ git lfs push gerrit master

Updating the ORES deploy repository

This repository is the one that we deploy in production, that includes all the more specific model repositories as git submodules. If you don't need to change git submodules, just change the code and send a gerrit patch, and wait for the WMF Machine Learning team to review and merge.

If you need to update a submodule, for example editquality:

# Assumption - the working directory is the ores/deploy one
cd submodules/editquality/
# Checkout new changes
git checkout master
git fetch origin master
# Confirm that the diff between origin and local is the expected one
git diff origin/master
git pull
cd ../../
# Now you should see a diff in the submodule sha
git diff
# Proceed with git add, commit and review

Deploy to the test server

Please deploy to the beta cluster well in advance of any production deployments, at least an hour, several days is better, to give time for smoke-testing and log-watching.

We have a series of increasingly production-like environments available for smoke testing each release, please take the time to go through each step, labs staging -> beta -> production. There is also an automatic canary deployment during scap, which stops after pushing to ores1001 and gives you the opportunity to compare that server's health to its brethren's.

Labs (ores.wmflabs.org)

NOTE: This is not a required step for production, but we like to keep the repos in sync.

First, go to staging. Simply make your changes in the ores-wmflabs-deploy repo and do fab stage (don't forget to log it in #wikimedia-cloud by typing this: "!log ores-staging deploying <HASH> into staging".

Then check ores-staging.wmflabs.org to see if everything is healthy. If so, you are good to go to the labs setup. Rebase the "deploy" branch onto master.

git checkout deploy
git rebase origin/master
git push -f origin deploy

If working as expected, deploy with "fab deploy_web" and then "fab deploy_celery". Once it's done, test ores.wmflabs.org to see if everything is working as expected.

Beta (ores-beta.wmflabs.org)

Monitoring

If something does go wrong, you'll want to read the diagnostic messages. See /srv/log/ores/main.log and app.log. Monitor the logs throughout each of these deployment stages, by going to the target server, for beta this is currently deployment-ores01.deployment-prep.eqiad1.wikimedia.cloud, and running:

sudo tail -f /srv/log/ores/*.log

You can also view these logs on https://logstash-beta.wmflabs.org

Open the beta cluster grafana dashboard for the ORES service: https://grafana-labs.wikimedia.org/dashboard/db/ores-beta-cluster?orgId=1

Open the beta cluster ORES extension graphs at: https://grafana-labs.wikimedia.org/dashboard/db/ores-extension?orgId=1

Read the recent server admin log messages for beta: https://tools.wmflabs.org/sal/deployment-prep

Configuration

The beta cluster configuration should match production, the only time when it's appropriate for the config to be different is when you're testing new configuration that will be included with this deployment. Since the beta cluster configuration is applied as an override on top of production configuration, the usual case is that you will make sure that InitialiseSettings-labs.php and CommonSettings-labs.php contain no ORES-specific configuration.

If you do plan to deploy a configuration change, consider what will happen if the code is rolled back. The safest type of change can be deployed either code- or configuration- first. If one cannot be deployed without the other, please review your rollback plan with the rest of the team.

Deploy to beta

  1. ssh deployment-deploy01.eqiad1.wikimedia.cloud
  2. cd /srv/deployment/ores/deploy
  3. git pull && git submodule update --init
  4. Record the NEWHASH at the top of git log -1
  5. Record the new revision (NEWHASH) and prepare a message to send to #wikimedia-cloud connect: "!log deployment-prep deploying ores <NEWHASH>"
  6. Deploy with scap deploy -v "<relevant task -- e.g. T1234>" and check out whether everything works as expected.

Deploy to production

Production cluster (ores.wikimedia.org)

You are doing a dangerous thing. Remember, breaking the site is extremely easy! Be careful in every step and try to have someone from the team and ops supervising you. Also remember, ORES is depending on a huge number of puppet configurations, check out if your change is compatible with puppet configs and change puppet configs if necessary.

Monitoring

It's crucial to watch all of these places, sometimes the service side won't error but will cause the wikis themselves to burst into flames.

Production ORES service graphs: https://grafana.wikimedia.org/dashboard/db/ores?orgId=1.

Production ORES extension graphs: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1

Site-wide error graphs: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1

Watch the logs, especially for ERROR-level messages: https://logstash.wikimedia.org/app/kibana#/dashboard/ORES

Watch MediaWiki fatal logs: https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors

Note that the service "Scores processed" graph is the only indication of what's happening on each machine's Celery workers. This is the best place to watch for canary health. All of the "scores returned" graphs are only showing behavior at the uWSGI layer.

Prep work

We'll double check the hash that is deployed in case we need to revert and then update the code to current master.

  1. ssh ores1001.eqiad.wmnet
  2. cd /srv/deployment/ores/deploy.
  3. Record the latest revision (OLDHASH) with git log -1 (in case you needed to rollback). Not that the revision on the deployment server (tin) is not a 100% reliable reference, it's possible that the code was rolled back, incompletely deployed, or that the last person was doing a deployment to an experimental cluster. You need to get the current revision from the production server itself.
Deploy to canary

Then you need to deploy it into a node to check if it works as expected. It's called canary node. Right now, it's ores1001.eqiad.wmnet.

  1. ssh deployment.eqiad.wmnet.
  2. Update the deploy repository with:
    1. cd /srv/deployment/ores/deploy
    2. git log (and verify that HEAD is the hash retrieved in Prep Work on ores1001)
    3. git fetch
    4. git log origin (and inspect the commits between origin and local branch)
    5. git pull
    6. git submodule update --init
  3. scap deploy -v "<relevant task -- e.g. T1234>" (This will automatically post a log line in #wikimedia-operations connect.)
  4. Let it run, but when prompted to continue do not hit "y" yet! You have just deployed to the canary server, please smoke test.
  5. ssh ores1001.eqiad.wmnet and check the service internally by commanding curl http://0.0.0.0:8081/v3/scores/fakewiki/$(date +%s)
    • It would be great if you test other aspects if you are changing them (e.g. test if it returns data if you are adding a new model).
    • Note that you are testing uWSGI on the canary server, so any gross errors will show up, but if the request makes a call through celery (most requests do), you won't necessarily be running code on the canary server, but on any node in the cluster. Try running the curl command 10 times for a reasonable chance (94%) of hitting the canary server. Makes sure to include ?features in the request to circumvent the cache.
Continue deployment to prod

If everything works as expected, we're ready to continue.

  1. Deploy it fully by answering "y" to the scap prompt.
  2. If everything looks OK, say "Victory! ORES deploy looks good" (or something equally effusive) in #wikimedia-operations.

In case of a production accident

The ORES extension has the potential to break a few critical pages, such as Special:RecentChanges. An issue with these pages is serious, and should be handled in basically the same way as if you took down the entire site.

Rollback

Your first instinct should be to roll back whatever you just deployed. Take the OLDHASH you recorded before deploying, and run this command:

  1. Announce the problem and your intention to roll back in #wikimedia-operations.
  2. scap deploy -v -r <OLDHASH>

Disable the ORES extension

In the unlikely event that a rollback isn't going fast enough, or for some reason doesn't work, please disable the ORES extension on any sites that are having problems, or globally if appropriate.

  1. Announce what steps you'll take in the #wikimedia-operations.
  2. Make a patch in the mediawiki-config repo, in wmf-config/InitialiseSettings.php, to disable $wmgUseORES on the sites you have identified.
  3. From the deployment server:
  4. cd /srv/mediawiki-staging
  5. git fetch
  6. git log HEAD..origin/master -- Make sure you're only pulling in your own change.
  7. git rebase
  8. scap deploy-file wmf-config/InitialiseSettings.php "<Explain why you're doing this>"

Monitor

Make sure the situation stabilizes. Sorry but you break it, you buy it. Please stay on-duty until you can be certain that nothing else is happening, or someone else on the team agrees to adopt your putrid albatross.

Incident report

When you're feeling better, ` within a day or two, explain what happened.

  1. Create a wiki page as a subpage under Incident documentation, use the template and follow instructions there.
  2. You should have just emailed ops@?
  3. Create a Phabricator task and tag with #wikimedia-incident

Unusual maintenance actions

Clear threshold cache

Thresholds are normally cached for a day, so if you want changes to threshold code to be reflected immediately, you'll have to purge the caches manually. Calculated threshold values are cached separately for every wiki and model. Clear by logging into the deployment server and running, for example,

mwscript eval.php --wiki frwiki

$cache = MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache();
$key = $cache->makeKey( 'ORES', 'threshold_statistics', 'damaging', 1 );
$cache->delete($key);
$key = $cache->makeKey( 'ORES', 'threshold_statistics', 'goodfaith', 1 );
$cache->delete($key);

Restarting Redis

Celery is unhappy when its Redis backing is restarted. Any time Redis crashes or is intentionally restarted, you must restart the Celery workers. If this is an intentional restart, then stop all Celery workers prior to shutting down Redis.

Enabling ORES on a new wiki

TODO: bug T182054

Puppet-managed config changes

First, our configurations can be found in several places. At the code, you can find them in "config" folder. Then in the deploy repository there is another "config" folder that overrides the code configs, and at last there's puppet ores module that has the final configs that override the other two.

If you want to change configs in the code or deploy repo, you just need to make the change, get it merged and deploy it. Deployment causes the services to restart and pick up the new config but changing the puppet-managed configs doesn't cause the service to restart and pick up the new ones. You need to wait until puppet agent run in each ores node (like ores1001) and changes the config file. The files can be found at /etc/ores/*.yaml and once it's changed you need to manually restart ores services:

sudo service uwsgi-ores restart
sudo service celery-worker-ores restart

You need to do it on all nodes in both datacenters. You can test it on one or two nodes as canary and if everything's fine, use pssh (or fabric, capistrano, your choice) to run it automatically on the rest. TODO: Make a script for this.