You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
ORES/Deployment: Difference between revisions
|Line 241:||Line 241:|
=== Restarting Redis ===
=== Restarting Redis ===
Celery is unhappy when its Redis backing is restarted. Any time Redis crashes or is intentionally restarted, you
Celery is unhappy when its Redis backing is restarted. Any time Redis crashes or is intentionally restarted, you restart the Celery workers.
If restart ,
=== Enabling ORES on a new wiki ===
=== Enabling ORES on a new wiki ===
Revision as of 13:31, 27 August 2021
This page is a guide on how to deploy new version of ORES into the server.
Prepare the source code
So, your patches are merged into ores/revscoring/other dependencies. You need to increment the version number. Try to do that in a SemVer fashion. Like only upgrading the patch level (e.g. 0.5.8 -> 0.5.9). You need to do it in setup.py and __init__.py (and probably some other place too, use grep to check where the current version is used)
Then you need to push new version into PyPI using:
python setup.py sdist bdist_wheel upload
If you got GPG/PGP you can try adding sign to the list above to also sign the wheel and the sdist
If you are doing breaking changes to revscoring probably old model files won't work, so you need to rebuild models. Do it using Makefile in editquality & wikiclass repos. If a model changes substantially (new features, new algorithm, etc), make sure to increment the model versions in the Makefile too.
First, clone https://github.com/wiki-ai/ores-wmflabs-deploy:
git clone https://github.com/wiki-ai/ores-wmflabs-deploy
There is a file in ores-wmflabs-deploy called "requirements.txt". Update their version number and make wheels by making a virtualenv and installing everything in it:
virtualenv -p python3 tmp source tmp/bin/activate pip install --upgrade pip pip install wheel pip wheel -w wheels/ -r requirements.txt
It's critical to do this in an environment that will be binary-compatible with the production cluster. ores-misc-01.ores.eqiad1.wikimedia.cloud is designed to do that. Don't forget to install C dependencies beforehand. Be careful if any kind of error happened.
Once wheels are ready, there is a repo in gerrit called wheels (in research/ores/wheels) we keep wheels and nltk data in it. You need to git clone, update wheels and make a patch:
git clone ssh://YOURUSERNAME@gerrit.wikimedia.org:29418/research/ores/wheels
Then, you need to copy new versions to wheels folder, delete old ones and make a new patch:
cd wheels git commit -m "New wheels for wiki-ai 1.2" -a git review -R
To rebuild the production wheels, use frozen-requirements.txt rather than requirements.txt.
After +2ing and being merged, you should update ores-wmflabs-deploy
NOTE: This is not a required step for production, but we like to keep the repos in sync.
cd ores-wmflabs-deploy git checkout -b wiki_ai_1.2 source tmp/bin/activate pip freeze | grep -v setuptools > frozen-requirements.txt cd submodules/wheels git pull cd ../.. git commit -m "Release wiki-ai 1.2" git push -f origin wiki_ai_1.2
After that you need to make a PR in github and once it's merged it's good to go!
If you want to deploy to prod as well (ores.wikimedia.org) you need to backport your commits in gerrit too (ewww). The gerrit repos are:
git clone ssh://YOURUSERNAME@gerrit.wikimedia.org:29418/mediawiki/services/ores
- "mediawiki/services/ores/deploy" for ores-wmflabs-deploy (note that these repos have diverged [FIXME: Mande?])
- "mediawiki/services/ores/editquality" for editquality
- "mediawiki/services/ores/wikiclass" for wikiclass
Merge the code and prepare to deploy
There are two use cases: updating repositories with models and updating the ORES deploy repository.
Updating model repositories
In this case, we are interested for example to update or add a model to one of the repositories, for example https://github.com/wikimedia/editquality. After doing all the work the first step is to send a pull request to the github repository, and wait for approvals from the WMF Machine Learning team before merging. For example: https://github.com/wikimedia/editquality/pull/233
Once the change is merged, then we need to propagate git LFS object from github to gerrit (since we deploy gerrit repositories in production) following what suggested in https://phabricator.wikimedia.org/T212818#4865070:
$ git clone https://github.com/wikimedia/editquality $ cd editquality $ git lfs pull $ git remote add gerrit https://gerrit.wikimedia.org/r/scoring/ores/editquality $ git lfs push gerrit master
Updating the ORES deploy repository
This repository is the one that we deploy in production, that includes all the more specific model repositories as git submodules. If you don't need to change git submodules, just change the code and send a gerrit patch, and wait for the WMF Machine Learning team to review and merge.
If you need to update a submodule, for example
# Assumption - the working directory is the ores/deploy one cd submodules/editquality/ # Checkout new changes git checkout master git fetch origin master # Confirm that the diff between origin and local is the expected one git diff origin/master git pull cd ../../ # Now you should see a diff in the submodule sha git diff # Proceed with git add, commit and review
Deploy to the test server
Please deploy to the beta cluster well in advance of any production deployments, at least an hour, several days is better, to give time for smoke-testing and log-watching.
We have a series of increasingly production-like environments available for smoke testing each release, please take the time to go through each step, labs staging -> beta -> production. There is also an automatic canary deployment during scap, which stops after pushing to ores1001 and gives you the opportunity to compare that server's health to its brethren's.
NOTE: This is not a required step for production, but we like to keep the repos in sync.
First, go to staging. Simply make your changes in the ores-wmflabs-deploy repo and do
fab stage (don't forget to log it in #wikimedia-cloud by typing this: "!log ores-staging deploying <HASH> into staging".
Then check ores-staging.wmflabs.org to see if everything is healthy. If so, you are good to go to the labs setup. Rebase the "deploy" branch onto master.
git checkout deploy git rebase origin/master git push -f origin deploy
If working as expected, deploy with "fab deploy_web" and then "fab deploy_celery". Once it's done, test ores.wmflabs.org to see if everything is working as expected.
If something does go wrong, you'll want to read the diagnostic messages. See
app.log. Monitor the logs throughout each of these deployment stages, by going to the target server, for beta this is currently
deployment-ores01.deployment-prep.eqiad1.wikimedia.cloud, and running:
sudo tail -f /srv/log/ores/*.log
You can also view these logs on https://logstash-beta.wmflabs.org
Open the beta cluster grafana dashboard for the ORES service: https://grafana-labs.wikimedia.org/dashboard/db/ores-beta-cluster?orgId=1
Open the beta cluster ORES extension graphs at: https://grafana-labs.wikimedia.org/dashboard/db/ores-extension?orgId=1
Read the recent server admin log messages for beta: https://tools.wmflabs.org/sal/deployment-prep
The beta cluster configuration should match production, the only time when it's appropriate for the config to be different is when you're testing new configuration that will be included with this deployment. Since the beta cluster configuration is applied as an override on top of production configuration, the usual case is that you will make sure that InitialiseSettings-labs.php and CommonSettings-labs.php contain no ORES-specific configuration.
If you do plan to deploy a configuration change, consider what will happen if the code is rolled back. The safest type of change can be deployed either code- or configuration- first. If one cannot be deployed without the other, please review your rollback plan with the rest of the team.
Deploy to beta
- ssh deployment-deploy01.eqiad1.wikimedia.cloud
- cd /srv/deployment/ores/deploy
- git pull && git submodule update --init
- Record the NEWHASH at the top of git log -1
- Record the new revision (NEWHASH) and prepare a message to send to #wikimedia-cloud connect: "!log deployment-prep deploying ores <NEWHASH>"
- Deploy with scap deploy -v "<relevant task -- e.g. T1234>" and check out whether everything works as expected.
Deploy to production
Production cluster (ores.wikimedia.org)
You are doing a dangerous thing. Remember, breaking the site is extremely easy! Be careful in every step and try to have someone from the team and ops supervising you. Also remember, ORES is depending on a huge number of puppet configurations, check out if your change is compatible with puppet configs and change puppet configs if necessary.
It's crucial to watch all of these places, sometimes the service side won't error but will cause the wikis themselves to burst into flames.
Production ORES service graphs: https://grafana.wikimedia.org/dashboard/db/ores?orgId=1.
Production ORES extension graphs: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1
Site-wide error graphs: https://grafana.wikimedia.org/dashboard/file/varnish-http-errors.json?refresh=5m&orgId=1
Watch the logs, especially for ERROR-level messages: https://logstash.wikimedia.org/app/kibana#/dashboard/ORES
Watch MediaWiki fatal logs: https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors
Note that the service "Scores processed" graph is the only indication of what's happening on each machine's Celery workers. This is the best place to watch for canary health. All of the "scores returned" graphs are only showing behavior at the uWSGI layer.
- Prep work
We'll double check the hash that is deployed in case we need to revert and then update the code to current master.
- Record the latest revision (OLDHASH) with
git log -1(in case you needed to rollback). Not that the revision on the deployment server (tin) is not a 100% reliable reference, it's possible that the code was rolled back, incompletely deployed, or that the last person was doing a deployment to an experimental cluster. You need to get the current revision from the production server itself.
- Deploy to canary
Then you need to deploy it into a node to check if it works as expected. It's called canary node. Right now, it's ores1001.eqiad.wmnet.
- Update the deploy repository with:
git log(and verify that HEAD is the hash retrieved in Prep Work on ores1001)
git log origin(and inspect the commits between origin and local branch)
git submodule update --init
scap deploy -v "<relevant task -- e.g. T1234>"(This will automatically post a log line in #wikimedia-operations connect.)
- Let it run, but when prompted to continue do not hit "y" yet! You have just deployed to the canary server, please smoke test.
ssh ores1001.eqiad.wmnetand check the service internally by commanding
curl http://0.0.0.0:8081/v3/scores/fakewiki/$(date +%s)
- It would be great if you test other aspects if you are changing them (e.g. test if it returns data if you are adding a new model).
- Note that you are testing uWSGI on the canary server, so any gross errors will show up, but if the request makes a call through celery (most requests do), you won't necessarily be running code on the canary server, but on any node in the cluster. Try running the curl command 10 times for a reasonable chance (94%) of hitting the canary server. Makes sure to include ?features in the request to circumvent the cache.
- Continue deployment to prod
If everything works as expected, we're ready to continue.
- Deploy it fully by answering "y" to the scap prompt.
- If everything looks OK, say "Victory! ORES deploy looks good" (or something equally effusive) in #wikimedia-operations.
In case of a production accident
The ORES extension has the potential to break a few critical pages, such as Special:RecentChanges. An issue with these pages is serious, and should be handled in basically the same way as if you took down the entire site.
Your first instinct should be to roll back whatever you just deployed. Take the OLDHASH you recorded before deploying, and run this command:
- Announce the problem and your intention to roll back in #wikimedia-operations.
scap deploy -v -r <OLDHASH>
Disable the ORES extension
In the unlikely event that a rollback isn't going fast enough, or for some reason doesn't work, please disable the ORES extension on any sites that are having problems, or globally if appropriate.
- Announce what steps you'll take in the #wikimedia-operations.
- Make a patch in the
wmf-config/InitialiseSettings.php, to disable
$wmgUseORESon the sites you have identified.
- From the deployment server:
git log HEAD..origin/master-- Make sure you're only pulling in your own change.
scap deploy-file wmf-config/InitialiseSettings.php "<Explain why you're doing this>"
Make sure the situation stabilizes. Sorry but you break it, you buy it. Please stay on-duty until you can be certain that nothing else is happening, or someone else on the team agrees to adopt your putrid albatross.
When you're feeling better, ` within a day or two, explain what happened.
- Create a wiki page as a subpage under Incident documentation, use the template and follow instructions there.
- You should have just emailed ops@?
- Create a Phabricator task and tag with #wikimedia-incident
Unusual maintenance actions
Clear threshold cache
Thresholds are normally cached for a day, so if you want changes to threshold code to be reflected immediately, you'll have to purge the caches manually. Calculated threshold values are cached separately for every wiki and model. Clear by logging into the deployment server and running, for example,
mwscript eval.php --wiki frwiki $cache = MediaWiki\MediaWikiServices::getInstance()->getMainWANObjectCache(); $key = $cache->makeKey( 'ORES', 'threshold_statistics', 'damaging', 1 ); $cache->delete($key); $key = $cache->makeKey( 'ORES', 'threshold_statistics', 'goodfaith', 1 ); $cache->delete($key);
Celery is unhappy when its Redis backing is restarted. Any time Redis crashes or is intentionally restarted, you may need to restart the Celery workers.
There are multiple Redis services for each datacenter:
- two instances (master/replica) holding the celery queue (not persisted on disk)
- two instances (master/replica) holding the ORES score cache (persisted on disk)
The two master instances are running on the same rdb node (different ports), same thing for the replicas. If you want to restart or reboot one of the redis instances, you can follow this simple procedure:
- A code change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/715209 to change ORES' config to point to the replica instance (a quick git grep in puppet should be sufficient to find the hostnames)
- On a cumin node, execute (after merging the above change) -
cumin -m async -b 1 -s 30 'A:ores-codfw' 'run-puppet-agent' 'depool' 'sleep 5' 'systemctl restart celery-ores-worker ; systemctl restart uwsgi-ores' 'sleep 5' 'pool'
- Monitor the ORES grafana dashboards and verify that no TCP connections are hitting the Redis node to reboot (a simple netstat on the node is enough)
- Revert the change in 1)
- Execute again the cumin command.
Enabling ORES on a new wiki
TODO: bug T182054
Puppet-managed config changes
First, our configurations can be found in several places. At the code, you can find them in "config" folder. Then in the deploy repository there is another "config" folder that overrides the code configs, and at last there's puppet ores module that has the final configs that override the other two.
If you want to change configs in the code or deploy repo, you just need to make the change, get it merged and deploy it. Deployment causes the services to restart and pick up the new config but changing the puppet-managed configs doesn't cause the service to restart and pick up the new ones. You need to wait until puppet agent run in each ores node (like ores1001) and changes the config file. The files can be found at /etc/ores/*.yaml and once it's changed you need to manually restart ores services:
sudo service uwsgi-ores restart sudo service celery-worker-ores restart
You need to do it on all nodes in both datacenters. You can test it on one or two nodes as canary and if everything's fine, use pssh (or fabric, capistrano, your choice) to run it automatically on the rest. TODO: Make a script for this.