You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
These are directions for deploying zotero, a nodejs service. More detailed but more general directions for nodejs services are at Migrating_from_scap-helm#Code_deployment/configuration_changes.
Locate build candidate
From gerrit, locate the candidate build. PipelineBot will post a message with the build name, i.e.
PipelineBot Mar 15 9:22 PM Patch Set 4: Wikimedia Pipeline Image Build SUCCESS IMAGE: docker-registry.discovery.wmnet/wikimedia/mediawiki-services-zotero TAGS: 2019-03-15-211530-candidate, 950e3b4468f2f84d3bb2b0343c
'2019-03-15-211530-candidate' is the name of the build.
Add change via gerrit
- Clone deployment-charts repo.
- vi values.yaml
main_app: image: wikimedia/mediawiki-services-zotero limits: cpu: 10 memory: 4Gi liveness_probe: tcpSocket: port: 1969 port: 1969 requests: cpu: 200m memory: 200Mi version: 2019-01-17-114541-candidate-change-me
- Make a CR to change the version value for all three servers (staging, codfw, and eqiad), and after a successful review, merge it.
- After merge, log into a deployment server, there is a cronjob (1 minute) that will update the /srv/deployment-charts directory with the contents from git.
Log into deployment server
Ssh into the deploy machine.
Navigate to /srv/deployment-charts/helmfile.d/services/
Zotero runs on two of the available server farms, codfw and eqiad. There is also a staging server. You can test out changes on the staging server first.
cd /srv/deployment-charts/helmfile.d/services/zotero helmfile -e staging -i apply
Helfile apply (may take awhile)
This checks status again so you can see if all instances have been restarted yet.
Verify Zotero is running on staging with a curl request:
curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/web
helmfile -e codfw -i apply
repeat for eqiad
helmfile -e eqiad -i apply
Verify Zotero is running with a curl request:
curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.svc.eqiad.wmnet:4969/web curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.svc.codfw.wmnet:4969/web curl -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.discovery.wmnet:4969/web
Logs have been disabled in 827891af due to them being actively harmful (aside from useless) to our environment. If you need to chase down a request that caused an issue, Citoid logs might be helpful as, aside from monitoring, Citoid is the only service talking to zotero. Citoid logs are in Logstash
Zotero is single threaded and a nodejs app. That means it has an event queue but whenever it is doing some CPU intensive (like parsing a large PDF) it is unable to serve the queue. It also does not offer any endpoint that can be used for a readinessProbe. That effectively means that when a given replica is serving a large request it can't serve anything else AND it is not depooled from the rotation, which means that requests will still head its' way. In the majority of cases, it's ok. Citoid is usually ok with zotero requests timing out and it falls back to its internal parser. Plus, by the time zotero ends up with the large request it will serve node's event queue which USUALLY (judging from CPU graphs) will be way cheaper CPU wise.
So what happens when we get paged is one of the following scenarios:
All majority replicas get some repeated request (a user submits the url they want cited multiple times) and end up being CPU pegged and unable to service requests, including monitoring which times out 3 times and raises the alert. A replica gets a large request and becomes unable to serve more requests for an amount of time, monitoring requests flow its' way in a real unlucky situation, timeout and raise the alert. Something in the spectrum between those 2 extremes.
An HTTP GET endpoint in zotero like /healthz would effectively mitigate the above and render most of this note moot. However we have never invested in that.
Rolling back changes
If you need to roll back a change because something went wrong:
- Revert the git commit to the deployment-charts repo
- Merge the revert (with review if needed)
- Wait one minute for the cron job to pull the change to the deployment server
ENV=<staging,eqiad,codfw> kube_env zotero $ENV; helmfile -e $ENV diffto see what you'll be changing
helmfile -e $ENV apply
Rolling back in an emergency
This is discouraged but is noted here for completeness. Only use it in really truly big emergencies. If you are wondering whether a situation qualifies as truly big emergency, it almost certainly is not. Reach out to more senior SREs and the service onwer first
If you can't wait the one minute, or the cron job to update from git fails etc. then it is possible to manually roll back using helm.
- Find the revision to roll back to
kube_env zotero <staging,eqiad,codfw>; helm history <production> --tiller-namespace YOUR_SERVICE_NAMESPACE
- Find the revision to roll back to
- e.g. perhaps the penultimate one
REVISION UPDATED STATUS CHART DESCRIPTION 1 Tue Jun 18 08:39:20 2019 SUPERSEDED termbox-0.0.2 Install complete 2 Wed Jun 19 08:20:42 2019 SUPERSEDED termbox-0.0.3 Upgrade complete 3 Wed Jun 19 10:33:34 2019 SUPERSEDED termbox-0.0.3 Upgrade complete 4 Tue Jul 9 14:21:39 2019 SUPERSEDED termbox-0.0.3 Upgrade complete
- Rollback with:
kube_env zotero <staging,eqiad,codfw>; helm rollback <production> 3 --tiller-namespace YOUR_SERVICE_NAMESPACE