Zotero/Deploying zotero

From Wikitech-static
Jump to navigation Jump to search
File:Zotero request flow.png
An image show zotero request flow

These are directions for deploying zotero, a nodejs service. More detailed but more general directions for nodejs services are at Migrating_from_scap-helm#Code_deployment/configuration_changes.

Locate build candidate

From gerrit, locate the candidate build. PipelineBot will post a message with the build name, i.e.

PipelineBot
Mar 15 9:22 PM

Patch Set 4:

Wikimedia Pipeline
Image Build SUCCESS

IMAGE:
 docker-registry.discovery.wmnet/wikimedia/mediawiki-services-zotero

TAGS:
 2019-03-15-211530-candidate, 950e3b4468f2f84d3bb2b0343c

'2019-03-15-211530-candidate' is the name of the build.

Add change via gerrit

  1. Clone deployment-charts repo.
  2. vi values.yaml
main_app:
  image: wikimedia/mediawiki-services-zotero
  limits:
    cpu: 10
    memory: 4Gi
  liveness_probe:
    tcpSocket:
      port: 1969
  port: 1969
  requests:
    cpu: 200m
    memory: 200Mi
  version: 2019-01-17-114541-candidate-change-me
  1. Make a CR to change the version value for all three servers (staging, codfw, and eqiad), and after a successful review, merge it.
  2. After merge, log into a deployment server, there is a cronjob (1 minute) that will update the /srv/deployment-charts directory with the contents from git.

Log into deployment server

Ssh into the deploy machine.

ssh deployment.eqiad.wmnet

Navigate to /srv/deployment-charts/helmfile.d/services/

Zotero runs on two of the available server farms, codfw and eqiad. There is also a staging server. You can test out changes on the staging server first.

Staging server

cd /srv/deployment-charts/helmfile.d/services/zotero
helmfile -e staging -i apply

Helfile apply (may take awhile)

This checks status again so you can see if all instances have been restarted yet.

Verify Zotero is running on staging with a curl request:

curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://staging.svc.eqiad.wmnet:4969/web

Production server

helmfile -e codfw -i apply

repeat for eqiad

helmfile -e eqiad -i apply

Verify

Verify Zotero is running with a curl request:

curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.svc.eqiad.wmnet:4969/web

curl -k -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.svc.codfw.wmnet:4969/web

curl -d 'http://www.nytimes.com/2018/06/11/technology/net-neutrality-repeal.html' -H 'Content-Type: text/plain' https://zotero.discovery.wmnet:4969/web

Logs

Logs have been disabled in 827891af due to them being actively harmful (aside from useless) to our environment. If you need to chase down a request that caused an issue, Citoid logs might be helpful as, aside from monitoring, Citoid is the only service talking to zotero. Citoid logs are in Logstash

Notes

Zotero is single threaded and a nodejs app. That means it has an event queue but whenever it is doing some CPU intensive (like parsing a large PDF) it is unable to serve the queue. It also does not offer any endpoint that can be used for a readinessProbe. That effectively means that when a given replica is serving a large request it can't serve anything else AND it is not depooled from the rotation, which means that requests will still head its' way. In the majority of cases, it's ok. Citoid is usually ok with zotero requests timing out and it falls back to its internal parser. Plus, by the time zotero ends up with the large request it will serve node's event queue which USUALLY (judging from CPU graphs) will be way cheaper CPU wise.

So what happens when we get paged is one of the following scenarios:

All majority replicas get some repeated request (a user submits the url they want cited multiple times) and end up being CPU pegged and unable to service requests, including monitoring which times out 3 times and raises the alert. A replica gets a large request and becomes unable to serve more requests for an amount of time, monitoring requests flow its' way in a real unlucky situation, timeout and raise the alert. Something in the spectrum between those 2 extremes.

An HTTP GET endpoint in zotero like /healthz would effectively mitigate the above and render most of this note moot. However we have never invested in that.

Rolling back changes

If you need to roll back a change because something went wrong:

  1. Revert the git commit to the deployment-charts repo
  2. Merge the revert (with review if needed)
  3. Wait one minute for the cron job to pull the change to the deployment server
  4. execute ENV=<staging,eqiad,codfw> kube_env zotero $ENV; helmfile -e $ENV diff to see what you'll be changing
  5. execute helmfile -e $ENV apply

Rolling back in an emergency

If you can't wait the one minute, or the cron job to update from git fails etc. then it is possible to manually roll back using helm.

  1. Find the revision to roll back to
    1. kube_env zotero <staging,eqiad,codfw>; helm history <production> --tiller-namespace YOUR_SERVICE_NAMESPACE
    2. Find the revision to roll back to
    3. e.g. perhaps the penultimate one
      REVISION        UPDATED                         STATUS          CHART           DESCRIPTION     
      1               Tue Jun 18 08:39:20 2019        SUPERSEDED      termbox-0.0.2   Install complete
      2               Wed Jun 19 08:20:42 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
      3               Wed Jun 19 10:33:34 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
      4               Tue Jul  9 14:21:39 2019        SUPERSEDED      termbox-0.0.3   Upgrade complete
      
  2. Rollback with: kube_env zotero <staging,eqiad,codfw>; helm rollback <production> 3 --tiller-namespace YOUR_SERVICE_NAMESPACE