You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Changeprop: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Quiddity
(dummy edit to fix mis-display in Category:Pages with syntax highlighting errors? (purge didn't work))
imported>Tim Starling
(→‎Deploying: unholy sshpipe automation)
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''changeprop''' (or '''Change Propagation''') is the name given to a service that processes change events generated by Mediawiki and stored in Kafka. Various actions are taken based on the messages read from [[Kafka]]. Common actions take the form of HTTP requests or CDN purges.
'''changeprop''' (or '''Change Propagation''') is the name given to a service that processes change events generated by MediaWiki and stored in Kafka. Various actions are taken based on the messages read from [[Kafka]]. Common actions take the form of HTTP requests or CDN purges.
 
=What it does=


== What it does ==
* Changeprop uses Kafka to ensure guaranteed delivery. We use the Apache Kafka message broker to attain ''at least once'' delivery semantics: once an event is in Kafka, we can be sure that it will be processed and an event will follow. This allows us to build very long and complex sequences of dependencies without fear of loss of events.  
* Changeprop uses Kafka to ensure guaranteed delivery. We use the Apache Kafka message broker to attain ''at least once'' delivery semantics: once an event is in Kafka, we can be sure that it will be processed and an event will follow. This allows us to build very long and complex sequences of dependencies without fear of loss of events.  
* Automatic retries with exponential delays, large job deduplication, and persistent error tracking via a dedicated error topic in Kafka
* Automatic retries with exponential delays, large job deduplication, and persistent error tracking via a dedicated error topic in Kafka
Line 8: Line 7:
* Fine-granted monitoring dashboard allows us to track rates and delays for individual topics, rates of event production and much more. Changeprop graphs can occasionally be used to discover bugs in other parts of the infrastructure around it.  
* Fine-granted monitoring dashboard allows us to track rates and delays for individual topics, rates of event production and much more. Changeprop graphs can occasionally be used to discover bugs in other parts of the infrastructure around it.  


=How it works=
== How it works ==
Changeprop reads events from [[Kafka]]. The topics changeprop reads from are defined in <code>config.yaml</code> - the <code>dc_name</code> variable is a prefix to the topic defined on a per-rule basis. So for example in eqiad for the mw_purge rule which uses the resource_change topic, the full topic will be <code>eqiad.resource_change</code>. Each rule specifies the topic to which it subscribes.  
Changeprop reads events from [[Kafka]]. The topics changeprop reads from are defined in <code>config.yaml</code> - the <code>dc_name</code> variable is a prefix to the topic defined on a per-rule basis. So for example in eqiad for the mw_purge rule which uses the resource_change topic, the full topic will be <code>eqiad.resource_change</code>. Each rule specifies the topic to which it subscribes.  


==Rules==
=== Rules ===
Rules define a list of cases to which a rule is to respond. General rule properties allow the definition of things like retries, delays and other features.  
Rules define a list of cases to which a rule is to respond. General rule properties allow the definition of things like retries, delays and other features.  


Line 17: Line 16:
Headers and other parameters can be defined for an exec section - see [https://github.com/wikimedia/change-propagation/blob/master/config.example.wikimedia.yaml the existing rules for details].  
Headers and other parameters can be defined for an exec section - see [https://github.com/wikimedia/change-propagation/blob/master/config.example.wikimedia.yaml the existing rules for details].  


==Service interactions==
=== Service interactions ===
Changeprop talks to [[Redis#Services|Redis]] to manage rate limiting and exclusion lists for problematic or high-traffic articles. All communication is done via [[Nutcracker]]. In Kubernetes, a local Nutcracker sidecar container runs within the changeprop pod, proxying access to a list of redis servers.
Changeprop talks to [[Redis#Services|Redis]] to manage rate limiting and exclusion lists for problematic or high-traffic articles. All communication is done via [[Nutcracker]]. In Kubernetes, a local Nutcracker sidecar container runs within the changeprop pod, proxying access to a list of redis servers.


Many of changeprop's operations are accomplished by sending HTTP requests to [[RESTBase]].  
Many of changeprop's operations are accomplished by sending HTTP requests to [[RESTBase]].  


=Where it runs=
== Where it runs ==
Changeprop currently runs in [[Kubernetes]] in codfw and eqiad. There is also an instance in the staging cluster that does not process prod traffic. In labs, changeprop runs in regular Docker on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud.   
Changeprop currently runs in [[Kubernetes]] in codfw and eqiad. There is also an instance in the staging cluster that does not process prod traffic. In labs, changeprop runs in regular Docker on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud.   


=Adding features=
== Adding features ==
==Adding a new rule==
 
=== Adding a new rule ===
#TODO  
#TODO  


=Deploying=
== Deploying ==
==To scb==
 
Changeprop has been removed from scb and cannot be redeployed there.
=== To Kubernetes ===
Changeprop uses the [[Kubernetes/Deployments]] workflow to deploy changes.


==To Kubernetes==
=== To deployment-prep ===
Depending on what needs to be changed in a Kubernetes deploy of changeprop, edits might need to take place in one of two locations - the Helmfile or the Helm chart. Whether the change is to the Helm chart itself, or the Helmfile that configures it, the deploy process to Kubernetes is the same.
In the Beta Cluster, Changeprop runs in Docker on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud. The configuration passed to changeprop is generated by scripts in the <code>deployment-charts</code> repository, in order to use the same templates and avoid deviation. This means that if you want to change the configuration in beta/deployment-prep, you will first need to edit the configuration in <code>deployment-charts</code>. The values for deployment-prep are stored in the <code>values-beta.yaml</code> file.
===Applying changes===
For the purposes of this section <code>$env</code> means one of <code>eqiad</code>, <code>codfw</code> or <code>staging</code>. Once your changes have been reviewed and merged by giving a +2 to your change when no rebase is required:
* a user with root will need to ssh to deploy1001 and sudo to root
* cd to /srv/deployment-charts/helmfile.d/services/changeprop/.
* Do a <code>git log -n1</code> to ensure that your change has been merged and is present in the local checkout of the repo.
* Check the impact of your changes on configuration files etc by running <code>helmfile -e $env diff</code>
* If everything looks okay, run a <code>helmfile -e $env sync</code> and then monitor <code>kubectl get pods</code> to ensure everything comes back up healthy


===Helmfile changes===
===Generating the configuration===
For the purposes of this section we'll assume that all changes will be against the staging environment. Helmfile changes happen in the [https://phabricator.wikimedia.org/source/operations-deployment-charts/browse/master/helmfile.d/services/staging/changeprop/values.yaml helmfile.d section] of the deployment-charts repository. Typically a change to this section will relate to changing an existing configured value for deployed instances (ie: adding Kafka or Redis servers, changing the Varnish multicast IP address).  
In <code>deployment-charts</code>, cd to <code>charts/changeprop</code> and <code>./make_beta_config.py  .</code>. The output from this command will be the configuration to be deployed.


===Helm chart changes===
For example, to generate the changeprop configuration:
# Make your changes to the chart
# Bump the version flag in the changeprop/Chart.yaml file.
# Add the Chart.yaml file to your change for review


==To deployment-prep==
<pre>
Changeprop runs in Docker in deployment-prep on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud. The configuration passed to changeprop is generated by scripts in the <code>deployment-charts</code> repository, in order to use the same templates and avoid deviation. This means that if you want to change the configuration in beta/deployment-prep, you will first need to edit the configuration in <code>deployment-charts</code>. The values for deployment-prep are stored in the <code>values-beta.yaml</code> file.
ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud \
  'cd /srv/deployment-charts/charts/changeprop && ./make_beta_config.py . changeprop'
</pre>


===Generating the configuration===
To generate the jobqueue configuration:
In <code>deployment-charts</code>, cd to <code>charts/changeprop</code> and <code>./make_beta_config.py .</code>. The output from this command will be the configuration to be deployed.
 
<pre>
ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud \
  'cd /srv/deployment-charts/charts/changeprop && ./make_beta_config.py . jobqueue'
</pre>


===Deploying the configuration===
===Deploying the configuration===
The configuration lives in a docker volume on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimeria.cloud, named <code>changeprop</code>. Configuration needs to be edited within this volume. To edit, run <code>sudo docker run -it -v changeprop:/srv/changeprop alpine /bin/sh</code> and edit /srv/changeprop/config.yaml as required. Then run <code>service changeprop restart</code> to load the configuration. Files other than config.yaml in this volume will be ignored.  
The configuration is in config.yaml in a docker volume on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud and deployment-docker-cpjobqueue01.deployment-prep.eqiad.wmflabs, named <code>changeprop</code> and <code>cpjobqueue</code> respectively. Configuration needs to be edited within this volume. The host directory can be discovered using `docker volume inspect`.


=Testing=
Ensure that the config is world readable when copying in a new file. Then run <code>service changeprop restart</code> to load the configuration. Files other than config.yaml in this volume will be ignored.
 
For example, to generate and deploy the changeprop configuration, here is an unholy sshpipe monster you could use:
 
<pre>
ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud \
  'cd /srv/deployment-charts/charts/changeprop && ./make_beta_config.py . changeprop' \
  | \
ssh deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud \
  sudo sh -xc \''cat > $(docker volume inspect changeprop -f {{.Mountpoint}})/config.yaml && systemctl restart changeprop'\'
</pre>
 
To generate and deploy the cpjobqueue configuration:
 
<pre>
ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud \
  'cd /srv/deployment-charts/charts/changeprop && ./make_beta_config.py . jobqueue' \
  | \
ssh deployment-docker-cpjobqueue01.deployment-prep.eqiad1.wikimedia.cloud \
  sudo sh -xc \''cat > $(docker volume inspect cpjobqueue -f {{.Mountpoint}})/config.yaml && systemctl restart cpjobqueue'\'
</pre>
 
Ideally the docker volume would have been pre-created with a fixed host path.
 
== Testing ==
changeprop can be tested by issuing events to Kafka that changeprop will consume. An example test command against the resource_change topic for the k8s staging cluster is: <code>cat mw_purge_example.json | kafkacat -b localhost:9092 -p 0 -t 'staging.resource_change'</code>.  
changeprop can be tested by issuing events to Kafka that changeprop will consume. An example test command against the resource_change topic for the k8s staging cluster is: <code>cat mw_purge_example.json | kafkacat -b localhost:9092 -p 0 -t 'staging.resource_change'</code>.  


Line 73: Line 94:
</syntaxhighlight>
</syntaxhighlight>


=How to monitor it=
== How to monitor it ==
There is a [https://grafana.wikimedia.org/d/000000201/change-propagation?orgId=1&refresh=1m Grafana dashboard for Changeprop]. The various graphs provide information  about things such as rule execution rate and rule backlogs for each rule for various streams.
There is a [https://grafana.wikimedia.org/d/000000201/change-propagation?orgId=1&refresh=1m Grafana dashboard for Changeprop]. The various graphs provide information  about things such as rule execution rate and rule backlogs for each rule for various streams.


Rule backlog is the time between the creation of event and the beginning of processing. If the backlog grows over time - change propagation can't keep up with the event rate and either concurrency should be increased, or some other action taken. Backlogs can have occasional spikes, but steady backlog growth is a clear indication of a problem.
Rule backlog is the time between the creation of event and the beginning of processing. If the backlog grows over time - change propagation can't keep up with the event rate and either concurrency should be increased, or some other action taken. Backlogs can have occasional spikes, but steady backlog growth is a clear indication of a problem.


=Debugging=
== Debugging ==
==Querying configuration==
 
=== Querying configuration ===
Changeprop's configuration can be queried if you have access to deploy1001:
Changeprop's configuration can be queried if you have access to deploy1001:
# ssh to deploy1001.eqiad.wmnet
# ssh to the deploy server for the datacenter
# cd to the appropriate directory (for example /srv/deployment-charts/helmfile.d/services/staging/changeprop)
# cd to the appropriate directory (for example /srv/deployment-charts/helmfile.d/services/staging/changeprop)
# run <code>source .hfenv</code> to set up your environment
# run <code>kube_env changeprop $CLUSTER</code> to set up your Kubernetes environment
# show the configuration via <code>kubectl describe configmap changeprop-staging-base-config</code>
# show the configuration via <code>kubectl describe configmap changeprop-staging-base-config</code>


Line 95: Line 117:
This can be ignored as long as the occurrences aren't too close together (currently they happen roughly once every hour in staging), they will not interrupt normal operation of changeprop.
This can be ignored as long as the occurrences aren't too close together (currently they happen roughly once every hour in staging), they will not interrupt normal operation of changeprop.


=Where it lives=
== Where it lives ==
* Changeprop's code can be cloned from [[Gerrit]] at <code>ssh://gerrit.wikimedia.org:29418/mediawiki/services/change-propagation</code>. It can be [https://phabricator.wikimedia.org/diffusion/MSCP/ browsed in Phabricator].
* Changeprop's code can be cloned from [[Gerrit]] at <code>ssh://gerrit.wikimedia.org:29418/mediawiki/services/change-propagation</code>. It can be [https://phabricator.wikimedia.org/diffusion/MSCP/ browsed in Phabricator].
* Changeprop is deployed to Kubernetes as a [[Helm]] chart. It lives in the [https://phabricator.wikimedia.org/source/operations-deployment-charts/browse/master/charts/changeprop/ deployment-charts] repo.
* Changeprop is deployed to Kubernetes as a [[Helm]] chart. It lives in the [https://phabricator.wikimedia.org/source/operations-deployment-charts/browse/master/charts/changeprop/ deployment-charts] repo.
Line 101: Line 123:
* There is a per-environment Helmfile values file which overrides the defaults configured in the Helm chart's values file. [https://phabricator.wikimedia.org/source/operations-deployment-charts/browse/master/helmfile.d/services/staging/changeprop/values.yaml This] is the file for staging values, there are corresponding production files in the per-DC directories.  
* There is a per-environment Helmfile values file which overrides the defaults configured in the Helm chart's values file. [https://phabricator.wikimedia.org/source/operations-deployment-charts/browse/master/helmfile.d/services/staging/changeprop/values.yaml This] is the file for staging values, there are corresponding production files in the per-DC directories.  


=Related pages=
== See also ==
* Changeprop emerged out of the older and now decommissioned [[Event_Platform/EventBus]] system. This page is largely out of date and does not represent the current system. A more modern overview of the Event systems currently in use can be seen on [[Event*]]
* Changeprop emerged out of the older and now decommissioned [[Event_Platform/EventBus|EventBus]] system. This page is largely out of date and does not represent the current system. A more modern overview of the Event systems currently in use can be seen on [[Event*]]
* [[mw:Requests for comment/Requirements for change propagation]] ([[Phab:T102476|T102476]]) - RFC that describes the different approaches being explored in the development of Changeprop
* [[mw:Requests for comment/Requirements for change propagation]] ([[Phab:T102476|T102476]]) - RFC that describes the different approaches being explored in the development of Changeprop

Latest revision as of 01:06, 17 June 2022

changeprop (or Change Propagation) is the name given to a service that processes change events generated by MediaWiki and stored in Kafka. Various actions are taken based on the messages read from Kafka. Common actions take the form of HTTP requests or CDN purges.

What it does

  • Changeprop uses Kafka to ensure guaranteed delivery. We use the Apache Kafka message broker to attain at least once delivery semantics: once an event is in Kafka, we can be sure that it will be processed and an event will follow. This allows us to build very long and complex sequences of dependencies without fear of loss of events.
  • Automatic retries with exponential delays, large job deduplication, and persistent error tracking via a dedicated error topic in Kafka
  • The config system allows us to add simple update rules with only a few lines of YAML and without code changes or deploys
  • Fine-granted monitoring dashboard allows us to track rates and delays for individual topics, rates of event production and much more. Changeprop graphs can occasionally be used to discover bugs in other parts of the infrastructure around it.

How it works

Changeprop reads events from Kafka. The topics changeprop reads from are defined in config.yaml - the dc_name variable is a prefix to the topic defined on a per-rule basis. So for example in eqiad for the mw_purge rule which uses the resource_change topic, the full topic will be eqiad.resource_change. Each rule specifies the topic to which it subscribes.

Rules

Rules define a list of cases to which a rule is to respond. General rule properties allow the definition of things like retries, delays and other features.

The "match" section of a rule dictates a pattern to match, which can include URL matching and tag matching (for example, mw_purge events also contain "tags":["purge"] and will only match if the URL pattern and the URL matches the pattern specified). URL match patterns are frequently used to target specific sites (for example have a rule only apply to Wiktionary) or classes of article. Matches can also be fine tuned to not match using not_match. If the match it satisfied, the exec section is executed. The exec will generally be a HTTP request of a defined method to the specified URI. A rule can have multiple match and corresponding exec sections in its cases list - if a pattern is created where matches are mutually exclusive, a rule can act as a switch statement using the same topic and the same semantics but different matches. Headers and other parameters can be defined for an exec section - see the existing rules for details.

Service interactions

Changeprop talks to Redis to manage rate limiting and exclusion lists for problematic or high-traffic articles. All communication is done via Nutcracker. In Kubernetes, a local Nutcracker sidecar container runs within the changeprop pod, proxying access to a list of redis servers.

Many of changeprop's operations are accomplished by sending HTTP requests to RESTBase.

Where it runs

Changeprop currently runs in Kubernetes in codfw and eqiad. There is also an instance in the staging cluster that does not process prod traffic. In labs, changeprop runs in regular Docker on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud.

Adding features

Adding a new rule

  1. TODO

Deploying

To Kubernetes

Changeprop uses the Kubernetes/Deployments workflow to deploy changes.

To deployment-prep

In the Beta Cluster, Changeprop runs in Docker on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud. The configuration passed to changeprop is generated by scripts in the deployment-charts repository, in order to use the same templates and avoid deviation. This means that if you want to change the configuration in beta/deployment-prep, you will first need to edit the configuration in deployment-charts. The values for deployment-prep are stored in the values-beta.yaml file.

Generating the configuration

In deployment-charts, cd to charts/changeprop and ./make_beta_config.py .. The output from this command will be the configuration to be deployed.

For example, to generate the changeprop configuration:

ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud \
  'cd /srv/deployment-charts/charts/changeprop && ./make_beta_config.py . changeprop'

To generate the jobqueue configuration:

ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud \
  'cd /srv/deployment-charts/charts/changeprop && ./make_beta_config.py . jobqueue'

Deploying the configuration

The configuration is in config.yaml in a docker volume on deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud and deployment-docker-cpjobqueue01.deployment-prep.eqiad.wmflabs, named changeprop and cpjobqueue respectively. Configuration needs to be edited within this volume. The host directory can be discovered using `docker volume inspect`.

Ensure that the config is world readable when copying in a new file. Then run service changeprop restart to load the configuration. Files other than config.yaml in this volume will be ignored.

For example, to generate and deploy the changeprop configuration, here is an unholy sshpipe monster you could use:

ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud \
  'cd /srv/deployment-charts/charts/changeprop && ./make_beta_config.py . changeprop' \
  | \
ssh deployment-docker-changeprop01.deployment-prep.eqiad1.wikimedia.cloud \
  sudo sh -xc \''cat > $(docker volume inspect changeprop -f {{.Mountpoint}})/config.yaml && systemctl restart changeprop'\'

To generate and deploy the cpjobqueue configuration:

ssh deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud \
  'cd /srv/deployment-charts/charts/changeprop && ./make_beta_config.py . jobqueue' \
  | \
ssh deployment-docker-cpjobqueue01.deployment-prep.eqiad1.wikimedia.cloud \
  sudo sh -xc \''cat > $(docker volume inspect cpjobqueue -f {{.Mountpoint}})/config.yaml && systemctl restart cpjobqueue'\'

Ideally the docker volume would have been pre-created with a fixed host path.

Testing

changeprop can be tested by issuing events to Kafka that changeprop will consume. An example test command against the resource_change topic for the k8s staging cluster is: cat mw_purge_example.json | kafkacat -b localhost:9092 -p 0 -t 'staging.resource_change'.

All IDs in these examples are random UUIDs. Not varying UUID between tests runs the risk of being seen as a duplicate event and being skipped. The "dt" field should also be changed to be close to the current time and date, as changeprop will not take action on older events.

mw_purge

{"$schema":"/resource_change/1.0.0","meta":{"dt": "2020-04-02T17:16:25Z", "uri":"https://en.wikipedia.org/wiki/Draft:Editta_Braun","id":"22350141-bbe2-488d-9f73-a1aa6094ac5c","domain":"en.wikipedia.org","stream":"resource_change"},"tags":["purge"]}

null_edit

{"$schema":"/resource_change/1.0.0","meta":{"uri":"https://fr.wikipedia.org/wiki/Oribiky","id":"b92d40b0-3206-469d9615-2fbf61a04418","dt":"2020-04-02T17:16:28Z","domain":"fr.wikipedia.org","stream":"resource_change"},"tags":["null_edit"]}

How to monitor it

There is a Grafana dashboard for Changeprop. The various graphs provide information about things such as rule execution rate and rule backlogs for each rule for various streams.

Rule backlog is the time between the creation of event and the beginning of processing. If the backlog grows over time - change propagation can't keep up with the event rate and either concurrency should be increased, or some other action taken. Backlogs can have occasional spikes, but steady backlog growth is a clear indication of a problem.

Debugging

Querying configuration

Changeprop's configuration can be queried if you have access to deploy1001:

  1. ssh to the deploy server for the datacenter
  2. cd to the appropriate directory (for example /srv/deployment-charts/helmfile.d/services/staging/changeprop)
  3. run kube_env changeprop $CLUSTER to set up your Kubernetes environment
  4. show the configuration via kubectl describe configmap changeprop-staging-base-config

The suffixes nutcracker-config and metrics-config are also available as configmaps.

Non-issues

Periodically Changeprop will log a message along the lines of the following:

{"name":"change-propagation","hostname":"changeprop-staging-684b9ddbd-4wdkn","pid":141,"level":"ERROR","err":{"message":"Local: Broker transport failure","name":"changeprop-staging","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/service/node_modules/node-rdkafka/lib/error.js:334:10)\n    at /srv/service/node_modules/node-rdkafka/lib/kafka-consumer.js:448:29","code":-195,"errno":-195,"origin":"kafka","rule_name":"page_create","executor":"RuleExecutor","levelPath":"error/consumer"},"msg":"Local: Broker transport failure","time":"2020-04-29T13:10:17.443Z","v":0}

This can be ignored as long as the occurrences aren't too close together (currently they happen roughly once every hour in staging), they will not interrupt normal operation of changeprop.

Where it lives

See also