You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Event Platform/EventGate/Administration: Difference between revisions
imported>Quiddity m (rm top gap) |
imported>Ottomata |
||
Line 137: | Line 137: | ||
== Troubleshooting in production == | == Troubleshooting in production == | ||
All helmfile and kubectl commands below assume your CWD is a helmfile.d service directory on the deployment server, e.g. <tt>/srv/deployment-charts/helmfile.d/services/staging/eventgate-analytics</tt> | All helmfile and kubectl commands below assume your CWD is a helmfile.d service directory on the deployment server, e.g. <tt>/srv/deployment-charts/helmfile.d/services/staging/eventgate-analytics</tt> | ||
==== curl a specific pod ==== | |||
<syntaxhighlight> | |||
# Get pods, copy and IP address | |||
kube_env eventgate-analytics staging; kubectl get pods -o wide | |||
# curl the http (not https) port (usually 8192 for all eventgates) | |||
curl 10.64.75.101:8192/v1/stream-configs | |||
</syntaxhighlight> | |||
==== Get detailed status of Helm release ==== | ==== Get detailed status of Helm release ==== |
Revision as of 14:48, 26 January 2022
EventGate is deployed using the WMF Helm & Kubernetes deployment pipeline. This page will describe how to build and deploy EventGate services as well as document how to administer and debug EventGate in beta and production.
Our deployments of EventGate are done using the eventgate-wikimedia repository. This is an npm module that implements a WMF specific EventGate factory, and specifies EventGate as a dependency. It launches service-runner via the EventGate module with config provided here that sets eventgate_factory_module to eventgate-wikimedia.js.
Mediawiki Vagrant development
See: Event_Platform/EventGate#Development_in_Mediawiki_Vagrant
Beta / deployment-prep
Since we deploy Docker images to Kubernetes in production, we want to run these same images in beta. This is done by including the role::beta::docker_services class on a deployment-prep node via the Horizon Puppet Configuration interface. The configuration of the service and image is done by editing Hiera config in the same Horizon interface. deployment-eventgate-3 is a good example. The EventBus Mediawiki extension in beta is configured with $wgEventServices that point to these instances.
Production
Primary documentation for Kubernetes Deployments is here: Deployments_on_kubernetes
Production deployments of EventGate use WMF's Service Deployment Pipeline. Deploying new code and configuration to this pipeline currently has several steps. You should first be familiar with the various technologies phases of this pipeline. Here's some reading material for ya!
- Deployment pipeline
- Deployment Pipeline Design (AKA Streamlined Service Delivery Design)
- Blubber - Dockerfile generator, ensures consistent Docker images.
- Helm - Manages deployment releases to Kubernetes clusters. Helm Charts describe e.g. Docker images and versions, service config templating, automated monitoring, metrics and logging, service replica scaling, etc.
- Kubernetes - Containerized cloud clusters made of of 'pods'. Each pod can run multiple containers.
Deployment Pipeline Overview
Here's a general overview of how a code and then a Helm chart change in EventGate makes it to production. Code changes require Docker image rebuilds, and eventgate Helm chart changes require a new chart version and release upgrade.
Each EventGate service is deployed via the same eventgate Helm chart. Each service runs in its own Kubernetes namespace and has a distinct release name. The services are configured and deployed using helmfile custom values files and commands.
Current services (as of 2020-03)
- eventgate-main - Produces lower volume 'production' events to Kafka main-* clusters.
- eventgate-analytics - Produces high volume 'analytics' events to Kafka jumbo-eqiad cluster.
- eventgate-analytics-external - Produces medium volume client side 'analytics' events to Kafka jumbo-eqiad cluster.
- eventgate-logging-external - Produces client side error logs to the Kafka logging-* cluster for use in logstash.
In case you get confused, here are the Helm and Kubernetes terms for the eventgate-analytics service:
- main app (service) name: eventgate-analytics
- docker image name: eventgate-wikimedia (built from the eventgate-wikimedia gerrit repository)
- Helm chart: eventgate
- Helm release name: canary or production
- Kubernetes cluster name: staging, eqiad or codfw
- Kubernetes namespace: eventgate-analytics
In the eventgate-analytics service examples below, you will be deploying the eventgate-wikimedia docker image from the eventgate Helm chart deploying and applying values via helmfile.
There are 3 repositories that may need changes.
- EventGate - This is the generic pluggable library & service
- eventgate-wikimedia - Wikimedia specific implementation code and deployment pipeline Blubber files.
- deployment-charts - Helm charts and helmfile values, specifies configs for service deployment.
If you make a change to EventGate or eventgate-wikimedia, you must trigger a rebuild of the eventgate Docker image, then change the image version in the eventgate chart and deploy. If you just need to make a config or chart change, then you only need to build a new chart and deploy.
EventGate / eventgate-wikimedia Code Change
If this is an EventGate change, first push the change to the EventGate repository, then change the eventgate dependency SHA version in eventgate-wikimedia package.json.
1. Change is merged to eventgate-wikimedia. This will trigger a service-pipeline-build
2. Jenkins trigger-service-pipeline-test-and-publish is triggered and launches the service-pipeline-test-and-publish job.
3. Once service-pipeline-test-and-publish finishes, the image will be available in our Docker registry https://docker-registry.wikimedia.org. You can list existing image tags with:
curl https://docker-registry.wikimedia.org/v2/wikimedia/eventgate-wikimedia/tags/list
Once the image is available, we can upgrade the appropriate release(s) in Kubernetes clusters.
4. Edit the appropriate helm values.yaml file(s) in the deployment-charts repo. E.g. helmfile.d/services/eventgate-analytics/values.yaml and update the image version. Merge this change. 1 minute later the updates values file will be pulled on the deployment server.
5. Jump to deployment.eqiad.wmnet. Upgrade the eventgate-analytics service in Kubernetes and verify that it works. Again, to do this follow the instructions at Deployments_on_kubernetes#Code_deployment/configuration_changes.
eventgate-wikimedia schema repository change
Most eventgate services at wikimedia use remote schema repositories, so they do not require an image rebuild and deploy to pick up a new schema. However, if you modify an existent schema version (hopefully you never have to do this), or if you need eventgate-main to use a new schema or schema version, you'll need an image rebuild and deploy/restart of the eventgate service.
To bump the schema repo, edit eventgate-wikimedia/.pipeline/blubber.yaml and change the git SHA(s) associated with the schema repository you want to update.
builder:
# Clone Wikimedia event schema repositories into /srv/service/schemas/event/*
# If you update schema repository, you'll need to update
# the SHAs that are checked out here, and then rebuild docker images.
command:
- >-
mkdir -p /srv/service/schemas/event &&
git clone --single-branch -- https://gerrit.wikimedia.org/r/schemas/event/primary /srv/service/schemas/event/primary && cd /srv/service/schemas/event/primary && git reset --hard d725698 &&
git clone --single-branch -- https://gerrit.wikimedia.org/r/schemas/event/secondary /srv/service/schemas/event/secondary && cd /srv/service/schemas/event/secondary && git reset --hard 7405981 # <-- change these SHAs
Commit and merge this change. Deployment pipeline will automatically build a new docker image version and post a comment on the gerrit change with the image version.
Follow steps 4 and 5 above to deploy the change.
eventgate service values config change
Service specific configs are kept in values.yaml files inside of helmfile.d To make a simple config value change, edit the appropriate service / cluster(s) values.yaml files, e.g. deployment-charts/helmfile.d/services/eventgate-analytics/values*.yaml. Commit and merge the change, wait up to 1 minute for the change to be synced on the deployment server, then follow the upgrade process described in steps 5.
eventgate chart change
To modify the Helm chart to e.g. change a template or default values, do the following:
1. Edit the eventgate chart in the deployment-charts repository.
2. Test locally in Minikube (more below).
3. Once satisfied, bump the chart version in Chart.yaml. (NOTE: The chart version is independent of the EventGate code version.)
4. Commit and submit the changes to gerrit for review. Once merged, the new chart release should show up at https://helm-charts.wikimedia.org/api/stable/charts/eventgate.
5. Follow the above instructions at Deployments_on_kubernetes#Code_deployment/configuration_changes to upgrade your service to the new deployment.
EventStreamConfig change
EventGate instances are configured to request stream configuration from the MediaWiki EventStreamConfig API, but the way they do so varies depending on configuration. For most 'production' instances, stream configuration is not often edited. To avoid runtime coupling of production EventGate instances, these production instances are configured to only look up their pertinent stream configs at when the service starts. However, eventgate-wikimedia also supports 'dynamic' runtime stream config lookup; meaning if a stream is being produced for which EventGate does not have stream configuration, it will attempt to look up that configuration from the remote EventStreamConfig API.
eventgate-analytics-external is meant for feature instrumentation, and has a higher rate of stream configuration changes. It is the only EventGate instance (as of 2020-08) that looks up event stream configuration at runtime.
To make a change to stream config, either to add a new stream or to change a setting:
1. Edit wgEventStreams in mediawiki-config InitialiseSettings.php. This might look like:
'resource-purge' => [
'schema_title' => 'resource_change',
'destination_event_service' => 'eventgate-main',
],
The stream config entry is keyed by stream name, and must minimally specify the schema_title setting (the title field of the event schemas that will be allowed in this stream), and the destination_event_service setting to the EventGate service name that is allowed to produce this event stream. Other stream config settings may be used by services other than EventGate (e.g. the EventLogging extension). Some default settings are set for all streams in wgEventStreamsDefaultSettings, but can be overridden for specific streams.
2. Merge and sync this change.
What happens next is dependent on if the EventGate instance uses static or dynamic stream config"
3a. If this stream config change is for an EventGate instance that uses dynamic stream config, no action is needed; the new stream config will be automatically looked up when it is used.
3b. If this was a change for an EventGate that uses static stream config, you'll have to restart the pods to get them to look up the change.
See Event_Platform/EventGate/Administration#Recreate_all_k8s_pods_(AKA_full_service_restart)
Troubleshooting in production
All helmfile and kubectl commands below assume your CWD is a helmfile.d service directory on the deployment server, e.g. /srv/deployment-charts/helmfile.d/services/staging/eventgate-analytics
curl a specific pod
# Get pods, copy and IP address
kube_env eventgate-analytics staging; kubectl get pods -o wide
# curl the http (not https) port (usually 8192 for all eventgates)
curl 10.64.75.101:8192/v1/stream-configs
Get detailed status of Helm release
See Migrating_from_scap-helm#Seeing_the_current_status
Upgrade a Helm release
See Migrating_from_scap-helm#Code_deployment/configuration_changes
Rollback to a previous Helm chart version
See Migrating_from_scap-helm#Rolling_back_changes
Targeting a specific release with Helmfile (e.g. canary)
helmfile -e eqiad --selector name=canary ...
List k8s pods and their k8s host nodes
kube_env eventgate-analytics eqiad; kubectl get pods -o wide
Recreate all k8s pods (AKA full service restart)
helmfile -e eqiad --state-values-set roll_restart=1 sync
Delete a specific k8s pod
sudo -i; kube_env admin <CLUSTER>; kubectl -n <tiller_namespace> delete pod <pod_name>
(<tiller_namespace> is likely the service name, e.g. eventgate-main.)
Delete all k8s pods in a cluster
You shouldn't do this in production!
sudo -i; kube_env admin <CLUSTER>; kubectl -n eventstreams kubectl delete pod -n <tiller_namespace> --all
(<tiller_namespace> is likely the service name, e.g. eventgate-main.)
Tail sdtout/logs on all pods in a service
for pod in $(kube_env eventgate-analytics eqiad; kubectl get pods -o wide | grep eventgate | awk '{print $1}'); do kube_env eventgate-analytics eqiad; kubectl logs -c $TILLER_NAMESPACE -f --since 1h -c $TILLER_NAMESPACE $pod & done | jq .
Tail stdout/logs on a specific k8s pod container
In staging (automaticly using the single active pod id):
kube_env eventgate-analytics eqiad; kubectl logs -c $TILLER_NAMESPACE -f --since 60m $(kube_env eventgate-analytics eqiad; kubectl get pods -l app=$TILLER_NAMESPACE -o wide | tail -n 1 | awk '{print $1}') | jq .
For a specific pod:
kube_env eventgate-analytics eqiad; kubectl logs -c $TILLER_NAMESPACE -f --since 60m <pod_name> | jq .
Get a shell on a specific k8s pod container
In staging (automaticly using the single active pod id):
kube_env eventgate-analytics eqiad; sudo KUBECONFIG=/etc/kubernetes/admin-staging.config kubectl exec -ti -n $TILLER_NAMESPACE -c $TILLER_NAMESPACE $(kube_env eventgate-analytics eqiad; kubectl get pods -l app=$TILLER_NAMESPACE -o wide | tail -n 1 | awk '{print $1}') bash
For a specific pod:
CLUSTER=eqiad # or codfw kube_env eventgate-analytics eqiad; sudo KUBECONFIG=/etc/kubernetes/admin-$CLUSTER.config kubectl exec -ti -n $TILLER_NAMESPACE -c $TILLER_NAMESPACE <pod_name> bash
strace on a process in a specific pod container
First find the host node your pod is running on. See above for kubectl get pods. ssh into that node.
# Get the docker container id in your pod. This will be $1 in the output. sudo docker ps | grep <pod_name> | grep nodejs # now get the pid sudo docker top <container_id> | grep '/usr/bin/node' # strace it: sudo strace -p <node_pid>
Or, all in one command (after finding your pod_name and logging into the k8s node:
pod_name=eventgate-analytics-7b6fbdf7b6-bmlh6 sudo strace -p $(sudo docker top $(sudo docker ps | grep $pod_name | grep nodejs | head -n 1 | awk '{print $1}') | grep /usr/bin/node | head -n 1 | awk '{print $2}')
Get a root shell on a specific k8s pod container
Again, find the node where your pod is running and log into that node. Then:
sudo docker exec -ti -u root $(sudo docker ps |grep <pod_name> | grep nodejs | tail -n 1 | awk '{print $1}') /bin/bash
Helm Chart Development
User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps has some instructions on setting up Minikube and Helm for chart development and then benchmarking. This section provides some EventGate specific instructions.
EventGate Helm development environment setup
1. Install Minikube. Follow instructions at https://kubernetes.io/docs/tasks/tools/install-minikube/. Minikube is a virtualized local developement single host Kubernetes cluster.
If Minikube is not started, you can start it with:
minikube start
You'll also need to turn on promiscuous mode so that the Kafka pod will work properly:
minikube ssh sudo ip link set docker0 promisc on exit
2. Install kubectl. Follow instructions on https://kubernetes.io/docs/tasks/tools/install-kubectl/
3. Install Helm. Follow instructions at https://docs.helm.sh/using_helm/#installing-helm. You will need to download the appropriate version for your OS and place it in the $PATH (or %PATH% if you are on Windows)
4. Install Blubber. Follow instructions at https://wikitech.wikimedia.org/wiki/Blubber/Download.
5. Use Minikube as your Docker host:
eval $(minikube docker-env)
6. clone the eventgate-wikimedia repository
git clone https://gerrit.wikimedia.org/r/eventgate-wikimedia cd eventgate-wikimedia
7. Build a local eventgate-wikimedia development Docker image using Blubber:
blubber .pipeline/blubber.yaml development > Dockerfile && docker build -t eventgate-dev .
There are several variants in the blubber.yaml file. Here development is selected, and the Docker image is tagged with eventgate-dev.
7. If you don't already have it, clone the operations/deployment-charts repository.
git clone https://gerrit.wikimedia.org/r/operations/deployment-charts
7. Install the Kafka development Helm chart into Minikube:
cd deployment-charts/charts helm install ./kafka-dev
This will install a Zookeeper and Kafka pod and keep it running.
8. Install a development chart release into Minikube:
helm install -n eventgate-dev --set main_app.image=eventgate-dev ./eventgate
9. Test that it works:
# Consume from the Kafka test event topic kafkacat -C -b $(minikube ip):30092 -t datacenter1.test.event
# In another shell, define a handy service alias: alias service="echo $(minikube ip):$(kubectl get svc --namespace default eventgate-development -o jsonpath='{.spec.ports[0].nodePort}')" # POST to the eventgate-development service in Minikube curl -v -H 'Content-Type: application/json' -d '{"$schema": "/test/event/0.0.2", "meta": {"stream": "test.event", "id": "12345678-1234-5678-1234-567812345678", "dt": "2019-01-01T00:00:00Z", "domain": "wikimedia.org"}, "test": "specific test value"}' $(service)/v1/events
You should see some output from curl like:
... < HTTP/1.1 201 All 1 out of 1 events were accepted. ...
10. Now that the development release is running, you can make local changes to it and re-deploy those changes in Minikube:
helm delete --purge eventgate-dev && helm install -n eventgate-dev --set main_app.image=eventgate-dev ./eventgate
Benchmarking
Benchmarking of EventGate was done during its initial production k8s pods estimation following User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps. The initial results are not documented, but a phabricator comment indicates that a single instance (with certain resource settings) can handle around 1800 events per second.