Envoy
What is Envoy proxy
Envoy ( GitHub ) is an L7 proxy and communication bus designed for large modern service-oriented architectures. It provides several features for a reverse proxy including but not limited to:
- HTTP2 support.
- L3/L4 filter architecture, so it can be used for TLS termination, traffic mirroring, and other use cases.
- Good observability and tracing, supporting statsd, zipking etc.
- rate limiting, circuit breakers support.
- dynamic configuration through the xDS protocol.
- service discovery.
- gRPC, Redis, MongoDB proxy support.
Envoy at WMF
There are several use cases for envoy at WMF:
-
Act as a TLS terminator / proxy for internal services. This is done for services:
- in the deployment pipeline (via the tls helpers in the deployment charts) where it works as a sidecar container to the service if tls is enabled for the specific chart.
- For services not in the pipeline, using profile::tlsproxy::envoy
- Act as a local proxy to other services for MediaWiki (for now), via profile::services_proxy::envoy
- Act as a gateway for external API requests, see API Gateway and REST Gateway .
TLS termination
If you want to add TLS termination to a new deployment chart, just use the scaffold script - it will create your starting chart with tls termination primitives already in place. If you want to add TLS termination to an existing chart, you just have to:
- Enable mesh.configuration, mesh.name, mesh.service and mesh.networkpolicy using sextant.
- Add the relevant templating for the mesh modules in your chart
See https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1039247 for an example of the logic of adding the mesh modules.
If you want to add TLS termination to a service in puppet, include profile::tlsproxy::envoy in its role in puppet, and add the hiera configuration following the suggestions in the class documentation.
Services Proxy
The services proxy is installed on all servers that run MediaWiki, and does expose them via HTTP on localhost:<PORT> . Some endpoints might also define a specific Host header.
The service proxy offers:
- Persistent connections
- Advanced TLS tunneling (envoy supports TLS 1.3)
- Retry logic
- Circuit breaking (still not implemented)
- Header rewriting
- Telemetry for all backends
- Tracing (still not implemented)
- Precise timeouts (microsecond resolution)
See intro presentation (restricted access).
Add a new service (listener)
The currently defined services are defined in hieradata/common/profile/services_proxy/envoy.yaml .
You can define your proxy to point to any valid DNS record, which will be re-resolved periodically. This means, it works with discovery records in DNS.
To add a new service you just need to add an entry to that list. A basic example may look like:
- name: mathoid
port: 6013
timeout: "5s"
service: mathoid
keepalive: "4.5s"
retry:
retry_on: "5xx"
num_retries: 1
Please refer to the class documentation in puppet for details: modules/profile/manifests/services_proxy/envoy.pp .
Note
: If you are adding a listener for a service that uses
Kubernetes/Ingress
, be sure to include
sets_sni: true
in your listener list entry.
Use a listener
To make use of a configured listener, it needs to be enabled for your host or within your kubernetes helm chart.
For hosts:
-
Include
profile::services_proxy::envoyin your puppet role -
Add the listener(s) you would like to enable in hiera key
profile::services_proxy::envoy::enabled_listeners(like this example for MW installations)
For kubernetes:
-
Include
common_templates/0.2/_tls_helpers.tplin your helm chart (you probably already have, this comes with the default scaffold) -
Add the listener(s) you would like to enable in helm key
.Values.discovery.listeners
You then need to configure the application to use
http://localhost:<listener_port>
to connect to the upstream service via the envoy listener.
Example (calling mw-api)
To call the MediaWiki API from your application, add the "mwapi-async" listener as described above and send your requests to http://localhost:6500 . As you use localhost now, you will need to add a proper Host-Header to your request to reach the Wikipedia you need:
def getPageDict(title: str, wiki_id: str, api_url: str) -> dict:
[...]
# This will only work for wikipedias, but it's just an example
mwapi_host = "{0}.wikipedia.org".format(
wiki_id.replace("wiki", "").replace("_", "-")
)
headers = {"User-Agent": "mwaddlink",
"Host": mwapi_host,
}
req = requests.get(api_url, headers=headers, params=params)
[...]
getPageDict(page_title, wiki_id, "http://localhost:6500/w/api.php")
Please note:
wikipedia.org, wikidata.org, and wikimedia.org hosts all use mediawiki, and one might expect them to use one of the
mw-api-*
envoy listeners. However, it is important to take note of the actual service that serves the endpoint you are trying to access. For example in the table shown below, although the language_pairs and pageviews endpoints have
wikimedia.org
as part of their host header, they use different envoy listeners:
| endpoint name | enpoint host header | enpoint uri | envoy listener |
|---|---|---|---|
| language_pairs | cxserver.wikimedia.org | http://localhost:6015/v1/languagepairs | cxserver |
| pageviews | wikimedia.org | http://localhost:6033/wikimedia.org/v1/metrics/pageviews | rest-gateway |
| wikipedia | {source}.wikipedia.org | http://localhost:6500/w/api.php | mw-api-int-async-ro |
| wikidata | www.wikidata.org | http://localhost:6500/w/api.php | mw-api-int-async-ro |
| event_logger | intake-analytics.wikimedia.org | http://localhost:6004/v1/events?hasty=true | eventgate-analytics |
Runtime configuration
Envoy allows you to change parts of its configuration at runtime, using the
administration interface
. You will find that exposed via
localhost:9631
on instances and
localhost:1666
or
/var/run/envoy/admin.sock
in kubernetes pods.
The following example increases the log level for the http logger to debug and configures the logger for the mwapi-async listener to log all requests (instead of just errors) in a apache combined like log format (it's different, though. See: https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage#config-access-log and https://blog.getambassador.io/understanding-envoy-proxy-and-ambassador-http-access-logs-fee7802a2ec5 ).
curl -XPOST localhost:1666/logging?http=debug
curl -XPOST localhost:1666/runtime_modify?mwapi-async_min_log_code=200
curl -XPOST --unix-socket /var/run/envoy/admin.sock http://localhost/logging?http=debug
For easier access to the port inside of kubernetes pods/containers, use nsenter on the kubernetes node the container runs on or take a look at k8sh .
From a kubernetes host you can do the following to find the socket path and then use curl (i.e. without nsenter)
-
docker ps # find the container id (first column) -
docker inspect <id> --format '{{.GraphDriver.Data.MergedDir}}' -
cd to the directory above -
curl -XPOST --unix-socket run/envoy/admin.sock http://localhost/logging?http=debug
Telemetry
Envoy telemetry data is embedded in a bunch of service dashboards in Grafana.wikimedia.org already. For generic dashboards, go to:
- https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1
- https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1
Building and updating Envoy for WMF
Start by choosing a target version, and review the release notes for the intermediate versions.
- If you're jumping forward more than one minor version, it's generally sufficient (but use the release dates to check) to read the notes for each new minor release, and for each patch release in the last minor version. Example: In going from 1.11.0 to 1.14.2, you'd review 1.12.0, 1.13.0, 1.14.0, 1.14.1, 1.14.2.
-
Check for any backwards-incompatible behavior changes that affect our Envoy use cases, any configuration changes (such as deprecated fields) that affect our configs, and any stats changes (such as renamed metrics) that affect our dashboards.
- Most of the release notes are for features we don't use, so it's quick to rule out any impact; only a few notes will need close attention, but you have to find them in the long list.
- Note that we have a number of Envoy configs and config templates, spread out in both the puppet and deployment-charts repos. When checking for a config key, don't forget to search both.
- Deprecated features and config fields can be addressed after the upgrade, as long as they're still supported (logged warnings are OK), but you should handle them right after the upgrade so they don't block the next one.
Prepare a new version
The Envoy project does offer
Debian packages
, but they don't meet our needs. For example, they don't include the hot restarter or systemd unit which we need for non-Kubernetes environments. (They also create the system user
envoyproxy
rather than
envoy
, which would be a more difficult transition than it's worth.)
As a result, we build our own packages, but we do use their prebuilt binaries rather than compiling from source.
The operations/debs/envoyproxy repository includes just the debian control files (starting from envoy version 1.23). Part of the process is to download the release tarball and verify its sha512 hash against what upstream provides. A trusted source for the pubkey of their signature could not be found.
Because of that, you will need to set HTTP proxy variables for internet access on the build host.
The general process to follow is:
- Check out operations/debs/envoyproxy on your workstation
- Decide if you want to update an existing version (switch so the corresponding vX.Y branch) or add a new version (create a new vX.Y branch based off of the latest one)
- Create a patch to bump the debian changelog
export NEW_VERSION=1.26.1 # envoy version you want to package
dch -v ${NEW_VERSION?}-1 -D bullseye-wikimedia "Update to v${NEW_VERSION?}"
git commit debian/changelog
# If adding a new minor version, create a new branch based on the previous one
ssh -p 29418 gerrit.wikimedia.org gerrit create-branch operations/debs/envoyproxy v1.26 v1.23
# Make sure to submit the patch to the correct branch
git review vX.Y
- Merge
- Check out operations/debs/envoyproxy on the build host
- Build the packages:
git checkout vX.Y
# Ensure you allow networking in pbuilder
# This option needs to be in the file, an environment variable will *not* work!
echo "USENETWORK=yes" >> ~/.pbuilderrc
# Build the package
https_proxy=http://webproxy.$(hostname -d):8080 DIST=bullseye pdebuild
Import to envoy-future with reprepo
Use the envoy-future component to test the new version before deploying it everywhere. It only exists for bullseye currently.
# On apt1002, copy the packages from the build host
rsync -vaz build2001.codfw.wmnet::pbuilder-result/bullseye-amd64/envoyproxy*${PACKAGE_VERSION?}* .
sudo -i reprepro -C component/envoy-future include bullseye-wikimedia $HOME/envoyproxy_${PACKAGE_VERSION?}_amd64.changes
Build the envoy-future Docker image
The envoy-future Dockerfile installs Envoy from the envoy-future component of our APT repository. Now that the new package is installed in the envoy-future component, we can rebuild the envoy-future image and use it for testing without affecting any users of the
latest
tag on the normal envoy image. (Using that tag is discouraged but not unheard of. In the future we may simplify this process, at the risk of unexpectedly giving
latest
users what they asked for.)
-
Bump the changelog of the
envoy-future
image (
example
)
# in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/ export NEW_VERSION=1.26.1 # envoy version you want to package cd images/envoy-future # Bump changelog dch -D wikimedia --force-distribution -c changelog -v ${NEW_VERSION}-1
- Test locally:
$ cd ../.. # Back to production-images/
$ docker-pkg -c config.yaml build --select '*envoy-future*' images # Builds without pushing
$ docker run docker-registry.wikimedia.org/envoy-future:${NEW_VERSION}-1 --version # Should print the correct version
- Merge.
-
Go on one build server (role
role::builderin puppet) and run:
$ cd /srv/images/production-images
# If someone's been naughty and hand patched the repo, this will alert you before messing with the local git history
$ sudo git pull --ff-only
$ sudo build-production-images --select '*envoy-future*'
Update CI
We're using envoy in operations/deployment-charts to lint and verify auto-generated envoy config.
To determine the envoy version, run "envoy --version" within the helm-linter image. You can do this on your laptop:
docker run --pull always --rm -it --entrypoint /usr/bin/envoy docker-registry.wikimedia.org/releng/helm-linter --version
To update the envoy version used there, bump the changelog at dockerfiles/helm-linter/changelog which triggers an update to the latest version:
dch -D wikimedia --force-distribution -c changelog
And add the new version to
jjb/operations-misc.yaml
in a second patch (
example
)
When this is merged and build, run CI (maybe just rebuild last at https://integration.wikimedia.org/ci/job/helm-lint/ ?) to verify the new envoy version against our config.
Validate the new version
You can change the version of Envoy used in any Kubernetes service by updating the
helmfile.d/services/<SERVICE>/values.yaml
files in the
deployment-charts repository
. Change the value of
mesh.image_version
(or insert it, overriding the default) to your image version, i.e.
${NEW_VERSION}-1
. At this stage of the rollout, you also want to use the envoy-future image, so additionally change the value of
mesh.image_name
to
envoy-future
. Then merge and
deploy the change
.
-
During a MediaWiki Infrastructure
deployment window
, temporarily upgrade mw-debug in one data center so that you can test it with the
WikimediaDebug
browser extension. For a brief test, you can use
helmfile --set:Ensure it starts up, with no errors in the$ cd /srv/deployment-charts/helmfile.d/services/mw-debug $ helmfile -e ${DC?} -i apply -l name=pinkunicorn --set mesh.image_name=envoy-future --set mesh.image_version=${NEW_VERSION?}-1 --context=5
tls-proxycontainer's logs, and (using the browser extension to test) ensure it serves traffic successfully. Note any deprecation warnings in the logs for post-upgrade followup work. When you're done, and before your deployment window ends, be sure to clean up your helmfile diffs (i.e. roll back to the older version) withIf you need to leave mw-debug upgraded for longer, either for further testing or because the Envoy upgrade isn't rollback-safe, don't usehelmfile -e ${DC?} -i apply -l name=pinkunicorn --context=5
--set; apply the change properly by modifying values.yaml, merging, and deploying. -
Choose a low-traffic non-MediaWiki service and upgrade it by setting its
mesh.image_nametoenvoy-futureandmesh.image_versionto the image you created. Ensure it starts up, with no errors in thetls-proxycontainer's logs, and serves traffic successfully. Note any deprecation warnings in the logs for post-upgrade followup work. -
Upgrade the
API Gateway
and
REST Gateway
staging environment.
- These services don't use an Envoy sidecar -- Envoy is the main application.
-
They have their own Envoy config, templated in the api-gateway Helm chart, instead of using the
meshmodule shared by most charts. The rest-gateway service is a second helmfile installing the same api-gateway chart. - For deployment instructions see their respective Wikitech pages.
Copy from component envoy-future to main with reprepro
You've already imported the exact version of Envoy you want in the envoy-future component, so you don't have to rebuild it in order to make it available in main. (If you're joining this procedure midway through, build and import the package instead.)
On apt1002, find the existing .deb file under
/srv/wikimedia/pool/component/envoy-future/e/envoyproxy
. Then import it to main:
sudo -i reprepro -C main includedeb bullseye-wikimedia path/to/envoyproxy_1.XX.X-1_amd64.deb
# Copy the package to other distributions as needed (this is possible because it only contains static binaries)
sudo -i reprepro copy bookworm-wikimedia bullseye-wikimedia envoyproxy
sudo -i reprepro copy trixie-wikimedia bullseye-wikimedia envoyproxy
Build the envoy Docker image
This is the same thing we did earlier with
envoy-future
, now with
envoy
.
-
Bump the changelog of the
envoy
image (
example
)
# in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/ export NEW_VERSION=1.26.1 # envoy version you want to package cd images/envoy # Bump changelog dch -D wikimedia --force-distribution -c changelog -v ${NEW_VERSION}-1
- Test locally:
$ cd ../.. # Back to production-images/
$ docker-pkg -c config.yaml build --select '*envoy:*' images # Builds without pushing
$ docker run docker-registry.wikimedia.org/envoy:${NEW_VERSION}-1 --version # Should print the correct version
- Merge.
-
Go on one build server (role
role::builderin puppet) and run
$ cd /srv/images/production-images
$ sudo git pull --ff-only
$ sudo build-production-images --select '*envoy:*'
Roll out the new version on bare-metal hosts
Use Debdeploy and debmonitor as usual.
As of this writing, the mwdebug hosts still exist but aren't reachable via WikimediaDebug ; you can use them as an installation and startup test, then move on to other hosts.
After installing on any host, you can check
curl -s localhost:9631/server_info | jq .version
to ensure the expected version is running. If not, try sending SIGHUP to the hot restarter process (find it with
pgrep -f envoyproxy-hot-restarter
) to prompt it to start a new Envoy and drain the old one. Check /server_info again to ensure the new Envoy is now serving traffic.
Envoy logs are in
/var/log/envoy
. At startup time, check
/var/log/envoy/syslog.log
for errors and warnings (note the two Envoys' logs will be interleaved during the transition but they can be differentiated by the PID near the left side). For tailing access logs,
sudo tail -f /var/log/envoy/*.log
is simplest. Use the
envoy telemetry dashboard (bare metal edition)
for metrics.
You can deploy on the remaining bare-metal hosts concurrently with the Kubernetes deployment. For example, it's a good idea to upgrade
A:restbase-canary
at the same time as the MediaWiki canaries on Kubernetes, then upgrade the rest of
A:restbase
along with the bulk of the MediaWiki services.
Roll out the new version in Kubernetes
-
Upgrade mw-debug (permanently this time, in both data centers; test again with
WikimediaDebug
) and the canary release of all MediaWiki deployments that have a canary release (
helmfile.d/services/mw-*/values-canary.yaml).- After merging the values-canary.yaml changes, you can use scap to deploy the helmfile diffs in the usual way.
- There should be no config surprises at this stage, but monitor for startup errors and for any significant diffs in the performance metrics -- that is, does the new Envoy version add significant latency or resource consumption? Use the envoy telemetry dashboard (Kubernetes edition) for metrics.
- (Note that after deploying to the canaries, it takes some time for traffic to stabilize; don't be alarmed by the immediate drop in canary requests.)
-
Upgrade the
API Gateway
and
REST Gateway
prod environment.
- You can revert your changes to the staging environment (removing the image name and version overrides) at the same time.
-
Upgrade the MediaWiki services that have MW Deployments.
- For those services with canaries, you can revert your changes in values-canary.yaml at the same time.
- As of this writing, those are: mw-api-ext, mw-api-int, mw-jobrunner, mw-misc, mw-parsoid, mw-web, mw-wikifunctions.
-
Upgrade mw-videoscaler.
-
This is a special case: when changing the Envoy version you should simultaneously increment
mercurius.generationin its values.yaml. (This is required to pick up the change because Mercurius is a Job rather than a Deployment, Jobs have immutable Pod templates, and unusually we aren't deploying a new MediaWiki at the same time.)
-
This is a special case: when changing the Envoy version you should simultaneously increment
Set the new version as default for all chart deployments
-
Set the infra-wide default, for all chart deployments, by changing the value of
default.mesh.image_versionin hiera keyprofile::kubernetes::deployment_server::generalof hieradata/role/common/deployment_server/kubernetes.yaml -
Revert all the image name/version overrides you set earlier, now that the new version is the default. Restoring the image version won't produce any helmfile diffs, but changing
envoy-futureback toenvoywill (though with no real effect since the image is the same) so you shouldhelmfile applythose services.