You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
API Gateway: Difference between revisions
imported>Ppchelko |
imported>Alex Paskulin (→How it works: Add diagram) |
||
Line 5: | Line 5: | ||
==How it works == | ==How it works == | ||
Envoy is more or less functioning as any other API router/reverse proxy does. The proxy answers requests for the configured domain, does some selective manipulation and then asks appservers via their LVS endpoint for page content. | [[File:Wikimedia-API-Gateway-architecture-v1.png|thumb|Wikimedia API Gateway architecture diagram]] | ||
Envoy is more or less functioning as any other API router/reverse proxy does. The proxy answers requests for the configured domain, does some selective manipulation and then asks appservers via their LVS endpoint for page content. | |||
===Rate limiting=== | ===Rate limiting=== | ||
Line 20: | Line 22: | ||
=== Logs and analytics=== | === Logs and analytics=== | ||
Logs are shipped from the API Gateway to [[Event Platform/EventGate|EventGate]] using fluentd. Fluentd runs in its own container, continuously parsing JSON request log output and reposting these logs to Eventgate. | Logs are shipped from the API Gateway to [[Event Platform/EventGate|EventGate]] using fluentd. Fluentd runs in its own container, continuously parsing JSON request log output and reposting these logs to Eventgate. | ||
==Where it runs== | ==Where it runs== | ||
The API Gateway runs in [[Kubernetes]] in staging, eqiad and codfw. The instance in staging does not receive external traffic but can be accessed internally at https://api-gateway.svc.eqiad.wmnet:8087. Changes should be deployed to staging and tested via curl on this endpoint. | The API Gateway runs in [[Kubernetes]] in staging, eqiad and codfw. The instance in staging does not receive external traffic but can be accessed internally at https://api-gateway.svc.eqiad.wmnet:8087. Changes should be deployed to staging and tested via curl on this endpoint. |
Revision as of 22:27, 7 October 2020
The API Gateway is a service that runs in Kubernetes based on Envoy. The service implements many features central to serving the unified API and the API portal.
What it does
The API Gateway serves pages for api.wikimedia.org. It does this by rewriting requests for the unified API to URIs that are understood by the respective APIs on the application servers, and also by serving pages for the API Portal wiki in the same way other wikis would be served. The API Gateway also uses metadata from JSON Web Tokens (JWTs) to apply rate limits to clients using the APIs.
How it works
Envoy is more or less functioning as any other API router/reverse proxy does. The proxy answers requests for the configured domain, does some selective manipulation and then asks appservers via their LVS endpoint for page content.
Rate limiting
The API Gateway applies rate limits to clients based upon the JWT provided (or not) by the client. During beta unauthenticated clients are currently limited to 500 requests an hour, and authenticated clients that pass a valid JWT are limited to 5000 requests. These values are entirely temporary and will be changed as the platform moves towards general release. Clients issue JWTs by requesting OAuth 2.0 clients on Meta and in future, on the API Portal.
Routing
The API Gateway maps API URIs passed to the Gateway's hostname (api.wikimedia.org) to the relevant APIs understood by the application servers. For example, https://api.wikimedia.org/core/v1/wikipedia/en/page/pizza is mapped to https://en.wikipedia.org/w/rest.php/v1/page/pizza by the Gateway's configuration language. As of September 2020, it is required to use a relatively complex rewriting method using Lua and multiple definitions of URL patterns seen in the values.yaml file, but this will be fixed in Envoy 1.16.0. Currently all APIs that are offered by the API Gateway are also directly accessible via the traditional API routes on their per-service level.
JSON Web Tokens
The API Gateway verifies the signatures of JWT Authorisation headers included alongside requests. If a JWT is valid, a different limit is applied. This limit can be configured via the Helm values file per environment (Values.main_app.ratelimiter.default_limit.unit
for valid JWTs and Values.main_app.ratelimiter.anon_limit.unit
for anonymous users).
API Portal
The API Gateway is the means by which all clients access the API Portal. The API Portal is simply a customised Mediawiki instance and the API Gateway serves requests to it by proxying requests to the appservers. Unlike other wikis however, the API Portal is only accessible via the API Gateway.
Logs and analytics
Logs are shipped from the API Gateway to EventGate using fluentd. Fluentd runs in its own container, continuously parsing JSON request log output and reposting these logs to Eventgate.
Where it runs
The API Gateway runs in Kubernetes in staging, eqiad and codfw. The instance in staging does not receive external traffic but can be accessed internally at https://api-gateway.svc.eqiad.wmnet:8087. Changes should be deployed to staging and tested via curl on this endpoint.
How it's configured
The API Gateway uses the reserved port 8087 internally and is registered in Service ports.
The core configuration for the API Gateway helm chart is documented in the default values.yaml file. Note that there are configuration overrides for production in general, and also for eqiad and codfw specifically (and staging, which does not serve public requests).
JWT tokens are verified using the public key of the keypair used to sign OAuth tokens on meta.wikimedia.org. This key has been converted to the [rfc:7517 JWKS format] required for support using JWTs and is distributed as a secret via puppet.
How to deploy changes
The API Gateway's configurable components all live within the deployment-charts repository. The components that are of interest are the api-gateway chart itself and the aforementioned helmfile.d configuration for the service. Note: when changing configuration in the API Gateway chart, make sure to bump the version in Chart.yaml. Not bumping this value will lead to your changes not being deployed.
Changes to the API Gateway chart or configuration files follow a standard code review process. Once you have received a +1 in Gerrit, submitting a +2 will trigger the auto-merge process for the deployment-charts repository. Once the change is merged, always deploy it to staging first and then deploy to the production environments using the standard deployment process.
There are currently no specific deployment windows for the API Gateway but if deploying a change ad hoc without PET's knowledge, it is best to both !log
liberally and make sure that someone from the team is on hand if you're doing something risky.
How to roll back changes
Follow the standard rollback procedures. If a change is affecting user experience in any way (increases in error codes served, timeouts etc - always refer to the dashboards when deploying), use the emergency procedure to limit the public impact of a change.
How to test changes
In development
Given the API Gateway's interactions with the appservers, testing changes locally can be difficult. However, there exists limited support for testing changes - if you have a local setup like minikube or similar, you can install a local version of the API Gateway by running helm install -f api-gateway/values-devel.yaml api-gateway
in the charts
directory. You will also need to build the echoapi container beforehand. This is required only once, see the chart's README for more details. Once your install is complete and you have forwarded the requisite ports, requests will be passed to a fake backend service that will return the headers and parameters of requests and responses to any requests. This can be used to ensure that basic behaviour changes are in keeping with what you're expecting, that Envoy syntax checks out and that URL mappings are behaving as expected, amongst other things.
In staging
When changes have been deployed to staging, they can be tested using curl from any internal host. This can make it difficult to test changes that rely on Mediawiki changes, but it is unlikely that helm will be used to change the API Portal's behaviour in lieu of the standard mediawiki-config deployment process.
For example, to test a change to the API routing, run curl -k https://staging.svc.eqiad.wmnet:8087/core/v1/wikipedia/ga/page/Veigeat%C3%B3ireachas -v
. When deploying new changes to staging, it should be verified that the change has had no impact on the API in general and specifically any API paths that have been modified or added. The normal operation of the API Portal should be tested - nothing too extensive but make sure that the main page loads okay.
How to debug it
Logs
To read and follow the logs for a API Gateway instance (codfw in this example):
hnowlan@deploy1001:~$ kube_env "api-gateway" "codfw" hnowlan@deploy1001:~$ kubectl get pods | grep Running api-gateway-production-5cd8c54ddb-rcg77 5/5 Running 0 5d5h tiller-deploy-77f47486d6-fxhpx 1/1 Running 0 6d3h nowlan@deploy1001:~$ kubectl logs api-gateway-production-5cd8c54ddb-rcg77 api-gateway-production --tail 10 -f
This will show the last 10 lines of the logs and then follow output.
Note that Envoy's log format is extremely verbose and dumping whole logs may take a few seconds. Following logs may be challenging at times as they can seem non-linear as many requests may be interpolated amongst each one another - one aid in sorting through logs is following the [Cxxxxx] fields in the logs which are unique connection IDs that can be used to follow requests as they are received and answered.
The above example can also be used to monitor the ratelimit service - in place of api-gateway-production
simply subsitute production-ratelimit
. This pattern applies to the other services within the pod but their log output is not always useful.
WikimediaDebug
The WikimediaDebug plugin is supported for accessing the API Portal. It is not currently supported for routing API requests.
How to monitor it
There is a Grafana dashboard available that monitors many features of the API Gateway.
How to assign a client to rate limit tier
All clients are assigned to the default ratelimit tier. To change the tier, use the setClientTierName.php
maintenance script.
Log it to Mwmaint1001 and execute:
mwscript extensions/OAuthRateLimiter/maintenance/setClientTierName.php --wiki metawiki --client <client_id> --tier <tier_name>
At the time of writing 3 tiers exist:
- Default rate limit class: 5000 API calls/hour per client ID/user ID pair (with null user ID counting as a pair here)
- Preferred rate limit class: 25,000 API calls/hour per client ID/user ID pair
- Internal rate limit class: 100,000 API calls/hour per client ID/user ID pair
Known issues
- An issue has been seen where occasionally users will see
{"httpCode":503,"httpReason":"upstream connect error or disconnect/reset before headers. reset reason: connection termination"}
instead of being served the API portal. This issue could relate to connection reuse or TLS termination issues within Envoy itself, it's not clear. However, a fix limiting the amount and length of connection reuse when connecting to upstream hosts in Envoy has limited the impact. For more details see T262490. - If a user receives an error of
{"httpCode":401,"httpReason":"Jwt issuer is not configured"}
it is because the "iss" field in the token does not match the one configured on the API Gateway. Depending on what has been changed this could be a misconfiguration of the OAuth token creation process or of the Gateway itself. Envoy is very strict about issuer being set (although this is changing) and a mismatch will lead to tokens being rejected.