You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

API Gateway: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Hnowlan
imported>Hnowlan
(→‎How to debug it: wikimediadebug)
Line 11: Line 11:


===Routing===
===Routing===
The API Gateway maps API URIs passed to the Gateway's hostname (api.wikimedia.org) to the relevant APIs understood by the application servers. For example, https://api.wikimedia.org/core/v1/wikipedia/en/page/pizza is mapped to https://en.wikipedia.org/w/rest.php/v1/page/pizza by the Gateway's [[phab:source/operations-deployment-charts/browse/master/charts/api-gateway/values.yaml$83|configuration language]]. As of September 2020, it is required to use a relatively complex rewriting method using Lua and multiple definitions of URL patterns seen in the values.yaml file, but this will be fixed in Envoy 1.16.0.  
The API Gateway maps API URIs passed to the Gateway's hostname (api.wikimedia.org) to the relevant APIs understood by the application servers. For example, https://api.wikimedia.org/core/v1/wikipedia/en/page/pizza is mapped to https://en.wikipedia.org/w/rest.php/v1/page/pizza by the Gateway's [[phab:source/operations-deployment-charts/browse/master/charts/api-gateway/values.yaml$83|configuration language]]. As of September 2020, it is required to use a relatively complex rewriting method using Lua and multiple definitions of URL patterns seen in the values.yaml file, but this will be fixed in Envoy 1.16.0. Currently all APIs that are offered by the API Gateway are also directly accessible via the traditional API routes on their per-service level. 


=== JSON Web Tokens ===
=== JSON Web Tokens ===
Line 35: Line 35:


Changes to the API Gateway chart or configuration files follow a standard code review process. Once you have received a +1 in Gerrit, submitting a +2 will trigger the auto-merge process for the deployment-charts repository. Once the change is merged, '''always deploy it to staging first''' and then deploy to the production environments using the [[Deployments on kubernetes#Code deployment/configuration changes|standard deployment process]].
Changes to the API Gateway chart or configuration files follow a standard code review process. Once you have received a +1 in Gerrit, submitting a +2 will trigger the auto-merge process for the deployment-charts repository. Once the change is merged, '''always deploy it to staging first''' and then deploy to the production environments using the [[Deployments on kubernetes#Code deployment/configuration changes|standard deployment process]].
There are currently no specific deployment windows for the API Gateway but if deploying a change ad hoc without PET's knowledge, it is best to both <code>!log</code> liberally and make sure that someone from the team is on hand if you're doing something risky.


===How to roll back changes===
===How to roll back changes===
Line 50: Line 52:


==How to debug it==
==How to debug it==
=== Logs ===
To read and follow the logs for a API Gateway instance (codfw in this example):
To read and follow the logs for a API Gateway instance (codfw in this example):


Line 61: Line 65:


Note that Envoy's log format is extremely verbose and dumping whole logs may take a few seconds. Following logs may be challenging at times as they can seem non-linear as many requests may be interpolated amongst each one another - one aid in sorting through logs is following the [Cxxxxx] fields in the logs which are unique connection IDs that can be used to follow requests as they are received and answered.   
Note that Envoy's log format is extremely verbose and dumping whole logs may take a few seconds. Following logs may be challenging at times as they can seem non-linear as many requests may be interpolated amongst each one another - one aid in sorting through logs is following the [Cxxxxx] fields in the logs which are unique connection IDs that can be used to follow requests as they are received and answered.   
The above example can also be used to monitor the ratelimit service - in place of <code>api-gateway-production</code> simply subsitute <code>production-ratelimit</code>. This pattern applies to the other services within the pod but their log output is not always useful. 
=== WikimediaDebug ===
The [[WikimediaDebug]] plugin is supported for accessing the API Portal. It is not currently supported for routing API requests.


==How to monitor it==
==How to monitor it==
Line 68: Line 77:


*An issue has been seen where occasionally users will see <code>{"httpCode":503,"httpReason":"upstream connect error or disconnect/reset before headers. reset reason: connection termination"}</code> instead of being served the API portal. This issue could relate to connection reuse or TLS termination issues within Envoy itself, it's not clear. However, a fix limiting the amount and length of connection reuse when connecting to upstream hosts in Envoy has limited the impact. For more details see [[phab:T262490|T262490]].
*An issue has been seen where occasionally users will see <code>{"httpCode":503,"httpReason":"upstream connect error or disconnect/reset before headers. reset reason: connection termination"}</code> instead of being served the API portal. This issue could relate to connection reuse or TLS termination issues within Envoy itself, it's not clear. However, a fix limiting the amount and length of connection reuse when connecting to upstream hosts in Envoy has limited the impact. For more details see [[phab:T262490|T262490]].
*If a user receives an error of <code>{"httpCode":401,"httpReason":"Jwt issuer is not configured"}</code> it is because the "iss" field in the token does not match the one configured on the API Gateway. Depending on what has been changed this could be a misconfiguration of the OAuth token creation process or of the Gateway itself. Envoy is very strict about issuer being set ([https://github.com/envoyproxy/envoy/pull/12744 although this is changing]) and a mismatch will lead to tokens being rejected. 


==Related==
==Related==

Revision as of 13:10, 23 September 2020

The API Gateway is a service that runs in Kubernetes based on Envoy. The service implements many features central to serving the unified API and the API portal.

What it does

The API Gateway serves pages for api.wikimedia.org. It does this by rewriting requests for the unified API to URIs that are understood by the respective APIs on the application servers, and also by serving pages for the API Portal wiki in the same way other wikis would be served. The API Gateway also uses metadata from JSON Web Tokens (JWTs) to apply rate limits to clients using the APIs.

How it works

Envoy is more or less functioning as any other API router/reverse proxy does. The proxy answers requests for the configured domain, does some selective manipulation and then asks appservers via their LVS endpoint for page content.

Rate limiting

The API Gateway applies rate limits to clients based upon the JWT provided (or not) by the client. During beta unauthenticated clients are currently limited to 500 requests an hour, and authenticated clients that pass a valid JWT are limited to 5000 requests. These values are entirely temporary and will be changed as the platform moves towards general release. Clients issue JWTs by requesting OAuth 2.0 clients on Meta and in future, on the API Portal.

Routing

The API Gateway maps API URIs passed to the Gateway's hostname (api.wikimedia.org) to the relevant APIs understood by the application servers. For example, https://api.wikimedia.org/core/v1/wikipedia/en/page/pizza is mapped to https://en.wikipedia.org/w/rest.php/v1/page/pizza by the Gateway's configuration language. As of September 2020, it is required to use a relatively complex rewriting method using Lua and multiple definitions of URL patterns seen in the values.yaml file, but this will be fixed in Envoy 1.16.0. Currently all APIs that are offered by the API Gateway are also directly accessible via the traditional API routes on their per-service level.

JSON Web Tokens

The API Gateway verifies the signatures of JWT Authorisation headers included alongside requests. If a JWT is valid, a different limit is applied. This limit can be configured via the Helm values file per environment (Values.main_app.ratelimiter.default_limit.unit for valid JWTs and Values.main_app.ratelimiter.anon_limit.unit for anonymous users).

API Portal

The API Gateway is the means by which all clients access the API Portal. The API Portal is simply a customised Mediawiki instance and the API Gateway serves requests to it by proxying requests to the appservers. Unlike other wikis however, the API Portal is only accessible via the API Gateway.

Logs and analytics

Logs are shipped from the API Gateway to EventGate using fluentd. Fluentd runs in its own container, continuously parsing JSON request log output and reposting these logs to Eventgate.

Where it runs

The API Gateway runs in Kubernetes in staging, eqiad and codfw. The instance in staging does not receive external traffic but can be accessed internally at https://api-gateway.svc.eqiad.wmnet:8087. Changes should be deployed to staging and tested via curl on this endpoint.

How it's configured

The API Gateway uses the reserved port 8087 internally and is registered in Service ports.

The core configuration for the API Gateway helm chart is documented in the default values.yaml file. Note that there are configuration overrides for production in general, and also for eqiad and codfw specifically (and staging, which does not serve public requests).

JWT tokens are verified using the public key of the keypair used to sign OAuth tokens on meta.wikimedia.org. This key has been converted to the [rfc:7517 JWKS format] required for support using JWTs and is distributed as a secret via puppet.

How to deploy changes

The API Gateway's configurable components all live within the deployment-charts repository. The components that are of interest are the api-gateway chart itself and the aforementioned helmfile.d configuration for the service. Note: when changing configuration in the API Gateway chart, make sure to bump the version in Chart.yaml. Not bumping this value will lead to your changes not being deployed.

Changes to the API Gateway chart or configuration files follow a standard code review process. Once you have received a +1 in Gerrit, submitting a +2 will trigger the auto-merge process for the deployment-charts repository. Once the change is merged, always deploy it to staging first and then deploy to the production environments using the standard deployment process.

There are currently no specific deployment windows for the API Gateway but if deploying a change ad hoc without PET's knowledge, it is best to both !log liberally and make sure that someone from the team is on hand if you're doing something risky.

How to roll back changes

Follow the standard rollback procedures. If a change is affecting user experience in any way (increases in error codes served, timeouts etc - always refer to the dashboards when deploying), use the emergency procedure to limit the public impact of a change.

How to test changes

In development

Given the API Gateway's interactions with the appservers, testing changes locally can be difficult. However, there exists limited support for testing changes - if you have a local setup like minikube or similar, you can install a local version of the API Gateway by running helm install -f api-gateway/values-devel.yaml api-gateway in the charts directory. You will also need to build the echoapi container beforehand. This is required only once, see the chart's README for more details. Once your install is complete and you have forwarded the requisite ports, requests will be passed to a fake backend service that will return the headers and parameters of requests and responses to any requests. This can be used to ensure that basic behaviour changes are in keeping with what you're expecting, that Envoy syntax checks out and that URL mappings are behaving as expected, amongst other things.

In staging

When changes have been deployed to staging, they can be tested using curl from any internal host. This can make it difficult to test changes that rely on Mediawiki changes, but it is unlikely that helm will be used to change the API Portal's behaviour in lieu of the standard mediawiki-config deployment process.

For example, to test a change to the API routing, run curl -k https://staging.svc.eqiad.wmnet:8087/core/v1/wikipedia/ga/page/Veigeat%C3%B3ireachas -v. When deploying new changes to staging, it should be verified that the change has had no impact on the API in general and specifically any API paths that have been modified or added. The normal operation of the API Portal should be tested - nothing too extensive but make sure that the main page loads okay.

How to debug it

Logs

To read and follow the logs for a API Gateway instance (codfw in this example):

hnowlan@deploy1001:~$ kube_env "api-gateway" "codfw"
hnowlan@deploy1001:~$ kubectl get pods | grep Running
api-gateway-production-5cd8c54ddb-rcg77   5/5     Running   0          5d5h
tiller-deploy-77f47486d6-fxhpx            1/1     Running   0          6d3h
nowlan@deploy1001:~$ kubectl logs api-gateway-production-5cd8c54ddb-rcg77 api-gateway-production --tail 10 -f

This will show the last 10 lines of the logs and then follow output.

Note that Envoy's log format is extremely verbose and dumping whole logs may take a few seconds. Following logs may be challenging at times as they can seem non-linear as many requests may be interpolated amongst each one another - one aid in sorting through logs is following the [Cxxxxx] fields in the logs which are unique connection IDs that can be used to follow requests as they are received and answered.

The above example can also be used to monitor the ratelimit service - in place of api-gateway-production simply subsitute production-ratelimit. This pattern applies to the other services within the pod but their log output is not always useful.

WikimediaDebug

The WikimediaDebug plugin is supported for accessing the API Portal. It is not currently supported for routing API requests.

How to monitor it

There is a Grafana dashboard available that monitors many features of the API Gateway.

Known issues

  • An issue has been seen where occasionally users will see {"httpCode":503,"httpReason":"upstream connect error or disconnect/reset before headers. reset reason: connection termination"} instead of being served the API portal. This issue could relate to connection reuse or TLS termination issues within Envoy itself, it's not clear. However, a fix limiting the amount and length of connection reuse when connecting to upstream hosts in Envoy has limited the impact. For more details see T262490.
  • If a user receives an error of {"httpCode":401,"httpReason":"Jwt issuer is not configured"} it is because the "iss" field in the token does not match the one configured on the API Gateway. Depending on what has been changed this could be a misconfiguration of the OAuth token creation process or of the Gateway itself. Envoy is very strict about issuer being set (although this is changing) and a mismatch will lead to tokens being rejected.

Related