You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Global traffic routing
This page covers our mechanisms for routing user requests through our Traffic infrastructure layers. The routing can be modified through administrative actions to improve performance and/or reliability, and/or respond to site/network outage conditions.
Sites
There are currently four total sites involved. All four sites can receive direct user traffic, however eqiad
and codfw
are Primary sites where application layer services can be hosted, while ulsfo
and esams
are Edge sites for cache-only traffic:
GeoDNS
The first point of entry is when the client performs a DNS request on one of our public hostnames. Our authoritative DNS servers perform GeoIP resolution and hand out one of several distinct IP addresses for each hostname, routing users approximately to their nearest cache datacenter. We can disable a site from direct user access through DNS configuration updates. Our DNS TTLs are commonly 10 minutes long, and some rare user caches will violate specs and cache them longer. The bulk of the traffic should switch inside of 10 minutes, though, with a fairly linear progression over that window.
Disabling a Site
To disable a site as an edge destination for user traffic in GeoDNS:
In the operations/dns
repo, edit the file admin_state
There are instructions inside for complex changes, but for the basic operation of completely disabling a site, the line you need to add at the bottom for e.g. disabling esams
is:
geoip/generic-map/esams => DOWN
(... and then deploy the DNS change in the usual way: merge through gerrit, ssh to any one of our 3x authdns servers (baham
, radon
, and eeden
), and execute authdns-update
as root.
Inter-cache routing
Once a user's request has entered the front edge of our Traffic infrastructure through GeoDNS, it then flows through one or more cache datacenters before reaching the application layer. The flow of traffic through our cache datacenters is currently controlled via hieradata. If one or more cache datacenters route their traffic through another site on their way to the app layer, and that site is down, you'll want to re-route the traffic around that. Each cache cluster has its own routing table.
In the operations/puppet
repo, there are per-cluster files hieradata/role/common/cache/*.yaml
(there are currently 4 of them: text, upload, misc, maps).
There you'll see a cache route table that looks like:
cache::route_table: eqiad: 'direct' codfw: 'eqiad' ulsfo: 'codfw' esams: 'eqiad'
Sites which map to direct
directly access the application layer. Traffic entering non-direct sites will essentially recurse through lookups in this routing table until they reach direct
. In the example above (current for all clusters as of this writing), a user request which first enters our Traffic infrastructure via ulsfo
will pass from there to codfw
, then to eqiad
, and then finally to the application layer itself.
Disabling a Site
To disable a site for all inter-cache routing in all clusters, you must remove right-hand-side references to it from the table, and re-route the affected sites to another pathway towards direct
.
To disable routing through codfw
, one would only need to change ulsfo
's entry, pointing it at eqiad
instead (or in theory, esams would work as a destination as well, but that would be extremely suboptimal given the physical geography of the datacenters!). The updated route table would look like:
cache::route_table: eqiad: 'direct' codfw: 'eqiad' ulsfo: 'eqiad' # was 'codfw', but changed due to codfw outage! esams: 'eqiad'
After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect.
Cache-to-application routing
The final step is routing requests out the back edge of the Traffic infrastructure into the application layer. The application layer services can exist at one of two primary datacenters: eqiad
or codfw
. This is controlled by per-application route entries found in hieradata.
In the operations/puppet
repo, there are per-cluster files hieradata/common/cache/*.yaml
(there are currently 4 of them: text, upload, misc, maps - note slightly different filenames than inter-cache routing above!). Within these files, underneath the apps
key, you will see one stanza per application layer service used by each cluster. Within each application service, there's backends
which defines the available hostnames for this services at eqiad
and/or codfw
. Ideally all services should exist at both. There is also a per-service route
key for selecting which datacenter to route the requests to. The value of route
can be set to eqiad
or codfw
.
Example of current apps
stanza for the text cluster:
apps: appservers: route: 'eqiad' backends: eqiad: 'appservers.svc.eqiad.wmnet' codfw: 'appservers.svc.codfw.wmnet' appservers_debug: route: 'eqiad' backends: eqiad: 'hassium.eqiad.wmnet' codfw: 'hassaleh.codfw.wmnet' api: route: 'eqiad' backends: eqiad: 'api.svc.eqiad.wmnet' codfw: 'api.svc.codfw.wmnet' rendering: route: 'eqiad' backends: eqiad: 'rendering.svc.eqiad.wmnet' codfw: 'rendering.svc.codfw.wmnet' restbase: route: 'eqiad' backends: eqiad: 'restbase.svc.eqiad.wmnet' codfw: 'restbase.svc.codfw.wmnet' cxserver: route: 'eqiad' backends: eqiad: 'cxserver.svc.eqiad.wmnet' citoid: route: 'eqiad' backends: eqiad: 'citoid.svc.eqiad.wmnet' security_audit: route: 'eqiad' backends: eqiad: []
In order to change the routing, one needs to commit changes to this data altering the necessary `route` attribute(s). After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect.
Future directions
The current state of affairs is an iterative improvement on the previous situation, where making changes to our routing would have been a very complex, manual, and error-prone process. However, where we're at now is still only an intermediate state on the way to better things. Specifically:
- Support split routing on a per-service basis: within a given cache cluster, some services will be (or already are!) ready for active:active split routing before others.
- We should move routing metadata (for both the app layer and inter-cache routing) to etcd, so that route changes are not configuration changes through puppet, but instead accomplished via confctl commands.
- As with GeoDNS today, structure that data such that it's possible to simply mark a given site 'down' in one place and have the routing react as best it can, rather than having the administrator have to think through the implications of manual route changes.