You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Global traffic routing

From Wikitech-static
Revision as of 22:22, 10 March 2016 by imported>Krinkle
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page covers our mechanisms for routing user requests through our Traffic infrastructure layers. The routing can be modified through administrative actions to improve performance and/or reliability, and/or respond to site/network outage conditions.

Sites

There are currently four total sites involved. All four sites can receive direct user traffic, however eqiad and codfw are Primary sites where application layer services can be hosted, while ulsfo and esams are Edge sites for cache-only traffic:

Map of Wikimedia Foundation data centers.

GeoDNS

The first point of entry is when the client performs a DNS request on one of our public hostnames. Our authoritative DNS servers perform GeoIP resolution and hand out one of several distinct IP addresses for each hostname, routing users approximately to their nearest cache datacenter. We can disable a site from direct user access through DNS configuration updates. Our DNS TTLs are commonly 10 minutes long, and some rare user caches will violate specs and cache them longer. The bulk of the traffic should switch inside of 10 minutes, though, with a fairly linear progression over that window.

Disabling a Site

To disable a site as an edge destination for user traffic in GeoDNS:

In the operations/dns repo, edit the file admin_state

There are instructions inside for complex changes, but for the basic operation of completely disabling a site, the line you need to add at the bottom for e.g. disabling esams is:

 geoip/generic-map/esams => DOWN

(... and then deploy the DNS change in the usual way: merge through gerrit, ssh to any one of our 3x authdns servers (baham, radon, and eeden), and execute authdns-update as root.

Inter-cache routing

Once a user's request has entered the front edge of our Traffic infrastructure through GeoDNS, it then flows through one or more cache datacenters before reaching the application layer. The flow of traffic through our cache datacenters is currently controlled via hieradata. If one or more cache datacenters route their traffic through another site on their way to the app layer, and that site is down, you'll want to re-route the traffic around that. Each cache cluster has its own routing table.

In the operations/puppet repo, there are per-cluster files hieradata/role/common/cache/*.yaml (there are currently 4 of them: text, upload, misc, maps).

There you'll see a cache route table that looks like:

 cache::route_table:
   eqiad: 'direct'
   codfw: 'eqiad'
   ulsfo: 'codfw'
   esams: 'eqiad'

Sites which map to direct directly access the application layer. Traffic entering non-direct sites will essentially recurse through lookups in this routing table until they reach direct. In the example above (current for all clusters as of this writing), a user request which first enters our Traffic infrastructure via ulsfo will pass from there to codfw, then to eqiad, and then finally to the application layer itself.

Disabling a Site

To disable a site for all inter-cache routing in all clusters, you must remove right-hand-side references to it from the table, and re-route the affected sites to another pathway towards direct.

To disable routing through codfw, one would only need to change ulsfo's entry, pointing it at eqiad instead (or in theory, esams would work as a destination as well, but that would be extremely suboptimal given the physical geography of the datacenters!). The updated route table would look like:

 cache::route_table:
   eqiad: 'direct'
   codfw: 'eqiad'
   ulsfo: 'eqiad' # was 'codfw', but changed due to codfw outage!
   esams: 'eqiad'

After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect.

Cache-to-application routing

The final step is routing requests out the back edge of the Traffic infrastructure into the application layer. The application layer services can exist at one of two primary datacenters: eqiad or codfw. This is controlled by per-application route entries found in hieradata.

In the operations/puppet repo, there are per-cluster files hieradata/common/cache/*.yaml (there are currently 4 of them: text, upload, misc, maps - note slightly different filenames than inter-cache routing above!). Within these files, underneath the apps key, you will see one stanza per application layer service used by each cluster. Within each application service, there's backends which defines the available hostnames for this services at eqiad and/or codfw. Ideally all services should exist at both. There is also a per-service route key for selecting which datacenter to route the requests to. The value of route can be set to eqiad or codfw.

Example of current apps stanza for the text cluster:

apps:
 appservers:
   route: 'eqiad'
   backends:
     eqiad: 'appservers.svc.eqiad.wmnet'
     codfw: 'appservers.svc.codfw.wmnet'
 appservers_debug:
   route: 'eqiad'
   backends:
     eqiad: 'hassium.eqiad.wmnet'
     codfw: 'hassaleh.codfw.wmnet'
 api:
   route: 'eqiad'
   backends:
     eqiad: 'api.svc.eqiad.wmnet'
     codfw: 'api.svc.codfw.wmnet'
 rendering:
   route: 'eqiad'
   backends:
     eqiad: 'rendering.svc.eqiad.wmnet'
     codfw: 'rendering.svc.codfw.wmnet'
 restbase:
   route: 'eqiad'
   backends:
     eqiad: 'restbase.svc.eqiad.wmnet'
     codfw: 'restbase.svc.codfw.wmnet'
 cxserver:
   route: 'eqiad'
   backends:
     eqiad: 'cxserver.svc.eqiad.wmnet'
 citoid:
   route: 'eqiad'
   backends:
     eqiad: 'citoid.svc.eqiad.wmnet'
 security_audit:
   route: 'eqiad'
   backends:
     eqiad: []

In order to change the routing, one needs to commit changes to this data altering the necessary `route` attribute(s). After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect.

Future directions

The current state of affairs is an iterative improvement on the previous situation, where making changes to our routing would have been a very complex, manual, and error-prone process. However, where we're at now is still only an intermediate state on the way to better things. Specifically:

  1. Support split routing on a per-service basis: within a given cache cluster, some services will be (or already are!) ready for active:active split routing before others.
  2. We should move routing metadata (for both the app layer and inter-cache routing) to etcd, so that route changes are not configuration changes through puppet, but instead accomplished via confctl commands.
  3. As with GeoDNS today, structure that data such that it's possible to simply mark a given site 'down' in one place and have the routing react as best it can, rather than having the administrator have to think through the implications of manual route changes.