You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Global traffic routing: Difference between revisions
imported>BBlack (→GeoDNS) |
imported>BBlack (Add pseudo-code explanation) |
||
Line 5: | Line 5: | ||
== Sites == | == Sites == | ||
There are currently four total sites involved. All four sites can receive direct user traffic, however <code>eqiad</code> and <code>codfw</code> are ''Primary sites'' | There are currently four total sites involved. All four sites can receive direct user traffic, however <code>eqiad</code> and <code>codfw</code> are ''Primary sites'' which also host application layer services, while <code>ulsfo</code> and <code>esams</code> are ''Edge sites'' which do not. | ||
{{ClusterMap}} | {{ClusterMap}} | ||
== | == Global Routing Overview == | ||
[[File:WMF Global Traffic Routing.svg|frameless|768x768px]] | |||
The first point of entry is when the client performs a DNS request on one of our public hostnames. Our authoritative DNS servers perform GeoIP resolution and hand out one of several distinct IP addresses | User traffic can enter through the front edge of any of the sites, and is then routed on to eventually reach an application service in a primary site. | ||
Ideally all of our application-layer services operate in an active/active configuration like <code>App1</code> above, meaning they can directly accept user traffic in both primary sites simultaneously. Some application services are active/passive like <code>App2</code> above, meaning that they're only accepting user traffic in one of the primary sites but not the other at any given time. Active/active services might also be temporarily configured to use only a single one of the primary sites for various operational maintenance or outage reasons. | |||
In the active/active application's case (<code>App1</code> above), global traffic is effectively split and does not cross the inter-cache route between the two primary sites. Users whose traffic enters at either of <code>ulsfo</code> or <code>codfw</code> would reach the application service in <code>codfw</code>, and users whose traffic enters at <code>esams</code> or <code>eqiad</code> would reach the application service in <code>eqiad</code>. | |||
When an application is active/passive, the primary site which does not have a direct route to the application configured (e.g. <code>eqiad</code> for <code>App2</code> above) will forward to the other primary site's cache to reach that application. | |||
== GeoDNS (User-to-Edge Routing) == | |||
The first point of entry is when the client performs a DNS request on one of our public hostnames. Our authoritative DNS servers perform GeoIP resolution and hand out one of several distinct IP addresses, sending users approximately to their nearest site. We can disable sending users directly to a particular site through DNS configuration updates. Our DNS TTLs are commonly 10 minutes long, and some rare user caches will violate specs and cache them longer. The bulk of the traffic should switch inside of 10 minutes, though, with a fairly linear progression over that window. | |||
=== Disabling a Site === | === Disabling a Site === | ||
Line 25: | Line 36: | ||
... and then deploy the DNS change in the usual way: merge through gerrit, ssh to any '''one''' of our 3x authdns servers (<code>baham</code>, <code>radon</code>, and <code>eeden</code>), and execute <code>authdns-update</code> as root. | ... and then deploy the DNS change in the usual way: merge through gerrit, ssh to any '''one''' of our 3x authdns servers (<code>baham</code>, <code>radon</code>, and <code>eeden</code>), and execute <code>authdns-update</code> as root. | ||
=== Hard enforcement of | === Hard enforcement of GeoDNS-disabled sites === | ||
In the case that we need to '''guarantee''' that zero requests are flowing | In the case that we need to '''guarantee''' that zero requests are flowing into the user-facing edge of a disabled site for a given cache cluster (or all clusters), we can forcibly block all traffic at the front edge. This should only be done when strictly necessary, and only long after (e.g. 24H after) making the DNS switch above, to avoid impacting those with minor trailing DNS cache update issues. To lock traffic out of the frontends for a given cluster in a given site, you'll need to merge and deploy a puppet hieradata update which sets the key <code>cache::traffic_shutdown</code> to <code>true</code> for the applicable cluster/site combinations. | ||
For example, to lock all traffic out of the text cluster in eqiad, add the following line to <code>hieradata/role/eqiad/cache/text.yaml</code>: | For example, to lock all traffic out of the text cluster in eqiad, add the following line to <code>hieradata/role/eqiad/cache/text.yaml</code>: | ||
Line 33: | Line 44: | ||
cache::traffic_shutdown: true | cache::traffic_shutdown: true | ||
== Inter-cache | == Inter-cache (Inter-Site) Routing == | ||
Once a user's request has entered the front edge of our Traffic infrastructure through GeoDNS, | Once a user's request has entered the front edge of our Traffic infrastructure through GeoDNS, inter-cache routing then takes place to route the request towards a primary site where the application service lives. The flow of traffic through our sites is currently controlled via hieradata. If one or more sites route their traffic '''through''' another site on their way to the app layer, and that site is down, we'd want to re-route the traffic around that. Each cache cluster has its own routing table. | ||
In the <code>operations/puppet</code> repo, there are per-cluster files <code>hieradata/role/common/cache/*.yaml</code> (there are currently 4 of them: text, upload, misc, maps). | In the <code>operations/puppet</code> repo, there are per-cluster files <code>hieradata/role/common/cache/*.yaml</code> (there are currently 4 of them: text, upload, misc, maps). | ||
There you'll see a cache route table that looks like: | There you'll see a cache route table mapping sources to destinations that looks like: | ||
<syntaxhighlight lang="yaml"> | |||
cache::route_table: | |||
eqiad: 'codfw' | |||
codfw: 'eqiad' | |||
ulsfo: 'codfw' | |||
esams: 'eqiad' | |||
</syntaxhighlight> | |||
Note that the two ''primary'' sites (<code>eqiad</code> and <code>codfw</code>) intentionally route to each other in a loop. This is so that each can route to the other for services which are active/passive in only one of the primary sites. The ''edge'' sites (<code>ulsfo</code> and <code>esams</code>) should normally point at one of the ''primary'' sites (although it is possible to point an edge at another edge as well and route through it, but this would probably be a rare operational scenario). | |||
=== Disabling a Site === | === Disabling a Site === | ||
If an edge site is malfunctioning, it usually won't be the right-hand destination of any route, so there's no change to be made here. | |||
If a primary site is malfunctioning, it should be removed from the right-hand destinations of edge sites. | |||
'''The loop between the two primary sites should be left alone'''. Scenarios in which we might alter the loop between the primaries fall outside the scope of a simple instructional wiki page. | |||
To disable routing through <code>codfw</code>, one would only need to change <code>ulsfo</code>'s entry, pointing it at <code>eqiad</code> instead | To disable routing through <code>codfw</code> due to malfunction, one would only need to change <code>ulsfo</code>'s entry, pointing it at <code>eqiad</code> instead: | ||
<syntaxhighlight lang="yaml"> | |||
cache::route_table: | |||
eqiad: 'codfw' | |||
codfw: 'eqiad' | |||
ulsfo: 'eqiad' # was 'codfw', but changed due to codfw outage! | |||
esams: 'eqiad' | |||
</syntaxhighlight> | |||
After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect. | After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect. | ||
Line 65: | Line 84: | ||
== Cache-to-application routing == | == Cache-to-application routing == | ||
The final step is routing requests out the back edge of the Traffic infrastructure into the application layer. The application layer services can exist at one of two primary | The final step is routing requests out the back edge of the Traffic caching infrastructure into the application layer. The application layer services can exist at one or both of the two primary sites (<code>eqiad</code> and <code>codfw</code>) on a case-by-case basis. This is controlled by per-application routing entries found in the same hieradata files as inter-cache routing above. | ||
In the <code>operations/puppet</code> repo, there are per-cluster files <code>hieradata/role/common/cache/*.yaml</code> (there are currently 4 of them: text, upload, misc, maps). | |||
Within these files, underneath the <code>cache::app_directors</code> key, you will see one stanza per application layer service used by each cluster. Within each application service, there's <code>backends</code> which defines the available hostnames for this service at <code>eqiad</code> and/or <code>codfw</code>. Ideally all services should exist active/active at both, but currently many are active/passive instead. For active/passive services with hot standby available, the inactive side will probably already be specified in the hieradata file but commented out, to make changes easier. | |||
Example of current <code>cache::app_directors</code> stanza for the text cluster, with all services active/passive (most active only in <code>eqiad</code>, but <code>appservers_debug</code> active only in <code>codfw</code>): | |||
<syntaxhighlight lang="yaml"> | |||
cache::app_directors: | |||
appservers: | |||
backends: | |||
eqiad: 'appservers.svc.eqiad.wmnet' | |||
# codfw: 'appservers.svc.codfw.wmnet' | |||
api: | |||
backends: | |||
eqiad: 'api.svc.eqiad.wmnet' | |||
# codfw: 'api.svc.codfw.wmnet' | |||
rendering: | |||
backends: | |||
eqiad: 'rendering.svc.eqiad.wmnet' | |||
# codfw: 'rendering.svc.codfw.wmnet' | |||
security_audit: | |||
backends: | |||
eqiad: 'appservers.svc.eqiad.wmnet' | |||
# codfw: 'appservers.svc.codfw.wmnet' | |||
appservers_debug: | |||
be_opts: | |||
max_connections: 20 | |||
backends: | |||
# eqiad: 'hassium.eqiad.wmnet' | |||
codfw: 'hassaleh.codfw.wmnet' | |||
restbase_backend: | |||
be_opts: | |||
port: 7231 | |||
max_connections: 5000 | |||
backends: | |||
eqiad: 'restbase.svc.eqiad.wmnet' | |||
# codfw: 'restbase.svc.codfw.wmnet' | |||
cxserver_backend: | |||
be_opts: | |||
port: 8080 | |||
backends: | |||
eqiad: 'cxserver.svc.eqiad.wmnet' | |||
# codfw: 'cxserver.svc.codfw.wmnet' | |||
citoid_backend: | |||
be_opts: | |||
port: 1970 | |||
backends: | |||
eqiad: 'citoid.svc.eqiad.wmnet' | |||
# codfw: 'citoid.svc.codfw.wmnet' | |||
</syntaxhighlight> | |||
Within each <code>backends</code> stanza, the primary site listed on the left names the site where the traffic would exit the cache layer, and the hostname on the right is the applayer hostname it will contact to do so. The code which operates on this data doesn't actually care whether the hostname on the right is actually within the site named on the left. This allows for interesting operational possibilities such as: | |||
<syntaxhighlight lang="yaml"> | |||
cache::app_directors: | |||
appservers: | appservers: | ||
backends: | backends: | ||
eqiad: 'appservers.svc.eqiad.wmnet' | eqiad: 'appservers.svc.eqiad.wmnet' | ||
codfw: 'appservers.svc. | codfw: 'appservers.svc.eqiad.wmnet' | ||
</syntaxhighlight> | |||
This would cause inter-cache routing to behave like an active/active service (dropping from the cache to the applayer directly at both primary sites), but both site's caches will contact only the eqiad applayer service. This is not how we would prefer to operate under normal conditions, but it can be a useful step during complex transitions and testing. | |||
'''Important Caveat:''' Because changes to this configuration roll out asynchronously to many cache hosts, swapping a single-site backends list from one primary site to the other in a single commit step will cause temporary traffic-routing loops as caches with different versions of the configuration forward traffic to each other. The caches will detect the looping requests immediately and return HTTP error code <code>508 Loop Detected</code> for the affected requests, causing a spike in user-facing errors until the situation resolves itself a short time later when the async config deployment process finishes. To avoid this, it's best to do an intermediate commit which enables both primary sites' caches to reach the application layer. In other words, you want this sequence of states to get from <code>eqiad</code>-only to <code>codfw</code>-only: | |||
Initial State: | |||
<syntaxhighlight lang="yaml"> | |||
backends: | |||
eqiad: 'appservers.svc.eqiad.wmnet' | |||
# codfw: 'appservers.svc.codfw.wmnet' | |||
</syntaxhighlight> | |||
Intermediate State (temporarily active/active): | |||
<syntaxhighlight lang="yaml"> | |||
backends: | |||
eqiad: 'appservers.svc.eqiad.wmnet' | |||
codfw: 'appservers.svc.codfw.wmnet' | |||
</syntaxhighlight> | |||
Final State: | |||
<syntaxhighlight lang="yaml"> | |||
backends: | |||
# eqiad: 'appservers.svc.eqiad.wmnet' | |||
codfw: 'appservers.svc.codfw.wmnet' | |||
</syntaxhighlight> | |||
== A code-level view of inter-cache and cache->app routing == | |||
The details of the inter-cache and cache->app routing are probably easier to understand for some as pseudo-code operating on the given hieradata. | |||
Each cache handles requests according to the following pseudo-code logic: | |||
<syntaxhighlight> | |||
$req = <incoming request from user or forwarded from another cache> | |||
$route_table = <hieradata cache::route_table for this cache cluster> | |||
$app_directors = <hieradata cache::app_directors for this cache cluster> | |||
$req_handling = <hieradata cache::req_handling for this cache cluster> | |||
$my_site = <the local site name> | |||
$which_app = parse($req, $req_handling); | |||
if ($app_directors[$which_app].has_key?($my_site)) { | |||
send_to_applayer_at_hostname($req, $app_directors[$which_app][$my_site]) | |||
} else { | |||
forward_to_another_cache($req, $route_table[$my_site]) | |||
} | |||
</syntaxhighlight> | |||
== Future directions == | == Future directions == | ||
The current state of affairs is an iterative improvement on the previous situation, | The current state of affairs is an iterative improvement on the previous situation, but there's still a ways to go! We're still missing some simplification of process, and then the most important piece of the puzzle that remains is transferring all of these routing-state controls to etcd/confctl control so that they don't involve the (much slower and task-inappropriate) full configuration commit->deploy process that they do today. | ||
[[Category:Caching]] | [[Category:Caching]] |
Revision as of 14:39, 30 March 2017
This page covers our mechanisms for routing user requests through our Traffic infrastructure layers. The routing can be modified through administrative actions to improve performance and/or reliability, and/or respond to site/network outage conditions.
Sites
There are currently four total sites involved. All four sites can receive direct user traffic, however eqiad
and codfw
are Primary sites which also host application layer services, while ulsfo
and esams
are Edge sites which do not.
Global Routing Overview
User traffic can enter through the front edge of any of the sites, and is then routed on to eventually reach an application service in a primary site.
Ideally all of our application-layer services operate in an active/active configuration like App1
above, meaning they can directly accept user traffic in both primary sites simultaneously. Some application services are active/passive like App2
above, meaning that they're only accepting user traffic in one of the primary sites but not the other at any given time. Active/active services might also be temporarily configured to use only a single one of the primary sites for various operational maintenance or outage reasons.
In the active/active application's case (App1
above), global traffic is effectively split and does not cross the inter-cache route between the two primary sites. Users whose traffic enters at either of ulsfo
or codfw
would reach the application service in codfw
, and users whose traffic enters at esams
or eqiad
would reach the application service in eqiad
.
When an application is active/passive, the primary site which does not have a direct route to the application configured (e.g. eqiad
for App2
above) will forward to the other primary site's cache to reach that application.
GeoDNS (User-to-Edge Routing)
The first point of entry is when the client performs a DNS request on one of our public hostnames. Our authoritative DNS servers perform GeoIP resolution and hand out one of several distinct IP addresses, sending users approximately to their nearest site. We can disable sending users directly to a particular site through DNS configuration updates. Our DNS TTLs are commonly 10 minutes long, and some rare user caches will violate specs and cache them longer. The bulk of the traffic should switch inside of 10 minutes, though, with a fairly linear progression over that window.
Disabling a Site
To disable a site as an edge destination for user traffic in GeoDNS:
In the operations/dns
repo, edit the file admin_state
There are instructions inside for complex changes, but for the basic operation of completely disabling a site, the line you need to add at the bottom for e.g. disabling esams
is:
geoip/generic-map/esams => DOWN
... and then deploy the DNS change in the usual way: merge through gerrit, ssh to any one of our 3x authdns servers (baham
, radon
, and eeden
), and execute authdns-update
as root.
Hard enforcement of GeoDNS-disabled sites
In the case that we need to guarantee that zero requests are flowing into the user-facing edge of a disabled site for a given cache cluster (or all clusters), we can forcibly block all traffic at the front edge. This should only be done when strictly necessary, and only long after (e.g. 24H after) making the DNS switch above, to avoid impacting those with minor trailing DNS cache update issues. To lock traffic out of the frontends for a given cluster in a given site, you'll need to merge and deploy a puppet hieradata update which sets the key cache::traffic_shutdown
to true
for the applicable cluster/site combinations.
For example, to lock all traffic out of the text cluster in eqiad, add the following line to hieradata/role/eqiad/cache/text.yaml
:
cache::traffic_shutdown: true
Inter-cache (Inter-Site) Routing
Once a user's request has entered the front edge of our Traffic infrastructure through GeoDNS, inter-cache routing then takes place to route the request towards a primary site where the application service lives. The flow of traffic through our sites is currently controlled via hieradata. If one or more sites route their traffic through another site on their way to the app layer, and that site is down, we'd want to re-route the traffic around that. Each cache cluster has its own routing table.
In the operations/puppet
repo, there are per-cluster files hieradata/role/common/cache/*.yaml
(there are currently 4 of them: text, upload, misc, maps).
There you'll see a cache route table mapping sources to destinations that looks like:
cache::route_table:
eqiad: 'codfw'
codfw: 'eqiad'
ulsfo: 'codfw'
esams: 'eqiad'
Note that the two primary sites (eqiad
and codfw
) intentionally route to each other in a loop. This is so that each can route to the other for services which are active/passive in only one of the primary sites. The edge sites (ulsfo
and esams
) should normally point at one of the primary sites (although it is possible to point an edge at another edge as well and route through it, but this would probably be a rare operational scenario).
Disabling a Site
If an edge site is malfunctioning, it usually won't be the right-hand destination of any route, so there's no change to be made here.
If a primary site is malfunctioning, it should be removed from the right-hand destinations of edge sites.
The loop between the two primary sites should be left alone. Scenarios in which we might alter the loop between the primaries fall outside the scope of a simple instructional wiki page.
To disable routing through codfw
due to malfunction, one would only need to change ulsfo
's entry, pointing it at eqiad
instead:
cache::route_table:
eqiad: 'codfw'
codfw: 'eqiad'
ulsfo: 'eqiad' # was 'codfw', but changed due to codfw outage!
esams: 'eqiad'
After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect.
Cache-to-application routing
The final step is routing requests out the back edge of the Traffic caching infrastructure into the application layer. The application layer services can exist at one or both of the two primary sites (eqiad
and codfw
) on a case-by-case basis. This is controlled by per-application routing entries found in the same hieradata files as inter-cache routing above.
In the operations/puppet
repo, there are per-cluster files hieradata/role/common/cache/*.yaml
(there are currently 4 of them: text, upload, misc, maps).
Within these files, underneath the cache::app_directors
key, you will see one stanza per application layer service used by each cluster. Within each application service, there's backends
which defines the available hostnames for this service at eqiad
and/or codfw
. Ideally all services should exist active/active at both, but currently many are active/passive instead. For active/passive services with hot standby available, the inactive side will probably already be specified in the hieradata file but commented out, to make changes easier.
Example of current cache::app_directors
stanza for the text cluster, with all services active/passive (most active only in eqiad
, but appservers_debug
active only in codfw
):
cache::app_directors:
appservers:
backends:
eqiad: 'appservers.svc.eqiad.wmnet'
# codfw: 'appservers.svc.codfw.wmnet'
api:
backends:
eqiad: 'api.svc.eqiad.wmnet'
# codfw: 'api.svc.codfw.wmnet'
rendering:
backends:
eqiad: 'rendering.svc.eqiad.wmnet'
# codfw: 'rendering.svc.codfw.wmnet'
security_audit:
backends:
eqiad: 'appservers.svc.eqiad.wmnet'
# codfw: 'appservers.svc.codfw.wmnet'
appservers_debug:
be_opts:
max_connections: 20
backends:
# eqiad: 'hassium.eqiad.wmnet'
codfw: 'hassaleh.codfw.wmnet'
restbase_backend:
be_opts:
port: 7231
max_connections: 5000
backends:
eqiad: 'restbase.svc.eqiad.wmnet'
# codfw: 'restbase.svc.codfw.wmnet'
cxserver_backend:
be_opts:
port: 8080
backends:
eqiad: 'cxserver.svc.eqiad.wmnet'
# codfw: 'cxserver.svc.codfw.wmnet'
citoid_backend:
be_opts:
port: 1970
backends:
eqiad: 'citoid.svc.eqiad.wmnet'
# codfw: 'citoid.svc.codfw.wmnet'
Within each backends
stanza, the primary site listed on the left names the site where the traffic would exit the cache layer, and the hostname on the right is the applayer hostname it will contact to do so. The code which operates on this data doesn't actually care whether the hostname on the right is actually within the site named on the left. This allows for interesting operational possibilities such as:
cache::app_directors:
appservers:
backends:
eqiad: 'appservers.svc.eqiad.wmnet'
codfw: 'appservers.svc.eqiad.wmnet'
This would cause inter-cache routing to behave like an active/active service (dropping from the cache to the applayer directly at both primary sites), but both site's caches will contact only the eqiad applayer service. This is not how we would prefer to operate under normal conditions, but it can be a useful step during complex transitions and testing.
Important Caveat: Because changes to this configuration roll out asynchronously to many cache hosts, swapping a single-site backends list from one primary site to the other in a single commit step will cause temporary traffic-routing loops as caches with different versions of the configuration forward traffic to each other. The caches will detect the looping requests immediately and return HTTP error code 508 Loop Detected
for the affected requests, causing a spike in user-facing errors until the situation resolves itself a short time later when the async config deployment process finishes. To avoid this, it's best to do an intermediate commit which enables both primary sites' caches to reach the application layer. In other words, you want this sequence of states to get from eqiad
-only to codfw
-only:
Initial State:
backends:
eqiad: 'appservers.svc.eqiad.wmnet'
# codfw: 'appservers.svc.codfw.wmnet'
Intermediate State (temporarily active/active):
backends:
eqiad: 'appservers.svc.eqiad.wmnet'
codfw: 'appservers.svc.codfw.wmnet'
Final State:
backends:
# eqiad: 'appservers.svc.eqiad.wmnet'
codfw: 'appservers.svc.codfw.wmnet'
A code-level view of inter-cache and cache->app routing
The details of the inter-cache and cache->app routing are probably easier to understand for some as pseudo-code operating on the given hieradata.
Each cache handles requests according to the following pseudo-code logic:
$req = <incoming request from user or forwarded from another cache>
$route_table = <hieradata cache::route_table for this cache cluster>
$app_directors = <hieradata cache::app_directors for this cache cluster>
$req_handling = <hieradata cache::req_handling for this cache cluster>
$my_site = <the local site name>
$which_app = parse($req, $req_handling);
if ($app_directors[$which_app].has_key?($my_site)) {
send_to_applayer_at_hostname($req, $app_directors[$which_app][$my_site])
} else {
forward_to_another_cache($req, $route_table[$my_site])
}
Future directions
The current state of affairs is an iterative improvement on the previous situation, but there's still a ways to go! We're still missing some simplification of process, and then the most important piece of the puzzle that remains is transferring all of these routing-state controls to etcd/confctl control so that they don't involve the (much slower and task-inappropriate) full configuration commit->deploy process that they do today.