You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Global traffic routing: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>BBlack
imported>Krinkle
 
(12 intermediate revisions by 9 users not shown)
Line 5: Line 5:
== Sites ==
== Sites ==


There are currently four total sites involved.  All four sites can receive direct user traffic, however <code>eqiad</code> and <code>codfw</code> are ''Primary sites'' where application layer services can be hosted, while <code>ulsfo</code> and <code>esams</code> are ''Edge sites'' for cache-only traffic:
There are currently [[Data centers|six total data centers]] involved.  All locations can receive direct user traffic, however <code>eqiad</code> and <code>codfw</code> also host ''Core application services'', whereas <code>ulsfo</code>, <code>esams</code>, <code>drmrs</code>, and <code>eqsin</code> are limited to ''Edge caching''.


{{ClusterMap}}
{{ClusterMap}}


== GeoDNS ==
== Global Routing Overview ==


The first point of entry is when the client performs a DNS request on one of our public hostnames.  Our authoritative DNS servers perform GeoIP resolution and hand out one of several distinct IP addresses for each hostname, routing users approximately to their nearest cache datacenter.  We can disable a site from direct user access through DNS configuration updates.  Our DNS TTLs are commonly 10 minutes long, and some rare user caches will violate specs and cache them longer.  The bulk of the traffic should switch inside of 10 minutes, though, with a fairly linear progression over that window.
User traffic can enter through the front edge of any of the sites, and is then routed on to eventually reach an application service in a primary site (either eqiad or codfw).
 
Ideally all of our application-layer services operate in an active/active configuration, meaning they can directly accept user traffic in both primary sites simultaneously.  Some application services are active/passive, meaning that they're only accepting user traffic in one of the primary sites but not the other at any given time.  Active/active services might also be temporarily configured to use only a single one of the primary sites for various operational maintenance or outage reasons.
 
In the active/active application's case, global traffic is effectively split.  Users whose traffic enters at either of <code>ulsfo</code> or <code>codfw</code> would reach the application service in <code>codfw</code>, and users whose traffic enters at <code>esams</code> or <code>eqiad</code> would reach the application service in <code>eqiad</code>.
 
== GeoDNS (User-to-Edge Routing) ==
 
The first point of entry is when the client performs a DNS request on one of our public hostnames.  Our authoritative DNS servers perform GeoIP resolution and hand out one of several distinct IP addresses, sending users approximately to their nearest site.  We can disable sending users directly to a particular site through DNS configuration updates.  Our DNS TTLs are commonly 10 minutes long, and some rare user caches will violate specs and cache them longer.  The bulk of the traffic should switch inside of 10 minutes, though, with a fairly linear progression over that window.


=== Disabling a Site ===
=== Disabling a Site ===


To disable a site as an edge destination for user traffic in GeoDNS:
To disable a site as an edge destination for user traffic in GeoDNS:
Downtime the matching site alert in https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=traffic+drop


In the <code>operations/dns</code> repo, edit the file <code>admin_state</code>
In the <code>operations/dns</code> repo, edit the file <code>admin_state</code>
Line 23: Line 33:
   geoip/generic-map/esams => DOWN
   geoip/generic-map/esams => DOWN


... and then deploy the DNS change in the usual way: merge through gerrit, ssh to any '''one''' of our 3x authdns servers (<code>baham</code>, <code>radon</code>, and <code>eeden</code>), and execute <code>authdns-update</code> as root.
... and then deploy the DNS change in the usual way: merge through gerrit, ssh to any '''one''' of our authdns servers (<code>authdns[12]001.wikimedia.org</code>), and execute <code>authdns-update</code> as root.


=== Hard enforcement of DNS-disabled sites ===
=== Hard enforcement of GeoDNS-disabled sites ===


In the case that we need to '''guarantee''' that zero requests are flowing through a disabled datacenter for a given cache (or all caches), we can forcibly block all traffic at the front edge.  This should only be done when strictly necessary, and only long after (e.g. 24H after) making the DNS switch above, to avoid impacting those with minor trailing DNS cache update issues.  To lock traffic out of the frontends for a given cluster in a given site, you'll need to merge and deploy a puppet hieradata update which sets the key <code>cache::traffic_shutdown</code> to <code>true</code> for the applicable cluster/site combinations.
In the case that we need to '''guarantee''' that zero requests are flowing into the user-facing edge of a disabled site for a given cache cluster (or all clusters), we can forcibly block all traffic at the front edge.  This should only be done when strictly necessary, and only long after (e.g. 24H after) making the DNS switch above, to avoid impacting those with minor trailing DNS cache update issues.  To lock traffic out of the cache frontends for a given cluster in a given site, you'll need to merge and deploy a puppet hieradata update which sets the key <code>cache::traffic_shutdown</code> to <code>true</code> for the applicable cluster/site combinations.


For example, to lock all traffic out of the text cluster in eqiad, add the following line to <code>hieradata/role/eqiad/cache/text.yaml</code>:
For example, to lock all traffic out of the text cluster in eqiad, add the following line to <code>hieradata/role/eqiad/cache/text.yaml</code>:


  cache::traffic_shutdown: true
<syntaxhighlight lang="yaml">
cache::traffic_shutdown: true
</syntaxhighlight>


== Inter-cache routing ==
Once the change is merged and applied to the nodes with puppet, all requests sent to eqiad will get a HTTP 403 response from the cache frontends instead of being served from cache or routed to the appropriate origin server.


Once a user's request has entered the front edge of our Traffic infrastructure through GeoDNS, it then flows through one or more cache datacenters before reaching the application layer. The flow of traffic through our cache datacenters is currently controlled via hieradata.  If one or more cache datacenters route their traffic '''through''' another site on their way to the app layer, and that site is down, you'll want to re-route the traffic around that.  Each cache cluster has its own routing table.
== Cache-to-application routing ==
 
Upon entering a given data center, HTTP requests reach a cache frontend host running Varnish. At this layer, caching is controlled by either the <code>cache::req_handling</code> or <code>cache::alternate_domains</code> hiera setting. The former is used by main sites like the wikis and upload.wikimedia.org, while the latter is used by miscellaneous sites such as for example [[Phabricator|phabricator.wikimedia.org]] and [[grafana.wikimedia.org]]. Choosing which data structure to use depends on whether the site needs to be controlled by the regular or misc VCL, most likely misc. It is thus almost sure that additional services need to be added to <code>cache::alternate_domains</code>. If in doubt, contact the traffic team. The format of both data structures is:
In the <code>operations/puppet</code> repo, there are per-cluster files <code>hieradata/role/common/cache/*.yaml</code> (there are currently 4 of them: text, upload, misc, maps).
 
There you'll see a cache route table that looks like:
 
  cache::route_table:
    eqiad: 'direct'
    codfw: 'eqiad'
    ulsfo: 'codfw'
    esams: 'eqiad'
 
Sites which map to <code>direct</code> directly access the application layer. Traffic entering non-direct sites will essentially recurse through lookups in this routing table until they reach <code>direct</code>. In the example above (current for all clusters as of this writing), a user request which first enters our Traffic infrastructure via <code>ulsfo</code> will pass from there to <code>codfw</code>, then to <code>eqiad</code>, and then finally to the application layer itself.


=== Disabling a Site ===
<syntaxhighlight lang="yaml">
cache::alternate_domains:
  hostname1:
    caching: 'normal'
  hostname2:
    caching: 'pass'
</syntaxhighlight>


To disable a site for all inter-cache routing in all clusters, you must remove right-hand-side references to it from the table, and re-route the affected sites to another pathway towards <code>direct</code>.
In Puppet terms there is a data type for those structures: <code>[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/types/cache/sites.pp Profile::Cache::Sites]</code>. The <code>caching</code> attribute is particularly interesting, see its [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/types/cache/caching.pp type definition].
A value of '''normal''' in the caching attribute means that Varnish will cache the responses for this site unless '''Cache-Control''' says otherwise. Conversely, '''pass''' means that objects for this site are never to be cached. It would be preferable to specify '''normal''' and ensure that the origin returns '''Cache-Control''' with appropriate values for responses that should not be cached, but where this is not possible '''pass''' can be used. For sites that need to support websockets, such as Phabricator/Etherpad, use '''websockets'''. A sample of the production values for <code>cache::alternate_domains</code> as of July 2020 follows.


To disable routing through <code>codfw</code>, one would only need to change <code>ulsfo</code>'s entry, pointing it at <code>eqiad</code> instead (or in theory, esams would work as a destination as well, but that would be extremely suboptimal given the physical geography of the datacenters!).  The updated route table would look like:
<syntaxhighlight lang="yaml">
cache::alternate_domains:
  15.wikipedia.org:
    caching: 'normal'
  analytics.wikimedia.org:
    caching: 'normal'
  annual.wikimedia.org:
    caching: 'normal'
  blubberoid.wikimedia.org:
    caching: 'pass'
  bienvenida.wikimedia.org:
    caching: 'normal'
  etherpad.wikimedia.org:
    caching: 'websockets'
</syntaxhighlight>


  cache::route_table:
In case there is no cache hit at the frontend layer, requests are sent to a cache backend running [[ATS]] in the same DC. Backend selection is done by applying consistent hashing on the request URL. If at the backend layer there is also no cache hit, the final step is routing requests out the back edge of the Traffic caching infrastructure into the application layer.  The application layer services can exist at one or both of the two primary sites (<code>eqiad</code> and <code>codfw</code>) on a case-by-case basis.  This is controlled by ATS remap rules mapping the '''Host''' header to a given origin server hostname. The hiera setting controlling the rules is <code>profile::trafficserver::backend::mapping_rules</code>, and for production it is specified in <code>hieradata/common/profile/trafficserver/backend.yaml</code>. For most services, the configuration of whether the service is active/active or active/passive is done via [[DNS/Discovery]]. The exception to this rule is services available in one primary DC only, such as pivot (eqiad-only) in the example below:
    eqiad: 'direct'
    codfw: 'eqiad'
    ulsfo: 'eqiad' # was 'codfw', but changed due to codfw outage!
    esams: 'eqiad'
 
After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect.
 
== Cache-to-application routing ==


The final step is routing requests out the back edge of the Traffic infrastructure into the application layer. The application layer services can exist at one of two primary datacenters: <code>eqiad</code> or <code>codfw</code>. This is controlled by per-application route entries found in hieradata.
<syntaxhighlight lang="yaml">
profile::trafficserver::backend::mapping_rules:
    - type: map
      target: http://15.wikipedia.org
      replacement: https://webserver-misc-apps.discovery.wmnet
    - type: map
      target: http://phabricator.wikimedia.org
      replacement: https://phabricator.discovery.wmnet
    - type: map
      target: http://pivot.wikimedia.org
      replacement: https://an-tool1007.eqiad.wmnet
</syntaxhighlight>


In the <code>operations/puppet</code> repo, there are per-cluster files <code>hieradata/common/cache/*.yaml</code> (there are currently 4 of them: text, upload, misc, maps - note slightly different filenames than inter-cache routing above!).  Within these files, underneath the <code>apps</code> key, you will see one stanza per application layer service used by each cluster.  Within each application service, there's <code>backends</code> which defines the available hostnames for this services at <code>eqiad</code> and/or <code>codfw</code>.  Ideally all services should exist at both.  There is also a per-service <code>route</code> key for selecting which datacenter to route the requests to.  The value of <code>route</code> can be set to <code>eqiad</code> or <code>codfw</code>.
Any administrative action such as depooling a primary site for active/active services, or moving an active/passive service from one primary DC to the other, can be performed via [[DNS/Discovery#How_to_manage_a_DNS_Discovery_service|DNS discovery updates]].


Example of current <code>apps</code> stanza for the text cluster:
When adding a new service to <code>profile::trafficserver::backend::mapping_rules</code>, ensure that the public hostname (ie: the hostname component of <code>target</code>) is included in the Subject Alternative Name (SAN) list of the certificate served by <code>replacement</code>. This is needed to ensure a successful TLS connection establishment between ATS and the origin server.


apps:
The following command provides an example for how to verify that the hostname '''phabricator.wikimedia.org''' is included in the SAN of the certificate offered by '''phabricator.discovery.wmnet''':
  appservers:
    route: 'eqiad'
    backends:
      eqiad: 'appservers.svc.eqiad.wmnet'
      codfw: 'appservers.svc.codfw.wmnet'
  appservers_debug:
    route: 'eqiad'
    backends:
      eqiad: 'hassium.eqiad.wmnet'
      codfw: 'hassaleh.codfw.wmnet'
  api:
    route: 'eqiad'
    backends:
      eqiad: 'api.svc.eqiad.wmnet'
      codfw: 'api.svc.codfw.wmnet'
  rendering:
    route: 'eqiad'
    backends:
      eqiad: 'rendering.svc.eqiad.wmnet'
      codfw: 'rendering.svc.codfw.wmnet'
  restbase:
    route: 'eqiad'
    backends:
      eqiad: 'restbase.svc.eqiad.wmnet'
      codfw: 'restbase.svc.codfw.wmnet'
  cxserver:
    route: 'eqiad'
    backends:
      eqiad: 'cxserver.svc.eqiad.wmnet'
      codfw: 'cxserver.svc.codfw.wmnet'
  citoid:
    route: 'eqiad'
    backends:
      eqiad: 'citoid.svc.eqiad.wmnet'
      codfw: 'citoid.svc.codfw.wmnet'
  security_audit:
    route: 'eqiad'
    backends:
      eqiad: []


In order to change the routing, one needs to commit changes to this data altering the necessary `route` attribute(s). After merging this through gerrit + puppet-merge, puppet agent needs to be run on the affected caches before this takes effect.
<syntaxhighlight lang="bash">
$ echo | openssl s_client -connect phabricator.discovery.wmnet:443 2>&1 | openssl x509 -noout -text | grep -q DNS:phabricator.wikimedia.org && echo OK || echo KO
OK
</syntaxhighlight>


== Future directions ==
If the above command fails, you might have to update the origin server certificate to include the public hostname. See [[Cergen]].


The current state of affairs is an iterative improvement on the previous situation, where making changes to our routing would have been a very complex, manual, and error-prone process.  However, where we're at now is still only an intermediate state on the way to better things.  Specifically:
To further verify that HTTPS requests are served properly by the configured origin, and everything works including the TLS handshake:


# Support split routing on a per-service basis: within a given cache cluster, some services will be (or already are!) ready for active:active split routing before others.
<syntaxhighlight lang="bash">
# We should move routing metadata (for both the app layer and inter-cache routing) to etcd, so that route changes are not configuration changes through puppet, but instead accomplished via confctl commands.
# get the IP address of phabricator.discovery.wmnet
# As with GeoDNS today, structure that data such that it's possible to simply mark a given site 'down' in one place and have the routing react as best it can, rather than having the administrator have to think through the implications of manual route changes.
$ host phabricator.discovery.wmnet
phabricator.discovery.wmnet is an alias for phab1001.eqiad.wmnet.
phab1001.eqiad.wmnet has address 10.64.16.8
# test an HTTPS request
$ curl -I https://phabricator.wikimedia.org --resolve phabricator.wikimedia.org:443:10.64.16.8
HTTP/1.1 200 OK
[...]
</syntaxhighlight>


[[Category:Caching]]
[[Category:Caching]]

Latest revision as of 23:46, 17 June 2022

This page covers our mechanisms for routing user requests through our Traffic infrastructure layers. The routing can be modified through administrative actions to improve performance and/or reliability, and/or respond to site/network outage conditions.

Sites

There are currently six total data centers involved. All locations can receive direct user traffic, however eqiad and codfw also host Core application services, whereas ulsfo, esams, drmrs, and eqsin are limited to Edge caching.

Map of Wikimedia Foundation data centers.

Global Routing Overview

User traffic can enter through the front edge of any of the sites, and is then routed on to eventually reach an application service in a primary site (either eqiad or codfw).

Ideally all of our application-layer services operate in an active/active configuration, meaning they can directly accept user traffic in both primary sites simultaneously. Some application services are active/passive, meaning that they're only accepting user traffic in one of the primary sites but not the other at any given time. Active/active services might also be temporarily configured to use only a single one of the primary sites for various operational maintenance or outage reasons.

In the active/active application's case, global traffic is effectively split. Users whose traffic enters at either of ulsfo or codfw would reach the application service in codfw, and users whose traffic enters at esams or eqiad would reach the application service in eqiad.

GeoDNS (User-to-Edge Routing)

The first point of entry is when the client performs a DNS request on one of our public hostnames. Our authoritative DNS servers perform GeoIP resolution and hand out one of several distinct IP addresses, sending users approximately to their nearest site. We can disable sending users directly to a particular site through DNS configuration updates. Our DNS TTLs are commonly 10 minutes long, and some rare user caches will violate specs and cache them longer. The bulk of the traffic should switch inside of 10 minutes, though, with a fairly linear progression over that window.

Disabling a Site

To disable a site as an edge destination for user traffic in GeoDNS:

Downtime the matching site alert in https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=traffic+drop

In the operations/dns repo, edit the file admin_state

There are instructions inside for complex changes, but for the basic operation of completely disabling a site, the line you need to add at the bottom for e.g. disabling esams is:

 geoip/generic-map/esams => DOWN

... and then deploy the DNS change in the usual way: merge through gerrit, ssh to any one of our authdns servers (authdns[12]001.wikimedia.org), and execute authdns-update as root.

Hard enforcement of GeoDNS-disabled sites

In the case that we need to guarantee that zero requests are flowing into the user-facing edge of a disabled site for a given cache cluster (or all clusters), we can forcibly block all traffic at the front edge. This should only be done when strictly necessary, and only long after (e.g. 24H after) making the DNS switch above, to avoid impacting those with minor trailing DNS cache update issues. To lock traffic out of the cache frontends for a given cluster in a given site, you'll need to merge and deploy a puppet hieradata update which sets the key cache::traffic_shutdown to true for the applicable cluster/site combinations.

For example, to lock all traffic out of the text cluster in eqiad, add the following line to hieradata/role/eqiad/cache/text.yaml:

cache::traffic_shutdown: true

Once the change is merged and applied to the nodes with puppet, all requests sent to eqiad will get a HTTP 403 response from the cache frontends instead of being served from cache or routed to the appropriate origin server.

Cache-to-application routing

Upon entering a given data center, HTTP requests reach a cache frontend host running Varnish. At this layer, caching is controlled by either the cache::req_handling or cache::alternate_domains hiera setting. The former is used by main sites like the wikis and upload.wikimedia.org, while the latter is used by miscellaneous sites such as for example phabricator.wikimedia.org and grafana.wikimedia.org. Choosing which data structure to use depends on whether the site needs to be controlled by the regular or misc VCL, most likely misc. It is thus almost sure that additional services need to be added to cache::alternate_domains. If in doubt, contact the traffic team. The format of both data structures is:

cache::alternate_domains:
  hostname1:
    caching: 'normal'
  hostname2:
    caching: 'pass'

In Puppet terms there is a data type for those structures: Profile::Cache::Sites. The caching attribute is particularly interesting, see its type definition. A value of normal in the caching attribute means that Varnish will cache the responses for this site unless Cache-Control says otherwise. Conversely, pass means that objects for this site are never to be cached. It would be preferable to specify normal and ensure that the origin returns Cache-Control with appropriate values for responses that should not be cached, but where this is not possible pass can be used. For sites that need to support websockets, such as Phabricator/Etherpad, use websockets. A sample of the production values for cache::alternate_domains as of July 2020 follows.

cache::alternate_domains:
  15.wikipedia.org:
    caching: 'normal'
  analytics.wikimedia.org:
    caching: 'normal'
  annual.wikimedia.org:
    caching: 'normal'
  blubberoid.wikimedia.org:
    caching: 'pass'
  bienvenida.wikimedia.org:
    caching: 'normal'
  etherpad.wikimedia.org:
    caching: 'websockets'

In case there is no cache hit at the frontend layer, requests are sent to a cache backend running ATS in the same DC. Backend selection is done by applying consistent hashing on the request URL. If at the backend layer there is also no cache hit, the final step is routing requests out the back edge of the Traffic caching infrastructure into the application layer. The application layer services can exist at one or both of the two primary sites (eqiad and codfw) on a case-by-case basis. This is controlled by ATS remap rules mapping the Host header to a given origin server hostname. The hiera setting controlling the rules is profile::trafficserver::backend::mapping_rules, and for production it is specified in hieradata/common/profile/trafficserver/backend.yaml. For most services, the configuration of whether the service is active/active or active/passive is done via DNS/Discovery. The exception to this rule is services available in one primary DC only, such as pivot (eqiad-only) in the example below:

profile::trafficserver::backend::mapping_rules:
    - type: map
      target: http://15.wikipedia.org
      replacement: https://webserver-misc-apps.discovery.wmnet
    - type: map
      target: http://phabricator.wikimedia.org
      replacement: https://phabricator.discovery.wmnet
    - type: map
      target: http://pivot.wikimedia.org
      replacement: https://an-tool1007.eqiad.wmnet

Any administrative action such as depooling a primary site for active/active services, or moving an active/passive service from one primary DC to the other, can be performed via DNS discovery updates.

When adding a new service to profile::trafficserver::backend::mapping_rules, ensure that the public hostname (ie: the hostname component of target) is included in the Subject Alternative Name (SAN) list of the certificate served by replacement. This is needed to ensure a successful TLS connection establishment between ATS and the origin server.

The following command provides an example for how to verify that the hostname phabricator.wikimedia.org is included in the SAN of the certificate offered by phabricator.discovery.wmnet:

$ echo | openssl s_client -connect phabricator.discovery.wmnet:443 2>&1 | openssl x509 -noout -text | grep -q DNS:phabricator.wikimedia.org && echo OK || echo KO
OK

If the above command fails, you might have to update the origin server certificate to include the public hostname. See Cergen.

To further verify that HTTPS requests are served properly by the configured origin, and everything works including the TLS handshake:

# get the IP address of phabricator.discovery.wmnet
$ host phabricator.discovery.wmnet
phabricator.discovery.wmnet is an alias for phab1001.eqiad.wmnet.
phab1001.eqiad.wmnet has address 10.64.16.8
# test an HTTPS request
$ curl -I https://phabricator.wikimedia.org --resolve phabricator.wikimedia.org:443:10.64.16.8 
HTTP/1.1 200 OK
[...]