You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Anycast"

From Wikitech-static
Jump to navigation Jump to search
imported>Ayounsi
 
imported>Ayounsi
Line 1: Line 1:
Still WIP
== External ==
In discussion: https://phabricator.wikimedia.org/T98006


== Internal ==
== Internal ==
Line 6: Line 7:
[[Anycast recursive DNS]]
[[Anycast recursive DNS]]


=== How does it work? ===
=== How? ===
<br />
 
==== How does it work? ====
https://en.wikipedia.org/wiki/Anycast
 
* The VIP (virtual IP) is configured on the servers loopback
* Bird (routing daemon) advertises the VIP to the routers using [[:en:Border_Gateway_Protocol|BGP]]
* (optional) A [[:en:Bidirectional_Forwarding_Detection|BFD]] session is established between Bird and the routers to ensure fast failover in case of server or link failure
* [https://github.com/unixsurfer/anycast_healthchecker Anycast_healthchecker] monitors the local (anycasted) service by querying it every second
* If a service failure is detected, the VIP stops being advertised to the routers
* When the service is restored, anycast_healthchecker waits 10s before re-advertising the IP to avoid flaps
* The bird service is linked (systemd bind) to the anycast_healthchecker service so bird is stopped if anycast_healthchecker is not running/crashed
* Time between a local service failure and clients to be redirected to a different server (advertising the same VIP) is 1s max
* All servers advertise the same VIP worldwide, clients will be be routed to the closest (in the BGP definition) server (same DC, then shorter AS path, etc...) but is not based on latency
* Routers do per flow load balancing (ECMP) between all local (same site) servers. Hashing is done on IP and port (L4)
* As last hope backup, in case all servers stop advertising the VIP (eg. global missconfiguration), eqiad and codfw routers have less specific (/30) backup static routes pointing to their local servers


=== How to deploy a new service? ===
==== How to deploy a new service? ====


# Assign an IP in DNS, from the 10.3.0.0/24 range - (eg. [[gerrit:c/operations/dns/+/524045|Gerrit CR 524045]])
# Assign an IP in DNS, from the 10.3.0.0/24 range - (eg. [[gerrit:c/operations/dns/+/524045|Gerrit CR 524045]])
# Configure the server side (eg. [[gerrit:c/operations/puppet/+/524037|Gerrit CR 524037]])  
# Configure the server side (eg. [[gerrit:c/operations/puppet/+/524037|Gerrit CR 524037]])  
## Add <code>include ::profile::bird::anycast</code> where you see fit (usually to the service's role)
## Add <code>include ::profile::bird::anycast</code> where you see fit (usually to the service's role)
## Configure the VIP and its attributes (usually <code>hieradata/role/common/</code>)<syntaxhighlight lang="yaml" line="1">
## Configure the VIP and its attributes (usually <code>hieradata/role/common/</code><syntaxhighlight lang="yaml" line="1">
profile::bird::advertise_vips:
  <vip_fqdn>:  # used as identifier
    address: 10.3.x.x # VIP to advertise
    check_cmd: '/bin/true' # Any command to check the healh of the service
</syntaxhighlight>Notes:
##*check_cmd is ran once per second from user "bird"
##*anycast-healthchecker use the return code of the heath-check script, 0 = good, everything else is considered as a failure
# Configure the router side:
##<code>set protocols bgp group Anycast4 neighbor <server_IP></code>
# Add monitoring to the VIP, similar to any Icinga checks, but in [[phab:source/operations-puppet/browse/production/modules/profile/manifests/bird/anycast_monitoring.pp|modules/profile/manifests/bird/anycast_monitoring.pp]]
# (Optional) if deploying a new type of service, ask Netops to add a backup static route
 
==== What other configuration bits are relevant? ====
Hiera keys:<syntaxhighlight lang="yaml">
# service to bind bird to. Usually the anycast-healthchecker
# this mean if anycast-healthchecker crashes, Bird will stop as well
# Usually set globally for Bird
profile::bird::bind_service: 'anycast-healthchecker.service'
 
# Router IPs with which Birds establish BGP sessions
# Usually set per site
profile::bird::neighbors_list:
  - routerIP
  - other_router_IP
 
# Usually set per service (role)
# But can be set for a specific host as well, for example to specifically remove the VIP from a host to be decommissioned.
profile::bird::advertise_vips:
profile::bird::advertise_vips:
   <vip_fqdn>:
   <vip_fqdn>: # Used as identifier
     address: 10.3.x.x # VIP to advertise (required)
     address: 10.3.x.x # VIP to advertise (required)
     check_cmd: '/bin/true' # Any command to check the healh of the service, ran as user "bird" (required)
     check_cmd: '/bin/true' # Any command to check the healh of the service, ran as user "bird" once per second (required)
     ensure: present # Set to absent to cleanly remove the check (optional, present by default)
     ensure: present # Set to absent to cleanly remove the check (optional, present by default)
     bfd: true # Fast failure detection between router and server (Optional, true by default)
     bfd: true # Fast failure detection between router and server (Optional, true by default)
profile::bird::bind_service: 'foobar.service' # Stop bird if linked service goes down (optional, none by default)
</syntaxhighlight>
</syntaxhighlight>Some notes:
 
##* The check_cmd needs to run in less than 1s (check interval)
==== How are the routers configured? ====
# Configure the router side:
<syntaxhighlight lang="bash" line="1">
## <code>set protocols bgp group Anycast4 neighbor <server_IP></code>
# show protocols bgp group Anycast4  
# Add monitoring to the VIP, similar to any Icinga checks, but in [[phab:source/operations-puppet/browse/production/modules/profile/manifests/bird/anycast_monitoring.pp|modules/profile/manifests/bird/anycast_monitoring.pp]]
type external;
# (Optional) if deploying a new type of service, ask Netops to add a backup static route
/* T209989 */
multihop {
    ttl 193;
}
local-address 208.80.153.193; # Router's loopback
import anycast_import;  # See below
family inet {
    unicast {
        prefix-limit {
            maximum 50; # Take the session down if more than 50 prefixes
            teardown;  # learned from the servers (eg. missconfiguration)
        }
    }
}
export NONE;
peer-as 64605;  # Server's ASN
bfd-liveness-detection {
    minimum-interval 300; # Take the session down after 3*300ms failures
}
multipath;  # Enable load balancing (remove for active/passive)
neighbor 208.80.153.111;  # Servers IPs
neighbor 208.80.153.77; 
 
 
# show policy-options policy-statement anycast_import
term anycast4 {
    from {
        prefix-list-filter anycast-internal4 longer; # Only accept prefixes in the defined range
    }
    then accept;
}
then reject;
 
# show policy-options prefix-list anycast-internal4     
10.3.0.0/24;


# show routing-options static route 10.3.0.0/30
next-hop 208.80.153.111;
readvertise;
no-resolve;
</syntaxhighlight>
====How to monitor anycast_healthchecker logs?====
<code>/var/log/anycast-healthchecker/anycast-healthchecker.log</code>
<br />
<br />
====How to know which routes a router takes to a specific VIP?====
Here both next hops (servers) are load balanced, as they are under the same *[BGP] block.<syntaxhighlight lang="bash" line="1">
> show route 10.3.0.1
10.3.0.1/32        *[BGP/170] 1w4d 08:54:21, localpref 100, from 208.80.153.77
                      AS path: 64605 I, validation-state: unverified
                      to 208.80.153.77 via ae3.2003
                    > to 208.80.153.111 via ae4.2004
</syntaxhighlight>MTR can also be used for less granularity (site). Eg:<syntaxhighlight lang="bash">
bast5001:~$ mtr 10.3.0.1 --report
Start: Fri Apr  5 16:48:21 2019
HOST: bast5001                    Loss%  Snt  Last  Avg  Best  Wrst StDev
  1.|-- ae1-510.cr2-eqsin.wikimed  0.0%    10    0.3  0.7  0.2  4.4  1.1
  2.|-- ae0.cr1-eqsin.wikimedia.o  0.0%    10    0.2  1.0  0.2  7.8  2.3
  3.|-- xe-5-1-2.cr1-codfw.wikime  0.0%    10  195.1 195.3 195.1 196.5  0.3
  4.|-- recdns.anycast.wmnet      0.0%    10  195.1 195.1 195.1 195.1  0.0
</syntaxhighlight>
====How to temporarily depool a server====
Disable Puppet, stop the bird service.
====How to long term depool a server====
Several options:
* Deactivate the neighbor IP on the router side
* (Cleaner) Add a specific <code>profile::bird::advertise_vips</code> with the same identifier to the server, and <code>check_cmd: /bin/false</code> or <code>ensure: absent</code>
=== Limitations ===
*The server "self-monitor" itself, if it fails in a way where BGP is up, but DNS is unreachable from outside to the VIP (eg. iptables) this will cause an outage
*By the nature of Anycast, Icinga will only check the health of the VIP closer to it
**This could be worked around by checking the anycasted service health from various vantage points (eg. bastion hosts)
**health checks to the servers' real IP still works
=== Future evolution ===
*IPv6 is supported by both Bird and anycast_healthchecker, but not implemented in Puppet (no current need)
*Upgrade anycast_healthchecker to 0.9.0 or more recent (and rollback https://gerrit.wikimedia.org/r/c/operations/puppet/+/520643)
*Implement BGP graceful shutdown on the server side to drain traffic before depooling
*Send anycast_healthchecker logs to central syslog server
*User BGP metrics to influence anycast routing (eg. don't send eqiad to esams but to codfw in case of eqiad's resolvers failure)
*Investigate BGP routing policies between sites (eg. eqiad only send public prefixes to esams via BGP) - [[phab:T227808|T227808]]
*https://packages.debian.org/buster/prometheus-bird-exporter

Revision as of 20:02, 25 July 2019

External

In discussion: https://phabricator.wikimedia.org/T98006

Internal

In production

Anycast recursive DNS

How?

How does it work?

https://en.wikipedia.org/wiki/Anycast

  • The VIP (virtual IP) is configured on the servers loopback
  • Bird (routing daemon) advertises the VIP to the routers using BGP
  • (optional) A BFD session is established between Bird and the routers to ensure fast failover in case of server or link failure
  • Anycast_healthchecker monitors the local (anycasted) service by querying it every second
  • If a service failure is detected, the VIP stops being advertised to the routers
  • When the service is restored, anycast_healthchecker waits 10s before re-advertising the IP to avoid flaps
  • The bird service is linked (systemd bind) to the anycast_healthchecker service so bird is stopped if anycast_healthchecker is not running/crashed
  • Time between a local service failure and clients to be redirected to a different server (advertising the same VIP) is 1s max
  • All servers advertise the same VIP worldwide, clients will be be routed to the closest (in the BGP definition) server (same DC, then shorter AS path, etc...) but is not based on latency
  • Routers do per flow load balancing (ECMP) between all local (same site) servers. Hashing is done on IP and port (L4)
  • As last hope backup, in case all servers stop advertising the VIP (eg. global missconfiguration), eqiad and codfw routers have less specific (/30) backup static routes pointing to their local servers

How to deploy a new service?

  1. Assign an IP in DNS, from the 10.3.0.0/24 range - (eg. Gerrit CR 524045)
  2. Configure the server side (eg. Gerrit CR 524037)
    1. Add include ::profile::bird::anycast where you see fit (usually to the service's role)
    2. Configure the VIP and its attributes (usually hieradata/role/common/
      profile::bird::advertise_vips:
        <vip_fqdn>:  # used as identifier
          address: 10.3.x.x # VIP to advertise
          check_cmd: '/bin/true' # Any command to check the healh of the service
      
      Notes:
      • check_cmd is ran once per second from user "bird"
      • anycast-healthchecker use the return code of the heath-check script, 0 = good, everything else is considered as a failure
  3. Configure the router side:
    1. set protocols bgp group Anycast4 neighbor <server_IP>
  4. Add monitoring to the VIP, similar to any Icinga checks, but in modules/profile/manifests/bird/anycast_monitoring.pp
  5. (Optional) if deploying a new type of service, ask Netops to add a backup static route

What other configuration bits are relevant?

Hiera keys:

# service to bind bird to. Usually the anycast-healthchecker
# this mean if anycast-healthchecker crashes, Bird will stop as well
# Usually set globally for Bird
profile::bird::bind_service: 'anycast-healthchecker.service'

# Router IPs with which Birds establish BGP sessions
# Usually set per site
profile::bird::neighbors_list:
  - routerIP
  - other_router_IP

# Usually set per service (role)
# But can be set for a specific host as well, for example to specifically remove the VIP from a host to be decommissioned.
profile::bird::advertise_vips:
  <vip_fqdn>: # Used as identifier
    address: 10.3.x.x # VIP to advertise (required)
    check_cmd: '/bin/true' # Any command to check the healh of the service, ran as user "bird" once per second (required)
    ensure: present # Set to absent to cleanly remove the check (optional, present by default)
    bfd: true # Fast failure detection between router and server (Optional, true by default)

How are the routers configured?

# show protocols bgp group Anycast4 
type external;
/* T209989 */
multihop {
    ttl 193;
}
local-address 208.80.153.193; # Router's loopback
import anycast_import;  # See below
family inet {
    unicast {
        prefix-limit {
            maximum 50; # Take the session down if more than 50 prefixes
            teardown;  # learned from the servers (eg. missconfiguration)
        }
    }
}
export NONE;
peer-as 64605;  # Server's ASN
bfd-liveness-detection {
    minimum-interval 300; # Take the session down after 3*300ms failures
}
multipath;  # Enable load balancing (remove for active/passive)
neighbor 208.80.153.111;  # Servers IPs
neighbor 208.80.153.77;   


# show policy-options policy-statement anycast_import 
term anycast4 {
    from {
        prefix-list-filter anycast-internal4 longer; # Only accept prefixes in the defined range
    }
    then accept;
}
then reject;

# show policy-options prefix-list anycast-internal4      
10.3.0.0/24;

# show routing-options static route 10.3.0.0/30 
next-hop 208.80.153.111;
readvertise;
no-resolve;

How to monitor anycast_healthchecker logs?

/var/log/anycast-healthchecker/anycast-healthchecker.log

How to know which routes a router takes to a specific VIP?

Here both next hops (servers) are load balanced, as they are under the same *[BGP] block.

> show route 10.3.0.1
10.3.0.1/32        *[BGP/170] 1w4d 08:54:21, localpref 100, from 208.80.153.77
                      AS path: 64605 I, validation-state: unverified
                      to 208.80.153.77 via ae3.2003
                    > to 208.80.153.111 via ae4.2004

MTR can also be used for less granularity (site). Eg:

bast5001:~$ mtr 10.3.0.1 --report
Start: Fri Apr  5 16:48:21 2019
HOST: bast5001                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae1-510.cr2-eqsin.wikimed  0.0%    10    0.3   0.7   0.2   4.4   1.1
  2.|-- ae0.cr1-eqsin.wikimedia.o  0.0%    10    0.2   1.0   0.2   7.8   2.3
  3.|-- xe-5-1-2.cr1-codfw.wikime  0.0%    10  195.1 195.3 195.1 196.5   0.3
  4.|-- recdns.anycast.wmnet       0.0%    10  195.1 195.1 195.1 195.1   0.0

How to temporarily depool a server

Disable Puppet, stop the bird service.

How to long term depool a server

Several options:

  • Deactivate the neighbor IP on the router side
  • (Cleaner) Add a specific profile::bird::advertise_vips with the same identifier to the server, and check_cmd: /bin/false or ensure: absent

Limitations

  • The server "self-monitor" itself, if it fails in a way where BGP is up, but DNS is unreachable from outside to the VIP (eg. iptables) this will cause an outage
  • By the nature of Anycast, Icinga will only check the health of the VIP closer to it
    • This could be worked around by checking the anycasted service health from various vantage points (eg. bastion hosts)
    • health checks to the servers' real IP still works

Future evolution

  • IPv6 is supported by both Bird and anycast_healthchecker, but not implemented in Puppet (no current need)
  • Upgrade anycast_healthchecker to 0.9.0 or more recent (and rollback https://gerrit.wikimedia.org/r/c/operations/puppet/+/520643)
  • Implement BGP graceful shutdown on the server side to drain traffic before depooling
  • Send anycast_healthchecker logs to central syslog server
  • User BGP metrics to influence anycast routing (eg. don't send eqiad to esams but to codfw in case of eqiad's resolvers failure)
  • Investigate BGP routing policies between sites (eg. eqiad only send public prefixes to esams via BGP) - T227808
  • https://packages.debian.org/buster/prometheus-bird-exporter