You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Anycast recursive DNS

From Wikitech-static
Revision as of 01:17, 22 August 2018 by imported>Quiddity (spacing, and syntaxhighlight lang="bash" x2)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Current setup implies servers to have the closest pair of rdns server configured in their resolv.conf file.

In order to improve resiliency of the service, this POC explores the possibility of a single VIP to be advertised from all the recursive DNS servers.

Task: https://phabricator.wikimedia.org/T186550

CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/397723/

Current limitations

  • Some services don't fail over fast enough to the 2nd server listed on resolv.conf when one fails
  • If the two servers of a site (or the whole site) fails, servers relying on them will experience an outage
  • LVS/pybal depends on DNS and thus making it a chicken/egg problem

High level

Anycast rec-dns diagram.png

The VIP is configured on the servers loopback.

Bird (routing daemon) advertises the VIP to the routers using BGP.

A BFD session is established between Bird and the routers to ensure fast failover in case of server or link failure.

Anycast_healthchecker monitors the local DNS server by querying it every seconds.

If a DNS failure is detected, the VIP stops being advertised.

When the service is restored, anycast_healthchecker waits 10s before re-advertising the IP to avoid flaps.

The bird service is linked (systemd bind) to the anycast_healthchecker service so bird is stopped if anycast_healthchecker is not running.

Time between an incident and the VIP to be withdrawn is 1s max.

All servers advertise the same VIP worldwide, clients will be be routed to the closest (in the BGP definition) server (same DC, then shorter AS path, etc...)

Routers do per flow load balancing between the two connected DNS servers.

Configuration

Router side

# show protocols bgp group Anycast4 
type external;
multihop {
    ttl 2; # Needed as we're peering with the router's loopback
}
local-address 208.80.153.193; # Router's loopback
import anycast_import;  # See below
family inet {
    unicast {
        prefix-limit {
            maximum 50; # Take the session down if more than 50 prefixes
            teardown;  # learned from the servers (eg. missconfiguration)
        }
    }
}
export NONE;
peer-as 64605;  # Server's ASN
bfd-liveness-detection {
    minimum-interval 300; # Take the session down after 3*300ms failures
}
multipath;  # Enable load balancing (remove for active/passive)
neighbor 208.80.153.111;  # Servers IPs
neighbor 208.80.153.77;   


# show policy-options policy-statement anycast_import 
term anycast4 {
    from {
        prefix-list-filter anycast-internal4 longer; # Only accept prefixes in the defined range
    }
    then accept;
}
then reject;

# show policy-options prefix-list anycast-internal4      
10.3.0.0/24;

Server (puppet) side

modules/role/manifests/dnsrecursor.pp  include ::profile::bird::anycast

hieradata/role/common/recursor.yaml (global)

profile::bird::advertise_vips:
  rec-dns-anycast-vip:
   address: 10.3.0.1 #VIP to advertise
profile::bird::bind_service: 'anycast-healthchecker.service' #Service to bind Bird to
profile::bird::healthchecks:
  recdns.anycast.wmnet:
    anycast_vip: 10.3.0.1  # anycast-healthchecker will not start if more than 1 healthcheck per VIP
    check_cmd: '/usr/lib/nagios/plugins/check_dns -H www.wikipedia.org -s 10.3.0.1 -t 1 -c 1'
    ensure: present #optional, set to "absent" to remove the check

hieradata/role/codfw/recursor.yaml (per site)

profile::bird::neighbors_list:
  - 208.80.153.192 # cr1-codfw loopback
  - 208.80.153.193 # cr2-codfw loopback

Health Checks

anycast-healthchecker use the return code of the heath-check script, 0 = good, everything else is considered as a failure.

In this case we re-use an Icinga NRPE check, installed on all the servers:

/usr/lib/nagios/plugins/check_dns -H www.wikipedia.org -s 10.3.0.1 -t 1 -c 1

Troubleshooting

Know which server a client is redirected to

Unless topology change (eg. server going up/down) a client will always be routed to the same server

$ dig @10.3.0.1 CHAOS TXT id.server. +short

Show which routes a router takes to a specific VIP

Here both next hops (servers) are load balanced, as they are under the same *[BGP] block.

> show route 10.3.0.1
10.3.0.1/32        *[BGP/170] 1w4d 08:54:21, localpref 100, from 208.80.153.77
                      AS path: 64605 I, validation-state: unverified
                      to 208.80.153.77 via ae3.2003
                    > to 208.80.153.111 via ae4.2004

Monitor anycast_healthchecker logs

/var/log/anycast-healthchecker/anycast-healthchecker.log

How to

Temporarily depool a server

Disable Puppet, stop the bird service, ensure no more queries are making it to the VIP (eg. tcpdump)

Long time depool of a server

Deactivate the neighbor IP on the router side

Limitations

  • The server "self-monitor" itself, if it fails in a way where BGP is up, but DNS is unreachable from outside to the VIP (eg. iptables) this will cause an outage
  • By the nature of Anycast, Icinga will only check the health of the VIP closer to it
    • This could be worked around by checking DNS health from various vantage points (eg. bastion hosts)
    • DNS health check to the server's real IP still works

Future evolution

  • IPv6 is supported by both Bird and anycast_healthchecker, but not implemented in Puppet (no current need)
  • Add Icinga monitoring to anycast_healthchecker (script available)
  • Implement BGP graceful shutdown on the server side to drain traffic before depooling
  • Send anycast_healthchecker logs to central syslog server