You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Anycast authoritative DNS
Work in progress.
In order to improve latency and resilience of our authoritative DNS, this setup leverages BGP and anycast.
Tracking task: https://phabricator.wikimedia.org/T98006
Limitation of a non-anycast setup
By definition, GeoDNS can't be used to redirect users to their closest nameserver (NS), like we do for websites.
When asked for a record (eg. fr.wikipedia.org), the .org
zone presents all 3 of our NS to the client, to decide which one to use.
Client side implementations not being great [citation needed], anycast offloads that decision to BGP.
Configuration
Server side
The server side is a regular internal anycast setup.
modules/profile/manifests/dns/auth.pp and modules/profile/manifests/dns/recursor.pp include ::profile::bird::anycast
hieradata/role/common/dnsbox.yaml and hieradata/role/common/dns/auth.yaml
profile::bird::advertise_vips:
nsa.wikimedia.org:
address: 198.35.27.27 # VIP to advertise (limited to a /32)
check_cmd: '/usr/lib/nagios/plugins/check_dns_query -H 198.35.27.27 -a -l -d www.wikipedia.org -t 1'
ensure: present
service_type: authdns
Routers side
Policy to only create (and thus advertise) the /24 anycast prefix if the router learns about it locally.
policy-options {
policy-statement BGP_from_anycast {
term BGP_local_anycast {
from {
protocol bgp;
as-path local_anycast;
}
then accept;
}
then reject;
}
as-path local_anycast "^64605$";
}
routing-options {
aggregate {
route 198.35.27.0/24 policy BGP_from_anycast;
}
}
Troubleshooting
Know which server a client is routed to
$ dig +nsid @nsa.wikimedia.org en.wikipedia.org A |grep NSID
Failure tests
Total local AuthDNS failure
- Stop gdnsd on all ulsfo servers
- The anycast prefix stops being advertised to the routers
- The routers don't have any contributing routes to the less specific prefix
- The routers stop advertising the prefix to their peers
- Start gdnsd back
- prefixes are re-advertised
Limitations
- L3 header LB:
ICMP packet too big
sent by routers along the path will not consistently be router to the correct router - Non-consistent hashing: if a routing change on the Internet causes a TCP packet to arrive through a different router, the router will not consistently route it to the proper server