Anycast recursive DNS

From Wikitech-static
Jump to navigation Jump to search

In order to improve resiliency of recursive DNS, this setup leverages BGP and anycast.

Task: https://phabricator.wikimedia.org/T186550

CR: https://gerrit.wikimedia.org/r/c/operations/puppet/+/397723/

Limitation of a non-anycast setup

  • Some services don't fail over fast enough to the 2nd server listed on resolv.conf when one fails
  • If the two servers of a site (or the whole site) fails, servers relying on them will experience an outage
  • LVS/pybal depends on DNS and thus making it a chicken/egg problem

Configuration

Anycast rec-dns diagram.png

Server side

modules/role/manifests/dnsrecursor.pp  include ::profile::bird::anycast

hieradata/role/common/recursor.yaml (global)

profile::bird::advertise_vips:
  recdns.anycast.wmnet:
    address: 10.3.0.1 # VIP to advertise (limited to a /32)
    check_cmd: '/usr/lib/nagios/plugins/check_dns_query -H 10.3.0.1 -l -d www.wikipedia.org -t 1'
    service_type: recdns


check_cmd

In this case we re-use an Icinga NRPE check, installed on all the servers:

/usr/lib/nagios/plugins/check_dns_query -H 10.3.0.1 -l -d www.wikipedia.org -t 1

Troubleshooting

Know which server a client is redirected to

$ dig @10.3.0.1 CHAOS TXT id.server. +short

Ensure all servers can reach the VIP

bblack@cumin1001:~$ sudo cumin '*' 'dig @10.3.0.1 CHAOS TXT id.server. +short'

Failure tests

Single local recursor failure

bblack@backup2001:~$ while [ 1 ]; do echo ======; date; dig @10.3.0.1 CHAOS TXT id.server. +short; sleep 1; done

"so I can see a result once a second, I've tried stopping just healthchecker, stopping or killing the recursor, etc"

Traffic routes to the one working local node within the second.

Double local recursor failure

Eg. take down dns2001/dns2002

Request end up on dns1001/dns1002

Limitations

  • If the DNS recursors have the anycast VIP as only resolver in resolv.conf, then processes depending on DNS will fail until pdnsd starts as they will try to connect to the local recdns service instead of being routed to the closest server.
    • Workaround is to either hardcode real recursors IPs or have a daemon that remove the VIP loopback

Future evolution

  • Add Icinga monitoring to check local recursors work (eg. Icinga check on bastX hosts that check it's dnsX that reply and not dnsY)