You are browsing a read-only backup copy of Wikitech. The primary site can be found at


From Wikitech-static
Revision as of 12:05, 9 August 2018 by imported>Ema
Jump to navigation Jump to search

DNS Discovery is a simple dynamic service discovery system to get the closest active endpoint of a given service that is running in multiple data centers.

This solution is meant only for simple discovery entries, if more complex data needs to be dynamically driven, the usage of a Confd / etcd managed configuration is required.

Active/active services

If a service is running in active/active mode, it means that it can be contacted in any data center. In this case the entry service-name.discovery.wmnet will return the IP of the endpoint of the same data center of the host that is performing the resolution, if that endpoint is pooled.

For example, with both data centers pooled, an host in eqiad that will resolve service-name.discovery.wmnet will get the IP of service-name.svc.eqiad.wmnet while an host in codfw will get the IP of service-name.svc.codfw.wmnet.

If the codfw data center entry is depooled, an host in codfw will get the IP of the endpoint in eqiad, if that is pooled.

Dns-discovery active-active.png

Active/passive services

If a service is running in active/passive mode, it means that it can be contacted only in the primary data center and not in the passive one. In this case the entry service-name.discovery.wmnet will always return the IP of the endpoint in the primary data center.

Dns-discovery active-passive.png

Read-only and read-write

If a service can handle reads in an active/active way, but writes only in an active/passive way, two DNS Discovery records can be created, service-name-ro and service-name-rw so they can be treated as two different services, one active/active and the other active/passive.

Failure scenario

To handle the failure cases in which no datacenter is pooled for a given service, a failoid service was created that will always close the connection to any TCP port. In this way the DNS Discovery can have the failod IPs as fallback and is able to return always an IP, avoiding any negative DNS caching and such. The failoid service is present in both eqiad and codfw datacenters and the IP of the local one will be returned.

How to manage a DNS Discovery service

TODO: Add more details for the Puppet configuration

The DNS configuration is managed in Puppet while the current pooled/depooled state and the TTL are stored in etcd and can be managed via Conftool, either from the CLI or using it as a library. For example:

  • Get the current live state of the three main MediaWiki discovery entries:
$ confctl --quiet --object-type discovery select 'dnsdisc=(appservers|api|imagescaler)-rw' get
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=imagescaler-rw"}
{"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=imagescaler-rw"}
{"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=api-rw"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=api-rw"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=appservers-rw"}
{"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=appservers-rw"}
  • Get the current live state of the parsoid entry:
$ confctl --quiet --object-type discovery select 'dnsdisc=parsoid' get
{"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=parsoid"}
{"codfw": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=parsoid"}
  • Depool the codfw entry of the imagescaler-ro entry in codfw:
$ confctl --object-type discovery select 'dnsdisc=imagescaler-ro,name=codfw' set/pooled=false

Remove a service from production

With the goal of removing a service named example from production:

  1. Remove any reference to example.discovery.wmnet and example.svc.{eqiad,codfw}.wmnet from the configuration of other services
  2. Remove the discovery entries for example.discovery.wmnet as well as the geo-config-test part from DNS. Example
  3. Remove discovery's hieradata entries from the puppet repo (hieradata/common/discovery.yaml). Example
  4. Downtime the LVS endpoints in icinga
  5. Remove the lvs configuration from hiera (Example) and then EITHER:
    1. move the hosts to role::spare, or
    2. remove role::lvs::realserver from the hosts configuration
  6. (Optional) Run puppet on
  7. Run puppet on the affected load balancers and rolling-restart pybal. To identify which load balancers need to be restarted, look at the class attribute of the service being removed on hieradata/common/lvs/configuration.yaml and see which lvs hosts belong to that class.
  8. (Optional) PyBal does not automatically remove ipvsadm services once they're gone from configuration. That can be done by hand with ipvsadm
  9. Remove conftool-data entries. Example
  10. Remove example.svc.{eqiad,codfw}.wmnet entries from DNS Example