You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikidough/Monitoring: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Sukhbir Singh
(improve basic check resolving message)
 
imported>Sukhbir Singh
(→‎Wikidough Basic Check: there is no more bird6)
Line 10: Line 10:
* Is the ''dnsdist.service'' active? Check with: <code>systemctl status dnsdist.service</code>.
* Is the ''dnsdist.service'' active? Check with: <code>systemctl status dnsdist.service</code>.
** If it is has stopped or failed, try restarting it. If it fails again, check the journal output to see why it is failing.
** If it is has stopped or failed, try restarting it. If it fails again, check the journal output to see why it is failing.
* Since this check queries the host IP and not the anycast IP, it is unlikely that there is an issue with ''anycast-healthchecker'' or the ''bird'' or ''bird6'' services.
* Since this check queries the host IP and not the anycast IP, it is unlikely that there is an issue with ''anycast-healthchecker'' or the ''bird'' service.
** Nevertheless, checking the status of the above three services might be worthwhile.
** Nevertheless, checking the status of the above three services might be worthwhile.



Revision as of 13:48, 5 July 2022

Wikidough Basic Check

What does it mean?

If this check fails, it means that a Wikidough host is not responding on ports 443 and/or 853 on its IPv4 and/or IPv6 address. This happens when the host is down, depooled, or the dnsdist.service is inactive or has failed.

Resolving this message

  • Head to Icinga and check if the host is up and if there are other checks that are failing (which may indicate a problem with the host itself).
  • Is the dnsdist.service active? Check with: systemctl status dnsdist.service.
    • If it is has stopped or failed, try restarting it. If it fails again, check the journal output to see why it is failing.
  • Since this check queries the host IP and not the anycast IP, it is unlikely that there is an issue with anycast-healthchecker or the bird service.
    • Nevertheless, checking the status of the above three services might be worthwhile.

Service Restart Check

What does it mean?

A failure of this check indicates that the configuration file for either dnsdist or pdns-recursor was changed but the service itself was not restarted. A CRITICAL alert is raised if the time delta between the configuration file change and service restart exceeds 24 hours.

This check is meant to be a warning alert and does not signify an error in the service.

Resolving this message

Please do not perform the steps below without contacting the Traffic team first as restarting any of these services clears the cache.

From a cumin host, restart the service mentioned in the alert on the Wikidough hosts:

sudo cumin -b 1 -s 5 'A:wikidough' 'systemctl restart dnsdist.service'

or,

sudo cumin -b 1 -s 5 'A:wikidough' 'systemctl restart pdns-recursor.service'