You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Druid/Alerts

From Wikitech-static
< Analytics‎ | Systems‎ | Druid
Jump to navigation Jump to search

We have a number of alerts set up in Icinga and Alertmanager that relate to Druid and its ingestion jobs.

This page exists as a set of instructions or runbooks to help identify what courses of action might be needed if one or more of these alerts is triggered.

Druid Netflow Supervisor

This alert triggers if the realtime netflow ingestion job receives below a certain threshold of events, over a 30 minutes period.

The critical value is 0 and the warning value is 30.

The grafana dashboard showing the trend data is here.

Druid Segments Unavailable

This alert triggers for each data source if the cluster is reporting above a certain number of segments missing over a 15 minute period.

The critical value is 30 segments unavailable for each data source.

The warning value is 20 segments unavailable.

The Grafana dashboard showing the trend is here.

It may well be that Druid auto-heals this unavailability automatically, so chek the general Troubleshooting techniques before taking any action.