SLO/logstash

From Wikitech-static
< SLO
Revision as of 19:31, 22 October 2021 by imported>Herron (→‎Service Level Objectives)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

SLO Worksheet - Logstash

Service

Logstash is a free and open server-side data processing pipeline that ingests data from multiple sources, transforms it, and then outputs it for search. In our infrastructure Logstash is a component of the logging pipeline, which consists of Kafka -> Logstash -> Elasticsearch <- Kibana.

Teams

Logstash is owned by the SRE Observability team, which is responsible for operation, scalability, and software updates. Contact: sre-observability@wikimedia.org and https://office.wikimedia.org/wiki/Contact_list#Observability

Architectural

Logstash consists of two clusters per-site.

  • A production cluster which consumes logs from Kafka, transforms them, and outputs to Elasticsearch.
  • A barebones legacy cluster which ingests logs directly via TCP/UDP and outputs them to Kafka for consumption by the production cluster.

Hard Dependencies

  • Elasticsearch - This is where log data is stored, logstash will block if Elasticsearch becomes unavailable.
  • Kafka - Logstash ingests log message from the kafka-logging cluster.
  • Hardware - Both dedicated servers, Ganeti instances, and networking.

Soft Dependencies

none

Client-facing

Clients

software use connection interval failure mode
(Logstash down)
Kafka Aggregates and queues log messages for consumption by logstash Pull via TCP Continuous Kafka consumer lag will spike and alarm
Elasticsearch Storage/archival of log data for search Push via TCP Continuous Logstash will block and stop consuming log events, Kafka consumer lag will spike and alarm.
SCAP pre-flight error checks to support deployments Pull via TCP using logstash_checker.py in puppet Manual False negative/positive result during deploy pre-flight deploy check

Service Level Indicators (SLIs)

Errors - Percentage of logs which fail to be indexed by elasticsearch

Latency - Messages ingested from Kafka logging by logstash without consumer Lag (as defined by kafka burrow)

Monitoring

Logstash is monitored via a suite of health checks and metrics, including:

  • Icinga checks - Host based service up/down checks
  • Kafka consumer lag - Is logstash able to consume logs from the kafka queue faster (or as fast as) they appear, or is the Kafka queue growing faster than logstash (and elasticsearch) can process?
  • Elasticsearch indexing failures - Is logstash able to output events to elasticsearch, or do a significant number of log messages fail to be stored in elasticsearch
  • Logstash event rate today vs. yesterday - Is the overall log volume significantly higher or lower than 24h ago?

Deployment

Logstash is installed via Debian package and its configuration is deployed via puppet.

Service Level Objectives

  • Errors - 99.5% of events are indexed successfully, per datacenter. Log producers may emit invalid log messages which cannot be parsed and are dropped, producers may exceed rate limits, or output excessive amounts of logs that cannot be reasonably ingested
  • Latency - 99.5% of events are ingested from Kaka logging without consumer Lag (as defined by kafka burrow), per datacenter