You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

HAProxyKafka

From Wikitech-static
Jump to navigation Jump to search

HAProxyKafka is a daemon running on all cache hosts to read logs produced by HAProxy and forward them to Kafka to be used in analytics pipelines.

HaproxyKafka replaced VarnishKafka in order to have a better observability of incoming requests, considering that HAProxy is our main entry point for all CDN requests.

The daemon is written in Golang and uses the librdkafka library installed on the cache hosts (and NOT the embedded one provided by the Golang package).

Useful links:

Building the package and deploy new version

GitLab CI should build the package just fine using the usual branch naming to choose the target distribution. Upload the binary package as usual to the internal APT repo and deploy it manually (or with something like cumin) as usual.

HAProxy configuration

To let HAProxyKafka correctly parse and dispatch messages, HAProxy must be configured:

1. To log to a unix domain socket (created by HAProxyKafka in advance with the correct permissions) 2. To structure logs with the expected log-format. This includes the captured headers in the correct order, in RFC5424 format.

Luckily using Puppet this is configured automatically, along with the HAProxyKafka configuration file and services. Refer to existing puppet hiera/code for this.

Procedures

Restarting

Restarting HaproxyKafka requires a bried amount of time so the HAProxy log buffer should be sufficient, under normal conditions, to avoid losing messages. If HaproxyKafka is unavailable for quite some time, on the other hand, HAProxy messages will be silently discarded and lost.

Alerts

HaproxyKafkaExporterDown

This can be related to the service being stopped / process has been killed or in some cases to the process being up but the Prometheus exporter is unable to publish metrics (or other debug information). Usually a restart fixes this, but for a more in-depth debugging consider getting at least the pprof output (if possible) to analyze later, or attach to the process with strace to check if some evident issue is going on.

curl -v -o pprof_goroutine.out http://localhost:9341/debug/pprof/goroutine
curl -v -o pprof_profile.out http://localhost:9341/debug/pprof/profile # this can take a while

Analyzing (locally) these files with go pprof often is useful to find some culprits:

go tool pprof -http=":8000" ./pprof_goroutine.out
go tool pprof -http=":8000" ./pprof_profile.out

HaproxyKafkaNoMessages

This alert is due to a mismatch between received requests in HAProxy and HaproxyKafka messages sent, meaning that something wrong is happening or at HAproxy logging or at HaproxyKafka level.

A good way to debug this (other than checking haproxykafka service journal on the impacted host) is to check for the number of established connections between haproxykafka and the kafka jumbo cluster (usually on port 9093). A low number of connection, or no connection at all means that haproxykafka isn't even trying to send messages to the kafka brokers.

As HAProxy communicates with kafka over unix socket, there's no "easy" way to check the messages that flows from HAProxy to HaproxyKafka, but in this case a strace for write calls could be useful to detect which side has the issue.

Another option is also connect to an host in the kafka jumbo cluster and issue a command to check that the messages are actually coming from that specific host, eg:

fabfur@kafka-jumbo1007:~$ kafkacat -q -C -b localhost:9092 -t webrequest_frontend_text -o end |grep cp5017 | jq

Also the commands for the alert right above could shed some light in case of some software misbehavior.

As this kind of scenario should be pretty unusual, it's a good idea to open a ticket to the Traffic Team in this case to notify about it.

HaproxyKafkaRestarted

WARNING: This is still a WIP, the alert and watchdog functionalities are not active yet

This alert is triggered when systemd auto-restarts a service usually due to an hang (haproxykafka uses systemd's watchdog features to detect when to force a restart if the application is not responding).

When this happens:

  • Check the actual number of restarts (if > 0, systemd restarted the service due to a crash) with systemctl show haproxykafka.service -p NRestarts
  • A manual restart (systemctl restart) clears the issue
  • Investigate on the crash cause, usually looking at the unit journal and see if /var/tmp/core/ or /var/lib/systemd/coredump contains a core dump file

HaproxyKafkaSocketDroppedMessages

This means that HaproxyKafka is dropping messages from the socket buffer, most probably due to high memory pressure (sudden spike in number of messages to be processed). Looking at other metrics can be helpful to pinpoint the root cause.

See also