Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

HAProxyKafka

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

HAProxyKafka is a daemon running on all cache hosts to read logs produced by HAProxy and forward them to Kafka to be used in analytics pipelines.

HaproxyKafka replaced VarnishKafka in order to have a better observability of incoming requests, considering that HAProxy is our main entry point for all CDN requests.

The daemon is written in Golang and uses the librdkafka library installed on the cache hosts (and NOT the embedded one provided by the Golang package).

Useful links:

Building the package and deploy new version

GitLab CI should build the package just fine using the usual branch naming to choose the target distribution. Upload the binary package as usual to the internal APT repo and deploy it manually (or with something like cumin) as usual.

HAProxy configuration

To let HAProxyKafka correctly parse and dispatch messages, HAProxy must be configured:

1. To log to a unix domain socket (created by HAProxyKafka in advance with the correct permissions) 2. To structure logs with the expected log-format. This includes the captured headers in the correct order, in RFC5424 format .

Luckily using Puppet this is configured automatically, along with the HAProxyKafka configuration file and services. Refer to existing puppet hiera/code for this.

Caveats

  • If a required log field (eg. the server PID) is missing from the original HAProxy log line, HaproxyKafka fails to process the line and a debug message with the original content is sent to the DLQ
  • If an "unexpected" field (eg. a newly added one) is found in the HAProxy log line and HaproxyKafka has no mapping for it, it is silently discarded from the message sent to the broker. Rest of the fields are parsed as expected.
  • If a non-required field misses from the HAProxy log line, HaproxyKafka populates it with the default value (according to golang default value for that specific type) and sends it to the broker

Procedures

Restarting

Restarting HaproxyKafka requires a bried amount of time so the HAProxy log buffer should be sufficient, under normal conditions, to avoid losing messages. If HaproxyKafka is unavailable for quite some time, on the other hand, HAProxy messages will be silently discarded and lost.

Adding fields for webrequest

Adding a new variable or header to webrequest involves several steps, roughly:

  • Add the new field in the HAProxy log and distribute the new configuration with puppet as usual. HaproxyKafka will ignore the new field.
  • Open a MR to HaproxyKafka main branch to add the new field to the processor. Add eventual conversion and / or tests for the new entry as the other ones.
  • Merge the changes, port it to the branches for packaging (depending on the version) and let GitLab CI create the debian package as usual.
  • Distribute the new version on all cache hosts restarting it after upgrading the package

Alerts

HaproxyKafkaExporterDown

This can be related to the service being stopped / process has been killed or in some cases to the process being up but the Prometheus exporter is unable to publish metrics (or other debug information). Usually a restart fixes this, but for a more in-depth debugging consider getting at least the pprof output (if possible) to analyze later, or attach to the process with strace to check if some evident issue is going on.

curl -v -o pprof_goroutine.out http://localhost:9341/debug/pprof/goroutine
curl -v -o pprof_profile.out http://localhost:9341/debug/pprof/profile # this can take a while

Analyzing (locally) these files with go pprof often is useful to find some culprits:

go tool pprof -http=":8000" ./pprof_goroutine.out
go tool pprof -http=":8000" ./pprof_profile.out

HaproxyKafkaNoMessages

This alert is due to a mismatch between received requests in HAProxy and HaproxyKafka messages sent, meaning that something wrong is happening or at HAproxy logging or at HaproxyKafka level.

A good way to debug this (other than checking haproxykafka service journal on the impacted host) is to check for the number of established connections between haproxykafka and the kafka jumbo cluster (usually on port 9093). A low number of connection, or no connection at all means that haproxykafka isn't even trying to send messages to the kafka brokers.

As HAProxy communicates with kafka over unix socket, there's no "easy" way to check the messages that flows from HAProxy to HaproxyKafka, but in this case a strace for write calls could be useful to detect which side has the issue.

Another option is also connect to an host in the kafka jumbo cluster and issue a command to check that the messages are actually coming from that specific host, eg:

fabfur@kafka-jumbo1007:~$ kafkacat -q -C -b localhost:9092 -t webrequest_frontend_text -o end |grep cp5017 | jq

Also the commands for the alert right above could shed some light in case of some software misbehavior.

As this kind of scenario should be pretty unusual, it's a good idea to open a ticket to the Traffic Team in this case to notify about it.

HaproxyKafkaRestarted

WARNING: This is still a WIP, the alert and watchdog functionalities are not active yet

This alert is triggered when systemd auto-restarts a service usually due to an hang (haproxykafka uses systemd's watchdog features to detect when to force a restart if the application is not responding).

When this happens:

  • Check the actual number of restarts (if > 0, systemd restarted the service due to a crash) with systemctl show haproxykafka.service -p NRestarts
  • A manual restart ( systemctl restart ) clears the issue
  • Investigate on the crash cause, usually looking at the unit journal and see if /var/tmp/core/ or /var/lib/systemd/coredump contains a core dump file

HaproxyKafkaSocketDroppedMessages

This means that HaproxyKafka is dropping messages from the socket buffer, most probably due to high memory pressure (sudden spike in number of messages to be processed). Looking at other metrics can be helpful to pinpoint the root cause.

See also