HAProxyKafka
HAProxyKafka is a daemon running on all cache hosts to read logs produced by HAProxy and forward them to Kafka to be used in analytics pipelines.
HaproxyKafka replaced VarnishKafka in order to have a better observability of incoming requests, considering that HAProxy is our main entry point for all CDN requests.
The daemon is written in Golang and uses the
librdkafka
library installed on the cache hosts (and
NOT
the embedded one provided by the Golang package).
Useful links:
- Code repository on GitLab
- Grafana Dashboard
Building the package and deploy new version
GitLab CI should build the package just fine using the usual branch naming to choose the target distribution. Upload the binary package as usual to the internal APT repo and deploy it manually (or with something like cumin) as usual.
HAProxy configuration
To let HAProxyKafka correctly parse and dispatch messages, HAProxy must be configured:
1. To log to a unix domain socket (created by HAProxyKafka in advance with the correct permissions) 2. To structure logs with the expected log-format. This includes the captured headers in the correct order, in RFC5424 format .
Luckily using Puppet this is configured automatically, along with the HAProxyKafka configuration file and services. Refer to existing puppet hiera/code for this.
Caveats
- If a required log field (eg. the server PID) is missing from the original HAProxy log line, HaproxyKafka fails to process the line and a debug message with the original content is sent to the DLQ
- If an "unexpected" field (eg. a newly added one) is found in the HAProxy log line and HaproxyKafka has no mapping for it, it is silently discarded from the message sent to the broker. Rest of the fields are parsed as expected.
- If a non-required field misses from the HAProxy log line, HaproxyKafka populates it with the default value (according to golang default value for that specific type) and sends it to the broker
Procedures
Restarting
Restarting HaproxyKafka requires a bried amount of time so the HAProxy log buffer should be sufficient, under normal conditions, to avoid losing messages. If HaproxyKafka is unavailable for quite some time, on the other hand, HAProxy messages will be silently discarded and lost.
Adding fields for webrequest
Adding a new variable or header to webrequest involves several steps, roughly:
- Add the new field in the HAProxy log and distribute the new configuration with puppet as usual. HaproxyKafka will ignore the new field.
- Open a MR to HaproxyKafka main branch to add the new field to the processor. Add eventual conversion and / or tests for the new entry as the other ones.
- Merge the changes, port it to the branches for packaging (depending on the version) and let GitLab CI create the debian package as usual.
- Distribute the new version on all cache hosts restarting it after upgrading the package
Alerts
HaproxyKafkaExporterDown
This can be related to the service being stopped / process has been killed or in some cases to the process being up but the Prometheus exporter is unable to publish metrics (or other debug information). Usually a restart fixes this, but for a more in-depth debugging consider getting at least the pprof output (if possible) to analyze later, or attach to the process with
strace
to check if some evident issue is going on.
curl -v -o pprof_goroutine.out http://localhost:9341/debug/pprof/goroutine
curl -v -o pprof_profile.out http://localhost:9341/debug/pprof/profile # this can take a while
Analyzing (locally) these files with
go pprof
often is useful to find some culprits:
go tool pprof -http=":8000" ./pprof_goroutine.out
go tool pprof -http=":8000" ./pprof_profile.out
HaproxyKafkaNoMessages
This alert is due to a mismatch between received requests in HAProxy and HaproxyKafka messages sent, meaning that something wrong is happening or at HAproxy logging or at HaproxyKafka level.
A good way to debug this (other than checking haproxykafka service journal on the impacted host) is to check for the number of established connections between haproxykafka and the kafka jumbo cluster (usually on port 9093). A low number of connection, or no connection at all means that haproxykafka isn't even trying to send messages to the kafka brokers.
As HAProxy communicates with kafka over unix socket, there's no "easy" way to check the messages that flows from HAProxy to HaproxyKafka, but in this case a
strace
for
write
calls could be useful to detect which side has the issue.
Another option is also connect to an host in the kafka jumbo cluster and issue a command to check that the messages are actually coming from that specific host, eg:
fabfur@kafka-jumbo1007:~$ kafkacat -q -C -b localhost:9092 -t webrequest_frontend_text -o end |grep cp5017 | jq
Also the commands for the alert right above could shed some light in case of some software misbehavior.
As this kind of scenario should be pretty unusual, it's a good idea to open a ticket to the Traffic Team in this case to notify about it.
HaproxyKafkaRestarted
WARNING: This is still a WIP, the alert and watchdog functionalities are not active yet
This alert is triggered when systemd auto-restarts a service usually due to an hang (haproxykafka uses systemd's watchdog features to detect when to force a restart if the application is not responding).
When this happens:
-
Check the actual number of restarts (if > 0, systemd restarted the service due to a crash) with
systemctl show haproxykafka.service -p NRestarts -
A manual restart (
systemctl restart) clears the issue - Investigate on the crash cause, usually looking at the unit journal and see if /var/tmp/core/ or /var/lib/systemd/coredump contains a core dump file
HaproxyKafkaSocketDroppedMessages
This means that HaproxyKafka is dropping messages from the socket buffer, most probably due to high memory pressure (sudden spike in number of messages to be processed). Looking at other metrics can be helpful to pinpoint the root cause.