Distributed tracing
Distributed tracing starts with a single incoming request from a user, and tracks and records all the sub-queries issued between different microservices to handle that request.
If you are totally new to distributed tracing in general, then we recommend beginning with Tutorial/Start
We implement distributed tracing with the OpenTelemetry Collector running on each production host as a collection point, and then using Jaeger as an indexer and search/display interface to the trace data. The data at rest lives on the same OpenSearch cluster that also backs Logstash .
As of August 2024, the only thing in production that emits data to otelcol is Envoy proxy. Our Envoy configuration offers simple opt-in tracing of both incoming and outgoing requests, as long as your application is propagating context -- see #Enabling tracing for a service and also /Propagating tracing context .
Any interested service owners who want to emit OTel themselves should get in touch at TODO (irc, phab, and/or slack?)
Enabling tracing for a service
TODO simple configuration instructions for helm charts in production
After enabling tracing, you should also do a brief audit for any easily-removable PII embedded in traces. Some PII is inevitable, but especially-sensitive data may be scrubbed by writing an otelcol processor rule. SRE is happy to assist with this, but you can find existing examples under
transform/scrub:
in
helmfile.d/admin_ng/opentelemetry-collector/values.yaml
TODO a way for service owners to e2e test in staging? no otelcol deployment there https://phabricator.wikimedia.org/T365809
More user-facing documentation
[17:30] < taavi> how do i search trace.wikimedia.org with a specific request id?
[09:33] < claime> taavi: search for guid:x-request-id=$your-request-id