Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Distributed tracing

From Wikitech

Distributed tracing starts with a single incoming request from a user, and tracks and records all the sub-queries issued between different microservices to handle that request.

If you are totally new to distributed tracing in general, then we recommend beginning with Tutorial/Start

We implement distributed tracing with the OpenTelemetry Collector running on each production host as a collection point, and then using Jaeger as an indexer and search/display interface to the trace data. The data at rest lives on the same OpenSearch cluster that also backs Logstash .

As of August 2024, the only thing in production that emits data to otelcol is Envoy proxy. Our Envoy configuration offers simple opt-in tracing of both incoming and outgoing requests, as long as your application is propagating context -- see #Enabling tracing for a service and also /Propagating tracing context .

Any interested service owners who want to emit OTel themselves should get in touch at TODO (irc, phab, and/or slack?)


Enabling tracing for a service

TODO simple configuration instructions for helm charts in production

After enabling tracing, you should also do a brief audit for any easily-removable PII embedded in traces. Some PII is inevitable, but especially-sensitive data may be scrubbed by writing an otelcol processor rule. SRE is happy to assist with this, but you can find existing examples under transform/scrub: in helmfile.d/admin_ng/opentelemetry-collector/values.yaml

TODO a way for service owners to e2e test in staging? no otelcol deployment there https://phabricator.wikimedia.org/T365809

More user-facing documentation

[17:30]  <    taavi> how do i search trace.wikimedia.org with a specific request id?
[09:33]  <   claime> taavi: search for guid:x-request-id=$your-request-id

Tutorial sub-page: how to read a trace

/Tutorial/Start