You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Purged: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ema
No edit summary
imported>BCornwall
(Fix broken grafana link)
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Navigation Wikimedia infrastructure|expand=caching}}
{{Navigation Wikimedia infrastructure|expand=caching}}


'''purged''' is a daemon running on all cache hosts that reads Kafka purge messages, parses them, and turns them into HTTP PURGE requests to be sent to the local [[ATS]] and [[Varnish]] daemons.
'''purged''' (pronounced "purge-dee") is a daemon running on all cache hosts that reads Kafka purge messages, parses them, and turns them into HTTP purge requests to be sent to the local [[ATS]] and [[Varnish]] daemons.


Detailed information about running purged instances can be found on this [https://grafana.wikimedia.org/dashboard/db/purged grafana dashboard].
Detailed information about running purged instances can be found on this [https://grafana.wikimedia.org/d/RvscY1CZk/purged grafana dashboard].


The daemon is written in Golang, see the [https://github.com/wikimedia/operations-software-purged operations-software-purged] repo.
The daemon is written in Golang, see the [https://github.com/wikimedia/operations-software-purged operations-software-purged] repo.
Line 9: Line 9:
== Building the package ==
== Building the package ==
To target Debian Buster, build the package as follows on the build host (deneb at the time of this writing):
To target Debian Buster, build the package as follows on the build host (deneb at the time of this writing):
<source>
<syntaxhighlight lang="text">
WIKIMEDIA=yes BACKPORTS=yes ARCH=amd64 DIST=buster GIT_PBUILDER_AUTOCONF=no gbp buildpackage -jauto -us -uc -sa --git-builder=git-pbuilder
WIKIMEDIA=yes BACKPORTS=yes ARCH=amd64 DIST=buster GIT_PBUILDER_AUTOCONF=no gbp buildpackage -jauto -us -uc -sa --git-builder=git-pbuilder
</source>
</syntaxhighlight>


== Alerts ==
== Alerts ==
The details about which pages need to be purged come from Kafka: for this reason we monitor the amount of time since the last Kafka message received and alert if it is not within a certain threshold. Here's how the alert looks like:
The details about which pages need to be purged come from Kafka: for this reason we monitor the amount of time since the last Kafka message received and alert if it is not within a certain threshold. Here's how the alert looks like:


<source>
<syntaxhighlight lang="text">
Time elapsed since the last kafka event processed by purged on cp2041 is CRITICAL: cluster=cache_text instance=cp2041 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041
Time elapsed since the last kafka event processed by purged on cp2041 is CRITICAL: cluster=cache_text instance=cp2041 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041
</source>
</syntaxhighlight>


If this happens and nobody from Traffic is around to take a look, the best course of action is killing purged and checking the journal to see if it is being restarted properly by systemd:
If this happens and nobody from Traffic is around to take a look, the best course of action is killing purged and checking the journal to see if it is being restarted properly by systemd:


<source>
<syntaxhighlight lang="text">
$ sudo pkill -9 purged
$ sudo pkill -9 purged
$ sudo journalctl -u purged --since today
$ sudo journalctl -u purged --since today
Line 37: Line 37:
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Start consuming topics [eqiad.resource-purge codfw.resource-purge] from kafka
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Start consuming topics [eqiad.resource-purge codfw.resource-purge] from kafka
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Reading from 239.128.0.112,239.128.0.115 with maximum datagram size 4096
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Reading from 239.128.0.112,239.128.0.115 with maximum datagram size 4096
</source>
</syntaxhighlight>




Line 43: Line 43:
[[File:Purged-cp4025.png|thumb|purged CPU profile diagram]]
[[File:Purged-cp4025.png|thumb|purged CPU profile diagram]]


If '''purged''' seems to be misbehaving, data such as '''perf''' reports, callgraphs and go profiling can be useful to diagnose the issue.
If <code>purged</code> seems to be misbehaving, data such as <code>perf</code> reports, callgraphs and go profiling can be useful to diagnose the issue.


One minute of '''perf''' data can be gathered on the host where '''purged''' is running with:  
One minute of <code>perf</code> data can be gathered on the host where <code>purged</code> is running with:  
<source>
<syntaxhighlight lang="text">
sudo timeout 60 perf record -p `pidof purged`
sudo timeout 60 perf record -p `pidof purged`
sudo perf report --stdio
sudo perf report --stdio
</source>
</syntaxhighlight>


Similarly, '''go''' profile information can be collected with:
Similarly, <code>go</code> profile information can be collected with:
<source>
<syntaxhighlight lang="text">
curl http://localhost:2112/debug/pprof/profile?seconds=60 > cpuprof
curl http://localhost:2112/debug/pprof/profile?seconds=60 > cpuprof
</source>
</syntaxhighlight>


Copy the file '''cpuprof''' to a system with go installed, and run the following command to get a detail of CPU usage to standard output
Copy the file <code>cpuprof</code> to a system with go installed, and run the following command to get a detail of CPU usage to standard output
<source>
<syntaxhighlight lang="text">
go tool pprof -top cpuprof
go tool pprof -top cpuprof
</source>
</syntaxhighlight>


A PNG profile diagram can be created with:
A PNG profile diagram can be created with:
<source>
<syntaxhighlight lang="text">
go tool pprof -png cpuprof
go tool pprof -png cpuprof
</source>
</syntaxhighlight>
 
== See also ==
* Source code: https://gerrit.wikimedia.org/g/operations/software/purged
[[Category:Services]]

Revision as of 19:46, 17 June 2022

purged (pronounced "purge-dee") is a daemon running on all cache hosts that reads Kafka purge messages, parses them, and turns them into HTTP purge requests to be sent to the local ATS and Varnish daemons.

Detailed information about running purged instances can be found on this grafana dashboard.

The daemon is written in Golang, see the operations-software-purged repo.

Building the package

To target Debian Buster, build the package as follows on the build host (deneb at the time of this writing):

WIKIMEDIA=yes BACKPORTS=yes ARCH=amd64 DIST=buster GIT_PBUILDER_AUTOCONF=no gbp buildpackage -jauto -us -uc -sa --git-builder=git-pbuilder

Alerts

The details about which pages need to be purged come from Kafka: for this reason we monitor the amount of time since the last Kafka message received and alert if it is not within a certain threshold. Here's how the alert looks like:

Time elapsed since the last kafka event processed by purged on cp2041 is CRITICAL: cluster=cache_text instance=cp2041 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2041

If this happens and nobody from Traffic is around to take a look, the best course of action is killing purged and checking the journal to see if it is being restarted properly by systemd:

$ sudo pkill -9 purged
$ sudo journalctl -u purged --since today
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Main process exited, code=killed, status=9/KILL
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Failed with result 'signal'.
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Consumed 2d 16h 43min 48.527s CPU time.
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Service RestartSec=100ms expired, scheduling restart.
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Scheduled restart job, restart counter is at 58.
Jun 29 08:57:05 cp2041 systemd[1]: Stopped Purger for ATS and Varnish.
Jun 29 08:57:05 cp2041 systemd[1]: purged.service: Consumed 2d 16h 43min 48.527s CPU time.
Jun 29 08:57:05 cp2041 systemd[1]: Started Purger for ATS and Varnish.
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Listening for topics eqiad.resource-purge,codfw.resource-purge
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Process purged started with 48 backend and 4 frontend workers. Metrics at :2112/metrics
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Start consuming topics [eqiad.resource-purge codfw.resource-purge] from kafka
Jun 29 08:57:05 cp2041 purged[44684]: 2020/06/29 08:57:05 Reading from 239.128.0.112,239.128.0.115 with maximum datagram size 4096


Gather information for bug reports

purged CPU profile diagram

If purged seems to be misbehaving, data such as perf reports, callgraphs and go profiling can be useful to diagnose the issue.

One minute of perf data can be gathered on the host where purged is running with:

sudo timeout 60 perf record -p `pidof purged`
sudo perf report --stdio

Similarly, go profile information can be collected with:

curl http://localhost:2112/debug/pprof/profile?seconds=60 > cpuprof

Copy the file cpuprof to a system with go installed, and run the following command to get a detail of CPU usage to standard output

go tool pprof -top cpuprof

A PNG profile diagram can be created with:

go tool pprof -png cpuprof

See also