You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
The following diagram should be a companion of the excellent explanation in https://github.com/wikimedia/eventlogging
The diagram has been created with https://www.draw.io/. If you want the source code, please check https://gist.github.com/elukey/975fab2bcf2ea6398fe1
statsv is running on hafnium.eqiad,wmnet and it is a daemon responsible to aggregate performance data before sending it to Statsd. There are metrics in the Even Logging dashboard like https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=11&fullscreen that are counting on this service to work properly. If you observe datapoint loss in the metric please check the status of the statsv service on hafnium and restart it if needed.
As depicted in the picture we are monitoring:
- Lag between messages landing in Kafka topics and message consumption rate from EventLogging's processes using Burrow. The alarm will be triggered as email sent to analytics-alert@.
- Insertion rate to Mysql master from the Consumer processes using Graphite/Icinga (config file). You will see alerts in the wikimedia-analytics IRC channel.
- [IN PROGRESS] Replication lag between for MySQL slaves (https://phabricator.wikimedia.org/T124306)
Consumption Lag "Alarms": Burrow
Alarms for burrow report numbers like
eventlogging-valid-mixed:2 (1454622892611, 199315032, 17) -> (1454622901672, 199315032, 17)
(timestamp, offset, #of messages behind (lag))
- Burrow evaluates lag in length window of 10 offsets by default. We are committing offsets every second, this frequency would make us evaluate lag every 10 secs which seems too frequent so we have changed the length of lag window window to 100 secs.
- Lag is evaluated aprox every couple minutes. (we are fine tuning this)
- docs about burrow lag - https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules
An interesting alert use case is the following one:
Cluster: eqiad Group: eventlogging-00 Status: ERR Complete: true Errors: 1 partitions have problems Format: Topic:Partition (timestamp, start-offset, start-lag) -> (timestamp, end-offset, end-lag) eventlogging-client-side:11 (1455812030875, 188196575, 0) -> (1455813022220, 188205237, 0)
As you can see the end offset is greater than the start one (meanwhile the lag is zero) and the partition status is ERROR. This is due to the current Burrow's rule 4 in https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules, that should be read in this way:
Burrow will alert you when a consumer is so slow that the time elapsed between the last offset committed and now is bigger than the time taken to commit all the offsets belonging the last window.
This can happen when a Kafka broker goes offline and it is a partition leader. EventLogging will need a bit of time to recognize the problem and request new metadata from Kafka, and hence the related consumer status according to Burrow will look like it is stalled or completely blocked even though it is only a temporary stop.
MySQL Master/Slave replication
Eventlogging databases (m4 shard): db1046 (m4-master), db1047 (analytics slave 1), dbstore1002 (analytics slave2), and dbstore2002 (dallas backup) use a custom replication mechanism:
- eventlogging_sync.init used to bootstrap the following script and keeping it running.
- eventlogging_sync.sh used to replicate the Master's tables to the slaves.
For more information please check T124306.
EventLogging Routine Maintenance for the oncall
- Check grafana for problems with raw vs. validated events, or other apparent problems
- Check database for any gaps. A few different scripts exist, Milimetric's gist for example.
- Decide on whether we need to deploy on that week, avoid Friday deployments
- Remember to log all actions to SAL log (!log <something> on ops channel)
- Report outages as part of wikimedia's incident reports so there is a reference
- Follow up on any alarms that might be raised
- Ensure you can access graphite (wikitech credentials).
- Get Access to eventlog1001.eqiad.wmnet with root access.
- Ask Andrew (ottomata on IRC) to setup the system to send you alarms.
- In case of DB errors, the person to contact is Jaime Crespo (jcrespo on IRC), usually on #wikimedia-operations channel.
Restart all EventLogging processes
sudo eventloggingctl stop && sleep 2 && sudo eventloggingctl start
sudo eventloggingctl stop
The config applied to create logs and such by upstart is at:
Start/Stop/Restart individual EventLogging processes
EventLogging processes are managed by upstart. Each config file in /etc/eventlogging.d/*/* corresponds to a single eventlogging process. Let's call the pieces of this hierarchy /etc/eventlogging.d/$service/$name.
To stop one of them, you can do something like:
stop eventlogging/consumer NAME=mysql-m4-master-00 CONFIG=/etc/eventlogging.d/consumers/mysql-m4-master-00
You can stop, start, and restart any individual EventLogging process using variations of this command.
For example, if your schema name is
MobileBetaWatchList, you can monitor new events with
zsub vanadium.eqiad.wmnet:8600 | grep '"MobileBetaWatchList"'
Raw logs are at:
Process logs are at:
Fix graphite counts if not working
There seems to be some problems with upstart and EL. We have had to start processes by hand. If graphite counts are affected it is likely that consumers in either Hafnium or eventlog1001.eqiad.wmnet are not running. Note that there are consumers in both machines, global counts are reported from eventlog1001.eqiad.wmnet and schema counts are reported from hafnium.
To restart consumers in Hafnium do:
start eventlogging/consumer NAME=graphite CONFIG=/etc/eventlogging.d/consumers/graphite
You should see a process or a set of processes similar to the following (some "eventlogging-consumer")
/usr/bin/python /usr/local/bin/eventlogging-consumer @/etc/eventlogging.d/consumers/graphite
You can use tcp dump to see what is sent to statsd (statsd.eqiad.wmnet), you should see stuff like:
18:03:13.590338 IP 184.108.40.206.39839 > 10.64.32.155.8125: UDP, length 28 0x0000: 4500 0038 33dc 4000 4011 715e d050 9a4f E..83.@.@.q^.P.O 0x0010: 0a40 209b 9b9f 1fbd 0024 95b0 6576 656e .@.......$..even 0x0020: 746c 6f67 6769 6e67 2e73 6368 656d 612e tlogging.schema. 0x0030: 4564 6974 3a31 7c6d Edit:1|m
Get a first feeling of end-to-end issues on the whole pipeline of a schema
If you're for example interested in NavigationTiming 10374055 run
mysql --defaults-extra-file=/etc/mysql/conf.d/research-client.cnf --host dbstore1002.eqiad.wmnet -e "select left(timestamp,10) ts , COUNT(*) from log.NavigationTiming_10374055 where left(timestamp,8) >= '20150101' group by ts order by ts;" >out && tail -n 100 out
on stat1003 (You need to be in the researchers group to access the
research-client.cnf). That will dump recent hourly totals to the screen, and if you prefer graphs, more data is stored in
out and only waiting to get plotted.
If you need different granularity, just change the
10 in the query to the granularity you need (like 8 => per day, 12 => per minute, 14 => per second).
If the numbers you get indicate issues, you can go to Graphite to sanity check the early parts of the pipeline. A first indicator is typically the
overall counts. Like comparing
eventlogging.overall.valid.rate. Then one can bisect into
eventlogging.server_side_events.*, directly drill down into per schema counts, like looking at the graph for
If you're good at staring at graphs, go right to the all in one graph.
If the numbers you get indicate issues, you can also repeat the database query to the
m2 master database directly (credentials are in
vanadium. That allows to exclude replication issues.
Troubleshoot events coming in in real time
- Incoming events counts are logged to graphite, both the count of validating and non validating events per schema are available
using those users can get a sense of change, graphite is populated real-time and if all of a sudden events for an schema do not validate it is clearly visible.
- EvenLogging slave database (accesible from 1002 for users with access to 'research' user) is also populated real-time.
Lastly, Event Logging events coming on real time are logged to logs that are sync-ed to 1002 once a day, these logs can be found here:
If you detect an issue or suspicious change , please notify analytics@ and escalate with analytics devs.
Troubleshoot insufficient permission
"error: insufficient permission for adding an object to repository database .git/objects"
List > groups to see if you are on wikidev group, if so likely some files on .git directory are not writable by "wikidev" group. Make them so.
EventLogging is deployed using scap3. The scap deployment configuration for various EventLogging deployments can be found in specific scap repos in gerrit: eventlogging/scap/<deployment-name>. The EventLogging Analytics deployment scap configs are at at eventlogging/scap/analytics. Deployment on deployment.eqiad.wmnet using
# ssh to production deploy server ssh deployment.eqiad.wmnet # cd to the EventLogging Analytics instance deploy source cd /srv/deployment/eventlogging/analytics # Checkout the revision you want to deploy git pull # Update the submodules git submodule update --init # Run scap3 deployment deploy
ssh eventlog1001.eqiad.wmnet (or wherever eventlogging is deployed.)
Go to /srv/deployment/eventlogging/analytics
See that checkout is there from what you just pulled in from tin (via git log).
Restart EL on target host (eventlog1001.eqiad.wmnet)
eventloggingctl stop eventloggingctl start
Check various logs in /var/log/upstart/eventlogging_* to see that things are running as they should.
Check that /srv/log/eventlogging/all-events.log has data flowing in.
Hop in the Ops IRC channel and !log that you upgraded & restarted EventLogging and add the commit hash that you deployed.
!log Redeployed eventlogging with revert to batch/queue size change - https://gerrit.wikimedia.org/r/#/c/258384/
Now please deploy latest code to Beta Cluster to keep things in sync: EventLogging/Testing/BetaLabs#How_to_deploy_code
Blacklist a schema
Resolve lagging replication between MySQL Master and Slaves
If replication is lagging we should open a ticket with the info we have and tag it with phabricator DBA task so DBA gets ping-ed. An example of a prior problem: 
Have in mind that while monitoring for database is here: , lag reported does not apply as replication on EL doesn't go through regular channels.
Ad hoc replication script is here: https://github.com/wikimedia/operations-software/blob/master/dbtools/eventlogging_sync.sh
Check replication script