You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/EventLogging/How to: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Mforns
(add deploying eventlogging section)
 
imported>MarcoAurelio
m (Bot: Fixing double redirect to Analytics/Systems/EventLogging/Administration)
 
(7 intermediate revisions by 6 users not shown)
Line 1: Line 1:
== Restart EventLogging ==
#REDIRECT [[Analytics/Systems/EventLogging/Administration]]
Check:
 
  /etc/init/eventlogging/init.conf
 
Run:
 
  sudo  eventloggingctl restart
 
Stop completely:
  sudo  eventloggingctl stop
 
The config applied to create logs and such by upstart is at:
 
/etc/eventlogging.d/consumers/
 
For "some" reason sometimes this is completely wrong and instead of saying:
 
nuria@vanadium:~$ more /etc/eventlogging.d/consumers/all-events-log
tcp://127.0.0.1:8600?socket_id=all-events.log
file:///srv/log/eventlogging/all-events.log
 
It says:
 
nuria@vanadium:~$ more /etc/eventlogging.d/consumers/all-events-log
tcp://127.0.0.1:8600?socket_id=all-events.log
file:///all-events.log -> NOTE, this will not work, needs to be changed
 
== Backfilling ==
[[EventLogging/Backfilling]]
 
For example, if your schema name is <code>MobileBetaWatchList</code>, you can monitor new events with <code>zsub vanadium.eqiad.wmnet:8600 | grep '"MobileBetaWatchList"'</code>
 
== See logs ==
Raw logs are at:
/var/log/eventloggin
 
Process logs are at:
/var/logs/upstart/
== Fix graphite counts not working ==
There seems to be some problems with upstart and EL. We have had to start processes by hand. If graphite counts are affected it is likely that consumers in either Hafnium or eventlog1001.eqiad.wmnet are not running. Note that there are consumers in both machines, global counts are reported from eventlog1001.eqiad.wmnet and schema counts are reported from hafnium.
 
To restart consumers in Hafnium do:
 
start eventlogging/consumer NAME=graphite CONFIG=/etc/eventlogging.d/consumers/graphite
 
You should see a process or a set of processes similar to the following (some "eventlogging-consumer")
 
/usr/bin/python /usr/local/bin/eventlogging-consumer @/etc/eventlogging.d/consumers/graphite
 
You can use tcp dump to see what is sent to statsd (statsd.eqiad.wmnet), you should see stuff like:
 
18:03:13.590338 IP 208.80.154.79.39839 > 10.64.32.155.8125: UDP, length 28
0x0000:  4500 0038 33dc 4000 4011 715e d050 9a4f  E..83.@.@.q^.P.O
0x0010:  0a40 209b 9b9f 1fbd 0024 95b0 6576 656e  .@.......$..even
0x0020:  746c 6f67 6769 6e67 2e73 6368 656d 612e  tlogging.schema.
0x0030:  4564 6974 3a31 7c6d                      Edit:1|m
 
== Get a first feeling of end-to-end issues on the whole pipeline of a schema ==
 
If you're for example interested in [https://meta.wikimedia.org/w/index.php?title=Schema:NavigationTiming&oldid=10374055 NavigationTiming 10374055] run
 
  mysql --defaults-extra-file=/etc/mysql/conf.d/research-client.cnf --host dbstore1002.eqiad.wmnet -e "select left(timestamp,10) ts , COUNT(*) from log.NavigationTiming_10374055 where left(timestamp,8) >= '20150101' group by ts order by ts;" >out && tail -n 100 out
 
on stat1003 (You need to be in the researchers group to access the <code>research-client.cnf</code>). That will dump recent hourly totals to the screen, and if you prefer graphs, more data is stored in <code>out</code> and only waiting to get plotted.
 
If you need different granularity, just change the <code>10</code> in the query to the granularity you need (like 8 => per day, 12 => per minute, 14 => per second).
 
If the numbers you get indicate issues, you can go to [[Graphite]] to sanity check the early parts of the pipeline. A first indicator is typically the <code>overall</code> counts. Like [https://graphite.wikimedia.org/render/?width=1613&height=780&_salt=1420797818.805&target=eventlogging.overall.raw.rate&target=eventlogging.overall.valid.rate&from=-3days comparing <code>eventlogging.overall.raw.rate</code> to <code>eventlogging.overall.valid.rate</code>]. Then one can bisect into [https://graphite.wikimedia.org/render/?width=1613&height=780&_salt=1420798525.397&target=eventlogging.client_side_events.raw.rate&target=eventlogging.client_side_events.valid.rate&from=-3days <code>eventlogging.client_side_events.*</code>], or [https://graphite.wikimedia.org/render/?width=1613&height=780&_salt=1420798583.589&from=-3days&target=eventlogging.server_side_events.raw.rate&target=eventlogging.server_side_events.valid.rate <code>eventlogging.server_side_events.*</code>], directly drill down into per schema counts, like looking at the [https://graphite.wikimedia.org/render/?width=1613&height=780&_salt=1420798046.608&from=-3days&target=eventlogging.schema.NavigationTiming.rate graph for <code>eventlogging.schema.NavigationTiming.rate</code>].
If you're good at staring at graphs, go right to the [https://graphite.wikimedia.org/render/?width=1613&height=780&_salt=1420798959.213&from=-3days&colorList=8100ff%2Cad59ff%2Ca45b00%2Cff8e00%2C007c4f%2C00ef98%2Cff0000%2C000000&target=eventlogging.overall.raw.rate&target=eventlogging.overall.valid.rate&target=eventlogging.client_side_events.raw.rate&target=eventlogging.client_side_events.valid.rate&target=eventlogging.server_side_events.raw.rate&target=eventlogging.server_side_events.valid.rate&target=eventlogging.schema.MediaViewer.rate&target=eventlogging.schema.NavigationTiming.rate all in one graph].
 
If the numbers you get indicate issues, you can also repeat the database query to the <code>m2</code> master database directly (credentials are in <code>/etc/eventlogging.d/consumers/mysql-m2-master</code> on <code>[[vanadium]]</code>. That allows to exclude replication issues.
 
== Troubleshoot events coming in in real time ==
 
* Incoming events counts are logged to graphite, both the count of validating and non validating events per schema are available
using those users can get a sense of change, graphite is populated real-time and if all of a sudden events for an schema do not validate
it is clearly visible.
 
* EvenLogging slave database (accesible from 1002 for users with access to 'research' user) is also populated real-time.
 
Lastly, Event Logging events coming on real time are logged to logs that are sync-ed to 1002 once a day, these logs can be found here:
~@stat1002:/a/eventlogging/archive$
 
If you detect  an issue or suspicious change , please notify analytics@ and escalate with analytics devs.
 
== Troubleshoot insufficient permission ==
 
"error: insufficient permission for adding an object to repository database .git/objects"
 
List > groups to see if you are on wikidev group, if so likely some files on .git directory are not writable by "wikidev" group. Make them so.
 
==Deploying EventLogging==
 
(Crude brain-dump; might be outdated already)
 
# Deployment on [[tin]] using [[Trebuchet#Deploying|git deploy]].
  go to: /srv/deployment/eventlogging/EventLogging
#:The git-deploy does all the tagging it the git repo and brings the new code to vanadium and hafnium.
#:“git deploy sync” will ask you about whether you want to continue as only 1/5 machines completed some step. Say “y”es.
#:“git deploy sync” will ask you about whether you want to continue as only 1/5 machines completed some other step. Say “y”es.
# (NOTE: we should at some point clean up the deployment config to stop trying vanadium, osmium, etc.)
 
 
Sample sequence to deploy master branch:
nuria@tin:/srv/deployment/eventlogging/EventLogging$ git deploy start
nuria@tin:/srv/deployment/eventlogging/EventLogging$ git checkout master
nuria@tin:/srv/deployment/eventlogging/EventLogging$ git pull
nuria@tin:/srv/deployment/eventlogging/EventLogging$ git deploy sync
 
 
# Installing the new code in EL box.<br>
# Go to /srv/deployment/eventlogging/EventLogging
see that checkout is there from what you just pulled in from tin
 
Build python (/srv/deployment/eventlogging/EventLogging/server)
python setup.py install
 
Start/STOP EL:
eventloggingctl stop
eventloggingctl start
 
 
# Repeat step on [[hafnium]].
# Hop in the Ops IRC channel and !log that you upgraded & restarted EventLogging and add the commit hash that you deployed.
 
 
Now please deploy latest code to beta labs to keep things in sync: [[EventLogging/Testing/BetaLabs#How_to_deploy_code]]

Latest revision as of 19:01, 13 July 2017