You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Systems/Cluster/Edit history administration
We rebuild the mediawiki edit history from the new db replicas on labs, those databases only hold public data
Administration
See sqoop crons
Scoop jobs run from analytics1003.
sudo -u hdfs crontab -u hdfs -l
See sqoop errors
Logs are in "/var/log/refinery", grep for ERROR
nuria@analytics1003:/var/log/refinery$ more sqoop-mediawiki.log | grep ERROR | more 2018-03-02T10:22:27 ERROR ERROR: zhwiki.revision (try 1) 2018-03-02T10:31:20 ERROR ERROR: zhwiki.pagelinks (try 1) 2018-03-02T11:09:17 ERROR ERROR: svwiki.pagelinks (try 1) 2018-03-02T11:30:38 ERROR ERROR: zhwiki.pagelinks (try 2) 2018-03-02T13:17:17 ERROR ERROR: viwiki.pagelinks (try 1)
QA: Assessing quality of a snapshot
Once denormalization has run we need to be able to look that the snapshot created is of quality (i.e. data should match last snapshot, bugs might have been introduced since last snapshot was run).
Automatic validation steps
Two similar automatic validation steps check newly generated snapshot against the previously generated one for denormalized and reduced datasets.
Manually compare data with available data sources
Example: Data is available for all wikipedias in pages like : https://en.wikipedia.org/wiki/Special:Statistics
For all wikipedias that page lists for example the number of articles, does the data returned by request below match that number?
https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/content/monthly/2001010100/2018032900
A handy link to transform json data into cvs that can be exported into a spreadsheet for easy computations: [1]
Is there data for all types of editors including anonymous editors?
https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/all-projects/anonymous/all-page-types/monthly/2016030100/2018042400
Data Loading
Analytics/Systems/Data_Lake/Edits/Pipeline/Data_loading
How is this data gathered: public data from labs
Sqoop job runs in 1003 (although that might change, check puppet) and thus far it logs to: /var/log/refinery/sqoop-mediawiki.log
How is this data gathered: ad-hoc private replicas
Let's go over how to run this process for ad-hoc private replicas (which we do once in a while to be able to analyze editing data that's not public).
- Keep in mind that after you do the next step, the following job will trigger automatically if the _SUCCESS flags are written and the Oozie datasets are considered updated: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/mediawiki/history/denormalize
- Run the same cron that pulls from labs replicas but with the following changes:
- $wiki_file = '/mnt/hdfs/wmf/refinery/current/static_data/mediawiki/grouped_wikis/prod_grouped_wikis.csv'
- $db_host = 'analytics-store.eqiad.wmnet'
- $db_user = 'research'
- $db_password_file = '/user/hdfs/mysql-analytics-research-client-pw.txt'
- $log_file = '/var/log/refinery/sqoop-mediawiki-<<something like manual-2017-07_private>>.log'
- For the command itself:
- --job-name sqoop-mediawiki-monthly-<<YYYY-MM>>_private
- --snapshot <<YYYY-MM>>_private
- --timestamp <<YYYY(MM+1 in MM format (so 08 if you're doing the 07 snapshot))>>01000000 (eg 20170801000000)
- remove --labsdb
IMPORTANT NOTE: After this sqoop is done, you'll probably want to run the mediawiki reconstruction and denormalization job. To do this, you'll need to do three things:
- Put the
_SUCCESS
flag in all/wmf/data/raw/mediawiki/tables/<<table>>/snapshot=<<YYYY-MM>>_private
directories - Run the oozie job with the _private suffix as implemented in this change: https://gerrit.wikimedia.org/r/#/c/370322/
- IMPORTANT: Copy the latest project_namespace_map snapshot to the same folder + _private because the spark job requires this, despite the correct path being configured on the oozie job. This is probably a small bug that we can fix if we end up running more than a handful of private snapshots.