You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Archive/2015 data warehouse experiments

From Wikitech-static
Jump to navigation Jump to search

Goal

This document describes our current use cases for the data warehouse. We need the data warehouse because the database schema that currently exists is not optimal for us to do intensive queries to calculate editing data, specially for large wikis. We found out that we simply can't calculate the data we need in a timely fashion.

Although we have some established initial use cases, we are signing up for a continuous effort to optimize and evolve with new use cases. New metrics and new ways of using the data warehouse are going to happen in the near future and we hope the schema we're building allows us to adapt easier than the vanilla mediawiki schema.

Changesets

We made a rough draft of the schema so we can have something concrete to talk about.

Also, how to manage migrations in the warehouse via alembic: [1]

Use Cases

Our primary use cases deal with retrieval of editing and registration data for all wikis on a daily basis. Here are are three sample use cases, there are many more but roughly they all deal with the same data and do a slightly different set of computations.

Newly Registered Users

Thorough description here

Sample SQL

Newly registered users for September 1st for enwiki:

 select count(*)
   from user
  where user_registration between '2014-09-01 00:00:00' and '2014-09-01 23:59:59'
    and wiki='enwiki'
    and registration_type='create';

Rolling recurring old active editors

Thorough description here

Note that this metric looks back 60 days when it comes to registration data and 30 days for editing data.

Sample SQL

Rolling recurring old active editors for enwiki as of September 1st

 select edit.user_id
   from edit
            inner join
        user    on edit.user_id = user.user_id
               and edit.wiki = user.wiki
               and edit.time between user.valid_from and user.valid_to
  where edit.time between '2014-07-01' and '2014-09-01'
    and user.user_registration < '2014-07-01'
    and user.in_bot_user_group = 0
    and edit.wiki = 'enwiki'
  group by user_id
 having sum(if(time <= '2014-08-01', 1, 0)) >= 5
    and sum(if(time >  '2014-08-01', 1, 0)) >= 5
  order by null;

analytics-store warehouse: 2 min 45 sec

Rolling new active editors

Thorough description here

Sample SQL

Rolling new active editors for enwiki as of September 1st Note that this metric looks back 30 days in registration data and edits

 select edit.user_id
   from edit
            inner join	
        user           on edit.user_id = user.user_id	
                      and edit.wiki = user.wiki	
                      and edit.time between user.valid_from and user.valid_to
  where edit.time between '2014-08-01 00:00:00' and '2014-09-01 23:59:59'	
    and user.user_registration between '2014-08-01 00:00:00' and '2014-09-01 23:59:59' 
    and user.in_bot_user_group = 0
    and edit.wiki = 'enwiki'
  group by user_id	
 having count(*) > 5
  order by null;

analytics-store warehouse: 4 sec

Daily Pages Created

Note that this metric also counts deleted pages and pages in any namespace. See detail description here

Sample SQL

Pages created on enwiki

select count(page_id)
  from page
 where wiki='enwiki'
   and page_is_redirect = 0
   and valid_from between '2014-10-27 00:00:00' and '2014-10-27 23:59:59';
   -- optional: and (@includeArchived = 1 or archived = 0) -- where @includeArchived would be passed in a parameterized query

Initial Data Verifications

Data verifications are moved to their own pages, for readability:

Meetings

6/1/2015

Sean has used alembic changes to modify schema on warehouse and started the backfilling again this time from 2010 for enwiki and dewiki. He estimates this will take 24hrs to complete.

After backfilling is done analytics team is to -again- vet the data and see that the bugs we found prior are fixed.

If vetting is positive: Sean is to use repurposed hardware to setup the warehouse and we shall merge the alembic changes. There is no need to request hardware as there is hardware available.

11/24/2014

Next Steps:

  • Analytics vets data, indexes are created
  • Backfilling data to 2007
  • Decide on usage of warehouse from labs? Can we run queries on prod and just "move" results to labs (so there is no need to optimize data gathering on labs)

12/2/2014

  • Found things that need correction while doing the data vetting. We agreed on exploring alembic as a way to manage migration of fields. We will exploring alembic as a tool that manages schema changes. If things work we will be adding alembic changes to the changeset un which we currently have the warehouse schema.
  • Once changes are done data needs to be vetted again, can we have some automated way of doing this vetting? unit tests? (nuria: sounds more like monitoring than unit tests)
  • Sean brought up that in the near term there is no way we could replicate the whole data set for the warehouse to labs in the short term, given labs constrains with space (Side note: replicating some of Event Logging data to labs might pave the way a bit on this process) He suggests that 1) we compute metrics in prod and move those to labs 2) we move wikimetrics to prod

12/9/2014

  • Sean to reload datawarehouse data
  • Nuria to vet data once loaded again
  • Nuria and Dan to investigate testing/vetting of data on an automated fashion
  • Nuria and Dan to research an automated way of loading data (kettle, pentanjo stack)

02/03/2015

  • Dan and Marcel will work on the ETL scripts by loading a small time slice from enwiki and dewiki into staging. When this is complete, they will merge the verified scripts into the data-warehouse repository, and Sean will use them to run over all the data.
  • Dan will create a script that adds indices to the Edit schema's event logging table in the log database, to facilitate analysis for the VE team. Once merged in the data-warehouse repository, Sean will apply it to the table.
  • Nuria is working on removing the auto-increment id field from Event Logging, work in progress. An interesting FYI is that different variants of UUIDs are possible, with different characteristics that may be useful in the future: http://en.wikipedia.org/wiki/Universally_unique_identifier#Variants_and_versions