You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data access: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
imported>HaeB
m (link, copyedits)
Line 1: Line 1:
In addition to a variety of [[meta:Research:Data|publicly-available data sources]], Wikimedia has a parallel set of private data sources and infrastructure. The main reason for this is to store and provide restricted access to sensitive user data (such as the IP addresses and user agents of readers and editors, stored up to 90 days as permitted by the privacy policy), although these sources also duplicate access to non-sensitive data for ease of use.
In addition to a variety of [[meta:Research:Data|publicly-available data sources]], Wikimedia has a parallel set of private data sources. The main reason is to allows a carefully vetted set of users to perform research and analysis on confidential user data (such as the IP addresses of readers and editors) which is stored according to our [[foundation:Privacy_policy|privacy policy]] and [[metawiki:Data_retention_guidelines|data retention guidelines]]. This private infrastructure also provides duplicate copies of publicly-available data for ease of use.


To get access to this private data, you first need to have signed a non-disclosure agreement with the Wikimedia Foundation. If you're a Foundation employee, this was included as part of your employment agreement; if you're a volunteer, you'll need to demonstrate a need for these resources and find an employee who will sponsor you. This process in explained in [[Volunteer NDA|volunteer NDA article]]. If you're a researcher, you'll have to set up [[mw:Wikimedia_Research/Formal_collaborations|a formal collaboration with the Wikimedia Foundation's Research team]].
== Access ==
This private data lives in same server cluster that runs Wikimedia's production websites. This means you will need [[production shell access]] to get it (see also [https://docs.google.com/document/d/1BwB92e-wNc-y6c5DYfBj7ZxdRFmYlKa-ijzp4t-2f0c/edit these notes] on configuring SSH specifically for the purpose of working with the stats servers).  


==Data sources==
However, since this access gets you closer to both those production websites and this confidential data, it is not freely given out. First, you have to demonstrate a need for these resources. Second, you need to have a non-disclosure agreement with the Wikimedia Foundation. If you're a Foundation employee, this was included as part of your employment agreement. Otherwise, if you're a volunteer, you'll need to find an employee who will sponsor you through the [[Volunteer NDA|volunteer NDA process]]. If you're a researcher, it's possible to be sponsored through [[mw:Wikimedia_Research/Formal_collaborations|a formal collaboration with the Wikimedia Foundation's Research team]].


Data sets and data streams can be found in [https://wikitech.wikimedia.org/wiki/Category:Data_stream Category:Data_stream]
=== User responsibilities ===
 
If you get this access, you '''must''' remember that this access is extremely sensitive. '''You have a duty to protect the privacy of our users'''. As Uncle Ben says, "with great power comes great responsibility." Always follow the rules outlined in the [[phab:L3|Acknowledgement of Server Access Responsibilities]], which you have signed if you have access to this data.
=== Data Dashboards. Superset and Turnilo (previously called Pivot) ===
Superset: http://superset.wikimedia.org
Pivot: http://pivot.wikimedia.org
 
You need a wikitech login that is in the "wmf" or "nda" LDAP groups. If you don't have it, please create a task like https://phabricator.wikimedia.org/T160662
 
Before requesting access, please make sure you:
* have a functioning Wikitech login. Get one: https://toolsadmin.wikimedia.org/register/
* are an employee or contractor with wmf OR have signed an NDA
Depending on the above, you can request to be added to the wmf group or the nda group. Please indicate the motivation on the task about why you need access and ping the analytics team if you don't hear any feedback soon from the Opsen on duty.
 
===MediaWiki application data===
You can do a lot of work with the data stored by MediaWiki in the normal course of running itself. This includes data about:
 
*Users' edit counts (consult the <code>user</code> table)
*Edits to a particular page (consult the <code>revision</code> table, joined with the <code>page</code> table if necessary)
*Account creations (consult the <code>logging</code> table)
 
==== Databases ====
You can access this data using the replica MariaDB databases.  These are accessible from the stat100* machines, as [[Analytics/Data_access#Stats_machines|detailed below]].
 
For an overview of how the data is laid out in those databases, consult the [[mediawikiwiki:Manual:Database_layout|database layout manual]].
 
There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them [[#API|through the API]] or in the XML dumps, which are both described below.
 
==== API ====
A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for ''private'' wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is [[Special:ApiSandbox]], which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.
 
If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki Utilities], which includes a module dedicated to API interactions. There's no equivalent for R yet.
 
====Database dumps====
Every month, [http://dumps.wikimedia.org/ XML snapshots] of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and [http://dumps.wikimedia.org/enwiki/20161001/ incredibly large]). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.
 
Aaron's [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki-utilities] package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in <code>/mnt/data/xmldatadumps/public</code> on stat1007.
 
===EventLogging data===
One analytics-specific source of data is [[Analytics/EventLogging|EventLogging]]. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:
 
#A log of changes to user preferences;
#A/B testing data;
#Clicktracking data.
 
These datasets are stored in the <code>log</code> database on '''analytics-slave.eqiad.wmnet''' (a CNAME for db1108). The schemas that set out each table, and what they contain, can be found on Meta in the [https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 Schema namespace].
 
===Pageviews data===
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in [[Analytics/Cluster/Hive#Access|our Hadoop cluster]], which contains [[Analytics/Data/Pageview hourly|aggregated pageview data]] as well as the mostly-raw [[Analytics/Data/Webrequest|database of web requests]]. See the detailed documentation [[Analytics/Data/Pageview_hourly|here]].
 
==== Turnilo ====
[[Analytics/Systems/Turnilo-Pivot#Access]]
 
===Geolocation data===
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1006 and stat1007: a full guide to geolocation and some examples of how to do it can be found [[Analytics/Geolocation|on the 'geolocation' page]].
 
==Infrastructure==
 
=== Stats machines ===
The production Wikimedia cluster contains several dedicated statistics servers used to do general statistical computation and to access various internal datasources like the MariaDB replicas and the webrequest data in Hadoop.
 
You may need to access the internet from the stats machines (for example, to download a Python script using <code>pip</code>). By default, this will fail because the machines are tightly firewalled. You'll have to use the [[Http proxy|HTTP proxy]].
 
Since these machines are in the production cluster, you'll need production shell access to use them. 
{| class="wikitable"
!Name
!Hostname
!Access available
|-
|[[stat1007]]
|stat1007.eqiad.wmnet
|Hadoop, MariaDB
|-
|[[stat1006]]
|stat1006.eqiad.wmnet
|MariaDB
|-
|[[stat1004]]
|stat1004.eqiad.wmnet
|Hadoop
|}
Please note: stat1005 has been replaced with stat1007 in [[phab:T205846|T205846]].
 
===MariaDB replicas===
The Operations team maintains several dedicated MariaDB replicas. These contain copies of the production [[Mw:Manual:Database layout|MediaWiki databases]] (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees) as well as the [[Analytics/EventLogging|EventLogging]] databases.  


The main one is <code>analytics-store.eqiad.wmnet</code>, known for short as <code>analytics-store</code> (this hostname is actually an alias for <code>dbstore1002.eqiad.wmnet</code>, but it's easier to remember). It contains a <code>flowdb</code> database with a copy of the data stored by the [[mediawikiwiki:Extension:Flow|Flow discussion system]], a <code>staging</code> database in which researchers can create their own tables, [[Analytics/Datasets|a <code>datasets</code> table]] containing useful reference tables. Almost all production wikis are replicated to <code>analytics-store.eqiad.wmnet</code> and you can get a complete list via [Https://meta.wikimedia.org/w/api.php?action=sitematrix the SiteMatrix API]. The Eventlogging log database used to be replicated via custom script, but not anymore (more info [[phab:T156844|T156844]]).
In addition, keep in mind the following important principles:


<code>x1-analytics-slave.eqiad.wmnet</code>, known as <code>x1-analytics-slave</code>, used to be an alias for <code>db1031.eqiad.wmnet</code> , containing replicas of the data stored by [[mw:Extension:Echo|Echo]], the notifications system, as well as the Flow data. The domain now points to dbstore1002.eqiad.wmnet.  
*'''Be paranoid about personally identifiable information''' (PII). Familiarize yourself with the data you are working on, and determine if it contains any PII. It's better to double and triple check than to assume anything, but if you have any doubt ask the Analytics team (via IRC or email or Phabricator). Please see the [[metawiki:Data_retention_guidelines|data retention guidelines]].
*'''Don't copy sensitive data''' (for example, data accessible only by the users in the analytics-privatedata-users) from its origin location to elsewhere (in HDFS or on any other host/support) unless strictly necessary.  And most importantly, do it only if you know what you are doing. If you are in doubt, please reach out to the Analytics team first.
*'''Restrict access'''. If you do need to copy sensitive data somewhere, please make sure that you are the only one able to access the data. For example, if you copy Webrequest data from its location on HDFS to your /user/$your-username directory, make sure that the permissions are set to avoid everybody with access to HDFS to read the data. This is essential to avoid accidental leaks of PII/sensitive data or retention over our guidelines (https://meta.wikimedia.org/wiki/Data_retention_guidelines).
*'''Clean up copies of data'''.  Please make sure that any data that you copied is deleted as soon as your work has been done.


You can access these analytics replicas from either stat1007 or stat1006. To use the <code>mysql</code> client interactively, type: <code>mysql -h analytics-store.eqiad.wmnet -A</code>. The <code>-A</code> disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.<ref>The stat machines automatically authenticate to the MariaDB replicas using passwords stored in preloaded config (<code>.cnf</code>) files. On stat1006, the file is <code>/etc/mysql/conf.d/research-client.cnf</code>, while on stat1007 it's <code>/etc/mysql/conf.d/analytics-research-client.cnf</code>. These files are automatically referenced from <code>/etc/mysql/my.cnf</code>, which the command line <code>mysql</code> program reads by default. Other programs may not do this automatically, and require an explicit pointer to the underlying <code>.cnf</code> file.</ref>
If you ever have any questions or doubts, err on the side of caution and [[Analytics#Contact|contact the Analytics team]]. We are very friendly and happy to help!
 
If you'd rather generate a TSV file<ref>The <code>mysql</code> utility doesn't have the ability to generate files in other formats like CSV.</ref> and then retrieve it later, you can also do so from the command line. You can type type:
 
<code>mysql -h analytics-store.eqiad.wmnet <nowiki>{{database name}}</nowiki> -e "<nowiki>{{your query;}}</nowiki>" > <nowiki>{{filename}}</nowiki>.tsv</code>
 
It'll go off to generate the file on its own.<ref>The file extension you choose doesn't actually affect the command, but since <code>-e</code> generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.</ref>
 
As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the [http://mysql-python.sourceforge.net/MySQLdb.html MySQLdb] module installed on stat1006 and stat1007. For R, we have [http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf RMySQL].
 
</noinclude>The page [[MariaDB]] contains detailed internal information about the setup of the slaves.
 
===Hadoop===
Finally, we have Hadoop - our storage system for large amounts of data. The easiest way to query the Hadoop data is through [[Analytics/Cluster/Hive|Hive]], which can be accessed from [[stat1005|stat1007]] and stat1004 -  simply type <code>beeline</code> in the terminal, switch to the <code>wmf</code> database, and input your query.
 
At the moment there are no recommended Hive access packages for R or Python. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:
 
<code>beeline -f my_query.hql > file_name.tsv</code>
 
For information about writing HQL to query this data, see the [https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive language manual].
 
=== SWAP ===
We also have [[SWAP]], an internal [http://jupyter.org/ Jupyter] server that provides a nice notebook interface for crunching data.
 
== Production access ==
To be able to access a number of internal data sources (such as logs, replicas of the production databases, EventLogging data) as well as machines used for data crunching (e.g. <code>stat1006</code>), you will need [[production shell access|shell access to the production Wikimedia cluster]] (see also [https://docs.google.com/document/d/1BwB92e-wNc-y6c5DYfBj7ZxdRFmYlKa-ijzp4t-2f0c/edit these notes] on configuring SSH specifically for the purpose of working with the stats servers). You can read more about that access and how to request it at the link, but keep one thing in mind: '''production shell access is extremely sensitive'''. You ''must'' follow the rules outlined in [[phab:L3|Acknowledgement of Server Access Responsibilities]]. If you have questions or doubt, ask Tech Ops.
 
=== User responsibilities ===
Before starting, please read carefully what the Analytics team expects from you. As uncle Ben says, "''with great power comes great responsibility''". When you are given access to Analytics production data, you also take on the duty of protecting the integrity of users' privacy. The following actions should be always taken into consideration while working with sensitive data:
 
* '''Learn about Personally Identifiable Information (PII)'''.  Familiarize yourself with the data you are working on, and determine if it contains any PII. It's better to double and triple check than to assume anything, but if you have any doubt ask the Analytics team (via IRC or email or Phabricator). Please see https://meta.wikimedia.org/wiki/Data_retention_guidelines for more information, it will help you in thinking about PII and tell you how to handle it if your datasets contain it.
* '''Don't copy sensitive data''' (for example, data accessible only by the users in the analytics-privatedata-users) from its origin location to elsewhere (in HDFS or on any other host/support) unless strictly necessary.  And most importantly, do it only if you know what you are doing. If you are in doubt, please reach out to the Analytics team first.
* '''Restrict access'''.  If you do need to copy sensitive data somewhere, please make sure that you are the only one able to access the data. For example, if you copy Webrequest data from its location on HDFS to your /user/$your-username directory, make sure that the permissions are set to avoid everybody with access to HDFS to read the data. This is essential to avoid accidental leaks of PII/sensitive data or retention over our guidelines (https://meta.wikimedia.org/wiki/Data_retention_guidelines).
* '''Clean up copies of data'''.  Please make sure that any data that you copied is deleted as soon as your work has been done.
 
Thank you for your patience and willingness to help, this is an essential part of guaranteeing the safety of our users' data.


=== Access Groups ===
=== Access Groups ===
Line 260: Line 147:
|}
|}


== Of special interest for external researchers ==
==Infrastructure==
If you are an external researcher trying to get access to our data you must have signed an NDA.


As any developer at the foundation you will need an account Please follow the Sign Up link in https://www.mediawiki.org/wiki/Developer_access as per https://wikitech.wikimedia.org/wiki/Production_shell_access.
=== Analytics clients ===
Once you have access to the production cluster, there are several servers which you can use to access the various private data sources and do general statistical computation. There are two types: the stat servers, designed for command-line use, and the [[SWAP]] servers, designed for [https://jupyter.org Jupyter] notebook use. Together, these are called the [[phab:source/operations-puppet/browse/production/modules/profile/manifests/analytics/cluster/client.pp|analytics clients]], since they act as clients accessing data from various other databases.
{| class="wikitable"
!Name
!Hostname
!Access available
|-
|[[stat1007]]
|stat1007.eqiad.wmnet
|Hadoop, MariaDB
|-
|[[stat1006]]
|stat1006.eqiad.wmnet
|MariaDB
|-
|[[stat1004]]
|stat1004.eqiad.wmnet
|Hadoop
|-
|notebook1003
|notebook1003.eqiad.wmnet
|Hadoop, MariaDB
|-
|notebook1004
|notebook1004.eqiad.wmnet
|Hadoop, MariaDB
|}
Please note: stat1005 was [[phab:T205846|replaced]] by stat1007 in December 2018.
 
You may need to access the internet from the stats machines (for example, to download a Python script using <code>pip</code>). By default, this will fail because the machines are tightly firewalled. You'll have to use the [[Http proxy|HTTP proxy]].
 
===MariaDB replicas===
The SRE team (Data Persistence) maintains several dedicated MariaDB replicas. These contain copies of the production [[Mw:Manual:Database layout|MediaWiki databases]] (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees) as well as the [[Analytics/EventLogging|EventLogging]] databases.
 
<br />
 
{{warning|1= analytics-store.eqiad.wmnet (dbstore1002) has been deprecated in [[phab:T210478]]}}
 
<br />
 
{{warning|1= The sX-analytics-slave.eqiad.wmnet CNAMEs have been deprecated in favor of using only analytics-store.eqiad.wmnet.}}
 
Up to T210478, all the wiki databases were replicated in one single MySQL instance/host, named <code>analytics-store.eqiad.wmnet</code>, known for short as <code>analytics-store</code> (this hostname was actually an alias for <code>dbstore1002.eqiad.wmnet</code>). Several CNAMEs like <code>x1-analytics-slave.eqiad.wmnet</code> were maintained to ease the access to dbstore1002. This environment has been deprecated in favor of a multi instance/host solution, namely each section (sX or x1) is now running in a separate MySQL database instance. In order to ease the access, the following CNAMEs have been created:
 
* s1-analytics-replica.eqiad.wmnet
* s2-analytics-replica.eqiad.wmnet
* ...
* s8-analytics-replica.eqiad.wmnet
* x1-analytics-replica.eqiad.wmnet
*staging-db-analytics.eqiad.wmnet
 
The main gotcha now is to figure out what port to use to connect to each section, since as we said above there are multiple MySQL instances running. This is the scheme to use:
 
* 331 + the digit of the section in case of sX. Example: s5 will be accessible at s5-analytics-replica.eqiad.wmnet:3315
* 3320 for x1. Example: x1-analytics-replica.eqiad.wmnet:3320
* 3350 for staging
 
We have created DNS [[:en:SRV_record|SRV]] records to ease the use of the dbstore shard into scripts (see Python functions below for an example):
 
* _s1-analytics._tcp.eqiad.wmnet
* ...
* _s8-analytics._tcp.eqiad.wmnet
* _x1-analytics._tcp.eqiad.wmnet
* _staging-analytics._tcp.eqiad.wment
 
These records look "weird" if you are not familiar with them, but they allow us to get a hostname/port combination from one DNS name! <u>Please note: the mysql client is sadly not capable of reading those, so you cannot use them directly with it.</u> 
 
You can access these analytics replicas from either stat1007 or stat1006. To use the <code>mysql</code> client interactively, type: <code>mysql -h sX-analytics-replica.eqiad.wmnet -P 331X -A</code>. The <code>-A</code> disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.<ref>The stat machines automatically authenticate to the MariaDB replicas using passwords stored in preloaded config (<code>.cnf</code>) files. On stat1006, the file is <code>/etc/mysql/conf.d/research-client.cnf</code>, while on stat1007 it's <code>/etc/mysql/conf.d/analytics-research-client.cnf</code>. These files are automatically referenced from <code>/etc/mysql/my.cnf</code>, which the command line <code>mysql</code> program reads by default. Other programs may not do this automatically, and require an explicit pointer to the underlying <code>.cnf</code> file.</ref>
 
If you'd rather generate a TSV file<ref>The <code>mysql</code> utility doesn't have the ability to generate files in other formats like CSV.</ref> and then retrieve it later, you can also do so from the command line. You can type type:
 
<code>mysql -h s1-analytics-replica.eqiad.wmnet -P 3321 <nowiki>{{database name}}</nowiki> -e "<nowiki>{{your query;}}</nowiki>" > <nowiki>{{filename}}</nowiki>.tsv</code>
 
It'll go off to generate the file on its own.<ref>The file extension you choose doesn't actually affect the command, but since <code>-e</code> generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.</ref>
 
As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the [http://mysql-python.sourceforge.net/MySQLdb.html MySQLdb] module installed on stat1006 and stat1007. For R, we have [http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf RMySQL].
 
The following Python snippets are an example about how it is possible to find the correct Mysql hostname+port combination given a wiki db name:<syntaxhighlight lang="python3">
# The second function needs dnspython to work
import dns.resolver
 
def get_mediawiki_section_dbname_mapping(mw_config_path, use_x1):
    db_mapping = {}
    if use_x1:
        dblist_section_paths = [mw_config_path.rstrip('/') + '/dblists/all.dblist']
    else:
        dblist_section_paths = glob.glob(mw_config_path.rstrip('/') + '/dblists/s[0-9]*.dblist')
    for dblist_section_path in dblist_section_paths:
        with open(dblist_section_path, 'r') as f:
            for db in f.readlines():
                db_mapping[db.strip()] = dblist_section_path.strip().rstrip('.dblist').split('/')[-1]
 
    return db_mapping
 
 
def get_dbstore_host_port(db_mapping, use_x1, dbname):
    if dbname == 'staging':
        shard = 'staging'
    elif use_x1:
        shard = 'x1'
    else:
        try:
            shard = db_mapping[dbname]
        except KeyError:
            raise RuntimeError("The database {} is not listed among the dblist files of the supported sections."
                              .format(dbname))
    answers = dns.resolver.query('_' + shard + '-analytics._tcp.eqiad.wmnet', 'SRV')
    host, port = str(answers[0].target), answers[0].port
    return (host,port)
</syntaxhighlight>Example of a Python script to retrieve data for a specific wiki from a Notebook (credits to Neil for the work!):<syntaxhighlight lang="python">
import dns.resolver
import mysql.connector as mysql
 
 
query = "select count(*) from recentchanges"
ans = dns.resolver.query('_s7-analytics._tcp.eqiad.wmnet', 'SRV')[0]
conn = mysql.connect(
    host=str(ans.target),
    port=ans.port,
    database="hewiki",
    option_files='/etc/mysql/conf.d/research-client.cnf',
    charset='binary',
    autocommit=False
)
cursor = conn.cursor()
cursor.execute(query)
cursor.fetchall()
</syntaxhighlight>
This is of course a quick solution, it can be done in any number of ways. If you have comments/suggestions/improvements please feel free to improve :)
 
The first function, get_mediawiki_section_dbname_mapping, needs the path of the [[gerrit:#/admin/projects/operations/mediawiki-config|mediawiki-config repository]] local checkout. '''If you run your script on a stat100* or notebook100* host, then you can find the repository checked out under /srv/mediawiki-config'''. This is one source of the mapping between mediawiki db sections (s1..s8, x1) to wiki names. For example, itwiki is contained into two files:
 
* dblists/s2.dblist
* dblists/all.dblist
 
The above files means two things:
 
1) itwiki's database will be available on s2-analytics-replica.eqiad.wmnet
 
2) itiwiki's database will be available on x1-analytics-replica.eqiad.wmnet (if we need the tables related to the extensions, since x1 contains all the wikis).
 
This is a pre-requisite to use the get_dbstore_host_port, that starts from the mapping described above and uses DNS SRV records as explained above.
 
====MySQL wrapper====
On all the Analytics nodes (stat100[4,6,7] and notebook100[3,4]) there is a tool called <code>analytics-mysql</code> that should help using the new databases:<syntaxhighlight lang="bash">
elukey@stat1006:~$ analytics-mysql itwiki --print-target
dbstore1004.eqiad.wmnet:3312
 
elukey@stat1006:~$ analytics-mysql itwiki --print-target --use-x1
dbstore1005.eqiad.wmnet:3320
 
elukey@stat1006:~$ analytics-mysql itwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 465860
Server version: 10.1.37-MariaDB MariaDB Server
 
Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
mysql:research@dbstore1004.eqiad.wmnet. [itwiki]> Bye
 
elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables limit 3'
ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'limit 3' at line 1
elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables'
+--------------------------+
| Tables_in_itwiki        |
+--------------------------+
[..]
</syntaxhighlight>
 
===Hadoop===
Finally, we have Hadoop - our storage system for large amounts of data. The easiest way to query the Hadoop data is through [[Analytics/Cluster/Hive|Hive]], which can be accessed from [[stat1005|stat1007]] and stat1004 -  simply type <code>beeline</code> in the terminal, switch to the <code>wmf</code> database, and input your query.
 
At the moment there are no recommended Hive access packages for R or Python. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:
 
<code>beeline -f my_query.hql > file_name.tsv</code>
 
For information about writing HQL to query this data, see the [https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive language manual].
 
==Data sources==
Data sets and data streams can be found in [https://wikitech.wikimedia.org/wiki/Category:Data_stream Category:Data_stream]
 
=== Data Dashboards. Superset and Turnilo (previously called Pivot) ===
Superset: http://superset.wikimedia.org
Pivot: http://pivot.wikimedia.org
 
You need a wikitech login that is in the "wmf" or "nda" LDAP groups. If you don't have it, please create a task like https://phabricator.wikimedia.org/T160662
 
Before requesting access, please make sure you:
* have a functioning Wikitech login. Get one: https://toolsadmin.wikimedia.org/register/
* are an employee or contractor with wmf OR have signed an NDA
Depending on the above, you can request to be added to the wmf group or the nda group. Please indicate the motivation on the task about why you need access and ping the analytics team if you don't hear any feedback soon from the Opsen on duty.
 
===MediaWiki application data===
You can do a lot of work with the data stored by MediaWiki in the normal course of running itself. This includes data about:
 
*Users' edit counts (consult the <code>user</code> table)
*Edits to a particular page (consult the <code>revision</code> table, joined with the <code>page</code> table if necessary)
*Account creations (consult the <code>logging</code> table)
 
==== Databases ====
You can access this data using the replica MariaDB databases. These are accessible from the stat100* machines, as [[Analytics/Data_access#Stats_machines|detailed below]].
 
For an overview of how the data is laid out in those databases, consult the [[mediawikiwiki:Manual:Database_layout|database layout manual]].
 
There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them [[#API|through the API]] or in the XML dumps, which are both described below.
 
==== API ====
A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for ''private'' wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is [[Special:ApiSandbox]], which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.
 
If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki Utilities], which includes a module dedicated to API interactions. There's no equivalent for R yet.
 
====Database dumps====
Every month, [http://dumps.wikimedia.org/ XML snapshots] of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and [http://dumps.wikimedia.org/enwiki/20161001/ incredibly large]). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.
 
Aaron's [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki-utilities] package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in <code>/mnt/data/xmldatadumps/public</code> on stat1007.
 
===EventLogging data===
One analytics-specific source of data is [[Analytics/EventLogging|EventLogging]]. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:
 
#A log of changes to user preferences;
#A/B testing data;
#Clicktracking data.
 
These datasets are stored in the <code>log</code> database on '''analytics-slave.eqiad.wmnet''' (a CNAME for db1108). The schemas that set out each table, and what they contain, can be found on Meta in the [https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 Schema namespace].
 
===Pageviews data===
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in [[Analytics/Cluster/Hive#Access|our Hadoop cluster]], which contains [[Analytics/Data/Pageview hourly|aggregated pageview data]] as well as the mostly-raw [[Analytics/Data/Webrequest|database of web requests]]. See the detailed documentation [[Analytics/Data/Pageview_hourly|here]].
 
==== Turnilo ====
[[Analytics/Systems/Turnilo-Pivot#Access]]
 
===Geolocation data===
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1006 and stat1007: a full guide to geolocation and some examples of how to do it can be found [[Analytics/Geolocation|on the 'geolocation' page]].


== Notes ==
== Notes ==
<references />
<references />

Revision as of 02:59, 27 March 2019

In addition to a variety of publicly-available data sources, Wikimedia has a parallel set of private data sources. The main reason is to allows a carefully vetted set of users to perform research and analysis on confidential user data (such as the IP addresses of readers and editors) which is stored according to our privacy policy and data retention guidelines. This private infrastructure also provides duplicate copies of publicly-available data for ease of use.

Access

This private data lives in same server cluster that runs Wikimedia's production websites. This means you will need production shell access to get it (see also these notes on configuring SSH specifically for the purpose of working with the stats servers).

However, since this access gets you closer to both those production websites and this confidential data, it is not freely given out. First, you have to demonstrate a need for these resources. Second, you need to have a non-disclosure agreement with the Wikimedia Foundation. If you're a Foundation employee, this was included as part of your employment agreement. Otherwise, if you're a volunteer, you'll need to find an employee who will sponsor you through the volunteer NDA process. If you're a researcher, it's possible to be sponsored through a formal collaboration with the Wikimedia Foundation's Research team.

User responsibilities

If you get this access, you must remember that this access is extremely sensitive. You have a duty to protect the privacy of our users. As Uncle Ben says, "with great power comes great responsibility." Always follow the rules outlined in the Acknowledgement of Server Access Responsibilities, which you have signed if you have access to this data.

In addition, keep in mind the following important principles:

  • Be paranoid about personally identifiable information (PII). Familiarize yourself with the data you are working on, and determine if it contains any PII. It's better to double and triple check than to assume anything, but if you have any doubt ask the Analytics team (via IRC or email or Phabricator). Please see the data retention guidelines.
  • Don't copy sensitive data (for example, data accessible only by the users in the analytics-privatedata-users) from its origin location to elsewhere (in HDFS or on any other host/support) unless strictly necessary. And most importantly, do it only if you know what you are doing. If you are in doubt, please reach out to the Analytics team first.
  • Restrict access. If you do need to copy sensitive data somewhere, please make sure that you are the only one able to access the data. For example, if you copy Webrequest data from its location on HDFS to your /user/$your-username directory, make sure that the permissions are set to avoid everybody with access to HDFS to read the data. This is essential to avoid accidental leaks of PII/sensitive data or retention over our guidelines (https://meta.wikimedia.org/wiki/Data_retention_guidelines).
  • Clean up copies of data. Please make sure that any data that you copied is deleted as soon as your work has been done.

If you ever have any questions or doubts, err on the side of caution and contact the Analytics team. We are very friendly and happy to help!

Access Groups

To get access, you submit a request on phabricator and tag SRE-Access-Requests. You will need to specify what access group you need.

'analytics-*' groups have access to the Analytics Cluster (which mostly means Hadoop). 'statistics-*' groups get access to stat* servers for local (non distributed) compute resources. These groups overlap in what servers they grant ssh access to, but further posix permissions restrict access to things like MySQL, Hadoop, and files.

Here's a summary of groups you might need (as of 2016-10-18):

researchers
Access to stat1006 and the credentials for the MariaDB slaves in/etc/mysql/conf.d/research-client.cnf.
statistics-users
Access to stat1006 for number crunching and accessing non private log files hosted there.
statistics-privatedata-users
Access to stat100[56], public data like sampled Webrequest logs (stored under /a/log/webrequest/archive) and for the MariaDB slaves in /etc/mysql/conf.d/statistics-private-client.cnf
analytics-wmde
WMDE specific group (mostly used for crons). Access to stat1007 and to MariaDB slaves in /etc/mysql/conf.d/research-wmde-client.cnf
analytics-users
Access to stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) (NO HADOOP PRIVATE DATA).
analytics-privatedata-users
Access to stat1007 and stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) and to query private data hosted there, including webrequest logs. Access to MariaDB slaves in /etc/mysql/conf.d/analytics-research-client.cnf
If you want Hadoop access you probably want this.

The list of users currently in each group is available in this configuration file.[1]

Host access granted

Access Groups Access to stat1007 Access to stat1006 Access to stat1004 Access to notebook100[34]
researchers X X
statistics-users X
statistics-privatedata-users X X X
analytics-users X X X
analytics-privatedata-users X X X
analytics-wmde X

Data access granted

Access Groups Hadoop access

(No private data)

Hadoop access

(Private data)

research-client.cnf statistics-private-client.cnf research-wmde-client.cnf analytics-research-client.cnf
researchers X
statistics-users
statistics-privatedata-users X
analytics-users X
analytics-privatedata-users X X X
analytics-wmde X

Infrastructure

Analytics clients

Once you have access to the production cluster, there are several servers which you can use to access the various private data sources and do general statistical computation. There are two types: the stat servers, designed for command-line use, and the SWAP servers, designed for Jupyter notebook use. Together, these are called the analytics clients, since they act as clients accessing data from various other databases.

Name Hostname Access available
stat1007 stat1007.eqiad.wmnet Hadoop, MariaDB
stat1006 stat1006.eqiad.wmnet MariaDB
stat1004 stat1004.eqiad.wmnet Hadoop
notebook1003 notebook1003.eqiad.wmnet Hadoop, MariaDB
notebook1004 notebook1004.eqiad.wmnet Hadoop, MariaDB

Please note: stat1005 was replaced by stat1007 in December 2018.

You may need to access the internet from the stats machines (for example, to download a Python script using pip). By default, this will fail because the machines are tightly firewalled. You'll have to use the HTTP proxy.

MariaDB replicas

The SRE team (Data Persistence) maintains several dedicated MariaDB replicas. These contain copies of the production MediaWiki databases (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees) as well as the EventLogging databases.



Up to T210478, all the wiki databases were replicated in one single MySQL instance/host, named analytics-store.eqiad.wmnet, known for short as analytics-store (this hostname was actually an alias for dbstore1002.eqiad.wmnet). Several CNAMEs like x1-analytics-slave.eqiad.wmnet were maintained to ease the access to dbstore1002. This environment has been deprecated in favor of a multi instance/host solution, namely each section (sX or x1) is now running in a separate MySQL database instance. In order to ease the access, the following CNAMEs have been created:

  • s1-analytics-replica.eqiad.wmnet
  • s2-analytics-replica.eqiad.wmnet
  • ...
  • s8-analytics-replica.eqiad.wmnet
  • x1-analytics-replica.eqiad.wmnet
  • staging-db-analytics.eqiad.wmnet

The main gotcha now is to figure out what port to use to connect to each section, since as we said above there are multiple MySQL instances running. This is the scheme to use:

  • 331 + the digit of the section in case of sX. Example: s5 will be accessible at s5-analytics-replica.eqiad.wmnet:3315
  • 3320 for x1. Example: x1-analytics-replica.eqiad.wmnet:3320
  • 3350 for staging

We have created DNS SRV records to ease the use of the dbstore shard into scripts (see Python functions below for an example):

  • _s1-analytics._tcp.eqiad.wmnet
  • ...
  • _s8-analytics._tcp.eqiad.wmnet
  • _x1-analytics._tcp.eqiad.wmnet
  • _staging-analytics._tcp.eqiad.wment

These records look "weird" if you are not familiar with them, but they allow us to get a hostname/port combination from one DNS name! Please note: the mysql client is sadly not capable of reading those, so you cannot use them directly with it.

You can access these analytics replicas from either stat1007 or stat1006. To use the mysql client interactively, type: mysql -h sX-analytics-replica.eqiad.wmnet -P 331X -A. The -A disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.[2]

If you'd rather generate a TSV file[3] and then retrieve it later, you can also do so from the command line. You can type type:

mysql -h s1-analytics-replica.eqiad.wmnet -P 3321 {{database name}} -e "{{your query;}}" > {{filename}}.tsv

It'll go off to generate the file on its own.[4]

As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the MySQLdb module installed on stat1006 and stat1007. For R, we have RMySQL.

The following Python snippets are an example about how it is possible to find the correct Mysql hostname+port combination given a wiki db name:

# The second function needs dnspython to work
import dns.resolver

def get_mediawiki_section_dbname_mapping(mw_config_path, use_x1):
    db_mapping = {}
    if use_x1:
        dblist_section_paths = [mw_config_path.rstrip('/') + '/dblists/all.dblist']
    else:
        dblist_section_paths = glob.glob(mw_config_path.rstrip('/') + '/dblists/s[0-9]*.dblist')
    for dblist_section_path in dblist_section_paths:
        with open(dblist_section_path, 'r') as f:
            for db in f.readlines():
                db_mapping[db.strip()] = dblist_section_path.strip().rstrip('.dblist').split('/')[-1]

    return db_mapping


def get_dbstore_host_port(db_mapping, use_x1, dbname):
    if dbname == 'staging':
        shard = 'staging'
    elif use_x1:
        shard = 'x1'
    else:
        try:
            shard = db_mapping[dbname]
        except KeyError:
            raise RuntimeError("The database {} is not listed among the dblist files of the supported sections."
                               .format(dbname))
    answers = dns.resolver.query('_' + shard + '-analytics._tcp.eqiad.wmnet', 'SRV')
    host, port = str(answers[0].target), answers[0].port
    return (host,port)

Example of a Python script to retrieve data for a specific wiki from a Notebook (credits to Neil for the work!):

import dns.resolver
import mysql.connector as mysql


query = "select count(*) from recentchanges"
ans = dns.resolver.query('_s7-analytics._tcp.eqiad.wmnet', 'SRV')[0]
conn = mysql.connect(
    host=str(ans.target), 
    port=ans.port, 
    database="hewiki",
    option_files='/etc/mysql/conf.d/research-client.cnf',
    charset='binary',
    autocommit=False
)
cursor = conn.cursor()
cursor.execute(query)
cursor.fetchall()

This is of course a quick solution, it can be done in any number of ways. If you have comments/suggestions/improvements please feel free to improve :)

The first function, get_mediawiki_section_dbname_mapping, needs the path of the mediawiki-config repository local checkout. If you run your script on a stat100* or notebook100* host, then you can find the repository checked out under /srv/mediawiki-config. This is one source of the mapping between mediawiki db sections (s1..s8, x1) to wiki names. For example, itwiki is contained into two files:

  • dblists/s2.dblist
  • dblists/all.dblist

The above files means two things:

1) itwiki's database will be available on s2-analytics-replica.eqiad.wmnet

2) itiwiki's database will be available on x1-analytics-replica.eqiad.wmnet (if we need the tables related to the extensions, since x1 contains all the wikis).

This is a pre-requisite to use the get_dbstore_host_port, that starts from the mapping described above and uses DNS SRV records as explained above.

MySQL wrapper

On all the Analytics nodes (stat100[4,6,7] and notebook100[3,4]) there is a tool called analytics-mysql that should help using the new databases:

elukey@stat1006:~$ analytics-mysql itwiki --print-target
dbstore1004.eqiad.wmnet:3312

elukey@stat1006:~$ analytics-mysql itwiki --print-target --use-x1
dbstore1005.eqiad.wmnet:3320

elukey@stat1006:~$ analytics-mysql itwiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 465860
Server version: 10.1.37-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1004.eqiad.wmnet. [itwiki]> Bye

elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables limit 3'
ERROR 1064 (42000) at line 1: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'limit 3' at line 1
elukey@stat1006:~$ analytics-mysql itwiki -e 'show tables'
+--------------------------+
| Tables_in_itwiki         |
+--------------------------+
[..]

Hadoop

Finally, we have Hadoop - our storage system for large amounts of data. The easiest way to query the Hadoop data is through Hive, which can be accessed from stat1007 and stat1004 - simply type beeline in the terminal, switch to the wmf database, and input your query.

At the moment there are no recommended Hive access packages for R or Python. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:

beeline -f my_query.hql > file_name.tsv

For information about writing HQL to query this data, see the Hive language manual.

Data sources

Data sets and data streams can be found in Category:Data_stream

Data Dashboards. Superset and Turnilo (previously called Pivot)

Superset: http://superset.wikimedia.org Pivot: http://pivot.wikimedia.org

You need a wikitech login that is in the "wmf" or "nda" LDAP groups. If you don't have it, please create a task like https://phabricator.wikimedia.org/T160662

Before requesting access, please make sure you:

Depending on the above, you can request to be added to the wmf group or the nda group. Please indicate the motivation on the task about why you need access and ping the analytics team if you don't hear any feedback soon from the Opsen on duty.

MediaWiki application data

You can do a lot of work with the data stored by MediaWiki in the normal course of running itself. This includes data about:

  • Users' edit counts (consult the user table)
  • Edits to a particular page (consult the revision table, joined with the page table if necessary)
  • Account creations (consult the logging table)

Databases

You can access this data using the replica MariaDB databases. These are accessible from the stat100* machines, as detailed below.

For an overview of how the data is laid out in those databases, consult the database layout manual.

There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them through the API or in the XML dumps, which are both described below.

API

A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for private wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is Special:ApiSandbox, which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.

If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains MediaWiki Utilities, which includes a module dedicated to API interactions. There's no equivalent for R yet.

Database dumps

Every month, XML snapshots of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and incredibly large). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.

Aaron's MediaWiki-utilities package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in /mnt/data/xmldatadumps/public on stat1007.

EventLogging data

One analytics-specific source of data is EventLogging. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:

  1. A log of changes to user preferences;
  2. A/B testing data;
  3. Clicktracking data.

These datasets are stored in the log database on analytics-slave.eqiad.wmnet (a CNAME for db1108). The schemas that set out each table, and what they contain, can be found on Meta in the Schema namespace.

Pageviews data

An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in our Hadoop cluster, which contains aggregated pageview data as well as the mostly-raw database of web requests. See the detailed documentation here.

Turnilo

Analytics/Systems/Turnilo-Pivot#Access

Geolocation data

When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1006 and stat1007: a full guide to geolocation and some examples of how to do it can be found on the 'geolocation' page.

Notes

  1. Other groups including statistics-admins, analytics-admins, eventlogging-admins, and statistics-web-users are for people doing system maintenance and administration, so you don't need them just to access data.
  2. The stat machines automatically authenticate to the MariaDB replicas using passwords stored in preloaded config (.cnf) files. On stat1006, the file is /etc/mysql/conf.d/research-client.cnf, while on stat1007 it's /etc/mysql/conf.d/analytics-research-client.cnf. These files are automatically referenced from /etc/mysql/my.cnf, which the command line mysql program reads by default. Other programs may not do this automatically, and require an explicit pointer to the underlying .cnf file.
  3. The mysql utility doesn't have the ability to generate files in other formats like CSV.
  4. The file extension you choose doesn't actually affect the command, but since -e generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.