You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data access: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Addshore
(→‎Access Groups: researchers means stat1003 access)
imported>Neil P. Quinn-WMF
(Copyedit section on access groups)
Line 1: Line 1:
This page documents the internal and external data sources that Analytics uses, the information stored within them, and how to get access.
This page documents the internal and external data sources that Analytics uses, the information stored within them, and how to get access.


==Data types==
==Data sources==


Data sets and data streams can be found in [https://wikitech.wikimedia.org/wiki/Category:Data_stream Category:Data_stream]
Data sets and data streams can be found in [https://wikitech.wikimedia.org/wiki/Category:Data_stream Category:Data_stream]


===MediaWiki data (private/large-scale)===
===MediaWiki application data===
A lot of our work is based on data stored within MediaWiki itself - not dedicated analytics logging, but just things that are needed for the wiki to do its job. Examples of this would be:
You can do a lot of work with the data stored by MediaWiki in the normal course of running itself but just things that are needed for the wiki to do its job. This includes data about:


*You want data about users' edit counts. Consult the "user" table.
*Users' edit counts (consult the <code>user</code> table)
*You want data about edits to a particular page. Consult the "revision" table.
*Edits to a particular page (consult the <code>revision</code> table, joined with the <code>page</code> table if necessary)
*You want data about account creations. Consult the "logging" table.
*Account creations (consult the <code>logging</code> table)


Much of this data is only accessible from inside the database - other pieces that are publicly accessible through the API (for example, linking usernames to userIDs) can still be faster to gather internally if you have a large request.
==== Databases ====
You can access this data directly using the replica MariaDB databases on the [[#Analytics_slaves|analytics slaves]], discussed below. For an overview of how the data is laid out in those databases, consult the [[mediawikiwiki:Manual:Database_layout|database layout manual]].  


This data can be found in MySQL databases on the [[#Analytics_slaves|analytics slaves]], discussed below. A good guide to the tables inside those databases can be found in the [[mw:Manual:Database layout|MediaWiki database guide]].
There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them [[#API|through the API]] or in the XML dumps, which are both described below.


===MediaWiki data (public)===
==== API ====
Some pieces of MediaWiki data are not only public, but only really accessible publicly. The primary example of this is the actual content of pages and revisions, which isn't available in the analytics databases. Instead, it's made available [[#API|through the API]]. This content can also be found in the XML dumps of our projects, which are described [[#Database_dumps|in the database dumps section below]].
A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for ''private'' wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is [[Special:ApiSandbox]], which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.


===High-volume MediaWiki data===
If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki Utilities], which includes a module dedicated to API interactions. There's no equivalent for R yet.
A couple of MediaWiki extensions generate so much data that, regardless of what wiki the data is coming from, they're actually stored in a completely different cluster of databases - the X1 cluster. As with the normal databases, there is a dedicated analytics slave for this data.


The only current examples of extensions that rely on this cluster are [[mw:Extension:Echo|Echo]] (the notifications system) and [[mw:Extension:Flow|Flow]] (the new discussion system). Both are stored in slightly different ways, which is discussed in [[#Analytics slaves|the section on the analytics slaves]].
====Database dumps====
Every month, [http://dumps.wikimedia.org/ XML snapshots] of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and [http://dumps.wikimedia.org/enwiki/20140304/ incredibly large]). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.
 
Aaron's [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki-utilities] package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in <code>/mnt/data/xmldatadumps/public</code> on stat1002.


===EventLogging data===
===EventLogging data===
One analytics-specific source of data is [[mw:Extension:EventLogging|EventLogging]]. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples would be:
One analytics-specific source of data is [[Analytics/EventLogging|EventLogging]]. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:


#A log of changes to user preferences;
#A log of changes to user preferences;
Line 31: Line 34:
#Clicktracking data.
#Clicktracking data.


These datasets are stored in their own database, the 'log' database, which is described in the "[[#Analytics_slaves|Analytics slaves]]" section below. The schemas that set out each table, and what they contain, can be found on Meta in the [https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 Schema namespace].
These datasets are stored in the <code>log</code> database on analytics-store, one of the MariaDB slaves. The schemas that set out each table, and what they contain, can be found on Meta in the [https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 Schema namespace].


===Pageviews data===
===Pageviews data===
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in [[#Hive|our Hive cluster]]. See the detailed documentation [[Analytics/Data/Pageview_hourly|here]].
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in [[#Hive|our Hadoop cluster]], which contains [[Analytics/Data/Pageview hourly|aggregated pageview data]] as well as the mostly-raw [[Analytics/Data/Webrequest|database of web requests]]. See the detailed documentation [[Analytics/Data/Pageview_hourly|here]].
 
===Webrequest logs===
Another important source of reader data is the RequestLogs. They can be found in [[#Hive|our Hive cluster]]. See the detailed documentation [[Analytics/Data/Webrequest|here]].


===Geolocation data===
===Geolocation data===
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1003 and stat1002: a full guide to geolocation and some examples of how to do it can be found [[Analytics/Geolocation|on the 'geolocation' page]].
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1003 and stat1002: a full guide to geolocation and some examples of how to do it can be found [[Analytics/Geolocation|on the 'geolocation' page]].


==Data sources==
==Infrastructure==
===API===
The API is a core component of every wiki we run - with the exception of ''private'' wikis, which you shouldn't really need to perform research on anyway ;). A good way to understand it, and to test queries, is [[Special:ApiSandbox]], which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.


If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki Utilities], which includes a module dedicated to API interactions. There's no equivalent for R - yet.
=== Stats machines ===
The production Wikimedia cluster contains several dedicated statistics servers used to do general statistical computation and to access various internal datasources like the MariaDB replicas and the webrequest data in Hadoop.


===Database dumps===
Since these machines are in the production cluster, you'll need production shell access to use them. 
Another common public datasource is the collection of [http://dumps.wikimedia.org/ XML snapshots]. These are generated each month, and so are always slightly outdated, but make up for it by being incredibly cohesive (and [http://dumps.wikimedia.org/enwiki/20140304/ incredibly large]). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.
{| class="wikitable"
!Name
!Hostname
!Access available
|-
|[[stat1002]]
|stat1002.eqiad.wmnet
|Hadoop, MariaDB
|-
|[[stat1003]]
|stat1003.eqiad.wmnet
|MariaDB
|-
|[[stat1004]]
|stat1004.eqiad.wmnet
|Hadoop
|}


Aaron's [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki-utilities] package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in <code>/mnt/data/xmldatadumps/public</code> on stat1002.
===MariaDB slaves===
The Operations team maintains several dedicated MariaDB slaves. These contain replicas of the production [[Mw:Manual:Database layout|MediaWiki databases]] (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees) as well as the [[Analytics/EventLogging|EventLogging]] databases.  


===Analytics slaves===
The main one is <code>analytics-store.eqiad.wmnet</code>, known for short as <code>analytics-store</code> (this hostname is actually an alias for <code>dbstore1002.eqiad.wmnet</code>, but it's easier to remember). It contains replicas of all the MediaWiki and EventLogging databases. It also contains a <code>flowdb</code> database with a copy of the data stored by the [[mediawikiwiki:Extension:Flow|Flow discussion system]], a <code>staging</code> database in which researchers can create their own tables, [[Analytics/Datasets|a <code>datasets</code> table]] containing useful reference tables. Almost all production wikis are replicated to <code>analytics-store.eqiad.wmnet</code> and you can get a complete list via [Https://meta.wikimedia.org/w/api.php?action=sitematrix the SiteMatrix API]. (As of writing, the 'vewikimedia' database is also here but is no longer configured in MediaWiki, and 'labswiki' is not available here.)  
The Operations team (praise be) maintains several dedicated analytics slaves. These are copies of the MediaWiki databases for each of our production websites (en.wikipedia.org, fr.wikipedia.org, de.wikisource.org...).


The analytics slaves contain both "production" wikis (actively-used mainstream projects) and non-production wikis (for example, the wikis for various projects' Arbitration Committees, or defunct projects). The active "production" wikis (and their locations) are in the collapsed table below.
In addition, <code>x1-analytics-slave.eqiad.wmnet</code>, known as <code>x1-analytics-slave</code>, (this is an alias for <code>db1031.eqiad.wmnet</code>) contains replicas of the data stored by [[mw:Extension:Echo|Echo]], the notifications system, as well as the Flow data.  


Accessing one of the analytics slaves is simple, ''if'' you have access to either stat1003 or stat1002 (if you don't, see the [[#Access_to_WMF_machines|access requests]] section). Taking the English-language Wikipedia, which lives on the analytics-store.eqiad.wmnet host, as an example, you'd SSH into stat1003, and then type:
You can access these analytics slaves from either stat1003 or stat1002. Once you're onto those machines, access is by a password for the "research" user which is stored in a configuration file; on stat1003, the file is <code>/etc/mysql/conf.d/research-client.cnf</code>,<ref>Stat1003 also contains a password in <code>/srv/passwords/research/</code>, but this is outdated.</ref> while on stat1002 it's <code>/etc/mysql/conf.d/analytics-research-client.cnf</code>. The command line <code>mysql</code> program look for this configuration file in <code>~/.my.cnf</code> by default, so if you create a [[:en:Symbolic_link#POSIX_and_Unix-like_operating_systems|symbolic link]] in that location that points to the correct configuration file, you can save yourself time typing <code>--defaults-file=/etc/mysql/conf.d/research-client.cnf</code>.


<code>mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -h analytics-store.eqiad.wmnet -A</code>
To use the <code>mysql</code> client interactively, type: <code>mysql -h analytics-store.eqiad.wmnet -A</code> (including the <code>--defaults-file</code> argument if you haven't created a symbolic link). The <code>-A</code> disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.


If you are on stat1002, type:
If you'd rather generate a TSV file<ref>The <code>mysql</code> utility doesn't have the ability to generate files in other formats like CSV.</ref> and then retrieve it later, you can also do so from the command line. You can type type:


<code>mysql --defaults-file=/etc/mysql/conf.d/analytics-research-client.cnf -h analytics-store.eqiad.wmnet -A</code>
<code>mysql -h analytics-store.eqiad.wmnet <nowiki>{{database name}}</nowiki> -e "<nowiki>{{your query;}}</nowiki>" > <nowiki>{{filename}}</nowiki>.tsv</code>


You'll then be dropped into the MySQL command line. Type <code>USE enwiki;</code> and then run whatever query you need.
It'll go off to generate the file on its own.<ref>The file extension you choose doesn't actually affect the command, but since <code>-e</code> generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.</ref>


In case you need to check the database server's load, view [http://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=dbstore1002.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 this] [[Ganglia]] report.
In case you need to check the database server's load, view [http://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=dbstore1002.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 this] [[Ganglia]] report.
Line 72: Line 87:
As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the [http://mysql-python.sourceforge.net/MySQLdb.html MySQLdb] module installed on stat1003 and stat1002. For R, we have [http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf RMySQL].
As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the [http://mysql-python.sourceforge.net/MySQLdb.html MySQLdb] module installed on stat1003 and stat1002. For R, we have [http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf RMySQL].


On the other hand, if you just want to generate a TSV or CSV and then retrieve the data from that file later, you can easily do so from the command line. Taking the English-language Wikipedia example from above, you'd type:
</noinclude>The page [[MariaDB]] contains detailed internal information about the setup of the slaves.


<code>mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -h analytics-store.eqiad.wmnet enwiki -e "your query goes here;" > file_name.tsv</code>  
===Hadoop===
Finally, we have Hadoop - our storage system for large amounts of data. The easiest way to query the Hadoop data is through [[Analytics/Cluster/Hive|Hive]], which can be accessed from stat1002 and stat1004- simply type <code>beeline</code> in the terminal, switch to the <code>wmf</code> database, and input your query.


For CSVs, just change the file ending. It'll go off to generate the file on its own.
At the moment there are no recommended Hive access packages for R or Python. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:
 
If you're interested in accessing high-volume data, such as data around Flow and Echo, this can be found on <code>analytics-store.eqiad.wmnet</code>; Flow data is stored in its own database (<code>flowdb</code>, while Echo-related data is stored in per-wiki databases. EventLogging data, meanwhile, is stored in the 'log' database on <code>analytics-store.eqiad.wmnet</code>, with each schema as its own table. Almost all production wikis are replicated to <code>analytics-store.eqiad.wmnet</code> and you can get a complete list via [Https://meta.wikimedia.org/w/api.php?action=sitematrix the SiteMatrix API]. (As of writing, the 'vewikimedia' database is also here but is no longer configured in MediaWiki, and 'labswiki' is not available here.)</noinclude>
 
The page [[MariaDB]] contains detailed internal information about the setup of the slaves.
 
===Hive===
Finally, we have [[Analytics/Cluster/Hive|Hive]] - our storage system for large amounts of data. Hive can be accessed from stat1002 and stat1004- simply type <tt>beeline</tt> in the terminal, switch to the <tt>wmf</tt> database, and input your query.
 
At the moment there are no recommended Hive access packages for R or Python, although we're actively investigating possible solutions. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:


<code>beeline -f my_query.hql > file_name.tsv</code>
<code>beeline -f my_query.hql > file_name.tsv</code>
Line 93: Line 100:


=== Access Groups ===
=== Access Groups ===
When submitting your access request, you will need to specify what access group you need. Note that you will usually need to be added to the <code>bastiononly</code> group if you're not already in a group that grants bastion access (deployment, restricted, parsoid-admin, ocg-render-admins). Mention this in the ticket since it's occasionally missed. ([[Gerrit:227327]] was intended to fix this, but has been abandoned.)
When submitting your access request, you will need to specify what access group you need. Note that you will usually need to be added to the <code>bastiononly</code> group if you're not already in a group that grants bastion access (<code>deployment</code>, <code>restricted</code>, <code>parsoid-admin</code>, <code>ocg-render-admins</code>). Mention this in the ticket since it's occasionally missed.<ref>[[Gerrit:227327]] was intended to fix this, but has been abandoned.</ref>


Here's a summary of groups you might need (as of 2014-09):
Here's a summary of groups you might need (as of 2014-09):


;<code>researchers</code>
;<code>researchers</code>
: Access to stat1003 and <code>/srv/passwords/research</code> and <code>/etc/mysql/conf.d/research-client.cnf</code> (credentials for the SQL slaves)
: Access to stat1003 and the credentials for the MariaDB slaves in  <code>/etc/mysql/conf.d/research-client.cnf</code>.
;<code>statistics-users</code>
;<code>statistics-users</code>
: Access to stat1003 for number crunching and accessing non private log files hosted there.
: Access to stat1003 for number crunching and accessing non private log files hosted there.
Line 108: Line 115:
: Access to stat1002 and stat1004 to connect to the [[Analytics/Cluster]] (Hadoop/Hive) and to query private data hosted there, including webrequest logs.  If you want Hadoop access, you probably want this.
: Access to stat1002 and stat1004 to connect to the [[Analytics/Cluster]] (Hadoop/Hive) and to query private data hosted there, including webrequest logs.  If you want Hadoop access, you probably want this.


The list of users currently in each group is available in this [https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml configuration file] (other groups such as <tt>statistics-admins</tt>, <tt>analytics-admins</tt>, <tt>eventlogging-admins</tt>, <tt>statistics-web-users</tt> should not be required for tasks other than system maintenance/administration).
The list of users currently in each group is available in this [https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml configuration file].<ref>Other groups including <code>statistics-admins</code>, <code>analytics-admins</code>, <code>eventlogging-admins</code>, and <code>statistics-web-users</code> are for people doing system maintenance and administration, so you don't need them just to access data.</ref>
 
== Notes ==
<references />

Revision as of 01:27, 5 August 2016

This page documents the internal and external data sources that Analytics uses, the information stored within them, and how to get access.

Data sources

Data sets and data streams can be found in Category:Data_stream

MediaWiki application data

You can do a lot of work with the data stored by MediaWiki in the normal course of running itself but just things that are needed for the wiki to do its job. This includes data about:

  • Users' edit counts (consult the user table)
  • Edits to a particular page (consult the revision table, joined with the page table if necessary)
  • Account creations (consult the logging table)

Databases

You can access this data directly using the replica MariaDB databases on the analytics slaves, discussed below. For an overview of how the data is laid out in those databases, consult the database layout manual.

There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them through the API or in the XML dumps, which are both described below.

API

A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for private wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is Special:ApiSandbox, which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.

If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains MediaWiki Utilities, which includes a module dedicated to API interactions. There's no equivalent for R yet.

Database dumps

Every month, XML snapshots of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and incredibly large). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.

Aaron's MediaWiki-utilities package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in /mnt/data/xmldatadumps/public on stat1002.

EventLogging data

One analytics-specific source of data is EventLogging. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:

  1. A log of changes to user preferences;
  2. A/B testing data;
  3. Clicktracking data.

These datasets are stored in the log database on analytics-store, one of the MariaDB slaves. The schemas that set out each table, and what they contain, can be found on Meta in the Schema namespace.

Pageviews data

An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in our Hadoop cluster, which contains aggregated pageview data as well as the mostly-raw database of web requests. See the detailed documentation here.

Geolocation data

When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1003 and stat1002: a full guide to geolocation and some examples of how to do it can be found on the 'geolocation' page.

Infrastructure

Stats machines

The production Wikimedia cluster contains several dedicated statistics servers used to do general statistical computation and to access various internal datasources like the MariaDB replicas and the webrequest data in Hadoop.

Since these machines are in the production cluster, you'll need production shell access to use them.

Name Hostname Access available
stat1002 stat1002.eqiad.wmnet Hadoop, MariaDB
stat1003 stat1003.eqiad.wmnet MariaDB
stat1004 stat1004.eqiad.wmnet Hadoop

MariaDB slaves

The Operations team maintains several dedicated MariaDB slaves. These contain replicas of the production MediaWiki databases (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees) as well as the EventLogging databases.

The main one is analytics-store.eqiad.wmnet, known for short as analytics-store (this hostname is actually an alias for dbstore1002.eqiad.wmnet, but it's easier to remember). It contains replicas of all the MediaWiki and EventLogging databases. It also contains a flowdb database with a copy of the data stored by the Flow discussion system, a staging database in which researchers can create their own tables, a datasets table containing useful reference tables. Almost all production wikis are replicated to analytics-store.eqiad.wmnet and you can get a complete list via the SiteMatrix API. (As of writing, the 'vewikimedia' database is also here but is no longer configured in MediaWiki, and 'labswiki' is not available here.)

In addition, x1-analytics-slave.eqiad.wmnet, known as x1-analytics-slave, (this is an alias for db1031.eqiad.wmnet) contains replicas of the data stored by Echo, the notifications system, as well as the Flow data.

You can access these analytics slaves from either stat1003 or stat1002. Once you're onto those machines, access is by a password for the "research" user which is stored in a configuration file; on stat1003, the file is /etc/mysql/conf.d/research-client.cnf,[1] while on stat1002 it's /etc/mysql/conf.d/analytics-research-client.cnf. The command line mysql program look for this configuration file in ~/.my.cnf by default, so if you create a symbolic link in that location that points to the correct configuration file, you can save yourself time typing --defaults-file=/etc/mysql/conf.d/research-client.cnf.

To use the mysql client interactively, type: mysql -h analytics-store.eqiad.wmnet -A (including the --defaults-file argument if you haven't created a symbolic link). The -A disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.

If you'd rather generate a TSV file[2] and then retrieve it later, you can also do so from the command line. You can type type:

mysql -h analytics-store.eqiad.wmnet {{database name}} -e "{{your query;}}" > {{filename}}.tsv

It'll go off to generate the file on its own.[3]

In case you need to check the database server's load, view this Ganglia report.

As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the MySQLdb module installed on stat1003 and stat1002. For R, we have RMySQL.

The page MariaDB contains detailed internal information about the setup of the slaves.

Hadoop

Finally, we have Hadoop - our storage system for large amounts of data. The easiest way to query the Hadoop data is through Hive, which can be accessed from stat1002 and stat1004- simply type beeline in the terminal, switch to the wmf database, and input your query.

At the moment there are no recommended Hive access packages for R or Python. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:

beeline -f my_query.hql > file_name.tsv

Production access

To be able to access a number of internal data sources (such as logs, replicas of the production databases, EventLogging data) as well as machines used for data crunching (e.g. stat1003), you will need shell access to the production Wikimedia cluster. You can read more about that access and how to request it at the link, but keep one thing in mind: production shell access is extremely sensitive. You must follow the rules outlined in Acknowledgement of Server Access Responsibilities. If you have questions or doubt, ask Tech Ops.

Access Groups

When submitting your access request, you will need to specify what access group you need. Note that you will usually need to be added to the bastiononly group if you're not already in a group that grants bastion access (deployment, restricted, parsoid-admin, ocg-render-admins). Mention this in the ticket since it's occasionally missed.[4]

Here's a summary of groups you might need (as of 2014-09):

researchers
Access to stat1003 and the credentials for the MariaDB slaves in /etc/mysql/conf.d/research-client.cnf.
statistics-users
Access to stat1003 for number crunching and accessing non private log files hosted there.
statistics-privatedata-users
Access to stat1002 where private webrequest logs are hosted.
analytics-users
Access to stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) (NO PRIVATE DATA).
analytics-privatedata-users
Access to stat1002 and stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) and to query private data hosted there, including webrequest logs. If you want Hadoop access, you probably want this.

The list of users currently in each group is available in this configuration file.[5]

Notes

  1. Stat1003 also contains a password in /srv/passwords/research/, but this is outdated.
  2. The mysql utility doesn't have the ability to generate files in other formats like CSV.
  3. The file extension you choose doesn't actually affect the command, but since -e generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly.
  4. Gerrit:227327 was intended to fix this, but has been abandoned.
  5. Other groups including statistics-admins, analytics-admins, eventlogging-admins, and statistics-web-users are for people doing system maintenance and administration, so you don't need them just to access data.