You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data access: Difference between revisions
imported>Gergő Tisza |
imported>Lucas Werkmeister (WMDE) m (→SWAP: fix typo) |
||
Line 102: | Line 102: | ||
=== SWAP === | === SWAP === | ||
We also have [[SWAP]], an internal [http://jupyter.org/ Jupyter] | We also have [[SWAP]], an internal [http://jupyter.org/ Jupyter] server that provides a nice notebook interface for crunching data. | ||
== Production access == | == Production access == |
Revision as of 13:09, 8 March 2018
This page documents the internal and external data sources that Analytics uses, the information stored within them, and how to get access.
Data sources
Data sets and data streams can be found in Category:Data_stream
MediaWiki application data
You can do a lot of work with the data stored by MediaWiki in the normal course of running itself. This includes data about:
- Users' edit counts (consult the
user
table) - Edits to a particular page (consult the
revision
table, joined with thepage
table if necessary) - Account creations (consult the
logging
table)
Databases
You can access this data using the replica MariaDB databases. These are accessible from the stat100* machines, as detailed below.
For an overview of how the data is laid out in those databases, consult the database layout manual.
There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them through the API or in the XML dumps, which are both described below.
API
A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for private wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is Special:ApiSandbox, which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.
If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains MediaWiki Utilities, which includes a module dedicated to API interactions. There's no equivalent for R yet.
Database dumps
Every month, XML snapshots of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and incredibly large). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.
Aaron's MediaWiki-utilities package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in /mnt/data/xmldatadumps/public
on stat1005.
EventLogging data
One analytics-specific source of data is EventLogging. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:
- A log of changes to user preferences;
- A/B testing data;
- Clicktracking data.
These datasets are stored in the log
database on analytics-slave.eqiad.wmnet (a CNAME for db1108). The schemas that set out each table, and what they contain, can be found on Meta in the Schema namespace.
Pageviews data
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in our Hadoop cluster, which contains aggregated pageview data as well as the mostly-raw database of web requests. See the detailed documentation here.
Pivot
Analytics/Systems/Druid#Access_to_Pivot
Geolocation data
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1006 and stat1005: a full guide to geolocation and some examples of how to do it can be found on the 'geolocation' page.
Infrastructure
Stats machines
The production Wikimedia cluster contains several dedicated statistics servers used to do general statistical computation and to access various internal datasources like the MariaDB replicas and the webrequest data in Hadoop.
You may need to access the internet from the stats machines (for example, to download a Python script using pip
). By default, this will fail because the machines are tightly firewalled. You'll have to use the HTTP proxy.
Since these machines are in the production cluster, you'll need production shell access to use them.
Name | Hostname | Access available |
---|---|---|
stat1005 | stat1005.eqiad.wmnet | Hadoop, MariaDB |
stat1006 | stat1006.eqiad.wmnet | MariaDB |
stat1004 | stat1004.eqiad.wmnet | Hadoop |
MariaDB replicas
The Operations team maintains several dedicated MariaDB replicas. These contain copies of the production MediaWiki databases (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees) as well as the EventLogging databases.
The main one is analytics-store.eqiad.wmnet
, known for short as analytics-store
(this hostname is actually an alias for dbstore1002.eqiad.wmnet
, but it's easier to remember). It contains a flowdb
database with a copy of the data stored by the Flow discussion system, a staging
database in which researchers can create their own tables, a datasets
table containing useful reference tables. Almost all production wikis are replicated to analytics-store.eqiad.wmnet
and you can get a complete list via the SiteMatrix API. The Eventlogging log database used to be replicated via custom script, but not anymore (more info T156844).
x1-analytics-slave.eqiad.wmnet
, known as x1-analytics-slave
, used to be an alias for db1031.eqiad.wmnet
, containing replicas of the data stored by Echo, the notifications system, as well as the Flow data. The domain now points to dbstore1002.eqiad.wmnet.
You can access these analytics replicas from either stat1005 or stat1006. To use the mysql
client interactively, type: mysql -h analytics-store.eqiad.wmnet -A
. The -A
disables tab autocompletion, which can be bothersome when pasting multi-line queries into the prompt. You'll then be dropped into the MySQL command line.[1]
If you'd rather generate a TSV file[2] and then retrieve it later, you can also do so from the command line. You can type type:
mysql -h analytics-store.eqiad.wmnet {{database name}} -e "{{your query;}}" > {{filename}}.tsv
It'll go off to generate the file on its own.[3]
As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the MySQLdb module installed on stat1006 and stat1005. For R, we have RMySQL.
The page MariaDB contains detailed internal information about the setup of the slaves.
Hadoop
Finally, we have Hadoop - our storage system for large amounts of data. The easiest way to query the Hadoop data is through Hive, which can be accessed from stat1005 and stat1004 - simply type beeline
in the terminal, switch to the wmf
database, and input your query.
At the moment there are no recommended Hive access packages for R or Python. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:
beeline -f my_query.hql > file_name.tsv
For information about writing HQL to query this data, see the Hive language manual.
SWAP
We also have SWAP, an internal Jupyter server that provides a nice notebook interface for crunching data.
Production access
To be able to access a number of internal data sources (such as logs, replicas of the production databases, EventLogging data) as well as machines used for data crunching (e.g. stat1006
), you will need shell access to the production Wikimedia cluster (see also these notes on configuring SSH specifically for the purpose of working with the stats servers). You can read more about that access and how to request it at the link, but keep one thing in mind: production shell access is extremely sensitive. You must follow the rules outlined in Acknowledgement of Server Access Responsibilities. If you have questions or doubt, ask Tech Ops.
Access Groups
When submitting your access request, you will need to specify what access group you need.
'analytics-*' groups have access to the Analytics Cluster (which mostly means Hadoop). 'statistics-*' groups get access to stat* servers for local (non distributed) compute resources. These groups overlap in what servers they grant ssh access to, but further posix permissions restrict access to things like MySQL, Hadoop, and files.
Here's a summary of groups you might need (as of 2016-10-18):
researchers
- Access to stat1006 and the credentials for the MariaDB slaves in
/etc/mysql/conf.d/research-client.cnf
. statistics-users
- Access to stat1006 for number crunching and accessing non private log files hosted there.
statistics-privatedata-users
- Access to stat100[56], public data like sampled Webrequest logs (stored under
/a/log/webrequest/archive
) and for the MariaDB slaves in/etc/mysql/conf.d/statistics-private-client.cnf
analytics-wmde
- WMDE specific group (mostly used for crons). Access to stat1005 and to MariaDB slaves in
/etc/mysql/conf.d/research-wmde-client.cnf
analytics-users
- Access to stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) (NO HADOOP PRIVATE DATA).
analytics-privatedata-users
- Access to stat1005 and stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) and to query private data hosted there, including webrequest logs. Access to MariaDB slaves in
/etc/mysql/conf.d/analytics-research-client.cnf
- If you want Hadoop access you probably want this.
The list of users currently in each group is available in this configuration file.[4]
Host access granted
Access Groups | Access to stat1005 | Access to stat1006 | Access to stat1004 |
---|---|---|---|
researchers
|
X | ||
statistics-users
|
X | ||
statistics-privatedata-users
|
X | X | |
analytics-users
|
X | X | |
analytics-privatedata-users
|
X | X | |
analytics-wmde
|
X |
Data access granted
Access Groups | Hadoop access
(No private data) |
Hadoop access
(Private data) |
research-client.cnf | statistics-private-client.cnf | research-wmde-client.cnf | analytics-research-client.cnf |
---|---|---|---|---|---|---|
researchers
|
X | |||||
statistics-users
|
||||||
statistics-privatedata-users
|
X | |||||
analytics-users
|
X | |||||
analytics-privatedata-users
|
X | X | X | |||
analytics-wmde
|
X |
Of special interest for external researchers
If you are an external researcher trying to get access to our data you must have signed an NDA. Once you have sorted that part of the process out here is a ticket that can guide you when it comes to request access to systems: task T141634
As any developer at the foundation you will need an account Please follow the Sign Up link in https://www.mediawiki.org/wiki/Developer_access as per https://wikitech.wikimedia.org/wiki/Production_shell_access.
Notes
- ↑ The stat machines automatically authenticate to the MariaDB replicas using passwords stored in preloaded config (
.cnf
) files. On stat1006, the file is/etc/mysql/conf.d/research-client.cnf
, while on stat1005 it's/etc/mysql/conf.d/analytics-research-client.cnf
. These files are automatically referenced from/etc/mysql/my.cnf
, which the command linemysql
program reads by default. Other programs may not do this automatically, and require an explicit pointer to the underlying.cnf
file. - ↑ The
mysql
utility doesn't have the ability to generate files in other formats like CSV. - ↑ The file extension you choose doesn't actually affect the command, but since
-e
generates a TSV file, you should use the corresponding file extension so other programs load the resulting file correctly. - ↑ Other groups including
statistics-admins
,analytics-admins
,eventlogging-admins
, andstatistics-web-users
are for people doing system maintenance and administration, so you don't need them just to access data.