You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data access: Difference between revisions
imported>Ottomata (→Hive) |
imported>Jcrespo (→Access Groups: no access to stat1002) |
||
Line 17: | Line 17: | ||
: Access to stat1002 where private webrequest logs are hosted. | : Access to stat1002 where private webrequest logs are hosted. | ||
;<tt>analytics-users</tt> | ;<tt>analytics-users</tt> | ||
: Access to | : Access to stat1004 to connect to the [[Analytics/Cluster]] (Hadoop/Hive) (NO PRIVATE DATA). | ||
;<tt>analytics-privatedata-users</tt> | ;<tt>analytics-privatedata-users</tt> | ||
: Access to stat1002 and stat1004 to connect to the [[Analytics/Cluster]] (Hadoop/Hive) and to query private data hosted there, including webrequest logs. If you want Hadoop access, you probably want this. | : Access to stat1002 and stat1004 to connect to the [[Analytics/Cluster]] (Hadoop/Hive) and to query private data hosted there, including webrequest logs. If you want Hadoop access, you probably want this. |
Revision as of 15:23, 31 May 2016
This page documents the internal and external data sources that Analytics uses, the information stored within them, and how to get access.
Access to WMF machines
To be able to access a number of internal data sources (such as logs, replicas of the production databases, EventLogging data) as well as machines used for data crunching (e.g. stat1003
), you will need to request shell access.
Access requests
If you're looking for access to a service or machine that you don't currently have permission to use, the process is relatively simple. See Requesting shell access for details. Note that you will always need explicit bastiononly access. Mention this in the ticket; it's occasionally missed. Gerrit:227327 will fix this, but as of writing is not merged.
Access Groups
When submitting your access request, please specify what access group you need to be added to. Here's a summary of groups you might need (as of 2014-09):
- researchers
- Access to /srv/passwords/research and /etc/mysql/conf.d/research-client.cnf on stat1003 (credentials for the SQL slaves)
- statistics-users
- Access to stat1003 for number crunching and accessing non private log files hosted there.
- statistics-privatedata-users
- Access to stat1002 where private webrequest logs are hosted.
- analytics-users
- Access to stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) (NO PRIVATE DATA).
- analytics-privatedata-users
- Access to stat1002 and stat1004 to connect to the Analytics/Cluster (Hadoop/Hive) and to query private data hosted there, including webrequest logs. If you want Hadoop access, you probably want this.
The list of users currently in each group is available in this configuration file (other groups such as statistics-admins, analytics-admins, eventlogging-admins, statistics-web-users should not be required for tasks other than system maintenance/administration).
Configuring SSH
See Server access responsibilities#SSH.
Server access responsibilities
People with access to our private data stores are expected to have signed the Non-Disclosure Agreement, and to adhere to the server access responsibilities. Most crucial are the sections on security and the handling of sensitive data. Please read these guidelines before doing anything; access can and will be removed if you fail to live up to your responsibilities.
Data types
MediaWiki data (private/large-scale)
A lot of our work is based on data stored within MediaWiki itself - not dedicated analytics logging, but just things that are needed for the wiki to do its job. Examples of this would be:
- You want data about users' edit counts. Consult the "user" table.
- You want data about edits to a particular page. Consult the "revision" table.
- You want data about account creations. Consult the "logging" table.
Much of this data is only accessible from inside the database - other pieces that are publicly accessible through the API (for example, linking usernames to userIDs) can still be faster to gather internally if you have a large request.
This data can be found in MySQL databases on the analytics slaves, discussed below. A good guide to the tables inside those databases can be found in the MediaWiki database guide.
MediaWiki data (public)
Some pieces of MediaWiki data are not only public, but only really accessible publicly. The primary example of this is the actual content of pages and revisions, which isn't available in the analytics databases. Instead, it's made available through the API. This content can also be found in the XML dumps of our projects, which are described in the database dumps section below.
High-volume MediaWiki data
A couple of MediaWiki extensions generate so much data that, regardless of what wiki the data is coming from, they're actually stored in a completely different cluster of databases - the X1 cluster. As with the normal databases, there is a dedicated analytics slave for this data.
The only current examples of extensions that rely on this cluster are Echo (the notifications system) and Flow (the new discussion system). Both are stored in slightly different ways, which is discussed in the section on the analytics slaves.
EventLogging data
One analytics-specific source of data is EventLogging. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples would be:
- A log of changes to user preferences;
- A/B testing data;
- Clicktracking data.
These datasets are stored in their own database, the 'log' database, which is described in the "Analytics slaves" section below. The schemas that set out each table, and what they contain, can be found on Meta in the Schema namespace.
Pageviews data
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in Hive, which is described below.
The data structure is:
Column name | Type | Description | Example |
---|---|---|---|
project | string | The project the page is on | En |
page | string | The page title | Getting_Things_Done |
views | integer | The number of views for that revision in that hour | 1 |
bytes | integer | The size of the revision in bytes | 230243 |
year | integer | The year | 2014 |
month | integer | The month | 01 |
day | integer | The day | 01 |
hour | integer | The hour | 00 |
Webrequest logs
Another important source of reader data is the RequestLogs - the logs of actual requests to Wikimedia machines. These can be found in two different places, with two different data structures, depending on the sort of data you're interested in.
For Mobile request logs, you should look in our Hive cluster. the data has the format:
Column name | Type | Description | Example |
---|---|---|---|
hostname | String | The name of the cache server the request is processing through | cp1060.eqiad.wmnet |
sequence | Integer | The per-host request number. It increases by 1 for each request on that host. | 919939520 |
dt | String | The date and time of the request, in UTC | 2014-01-03T20:08:27 |
ip | String | The IP address of the client | 192.168.0.1 |
time_firstbyte | Floating-point number | The amount of time (in seconds) before the first byte of the requested content was transmitted | 0.001473904 |
cache_status | String | The cache's response code | hit |
http_status | Integer | The HTTP status code associated with the request | 200 |
response_size | Integer | The size of the returned content, in bytes | 87 |
http_method | String | The HTTP method associated with the request | GET |
uri_host | String | The site the request was aimed at | meta.m.wikimedia.org |
uri_path | String | The page the request was aimed at | /wiki/Special:BannerRandom |
uri_query | String | Any parameters (say, ?action=edit) associated with the request | ?uselang=en&sitename=Wikipedia&project=wikipedia&anonymous=true |
content_type | String | The MIME type of the returned content | text/html |
referer | String | The referring page | http://google.com |
x_forwarded_for | String | The X-Forwarded-For header | - |
user_agent | String | The user agent of the client | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0 |
accept_language | String | The user language (or language variant) | en-US,en;q=0.8 |
x_analytics | String | An instrumentation field used by the Wikipedia Zero team | - |
For Desktop and historical mobile request logs, we have the sampled request logs, which are sampled and stored at a 1:1000 ratio. These currently stretch from May 2013 to the present, and are stored in the /a/squid/archive/sampled
directory on stat1002, as .TAR.GZs. They take the format:
Column name | Type | Description | Example |
---|---|---|---|
squid | string | The name of the cache server the request is processing through | cp1060.eqiad.wmnet |
sequence | Integer | The per-host request number. It increases by 1 for each request on that host. | 919939520 |
dt | String | The date and time of the request, in UTC | 2014-01-03T20:08:27 |
time_firstbyte | Floating-point number | The amount of time (in seconds) before the first byte of the requested content was transmitted | 0.001473904 |
ip | String | The IP address of the client | 192.168.0.1 |
cache_status/http status | String | The cache's response code, and the HTTP status code associated with the request | HIT/200 |
response_size | Integer | The size of the returned content, in bytes | 87 |
http_method | String | The HTTP method associated with the request | GET |
URL | String | The URL of the requested element | http://en.m.wikipedia.org/wiki/Colin_Hanks |
squid_status | String | Status of the squid | N/A - ignore this. |
content_type | String | The MIME type of the returned content | text/html |
referer | String | The referring page | http://google.com |
x_forwarded_for | String | The X-Forwarded-For header | - |
user_agent | String | The user agent of the client | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0 |
accept_language | String | The user language (or language variant) | en-US,en;q=0.8 |
x_analytics | String | An instrumentation field used by the Wikipedia Zero team | - |
The sampled logs' format is described on the Cache log format page.
These files are both not quoted, and lack headers, which can make parsing them a bit of a pain. At the moment we do not have standardised scripts for doing so, although that will (hopefully!) change.
Geolocation data
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1003 and stat1002: a full guide to geolocation and some examples of how to do it can be found on the 'geolocation' page.
Data sources
API
The API is a core component of every wiki we run - with the exception of private wikis, which you shouldn't really need to perform research on anyway ;). A good way to understand it, and to test queries, is Special:ApiSandbox, which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.
If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains MediaWiki Utilities, which includes a module dedicated to API interactions. There's no equivalent for R - yet.
Database dumps
Another common public datasource is the collection of XML snapshots. These are generated each month, and so are always slightly outdated, but make up for it by being incredibly cohesive (and incredibly large). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.
Aaron's MediaWiki-utilities package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in /mnt/data/xmldatadumps/public
on stat1002.
Analytics slaves
The Operations team (praise be) maintains several dedicated analytics slaves. These are copies of the MediaWiki databases for each of our production websites (en.wikipedia.org, fr.wikipedia.org, de.wikisource.org...).
The analytics slaves contain both "production" wikis (actively-used mainstream projects) and non-production wikis (for example, the wikis for various projects' Arbitration Committees, or defunct projects). The active "production" wikis (and their locations) are in the collapsed table below.
Accessing one of the analytics slaves is simple, if you have access to either stat1003 or stat1002 (if you don't, see the access requests section). Taking the English-language Wikipedia, which lives on the analytics-store.eqiad.wmnet host, as an example, you'd SSH into stat1003, and then type:
mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -h analytics-store.eqiad.wmnet -A
If you are on stat1002, type:
mysql --defaults-file=/etc/mysql/conf.d/analytics-research-client.cnf -h analytics-store.eqiad.wmnet -A
You'll then be dropped into the MySQL command line. Type USE enwiki
, and then run whatever query you need.
In case you need to check the database server's load, view this Ganglia report.
As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the MySQLdb module installed on stat1003 and stat1002. For R, we have RMySQL.
On the other hand, if you just want to generate a TSV or CSV and then retrieve the data from that file later, you can easily do so from the command line. Taking the English-language Wikipedia example from above, you'd type:
mysql
--defaults-file=/etc/mysql/conf.d/research-client.cnf
-h analytics-store.eqiad.wmnet enwiki -e "your query goes here;" > file_name.tsv
For CSVs, just change the file ending. It'll go off to generate the file on its own.
If you're interested in accessing high-volume data, such as data around Flow and Echo, this can be found on analytics-store.eqiad.wmnet
; Flow data is stored in its own database (flowdb
, while Echo-related data is stored in per-wiki databases. EventLogging data, meanwhile, is stored in the 'log' database on analytics-store.eqiad.wmnet
, with each schema as its own table. Almost all production wikis are replicated to analytics-store.eqiad.wmnet
and you can get a complete list via the SiteMatrix API. (As of writing, the 'vewikimedia' database is also here but is no longer configured in MediaWiki, and 'labswiki' is not available here.)
The page MariaDB contains detailed internal information about the setup of the slaves.
Hive
Finally, we have Hive - our storage system for large amounts of data. Hive can be accessed from stat1002 and stat1004- simply type beeline in the terminal, switch to the wmf database, and input your query.
At the moment there are no recommended Hive access packages for R or Python, although we're actively investigating possible solutions. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:
beeline -f my_query.hql > file_name.tsv