You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data access: Difference between revisions
imported>Neil P. Quinn-WMF (Update name of "SWAP" to "Jupyter") |
imported>Elukey |
||
Line 43: | Line 43: | ||
=== Host access granted === | === Host access granted === | ||
There used to be a lot of differences in what hosts an Analytics POSIX group could have had access to, but now there is none anymore. | |||
=== Data access granted === | === Data access granted === |
Revision as of 12:00, 23 June 2020
In addition to a variety of publicly-available data sources, Wikimedia has a parallel set of private data sources. The main reason is to allows a carefully vetted set of users to perform research and analysis on confidential user data (such as the IP addresses of readers and editors) which is stored according to our privacy policy and data retention guidelines. This private infrastructure also provides duplicate copies of publicly-available data for ease of use.
Shell access
This private data lives in same server cluster that runs Wikimedia's production websites. This means you will need production shell access to get it (see also these notes on configuring SSH specifically for the purpose of working with the stats servers).
However, since this access gets you closer to both those production websites and this confidential data, it is not freely given out. First, you have to demonstrate a need for these resources. Second, you need to have a non-disclosure agreement with the Wikimedia Foundation. If you're a Foundation employee, this was included as part of your employment agreement. If you're a researcher, it's possible to be sponsored through a formal collaboration with the Wikimedia Foundation's Research team.
User responsibilities
If you get this access, you must remember that this access is extremely sensitive. You have a duty to protect the privacy of our users. As Uncle Ben says, "with great power comes great responsibility." Always follow the rules outlined in the Acknowledgement of Server Access Responsibilities, which you have signed if you have access to this data.
In addition, keep in mind the following important principles:
- Be paranoid about personally identifiable information (PII). Familiarize yourself with the data you are working on, and determine if it contains any PII. It's better to double and triple check than to assume anything, but if you have any doubt ask the Analytics team (via IRC or email or Phabricator). Please see the data retention guidelines.
- Don't copy sensitive data (for example, data accessible only by the users in the analytics-privatedata-users) from its origin location to elsewhere (in HDFS or on any other host/support) unless strictly necessary. And most importantly, do it only if you know what you are doing. If you are in doubt, please reach out to the Analytics team first.
- Restrict access. If you do need to copy sensitive data somewhere, please make sure that you are the only one able to access the data. For example, if you copy Webrequest data from its location on HDFS to your /user/$your-username directory, make sure that the permissions are set to avoid everybody with access to HDFS to read the data. This is essential to avoid accidental leaks of PII/sensitive data or retention over our guidelines (https://meta.wikimedia.org/wiki/Data_retention_guidelines).
- Clean up copies of data. Please make sure that any data that you copied is deleted as soon as your work has been done.
If you ever have any questions or doubts, err on the side of caution and contact the Analytics team. We are very friendly and happy to help!
Access Groups
To get access, you submit a request on phabricator and tag SRE-Access-Requests for shell access:
Production shell access#Requesting access. You will need to specify what access group you need.
'analytics-*' groups have access to the Analytics Cluster (which mostly means Hadoop) and to stat* servers for local (non distributed) compute resources. These groups overlap in what servers they grant ssh access to, but further posix permissions restrict access to things like MySQL, Hadoop, and files.
Here's a summary of groups you might need (as of 2020-03-16):
Team specific:
analytics-wmde-users
- For Wikimedia Deutschland employees, mostly used for crons. Grants access to all stat100x hosts and to the MariaDB replicas via
/etc/mysql/conf.d/research-wmde-client.cnf
analytics-search-users
- For members of the Wikimedia Foundation Search Platform team, used for various Analytics-Search jobs). Grants access to all stat100x hosts, an-launcher1001 and to the MariaDB replicas.
Generic users:
researchers
- Grants access to all the analytics clients and the credentials for the MariaDB replicas in
/etc/mysql/conf.d/research-client.cnf
. It is called "researchers" because it was originally created for the Research team, but this is no longer true. This group is deprecated, but might be useful for some particular use cases. Reach out to the Analytics team if it might be useful for you :) analytics-users
- Grants access to all the analytics clients. Doesn't grant access to Hadoop or any private data.
analytics-privatedata-users
- Grants access to all the analytics clients, the analytics cluster (Hadoop/Hive) and the private data hosted there, and to MariaDB replicas, using the credentials at
/etc/mysql/conf.d/analytics-research-client.cnf
. - Users in this group also need a Kerberos authentication principal. If you're already a group member and don't have one, follow the instructions in the the Kerberos user guide. If you're requesting membership in this group, the SRE team will create this for you when they add you to the group.
The list of users currently in each group is available in this configuration file.[1]
Host access granted
There used to be a lot of differences in what hosts an Analytics POSIX group could have had access to, but now there is none anymore.
Data access granted
Access Groups | Hadoop access
(No private data) |
Hadoop access
(Private data) |
research-client.cnf | research-wmde-client.cnf | analytics-research-client.cnf |
---|---|---|---|---|---|
researchers
|
X | ||||
analytics-users
|
|||||
analytics-privatedata-users
|
X | X | X | ||
analytics-wmde
|
X |
Data access expiration
Data access is given to collaborators and contractors with a time limit. Normally the end date is set to be the contract or collaboration end date. For staff data access terminates upon employment termination unless there is a collaboration in place.
Once a user is terminated their home directory is deleted, if the team wishes to preserve some of the user work (work, not data as data as strict guidelines for deletion) it can be done via archiving that work to hadoop. Please file a phab ticket to have this done. Archival to hadoop would happen in the following directory:
/wmf/data/archive/user/<username>
LDAP access
Some Analytics systems, including Superset, Turnilo, and Jupyter, require a developer account in the wmf
or nda
LDAP groups for access.
If you need this access, first make sure you have a working developer account (if you can log into this wiki, you have one). If you need one, you can create one at mw:Developer_account.
Note that a developer account comes with two different usernames; some services need one and some services need the other. You can find both by logging into this wiki and visiting the "user profile" section of Special:Preferences. Your Wikitech username is listed under "Username", while your developer shell username is listed under "Instance shell account name". Thankfully, there's only one password!
Then, create a Phabricator task tagged with LDAP-access-requests asking to be added to the appropriate group. Make sure you include both your usernames. For an example task, see T208822.
Note that this access has similar requirements to shell access: you will need to either be a Wikimedia Foundation employee or have a signed volunteer NDA.
Infrastructure
Analytics clients
Once you have access to the production cluster, there are several servers which you can use to access private data sources and run your analysis. There are two types: the stat servers, designed for command-line use, and the SWAP servers, designed for Jupyter notebook use. For more information, see Analytics/Systems/Clients and/or SWAP.
MariaDB
The Analytics MariaDB cluster contains copies of the production MediaWiki databases (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees).
Hadoop
- As of November 2019, Hadoop is authenticated via Kerberos. See Kerberos User guide and Hadoop testing cluster.
Hadoop is our storage system for large amounts of data. The easiest way to query the Hadoop data is through Hive, which can be accessed from most of the Analytics clients. Simply type beeline
in the terminal, switch to the wmf
database, and input your query.
At the moment there are no recommended Hive access packages for R or Python. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:
beeline -f my_query.hql > file_name.tsv
For information about writing HQL to query this data, see the Hive language manual.
Data sources
Data sets and data streams can be found in Category:Data_stream
Data Dashboards. Superset and Turnilo (previously called Pivot)
Superset: http://superset.wikimedia.org Pivot: http://pivot.wikimedia.org
You need a wikitech login that is in the "wmf" or "nda" LDAP groups. If you don't have it, please create a task like https://phabricator.wikimedia.org/T160662
Before requesting access, please make sure you:
- have a functioning Wikitech login. Get one: https://toolsadmin.wikimedia.org/register/
- are an employee or contractor with wmf OR have signed an NDA
Depending on the above, you can request to be added to the wmf group or the nda group. Please indicate the motivation on the task about why you need access and ping the analytics team if you don't hear any feedback soon from the Opsen on duty.
MediaWiki application data
You can do a lot of work with the data stored by MediaWiki in the normal course of running itself. This includes data about:
- Users' edit counts (consult the
user
table) - Edits to a particular page (consult the
revision
table, joined with thepage
table if necessary) - Account creations (consult the
logging
table)
Databases
You can access this data using the replica MariaDB databases. These are accessible from the stat100* machines, as detailed below.
For an overview of how the data is laid out in those databases, consult the database layout manual.
There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them through the API or in the XML dumps, which are both described below.
API
A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for private wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is Special:ApiSandbox, which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.
If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains MediaWiki Utilities, which includes a module dedicated to API interactions. There's no equivalent for R yet.
Database dumps
Every month, XML snapshots of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and incredibly large). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.
Aaron's MediaWiki-utilities package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in /mnt/data/xmldatadumps/public
on stat1006, stat1007, notebook1003, and notebook1004.
EventLogging data
One analytics-specific source of data is EventLogging. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:
- A log of changes to user preferences;
- A/B testing data;
- Clicktracking data.
These datasets are stored in the event
and event_sanitized
Hive databases, subject to HDFS access control.
Pageviews data
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in our Hadoop cluster, which contains aggregated pageview data as well as the mostly-raw database of web requests. See the detailed documentation here.
Turnilo
Analytics/Systems/Turnilo-Pivot#Access
Geolocation data
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1006 and stat1007: a full guide to geolocation and some examples of how to do it can be found on the 'geolocation' page.
Notes
- ↑ Other groups including
statistics-admins
,analytics-admins
,eventlogging-admins
, andstatistics-web-users
are for people doing system maintenance and administration, so you don't need them just to access data.