You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data access: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Alex Monk
(→‎Analytics slaves: remove iewikibooks which appears to be the only closed wiki here)
imported>Majavah
m (Reverted edits by "ronniext" (talk) to last revision by Ebernhardson)
(100 intermediate revisions by 26 users not shown)
Line 1: Line 1:
This page documents the internal and external data sources that Analytics uses, the information stored within them, and how to get access.
In addition to a variety of [[meta:Research:Data|publicly-available data sources]], Wikimedia has a parallel set of private data sources. The main reason is to allows a carefully vetted set of users to perform research and analysis on confidential user data (such as the IP addresses of readers and editors) which is stored according to our [[foundation:Privacy_policy|privacy policy]] and [[metawiki:Data_retention_guidelines|data retention guidelines]]. This private infrastructure also provides duplicate copies of publicly-available data for ease of use.


==Access to WMF machines==
== Do you need it? ==
To be able to access a number of internal data sources (such as logs, replicas of the production databases, EventLogging data) as well as machines used for data crunching (e.g. <code>stat1003</code>), you will need to request shell access.
Private data lives in same server cluster that runs Wikimedia's production websites. Often, this means you will need [[production shell access]] to get it.


===Access requests===
However, since this access gets you closer to both those production websites and this confidential data, it is not freely given out. First, you have to demonstrate a need for these resources. Second, you need to have a non-disclosure agreement with the Wikimedia Foundation. If you're a Foundation employee, this was included as part of your employment agreement. If you're a researcher, it's possible to be sponsored through [[mw:Wikimedia_Research/Formal_collaborations|a formal collaboration with the Wikimedia Foundation's Research team]].
If you're looking for access to a service or machine that you don't currently have permission to use, the process is relatively simple. Create a [https://phabricator.wikimedia.org/maniphest/task/create/?projects=Ops-Access-Requests Phabricator ticket] in the "Ops-Access-Requests" project with:
#Your public SSH key and your preferred shell username (the default naming convention is first name initial and surname, e.g. <code>jdoe</code>);
#Your manager CCed in so that they can confirm you need access;
#An explanation of why you need access to the service;
...and they'll add you to the relevant lists. If you're looking for access to services on stat1003 or stat1002, be aware that you ''also'' need access to a bastion (e.g. bast1001, bast2001, or hooft). Mention this in the ticket; it's occasionally missed.


=== Access Groups ===
=== {{Anchor|Responsibilities}}User responsibilities ===
When submitting your access request, please specify what access group you need to be added to. Here's a summary of groups you might need (as of 2014-09):
You '''must''' remember this access is extremely sensitive. '''You have a duty to protect the privacy of our users'''. As Uncle Ben says, "with great power comes great responsibility." Always follow the rules outlined in the [[phab:L3|Acknowledgement of Server Access Responsibilities]], even if you don't have requested ssh access to stat100x clients since it contains good guidelines about how to handle sensitive data.


;<tt>researchers</tt>
In addition, keep in mind the following important principles:
: Access to <tt>/srv/passwords/research</tt> and <tt>/etc/mysql/conf.d/research-client.cnf</tt> on stat1003 (credentials for the SQL slaves)
* Read data [https://wikitech.wikimedia.org/wiki/Analytics/Data_Access_Guidelines access guidelines], this is important.
;<tt>statistics-users</tt>
: Access to stat1003 for number crunching and connecting to the SQL research slaves.
;<tt>statistics-privatedata-users</tt>
: Access to stat1002 where private webrequest logs are hosted.
;<tt>analytics-users</tt>
: Access to stat1002 to connect to the [[Analytics/Cluster]].
;<tt>analytics-privatedata-users</tt>
: Access to stat1002 to connect to the [[Analytics/Cluster]] and to query private data hosted there, including webrequest logs.


The list of users currently in each group is available in this [https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml configuration file] (other groups such as <tt>statistics-admins</tt>, <tt>analytics-admins</tt>, <tt>eventlogging-admins</tt>, <tt>statistics-web-users</tt> should not be required for tasks other than system maintenance/administration).
*'''Be paranoid about personally identifiable information''' (PII). Familiarize yourself with the data you are working on, and determine if it contains any PII. It's better to double and triple check than to assume anything, but if you have any doubt ask the Analytics team (via IRC or email or Phabricator). Please see the [[metawiki:Data_retention_guidelines|data retention guidelines]].
*'''Don't copy sensitive data''' (for example, data accessible only by the users in the analytics-privatedata-users) from its origin location to elsewhere (in HDFS or on any other host/support) unless strictly necessary.  And most importantly, do it only if you know what you are doing. If you are in doubt, please reach out to the Analytics team first.
*'''Restrict access'''.  If you do need to copy sensitive data somewhere, please make sure that you are the only one able to access the data. For example, if you copy Webrequest data from its location on HDFS to your /user/$your-username directory, make sure that the permissions are set to avoid everybody with access to HDFS to read the data. This is essential to avoid accidental leaks of PII/sensitive data or retention over our guidelines (https://meta.wikimedia.org/wiki/Data_retention_guidelines).
*'''Clean up copies of data'''.  Please make sure that any data that you copied is deleted as soon as your work has been done.


===Configuring SSH===
If you ever have any questions or doubts, err on the side of caution and [[Analytics#Contact|contact the Analytics team]]. We are very friendly and happy to help!
See [[Server access responsibilities#SSH]].


===Server access responsibilities===
== Requesting access ==
People with access to our private data stores are expected to have signed the Non-Disclosure Agreement, and to adhere to the [[server access responsibilities]]. Most crucial are the sections on [[Server_access_responsibilities#Security|security]] and the [[Server_access_responsibilities#Handling_sensitive_data|handling of sensitive data]]. Please read these guidelines before doing anything; access can and will be removed if you fail to live up to your responsibilities.


==Data types==
If after reading the above you do need access to WMF analytics data and/or tools, you'll need to submit a request on Phabricator and add the project tag <code>SRE-Access-Requests</code>: Follow the steps at [[Production access#Access Request Process]].
===MediaWiki data (private/large-scale)===
A lot of our work is based on data stored within MediaWiki itself - not dedicated analytics logging, but just things that are needed for the wiki to do its job. Examples of this would be:


*You want data about users' edit counts. Consult the "user" table.
If you already have access and you only need to get kerberos credentials, it is sufficient to create a task with the project tag <code>Analytics</code>: [https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?title=Requesting%20Kerberos%20access%20for%20%3CYOUR%20USERNAME%3E&description=*%20My%20username%20on%20wikitech.wikimedia.org%20is%3A%20%0D%0A*See%20https%3A%2F%2Fwikitech.wikimedia.org%2Fwiki%2FAnalytics%2FData_access&projects=analytics Create a ticket requesting kerberos credentials].
*You want data about edits to a particular page. Consult the "revision" table.
*You want data about account creations. Consult the "logging" table.


Much of this data is only accessible from inside the database - other pieces that are publicly accessible through the API (for example, linking usernames to userIDs) can still be faster to gather internally if you have a large request.
Read the following sections to figure out what you'll access levels you should request in your ticket.


This data can be found in MySQL databases on the [[#Analytics_slaves|analytics slaves]], discussed below. A good guide to the tables inside those databases can be found in the [[mw:Manual:Database layout|MediaWiki database guide]].
Please follow the instructions [[Production_access#Filing_the_request|Production access request instructions]] for any of the access types. We need a paper trail and a standard form in order to keep track of requests and understand why they are happening.  When submitting the Phabricator ticket, you may edit the description accordingly to match the request you are asking for. E.g. if you don't need SSH access, you don't need to provide an SSH key.


===MediaWiki data (public)===
== Access Levels ==
Some pieces of MediaWiki data are not only public, but only really accessible publicly. The primary example of this is the actual content of pages and revisions, which isn't available in the analytics databases. Instead, it's made available [[#API|through the API]]. This content can also be found in the XML dumps of our projects, which are described [[#Database_dumps|in the database dumps section below]].
There are a few varying levels and combinations of access that we support.


===High-volume MediaWiki data===
'analytics-*' groups have access to the [[Analytics/Cluster|Analytics Cluster]] (which mostly means Hadoop) and to stat* servers for local (non distributed) compute resources. These groups overlap in what servers they grant ssh access to, but further posix permissions restrict access to things like MySQL, Hadoop, and files.
A couple of MediaWiki extensions generate so much data that, regardless of what wiki the data is coming from, they're actually stored in a completely different cluster of databases - the X1 cluster. As with the normal databases, there is a dedicated analytics slave for this data.


The only current examples of extensions that rely on this cluster are [[mw:Extension:Echo|Echo]] (the notifications system) and [[mw:Extension:Flow|Flow]] (the new discussion system). Both are stored in slightly different ways, which is discussed in [[#Analytics slaves|the section on the analytics slaves]].
* LDAP membership in the <tt>wmf</tt> or </tt>nda</tt> LDAP group allow you to log in and authenticate via web tools like Superset and Turnilo.
* Shell (posix) membership in the `analytics-privatedata-users` group allows you to read private data stored in tools like Hadoop, Hive, Presto.
* An ssh key for your shell user allows you to ssh into the analytics client servers (AKA stat boxes) (and access tools like [[Analytics/Systems/Jupyter|Jupyter]] which also needs LDAP membership).
* A Kerberos principal allows you to access data in Hadoop directly.
* Team specific shell (posix) group membership for management of team specific jobs and data.


===EventLogging data===
This might all be confusing if you are just trying to figure out what to put in your Phabricator SRE-Access-Requests ticket.  Here are a few common use cases of what you might be trying to request.
One analytics-specific source of data is [[mw:Extension:EventLogging|EventLogging]]. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples would be:
 
== What access should I request? ==
 
If you need access to...
 
=== Dashboards in web tools like Turnilo and/or Superset that do not access private data ===
* LDAP membership in the <tt>wmf</tt> or </tt>nda</tt> LDAP group.
 
=== Dashboards in Superset / Hive interfaces (like Hue) that do access private data ===
* LDAP membership in the <tt>wmf</tt> or </tt>nda</tt> LDAP group.
* Shell (posix) membership in the `analytics-privatedata-users` group
 
''Note to SREs granting this access:  This can be done by declaring the user in Puppet as usual, but with an empty array of <tt>ssh_keys</tt>.
''
 
=== ssh login to analytics client servers (AKA stat boxes) without Hadoop, Hive, Presto access ===
This is a rare need, but you might want it if you just want to use a GPU on a stat box, or access to MediaWiki analytics MariaDB instances.
* LDAP membership in the <tt>wmf</tt> or </tt>nda</tt> LDAP group.
* Shell (posix) membership in the `analytics-privatedata-users` group
* An ssh key for your shell user
 
=== ssh login to analytics client servers (AKA stat boxes) with Hadoop, Hive, Presto access ===
* LDAP membership in the <tt>wmf</tt> or </tt>nda</tt> LDAP group.
* Shell (posix) membership in the `analytics-privatedata-users` group
* An ssh key for your shell user
* A Kerberos principal
 
=== All of the above ===
If you are a WMF engineer wanting to work with analytics data, most likely you'll want all of these access levels together:
 
* LDAP membership in the <tt>wmf</tt> or </tt>nda</tt> LDAP group.
* Shell (posix) membership in the `analytics-privatedata-users` group
* An ssh key for your shell user
* A Kerberos principal
 
If needed for work on your team, you may also want Team specific shell (posix) group membership (see below).
 
== Analytics shell (posix) groups explained ==
 
=== Generic data access (can go together with the Team specific ones) ===
<code>'''analytics-privatedata-users (no kerberos, no ssh)'''</code>
 
The Analytics team offers various UIs to fetch data from Hadoop, like Turnilo and Superset. They are both guarded by CAS authentication (requiring the user to be in either the wmf or the nda LDAP groups), fetching data from Druid (currently not authenticated). Superset is also able to fetch data from Hadoop/Hive on behalf of the logged in user via a (read-only) tool called Presto. There are two use cases:
 
* Sql-lab panel: the user is able to make sql-like queries on Hadoop datasets (pageviews/event/etc..) without the need to log in on a stat100x host.
* Dashboards: data visualized in dashboards fetched from Hadoop.
 
In both cases, Superset works on behalf of the user, so eventually the username will need to hold read permissions for Hadoop data to correctly visualize what requested. This is guaranteed by being into <code>analytics-privatedata-users</code>, that gets deployed on the Hadoop master nodes (without ssh access) to outline user permissions on HDFS. This is why some users might want to be in the group without either kerberos or ssh.
 
Additionally the user needs to be added to the "wmf" LDAP group. Make sure to add them (if you are an SRE) or mention it on the ticket (if you are the requestor).
 
<code>'''analytics-privatedata-users (no kerberos)'''</code>
 
Grants access to the [[Analytics/Systems/Clients|analytics clients]], GPUs and to [[Analytics/Systems/MariaDB|MariaDB replicas]] (using the credentials at <code>/etc/mysql/conf.d/analytics-research-client.cnf</code>).
;<code>analytics-privatedata-users (with kerberos)</code>
:Grants access to all the [[Analytics/Systems/Clients|analytics clients]], the [[Analytics/Cluster|analytics cluster]] (Hadoop/Hive) and the '''private''' data hosted there, and to [[Analytics/Systems/MariaDB|MariaDB replicas]], using the credentials at <code>/etc/mysql/conf.d/analytics-research-client.cnf</code>.
:Users in this group also need a [[Analytics/Systems/Kerberos|Kerberos]] authentication principal. If you're already a group member and don't have one, follow the [[Analytics/Systems/Kerberos/UserGuide#Get_a_password_for_Kerberos|instructions in the Kerberos user guide]]. If you're requesting membership in this group, the [[SRE|SRE team]] will [[Analytics/Systems/Kerberos#Create_a_principal_for_a_real_user|create this for you]] when they add you to the group.
 
The list of users currently in each group is available in this [https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.yaml configuration file].<ref>Other groups including <code>statistics-admins</code>, <code>analytics-admins</code>, <code>eventlogging-admins</code>, and <code>statistics-web-users</code> are for people doing system maintenance and administration, so you don't need them just to access data.</ref>
 
=== Team specific (they do not grant access to PII data on Hadoop, for that see analytics-privatedata-users) ===
;<code>analytics-wmde-users</code>
:For [[meta:Wikimedia Deutschland|Wikimedia Deutschland]] employees, mostly used for crons running automation jobs as the <code>analytics-wmde</code> system user. Grants access to all stat100x hosts, to the [[Analytics/Systems/MariaDB|MariaDB replicas]] via <code>/etc/mysql/conf.d/research-wmde-client.cnf</code> and to the <code>analytics-wmde</code> system user. It is not required that every WMDE user is placed into this group, only those who needs to take care of the aforementioned automation will require access (so they'll ask it explicitly).
;<code>analytics-search-users</code>
: For members of the [[mw:Wikimedia Search Platform|Wikimedia Foundation Search Platform team]] , used for various Analytics-Search jobs). Grants access to all stat100x hosts, an-airflow1001 and to the <code>analytics-search</code> system user.
;<code>analytics-product-users</code>
:For members of the Product Analytics team, used for various analytics jobs. Grants access to all stat100x hosts, and to the <code>analytics-product</code> system user.
;<code>analytics-research-users</code>
:For members of the Research team, used for various jobs. Grants access to all stat100x hosts, an Airflow instance, and to the <code>analytics-research</code> system user.
;<code>analytics-platform-eng-users</code>
:For members of the Research team, used for various jobs. Grants access to all stat100x hosts, an Airflow instance, and to the <code>analytics-platform-eng</code> system user.


#A log of changes to user preferences;
=== Groups to avoid (deprecated) ===
#A/B testing data;
#Clicktracking data.


These datasets are stored in their own database, the 'log' database, which is described in the "[[#Analytics_slaves|Analytics slaves]]" section below. The schemas that set out each table, and what they contain, can be found on Meta in the [https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 Schema namespace].
;<code>researchers</code>
;<code>analytics-users</code>


===Pageviews data===
===Host access granted===
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in Hive, which is [[#Hive|described below]].
There used to be a lot of differences in what hosts an Analytics POSIX group could have had access to, but now there is none anymore.


The data structure is:
===Data access granted===
{| class="wikitable"
{| class="wikitable"
!Access Groups
!Hadoop access
(No private data)
!Hadoop access
(Private data)
!Mariadb credentials
!System user
!Other
|-
|-
! Column name
|<code>analytics-privatedata-users</code>
! Type
|<code>yes</code>
! Description
|<code>yes</code>
! Example
|<code>analytics-research-client.cnf</code>
|-
|<code>analytics-privatedata</code>
| project
|
| string
| The project the page is on
| En
|-
| page
| string
| The page title
| Getting_Things_Done
|-
| views
| integer
| The number of views for that revision in that hour
| 1
|-
| bytes
| integer
| The size of the revision in bytes
| 230243
|-
|-
| year
|<code>analytics-wmde-users</code>
| integer
|
| The year
|
| 2014
|<code>research-wmde-client.cnf (only on stat1007)</code>
|<code>analytics-wmde</code>
|
|-
|-
| month
|<code>analytics-search-users</code>
| integer
|
| The month
|
| 01
|
|
|<code>Airflow admin</code>
|-
|-
| day
|<code>analytics-product-users</code>
| integer
|
| The day
|
| 01
|
|-
|<code>analytics-product</code>
| hour
|
| integer
| The hour
| 00
|-
|}
|}


===Request logs===
=== Shell access expiration ===
Another important source of reader data is the RequestLogs - the logs of actual requests to Wikimedia machines. These can be found in two different places, with two different data structures, depending on the sort of data you're interested in.
Data access is given to collaborators and contractors with a time limit. Normally the end date is set to be the contract or collaboration end date. For staff data access terminates upon employment termination unless there is a collaboration in place.
 
Once a user is terminated their home directory is deleted, if the team wishes to preserve some of the user work (work, not data as data as strict guidelines for deletion) it can be done via archiving that work to hadoop. Please file a phab ticket to have this done. Archival to hadoop would happen in the following directory:
/wmf/data/archive/user/<username>
 
== LDAP access ==
Some Analytics systems, including [[Analytics/Systems/Superset|Superset]], [[Analytics/Systems/Turnilo|Turnilo]], and [[Analytics/Systems/Jupyter|Jupyter]], require a [[mw:developer account|developer account]] in the <code>wmf</code> or <code>nda</code> [[LDAP/Groups|LDAP groups]] for access.
 
If you need this access, first make sure you have a working developer account (if you can [[Special:Login|log into this wiki]], you have one). If you need one, you can create one at [[mw:Developer_account]].
 
Note that a developer account comes with ''two'' different usernames; some services need one and some services need the other. You can find both by [[Special:Login|logging into this wiki]] and visiting [[Special:Preferences#mw-prefsection-personal|the "user profile" section of Special:Preferences]]. Your ''Wikitech username'' is listed under "Username", while your ''developer shell username'' is listed under "Instance shell account name". Thankfully, there's only one password!
 
Then, create a Phabricator task: Read and follow [[phab:project/profile/1564/|the instructions for LDAP-access-requests]] to request getting added to the appropriate group. Make sure you include both your usernames.
 
Note that this access has similar requirements to shell access: you will need to either be a Wikimedia Foundation employee or have a signed volunteer NDA.
 
== Accounts and passwords explained: LDAP/Wikitech/MW Developer vs shell/ssh/posix vs Kerberos ==
There are too many different accounts and passwords one has to deal with in order to access analytics systems.  For now it's what we've got.  Let's try to explain them all explicitly.
 
 
 
=== tl;dr ===
* LDAP AKA Wikitech AKA Mediawiki Developer accounts are the same.  There are 2 usernames for this account, but only one password.
* POSIX AKA shell AKA ssh accounts are the same.  The username is the same as your 'shell username' for your LDAP account.  There is no password, only an ssh key pair.
* Kerberos uses your shell username and a separate Kerberos account password, and grants you access to distributed systems like Hadoop.
 
=== LDAP ===
LDAP is used mostly for web logins.  An LDAP account has 2 usernames, the 'Wikitech' username and the shell username, as described above.  The password for these is the same.
Since LDAP account creation is handled by Mediawiki and also allows you to log into Wikitech (this wiki), LDAP accounts are sometimes referred to as your 'Wikitech' account or your 'Mediawiki developer account'. These terms all mean the same thing.
 
Analytics web UIs (like Jupyter, Turnilo, Superset, etc.) require that you have an LDAP account in specific groups.  Membership in these groups authorize access.
 
=== POSIX ===
To log into a production server, you need an explicit POSIX shell account created for you.  This is handled by SRE.  POSIX user accounts are often also referred to as your shell or ssh account, as ssh allows you to remote login and get a shell (terminal) on a production server.  At WMF, POSIX user accounts do not use passwords.  Instead, you login via ssh using an ssh key pair.
 
Access to specific production servers is managed by membership of your POSIX account in specific groups, e.g. analytics-privatedata-users.


For ''Mobile'' request logs, you should look in [[#Hive|our Hive cluster]]. the data has the format:
=== Kerberos ===
[[Analytics/Systems/Kerberos|Kerberos]] is only needed when using a distributed system like Hadoop.  You can ssh into a single production server with your POSIX account, but other production servers that you are not directly logged into have no way of knowing you are authorized to access them.  Kerberos solves this problem.  After logging into a server with ssh, you authenticate to Kerberos with <tt>kinit</tt> and your Kerberos password (this is a totally different password than your LDAP one).  Then, when using a distributed system, other servers can interact with Kerberos to determine if your access should be authorized.


{| class="wikitable"
==Infrastructure==
|-
===Analytics clients===
! Column name
The [[Analytics/Systems/Clients|analytics clients]] are servers in the production cluster where you can run your code and queries. In fact, you ''should'' use them to run all your analysis, so that sensitive data never leaves the production cluster.
! Type
 
! Description
They have a number of useful capabilities, from large amounts of memory to [[Analytics/Systems/Jupyter|Jupyter notebooks]].  
! Example
|-
|hostname
|String
|The name of the cache server the request is processing through
|cp1060.eqiad.wmnet
|-
|sequence
|Integer
|The per-host request number. It increases by 1 for each request on that host.
|919939520
|-
|dt
|String
|The date and time of the request, in UTC
|2014-01-03T20:08:27
|-
|ip
|String
|The IP address of the client
|192.168.0.1
|-
|time_firstbyte
|Floating-point number
|The amount of time (in seconds) before the first byte of the requested content was transmitted
|0.001473904
|-
|cache_status
|String
|The cache's response code
|hit
|-
|http_status
|Integer
|The [[w:List of HTTP status codes|HTTP status code]] associated with the request
|200
|-
|response_size
|Integer
|The size of the returned content, in bytes
|87
|-
|http_method
|String
|The [[w:Hypertext_Transfer_Protocol#Request_methods|HTTP method]] associated with the request
|GET
|-
|uri_host
|String
|The site the request was aimed at
|meta.m.wikimedia.org
|-
|uri_path
|String
|The page the request was aimed at
|/wiki/Special:BannerRandom
|-
|uri_query
|String
|Any parameters (say, ?action=edit) associated with the request
|?uselang=en&sitename=Wikipedia&project=wikipedia&anonymous=true
|-
|content_type
|String
|The [[w:MIME type|MIME type]] of the returned content
|text/html
|-
|referer
|String
|The referring page
|http://google.com
|-
|x_forwarded_for
|String
|The [[w:X-Forwarded-For|X-Forwarded-For]] header
| -
|-
|user_agent
|String
|The [[w:user agent|user agent]] of the client
|Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0
|-
|accept_language
|String
|The user language (or language variant)
|en-US,en;q=0.8
|-
|x_analytics
|String
|An instrumentation field used by the Wikipedia Zero team
| -
|}


For ''Desktop and historical mobile request logs'', we have the sampled request logs, which are sampled and stored at a 1:1000 ratio. These currently stretch from May 2013 to the present, and are stored in the <code>/a/squid/archive/sampled</code> directory on stat1002, as .TAR.GZs. They take the format:
===MariaDB===
The [[Analytics/Systems/MariaDB|Analytics MariaDB cluster]] contains copies of the production [[Mw:Manual:Database layout|MediaWiki databases]] (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees).


{| class="wikitable"
=== Data Lake===
|-
We store large amounts of data in analysis-friendly formats in the [[Analytics/Data Lake|Data Lake]].
! Column name
! Type
! Description
! Example
|-
|squid
|string
|The name of the cache server the request is processing through
|cp1060.eqiad.wmnet
|-
|sequence
|Integer
|The per-host request number. It increases by 1 for each request on that host.
|919939520
|-
|dt
|String
|The date and time of the request, in UTC
|2014-01-03T20:08:27
|-
|time_firstbyte
|Floating-point number
|The amount of time (in seconds) before the first byte of the requested content was transmitted
|0.001473904
|-
|ip
|String
|The IP address of the client
|192.168.0.1
|-
|cache_status/http status
|String
|The cache's response code, and the [[w:List of HTTP status codes|HTTP status code]] associated with the request
|HIT/200
|-
|response_size
|Integer
|The size of the returned content, in bytes
|87
|-
|http_method
|String
|The [[w:Hypertext_Transfer_Protocol#Request_methods|HTTP method]] associated with the request
|GET
|-
|URL
|String
|The URL of the requested element
|http://en.m.wikipedia.org/wiki/Colin_Hanks
|-
|squid_status
|String
|Status of the squid
|N/A - ignore this.
|-
|content_type
|String
|The [[w:MIME type|MIME type]] of the returned content
|text/html
|-
|referer
|String
|The referring page
|http://google.com
|-
|x_forwarded_for
|String
|The [[w:X-Forwarded-For|X-Forwarded-For]] header
| -
|-
|user_agent
|String
|The [[w:user agent|user agent]] of the client
|Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0
|-
|accept_language
|String
|The user language (or language variant)
|en-US,en;q=0.8
|-
|x_analytics
|String
|An instrumentation field used by the Wikipedia Zero team
| -
|}


The sampled logs' format is described on the [[Cache log format]] page.
==Scripting access==
If you're writing some analysis code, you will probably need to access data first. There are a couple of software packages that have been developed to make this easy. Note that both of them are designed to work on the analytics clients only.


These files are both not quoted, and lack headers, which can make parsing them a bit of a pain. At the moment we do not have standardised scripts for doing so, although that will (hopefully!) change.
For Python, there is [https://github.com/wikimedia/wmfdata-python wmfdata]. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating custom Spark sessions.


===Geolocation data===
For R, there is [https://github.com/wikimedia/wikimedia-discovery-wmf wmf]. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.  
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on both stat1003 and stat1002: a full guide to geolocation and some examples of how to do it can be found [[Analytics/Geolocation|on the 'geolocation' page]].


==Data sources==
==Data sources==
===API===
Data sets and data streams can be found in [[wikitech:Category:Data_stream|Category:Data_stream]]
The API is a core component of every wiki we run - with the exception of ''private'' wikis, which you shouldn't really need to perform research on anyway ;). A good way to understand it, and to test queries, is [[Special:ApiSandbox]], which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.
 
===Data Dashboards. Superset and Turnilo===
Superset: http://superset.wikimedia.org
Turnilo: http://turnilo.wikimedia.org


If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki Utilities], which includes a module dedicated to API interactions. There's no equivalent for R - yet.
You need a wikitech login that is in the "wmf" or "nda" LDAP groups. If you don't have it, please create a Phabricator task by following instructions on [[phab:tag/ldap-access-requests/]].


===Database dumps===
Before requesting access, please make sure you:
Another common public datasource is the collection of [http://dumps.wikimedia.org/ XML snapshots]. These are generated each month, and so are always slightly outdated, but make up for it by being incredibly cohesive (and [http://dumps.wikimedia.org/enwiki/20140304/ incredibly large]). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.
*have a functioning Wikitech login. Get one: https://toolsadmin.wikimedia.org/register/
*are an employee or contractor with wmf OR have signed an NDA
Depending on the above, you can request to be added to the wmf group or the nda group. Please indicate the motivation on the task about why you need access and ping the analytics team if you don't hear any feedback soon from the Opsen on duty.


Aaron's [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki-utilities] package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in <code>/mnt/data/xmldatadumps/public</code> on stat1002.
===MediaWiki application data===
You can do a lot of work with the data stored by MediaWiki in the normal course of running itself. This includes data about:


===Analytics slaves===
* Users' edit counts (consult the <code>user</code> table)
The Operations team (praise be) maintains several dedicated analytics slaves. These are copies of the MediaWiki databases for each of our production websites (en.wikipedia.org, fr.wikipedia.org, de.wikisource.org...).
*Edits to a particular page (consult the <code>revision</code> table, joined with the <code>page</code> table if necessary)
*Account creations (consult the <code>logging</code> table)


The analytics slaves contain both "production" wikis (actively-used mainstream projects) and non-production wikis (for example, the wikis for various projects' Arbitration Committees, or defunct projects). The active "production" wikis (and their locations) are in the collapsed table below.
====Databases====
You can access this data using the replica MariaDB databases.  These are accessible from the stat100* machines via <code>analytics-mysql <wiki-id></code>. For more details [[Analytics/Systems/MariaDB|see here]].


Accessing one of the analytics slaves is simple, ''if'' you have access to either stat1003 or stat1002 (if you don't, see the [[#Access_requirements.2C_requests_and_responsibilities|access requests]] section). Taking the English-language Wikipedia, which lives on the analytics-store.eqiad.wmnet host, as an example, you'd SSH into stat1003 or stat1002, and then type:
For an overview of how the data is laid out in those databases, consult the [[mediawikiwiki:Manual:Database_layout|database layout manual]].  


<code> mysql -u research -h analytics-store.eqiad.wmnet -p -A </code>
There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them [[#API|through the API]] or in the XML dumps, which are both described below.


You'll then be prompted for the password for the 'research' account, which you should have, and dropped into the MySQL command line. Type <code> USE enwiki </code>, and then run whatever query you need.
==== API====
A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for ''private'' wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is [[Special:ApiSandbox]], which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.


As well as connecting directly, it's also possible to connect automatically from your programming language of choice, be it R or Python. For Python, we have the [http://mysql-python.sourceforge.net/MySQLdb.html MySQLdb] module installed on stat1003 and stat1002. For R, we have [http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf RMySQL].
If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki Utilities], which includes a module dedicated to API interactions. There's no equivalent for R yet.


On the other hand, if you just want to generate a TSV or CSV and then retrieve the data from that file later, you can easily do so from the command line. Taking the English-language Wikipedia example from above, you'd type:
====Database dumps ====
Every month, [http://dumps.wikimedia.org/ XML snapshots] of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and [http://dumps.wikimedia.org/enwiki/20161001/ incredibly large]). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.


<code> mysql -u research -h analytics-store.eqiad.wmnet -p enwiki -e "your query goes here;" > file_name.tsv </code>
Aaron's [https://pypi.python.org/pypi/mediawiki-utilities#downloads MediaWiki-utilities] package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in <code>/mnt/data/xmldatadumps/public</code> on stat1004, stat1005, stat1006, stat1007, and stat1008.


For CSVs, just change the file ending. Either way, it'll prompt you to enter the password, and go off to generate the file on its own.
===EventLogging data===
One analytics-specific source of data is [[Analytics/EventLogging|EventLogging]]. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:


If you're interested in accessing high-volume data, such as data around Flow and Echo, this can be found on <code>analytics-store.eqiad.wmnet</code>; Flow data is stored in its own database (<code>flowdb</code>, while Echo-related data is stored in per-wiki databases. EventLogging data, meanwhile, is stored in the 'log' database on <code>analytics-store.eqiad.wmnet</code>, with each schema as its own table. Most other production wikis on <code>analytics-store.eqiad.wmnet</code> can be found in the table below.
#A log of changes to user preferences;
# A/B testing data;
#Clicktracking data.


<div class="NavFrame collapsed" style="border:none; {{{style|{{{css|}}}}}}"><div class="NavHead" style="font-weight:bold; text-align:center; background:#80D0FF">Production wikis</div>
These datasets are stored in the <code>event</code> and <code>event_sanitized</code> Hive databases, subject to HDFS access control.
<div class="NavContent" style="border:1px solid #80D0FF;text-align:left; padding:1em">
{| class=" wikitable sortable"
|-
!Language !! Project type !! Database name
|-
|English
|Wikipedia
|enwiki
|-
|German
|Wikipedia
|dewiki
|-
|Spanish
|Wikipedia
|eswiki
|-
|French
|Wikipedia
|frwiki
|-
|Dutch
|Wikipedia
|nlwiki
|-
|Italian
|Wikipedia
|itwiki
|-
|Polish
|Wikipedia
|plwiki
|-
|Russian
|Wikipedia
|ruwiki
|-
|Japanese
|Wikipedia
|jawiki
|-
|Portugese
|Wikipedia
|ptwiki
|-
|NA
|MediaWiki
|mediawikiwiki
|-
|NA
|Wikimedia Commons
|commonswiki
|-
|NA
|Meta
|metawiki
|-
|Bulgarian
|Wikipedia
|bgwiki
|-
|Czech
|Wikipedia
|cswiki
|-
|English
|Wikiquote
|enwikiquote
|-
|Esperanto
|Wikipedia
|eowiki
|-
|Finnish
|Wikipedia
|fiwiki
|-
|Indonesian
|Wikipedia
|idwiki
|-
|Norwegian
|Wikipedia
|nowiki
|-
|Swedish
|Wikipedia
|svwiki
|-
|Thai
|Wikipedia
|thwiki
|-
|Turkish
|Wikipedia
|trwiki
|-
|Abkhaz
|Wikipedia
|abwiki
|-
|Acehnese
|Wikipedia
|acewiki
|-
|Afrikaans
|Wikipedia
|afwiki
|-
|Afrikaans
|Wikibooks
|afwikibooks
|-
|Afrikaans
|Wikiquote
|afwikiquote
|-
|Akan
|Wikipedia
|akwiki
|-
|Alemannic German
|Wikipedia
|alswiki
|-
|Amharic
|Wikipedia
|amwiki
|-
|Old English
|Wikipedia
|angwiki
|-
|Aragonese
|Wikipedia
|anwiki
|-
|Syriac
|Wikipedia
|arcwiki
|-
|Arabic
|Wikipedia
|arwiki
|-
|Arabic
|Wikibooks
|arwikibooks
|-
|Arabic
|Wikinews
|arwikinews
|-
|Arabic
|Wikiquote
|arwikiquote
|-
|Arabic
|Wikisource
|arwikisource
|-
|Arabic
|Wikiversity
|arwikiversity
|-
|Egyptian
|Wikipedia
|arzwiki
|-
|Asturianu
|Wikipedia
|astwiki
|-
|Assamese
|Wikipedia
|aswiki
|-
|Avar
|Wikipedia
|avwiki
|-
|Aymara
|Wikipedia
|aywiki
|-
|Azerbaijani
|Wikipedia
|azwiki
|-
|Azerbaijani
|Wikibooks
|azwikibooks
|-
|Azerbaijani
|Wikiquote
|azwikiquote
|-
|Azerbaijani
|Wikisource
|azwikisource
|-
|Bavarian
|Wikipedia
|barwiki
|-
|Samogitian
|Wikipedia
|bat_smgwiki
|-
|Bashkir
|Wikipedia
|bawiki
|-
|Central Bikol
|Wikipedia
|bclwiki
|-
|Belarusian
|Wikipedia
|bewiki
|-
|Belarusian
|Wikibooks
|bewikibooks
|-
|Belarusian
|Wikiquote
|bewikiquote
|-
|Belarusian
|Wikisource
|bewikisource
|-
|Bulgarian
|Wikibooks
|bgwikibooks
|-
|Bulgarian
|Wikinews
|bgwikinews
|-
|Bulgarian
|Wikiquote
|bgwikiquote
|-
|Bulgarian
|Wikibooks
|bgwikibooks
|-
|Bulgarian
|Wikisource
|bgwikisource
|-
|Bhojpuri
|Wikipedia
|bhwiki
|-
|Bislama
|Wikipedia
|biwiki
|-
|Banjar
|Wikipedia
|bjnwiki
|-
|Bambara
|Wikipedia
|bmwiki
|-
|Bengali
|Wikipedia
|bnwiki
|-
|Bengali
|Wikisource
|bnwikisource
|-
|Tibetan
|Wikipedia
|bowiki
|-
|Bishnupriya Manipuri
|Wikipedia
|bpywiki
|-
|Breton
|Wikipedia
|brwiki
|-
|Breton
|Wikiquote
|brwikiquote
|-
|Breton
|Wikisource
|brwikisource
|-
|Bosnian
|Wikipedia
|bswiki
|-
|Bosnian
|Wikibooks
|bswikibooks
|-
|Bosnian
|Wikinews
|bswikinews
|-
|Bosnian
|Wikiquote
|bswikiquote
|-
|Bosnian
|Wikisource
|bswikisource
|-
|Buginese
|Wikipedia
|bugwiki
|-
|Buryat
|Wikipedia
|bxrwiki
|-
|Catalan
|Wikipedia
|cawiki
|-
|Catalan
|Wikibooks
|cawikibooks
|-
|Catalan
|Wikinews
|cawikinews
|-
|Catalan
|Wikiquote
|cawikiquote
|-
|Catalan
|Wikisource
|cawikisource
|-
|Zamboanga Chavacano
|Wikipedia
|cbk_zamwiki
|-
|Min Dong
|Wikipedia
|cdowiki
|-
|Cebuano
|Wikipedia
|cebwiki
|-
|Chechen
|Wikipedia
|cewiki
|-
|Cherokee
|Wikipedia
|chrwiki
|-
|Cherokee
|Wiktionary
|chrwiktionary
|-
|Chamorro
|Wikipedia
|chwiki
|-
|Cheyenne
|Wikipedia
|chywiki
|-
|Sorani
|Wikipedia
|ckbwiki
|-
|Corsican
|Wikipedia
|cowiki
|-
|Corsican
|Wiktionary
|cowiktionary
|-
|Crimean Tatar
|Wikipedia
|crhwiki
|-
|Cree
|Wikipedia
|crwiki
|-
|Kashubian
|Wikipedia
|csbwiki
|-
|Kashubian
|Wiktionary
|csbwiktionary
|-
|Czech
|Wikibooks
|cswikibooks
|-
|Czech
|Wikinews
|cswikinews
|-
|Czech
|Wikiquote
|cswikiquote
|-
|Czech
|Wikisource
|cswikisource
|-
|Czech
|Wikiversity
|cswikiversity
|-
|Czech
|Wiktionary
|cswiktionary
|-
|Old Church Slavonic
|Wikipedia
|cuwiki
|-
|Chuvash
|Wikipedia
|cvwiki
|-
|Chuvash
|Wikibooks
|cvwikibooks
|-
|Welsh
|Wikipedia
|cywiki
|-
|Welsh
|Wikibooks
|cywikibooks
|-
|Welsh
|Wikiquote
|cywikiquote
|-
|Welsh
|Wikisource
|cywikisource
|-
|Welsh
|Wiktionary
|cywiktionary
|-
|Danish
|Wikipedia
|dawiki
|-
|Danish
|Wikibooks
|dawikibooks
|-
|Danish
|Wikiquote
|dawikiquote
|-
|Danish
|Wikisource
|dawikisource
|-
|Danish
|Wiktionary
|dawiktionary
|-
|German
|Wikibooks
|dewikibooks
|-
|German
|Wikinews
|dewikinews
|-
|German
|Wikiquote
|dewikiquote
|-
|German
|Wikisource
|dewikisource
|-
|German
|Wikiversity
|dewikiversity
|-
|German
|Wikivoyage
|dewikivoyage
|-
|German
|Wiktionary
|dewiktionary
|-
|Zazaki
|Wikipedia
|diqwiki
|-
|Lower Sorbian
|Wikipedia
|dsbwiki
|-
|Divehi
|Wikipedia
|dvwiki
|-
|Divehi
|Wiktionary
|dvwiktionary
|-
|Dzongkha
|Wikipedia
|dzwiki
|-
|Ewe
|Wikipedia
|eewiki
|-
|Greek
|Wikipedia
|elwiki
|-
|Greek
|Wikibooks
|elwikibooks
|-
|Greek
|Wikinews
|elwikinews
|-
|Greek
|Wikiquote
|elwikiquote
|-
|Greek
|Wikiversity
|elwikiversity
|-
|Greek
|Wikivoyage
|elwikivoyage
|-
|Greek
|Wiktionary
|elwiktionary
|-
|Emilian-Romagnol
|Wikipedia
|emlwiki
|-
|English
|Wikibooks
|enwikibooks
|-
|English
|Wikinews
|enwikinews
|-
|English
|Wikisource
|enwikisource
|-
|English
|Wikiversity
|enwikiversity
|-
|English
|Wikivoyage
|enwikivoyage
|-
|Esperanto
|Wikibooks
|eowikibooks
|-
|Esperanto
|Wikinews
|eowikinews
|-
|Esperanto
|Wikiquote
|eowikiquote
|-
|Esperanto
|Wikisource
|eowikisource
|-
|Esperanto
|Wiktionary
|eowiktionary
|-
|Spanish
|Wikibooks
|eswikibooks
|-
|Spanish
|Wikinews
|eswikinews
|-
|Spanish
|Wikiquote
|eswikiquote
|-
|Spanish
|Wikisource
|eswikisource
|-
|Spanish
|Wikiversity
|eswikiversity
|-
|Spanish
|Wikivoyage
|eswikivoyage
|-
|Spanish
|Wiktionary
|eswiktionary
|-
|Estonian
|Wikipedia
|etwiki
|-
|Estonian
|Wikibooks
|etwikibooks
|-
|Estonian
|Wikiquote
|etwikiquote
|-
|Estonian
|Wikisource
|etwikisource
|-
|Estonian
|Wiktionary
|etwiktionary
|-
|Basque
|Wikipedia
|euwiki
|-
|Basque
|Wikibooks
|euwikibooks
|-
|Basque
|Wikiquote
|euwikiquote
|-
|Basque
|Wiktionary
|euwiktionary
|-
|Extremaduran
|Wikipedia
|extwiki
|-
|Persian
|Wikibooks
|fawikibooks
|-
|Persian
|Wikinews
|fawikinews
|-
|Persian
|Wikiquote
|fawikiquote
|-
|Persian
|Wikisource
|fawikisource
|-
|Persian
|Wiktionary
|fawiktionary
|-
|Fula
|Wikipedia
|ffwiki
|-
|Võro
|Wikipedia
|fiu_vrowiki
|-
|Finnish
|Wikibooks
|fiwikibooks
|-
|Finnish
|Wikinews
|fiwikinews
|-
|Finnish
|Wikiquote
|fiwikiquote
|-
|Finnish
|Wikisource
|fiwikisource
|-
|Finnish
|Wikiversity
|fiwikiversity
|-
|Finnish
|Wiktionary
|fiwiktionary
|-
|Fijian
|Wikipedia
|fjwiki
|-
|Fijian
|Wiktionary
|fjwiktionary
|-
|Faroese
|Wikipedia
|fowiki
|-
|Faroese
|Wikisource
|fowikisource
|-
|Faroese
|Wiktionary
|fowiktionary
|-
|Arpitan
|Wikipedia
|frpwiki
|-
|North Frisian
|Wikipedia
|frrwiki
|-
|French
|Wikibooks
|frwikibooks
|-
|French
|Wikinews
|frwikinews
|-
|French
|Wikiquote
|frwikiquote
|-
|French
|Wikisource
|frwikisource
|-
|French
|Wikiversity
|frwikiversity
|-
|French
|Wikivoyage
|frwikivoyage
|-
|Friulian
|Wikipedia
|furwiki
|-
|West Frisian
|Wikipedia
|fywiki
|-
|West Frisian
|Wikibooks
|fywikibooks
|-
|West Frisian
|Wiktionary
|fywiktionary
|-
|Gagauz
|Wikipedia
|gagwiki
|-
|Gan
|Wikipedia
|ganwiki
|-
|Irish
|Wikipedia
|gawiki
|-
|Irish
|Wiktionary
|gawiktionary
|-
|Scottish Gaelic
|Wikipedia
|gdwiki
|-
|Scottish Gaelic
|Wiktionary
|gdwiktionary
|-
|Gilaki
|Wikipedia
|glkwiki
|-
|Galician
|Wikipedia
|glwiki
|-
|Galician
|Wikibooks
|glwikibooks
|-
|Galician
|Wikiquote
|glwikiquote
|-
|Galician
|Wikisource
|glwikisource
|-
|Galician
|Wiktionary
|glwiktionary
|-
|Guarani
|Wikipedia
|gnwiki
|-
|Guarani
|Wiktionary
|gnwiktionary
|-
|Gothic
|Wikipedia
|gotwiki
|-
|Gujarati
|Wikipedia
|guwiki
|-
|Gujarati
|Wikiquote
|guwikiquote
|-
|Gujarati
|Wikisource
|guwikisource
|-
|Gujarati
|Wiktionary
|guwiktionary
|-
|Manx
|Wikipedia
|gvwiki
|-
|Manx
|Wiktionary
|gvwiktionary
|-
|Hakka
|Wikipedia
|hakwiki
|-
|Hausa
|Wikipedia
|hawiki
|-
|Hausa
|Wiktionary
|hawiktionary
|-
|Hawaiian
|Wikipedia
|hawwiki
|-
|Hebrew
|Wikibooks
|hewikibooks
|-
|Hebrew
|Wikinews
|hewikinews
|-
|Hebrew
|Wikiquote
|hewikiquote
|-
|Hebrew
|Wikisource
|hewikisource
|-
|Hebrew
|Wikivoyage
|hewikivoyage
|-
|Hebrew
|Wiktionary
|hewiktionary
|-
|Fiji Hindi
|Wikipedia
|hifwiki
|-
|Hindi
|Wikipedia
|hiwiki
|-
|Hindi
|Wikibooks
|hiwikibooks
|-
|Hindi
|Wikiquote
|hiwikiquote
|-
|Hindi
|Wiktionary
|hiwiktionary
|-
|Meadow Mari
|Wikipedia
|hrwiki
|-
|Meadow Mari
|Wikibooks
|hrwikibooks
|-
|Meadow Mari
|Wikiquote
|hrwikiquote
|-
|Meadow Mari
|Wikisource
|hrwikisource
|-
|Meadow Mari
|Wiktionary
|hrwiktionary
|-
|Upper Sorbian
|Wikipedia
|hsbwiki
|-
|Upper Sorbian
|Wiktionary
|hsbwiktionary
|-
|Haitian
|Wikipedia
|htwiki
|-
|Hungarian
|Wikibooks
|huwikibooks
|-
|Hungarian
|Wikiquote
|huwikiquote
|-
|Hungarian
|Wikisource
|huwikisource
|-
|Hungarian
|Wiktionary
|huwiktionary
|-
|Armenian
|Wikipedia
|hywiki
|-
|Armenian
|Wikipedia
|hywiki
|-
|Armenian
|Wikisource
|hywikisource
|-
|Armenian
|Wikiquote
|hywikiquote
|-
|Armenian
|Wikibooks
|hywikibooks
|-
|Interlingua
|Wikipedia
|iawiki
|-
|Interlingua
|Wikibooks
|iawikibooks
|-
|Interlingua
|Wiktionary
|iawiktionary
|-
|Indonesian
|Wiktionary
|idwikibooks
|-
|Indonesian
|Wikiquote
|idwikiquote
|-
|Indonesian
|Wikisource
|idwikisource
|-
|Indonesian
|Wiktionary
|idwiktionary
|-
|Interlingue
|Wikipedia
|iewiki
|-
|Interlingue
|Wiktionary
|iewiktionary
|-
|Igbo
|Wikipedia
|igwiki
|-
|Inupiak
|Wikipedia
|ikwiki
|-
|Ilokano
|Wikipedia
|ilowiki
|-
|Ido
|Wikipedia
|iowiki
|-
|Ido
|Wiktionary
|iowiktionary
|-
|Icelandic
|Wikipedia
|iswiki
|-
|Icelandic
|Wikibooks
|iswikibooks
|-
|Icelandic
|Wikiquote
|iswikiquote
|-
|Icelandic
|Wikisource
|iswikisource
|-
|Icelandic
|Wikipedia
|iswiktionary
|-
|Italian
|Wikibooks
|itwikibooks
|-
|Italian
|Wikinews
|itwikinews
|-
|Italian
|Wikiquote
|itwikiquote
|-
|Italian
|Wikisource
|itwikisource
|-
|Italian
|Wikiversity
|itwikiversity
|-
|Italian
|Wikivoyage
|itwikivoyage
|-
|Italian
|Wiktionary
|itwiktionary
|-
|Inuktitut
|Wikipedia
|iuwiki
|-
|Inuktitut
|Wiktionary
|iuwiktionary
|-
|Japanese
|Wikibooks
|jawikibooks
|-
|Japanese
|Wiktionary
|jawiktionary
|-
|Japanese
|Wikiversity
|jawikiversity
|-
|Japanese
|Wikisource
|jawikisource
|-
|Japanese
|Wikiquote
|jawikiquote
|-
|Japanese
|Wikinews
|jawikinews
|-
|Lojban
|Wikipedia
|jbowiki
|-
|Lojban
|Wiktionary
|jbowiktionary
|-
|Javanese
|Wikipedia
|jvwiki
|-
|Javanese
|Wiktionary
|jvwiktionary
|-
|Karakalpak
|Wikipedia
|kaawiki
|-
|Kabyle
|Wikipedia
|kabwiki
|-
|Georgian
|Wikipedia
|kawiki
|-
|Georgian
|Wikibooks
|kawikibooks
|-
|Georgian
|Wikiquote
|kawikiquote
|-
|Georgian
|Wiktionary
|kawiktionary
|-
|Kabardian Circassian
|Wikipedia
|kbdwiki
|-
|Kongo
|Wikipedia
|kgwiki
|-
|Cornish
|Wikipedia
|kiwiki
|-
|Kazakh
|Wikipedia
|kkwiki
|-
|Kazakh
|Wikibooks
|kkwikibooks
|-
|Kazakh
|Wiktionary
|kkwiktionary
|-
|Greenlandic
|Wikipedia
|klwiki
|-
|Greenlandic
|Wiktionary
|klwiktionary
|-
|Khmer
|Wikipedia
|kmwiki
|-
|Khmer
|Wikibooks
|kmwikibooks
|-
|Khmer
|Wiktionary
|kmwiktionary
|-
|Kannada
|Wikipedia
|knwiki
|-
|Kannada
|Wikiquote
|knwikiquote
|-
|Kannada
|Wikisource
|knwikisource
|-
|Kannada
|Wiktionary
|knwiktionary
|-
|Komi-Permyak
|Wikipedia
|koiwiki
|-
|Korean
|Wikibooks
|kowikibooks
|-
|Korean
|Wikinews
|kowikinews
|-
|Korean
|Wikiquote
|kowikiquote
|-
|Korean
|Wikisource
|kowikisource
|-
|Korean
|Wikiversity
|kowikiversity
|-
|Korean
|Wiktionary
|kowiktionary
|-
|Karachay-Balkar
|Wikipedia
|krcwiki
|-
|Ripuarian
|Wikipedia
|kshwiki
|-
|Kashmiri
|Wikipedia
|kswiki
|-
|Kashmiri
|Wiktionary
|kswiktionary
|-
|Kurdish
|Wikipedia
|kuwiki
|-
|Kurdish
|Wikibooks
|kuwikibooks
|-
|Kurdish
|Wikiquote
|kuwikiquote
|-
|Kurdish
|Wiktionary
|kuwiktionary
|-
|Komi
|Wikipedia
|kvwiki
|-
|Cornish
|Wikipedia
|kwwiki
|-
|Cornish
|Wiktionary
|kwwiktionary
|-
|Kirghiz
|Wikipedia
|kywiki
|-
|Kirghiz
|Wikibooks
|kywikibooks
|-
|Kirghiz
|Wikiquote
|kywikiquote
|-
|Kirghiz
|Wiktionary
|kywiktionary
|-
|Ladino
|Wikipedia
|ladwiki
|-
|Latin
|Wikipedia
|lawiki
|-
|Latin
|Wikibooks
|lawikibooks
|-
|Latin
|Wikiquote
|lawikiquote
|-
|Latin
|Wikisource
|lawikisource
|-
|Latin
|Wiktionary
|lawiktionary
|-
|Lak
|Wikipedia
|lbewiki
|-
|Luxembourgish
|Wikipedia
|lbwiki
|-
|Luxembourgish
|Wiktionary
|lbwiktionary
|-
|Lezgian
|Wikipedia
|lezwiki
|-
|Luganda
|Wikipedia
|lgwiki
|-
|Ligurian
|Wikipedia
|lijwiki
|-
|Limburgish
|Wikipedia
|liwiki
|-
|Limburgish
|Wikibooks
|liwikibooks
|-
|Limburgish
|Wikiquote
|liwikiquote
|-
|Limburgish
|Wikisource
|liwikisource
|-
|Limburgish
|Wiktionary
|liwiktionary
|-
|Lombard
|Wikipedia
|lmowiki
|-
|Lombard
|Wikipedia
|lnwiki
|-
|Lombard
|Wiktionary
|lnwiktionary
|-
|Lombard
|Wikipedia
|lowiki
|-
|Lombard
|Wiktionary
|lowiktionary
|-
|Latgalian
|Wikipedia
|ltgwiki
|-
|Lithuanian
|Wikipedia
|ltwiki
|-
|Lithuanian
|Wikibooks
|ltwikibooks
|-
|Lithuanian
|Wikiquote
|ltwikiquote
|-
|Lithuanian
|Wikisource
|ltwikisource
|-
|Lithuanian
|Wiktionary
|ltwiktionary
|-
|Latvian
|Wikipedia
|lvwiki
|-
|Latvian
|Wiktionary
|lvwiktionary
|-
|Banyumasan
|Wikipedia
|map_bmswiki
|-
|Moksha
|Wikipedia
|mdfwiki
|-
|Malagasy
|Wikipedia
|mgwiki
|-
|Malagasy
|Wikibooks
|mgwikibooks
|-
|Malagasy
|Wiktionary
|mgwiktionary
|-
|Meadow Mari
|Wikipedia
|mhrwiki
|-
|Minangkabau
|Wikipedia
|minwiki
|-
|Maori
|Wikipedia
|miwiki
|-
|Maori
|Wiktionary
|miwiktionary
|-
|Macedonian
|Wikipedia
|mkwiki
|-
|Macedonian
|Wikibooks
|mkwikibooks
|-
|Macedonian
|Wikisource
|mkwikisource
|-
|Macedonian
|Wiktionary
|mkwiktionary
|-
|Malayalam
|Wikipedia
|mlwiki
|-
|Malayalam
|Wikibooks
|mlwikibooks
|-
|Malayalam
|Wikiquote
|mlwikiquote
|-
|Malayalam
|Wikisource
|mlwikisource
|-
|Malayalam
|Wiktionary
|mlwiktionary
|-
|Mongolian
|Wikipedia
|mnwiki
|-
|Mongolian
|Wiktionary
|mnwiktionary
|-
|Hill Mari
|Wikipedia
|mrjwiki
|-
|Marathi
|Wikipedia
|mrwiki
|-
|Marathi
|Wikibooks
|mrwikibooks
|-
|Marathi
|Wikiquote
|mrwikiquote
|-
|Marathi
|Wikisource
|mrwikisource
|-
|Marathi
|Wiktionary
|mrwiktionary
|-
|Malay
|Wikipedia
|mswiki
|-
|Malay
|Wikibooks
|mswikibooks
|-
|Malay
|Wiktionary
|mswiktionary
|-
|Maltese
|Wikipedia
|mtwiki
|-
|Maltese
|Wiktionary
|mtwiktionary
|-
|Mirandese
|Wikipedia
|mwlwiki
|-
|Erzya
|Wikipedia
|myvwiki
|-
|Burmese
|Wikipedia
|mywiki
|-
|Burmese
|Wiktionary
|mywiktionary
|-
|Mazandarani
|Wikipedia
|mznwiki
|-
|Nahuatl
|Wikipedia
|nahwiki
|-
|Nahuatl
|Wiktionary
|nahwiktionary
|-
|Neapolitan
|Wikipedia
|napwiki
|-
|Nauruan
|Wikipedia
|nawiki
|-
|Nauruan
|Wiktionary
|nawiktionary
|-
|Dutch Low Saxon
|Wikipedia
|nds_nlwiki
|-
|Low Saxon
|Wikipedia
|ndswiki
|-
|Low Saxon
|Wiktionary
|ndswiktionary
|-
|Nepali
|Wikipedia
|newiki
|-
|Nepali
|Wikibooks
|newikibooks
|-
|Nepali
|Wiktionary
|newiktionary
|-
|Newar / Nepal Bhasa
|Wikipedia
|newwiki
|-
|Dutch
|Wikibooks
|nlwikibooks
|-
|Dutch
|Wikiquote
|nlwikiquote
|-
|Dutch
|Wikisource
|nlwikisource
|-
|Dutch
|Wikivoyage
|nlwikivoyage
|-
|Dutch
|Wiktionary
|nlwiktionary
|-
|Norwegian (Norsk)
|Wikipedia
|nnwiki
|-
|Norwegian (Norsk)
|Wikiquote
|nnwikiquote
|-
|Norwegian (Norsk)
|Wiktionary
|nnwiktionary
|-
|Norwegian (Bokmål)
|Wikipedia
|novwiki
|-
|Norwegian (Bokmål)
|Wikibooks
|nowikibooks
|-
|Norwegian (Bokmål)
|Wikinews
|nowikinews
|-
|Norwegian (Bokmål)
|Wikiquote
|nowikiquote
|-
|Norwegian (Bokmål)
|Wikisource
|nowikisource
|-
|Norwegian (Bokmål)
|Wiktionary
|nowiktionary
|-
|Norman
|Wikipedia
|nrmwiki
|-
|Northern Sotho
|Wikipedia
|nsowiki
|-
|Navajo
|Wikipedia
|nvwiki
|-
|Chichewa
|Wikipedia
|nywiki
|-
|Occitan
|Wikipedia
|ocwiki
|-
|Occitan
|Wikibooks
|ocwikibooks
|-
|Occitan
|Wiktionary
|ocwiktionary
|-
|Oromo
|Wikipedia
|omwiki
|-
|Oromo
|Wiktionary
|omwiktionary
|-
|Oriya
|Wikipedia
|orwiki
|-
|Oriya
|Wiktionary
|orwiktionary
|-
|Ossetian
|Wikipedia
|oswiki
|-
|Pangasinan
|Wikipedia
|pagwiki
|-
|Kapampangan
|Wikipedia
|pamwiki
|-
|Papiamentu
|Wikipedia
|papwiki
|-
|Punjabi
|Wikipedia
|pawiki
|-
|Punjabi
|Wikibooks
|pawikibooks
|-
|Punjabi
|Wiktionary
|pawiktionary
|-
|Picard
|Wikipedia
|pcdwiki
|-
|Pennsylvania German
|Wikipedia
|pdcwiki
|-
|Palatinate German
|Wikipedia
|pflwiki
|-
|Norfolk
|Wikipedia
|pihwiki
|-
|Pali
|Wikipedia
|piwiki
|-
|Polish
|Wikibooks
|plwikibooks
|-
|Polish
|Wikinews
|plwikinews
|-
|Polish
|Wikiquote
|plwikiquote
|-
|Polish
|Wikisource
|plwikisource
|-
|Polish
|Wikivoyage
|plwikivoyage
|-
|Polish
|Wiktionary
|plwiktionary
|-
|Piedmontese
|Wikipedia
|pmswiki
|-
|Western Panjabi
|Wikipedia
|pnbwiki
|-
|Western Panjabi
|Wiktionary
|pnbwiktionary
|-
|Pontic
|Wikipedia
|pntwiki
|-
|Pashto
|Wikipedia
|pswiki
|-
|Pashto
|Wiktionary
|pswiktionary
|-
|Portugese
|Wikibooks
|ptwikibooks
|-
|Portugese
|Wikinews
|ptwikinews
|-
|Portugese
|Wikiquote
|ptwikiquote
|-
|Portugese
|Wikisource
|ptwikisource
|-
|Portugese
|Wikiversity
|ptwikiversity
|-
|Portugese
|Wikivoyage
|ptwikivoyage
|-
|Portugese
|Wiktionary
|ptwiktionary
|-
|Quechua
|Wikipedia
|quwiki
|-
|Quechua
|Wiktionary
|quwiktionary
|-
|Romansh
|Wikipedia
|rmwiki
|-
|Romani
|Wikipedia
|rmywiki
|-
|Kirundi
|Wikipedia
|rnwiki
|-
|Aromanian
|Wikipedia
|roa_rupwiki
|-
|Aromanian
|Wiktionary
|roa_rupwiktionary
|-
|Tarantino
|Wikipedia
|roa_tarawiki
|-
|Tarantino
|Wikibooks
|rowikibooks
|-
|Romanian
|Wikinews
|rowikinews
|-
|Romanian
|Wikiquote
|rowikiquote
|-
|Romanian
|Wikisource
|rowikisource
|-
|Romanian
|Wikivoyage
|rowikivoyage
|-
|Romanian
|Wiktionary
|rowiktionary
|-
|Rusyn
|Wikipedia
|ruewiki
|-
|Russian
|Wikibooks
|ruwikibooks
|-
|Russian
|Wikinews
|ruwikinews
|-
|Russian
|Wikiquote
|ruwikiquote
|-
|Russian
|Wikisource
|ruwikisource
|-
|Russian
|Wikiversity
|ruwikiversity
|-
|Russian
|Wikivoyage
|ruwikivoyage
|-
|Russian
|Wiktionary
|ruwiktionary
|-
|Kinyarwanda
|Wikipedia
|rwwiki
|-
|Kinyarwanda
|Wiktionary
|rwwiktionary
|-
|Sakha
|Wikipedia
|sahwiki
|-
|Sakha
|Wikisource
|sahwikisource
|-
|Sanskrit
|Wikipedia
|sawiki
|-
|Sanskrit
|Wikibooks
|sawikibooks
|-
|Sanskrit
|Wikiquote
|sawikiquote
|-
|Sanskrit
|Wikisource
|sawikisource
|-
|Sanskrit
|Wiktionary
|sawiktionary
|-
|Sicilian
|Wikipedia
|scnwiki
|-
|Sicilian
|Wiktionary
|scnwiktionary
|-
|Scots
|Wikipedia
|scowiki
|-
|Sardinian
|Wikipedia
|scwiki
|-
|Sindhi
|Wikipedia
|sdwiki
|-
|Sindhi
|Wiktionary
|sdwiktionary
|-
|Northern Sami
|Wikipedia
|sewiki
|-
|Sango
|Wikipedia
|sgwiki
|-
|Sango
|Wiktionary
|sgwiktionary
|-
|Serbo-Croatian
|Wikipedia
|shwiki
|-
|Serbo-Croatian
|Wiktionary
|shwiktionary
|-
|Simple English
|Wikipedia
|simplewiki
|-
|Simple English
|Wiktionary
|simplewiktionary
|-
|Sinhalese
|Wikipedia
|siwiki
|-
|Sinhalese
|Wikibooks
|siwikibooks
|-
|Sinhalese
|Wiktionary
|siwiktionary
|-
|Slovak
|Wikipedia
|skwiki
|-
|Slovak
|Wikibooks
|skwikibooks
|-
|Slovak
|Wikiquote
|skwikiquote
|-
|Slovak
|Wikisource
|skwikisource
|-
|Slovak
|Wiktionary
|skwiktionary
|-
|Slovenian
|Wikipedia
|slwiki
|-
|Slovenian
|Wikibooks
|slwikibooks
|-
|Slovenian
|Wikiquote
|slwikiquote
|-
|Slovenian
|Wikisource
|slwikisource
|-
|Slovenian
|Wikiversity
|slwikiversity
|-
|Slovenian
|Wiktionary
|slwiktionary
|-
|Samoan
|Wikipedia
|smwiki
|-
|Samoan
|Wiktionary
|smwiktionary
|-
|Shona
|Wikipedia
|snwiki
|-
|Somali
|Wikipedia
|sowiki
|-
|Somali
|Wiktionary
|sowiktionary
|-
|Albanian
|Wikipedia
|sqwiki
|-
|Albanian
|Wikibooks
|sqwikibooks
|-
|Albanian
|Wikinews
|sqwikinews
|-
|Albanian
|Wikiquote
|sqwikiquote
|-
|Albanian
|Wiktionary
|sqwiktionary
|-
|Sranan
|Wikipedia
|srnwiki
|-
|Serbian
|Wikipedia
|srwiki
|-
|Serbian
|Wikibooks
|srwikibooks
|-
|Serbian
|Wikinews
|srwikinews
|-
|Serbian
|Wikiquote
|srwikiquote
|-
|Serbian
|Wikisource
|srwikisource
|-
|Serbian
|Wiktionary
|srwiktionary
|-
|Swati
|Wikipedia
|sswiki
|-
|Swati
|Wiktionary
|sswiktionary
|-
|Saterland Frisian
|Wikipedia
|stqwiki
|-
|Sesotho
|Wikipedia
|stwiki
|-
|Sesotho
|Wiktionary
|stwiktionary
|-
|Sundanese
|Wikipedia
|suwiki
|-
|Sundanese
|Wikiquote
|suwikiquote
|-
|Sundanese
|Wiktionary
|suwiktionary
|-
|Swedish
|Wikibooks
|svwikibooks
|-
|Swedish
|Wikinews
|svwikinews
|-
|Swedish
|Wikiquote
|svwikiquote
|-
|Swedish
|Wikisource
|svwikisource
|-
|Swedish
|Wikiversity
|svwikiversity
|-
|Swedish
|Wikivoyage
|svwikivoyage
|-
|Swedish
|Wiktionary
|svwiktionary
|-
|Swahili
|Wikipedia
|swwiki
|-
|Swahili
|Wiktionary
|swwiktionary
|-
|Silesian
|Wikipedia
|szlwiki
|-
|Tamil
|Wikipedia
|tawiki
|-
|Tamil
|Wikibooks
|tawikibooks
|-
|Tamil
|Wikinews
|tawikinews
|-
|Tamil
|Wikiquote
|tawikiquote
|-
|Tamil
|Wikisource
|tawikisource
|-
|Tamil
|Wiktionary
|tawiktionary
|-
|Tetum
|Wikipedia
|tetwiki
|-
|Telugu
|Wikipedia
|tewiki
|-
|Telugu
|Wikibooks
|tewikibooks
|-
|Telugu
|Wikiquote
|tewikiquote
|-
|Telugu
|Wikisource
|tewikisource
|-
|Telugu
|Wiktionary
|tewiktionary
|-
|Tajik
|Wikipedia
|tgwiki
|-
|Tajik
|Wikibooks
|tgwikibooks
|-
|Tajik
|Wiktionary
|tgwiktionary
|-
|Thai
|Wikibooks
|thwikibooks
|-
|Thai
|Wikiquote
|thwikiquote
|-
|Thai
|Wikisource
|thwikisource
|-
|Thai
|Wiktionary
|thwiktionary
|-
|Tigrinya
|Wikipedia
|tiwiki
|-
|Tigrinya
|Wiktionary
|tiwiktionary
|-
|Turkmen
|Wikipedia
|tkwiki
|-
|Turkmen
|Wiktionary
|tkwiktionary
|-
|Tagalog
|Wikipedia
|tlwiki
|-
|Tagalog
|Wikibooks
|tlwikibooks
|-
|Tagalog
|Wiktionary
|tlwiktionary
|-
|Tswana
|Wikipedia
|tnwiki
|-
|Tswana
|Wiktionary
|tnwiktionary
|-
|Tongan
|Wikipedia
|towiki
|-
|Tok Pisin
|Wikipedia
|tpiwiki
|-
|Tok Pisin
|Wiktionary
|tpiwiktionary
|-
|Turkish
|Wikibooks
|trwikibooks
|-
|Turkish
|Wikinews
|trwikinews
|-
|Turkish
|Wikiquote
|trwikiquote
|-
|Turkish
|Wikisource
|trwikisource
|-
|Turkish
|Wiktionary
|trwiktionary
|-
|Tsonga
|Wikipedia
|tswiki
|-
|Tsonga
|Wiktionary
|tswiktionary
|-
|Tatar
|Wikipedia
|ttwiki
|-
|Tatar
|Wikibooks
|ttwikibooks
|-
|Tatar
|Wiktionary
|ttwiktionary
|-
|Tumbuka
|Wikipedia
|tumwiki
|-
|Twi
|Wikipedia
|twwiki
|-
|Tuvan
|Wikipedia
|tyvwiki
|-
|Tahitian
|Wikipedia
|tywiki
|-
|Udmurt
|Wikipedia
|udmwiki
|-
|Uyghur
|Wikipedia
|ugwiki
|-
|Uyghur
|Wiktionary
|ugwiktionary
|-
|Ukrainian
|Wikibooks
|ukwikibooks
|-
|Ukrainian
|Wikinews
|ukwikinews
|-
|Ukrainian
|Wikiquote
|ukwikiquote
|-
|Ukrainian
|Wikisource
|ukwikisource
|-
|Ukrainian
|Wikivoyage
|ukwikivoyage
|-
|Ukrainian
|Wiktionary
|ukwiktionary
|-
|Urdu
|Wikipedia
|urwiki
|-
|Urdu
|Wikibooks
|urwikibooks
|-
|Urdu
|Wikiquote
|urwikiquote
|-
|Urdu
|Wiktionary
|urwiktionary
|-
|Uzbek
|Wikipedia
|uzwiki
|-
|Uzbek
|Wikiquote
|uzwikiquote
|-
|Uzbek
|Wiktionary
|uzwiktionary
|-
|Venetian
|Wikipedia
|vecwiki
|-
|Venetian
|Wikisource
|vecwikisource
|-
|Venetian
|Wiktionary
|vecwiktionary
|-
|Vepsian
|Wikipedia
|vepwiki
|-
|Venda
|Wikipedia
|vewiki
|-
|Vietnamese
|Wikibooks
|viwikibooks
|-
|Vietnamese
|Wikiquote
|viwikiquote
|-
|Vietnamese
|Wikisource
|viwikisource
|-
|Vietnamese
|Wikivoyage
|viwikivoyage
|-
|Vietnamese
|Wiktionary
|viwiktionary
|-
|West Flemish
|Wikipedia
|vlswiki
|-
|Volapük
|Wikipedia
|vowiki
|-
|Volapük
|Wiktionary
|vowiktionary
|-
|Waray-Waray
|Wikipedia
|warwiki
|-
|Walloon
|Wikipedia
|wawiki
|-
|Walloon
|Wiktionary
|wawiktionary
|-
|Wolof
|Wikipedia
|wowiki
|-
|Wolof
|Wikiquote
|wowikiquote
|-
|Wolof
|Wiktionary
|wowiktionary
|-
|Wu
|Wikipedia
|wuuwiki
|-
|Kalmyk
|Wikipedia
|xalwiki
|-
|Xhosa
|Wikipedia
|xhwiki
|-
|Mingrelian
|Wikipedia
|xmfwiki
|-
|Yiddish
|Wikipedia
|yiwiki
|-
|Yiddish
|Wikisource
|yiwikisource
|-
|Yiddish
|Wiktionary
|yiwiktionary
|-
|Yoruba
|Wikipedia
|yowiki
|-
|Zhuang
|Wikipedia
|zawiki
|-
|Zeelandic
|Wikipedia
|zeawiki
|-
|Classical Chinese
|Wikipedia
|zh_classicalwiki
|-
|Min Nan
|Wikipedia
|zh_min_nanwiki
|-
|Min Nan
|Wikisource
|zh_min_nanwikisource
|-
|Min Nan
|Wiktionary
|zh_min_nanwiktionary
|-
|Cantonese
|Wikipedia
|zh_yuewiki
|-
|Chinese
|Wikibooks
|zhwikibooks
|-
|Chinese
|Wikinews
|zhwikinews
|-
|Chinese
|Wikiquote
|zhwikiquote
|-
|Chinese
|Wikisource
|zhwikisource
|-
|Chinese
|Wikivoyage
|zhwikivoyage
|-
|Chinese
|Wiktionary
|zhwiktionary
|-
|Zulu
|Wikipedia
|zuwiki
|-
|Zulu
|Wiktionary
|zuwiktionary
|-
|NA
|Wikidata
|wikidatawiki
|-
|Persian
|Wikipedia
|fawiki
|-
|French
|Wiktionary
|fawiki
|-
|Hebrew
|Wikipedia
|hewiki
|-
|Hungarian
|Wikipedia
|huwiki
|-
|Korean
|Wikipedia
|kowiki
|-
|Romanian
|Wikipedia
|rowiki
|-
|Ukranian
|Wikipedia
|ukwiki
|-
|Vietnamese
|Wikipedia
|viwiki
|-
|}
</div></div></noinclude>


===Hive===
===Pageviews data===
Finally, we have [[Analytics/Cluster/Hive|Hive]] - our storage system for large amounts of data. Hive can be accessed from stat1002 - simply type '<code>hive</code>' in the terminal, switch to the <wmf_raw> database, and input your query.
An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in [[Analytics/Cluster/Hive#Access|our Hadoop cluster]], which contains [[Analytics/Data/Pageview hourly|aggregated pageview data]] as well as the mostly-raw [[Analytics/Data/Webrequest|database of web requests]]. See the detailed documentation [[Analytics/Data/Pageview_hourly|here]].


At the moment there are no recommended Hive access packages for R or Python, although we're actively investigating possible solutions. In the meantime, the best way to get data out of the system is to treat it as you would the Analytics slaves; through the terminal, type:
====Turnilo====
[[Analytics/Systems/Turnilo-Pivot#Access]]


<code> hive --database wmf_raw -e "query goes here" > file_name.tsv</code>
=== Geolocation data===
When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on stat boxes: a full guide to geolocation and some examples of how to do it can be found [[Analytics/Geolocation|on the 'geolocation' page]].


Again, switching out .tsv to .csv alters the format the file is saved in.
==Notes==
<references />

Revision as of 08:06, 9 May 2022

In addition to a variety of publicly-available data sources, Wikimedia has a parallel set of private data sources. The main reason is to allows a carefully vetted set of users to perform research and analysis on confidential user data (such as the IP addresses of readers and editors) which is stored according to our privacy policy and data retention guidelines. This private infrastructure also provides duplicate copies of publicly-available data for ease of use.

Do you need it?

Private data lives in same server cluster that runs Wikimedia's production websites. Often, this means you will need production shell access to get it.

However, since this access gets you closer to both those production websites and this confidential data, it is not freely given out. First, you have to demonstrate a need for these resources. Second, you need to have a non-disclosure agreement with the Wikimedia Foundation. If you're a Foundation employee, this was included as part of your employment agreement. If you're a researcher, it's possible to be sponsored through a formal collaboration with the Wikimedia Foundation's Research team.

User responsibilities

You must remember this access is extremely sensitive. You have a duty to protect the privacy of our users. As Uncle Ben says, "with great power comes great responsibility." Always follow the rules outlined in the Acknowledgement of Server Access Responsibilities, even if you don't have requested ssh access to stat100x clients since it contains good guidelines about how to handle sensitive data.

In addition, keep in mind the following important principles:

  • Be paranoid about personally identifiable information (PII). Familiarize yourself with the data you are working on, and determine if it contains any PII. It's better to double and triple check than to assume anything, but if you have any doubt ask the Analytics team (via IRC or email or Phabricator). Please see the data retention guidelines.
  • Don't copy sensitive data (for example, data accessible only by the users in the analytics-privatedata-users) from its origin location to elsewhere (in HDFS or on any other host/support) unless strictly necessary. And most importantly, do it only if you know what you are doing. If you are in doubt, please reach out to the Analytics team first.
  • Restrict access. If you do need to copy sensitive data somewhere, please make sure that you are the only one able to access the data. For example, if you copy Webrequest data from its location on HDFS to your /user/$your-username directory, make sure that the permissions are set to avoid everybody with access to HDFS to read the data. This is essential to avoid accidental leaks of PII/sensitive data or retention over our guidelines (https://meta.wikimedia.org/wiki/Data_retention_guidelines).
  • Clean up copies of data. Please make sure that any data that you copied is deleted as soon as your work has been done.

If you ever have any questions or doubts, err on the side of caution and contact the Analytics team. We are very friendly and happy to help!

Requesting access

If after reading the above you do need access to WMF analytics data and/or tools, you'll need to submit a request on Phabricator and add the project tag SRE-Access-Requests: Follow the steps at Production access#Access Request Process.

If you already have access and you only need to get kerberos credentials, it is sufficient to create a task with the project tag Analytics: Create a ticket requesting kerberos credentials.

Read the following sections to figure out what you'll access levels you should request in your ticket.

Please follow the instructions Production access request instructions for any of the access types. We need a paper trail and a standard form in order to keep track of requests and understand why they are happening. When submitting the Phabricator ticket, you may edit the description accordingly to match the request you are asking for. E.g. if you don't need SSH access, you don't need to provide an SSH key.

Access Levels

There are a few varying levels and combinations of access that we support.

'analytics-*' groups have access to the Analytics Cluster (which mostly means Hadoop) and to stat* servers for local (non distributed) compute resources. These groups overlap in what servers they grant ssh access to, but further posix permissions restrict access to things like MySQL, Hadoop, and files.

  • LDAP membership in the wmf or nda LDAP group allow you to log in and authenticate via web tools like Superset and Turnilo.
  • Shell (posix) membership in the `analytics-privatedata-users` group allows you to read private data stored in tools like Hadoop, Hive, Presto.
  • An ssh key for your shell user allows you to ssh into the analytics client servers (AKA stat boxes) (and access tools like Jupyter which also needs LDAP membership).
  • A Kerberos principal allows you to access data in Hadoop directly.
  • Team specific shell (posix) group membership for management of team specific jobs and data.

This might all be confusing if you are just trying to figure out what to put in your Phabricator SRE-Access-Requests ticket. Here are a few common use cases of what you might be trying to request.

What access should I request?

If you need access to...

Dashboards in web tools like Turnilo and/or Superset that do not access private data

  • LDAP membership in the wmf or nda LDAP group.

Dashboards in Superset / Hive interfaces (like Hue) that do access private data

  • LDAP membership in the wmf or nda LDAP group.
  • Shell (posix) membership in the `analytics-privatedata-users` group

Note to SREs granting this access: This can be done by declaring the user in Puppet as usual, but with an empty array of ssh_keys.

ssh login to analytics client servers (AKA stat boxes) without Hadoop, Hive, Presto access

This is a rare need, but you might want it if you just want to use a GPU on a stat box, or access to MediaWiki analytics MariaDB instances.

  • LDAP membership in the wmf or nda LDAP group.
  • Shell (posix) membership in the `analytics-privatedata-users` group
  • An ssh key for your shell user

ssh login to analytics client servers (AKA stat boxes) with Hadoop, Hive, Presto access

  • LDAP membership in the wmf or nda LDAP group.
  • Shell (posix) membership in the `analytics-privatedata-users` group
  • An ssh key for your shell user
  • A Kerberos principal

All of the above

If you are a WMF engineer wanting to work with analytics data, most likely you'll want all of these access levels together:

  • LDAP membership in the wmf or nda LDAP group.
  • Shell (posix) membership in the `analytics-privatedata-users` group
  • An ssh key for your shell user
  • A Kerberos principal

If needed for work on your team, you may also want Team specific shell (posix) group membership (see below).

Analytics shell (posix) groups explained

Generic data access (can go together with the Team specific ones)

analytics-privatedata-users (no kerberos, no ssh)

The Analytics team offers various UIs to fetch data from Hadoop, like Turnilo and Superset. They are both guarded by CAS authentication (requiring the user to be in either the wmf or the nda LDAP groups), fetching data from Druid (currently not authenticated). Superset is also able to fetch data from Hadoop/Hive on behalf of the logged in user via a (read-only) tool called Presto. There are two use cases:

  • Sql-lab panel: the user is able to make sql-like queries on Hadoop datasets (pageviews/event/etc..) without the need to log in on a stat100x host.
  • Dashboards: data visualized in dashboards fetched from Hadoop.

In both cases, Superset works on behalf of the user, so eventually the username will need to hold read permissions for Hadoop data to correctly visualize what requested. This is guaranteed by being into analytics-privatedata-users, that gets deployed on the Hadoop master nodes (without ssh access) to outline user permissions on HDFS. This is why some users might want to be in the group without either kerberos or ssh.

Additionally the user needs to be added to the "wmf" LDAP group. Make sure to add them (if you are an SRE) or mention it on the ticket (if you are the requestor).

analytics-privatedata-users (no kerberos)

Grants access to the analytics clients, GPUs and to MariaDB replicas (using the credentials at /etc/mysql/conf.d/analytics-research-client.cnf).

analytics-privatedata-users (with kerberos)
Grants access to all the analytics clients, the analytics cluster (Hadoop/Hive) and the private data hosted there, and to MariaDB replicas, using the credentials at /etc/mysql/conf.d/analytics-research-client.cnf.
Users in this group also need a Kerberos authentication principal. If you're already a group member and don't have one, follow the instructions in the Kerberos user guide. If you're requesting membership in this group, the SRE team will create this for you when they add you to the group.

The list of users currently in each group is available in this configuration file.[1]

Team specific (they do not grant access to PII data on Hadoop, for that see analytics-privatedata-users)

analytics-wmde-users
For Wikimedia Deutschland employees, mostly used for crons running automation jobs as the analytics-wmde system user. Grants access to all stat100x hosts, to the MariaDB replicas via /etc/mysql/conf.d/research-wmde-client.cnf and to the analytics-wmde system user. It is not required that every WMDE user is placed into this group, only those who needs to take care of the aforementioned automation will require access (so they'll ask it explicitly).
analytics-search-users
For members of the Wikimedia Foundation Search Platform team , used for various Analytics-Search jobs). Grants access to all stat100x hosts, an-airflow1001 and to the analytics-search system user.
analytics-product-users
For members of the Product Analytics team, used for various analytics jobs. Grants access to all stat100x hosts, and to the analytics-product system user.
analytics-research-users
For members of the Research team, used for various jobs. Grants access to all stat100x hosts, an Airflow instance, and to the analytics-research system user.
analytics-platform-eng-users
For members of the Research team, used for various jobs. Grants access to all stat100x hosts, an Airflow instance, and to the analytics-platform-eng system user.

Groups to avoid (deprecated)

researchers
analytics-users

Host access granted

There used to be a lot of differences in what hosts an Analytics POSIX group could have had access to, but now there is none anymore.

Data access granted

Access Groups Hadoop access

(No private data)

Hadoop access

(Private data)

Mariadb credentials System user Other
analytics-privatedata-users yes yes analytics-research-client.cnf analytics-privatedata
analytics-wmde-users research-wmde-client.cnf (only on stat1007) analytics-wmde
analytics-search-users Airflow admin
analytics-product-users analytics-product

Shell access expiration

Data access is given to collaborators and contractors with a time limit. Normally the end date is set to be the contract or collaboration end date. For staff data access terminates upon employment termination unless there is a collaboration in place.

Once a user is terminated their home directory is deleted, if the team wishes to preserve some of the user work (work, not data as data as strict guidelines for deletion) it can be done via archiving that work to hadoop. Please file a phab ticket to have this done. Archival to hadoop would happen in the following directory:

/wmf/data/archive/user/<username>

LDAP access

Some Analytics systems, including Superset, Turnilo, and Jupyter, require a developer account in the wmf or nda LDAP groups for access.

If you need this access, first make sure you have a working developer account (if you can log into this wiki, you have one). If you need one, you can create one at mw:Developer_account.

Note that a developer account comes with two different usernames; some services need one and some services need the other. You can find both by logging into this wiki and visiting the "user profile" section of Special:Preferences. Your Wikitech username is listed under "Username", while your developer shell username is listed under "Instance shell account name". Thankfully, there's only one password!

Then, create a Phabricator task: Read and follow the instructions for LDAP-access-requests to request getting added to the appropriate group. Make sure you include both your usernames.

Note that this access has similar requirements to shell access: you will need to either be a Wikimedia Foundation employee or have a signed volunteer NDA.

Accounts and passwords explained: LDAP/Wikitech/MW Developer vs shell/ssh/posix vs Kerberos

There are too many different accounts and passwords one has to deal with in order to access analytics systems. For now it's what we've got. Let's try to explain them all explicitly.


tl;dr

  • LDAP AKA Wikitech AKA Mediawiki Developer accounts are the same. There are 2 usernames for this account, but only one password.
  • POSIX AKA shell AKA ssh accounts are the same. The username is the same as your 'shell username' for your LDAP account. There is no password, only an ssh key pair.
  • Kerberos uses your shell username and a separate Kerberos account password, and grants you access to distributed systems like Hadoop.

LDAP

LDAP is used mostly for web logins. An LDAP account has 2 usernames, the 'Wikitech' username and the shell username, as described above. The password for these is the same. Since LDAP account creation is handled by Mediawiki and also allows you to log into Wikitech (this wiki), LDAP accounts are sometimes referred to as your 'Wikitech' account or your 'Mediawiki developer account'. These terms all mean the same thing.

Analytics web UIs (like Jupyter, Turnilo, Superset, etc.) require that you have an LDAP account in specific groups. Membership in these groups authorize access.

POSIX

To log into a production server, you need an explicit POSIX shell account created for you. This is handled by SRE. POSIX user accounts are often also referred to as your shell or ssh account, as ssh allows you to remote login and get a shell (terminal) on a production server. At WMF, POSIX user accounts do not use passwords. Instead, you login via ssh using an ssh key pair.

Access to specific production servers is managed by membership of your POSIX account in specific groups, e.g. analytics-privatedata-users.

Kerberos

Kerberos is only needed when using a distributed system like Hadoop. You can ssh into a single production server with your POSIX account, but other production servers that you are not directly logged into have no way of knowing you are authorized to access them. Kerberos solves this problem. After logging into a server with ssh, you authenticate to Kerberos with kinit and your Kerberos password (this is a totally different password than your LDAP one). Then, when using a distributed system, other servers can interact with Kerberos to determine if your access should be authorized.

Infrastructure

Analytics clients

The analytics clients are servers in the production cluster where you can run your code and queries. In fact, you should use them to run all your analysis, so that sensitive data never leaves the production cluster.

They have a number of useful capabilities, from large amounts of memory to Jupyter notebooks.

MariaDB

The Analytics MariaDB cluster contains copies of the production MediaWiki databases (both actively-used mainstream projects and small internal-facing wikis, like various projects' Arbitration Committees).

Data Lake

We store large amounts of data in analysis-friendly formats in the Data Lake.

Scripting access

If you're writing some analysis code, you will probably need to access data first. There are a couple of software packages that have been developed to make this easy. Note that both of them are designed to work on the analytics clients only.

For Python, there is wmfdata. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating custom Spark sessions.

For R, there is wmf. It can access data from MariaDB and Hive and has many other useful functions, particularly for graphing and statistics.

Data sources

Data sets and data streams can be found in Category:Data_stream

Data Dashboards. Superset and Turnilo

Superset: http://superset.wikimedia.org Turnilo: http://turnilo.wikimedia.org

You need a wikitech login that is in the "wmf" or "nda" LDAP groups. If you don't have it, please create a Phabricator task by following instructions on phab:tag/ldap-access-requests/.

Before requesting access, please make sure you:

Depending on the above, you can request to be added to the wmf group or the nda group. Please indicate the motivation on the task about why you need access and ping the analytics team if you don't hear any feedback soon from the Opsen on duty.

MediaWiki application data

You can do a lot of work with the data stored by MediaWiki in the normal course of running itself. This includes data about:

  • Users' edit counts (consult the user table)
  • Edits to a particular page (consult the revision table, joined with the page table if necessary)
  • Account creations (consult the logging table)

Databases

You can access this data using the replica MariaDB databases. These are accessible from the stat100* machines via analytics-mysql <wiki-id>. For more details see here.

For an overview of how the data is laid out in those databases, consult the database layout manual.

There are a few things that aren't available from the databases replicas. The main example of this is the actual content of pages and revisions. Instead, you can access them through the API or in the XML dumps, which are both described below.

API

A subset of this application data, which doesn't present privacy concerns, is also publicly accessible through the API (except for private wikis, which you shouldn't really need to perform research on anyway!). A good way to understand it, and to test queries, is Special:ApiSandbox, which provides a way of easily constructing API calls and testing them. The output includes "Request URL" - a direct URL for making that query in the future, that should work on any and all Wikimedia production wikis.

If you're interested in common API tasks, and don't feel like reinventing the wheel, there are a number of Python-based API wrappers and MediaWiki utilities. Our very own Aaron Halfaker maintains MediaWiki Utilities, which includes a module dedicated to API interactions. There's no equivalent for R yet.

Database dumps

Every month, XML snapshots of the databases are generated. Since they're generated monthly, they're always slightly outdated, but make up for it by being incredibly cohesive (and incredibly large). They contain both the text of each revision of each page, and snapshots of the database tables. As such, they're a really good way of getting large amounts of diffs or information on revisions without running into the query limits on the API.

Aaron's MediaWiki-utilities package contains a set of functions for handling and parsing through the XML dumps, which should drastically simplify dealing with them. They're also stored internally, as well as through dumps.wikimedia.org, and can be found in /mnt/data/xmldatadumps/public on stat1004, stat1005, stat1006, stat1007, and stat1008.

EventLogging data

One analytics-specific source of data is EventLogging. This allows us to track things we're interested in as researchers that MediaWiki doesn't normally log. Examples include:

  1. A log of changes to user preferences;
  2. A/B testing data;
  3. Clicktracking data.

These datasets are stored in the event and event_sanitized Hive databases, subject to HDFS access control.

Pageviews data

An important piece of community-facing data is information on our pageviews; what articles are being read, and how much? This is currently stored in our Hadoop cluster, which contains aggregated pageview data as well as the mostly-raw database of web requests. See the detailed documentation here.

Turnilo

Analytics/Systems/Turnilo-Pivot#Access

Geolocation data

When you have IP addresses - be they from the RequestLogs, EventLogging or MediaWiki itself - you can do geolocation. This can be a very useful way of understanding user behaviour and evaluating how our ecosystem works. We currently use the MaxMind geolocation services, which are accessible on stat boxes: a full guide to geolocation and some examples of how to do it can be found on the 'geolocation' page.

Notes

  1. Other groups including statistics-admins, analytics-admins, eventlogging-admins, and statistics-web-users are for people doing system maintenance and administration, so you don't need them just to access data.