You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Edits/Geoeditors
This is a private dataset that shows, in aggregate, what countries editors edit from. It is available only on the private cluster for now as we work out how to publish it for broader use. You can access this dataset on Superset or in the wmf.geoeditors_monthly
Hive table.
For details on the infrastructure that produces this data source, see Analytics/Systems/Geoeditors.
Data
col_name | data_type | comment |
---|---|---|
wiki_db | string | The wiki database the editors worked in |
country_code | string | The 2-letter ISO country code this group of editors geolocated to, including Unknown (--) |
users_are_anonymous | boolean | Whether or not this group of editors edited anonymously |
activity_level | string | How many edits this group of editors performed
possible values: "1 to 4", "5 to 99", "100 or more" |
distinct_editors | bigint | Number of editors meeting this activity level |
namespace_zero_distinct_editors | bigint | Number of editors meeting this activity level, with only namespace zero edits |
month | string | [partition] The month in YYYY-MM format |
NOTE: while users are grouped by their user_id
, which is a solid grouping, anonymous users are grouped by a hash of their User Agent and IP. This can still group multiple editors as one, but it should be rare enough that the numbers are indicative of anonymous editing patterns.
From this table, you can find the number of editors, with different activity levels, anonymous or logged-in, in a specific project, editing from a specific country. An example query gets all this data for French Wikipedia in March 2018:
select
country_code,
users_are_anonymous,
activity_level,
sum(distinct_editors) as editor_count
from wmf.geowiki_monthly
where month='2018-03' and wiki_db = 'frwiki'
group by country_code, users_are_anonymous, activity_level;
Access
The easiest way to get to this data is by using Superset, a UI that you can use to access most of our datasources. The example geowiki dashboard is at https://superset.wikimedia.org/superset/dashboard/9. As an example, here's a map generated from the data:
Generation
To compile this dataset, we import the cu_changes
table and filter out all the entries for administrative actions. This result is available as the wmf_raw.mediawiki_private_cu_changes
table.
We then aggregate at the day and user level, labeling anonymous users and generate the wmf.geowiki_daily
table. Data in these two raw tables is available for 60 days on the cluster, because it directly includes the country of individual editors. From these tables, we them aggregate at the monthly level and get the wmf.geowiki_monthly
table. The schemas for the raw tables:
wmf.geowiki_daily
wiki_db string 'The wiki database of origin', country_code string 'The 2-letter ISO country code this group of edits geolocated to, including Unknown (--)', user_fingerprint_or_id string 'If an anonymous user, this is a hash of the IP + UA, otherwise it is their user id in this wiki db', user_is_anonymous boolean 'Whether or not this user edited this group of edits anonymously', date string 'The YYYY-MM-DD date for this group of edits', edit_count bigint 'The total count of edits for this grouping', namespace_zero_edit_count bigint 'The total count of edits to namespace zero for this grouping'
wmf_raw.mediawiki_private_cu_changes
cuc_id bigint 'Primary key, artificial', cuc_namespace bigint 'When pages are renamed, their RC entries do _not_ change.', cuc_title string 'When pages are renamed, their RC entries do _not_ change.', cuc_user bigint 'user_id, rev_user, the id of the user performing the action', cuc_user_text string 'user_text, rev_user_text, the name of the user at the time', cuc_actiontext string 'Unknown, undocumented', cuc_comment string 'The revision comment', cuc_minor boolean 'Whether this was a minor edit', cuc_page_id bigint 'The id of the page being edited', cuc_this_oldid bigint 'rev_id of the revision represented by this change', cuc_last_oldid bigint 'rev_id of the previous revision on this page', cuc_type int 'see https://www.mediawiki.org/wiki/Manual:Recentchanges_table#rc_type', cuc_timestamp string 'mediawiki-formatted timestamp of this event', cuc_ip string 'clear-text IP address of the user responsible for this event', cuc_agent string 'clear-text user agent of the user responsible for this event'
Data loss
- 2017: November 13th to November 20th, data not computed while the old database was decommissioned. Data will be lost if it's not re-computed by December 13th. As of this writing, we do not have capacity to save this data.
- 2017: October 25th to November 4th, data not computed while changing database hosts. Data will be lost if it's not re-computed by November 24th (as of this writing, a few days have already been lost)