You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Edits/Geoeditors/Public

From Wikitech
< Analytics‎ | Data Lake‎ | Edits‎ | Geoeditors
Revision as of 21:22, 7 November 2019 by imported>Krinkle
Jump to navigation Jump to search

This is a public version of the Geoeditors Monthly dataset. It reports the number of active editors per country per month for a set number of countries.

Data Format

The data is released monthly as a flat file, available here. The file will have the following columns:

  • wiki db: the code name for the wiki, "enwiki" for English Wikipedia, "commonswiki" for Wikimedia Commons.
  • country: the name of the country with editors of this wiki
  • activity level: how many edits this group of editors has made in the past month (either 5 to 99 or more than 100)
  • lower bound: at least this many editors in this group
  • upper bound: at most this many editors in this group

Privacy

Since this data has many privacy concerns this public release applies the following changes to make the data reveal less while providing value for a public audience:

Blacklist of countries

We are not releasing data in countries identified by independent organizations as dangerous for journalists or internet freedom. Each year we will look at lists published by organizations like Reporters Without Borders and Freedom on the Net and combine the lowest rated countries into a blacklist. For example this is the list for 2019.

No exact counts

To add a small amount of imprecision to the data, instead of saying, for example, there are 5 editors editing Estonian Wikipedia from Romania, we say there are between 1 and 10. This does not dramatically improve the privacy of the dataset, but it adds a small amount of uncertainty if someone is trying to guess the country of an editor. The amount of uncertainty does not depend on the bucket size but rather in the number of countries for which there are editors for a given project.

Only active wikis

We are only releasing data for wikis with at least 3 active editors on any given month. That's three distinct editors making 5 or more edits in a month. Past research indicates that any less activity than that can't support the healthy collaboration and exchange of ideas essential to wikis.

Risk Assessment

Initial Risk: Medium

Mitigations: Aggregation, Blacklist

Residual Risk: Low

The Wikimedia Foundation has developed a process for reviewing datasets prior to release in order to determine a privacy risk level, appropriate mitigations, and a residual risk level. WMF takes privacy very seriously, and seeks to be as transparent as possible while still respecting the privacy of our readers and editors.

Our Privacy Risk Review process first documents the anticipated benefits of releasing a dataset. Because we feel transparency is so crucial to free information, generally WMF takes a release-by-default approach - that is, release unless there is a compelling reason not to. Often, however, there are additional reasons for releasing a particular dataset, such as supporting research. We want to capture those reasons and account for them.

Second, WMF identifies populations that might possibly be impacted by the release of a dataset. We also specifically identify potential impacts to particularly vulnerable populations, such as political dissidents, ethnic minorities, religious minorities, etc.

Next, we catalog potential threat actors, such as organized crime, data aggregators, or other malicious actors that might potentially seek to violate a user’s privacy. We work to identify the potential motivations of these actors and populations they may target.

Finally, we analyze the Opportunity, Ease, and Probability of action by a threat actor against a potential target, along with the Magnitude of privacy harm to arrive at an initial risk score. Once we have identified our initial risks, we develop a mitigation strategy to minimize the risks we can, resulting in a residual (or post-mitigation) risk level.

WMF does not publicly publish this information because we do not want to motivate threat actors, or give them additional ideas for potential abuse of data. Unlike publishing a security vulnerability for code that could be patched, a publicly released dataset cannot be “patched” - it has already been made public.

Any dataset that contains this notice has been reviewed using this process.