You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Traffic/referrer daily

From Wikitech-static
< Analytics‎ | Data Lake‎ | Traffic
Revision as of 15:48, 29 April 2021 by imported>Isaac Johnson (add related datasets)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The table referrer_daily (available in the wmf database on Hive) contains pre-aggregated counts of how many Wikipedia pageviews were referred from common search engines on a given day. They split the data by country, language edition, browser family, and OS family. Given that this table contains sensitive geographic content, a privacy threshold of 500 is enforced such that any set of facets (search engine, country, language, OS family, browser family) that did not refer at least 500 pageviews is represented by other. This retains accurate complete counts of search engine referrals while reducing privacy risks.

Schema

hive (default)> DESCRIBE wmf.referrer_daily;

# col_name	        data_type	        comment
  country             	string              	Reader country per IP geolocation
  lang                	string              	Wikipedia language -- e.g., en for English
  browser_family      	string              	Browser family from user-agent
  os_family           	string              	OS family from user-agent
  search_engine       	string              	One of ~20 standard search engines (e.g., Google)
  num_referrals       	int                 	Number of pageviews from the referral source
  year                	int                 	Unpadded year of request
  month               	int                 	Unpadded month of request
  day                 	int                 	Unpadded day of request  

# Partition Information
# col_name            	data_type           	comment
  year                	int                 	Unpadded year of request
  month               	int                 	Unpadded month of request
  day                 	int                 	Unpadded day of request

Availability

Beyond Hive, the data is available in various other places.

Stat machines

Any stat machine with access to Hadoop can access daily TSV dumps of the data at /mnt/hdfs/wmf/data/archive/referrer/daily.

Dashboard

Given the many, orthogonal facets to this data -- e.g., one person may want to aggregate by country while another might want to aggregate by language -- this data is also made available via a prototype public Turnilo instance. See Dashboard main page for more information.

Privacy

For more details and discussion around the privacy risks of this dataset, see task T270140.

See also