You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Traffic/referrer daily/Dashboard

From Wikitech-static
< Analytics‎ | Data Lake‎ | Traffic‎ | referrer daily
Revision as of 16:08, 29 April 2021 by imported>Isaac Johnson (draft documentation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The referrer_daily dataset has many facets and ways in which someone might want to aggregate or split the data. Turnilo is currently the best solution at Wikimedia for visualizing data with these properties, but our Turnilo instance is designed for private datasets so a public instance needed to be created in order to share this dataset more broadly. Some technical details are given below and the dashboard can be found at: https://turnilo-public.wmcloud.org/

Turnilo instance

The turnilo dashboard is hosted on a Cloud VPS instance by the Wikimedia Research team. The code for setting up the instance can be found here: https://github.com/wikimedia/research-api-endpoint-template/tree/turnilo

Data backend

Turnilo generally depends on a Druid database backend. It also supports flat-file formats though, including JSON, TSV, and CSV files. For simplicity and given the relatively small size of the dataset, this instance uses a JSON flat-file backend. This requires restarting the Turnilo instance each time the data is updated (daily) but startup is quick and this is considered simpler than building a public Druid database endpoint at the moment.

Updating

Currently the updates are handled manually but an automated pipeline that runs daily will be built as follows:

  • Extract new TSV from HDFS
  • Reformat data to match Turnilo's expected format and append to single flat-file
  • Export data to Cloud VPS instance
  • Restart Turnilo endpoint with updated data backend