You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data Lake/Traffic/Webrequest/Tagging

From Wikitech-static
< Analytics‎ | Data Lake‎ | Traffic‎ | Webrequest
Revision as of 22:45, 2 August 2017 by imported>Nuria (→‎Webrequest Tagging)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Webrequest Tagging

We are in the process of adding a “tags” column to webrequest. This tag column is an array that can hold values like: “portal”, “wikidata", "pageview". The pageview refinement process will be enhanced with a tagging step, in which some requests (pageviews or not) will be marked with one of many tags.

Once tagging phase is completed a second process will read the tag column. A small number of tags will be used for splitting the webrequest dataset in smaller datasets using hive dynamic partitioning. Many of our regular data-generation jobs read every record in webrequest when they actually need only a portion of it. Splitting the data into pre-filtered datasets will optimize our jobs, as they would be able to read just pertinent data.

Not all tags will be used for partitioning, just a smaller set, other tags might be short lived and used to more efficiently select records from webrequest table.

Usage of tags column in SQL

The tags column is an array<string>, a hive complex type. Selects to get elements can look like:

Select tags from webrequest where year=2017 and month=09 and day=09 and hour=09 limit 5;

This might return something like:

["wikidata-query","sparql"]
["portal"]
["wikidata-query","sparql"]
[]
[]

Also:

Select tags[0]

Will return 1st element of array.