You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data/Browser general: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Mforns
(Created page with "'''wmf.browser_general''' is an intermediate table that makes it easier and faster to get pageview statistics broken down by user-agent related dimensions like OS family, OS m...")
 
imported>Milimetric
 
Line 1: Line 1:
'''wmf.browser_general''' is an intermediate table that makes it easier and faster to get pageview statistics broken down by user-agent related dimensions like OS family, OS major, browser family, browser major, etc. In addition to ad-hoc queries, it is also used to generate the [https://browser-reports.wmflabs.org Browser Reports].
#REDIRECT [[Analytics/Data Lake/Traffic/Browser general]]
 
=== Current Schema ===
<pre>
$ hive --database wmf
 
hive (wmf)> describe browser_general;
OK
col_name  data_type comment
access_method            string                  (desktop|mobile web|mobile app)
os_family                string                  OS family: Windows, Android, etc.
os_major                string                  OS major version: 8, 10, etc.
browser_family          string                  Browser family: Chrome, Safari, etc.
browser_major            string                  Browser major version: 47, 11, etc.
view_count              bigint                  Number of pageviews.
year                    int                      Unpadded year of request.
month                    int                      Unpadded month of request.
day                      int                      Unpadded day of request.
         
# Partition Information       
# col_name              data_type                comment           
         
year                    int                      Unpadded year of request.
month                    int                      Unpadded month of request.
day                      int                      Unpadded day of request.
</pre>
Notice the ''year'', ''month'' and ''day'' fields. These are Hive [http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ partitions], and are explicit mappings to daily files in HDFS.  You must include at least one partition predicate in the ''where'' clause of your queries (even if it is just <code>year > 0</code>). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns your results. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
 
=== Loading of anonymized data ===
'''wmf.browser_general''' is loaded through [https://github.com/wikimedia/analytics-refinery/tree/master/oozie/browser/general this oozie job]. It is a basic query that extracts data from ''wmf.pageview_hourly'' and ''wmf.projectview_hourly'', except for the fact that it applies a threshold-based anonymization to it. For privacy and readability reasons, all the rows that initially represent less than K% of the total view count are grouped into a single row where all dimension columns (os_family, os_major, browser_family, browser_major) are assigned an unknown value (see job config for exact K and unknown values).
 
=== Example query ===
<pre>
SELECT
  browser_family,
  browser_major,
  SUM(view_count) as view_count
FROM wmf.browser_general
WHERE
  access_method IN ('desktop', 'mobile web') AND
  CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2016-01-01' AND
  CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) < '2016-02-01'
GROUP BY
  browser_family,
  browser_major
ORDER BY view_count DESC
LIMIT 1000;
</pre>
 
=== Changes and known problems since 2016-03-21 ===
{| class="wikitable"
|-
! Date from !! Date until !! Task !! Details
|-
|...
|...
|...
|...
|}

Latest revision as of 13:48, 7 April 2017