You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Analytics/Cluster/Hive: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
(Milimetric moved page Analytics/Cluster/Hive to Analytics/Systems/Cluster/Hive: Reorganizing documentation)
(8 intermediate revisions by 5 users not shown)
Line 1: Line 1:
[[File:Pageview_@_Wikimedia_(WMF_Analytics_lightning_talk,_June_2015).pdf|thumb|350px|page=6|Hive/Hadoop (rounded box at the bottom) within the Wikimedia Foundation's pageview data pipeline]]
#REDIRECT [[Analytics/Systems/Cluster/Hive]]
[ Apache Hive] is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in [[:en:Apache_Hadoop#HDFS|HDFS]].  WMF's first use case was to enable querying of unsampled webrequest logs.
==Access ==
=== Cluster Access ===
In order to get shell access to the analytics cluster through hive you need access to [[stat1002]], and be added to the <code>analytics-privatedata-users</code> shell user group. Per [[Requesting shell access]], create a Phabricator ticket for such request.
For access see: [[Analytics/Cluster/Access]]
== Querying ==
* [[Analytics/Cluster/Hive/Queries]] (includes a FAQ about common tasks and problems)
* [[Analytics/Cluster/Hive/QueryUsingUDF]]
While hive supports SQL there are some differences: see the [ Hive Language Manual] for more info.
== Maintained Tables ==
''(see also [[Analytics/Data]])''
* [[Analytics/Data/Webrequest|Webrequest]] (raw and refined)
* [[Analytics/Data/Pageview hourly|pageview_hourly]]
* [[Analytics/Data/Projectview hourly|projectview_hourly]]
* [[Analytics/Data/Pagecounts-all-sites|pagecounts_all_sites]]
* [[Analytics/Data/Mediacounts|mediacounts]]
* ...
=== Notes ===
* The wmf_raw and wmf databases contain Hive tables maintained by Ops.  You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.''
* Hive has the ability to map tables on top of almost any data structure.  Since webrequest logs are JSON, the Hive tables must be told to use a JSON [ SerDe] to be able to serialize/deserialize to/from JSON.  We use the JsonSerDe included with [ Hive-HCatalog].
* The HCatalog .jar will be automatically added to a Hive client's auxpath.  You shouldn't need to think about it.
== Troubleshooting ==
== See also ==
* [[Analytics/Cluster/Access]] - on how to access Hive and Hue (Hadoop User Experience, a GUI for Hive)
== References ==
* [[mediawikiwiki:Analytics/Kraken/Researcher_analysis|Early Hive at WMF Researcher Analysis]] (2013)
* [[Analytics/Cluster/Hive/Compression|Hive Compression Test]]
* [ Hadoop SequenceFile]
* [ Introduction to Hive Partitioning]

Latest revision as of 13:44, 7 April 2017