You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Cluster/Hive: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Spage
m (sentence case in headings, per MOS:SECTIONCAPS)
imported>Milimetric
(Milimetric moved page Analytics/Cluster/Hive to Analytics/Systems/Cluster/Hive: Reorganizing documentation)
 
(6 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[[File:Pageview_@_Wikimedia_(WMF_Analytics_lightning_talk,_June_2015).pdf|thumb|350px|page=6|Hive/Hadoop (rounded box at the bottom) within the Wikimedia Foundation's pageview data pipeline]]
#REDIRECT [[Analytics/Systems/Cluster/Hive]]
[http://hive.apache.org Apache Hive] is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in [[:en:Apache_Hadoop#HDFS|HDFS]].  WMF's first use case was to enable querying of unsampled webrequest logs.
 
 
==Access ==
=== Cluster access ===
In order to get shell access to the analytics cluster through hive you need access to [[stat1002]], and be added to the <code>analytics-privatedata-users</code> shell user group. Per [[Requesting shell access]], create a Phabricator ticket for such request.
 
For how to access the servers (once you have the credentials), see: [[Analytics/Cluster/Access]]
 
== Querying ==
* [[Analytics/Cluster/Hive/Queries]] (includes a FAQ about common tasks and problems)
* [[Analytics/Cluster/Hive/QueryUsingUDF]]
While hive supports SQL there are some differences: see the [https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive Language Manual] for more info.
 
== Maintained tables ==
''(see also [[Analytics/Data]])''
* [[Analytics/Data/Webrequest|Webrequest]] (raw and refined)
* [[Analytics/Data/Pageview hourly|pageview_hourly]]
* [[Analytics/Data/Projectview hourly|projectview_hourly]]
* [[Analytics/Data/Pagecounts-all-sites|pagecounts_all_sites]]
* [[Analytics/Data/Mediacounts|mediacounts]]
* ...
 
=== Notes ===
* The wmf_raw and wmf databases contain Hive tables maintained by Ops.  You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.''
* Hive has the ability to map tables on top of almost any data structure.  Since webrequest logs are JSON, the Hive tables must be told to use a JSON [https://cwiki.apache.org/confluence/display/Hive/SerDe SerDe] to be able to serialize/deserialize to/from JSON.  We use the JsonSerDe included with [https://cwiki.apache.org/confluence/display/Hive/HCatalog Hive-HCatalog].
* The HCatalog .jar will be automatically added to a Hive client's auxpath.  You shouldn't need to think about it.
 
== Troubleshooting ==
[[Analytics/Cluster/Hive/Troubleshooting]]
 
== Subpages of {{PAGENAME}} ==
{{Special:PrefixIndex/{{PAGENAME}}/|stripprefix=1}}
 
== See also ==
* [[Analytics/Cluster/Access]] - on how to access Hive and Hue (Hadoop User Experience, a GUI for Hive)
 
== References ==
 
* [[mediawikiwiki:Analytics/Kraken/Researcher_analysis|Early Hive at WMF Researcher Analysis]] (2013)
* [[Analytics/Cluster/Hive/Compression|Hive Compression Test]]
* [http://wiki.apache.org/hadoop/SequenceFile Hadoop SequenceFile]
* [http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ Introduction to Hive Partitioning]

Latest revision as of 13:44, 7 April 2017