You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Cluster/Hive: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
imported>Milimetric
(Milimetric moved page Analytics/Cluster/Hive to Analytics/Systems/Cluster/Hive: Reorganizing documentation)
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[File:Pageview_@_Wikimedia_(WMF_Analytics_lightning_talk,_June_2015).pdf|thumb|350px|page=6|Hive/Hadoop (rounded box at the bottom) within the Wikimedia Foundation's pageview data pipeline]]
#REDIRECT [[Analytics/Systems/Cluster/Hive]]
[[File:Apache Hive logo.svg|thumb|Apache Hive logo]]
[http://hive.apache.org Apache Hive] is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in [[:en:Apache_Hadoop#HDFS|HDFS]].  WMF's first use case was to enable querying of unsampled [[Analytics/Data/Webrequest|webrequest]] logs.
 
 
==Access ==
=== Cluster access ===
In order to get shell access to the Analytics Cluster through Hive you need to be added to either the <code>analytics-privatedata-users</code> or the <code>analytics-users</code> shell user group. Per [[Requesting shell access]], create a Phabricator ticket for such request.  Some analytics team generated data (like the webrequest logs) are considered private data, and only <code>analytics-privatedata-users</code> can access it.  If you are getting access to Hive, you will probably want to be in this group.
 
For how to access the servers (once you have the credentials), see: [[Analytics/Cluster/Access]]
 
== Querying ==
[[File:Introduction to Hive.pdf|thumb|400px|A presentation introducing the Hive cluster]]
 
* [[Analytics/Cluster/Hive/Queries]] (includes a FAQ about common tasks and problems)
* [[Analytics/Cluster/Hive/QueryUsingUDF]]
While hive supports SQL, there are some differences: see the [https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive Language Manual] for more info.
 
== Maintained tables ==
''(see also [[Analytics/Data]])''
* [[Analytics/Data/Webrequest|Webrequest]] (raw and refined)
* [[Analytics/Data/Pageview hourly|pageview_hourly]]
* [[Analytics/Data/Projectview hourly|projectview_hourly]]
* [[Analytics/Data/Pagecounts-all-sites|pagecounts_all_sites]]
* [[Analytics/Data/Mediacounts|mediacounts]]
* ...
 
=== Notes ===
* The wmf_raw and wmf databases contain Hive tables maintained by Ops.  You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.''
* Hive has the ability to map tables on top of almost any data structure.  Since webrequest logs are JSON, the Hive tables must be told to use a JSON [https://cwiki.apache.org/confluence/display/Hive/SerDe SerDe] to be able to serialize/deserialize to/from JSON.  We use the JsonSerDe included with [https://cwiki.apache.org/confluence/display/Hive/HCatalog Hive-HCatalog].
* The HCatalog .jar will be automatically added to a Hive client's auxpath.  You shouldn't need to think about it.
* It is also possible to [[Analytics/EventLogging|import EventLogging data into Hive]], although (as of April 2016) this is not widely tested yet.
 
== Troubleshooting ==
See the [[Analytics/Cluster/Hive/Queries#FAQ|FAQ]]
 
== Subpages of {{PAGENAME}} ==
{{Special:PrefixIndex/{{PAGENAME}}/|stripprefix=1}}
 
== See also ==
* [[Analytics/Cluster/Access]] - on how to access Hive and Hue (Hadoop User Experience, a GUI for Hive)
 
== References ==
 
* [[mediawikiwiki:Analytics/Kraken/Researcher_analysis|Early Hive at WMF Researcher Analysis]] (2013)
* [[Analytics/Cluster/Hive/Compression|Hive Compression Test]]
* [http://wiki.apache.org/hadoop/SequenceFile Hadoop SequenceFile]
* [http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ Introduction to Hive Partitioning]

Latest revision as of 13:44, 7 April 2017