You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Hive: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Neil P. Quinn-WMF
(Update version number)
imported>Btullis
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
[[File:Pageview_@_Wikimedia_(WMF_Analytics_lightning_talk,_June_2015).pdf|thumb|350px|page=6|Hive/Hadoop (rounded box at the bottom) within the Wikimedia Foundation's pageview data pipeline]]
#REDIRECT [[Data Engineering/Systems/Cluster/Hive]]
[[File:Apache Hive logo.svg|thumb|Apache Hive logo]]
[http://hive.apache.org Apache Hive] is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in [[:en:Apache_Hadoop#HDFS|HDFS]].  WMF's first use case was to enable querying of unsampled [[Analytics/Data/Webrequest|webrequest]] logs.
 
As of February 2021, we are running Hive 2.3.6.
 
__TOC__
 
{{anchor|Cluster access}}
== Access ==
In order to access Hive, you need an account with [[production shell access]] in the <code>analytics-privatedata-users</code> user group. For more instructions, see [[Analytics/Data access]].
 
Some of the data in Hive, like the [[Analytics/Data Lake/Traffic/Webrequest|webrequest]] logs, are private data so only <code>analytics-privatedata-users</code> can access it.  If you are requesting access to Hive, you probably want to be in this group.
 
Once you have the credentials, see [[Analytics/Cluster/Access|Analytics/Systems/Cluster/Access]] for instructions on using the web UI and SSH tunneling.
 
== Create your own database ==
 
Hive uses databases to organize tables. You can create databases for your own use, and by convention we use our shell username as database name. Here is an example of command to create a database:
<syntaxhighlight lang="sql">
CREATE DATABASE my_user_name;
</syntaxhighlight>
 
== Querying ==
[[File:Introduction to Hive.pdf|thumb|400px|A presentation introducing the Hive cluster]]
 
* [[Analytics/Cluster/Hive/Queries]] (includes a FAQ about common tasks and problems)
* [[Analytics/Cluster/Hive/QueryUsingUDF]]
While hive supports SQL, there are some differences: see the [https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive Language Manual] for more info.
 
== Loading data ==
=== TSV file ===
If you have a data file you'd like to load in to Hive (perhaps to join with an existing Hive table), start by copying it onto one of the stats or notebook machines. Then, create a table in Hive with a "delimited" row format:
<syntaxhighlight lang="sql">
CREATE TABLE tablename (tablespec)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
</syntaxhighlight>
 
You can easily change the terminator string from "\t" to "," if you have a CSV file.
 
Finally, use the <code>hive</code> command line client on that machine to run the following query:
<syntaxhighlight lang="sql">
LOAD DATA LOCAL INPATH '{{local path to file}}'
OVERWRITE INTO TABLE {{name}}
</syntaxhighlight>
 
Note that you cannot use <code>beeline</code> since it will look on the Hive server instead for your data file, even when you use the <code>LOCAL</code> keywoard.
 
== Maintained tables ==
''(see also [[Analytics/Data]])''
* [[Analytics/Data/Webrequest|Webrequest]] (raw and refined)
* [[Analytics/Data/Pageview hourly|pageview_hourly]]
* [[Analytics/Data/Projectview hourly|projectview_hourly]]
* [[Analytics/Data/Pagecounts-all-sites|pagecounts_all_sites]]
* [[Analytics/Data/Mediacounts|mediacounts]]
* [[Analytics/Data/mobile_apps_session_metrics|mobile_apps_session_metrics]]
* [[Analytics/Systems/EventLogging|EventLogging data]], in the <code>event</code> database ([[Analytics/Systems/EventLogging#Hadoop_&_Hive|details]])
 
=== Notes ===
* The wmf_raw and wmf databases contain Hive tables maintained by Analytics.  You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell use''rname.
* Hive has the ability to map tables on top of almost any data structure.  Since webrequest logs are JSON, the Hive tables must be told to use a JSON [https://cwiki.apache.org/confluence/display/Hive/SerDe SerDe] to be able to serialize/deserialize to/from JSON.  We use the JsonSerDe included with [https://cwiki.apache.org/confluence/display/Hive/HCatalog Hive-HCatalog].
* The HCatalog .jar will be automatically added to a Hive client's auxpath.  You shouldn't need to think about it.
 
== Troubleshooting ==
See the [[Analytics/Cluster/Hive/Queries#FAQ|FAQ]]
 
== Subpages of {{PAGENAME}} ==
{{Special:PrefixIndex/{{PAGENAME}}/|stripprefix=1}}
 
== References ==
 
* [[mediawikiwiki:Analytics/Kraken/Researcher_analysis|Early Hive at WMF Researcher Analysis]] (2013)
* [[Analytics/Cluster/Hive/Compression|Hive Compression Test]]
* [http://wiki.apache.org/hadoop/SequenceFile Hadoop SequenceFile]
* [http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ Introduction to Hive Partitioning]

Latest revision as of 11:55, 12 January 2023