You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Cluster/Hive: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Spage
m (sentence case in headings, per MOS:SECTIONCAPS)
imported>Neil P. Quinn-WMF
(Add presentation)
Line 10: Line 10:


== Querying ==
== Querying ==
[[File:Introduction to Hive.pdf|thumb|400px|A presentation introducing the Hive cluster]]
* [[Analytics/Cluster/Hive/Queries]] (includes a FAQ about common tasks and problems)
* [[Analytics/Cluster/Hive/Queries]] (includes a FAQ about common tasks and problems)
* [[Analytics/Cluster/Hive/QueryUsingUDF]]
* [[Analytics/Cluster/Hive/QueryUsingUDF]]

Revision as of 21:52, 20 October 2015

File:Pageview @ Wikimedia (WMF Analytics lightning talk, June 2015).pdf Apache Hive is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in HDFS. WMF's first use case was to enable querying of unsampled webrequest logs.


Access

Cluster access

In order to get shell access to the analytics cluster through hive you need access to stat1002, and be added to the analytics-privatedata-users shell user group. Per Requesting shell access, create a Phabricator ticket for such request.

For how to access the servers (once you have the credentials), see: Analytics/Cluster/Access

Querying

File:Introduction to Hive.pdf

While hive supports SQL there are some differences: see the Hive Language Manual for more info.

Maintained tables

(see also Analytics/Data)

Notes

  • The wmf_raw and wmf databases contain Hive tables maintained by Ops. You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.
  • Hive has the ability to map tables on top of almost any data structure. Since webrequest logs are JSON, the Hive tables must be told to use a JSON SerDe to be able to serialize/deserialize to/from JSON. We use the JsonSerDe included with Hive-HCatalog.
  • The HCatalog .jar will be automatically added to a Hive client's auxpath. You shouldn't need to think about it.

Troubleshooting

Analytics/Cluster/Hive/Troubleshooting

Subpages of Analytics/Cluster/Hive

See also

References