You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Hive: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>HaeB
(update)
imported>Neil P. Quinn-WMF
(Consolidate and prune access instructions)
Line 7: Line 7:
{{anchor|Cluster access}}
{{anchor|Cluster access}}
== Access ==
== Access ==
In order to get shell access to the Analytics Cluster through Hive you need to be added to either the <code>analytics-privatedata-users</code> or the <code>analytics-users</code> shell user group. Per [[Requesting shell access]], create a Phabricator ticket for such request.  Some analytics team generated data (like the webrequest logs) are considered private data, and only <code>analytics-privatedata-users</code> can access it.  If you are getting access to Hive, you will probably want to be in this group.
In order to access Hive, you need an account with [[production shell access]] in either the <code>analytics-privatedata-users</code> or the <code>analytics-users</code> user group. For more instructions, see [[Analytics/Data access]].


For how to access the servers (once you have the credentials), see: [[Analytics/Cluster/Access]]
Some of the data in Hive, like the [[Analytics/Data Lake/Traffic/Webrequest|webrequest]] logs, are private data,so only <code>analytics-privatedata-users</code> can access it.  If you are requesting access to Hive, you probably want to be in this group.
 
Once you have the credentials, see [[Analytics/Cluster/Access|Analytics/Systems/Cluster/Access]] for instructions on using the web UI and SSH tunneling.


== Querying ==
== Querying ==
Line 38: Line 40:
== Subpages of {{PAGENAME}} ==
== Subpages of {{PAGENAME}} ==
{{Special:PrefixIndex/{{PAGENAME}}/|stripprefix=1}}
{{Special:PrefixIndex/{{PAGENAME}}/|stripprefix=1}}
== See also ==
* [[Analytics/Cluster/Access]] - on how to access Hive and Hue (Hadoop User Experience, a GUI for Hive)


== References ==
== References ==

Revision as of 00:16, 4 May 2018

File:Pageview @ Wikimedia (WMF Analytics lightning talk, June 2015).pdf

Apache Hive logo

Apache Hive is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in HDFS. WMF's first use case was to enable querying of unsampled webrequest logs.

Access

In order to access Hive, you need an account with production shell access in either the analytics-privatedata-users or the analytics-users user group. For more instructions, see Analytics/Data access.

Some of the data in Hive, like the webrequest logs, are private data,so only analytics-privatedata-users can access it. If you are requesting access to Hive, you probably want to be in this group.

Once you have the credentials, see Analytics/Systems/Cluster/Access for instructions on using the web UI and SSH tunneling.

Querying

File:Introduction to Hive.pdf

While hive supports SQL, there are some differences: see the Hive Language Manual for more info.

Maintained tables

(see also Analytics/Data)

Notes

  • The wmf_raw and wmf databases contain Hive tables maintained by Ops. You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.
  • Hive has the ability to map tables on top of almost any data structure. Since webrequest logs are JSON, the Hive tables must be told to use a JSON SerDe to be able to serialize/deserialize to/from JSON. We use the JsonSerDe included with Hive-HCatalog.
  • The HCatalog .jar will be automatically added to a Hive client's auxpath. You shouldn't need to think about it.

Troubleshooting

See the FAQ

Subpages of Analytics/Systems/Cluster/Hive

References