You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
< Analytics‎ | Systems‎ | Cluster
Revision as of 15:59, 28 January 2021 by imported>Neil P. Quinn-WMF (Mention version used.)
Jump to navigation Jump to search

File:Pageview @ Wikimedia (WMF Analytics lightning talk, June 2015).pdf

Apache Hive logo

Apache Hive is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in HDFS. WMF's first use case was to enable querying of unsampled webrequest logs.

As of January 2021, we are running Hive 1.1.0.


In order to access Hive, you need an account with production shell access in the analytics-privatedata-users user group. For more instructions, see Analytics/Data access.

Some of the data in Hive, like the webrequest logs, are private data so only analytics-privatedata-users can access it. If you are requesting access to Hive, you probably want to be in this group.

Once you have the credentials, see Analytics/Systems/Cluster/Access for instructions on using the web UI and SSH tunneling.

Create your own database

Hive uses databases to organize tables. You can create databases for your own use, and by convention we use our shell username as database name. Here is an example of command to create a database:

CREATE DATABASE my_user_name;


File:Introduction to Hive.pdf

While hive supports SQL, there are some differences: see the Hive Language Manual for more info.

Loading data

TSV file

If you have a data file you'd like to load in to Hive (perhaps to join with an existing Hive table), start by copying it onto one of the stats or notebook machines. Then, create a table in Hive with a "delimited" row format:

CREATE TABLE tablename (tablespec)

You can easily change the terminator string from "\t" to "," if you have a CSV file.

Finally, use the hive command line client on that machine to run the following query:

LOAD DATA LOCAL INPATH '{{local path to file}}'

Note that you cannot use beeline since it will look on the Hive server instead for your data file, even when you use the LOCAL keywoard.

Maintained tables

(see also Analytics/Data)


  • The wmf_raw and wmf databases contain Hive tables maintained by Analytics. You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.
  • Hive has the ability to map tables on top of almost any data structure. Since webrequest logs are JSON, the Hive tables must be told to use a JSON SerDe to be able to serialize/deserialize to/from JSON. We use the JsonSerDe included with Hive-HCatalog.
  • The HCatalog .jar will be automatically added to a Hive client's auxpath. You shouldn't need to think about it.


See the FAQ

Subpages of Analytics/Systems/Cluster/Hive