Data Platform/Systems/Hive
Apache Hive is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in HDFS . WMF's first use case was to enable querying of unsampled webrequest logs.
As of January 2025, we are running Hive 2.3.6.
Deprecation
Querying on Hive (also called Hive-on-MapReduce) has been officially deprecated since 2015 . As of January 2025, there are general plans, but no specific schedules, for deprecating and then removing Wmfdata's Hive module ( T384541 ) and removing the Hive CLI ( T379385 ).
What to use instead
For most or all queries, Spark should be a drop-in replacement for Hive. You can also use Presto , but it has some syntax differences with Hive.
There is currently no drop-in replacement for Wmfdata's
hive.load_csv
, but one will be added soon.
Use
The Hive command-line interface is available on all the analytics clients . Here is how to use it:
$ kinit
$ hive
(
hive
takes a few seconds to start. You should see a prompt like
hive (default)>
when it is ready.)
Hive can also be easily accessed through Python using wmfdata-python
#!/usr/bin/env python3
import wmfdata as wmf
# Returns a Pandas dataframe
wmf.hive.run("SHOW TABLES FROM event")
Access
In order to access Hive, you need an account with
production shell access
in the
analytics-privatedata-users
user group. For more instructions, see
Data Platform/Data access
.
Some of the data in Hive, like the
webrequest
logs, are private data so only
analytics-privatedata-users
can access it. If you are requesting access to Hive, you probably want to be in this group.
Querying
- Data Platform/Systems/Hive/Queries (includes a FAQ about common tasks and problems)
- Data Platform/Systems/Hive/QueryUsingUDF
While hive supports SQL, there are some differences: see the Hive Language Manual for more info.
Maintained tables
(see also Data Platform/Data Lake )
- Webrequest (raw and refined)
- pageview_hourly
- projectview_hourly
- pagecounts_all_sites
- mediacounts
- mobile_apps_session_metrics
-
EventLogging data
, in the
eventdatabase ( details )
Notes
- The wmf_raw and wmf databases contain Hive tables maintained by Analytics. You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell use rname.
- Hive has the ability to map tables on top of almost any data structure. Since webrequest logs are JSON, the Hive tables must be told to use a JSON SerDe to be able to serialize/deserialize to/from JSON. We use the JsonSerDe included with Hive-HCatalog .
- The HCatalog .jar will be automatically added to a Hive client's auxpath. You should not need to think about it.
Troubleshooting
See the FAQ