You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Cluster/Hive: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>HaeB
No edit summary
imported>Nuria
Line 1: Line 1:
[[File:Pageview_@_Wikimedia_(WMF_Analytics_lightning_talk,_June_2015).pdf|thumb|350px|page=6|Hive/Hadoop (rounded box at the bottom) within the Wikimedia Foundation's pageview data pipeline]]
[[File:Pageview_@_Wikimedia_(WMF_Analytics_lightning_talk,_June_2015).pdf|thumb|350px|page=6|Hive/Hadoop (rounded box at the bottom) within the Wikimedia Foundation's pageview data pipeline]]
[http://hive.apache.org Apache Hive] is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in [[:en:Apache_Hadoop#HDFS|HDFS]].  WMF's first use case was to enable querying of unsampled webrequest logs.
[http://hive.apache.org Apache Hive] is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in [[:en:Apache_Hadoop#HDFS|HDFS]].  WMF's first use case was to enable querying of unsampled webrequest logs.
==Access ==
=== Cluster Access ===
In order to get shell access to the analytics cluster through hive you need access to [[stat1002]], and be added to the <code>analytics-privatedata-users</code> shell user group. Per [[Requesting shell access]], create a Phabricator ticket for such request.
For access see: [[Analytics/Cluster/Access]]
== Querying ==
* [[Analytics/Cluster/Hive/Queries]] (includes a FAQ about common tasks and problems)
* [[Analytics/Cluster/Hive/QueryUsingUDF]]
While hive supports SQL there are some differences: see the [https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive Language Manual] for more info.
== Maintained Tables ==
== Maintained Tables ==
''(see also [[Analytics/Data]])''
''(see also [[Analytics/Data]])''
Line 9: Line 22:
* [[Analytics/Data/Mediacounts|mediacounts]]
* [[Analytics/Data/Mediacounts|mediacounts]]
* ...
* ...
== Querying ==
To access data, use
  hive
Inside Hive CLI, you can query the [[Analytics/Data/Webrequest|wmf.webrequest]] and other tables:
<source lang="SQL" enclose="div">
-- Calculate per-country mobile page views for 2015-04-10 00:00--01:00 (1st hour)
SELECT
    geocoded_data['country_code'] country, COUNT(*) cc_count
FROM wmf.webrequest
WHERE
    webrequest_source = 'mobile'  -- could be 'text', 'upload', 'bits', 'misc', ...
    AND year = 2015             
    AND month = 4
    AND day = 10
    AND hour = 0
    AND is_pageview
GROUP BY
    geocoded_data['country_code'];
</source>
Notice the <tt>webrequest_source</tt>, <tt>year</tt>, <tt>month</tt>, <tt>day</tt>, and <tt>hour</tt> fields.  These are Hive [http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ partitions], and are explicit mappings to hourly imports in HDFS.  You must include at least one partition predicate in the <tt>where</tt> clause of your queries (even if it is just <tt>year > 0</tt>).  Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results.  For example, if are only interested in data during a particular day, you could add <tt>where year = 2014 and month = 1 and day = 12</tt>.  This will instruct Hive to only process data for partitions that match that partition predicate.  You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.
Most SQL is supported by Hive, with a few differences.  See the [https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive Language Manual] for more info.


=== Notes ===
=== Notes ===
Line 41: Line 27:
* Hive has the ability to map tables on top of almost any data structure.  Since webrequest logs are JSON, the Hive tables must be told to use a JSON [https://cwiki.apache.org/confluence/display/Hive/SerDe SerDe] to be able to serialize/deserialize to/from JSON.  We use the JsonSerDe included with [https://cwiki.apache.org/confluence/display/Hive/HCatalog Hive-HCatalog].
* Hive has the ability to map tables on top of almost any data structure.  Since webrequest logs are JSON, the Hive tables must be told to use a JSON [https://cwiki.apache.org/confluence/display/Hive/SerDe SerDe] to be able to serialize/deserialize to/from JSON.  We use the JsonSerDe included with [https://cwiki.apache.org/confluence/display/Hive/HCatalog Hive-HCatalog].
* The HCatalog .jar will be automatically added to a Hive client's auxpath.  You shouldn't need to think about it.
* The HCatalog .jar will be automatically added to a Hive client's auxpath.  You shouldn't need to think about it.
== Sample Queries ==
* [[Analytics/Cluster/Hive/Queries]] (includes a FAQ about common tasks and problems)
* [[Analytics/Cluster/Hive/QueryUsingUDF]]
== Killing a running query ==
Once you submit a query, it is handed off to Hadoop.  Hadoop runs the query as a YARN application.  The Hive CLI is then detached from the actual application.  If you Ctrl-C your Hive CLI, you will quit the interface you used to submit the query, but will not actually kill the application.  To kill the application, you have to tell YARN you want it dead.
Note the application ID from when your query started.  You should see something like:
  Starting Job = job_1387838787660_12241, Tracking URL = http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_12241/
The application ID in this case is  application_1387838787660_12241.  To kill this application, run:
  yarn application -kill application_1387838787660_12241


== Troubleshooting ==
== Troubleshooting ==
Line 64: Line 35:


== References ==
== References ==
* [[Analytics/Cluster/Hive/Queries|WMF Researcher Hive Queries Examples]]
 
* [[mediawikiwiki:Analytics/Kraken/Researcher_analysis|Early Hive at WMF Researcher Analysis]] (2013)
* [[mediawikiwiki:Analytics/Kraken/Researcher_analysis|Early Hive at WMF Researcher Analysis]] (2013)
* [[Analytics/Cluster/Hive/Compression|Hive Compression Test]]
* [[Analytics/Cluster/Hive/Compression|Hive Compression Test]]
* [http://wiki.apache.org/hadoop/SequenceFile Hadoop SequenceFile]
* [http://wiki.apache.org/hadoop/SequenceFile Hadoop SequenceFile]
* [http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ Introduction to Hive Partitioning]
* [http://www.brentozar.com/archive/2013/03/introduction-to-hive-partitioning/ Introduction to Hive Partitioning]

Revision as of 16:06, 4 September 2015

File:Pageview @ Wikimedia (WMF Analytics lightning talk, June 2015).pdf Apache Hive is an abstraction built on top of MapReduce that allows SQL to be used on various file formats stored in HDFS. WMF's first use case was to enable querying of unsampled webrequest logs.


Access

Cluster Access

In order to get shell access to the analytics cluster through hive you need access to stat1002, and be added to the analytics-privatedata-users shell user group. Per Requesting shell access, create a Phabricator ticket for such request.

For access see: Analytics/Cluster/Access

Querying

While hive supports SQL there are some differences: see the Hive Language Manual for more info.

Maintained Tables

(see also Analytics/Data)

Notes

  • The wmf_raw and wmf databases contain Hive tables maintained by Ops. You can create your own tables in Hive, but please be sure to create them in a different database, preferably one named after your shell username.
  • Hive has the ability to map tables on top of almost any data structure. Since webrequest logs are JSON, the Hive tables must be told to use a JSON SerDe to be able to serialize/deserialize to/from JSON. We use the JsonSerDe included with Hive-HCatalog.
  • The HCatalog .jar will be automatically added to a Hive client's auxpath. You shouldn't need to think about it.

Troubleshooting

Analytics/Cluster/Hive/Troubleshooting

See also

References