You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Note: The Hive client that is launched by using
hivecli, is being deprecated, and is being replaced with
beelineinterface for HiveServer2 instead. Querying works the same way as this tutorial will explain, but you can use
beeline in place of
hive. See Analytics/Cluster/Beeline for full usage documentation.
Once you can
ssh to stat1004 you can simply access the hive command-line interface by entering hive. Here's the start of a sample interactive session:
nuria@stat1004:~$ beeline 0: jdbc:hive2://analytics1003.eqiad.wmnet:100> show databases; wmf ... 0: jdbc:hive2://analytics1003.eqiad.wmnet:100> use wmf; No rows affected (0.019 seconds) 0: jdbc:hive2://analytics1003.eqiad.wmnet:100> show tables; tab_name aqs_hourly browser_general last_access_uniques_daily last_access_uniques_monthly mediacounts mobile_apps_session_metrics mobile_apps_session_metrics_by_os mobile_apps_uniques_daily mobile_apps_uniques_monthly pagecounts_all_sites pageview_hourly pageview_unexpected_values pageview_whitelist projectview_hourly webrequest webrequest_empty 16 rows selected (0.028 seconds)
You can also write your query in a text file, e.g. named my_query.hql, and execute it. This example redirects the output of the query (by default in tab-separated values format) to results.txt:
$ beeline -f my_query.hql > results.txt
use wmf database (instead of the
wmf_raw database) if you can, or your queries will be slow. The
wmf database includes filtered and preprocessed data.
Output to a file
beeline -f my_query.hql > out.txt
Run queries in Cron
# The USER variable needs to be set with the username # that will run the queries. Example: USER=elukey USER=<user> /usr/local/bin/beeline -f my_query.hql > out.txt
Always restrict queries to a date range (partitioning)
There is a lot of data in Hive, so you want to make sure your queries have time bounds so they complete in a reasonable amount of time. For most tables you will be copying and pasting
- WHERE year=2022 AND month=8 AND day=16
to all your queries, and maybe appending AND hour=NN as well.
In fact, you must restrict your query to some Hive "partition", otherwise Hive will fail with the error "No partition predicate found". The partitioning of most tables is by year/month/day/hour. For example, seeing this simple query fail:
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> SELECT agent_type FROM webrequest LIMIT 5; Error: Error while compiling statement: FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest" (state=42000,code=10041)
you must restrict it, for example:
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> SELECT agent_type FROM webrequest WHERE webrequest_source='misc' AND year=2016 AND month=4 AND day=29 and hour=0 LIMIT 5; agent_type user user user spider user
Run long queries in a screen session and in the nice queue
- Screen: If your queries take a long time to run you can execute them in a
screensession. If you are accidentally disconnected, or you intentionally disconnect, from a screen session, your remote shell and queries continue to run.
- Nice queue: For long queries, it's good practice to let fast queries take precedence for resources on the cluster. In order to do that, long queries should be run in the
niceyarn queue. You can do so by
- In hive CLI, execute the command:
- In batch mode, add a parameter:
hive --hiveconf mapred.job.queue.name=nice ...
- In hive CLI, execute the command:
You can create or reuse UDFs (user-defined functions), see Analytics/Cluster/Hive/QueryUsingUDF.
WMF has developed several UDFs, such as
Some are run to create its derived tables from raw Hadoop information.
Here are sample queries from the tables in the
wmf database. Run beeline and enter
use wmf; before running these.
Using the pre-aggregated table pageview_hourly
SELECT concat(month,'/',day,'/',year), access_method, sum(view_count) FROM wmf.pageview_hourly WHERE year = 2015 AND month = 8 AND agent_type = "user" AND country = "Canada" AND project = "en.wikipedia" GROUP BY year, month, day, access_method;
SELECT COUNT(DISTINCT ip) AS hits FROM webrequest WHERE year = 2014 AND month = 5 AND day = 5 AND hour = 13 AND uri_host = "en.wikipedia.org" AND http_status = 200;
Top raw referrals for an article
SELECT referer, COUNT(DISTINCT ip) AS hits FROM webrequest WHERE year = 2014 AND month = 5 AND day = 5 AND hour = 13 AND uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND http_status = 200 GROUP BY referer ORDER BY hits DESC LIMIT 50;
Top internal referrals for an article
SELECT SUBSTR(referer,30) AS source, COUNT(DISTINCT ip) AS hits FROM webrequest WHERE year = 2014 AND month = 5 AND day = 5 AND hour = 13 AND uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND referer LIKE "http://en.wikipedia.org/wiki/%" AND http_status = 200 GROUP BY SUBSTR(referer,30) ORDER BY hits DESC LIMIT 50;
Top outbound clicks from an article
SELECT SUBSTR(uri_path,7) AS target, COUNT(DISTINCT ip) AS hits FROM webrequest WHERE year = 2014 AND month = 5 AND day = 5 AND hour = 13 AND uri_host = "en.wikipedia.org" AND referer LIKE "http://en.wikipedia.org/wiki/London%" AND http_status = 200 GROUP BY SUBSTR(uri_path,7) ORDER BY hits DESC LIMIT 50;
Counting rows which match a criterion
A common use case in querying data is counting the number of rows which match a criterion. With Hive, a good way to do this is by casting an expression which produces a boolean value to an integer and taking the sum. For example:
select sum(cast(distinct_editors >= 500 as int)) from geowiki_monthly where month = "2018-03"
This counts the number of rows where the
month is March 2018 and the value of
distinct_editors is greater than 500.
You can't simply use
count() on a boolean expression, as in
count(distinct_editors >= 500), because that counts the rows where the expression is not null—which includes rows where it is false!
If you get errors like:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found
then enter the following:
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar ;
Killing a running query
Once you submit a query, it is handed off to Hadoop. Hadoop runs the query as a YARN application. The Hive CLI is then detached from the actual application. If you Ctrl-C your Hive CLI, you will quit the interface you used to submit the query, but will not actually kill the application. To kill the application, you have to tell YARN you want it dead.
Note the application ID from when your query started. You should see something like:
Starting Job = job_1387838787660_12241, Tracking URL = http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_12241/
The application ID in this case is application_1387838787660_12241. To kill this application, run:
yarn application -kill application_1387838787660_12241
A simple SELECT LIMIT like the following will prepare all data from a partition before applying the LIMIT clause, so it's always best to specify the lowest-level relevant partition for the query.
SELECT http_status,uri_path FROM webrequest WHERE year = 2014 LIMIT 10;
You can create your own database using regular SQL
CREATE DATABASE dartar;
You can save the output of a hive query into a TSV by using hive in batch mode, e.g.
hive -e "USE wmf; SELECT foo;" > ./bar.tsv
Dates are expressed as integers due to how we have configured this data. You may encounter this error if you treat dates as strings by quoting:
> Diagnostic Messages for this Task: > Error: java.io.IOException: java.lang.reflect.InvocationTargetException > [...] > Caused by: java.lang.reflect.InvocationTargetException > [...] > Caused by: java.io.EOFException > [...]
Out of Memory Errors on Client
When performing large queries, the hive client may run out of memory. Just look for an out of memory error at the top of the stack trace.
Invoke Hive via the following to allocate more memory
export HADOOP_HEAPSIZE=2048 && beeline
Out of Memory Errors when exporting large datasets
If your query generates a large dataset and you are writing it out to a file (e.g.
beeline -f query.hql > output.tsv), your query might fail due to beeline buffering all the data before writing it out, thereby running out of memory. Passing the parameter
--incremental=true to make beeline print to stdout incrementally as data comes in. The previous example turns into the following command:
beeline --incremental=true -f query.hql > output.tsv
Another solution is of course to switch to a data processing approach that does not require the data to be exported.
Standard Hadoop practice involves launching a long running job and then realizing you forgot a parameter. Here is how you kill a job from the command line:
yarn application -kill [applicationID]
Your application id can be found in the hive messages -- it's the application_XXX string at the end of the tracking URL:
Starting Job = job_1409078537822_2754, Tracking URL = http://analytics1001.eqiad.wmnet:8088/proxy/application_1409078537822_2754/
INSERT OVERWRITE ... IF NOT EXISTS
INSERT OVERWRITE with
IF NOT EXISTS be sure to use unqualified tables. Otherwise, the
IF NOT EXISTS gets silently ignored.
Earliest data available in hadoop
What’s the most efficient way to obtain the earliest, non-truncated hour or date of data still available in hadoop?
show partitions webrequest;
Will show you all of the available partitions. Webrequest data in Hadoop is currently pruned after 62 days.
Avoiding overgreedy scans / Operator precedence
AND takes precedence over
So when trying to select all “mobile” and “text” partitions for 2015-01-01,
... WHERE year=2015 AND month=1 AND day=1 AND webrequest_source='mobile' OR webrequest_source='text'
would not do the trick, as Hive parses it as
... WHERE (year=2015 AND month=1 AND day=1 AND webrequest_source='mobile') OR webrequest_source='text'
and would select all “text” partitions. This unintentionally skews the data you want to obtain, makes the query take longer, and keeps all “text” partitions locked.
You can use parenthesis to override operator precedence. So for example
... WHERE year=2015 AND month=1 AND day=1 AND (webrequest_source='mobile' OR webrequest_source='text')
would select the “mobile” and “text” partitions for 2015-01-01.
My query does not start
My query does not progress / finish
Querying through Hive is not as fast as querying a plain SQL table. So expect to wait an hour for your result to show up.
But if the progress counter in your query does not increase (like at least a 1% per minute), you are either unintentionally querying a lot of data, or the cluster is stalled.
If you are querying “a lot” of data (a week worth of webrequest logs is definitely already more “a lot”!), consider splitting up your query. That will increase performance for you, cut down runtime of your query, and it will leave cluster resources available to other users. If you run into blockers, let the Analytics team know.
Search through logs
If your job is finished, you can find all of the job logs in HDFS at:
Where USER is your username (echo $USER at a shell prompt) and application_id is the
application_XXXXXX_XXXX number that yarn gives you when your app starts.
You can also access these via the yarn CLI, like:
yarn logs -applicationId <applicationId>
More debgugging information in CLI
You can get more debugging information in your CLI by launching Hive with a special configuration option:
hive -hiveconf hive.root.logger=INFO,console
Vertical query results
!set outputformat vertical