You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Hive/Queries: Difference between revisions

From Wikitech-static
< Analytics‎ | Systems‎ | Cluster‎ | Hive
Jump to navigation Jump to search
imported>Quiddity
m ((syntaxhighlight) lang="bash")
imported>Quiddity
m (fixes)
 
(17 intermediate revisions by 13 users not shown)
Line 1: Line 1:
== Writing queries ==


== Prerequisites ==
=== Handy column names ===
# You have a Wikitech/Labs account &ndash; see [[Analytics/Onboarding#Labs]]
set hive.resultset.use.unique.column.names=false;
# You have requested access to <tt>analytics-privatedata-users</tt> or <tt>analytics-users</tt>.
=== Use partitions ===
## You [[phab:https:maniphest/task/create/?projects=Ops-Access-Requests|created a task]] like [[phab:T96053]]
There is a ''lot'' of data in Hive, so you want to make sure your queries have time bounds so they complete in a reasonable amount of time.
## Your manager has approved your access request in the task you created above
For most tables you will be copying and pasting
## Ops has fulfilled the request
 
# You have <code>ssh</code> configured to tunnel through Bastion &ndash; see [[Analytics/Onboarding#Sample ssh config]]
:<kbd>WHERE year={{CURRENTYEAR}} AND month={{CURRENTMONTH1}} AND day={{CURRENTDAY}}</kbd>
# You can start a shell on [[stat1004]] by entering <kbd>ssh stat1004.eqiad.wmnet</kbd> in a command-line terminal.
 
to all your queries, and maybe appending <kbd>AND hour=''NN''</kbd> as well.
 
In fact, you ''must'' restrict your query to some Hive "partition", otherwise Hive will fail with the error "No partition predicate found". The partitioning of most tables is by year/month/day/hour.
 
For example, seeing this simple query fail:
<pre>
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> SELECT agent_type FROM webrequest LIMIT 5;
Error: Error while compiling statement: FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest" (state=42000,code=10041)</pre>
you must restrict it, for example:
<pre>
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> SELECT agent_type FROM webrequest WHERE webrequest_source='misc' AND year=2016 AND month=4 AND day=29 and hour=0 LIMIT 5;
agent_type
user
user
user
spider
user
</pre>
 
=== INSERT OVERWRITE ... IF NOT EXISTS ===
When using <code>INSERT OVERWRITE</code> with <code>IF NOT EXISTS</code> be sure to use unqualified tables. Otherwise, the <code>IF NOT EXISTS</code> gets silently ignored.
 
===Earliest data available in hadoop===
What’s the most efficient way to obtain the earliest, non-truncated hour or date of data still available in hadoop?
<syntaxhighlight lang="bash"> 
show partitions webrequest;
</syntaxhighlight>
Will show you all of the available partitions. Webrequest data in Hadoop is currently pruned after 62 days.
 
===Avoiding overgreedy scans / Operator precedence===
In HiveQL, <code>AND</code> takes precedence over <code>OR</code>.
 
So when trying to select all “mobile” and “text” partitions for 2015-01-01,
<syntaxhighlight lang="sql"> 
  ... WHERE year=2015 AND month=1 AND day=1 AND webrequest_source='mobile' OR webrequest_source='text'
</syntaxhighlight>
would not do the trick, as Hive parses it as
<syntaxhighlight lang="sql"> 
  ... WHERE (year=2015 AND month=1 AND day=1 AND webrequest_source='mobile') OR webrequest_source='text'
</syntaxhighlight>
and would select ''all'' “text” partitions. This unintentionally skews the data you want to obtain, makes the query take longer, and keeps all “text” partitions locked.
 
You can use parenthesis to override operator precedence. So for example
<syntaxhighlight lang="sql">
  ... WHERE year=2015 AND month=1 AND day=1 AND (webrequest_source='mobile' OR webrequest_source='text')
</syntaxhighlight>
would select the “mobile” and “text” partitions for 2015-01-01.
 
=== User-defined functions ===
You can create or reuse UDFs (user-defined functions), see [[Analytics/Cluster/Hive/QueryUsingUDF]].
WMF has developed several UDFs, such as <code>is_wikimedia_bot</code>, <code>geocoded_data</code>, etc.
Some are run to create its derived tables from raw Hadoop information.
 
===Slow queries===
A simple SELECT LIMIT like the following will prepare all data from a partition before applying the LIMIT clause, so it's always best to specify the lowest-level relevant partition for the query.
<syntaxhighlight lang="SQL">
SELECT http_status,uri_path FROM webrequest WHERE year = 2014 LIMIT 10;
</syntaxhighlight>
===Counting rows which match a criterion===
A common use case in querying data is counting the number of rows which match a criterion. With Hive, a good way to do this is by casting an expression which produces a boolean value to an integer and taking the sum. For example:<syntaxhighlight lang="sql">
select sum(cast(distinct_editors >= 500 as int))
from geowiki_monthly
where month = "2018-03"
</syntaxhighlight>This counts the number of rows where the <code>month</code> is March 2018 and the value of <code>distinct_editors</code> is greater than 500.
 
You can't simply use <code>count()</code> on a boolean expression, as in <code>count(distinct_editors >= 500)</code>, because that counts the rows where the expression is not null—which includes rows where it is false!


== Running queries ==
== Running queries ==
=== Command line ===
{{Warn|Before you can use <code>hive</code> or <code>beeline</code>, you first need to authenticate with Kerberos. See also [[Analytics/Systems/Kerberos/UserGuide#Authenticate via Kerberos|Authenticate via Kerberos]].}}
There are two command line programs available for accessing Hive, <code>hive</code> and <code>beeline</code>. <code>hive</code> is officially deprecated in favor of <code>beeline</code>, but as of October 2018, the Analytics team does not recommend migrating to it. The <code>hive</code> client still has significantly better error reporting and a few other advantages.


Note: The Hive client that is launched by using <code>hive</code>cli, is being deprecated, and is being replaced with <code>beeline</code>interface for HiveServer2 instead. Querying works the same way as this tutorial will explain, but you can use <code>beeline</code> in place of <code>hive</code>. See [[Analytics/Cluster/Beeline]] for full usage documentation.
Generally, the two clients can be used identically; see [[Analytics/Cluster/Beeline]] for Beeline-specific documentation.


Once you can <code>ssh</code> to stat1004 you can simply access the hive command-line interface by entering <kbd>hive</kbd>. Here's the start of a sample interactive  session:
Once you can <code>ssh</code> to stat1004 you can simply access the hive command-line interface by entering <code>hive</code> or <code>beeline</code>. Here's the start of a sample interactive  session:
<source lang="bash">
<syntaxhighlight lang="bash">
nuria@stat1004:~$ beeline
nuria@stat1004:~$ beeline


0: jdbc:hive2://analytics1003.eqiad.wmnet:100> show databases;
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> show databases;
  wmf
  wmf
  ...
  ...
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> use wmf;
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> use wmf;
No rows affected (0.019 seconds)
No rows affected (0.019 seconds)


0: jdbc:hive2://analytics1003.eqiad.wmnet:100> show tables;
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> show tables;
tab_name
tab_name
aqs_hourly
aqs_hourly
Line 42: Line 112:
webrequest_empty
webrequest_empty
16 rows selected (0.028 seconds)
16 rows selected (0.028 seconds)
</source>
</syntaxhighlight>


You can also write your query in a text file, e.g. named <tt>my_query.hql</tt>, and execute it. This example redirects the output of the query (by default in tab-separated values format) to <tt>results.txt</tt>:
You can also write your query in a text file, e.g. named <tt>my_query.hql</tt>, and execute it. This example redirects the output of the query (by default in tab-separated values format) to <tt>results.txt</tt>:
<source lang=bash>
<syntaxhighlight lang=bash>
  $ beeline -f my_query.hql > results.txt  
  $ beeline -f my_query.hql > results.txt  
</source>
</syntaxhighlight>
 


=== Use wmf ===
=== Use wmf ===


You should <source lang=sql inline>use wmf</source> database (instead of the <code>wmf_raw</code> database) if you can, or your queries will be slow. The <code>wmf</code> database includes filtered and preprocessed data.
You should <syntaxhighlight lang=sql inline>use wmf</syntaxhighlight> database (instead of the <code>wmf_raw</code> database) if you can, or your queries will be slow. The <code>wmf</code> database includes filtered and preprocessed data.


=== Output to a file ===
=== Output to a file ===
Line 64: Line 133:
</syntaxhighlight>
</syntaxhighlight>


=== Always restrict queries to a date range (partitioning) ===
=== Run long queries in a screen session and in the nice queue ===
There is a ''lot'' of data in Hive, so you want to make sure your queries have time bounds so they complete in a reasonable amount of time.
For most tables you will be copying and pasting
: <kbd>WHERE year={{CURRENTYEAR}} AND month={{CURRENTMONTH1}} AND day={{CURRENTDAY}}</kbd>
to all your queries, and maybe appending <kbd>AND hour=''NN''</kbd> as well.
 
In fact, you ''must'' restrict your query to some Hive "partition", otherwise Hive will fail with the error "No partition predicate found". The partitioning of most tables is by year/month/day/hour.
For example, seeing this simple query fail:
<pre>
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> SELECT agent_type FROM webrequest LIMIT 5;
Error: Error while compiling statement: FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest" (state=42000,code=10041)</pre>
you must restrict it, for example:
<pre>
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> SELECT agent_type FROM webrequest WHERE webrequest_source='misc' AND year=2016 AND month=4 AND day=29 and hour=0 LIMIT 5;
agent_type
user
user
user
spider
user
</pre>


=== Run long queries in a screen session and in the nice queue ===
* '''Screen''': If your queries take a long time to run you can execute them in a <code>[[w:GNU_Screen|screen]]</code> session. If you are accidentally disconnected, or you intentionally disconnect, from a screen session, your remote shell and queries continue to run.  
* '''Screen''': If your queries take a long time to run you can execute them in a <code>[[w:GNU_Screen|screen]]</code> session. If you are accidentally disconnected, or you intentionally disconnect, from a screen session, your remote shell and queries continue to run.  
* '''Nice queue''': For long queries, it's good practice to let fast queries take precedence for resources on the cluster. In order to do that, long queries should be run in the <code>nice</code> yarn queue. You can do so by  
* '''Nice queue''': For long queries, it's good practice to let fast queries take precedence for resources on the cluster. In order to do that, long queries should be run in the <code>nice</code> yarn queue. You can do so by  
** In hive CLI, execute the command: <code>SET mapred.job.queue.name=nice;</code>  
** In hive CLI, execute the command: <code>SET mapred.job.queue.name=nice;</code>  
** In batch mode, add a parameter: <code>hive -- hiveconf mapred.job.queue.name=nice ...</code>  
** In batch mode, add a parameter: <code>hive --hiveconf mapred.job.queue.name=nice ...</code>
*'''Email notification''': For long-running queries, one may want to get an email alert as soon as the result is available. See [[phab:P7948]] for an example of how to do this using a bash script.


=== User-defined functions ===
=== Vertical query results ===
You can create or reuse UDFs (user-defined functions), see [[Analytics/Cluster/Hive/QueryUsingUDF]].
<code>!set outputformat vertical</code>
WMF has developed several UDFs, such as <code>is_wikimedia_bot</code>, <code>geocoded_data</code>, etc.
Some are run to create its derived tables from raw Hadoop information.


==Sample queries==
== Editing Schemas ==
Here are sample queries from the tables in the <code>wmf</code> database. Run hive and enter <source lang=sql inline>use wmf;</source> before running these.
Hive allows <code>ALTER TABLE</code> statements, but there are gotchas compared to a traditional RDBMS where you would issue such commands:


=== Pageviews ===
* There's something really weird going on with the wmf_raw database.  When doing <code>ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;</code> sometimes it just doesn't work, and Hive gives an error when trying to, for example, <code>desc wmf_raw.webrequest</code>.  This is not consistent but it did happen to both ottomata and milimetric.
Using the pre-aggregated table [[Analytics/Data/Pageview hourly|pageview_hourly]]
* Adding a column to wmf_raw.webrequest broke selects from that table for existing partitions.  Both <code>select * ...</code> and <code>select hostname ...</code> broke, so it's not a matter of Hive metadata.  The fix is to drop a partition and recreate it, querying will then work for that partition only.
<source lang="SQL" enclose="div">
SELECT
  concat(month,'/',day,'/',year), access_method, sum(view_count)
FROM
  wmf.pageview_hourly
WHERE
  year = 2015
  AND month = 8
  AND agent_type = "user"
  AND country = "Canada"
  AND project = "en.wikipedia"
GROUP BY
  year, month, day, access_method;
</source>
===Distinct IPs===
<source lang="SQL" enclose="div">
SELECT
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200;
</source>


===Top raw referrals for an article===
== Troubleshooting ==
<source lang="SQL" enclose="div">
SELECT
  referer,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_path = "/wiki/London"
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200
GROUP BY referer
ORDER BY hits DESC
LIMIT 50;
</source>


===Top internal referrals for an article ===
=== Query fails with generic "Execution Error" ===
<source lang="SQL" enclose="div">
You may find that queries fail with an error message like "FAILED: Execution Error" and an error code like "return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask".
SELECT
  SUBSTR(referer,30) AS source,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_path = "/wiki/London"
  AND uri_host = "en.wikipedia.org"
  AND referer LIKE "http://en.wikipedia.org/wiki/%"
  AND http_status = 200
GROUP BY
  SUBSTR(referer,30)
ORDER BY hits DESC
LIMIT 50;
</source>


===Top outbound clicks from an article ===
In this case, the first thing to do is make sure you have the detailed error message. For some reason, [https://cwiki.apache.org/confluence/display/Hive/HiveServer2%2BClients#HiveServer2Clients-BeelineHiveCommands HiveServer2 or Thrift clients] like the Hue interface, the <code>beeline</code> command line program, or the [https://github.com/cloudera/impyla impyla] Python library don't give specific error messages. Try rerunning your query using the <code>hive</code> command line program, which should give you more detail on the error you've encountered.
<source lang="SQL" enclose="div">
SELECT
  SUBSTR(uri_path,7) AS target,
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_host = "en.wikipedia.org"
  AND referer LIKE "http://en.wikipedia.org/wiki/London%"
AND
  http_status = 200
GROUP BY SUBSTR(uri_path,7)
ORDER BY hits
DESC LIMIT 50;
</source>


==FAQ==
=== Hadoop containers run out of memory ===
If your query fails with a diagnostic message like the following:
Container [pid=40261,containerID=container_e85_1538672679435_0171_01_002286] is running beyond physical memory limits. Current usage: 2.3 GB of 2 GB physical memory used; 4.1 GB of 4.2 GB virtual memory used. Killing container.
Try increasing the [https://stackoverflow.com/questions/24070557/ddg#25945896 map memory setting] with the command <code>SET mapreduce.map.memory.mb=4096</code> .


=== JsonSerDe Errors===
=== JsonSerDe Errors===
Line 202: Line 168:
then enter the following:
then enter the following:
  ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar ;
  ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar ;
Alternatively, you could try
ADD JAR /usr/lib/hive/lib/hive-serde.jar;


=== Killing a running query ===
=== Killing a running query ===
Line 214: Line 183:
   yarn application -kill application_1387838787660_12241
   yarn application -kill application_1387838787660_12241


===Slow queries===
A simple SELECT LIMIT like the following will prepare all data from a partition before applying the LIMIT clause, so it's always best to specify the lowest-level relevant partition for the query.
<source lang="SQL" enclose="div">
SELECT http_status,uri_path FROM webrequest WHERE year = 2014 LIMIT 10;
</source>
===Scratch space===
===Scratch space===
You can create your own database using regular SQL
You can create your own database using regular SQL
<source lang="SQL" enclose="div">
<syntaxhighlight lang="SQL">
CREATE DATABASE dartar;
CREATE DATABASE dartar;
</source>
</syntaxhighlight>
===Batch mode===
===Batch mode===
You can save the output of a hive query into a TSV by using hive in batch mode, e.g.
You can save the output of a hive query into a TSV by using hive in batch mode, e.g.
<source lang="bash" enclose="div">
<syntaxhighlight lang="bash">
hive  -e "USE wmf; SELECT foo;" > ./bar.tsv
hive  -e "USE wmf; SELECT foo;" > ./bar.tsv
</source>
</syntaxhighlight>
===Dates===
===Dates===
Dates are expressed as integers due to how we have configured this data. You may encounter this error if you treat dates as strings by quoting:
Dates are expressed as integers due to how we have configured this data. You may encounter this error if you treat dates as strings by quoting:
<source lang="java" enclose="div">
<syntaxhighlight lang="java">
> Diagnostic Messages for this Task:
> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.lang.reflect.InvocationTargetException
> Error: java.io.IOException: java.lang.reflect.InvocationTargetException
Line 239: Line 203:
> Caused by: java.io.EOFException
> Caused by: java.io.EOFException
>      [...]
>      [...]
</source>
</syntaxhighlight>
===Out of Memory Errors on Client===
===Out of Memory Errors on Client===
When performing large queries, the hive client may run out of memory. Just look for an out of memory error at the top of the stack trace.
When performing large queries, the hive client may run out of memory. Just look for an out of memory error at the top of the stack trace.


Invoke Hive via the following to allocate more memory
Invoke Hive via the following to allocate more memory
<source lang="bash" enclose="div">
<syntaxhighlight lang="bash">
  export HADOOP_HEAPSIZE=2048 && beeline
  export HADOOP_HEAPSIZE=2048 && beeline
</source>
</syntaxhighlight>


===Killing Jobs===
===Out of Memory Errors when exporting large datasets===
Standard Hadoop practice involves launching a long running job and then realizing you forgot a parameter. Here is how you kill a job from the command line:
If your query generates a large dataset and you are writing it out to a file (e.g. <code>beeline -f query.hql > output.tsv</code>), your query might fail due to beeline buffering all the data before writing it out, thereby running out of memory. Passing the parameter <code>--incremental=true</code> to make beeline print to stdout incrementally as data comes in. The previous example turns into the following command:


<source lang="bash" enclose="div">
<syntaxhighlight lang="bash">
yarn application -kill [applicationID]
beeline --incremental=true -f query.hql > output.tsv
</source>
</syntaxhighlight>
 
Your application id can be found in the hive messages -- it's the application_XXX string at the end of the tracking URL:
 
<source lang="bash" enclose="div">
Starting Job = job_1409078537822_2754, Tracking URL = http://analytics1001.eqiad.wmnet:8088/proxy/application_1409078537822_2754/
</source>
 
===INSERT OVERWRITE ... IF NOT EXISTS===
When using <code>INSERT OVERWRITE</code> with <code>IF NOT EXISTS</code> be sure to use unqualified tables. Otherwise, the <code>IF NOT EXISTS</code> gets silently ignored.
 
===Earliest data available in hadoop===
What’s the most efficient way to obtain the earliest, non-truncated hour or date of data still available in hadoop?
<source lang="bash" enclose="div"> 
show partitions webrequest;
</source>
Will show you all of the available partitions. Webrequest data in Hadoop is currently pruned after 62 days.


===Avoiding overgreedy scans / Operator precedence===
Another solution is of course to switch to a data processing approach that does not require the data to be exported.
In HiveQL, <code>AND</code> takes precedence over <code>OR</code>.
 
So when trying to select all “mobile” and “text” partitions for 2015-01-01,
<source lang="sql"> 
  ... WHERE year=2015 AND month=1 AND day=1 AND webrequest_source='mobile' OR webrequest_source='text'
</source>
would not do the trick, as Hive parses it as
<source lang="sql"> 
  ... WHERE (year=2015 AND month=1 AND day=1 AND webrequest_source='mobile') OR webrequest_source='text'
</source>
and would select ''all'' “text” partitions. This unintentionally skews the data you want to obtain, makes the query take longer, and keeps all “text” partitions locked.
 
You can use parenthesis to override operator precedence. So for example
<source lang="sql"> 
  ... WHERE year=2015 AND month=1 AND day=1 AND (webrequest_source='mobile' OR webrequest_source='text')
</source>
would select the “mobile” and “text” partitions for 2015-01-01.


===My query does not start===
===My query does not start===
Check if the [[Analytics/Cluster/Hadoop/Load|cluster is overloaded]]. If it is, [[Analytics/Cluster/Hadoop/Load#What_to_do_if_the_cluster_is_stalling.3F|let the Analytics team know]].
Check if the [[Analytics/Cluster/Hadoop/Load|cluster is overloaded]]. If it is, [[Analytics/Cluster/Hadoop/Load#What_to_do_if_the_cluster_is_stalling.3F|let the Analytics team know]].


Line 303: Line 233:


If your query is small, nice and neat, and does not progress nonetheless, please check if the [[Analytics/Cluster/Hadoop/Load|cluster is overloaded]]. If it is, [[Analytics/Cluster/Hadoop/Load#What_to_do_if_the_cluster_is_stalling.3F|let the Analytics team know]].
If your query is small, nice and neat, and does not progress nonetheless, please check if the [[Analytics/Cluster/Hadoop/Load|cluster is overloaded]]. If it is, [[Analytics/Cluster/Hadoop/Load#What_to_do_if_the_cluster_is_stalling.3F|let the Analytics team know]].


=== Search through logs ===
=== Search through logs ===
Line 316: Line 245:
   yarn logs -applicationId <applicationId>
   yarn logs -applicationId <applicationId>


=== More debgugging information in CLI ===
=== More debugging information in CLI ===
You can get more debugging information in your CLI by launching Hive with a special configuration option:
You can get more debugging information in your CLI by launching Hive with a special configuration option:
  hive  -hiveconf hive.root.logger=INFO,console
  hive  -hiveconf hive.root.logger=INFO,console


=== Vertical query results ===
=== Use RLIKE instead of LIKE when working with multi-line text ===
<code>!set outputformat vertical</code>
As documented in [https://issues.apache.org/jira/browse/HIVE-22008 this hive issue], the <code>LIKE</code> operator in hive does't match patterns over multi-lines strings.
 
Example: <code>my_string LIKE <nowiki>''abc%def'</nowiki></code> will return false if <code>my_string</code>  is <code>abc\ndef</code>.
 
In order to have the above working, you need to use <code>RLIKE</code> with the dot-matching-all flag set <code>(?s)</code>, as by default . doesn't match new-lines (this is the reason of the bug in Hive).
 
'''Warning''': don't forget to use beginning <code>^</code> and end-of string <code>$</code> regex markers when using RLIKE, as it matches subportions of string while like matches entire strings.
 
Example: <code>my_string RLIKE '(?s)^abc.*def$'</code> matches <code>abc\ndef</code>
 
=== EventLogging ===
To query EventLogging data from Hadoop using Hive, see:
* [[Analytics/Systems/EventLogging#Accessing data|EventLogging#Accessing data]]
* [[Analytics/Data access#EventLogging data]]
* [[Analytics/Systems/EventLogging/Data retention and auto-purging|EventLogging/Data retention and auto-purging]]

Latest revision as of 18:26, 4 September 2021

Writing queries

Handy column names

set hive.resultset.use.unique.column.names=false;

Use partitions

There is a lot of data in Hive, so you want to make sure your queries have time bounds so they complete in a reasonable amount of time. For most tables you will be copying and pasting

WHERE year=2022 AND month=5 AND day=20

to all your queries, and maybe appending AND hour=NN as well.

In fact, you must restrict your query to some Hive "partition", otherwise Hive will fail with the error "No partition predicate found". The partitioning of most tables is by year/month/day/hour.

For example, seeing this simple query fail:

0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> SELECT agent_type FROM webrequest LIMIT 5;
Error: Error while compiling statement: FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest" (state=42000,code=10041)

you must restrict it, for example:

0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> SELECT agent_type FROM webrequest WHERE webrequest_source='misc' AND year=2016 AND month=4 AND day=29 and hour=0 LIMIT 5;
agent_type
user
user
user
spider
user

INSERT OVERWRITE ... IF NOT EXISTS

When using INSERT OVERWRITE with IF NOT EXISTS be sure to use unqualified tables. Otherwise, the IF NOT EXISTS gets silently ignored.

Earliest data available in hadoop

What’s the most efficient way to obtain the earliest, non-truncated hour or date of data still available in hadoop?

  
show partitions webrequest;

Will show you all of the available partitions. Webrequest data in Hadoop is currently pruned after 62 days.

Avoiding overgreedy scans / Operator precedence

In HiveQL, AND takes precedence over OR.

So when trying to select all “mobile” and “text” partitions for 2015-01-01,

  
  ... WHERE year=2015 AND month=1 AND day=1 AND webrequest_source='mobile' OR webrequest_source='text'

would not do the trick, as Hive parses it as

  
  ... WHERE (year=2015 AND month=1 AND day=1 AND webrequest_source='mobile') OR webrequest_source='text'

and would select all “text” partitions. This unintentionally skews the data you want to obtain, makes the query take longer, and keeps all “text” partitions locked.

You can use parenthesis to override operator precedence. So for example

  
  ... WHERE year=2015 AND month=1 AND day=1 AND (webrequest_source='mobile' OR webrequest_source='text')

would select the “mobile” and “text” partitions for 2015-01-01.

User-defined functions

You can create or reuse UDFs (user-defined functions), see Analytics/Cluster/Hive/QueryUsingUDF. WMF has developed several UDFs, such as is_wikimedia_bot, geocoded_data, etc. Some are run to create its derived tables from raw Hadoop information.

Slow queries

A simple SELECT LIMIT like the following will prepare all data from a partition before applying the LIMIT clause, so it's always best to specify the lowest-level relevant partition for the query.

SELECT http_status,uri_path FROM webrequest WHERE year = 2014 LIMIT 10;

Counting rows which match a criterion

A common use case in querying data is counting the number of rows which match a criterion. With Hive, a good way to do this is by casting an expression which produces a boolean value to an integer and taking the sum. For example:

select sum(cast(distinct_editors >= 500 as int))
from geowiki_monthly
where month = "2018-03"

This counts the number of rows where the month is March 2018 and the value of distinct_editors is greater than 500.

You can't simply use count() on a boolean expression, as in count(distinct_editors >= 500), because that counts the rows where the expression is not null—which includes rows where it is false!

Running queries

Command line

There are two command line programs available for accessing Hive, hive and beeline. hive is officially deprecated in favor of beeline, but as of October 2018, the Analytics team does not recommend migrating to it. The hive client still has significantly better error reporting and a few other advantages.

Generally, the two clients can be used identically; see Analytics/Cluster/Beeline for Beeline-specific documentation.

Once you can ssh to stat1004 you can simply access the hive command-line interface by entering hive or beeline. Here's the start of a sample interactive session:

nuria@stat1004:~$ beeline

0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> show databases;
 wmf
 ...
0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> use wmf;
No rows affected (0.019 seconds)

0: jdbc:hive2://analytics-hive.eqiad.wmnet:100> show tables;
tab_name
aqs_hourly
browser_general
last_access_uniques_daily
last_access_uniques_monthly
mediacounts
mobile_apps_session_metrics
mobile_apps_session_metrics_by_os
mobile_apps_uniques_daily
mobile_apps_uniques_monthly
pagecounts_all_sites
pageview_hourly
pageview_unexpected_values
pageview_whitelist
projectview_hourly
webrequest
webrequest_empty
16 rows selected (0.028 seconds)

You can also write your query in a text file, e.g. named my_query.hql, and execute it. This example redirects the output of the query (by default in tab-separated values format) to results.txt:

 $ beeline -f my_query.hql > results.txt

Use wmf

You should use wmf database (instead of the wmf_raw database) if you can, or your queries will be slow. The wmf database includes filtered and preprocessed data.

Output to a file

beeline -f my_query.hql > out.txt 

Run queries in Cron

# The USER variable needs to be set with the username
# that will run the queries. Example: USER=elukey
USER=<user> /usr/local/bin/beeline -f my_query.hql > out.txt

Run long queries in a screen session and in the nice queue

  • Screen: If your queries take a long time to run you can execute them in a screen session. If you are accidentally disconnected, or you intentionally disconnect, from a screen session, your remote shell and queries continue to run.
  • Nice queue: For long queries, it's good practice to let fast queries take precedence for resources on the cluster. In order to do that, long queries should be run in the nice yarn queue. You can do so by
    • In hive CLI, execute the command: SET mapred.job.queue.name=nice;
    • In batch mode, add a parameter: hive --hiveconf mapred.job.queue.name=nice ...
  • Email notification: For long-running queries, one may want to get an email alert as soon as the result is available. See phab:P7948 for an example of how to do this using a bash script.

Vertical query results

!set outputformat vertical

Editing Schemas

Hive allows ALTER TABLE statements, but there are gotchas compared to a traditional RDBMS where you would issue such commands:

  • There's something really weird going on with the wmf_raw database. When doing ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar; sometimes it just doesn't work, and Hive gives an error when trying to, for example, desc wmf_raw.webrequest. This is not consistent but it did happen to both ottomata and milimetric.
  • Adding a column to wmf_raw.webrequest broke selects from that table for existing partitions. Both select * ... and select hostname ... broke, so it's not a matter of Hive metadata. The fix is to drop a partition and recreate it, querying will then work for that partition only.

Troubleshooting

Query fails with generic "Execution Error"

You may find that queries fail with an error message like "FAILED: Execution Error" and an error code like "return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask".

In this case, the first thing to do is make sure you have the detailed error message. For some reason, HiveServer2 or Thrift clients like the Hue interface, the beeline command line program, or the impyla Python library don't give specific error messages. Try rerunning your query using the hive command line program, which should give you more detail on the error you've encountered.

Hadoop containers run out of memory

If your query fails with a diagnostic message like the following:

Container [pid=40261,containerID=container_e85_1538672679435_0171_01_002286] is running beyond physical memory limits. Current usage: 2.3 GB of 2 GB physical memory used; 4.1 GB of 4.2 GB virtual memory used. Killing container.

Try increasing the map memory setting with the command SET mapreduce.map.memory.mb=4096 .

JsonSerDe Errors

If you get errors like:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found

then enter the following:

ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar ;

Alternatively, you could try

ADD JAR /usr/lib/hive/lib/hive-serde.jar;

Killing a running query

Once you submit a query, it is handed off to Hadoop. Hadoop runs the query as a YARN application. The Hive CLI is then detached from the actual application. If you Ctrl-C your Hive CLI, you will quit the interface you used to submit the query, but will not actually kill the application. To kill the application, you have to tell YARN you want it dead.

Note the application ID from when your query started. You should see something like:

 Starting Job = job_1387838787660_12241, Tracking URL = http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_12241/

The application ID in this case is application_1387838787660_12241. To kill this application, run:

 yarn application -kill application_1387838787660_12241

Scratch space

You can create your own database using regular SQL

CREATE DATABASE dartar;

Batch mode

You can save the output of a hive query into a TSV by using hive in batch mode, e.g.

hive  -e "USE wmf; SELECT foo;" > ./bar.tsv

Dates

Dates are expressed as integers due to how we have configured this data. You may encounter this error if you treat dates as strings by quoting:

> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.lang.reflect.InvocationTargetException
>       [...]
> Caused by: java.lang.reflect.InvocationTargetException
>       [...]
> Caused by: java.io.EOFException
>       [...]

Out of Memory Errors on Client

When performing large queries, the hive client may run out of memory. Just look for an out of memory error at the top of the stack trace.

Invoke Hive via the following to allocate more memory

 export HADOOP_HEAPSIZE=2048 && beeline

Out of Memory Errors when exporting large datasets

If your query generates a large dataset and you are writing it out to a file (e.g. beeline -f query.hql > output.tsv), your query might fail due to beeline buffering all the data before writing it out, thereby running out of memory. Passing the parameter --incremental=true to make beeline print to stdout incrementally as data comes in. The previous example turns into the following command:

beeline --incremental=true -f query.hql > output.tsv

Another solution is of course to switch to a data processing approach that does not require the data to be exported.

My query does not start

Check if the cluster is overloaded. If it is, let the Analytics team know.

My query does not progress / finish

Querying through Hive is not as fast as querying a plain SQL table. So expect to wait an hour for your result to show up.

But if the progress counter in your query does not increase (like at least a 1% per minute), you are either unintentionally querying a lot of data, or the cluster is stalled.

If you are querying “a lot” of data (a week worth of webrequest logs is definitely already more “a lot”!), consider splitting up your query. That will increase performance for you, cut down runtime of your query, and it will leave cluster resources available to other users. If you run into blockers, let the Analytics team know.

If your query is small, nice and neat, and does not progress nonetheless, please check if the cluster is overloaded. If it is, let the Analytics team know.

Search through logs

If your job is finished, you can find all of the job logs in HDFS at:

/mnt/hdfs/var/log/hadoop-yarn/apps/<user>/

Where USER is your username (echo $USER at a shell prompt) and application_id is the application_XXXXXX_XXXX number that yarn gives you when your app starts.

You can also access these via the yarn CLI, like:

 yarn logs -applicationId <applicationId>

More debugging information in CLI

You can get more debugging information in your CLI by launching Hive with a special configuration option:

hive  -hiveconf hive.root.logger=INFO,console

Use RLIKE instead of LIKE when working with multi-line text

As documented in this hive issue, the LIKE operator in hive does't match patterns over multi-lines strings.

Example: my_string LIKE ''abc%def' will return false if my_string is abc\ndef.

In order to have the above working, you need to use RLIKE with the dot-matching-all flag set (?s), as by default . doesn't match new-lines (this is the reason of the bug in Hive).

Warning: don't forget to use beginning ^ and end-of string $ regex markers when using RLIKE, as it matches subportions of string while like matches entire strings.

Example: my_string RLIKE '(?s)^abc.*def$' matches abc\ndef

EventLogging

To query EventLogging data from Hadoop using Hive, see: