Data Platform/Systems/Presto
Presto is an SQL engine which you can use to query data in the Data Lake . We currently run Presto in read-only mode.
As of June 2025, we are running Presto 0.288.1, so the most accurate documentation is at prestodb.io/docs/0.288/.1 .
Use
The Presto command-line interface is available on all the stat hosts . Here's how to use it:
$ kinit
$ presto --catalog analytics_hive
Presto can also be easily accessed through Python using Wmfdata :
#!/usr/bin/env python3
import wmfdata as wmf
# Returns a Pandas dataframe
wmf.presto.run("SHOW TABLES FROM event")
Catalogs
Presto can connect to many different data sources . The data from each source is represented by a different catalog .
For example, you can use Presto to query data in
analytics_hive.wmf.mediawiki_history
. In this case:
-
analytics_hiveis the catalog -
wmfis the schema (sometimes called the "database") -
mediawiki_historyis the table
List of catalogs
We have the following catalogs in our Presto setup:
-
analytics_hive: Data Lake tables using the Hive table format -
analytics_iceberg: Data Lake tables using the Iceberg table format -
thanos: experimental , Prometheus operational metrics, accessed via Thanos
Accessing a catalog
You can access a particular catalog by passing the right argument to your query tool:
-
for Wmfdata,
wmf.presto.run(query, catalog=...)-
Wmfdata defaults to
analytics_hive
-
Wmfdata defaults to
-
for the command-line interface,
presto --catalog ...
You can also always access a particular table by using the full name, including the catalog, in the query. For example,
SELECT * FROM thanos.default.haproxy_frontend_http_responses_total
will always work no matter what catalog has been specified by the query too.
SQL quirks
-
Strings in double quotes (e.g.
"string") are treated as identifiers (e.g. the name of a table or field). Strings in single quotes (e.g.'string') are treated as string literals. -
Unlike with
Hive
or
Spark
, we run presto in read-only mode so you can only run
SELECTqueries.
Presto on Superset
Presto is available in Superset via the SQL Lab panel:
- go to https://superset.wikimedia.org/sqllab
-
select database
presto_analytics_hive -
select schema
eventfor Eventlogging (just an example, or any available one) - select any table to get a preview of the content, attributes, etc..
Please note: you will be able to see only the data that your user is allowed to, according to the POSIX permissions in puppet.
Background
We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset . The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.
We had several alternatives for this use-case: Hive , Druid , Clickhouse , and Presto .
Presto has been chosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.
Reasons why we choose Presto:
- It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
- It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine [ 1 ] ).
- It reads from HDFS and other big-data storage systems, making it easy to load/reload/maintain datasets (unlike Clickhouse and Druid).
- It takes advantage of Parquet, the standard Hadoop columnar data format
- It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.
Administration
Please check Data_Platform/Systems/Presto/Administration .
- ↑ As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.