Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Data Platform/Systems/Presto

From Wikitech

Presto is an SQL engine which you can use to query data in the Data Lake . We currently run Presto in read-only mode.

As of June 2025, we are running Presto 0.288.1, so the most accurate documentation is at prestodb.io/docs/0.288/.1 .

Use

The Presto command-line interface is available on all the stat hosts . Here's how to use it:

$ kinit

$ presto --catalog analytics_hive

Presto can also be easily accessed through Python using Wmfdata :

#!/usr/bin/env python3
import wmfdata as wmf

# Returns a Pandas dataframe
wmf.presto.run("SHOW TABLES FROM event")

Catalogs

Presto can connect to many different data sources . The data from each source is represented by a different catalog .

For example, you can use Presto to query data in analytics_hive.wmf.mediawiki_history . In this case:

  • analytics_hive is the catalog
  • wmf is the schema (sometimes called the "database")
  • mediawiki_history is the table

List of catalogs

We have the following catalogs in our Presto setup:

Accessing a catalog

You can access a particular catalog by passing the right argument to your query tool:

  • for Wmfdata, wmf.presto.run(query, catalog=...)
    • Wmfdata defaults to analytics_hive
  • for the command-line interface, presto --catalog ...

You can also always access a particular table by using the full name, including the catalog, in the query. For example, SELECT * FROM thanos.default.haproxy_frontend_http_responses_total will always work no matter what catalog has been specified by the query too.

SQL quirks

  • Strings in double quotes (e.g. "string" ) are treated as identifiers (e.g. the name of a table or field). Strings in single quotes (e.g. 'string' ) are treated as string literals.
  • Unlike with Hive or Spark , we run presto in read-only mode so you can only run SELECT queries.

Presto on Superset

Presto is available in Superset via the SQL Lab panel:

  • go to https://superset.wikimedia.org/sqllab
  • select database presto_analytics_hive
  • select schema event for Eventlogging (just an example, or any available one)
  • select any table to get a preview of the content, attributes, etc..

Please note: you will be able to see only the data that your user is allowed to, according to the POSIX permissions in puppet.

Background

We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset . The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.

We had several alternatives for this use-case: Hive , Druid , Clickhouse , and Presto .

Presto has been chosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.

Reasons why we choose Presto:

  • It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
  • It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine [ 1 ] ).
  • It reads from HDFS and other big-data storage systems, making it easy to load/reload/maintain datasets (unlike Clickhouse and Druid).
  • It takes advantage of Parquet, the standard Hadoop columnar data format
  • It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.

Administration

Please check Data_Platform/Systems/Presto/Administration .

  1. As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.