You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Presto is an SQL engine which you can use to query data in the Data Lake. We currently run Presto in read-only mode.
As of August 2022, we are running Presto 0.273.3, so the most accurate documentation is at prestodb.io/docs/0.273.3/.
The Presto command-line interface is available on all the analytics clients. Here's how to use it:
$ kinit $ presto --catalog analytics_hive
Presto can also be easily accessed through Python using wmfdata-python
#!/usr/bin/env python3 import wmfdata as wmf # Returns a Pandas dataframe wmf.presto.run("SHOW TABLES FROM event")
- Strings in double quotes (e.g.
"string") are treated as identifiers (e.g. the name of a table or field). Strings in single quotes (e.g.
'string') are treated as string literals.
- Unlike with Hive or Spark, we run presto in read-only mode so you can only run
Presto on Superset
Presto is available in Superset via the SQL Lab panel:
- go to https://superset.wikimedia.org/superset/sqllab
- select database
- select schema
eventfor Eventlogging (just an example, or any available one)
- select any table to get a preview of the content, attributes, etc..
Please note: you will be able to see only the data that your user is allowed to, according to the POSIX permissions in puppet.
We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.
We had several alternatives for this use-case: Hive, Druid, Clickhouse, and Presto.
Presto has been chosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.
Reasons why we choose Presto:
- It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
- It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine).
- It reads from HDFS and other big-data storage systems, making it easy to load/reload/maintain datasets (unlike Clickhouse and Druid).
- It takes advantage of Parquet, the standard Hadoop columnar data format
- It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.
Please check Analytics/Systems/Presto/Administration.
- ↑ As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.