You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Presto is a distributed SQL query engine for big data open sourced by Facebook (like Hive). We are currently (as of September 2018) on the way to deploy an experimental setup on 3 machines to provide labs users with qurying capabilities over the mediawiki-history dataset.
Why Presto ? Value proposition
We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are satisfied by Hive but it has a major drawback in that is that it is slow (time-overhead for launching jobs and relying on MapReduce for computation makes the ratio (job-duration/ data-size) very bad for small-ish data).
Presto has been choosen as the best technology fitting our needs. It was developed to solve Hive issues with speed.
Reasons why we choose Presto:
- It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
- It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine).
- It reads from HDFS and other big-data storages, making it easy to load/reload/maintain datasets (by opposition to Clickhouse and Druid).
- It takes advantage of hadoop-standard columnar data format (Parquet)
- It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.
- As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.