You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Presto

From Wikitech-static
< Analytics‎ | Systems
Revision as of 15:02, 21 September 2018 by imported>Joal (Updates in core and druid precisions additions.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Presto is a distributed SQL query engine for big data open sourced by Facebook (like Hive). We are currently (as of September 2018) on the way to deploy an experimental setup on 3 machines to provide labs users with qurying capabilities over the mediawiki-history dataset.

Why Presto ? Value proposition

We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are satisfied by Hive but it has a major drawback in that is that it is slow (time-overhead for launching jobs and relying on MapReduce for computation makes the ratio (job-duration/ data-size) very bad for small-ish data).

We had several alternatives for this use-case: Hive, Druid, Clickhouse, and Presto.

Presto has been choosen as the best technology fitting our needs. It was developed to solve Hive issues with speed.

Reasons why we choose Presto:

  • It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
  • It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine[1]).
  • It reads from HDFS and other big-data storages, making it easy to load/reload/maintain datasets (by opposition to Clickhouse and Druid).
  • It takes advantage of hadoop-standard columnar data format (Parquet)
  • It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.
  1. As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.