You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Presto

From Wikitech-static
Jump to navigation Jump to search

Presto is a distributed SQL query engine for big data open sourced by Facebook (like Hive).

Why Presto?

We have been looking for a query-engine that would facilitate querying the mediawiki-history dataset. The dataset is somewhat "big" but not one of our largest (~750Gb, 3 billion rows) and the expected queries would be analytics-style queries (group by, count, sum rather than random-read of single rows). Our requirements are to provide a SQL compliant query interface, with interesting SQL-analytics features (window functions). This two requirements are "functionally" satisfied by Hive but Hive has significant issues when it comes to performance, there is a significant time-overhead for launching jobs and relying on MapReduce for computation makes the ratio of job-duration to data-size very bad for small-ish data.

We had several alternatives for this use-case: Hive, Druid, Clickhouse, and Presto.

Presto has been chosen as the best technology fitting our needs. It was developed by Facebook to solve Hive issues with speed.

Reasons why we choose Presto:

  • It matches all the SQL needs with the advantage of being SQL-ANSI compliant, by opposition to all other systems that use dialects
  • It is really faster than Hive for small/medium size data. A bit less fast than Clickhouse and Druid for the queries Druid can process (Druid is actually not a general SQL-engine[1]).
  • It reads from HDFS and other big-data storage systems, making it easy to load/reload/maintain datasets (unlike Clickhouse and Druid).
  • It takes advantage of Parquet, the standard Hadoop columnar data format
  • It is the preferred tool of many other big players for querying analytics-oriented data in an exploratory way. It has a live ecosystem.

Usage on analytics cluster

The Presto CLI is deployed on all the analytics clients. Here's how to use it:

$ kinit

$ presto --catalog analytics_hive

Presto can also be easily used from a Python environment on stat/notebook hosts. Here an example:

#!/usr/bin/env python3
# 
# IMPORTANT - before first use, in notebook terminal:
# pip3 install presto-python-client['kerberos']
#
import prestodb
import os

server = {
    'host': 'an-coord1001.eqiad.wmnet',
    'port': 8281,
    'ca_bundle':  '/etc/presto/ca.crt.pem',
    }

conn = prestodb.dbapi.connect(
    http_scheme='https',
    host=server['host'],
    port=server['port'],
    user=os.environ['USER'],
    catalog='analytics_hive',
    auth=prestodb.auth.KerberosAuthentication(
        config='/etc/krb5.conf',
        service_name='presto',
        principal='{}@WIKIMEDIA'.format(os.environ['USER']),
        ca_bundle=server['ca_bundle']
        )
)
cursor = conn.cursor()
cursor.execute('SHOW TABLES from event')
for row in cursor.fetchall():
    print(row)

Presto on Superset

Presto is available in Superset via the Sqllab panel:

  • hit https://superset.wikimedia.org/superset/sqllab
  • select database presto_analytics_hive
  • select schema event for Eventlogging (just an example, or any available one)
  • select any table to get a preview of the content, attributes, etc..

Please note: you will be able to see only the data that your user is allowed to, according to the POSIX permissions in puppet.

Administration

Please check Analytics/Systems/Presto/Administration

  1. As of today (September 2018) there two drawbacks on using Druid as a general SQL query engine: there is a significant scope of SQL that Druid would not be able to parse, and a broad range of queries (nested group-by for instance) would fail at computation-stage.