You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/Druid"

From Wikitech-static
Jump to navigation Jump to search
imported>Nuria
(Undo revision 1758259 by Nuria (talk))
imported>BryanDavis
(emphasize that piviot was FOSS but is no longer.)
Line 12: Line 12:
* It has easy loading specs and supports real time ingestion  
* It has easy loading specs and supports real time ingestion  
* It provides front caching that repeated queries benefit from (clickhouse is desined as a fast datastore for analytics but it doesn't have a fronetend cache)
* It provides front caching that repeated queries benefit from (clickhouse is desined as a fast datastore for analytics but it doesn't have a fronetend cache)
* Druid shipped also with a convenient UI to do basic exploration of data that was also open source: [http://pivot.imply.io/ Pivot]
* Druid shipped also with a convenient UI to do basic exploration of data that '''''was''''' also open source: [http://pivot.imply.io/ Pivot]





Revision as of 22:10, 20 May 2017

Druid is an analytics data store, currently (as of August 2016) in experimental use for the upcoming Analytics/Data_Lake. It is comprised of many services, each of which is fully redundant.

The Analytics team is using a nodejs Web UI application called Pivot as experimental tool to explore Druid's data.

Why Druid. Value Proposition

When looking for an analytics columnar datastore we wanted a product that could fit our use cases and scale and that in the future we could use to support real time ingestion of data. We had several alternatives: Druid, Cassandra , ElasticSearch and of late, Clickhouse. All these are open source choices that served our use cases to different degrees.

Druid offered the best value proposition:

  • It is designed for analytics so it can handle creation of cubes with many different dimensions without having to have those precomputed (like cassandra does)
  • It has easy loading specs and supports real time ingestion
  • It provides front caching that repeated queries benefit from (clickhouse is desined as a fast datastore for analytics but it doesn't have a fronetend cache)
  • Druid shipped also with a convenient UI to do basic exploration of data that was also open source: Pivot


Access to Druid Data via Pivot

Analytics/Systems/Pivot

Druid Administration

Delete a data set from deep storage

Disable datasource in coordinator (needed before deep-storage deletion) This step is not irreversible, data is still present in deep-storage and can reloaded easily

 curl -X DELETE http://localhost:8081/druid/coordinator/v1/datasources/DATASOURCE_NAME

Hard-delete deep storage data - Irreversible

curl -X 'POST' -H 'Content-Type:application/json' -d "{ \"type\":\"kill\", \"id\":\"kill_task-tiles-poc-`date --iso-8601=seconds`\",\"dataSource\":\"DATASOURCE_NAME\", \"interval\":\"2016-11-01T00:00:00Z/2017-01-04T00:00:00Z\" }" localhost:8090/druid/indexer/v1/task

Administration UI

ssh -N druid1003.eqiad.wmnet -L 8081:druid1003.eqiad.wmnet:8081
http://localhost:8081/#/datasources/pageviews-hourly


Full Restart of services

To restart all druid services, you must restart each service on each Druid node individually. It is best to do them one at a time, but the order does not particularly matter.

Note that Druid is still in an experimental, and does not yet have much WMF operational experience behind it.

# for each Druid node (druid100[123]):
service druid-broker restart
service druid-coordinator restart
service druid-historical restart
service druid-middlemanager restart
service druid-overlord restart

Bash snippet to automate the restart:

#!/bin/bash
set -x
set -e

sudo service druid-broker restart
sudo service druid-broker status
sleep 5
sudo service druid-coordinator restart
sudo service druid-coordinator status
sleep 5
sudo service druid-historical restart
sudo service druid-historical status
sleep 5
sudo service druid-middlemanager restart
sudo service druid-middlemanager status
sleep 5
sudo service druid-overlord restart
sudo service druid-overlord status

We intend to also run a dedicated Zookeeper cluster for druid on the druid nodes. For now (August 2016), druid uses the main Zookeeper cluster on conf100[123]. In the future, when the Druid nodes run Zookeeper, you may also want to restart Zookeeper on each node.

service zookeeper restart