You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Druid

From Wikitech-static
< Analytics‎ | Systems
Revision as of 18:29, 21 September 2017 by imported>Ottomata (→‎Full Restart of services)
Jump to navigation Jump to search

Druid is an analytics data store, currently (as of August 2016) in experimental use for the upcoming Analytics/Data_Lake. It is comprised of many services, each of which is fully redundant.

The Analytics team is using a nodejs Web UI application called Pivot as experimental tool to explore Druid's data.

Why Druid. Value Proposition

When looking for an analytics columnar datastore we wanted a product that could fit our use cases and scale and that in the future we could use to support real time ingestion of data. We had several alternatives: Druid, Cassandra , ElasticSearch and of late, Clickhouse. All these are open source choices that served our use cases to different degrees.

Druid offered the best value proposition:

  • It is designed for analytics so it can handle creation of cubes with many different dimensions without having to have those precomputed (like cassandra does)
  • It has easy loading specs and supports real time ingestion
  • It provides front caching that repeated queries benefit from (clickhouse is desined as a fast datastore for analytics but it doesn't have a fronetend cache)
  • Druid shipped also with a convenient UI to do basic exploration of data that was also open source: Pivot


Access to Druid Data via Pivot

Analytics/Systems/Pivot

Druid Administration

Naming convention

For homogeneity across systems, underscores _ should be used in datasource names and field names instead of hyphens -.

Delete a data set from deep storage

Disable datasource in coordinator (needed before deep-storage deletion) This step is not irreversible, data is still present in deep-storage and can reloaded easily

 curl -X DELETE http://localhost:8081/druid/coordinator/v1/datasources/DATASOURCE_NAME

Hard-delete deep storage data - Irreversible

curl -X 'POST' -H 'Content-Type:application/json' -d "{ \"type\":\"kill\", \"id\":\"kill_task-tiles-poc-`date --iso-8601=seconds`\",\"dataSource\":\"DATASOURCE_NAME\", \"interval\":\"2016-11-01T00:00:00Z/2017-01-04T00:00:00Z\" }" localhost:8090/druid/indexer/v1/task

Administration UI

ssh -N druid1003.eqiad.wmnet -L 8081:druid1003.eqiad.wmnet:8081
http://localhost:8081/#/datasources/pageviews-hourly


Indexing Logs

Located at "/var/lib/druid/indexing-logs"

Full Restart of services

To restart all druid services, you must restart each service on each Druid node individually. It is best to do them one at a time, but the order does not particularly matter.

NOTE: druid-historical can take a while to restart up, as it needs to re-read indexes.

# for each Druid node (druid100[123]):
service druid-broker restart
service druid-coordinator restart
service druid-historical restart
service druid-middlemanager restart
service druid-overlord restart

Bash snippet to automate the restart:

#!/bin/bash
set -x
set -e

sudo service druid-broker restart
sudo service druid-broker status
sleep 5
sudo service druid-coordinator restart
sudo service druid-coordinator status
sleep 5
sudo service druid-historical restart
sudo service druid-historical status
sleep 120 # check that historical startup finishes in /var/log/druid/historical.log
sudo service druid-middlemanager restart
sudo service druid-middlemanager status
sleep 5
sudo service druid-overlord restart
sudo service druid-overlord status

We intend to also run a dedicated Zookeeper cluster for druid on the druid nodes. For now (August 2016), druid uses the main Zookeeper cluster on conf100[123]. In the future, when the Druid nodes run Zookeeper, you may also want to restart Zookeeper on each node.

service zookeeper restart