You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/Druid"

From Wikitech-static
Jump to navigation Jump to search
imported>Nuria
imported>Nuria
(Undo revision 1758259 by Nuria (talk))
Line 3: Line 3:
The Analytics team is using a nodejs Web UI application called Pivot as experimental tool to explore Druid's data.
The Analytics team is using a nodejs Web UI application called Pivot as experimental tool to explore Druid's data.


= Pivot =
== Why Druid. Value Proposition ==
[Http://pivot.wikimedia.org http://pivot.wikimedia.org] is a user interface for non programatic access to data. Most of the data available in pivot at this time comes from Hadoop. (See also a [https://usercontent.irccloud-cdn.com/file/xuIMGKl0/Screen%20Shot%202017-04-07%20at%2012.18.24%20PM.png snapshot] of available data cubes as of April 2017, with update schedules etc.)


== Access to Pivot ==
When looking for an analytics columnar datastore we wanted  a product that could fit our use cases and scale and that in the future we could use to support real time ingestion of data. We had several alternatives: Druid, Cassandra , ElasticSearch and of  late, Clickhouse.  All these are open source choices that served our use cases to different degrees.


You need a wikitech login that is in the "wmf" or "nda" LDAP groups. If you don't have it, please create a task like https://phabricator.wikimedia.org/T160662
Druid offered the best value proposition:


Before requesting access, please make sure you:
* It is designed for analytics so it can handle creation of cubes with many different dimensions without having to have those precomputed (like cassandra does)
* have a functioning Wikitech login. Get one: https://toolsadmin.wikimedia.org/register/
* It has easy loading specs and supports real time ingestion
* are an employee or contractor with wmf OR have signed an NDA
* It provides front caching that repeated queries benefit from (clickhouse is desined as a fast datastore for analytics but it doesn't have a fronetend cache)
Depending on the above, you can request to be added to the wmf group or the nda group. Please indicate the motivation on the task about why you need access and ping the analytics team if you don't hear any feedback soon from the Opsen on duty.
* Druid shipped also with a convenient UI to do basic exploration of data that was also open source: [http://pivot.imply.io/ Pivot]


==Administration ==
=== Logs ===
On stat1001 everybody can read <code>/var/log/pivot/syslog.log</code>
=== Deploy ===
Deployment steps for deployment.eqiad.wmnet:
<code>cd /srv/deployment/analytics/pivot/deploy</code>
<code>git pull</code>
<code>git submodule update --init</code>
<code>scap deploy</code>
The code that renders https://pivot.eqiad.wmnet is running entirely on stat1001.eqiad.wmnet and it is split in two parts:
* an Apache httpd Virtual Host that takes care of Basic Authentication via LDAP Wikitech credentials check.
* a nodejs application deployed via scap and stored in the https://gerrit.wikimedia.org/r/#/admin/projects/analytics/pivot/deploy repo (https://gerrit.wikimedia.org/r/#/admin/projects/analytics/pivot is a submodule).


==Access to Druid Data via Pivot==
[[Analytics/Systems/Pivot]]
== Druid Administration ==
== Druid Administration ==



Revision as of 17:51, 3 May 2017

Druid is an analytics data store, currently (as of August 2016) in experimental use for the upcoming Analytics/Data_Lake. It is comprised of many services, each of which is fully redundant.

The Analytics team is using a nodejs Web UI application called Pivot as experimental tool to explore Druid's data.

Why Druid. Value Proposition

When looking for an analytics columnar datastore we wanted a product that could fit our use cases and scale and that in the future we could use to support real time ingestion of data. We had several alternatives: Druid, Cassandra , ElasticSearch and of late, Clickhouse. All these are open source choices that served our use cases to different degrees.

Druid offered the best value proposition:

  • It is designed for analytics so it can handle creation of cubes with many different dimensions without having to have those precomputed (like cassandra does)
  • It has easy loading specs and supports real time ingestion
  • It provides front caching that repeated queries benefit from (clickhouse is desined as a fast datastore for analytics but it doesn't have a fronetend cache)
  • Druid shipped also with a convenient UI to do basic exploration of data that was also open source: Pivot


Access to Druid Data via Pivot

Analytics/Systems/Pivot

Druid Administration

Delete a data set from deep storage

Disable datasource in coordinator (needed before deep-storage deletion) This step is not irreversible, data is still present in deep-storage and can reloaded easily

 curl -X DELETE http://localhost:8081/druid/coordinator/v1/datasources/DATASOURCE_NAME

Hard-delete deep storage data - Irreversible

curl -X 'POST' -H 'Content-Type:application/json' -d "{ \"type\":\"kill\", \"id\":\"kill_task-tiles-poc-`date --iso-8601=seconds`\",\"dataSource\":\"DATASOURCE_NAME\", \"interval\":\"2016-11-01T00:00:00Z/2017-01-04T00:00:00Z\" }" localhost:8090/druid/indexer/v1/task

Administration UI

ssh -N druid1003.eqiad.wmnet -L 8081:druid1003.eqiad.wmnet:8081
http://localhost:8081/#/datasources/pageviews-hourly


Full Restart of services

To restart all druid services, you must restart each service on each Druid node individually. It is best to do them one at a time, but the order does not particularly matter.

Note that Druid is still in an experimental, and does not yet have much WMF operational experience behind it.

# for each Druid node (druid100[123]):
service druid-broker restart
service druid-coordinator restart
service druid-historical restart
service druid-middlemanager restart
service druid-overlord restart

Bash snippet to automate the restart:

#!/bin/bash
set -x
set -e

sudo service druid-broker restart
sudo service druid-broker status
sleep 5
sudo service druid-coordinator restart
sudo service druid-coordinator status
sleep 5
sudo service druid-historical restart
sudo service druid-historical status
sleep 5
sudo service druid-middlemanager restart
sudo service druid-middlemanager status
sleep 5
sudo service druid-overlord restart
sudo service druid-overlord status

We intend to also run a dedicated Zookeeper cluster for druid on the druid nodes. For now (August 2016), druid uses the main Zookeeper cluster on conf100[123]. In the future, when the Druid nodes run Zookeeper, you may also want to restart Zookeeper on each node.

service zookeeper restart