You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Analytics/Systems/Superset: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Lucas Werkmeister (WMDE)
m (→‎Bounce: change to h3 (looks like it’s supposed to be under “how to”))
(23 intermediate revisions by 8 users not shown)
Line 1: Line 1:
[ Superset] is an Apache incubator project, originally started at AirBnB.  It enables visualizations and dashboards built from various analytics data sources.  WMF's Superset instance can be found at Like [[Analytics/Systems/Turnilo-Pivot|Turnilo]], it provides access to various [[Analytics/Systems/Druid|Druid]] tables, as well as data in Hive (and elsewhere) via Presto.
#REDIRECT [[Data Engineering/Systems/Superset]]
== Access ==
To access Superset, you need <code>wmf</code> or <code>nda</code> LDAP access. For more details, see [[Analytics/Data access#LDAP access]]. Once done, you can log in using your Wikitech username and password.
== Druid datasources vs Druid tables ==
We should stop adding datasources to the [ "Druid Datasources" list] if possible in favor of Druid tables (backed by Druid SQLAlchemy mappings). In order to add a Druid table, do the following:
* Go to Sources -> Tables
* Hit the + sign to add one
* Use <code>Druid Analytics SQL</code> as database, <code>druid</code> as schema and the Druid datasource name as table name (like <code>edits_hourly</code>, <code>wmf_netflow</code>, etc..)
== Usage notes ==
* The [ "Druid Datasources" list] shows ingested tables that are available for querying. As of October 2018, this includes e.g. daily and hourly [[Analytics/Data Lake/Traffic/Pageview hourly|pageviews data]] (the daily version is only updates once a month, but goes further back), a sampled excerpt of [[webrequest]] data, unique devices, and a few select EventLogging schemas. If a recently created Druid datasource is not yet visible in the list, try clicking [ "Scan New Datasouces"].
* NULL values don't show up properly in the values selection dropdown list for filters (i.e. one can't use that dropdown to exclude NULL values from a chart or limit it to NULL values). But one can use the regex option instead: Type in ".+" (without the quotes), and accept the offer to create that as an option.
*By default, always use predefined SUM metrics when available. When choosing a metric then picking the SUM aggregation function, the aggregation is managed by superset and uses the <code>floatSum</code> operator. This operator uses 32 bits floats instead of 64 bits longs or double, leading to inaccuracies. Usually predefined <code>SUM(...)</code> metrics are available and should be used, as they are manually defined using <code>doubleSum</code> or <code>longSum</code> 64 bits operators.
== SQL Lab ==
Superset allows user to query data via SQL using a dedicated tool called SQL Lab:
Multiple databases are available to query:
* <code>presto_analytics_hive</code> to explore Hive's databases/tables via Presto.
* <code>Druid Analytics SQL</code> to explore all Druid datasources via SQL (warning: not all the full power of SQL will be available)
* etc..
== Administration ==
=== Upgrading ===
To upgrade, first follow the instructions in the [ analytics/superset/deploy README] to update the deploy repository.  Once deployed, activate the superset virtualenv, add /etc/superset to PYTHONPATH (to allow superset to pick up configuration) and follow the [ Superset upgrade instructions] (minus the <tt>pip install superset --upgrade</tt> part).  This should be something like:
<source lang="bash">
. /srv/deployment/analytics/superset/venv/bin/activate
export PYTHONPATH=/etc/superset
superset db upgrade
superset init
=== Deploy ===
This assumes that one has already filed the change for the superset deploy repository. The first thing to do is test the change on the staging instance, an-tool1005.eqiad.wmnet:<syntaxhighlight lang="bash">
# ssh to deploy1001 and set the working directory
ssh deploy1001.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy
# create a new branch from the master one, name it as you prefer
git checkout -B testing_something_important
# cherry pick the change in the new branch created
git cherry-pick $change-from-gerrit
# deploy only to an-tool1005, without logging in the ops's sal
scap deploy --no-log-message -f -l an-tool1005.eqiad.wmnet "Test deployment for something important"
</syntaxhighlight>Then check if Superset works as expected:<syntaxhighlight lang="bash">
# Create a ssh tunnel and then test it via localhost:8080 on the browser
ssh -L 8080:an-tool1005.eqiad.wmnet:80 an-tool1005.eqiad.wmnet
</syntaxhighlight>If you are happy with the Superset version, then merge and deploy to the production host:<syntaxhighlight lang="bash">
# ssh to deploy1001 and set the working directory
ssh deploy1001.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy
scap deploy -l analytics-tool1004.eqiad.wmnet "Deployment for something important"
== How to  ==
=== See status ===
systemctl status superset.service
=== Bounce ===
systemctl restart superset
==See also==
* Wikimedia fork of Superset

Latest revision as of 12:10, 12 January 2023