You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/Superset"

From Wikitech-static
Jump to navigation Jump to search
imported>Elukey
imported>Razzi
(→‎Upgrading: Add back necessary `export PYTHONPATH=/etc/superset `)
(16 intermediate revisions by 6 users not shown)
Line 1: Line 1:
[https://superset.incubator.apache.org/ Superset] is an Apache incubator project, originally started at AirBnB.  It enables visualizations and dashboards built from various analytics data sources.  WMF's Superset instance can be found at https://superset.wikimedia.org. Like [[Analytics/Systems/Turnilo-Pivot|Turnilo]], it provides access to various [[Analytics/Systems/Druid|Druid]] tables.
[https://superset.incubator.apache.org/ Superset] is an Apache incubator project, originally started at AirBnB.  It enables visualizations and dashboards built from various analytics data sources.  WMF's Superset instance can be found at https://superset.wikimedia.org. Like [[Analytics/Systems/Turnilo-Pivot|Turnilo]], it provides access to various [[Analytics/Systems/Druid|Druid]] tables, as well as data in Hive (and elsewhere) via Presto.


== Access ==
== Access ==
To access Superset, you need <code>wmf</code> or <code>nda</code> LDAP access. For more details, see [[Analytics/Data access#LDAP access]].
To access Superset, you need <code>wmf</code> or <code>nda</code> LDAP access. For more details, see [[Analytics/Data access#LDAP access]]. Once done, you can log in using your Wikitech username and password.  
 
If you have that access, you can log in at [https://superset.wikimedia.org superset.wikimedia.org] with your developer shell username and password.


== Usage notes ==
== Usage notes ==
Line 11: Line 9:
* NULL values don't show up properly in the values selection dropdown list for filters (i.e. one can't use that dropdown to exclude NULL values from a chart or limit it to NULL values). But one can use the regex option instead: Type in ".+" (without the quotes), and accept the offer to create that as an option.
* NULL values don't show up properly in the values selection dropdown list for filters (i.e. one can't use that dropdown to exclude NULL values from a chart or limit it to NULL values). But one can use the regex option instead: Type in ".+" (without the quotes), and accept the offer to create that as an option.
*By default, always use predefined SUM metrics when available. When choosing a metric then picking the SUM aggregation function, the aggregation is managed by superset and uses the <code>floatSum</code> operator. This operator uses 32 bits floats instead of 64 bits longs or double, leading to inaccuracies. Usually predefined <code>SUM(...)</code> metrics are available and should be used, as they are manually defined using <code>doubleSum</code> or <code>longSum</code> 64 bits operators.
*By default, always use predefined SUM metrics when available. When choosing a metric then picking the SUM aggregation function, the aggregation is managed by superset and uses the <code>floatSum</code> operator. This operator uses 32 bits floats instead of 64 bits longs or double, leading to inaccuracies. Usually predefined <code>SUM(...)</code> metrics are available and should be used, as they are manually defined using <code>doubleSum</code> or <code>longSum</code> 64 bits operators.
*If you build a chart based on a table with structs, you won't be able to access the fields of the struct because Superset recognizes the struct as a single string column. The workaround is to add a computed column with <code>struct.field</code> as the SQL expression.
*Superset expects time columns to be in SQL timestamp string format (<code>2021-01-01 00:00:00</code>) and has trouble with columns in ISO 8601 string format (<code>2021-01-01T00:00:00Z</code>). To fix this, it's best to create a computed column that casts the time to the right format using an SQL expression like (<code>CAST(TO_ISO8601_TIMESTAMP(dt) AS VARCHAR)</code>.
== Druid datasources vs Druid tables ==
We should stop adding datasources to the [https://superset.wikimedia.org/druiddatasourcemodelview/list/ "Druid Datasources" list] if possible in favor of Druid tables (backed by Druid SQLAlchemy mappings). In order to add a Druid table, do the following:
* Go to Sources -> Tables
* Hit the + sign to add one
* Use <code>Druid Analytics SQL</code> as database, <code>druid</code> as schema and the Druid datasource name as table name (like <code>edits_hourly</code>, <code>wmf_netflow</code>, etc..)
== Migrate a chart to Druid tables ==
It is easy to spot if a chart is not using a Druid table (but an old Druid datasource) simply checking the [https://superset.wikimedia.org/chart/list/ chart list]:
[[File:Superset Druid datasource example.png|none|thumb|894x894px|Example of old Druid datasource: the name pageviews_daily is not prepended with the prefix 'druid.']]In the above example the "pageviews_daily" datasource is not prepended with "druid." (like druid.pageviews_daily) so it is an old Druid datasource. If you want to move the  chart to the Druid table, hit the edit button for it and switch the datasource name (it will likely change some default settings so make sure that the chart is correctly visualized) and then save it. You dashboards will be updated automatically!
If the druid.your-datasource-name is not present yet, please check the above section, since it may need to be defined via Druid -> Tables menu'.


== SQL Lab ==
== SQL Lab ==


Superset allows user to query data via SQL using a dedicated tool called SQL Lab: https://superset.wikimedia.org/superset/sqllab
Superset allows user to query data via SQL using a dedicated tool called SQL Lab: https://superset.wikimedia.org/superset/sqllab. Multiple databases are available to query:
Multiple databases are available to query:
* <code>presto_analytics_hive</code> to explore [[Analytics/Data Lake|Data Lake]] data using Presto
- <code>presto_analytics_hive</code> to explore Hive's databases/tables via Presto.
* <code>Druid Analytics SQL</code> to explore [[Analytics/Systems/Druid|Druid]] data cubes (note: the full power of SQL is not available for Druid data)
- <code>Druid Analytics SQL</code> to explore all Druid datasources via SQL (warning: not all the full power of SQL will be available)
* etc..
- etc..


== Administration ==
== Administration ==
=== Sync the staging database with the production one ===
When testing a new release in the staging environment it is nice to get the same dashboards as in production, since two different databases are used and they get out of sync very quickly (nobody updates dashboards in staging). The procedure is the following:<syntaxhighlight lang="bash">
# The two databases live on the same mariadb instance
ssh an-coord1001.eqiad.wmnet
# Moving to /srv since it is on a separate partition with more free space
cd /srv
# Dump the production db
sudo sh -c 'mysqldump superset_production > superset_production_$(date +%s).sql'
# Get the filename of the production dump
ls -l
# Connect to mysql and drop the staging db
sudo mysql
# Good practice to avoid specific commands to be replicated via binlog.
set session sql_log_bin=0
# Drop and re-create the staging database
drop database superset_staging;
create database superset_staging DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
exit
# Load the production database into the staging one. Change the filename according
# to the dump previously taken!
sudo mysql superset_staging < superset_production_111111.sql
# Then remember that some specific production things need to be changed once superset
# runs. For example, in databases -> presto-analytics (in the superset ui) the kerberos principal
# used is saved in the config and changes between production and staging (the hostname is different).
</syntaxhighlight>
=== Upgrading ===
=== Upgrading ===
To upgrade, first follow the instructions in the [https://github.com/wikimedia/analytics-superset-deploy analytics/superset/deploy README] to update the deploy repository.  Once deployed, activate the superset virtualenv, add /etc/superset to PYTHONPATH (to allow superset to pick up configuration) and follow the [https://superset.incubator.apache.org/installation.html#upgrading Superset upgrade instructions] (minus the <tt>pip install superset --upgrade</tt> part).  This should be something like:
To upgrade, first follow the instructions in the [https://github.com/wikimedia/analytics-superset-deploy analytics/superset/deploy README] to update the deploy repository.  Once deployed, activate the superset virtualenv, add /etc/superset to PYTHONPATH (to allow superset to pick up configuration) and follow the [https://superset.incubator.apache.org/installation.html#upgrading Superset upgrade instructions] (minus the <tt>pip install superset --upgrade</tt> part).  This should be something like:


<source lang="bash">
<syntaxhighlight lang="bash">
. /srv/deployment/analytics/superset/venv/bin/activate
source /srv/deployment/analytics/superset/venv/bin/activate
export PYTHONPATH=/etc/superset
export PYTHONPATH=/etc/superset
superset db upgrade
superset db upgrade
superset init
superset init
</source>


=== Deploy ===
</syntaxhighlight>
This assumes that one has already filed the change for the superset deploy repository. The first thing to do is test the change on the staging instance, an-tool1005.eqiad.wmnet:<syntaxhighlight lang="bash">
 
# ssh to deploy1001 and set the working directory
=== Deploy to staging ===
ssh deploy1001.eqiad.wmnet
This assumes you have a change open for the [https://gerrit.wikimedia.org/r/q/project:analytics%252Fsuperset%252Fdeploy analytics/superset/deploy repository].<syntaxhighlight lang="bash">
# ssh to deploy1002 and set the working directory
ssh deploy1002.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy
cd /srv/deployment/analytics/superset/deploy


Line 47: Line 91:




</syntaxhighlight>Then check if Superset works as expected:<syntaxhighlight lang="bash">
</syntaxhighlight>
 
=== Test staging via ssh tunnel ===
<syntaxhighlight lang="bash">
# Create a ssh tunnel and then test it via localhost:8080 on the browser
# Create a ssh tunnel and then test it via localhost:8080 on the browser
ssh -L 8080:an-tool1005.eqiad.wmnet:80 an-tool1005.eqiad.wmnet
ssh -L 8080:an-tool1005.eqiad.wmnet:80 an-tool1005.eqiad.wmnet
</syntaxhighlight>If you are happy with the Superset version, then merge and deploy to the production host:<syntaxhighlight lang="bash">
</syntaxhighlight>Upon accessing http://localhost:8080, you will be presented with a basic auth challenge that allows you to authenticate via LDAP.
 
=== Test as different user on staging ===
In order to test features such as permissions, you may want to impersonate a different user.
 
First, add an extension to your browser that allows you to set custom headers, such as https://addons.mozilla.org/en-US/firefox/addon/modify-header-value/ for Firefox.
 
Then, set the following headers for localhost:8080:<syntaxhighlight lang="bash">
X-Remote-User: <user you want to impersonate>
X-Forwarded-Proto: http
</syntaxhighlight>Connect directly to the superset server by opening an SSH tunnel on an-tool1005 port 9080:<syntaxhighlight lang="bash">
ssh -NL 8080:an-tool1005.eqiad.wmnet:9080 an-tool1005.eqiad.wmnet
</syntaxhighlight>Open http://localhost:8080 in your browser, and you should be logged in as the user specified by X-Remote-User. You can see who you are logged in as by viewing your profile from the Settings menu.
 
If you change the header to be a different user, logout via the Settings menu to change to be the new user.
 
=== Deploy to production ===
<syntaxhighlight lang="bash">
# ssh to deploy1001 and set the working directory
# ssh to deploy1001 and set the working directory
ssh deploy1001.eqiad.wmnet
ssh deploy1001.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy
cd /srv/deployment/analytics/superset/deploy


scap deploy -l analytics-tool1004.eqiad.wmnet "Deployment for something important"
scap deploy -l an-tool1010.eqiad.wmnet "Deployment for something important"




</syntaxhighlight>
</syntaxhighlight>Don't forget to run migrations after you deploy (see [[Analytics/Systems/Superset#Upgrading]]).


== How to  ==
== How to  ==
Line 65: Line 129:
  systemctl status superset.service
  systemctl status superset.service


== Bounce ==
=== Bounce ===
  systemctl restart superset
  systemctl restart superset


==See also==
==See also==
* https://github.com/wikimedia/incubator-superset Wikimedia fork of Superset
* https://github.com/wikimedia/incubator-superset Wikimedia fork of Superset

Revision as of 22:59, 18 October 2021

Superset is an Apache incubator project, originally started at AirBnB. It enables visualizations and dashboards built from various analytics data sources. WMF's Superset instance can be found at https://superset.wikimedia.org. Like Turnilo, it provides access to various Druid tables, as well as data in Hive (and elsewhere) via Presto.

Access

To access Superset, you need wmf or nda LDAP access. For more details, see Analytics/Data access#LDAP access. Once done, you can log in using your Wikitech username and password.

Usage notes

  • The "Druid Datasources" list shows ingested tables that are available for querying. As of October 2018, this includes e.g. daily and hourly pageviews data (the daily version is only updates once a month, but goes further back), a sampled excerpt of webrequest data, unique devices, and a few select EventLogging schemas. If a recently created Druid datasource is not yet visible in the list, try clicking "Scan New Datasouces".
  • NULL values don't show up properly in the values selection dropdown list for filters (i.e. one can't use that dropdown to exclude NULL values from a chart or limit it to NULL values). But one can use the regex option instead: Type in ".+" (without the quotes), and accept the offer to create that as an option.
  • By default, always use predefined SUM metrics when available. When choosing a metric then picking the SUM aggregation function, the aggregation is managed by superset and uses the floatSum operator. This operator uses 32 bits floats instead of 64 bits longs or double, leading to inaccuracies. Usually predefined SUM(...) metrics are available and should be used, as they are manually defined using doubleSum or longSum 64 bits operators.
  • If you build a chart based on a table with structs, you won't be able to access the fields of the struct because Superset recognizes the struct as a single string column. The workaround is to add a computed column with struct.field as the SQL expression.
  • Superset expects time columns to be in SQL timestamp string format (2021-01-01 00:00:00) and has trouble with columns in ISO 8601 string format (2021-01-01T00:00:00Z). To fix this, it's best to create a computed column that casts the time to the right format using an SQL expression like (CAST(TO_ISO8601_TIMESTAMP(dt) AS VARCHAR).

Druid datasources vs Druid tables

We should stop adding datasources to the "Druid Datasources" list if possible in favor of Druid tables (backed by Druid SQLAlchemy mappings). In order to add a Druid table, do the following:

  • Go to Sources -> Tables
  • Hit the + sign to add one
  • Use Druid Analytics SQL as database, druid as schema and the Druid datasource name as table name (like edits_hourly, wmf_netflow, etc..)

Migrate a chart to Druid tables

It is easy to spot if a chart is not using a Druid table (but an old Druid datasource) simply checking the chart list:

File:Superset Druid datasource example.png
Example of old Druid datasource: the name pageviews_daily is not prepended with the prefix 'druid.'

In the above example the "pageviews_daily" datasource is not prepended with "druid." (like druid.pageviews_daily) so it is an old Druid datasource. If you want to move the chart to the Druid table, hit the edit button for it and switch the datasource name (it will likely change some default settings so make sure that the chart is correctly visualized) and then save it. You dashboards will be updated automatically!

If the druid.your-datasource-name is not present yet, please check the above section, since it may need to be defined via Druid -> Tables menu'.

SQL Lab

Superset allows user to query data via SQL using a dedicated tool called SQL Lab: https://superset.wikimedia.org/superset/sqllab. Multiple databases are available to query:

  • presto_analytics_hive to explore Data Lake data using Presto
  • Druid Analytics SQL to explore Druid data cubes (note: the full power of SQL is not available for Druid data)
  • etc..

Administration

Sync the staging database with the production one

When testing a new release in the staging environment it is nice to get the same dashboards as in production, since two different databases are used and they get out of sync very quickly (nobody updates dashboards in staging). The procedure is the following:

# The two databases live on the same mariadb instance
ssh an-coord1001.eqiad.wmnet
# Moving to /srv since it is on a separate partition with more free space
cd /srv
# Dump the production db
sudo sh -c 'mysqldump superset_production > superset_production_$(date +%s).sql'
# Get the filename of the production dump
ls -l
# Connect to mysql and drop the staging db
sudo mysql
# Good practice to avoid specific commands to be replicated via binlog.
set session sql_log_bin=0
# Drop and re-create the staging database
drop database superset_staging;
create database superset_staging DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
exit

# Load the production database into the staging one. Change the filename according
# to the dump previously taken!
sudo mysql superset_staging < superset_production_111111.sql

# Then remember that some specific production things need to be changed once superset
# runs. For example, in databases -> presto-analytics (in the superset ui) the kerberos principal
# used is saved in the config and changes between production and staging (the hostname is different).

Upgrading

To upgrade, first follow the instructions in the analytics/superset/deploy README to update the deploy repository. Once deployed, activate the superset virtualenv, add /etc/superset to PYTHONPATH (to allow superset to pick up configuration) and follow the Superset upgrade instructions (minus the pip install superset --upgrade part). This should be something like:

source /srv/deployment/analytics/superset/venv/bin/activate
export PYTHONPATH=/etc/superset
superset db upgrade
superset init

Deploy to staging

This assumes you have a change open for the analytics/superset/deploy repository.

# ssh to deploy1002 and set the working directory
ssh deploy1002.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy

# create a new branch from the master one, name it as you prefer
git checkout -B testing_something_important

# cherry pick the change in the new branch created
git cherry-pick $change-from-gerrit

# deploy only to an-tool1005, without logging in the ops's sal
scap deploy --no-log-message -f -l an-tool1005.eqiad.wmnet "Test deployment for something important"

Test staging via ssh tunnel

# Create a ssh tunnel and then test it via localhost:8080 on the browser
ssh -L 8080:an-tool1005.eqiad.wmnet:80 an-tool1005.eqiad.wmnet

Upon accessing http://localhost:8080, you will be presented with a basic auth challenge that allows you to authenticate via LDAP.

Test as different user on staging

In order to test features such as permissions, you may want to impersonate a different user.

First, add an extension to your browser that allows you to set custom headers, such as https://addons.mozilla.org/en-US/firefox/addon/modify-header-value/ for Firefox.

Then, set the following headers for localhost:8080:

X-Remote-User: <user you want to impersonate>
X-Forwarded-Proto: http

Connect directly to the superset server by opening an SSH tunnel on an-tool1005 port 9080:

ssh -NL 8080:an-tool1005.eqiad.wmnet:9080 an-tool1005.eqiad.wmnet

Open http://localhost:8080 in your browser, and you should be logged in as the user specified by X-Remote-User. You can see who you are logged in as by viewing your profile from the Settings menu.

If you change the header to be a different user, logout via the Settings menu to change to be the new user.

Deploy to production

# ssh to deploy1001 and set the working directory
ssh deploy1001.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy

scap deploy -l an-tool1010.eqiad.wmnet "Deployment for something important"

Don't forget to run migrations after you deploy (see Analytics/Systems/Superset#Upgrading).

How to

See status

systemctl status superset.service

Bounce

systemctl restart superset

See also