You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Systems/Superset: Difference between revisions
imported>Elukey |
imported>Razzi (→Sync the staging database with the production one: Add sudo sh -c for permission to write redirect output) |
||
Line 33: | Line 33: | ||
cd /srv | cd /srv | ||
# Dump the production db | # Dump the production db | ||
sudo mysqldump superset_production > superset_production_$(date +%s).sql | sudo sh -c 'mysqldump superset_production > superset_production_$(date +%s).sql' | ||
# Get the filename of the production dump | # Get the filename of the production dump | ||
ls -l | ls -l |
Revision as of 16:16, 16 December 2020
Superset is an Apache incubator project, originally started at AirBnB. It enables visualizations and dashboards built from various analytics data sources. WMF's Superset instance can be found at https://superset.wikimedia.org. Like Turnilo, it provides access to various Druid tables, as well as data in Hive (and elsewhere) via Presto.
Access
To access Superset, you need wmf
or nda
LDAP access. For more details, see Analytics/Data access#LDAP access. Once done, you can log in using your Wikitech username and password.
Druid datasources vs Druid tables
We should stop adding datasources to the "Druid Datasources" list if possible in favor of Druid tables (backed by Druid SQLAlchemy mappings). In order to add a Druid table, do the following:
- Go to Sources -> Tables
- Hit the + sign to add one
- Use
Druid Analytics SQL
as database,druid
as schema and the Druid datasource name as table name (likeedits_hourly
,wmf_netflow
, etc..)
Usage notes
- The "Druid Datasources" list shows ingested tables that are available for querying. As of October 2018, this includes e.g. daily and hourly pageviews data (the daily version is only updates once a month, but goes further back), a sampled excerpt of webrequest data, unique devices, and a few select EventLogging schemas. If a recently created Druid datasource is not yet visible in the list, try clicking "Scan New Datasouces".
- NULL values don't show up properly in the values selection dropdown list for filters (i.e. one can't use that dropdown to exclude NULL values from a chart or limit it to NULL values). But one can use the regex option instead: Type in ".+" (without the quotes), and accept the offer to create that as an option.
- By default, always use predefined SUM metrics when available. When choosing a metric then picking the SUM aggregation function, the aggregation is managed by superset and uses the
floatSum
operator. This operator uses 32 bits floats instead of 64 bits longs or double, leading to inaccuracies. Usually predefinedSUM(...)
metrics are available and should be used, as they are manually defined usingdoubleSum
orlongSum
64 bits operators.
SQL Lab
Superset allows user to query data via SQL using a dedicated tool called SQL Lab: https://superset.wikimedia.org/superset/sqllab Multiple databases are available to query:
presto_analytics_hive
to explore Hive's databases/tables via Presto.Druid Analytics SQL
to explore all Druid datasources via SQL (warning: not all the full power of SQL will be available)- etc..
Administration
Sync the staging database with the production one
When testing a new release in the staging environment it is nice to get the same dashboards as in production, since two different databases are used and they get out of sync very quickly (nobody updates dashboards in staging). The procedure is the following:
# The two databases live on the same mariadb instance
ssh an-coord1001.eqiad.wmnet
# Moving to /srv since it is on a separate partition with more free space
cd /srv
# Dump the production db
sudo sh -c 'mysqldump superset_production > superset_production_$(date +%s).sql'
# Get the filename of the production dump
ls -l
# Connect to mysql and drop the staging db
sudo mysql
# Good practice to avoid specific commands to be replicated via binlog.
set session sql_log_bin=0
# Drop and re-create the staging database
drop database superset_staging;
create database superset_staging DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
exit
# Load the production database into the staging one. Change the filename according
# to the dump previously taken!
sudo mysql superset_staging < superset_production_111111.sql
# Then remember that some specific production things need to be changed once superset
# runs. For example, in databases -> presto-analytics (in the superset ui) the kerberos principal
# used is saved in the config and changes between production and staging (the hostname is different).
Upgrading
To upgrade, first follow the instructions in the analytics/superset/deploy README to update the deploy repository. Once deployed, activate the superset virtualenv, add /etc/superset to PYTHONPATH (to allow superset to pick up configuration) and follow the Superset upgrade instructions (minus the pip install superset --upgrade part). This should be something like:
. /srv/deployment/analytics/superset/venv/bin/activate
export PYTHONPATH=/etc/superset
superset db upgrade
superset init
Deploy
This assumes that one has already filed the change for the superset deploy repository. The first thing to do is test the change on the staging instance, an-tool1005.eqiad.wmnet:
# ssh to deploy1001 and set the working directory
ssh deploy1001.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy
# create a new branch from the master one, name it as you prefer
git checkout -B testing_something_important
# cherry pick the change in the new branch created
git cherry-pick $change-from-gerrit
# deploy only to an-tool1005, without logging in the ops's sal
scap deploy --no-log-message -f -l an-tool1005.eqiad.wmnet "Test deployment for something important"
Then check if Superset works as expected:
# Create a ssh tunnel and then test it via localhost:8080 on the browser
ssh -L 8080:an-tool1005.eqiad.wmnet:80 an-tool1005.eqiad.wmnet
If you are happy with the Superset version, then merge and deploy to the production host:
# ssh to deploy1001 and set the working directory
ssh deploy1001.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy
scap deploy -l analytics-tool1004.eqiad.wmnet "Deployment for something important"
How to
See status
systemctl status superset.service
Bounce
systemctl restart superset
See also
- https://github.com/wikimedia/incubator-superset Wikimedia fork of Superset