Data Platform/Systems/Superset
Apache Superset enables visualizations and dashboards built from various analytics data sources.
Our primary Superset instance is version 4.0.2 and can be found at superset.wikimedia.org .
We also run a staging instance which is located at superset-next.wikimedia.org - This is used to test new versions of and features of Superset before they are promoted to the primary instance.
Like Turnilo , it provides access to various Druid tables, as well as data in Hive (and elsewhere) via Presto .
Access
To access Superset, you need
wmf
or
nda
LDAP access. For more details, see
Data Platform/Data access#LDAP access
. Once done, you can log in using your Wikimedia developer account.
Query timeouts
Superset queries are set to time out after 3 minutes ( task T294771 , task T407799 ). This applies to both manual queries in SQL Lab and the automatically-generated queries Superset runs every time a graph is loaded.
Usage notes
- NULL values do not show up properly in the values selection dropdown list for filters (i.e. one cannot use that dropdown to exclude NULL values from a chart or limit it to NULL values). But one can use the regex option instead: Type in ".+" (without the quotes), and accept the offer to create that as an option.
-
By default, always use predefined SUM metrics when available. When choosing a metric then picking the SUM aggregation function, the aggregation is managed by superset and uses the
floatSumoperator. This operator uses 32 bits floats instead of 64 bits longs or double, leading to inaccuracies. Usually predefinedSUM(...)metrics are available and should be used, as they are manually defined usingdoubleSumorlongSum64 bits operators. -
If you build a chart based on a table with structs, you will not be able to access the fields of the struct because Superset recognizes the struct as a single string column. The workaround is to add a computed column with
struct.fieldas the SQL expression. -
Superset expects time columns to be in SQL timestamp string format (
2021-01-01 00:00:00) and has trouble with columns in ISO 8601 string format (2021-01-01T00:00:00Z). To fix this, it is best to create a computed column that casts the time to the right format using an SQL expression like (CAST(TO_ISO8601_TIMESTAMP(dt) AS VARCHAR). - When creating filters for the charts on your dashboard, please do not create Filter Box charts. Instead, use Native filters available on the Dashboard Filter sidebar. For detailed information see here https://docs.preset.io/docs/dashboard-filtering
Advanced time ranges
Superset accepts a
lot
of different things in advanced time ranges boxes, including SQL time functions and any human-language times accepted by the
Parsedatetime library
.Unfortunately, neither Superset nor Parsedatetime provides any documentation of this, so there is no actual list of all the possibilities. But when you put things in, Superset does at least give you immediate feedback about whether they are valid and what they are interpreted as. You can also use a variety of helpful
date functions
such as
DATETRUNC
and
DATEADD
.
Here are some useful examples:
| desired range | start time value | end time value |
|---|---|---|
|
the last 26
full
Monday-Sunday weeks
(useful for a weekly graph covering the last half year) |
this Monday 26 weeks ago
|
this Monday
|
|
up to yesterday
(useful when working with hourly data at daily granularity and today is incomplete) |
DATETRUNC(DATETIME("yesterday"), day)
or
DATETRUNC(DATEADD(DATETIME("now"), -1, day), day)
|
SQL Lab
Superset allows user to query data via SQL using a dedicated tool called SQL Lab: https://superset.wikimedia.org/superset/sqllab . Multiple databases are available to query:
-
presto_analytics_hiveto explore Data Lake data using Presto -
Druid Analytics SQLto explore Druid data cubes (note: the full power of SQL is not available for Druid data) -
mysql wikishared: contains data from MariaDB database. It has data from the x1 cluster . -
mysql_staging: a staging database created for analysts to dump data from various wiki projects. This is not used anymore. - Note: SQLLab does not have all the MariaDB replicas integrated yet.
- etc...
Troubleshooting
Accessing a dashboard fails with "SupersetApiError: Not found"
If you try to access a dashboard and it fails with "Unexpected error: SupersetApiError: Not found", this is because the dashboard is still a "draft", which means it's accessible only to its owners and to Superset admins ( task T405535 ). The owner will need to either publish the dashboard or add you as a co-owner. To figure out who the owner is, consider where you got the link or ask the Data Platform SRE team for help (for example, in the #talk-to-data-engineering channel in the Wikimedia Foundation Slack).
Administration
Please see Superset/Administration for more information.
See also
- phab:tag/superset - Bug reports and feature requests in mw:Phabricator
- https://github.com/wikimedia/incubator-superset Wikimedia fork of Superset
- phab:T211706 Phabricator ticket with misc tips and history of ongoing maintenance