You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Systems/Superset: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Lucas Werkmeister (WMDE)
m (→‎Bounce: change to h3 (looks like it’s supposed to be under “how to”))
imported>Btullis
 
(23 intermediate revisions by 8 users not shown)
Line 1: Line 1:
[https://superset.incubator.apache.org/ Superset] is an Apache incubator project, originally started at AirBnB.  It enables visualizations and dashboards built from various analytics data sources.  WMF's Superset instance can be found at https://superset.wikimedia.org. Like [[Analytics/Systems/Turnilo-Pivot|Turnilo]], it provides access to various [[Analytics/Systems/Druid|Druid]] tables, as well as data in Hive (and elsewhere) via Presto.
#REDIRECT [[Data Engineering/Systems/Superset]]
 
== Access ==
To access Superset, you need <code>wmf</code> or <code>nda</code> LDAP access. For more details, see [[Analytics/Data access#LDAP access]]. Once done, you can log in using your Wikitech username and password.
 
== Druid datasources vs Druid tables ==
We should stop adding datasources to the [https://superset.wikimedia.org/druiddatasourcemodelview/list/ "Druid Datasources" list] if possible in favor of Druid tables (backed by Druid SQLAlchemy mappings). In order to add a Druid table, do the following:
* Go to Sources -> Tables
* Hit the + sign to add one
* Use <code>Druid Analytics SQL</code> as database, <code>druid</code> as schema and the Druid datasource name as table name (like <code>edits_hourly</code>, <code>wmf_netflow</code>, etc..)
 
== Usage notes ==
* The [https://superset.wikimedia.org/druiddatasourcemodelview/list/ "Druid Datasources" list] shows ingested tables that are available for querying. As of October 2018, this includes e.g. daily and hourly [[Analytics/Data Lake/Traffic/Pageview hourly|pageviews data]] (the daily version is only updates once a month, but goes further back), a sampled excerpt of [[webrequest]] data, unique devices, and a few select EventLogging schemas. If a recently created Druid datasource is not yet visible in the list, try clicking [https://superset.wikimedia.org/druid/scan_new_datasources/ "Scan New Datasouces"].
 
* NULL values don't show up properly in the values selection dropdown list for filters (i.e. one can't use that dropdown to exclude NULL values from a chart or limit it to NULL values). But one can use the regex option instead: Type in ".+" (without the quotes), and accept the offer to create that as an option.
*By default, always use predefined SUM metrics when available. When choosing a metric then picking the SUM aggregation function, the aggregation is managed by superset and uses the <code>floatSum</code> operator. This operator uses 32 bits floats instead of 64 bits longs or double, leading to inaccuracies. Usually predefined <code>SUM(...)</code> metrics are available and should be used, as they are manually defined using <code>doubleSum</code> or <code>longSum</code> 64 bits operators.
 
== SQL Lab ==
 
Superset allows user to query data via SQL using a dedicated tool called SQL Lab: https://superset.wikimedia.org/superset/sqllab
Multiple databases are available to query:
* <code>presto_analytics_hive</code> to explore Hive's databases/tables via Presto.
* <code>Druid Analytics SQL</code> to explore all Druid datasources via SQL (warning: not all the full power of SQL will be available)
* etc..
 
== Administration ==
=== Upgrading ===
To upgrade, first follow the instructions in the [https://github.com/wikimedia/analytics-superset-deploy analytics/superset/deploy README] to update the deploy repository.  Once deployed, activate the superset virtualenv, add /etc/superset to PYTHONPATH (to allow superset to pick up configuration) and follow the [https://superset.incubator.apache.org/installation.html#upgrading Superset upgrade instructions] (minus the <tt>pip install superset --upgrade</tt> part).  This should be something like:
 
<source lang="bash">
. /srv/deployment/analytics/superset/venv/bin/activate
export PYTHONPATH=/etc/superset
superset db upgrade
superset init
</source>
 
=== Deploy ===
This assumes that one has already filed the change for the superset deploy repository. The first thing to do is test the change on the staging instance, an-tool1005.eqiad.wmnet:<syntaxhighlight lang="bash">
# ssh to deploy1001 and set the working directory
ssh deploy1001.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy
 
# create a new branch from the master one, name it as you prefer
git checkout -B testing_something_important
 
# cherry pick the change in the new branch created
git cherry-pick $change-from-gerrit
 
# deploy only to an-tool1005, without logging in the ops's sal
scap deploy --no-log-message -f -l an-tool1005.eqiad.wmnet "Test deployment for something important"
 
 
</syntaxhighlight>Then check if Superset works as expected:<syntaxhighlight lang="bash">
# Create a ssh tunnel and then test it via localhost:8080 on the browser
ssh -L 8080:an-tool1005.eqiad.wmnet:80 an-tool1005.eqiad.wmnet
</syntaxhighlight>If you are happy with the Superset version, then merge and deploy to the production host:<syntaxhighlight lang="bash">
# ssh to deploy1001 and set the working directory
ssh deploy1001.eqiad.wmnet
cd /srv/deployment/analytics/superset/deploy
 
scap deploy -l analytics-tool1004.eqiad.wmnet "Deployment for something important"
 
 
</syntaxhighlight>
 
== How to  ==
 
=== See status ===
systemctl status superset.service
 
=== Bounce ===
systemctl restart superset
 
==See also==
* https://github.com/wikimedia/incubator-superset Wikimedia fork of Superset

Latest revision as of 12:10, 12 January 2023