You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Tutorials/Dashboards: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Elukey
imported>Btullis
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
This page describes how to easily build dashboards that visualize statistics on WMF's data sources using Analytics' pipeline and tools.
#REDIRECT [[Data Engineering/Tutorials/Dashboards]]
 
== Overview ==
 
===Supported data sources===
Analytics' dashboarding pipeline supports all data that can be accessed through stat1007.eqiad.wmnet and stat1006.eqiad.wmnet. Examples include:
*MediaWiki database replicas
*EventLogging database replica
*Hive tables on top of pageview, search, edit, and other data
===Supported layouts and visualizations===
Different dashboard layouts allow you to present different types of data.  Available layouts are:
*The '''tabs''' layout lets you organize your visualizations in a tab navigation menu. See: [https://analytics.wikimedia.org/dashboards/browsers/ User agent breakdowns]
*The '''metrics-by-project''' layout addresses the problem of showing stats on multiple projects and languages on the same visualization. See: [https://analytics.wikimedia.org/dashboards/vital-signs/ Vital signs]
*The '''compare''' layout can help you if your metrics are split into 2 realms that are to be compared, for example: Wikitext editor vs. VisualEditor. See: [https://edit-analysis.wmflabs.org/compare/ Edit analysis]
These layouts can be customized to use the available visualizations:
*Line chart
*Tabular text data
*Sunburst chart
*Stacked bar chart
New layouts and visualizations can be added fairly quickly.  The Analytics team will do this for you if necessary.
 
=== Necessary knowledge ===
You don't need to be a software developer to create dashboards using this pipeline. However a basic know-how of the following concepts will be needed:
* SQL to adapt your queries to the expected conventions
* Git and Gerrit for downloading tools and submitting changes for review
 
=== Tools used ===
* '''[[labsconsole:Analytics/Reportupdater|Reportupdater]]'''  It's a python program that periodically executes custom SQL queries against given data sources and progressively builds TSV reports.
* '''[https://github.com/wikimedia/analytics-dashiki Dashiki]'''  It's a serverless dashboarding application written in JavaScript that visualizes TSV reports (among other data sources) and is configured on wiki.
 
=== Work time and calendar time expectation ===
The time it takes to have a dashboard up and running depends on the number of visualizations that you want to include in your dashboard. Also, if this is the first time you use this pipeline, you'll have to spend some time reading this documentation, familiarizing yourself with the tools, creating a repository and deploying a dashboard instance. Finally, the process includes a code review by the Analytics team. As a rough approximation consider: <big>2-3h</big> getting started <big>'''+''' (1-2h</big> work <big>* #visualizations) '''+''' 1-2d</big> waiting for Analytics <big>'''+''' 1-2h</big> testing and deploying.
 
== Step by step ==
The following section explains in detail what you need to do to spin up a dashboard using Analytics dashboarding pipeline. It assumes that you already have one or more SQL queries that work on one of the supported data sources. The steps can be divided into 2 blocks: generating reports and generating visualizations. In case you already have TSV reports in analytics.wikimedia.org/datasets/, please skip the ''Generating reports'' section and go directly to ''Generating visualiations''.
 
=== Generating reports ===
 
==== Adapt your SQL queries to reportupdater's conventions ====
Reportupdater needs the queries to return a '''single data point'''. So modify your queries to look like:<syntaxhighlight lang="sql">
SELECT
    DATE('{timestamp_from}') AS date,
    COUNT(*) AS metric_1,
    SUM(some_value) AS metric_2
FROM some_table
WHERE
    timestamp >= '{timestamp_from}' AND
    timestamp < '{timestamp_to}'
GROUP BY date;
</syntaxhighlight>Note that this query is using 2 reportupdater placeholders: ''{timestamp_from}'' and ''{timestamp_to}''. Reportupdater will replace those with the timestamps of a given time interval. The first column the query selects should be named ''date'' and should be formatted as ''YYYY-MM-DD''. So the final output of the query, for a given <code>timestamp_from = 20160101000000</code> and <code>timestamp_to = 20160102000000</code> will look like this:<pre>
date        metric_1    metric_2
2016-01-01  34          2532
</pre>This is likely to be the only adaptation you need. But you might also want to use reportupdater's ''explode by wiki'' feature. With this option enabled, reportupdater will execute your query for each wiki in a list and generate a separate report for each one of them. To enable ''explode by wiki'', put a new placeholder named {wiki_db} in the FROM statement of your query, like this:<syntaxhighlight lang="sql">
FROM {wiki_db}.some_table
</syntaxhighlight>If you want to dig deeper into all reportupdater features and options, please take a look at the full [[labsconsole:Analytics/Reportupdater|reportupdater documentation]].
 
==== Write the configuration for your reports ====
Create a folder and put all your queries in it. Also, create a new file in it named ''config.yaml''. The folder structure should look like:<pre>
query_folder
    |__ query_1.sql
    |__ query_2.sql
    ...
    |__ query_N.sql
    \__ config.yaml
</pre>The config.yaml file will contain the pieces of information reportupdater needs to execute the queries and create the reports. It has 3 sections: databases, defaults and reports. This is an example of a config.yaml file:<syntaxhighlight lang="yaml">
databases:
    el:
        host: "analytics-store.eqiad.wmnet"
        port: 3306
        creds_file: /a/.my.cnf.research
        db: log
 
defaults:
    db: el
 
reports:
    query_1:
        granularity: days
        starts: 2015-04-01
    query_2:
        granularity: weeks
        starts: 2016-02-01
        explode_by:
            wiki_db: enwiki, dewiki, frwiki
</syntaxhighlight>In the databases section, specify the host and port you want to connect to, point to the credentials file and give a default database. In the defaults section, specify that the default db is the one you just configured. And in the reports section, write a config block for each query you want to run. For each report, you have to specify granularity and start date. If you are using the ''explode by wiki'' feature, indicate which wiki databases you want reportupdater to execute your queries on. Please, look at the full [[labsconsole:Analytics/Reportupdater|reportupdater documentation]] for more detail.
 
==== Test locally ====
A great way to know if your queries and configuration will work in production is testing the generation of your TSV report files on your local machine. To do that, you have to:
# Clone [https://gerrit.wikimedia.org/r/#/admin/projects/analytics/reportupdater reportupdater repository].
# In a separate terminal, run <code>ssh -L 3307:analytics-store.eqiad.wmnet:3306 stat1006.eqiad.wmnet</code> to create an ssh tunnel to the database host. Replace ''analytics-store.eqiad.wmnet'' with your host if necessary.
# Create a credentials file as explained [https://github.com/wikimedia/analytics-reportupdater/blob/master/test/locally/README.md here] and put it into <code>reportupdater/test/locally/</code> folder. Warning: '''never''' push the credentials file to Gerrit!
# Modify the database section in your ''config.yaml'' file to point to <code>localhost:3307</code>, and to point to the credentials file you just created. You can copy the example shown in <code>reportupdater/test/locally/config_example.yaml</code>. Don't forget to revert that when you're finished testing.
# Run reportupdater with <code>python update_reports.py /path/to/your/query/folder/ /tmp/output -l info</code>.
You should see reportupdater running and outputing info logs. If everything goes well, the report files will be written to <code>/tmp/output</code>. Note that this test does not work with queries to the Hive cluster! If this feels too complicated, please let us Analytics know and we'll help you :-).
 
==== Deploy reportupdater job ====
Once reportupdater is able to execute your queries, you can submit a change to gerrit. Add your folder containing the queries and config to the root level of [https://gerrit.wikimedia.org/r/#/admin/projects/analytics/reportupdater-queries reportupdater-queries repository]. If you already have a separate query repository or want to use your own new repository, that's OK. In any case, add someone in Analytics as a reviewer of your patch and ping us. We will review the code and merge it. We'll also create a trigger in puppet to launch reportupdater for you. If everything goes well, after a couple days, you should see your report files made public in [https://analytics.wikimedia.org/datasets/periodic/reports/ analytics.wikimedia.org/datasets/periodic/reports].
 
=== Generating visualizations ===
 
==== Write the configuration for your dashboard ====
Dashiki dashboards are configured on wiki, so you'll need to create a wiki page on meta.wikimedia.org and fill in some lines of JSON configuration. You should name the page <code>Config:Dashiki:NameOfYourDashboard</code>. Depending on which Dashiki layout you choose, your configuration will be different.
 
===== Config for the tabs layout =====
The most generic Dashiki layout is the ''tabs layout''. With it you can organize your visualizations under a tabs navigation menu. Looks like this: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os
 
To use it, write a JSON config following this example:<syntaxhighlight lang="json">
{
    "title": "Dashboard Title",
    "subtitle": "Dashboard Subtitle",
    "desc": "A description of your dashboard",
    "tabs": [
        {
            "title": "Tab 1 Title",
            "dataRange": {
                "startDate": "2015-06-01" // graphs will show data from this day on
            },
            "graphs": [
                {
                    "title": "Graph 1 title",
                    "type": "dygraphs-timeseries", // visualizer name
                    "path": "path/to/report/file1.tsv", // relative to analytics.wikimedia.org/datasets/periodic
                    "format": "percent" // number format
                }, {
                    "title": "Graph 2 title",
                    "type": "hierarchy",
                    "path": "path/to/report/file2.tsv",
                    "pivot": {
                        "dimension": "column to pivot by",
                        "metric": "column to aggregate (if needed) when pivoting"
                    }
                },
                ...
            ]
        }, {
            "title": "Tab 2 title",
            "graphs": [
                ...
            ]
        },
        ...
    ]
}
 
</syntaxhighlight>You can find a full working example [[m:Config:Dashiki:Sample/tabs|in the Sample/tabs dashboard config]]. Note that the <code>//comments</code> and <code>...</code> in the JSON code above are just explanatory, and they need to be removed.
 
In the ''graph-type'' field you can  specify one of: ''dygraphs-timeseries'', ''table-timeseries'' or ''hierarchy''. Choose ''dygraphs-timeseries'' if you want to graph your data as a line chart. Tip: each column in your report (except the first that holds the date) will become a line in the chart. Choose ''table-timeseries'' if you want to graph your data as a table. Or choose ''hierarchy'' if you prefer graphing your data as a sunburst hierarchical chart. In the latter, your report file needs to have a cube-like structure with a header like: <code>date  dimension1  dimension2  ...  dimensionN  metric</code>, where dimension values are categorical strings and metric is a number. In the ''graph-path'' field you have to write the path to the report file relative to <code><nowiki>https://analytics.wikimedia.org/datasets/periodic/reports/metrics/</nowiki></code>.
 
===== Config for the metrics-by-project layout =====
If you have a metric that you want to visualize for all wikimedia projects (all 200+) this is the layout you want to use, it looks like this: https://analytics.wikimedia.org/dashboards/vital-signs/#projects=ruwiki,frwiki/metrics=Pageviews
 
This Dashiki layout is very useful for reports that use the ''explode by wiki'' feature (see: Generating reports). It will let the users select the wikis they want to add to the chart in a friendly way. To use it, you have to add a bit of JSON config to an existing page in meta.wikimedia.org first: [[m:Dashiki:CategorizedMetrics|Dashiki:CategorizedMetrics]]. Open it for editing and add a config block for your queries:<syntaxhighlight lang="json">
{
    "name": "Category your metrics belong to",
    "metrics": [
        {
            "definition": "https://meta.wikimedia.org/wiki/Some_page_that_explains_your_metric",
            "name": "A Readable Metric Name",
            "metric": "name_of_your_query_folder",
            "submetric": "name_of_your_query",
            "api": "datasets"
        },
        ...
    ]
}
</syntaxhighlight>Be careful not to write any JSON incompatible code, otherwise other dashboards may break. You can check if everything is OK by visiting [https://analytics.wikimedia.org/dashboards/vital-signs this dashboard]. If you see data, everything's fine; otherwise, please revert your change to ''Dashiki:CategorizedMetrics'' and try again. After that, you have to write the actual configuration for your dashboard in <code>Config:Dashiki:NameOfYourDashboard</code> page, but this one is really short:<syntaxhighlight lang="json">
{
    "defaultProjects": [
        "enwiki", "dewiki", "frwiki"
    ],
    "defaultMetrics": [
        "Metric Name 1"
    ],
    "metrics": [
        "Metric Name 1", "Metric Name 2", "Metric Name 3"
    ]
}
 
</syntaxhighlight>Note that you can only list as much ''defaultProjects'' as wiki files exist in your reports folder. And the metric names should be specified in the ''Dashiki:CategorizedMetrics'' page.
 
===== Config for the compare layout =====
The compare layout works well when you have two families of reports that share the same metrics, and you want to compare them. An example could be an A/B test, where you want to compare the metrics of both groups A and B. Dashiki expects the report files to be structured like this:<pre>
https://analytics.wikimedia.org/datasets/periodic/reports/metrics
    |__query_name_1
    |  |__group_name_1
    |  |  |__enwiki.tsv
    |  |  |__dewiki.tsv
    |  |  \__frwiki.tsv
    |  \__group_name_2
    |      |__enwiki.tsv
    |      |__dewiki.tsv
    |      \__frwiki.tsv
    |__query_name_2
    |  |__group_name_1
    |  |  |__enwiki.tsv
    ... ... ...
   
</pre>Note that this layout is also designed specifically for the ''explode by wiki'' feature! So write a configuration that follows this example:<syntaxhighlight lang="json">
{
    "a": "group_name_1",
    "b": "group_name_2",
    "startDate": "2015-04-01",
    "comparisons": [
        {
            "title": "Readable Metric Name 1",
            "type": "timeseries",
            "metric": "query_name_1",
            "desc": "Description of the chart 1"
        },
        {
            "title": "Readable Metric Name 2",
            "type": "timeseries",
            "metric": "query_name_2",
            "desc": "Description of the chart 2"
        },
        ...
    ]
}
 
</syntaxhighlight>
 
==== Test locally ====
Once you have the configuration in place on ''meta.wikimedia.org'', you can easily test the dashboard on your local machine (your desktop, no VM needed) following these steps:
# You need [https://www.npmjs.com/ npm] installed in your machine.
## In addition to npm, the two packages [https://bower.io bower] and [https://gulpjs.com gulp] need to be installed and available from the command line.
# Clone [https://gerrit.wikimedia.org/r/#/admin/projects/analytics/dashiki Dashiki repository] and <code>cd</code> into it.
# Execute <code>npm install</code> to setup the dependencies.
# Execute<code>cd semantic && gulp build</code>.
# Build the project with <code>gulp --layout <LAYOUT> --config <CONFIG></code>. Where <LAYOUT> is either tabs, metrics-by-project or compare; and <CONFIG> is the title of your config page in ''meta.wikimedia.org'' without the ''Config:'' prefix.
# Spin up a local file server, for example by executing <code>python -m SimpleHTTPServer 5000</code> from the <code>dist</code> subdirectory that contains the output generated by gulp.
# Visit <code>localhost:5000</code> on your browser to see your dashboard, hopefully full of colorful charts :-)
 
==== Deploy your dashboard ====
Once you've tested that the dashboard works in your local machine, you can deploy it for other people to see. Add a block like this one to the end of the config.yaml file living in dashiki's root directory:<syntaxhighlight lang="yaml">
name-of-your-dashboard:
    layout: tabs
    config: MyDashboardConfig
    hostname: name-of-your-dashboard-test.wmflabs.org
</syntaxhighlight>And submit a patch to [https://gerrit.wikimedia.org/r/#/admin/projects/analytics/dashiki Dashiki's repository] in Gerrit. We Analytics will review, merge and deploy it for you!
 
== Support ==
Even when this documentation is not amazing by far, we Analytics will love to help you in configuring your dashboards with this pipeline. So, for any question, feedback or comment on this process, please send an email to the [[mail:Analytics|Analytics mailing list]] or open a ticket on [[phab:|Phabricator]] tagging it with the ''Analytics'' project. Thanks!

Latest revision as of 16:40, 18 March 2022