You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
The Technical Wishes team wants to collect product analytics to understand how people use the features we write. For example, we count the number of times users toggle the syntax highlighting feature, and in which editing interfaces.
Our features emit raw statistics or events into a backend (StatsD or EventGate). The collected data is stored in the analytics cluster.
Statistics (StatsD) are simple enough that they are tallied directly in Graphite, these are usually a counter or gauge. Any desired segmentation, for example splitting by wiki, must be done at the site of collection.
Events (EventGate) are multidimensional and are collected as discrete payloads, one for each event. These must be aggregated, reduced down to a small number of simpler metrics.
Finally, we visualize the data, usually as a graphical dashboard in Grafana.
For more help, please visit IRC:
Freenode/#wikimedia-analytics to ask about events or the analytics infrastructure.
Freenode/#wikimedia-observability to talk about Graphite
The following template aims to help clearing up the goal and the tools in a structured way. It should simplify the communication between developers and the product team.
|dimension||possible values||include as variable?|
|<name of metric>||Toggle||...|
|description||describe metric||number of button clicks in the editor (to activate or deactivate syntax highlighting)||...|
|path||Graphite path to metric||MediaWiki.CodeMirror.toggles||…|
|dimensions||name available dimensions, defined above||byEnabled, byWiki, byEditor||...|
|normalization||describe if and how the data should be normalized||per session (not per user)||...|
|graphs||name wanted graphs and link to the ones that are done||Average activations per session
Activation per session, by wiki
compare toggle activations and deactivators in wikitext_2010 and in wikitext_2017
We're also experimenting with this format, written like a right-to-left data flow:
|Research questions||Dashboard ▭||Panel ▭ ←||Aggregate metrics ═ ←||Aggregation ◯ ←||Collected metrics ═ ←||Collected by ◯ ←||Initiated by ▭||Status summary||History||Deprecation plan|
|Reference Previews Usage||reference_previews.baseline.footnote_clicks_per_pageview.$wiki||reference-previews/baseline||ReferencePreviewsBaseline||Cite extension||Click on reference footnote||Active||Starts 2019-11||Drop once the probation period ends. *Note: This period doesn't really exist as a thing, yet.|
Note: do not create new EventLogging schemas, this is a legacy system. Use the “modern event platform” schemas (EventGate) instead.
Before deploying any major feature, we believe it's important to collect baseline data for at least one month, to help us understand the impact of our changes.
Data can be continuously collected until no longer needed, but we encourage ourselves to sunset all analytics and remove at the appropriate time.
If a very large number of events (for example, all pageviews) would cause events to be recorded, it's important to randomly sample events to keep the volume down. This should be done on a per-session basis when possible, or per-pageview, rather than a per-event basis. In other words, sample across users so that e.g. frequent usage is correctly measured. Use the built-in functions like mw.eventLog.inSample for this. Try to match your sample rate to that of related features, for example VisualEditor is sampling at 6.25%.
- Clone the mediawiki-extensions-EventLogging repo.
- See EventLogging/devserver/README.md file for up-to-date instructions.
Analytics cluster access
You’ll need at least some level of analytics access in order to work on the aggregation code. Shell access gives the most flexibility, allowing you to use Jupyter and directly run queries against Hive or the MariaDB replicas. Just getting Superset access might be good enough for some minimal cases, this gives you access to the data but through a web UI. The drawback of using Superset is that we don’t expect the queries run there to behave identically when run on the cluster.
Gaining access rights
- Read through responsibilities and the access request process.
- Generate a new password protected SSH key pair to use exclusively for Wikimedia production access.
- File an access request on Phabricator:
- Generic Phabricator template for access requests.
- The required access group for our use case is analytics-privatedata-users (details and other groups).
- Monitor the ticket for any additional information SRE may need.
Setting up environment
1. Configure SSH access
2. Connecting to analytics client
In order to facilitate easy file sharing within our team, we should primarily use the same analytics client, in this case stat1005. If your SSH config is configured correctly connecting should be as simple as:
If prompted, verify the corresponding fingerprints:
stat1005: Help:SSH Fingerprints/stat1005.eqiad.wmnet - Wikitech List of analytics clients: Analytics/Systems/Clients - Wikitech
3. Configuring Kerberos access
When connected to an analytics client you must authenticate with Kerberos prior to running any queries:
When asked for a password check your email for a temporary Kerberos password (this email should have been sent at roughly the same time production access was granted). You should be prompted to change this password after successfully authenticating. This command needs to be run at least every 5 days to reauthenticate.
4. [Optional] Installing wmfdata-python
wmfdata-python is a helpful library provided by the WMF analytics team which provides an easy way to run both HiveQL and SQL queries.
Verify that you have a default venv in your home directory on the analytics client:
If venv is not present:
Activate the default venv using:
In order for the pip install to work you will need to run the following before installing (otherwise you may receive a “Connection to pypi.org timed out” error):
export http_proxy=http://webproxy.eqiad.wmnet:8080 export https_proxy=http://webproxy.eqiad.wmnet:8080
Use pip to install wmfdata-python:
pip install git+https://github.com/wikimedia/wmfdata-python.git@release
JupyterLab instances on the analytics client can be accessed through SSH port tunneling. Disconnect any active SSH session with the analytics client and reconnect using:
ssh -N stat1005.eqiad.wmnet -L 8000:127.0.0.1:8000
Open a browser and navigate to:
http://localhost:8000/user/<username>/lab (replace <username> with your shell username)
Login using your wiki-tech credentials.
If you get a “Server not running” error, click “Launch Server” and you should be automatically redirected to the lab interface.
Before being able to execute any queries you will need to log into Kerberos.
In JupyterLab, go to File > New > Terminal and in the new terminal run:
and enter your Kerberos credentials when asked.
Now you are ready to create a new notebook and get to querying.
In JupyterLab, go to File > New > Notebook and when asked, select the preferred kernel (should be “Python 3”). If you followed the install instructions for wmfdata-python, then this is the kernel which has the python library installed.
Happy querying! (See also examples further below)
For more info on using JupyterLab:
Hive queries can be run using
beeline, either from the interactive prompt or as a batch script. The database can be selected just like in MySQL, for example “use event”.
MariaDB queries can be run from the
mysql command-line, but you’ll need to set up the relevant replica host in
~/.my.cnf, for example:
# enwiki host = s1-analytics-replica.eqiad.wmnet port = 3311
The mapping from wiki database to shard is documented here. To smoke-test a reportupdater-queries script, use a command-line like this:
bash template_dialog_opens 2021-01-13
Superset is another tool that can be used to smoke test SQL queries. It’s also possible to test Hadoop queries but as they can only be written in SQL (using Presto) instead of HiveQL it’s important to verify that these queries work with Hive before using them in the cronjob.
Although it seems like a good one-stop-shop for querying and visualization we may not be able to replace Grafana with it as it still seems to require users to be in the analytics-privatedata-users access group for both querying and visualizing:
“In both cases, Superset works on behalf of the user, so eventually the username will need to hold read permissions for Hadoop data to correctly visualize what requested. This is guaranteed by being into analytics-privatedata-users, that gets deployed on the Hadoop master nodes (without ssh access) to outline user permissions on HDFS.” from https://wikitech.wikimedia.org/wiki/Analytics/Data_access.
Along with other unforeseen pitfalls.
Spoof SQL queries locally
When writing queries for production SQL replicas it’s also feasible to initially test these queries against a locally installed database (i.e. MediaWiki dev env). This has the benefit of requiring no extra permissions and will result in faster query times. The downside is lack of testing data and that you must either configure your MediaWiki install appropriately for the queries you are trying to run or create mock database tables/entries.
Library provided by the WMF analytics team which provides an easy way to run both HiveQL and SQL queries.
from wmfdata import hive, mariadb
Example MariaDB (SQL)
mariadb.run( """ SELECT COUNT(*) AS total FROM user_properties WHERE up_property = 'usecodeeditor' AND up_value = 1 """, "enwiki" )
Example Hadoop (HiveQL)
hive.run( """ SELECT COUNT(*) AS total, event.editor_interface, wiki FROM event.EditAttemptStep WHERE event.action = 'ready' AND event.editor_interface IN ( 'visualeditor', 'wikitext-2017' ) AND year = 2021 AND month = 1 AND day = 1 GROUP BY event.editor_interface, wiki """ )
ReportUpdater is a WMF tool which we use to periodically run SQL and Hive queries. The results of these queries are then subsequently stored in Graphite for use with Grafana.
For more information on analytics-reportupdater: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater
Installing a development branch
Reportupdater normally runs on an-launcher1002, a private analytics instance that we cannot access. To help debug queries, you can also run your own copy of the script. Make sure not to send data to Graphite.
- Connect to the stat1005 client
- Create a working directory (e.g. reportupdater)
mkdir ~/reportupdater cd ~/reportupdater
- Clone analytics-reportupdater
git clone https://github.com/wikimedia/analytics-reportupdater.git
- Clone analytics-reportupdater-queries
git clone https://github.com/wikimedia/analytics-reportupdater-queries
- Create a new venv for reportupdater
python3 -m venv --system-site-packages analytics-reportupdater/venv
- Activate the new venv
- Install requirements
export http_proxy=<nowiki>http://webproxy.eqiad.wmnet:8080</nowiki> export https_proxy=<nowiki>http://webproxy.eqiad.wmnet:8080</nowiki> pip install -r analytics-reportupdater/requirements.txt
- Create an output directory
- Connect to the stat1005 client
- Authenticate with Kerberos
- Activate the venv
- (optional) Pull your changes for (or directly modify) analytics-reportupdater-queries
- (important) Adjustments in the config.yaml
- If you are working on CodeMirror you would adjust the following config
- Speed up the query by adjusting all starts dates closer to the current date
- ! important ! Remove all mentions of graphite and any of its values (prevents breaking production aggregates in graphite)
- ! important ! If you are running SQL queries you will need to replace
- If you are working on CodeMirror you would adjust the following config
- Run the queries (example for codemirror)
python ~/reportupdater/analytics-reportupdater/update_reports.py -l info ~/reportupdater/analytics-reportupdater-queries/codemirror/ ~/reportupdater/output
Working on queries
- Before any query is added to the Reportupdater-queries it should be smoke tested (verifying that the queries work) using either CLI or JupyterLab
- Explain the path notation in config.yaml
- Hive query results must have a column which represents the wiki name
- SQL query results do not need to have such a column (reportupdater runs the query for each wiki database separately according to what is configured in the config.yaml)
Visualization in Grafana
Grafana is the tool we use to visualize statistics (about grafana and wikimedia | list of grafana functions). Tech Wish dashboards are collected in a folder (also findable with tags techwish and wmde).
Tips for designing a dashboard
- Link to the ticket from the board.
- Add an introductory text element at the top of the board.
- Name a responsible person.
- Create one graph for each metric (and if it makes sense for each dimension).
- Use variables to include more than one dimension if it makes sense.
- Do not stack charts and use colors to make it clear it is a stream and not a line graph.
- Use mean numbers and percentages.
- Use total numbers in addition to display scale differences.
Any change to the analytics data pipeline must be considered in light of side-effects it may have on already deployed processes. It’s similar to a database schema, but each step of the pipeline has different fields. For example, when collection is changed to include a new field, and an old field is deprecated (must omit or send null, both are equivalent), then the aggregation must take into account the field change and should be liberal about accepting either old or new events.
There are strict rules for how a schema can be updated. A simplified guideline is that old events must still be valid when tested against a newer schema. Practically, this means that the old legal changes are to add new optional fields, and to add enum options to an existing list. EventGate includes linting rules that will prevent you from breaking the guidelines.
Install the git commit hook from the top level of the schema repo (e.g.
current.yaml for your schema, increment the version number, and let the git pre-commit hook take care of propagating the changes into generated files.
Once a schema is updated and published, we can update the producer’s
attributes.EventLogging.Schemas to target the newer schema revision.
Schema changes are deployed by merging patches in the schemas-event-secondary repository. This causes your topic to be automatically pushed to production and causes a schema update in Hadoop.
Production schemas are visible on schema.wikimedia.org, for example. There is a time lag of up to an hour while we wait for puppet to copy new schemas to the machines.
Schema “examples” events specified in the YAML are pushed through a separate “canary” domain roughly hourly, these test that the topic and its validator are working as expected.
Monitor the eventgate-validator dashboard and make sure that the collection code isn’t sending invalid events.
Most of our metrics will be used for short-term questions surrounding product launch. These should be associated with a clean-up ticket, and clearly marked as temporary in source files. When we no longer have need for the metrics, they should be removed starting with the visualization and working backwards to aggregation and collection.
Long-term metrics should be carefully selected, these should be only the most useful indicators and if possible any invasive measurements (e.g. raw user ID or edit count) can be removed from collection.