You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/AQS/Media metrics"

From Wikitech
< Analytics‎ | AQS
Jump to navigation Jump to search
imported>Fdans
(referer types)
imported>Fdans
 
Line 20: Line 20:
 
All of our Wikistats metrics currently have a project dimension. In the case of media requests it's a bit tricky because files do not belong intrinsically to a specific wiki. However, for roughly half of the webrequests to files we can retrieve the project that it was requested from by using the referer string.
 
All of our Wikistats metrics currently have a project dimension. In the case of media requests it's a bit tricky because files do not belong intrinsically to a specific wiki. However, for roughly half of the webrequests to files we can retrieve the project that it was requested from by using the referer string.
  
This means that instead of a traditional project field, these metrics will have a referer field, that can either be <code>{project}</code>, <code>internal</code>, <code>external</code>, <code>external (search engine)</code> , <code>unknown</code> and <code>none</code>
+
This means that instead of a traditional project field, these metrics will have a referer field, that can either be <code>{project}</code>, <code>internal</code>, <code>external</code>, <code>search-engine</code> , <code>unknown</code> and <code>none</code>
  
 
=== Study of signal/noise ===
 
=== Study of signal/noise ===

Latest revision as of 13:06, 12 September 2019

The terms media counts, media requests and media views refer to different ways of measuring the number of times that images, videos and sounds from Wikimedia Commons are viewed or played. This page describes the three approaches.

Media counts

Main article: Analytics/Data Lake/Traffic/Mediacounts

The mediacounts stream holds counts of how often an image, video, or audio file from upload.wikimedia.org has been transferred to users.

See phabricator ticket for long-standing community request: https://phabricator.wikimedia.org/T210313

Media requests

Media requests is the proposed name of the AQS endpoint that will serve the current state of the media requests API. It will feature roughly the same endpoints that the current pageview API has:

  • Monthly aggregates per project
  • Media requests per file
  • Top 1000 files per month/day

Media requests have the same caveats as media counts in that a lot of prefetches and requests to media that don't end up viewed by the user are counted as valid traffic. Therefore this metric has a lot of noise. Nevertheless, it has a historical purpose as any metric that actually gauges whether it has been viewed by a user (media views described below) will be instrumented as an event and will not have the possibility of being backfilled.

Project as a dimension

All of our Wikistats metrics currently have a project dimension. In the case of media requests it's a bit tricky because files do not belong intrinsically to a specific wiki. However, for roughly half of the webrequests to files we can retrieve the project that it was requested from by using the referer string.

This means that instead of a traditional project field, these metrics will have a referer field, that can either be {project}, internal, external, search-engine , unknown and none

Study of signal/noise

As agreed during the 2019 Analytics Team offsite, to make sure that this data is useful we need to check what proportion of requests that come in for media files are pre-fetches versus actual media used by the users. If the team considers that the data has enough value, we'll proceed with the productionization steps described below.

Steps to put in production

Oozie job

The current mediacounts oozie job loads hourly to the hive mediacounts table and generates a daily public dump. This job needs to be modified to generate the aggregates per project. The regular expression that classifies files into types of media seems to be really out of date and needs to be revamped.

Endpoints in AQS

Endpoint description Example URI
Aggregate of media requests per referer /metrics/legacy/mediarequests/aggregate/en.wikipedia.org/[all-media-types]/[all-sites]/[all-agents]/daily/2010100100/2010103100/metrics/legacy/mediarequests/aggregate/external/[all-media-types]/[all-sites]/[all-agents]/daily/2010100100/2010103100
Media requests per file /metrics/legacy/mediarequests/per-file/es.wiktionary/[mobile-web]/[all-agents]/caprichoso/daily/2015110100/2015110100/metrics/legacy/mediarequests/per-file/internal/[mobile-web]/[all-agents]/caprichoso/daily/2015110100/2015110100
Top media by number of requests /metrics/legacy/mediarequests/top/pt.wikipedia/[all-media-types]/[mobile-app]/2015/11/01

All endpoints need to be tested in beta. Fun!

For requests per file, only the daily granularity needs to be in Cassandra: monthly granularity is computed on the fly by AQS, like with pageviews per article.

Cassandra loading

According to SRE's calculations we should be OK regarding storage capacity for these endpoints.

Media requests need to be added to the cassandra bundle. Depending on the decisions we make about dimensionality (see note above about access-type and access-site), loading time should be around 2 weeks. The loading needs to be monitored periodically.

Wikistats UI

Unless we decide to add a new area for media, the only actionable in the Wikistats UI will be to add the three metrics' configuration and decide on their position in the dashboard, if any.

Media views

Media views are the final stage of measuring viewership of media in the wikis. They are the equivalent of pageviews in that they should be a verified measure that a media file has been viewed by a user, either inside an article or in the media player. The discussed approach to this will be to use something like a MediaView event that is triggered when an image, video, or sound is scrolled over or played.

This can't be done with the dataset that the two approaches above use (webrequests) and it would require someone on the MediaWiki side to instrument the code to send these events for us to aggregate them and generate the metrics.

This work will probably be planned for FY2020-2021, but we should start coordinating with Audiences folks to instrument code.