Analytics/Archive/Wikipedia Zero

From Wikitech-static
Jump to navigation Jump to search

Wikipedia Zero were a set of scripts used to automatically analyzes the traffic of Wikipedia Zero. The generated files have been available through http://gp.wmflabs.org/ (cf. domain description). They have been deprecated and on 2015-03-20 have been taken offline (task T92920) in favor of username/password protected grahps on https://zero.wikimedia.org/ .

Source code

The source code for the Wikipedia Zero scripts themselves is at https://gerrit.wikimedia.org/r/#/admin/projects/analytics/wp-zero .

Used streams

The Wikipedia Zero scripts rely on

Generated data

It is currently not clear whether or not we are allow to publicly deep link to the generated datafiles etc, so we do not give a full list here. If you are interested in the generated data, browse http://gp.wmflabs.org/ , or contact the analytics team.

What is ZERO traffic?

The following data structure: IP range + time range -> configuration block

At the time of the request, find the appropriate config block. The request is ZERO if:

  • "enabled" is true or missing
  • "sites" is
    • missing: m.wikipedia.org and zero.wikipedia.org are ok
    • empty array: ALL projects are ok
    • values: Whichever domain values are there
  • "whitelistedLangs" is empty or the request’s language is in there
  • If request is HTTPS, "enableHttps" must not be missing and be true (not implemented yet)
  • "proxies" (not implemented yet) – If X-Forwarded-By header is present, that request is NOT zero unless "proxies" field is present in the config, and the value of the header is listed in it.

Procedure to generate new dashboards

This section is obsolete. The "free as of" column should no longer be used (instead, we plot every data point that we have), and as of 2014-04-11, the Wikipedia-Zero team removed the MCC-MNC column in the wikitable.

In order to get a dashboard for the new carrier $CARRIER:

  1. The Wikipedia Zero team creates Json definition page (with $CARRIER's MCC-MNC as name) in the Zero namespace of metawiki. (Any information of that page is authorative, but note that “enabled” != “free as of” of the carrier wikitable)
  2. The Wikipedia Zero team adds a row for $CARRIER to the carrier wikitable (Of that table only the “free as of” column is authorative, and is connected to $CARRIER's Json configuration via the MCC-MNC column. In case of mismatches between the MCC-MNC's link text and its link target, the link text is used to map to $CARRIER.).
  3. The Wikipedia Zero team notifies the Analytics team to generate dashboards for $CARRIER.
  4. The Analytics team schedules creation of dashboard/data for $CARRIER.
  5. The Analytics team updates Kraken with configuration for $CARRIER.
  6. The Analytics team updates the wp-zero repo with configuration for $CARRIER.
  7. The Analytics team backfills data for $CARRIER between the “free as of” date and now.
  8. The Analytics team creates a dashboard for $CARRIER on http://gp.wmflabs.org/ .
  9. The Analytics team asks the Wikipedia Zero team to sign off on $CARRIER's dashboard.

Which parts of the different wiki pages are authorative has been agreed on July/August 2013. The only exception being the mapping of carrier wikitable to MCC-MNCs, as back then, they were just text. Since the Wikipedia Zero team added links to the Zero configuration, the Analytics team picked the link target in the beginning, but as link target and link text started to diverge for carriers, it seemed that the Wikipedia Zero team cares more about link text than link target, so the Analytics teams switched to using the link text. But there has been no discussion/agreement on whether link text, or link target should be treated as authorative.

Known problems since 2013-10-01

Besides the problems stemming from the used streams themselves, the generated files are affected by the following problems:

Date from Date until Bug Number Details
* Wrong interpretation of X-Forwarded-For bug 54783
* Missing pageviews to not capturing api requests bug 54782
2014-08-16 ~22:45 bug 69663 udp2log was restarted to remove some filters and there was a small request drop