You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data/Pagecounts-all-sites: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
No edit summary
imported>Milimetric
Line 1: Line 1:
:'''''See also the [[pageviews API]], available since the end of 2015'''''.
#REDIRECT [[Analytics/Data Lake/Traffic/Pagecounts-all-sites]]
 
'''NOTEː This dataset is deprecated since 2016-08-01, see [https://lists.wikimedia.org/pipermail/analytics/2016-August/005339.html this thread]'''
 
The '''[https://dumps.wikimedia.org/other/pagecounts-all-sites/ pagecounts-all-sites]''' is holding output that mimics [[Analytics/Data/Pagecounts-raw|pagecounts-raw]] files, but gets generated from Hadoop data using Hive. Also, it extends<ref>Note that this extension to mobile and zero site does not solve the long-standing issues with webstatscollector's pageview definition. It is more a stop-gap measure, and comes with all the issues of webstatscollector's pageview definition.</ref> the [[Analytics/Webstatscollector#Used_Page_View_definition|webstatscollector pageview definition]] to [[#Requests_to_the_mobile_site_and_requests_from_mobile_devices_or_apps|mobile]] and [[#Requests_to_the_zero_site_and_Wikipedia_Zero_requests|zero sites]].
 
This stream is owned by the [[mw:Analytics|Analytics Team]].
 
== Contained data ==
 
(If you are familiar with the [[Analytics/Pagecounts-raw|pagecounts-raw]] dataset, you might want to look at the [[#Differences_to_the_pagecounts-raw_dataset|differences between those two datasets]] right away.)
 
{{#lst:Analytics/Data/Pagecounts-raw|fields}}
=== Disambiguating abbreviations ending in “.m” ===
 
The are two ways for an abbreviation to end in <code>.m</code>. Either because the domain is a whitelisted project on <code>wikimedia.org</code> (like <code>commons.wikimedia.org</code> being abbreviated to <code>commons.m</code>), or the domain is the mobile site of wikipedia (like <code>en.m.wikipedia.org</code> being abbreviated to <code>en.m</code>).
 
Since the whitelisted <code>wikimedia.org</code> projects (see abbreviation table above) never match a language code on wikipedia, the mapping between domain name and abbreviation is bijective.
 
While this solution requires an <code>if</code> for the edge case of "Summing up pageviews across all mobile sites", it allows to stay compatible with [[Analytics/Pagecounts-raw|pagecounts-raw]]'s abbreviations while at the same time also keeping the concept and semantics of abbreviating domain names. Also it makes it easier to automate comparisons between this dataset and TSVs (like [[Analytics/Requests_stream|sampled-1000]]) or Hive data.
 
=== Differences to the pagecounts-raw dataset ===
 
Every line that is in [[Analytics/Pagecounts-raw|pagecounts-raw]] is also in pagecounts-all-sites.
 
Additionally, pagecounts-all-sites counts the mobile site (E.g.: having a line like
 
  de.m.voy Berlin 176 314159
 
for the mobile site page de.m.wikivoyage.org/wiki/Berlin ) and the zero site (E.g.: having a line like
 
  ms.zero Cinta_Elysa 4 32944
 
for the zero site page ms.zero.wikipedia.org/wiki/Cinta_Elysa ).
 
Next to that, there should not be differences between [[Analytics/Pagecounts-raw|pagecounts-raw]] and pagecounts-all-sites.
 
=== Requests to the mobile site and requests from mobile devices or apps ===
 
“mobile site” refers to the mobile site (so URLs having <code>.m.</code> before the <code>wikipedia.org</code>, … in the URL), not to device identification. Note however that mobile phones and tablets are by default redirected to the mobile sites.
 
Also, traffic from mobile apps is not singled out, and according to the [[Analytics/Webstatscollector#Used_Page_View_definition|webstatscollector pageview definition]], API requests are not counted.
 
=== Requests to the zero site and Wikipedia Zero requests ===
 
Wikipedia Zero requests can (depending on the setup for the Wikipedia Zero partner) hit either
* mobile site (having “.m.” in the unabbreviated domain name), or
* zero site (having “.zero.” in the unabbreviated domain name).
 
Hence, aggregating all lines that have “.zero” in the domain abbreviation (like
 
  ms.zero Cinta_Elysa 4 32944
 
) does ''not'' allow to obtain the total volume of Wikipedia Zero traffic, but ''only gives the total volume of traffic to the zero site''.
The bigger part of Wikipedia Zero traffic goes to the mobile site. Note however, that the mobile site sees both Wikipedia Zero and non-Wikipedia Zero traffic.
So there is no way to compute the “total volume of Wikipedia Zero traffic”.
 
== Availability ==
 
=== dumps.wikimedia.org ===
The stream is available as hourly files at http://dumps.wikimedia.org/other/pagecounts-all-sites/.
 
To maintain compatibility with [[Analytics/Pagecounts-raw|pagecounts-raw]], the date in the file name refers to the ''end'' of the capturing period, not the beginning.
 
=== stat1002.eqiad.wmnet ===
 
The stream is available as hourly files at <code>/mnt/hdfs/wmf/data/archive/pagecounts-all-sites</code> on stat1002.
 
To maintain compatibility with [[Analytics/Pagecounts-raw|pagecounts-raw]], the date in the file name refers to the ''end'' of the capturing period, not the beginning.
 
=== Analytics cluster ===
 
The stream is available as hourly files at <code>/wmf/data/archive/pagecounts-all-sites</code> in the Analytics cluster.
 
To maintain compatibility with [[Analytics/Pagecounts-raw|pagecounts-raw]], the date in the file name refers to the ''end'' of the capturing period, not the beginning.
 
== Events and known problems since 2014-10-01 ==
 
You can [https://wikitech.wikimedia.org/w/index.php?title=Analytics/Pagecounts-all-sites&feed=atom&action=history follow the feed] for these incident updates.
 
{| class="wikitable"
|-
! Date from !! Date until !! Bug !! Details
|-
| 2014-10-08 23:02
| 2014-10-08 23:11
| {{bug|71876}}
| ULSFO connectivity issues causing duplicates and missing requests worth <2 minutes
|-
| 2014-10-13 13:37:15
| 2014-10-13 13:38:26
| {{bug|72028}}
| analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <2 seconds.
|-
| *
| 2014-10-15 19:00:00
| {{bug|66352}}
| Pageviews to “undefined” and “Undefined” pages have been counted
|-
| *
| 2014-10-15 19:00:00
| {{bug|71790}}
| Redirects have been counted
|-
| 2014-10-20T02:05:08
| 2014-10-20T02:05:16
| {{bug|72252}}
| analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <2 seconds.
|-
| 2014-10-20 13:07
| 2014-10-20 13:26
| {{bug|72296}}
| ULSFO connectivity issues causing duplicates and missing requests worth ~3 minutes of data.
|-
| 2014-10-21 11:41
| 2014-10-21 12:00
| {{bug|72352}}
| ULSFO connectivity issues causing duplicates and missing requests worth ~80 seconds of data.
|-
| 2014-10-27T07:12:29
| 2014-10-27T07:12:32
| {{bug|72550}}
| analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <<1 second for the text cluster.
|-
| 2014-11-23 15:16
| 2014-11-23 15:28
| No bugzilla, no bug :-(
| ULSFO connectivity issues causing duplicates worth ~10 minutes of data and missing requests worth ~15 seconds of data.
|-
| 2014-12-04 16:22:36
| 2014-12-04 16:26:55
| {{PhabT|85312}}
| analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth ~30 seconds of total traffic
|-
| 2014-12-10 14:18
| 2014-12-10 14:18
| {{PhabT|85675}}
| analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth ~1 seconds of total traffic
|-
| 2014-12-10 15:27
| 2014-12-10 15:27
| {{PhabT|85675}}
| Leader re-election brought analytics1021 back into set of partition leaders. No duplicates, but missing lines worth <1 seconds traffic
|-
| 2014-12-11 14:54:33
| 2014-12-11 14:54:35
| {{PhabT|85712}}
| analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <1 second of total traffic
|-
| 2014-12-26 06:02:18
| 2014-12-26 06:02:20
| {{PhabT|85709}}
| analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <1 second of total traffic
|-
| 2014-12-29 17:23:21
| 2014-12-29 17:45:22
| {{PhabT|85695}}
| Broken varnishkafka configuration got picked up by three mobile caches and caused missing data worth 50 seconds of total traffic.
|-
| 2015-01-03 10:21:12
| 2015-01-03 10:21:14
| {{PhabT|85758}}
| analytics1021 dropped out of its partition leader role. No duplicates, but missing lines worth <1 second of total traffic
|-
| 2016-08-15
| today
| {{PhabT|130656}}
|  pagecounts-raw and pagecounts-all-sites are no longer published: https://lists.wikimedia.org/pipermail/analytics/2016-August/005339.html
|}
 
== Idiosyncrasies ==
 
=== Capitalization is split up ===
Some requests look as though they were made to EN.WIKIPEDIA.ORG...  To stay compatible with the original files, we separate the counts per project and per different capitalization.  So, for example, you might see:
 
<pre>
en - 12345 123456
EN - 1 1234
En - 89 12345
</pre>
 
And although the lowercase en entry is the main one and will have most requests, there are other requests to English Wikipedia hiding in these other entries.
 
== Note ==
<references/>
 
[[Category:Data stream]]

Revision as of 13:48, 7 April 2017