You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Data/Redirects: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>BryanDavis
(Replace <syntaxhighlight> with preformatted text)
imported>MarcoAurelio
m (Bot: Fixing double redirect to Analytics/Data Lake/Traffic/Pageviews/Redirects)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
This page discusses the following question: What happens when a request goes through redirects like
#REDIRECT [[Analytics/Data Lake/Traffic/Pageviews/Redirects]]
/wiki/Something Notable
/wiki/Something%20Notable
/wiki/Something_Notable
or
/index.php?title=Something Notable
/index.php?title=Something%20Notable
/index.php?title=Something_Notable
And how do we handle pageview identification and counting on those requests?
 
== Types of redirects ==
In the example above we see 2 kinds of redirects, but there are others, here's a list of possible redirects:
 
=== Direct correct request ===
Well this is not a redirect, but serves as a base to compare it to the other exmples. The browser sends a request to for example ''Something_Notable'' and Varnish responds with a 200. The Cluster recognizes this as a Pageview.
 
=== URI encodings performed by the browser ===
Those are made prior to sending the request. For example: ''Something Notable'' to ''Something%20Notable'', or ''"Awesome"'' to ''%22Awesome%22''. They have no effect in the pageview computation, because both representations are supported in PageviewDefinition UDFs, and are ultimately normalized by it.
 
=== Capitalization of the first letter ===
Whenever a request is sent with a lower-case first letter, the response is a 301, where the target is the article with a capitalized first letter. The browser will send another request to the new target this time, which should return a 200. The PageviewDefinition does not identify 301 requests as pageviews, so it will only count the second request to the correct page as a pageview.
 
=== Conversion of spaces ===
Conversion of spaces (%20) to underscore is the same case as first letter capitalization. Whenever a request is sent with spaces (%20) in between words, the response is a 301, where the target is the article with underscores instead of spaces (%20). The browser will send another request to the new target, which should return a 200. The PageviewDefinition does not identify 301 requests as pageviews, so it will only count the second request to the correct page as a pageview.
 
=== Other spellings covered by a redirect page ===
Any other spellings like: alternate spellings, misspellings, abbreviations, translations, capitalizations, plural-vs-singular, etc. for which there is a page in the corresponding project that acts as a hard redirect (its contents start with #REDIRECT[<target>]) will be handled by Varnish or the server-side and will return a 200 response with the contents of the target page, with a small redirect note like "(Redirected from ...)". However, Varnish will generate a log with the redirect URL (before conversion). This is the only potentially problematic scenario, because the cluster will compute a pageview for the redirect page, even the contents shown to the user are those of the target page. But nevertheless, it will only compute 1 pageview, there will be no duplicates.
 
=== Alternate spellings NOT covered by a redirect page ===
If no page exists that covers the spelling requested, Varnish or the server return a 404, so no pageview will be computed for that.
 
== Potential problems ==
 
=== Per article analyses ===
The only redirect scenarios that can be confusing (or may be wrong) are the alternate spellings covered by a redirect page. They do not alter global counts, or counts per project, but they alter per article analyses. For example, in the per-article endpoint of the Pageview API, the page "Barack_Obama":
 
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barack_Obama/daily/2016010100/2016010100
 
returns 26166 pageviews, whereas its redirect page "Barack_obama" (note the lower-case 'o'):
 
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barack_Obama/daily/2016010100/2016010100
 
returns 30415 pageviews for the same period. Actually all the users that generated these 30415 pageviews actually read the contents of Barack_Obama with capital 'O', but we're counting them as Barack_obama (lower-case 'o'). The research paper mentioned by Aaron:
 
https://mako.cc/academic/hill_shaw-consider_the_redirect.pdf
 
suggests that 55% of the articles in the main namespace are redirects to other pages, so this is surely not a small proportion of pageviews or articles.
 
== Possible solutions? ==
 
=== X-Analytics ===
Add a ''redirectedTo'' field to the x-analytics header that holds the target url of the redirect. Note: if the request to the redirect page has ?redirect=no, it should leave the ''redirectedTo'' field blank. And let the ''PageviewDefinition'' get the page title from the ''redirectedTo'' field when not empty.

Latest revision as of 18:59, 13 July 2017