You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Difference between revisions of "Analytics/Data Lake/Traffic/Unique Devices/Last access solution"
(update wording (no longer a proposal), other copyedits and corrections (cf. talk page). the explanation is still quite convoluted and confusing regarding the underestimate vs offset count)
|Line 138:||Line 138:|
== Caveats ==
== Caveats ==
=== Bots ===
=== Bots ===
=== Redirects ===
=== Redirects ===
Latest revision as of 19:04, 7 July 2020
- 1 Objective
- 2 Caveats
- 3 Privacy
- 4 Technicalities
- 4.1 How we are counting: Plain English
- 4.2 How we are counting: Technical explanation
- 5 Data Quality Analysis
- 6 More docs
The WMF Analytics Engineering team maintains counts of unique devices per-domain and per project-family, daily and monthly, in a way that does not uniquely identify, fingerprint or otherwise track users. The outcome is reports on the number of Unique Devices per domain or project for a given month or day. This is achieved by setting cookies with a Last-Access day on clients and counting sightings of browsers with an old cookie or no cookie at all.
Reports in the following format:
Redirects (http response codes
307) were originally filtered out from the unique devices computation. While this is the case for the per-domain unique devices, they have to be included in the per-project-family computation. See Technicalities/ for more details.
- We are not able to identify users from the data passed in the cookie. The cookie contains only a year, month and day.
In order to produce the above report these are the cookies we need, each cookie stores last access time per project.
<<language>>.m.<<project>>.org Mobile site uniques for <<project>> and <<language>>
<<language>>.<<project>>.org desktop site uniques for <<project>> and <<language>>
*.<<project>>.org Uniques for <<project>>
How we are counting: Plain English
Unique devices are computing by adding two numbers, one derived from WMF-Last-Access or WMF-Last-Access-Global cookie and an offset. (In the rest of this section, we will use WMF-Last-Access, knowing that the same mechanism applies to WMF-Last-Access-Global.)
A very high level explanation of how this works can be found on our blog: https://blog.wikimedia.org/2016/03/30/unique-devices-dataset/.
For a more technical explanation you can keep on reading.
How we are counting: Technical explanation
1) Request comes in, user does not have a WMF-Last-Access cookie: We issue one with today as last access date, with a future expire time (any expire time over a month will work). Cookie value is "14-Dec-2015" for example.
2) Request comes in, user already has a WMF-Last-Access cookie: We re-issue a new cookie with a future expiration date and set the old date as the value of the cookie in the X-Analytics header. In our prior example, one day has gone by among requests, value of cookie is reset to "15-Dec-2015" and we store the following in the x-analytics header:
X-analytics["WMF-Last-Access"] = "14-Dec-2015"
In order to count unique devices for say, January, we get from the webrequest table all requests from January that do not have a January date recorded on
x-analytics["WMF-Last-Access"] (this includes requests without any Last-Access cookie at all). All those requests are counted toward the January uniques, because that device has not visited in January before (according to the Last-Access cookie).
Same logic for daily: to count uniques on December 15th we will get all requests for December 15th that have an
X-analytics["WMF-Last-Access"] value with an older date than December 15th. So the request in our example above will be counted. Those are uniques for December 15th.
Note that this method of counting assumes that requests come from real users that accept cookies, so we are assuming that if we set a cookie we are going to be able to retrieve it in a subsequent request. This is true only in the case of browser clients that accept cookies. Although while counting we are only looking at traffic tagged as "user" in the cluster, we have to be aware of bots that are not reported as such. In order to discount those requests, we only count requests that have nocookie=0, meaning that those requests came to us with 'some' cookie set. This method of counting, by definition, underreports users as we will not be counting users with a fresh session or users browsing without cookies.
Per x-analytics documentation every request that comes in without cookies whatsoever is tagged with
nocookie=1. Such requests include bots, users browsing with cookies off, and users who just launched a fresh "incognito" session in their browser. We did some research on this, and it turns out that
nocookie=1 is a cheap proxy to rule out a bunch of what might be bot traffic, see Analytics/Unique_clients/Last_access_solution/BotResearch.
When possible, we want to make sure that we count devices that might be coming to Wikipedia with a fresh session without cookies at all. Thus, we also count as uniques requests with
nocookie=1 whose signature appears only once in a day or month. The signature is calculated with a hash of (ip, user_agent, accept_language) per project. The idea behind this reasoning is that -if you are a real user- for the day and you did not refresh your browser session, there is only one request you could do without cookies, the 1st one. Subsequent requests will be sending the WMF-Last-Access cookie.
This methodology has two caveats:
- It underreports fresh sessions in mobile (due to NAT-ing of IP addresses that are shared among many users. see:  and )
- It will overreport a device in which the IP is changed frequently and/or cookies are deleted frequently as it will appear as two different fresh sessions to our logic. This is less prevalence of an occurrence than the underreport on mobile described prior.
We add this offset to the numbers that result from looking at WMF-Last-Access cookie.
How big of a percentage does the offset represent from the total?
For projects with more than 100.000 uniques, the offset represents between 5% and 60% of the total if counted daily, variability is high depending on project. The offset represents a smaller percentage on mobile domains, as we know numbers are underreported for fresh sessions in mobile. The offset also represents a higher percentage of monthly numbers, as fresh sessions are likely to be more numerous as we expand our timeperiod for counting them.
One other thing to keep in mind about the offset is that a high for offset doesn't mean bad quality data. In fact, for project-families having fact-checking mostly usage pattern (wiktionnary is a good example, once every now and then, you get there to check for spelling or existence, but it's not a usual pattern to follow inner-links on wiktionnaries as it is on wikipedias), having (many) more offsets than cookie-based unique devices is expected.
The redirect issue on unique device counts for project families
When a mobile device with a fresh session (no cookies) visits the desktop version of one of our projects, for example
www.wikidata.org, it gets redirected with a
302 to the mobile version of the website, here
m.wikidata.org. In this transaction two cookies are set:
- 1. Cookie on global domain (*.wikidata.org) when server responds with
- 2. Cookie set on m.wikidata.org on the regular
200response from the Wikidata mobile site.
Our per-domain computation filters
302 requests (as those are not pageviews). That works well in the per-domain case, as the cookie is set on the
200 response. But it doesn't work for the global domain uniques calculation, as the cookie is being set "earlier".
In our example, if we filter out redirects for the project-family computation (counting devices on *.wikidata.org) and we only count the
200 responses (pageviews), we would be missing fresh sessions that exhibit the behaviour described above. In order to solve that issue, in June 2017 we updated the filter for project-family unique devices computation to accept redirects that lead to a pageview (phab:T167005).
Data Quality Analysis
Unique devices per-domain
That is unique devices for en.m.wikimedia.org (mobile site) or en.wikimedia.org (desktop site).
We recommend that if you are using the unique devices number per domain, you consider domains having at least 1000 unique devices daily (one thousand). Domains with less than 1000 unique devices show too much random variation for data to be actionable. In other words, data is too noisy if the number of unique devices is less than 1000 daily.
- Depending on the project family you are looking at (Wikipedia or Wiktionary for instance), it is interesting to keep an eye on the uniques_offset. For projects where you can follow links (like wikipedia), uniques_offset is less important, but for projects mostly used for fact checking (wiktionnary for instance), uniques_offset represents a very wider portion of the total uniques.
- We also recommend to be very aware of the variability issues when looking at uniques per country (only for WMF employees, or people under NDA) as the variability per host per country is higher than variability per host.
How did we determined this 1000 number for the per-domain uniques
We ran an analysis on our daily dataset trying to measure randomness, taking into account the weekly rhythm of our traffic. We took 9 weeks of data, among which one week is taken as reference to compute variation of the eight other weeks day per day per domain.
Example: On Dec 23, a Friday we compute for es.m.wikipedia how much the number of unique devices we have (say, a 100) differs from the number of unique devices on Friday on reference week (say, 110). If this difference is 10 our variation is 10% or 0.1. Since we have 8 weeks of data plus one week of reference we have a series of 8 points per day of the week. Given this series we compute the standard deviations of all variations on the 8 week period per domain. Small standard deviations are good, as it means that variation is acceptable, standard deviations >1 mean that data varied more than 100% from the reference week. We consider that too be too large and a sign that the data is too "noisy".
We also computed, for each project, the median value of unique devices over the 9 weeks of data.
This data is plotted below, using log scales.
In this chart the red line represents the mean of unique_devices estimate, projects are plotted from higher to lower, that is, the left side of screen has projects like es.wikipedia.org which have millions of users and a variation of less than 0.1 (10%). The right side of the graph plots projects with a small number of uniques, the blue line represents the variation and it can easily be seen how it is a lot higher for those projects, sometimes as high as 100% (a std deviation of 1 in this case).
Unique devices per project-family
Daily per-project-family unique devices (on *.wikimedia.org for instance) display less variation, in the plot below we can see that standard deviations are a lot lower. In the case of wikipedia.org domain the variability is about 2%. Now, these calculations aggregate results for a project family (example: *.wikipedia.org). If you are splitting results further (per country, for example) you should take into account that variability will be higher.
Variation calculations are done with data from March and April 2017, week of reference being from March 1st to March 7th.
Remember that the standard deviation is computed on a variation, meaning a value of 0.02 is actually a variation of 2% of the total number of uniques.
|Host||Standard deviation||Median||Total uniques estimate|