https://wikitech-static.wikimedia.org/w/index.php?title=Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution/Validation&feed=atom&action=historyAnalytics/Data Lake/Traffic/Unique Devices/Last access solution/Validation - Revision history2022-05-24T11:40:58ZRevision history for this page on the wikiMediaWiki 1.38.0-rc.0https://wikitech-static.wikimedia.org/w/index.php?title=Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution/Validation&diff=284792&oldid=previmported>Joal: Joal moved page Analytics/Unique clients/Last access solution/Validation to Analytics/Data Lake/Traffic/Unique Devices/Last access solution/Validation: Reorganizing documentation2017-04-07T14:18:54Z<p>Joal moved page <a href="/wiki/Analytics/Unique_clients/Last_access_solution/Validation" class="mw-redirect" title="Analytics/Unique clients/Last access solution/Validation">Analytics/Unique clients/Last access solution/Validation</a> to <a href="/wiki/Analytics/Data_Lake/Traffic/Unique_Devices/Last_access_solution/Validation" title="Analytics/Data Lake/Traffic/Unique Devices/Last access solution/Validation">Analytics/Data Lake/Traffic/Unique Devices/Last access solution/Validation</a>: Reorganizing documentation</p>
<p><b>New page</b></p><div>This article describes all the work done in validating the numbers generated using the last access cookie. In this article, users = devices.<br />
<br />
=== '''Why validate?''' ===<br />
The WMF-Last-Access cookie comes from the X-Analytics request header into our refined webrequest logs. The value of the cookie could be null, or a date, like 2015-06-30. The number of uniques for a day, can be estimated by looking at all the requests with the cookie set to null, or a date less than the date you are calculating for. Given that the cookie expires over a 31 day period, the Nulls come from users who have never visited the site before, or their cookie has expired. The older dates come from people who are visiting the site for the first time today. In both these cases, we would respond back to their client, and set their cookie date to the current date, and requests that come from the user then on for the rest of the day, can be ignored. <br />
<br />
However, if there is a bot that doesn't register as a spider, or if there is a user that refuses cookies, the value they send will always be Null, and each request will get counted as a unique. We attempted to see if we could figure out approximately what percentage our uniques count was off by, so we could offset our numbers by that much.<br />
<br />
=== '''Terms''' ===<br />
'''Overcount:''' This is the estimate of uniques we have, that is bloated by the high number of requests with null value set in the cookie, coming from bots/users who don't accept cookies. Hence the name, overcount. Overcount = Uniques estimated through the last access cookie method.<br />
<br />
'''Repeats:''' In order to get a lower bound of uniques, we also calculate a number, ignoring the null counts, and only counting requests with an earlier date set. This would be the number of users who have visited the site repeatedly over an arbitrary 31 day period. Hence the name repeats. Even though this number is of no real significance, it helps establish the interval below which our uniques cannot be.<br />
<br />
=== '''Other Caveats''' ===<br />
The cookie is deployed per domain, so numbers across domains cannot be added up. For eg, to estimate uniques for a enwiki (including desktop and mobile), the numbers for en.wikipedia.org and en.m.wikipedia.org cannot be added up. As an extension, most of the numbers in this report are usually calculated for one domain, most commonly, en.wikipedia.org.<br />
<br />
=== '''SQL to count uniques''' ===<br />
This is the basic sql to count daily and monthly overcounts and repeats. Users with access to stat1002 can run these using hive or Hue.<br />
<br />
==== '''Daily overcounts and repeats grouped by uri_host, for June 10 2015''' ====<br />
<syntaxhighlight lang="sql" line="1"><br />
USE wmf;<br />
<br />
SELECT uri_host,<br />
SUM(IF ((x_analytics_map['WMF-Last-Access'] IS NULL<br />
OR (unix_timestamp(x_analytics_map['WMF-Last-Access'], 'dd-MMM-yyyy') <<br />
unix_timestamp(to_date(dt), 'yyyy-MM-dd'))),<br />
1, 0)) AS overcount,<br />
SUM(IF ((x_analytics_map['WMF-Last-Access'] IS NOT NULL<br />
AND (unix_timestamp(x_analytics_map['WMF-Last-Access'], 'dd-MMM-yyyy') <<br />
unix_timestamp(to_date(dt), 'yyyy-MM-dd'))),<br />
1, 0)) AS repeats<br />
FROM webrequest<br />
WHERE x_analytics_map IS NOT NULL<br />
AND agent_type = 'user'<br />
AND is_pageview = TRUE<br />
AND webrequest_source IN ('mobile', 'text')<br />
AND YEAR = 2015<br />
AND MONTH = 6<br />
AND DAY = 10<br />
GROUP BY uri_host<br />
ORDER BY overcount DESC LIMIT 5000;<br />
</syntaxhighlight><br />
<br />
==== '''Monthly overcounts and repeats grouped by uri_host, for June 2015''' ====<br />
<syntaxhighlight lang="sql" line="1"><br />
USE wmf;<br />
SELECT uri_host,<br />
SUM(IF ((x_analytics_map['WMF-Last-Access'] IS NULL<br />
OR (unix_timestamp(x_analytics_map['WMF-Last-Access'], 'dd-MMM-yyyy') < <br />
unix_timestamp(to_date(dt), 'yyyy-MM'))), 1, 0)) AS overcount,<br />
SUM(IF ((x_analytics_map['WMF-Last-Access'] IS NOT NULL<br />
AND (unix_timestamp(last_access, 'dd-MMM-yyyy') < <br />
unix_timestamp(to_date(dt), 'yyyy-MM'))), 1, 0)) AS repeats<br />
FROM webrequest<br />
WHERE x_analytics_map IS NOT NULL<br />
AND agent_type = 'user'<br />
AND is_pageview = TRUE<br />
AND webrequest_source IN ('mobile',<br />
'text')<br />
AND YEAR = 2015<br />
AND MONTH = 6<br />
GROUP BY uri_host<br />
ORDER BY overcount DESC LIMIT 5000;<br />
</syntaxhighlight><br />
<br />
==== '''Sample numbers for en.wikipedia.org from May - September 2015''' ====<br />
{|class="wikitable"<br />
|<br />
'''month'''<br />
|<br />
'''day'''<br />
|<br />
'''Overcount'''<br />
|<br />
'''Repeats'''<br />
|-<br />
|<br />
5<br />
|<br />
1<br />
|<br />
92605826<br />
|<br />
13771272<br />
|-<br />
|<br />
5<br />
|<br />
2<br />
|<br />
78458424<br />
|<br />
10768860<br />
|-<br />
|<br />
5<br />
|<br />
3<br />
|<br />
79083954<br />
|<br />
12577781<br />
|-<br />
|<br />
5<br />
|<br />
31<br />
|<br />
46871121<br />
|<br />
16132685<br />
|-<br />
|<br />
6<br />
|<br />
1<br />
|<br />
58419722<br />
|<br />
22250586<br />
|-<br />
|<br />
6<br />
|<br />
2<br />
|<br />
60721513<br />
|<br />
22543029<br />
|-<br />
|<br />
6<br />
|<br />
30<br />
|<br />
45369844<br />
|<br />
18148775<br />
|-<br />
|<br />
7<br />
|<br />
1<br />
|<br />
44480024<br />
|<br />
17798867<br />
|-<br />
|<br />
7<br />
|<br />
2<br />
|<br />
46065819<br />
|<br />
17589004<br />
|-<br />
|<br />
7<br />
|<br />
30<br />
|<br />
48625679<br />
|<br />
17770151<br />
|-<br />
|<br />
7<br />
|<br />
31<br />
|<br />
48195861<br />
|<br />
17082829<br />
|-<br />
|<br />
8<br />
|<br />
1<br />
|<br />
38402477<br />
|<br />
13236340<br />
|-<br />
|<br />
8<br />
|<br />
2<br />
|<br />
36832098<br />
|<br />
14130095<br />
|-<br />
|<br />
8<br />
|<br />
30<br />
|<br />
37230296<br />
|<br />
13854064<br />
|-<br />
|<br />
8<br />
|<br />
31<br />
|<br />
46552884<br />
|<br />
19112208<br />
|-<br />
|<br />
9<br />
|<br />
1<br />
|<br />
48486743<br />
|<br />
19500402<br />
|-<br />
|<br />
9<br />
|<br />
9<br />
|<br />
61030474<br />
|<br />
19925116<br />
|-<br />
|<br />
9<br />
|<br />
10<br />
|<br />
53026323<br />
|<br />
19815455<br />
|}<br />
<br />
==== '''Overcounts and repeat trend from May - September 2015 (en.wikipedia.org)''' ====<br />
<br />
[[File:Uniques_trend_May_to_September.png|800px]]<br />
<br />
This graph plots the uniques from May 1 to September 10 2015. The huge spikes in August show how bot activity can have unpredictable impact on these numbers.<br />
<br />
==== Monthly numbers from May-August 2015 for en.wikipedia.org ====<br />
{|class="wikitable"<br />
<br />
|<br />
'''month'''<br />
<br />
|<br />
'''overcount'''<br />
<br />
|<br />
'''repeats'''<br />
<br />
|-<br />
|<br />
5<br />
<br />
|<br />
2,866,072,382<br />
<br />
|<br />
110,839,956<br />
<br />
|-<br />
|<br />
6<br />
<br />
|<br />
973,394,414<br />
<br />
|<br />
95,424,659<br />
<br />
|-<br />
|<br />
7<br />
<br />
|<br />
969,566,054<br />
<br />
|<br />
85,946,392<br />
<br />
|-<br />
|<br />
8<br />
<br />
|<br />
927,450,746<br />
<br />
|<br />
93,108,562<br />
<br />
|}<br />
<br />
=== Validation approaches ===<br />
There were few approaches taken to see if we can find a percentage range, that our overcount numbers are off by.<br />
<br />
==== Fingerprinting for a short time ====<br />
The first idea was to count uniques based on the assumption that the combination of an IP and a user-agent defined a user. For a short range of dates, we calculated overcount, fingerprinted, and repeat numbers for en.wikipedia.org to see what we could learn from it.<br />
<br />
[[File:Uniques_with_fingerprint.png|800px]]<br />
<br />
Since the cookie was only deployed on April 28th, the high overcount numbers on the first few days can be attributed to the fact that a lot of users came in without the cookie. It seems to stabilize over the last few days and all three lines follow the same trend. However, the overcount numbers are ~2.3 times the fingerprinted numbers. Fingerprinting itself, has multiple drawbacks, and when we look at daily numbers, we initially assumed fingerprinted numbers would be ~80% accurate. The overcount numbers being 2.3 times higher than the fingerprint numbers makes it inconclusive as to what the right unique counts are closer to. Further, this cannot be extended to mobile, because many users are assigned the same public IP. It cannot be extended to calculate monthly uniques either. If our overcount numbers and fingerprint numbers were close (off by 20% or so), we might have been able to observe what happened over a longer period, and maybe come to some conclusion. But the numbers we got didn't help us much.<br />
<br />
==== Compare numbers from mobile UUID based uniques ====<br />
We have existing uniques numbers for mobile clients (Android and iOS), and these are based on the wmfuuid value set in [[X-Analytics]]. We looked if our last access based uniques for mobile apps was similar. The drawback of this approach is that apps are not HTTP clients and do not have to handle cookies the same way as browsers. However we did find that in Android the numbers were pretty close, and off by only 5-6%. iOS was not consistent and inconclusive. But the Android trend tells us that our counting mechanism is valid, and in a bot-free universe, it would give us uniques.<br />
<br />
==== Bot filtering ====<br />
<br />
===== Regex based user-agent bot filtering =====<br />
The first approach was to filter for bots using regex matching on the User agent string for commonly known bots. This is the regular expression we used to filter:<syntaxhighlight lang="text"><br />
".*([Bb]ot|[Ss]pider|WordPress|AppEngine|AppleDictionaryService|Python-urllib|python-requests|Google-HTTP-Java-Client|[Ff]acebook|[Yy]ahoo|RockPeaks).*|Java/1\\..*|curl.*|PHP/.*|-|"<br />
</syntaxhighlight>This was done on en.wikipedia.org for July 2015, and here's a snapshot of the results:<br />
{|class="wikitable"<br />
|day<br />
|overcount<br />
|repeats<br />
|overcount_excluding_bots<br />
|repeat_excluding_bots<br />
|-<br />
|7/1/2015<br />
|44480024<br />
|17798867<br />
|43176159<br />
|17796810<br />
|-<br />
|7/1/2015<br />
|46065819<br />
|17589004<br />
|44641079<br />
|17586246<br />
|-<br />
|7/1/2015<br />
|39400385<br />
|14773861<br />
|38011278<br />
|14771951<br />
|}<br />
The regex based filtering reduced the overcounted uniques by an average of 4.5% over all days in July. This is good, but not sufficient to narrow down our estimates.<br />
<br />
==== How many requests makes a bot ====<br />
Another short test we did was, assuming an ip/user agent combination made a user, if a user issued a large number of requests, say n, and the last access cookie was set to NULL, maybe they were requests coming from a bot. To estimate n, the distribution of the count of requests to enwiki desktop site was plotted, and the second highest peak of the distribution - 450 was assumed as the cut off. This test is harder to run, as it involves building a black list of all users with over 450 requests and no cookie value set, and excluding them while counting uniques. We tried this for three days, and the results looked like:<br />
{|class="wikitable"<br />
|day<br />
|overcount_with_bots<br />
|repeat_with_bots<br />
|overcount_excluding_bots2<br />
|repeat_excluding_bots2<br />
|%(overcount with bots/overcount excluding bots)<br />
|-<br />
|7/1/2015<br />
|44480024<br />
|17798867<br />
|27942921<br />
|16380726<br />
|62.8212813<br />
|-<br />
|7/1/2015<br />
|46065819<br />
|17589004<br />
|26410269<br />
|15777171<br />
|57.33159547<br />
|-<br />
|7/1/2015<br />
|39400385<br />
|14773861<br />
|23167075<br />
|13789624<br />
|58.79910818<br />
|}<br />
This difference is a lot more significant, and on the surface, promising. However, this assumption cannot be extended across wikis, especially smaller ones. It also cannot be extended to mobile devices, because 3G ip assignment implies a lot of users will share the same ip. It also cannot be extended to estimate monthly uniques.<br />
<br />
=== Conclusion ===<br />
At the moment, we've concluded that the variability in the number of bot requests is way too high to be able to determine a percentage range by which the last access cookie based uniques can be offset by. We have to pursue serious statistical analysis based bot detection on our request stream, or perhaps an alternate solution like deploying this cookie on Javascript. It does not seem possible to estimate uniques using this method otherwise.</div>imported>Joal