Data Platform/Data Lake/Traffic/Pageview hourly/K Anonymity Threshold Analysis
In a plan to secure our user data, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path.
See this page for a broad view on the pageview_hourly sanitization project, and this one for a detailed description of the algorithm proposal.
This page go through the various analysis we have been doing in defining K, the minimum number of distinct IPs a finger-printing group must have not to go through anonymization (see an introduction on k-anonymity ).
Dataset specification
|
CREATE TABLE pv_san.idents_buckets STORED AS PARQUET AS
WITH wrong_ips AS (
SELECT
client_ip AS w_ip,
COUNT(DISTINCT user_agent) AS user_agent_count,
COUNT(1) AS request_count
FROM wmf.webrequest
WHERE webrequest_source IN ('mobile', 'text')
AND year=${YEAR} AND month=${MONTH} AND day=${DAY} AND hour=${HOUR}
AND is_pageview
GROUP BY client_ip
HAVING (COUNT(DISTINCT user_agent) >= ${IP_UA_THRESHOLD}
OR COUNT(1) >= ${IP_REQ_THRESHOLD} )
)
SELECT
pageview_info['project'] as project,
pageview_info['language_variant'] as language,
agent_type,
access_method,
agent_type,
x_analytics_map['zero'] AS zero_carrier,
geocoded_data['continent'] AS continent,
geocoded_data['country'] AS country,
geocoded_data['subdivision'] AS subdivision,
geocoded_data['city'] AS city,
user_agent_map,
COUNT(DISTINCT client_ip) AS ips_count,
COLLECT_SET(client_ip) AS unique_ips,
COUNT(DISTINCT pageview_info['page_title']) AS pages_count,
COLLECT_SET(pageview_info['page_title']) AS unique_pages,
COUNT(1) AS request_count
FROM wmf.webrequest webrequest
LEFT OUTER JOIN wrong_ips ON (webrequest.client_ip = wrong_ips.w_ip)
WHERE wrong_ips.w_ip IS NULL and
AND webrequest_source IN ('mobile', 'text')
AND year=${YEAR} and month=${MONTH} and day=${DAY} and hour=${HOUR}
AND geocoded_data['continent'] not in ('Unknown', '--', '-')
AND geocoded_data['country'] not in ('Unknown', '--', '-')
AND geocoded_data['subdivision'] not in ('Unknown', '--', '-')
AND geocoded_data['city'] not in ('Unknown', '--', '-')
AND is_pageview
AND agent_type = 'user'
GROUP BY
pageview_info['project'],
pageview_info['language_variant'],
access_method,
agent_type,
x_analytics_map['zero'],
geocoded_data['continent'],
geocoded_data['country'],
geocoded_data['subdivision'],
geocoded_data['city'],
user_agent_map;
|
General shape of data
- bucket_count
- The number of unique sets of bucketed values
- request_count
- The total number of requests
|
SELECT
COUNT(1) AS bucket_count,
SUM(request_count) AS request_count
FROM pv_san.idents_buckets
LIMIT 100000;
|
It looks like we see ~10 requests per unique bucket on average, but the distribution is highly skewed as can be seen below.
Finger-printing groups, distinct ips, distinct pages and requests
We only dig into finger-printing groups that have 10 or less distinct IPs, or 10 or less distinct pages since those are where the anonymization problem exists.
Tables below show show two results:
- Number of request per IP is stable at 2.4 and number of request per page is stable at 1.2, making number of pages per IP stable at 2.
- Both for distinct ips and distinct pages, number of buckets and number of requests follow power laws (more or less steep).
Distinct IPs
|
SELECT
ips_count,
COUNT(1) AS bucket_count,
SUM(request_count) AS request_count
FROM pv_san.idents_buckets
WHERE ips_count <= 10
GROUP BY ips_count
ORDER BY ips_count
LIMIT 10;
|
Distinct pages
|
SELECT
pages_count,
COUNT(1) AS bucket_count,
SUM(request_count) AS request_count
FROM pv_san.idents_buckets
WHERE pages_count <= 10
GROUP BY pages_count
ORDER BY pages_count
LIMIT 10;
|
Example of pages viewed together
By number of distinct IPs
2 distinct IPs
|
SELECT
unique_pages
FROM pv_san.idents_buckets
WHERE ips_count = 2
AND project = 'en.wikipedia'
LIMIT 10;
|
3 distinct IPs
|
SELECT
unique_pages
FROM pv_san.idents_buckets
WHERE ips_count = 3
LIMIT 100;
|
By number of distinct pages (with more than only 1 IP)
Showing 20 random samples having X distinct pages and more than one IP, we can see that even when only 2 distinct pages are seen, they are from different topics, preventing having a (single) section reconstructed.
2 distinct pages
|
SELECT
unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 2
AND ips_count > 1
AND project = 'en.wikipedia'
LIMIT 100;
|
3 distinct pages
4 distinct pages
5 distinct pages - Detailed analysis
|
SELECT
unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 5
AND ips_count > 1
AND project = 'en.wikipedia'
LIMIT 100;
|