You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Data/PreventingIdentityReconstruction: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Nuria
No edit summary
 
imported>Krinkle
Line 1: Line 1:
= Pageview_hourly dataset =
#REDIRECT [[Analytics/Data/Preventing identity reconstruction]]
Refers to data in our pageview_hourly table on hive. This data is private at this time, once we feel confident about anonymization we will make it public.
 
== Goal ==
Identify possible identity reconstruction with only pageview_hourly dataset (long lived, we keep it forever) assuming that you have access to cluster and can use some of webrequest data (short lived, only 60 days). Explore whether is possible to reconstruct identity with pageview_hourly data alone. This exercise is to help us reduce the bits of information in the datasets we release so as to protect the privacy of our users.
 
==Results ==
Using '''city''', '''country''', '''user_agent_map''' and '''method of access''' it is possible to reconstruct an identity and trace it through our pageview_hourly dataset. It is important to point out that the user_agent_map is not the raw user agent but rather a processed version. While having access to webrequest table makes this reconstruction easier, we could do it just as well using only data in the pageview_hourly table.
 
It is easier to narrow an identity if the user_agent_map is a rare one, we have successfully done it for blackberry user agents and user agents for iOS devices that report the version of wikipedia mobile app installed in the device. This might not necessarily pinpoint a user but rather a device plus geographical location. In some instances the identity could be a library computer, in others it will be a user's mobile device. '''The bits of our dataset that carry most information are the user_agent_map and within the user_agent_map the "wmf_app_version" field.'''
 
== Next Steps ==
To anonymize the dataset further we can:
 
1. Remove user_agent_map entirely
 
2. Remove user agents that do not appear more than a certain percentage of times
 
 
Explore whether without user_agent_map and using only geographical location we can reconstruct the identity too.
 
== Methodology ==
 
=== Plain English ===
 
1. We look in webrequest table for an a hour of data cities in which we have less than 10 different IPs coming for the hour.
 
2. For cities in which on that hour we only have one IP we calculate a hash with pageview_hourly data using:
* user_agent_map
* city
*country
*access_method
 
3. We look in pageview_hourly table for a timeperiod couple months back whether any of those hashes we have computed are present
 
4. We inspect by hand the browsing history of some  of the hashes and try to establish whether it is the same identity behind both sessions
 
=== Caveats ===
 
Having one IP for a given city within an hour can indicate  a very small number of users accessing the site but also - as we have seen - might indicate that the whole town is using an internet service
that makes all users appear under the same IP. Also, it is worth mentioning that mobile carriers in many instances have a very small range of IPS for their whole user base.
 
Even in big towns we can find very small numbers of IPs if a small set of people have access to internet there.
 
=== Examples ===
''Hourly Dataset:'' 1 hour of traffic on September 1st, available in webrequest table.
 
''Monthly Dataset:'' 1 month of traffic (July) for hashes that are present on hourly dataset.
 
==== Example 1, same set of pages visited. iOS App ====
Data from '''hourly dataset''', note that hash computed includes IOS application version:
 
{| class="wikitable"
|-
! Hash
!Page
! City
! Country
! user_agent_map
! access_method
|-
| 1309608194
| Pagina_principale
| SomeTown
| Italy
| {browser_major":"-","os_family":"iOS","os_major":"8","device_family":"Other","browser_family":"Mobile Safari","os_minor":"3","wmf_app_version":"some app version"}
| mobile app
|-
|1309608194
|Forum_(programma_televisivo)
|SomeTown
|Italy
|{"browser_major":"-","os_family":"iOS","os_major":"8","device_family":"Other","browser_family":"Mobile Safari","os_minor":"3","wmf_app_version":"some app version"}
|mobile app
|}
 
Data from '''monthly dataset''', note pages viewed are the same than in hourly dataset (which means that we could follow this hash and possibly find
additional pages viewed by this identity)
{| class="wikitable"
|-
! Hash
!Page
! City
! Country
! date of access
|-
|1309608194
|Forum_(programma_televisivo)
|SomeTown
|Italy
|2015-7-30
|-
|1309608194
|Pagina_principale
|SomeTown
|Italy
|2015-7-30
|}
 
==== Example 2, similar set of pages. Blackberry device ====
 
Data from '''hourly dataset''', note that hash computed includes not very common blackberry device. Pages searched seem academical topics in nature.
{| class="wikitable"
|-
! Hash
!Page
! City
! Country
! user_agent_map
! access_method
|-
| 989888942
|Feeble-minded
|SomeBigCity
| Nigeria
|{"browser_major":"7","os_family":"BlackBerry OS","os_major":"7","device_family":"BlackBerry 9810","browser_family":"BlackBerry WebKit","os_minor":"0","wmf_app_version":"-"}
|mobile web
|-
| 989888942
|The_School_of_Athens
|SomeBigCity
|Nigeria
|{"browser_major":"7","os_family":"BlackBerry OS","os_major":"7","device_family":"BlackBerry 9810","browser_family":"BlackBerry WebKit","os_minor":"0","wmf_app_version":"-"}
|mobile web
|}
 
 
 
 
Data from '''monthly dataset''', more academical topics:
{| class="wikitable"
|-
! Hash
!Page
! City
! Country
! date of access
|-
| 989888942
|Friction_(disambiguation)
|SomeBigCity
|Nigeria
|2015-7-10
|-
|989888942
|Dating
|SomeBigCity
|Nigeria
|2015-7-11
|-
|989888942
|Jerk_(physics)
|SomeBigCity
|Nigeria
|2015-7-16
|-
|989888942
|Dating
|SomeBigCity
|Nigeria
|2015-7-11
|-
|989888942
|Videos
|SomeBigCity
|Nigeria
|2015-7-11
|-
|989888942
|Quotient_rule
|SomeBigCity
|Nigeria
|2015-7-11
|-
|989888942
|List_of_Nigerian_films
|SomeBigCity
|Nigeria
|2015-7-9
|-
|989888942
|Youtube
|SomeBigCity
|Nigeria
|2015-7-9
|}
 
=== SQL ===
<source lang="sql">
-- geocoded data on webrequest is like:
-- {"city":"Unknown","country_code":"--","longitude":"-1","postal_code":"Unknown","timezone":"Unknown","subdivision":"Unknown","continent":"Unknown","latitude":"-1","country":" Unknown"}
 
    use wmf;
    -- find cities with less than 10 ips
    -- Keep only IPs with reasonable number of hits
 
    with cities as (
    select geocoded_data['country'] as country, geocoded_data['city'] as city,
            count(distinct client_ip) as cnt,
            collect_set(client_ip) as unique_ips_for_this_city
      from webrequest
      where webrequest_source in('mobile', 'text') and month=09 and day=01 and hour=01
        and geocoded_data['country'] not in ('Unknown', '--', '-')
        and geocoded_data['city'] not in ('Unknown', '--', '-')
      group by geocoded_data['country'], geocoded_data['city']
    having count(distinct client_ip) < 10
    ),
 
    -- use cities that only have 1 IP for the hour, compute a signature
    unique_signatures_plus_pageview as (
    select hash(ph.user_agent_map, ph.city, ph.country, ph.access_method) ua_geo_hash,
              ph.page_title, cities.city, cities.country,
              ph.user_agent_map, ph.access_method
        from cities
        join pageview_hourly ph
      on
      cities.city = ph.city
      and
      cities.country = ph.country
      where cnt = 1
      and  (ph.year=2015 and ph.month=09 and ph.day=01 and ph.hour=01)
      and ph.agent_type="user" and ph.user_agent_map['os_family'] not in ("Other", "-")),
 
    -- get all distinct signatures
    uniques as (
    select distinct ua_geo_hash, city, country,
              access_method, user_agent_map
          from unique_signatures_plus_pageview
    )
 
    -- go back one month
    select uniques.ua_geo_hash, ph2.page_title,ph2.city, ph2.country,
            concat_ws("-",cast(year as string), cast(month as string), cast(day as string))
      from uniques
      join pageview_hourly as ph2
      on
      uniques.city = ph2.city
      and
      uniques.country = ph2.country
      and
      uniques.user_agent_map = ph2.user_agent_map
      and
    uniques.access_method = ph2.access_method
    where (ph2.year=2015 and ph2.month=07 )
    and ph2.agent_type="user" and ph2.user_agent_map['os_family'] not in ("Other", "-")
</source>
 
 
== Checks And Balances ==
 
We also took at our computed signatures that had 1 IP associated in a 1 hour timestamp (25.000)
and tried to establish whether they had a few or a lot of addressed associated in a bigger time period, a day.
We found about 6000 of our signatures present the following day and of those 3000 still only had 1 IP associated
with the signature.
 
<source lang="sql">
    -- now find how many IPS there are associated with this signature
    -- in the last couple of days in webrequest
    -- we know is at least 1
    select hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method) as ua_geo_hash_w, count(distinct client_ip) as cnt,
        collect_set(client_ip) as unique_ips_for_this_city
    from uniques as u
    join webrequest as w
    on
        u.ua_geo_hash = hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method)
    where
        webrequest_source in('mobile', 'text') and
        agent_type='user' and
        (month=09  and day=02 and year=2015)
        group by hash(w.user_agent_map, geocoded_data['city'], geocoded_data['country'], w.access_method)
</source>
 
== Other exploits ==
 
Will be easy to geo-locate editors for pages with very few pageviews as an edit is normally preceded by a view

Revision as of 21:00, 1 October 2015