You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Analytics/Systems/Event Sanitization"

From Wikitech-static
Jump to navigation Jump to search
imported>Bearloga
(Moved & updated from Analytics/Systems/EventLogging/Data retention)
 
imported>Ottomata
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
This page describes the event sanitization processes used with [[Event Platform]] and [[Event Platform/EventLogging legacy|legacy EventLogging]] (EL) data for retaining certain event data beyond the 90 day retention period per WMF's [[wmf:Privacy_policy|Privacy Policy]] and [[meta:Data retention guidelines|Data Retention Guidelines]].
This page describes the event sanitization processes used with [[Event Platform]] data for retaining certain event data in Hive beyond the 90 day retention period per WMF's [[wmf:Privacy_policy|Privacy Policy]] and [[meta:Data retention guidelines|Data Retention Guidelines]].


== Event Platform ==
== Data retention ==
Coming soon! This work is tracked in [[phab:T273789|T273789]].
To learn more about data retention practices for events, see [[Analytics/Systems/Event Data retention|Event Data retention]].


== Legacy EventLogging ==
== Hive ==
To learn more about data retention practices in the legacy EL system, see [[Analytics/Systems/EventLogging/Data retention|this page]].
The Analytics [[Analytics/Data_Lake|Data Lake]] Hive <code>event</code> database stores event streams as Hive tables, including those with a very high volume. It uses 2 databases: <code>event</code> and <code>event_sanitized</code>. The <code>event</code> database stores original (unsanitized) events, while <code>event_sanitized</code> database stores sanitized events. Sanitization happens right after events are generated (with a couple hours lag). So, unsanitized and sanitized events co-exist in 2 different databases during 90 days. After that, however, unsanitized events older than 90 days are automatically deleted from the <code>event</code> database, and the only events that persist indefinitely are those in <code>event_sanitized</code>.


===Hive store===
===Hive event sanitization job===
The Analytics' Hive cluster stores all EventLogging schemas, including those with a very high volume. It uses 2 databases: <code>event</code> and <code>event_sanitized</code>. The <code>event</code> database stores original (unsanitized) events, while <code>event_sanitized</code> database stores sanitized events. Sanitization happens right after events are generated (with a couple hours lag). So, unsanitized and sanitized events co-exist in 2 different databases during 90 days. After that, however, unsanitized events older than 90 days are automatically deleted from the event database, and the only events that persist indefinitely are those in <code>event_sanitized</code>.
It's a [[gerrit:plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/RefineSanitize.scala|job that lives in analytics/refinery/source]] and it runs every hour by a cron job. It reads the new unsanitized events from the <code>event</code> database, sanitizes them using the allowlist and copies them over to the <code>event_sanitized</code> database. Only tables that are present in the [https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/ allowlists] will be sanitized and copied over the <code>event_sanitized</code> database.
===Hive sanitization job===
It's a [[gerrit:plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/EventLoggingSanitization.scala|job that lives in analytics/refinery/source]] and it's run every hour by a cron job. It reads the new unsanitized events from the <code>event</code> database, sanitizes them using the allowlist and copies them over to the <code>event_sanitized</code> database. Only schemas that are present in the [[gerrit:plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/eventlogging/whitelist.yaml|allowlist]] will be copied over the <code>event_sanitized</code> database.


A second sanitization job runs 45 days after data was received, just in case any changes were made to the allowlist in that time.
A second sanitization job runs 45 days after data was received, just in case any changes were made to the allowlist in that time.


=== Allowlist ===
The [[gerrit:plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/eventlogging/whitelist.yaml|EL sanitization allowlist]] is a YAML file with the following format: The first level corresponds to schema names. All schemas that we want to partially or fully keep need to be there, otherwise, the whole contents of that schema's table is going to be purged. Under each schema name, at the second level of the YAML, there have to be the field names that we want to keep indefinitely. Each field name must have the label 'keep'. See example:
schemaName:
    fieldName1: keep
    fieldName2: keep
The allowlist supports partially allowlisting nested fields. Actually, the main event information is enclosed in the event nested object. Thus, the allowlist for EL schemas should look like this (note capsule fields are still outside the scope of event):
schemaName:
    event:
        fieldName1: keep
        fieldName2: keep
    capsuleFieldName1: keep
    capsuleFieldName2: keep
This feature can also be used for nested fields like <code>userAgent</code> or <code>geocoded_data</code>, but this will work for Hive only.


Important notes:
*For EventLogging, using the keep label for nested fields or for whole schemas is not allowed.
*The allowlist is schema-centric (not table-centric), meaning it serves for all revisions of a given schema. This way, when a schema is altered, the allowlist continues to work.
===Hashing (and salting)===
The EventLogging sanitization process has a feature that allows for string fields that are privacy sensitive to be automatically hashed when copied over to event_sanitized. To do that, instead of 'keep', use the 'hash' label in the allowlist. All fields hashed this way will also be salted (appended a cryptographic salt before applying hash function) to increase the security of the hash. The EventLogging sanitization salt is rotated (replaced by a new one) every 3 months, coinciding with the start of quarter, and the old salt is thrown away.


'''IMPORTANT NOTE''': Because of rotating salts, hashed identifiers will be linkable within the same quarter, but not across quarters. In other words, you will not be able to group events by the identifier across quarters, only within one quarter.
=== Allowlists ===
The [https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/ allowlists] are YAML files with the following format: The first level corresponds to table names. All tables that we want to partially or fully keep need to be there, otherwise, the whole contents of that table is going to be purged. Under each table name, at the second level of the YAML, there have to be the field names that we want to keep indefinitely. Each field name must have the tag <code>keep</code> (retained as-is) or <code>hash</code>.  


'''ANOTHER IMPORTANT NOTE''': If you decide to hash (and salt) an identifier field, then all other identifiers of the same schema have to be hashed as well. This applies even for temporary identifiers like session tokens. Otherwise, those identifiers can be used to match hashed (and salted) fields around the period of salt rotation. And this would invalidate the protection that salting and hashing offers.
Analytics/instrumentation event tables must explicitly list all fields then want to keep or hash.  Main production event tables are more permissive and may use the <code>keep_all</code> to keep all fields for the table.  There are two separate allowlists for this purpose.
===How to change the purging strategy of a schema===
 
*Submit a Gerrit patch to the [[gerrit:plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/eventlogging/whitelist.yaml|EL sanitization allowlist in puppet]] where you add the schema and fields you want to keep indefinitely. Please, take the sanitization rationale into account when selecting them. Then add someone in the [[gerrit:admin/groups/d34747bee94be39cff54b5fda1ae36b575107792,members|Analytics team]] to review the patch, who will review and merge it – and that's it!
Example:
*Alternatively, create a [[phab:|Phabricator]] task named i.e. "Add <SchemaName_123> fields to EL purging allowlist" and tag it with the "[[phab:project/view/11/|Analytics]]" project. In the task description, mention which field you'd like to keep. Analytics team will update the allowlist and that's it. This option might take a bit longer, because it might take a couple days until it gets looked at, prioritized, and worked on from the backlog.
<syntaxhighlight lang="yaml">
table_name:
    event:
        field_name1: keep
        field_name2: keep
        identifier1: hash
</syntaxhighlight>
 
{{Note|content=Because of rotating salts, hashed identifiers will be linkable within the same quarter, but not across quarters. In other words, you will not be able to group events by the identifier across quarters, only within one quarter.}}
 
The allowlist supports partially allowlisting nested fields.
 
{{Note|content=If you decide to hash (and salt) an identifier field, then all other identifiers of the same schema have to be hashed as well. This applies even for temporary identifiers like session tokens. Otherwise, those identifiers can be used to match hashed (and salted) fields around the period of salt rotation. And this would invalidate the protection that salting and hashing offers.|type=warning}}
 
'''Important notes:'''
*For analytics/instrumentation events, using the keep label for nested fields or for whole schemas is not allowed.
*The allowlist is table-centric meaning it serves for all versions of a given event schema. This way, when an event schema is altered, the allowlist continues to work.
*The event sanitization process has a feature that allows for string fields that are privacy sensitive to be automatically hashed when copied over to <code>event_sanitized</code>. To do that, instead of 'keep', use the 'hash' label in the allowlist. All fields hashed this way will also be salted (appended a cryptographic salt before applying hash function) to increase the security of the hash. The event sanitization salt is rotated (replaced by a new one) every 3 months, coinciding with the start of quarter, and the old salt is thrown away.
 
==== Modifying the allowlist ====
*Submit a Gerrit patch to  
an [https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/ allowlist YAML file] where you add the table and fields you want to keep indefinitely. Please, take the sanitization rationale into account when selecting them. Then add someone in the [[gerrit:admin/groups/d34747bee94be39cff54b5fda1ae36b575107792,members|Analytics team]] to review the patch, who will review and merge it – and that's it!
*Alternatively, create a [[phab:|Phabricator]] task named i.e. "Add table_name fields to event sanitization allowlist" and tag it with the "[[phab:project/view/11/|Analytics]]" project. In the task description, mention which field you'd like to keep. Analytics team will update the allowlist for you. This option might take a bit longer, because it might take a couple days until it gets looked at, prioritized, and worked on from the backlog.
*Allowlist updates are automatically deployed on the [[Deployments/Train|weekly train]]. If you need an update to be deployed sooner, you can ask Analytics to do a manual deploy.
*Allowlist updates are automatically deployed on the [[Deployments/Train|weekly train]]. If you need an update to be deployed sooner, you can ask Analytics to do a manual deploy.
*Analytics team will reach out to Legal if they have concerns about retaining any specific fields.
*Analytics team will reach out to Legal if they have concerns about retaining any specific fields.
Line 45: Line 48:
=== F.A.Q. ===
=== F.A.Q. ===


====Is the information about purging that lives in the schema talk pages correct?====
We can ''not'' ensure that the purging strategy that is mentioned in the schema talk pages is the actual one that is implemented in the allowlist. Listing the purging strategy in the talk pages was a decision that came out to be non-practical, and in the end we decided that the allowlist would be the place for that.
====Will the schema talk pages ever have correct purging info?====
There's a [https://phabricator.wikimedia.org/T170019 task in Analytics' backlog] to write a script that automatically updates the talk pages with the changes to the allowlist, but it's been declined.  Please update the talk page whenever you edit purge configuration.
====What is the default purging strategy for new schemas?====
====What is the default purging strategy for new schemas?====
The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside EventLogging databases. This means, if you create a new schema and don't take action to allowlist its fields, the events produced to that schema are going to be purged after 90 days.
The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside <code>event</code> databases. This means, if you create a new schema and don't take action to allowlist its fields, the events produced to that schema are going to be purged after 90 days.
 
====Should I allowlist all fields of my schema every time I modify it?====
====Should I allowlist all fields of my schema every time I modify it?====
No. The allowlist is schema-centric – meaning it does ''not'' observe revisions. All the fields that are in the allowlist for previous revisions of your schema will also apply to the new revision.
No. The allowlist is table-centric – meaning it does ''not'' now about schema versions. All the fields that are in the allowlist for previous revisions of your schema will also apply to the new version.
====When are allowlist changes effective?====
====When are allowlist changes effective?====
After being merged changes need to be deployed with analytics refinery code, this normally happens on a weekly cadence on Wednesdays but it might not happen all weeks if there are no sufficient changes or if a significant part of the team is not available due to ops issues/holidays/offsites.
After being merged changes need to be deployed with analytics refinery code, this normally happens on a weekly cadence on Wednesdays but it might not happen all weeks if there are no sufficient changes or if a significant part of the team is not available due to ops issues/holidays/offsites.
Line 58: Line 58:
The new field won't be in the allowlist, because it's new. So by default, it will be purged after 90 days. Note, that all other allowlisted fields will still be kept. If you want to keep the new field, follow the steps to allowlist it described below.
The new field won't be in the allowlist, because it's new. So by default, it will be purged after 90 days. Note, that all other allowlisted fields will still be kept. If you want to keep the new field, follow the steps to allowlist it described below.
====If I remove fields from my schema, should I remove them from the allowlist?====
====If I remove fields from my schema, should I remove them from the allowlist?====
Normally no. The older fields, will continue to allowlist older revisions of your schema. If you do not need the data contained in older revisions of your schema, feel free to remove the fields from the allowlist.
You cannot remove fields from your schema; this would be a backwards incompatible change.  Technically, this can be done but it requires a lot of manual intervention and migration planning.
 
If this does happen for some reason, no you should not remove the fields from the allowlist. The older fields, will continue to allowlist events created with older version of your schema. If you do not need the data contained in older revisions of your schema, feel free to remove the fields from the allowlist.
 
More correctly: because the allowlist applies to the Hive table, NOT the source event schema, the allowlist should match the Hive table schema.
 
====What happens when I rename fields in a schema?====
====What happens when I rename fields in a schema?====
Renaming fields is not possible for event schemas. Hive does also accept field renames, but it does not actually rename the previous field, it considers schema renames as a deletion of the original field plus a creation of a new field. The resulting refined table will have both old and new names as columns. If you decide to rename a schema field anyway, please remember to update the EL sanitization allowlist accordingly, otherwise the newly named field will be purged.
Renaming fields is not possible for event schemas. Hive does also accept field renames, but it does not actually rename the previous field, it considers event schema renames as a deletion of the original field plus a creation of a new field. The resulting refined table will have both old and new names as columns. If you decide to rename a schema field anyway, please remember to update the sanitization allowlist accordingly, otherwise the newly named field will be purged.

Latest revision as of 14:43, 6 May 2021

This page describes the event sanitization processes used with Event Platform data for retaining certain event data in Hive beyond the 90 day retention period per WMF's Privacy Policy and Data Retention Guidelines.

Data retention

To learn more about data retention practices for events, see Event Data retention.

Hive

The Analytics Data Lake Hive event database stores event streams as Hive tables, including those with a very high volume. It uses 2 databases: event and event_sanitized. The event database stores original (unsanitized) events, while event_sanitized database stores sanitized events. Sanitization happens right after events are generated (with a couple hours lag). So, unsanitized and sanitized events co-exist in 2 different databases during 90 days. After that, however, unsanitized events older than 90 days are automatically deleted from the event database, and the only events that persist indefinitely are those in event_sanitized.

Hive event sanitization job

It's a job that lives in analytics/refinery/source and it runs every hour by a cron job. It reads the new unsanitized events from the event database, sanitizes them using the allowlist and copies them over to the event_sanitized database. Only tables that are present in the allowlists will be sanitized and copied over the event_sanitized database.

A second sanitization job runs 45 days after data was received, just in case any changes were made to the allowlist in that time.


Allowlists

The allowlists are YAML files with the following format: The first level corresponds to table names. All tables that we want to partially or fully keep need to be there, otherwise, the whole contents of that table is going to be purged. Under each table name, at the second level of the YAML, there have to be the field names that we want to keep indefinitely. Each field name must have the tag keep (retained as-is) or hash.

Analytics/instrumentation event tables must explicitly list all fields then want to keep or hash. Main production event tables are more permissive and may use the keep_all to keep all fields for the table. There are two separate allowlists for this purpose.

Example:

table_name:
    event:
        field_name1: keep
        field_name2: keep
        identifier1: hash

The allowlist supports partially allowlisting nested fields.

Important notes:

  • For analytics/instrumentation events, using the keep label for nested fields or for whole schemas is not allowed.
  • The allowlist is table-centric meaning it serves for all versions of a given event schema. This way, when an event schema is altered, the allowlist continues to work.
  • The event sanitization process has a feature that allows for string fields that are privacy sensitive to be automatically hashed when copied over to event_sanitized. To do that, instead of 'keep', use the 'hash' label in the allowlist. All fields hashed this way will also be salted (appended a cryptographic salt before applying hash function) to increase the security of the hash. The event sanitization salt is rotated (replaced by a new one) every 3 months, coinciding with the start of quarter, and the old salt is thrown away.

Modifying the allowlist

  • Submit a Gerrit patch to

an allowlist YAML file where you add the table and fields you want to keep indefinitely. Please, take the sanitization rationale into account when selecting them. Then add someone in the Analytics team to review the patch, who will review and merge it – and that's it!

  • Alternatively, create a Phabricator task named i.e. "Add table_name fields to event sanitization allowlist" and tag it with the "Analytics" project. In the task description, mention which field you'd like to keep. Analytics team will update the allowlist for you. This option might take a bit longer, because it might take a couple days until it gets looked at, prioritized, and worked on from the backlog.
  • Allowlist updates are automatically deployed on the weekly train. If you need an update to be deployed sooner, you can ask Analytics to do a manual deploy.
  • Analytics team will reach out to Legal if they have concerns about retaining any specific fields.

F.A.Q.

What is the default purging strategy for new schemas?

The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside event databases. This means, if you create a new schema and don't take action to allowlist its fields, the events produced to that schema are going to be purged after 90 days.

Should I allowlist all fields of my schema every time I modify it?

No. The allowlist is table-centric – meaning it does not now about schema versions. All the fields that are in the allowlist for previous revisions of your schema will also apply to the new version.

When are allowlist changes effective?

After being merged changes need to be deployed with analytics refinery code, this normally happens on a weekly cadence on Wednesdays but it might not happen all weeks if there are no sufficient changes or if a significant part of the team is not available due to ops issues/holidays/offsites.

If I add a new field to an existing schema, what will happen?

The new field won't be in the allowlist, because it's new. So by default, it will be purged after 90 days. Note, that all other allowlisted fields will still be kept. If you want to keep the new field, follow the steps to allowlist it described below.

If I remove fields from my schema, should I remove them from the allowlist?

You cannot remove fields from your schema; this would be a backwards incompatible change. Technically, this can be done but it requires a lot of manual intervention and migration planning.

If this does happen for some reason, no you should not remove the fields from the allowlist. The older fields, will continue to allowlist events created with older version of your schema. If you do not need the data contained in older revisions of your schema, feel free to remove the fields from the allowlist.

More correctly: because the allowlist applies to the Hive table, NOT the source event schema, the allowlist should match the Hive table schema.

What happens when I rename fields in a schema?

Renaming fields is not possible for event schemas. Hive does also accept field renames, but it does not actually rename the previous field, it considers event schema renames as a deletion of the original field plus a creation of a new field. The resulting refined table will have both old and new names as columns. If you decide to rename a schema field anyway, please remember to update the sanitization allowlist accordingly, otherwise the newly named field will be purged.