You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Coming soon! This work is tracked in T273789.
To learn more about data retention practices in the legacy EL system, see this page.
The Analytics' Hive cluster stores all EventLogging schemas, including those with a very high volume. It uses 2 databases:
event database stores original (unsanitized) events, while
event_sanitized database stores sanitized events. Sanitization happens right after events are generated (with a couple hours lag). So, unsanitized and sanitized events co-exist in 2 different databases during 90 days. After that, however, unsanitized events older than 90 days are automatically deleted from the event database, and the only events that persist indefinitely are those in
Hive sanitization job
It's a job that lives in analytics/refinery/source and it's run every hour by a cron job. It reads the new unsanitized events from the
event database, sanitizes them using the allowlist and copies them over to the
event_sanitized database. Only schemas that are present in the allowlist will be copied over the
A second sanitization job runs 45 days after data was received, just in case any changes were made to the allowlist in that time.
The EL sanitization allowlist is a YAML file with the following format: The first level corresponds to schema names. All schemas that we want to partially or fully keep need to be there, otherwise, the whole contents of that schema's table is going to be purged. Under each schema name, at the second level of the YAML, there have to be the field names that we want to keep indefinitely. Each field name must have the label 'keep'. See example:
schemaName: fieldName1: keep fieldName2: keep
The allowlist supports partially allowlisting nested fields. Actually, the main event information is enclosed in the event nested object. Thus, the allowlist for EL schemas should look like this (note capsule fields are still outside the scope of event):
schemaName: event: fieldName1: keep fieldName2: keep capsuleFieldName1: keep capsuleFieldName2: keep
This feature can also be used for nested fields like
geocoded_data, but this will work for Hive only.
- For EventLogging, using the keep label for nested fields or for whole schemas is not allowed.
- The allowlist is schema-centric (not table-centric), meaning it serves for all revisions of a given schema. This way, when a schema is altered, the allowlist continues to work.
Hashing (and salting)
The EventLogging sanitization process has a feature that allows for string fields that are privacy sensitive to be automatically hashed when copied over to event_sanitized. To do that, instead of 'keep', use the 'hash' label in the allowlist. All fields hashed this way will also be salted (appended a cryptographic salt before applying hash function) to increase the security of the hash. The EventLogging sanitization salt is rotated (replaced by a new one) every 3 months, coinciding with the start of quarter, and the old salt is thrown away.
IMPORTANT NOTE: Because of rotating salts, hashed identifiers will be linkable within the same quarter, but not across quarters. In other words, you will not be able to group events by the identifier across quarters, only within one quarter.
ANOTHER IMPORTANT NOTE: If you decide to hash (and salt) an identifier field, then all other identifiers of the same schema have to be hashed as well. This applies even for temporary identifiers like session tokens. Otherwise, those identifiers can be used to match hashed (and salted) fields around the period of salt rotation. And this would invalidate the protection that salting and hashing offers.
How to change the purging strategy of a schema
- Submit a Gerrit patch to the EL sanitization allowlist in puppet where you add the schema and fields you want to keep indefinitely. Please, take the sanitization rationale into account when selecting them. Then add someone in the Analytics team to review the patch, who will review and merge it – and that's it!
- Alternatively, create a Phabricator task named i.e. "Add <SchemaName_123> fields to EL purging allowlist" and tag it with the "Analytics" project. In the task description, mention which field you'd like to keep. Analytics team will update the allowlist and that's it. This option might take a bit longer, because it might take a couple days until it gets looked at, prioritized, and worked on from the backlog.
- Allowlist updates are automatically deployed on the weekly train. If you need an update to be deployed sooner, you can ask Analytics to do a manual deploy.
- Analytics team will reach out to Legal if they have concerns about retaining any specific fields.
Is the information about purging that lives in the schema talk pages correct?
We can not ensure that the purging strategy that is mentioned in the schema talk pages is the actual one that is implemented in the allowlist. Listing the purging strategy in the talk pages was a decision that came out to be non-practical, and in the end we decided that the allowlist would be the place for that.
Will the schema talk pages ever have correct purging info?
There's a task in Analytics' backlog to write a script that automatically updates the talk pages with the changes to the allowlist, but it's been declined. Please update the talk page whenever you edit purge configuration.
What is the default purging strategy for new schemas?
The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside EventLogging databases. This means, if you create a new schema and don't take action to allowlist its fields, the events produced to that schema are going to be purged after 90 days.
Should I allowlist all fields of my schema every time I modify it?
No. The allowlist is schema-centric – meaning it does not observe revisions. All the fields that are in the allowlist for previous revisions of your schema will also apply to the new revision.
When are allowlist changes effective?
After being merged changes need to be deployed with analytics refinery code, this normally happens on a weekly cadence on Wednesdays but it might not happen all weeks if there are no sufficient changes or if a significant part of the team is not available due to ops issues/holidays/offsites.
If I add a new field to an existing schema, what will happen?
The new field won't be in the allowlist, because it's new. So by default, it will be purged after 90 days. Note, that all other allowlisted fields will still be kept. If you want to keep the new field, follow the steps to allowlist it described below.
If I remove fields from my schema, should I remove them from the allowlist?
Normally no. The older fields, will continue to allowlist older revisions of your schema. If you do not need the data contained in older revisions of your schema, feel free to remove the fields from the allowlist.
What happens when I rename fields in a schema?
Renaming fields is not possible for event schemas. Hive does also accept field renames, but it does not actually rename the previous field, it considers schema renames as a deletion of the original field plus a creation of a new field. The resulting refined table will have both old and new names as columns. If you decide to rename a schema field anyway, please remember to update the EL sanitization allowlist accordingly, otherwise the newly named field will be purged.