You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Labsdb redaction

From Wikitech-static
Revision as of 15:17, 26 September 2017 by imported>BryanDavis (Tag with Category:Wiki Replica admin)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is a WIP

This page is to document how the data is sanitized for the public databases that Wikimedia Cloud Services provides.

Step 1 Sanitarium

See MariaDB/Sanitarium and Labsdbs for more details.

Sanitarium has 7 mysql instances to replicate each db shard. This removes sensitive columns, tables and databases in the simple case where there are no conditions (e.g. Ensures user_password does not go into labs).

  • For tables that should not be replicated, the replicate-wild-ignore-table mysql config option is set with the $private_tables puppet variable
  • For databases that should not be replicated (private wikis), replicate-wild-ignore-table is set with the databases from the $private_wikis puppet variable (Note, this is separate from private.dblist)
  • For columns that should be redacted, they are redacted via triggers that are set based on the list of columns at modules/role/files/mariadb/filtered_tables.txt

Data from this host is then replicated on to the labsdb hosts. Having this redaction done on a separate host outside of labs helps isolate the security of the data and ensure a privilege escalation on labs does not compromise the very sensitive data in the db.

There is also a report check_private_data_report to make sure redaction happened properly (FIXME: How is this run?)

The code related to sanitarium currently lives in operations/puppet in modules/role/files/mariadb.

  • modules/role/files/mariadb/redact_sanitarium.sh Add triggers to redact the appropriate columns
  • modules/role/files/mariadb/filtered_tables.txt What columns to filter
  • modules/role/files/mariadb/check_private_data_report and check_private_data.py Audit to make sure no private data is there
  • $private_wikis and $private_tables in manifests/realm.pp

Formerly this used to be part of operations/software/redactron.git, but that repo is no longer used.

Step 2 Labsdb views

In operations/puppet.git modules/role/templates/labs/db/views/maintain-views.yaml contains views that define what is public. This contains conditional redactions that cannot be done at sanitarium (e.g. revision delete), and also serves as defense in depth in case one of the sanitarium redactions fail.

Document redaction decisions

TODO: include documentation/rationale on any info publicly exposed that is not publically exposed by MW.

Other

Note: operations/software/redactron.git and operations/software/labsdb-auditor.git contain historical software which is no longer used.