You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Email System Revamp

From Wikitech-static
Revision as of 21:42, 16 June 2022 by imported>JHathaway
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Glossary

  • ARC (Authenticated Received Chain): Provides a method for mail relays to digitally sign the results of their SPF and DKIM checks, which may then be verified by the next hop mail server, RFC 8617.
  • DKIM (DomainKeys Identified Mail): Adds an encrypted digital signature from the sender’s domain to each message. This signature allows a receiver to verify that specific headers were not tampered with in transit. Notably the From header is always signed by DKIM, RFC 6376.
  • DMARC (Domain-based Message Authentication, Reporting, and Conformance): Provides a policy and reporting mechanism to describe what a receiving domain should do with emails which fail their SPF or DKIM checks, RFC 7489.
  • Null Client: Mail server, which only relays mail to another mail server for the domain. The receiving mail server then handles the actual delivery to the email recipient (also referred to as a “satellite config”).
  • Milter: Mail filter for an MTA, portmanteau of mail & filter, e.g. rspamd.
  • MTA (Mail Transfer Agent): A general mail server, RFC 5321.
  • MTA-STS (MTA Strict Transport Security): Allows a domain to specify that incoming emails should be delivered only over TLS encrypted sessions, RFC 8461
  • MX: Mail exchange DNS record, specifies which servers accept mail for a given domain.
  • SPF (Sender Policy Framework): Allows a domain to declare via DNS the mail servers which are authorized to send email for the domain, RFC 7208.

Current Email Services Overview

All current email services use Exim for delivery.

Inbound Email Services

  1. Wikimedia Foundation Email: @wikimedia.org
  2. VRT Contacts:
    • (permissions, info)-(en, ko, tr, etc)
      • @wikimedia.org
      • wikis
        1. @wikipedia.org
        2. @wikibooks.org
        3. @wikivoyage.org
        4. etc.
  3. Wiki Operators:
    • postmaster, abuse, hostmaster, etc.
  4. Phabricator Replies: Replies to issues,@phabricator.wikimedia.org
  5. Mailing Lists: wikitech-l, mediawiki-l, etc, @lists.wikimedia.org

Inbound Email Routing

  • mx(1001,2001).wikimedia.org
    • Wikimedia Foundation Email: Internet ➡ mx(1001,2001).wikimedia.org ➡ Gmail
    • VRT Contacts: Internet ➡ mx(1001,2001).wikimedia.org ➡ otrs1001.eqiad.wmnet
    • Wiki Operators: Internet ➡ mx(1001,2001).wikimedia.org ➡ Gmail
    • Phabricator Replies: Internet ➡ mx(1001,2001).wikimedia.org ➡ phab1001.eqiad.wmnet
  • lists.wikimedia.org
    • Mailing Lists: Internet ➡ lists1001.wikimedia.org

Outbound Email Services

  1. Wiki Email: Email generated by MediaWiki
    • Password resets
    • Login notifications
    • Article edits
  2. Infra Email: Email generated by our infrastructure
    • Email to local users, e.g. root@HOST.(eqiad, codfw, etc).wmnet
    • Jenkins
    • Wikitech
    • AlertManager
    • GitLab
  3. VRT Email: Volunteer Response Team Replies from Znuny (Formerly OTRS)
    • Replies for wiki domains: wikimedia.org, wikipedia.org, wiktionary.org, etc
  4. Mailing List Email: Email from Mailman3 to subscribers, hosted on lists1001.wikimedia.org
  5. Phabricator Email: Email from Phabricator, comments, notifications, etc (phabX.wikimedia.org)
  6. Gerrit Email: Email from Gerrit, comments, notifications, etc (gerrit1001.wikimedia.org)

Outbound Email Routing

  • mx(1001,2001).wikimedia.org
    • Wiki Email: (MediaWiki Servers) ➡ mx(1001,2001).wikimedia.org ➡ Internet
    • Infra Email: (Servers) ➡ mx(1001,2001).wikimedia.org ➡ Internet
    • VRT Replies: otrs1001.eqiad.wmnet ➡ mx(1001,2001).wikimedia.org ➡ Internet
    • Gerrit Email: gerrit1001.wikimedia.org ➡ mx(1001,2001).wikimedia.org ➡ Internet
  • lists.wikimedia.org
    • Mailing List Email: lists1001.wikimedia.org ➡ Internet
  • phabricator.wikimedia.org
    • Phabricator Email: phab1001.eqiad.wmnet ➡ Internet

Problems with Current Design

  1. Exim as the MTA: Exim has had a long string of security vulnerabilities, including many with remote code execution:
    1. 16 Debian Security Advisories & 6 Debian Long Term Support Security Advisory since 2010
    2. Vulnerabilities have been found in both new and old code.
  2. MTAs serve domains with mixed user groups
    1. Crosses organizational boundaries: e.g. ITS owns gmail, but SRE operates the MTA for inbound Wikimedia Foundation email.
    2. Makes it more difficult to tighten our DMARC and SPF configurations.
    3. IP reputation is mixed: e.g. Wiki email egresses the same servers as Infra email.
    4. Maintenance impacts more domains.
  3. Inbound & outbound MTAs are the same servers, T175362
    1. Makes it more difficult to monitor and administer the mail queues.
    2. Maintenance affects sending and receiving email

Benefits to Improving our Current Design

  1. Improved software security
  2. Reduced operational burden
  3. Improved spam identification

Risks of Keeping the Status Quo

  1. Continued possibility of a server compromise via an Exim vulnerability
  2. Operational burdens continue
  3. Spam levels stay the same or increase

Design Considerations

Subdomains or new domains for email services

Some of our email services have been moved to subdomains of wikimedia.org, e.g. lists.wikimedia.org. An alternative option is to move services to a separate domain, e.g. wikimedia-lists.org.

Options:

  1. Subdomain
    1. Pros
      1. Easy to create & manage
      2. Allows separate MX, SPF, DKIM, & DMARC
    2. Cons
      1. Some email service providers, such as Gmail, will alter the spam reputation of the apex domain, e.g. wikimedia.org, if a subdomain produces substantial spam, e.g. lists.wikimedia.org, 1 2
  2. Domain
    1. Pros
      1. Allows separate MX, SPF, DKIM, & DMARC
      2. Spam scoring is completely separate from other domains
    2. Cons
      1. Domain recognition may be more confusing for end users, e.g. should they trust wikimedia-lists.org?

Option (1) is used in the proposed design as that method was already used for the mailing list addresses and the risk of high levels of spam egressing our systems appears low. However, if egress spam becomes a problem we can reconsider the decision.

Multiple User Groups of @wikimedia.org

There are two primary user groups who use @wikimedia.org email addresses:

  1. VRT Contacts
  2. Wikimedia Foundation Email

All email for @wikimedia.org addresses ingresses SRE managed email servers and then, depending on the user group, are routed to Gmail’s or VRTS’ email servers. The current setup complicates administration as ITS owns the Wikimedia Foundation employees’ email concerns and SRE owns the email ingress servers and VRTS. In addition this setup is suboptimal as Gmail’s spam detection works best when email directly ingresses their MTAs. However, changing the current setup is not without difficulties and risks.

Options:

  1. Keep the Status Quo
    1. Pros
      1. Around 80% of VRT email is addressed to the wikimedia.org domain. VRT email addresses, e.g. <info@wikimedia.org>, have been publicized for many years and are documented in numerous places on and off wiki.
    2. Cons
      1. More spam to Wikimedia Foundation Email Users since Gmail is not the first point of ingestion.
      2. Requires some coordination between ITS, SRE, & VRT.
  2. Separate domains for VRT Contacts & Wikimedia Foundation Email
    1. Pros
      1. Separates ITS handled email and SRE handled email
      2. Wikimedia Foundation email would ingress Gmail directly for better spam protection
    2. Cons
      1. Significant migration effort needed to move VRT addresses off of the wikimedia.org domain. We would need to support old domain addresses for many years, if not indefinitely. Alternatively, Wikimedia Foundation email could be moved off of the wikimedia.org domain, but this would also require a long painful migration process.

Option (1) Though, separating domains for VRT email and Wikimedia Foundation email is attractive from an organizational perspective, the migration for either user group would be so disruptive as to negate any gains realized by separating the domains. Instead we keep the status quo of both user groups sharing the wikimedia.org domain. Also, we assume the user groups will share the domain for the foreseeable future, thus making investment in easing the complexity burden worthwhile.

Proposed Design Overview

Replace Exim with Postfix

Postfix offers a more security focused implementation than Exim. In comparison to Exim, Postfix has had only one Debian Security Advisory since 2010 and is structured in a way so as to minimize attack vectors,T232343.

  • Pros
    • Improved security design
  • Cons
    • More difficult configuration
    • Requires porting Exim’s configuration to Postfix’s
    • Debian has had a similar debate for many years, but has stuck with Exim

Split Inbound and Outbound MTAs

Splitting our inbound and outbound email delivery should allow for easier monitoring and security controls,T175362.

  • Pros
    • Easier to monitor
    • Easier to reason about configuration changes
  • Cons
    • More servers to monitor and maintain

Proposed Design

Inbound Email Routing

Two hosts per inbound MX record (one per site, eqiad & codfw), redundancy provided by DNS MX records for each domain.

  • mx-wiki (MX record for wikimedia.org & wikis: wikipedia.org, wikivoyage.org, etc)
    • Wikimedia Foundation Email: Internet ➡ mx-wiki ➡ Gmail
    • VRT Contacts: Internet ➡ mx-wiki ➡ otrs1001.eqiad.wmnet
    • Wiki Operators: Internet ➡ mx-wiki ➡ Gmail
  • mx-infra (MX record for phabricator.wikimedia.org)
    • Phabricator Replies: Internet ➡ mx-infra ➡ phab1001.eqiad.wmnet
  • mx-lists (MX record for lists.wikimedia.org)
    • Mailing Lists: Internet ➡ mx-lists ➡ lists1001.wikimedia.org

Outbound Email Routing

Two hosts per outbound relay (eqiad & codfw), redundancy provided by including both servers in Null Client configs. Alternatively, each pair of servers could be fronted by LVS for redundancy.

  • mta-relay-wiki
    • Wiki Email: (MediaWiki Servers) ➡ mta-relay-wiki ➡ Internet
    • VRT Replies: otrs1001.eqiad.wmnet ➡ mta-relay-wiki ➡ Internet
  • mta-relay-infra
    • Infra Email: (Servers) ➡ mta-relay-infra ➡ Internet
    • Phabricator Email: phab1001.eqiad.wmnet ➡ mta-relay-infra ➡ Internet
    • Gerrit Email: gerrit1001.wikimedia.org ➡ mta-relay-infra ➡ Internet
  • mta-relay-lists
    • Mailing List Email: lists1001.wikimedia.org ➡ mta-relay-lists ➡ Internet

Task Outline

These tasks are listed in approximate dependency order. Though some of them could be done in parallel.

Improve Test Coverage & Methodologies for Mail Config Changes

  1. High level domain, address and routing tests
  2. Config syntax checking

Provision a Development Setup on OpenStack

  1. Create Postfix configuration
  2. Add spam & security features, via postfix or separate milters, rspamd, opendkim, etc.
    1. SPF
    2. DKIM
    3. DMARC
    4. ARC

Create Puppet Modules

  1. Evaluate Puppet Forge modules for suitability
  2. Create or augment an existing Postfix module to support
    1. Null Client MTAs
    2. Inbound MTAs
    3. Outbound MTAs
  3. Create or augment an existing milter module to support
    1. SPF
    2. DKIM
    3. DMARC
    4. ARC
    5. MTA-STS,T203883

Setup Postfix Monitoring

  1. Export Postfix metric data to Prometheus to match existingExim metrics
  2. Add alerts on key metrics
  3. Assess our DMARC report monitoring and improve if needed

Setup Outbound Postfix Email Servers

Two hosts per outbound MTA relay (one per site, eqiad & codfw), redundancy provided by including both servers in Null Client configs. Alternatively, each pair of servers could be fronted by LVS for redundancy. IP reputation is helpful for ensuring messages are not being marked as spam. However, it is difficult to retain the same IPs while setting up new outbound servers. Instead new IPs will be used and DMARC reports will be monitored to determine if there is an elevated number of messages being marked as spam.

  • General MTA Relay Steps
    1. Provision Servers
    2. Build Postfix config
    3. Add DKIM records, or reuse existing records
    4. Test delivery
    5. Update DNS SPF records to include new servers
    6. Monitor delivery rates
    7. Remove Exim servers from SPF records
    8. Update DMARC records for domains
  1. MTA Relay Wiki: Outbound MTA Relay for wiki domains.
    1. General MTA Relay Steps
    2. Point MediaWiki to new hosts
    3. Point VRTS to new hosts
  2. MTA Relay Infra:Outbound MTA Relay for infrastructure
    1. General MTA Relay Steps
    2. Point servers to new hosts
  3. MTA Relay Lists: Outbound MTA Relay for Mailing Lists
    1. General MTA Relay Steps
    2. ARC Setup
    3. Point mailing list servers to new hosts

Convert Null Clients to Postfix

  1. Replace Exim Null Clients with Postfix equivalents

Setup Inbound Postfix Email Servers

The goal will be to provision Postfix servers in front of the existing Exim servers for inbound traffic. This setup should provide a fair bit of increased security, since Exim will no longer be exposed directly to the internet, but it will also allow us to temporarily keep the existing Exim mail routing logic in place.

  • General MX Steps
    1. Provision Servers
    2. Build Postfix config
    3. Setup MTA-STS
    4. Test delivery
    5. Add new servers with a higher value MX record (lower priority)
    6. Monitor delivery rates
    7. Raise priority of MX records
    8. Remove Exim servers from MX records
  1. MX Wiki: Inbound MX server for wiki domains.
    1. General MX Steps
  2. MX Lists: Inbound MX server for mailing lists
    1. General MX Steps
    2. ARC Setup
  3. MX Infra: Inbound MX server for Infrastructure services
    1. General MX Steps

Remove Exim MTAs

  1. Move inbound routing rules to Postfix
  2. Monitor for changes in delivery rates
  3. Shutdown the Exim servers once they are no longer routing any mail