You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

HTTPS: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>EddieGP
mNo edit summary
imported>BBlack
 
(16 intermediate revisions by 9 users not shown)
Line 1: Line 1:
== Design ==
{{Navigation Wikimedia infrastructure|expand=traffic}}
{{Outdated}}
'''[[w:HTTPS|HTTPS]]''' (also called '''HTTP over Transport Layer Security (TLS)''', '''HTTP over SSL''', and '''HTTP Secure''') is a communications protocol for secure communication over a computer network which is widely used on the Internet. The Wikimedia family of wikis and services use HTTPS encryption to prevent eavesdropping and [[w:Man-in-the-middle_attack|man-in-the-middle attacks]]. This page and its related sub-pages attempt to document the current best practices and standards for both server and client side protections.
=== Service names ===


For HTTP we used name based virtual hosts for all domains, where the appservers knew which service to serve based on a host header. For HTTPS we chose to use IP based virtual hosts. When a client connects to the server, the server has to give it the certificate for the virtual host it has requested, that host is not known unless [[w:en:Server Name Indication|Server Name Indication (SNI)]] is used. SNI is only supported in fairly modern browsers (e.g. IE7, FX2, Opera8).
== Current policies and standards ==
Security standards are constantly updated, and Wikimedia follows these changes as well. When older standards are dropped, this is done gradually and you might see the page https://www.wikipedia.org/sec-warning giving you information about why your browser will not be supported in the future.


In our previous CNAME approach we use three service names:
== For all public-facing Web sites and services under Wikimedia control ==
# text.wikimedia.org
# bits.wikimedia.org
# upload.wikimedia.org
All project domains (wikipedia, wikimedia, etc.), languages (en.wikipedia, de.wikinews, etc.) and sites (commons.wikimedia, meta.wikimedia, etc.) were CNAME'd to text.wikimedia.org.


text.wikimedia.org was also a CNAME, to enable GeoDNS. Depending on the [[DNS#Geographic_DNS|DNS scenario]] we are in, the 'text' CNAME points to either:
These policies and standards apply to all services having hostnames within our canonical domains (see below), even for sites run by third parties on our behalf.
* text.esams.wikimedia.org : Amsterdam, NL
* text.pmtpa.wikimedia.org : Tampa, USA
* text.eqiad.wikimedia.org : Ashburn, USA (not yet in production)


To support IP based virtual hosts, we created service CNAMEs on a per project basis. The <tt>-lb</tt> suffix means "load balancing".
We currently rely on https://www.ssllabs.com/ssltest/ to audit sites for basic TLS security issues.  Sites must get an A+ rating there.  Failing to reach A+ on that audit can happen for a very long list of reasons detailed in: https://github.com/ssllabs/research/wiki/SSL-Server-Rating-Guide .   A few more-specific issues (or issues that are not explicitly validated at by having an A+) are listed here:


* wikimedia-lb.wikimedia.org
* '''HTTPS enabled''' - ...with a minimum allowed protocol version of TLSv1.2 and supporting at least version TLSv1.2.
* wikipedia-lb.wikimedia.org
* '''Good Certs''' - Certificates must validate correctly in all common browsers, must send chain certs attaching them to known roots.
* wiktionary-lb.wikimedia.org
* '''Decent Ciphers''' - Must offer forward-secret AEAD ciphers (''e.g.'' ECDHE-*-AES128-GCM), should offer only forward-secret ciphers, and should not allow non-AEAD ciphers (''e.g.'' AES CBC modes)
* wikiquote-lb.wikimedia.org
* '''HTTP service''' - If available at all, must exist solely for the purpose of redirecting to HTTPS and not serve actual content. Preferred mechanics are that all GET and HEAD requests emit a 301 redirect to the same URL over HTTPS, and all other methods emit a 403 error.
* wikibooks-lb.wikimedia.org
* '''HSTS''' - All HTTPS responses must include an [[:en:HTTP_Strict_Transport_Security|HSTS]] header with a minimum max-age value of 1 year, which includes sub-domains and allows preloading. Example: <code>Strict-Transport-Security: max-age=31536000; includeSubDomains; preload</code>.
* wikisource-lb.wikimedia.org
* wikinews-lb.wikimedia.org
* wikiversity-lb.wikimedia.org
* mediawiki-lb.wikimedia.org
* foundation-lb.wikimedia.org


These CNAMES, like the original text.wikimedia.org, point to <servicename>.<datacenter>.wikimedia.org based on the DNS scenario. The referenced records are A records.  This means that for each service we need, we need an IP address per datacenter. Based on the above, this requires 30 IP addresses.
== Certificate Issuance and Renewal ==


text.wikimedia.org has been be replaced by text.svc, a backend IP, as described in the next section.
Most of our certificates are automated via [[Acme-chief]] using certs from [https://letsencrypt.org/ Let's Encrypt].


=== Load balancing ===
For our most important use-case, which is the "unified" certificate which covers all of the canonical domainnames of the main foundation projects and thus most of our production traffic, we maintain a pair of independently-issued certificates from independent certificate authorities as a defense against renewal issues and/or realtime [[:en:Online_Certificate_Status_Protocol|OCSP]] outages by the CAs.  One of the pair is from our standard LE / [[Acme-chief]] automation, and the other is a commercial certificate issued by Digicert.  We deploy the LE cert at our US edges and the [https://digicert.com Digicert] cert at our non-US edges, so that both see constant live use and are known-good options in the case that emergency operations require us to switch all edge sites to just one of the two.


We use LVS-DR for load balancing. This means the LVS server (aka the director) will direct incoming traffic for the services to a number of realservers. Each realserver binds the service IP address to the <code>lo</code> device. The realserver answers directly to the client, bypassing the director.
For the manual, commercial renewals such as the Digicert certificate above, it's important that a new certificate is aged by a few days (ideally as much as a week) before deployment to avoid rejection by clients with bad clocks as referenced in [[:phab:T196248]].


The fact that the realserver binds the IP address to lo is problematic for a couple reasons:
== For the Foundation's canonical domain names ==


# Since we are simply doing SSL termination, we want to decrypt the connection, and proxy it to the port 80 service. The port 80 service has the same IP. Since the IP is bound to lo, it would end up sending the backend requests back to itself.
While the Foundation may own many other domains for trademark, legal, or project/redirect reasons, there is only one small set which are considered to be the canonical set for our actual projects and content, which are subjected to higher standards.
# pybal conducts health checks on the realserver to ensure it is alive and can properly serve traffic. Since we are using IP based virtual hosts, the health checks would need to check the service IP, and not the realserver IP. This isn't possible from the LVS server.


To bypass problem #1 we have changed text.wikimedia.org to text.svc.<datacenter>.wmnet (a private routable address) which is used as the backend. We took the same approach for bits.wikimedia.org and upload.wikimedia.org. bits and upload are assigned a private routable address (bits.svc.<datacenter>.wmnet/upload.svc.<datacenter>.wmnet). We used the private routable addresses as the backend.
The current list of canonical domains is:
* wikipedia.org
* wikimedia.org
* wiktionary.org
* wikiquote.org
* wikifunctions.org
* wikibooks.org
* wikisource.org
* wikinews.org
* wikiversity.org
* wikidata.org
* wikivoyage.org
* wikimediafoundation.org
* mediawiki.org
* wmfusercontent.org
* w.wiki


To bypass problem #2 we disable normal content health checks but keep the idle connection health check. To re-enable the content health checks, we use the SSH health check and have it make requests to the service address directly on the host.
In addition to the basic per-service standards above, for all services hosted within these domains, the domains themselves must comply with additional policy at the domain level:
* Must be registered to the Wikimedia Foundation, and must be delegated by the registrar directly to the Foundation's name servers (currently <code>ns0.wikimedia.org</code>, <code>ns1.wikimedia.org</code>, and <code>ns2.wikimedia.org</code>).
* Must have valid [[:en:DNS_Certification_Authority_Authorization|CAA]] records denoting one or more legitimate certificate vendors designated by the Operations team.
* Must be submitted to (and eventually successfully included in) the STS preload list maintained by the Chromium project at https://hstspreload.org/ .


=== Certificates ===
== Related information ==
See: https://office.wikimedia.org/wiki/SSL_Certificates


=== SSL termination ===
* [[HTTPS/Browser Recommendations]] - Browser security recommendations aimed at end-users.
* [[HTTPS/Domains]] - Some tracking/auditing on minor non-standard sites that are in-scope.
* [https://grafana.wikimedia.org/dashboard/db/tls-ciphersuite-explorer TLS cipher suite stats dashboard]
* [https://diff.wikimedia.org/2015/06/12/securing-wikimedia-sites-with-https/ Blog announcement of our switch to HTTPS-only back in mid-2015]
* {{phabricator|T104681}} - Phabricator task tracking the long tail of securing minor sites in the wake of the switch above for major projects.
* [[HTTPS/Archived-Pre-2015|HTTPS/Archived-Pre-2015]] - Old outdated information from this page, mostly predating the above.
* The [[X-Analytics]] header contains a "https" field.
* [[Cergen#Cheatsheet|Creating TLS certificates with cergen (for envoyproxy et al)]]
* [[User:Giuseppe_Lavagetto/Add_Tls_On_Kubernetes|Adding TLS on Kubernetes]]
* [[User:Jbond/Encryption]]
* [[mw:Manual:HTTPS]]


To perform SSL termination we are using a cluster of nginx servers. The nginx servers answer requests on IP based virtual hosts and proxy the requests directly to the backends unencrypted. Headers are set for the requested host, the client's real IP, forwarded-for information, and forwarded-protocol information.
[[Category:TLS]]
 
__NOTOC__
SSL termination servers in [[esams]] talk to services in esams, and failover to services in pmtpa. When [[eqiad]] is brought online, it will behave in the same way.  SSL termination servers in pmtpa talk to services only in [[pmtpa]].
 
For testing puppet changes to the SSL terminators, see [[https/testing]].
 
=== Logging ===
 
Logging generally occurs at the Squid level. When using SSL termination, however, the IP address that the squids see are the SSL terminators, not the client's IP. It's possible to use the X-Forwarded-For header, but we can only trust this header if the request is coming from the SSL terminators (as they strip and set that header). This is painful in Squid.
 
Normally this wouldn't be terribly problematic, you'd just write the logs in squid format on nginx, and combine them. We, however, don't use log files. Squid sends the logs as UDP packets to a log collector. To address this we modified a UDP syslog logging module for NGINX to send logs in our format without the extra syslog information, to servers and ports of our choosing.
 
=== geoiplookup support ===
 
Our bits cluster has support for providing geographical JSON data based on the client's IP address. Like logging, since the bits cluster is behind the SSL terminators, it sees the IP address of the SSL terminators, not the client, which causes the bits cluster to send back the geographical information of the SSL terminators (which isn't terribly useful).
 
To solve this we modified the geoip inline C in the varnish VCL to use X-Forwarded-For if the client IP is one of the SSL terminators.
 
=== Secure cookies ===
 
Since we are doing SSL termination, MediaWiki does not see incoming traffic as being HTTPS, since it is receiving the requests over HTTP. This is problematic when sending cookies. When users log in using HTTPS, we need to protect their cookies, in case an attacker forces them to HTTP, or they accidentally visit a HTTP link to our sites, or if there is any mixed content that causes unencrypted requests to travel to our sites.
 
To solve this we used the X-Forwarded-Proto header. If the header is set, and is HTTPS, we mark cookies as secure. Like X-Forwarded-For in geoiplookup, we only trust this header if it is coming from the SSL terminators. In Squid and Varnish we strip this header if the request is not being sent by the SSL terminators.
 
=== Protocol-relative URLs ===
 
To make HTTP and HTTPS coexist happily, we must use protocol-relative URLs like <code>//en.wikipedia.org/wiki/Main_Page</code> whenever we link off-domain to one of our sites (images, interwiki links, etc.). This also ensures that we don't split our Squid and Varnish caches by caching pages with HTTPS and HTTP links.  Of course, this also means that our parser, Squid, and Varnish caches need to be fully purged to properly enable HTTPS.
 
Enabling the use of protocol-relative URLs required many changes to MediaWiki core as well as configuration. See the [[Server Admin Log]] and the commit log for these changes.
 
=== Failover ===
 
Ignoring our normal geodns based datacenter failover, the SSL termination cluster needs to failover from the caching datacenter's backends to the backends in the primary datacenters. The difficult thing here is that the traffic between esams and pmtpa must travel over the WAN, which means we can't do SSL termination.
 
To address this, for the caching datacenter we configured nginx with two backends. One backend is in-datacenter, and is http, and the other backend is out-of-datacenter, and is https. Two location directives are used. The first directive is for /, which proxies in-datacenter, and if that proxy_pass fails, it falls back to an @fallback directive, which proxies to the out-of-datacenter backend, using "error_page 502 503 504 = @fallback".
 
Of course, there's a possible issue here. If the sh LVS scheduler hashes all three caching datacenter's SSL terminator IPs to the same SSL terminator server in the primary datacenter, it will likely overload that server. In this situation we're likely to just failover to the primary datacenter anyway, though.
 
== Performance settings ==
 
* HTTP keepalive: 65 seconds, 100 requests
** Lowering requests likely a good idea
* SSL cache: shared, 50m (roughly 200,000 sessions)
** should use roughly 1.1GB RAM for all open sessions
* SSL timeout: default (5 minutes)
* Limit ssl_ciphers: RC4-SHA:RC4-MD5:DES-CBC3-SHA:AES128-SHA:AES256-SHA
** Set server preference to avoid BEAST
* Used a chained certificate
* Disabled access log
* Worker connections set to 32768
* Worker processes set to number of cores
* esams servers set to hit esams squids, then pmtpa SSL terminators if esams squids are down or failing
* Proxy buffering is disabled to avoid responses eating all memory
* sh LVS scheduler used to allow session reuse, and to ensure session cache is maximized
 
=== Initial connection testing ===
 
Using [http://httpd.apache.org/docs/2.0/programs/ab.html ab] we were able to get an average of 5,100 requests per second on a single processor, quad core server, with 4GB RAM. We used the following command, which was run three times concurrently:
 
ab -c2000 -n100000 \
-H 'Host: upload.wikimedia.org' \
-H 'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' \
https://wikimedia-lb.wikimedia.org/pybaltestfile.txt
 
This test has 6,000 concurrent clients, making 300,000 requests. Since we are testing the number of requests per second based on initial connections for each request, we select a small and static resource for the request to ensure speed isn't heavily affected by the backend. We pull from the backend to ensure that we are opening connections both for the client and for the backend, and to ensure that any backend related issues will also be reflected.
 
The server's total CPU usage was on average 85%. Memory usage was roughly 1GB.
 
=== Image transfer with keepalive testing ===
 
Using ab, we were able to get an average of 600 requests per second. Hardware tested was same as in the initial connection testing. We used the following command, which was run three times concurrently:
 
ab -k -c500 -n20000 \
-H 'Host: upload.wikimedia.org' \
-H 'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' \
https://wikimedia-lb.wikimedia.org/wikipedia/commons/thumb/f/ff/Viaduc_Saillard.jpg/691px-Viaduc_Saillard.jpg
 
This test has 1,500 concurrent clients, using keepalive, making 60,000 requests. We used keepalive since we were testing the number of thumbnail requests per second, allowing us to bypass the overhead of the initial connection. The thumbnail chosen was the size shown on an image page.
 
The server's total CPU usage was on average 25%, suggesting there is likely a bottleneck in the client when testing. Running the same test against the http backend directly had a similar number of requests per second. Memory usage was negligible.
 
=== Text transfer with keepalive testing ===
 
Using ab, we were able to get an average of 1,400 requests per second. Hardware tested was same as in the initial connection testing. We used the following command, which was run three times concurrently:
 
ab -k -c2000 -n100000 \
-H 'Host: meta.wikimedia.org' \
-H 'User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' \
https://wikimedia-lb.wikimedia.org/wiki/Main_Page
 
This test has 6,000 concurrent clients, using keepalive, making 300,000 requests. We used keepalive since we were testing the number of text requests per second, allowing us to bypass the overhead of the initial connection.
 
The server's total CPU usage was on average 20%, suggesting there is likely a bottleneck in the client when testing. Running the same test against the http backend directly had a similar number of requests per second. Memory usage was negligible.
 
== Security settings ==
 
* Limit protocols to exclude SSLv3 or lower (since [[gerrit:167015]]; [http://thread.gmane.org/gmane.org.wikimedia.analytics/1036 discussion], [http://article.gmane.org/gmane.org.wikimedia.analytics/1038 IE6 stats]).
* Limit ssl_ciphers to exclude less secure ones.
 
Comparing the [https://www.ssllabs.com/ssldb/analyze.html?d=https%3A%2F%2Fsecure.wikimedia.org SSL security of secure] against the [https://www.ssllabs.com/ssldb/analyze.html?d=wikimania2005.wikimedia.org SSL security of the new cluster] shows secure with a score of 52 (a C), and the new cluster with a score of 85 (an A).
 
On Thu Apr 24 18:41:49 UTC 2014 ssllabs.com assesses en.wikipedia.org with a score of 100 for Certificate, 90 for Protocol Support, 90 for Key Exchange and 90 for Cipher Strength. Overall Grade A-, downgrades because of no FPS and the RC4 cipher is used with TLS > 1.1.
 
== Future work ==
 
[[/Future work|Future work]] - our planning guide for future changes.
 
== Testing ==
 
[[/testing|Testing]] - How to test HTTPS changes.
 
[[Category:Security]]
[[Category:Services]]

Latest revision as of 13:52, 4 November 2022

HTTPS (also called HTTP over Transport Layer Security (TLS), HTTP over SSL, and HTTP Secure) is a communications protocol for secure communication over a computer network which is widely used on the Internet. The Wikimedia family of wikis and services use HTTPS encryption to prevent eavesdropping and man-in-the-middle attacks. This page and its related sub-pages attempt to document the current best practices and standards for both server and client side protections.

Current policies and standards

Security standards are constantly updated, and Wikimedia follows these changes as well. When older standards are dropped, this is done gradually and you might see the page https://www.wikipedia.org/sec-warning giving you information about why your browser will not be supported in the future.

For all public-facing Web sites and services under Wikimedia control

These policies and standards apply to all services having hostnames within our canonical domains (see below), even for sites run by third parties on our behalf.

We currently rely on https://www.ssllabs.com/ssltest/ to audit sites for basic TLS security issues. Sites must get an A+ rating there. Failing to reach A+ on that audit can happen for a very long list of reasons detailed in: https://github.com/ssllabs/research/wiki/SSL-Server-Rating-Guide . A few more-specific issues (or issues that are not explicitly validated at by having an A+) are listed here:

  • HTTPS enabled - ...with a minimum allowed protocol version of TLSv1.2 and supporting at least version TLSv1.2.
  • Good Certs - Certificates must validate correctly in all common browsers, must send chain certs attaching them to known roots.
  • Decent Ciphers - Must offer forward-secret AEAD ciphers (e.g. ECDHE-*-AES128-GCM), should offer only forward-secret ciphers, and should not allow non-AEAD ciphers (e.g. AES CBC modes)
  • HTTP service - If available at all, must exist solely for the purpose of redirecting to HTTPS and not serve actual content. Preferred mechanics are that all GET and HEAD requests emit a 301 redirect to the same URL over HTTPS, and all other methods emit a 403 error.
  • HSTS - All HTTPS responses must include an HSTS header with a minimum max-age value of 1 year, which includes sub-domains and allows preloading. Example: Strict-Transport-Security: max-age=31536000; includeSubDomains; preload.

Certificate Issuance and Renewal

Most of our certificates are automated via Acme-chief using certs from Let's Encrypt.

For our most important use-case, which is the "unified" certificate which covers all of the canonical domainnames of the main foundation projects and thus most of our production traffic, we maintain a pair of independently-issued certificates from independent certificate authorities as a defense against renewal issues and/or realtime OCSP outages by the CAs. One of the pair is from our standard LE / Acme-chief automation, and the other is a commercial certificate issued by Digicert. We deploy the LE cert at our US edges and the Digicert cert at our non-US edges, so that both see constant live use and are known-good options in the case that emergency operations require us to switch all edge sites to just one of the two.

For the manual, commercial renewals such as the Digicert certificate above, it's important that a new certificate is aged by a few days (ideally as much as a week) before deployment to avoid rejection by clients with bad clocks as referenced in phab:T196248.

For the Foundation's canonical domain names

While the Foundation may own many other domains for trademark, legal, or project/redirect reasons, there is only one small set which are considered to be the canonical set for our actual projects and content, which are subjected to higher standards.

The current list of canonical domains is:

  • wikipedia.org
  • wikimedia.org
  • wiktionary.org
  • wikiquote.org
  • wikifunctions.org
  • wikibooks.org
  • wikisource.org
  • wikinews.org
  • wikiversity.org
  • wikidata.org
  • wikivoyage.org
  • wikimediafoundation.org
  • mediawiki.org
  • wmfusercontent.org
  • w.wiki

In addition to the basic per-service standards above, for all services hosted within these domains, the domains themselves must comply with additional policy at the domain level:

  • Must be registered to the Wikimedia Foundation, and must be delegated by the registrar directly to the Foundation's name servers (currently ns0.wikimedia.org, ns1.wikimedia.org, and ns2.wikimedia.org).
  • Must have valid CAA records denoting one or more legitimate certificate vendors designated by the Operations team.
  • Must be submitted to (and eventually successfully included in) the STS preload list maintained by the Chromium project at https://hstspreload.org/ .

Related information