Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Edge uniques

From Wikitech

The Wikimedia Foundation will implement a first-party cookie named WMF-Uniq to enable A/B testing and add denial of service attacks (DDoS attacks) protections while preserving user privacy. This solution will process identifiers only at our private CDN edge servers and will not create user profiles of individual readers' browsing histories or patterns that could be linked to a specific person over time.

We will use this cookie to help us better understand how many people visit the Wikimedia sites, how to make our wikis work better for our readers and communities through A/B testing , and improve our ability to protect against denial of service attacks by helping staff identify legitimate human traffic. The Wikimedia Foundation will implement this cookie in a privacy-preserving way, designed to collect the minimal amount of data possible and ensure all uses of the cookie meet or exceed the privacy requirements outlined in our privacy policy and cookie use guidelines. Visitors to Wikimedia websites will be able to clear the cookie at any time.

This design document precedes implementation and provides insights into our technical design process. For more information about this project, see the Edge Uniques page on Meta Wiki.

Problem Statements

A/B Testing : For years, the Foundation has had a need to make it easier to conduct accurate A/B tests of user experience (UX) changes with our reader populations in order to better assess the user impact of proposed changes. While we have attempted some limited, manually-defined tests in the past, those attempts were often awkward and have had significant flaws in the way they enumerate, identify, and/or group users for testing purposes. These flaws cause issues with decreased accuracy of test group cardinality and inconsistent user experiences:

  • Attempting to use IP addresses as identifiers causes some users (e.g., mobile) to fall in and out of test groups randomly as their address shifts rapidly, causing a disjointed and inconsistent user experience.
  • Using IP addresses as identifiers also causes NAT and even CGNAT groups (whole homes, offices, campuses, mobile networks) to be treated as one individual, making it nearly impossible to specify or even know what percentage of total agents are included in a group.
  • JavaScript (JS)-based random identifiers have limited lifetimes with respect to test periods, high potential for privacy issues, and, most importantly, cannot easily be used to vary the delivery of our edge-cached content to differentiate content outputs for different test groups of anonymous readers. It can’t at all do so on the initial request from a fresh user agent. They also introduce bias as they rely on enabled JS behavior on the user’s browser.

DDoS Attacks : Over the past several years, we have seen rising DDoS attacks on our infrastructure that harm our wikis’ reliability and uptime. One of the key challenges in fending off these attacks is finding better ways to reliably differentiate our actual normal reader traffic from high-volume automated traffic, which seeks to emulate the behavior of legitimate readers. The harder it is to tell the two groups apart, the harder it is to preserve access for the real readers while deploying blunt weapons to thwart the unwelcome requests at our network edge. Also, similarly to the A/B test scenarios above, when we try to limit “per-agent” resource consumption by-IP-address, we run into all the same issues as above.  A limit, which seems appropriate on a per-agent basis, punishes a CGNAT network’s users, and a CGNAT-level limiter applied to all addresses leaves us relatively unprotected.

Unique Device Metrics: Unique device metrics are one of our more important organizational metrics. These tell us how many anonymous readers we have in the various projects and their language editions over time and how those populations vary. We employ heuristic methods to count them today based on our analytics dataset, but we could potentially get more accurate counts from a scheme that actually enumerates the legitimate user agents who visit our sites.

A central theme running through all of these problem areas points to a possible shared solution in the form of a trustworthy, unique client identifier that carries some minimal historical metadata and is retained persistently .  Discussions in the past about implementing an industry-standard tracking cookie mechanism to solve some of these problems have been stymied by reasonable privacy concerns. This proposal aims to create a reasonable set of tradeoffs: a cookie-based mechanism that causes the least possible danger to user privacy while also enabling solutions to the above and other related problems. It has a unique identifier, but we do not store or track said identifier in our server-side systems. It is only stored in the users' own browsers.

Overview

The basic idea of Edge Uniques is to have WMF’s private CDN edge network implement a mechanism that can reliably differentiate unique user agents (e.g., browsers, phones, bots with persistent storage, etc), which repeatedly visit any WMF site hosted by our CDN over the long term, using standard cookie technology. Some important clarifications:

  • Agents do not have a one-to-one mapping to users . Multiple users may share an agent (e.g., a family PC with a single login), and one user may have multiple agents (e.g., their phone and laptop browsers), but the relationship’s ratio tends to be of small magnitude.
  • Complying with this mechanism (returning the cookies we attempt to set) will not be required for basic service , but refusal could sometimes result in a degraded experience when necessary (e.g., the absence of the cookie could result in rate limiting or inconsistent UX when an experiment is running).
  • This mechanism uses first-party , secure , HTTP -only cookies that cannot be accessed by Javascript in the browser.
  • This mechanism also does not cross 2LD boundaries ; wikipedia.org gets different uniques than wikiversity.org, but all language editions and mobile (.m.) variants of wikipedia.org do share uniques.
  • This mechanism will be tamper-proof for many important kinds of tampering and at least tamper-resistant for others. Users will not be able to arbitrarily set their own values in order to confuse us.

Privacy Tradeoff Concerns

Adding a new unique identifier as a cookie on all anonymous reader traffic obviously raises legitimate privacy concerns . This is the tradeoff we have to consider against the utility the cookie offers. If this identifier were mishandled, leaked, or generally passed around all of our infrastructure and stored in arbitrary data stores, it could potentially compromise the relative privacy of our anonymous readers, especially if we do not trust our future selves in handling that data carefully.

In the interest of mitigating these concerns as best we can, we’re committing ourselves to certain important implementation limitations and practices:

  • These identifying cookies will be generated and directly consumed only by our CDN edge servers. These are the same servers responsible for the initial TLS termination of all user traffic and the caching of public content around the world.  By existing design and policy, these servers never persistently store any PII locally.
  • These servers discard the raw unique identifier as soon as they are done making use of it locally. They do not store or record its raw value locally and do not forward its raw value to any other infrastructure or services, not even our own analytics servers.
  • However, when functionally necessary , our edge servers will create derivative, irreversible, temporary one-way hashes of the long-term unique identifier, which may be forwarded to other systems. These would generally be for small subsets of traffic, and/or have short lifetimes and/or have very few bits of actual entropy. Examples:
    • When a controlled experiment is being performed, only for those clients who are directly involved in a small random test group,  only for the duration of the test, and only for those metrics that were pre-identified as test measurements, we may derive Hash(unique∥testid) and send it along with those metric events when forwarding them to our analytics clusters for analysis. These can’t be converted back to the raw ID values or correlated with similar records from unrelated tests.
    • For unique device metrics purposes, we may derive irreversible reductions which are used in counting algorithms. For example, for HyperLogLog we can calculate Hash(unique∥datestamp) and then only send a handful of the top bits (e.g., 16-24-ish) from that hash combined with the count of leading binary zeros from the bottom 64 bits, as that’s all the algorithm requires to estimate counts with the necessary accuracy.
    • A similar datestamp-hashed construct with few actual bits of entropy would potentially be useful in proving that multiple “users” logged in from a single naive agent over a relatively short period of time for fraud/sockpuppet investigations.
  • Given these rules and practices, the full data set of all the raw unique identifiers we’ve handed out to actual user agents is just a mathematical concept. It does not exist as an actual collection of data that can be queried or inferred or correlated from any combination of our storage systems.
  • Both the design and implementation of the cookies and their handling will be transparent , published open source code , as is standard practice at the Foundation and consistent with our guiding principles .

High Level Design

We will set a cookie value on all clients of our CDN network via Set-Cookie with a fixed expiry of 1 year and the Secure and HttpOnly flags set (only transmit over TLS , no JS access), and expect that "normal" conforming/default UAs will return this cookie to our CDN on future requests. We will replace cookies that are returned to us at least once per week to keep them fresh and update some minimal metadata inside it. From the agent’s point of view, this looks a lot like a first-party persistent login/session cookie of some sort (but it is not). Agents who never return these cookies will be issued fresh cookies on every request, which they are free to ignore if they wish.

This cookie value has several important properties:

  • It contains a random, unique id we've assigned to this client agent, which persists indefinitely on the client side for common repetitive agents, assuming they don’t disable/clear cookies and don’t go a long time without visiting.
  • It contains, to one-day resolution, the date the id was initially created .
  • It records the count of distinct weeks since creation during which the cookie has been sent back to our servers and refreshed with new metadata.
  • We can be reasonably cryptographically certain that the value was generated by our infrastructure (none of it can be faked). This is accomplished by using a server-side secret key to compute a Message Authentication Code of all the important metadata encoded in the cookie, including the MAC in the cookie, and verifying it upon receipt.
  • The implementation must be performant and scalable . We expect that some attack scenarios could cause a need to cryptographically generate and/or validate very large volumes of these cookies at the outermost layers of our CDN network in a scalable way. Therefore the algorithms must be efficient relative to the costs of normal traffic handling (over TLS), and must be horizontally scalable and require no additional requests to other servers, databases, or resources within our infrastructure (in other words: local and stateless on CDN machines).

Runtime Flow

  • When our CDN network receives any request, very early in the processing of them, it checks whether the client has sent our servers a cookie previously set by this mechanism.
  • If such a cookie exists, the CDN will cryptographically validate the value based on the MAC . If it does not validate, it is not used.
  • If the cookie does not exist (or was invalid above), a new cookie value will be generated for this request. From this point forward, all requests have a cookie value (either legitimate from the client, or a fresh one we’ve invented and not yet told the client about).
  • The CDN splits out the multiple metadata values for use directly in CDN processing for various use-cases (unique id, creation date, rotation count, optional flags), which may be used for further additional processing and/or calculating derivatives for forwarding.
  • The cookie value may be (re-)sent to the client via Set-Cookie in a number of scenarios:
    • If it was newly-generated
    • At most once per week to refresh the week-counting metadata.
    • Possibly, at random with low probability for an existing cookie value, just to maintain freshness for expiry purposes (although the above may be sufficient to cover this).

Server-side key rotation

  • This scheme requires secret cryptographic keys on our servers which are rotated on a regular, automated basis.
  • The main key will rotate roughly on an annual basis. This key is generated from a high-quality source of randomness in our infrastructure and deployed manually by the SRE team through normal configuration management mechanisms. The main key could also be arbitrarily rotated by human decision, in case of e.g., suspected compromise of the current key by an attack on our infrastructure.
  • Each main key has an associated 16-bit unique key tag value, which is recorded in the cookie’s metadata.
  • The current key is used to sign the cookie’s data with a MAC.
  • As with most such rotation schemes, prior and near-future keys will be available for validation, which will allow for imperfect time synchronization and asynchronous deployment of new keys across the infrastructure without disrupting cookie issuance and validation.
  • In the rare event that, after a rotation due to compromise risk, we discover evidence that the prior key material was actually stolen, we may have to disruptively remove the old key from our configuration earlier than planned. This will effectively instantly revoke the validity of many stale cookies that may exist in user agents’ storage and cause them to acquire fresh cookies on their next request, which will disrupt any active A/B tests or metrics collection activities.  Staff will have to handle those disruptions manually, including restarting ongoing tests, flagging windows of metrics as invalid and/or inaccurate, and averaging past them, etc.

Schema for defining A/B tests

  • The current JSON schema exists in the wmfuniq repo at: https://gitlab.wikimedia.org/repos/sre/libvmod-wmfuniq/-/tree/main/schema?ref_type=heads
  • This allows describing A/B tests via:
    • The unique name of the experiment
    • Start/end timestamp
    • Applicable domains in the experiment (e.g., “en.wikipedia.org”, etc)
    • Within each domain: named groups (e.g., “controlA”, “testgroupB”, etc)
    • The bucket-mapping of groups (100K buckets, 0.001% granularity).
    • Additional hash selector inputs (for either sharing the ID derivation and bucketing among a set of related experiments and/or splitting the natural 2LD sharing within a project for distinct per-language derivation and bucketing).
    • Attributes about whether and to which degree cache splitting happens (some experiments may be JS-only, or only need to vary a subset of the URI space).
  • This is the logical interface between whatever system defines and controls our experiments and the edge uniques mechanism in the CDN. We will have to develop code or scripts that translate test definitions into the language of this JSON specification and then operationally deploy updated JSON files to the CDN edge nodes as configuration changes (which they can reload on the fly).

Low Level Design

Below are all of the data fields used to construct a cookie value. Each is given a 1-letter abbreviation:

  • U (128 bits) is the unique identifier, randomly generated by a CSPRNG . Does not change when refreshing an existing valid cookie.
  • D (16 bits) is the date stamp (one-day resolution) of U creation in UTC. Day zero is Jan 1 2024, range goes a bit past Y2200. As with U, this is unchanging over the long term.
  • W (16 bits) is the week-number during which our servers last signed this key to update its metadata, counting the first week as the seven-day period beginning at D
  • C (8 bits) Saturating counter of the distinct weeks since D during which the client has returned this identifier to our servers and been re-signed to change the W number.
  • Z (8 bits) Reserved for future use.
  • K (16 bits) A short unique Key Tag identifier that tells the server which server-side key was last used to sign this cookie.
  • S (64 bits) Salt - randomly generated any time we sign a cookie, from the same CSPRNG as U. This adds an important source of additional entropy when re-signing the same cookie metadata (e.g., U,D) on a regular basis over time, which helps mitigate potential future attacks that might otherwise recover bits of our signing keys.
  • M (128 bits) MAC - Output of Blake2b128 (Key=Keys[K], Salt=S, Pers="WMF-Unique Cookies", Msg=U∥D∥W∥C∥Z∥K). Cryptographically validates that our infrastructure (which knows the secrets Keys[K]) generated this cookie value and that none of the values have been tampered with.  Note Blake2b is of cryptographic quality, does not exhibit length-extension problems, and is being fed fixed-length input.

Given the above field definitions:

  • Total bit length of raw component fields above is 384 bits (48 bytes).
  • The actual cookie value is the base64url encoding of the concatenation of all of these fields in order, resulting in a 64-byte base64url string.
  • For uniqueness from an analytics or security perspective, U∥D should be considered the client's full unique id. This avoids any extreme edge case where a duplicate U might be generated on a different D.
  • Note that the concatenation U∥D (144 raw bits) ends on an even base64url boundary (is the first 24 bytes of the 64 byte cookie value). This is a nice property, as we can chop this part out of the byte string and use it without decoding the rest, where that may matter operationally for e.g., ratelimiter keys.

The implementation code

The code implementation of this concept as a Varnish vmod can be found in our gitlab at: gitlab:repos/sre/libvmod-wmfuniq

Implementation environment details

  • The CDN network’s Varnish layer can most easily and efficiently implement this. Any traffic already rejected at the earlier haproxy TLS termination layer from e.g., stick-tables doesn’t need to hit this mechanism. Potentially, in the future, we could look at whether the TLS terminator might want to participate in validation for the purposes of rate/concurrency-limiting, though.
  • Our Varnish config already links libsodium (which we used for the Differential Privacy implementation), which has very efficient and trustworthy implementations of all the primitives this scheme needs for random number generation and the Blake2 series of hash functions.
  • The actual low-level code implementation for the cookie handling will be a Varnish VMOD linking libsodium.

Initial deployment considerations

  • We probably want to turn on the Set-Cookie code progressively over a reasonable period of time, e.g., the first month or so that we’re able to do so.
  • This allows for some reality checks on the induced load with live production traffic.
  • More importantly: it avoids having virtually all of our existing active users acquire nearly identical initial creation date stamps.
  • Our cache configuration will generally ignore these new cookies for cache-variance purposes unless the user is actively enrolled in an experimental test group that requires cache variance. No such tests would be running during the initial deployment window.
  • While we could dream up many complex schemes for progressive rollout based on request statistics, the simplest mechanism would probably be to enable the code on a per-CDN-server basis at each site and rely on our client IP hashing to do the job. This would allow us to roll out to ~12.5% of the users of any given CDN site per deployment.
  • We would logically start with just one CDN server, then roll out to one per site, and progressively turn up the rest of the servers at all the sites.
  • Metrics start dates wouldn’t be valid until quite some time after the initial rollout is complete. If we spend a month turning it up, perhaps wait at least a month past that to declare the start time of any initial window of valid metrics for “real” consumption/use. This gives time for any early operational fixups, etc.

See also