You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Event Platform/Stream Configuration
Stream configuration refers to configuration that distributed producers or consumers of a stream might want, e.g. the sampling rate or the schema title of the events that are allowed in the stream. Stream configuration was originally a requested feature of Event Platform for Product engineers, so they could more easily vary some event stream producer setting without having to do code deploys. It has since become a critical part of Event Platform, used by multiple services.
Declaring streams
To produce events to WMF's Event Platform, you must declare the stream in Stream Configuration in mediawiki-config/wmf-config/ext-EventStreamConfig.php A minimum stream config entry would look like:
// Declare a stream named "domain_namespace.my_event_stream_name":
'domain_namespace.my_event_stream_name' => [
// schema_title must match the title of the JSONSchema.
'schema_title' => 'domain_namespace/my/event/schema_uri',
],
This would declare a stream named "domain_namespace.my_event_stream_name", which should only accept event's who's JSONSchema title matches "domain_namespace/my/event/schema_uri".
Each key in the stream's config object is a stream config setting. Some top level stream settings are handled by the platform (e.g. schema_title
), while other's are settings that producer and consumer clients should respect. A stream's final settings will be merged with wgEventStreamsDefaultSettings.
See also the EventStreamConfig README.
Stream versioning
Streams are not dissimilar to an API. They
- Have a single producer
- Have many consumers
- Declare a contract (schema)
Sometimes, API maintainers need to make breaking changes. A standard approach is to version the API, and bump the major version number when a breaking change is needed. In this way, multiple major versions of the API can be maintained simultaneously, allowing the API maintainers to provide a deprecation period.
Streams sometimes have this need as well, so we accomplish this in a similar way: include the major version in the stream name. We do this purely by convention. Each unique stream name is distinct stream, so bumping the version in a stream name effectively declares a new stream.
Stream versioning is opt in; you only need to do this if you intend to maintain your stream as an API for other consumers to use.
Versioned streams should be declared with a major version suffix, like: mediawiki.page_change.v1
. All events produced to this v1 stream should have a major schema version of 1.
While developing a new stream, you may choose to suffix with a temporary 'version', e.g. mediawiki.page_change.rc0
or mediawiki.page_change.dev1
. You can do this to indicate that the stream should not be used by other consumers, and that its event in the stream may change schemas at will.
![]() | If you produce different events with incompatible schemas to a stream , you may break automatic downstream consumers, e.g. consumers.analytics_hadoop_ingestion. Consider setting this to false while you are in development mode. |
Upgrading the stream version is equivalent to declaring a new stream with the bumped major version, e.g. mediawiki.page_change.v2
. How you do coordinate this upgrade is up to you. You should consider:
- Maintaining both streams during an announced deprecation period
- Backfilling your stream with historical data in the new schema format
- Coordinating decommissioning the stream with known consumers.
More context can be found in task T332212.
Common Settings Documentation
In lieu of a better place, we'll try to document some of the common stream config settings here.
stream
wgEventStreams
is keyed by stream name. The stream name is also available as the stream
setting in API results.
schema_title
This much match exactly the title
of the event JSONSchema that is allowed in this stream.
destination_event_service
This refers to the name of the EventGate HTTP event intake service the stream should be produced through. Producer clients use this to figure out where to send the stream. The EventGate services also use this to determine if a stream is allowed to be produced through them.
While this setting in technically only needed if the producer is using EventGate, it must be set if canary_events_enabled
, as the canary event producer uses EventGate.
NOTE: This should one day be moved into a producers
config subobject.
canary_events_enabled
This aides in monitoring ingestion pipelines for event streams. If this is true (the default if not set), artificial canary events will periodically be produced into the stream. The canary events are created from the first event example in the schema, but with meta.dt
at a current timestamp, and with meta.domain: "canary"
. Consumers of streams with canary_events_enabled: true
should filter out all events where meta.domain == "canary"
.
consumers
and producers
These sub object config settings should be used to configure specific clients that produce or consume this stream. The keys in this subobject should be the name of the client. Clients look up their configuration from the API by this name.
As of 2021-09, this is only used for the Analytics Hadoop ingestion pipeline. See also https://phabricator.wikimedia.org/T273235.
EventStreamConfig
EventStreamConfig is a MediaWiki extension that implements PHP and HTTP API for requesting stream configuration. Streams configuration entries are declared in the $wgEventStreams global list in mediawiki-config/wmf-config/ext-EventStreamConfig.php.
This centralized EventStreamConfig is used by several services to automate discovery and configuration of stream producer and consumer clients:
- EventGate service clusters uses stream config to restrict which types of events are allowed in which streams via tha schema_title setting.
- The MediaWiki EventLogging extension uses stream config to vary things like event stream sampling rate.
- The Analytics Cluster uses stream config to automate ingestion of streams into Hive.
- EventStreams uses stream config to discover streams and auto-generate OpenAPI docs.
Because this API is a MediaWiki extension deployed to all (most?) Wikimedia Foundation wikis, it can be requested from any wiki. Because the configuration of the streams is in mediawiki-config, specific per wiki settings can be provided.
It is expected that 'global' configuration be requested from meta.wikimedia.org in production. You can then override things like sample rate per wiki by configuring the override for that wiki, and then requested the config you need from that wiki's action API URL.