You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Event Platform/Instrumentation How To: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Bearloga
(→‎Event stream configuration: Added destination_event_service)
imported>Ottomata
(moved wgEventStreams into ext-EventStreamConfig.php - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/895792)
 
(36 intermediate revisions by 9 users not shown)
Line 1: Line 1:
Wikimedia's Event Platform supports both production 'tier 1' events, as well as analytics 'tier 2' events.  To support both, the backend systems that handle these events are more reliable and scalable, but also more complex than legacy EventLogging.  This page will provide an instrumentation event stream lifecycle example for developing, deploying, evolving, and decommissioning MediaWiki client side JS instrumentation event streams from the [[mw:Extension:WikimediaEvents]] via [[mw:Extension:EventLogging]].
{{Navigation Event Platform}}


= Differences from EventLogging the legacy backend =
Wikimedia's Event Platform supports both production "Tier 1" events, as well as analytics "Tier 2" events.  To support both, the backend systems that handle these events are more reliable and scalable, but also more complex than legacy EventLoggingThis page will provide an instrumentation event stream lifecycle example for developing, deploying, evolving, and decommissioning MediaWiki client side JS instrumentation event streams from the [[mw:Extension:WikimediaEvents]] via [[mw:Extension:EventLogging]].
The EventLogging extension was originally built as an all in one system to capture MediaWiki analytics events.  It managed schemas, client side event submission, server side event validation and server side event ingestion (into e.g. MySQL).  The [[Event_Platform|Event Platform]] program was conceived to unify event collection for production and analytics events.  EventLogging's tier 2 and analytics focus and breadth was not suitable to support this unification.  Many of the features of WMF's Event Platform are the same as the legacy EventLogging system, but are more modular and scalableFrom an instrumentation only perspective, it may not be clear why things have to be different, but there are good engineering reasons for all of these changes.


The EventLogging extension has been repurposed as an MediaWiki instrumentation event producer library only. On wiki schemas and backend validation are no longer supported by EventLogging.
You can learn more about the differences between how Event Platform works and how the legacy EventLogging backend systems work on the [[Event Platform/EventLogging legacy|EventLogging legacy]] page.


 
== Event Streams and Schemas ==
 
{| class="wikitable"
|-
!  !! EventLogging legacy !! Event Platform
 
|-
| '''Schema repositories''' || EventLogging schemas were stored as centralized [https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 wiki pages on metawiki], and all environments (development, beta, production, etc.) had to use this same schema repository.
|| Event Platform schema are in decentralized git repositories.  (Analytics instrumentation schemas are in the [https://gerrit.wikimedia.org/r/admin/projects/schemas/event/secondary schemas/event/secondary] repository. Schema repositories are also readable at https://schema.wikimedia.org/#!/ )
 
|-
| '''Streams, not schemas''' || EventLogging schemas were single use.  Each schema corresponded to only one instrumentation, and eventually only one downstream SQL table.
|| Event Platform schemas are like data types for a dataset.  A realtime event data set is called an 'event stream' (or just 'stream' for shorthand).  Each stream must specify its schema, and a schema may be used by multiple streams.
 
|-
| '''Schema versions''' || EventLogging schema versions were wiki page revisions.  Each event specified its <tt>schema</tt> name and <tt>revision</tt>. || Event Platform schemas are semantically versioned, and each event declares its schema and version in a <tt>$schema</tt> URI.
 
|-
| '''Schema compatibility''' || Each EventLogging schema revision could change the schema in any way, which lead to backwards incompatible changes. || Event Platform schemas versions must be backwards compatible; i.e. only adding new optional fields is allowed.
 
|-
| '''Stream config''' || None.  Changes to the way events were emitted (like sampling rate) required a [[Heterogeneous_deployment/Train_deploys|code deployment]]. || Streams are configured in mediawiki-config and can be modified via a [[Backport_windows|Backport window deployment]].
 
|}
 
 
 
= Event Streams and Schemas =


Both event schemas and streams must be declared before they can be used by EventLogging.  Doing so will require three changes:
Both event schemas and streams must be declared before they can be used by EventLogging.  Doing so will require three changes:
Line 39: Line 11:
* To create or modify a schema, you will create and edit a current.yaml JSONSchema file in the schemas/event/secondary repository.  You can read more in depth about how Event Platform schemas work at [[Event_Platform/Schemas]].  Please also read [[Event_Platform/Schemas/Guidelines]] before creating or modifying scheams.
* To create or modify a schema, you will create and edit a current.yaml JSONSchema file in the schemas/event/secondary repository.  You can read more in depth about how Event Platform schemas work at [[Event_Platform/Schemas]].  Please also read [[Event_Platform/Schemas/Guidelines]] before creating or modifying scheams.


* To declare an event stream, you will edit the [https://gerrit.wikimedia.org/r/admin/projects/operations/mediawiki-config mediawiki-config] repository and add a stream config entry to [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/45fb7a856d99e421ea88fd9fbe12af72b4e213ae/wmf-config/InitialiseSettings.php#21084 <tt>$wgEventStreams</tt>] (in wmf-config/InitialiseSettings.php and/or wmf-config/InitialiseSettings-labs.php).
* To declare an event stream, you will edit the [https://gerrit.wikimedia.org/r/admin/projects/operations/mediawiki-config mediawiki-config] repository and add a stream config entry to <tt>$wgEventStreams</tt> (in [[gerrit:plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/ext-EventStreamConfig.php|wmf-config/ext-EventStreamConfig.php]] and/or [[gerrit:plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/InitialiseSettings-labs.php|wmf-config/InitialiseSettings-labs.php]]).


* To tell EventLogging that is should look up stream config for a stream (and that is is allowed to produce that stream), you will add an entry to [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/45fb7a856d99e421ea88fd9fbe12af72b4e213ae/wmf-config/InitialiseSettings.php#21131 <tt>$wgEventLoggingStreamNames</tt>]
* To tell EventLogging that it should look up stream config for a stream (and that it is allowed to produce that stream), you will add an entry to <tt>$wgEventLoggingStreamNames</tt> (in [[gerrit:plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/InitialiseSettings.php.php|wmf-config/InitialiseSettings.php]] and/or [[gerrit:plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/InitialiseSettings-labs.php|wmf-config/InitialiseSettings-labs.php]])


= Instrumentation Event Stream Lifecycle Example =
== Instrumentation Event Stream Lifecycle Example ==
Event Platform's instrumention development lifecycle is still a work in progress, so please be patient as we work to improve this.  Feedback and ideas are very welcome!
Event Platform's instrumention development lifecycle is still a work in progress, so please be patient as we work to improve this.  Feedback and ideas are very welcome!


Assuming you'll be using the WikimediaEvents extension to produce your events, you'll need to make changes to 3 git repositories: [https://gerrit.wikimedia.org/r/admin/projects/schemas/event/secondary schemas/events/secondary], [https://gerrit.wikimedia.org/r/admin/projects/mediawiki/extensions/WikimediaEvents WikimediaEvents], and finally for deployment (in beta and production) [https://gerrit.wikimedia.org/r/admin/projects/operations/mediawiki-config mediawiki-config].
Assuming you'll be using the WikimediaEvents extension to produce your events, you'll need to make changes to 3 git repositories: [https://gerrit.wikimedia.org/r/admin/projects/schemas/event/secondary schemas/events/secondary], [https://gerrit.wikimedia.org/r/admin/projects/mediawiki/extensions/WikimediaEvents WikimediaEvents], and finally for deployment (in beta and production) [https://gerrit.wikimedia.org/r/admin/projects/operations/mediawiki-config mediawiki-config].


This lifecycle example will demonstrate creating a new stream to log whenever a user hovers over an interwiki link.  We'll create a new event stream called 'analytics.interwiki_link_hover' that conforms to a new 'analytics/link_hover' schema.
This lifecycle example will demonstrate creating a new stream to log whenever a user hovers over an interwiki link.  We'll create a new event stream called 'mediawiki.interwiki_link_hover' that conforms to a new 'analytics/link_hover' schema.


== Development ==
=== Development ===
=== Setup In [[mw:MediaWiki-Vagrant|MediaWiki Vagrant]] ===
{{anchor|Setup}}
Enabling the wikimediaevents role will also include the eventlogging role for you, and set up other Event Platform backend components on MediaWiki Vagrant including EventGate and Kafka.  
==== In [[mw:MediaWiki-Docker|MediaWiki Docker]] ====
See [[mw:MediaWiki-Docker/Configuration_recipes/EventLogging|MediaWiki Docker EventLogging Configuration recipe]].
 
====  In [[mw:MediaWiki-Vagrant|MediaWiki Vagrant]] ====
Enabling the wikimediaevents role will also include the eventlogging role for you, and set up other Event Platform backend components on MediaWiki Vagrant including EventGate.  


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
Line 61: Line 37:
This will clone WikimediaEvents into mediawiki/extensions/WikimediaEvents and the schemas/event/secondary repository at srv/schemas/event/secondary (and also install its npm dependencies for [[Event_Platform/Schemas#Materializing_the_schema|schema materialization]] and [[Event_Platform/Schemas#Testing_schemas|tests]])
This will clone WikimediaEvents into mediawiki/extensions/WikimediaEvents and the schemas/event/secondary repository at srv/schemas/event/secondary (and also install its npm dependencies for [[Event_Platform/Schemas#Materializing_the_schema|schema materialization]] and [[Event_Platform/Schemas#Testing_schemas|tests]])


Mediawiki Vagrant's EventLogging and EventGate setup will allow events of any schema into any stream.  For now, you will not have to think about stream config in your development environment.
MediaWiki Vagrant's EventLogging and EventGate setup will allow events of any schema into any stream.  For now, you will not have to think about stream config in your development environment.
 
Events will be written to <tt>/vagrant/logs/eventgate-events.json</tt>. eventgate logs, including validation errors, are in <tt>/vagrant/logs/eventgate-wikimedia.log</tt>.
 
To verify that eventgate is working properly, you can force a test event to be produced by curl-ing http://localhost:8192/v1/_test/events.  You should see a test event logged into eventgate-events.json.
 
==== In your local dev environment with eventgate-devserver ====
If you aren't using Mediawiki-Vagrant, or you'd rather have more manual control over your development environment, EventLogging comes with an 'eventgate-devserver' that will accept events and write them to a local file. Clone <code>mediawiki/extensions/EventLogging</code> and run
 
<syntaxhighlight lang="shell-session">
$ cd extensions/EventLogging/devserver
$ npm install --no-optional
$ npm run eventgate-devserver
</syntaxhighlight>
 
This should download EventGate and other dependencies and run the eventgate-devserver accepting events at http://localhost:8192 and writing them to <tt>./events.json</tt>.  See the [https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventLogging/+/refs/heads/master/devserver/README.md devserver/README.md] for more info.


=== Creating a new schema ===
=== Creating a new schema ===
EventLogging instrumentation schemas will be added to the schemas/event/secondary repository in the jsonschema/analytics namespace.  We want to create a new schema that represents a link hover event.  Because this type of event can be modeled pretty generically, we are going to create a generic link hover event schema, not one that is specific to interwiki links.  This will allow the schema to possibly be reused by other types of link hover events.
EventLogging instrumentation schemas will be added to the schemas/event/secondary repository in the jsonschema/analytics namespace.  We want to create a new schema that represents a link hover event.  Because this type of event can be modeled pretty generically, we are going to create a generic link hover event schema, not one that is specific to interwiki links.  This will allow the schema to possibly be reused by other types of link hover events.


Create a new file in at in schemas/event/secondary at jsonschema/analytics/link_hover/current.yaml:
Create a new file in schemas/event/secondary at jsonschema/analytics/link_hover/current.yaml:


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
Line 93: Line 84:


examples:
examples:
   - {$schema":{$ref: "#/$id"},"meta":{"dt":"2020-04-02T19:11:20.942Z" ,"stream":"analytics.interwiki_link_hover"},"event_dt": "2020-04-02T19:11:20.942Z", "link_href": "mw:Extension:EventLogging", "link_title": "mw:Extension:EventLogging"}
   - {$schema":{$ref: "#/$id"},"meta":{"dt":"2020-04-02T19:11:20.942Z" ,"stream":"mediawiki.interwiki_link_hover"},"dt": "2020-04-02T19:11:20.942Z", "link_href": "mw:Extension:EventLogging", "link_title": "mw:Extension:EventLogging"}
</syntaxhighlight>
</syntaxhighlight>


NOTE: [[Event_Platform/Schemas/Guidelines]] has rules and conventions for schemas.
NOTE: [[Event_Platform/Schemas/Guidelines]] has rules and conventions for schemas.


Now git add and git commit this new file.  jsonschema-tools will automatically dereference and materialize this file for you.
Run <code>npm run build-new</code> and then git commit this new file.   
<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
$ git add jsonschema/analytics/link_hover/current.yaml
$ npm run build-new jsonschema/analytics/link_hover/current.yaml
# ...
 
$ git add jsonschema/analytics/link_hover/*


$ git commit -m 'Add new analytics/link_hover schema'
$ git commit -m 'Add new analytics/link_hover schema'


[2020-05-20 18:47:33.865 +0000]: Looking for modified current.yaml schema files in ./jsonschema/
[2020-05-20 18:47:33.946 +0000]: Materializing ./jsonschema/analytics/link_hover/current.yaml...
[2020-05-20 18:47:33.957 +0000]: Dereferencing schema with $id /analytics/link_hover/1.0.0 using schema base URIs ./jsonschema/,https://schema.wikimedia.org/repositories/primary/jsonschema/
[2020-05-20 18:47:34.005 +0000]: Materialized schema at ./jsonschema/analytics/link_hover/1.0.0.yaml.
[2020-05-20 18:47:34.006 +0000]: Materialized schema at ./jsonschema/analytics/link_hover/1.0.0.json.
[2020-05-20 18:47:34.007 +0000]: Created latest symlink ./jsonschema/analytics/link_hover/latest.yaml -> 1.0.0.yaml.
[2020-05-20 18:47:34.008 +0000]: Created latest symlink ./jsonschema/analytics/link_hover/latest.json -> 1.0.0.json.
[2020-05-20 18:47:34.008 +0000]: Created extensionless symlink ./jsonschema/analytics/link_hover/1.0.0 -> 1.0.0.yaml.
[2020-05-20 18:47:34.009 +0000]: Created latest symlink ./jsonschema/analytics/link_hover/latest -> 1.0.0.yaml.
[2020-05-20 18:47:34.011 +0000]: New schema files have been materialized. Adding them to git: ./jsonschema/analytics/link_hover/1.0.0.yaml,./jsonschema/analytics/link_hover/latest.yaml,./jsonschema/analytics/link_hover/1.0.0,./jsonschema/analytics/link_hover/latest,./jsonschema/analytics/link_hover/1.0.0.json,./jsonschema/analytics/link_hover/latest.json
[analytics_link_hover bd4496b] Add new analytics/link_hover schema
7 files changed, 248 insertions(+)
create mode 120000 jsonschema/analytics/link_hover/1.0.0
create mode 100644 jsonschema/analytics/link_hover/1.0.0.json
create mode 100644 jsonschema/analytics/link_hover/1.0.0.yaml
create mode 100644 jsonschema/analytics/link_hover/current.yaml
create mode 120000 jsonschema/analytics/link_hover/latest
create mode 120000 jsonschema/analytics/link_hover/latest.json
create mode 120000 jsonschema/analytics/link_hover/latest.yaml
   
   
</syntaxhighlight>
</syntaxhighlight>


As you can see, you now have many more files committed to git than just current.yaml.  Event Platform will use the statically versioned schema files to validate your events.
As you can see, you now have many more files committed to git than just current.yaml.  Event Platform will use the statically versioned schema files to validate your events.
If you want to materialize a current.yaml schema file without committing it to git, you run the command to materialize it manually:
<syntaxhighlight lang="shell-session">
$ ./node_modules/.bin/jsonschema-tools materialize jsonschema/analytics/link_hover/current.yaml
</syntaxhighlight>


You can read more about how and why we materialize schema versions over at [[Event_Platform/Schemas]].
You can read more about how and why we materialize schema versions over at [[Event_Platform/Schemas]].


=== Event stream configuration ===
=== Writing MediaWiki instrumentation code using the EventLogging extension ===


Event streams do not need to be configured to test instrumentation if using the debug mode, which can be enabled by logging in to an account and turning  <code>eventlogging-display-web</code> user preference on:<syntaxhighlight lang="javascript">
mw.loader.using('mediawiki.api.options')
    .then(
        () => new mw.Api().saveOption('eventlogging-display-web', '1')
    );
</syntaxhighlight>You will see the events as notifications to check that they're generated when they should be and that they have the data they need to have. However, '''for event data to actually be sent, the stream needs to be configured''' because ''streams that are not in the stream configuration are never in-sample''.
When developing with MediaWiki-Vagrant you do not need to specify <code>sampling</code> configuration for your streams, but you at least need to register them locally (and they will use 100% sampling by default). After enabling the "wikimediaevents" role and provisioning, add the following to [[gerrit:plugins/gitiles/mediawiki/vagrant/+/master/LocalSettings.php|LocalSettings.php]] in your clone:
<syntaxhighlight lang="php">
$wgEventStreams = [
    [
        'stream' => 'analytics.link_hover',
        'schema_title' => 'analytics/link_hover',
        'destination_event_service' => 'eventgate-analytics-external',
    ],
];
$wgEventLoggingStreamNames = [
'analytics.link_hover',
];
</syntaxhighlight>
You should be able to check that the stream config works by querying your local MediaWiki API: '''<nowiki>http://dev.wiki.local.wmftest.net:8080/w/api.php?action=streamconfigs&format=json</nowiki>''' or '''<nowiki>http://dev.wiki.local.wmftest.net:8080/w/api.php?action=streamconfigs&format=json&constraints=destination_event_service=eventgate-analytics-external</nowiki>''' and you should see something like <code><nowiki>{"streams":{"analytics.link_hover":{"destination_event_service":"eventgate-analytics-external"}}}</nowiki></code>
'''Note''': <code>$wgEventStreams</code> holds the stream configuration and <code>$wgEventLoggingStreamNames</code> registers streams with EventLogging, so that their configuration to EL. If your instrumentation does not appear to log events to a certain stream, check that the stream is in both of these.
'''Reminder''': any streams you add/use locally will need to be added to production config by adding a stream config entry to <code>$wgEventStreams</code> (in [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/InitialiseSettings.php wmf-config/InitialiseSettings.php] and/or [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/master/wmf-config/InitialiseSettings-labs.php wmf-config/InitialiseSettings-labs.php] in [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/ mediawiki-config]) for your instrumentation to actually send events once deployed.
=== Writing instrumentation code ===


==== In JavaScript ====
Assuming you already have a ResourceLoader package module in WikimediaEvents, you need to add code that calls the <code>mw.eventLog.submit</code> function.  We want to fire events whenever an interwiki link is hovered over.
Assuming you already have a ResourceLoader package module in WikimediaEvents, you need to add code that calls the <code>mw.eventLog.submit</code> function.  We want to fire events whenever an interwiki link is hovered over.


Line 180: Line 121:
link_title: link.title,
link_title: link.title,
};
};
var streamName = 'analytics.interwiki_link_hover';
var streamName = 'mediawiki.interwiki_link_hover';
mw.eventLog.submit( streamName, linkHoverEvent );
mw.eventLog.submit( streamName, linkHoverEvent );
} );
} );
</syntaxhighlight>
</syntaxhighlight>


Now when you over over a link, <code>mw.eventLog.submit</code> will be called and an event will be sent to the 'analytics.interwiki_link_hover' stream.
Now when you hover over a link, <code>mw.eventLog.submit</code> will be called and an event will be sent to the 'mediawiki.interwiki_link_hover' stream.


'''In summary:'''
'''In summary:'''


* Make sure your event data includes <code>$schema</code> which should match <code>$id</code> and is set to the path (starting with /) and (extensionless) version. This tells EventGate which schema and specifically which version of that schema the instrumentation conforms to.
* Make sure your event data includes <code>$schema</code> which should match <code>$id</code> and is set to the path (starting with /) and (extensionless) version. This tells EventGate which schema and specifically which version of that schema the instrumentation conforms to.
* <code>mw.eventLog.submit</code> needs: (1) the stream name, as configured in <code>wgEventStreams</code>, and (2) the event data must include <code>$schema</code>.
* <code>mw.eventLog.submit</code> needs: (1) the stream name (this must match what will be configured in production in <code>wgEventStreams</code> stream config, and (2) the event data must include <code>$schema</code>.
 
 
==== In PHP ====
The EventLogging PHP interface is the same as the JavaScript one.  If you were building a server side event that will be send with the EventLogging extension, you'd use the <code>EventLogging::submit</code> function.
 
<syntaxhighlight lang="php">
$exampleEvent = [
'$schema' => '/analytics/example_schema/1.0.0'
'field_a' => 'value_a'
// ... Other event data fields from /analytilcs/example_schema/1.0.0
];
 
$streamName = 'mediawiki.example.stream';
EventLogging::submit( $streamName, $exampleEvent );
</syntaxhighlight>


== Deployment ==
== Deployment ==
Once your schema and instrumentation code have been reviewed and merged, you are ready for deployment.  You will make some changes to the mediawiki-config repository to configure your new stream, as well as register it for use by the EventLogging extension.
Once your schema and instrumentation code have been reviewed and merged, you are ready for deployment.  You will make some changes to the mediawiki-config repository to configure your new stream, as well as register it for use by the EventLogging extension.


Clone [https://gerrit.wikimedia.org/r/admin/projects/operations/mediawiki-config mediawiki-config] and edit wmf-config/InitialiseSettings.php. ''NOTE: You can configure these same settings for beta only by editing wmf-config/InitialiseSettings-labs.php instead. InitialiseSettings.php will be used in both beta and production if no corresponding values are found in InitialiseSettings-labs.php.''
Clone [https://gerrit.wikimedia.org/r/admin/projects/operations/mediawiki-config mediawiki-config] and edit wmf-config/ext-EventLogging.php. ''NOTE: You can configure these same settings for beta only by editing wmf-config/InitialiseSettings-labs.php instead. ext-EventLogging.php will be used in both beta and production if no corresponding values are found in InitialiseSettings-labs.php.''


 
=== Stream Configuration ===
First add an entry to stream config in the <tt>wgEventStreams</tt> config variable.
First declare your stream in the <tt>wgEventStreams</tt> config variable.
<syntaxhighlight lang="php">
<syntaxhighlight lang="php">
'wgEventStreams' => [
'wgEventStreams' => [
'default' => [
'default' => [
// ...
// ...
[
'mediawiki.interwiki_link_hover' => [
'stream' => 'analytics.interwiki_link_hover',
'schema_title' => 'analytics/link_hover',
'schema_title' => 'analytics/link_hover'
'destination_event_service' => 'eventgate-analytics-external',
],
],
],
],
Line 215: Line 171:
Future work will add support for other stream configs, like [https://phabricator.wikimedia.org/T234594 adjusting sampling rate] without having to deploy code.
Future work will add support for other stream configs, like [https://phabricator.wikimedia.org/T234594 adjusting sampling rate] without having to deploy code.


Now list your stream in <tt>wgEventLoggingStreamNames</tt> so that EventLogging will get the config for your stream and be able to produce these events.
See [[Event_Platform/Stream_Configuration#Common_Settings_Documentation]] for documentation on common stream config settings.
 
=== Register your stream for use by EventLogging ===
If producing events using the EventLogging extension (most likely) You need to list your stream in <tt>wgEventLoggingStreamNames</tt> so that EventLogging will get the config for your stream and be able to produce these events.
 
<syntaxhighlight lang="php">
<syntaxhighlight lang="php">
'wgEventLoggingStreamNames' => [
'wgEventLoggingStreamNames' => [
'default' => [
'default' => [
// ...
// ...
'analytics.interwiki_link_hover'
'mediawiki.interwiki_link_hover',
],
],
],
],
Line 227: Line 187:
If you've made these changes in InitialiseSettings-labs.php, you can find a reviewer to just merge your change and the config will be automatically synced to the beta cluster.  If your schema change and instrumentation code change are also merged, you'll be able to send these events in beta.
If you've made these changes in InitialiseSettings-labs.php, you can find a reviewer to just merge your change and the config will be automatically synced to the beta cluster.  If your schema change and instrumentation code change are also merged, you'll be able to send these events in beta.


If you've made these changes in InitialiseSettings.php, you'll need to schedule a [[Backport_windows|Backport window deployment]] to sync out your config change to the production cluster. See [[Deployments]] and [[Backport_windows|Backport windows]] for instructions.
If you've made these changes in ext-EventLogging.php, you'll need to schedule a [[Backport_windows|Backport window deployment]] to sync out your config change to the production cluster. See [[Deployments]] and [[Backport_windows|Backport windows]] for instructions.


=== Per-wiki configurations ===
== Viewing and querying events ==
The examples above (and many streams currently in production) are deployed to 'default' (all wikis), but what's important is that they're deployed to metawiki – because EventGate uses the stream configuration from meta.wikimedia.org for checking which streams are active and which schema titles they're associated with.
=== Beta ===
 
Event Platform components are set up in the Cloud VPS deployment-prep project (AKA 'beta') similar to how they run in production.  The EventLogging extension there is configured to send events to https://intake-analytics.wikimedia.beta.wmflabs.org/v1/events?hasty=true.  If you are not using EventLogging (e.g. from a mobile app), you will have to configure your 'beta' installation of the app to also POST events to this URL.
 
An [[EventStreams]] instance in the Cloud VPS deployment-prep project (AKA 'beta') exists that allows you to consume any stream declared in stream config.  As this instance is in beta, it only has events that are generated in beta.  You can consume this stream via an EventSource/SSE client (see https://stream-beta.wmflabs.org/?doc) or in your browser using the EventStreams GUI at https://stream-beta.wmflabs.org/v2/ui.
 
Alternatively, you can log into a host in the deployment-prep Cloud VPS project and consume events directly from Kafka using a Kafka client like kafkacat:
 
<syntaxhighlight lang="bash">
sudo apt-get install kafkacat # if needed
kafkacat -C -b deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud -t eqiad.<your_stream_name>
</syntaxhighlight>
 
=== Production ===


Suppose you have deployed an instrument to production and that instrument produces events to 'analytics.interwiki_link_hover', which is not yet configured. You want to "activate" this instrumentation on a smaller wiki (e.g. cawiki) first to gauge volume of events. Technically the instrumentation is already active, but because the stream it's producing events to is not configured yet those events will not be sent to the intake service.
==== Data Lake ====
Events in production eventually make their way into the [[Analytics/Data_Lake]] where they are ingested into a Hive table in the <tt>event</tt> Hive database. From there they are queryable using [[Analytics/Systems/Cluster/Hive|Hive]], [[Analytics/Systems/Cluster/Spark|Spark]], or [[Analytics/Systems/Presto|Presto]], and also available for dashboarding in [[Analytics/Systems/Superset|Superset]] via Presto.


Currently you cannot configure a stream exclusively for, say, cawiki. This is invalid:<syntaxhighlight lang="php">
The Hive table name will be a normalized version of the stream name. Our example stream's Hive table will be <tt>event.mediawiki_interwiki_link_hover</tt>.
'wgEventStreams' => [
'cawiki' => [
// ...
[
'stream' => 'analytics.interwiki_link_hover',
'schema_title' => 'analytics/link_hover'
],
],
],
</syntaxhighlight>Well, it's ''valid'' but it doesn't do anything because that stream is not visible to EventGate. For remote activation/deactivation of instruments via stream configuration, there's two options for partial activation.


==== Approach 1: EventLogging registration ====
==== Kafka ====
<syntaxhighlight lang="php">
Events are also available in real time by consuming them directly from [[Kafka]]. The stream name is prefixed with the source datacenter for the Kafka topic. WMF has two main datacenters: 'eqiad' and 'codfw'. To consume all events from Kafka for your stream, you should consume from both of these topics. Our example's Kafka topics will be <code>eqiad.mediawiki.interwiki_link_hover</code> and <code>codfw.mediawiki.interwiki_link_hover</code>.
'wgEventStreams' => [
'default' => [
// ...
[
'stream' => 'analytics.interwiki_link_hover',
'schema_title' => 'analytics/link_hover'
],
],
],
</syntaxhighlight>and<syntaxhighlight lang="php">
'wgEventLoggingStreamNames' => [
    'default' => [
        // NOT 'analytics.interwiki_link_hover'
    ],
'cawiki' => [
'analytics.interwiki_link_hover'
],
],
</syntaxhighlight>'''This is the recommended approach''', for performance reasons. An event produced to the 'analytics.interwiki_link_hover' stream will be outright rejected on all wikis except cawiki because that stream's configuration is only available to EventLogging on cawiki and nowhere else. There is no in-sample vs out-of-sample determination that needs to happen involving math and thus this approach is preferable for performance reasons.


==== Approach 2: Sampling and configuration override ====
Legacy EventLogging streams are named in a different way and not split by data center: for example <code>eventlogging_InukaPageView</code>. Note that Kafka topic names are case sensitive, so make sure you use the correct capitalization, as listed in the [https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L21164 <code>wgEventStreams</code> configuration variable].
The following is an ''alternative'' method to demonstrate sampling and configuration overriding, and '''is not recommended'''. It is included for demonstration only.<syntaxhighlight lang="php">
'wgEventStreams' => [
'default' => [
// ...
[
'stream' => 'analytics.interwiki_link_hover',
'schema_title' => 'analytics/link_hover',
'sampling' => [
    'rate' => 0.0,
]
],
],
'cawiki' => [
    [
        'stream' => 'analytics.interwiki_link_hover',
'schema_title' => 'analytics/link_hover',
'sampling' => [
    'rate' => 1.0,
]
    ]
]
],
</syntaxhighlight>and<syntaxhighlight lang="php">
'wgEventLoggingStreamNames' => [
    'default' => [
        // ...
        'analytics.interwiki_link_hover'
    ],
],
</syntaxhighlight>This will cause the 'analytics.interwiki_link_hover' stream to be visible to EventGate and deployed on all wikis' stream configurations with 0% sampling rate, except on cawiki where it will be configured to have 100% sampling rate.


The reason this approach is not recommended is because instead of the event being outright rejected like it would be in approach 1, the client has to perform (''unnecessary'' in this case) calculations to determine whether the stream is in-sample or out-of-sample.
Here's a useful command which fetches and pretty-prints the last 5 messages from a topic:<syntaxhighlight lang="shell">
kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t {{topic name}} -o -5 -e -q | jq .
</syntaxhighlight>If you want to listen for new events (perhaps when deploying new instrumentation), you can use this instead:<syntaxhighlight lang="shell">
kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t {{topic name}} -o end | jq .
</syntaxhighlight>


== Viewing and querying events ==
==== EventStreams ====
=== In beta ===
Production has an internal [[EventStreams]] instance that allows you to consume any stream declared in our stream config. <code>eventstreams-internal.discovery.wmnet</code> is not publicly accessible, so you have to access it through an SSH tunnel:
TODO: set up a GUI consuming from EventStreams in beta that helps here.


=== In production ===
<syntaxhighlight lang="bash">
Events in production eventually make their way into the [[Analytics/Data_Lake]] where they are ingested into a Hive table in the <tt>event</tt> Hive database.  From there they are queryable using [[Analytics/Systems/Cluster/Hive|Hive]], [[Analytics/Systems/Cluster/Spark|Spark]], or [[Analytics/Systems/Presto|Presto]], and also available for dashboarding in [[Analytics/Systems/Superset|Superset]] via Presto.
# Tunnel local traffic to port 4992 to eventstreams-internal.discovery.wmnet:4992 via bast1003.wikimedia.org
# Replace bast1003.wikimedia.org with your preferred bastion or other accessible production host.
ssh -N -L4992:eventstreams-internal.discovery.wmnet:4992 bast1003.wikimedia.org
</syntaxhighlight>


The Hive table name will be a normalized version of the stream name. Our example stream's Hive table will be <tt>event.analytics_interwiki_link_hover</tt>.
Then, in your browser, you can navigate to https://localhost:4992/v2/ui/#/ to use the GUI. Note that you will probably have to add a security exception to do so, since the site uses an "unofficial" HTTPS certificate. You can also consume with your own EventSource/SSE client (see https://localhost:4992/?doc).


Events are also available realtime by consuming them directly from [[Kafka]].  The stream name is prefixed with the source datacenter for the Kafka topic. WMF has two main datacenters: 'eqiad' and 'codfw'.  To consume all events from Kafka for your stream, you should consume from both of these topics.  Our example's Kafka topics will be 'eqiad.analytics.interwiki_link_hover and 'codfw.analytics.interwiki_link_hover'.
[[File:Eventstreams-internal gui.png|1000px]]


=== Viewing schema validation errors ===
=== Viewing schema validation errors ===
Events that do not validate with their schema will not be produced.  For events that are produced via EventLogging, validation error events will be produced into the stream <code>eventgate-analytics-external.error.validation</code> and will be ingested into Hive in the <code>event.eventgate_analytics_external_error_validation</code> table. These validation error events are also ingested into logstash and can be viewed in Kibana using [https://logstash.wikimedia.org/app/kibana#/discover/AXMlVWkuMQ_08tQas2Xi?_g=(refreshInterval%3A(display%3A'1%20minute'%2Cpause%3A!f%2Csection%3A2%2Cvalue%3A60000)%2Ctime%3A(from%3Anow-1h%2Cmode%3Aquick%2Cto%3Anow)) this saved search].
Events that do not validate with their schema will not be produced.  For events that are produced via EventLogging, validation error events will be produced into the stream <code>eventgate-analytics-external.error.validation</code> and will be ingested into Hive in the <code>event.eventgate_analytics_external_error_validation</code> table.
 
These validation error events are also ingested into logstash and can be viewed in Kibana using [https://logstash.wikimedia.org/app/dashboards#/view/AXN5OoJu3_NNwgAUlbUT this dashboard].
 
The rate of validation errors per stream can be seen in the [https://grafana.wikimedia.org/goto/x_3wAwJ7k EventGate Grafana dashboard].


== Evolving ==
The <code>eventgate-analytics-external.error.validation</code> stream is a stream like any other, so you can also view the stream using an EventStreams GUI in beta or production (see above).
Let's say our 'analytics.interwiki_link_hover' stream is operating fine in production, and now we want to also log the link's text value.  We'll need to add a new <tt>link_text</tt> field to the 'analytics/link_hover' schema.
 
== Evolving your schema ==
Let's say our 'mediawiki.interwiki_link_hover' stream is operating fine in production, and now we want to also log the link's text value.  We'll need to add a new <tt>link_text</tt> field to the 'analytics/link_hover' schema.


Edit jsonschema/analytics/link_hover/current.yaml again and add the field and bump the version number in the <tt>$id</tt>.  The new content of current.yaml should look like:
Edit jsonschema/analytics/link_hover/current.yaml again and add the field and bump the version number in the <tt>$id</tt>.  The new content of current.yaml should look like:
Line 341: Line 268:
         description: text value of http anchor link
         description: text value of http anchor link
examples:
examples:
   - {$schema":{$ref: "#/$id"},"meta":{"dt":"2020-04-02T19:11:20.942Z" ,"stream":"analytics.interwiki_link_hover"},"event_dt": "2020-04-02T19:11:20.942Z", "link_href": "mw:Extension:EventLogging", "link_title": "mw:Extension:EventLogging", "link_text": "EventLogging extension"}
   - {$schema":{$ref: "#/$id"},"meta":{"dt":"2020-04-02T19:11:20.942Z" ,"stream":"mediawiki.interwiki_link_hover"},"dt": "2020-04-02T19:11:20.942Z", "link_href": "mw:Extension:EventLogging", "link_title": "mw:Extension:EventLogging", "link_text": "EventLogging extension"}
</syntaxhighlight>
</syntaxhighlight>


Now, add and commit current.yaml again and jsonschema-tools will materialize new 1.1.0 version files.
Now, run <code>npm run build-modified</code> and commit new 1.1.0 version files.


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
$ git add jsonschema/analytics/link_hover/current.yaml
$ npm run build-modified
# ...
$ git add jsonschema/analytics/link_hover/*
$ git commit -m 'analytics/link_hover - add link_text field and bump to version 1.1.0'
$ git commit -m 'analytics/link_hover - add link_text field and bump to version 1.1.0'
[2020-05-20 19:52:39.162 +0000]: Looking for modified current.yaml schema files in ./jsonschema/
[2020-05-20 19:52:39.229 +0000]: Materializing ./jsonschema/analytics/link_hover/current.yaml...
[2020-05-20 19:52:39.235 +0000]: Dereferencing schema with $id /analytics/link_hover/1.1.0 using schema base URIs ./jsonschema/,https://schema.wikimedia.org/repositories/primary/jsonschema/
[2020-05-20 19:52:39.264 +0000]: Materialized schema at ./jsonschema/analytics/link_hover/1.1.0.yaml.
[2020-05-20 19:52:39.271 +0000]: Materialized schema at ./jsonschema/analytics/link_hover/1.1.0.json.
[2020-05-20 19:52:39.275 +0000]: Created latest symlink ./jsonschema/analytics/link_hover/latest.yaml -> 1.1.0.yaml.
[2020-05-20 19:52:39.275 +0000]: Created latest symlink ./jsonschema/analytics/link_hover/latest.json -> 1.1.0.json.
[2020-05-20 19:52:39.276 +0000]: Created extensionless symlink ./jsonschema/analytics/link_hover/1.1.0 -> 1.1.0.yaml.
[2020-05-20 19:52:39.278 +0000]: Created latest symlink ./jsonschema/analytics/link_hover/latest -> 1.1.0.yaml.
[2020-05-20 19:52:39.279 +0000]: New schema files have been materialized. Adding them to git: ./jsonschema/analytics/link_hover/1.1.0.yaml,./jsonschema/analytics/link_hover/latest.yaml,./jsonschema/analytics/link_hover/1.1.0,./jsonschema/analytics/link_hover/latest,./jsonschema/analytics/link_hover/1.1.0.json,./jsonschema/analytics/link_hover/latest.json
[analytics_link_hover 3c4f8c7] analytics/link_hover - add link_text field and bump to version 1.1.0
7 files changed, 227 insertions(+), 4 deletions(-)
create mode 120000 jsonschema/analytics/link_hover/1.1.0
create mode 100644 jsonschema/analytics/link_hover/1.1.0.json
create mode 100644 jsonschema/analytics/link_hover/1.1.0.yaml
</syntaxhighlight>
</syntaxhighlight>


New analytics/link_hover/1.1.0 files have been created, and the jsonschema/analytics/link_hover/latest symlinks have been updated to point to the 1.1.0 versions.
New analytics/link_hover/1.1.0 files have been created, and the jsonschema/analytics/link_hover/latest symlinks have been updated to point to the 1.1.0 versions.


Now edit your instrumentation code to produce the event data with the <tt>link_text</tt> field and the udpated versioned <tt>$schema</tt> URI.  Your instrumentation code should now look like this:
Now edit your instrumentation code to produce the event data with the <tt>link_text</tt> field and the updated versioned <tt>$schema</tt> URI.  Your instrumentation code should now look like this:
<syntaxhighlight lang="javascript">
<syntaxhighlight lang="javascript">
$( '#content' ).on( 'mouseover', 'a.extiw', function ( jqEvent ) {
$( '#content' ).on( 'mouseover', 'a.extiw', function ( jqEvent ) {
Line 381: Line 295:
link_text: link.text,
link_text: link.text,
}
}
var streamName = 'analytics.interwiki_link_hover'
var streamName = 'mediawiki.interwiki_link_hover'
mw.eventLog.submit( streamName, linkHoverEvent );
mw.eventLog.submit( streamName, linkHoverEvent );
} );
} );
</syntaxhighlight>
</syntaxhighlight>


Note that only backwards compatible changes are allowed. This means that the only type of change you can do to a schema is add new optional fields.
'''Note that only backwards compatible changes are allowed'''. This means that the only type of change you can do to a schema is add new optional fields. jsonschema-tools will ensure that all schema versions are backwards compatible (as well as ensuring that the schema repository is in good shape).  Jenkins will run CI tests when you push your schema change to gerrit, but you can also run the tests manually:
jsonschema-tools will ensure that all schema versions are backwards compatible (as well as ensuring that the schema repository is in good shape).  Jenkins will run CI tests when you push your schema change to gerrit, but you can also run the tests manually:


<syntaxhighlight lang="shell-session">
<syntaxhighlight lang="shell-session">
Line 397: Line 310:
         ✓ 1.1.0 must be compatible with 1.0.0
         ✓ 1.1.0 must be compatible with 1.0.0
...
...
</syntaxhighlight>
=== Backwards incompatible schema changes ===
In general, backwards incompatible schema changes are not allowed.  This is because they are not possible to do without manual intervention.  Your code may be the only producer of this data, but there may be many consumers, and making a backwards incompatible change requires coordination with all of them.  For instrumentation, likely the only uses of your schema will be by EventGate for validation, and Hive for querying your data.  If you absolutely need to make a backwards incompatible change to an instrumenation schema, the procedure is:
* Make a Phabricator ticket and tag Data-Engineering that describes your request.
* Manual Steps Data-Engineering:
** Merge the schema change, wait at least 30 minute (or run puppet on schema* nodes to apply merge)
** Drop the corresponding Hive table(s).
As long as the schema change is a new version (not an editing an existent version), eventgate does not need a restart.  If for some forsaken reason you need to edit a change an existent schema version, you'll need to rolling restart the eventgate-analytics-external service.
== Overriding event stream config settings ==
=== In production for specific wikis / groups ===
Because [[Event_Platform/Stream Configuration|EventStreamConfig]] uses [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master mediawiki-config], we can override settings for beta, wiki groups, or even specific wikis. Unfortunetely, this ONLY works for clients that request their config via the specific wiki you are overriding.
EventGate is a service that is not wiki specific, so it requests all stream configs from meta.wikimedia.org.  This means that you MUST add entries in <code>wgEventStreams</code> for at least metawiki (usually done so in the 'default' configs).
EventLogging is a MediaWiki extension that produces events, and is usually executed within the context of a specific wiki.  It will get configuration for the specific wiki it is being run in.  That means you can override configs that EventLogging uses for specific wikis.
Example:
EventLogging respects a <code>sample</code> setting only produce a sample of events.  We only want to send a 1/10 events on enwiki.  We've already got the mediawiki.interwiki_link_hover stream declared in <code>wgEventStreams</code> <code>default</code>.  Let's add an override for the <code>sample</code> setting on enwiki.
<syntaxhighlight lang="php">
'wgEventStreams' => [
'default' => [
// ...
'mediawiki.interwiki_link_hover' => [
'schema_title' => 'analytics/link_hover',
'destination_event_service' => 'eventgate-analytics-external',
],
],
    // Add the overrides for enwiki.
    // The '+' prefix indicates that these settings should be recursively merged with 'default' settings.
    // See: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/src/StaticSiteConfiguration.php#64
    '+enwiki' => [
        'mediawiki.interwiki_link_hover' => [
            'sample' => [
                'rate' => 0.1,
            ],
        ],
    ]
],
</syntaxhighlight>
When enwiki MediaWiki JavaScript requests event stream config for the mediawiki.interwiki_link_hover stream, it will get something like
<syntaxhighlight lang="javascript">
{
  "streams": {
    "mediawiki.interwiki_link_hover": {
      "schema_title": "analytics/link_hover",
      "destination_event_service": "eventgate-analytics-external",
      "sample": {
        "rate": 0.1,
      },
    }
  }
}
</syntaxhighlight>
with the default config merged together.
=== In beta ===
Beta AKA deployment-prep configs are in InitialiseSettings-labs.php.  For the most part, overrides here work exactly as described above.  However, the 'default' configs will not be merged if you set them.  That means that you cannot use the 'default' config section in <code>wgEventStreams</code> in InitialiseSettings-labs.php.  You can however merge overrides in from <code>InitialiseSettings.php</code> 'default' section if you use the per wiki or group merge override as described above.  E.g. adding the following in InitialiseSettings-labs.php will override the sample rate for en.wikipedia.beta.wmflabs.org:
<syntaxhighlight lang="php">
'wgEventStreams' => [
   
    // Add the overrides for enwiki in beta
    // The '+' prefix indicates that these settings should be recursively merged with 'default' settings from InitialiseSettings.php.
    '+enwiki' => [
        'mediawiki.interwiki_link_hover' => [
            'sample' => [
                'rate' => 1.0,
            ],
        ],
    ]
],
</syntaxhighlight>
</syntaxhighlight>


== Decommissioning ==
== Decommissioning ==


When deploying a new instrumentation event stream, you should also plan to decommission it one day.  A schema should never be deleted, but all of the stream related code and configuration can be removed at anytime to stop producing an event stream.  In the rare cases where the presence of the schema itself is problematic, we can delete it but it requires a coordinated effort to blacklist it and prevent alarms along the data pipeline.
When deploying a new instrumentation event stream, you should also plan to decommission it one day.  A schema should never be deleted, but all of the stream related code and configuration can be removed at anytime to stop producing an event stream.  In the rare cases where the presence of the schema itself is problematic, we can delete it but it requires a coordinated effort to exclude it and prevent alarms along the data pipeline.
 
[[Category:Event Platform]]

Latest revision as of 14:18, 8 March 2023

Wikimedia's Event Platform supports both production "Tier 1" events, as well as analytics "Tier 2" events. To support both, the backend systems that handle these events are more reliable and scalable, but also more complex than legacy EventLogging. This page will provide an instrumentation event stream lifecycle example for developing, deploying, evolving, and decommissioning MediaWiki client side JS instrumentation event streams from the mw:Extension:WikimediaEvents via mw:Extension:EventLogging.

You can learn more about the differences between how Event Platform works and how the legacy EventLogging backend systems work on the EventLogging legacy page.

Event Streams and Schemas

Both event schemas and streams must be declared before they can be used by EventLogging. Doing so will require three changes:

  • To create or modify a schema, you will create and edit a current.yaml JSONSchema file in the schemas/event/secondary repository. You can read more in depth about how Event Platform schemas work at Event_Platform/Schemas. Please also read Event_Platform/Schemas/Guidelines before creating or modifying scheams.

Instrumentation Event Stream Lifecycle Example

Event Platform's instrumention development lifecycle is still a work in progress, so please be patient as we work to improve this. Feedback and ideas are very welcome!

Assuming you'll be using the WikimediaEvents extension to produce your events, you'll need to make changes to 3 git repositories: schemas/events/secondary, WikimediaEvents, and finally for deployment (in beta and production) mediawiki-config.

This lifecycle example will demonstrate creating a new stream to log whenever a user hovers over an interwiki link. We'll create a new event stream called 'mediawiki.interwiki_link_hover' that conforms to a new 'analytics/link_hover' schema.

Development

In MediaWiki Docker

See MediaWiki Docker EventLogging Configuration recipe.

In MediaWiki Vagrant

Enabling the wikimediaevents role will also include the eventlogging role for you, and set up other Event Platform backend components on MediaWiki Vagrant including EventGate.

$ vagrant roles enable wikimediaevents --provision
$ vagrant git-update

This will clone WikimediaEvents into mediawiki/extensions/WikimediaEvents and the schemas/event/secondary repository at srv/schemas/event/secondary (and also install its npm dependencies for schema materialization and tests)

MediaWiki Vagrant's EventLogging and EventGate setup will allow events of any schema into any stream. For now, you will not have to think about stream config in your development environment.

Events will be written to /vagrant/logs/eventgate-events.json. eventgate logs, including validation errors, are in /vagrant/logs/eventgate-wikimedia.log.

To verify that eventgate is working properly, you can force a test event to be produced by curl-ing http://localhost:8192/v1/_test/events. You should see a test event logged into eventgate-events.json.

In your local dev environment with eventgate-devserver

If you aren't using Mediawiki-Vagrant, or you'd rather have more manual control over your development environment, EventLogging comes with an 'eventgate-devserver' that will accept events and write them to a local file. Clone mediawiki/extensions/EventLogging and run

$ cd extensions/EventLogging/devserver
$ npm install --no-optional
$ npm run eventgate-devserver

This should download EventGate and other dependencies and run the eventgate-devserver accepting events at http://localhost:8192 and writing them to ./events.json. See the devserver/README.md for more info.

Creating a new schema

EventLogging instrumentation schemas will be added to the schemas/event/secondary repository in the jsonschema/analytics namespace. We want to create a new schema that represents a link hover event. Because this type of event can be modeled pretty generically, we are going to create a generic link hover event schema, not one that is specific to interwiki links. This will allow the schema to possibly be reused by other types of link hover events.

Create a new file in schemas/event/secondary at jsonschema/analytics/link_hover/current.yaml:

$ cd srv/schemas/event/secondary
$ mkdir -p jsonschema/analytics/link_hover

Create jsonschema/analytics/link_hover/current.yaml with this content:

title: analytics/link_hover
description: Represents an html link mouseover hover event
$id: /analytics/link_hover/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
  - $ref: /fragment/analytics/common/1.0.0#
  # event data fields.
  - properties:
      link_href:
        type: string
        description: href attribute of http anchor link
      link_title:
        type: string
        description: title attribute of http anchor link

examples:
  - {$schema":{$ref: "#/$id"},"meta":{"dt":"2020-04-02T19:11:20.942Z" ,"stream":"mediawiki.interwiki_link_hover"},"dt": "2020-04-02T19:11:20.942Z", "link_href": "mw:Extension:EventLogging", "link_title": "mw:Extension:EventLogging"}

NOTE: Event_Platform/Schemas/Guidelines has rules and conventions for schemas.

Run npm run build-new and then git commit this new file.

$ npm run build-new jsonschema/analytics/link_hover/current.yaml
# ...

$ git add jsonschema/analytics/link_hover/*

$ git commit -m 'Add new analytics/link_hover schema'

As you can see, you now have many more files committed to git than just current.yaml. Event Platform will use the statically versioned schema files to validate your events.

You can read more about how and why we materialize schema versions over at Event_Platform/Schemas.

Writing MediaWiki instrumentation code using the EventLogging extension

In JavaScript

Assuming you already have a ResourceLoader package module in WikimediaEvents, you need to add code that calls the mw.eventLog.submit function. We want to fire events whenever an interwiki link is hovered over.

// 'a.extiw' will match anchors that have a extiw class.  extiw is used for interwiki links.
$( '#content' ).on( 'mouseover', 'a.extiw', function ( jqEvent ) {
	var link = jqEvent.target;
	var linkHoverEvent = {
		// $schema is required and must be set to the exact value of $id that you set in your schema.
		$schema: '/analytics/link_hover/1.0.0',
		link_href: link.href,
		link_title: link.title,
	};
	var streamName = 'mediawiki.interwiki_link_hover';
	mw.eventLog.submit( streamName, linkHoverEvent );
} );

Now when you hover over a link, mw.eventLog.submit will be called and an event will be sent to the 'mediawiki.interwiki_link_hover' stream.

In summary:

  • Make sure your event data includes $schema which should match $id and is set to the path (starting with /) and (extensionless) version. This tells EventGate which schema and specifically which version of that schema the instrumentation conforms to.
  • mw.eventLog.submit needs: (1) the stream name (this must match what will be configured in production in wgEventStreams stream config, and (2) the event data must include $schema.


In PHP

The EventLogging PHP interface is the same as the JavaScript one. If you were building a server side event that will be send with the EventLogging extension, you'd use the EventLogging::submit function.

$exampleEvent = [
	'$schema' => '/analytics/example_schema/1.0.0'
	'field_a' => 'value_a'
	// ... Other event data fields from /analytilcs/example_schema/1.0.0
];

$streamName = 'mediawiki.example.stream';
EventLogging::submit( $streamName, $exampleEvent );

Deployment

Once your schema and instrumentation code have been reviewed and merged, you are ready for deployment. You will make some changes to the mediawiki-config repository to configure your new stream, as well as register it for use by the EventLogging extension.

Clone mediawiki-config and edit wmf-config/ext-EventLogging.php. NOTE: You can configure these same settings for beta only by editing wmf-config/InitialiseSettings-labs.php instead. ext-EventLogging.php will be used in both beta and production if no corresponding values are found in InitialiseSettings-labs.php.

Stream Configuration

First declare your stream in the wgEventStreams config variable.

'wgEventStreams' => [
	'default' => [
		// ...
		'mediawiki.interwiki_link_hover' => [
			'schema_title' => 'analytics/link_hover',
			'destination_event_service' => 'eventgate-analytics-external',
		],
	],
],

This is the minimal stream config required. This config is used to ensure that only events that have a schema with a title that matches the schema_title value here are allowed in the stream. (Note that this is NOT the same as the schema's $id field. The $id field is a versioned URI. Each event data's $schema URI field will be used to lookup the schema at that schema URI.)

Future work will add support for other stream configs, like adjusting sampling rate without having to deploy code.

See Event_Platform/Stream_Configuration#Common_Settings_Documentation for documentation on common stream config settings.

Register your stream for use by EventLogging

If producing events using the EventLogging extension (most likely) You need to list your stream in wgEventLoggingStreamNames so that EventLogging will get the config for your stream and be able to produce these events.

'wgEventLoggingStreamNames' => [
	'default' => [
		// ...
		'mediawiki.interwiki_link_hover',
	],
],

If you've made these changes in InitialiseSettings-labs.php, you can find a reviewer to just merge your change and the config will be automatically synced to the beta cluster. If your schema change and instrumentation code change are also merged, you'll be able to send these events in beta.

If you've made these changes in ext-EventLogging.php, you'll need to schedule a Backport window deployment to sync out your config change to the production cluster. See Deployments and Backport windows for instructions.

Viewing and querying events

Beta

Event Platform components are set up in the Cloud VPS deployment-prep project (AKA 'beta') similar to how they run in production. The EventLogging extension there is configured to send events to https://intake-analytics.wikimedia.beta.wmflabs.org/v1/events?hasty=true. If you are not using EventLogging (e.g. from a mobile app), you will have to configure your 'beta' installation of the app to also POST events to this URL.

An EventStreams instance in the Cloud VPS deployment-prep project (AKA 'beta') exists that allows you to consume any stream declared in stream config. As this instance is in beta, it only has events that are generated in beta. You can consume this stream via an EventSource/SSE client (see https://stream-beta.wmflabs.org/?doc) or in your browser using the EventStreams GUI at https://stream-beta.wmflabs.org/v2/ui.

Alternatively, you can log into a host in the deployment-prep Cloud VPS project and consume events directly from Kafka using a Kafka client like kafkacat:

sudo apt-get install kafkacat # if needed
kafkacat -C -b deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud -t eqiad.<your_stream_name>

Production

Data Lake

Events in production eventually make their way into the Analytics/Data_Lake where they are ingested into a Hive table in the event Hive database. From there they are queryable using Hive, Spark, or Presto, and also available for dashboarding in Superset via Presto.

The Hive table name will be a normalized version of the stream name. Our example stream's Hive table will be event.mediawiki_interwiki_link_hover.

Kafka

Events are also available in real time by consuming them directly from Kafka. The stream name is prefixed with the source datacenter for the Kafka topic. WMF has two main datacenters: 'eqiad' and 'codfw'. To consume all events from Kafka for your stream, you should consume from both of these topics. Our example's Kafka topics will be eqiad.mediawiki.interwiki_link_hover and codfw.mediawiki.interwiki_link_hover.

Legacy EventLogging streams are named in a different way and not split by data center: for example eventlogging_InukaPageView. Note that Kafka topic names are case sensitive, so make sure you use the correct capitalization, as listed in the wgEventStreams configuration variable.

Here's a useful command which fetches and pretty-prints the last 5 messages from a topic:

kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t {{topic name}} -o -5 -e -q | jq .

If you want to listen for new events (perhaps when deploying new instrumentation), you can use this instead:

kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t {{topic name}} -o end | jq .

EventStreams

Production has an internal EventStreams instance that allows you to consume any stream declared in our stream config. eventstreams-internal.discovery.wmnet is not publicly accessible, so you have to access it through an SSH tunnel:

# Tunnel local traffic to port 4992 to eventstreams-internal.discovery.wmnet:4992 via bast1003.wikimedia.org
# Replace bast1003.wikimedia.org with your preferred bastion or other accessible production host.
ssh -N -L4992:eventstreams-internal.discovery.wmnet:4992 bast1003.wikimedia.org

Then, in your browser, you can navigate to https://localhost:4992/v2/ui/#/ to use the GUI. Note that you will probably have to add a security exception to do so, since the site uses an "unofficial" HTTPS certificate. You can also consume with your own EventSource/SSE client (see https://localhost:4992/?doc).

Eventstreams-internal gui.png

Viewing schema validation errors

Events that do not validate with their schema will not be produced. For events that are produced via EventLogging, validation error events will be produced into the stream eventgate-analytics-external.error.validation and will be ingested into Hive in the event.eventgate_analytics_external_error_validation table.

These validation error events are also ingested into logstash and can be viewed in Kibana using this dashboard.

The rate of validation errors per stream can be seen in the EventGate Grafana dashboard.

The eventgate-analytics-external.error.validation stream is a stream like any other, so you can also view the stream using an EventStreams GUI in beta or production (see above).

Evolving your schema

Let's say our 'mediawiki.interwiki_link_hover' stream is operating fine in production, and now we want to also log the link's text value. We'll need to add a new link_text field to the 'analytics/link_hover' schema.

Edit jsonschema/analytics/link_hover/current.yaml again and add the field and bump the version number in the $id. The new content of current.yaml should look like:

title: analytics/link_hover
description: Represents an html link mouseover hover event
$id: /analytics/link_hover/1.1.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
  - $ref: /fragment/analytics/common/1.0.0#
  # event data fields.
  - properties:
      link_href:
        type: string
        description: href attribute of http anchor link
      link_title:
        type: string
        description: title attribute of http anchor link
      link_text:
        type: string
        description: text value of http anchor link
examples:
  - {$schema":{$ref: "#/$id"},"meta":{"dt":"2020-04-02T19:11:20.942Z" ,"stream":"mediawiki.interwiki_link_hover"},"dt": "2020-04-02T19:11:20.942Z", "link_href": "mw:Extension:EventLogging", "link_title": "mw:Extension:EventLogging", "link_text": "EventLogging extension"}

Now, run npm run build-modified and commit new 1.1.0 version files.

$ npm run build-modified
# ...
$ git add jsonschema/analytics/link_hover/*
$ git commit -m 'analytics/link_hover - add link_text field and bump to version 1.1.0'

New analytics/link_hover/1.1.0 files have been created, and the jsonschema/analytics/link_hover/latest symlinks have been updated to point to the 1.1.0 versions.

Now edit your instrumentation code to produce the event data with the link_text field and the updated versioned $schema URI. Your instrumentation code should now look like this:

$( '#content' ).on( 'mouseover', 'a.extiw', function ( jqEvent ) {
	var link = jqEvent.target;
	var linkHoverEvent = {
		// $schema is required and must be set to the exact value of $id that you set in your schema.
		// We've added link_text, which was added in schema verison 1.1.0, so we need to specify that this
		// event should be validated using schema verison 1.1.0.
		$schema: '/analytics/link_hover/1.1.0',
		link_href: link.href,
		link_title: link.title,
		link_text: link.text,
	}
	var streamName = 'mediawiki.interwiki_link_hover'
	mw.eventLog.submit( streamName, linkHoverEvent );
} );

Note that only backwards compatible changes are allowed. This means that the only type of change you can do to a schema is add new optional fields. jsonschema-tools will ensure that all schema versions are backwards compatible (as well as ensuring that the schema repository is in good shape). Jenkins will run CI tests when you push your schema change to gerrit, but you can also run the tests manually:

$ npm test
...
  Schema Compatibility in Repository ./jsonschema/
    analytics/link_hover
      Major Version 1
        ✓ 1.1.0 must be compatible with 1.0.0
...

Backwards incompatible schema changes

In general, backwards incompatible schema changes are not allowed. This is because they are not possible to do without manual intervention. Your code may be the only producer of this data, but there may be many consumers, and making a backwards incompatible change requires coordination with all of them. For instrumentation, likely the only uses of your schema will be by EventGate for validation, and Hive for querying your data. If you absolutely need to make a backwards incompatible change to an instrumenation schema, the procedure is:

  • Make a Phabricator ticket and tag Data-Engineering that describes your request.
  • Manual Steps Data-Engineering:
    • Merge the schema change, wait at least 30 minute (or run puppet on schema* nodes to apply merge)
    • Drop the corresponding Hive table(s).

As long as the schema change is a new version (not an editing an existent version), eventgate does not need a restart. If for some forsaken reason you need to edit a change an existent schema version, you'll need to rolling restart the eventgate-analytics-external service.

Overriding event stream config settings

In production for specific wikis / groups

Because EventStreamConfig uses mediawiki-config, we can override settings for beta, wiki groups, or even specific wikis. Unfortunetely, this ONLY works for clients that request their config via the specific wiki you are overriding.

EventGate is a service that is not wiki specific, so it requests all stream configs from meta.wikimedia.org. This means that you MUST add entries in wgEventStreams for at least metawiki (usually done so in the 'default' configs).

EventLogging is a MediaWiki extension that produces events, and is usually executed within the context of a specific wiki. It will get configuration for the specific wiki it is being run in. That means you can override configs that EventLogging uses for specific wikis.

Example: EventLogging respects a sample setting only produce a sample of events. We only want to send a 1/10 events on enwiki. We've already got the mediawiki.interwiki_link_hover stream declared in wgEventStreams default. Let's add an override for the sample setting on enwiki.

'wgEventStreams' => [
	'default' => [
		// ...
		'mediawiki.interwiki_link_hover' => [
			'schema_title' => 'analytics/link_hover',
			'destination_event_service' => 'eventgate-analytics-external',
		],
	],

    // Add the overrides for enwiki.
    // The '+' prefix indicates that these settings should be recursively merged with 'default' settings.
    // See: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/src/StaticSiteConfiguration.php#64
    '+enwiki' => [
        'mediawiki.interwiki_link_hover' => [
            'sample' => [
                'rate' => 0.1,
            ],
        ],
    ]
],

When enwiki MediaWiki JavaScript requests event stream config for the mediawiki.interwiki_link_hover stream, it will get something like

{
  "streams": {
    "mediawiki.interwiki_link_hover": {
      "schema_title": "analytics/link_hover",
      "destination_event_service": "eventgate-analytics-external",
      "sample": {
        "rate": 0.1,
      },
    }
  }
}

with the default config merged together.

In beta

Beta AKA deployment-prep configs are in InitialiseSettings-labs.php. For the most part, overrides here work exactly as described above. However, the 'default' configs will not be merged if you set them. That means that you cannot use the 'default' config section in wgEventStreams in InitialiseSettings-labs.php. You can however merge overrides in from InitialiseSettings.php 'default' section if you use the per wiki or group merge override as described above. E.g. adding the following in InitialiseSettings-labs.php will override the sample rate for en.wikipedia.beta.wmflabs.org:

'wgEventStreams' => [
     
    // Add the overrides for enwiki in beta
    // The '+' prefix indicates that these settings should be recursively merged with 'default' settings from InitialiseSettings.php.
    '+enwiki' => [
        'mediawiki.interwiki_link_hover' => [
            'sample' => [
                'rate' => 1.0,
            ],
        ],
    ]
],

Decommissioning

When deploying a new instrumentation event stream, you should also plan to decommission it one day. A schema should never be deleted, but all of the stream related code and configuration can be removed at anytime to stop producing an event stream. In the rare cases where the presence of the schema itself is problematic, we can delete it but it requires a coordinated effort to exclude it and prevent alarms along the data pipeline.