Difference between revisions of "Event Platform/Schemas"

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
imported>Ottomata
 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Navigation Event Platform}}
== Motivation and Overview ==
== Motivation and Overview ==


Event Schemas are the core of an Event Streaming Platform.  They allow disparate continuously changing producers and consumers to reliably communicate with each other.  By explicitly declaring the shape of data, schemas ease integration between various systems.
Event Schemas are essential for an Event Streaming Platform.  They allow disparate continuously changing producers and consumers to reliably communicate with each other.  By explicitly declaring the shape of data, schemas ease integration between various systems.


WMF uses JSON as our preferred data serialization format, and as such we have chosen to use JSONSchema for our event schemasThere are plenty of other schema technologies out there, (Avro, Thrift, etc.) but JSON and JSONSchema fit our use cases better than any of those. (For more information about how JSONSchema was chosen, see [https://phabricator.wikimedia.org/T198256 RFC: Modern Event Platform - Choose Schema Tech] and https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/.
Schemas should be readily available for any producer or consumer code that might need itSchemas are needed to validate data, but they can also be used to automate data integration problems, e.g. auto creation of SQL tables in which events will be imported. Access of those schemas should be reliable and immutable for any given deployed service.


JSONSchema provides powerful data validation, but unlike Avro, it does not have schema evolution built in.  That is, to JSONSchema, each schema is distinct, and there is no way to explicitly declare that a given schema is just a new version of another. Schema evolution is necessary to be able to reliably upgrade producer and consumer code.
WMF uses JSON as our preferred in-flight data serialization format, and as such we have chosen* to use JSONSchema for our event schemas. Schema evolution is necessary to be able to reliably upgrade producer and consumer code, but unfortunately, JSONSchema does not have any built-in features for schema evolution. Therefore, each change (even a small one) requires the creation of a totally separate JSONSchema file.  
 
Schemas should also be easily available for any producer or consumer code that might need it.  Schemas are needed to validate data, but they can also be used to automate data integration problems, e.g. auto creation of SQL tables in which events will be imported.  Accessing should also be reliable and immutable for any given deployed service.


WMF has chosen to distribute schemas using Git.  This allows us to do development, CI, versioning and deployment for schemas the same way we do any code project.  However, even though we use Git, we do not rely on Git history for schema versioning.  Each schema version is an explicit static file in the schema repository.  For more background, see [https://phabricator.wikimedia.org/T201643 RFC: Modern Event Platform: Schema Registry].
WMF has chosen to distribute schemas using Git.  This allows us to do development, CI, versioning and deployment for schemas the same way we do any code project.  However, even though we use Git, we do not rely on Git history for schema versioning.  Each schema version is an explicit static file in the schema repository.  For more background, see [https://phabricator.wikimedia.org/T201643 RFC: Modern Event Platform: Schema Registry].


WMF has developed the [https://github.com/wikimedia/jsonschema-tools jsonschema-tools] library to aide developing versioned and backwards compatible schemas in Git.
To make development of many schema versions files in git easier, WMF has developed the [https://github.com/wikimedia/jsonschema-tools jsonschema-tools] library. This tooling makes it easier for developers to design and evolve schemas dynamically while allowing production services can use static and immutable versions of those schemas.
This tooling makes it easier for developers to design and evolve schemas dynamically, event while production services can use static and immutable versions of those schemas.


[https://github.com/wikimedia/jsonschema-tools jsonschema-tools] will be used in the rest of this documentation to set up and develop schemas in a Git schema repository.  Please skim the [https://github.com/wikimedia/jsonschema-tools#jsonschema-tools jsonschema-tools README] before proceeding.
[https://github.com/wikimedia/jsonschema-tools jsonschema-tools] will be used in the rest of this documentation to set up and develop schemas in a Git schema repository.  Please skim the [https://github.com/wikimedia/jsonschema-tools#jsonschema-tools jsonschema-tools README] before proceeding.


jsonschema-tools is a NodeJS module, so you'll need a recent (Node 10 or greater) version of NodeJS and npm installed.  You can get NodeJS and npm at https://nodejs.org/en/.
jsonschema-tools is a NodeJS module, so you'll need a recent (Node 10 or greater) version of NodeJS and npm installed.  You can get NodeJS and npm at [https://nodejs.org/en/ nodejs.org]. Once installed, <code>cd</code> to the schema repository and run <code>npm install</code>. '''Heads-up''': the full path to the directory cannot contain spaces. For example, <code>~/Documents/analytics\ engineering/event\ schemas/primary</code> is likely to yield errors, but <code>~/Documents/analytics-engineering/event-schemas/primary</code> would be fine.
 
''*There are plenty of other schema technologies out there, (Avro, Thrift, etc.) but JSON and JSONSchema fit our use cases better than any of those. (For more information about how JSONSchema was chosen, see [https://phabricator.wikimedia.org/T198256 RFC: Modern Event Platform - Choose Schema Tech], https://techblog.wikimedia.org/2020/09/10/wikimedias-event-data-platform-or-json-is-ok-too/ and https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/.''


== Event Schema Design Rules and Conventions ==
== Event Schema Design Rules and Conventions ==
Line 31: Line 32:
     │   │   ├── 1.0.0 -> 1.0.0.yaml
     │   │   ├── 1.0.0 -> 1.0.0.yaml
     │   │   ├── 1.0.0.yaml
     │   │   ├── 1.0.0.yaml
     │   │   └── current.yaml
     │   │   ├── current.yaml
    │   │   └── latest -> 1.0.0
     │   └── release
     │   └── release
     │      ├── 1.0.0 -> 1.0.0.yaml
     │      ├── 1.0.0 -> 1.0.0.yaml
     │      ├── 1.0.0.yaml
     │      ├── 1.0.0.yaml
    │       ├── 1.0.1 -> 1.0.1.yaml
     │      ├── 1.0.1.yaml
     │      ├── 1.0.1.yaml
     │      └── current.yaml
     │      ├── current.yaml
     │       └── latest -> 1.0.1
     └── page_preview
     └── page_preview
         └── visibility_change
         └── visibility_change
Line 43: Line 47:
             ├── 2.0.0 -> 2.0.0.yaml
             ├── 2.0.0 -> 2.0.0.yaml
             ├── 2.0.0.yaml
             ├── 2.0.0.yaml
             └── current.yaml
             ├── current.yaml
            └── latest -> 2.0.0
 
</pre>
</pre>


Line 78: Line 84:
   "scripts": {
   "scripts": {
     "test": "mocha test/jsonschema",
     "test": "mocha test/jsonschema",
     "postinstall": "$(npm bin)/jsonschema-tools install-git-hook"
     "postinstall": "jsonschema-tools install-git-hook"
   },
   },
   "devDependencies": {
   "devDependencies": {
Line 108: Line 114:
Once you are working in a repository with jsonschema-tools, we can create new schemas.  By 'new schema', we mean a brand new schema lineage, not just a new schema version.  To create a new schema, we need to first decide on its title (and hierarchy), create the directory structure, write a new current.yaml schema file, and materialize the schema.  For this example, we'll create a new event schema that represents a Mediawiki UI button click.
Once you are working in a repository with jsonschema-tools, we can create new schemas.  By 'new schema', we mean a brand new schema lineage, not just a new schema version.  To create a new schema, we need to first decide on its title (and hierarchy), create the directory structure, write a new current.yaml schema file, and materialize the schema.  For this example, we'll create a new event schema that represents a Mediawiki UI button click.


''NOTE: since will be writing JSONSchema, you should probably know how to do that.  See this [http://json-schema.org/learn/getting-started-step-by-step.html tutorial] if this is your first time working with JSONSchema.''
''NOTE: since will be writing JSONSchema, you should probably know how to do that.  See this [http://json-schema.org/learn/getting-started-step-by-step.html tutorial] and [https://json-schema.org/understanding-json-schema/reference/index.html reference] for help working with JSONSchema.''


<syntaxhighlight lang=bash>
<syntaxhighlight lang=bash>
Line 146: Line 152:
</syntaxhighlight>
</syntaxhighlight>


==== Event meta data ====
==== Required event data ====
In addition to the <code>$schema</code> field, WMF has defined common fields for
event data. These common fields allow us to have some consistency all event data, and are also used to support backend functionality (deduplication, Hive table ingestion, etc.)
 
===== <code>$schema</code> =====
Each event needs to identify it's schema.  Right now we are just writing the schema, but later on your code
Each event needs to identify it's schema.  Right now we are just writing the schema, but later on your code
will produce JSON event data that conforms to this schema.  We need to be able to look up the schema
will produce JSON event data that conforms to this schema.  We need to be able to look up the schema
Line 162: Line 172:
</syntaxhighlight>
</syntaxhighlight>


In addition to the <code>$schema</code> field, WMF has defined common 'meta' fields for
===== Timestamps: <code>meta.dt</code> and <code>dt</code> =====
event data. These common fields allow us to have some consistency all event data.
These timestamps have different semantics, but in most cases they will be very close, if not the same.  These are both ISO-8601 UTC datetime strings, e.g. '2020-07-01T00:00:00Z'.
Every event happens at a certain date-time.  That event time is stored
in the <code>meta.dt</code> field as an ISO-8601 datetime string.


Also, every event should belong to a certain dataset or stream of eventsEach event needs to
Every event happens at a certain date-timeThat event time should be stored in the <code>dt</code> field.
specify which stream it belongs to. For example, the [https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/resource_change resource_change schema] is re-used in the `mediawiki.resource_change`, `transcludes.resource_change`, `change-prop.retry.resource_change`, etc. streams. You might want to design a generic button_clicked schema that is generic for all button clicks, but keep the different types of button click events in different streams.
We do this using the <code>meta.stream</code> field. (meta.stream is used for routing incoming events to specific streams and downstream 'datasets'. Each distinct meta.stream will correspond with certain Kafka topics and a Hive table.)


There are a few more common meta fields that WMF defines, but we don't need explain
<code>meta.dt</code> can be used as the event ingestion time, i.e. the time at which the intake system has received the event. Depending on the pipeline your event is flowing through, this might be set be different levels.  For events that are received first by our intake service (EventGate), this will be set by it, if it is not already set by the client.
 
NOTE: <code>meta.dt</code> will be used as the Kafka timestamp as well as for Hive hourly partitioning. If you don't have strict control over your event producers (e.g. remote browser clients), you should allow EventGate to fill in this field so that you don't end up with incorrect timestamps.
 
NOTE: As of 2020-12, these <tt>meta.dt</tt> and <tt>dt</tt> conventions are not fully adopted in existent schemas, but all new schemas should use these conventions. See [https://phabricator.wikimedia.org/T240460 T240460] and [https://phabricator.wikimedia.org/T267648 T267648].
 
===== <code>meta.stream</code> =====
Every event should belong to a named dataset. While events are in flight, this dataset is called a stream of events.  Each event needs to specify which stream it belongs to. For example, the [https://schema.wikimedia.org/repositories//primary/jsonschema/resource_change/current.yaml resource_change schema] is re-used in the `mediawiki.resource_change`, `transcludes.resource_change`, `change-prop.retry.resource_change`, etc. streams. You might want to design a generic button_clicked schema that is generic for all button clicks, but keep the different types of button click events in different streams.
We do this using the <code>meta.stream</code> field. (meta.stream is used for routing incoming events to specific streams and downstream 'datasets'. Each distinct meta.stream will correspond with certain Kafka topics and a Hive table.  In most cases, the Kafka topic will be the stream name prefixed with the datacenter name where the event was received.)
 
There are a few more common and optional meta fields that WMF defines, but we don't need explain
them all here.  For now we will write out just these 2 example <code>meta</code> fields.
them all here.  For now we will write out just these 2 example <code>meta</code> fields.
Later we will show how to include the event meta schema using <code>$ref</code>.
Later we will show how to include the event meta schema using <code>$ref</code>.
Line 283: Line 299:


== Modifying schemas ==
== Modifying schemas ==
Versioned schemas should be (mostly) immutable.  Once committed and merged, they may be used by many active producers and consumers.  Changing an existent version should not be done (unless you really know what you are doing).  Instead, to modify a schema you should just create a new backwards compatible version.
Versioned schemas should be (mostly) immutable.  Once committed and merged, they may be used by many active producers and consumers.  Changing an existent version should not be done (if you think you need to do it, get in touch with the Analytics or Core Platform Engineering teams).  Instead, to modify a schema you should just create a new backwards compatible version.


Let's add a user_id to our event data.  Edit <code>jsonschema/mediawiki/desktop/button/click/current.yaml</code> and add the following at the bottom of the schema.
Let's add a user_id to our event data.  Edit <code>jsonschema/mediawiki/desktop/button/click/current.yaml</code> and add the following at the bottom of the schema.
Line 326: Line 342:
When materializing schemas, jsonschema-tools will dereference any <code>$ref</code> pointers and merge any [https://json-schema.org/understanding-json-schema/reference/combining.html#allof <code>allOf</code>] it finds.  This allows us to DRY up common subschemas to avoid copy/paste bugs.  It may also potentially allow us to standardize common fields (<code>page_title</code>) into a data dictionary reference (TBD).
When materializing schemas, jsonschema-tools will dereference any <code>$ref</code> pointers and merge any [https://json-schema.org/understanding-json-schema/reference/combining.html#allof <code>allOf</code>] it finds.  This allows us to DRY up common subschemas to avoid copy/paste bugs.  It may also potentially allow us to standardize common fields (<code>page_title</code>) into a data dictionary reference (TBD).


For WMF, all event schemas should have a <code>$schema</code> event field, as well as use a common event meta sub object.  As of 2019-09, the common schema is in the [https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/common mediawiki/event-schemas repository] at [https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/common /common].  This process is WIP, but for now this documentation will describe how the mediawiki/event-schemas repository includes common schemas.
For WMF, all event schemas should have a <code>$schema</code> event field, as well as use a common event meta sub object.  The Wikimedia common schema is in the https://schema.wikimedia.org/#!/primary/jsonschema primary schema repository] at [https://schema.wikimedia.org/repositories/primary/jsonschema/fragment/common /fragment/common].  


{{Note|TODO figure out how multiple schema repositories should $ref the same /common schemas and document here. https://phabricator.wikimedia.org/T233432}}
In our example schema repository, assume we have a common schema at jsonschema/fragment/common/1.0.0 as:
 
In our example schema repository, assume we have a common schema at jsonschema/common/1.0.0 as:
<syntaxhighlight lang=yaml>
<syntaxhighlight lang=yaml>
title: common
title: common
Line 348: Line 362:
     type: object
     type: object
     required:
     required:
      - id
       - dt
       - dt
       - stream
       - stream
Line 362: Line 375:
       id:
       id:
         type: string
         type: string
        pattern: '^[a-fA-F0-9]{8}(-[a-fA-F0-9]{4}){3}-[a-fA-F0-9]{12}$'
         maxLength: 36
         maxLength: 36
         description: Unique ID of this event
         description: Unique ID of this event
Line 392: Line 404:
type: object
type: object
allOf:
allOf:
- $ref: /common/1.0.0
- $ref: /fragment/common/1.0.0
- properties:
properties:
    button_name:
  button_name:
      type: string
    type: string
      description: Name of the button that was clicked
    description: Name of the button that was clicked
    page_title:
  page_title:
      type: string
    type: string
      description: Page the button appeared on when clicked
    description: Page the button appeared on when clicked
    user_id:
  user_id:
      type: string
    type: string
      description: ID of the user
    description: ID of the user


examples:
examples:
Line 412: Line 424:
<syntaxhighlight lang=bash>
<syntaxhighlight lang=bash>
git add ./jsonschema/mediawiki/desktop/button/click/current.yaml
git add ./jsonschema/mediawiki/desktop/button/click/current.yaml
git commit -m 'Using $ref to common in new vesrion mediawiki/desktop/button/click 1.2.0'
git commit -m 'Using $ref to common in new version mediawiki/desktop/button/click 1.2.0'
...
...
</syntaxhighlight>
</syntaxhighlight>


The newly materialized <code>./jsonschema/mediawiki/desktop/button/click/1.2.0.yaml</code> has both our schema and the included common schema:
The newly materialized <code>./jsonschema/mediawiki/desktop/button/click/1.2.0.yaml</code> has both our schema and the included common schema merged together:


<syntaxhighlight lang=yaml>
<syntaxhighlight lang=yaml>
Line 447: Line 459:
     type: object
     type: object
     required:
     required:
      - id
       - dt
       - dt
       - stream
       - stream
Line 461: Line 472:
       id:
       id:
         type: string
         type: string
        pattern: '^[a-fA-F0-9]{8}(-[a-fA-F0-9]{4}){3}-[a-fA-F0-9]{12}$'
         maxLength: 36
         maxLength: 36
         description: Unique ID of this event
         description: Unique ID of this event
Line 489: Line 499:


=== How this works ===
=== How this works ===
When jsonschema-tools encounters a <code>$ref</code>, it will attempt to resolve it and then replace it with the resolved content.
When jsonschema-tools encounters a <code>$ref</code>, it will attempt to resolve it and then replace it with the resolved content.  After dereferencing, anything <code>allOf</code> is merged together with the top level schema fields to create a fully dereferenced and merged schema without any <code>$ref</code> or <code>allOf</code> keywords.


==== Absolute <code>$ref</code>====
==== Absolute <code>$ref</code>====
If the <code>$ref</code> starts with a URI protocol (http:// or file://), it will attempt to load it as is.
If the <code>$ref</code> starts with a URI protocol (http:// or file://), it will attempt to load it as is.
<code>$ref: http://schema-beta.wmflabs.org/repositories/mediawiki/jsonschema/common/1.0.0</code> will load the content at that URL.
<code>$ref: https://schema.wikimedia.org/repositories/primary/jsonschema/fragment/common/1.0.0</code> will load the content at that URL.


==== Relative to <code>baseSchemaUris</code>. ====
==== Relative to <code>baseSchemaUris</code>. ====
jsonschema-tools can be configured (in <code>[https://github.com/wikimedia/jsonschema-tools#jsonschema-tools-config-files .jsonschema-tools.yaml]</code> with multiple <code>baseSchemaUris</code>, the default of which is just the <code>schemaBasePath</code> (in our case, <code>./jsonschema</code>).  When a <code>$ref</code> starts with a slash (<code>/</code>), jsonschema-tools will iterate through each of thhe configured
jsonschema-tools can be configured (in <code>[https://github.com/wikimedia/jsonschema-tools#jsonschema-tools-config-files .jsonschema-tools.yaml]</code> with multiple <code>baseSchemaUris</code>, the default of which is just the <code>schemaBasePath</code> (in our case, <code>./jsonschema</code>).  When a <code>$ref</code> starts with a slash (<code>/</code>), jsonschema-tools will iterate through each of the configured
<code>baseSchemaUris</code>, prepend the base URI to the <code>$ref</code> value, and attempt to resolve it.  If your
<code>baseSchemaUris</code>, prepend the base URI to the <code>$ref</code> value, and attempt to resolve it.  If your
<code>baseSchemaUris: [./jsonschema, http://schema-beta.wmflabs.org/repositories]</code>, jsonschema-tools will look for your <code>$ref</code> path in both of those locations.
<code>baseSchemaUris: [./jsonschema, https://schema.wikimedia.org/repositories/primary/jsonschema/]</code>, jsonschema-tools will look for your <code>$ref</code> path in both of those locations.


== Testing schemas ==
== Testing schemas ==


jsonschema-tools comes with a series of tests that ensure your schema repository is nice and clean.  We showed how to install these tests in the section above about Creating a New Schema Repository.  These are mocha tests, so all we need to do is run <code>npm test</code>. These tests will ensure that your schema repository structure is correct, that your schemas have required fields, and that schema versions are backwards compatible.
jsonschema-tools comes with a series of tests that ensure your schema repository is nice and clean.  We showed how to install these tests in the section above about Creating a New Schema Repository.  These are mocha tests, so all we need to do is run <code>npm test</code>. These tests will ensure that your schema repository structure is correct, that your schemas have required fields, and that schema versions are backwards compatible.
[[Category:Event Platform]]

Latest revision as of 20:20, 20 September 2021

Motivation and Overview

Event Schemas are essential for an Event Streaming Platform. They allow disparate continuously changing producers and consumers to reliably communicate with each other. By explicitly declaring the shape of data, schemas ease integration between various systems.

Schemas should be readily available for any producer or consumer code that might need it. Schemas are needed to validate data, but they can also be used to automate data integration problems, e.g. auto creation of SQL tables in which events will be imported. Access of those schemas should be reliable and immutable for any given deployed service.

WMF uses JSON as our preferred in-flight data serialization format, and as such we have chosen* to use JSONSchema for our event schemas. Schema evolution is necessary to be able to reliably upgrade producer and consumer code, but unfortunately, JSONSchema does not have any built-in features for schema evolution. Therefore, each change (even a small one) requires the creation of a totally separate JSONSchema file.

WMF has chosen to distribute schemas using Git. This allows us to do development, CI, versioning and deployment for schemas the same way we do any code project. However, even though we use Git, we do not rely on Git history for schema versioning. Each schema version is an explicit static file in the schema repository. For more background, see RFC: Modern Event Platform: Schema Registry.

To make development of many schema versions files in git easier, WMF has developed the jsonschema-tools library. This tooling makes it easier for developers to design and evolve schemas dynamically while allowing production services can use static and immutable versions of those schemas.

jsonschema-tools will be used in the rest of this documentation to set up and develop schemas in a Git schema repository. Please skim the jsonschema-tools README before proceeding.

jsonschema-tools is a NodeJS module, so you'll need a recent (Node 10 or greater) version of NodeJS and npm installed. You can get NodeJS and npm at nodejs.org. Once installed, cd to the schema repository and run npm install. Heads-up: the full path to the directory cannot contain spaces. For example, ~/Documents/analytics\ engineering/event\ schemas/primary is likely to yield errors, but ~/Documents/analytics-engineering/event-schemas/primary would be fine.

*There are plenty of other schema technologies out there, (Avro, Thrift, etc.) but JSON and JSONSchema fit our use cases better than any of those. (For more information about how JSONSchema was chosen, see RFC: Modern Event Platform - Choose Schema Tech, https://techblog.wikimedia.org/2020/09/10/wikimedias-event-data-platform-or-json-is-ok-too/ and https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/.

Event Schema Design Rules and Conventions

Event Platform/Schemas/Guidelines

Schema Repositories

A schema repository is a Git repository with a hierarchy of versioned JSONSchema files, with a file layout something like:

jsonschema
└── analytics
    ├── button
    │   ├── click
    │   │   ├── 1.0.0 -> 1.0.0.yaml
    │   │   ├── 1.0.0.yaml
    │   │   ├── current.yaml
    │   │   └── latest -> 1.0.0
    │   └── release
    │       ├── 1.0.0 -> 1.0.0.yaml
    │       ├── 1.0.0.yaml
    │       ├── 1.0.1 -> 1.0.1.yaml
    │       ├── 1.0.1.yaml
    │       ├── current.yaml
    │       └── latest -> 1.0.1
    └── page_preview
        └── visibility_change
            ├── 1.0.0 -> 1.0.0.yaml
            ├── 1.0.0.yaml
            ├── 2.0.0 -> 2.0.0.yaml
            ├── 2.0.0.yaml
            ├── current.yaml
            └── latest -> 2.0.0

JSONSchema has title and $id fields that we use to associate event data with a schema, as well as for semantically versioning schemas. The actual hierarchy layout shown here is arbitrary, but each schema's title and $id must match the layout in a specific way. More on this below.

Note the 'current.yaml' files. These files represent the current working version of the schema. The current schemas are never themselves used as a schema for validation or data integration. Instead, they are 'materialized' by jsonschema-tools into static versioned schema files. These versioned schema files are the canonical schemas used by event processing systems.

Hierarchy Rules

Each schema's title should match its relative path in the schema repository. E.g. all schema version files in namespace1/entity1/verbB should have title: namespace1/entity1/verbB. Each schema's $id field should be set to the path (starting with /) and (extensionless) version. E.g. namespace1/entity1/verbB/1.0.1.yaml should have $id: /namespace1/entity1/verbB/1.0.1.

This layout combined with the title and $id allow for event data to specifically point to their schemas via relative URIs. By semantically versioning schema files, jsonschema-tools is able to associate schemas with the same title and enforce backwards compatibility. The relative and versioned $id URIs can also be used as JSON $ref links and with JSON Pointers. More on this below as well.

Creating a new schema repository

Most likely you will already be working with a schema repository. If so, skip to Creating a new schema or Modifying schemas.

jsonschema-tools is a NodeJS libary and CLI for managing JSONSchema Git repositories. To create a new schema repository, you'll create a package.json file, install and configure jsonschema-tools, and set up jsonschema-tools tests.

mkdir my_schema_repository
cd my_schema_repository
git init .

# Our schemas will go in the jsonschema/ directory
mkdir jsonschema

# Create a configuration file for jsonschema-tools.
echo -e 'schemaBasePath: ./jsonschema/\nlogLevel: info' > .jsonschema-tools.yaml

# Create a package.json file.  (Modify this as desired.)
echo '
{
  "name": "my_schema_repository",
  "scripts": {
    "test": "mocha test/jsonschema",
    "postinstall": "jsonschema-tools install-git-hook"
  },
  "devDependencies": {
    "@wikimedia/jsonschema-tools": "^0.6.0",
    "mocha": "^6.2.0"
  }
}
' > package.json

# Install jsonschema-tools.  The npm postinstall script will install a git
# pre-commit hook to auto materialize versioned schema files when current
# schema files are modified.
npm install .

# Install jsonschema-tools tests.
mkdir -p test/jsonschema
echo "
'use strict';
require('@wikimedia/jsonschema-tools').tests.all({ logLevel: 'info' });
" > test/jsonschema/repository.test.js

# Create the first git commit.
echo 'node_modules**' >> .gitignore
git add .
git commit -m 'New schema repository'

Creating a new schema

Once you are working in a repository with jsonschema-tools, we can create new schemas. By 'new schema', we mean a brand new schema lineage, not just a new schema version. To create a new schema, we need to first decide on its title (and hierarchy), create the directory structure, write a new current.yaml schema file, and materialize the schema. For this example, we'll create a new event schema that represents a Mediawiki UI button click.

NOTE: since will be writing JSONSchema, you should probably know how to do that. See this tutorial and reference for help working with JSONSchema.

mkdir -p jsonschema/mediawiki/desktop/button/click

Open jsonschema/mediawiki/desktop/button/click/current.yaml. We'll build this up piece by piece and explain each part.

Schema meta data

First we need some schema meta data that describe and identify the schema. Note that this schema meta data is not describing any aspect of your event data.

# This is the title of the schema.
# It should match the relative path to this file's parent directory.
title: mediawiki/desktop/button/click

# Document the what the schema represents.
description: Mediawiki desktop web button clicked

# The $id uniquely identifies this schema.  It should be a versioned (and extensionless) URI.
$id: /mediawiki/desktop/button/click/1.0.0

# This is the meta-schema of this schema.  This should probably always be the same
# for every schema, and should point to the main JSONSchema meta-schema at json-schema.org.
$schema: https://json-schema.org/draft-07/schema#


Event fields

...continuing on to event data fields. Your event should be a JSON object with each field explicitly declared here.

type: object
additionalProperties: false
properties:

Required event data

In addition to the $schema field, WMF has defined common fields for event data. These common fields allow us to have some consistency all event data, and are also used to support backend functionality (deduplication, Hive table ingestion, etc.)

$schema

Each event needs to identify it's schema. Right now we are just writing the schema, but later on your code will produce JSON event data that conforms to this schema. We need to be able to look up the schema for any given event just from the event data itself. To do this, we re-use the JSONSchema $schema field in the event properties.

  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be
      a short URI containing only the name and version at the end of the
      URI path.  e.g. /schema_name/1.0.0 is acceptable. This should match
      the schema's $id field.
Timestamps: meta.dt and dt

These timestamps have different semantics, but in most cases they will be very close, if not the same. These are both ISO-8601 UTC datetime strings, e.g. '2020-07-01T00:00:00Z'.

Every event happens at a certain date-time. That event time should be stored in the dt field.

meta.dt can be used as the event ingestion time, i.e. the time at which the intake system has received the event. Depending on the pipeline your event is flowing through, this might be set be different levels. For events that are received first by our intake service (EventGate), this will be set by it, if it is not already set by the client.

NOTE: meta.dt will be used as the Kafka timestamp as well as for Hive hourly partitioning. If you don't have strict control over your event producers (e.g. remote browser clients), you should allow EventGate to fill in this field so that you don't end up with incorrect timestamps.

NOTE: As of 2020-12, these meta.dt and dt conventions are not fully adopted in existent schemas, but all new schemas should use these conventions. See T240460 and T267648.

meta.stream

Every event should belong to a named dataset. While events are in flight, this dataset is called a stream of events. Each event needs to specify which stream it belongs to. For example, the resource_change schema is re-used in the `mediawiki.resource_change`, `transcludes.resource_change`, `change-prop.retry.resource_change`, etc. streams. You might want to design a generic button_clicked schema that is generic for all button clicks, but keep the different types of button click events in different streams. We do this using the meta.stream field. (meta.stream is used for routing incoming events to specific streams and downstream 'datasets'. Each distinct meta.stream will correspond with certain Kafka topics and a Hive table. In most cases, the Kafka topic will be the stream name prefixed with the datacenter name where the event was received.)

There are a few more common and optional meta fields that WMF defines, but we don't need explain them all here. For now we will write out just these 2 example meta fields. Later we will show how to include the event meta schema using $ref.

  ### Meta data object.  All events schemas should have this.
  meta:
    type: object
    properties:
      dt:
        type: string
        # Whenever a format is used on a field, we require that maxLength is also set.
        # See https://github.com/epoberezkin/ajv#security-risks-of-trusted-schemas
        format: date-time
        maxLength: 128
        description: Time stamp of the event, in ISO-8601 format
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
    required:
      - dt
      - stream

Event data fields

Finally we can add any fields that we really want our event to have.

  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked

The new schema

Here is the new schema we just wrote:

title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.0.0
$schema: https://json-schema.org/draft-07/schema#
type: object
properties:
  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be
      a short URI containing only the name and version at the end of the
      URI path.  e.g. /schema_name/1.0.0 is acceptable. This often will
      (and should) match the schema's $id field.
  ### Meta data object.  All events schemas should have this.
  meta:
    type: object
    properties:
      dt:
        type: string
        format: date-time
        maxLength: 128
        description: Time stamp of the event, in ISO-8601 format
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
    required:
      - dt
      - stream
  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked

examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser"}

Note the examples. This is optional, but can be nice if you want to give schema readers an example of what you expect event data to look like. Notice how the event's $schema matches exactly the schema's $id.

Materializing the schema

jsonschema-tools calls the process of derefencing, merging and generating the static versioned files 'materializing'. So far, we've saved this our new schema as ./jsonschema/mediawiki/desktop/button/click/current.yaml. current.yaml will be the 'current working copy' of a schema. It can contain $ref URI pointers (more on this below). Any changes we make to schemas should always be done on their current.yaml files. We'll use jsonschema-tools to materialize current.yaml into a statically versioned schema file.

When we set up our schema repository, we installed a Git pre-commit hook to auto-materialize schemas. So, if we do

git add ./jsonschema/mediawiki/desktop/button/click/current.yaml
git commit -m 'Created mediawiki/desktop/button/click 1.0.0'

[2019-09-19 16:24:53.057 +0000]: Looking for modified current.yaml schema files in ./jsonschema/
[2019-09-19 16:24:53.093 +0000]: Materializing /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/current.yaml...
[2019-09-19 16:24:53.097 +0000]: Dereferencing schema with $id /mediawiki/desktop/button/click/1.0.0 using schema base URIs ./jsonschema/
[2019-09-19 16:24:53.120 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0.yaml.
[2019-09-19 16:24:53.121 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0.json.
[2019-09-19 16:24:53.122 +0000]: Created extensionless symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0 -> 1.0.0.yaml.
[2019-09-19 16:24:53.123 +0000]: New schema files have been materialized. Adding them to git: /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.0.0.json
[master ac7b60d] Created mediawiki/desktop/button/click 1.0.0
 Date: Thu Sep 19 10:52:26 2019 -0400
 5 files changed, 130 insertions(+)
 create mode 120000 jsonschema/mediawiki/desktop/button/click/1.0.0
 create mode 100644 jsonschema/mediawiki/desktop/button/click/1.0.0.json
 create mode 100644 jsonschema/mediawiki/desktop/button/click/1.0.0.yaml
 create mode 100644 jsonschema/mediawiki/desktop/button/click/current.yaml

jsonschema-tools will notice any newly modified current.yaml and materialize them on git commit. The version to materialize will be obtained from the value of $id in current.yaml. Both yaml and json (by default) files will be materialized, and the versioned extensionless symlink will point to the versioned yaml file (by default).

Alternatively you can manually materialize a schema using the jsonschema-tools CLI. See $(npm bin)/jsonschema-tools --help for more information.

Modifying schemas

Versioned schemas should be (mostly) immutable. Once committed and merged, they may be used by many active producers and consumers. Changing an existent version should not be done (if you think you need to do it, get in touch with the Analytics or Core Platform Engineering teams). Instead, to modify a schema you should just create a new backwards compatible version.

Let's add a user_id to our event data. Edit jsonschema/mediawiki/desktop/button/click/current.yaml and add the following at the bottom of the schema.

# ...
  user_id:
    type: string
    description: ID of the user

# Add a user_id onto our examples field too:
examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser", "user_id": 123}

Since we've changed the schema, we MUST manually change the version in the schema's $id field. According to semantic versioning, our addition of the user_id field should be a minor version increment. So change $id to:

$id: /mediawiki/desktop/button/click/1.1.0

Since we've changed the version, jsonschema-tools will materialize new 1.1.0 version files on git commit:

git add ./jsonschema/mediawiki/desktop/button/click/current.yaml
git commit -m 'Added user_id and created mediawiki/desktop/button/click 1.1.0'

[2019-09-19 16:24:53.057 +0000]: Looking for modified current.yaml schema files in ./jsonschema/
[2019-09-19 16:24:53.093 +0000]: Materializing /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/current.yaml...
[2019-09-19 16:24:53.097 +0000]: Dereferencing schema with $id /mediawiki/desktop/button/click/1.1.0 using schema base URIs ./jsonschema/
[2019-09-19 16:24:53.120 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml.
[2019-09-19 16:24:53.121 +0000]: Materialized schema at /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json.
[2019-09-19 16:24:53.122 +0000]: Created extensionless symlink /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0 -> 1.1.0.yaml.
[2019-09-19 16:24:53.123 +0000]: New schema files have been materialized. Adding them to git: /home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.yaml,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0,/home/user/my_schema_repository/jsonschema/mediawiki/desktop/button/click/1.1.0.json
[master 827d1a6] Added user_id and created mediawiki/desktop/button/click 1.1.0
 4 files changed, 106 insertions(+), 2 deletions(-)
 create mode 120000 jsonschema/mediawiki/desktop/button/click/1.1.0
 create mode 100644 jsonschema/mediawiki/desktop/button/click/1.1.0.json
 create mode 100644 jsonschema/mediawiki/desktop/button/click/1.1.0.yaml

Including sub schemas

When materializing schemas, jsonschema-tools will dereference any $ref pointers and merge any allOf it finds. This allows us to DRY up common subschemas to avoid copy/paste bugs. It may also potentially allow us to standardize common fields (page_title) into a data dictionary reference (TBD).

For WMF, all event schemas should have a $schema event field, as well as use a common event meta sub object. The Wikimedia common schema is in the https://schema.wikimedia.org/#!/primary/jsonschema primary schema repository] at /fragment/common.

In our example schema repository, assume we have a common schema at jsonschema/fragment/common/1.0.0 as:

title: common
description: Common schema fields for all WMF schemas
$id: /common/1.0.0
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
properties:
  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be a short
      URI containing only the name and revision at the end of the URI path. 
      e.g. /schema_name/1.0.0 is acceptable. This should match
      the schema's $id field.
  meta:
    type: object
    required:
      - dt
      - stream
    properties:
      uri:
        type: string
        format: uri-reference
        maxLength: 8192
        description: Unique URI identifying the event / resource
      request_id:
        type: string
        description: Unique ID of the request that caused the event
      id:
        type: string
        maxLength: 36
        description: Unique ID of this event
      dt:
        type: string
        format: date-time
        maxLength: 128
        description: 'Time stamp of the event, in ISO-8601 format'
      domain:
        type: string
        description: Domain the event pertains to
        minLength: 1
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
        minLength: 1
required:
  - $schema
  - meta

We want to include this schema (including it's required properties) in our button/click example schema. Let's make a new version of this schema and include it using $ref. Edit jsonschema/mediawiki/desktop/button/click/current.yaml to

title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.2.0
$schema: https://json-schema.org/draft-07/schema#
type: object
allOf:
- $ref: /fragment/common/1.0.0
properties:
  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked
  user_id:
    type: string
    description: ID of the user

examples:
  - {"$schema": "/mediawiki/desktop/button/click/1.0.0", "meta": {"dt": "2019-01-01T00:00:00Z", "stream": "mediawiki.desktop.button-click", "id": "12345678-1234-5678-1234-567812345678"}, "button_name": "Edit source", "page_title": "Delayed-choice quantum eraser", "user_id": 123}

Notice that we've bumped the version number in $id again to 1.2.0. Commit and materialize this new schema.

git add ./jsonschema/mediawiki/desktop/button/click/current.yaml
git commit -m 'Using $ref to common in new version mediawiki/desktop/button/click 1.2.0'
...

The newly materialized ./jsonschema/mediawiki/desktop/button/click/1.2.0.yaml has both our schema and the included common schema merged together:

title: mediawiki/desktop/button/click
description: Mediawiki desktop web button clicked
$id: /mediawiki/desktop/button/click/1.2.0
$schema: 'https://json-schema.org/draft-07/schema#'
type: object
examples:
  - $schema: /mediawiki/desktop/button/click/1.0.0
    meta:
      dt: '2019-01-01T00:00:00Z'
      stream: mediawiki.desktop.button-click
      id: 12345678-1234-5678-1234-567812345678
    button_name: Edit source
    page_title: Delayed-choice quantum eraser
    user_id: 123
required:
  - $schema
  - meta
properties:
  $schema:
    type: string
    description: >
      The URI identifying the JSONSchema for this event. This should be a short
      URI containing only the name and revision at the end of the URI path. e.g.
      /schema_name/1.0.0 is acceptable. This should match the schema's $id
      field.
  meta:
    type: object
    required:
      - dt
      - stream
    properties:
      uri:
        type: string
        format: uri-reference
        maxLength: 8192
        description: Unique URI identifying the event / resource
      request_id:
        type: string
        description: Unique ID of the request that caused the event
      id:
        type: string
        maxLength: 36
        description: Unique ID of this event
      dt:
        type: string
        format: date-time
        maxLength: 128
        description: 'Time stamp of the event, in ISO-8601 format'
      domain:
        type: string
        description: Domain the event pertains to
        minLength: 1
      stream:
        type: string
        description: Name of the stream/queue that this event belongs in
        minLength: 1
  button_name:
    type: string
    description: Name of the button that was clicked
  page_title:
    type: string
    description: Page the button appeared on when clicked
  user_id:
    type: string
    description: ID of the user

How this works

When jsonschema-tools encounters a $ref, it will attempt to resolve it and then replace it with the resolved content. After dereferencing, anything allOf is merged together with the top level schema fields to create a fully dereferenced and merged schema without any $ref or allOf keywords.

Absolute $ref

If the $ref starts with a URI protocol (http:// or file://), it will attempt to load it as is. $ref: https://schema.wikimedia.org/repositories/primary/jsonschema/fragment/common/1.0.0 will load the content at that URL.

Relative to baseSchemaUris.

jsonschema-tools can be configured (in .jsonschema-tools.yaml with multiple baseSchemaUris, the default of which is just the schemaBasePath (in our case, ./jsonschema). When a $ref starts with a slash (/), jsonschema-tools will iterate through each of the configured baseSchemaUris, prepend the base URI to the $ref value, and attempt to resolve it. If your baseSchemaUris: [./jsonschema, https://schema.wikimedia.org/repositories/primary/jsonschema/], jsonschema-tools will look for your $ref path in both of those locations.

Testing schemas

jsonschema-tools comes with a series of tests that ensure your schema repository is nice and clean. We showed how to install these tests in the section above about Creating a New Schema Repository. These are mocha tests, so all we need to do is run npm test. These tests will ensure that your schema repository structure is correct, that your schemas have required fields, and that schema versions are backwards compatible.