You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Event Platform/Schemas

From Wikitech-static
< Event Platform
Revision as of 20:49, 18 September 2019 by imported>Ottomata
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Motivation and Overview

Event Schemas are the core of an Event Streaming Platform. They allow disparate continuously changing producers and consumers to reliably communicate with each other. By explicitly declaring the shape of data, schemas ease integration between various systems.

WMF uses JSON as our preferred data serialization format, and as such we have chosen to use JSONSchema for our event schemas. There are plenty of other schema technologies out there, (Avro, Thrift, etc.) but JSON and JSONSchema fit our use cases better than any of those. (For more information about how JSONSchema was chosen, see RFC: Modern Event Platform - Choose Schema Tech and https://blog.wikimedia.org/2017/01/13/json-hadoop-kafka/.

JSONSchema provides powerful data validation, but unlike Avro, it does not provide schema evolution. That is, each schema is distinct, and there is no way to explicitly declare that a given schema is just a new version of another. Schema evolution is necessary to be able to reliably upgrade producer and consumer code.

Schemas should also be easily available for any producer or consumer code that might need it. Schemas are needed to validate data, but they can also be used to automate data integration problems, e.g. auto creation of SQL tables in which events will be imported. Accessing should also be reliable and immutable for any given deployed service.

WMF has chosen to distribute schemas using git. This allows us to do development, CI, versioning and deployment for schemas the same way we do any code project. However, even though we use git, we do not rely on git history for schema versioning. Each schema version is an explicit static file in the schema repository. For more background, see RFC: Modern Event Platform: Schema Registry.

WMF has developed the jsonschema-tools library to aide developing versioned and backwards compatible schemas in git.

Schema Repositories

A schema repository is a git repository with a hierarchy of versioned JSONSchema files, with a layout something like:

jsonschema
└── namespace1
    ├── entity1
    │   ├── verbA
    │   │   ├── 1.0.0 -> 1.0.0.yaml
    │   │   ├── 1.0.0.yaml
    │   │   └── current.yaml
    │   └── verbB
    │       ├── 1.0.0 -> 1.0.0.yaml
    │       ├── 1.0.0.yaml
    │       ├── 1.0.1.yaml
    │       └── current.yaml
    └── entity2
        └── verbC
            ├── 1.0.0 -> 1.0.0.yaml
            ├── 1.0.0.yaml
            ├── 2.0.0 -> 2.0.0.yaml
            ├── 2.0.0.yaml
            └── current.yaml

JSONSchema has title and $id fields that we use associating event data with a schema, and also for semantically versioning schemas. The actual hierarchy layout here is arbitrary, but each schema version file's title and $id must match the layout in a specific way. More on this below.

Note the 'current.yaml' files. These files represent the current working schema. current schema files are never themselves used as a schema for validation or data integration, instead, they are 'materialized' by jsonschema-tools into static versioned schema files. These versioned schema files are the canonical schemas used by event processing systems.

Hierarchy Rules

Each schema's title should match its path in the schema repository. E.g. all schema version files in namespace1/entity1/verbB should have title: namespace1/entity1/verbB. Each schema's $id field should be set to the path (starting with /) and (extensionless) version. E.g. namespace1/entity1/verbB/1.0.1.yaml should have $id: /namespace1/entity1/verbB/1.0.1.

This layout combined with the title and $id allow for event data to specifically point to their schemas via URIs. By semantically versioning schema files, jsonschema-tools is able to associate schemas with the same title and enforce backwards compatibility.

Creating a new schema repository

jsonschema-tools is a NodeJS libary and CLI for managing JSONSchema git repositories. To create a new schema repository, you'll create a package.json file, install and configure jsonschema-tools, and set up jsonschema-tools tests.

mkdir my_schema_repository
cd my_schema_repository
git init .

# Our schemas will go in the jsonschema/ directory
mkdir jsonschema

# Create a configuration file for jsonschema-tools.
echo 'schemaBasePath: ./jsonschema/' > .jsonschema-tools.yaml

# Create a package.json file.  (Modify this as desired.)
echo '
{
  "name": "my_schema_repository",
  "scripts": {
    "test": "mocha test/jsonschema",
    "postinstall": "$(npm bin)/jsonschema-tools install-git-hook"
  },
  "devDependencies": {
    "@wikimedia/jsonschema-tools": "^0.4.2",
    "mocha": "^6.2.0"
  }
}
' > package.json

# Install jsonschema-tools.  The npm postinstall script will install a git
# pre-commit hook to auto materialize versioned schema files when current
# schema files are modified.
npm install .

# Install jsonschema-tools tests.
mkdir -p test/jsonschema
echo "
'use strict';
require('@wikimedia/jsonschema-tools').tests.all({ logLevel: 'warn' });
" > test/jsonschema/repository.test.js


Creating a new schema