You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Event Platform/Schemas/Guidelines

From Wikitech-static
< Event Platform‎ | Schemas
Revision as of 20:57, 19 September 2019 by imported>Ottomata (→‎Conventions)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page intends to document conventions, guidelines and rules for designing and evolving Event Schemas.

Rules

Required fields

All schemas must have the following fields.

$schema
A URI identifying the JSONSchema for this event. This should match an event schema $id in a schema repository. E.g. /schema_name/1.0.0
meta
Event meta data sub-object. This field contains common meta data fields for an event.
meta.stream
Name of the stream/queue that this event belongs in. This is used for routing incoming events to specific streams and downstream 'datasets'. This will likely correspond with a Kafka topic and a Hive table.
meta.dt
Event date timestamp of the event, in ISO-8601 format. This is the 'event time'. This field will be used as the Kafka message timestamp and for Hive table time based partitioning.
meta.id
UUID of this event. This field is used for deduplication, so every event (and especially every event in the same stream) needs a unique id.

No union types

Union types are not allowed. E.g. type: [string, integer] and type: [string, null] are not allowed. This means that null JSON values in data are not useful, as to set a field to null, it's type would have to be exactly type: "null", which means that all values are null. See the Optional Fields Guidelines documentation below. See also https://json-schema.org/understanding-json-schema/reference/null.html.

No object additionalProperties

The full structure of your data must be known ahead of time. The exception to this is map type fields (see below), since maps fully specify the data types of keys and values.

arrays

Arrays must specify their items type.

array_field:
  type: array
  items:
    type: string

Backwards compatible modifications only

This basically means that the only allowed schema modification is to add new optional fields.

No type changes

Type changes are the worst kind of backwards incompatible change. They can severely break downstream data consumers.

Do not remove fields

Field removals may cause downstream components to break. While field removals may be possible if they are done carefully, the best practice is just not to do them.

Do not rename fields

Renaming a field is the same as removing a field and adding a new one, and removing fields is not allowed.

Conventions

No Capital letters / Use snake_case

  • All field names should be in snake_case and should be all lower case.

datetimes / timestamps

All date time / timestamps should be serialized in JSON data as ISO-8601 datetime strings, ideally in UTC time, suffixed with the 'Z' (Zero) timezone qualifier. E.g. 2015-12-20T09:10:56Z.

Datetime fields should be named suffixed with '_dt'.

session_start_dt:
  type: string
  format: date-time
  maxLength: 128

Note the maxLength: 128. This is a security constraint that limits the amount of work a JSONSchema validator has to do in order to validate a format (all fields that use format or pattern will need to specify maxLength.

There may be cases where sending an integer unix epoch timestamp instead of an ISO-8601 string is required. In this case the field type should be integer, the timestamp should be sent as integer milliseconds, and the field name should be suffixed with '_ts_ms'. If you must send seconds instead of milliseconds, then you should suffix your field name with '_ts_s'.

session_start_ts_ms:
  type: integer
  description: Unix epoch in milliseconds

Elapsed time fields

If possible, time duration fields should be sent as integer milliseconds. Please suffix these fields with the time unit, e.g. '_ms', '_ns', '_s', etc.

map types

There may be times when you don't know all field names ahead of time. Since neither union types or arbitrary object (AKA structs) are supported, you'll want to use a 'map type'. JSON serialization does not differentiate between a 'map' and an 'object', but downstream systems need to know the difference. A map type in JSONSchema can be represented by an object with additionalProperties allowed, but for which all properties have the same type.

map_field:
  type: object
  additionalProperties:
    type: string