You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
This page intends to document conventions, guidelines and rules for designing and evolving Event Schemas.
All event schemas must have the following fields.
- A URI identifying the JSONSchema for this event. This should match an event schema $id in a schema repository. E.g. /schema_name/1.0.0
- Event meta data sub-object. This field contains common meta data fields for an event.
- Name of the stream/queue that this event belongs in. This is used for routing incoming events to specific streams and downstream 'datasets'. This will likely correspond with a Kafka topic and a Hive table.
- Event date timestamp of the event, in ISO-8601 format. This is the 'event time'. This field will be used as the Kafka message timestamp and for Hive table time based partitioning.
No union types
Union types are not allowed. E.g. type: [string, integer] and type: [string, null] are not allowed. This means that null JSON values in data are not useful, as to set a field to null, it's type would have to be exactly type: "null", which means that all values are null. See the Optional Fields Guidelines documentation below. See also https://json-schema.org/understanding-json-schema/reference/null.html.
No object additionalProperties
The full structure of your data must be known ahead of time. The exception to this is map type fields (see below), since maps fully specify the data types of keys and values.
Arrays must specify their items type.
array_field: type: array items: type: string
No Capital letters - Use snake_case
All field names should be in snake_case and should be all lower case. Event fields are often imported into case-insensitive RDBMS SQL systems. Mixing captial and lower case letters in e.g. Hive or MySQL table and field names can be confusing and cause issues in systems and code that access those SQL systems.
Keys in map type objects may use mixed case. However, unless totally necessary to use capital letters it is recommended that your producer convert map keys to lowercase. Doing so will avoid annoying case insensitive multi-case SQL queries later.
Backwards compatible modifications only
This basically means that the only allowed schema modification is to add new optional fields.
No type changes
Type changes are the worst kind of backwards incompatible change. They can severely break downstream data consumers.
Do not remove fields
Field removals may cause downstream components to break. While field removals may be possible if they are done carefully, the best practice is just not to do them.
Do not rename fields
Renaming a field is the same as removing a field and adding a new one, and removing fields is not allowed.
Optional / Missing fields
As long as a field is not in the list of required fields, it is 'optional'. There may be times when you don't want to set a field's value. To do so, just leave the field out of the event data when you send it. In downstream SQL systems (i.e. Hive), missing fields will be set to NULL during ingestion.
JSONSchema has an examples annotation. Your schema should include at least one example of an event that validates with the schema. Analytics Engineering uses these examples to produce canary (AKA heartbeat) events into event streams for monitoring of ingestion. jsonschema-tools tests will ensure that your examples validate with your schema.
You should always add at least one event example.
datetimes / timestamps
Datetime fields should be named suffixed with '_dt'.
session_start_dt: type: string format: date-time maxLength: 128
Note the maxLength: 128. This is a security constraint that limits the amount of work a JSONSchema validator has to do in order to validate a format (all fields that use format or pattern will need to specify maxLength.
There may be cases where sending an integer unix epoch timestamp instead of an ISO-8601 string is required. In this case the field type should be integer, the timestamp should be sent as integer milliseconds, and the field name should be suffixed with '_ts_ms'. If you must send seconds instead of milliseconds, then you should suffix your field name with '_ts_s'.
session_start_ts_ms: type: integer description: Unix epoch in milliseconds
Elapsed time fields
If possible, time duration fields should be sent as integer milliseconds. Please suffix these fields with the time unit, e.g. '_ms', '_ns', '_s', etc.
page_preview_visible_time_ms: type: integer description: Time the page preview popup was shown to the user
There may be times when you don't know all field names ahead of time. Since neither union types or arbitrary object (AKA structs) are supported, you'll want to use a 'map type'. JSON serialization does not differentiate between a 'map' and an 'object', but downstream systems need to know the difference. A map type in JSONSchema can be represented by an object with additionalProperties allowed, but for which all properties have the same type.
map_field: type: object additionalProperties: type: string
Modeling state changes
We model state changes by providing the current (new) state as well as the previous state in the same event, rather than attempting to provide a diff of the change. We do so by providing the current state as data fields in the main event body, and providing the previous state as a subobject called 'prior_state' with the same state fields from the main event body.
Example (from mediawiki/user/blocks-change):
### user/blocks_change specific fields blocks: description: > The current state of blocks for the target user of this user change event. type: object properties: name: description: Whether the name or IP should be suppressed (hidden). type: boolean email: description: Whether sending email is blocked. type: boolean user_talk: description: Whether the user is blocked from editing their own talk page. type: boolean account_create: description: Whether the user/IP is blocked from creating accounts. type: boolean expiry_dt: description: > The timestamp the block expires in ISO8601 format. If missing, the blocks do not expire. type: string format: date-time maxLength: 128 prior_state: description: > The prior state of the entity before this event. If a top level entity field is not present in this object, then its value has not changed since the prior event. For user blocks changes, if prior_state is not present, then the User or IP did not have any existing blocks in place at the time this event was emitted. This does not mean this User or IP never had any blocks. It is possible that the User's block had automatically expired and were no longer in place when this event was emitted. type: object properties: blocks: description: > The prior state of blocks for the target user of this user change event. type: object properties: name: type: boolean email: type: boolean user_talk: type: boolean account_create: type: boolean expiry_dt: type: string format: date-time maxLength: 128
Frequently used fields
This section documents some fields that may appear in many different schemas. These fields may DRYed up into a $ref-able subschemas, and/or they may be moved into a 'Data Dictionary'.
Events are often associated with a HTTP request/response. It can be useful for the event to record some information about the HTTP session. If you need to do this, for consistency we suggest the following:
http: type: object properties: uri: type: string description: The full URI of this HTTP request method: type: string description: The HTTP request method (GET, POST, etc.) client_ip: type: string description: The http client's IP address request_headers: type: object description: Request headers sent by the client. additionalProperties: type: string response_headers: type: object description: Response headers sent by the server. additionalProperties: type: string has_cookies: type: boolean description: True if the http request has any cookies set
Your schema does not need to set all of these fields in the http object, e.g. if you don't need http.response_headers, then leave it out. While this convention does have a convenient place to store common request headers like 'User-Agent' and 'Content-Type', you are not required to do so. If it is more useful for you to have a top level user_agent field, feel free to do so.
There is a fragment http schema you may $ref to include this in your schema if you don't mind having all of these fields defined.
(See this discussion for more context.)
Common Mediawiki event fields
- Mediawiki database name, e.g. enwiki, dewiki, etc.
- Information about the user that triggered the event
Includes all of mediawiki/common and
- (database) id of page
- textual page title
- wiki namespace
- boolean if the head revision of the page is a redirect
- The head revision of a page at the time of the page event
Includes all of mediawiki/common and
- page_id the revision belongs to
- page_title the revision belongs to
- namespace of page the revision belongs to
- boolean if this revision is a redirect
- (database) rev_id
- revision id of this revision's parent
- revision create time in ISO-8601 format. This does not end in '_dt' to better match the naming convention in the Mediawiki database.