You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Event Platform/Schemas/Guidelines: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Ottomata
imported>Jameel Kaisar
m (NOTE: As of 2020-12, these meta.dt and dt conventions are not fully adopted in existent schemas, but all new schemas should use these conventions. See T240460 and T267648.)
 
(18 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Navigation Event Platform}}
This page intends to document conventions, guidelines and rules for designing and evolving Event Schemas.
This page intends to document conventions, guidelines and rules for designing and evolving Event Schemas.
= WMF Schema Repositories =
WMF maintains 2 separate git schema repositories:
* [https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/primary/+/refs/heads/master schemas/event/primary]
* [https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/secondary/+/refs/heads/master schemas/event/secondary]
You can think of these as one single schema repository (and namespace), as softwares that use schemas at WMF generally use both.  The only reason we have more than one schema repository is to allow for different gerrit/git merge rights.  schemas/event/secondary is more permissive and should be used for schemas that are not used by 'tier 1' production systems.


= Rules =
= Rules =
Line 11: Line 22:
; <tt>meta.stream</tt> : Name of the stream/queue that this event belongs in.  This is used for routing incoming events to specific streams and downstream 'datasets'.  This will likely correspond with a Kafka topic and a Hive table.
; <tt>meta.stream</tt> : Name of the stream/queue that this event belongs in.  This is used for routing incoming events to specific streams and downstream 'datasets'.  This will likely correspond with a Kafka topic and a Hive table.


; <tt>meta.dt</tt> : date timestamp for the event, in ISO-8601 format.  This field will be used as the Kafka message timestamp and for Hive table time based partitioning. This is a required property in schemas, but will be filled in by EventGate if a client does not set it.   
; <tt>meta.dt</tt> : a date timestamp for the event, in ISO-8601 format.  This should be the event ingestion|processing|system time. This is a required property in schemas, but will be filled in by EventGate on the server side if a client does not set it.  If you are using EventGate, you should leave this field blank in your event data and allow EventGate to set it to server side receive time.


=== No union types ===
; <tt>dt</tt> : a date timestamp for the event, in ISO-8601 format.  This should be set by your client and should represent the 'event time' of the event.  That is, this is the actual time the event happened.
Union types are not allowed.  E.g. <tt>type: [string, integer]</tt> and <tt>type: [string, null]</tt> are not allowed. This means that <tt>null</tt> JSON values in data are not useful, as to set a field to null, it's type would have to be exactly <tt>type: "null"</tt>, which means that all values are null. See the Optional Fields Guidelines documentation below.  See also https://json-schema.org/understanding-json-schema/reference/null.html.
 
 
NOTE: As of 2020-12, these <tt>meta.dt</tt> and <tt>dt</tt> conventions are not fully adopted in existent schemas, but all new schemas should use these conventions. See [https://phabricator.wikimedia.org/T240460 T240460] and [https://phabricator.wikimedia.org/T267648 T267648].
 
=== No union types / No <tt>null</tt> values ===
Union types are not allowed.  E.g. <tt>type: [string, integer]</tt> and <tt>type: [string, null]</tt> are not allowed. This means that <tt>null</tt> JSON values in data are not useful, as to set a field to null, its type would have to be exactly <tt>type: "null"</tt>, which means that all values are null. See the [[#Optional_/_Missing_fields|Optional Fields Guidelines]] documentation below.  See also https://json-schema.org/understanding-json-schema/reference/null.html.


=== No object <tt>additionalProperties</tt> ===
=== No object <tt>additionalProperties</tt> ===
Line 29: Line 45:
</syntaxhighlight>
</syntaxhighlight>


The array items type may be a complex object, so along as that object also explicitly declares all properties and types.
The array items type may be a complex object, as long as that object also explicitly declares all properties and types.
   
   
<syntaxhighlight lang=yaml>
<syntaxhighlight lang=yaml>
Line 43: Line 59:
</syntaxhighlight>
</syntaxhighlight>


=== No Capital letters - Use [https://en.wikipedia.org/wiki/Snake_case snake_case] ===
=== Identifier Naming Rules ===
All field names should be in snake_case and should be all lower case. Event fields are often imported into case-insensitive RDBMS SQL systems.  Mixing captial and lower case letters in e.g. Hive or MySQL table and field names can be confusing and cause issues in systems and code that access those SQL systems.  
 
==== No Capital letters - Use [https://en.wikipedia.org/wiki/Snake_case snake_case] ====
All identifiers (schema names, field names, etc.) should be in snake_case and should be all lower case. Event fields are often imported into case-insensitive RDBMS SQL systems.  Mixing captial and lower case letters in e.g. Hive or MySQL table and field names can be confusing and cause issues in systems and code that access those SQL systems.  


Keys in map type objects may use mixed case.  However, unless totally necessary to use capital letters it is recommended that your producer convert map keys to lowercase.  Doing so will avoid annoying case insensitive multi-case SQL queries later.
Keys in map type objects may use mixed case.  However, unless totally necessary to use capital letters it is recommended that your producer convert map keys to lowercase.  Doing so will avoid annoying case insensitive multi-case SQL queries later.
See also: [[User:Ottomata/CamelCase is bad|camelCase is bad]]
==== No special characters ====
Avoid using characters that would require special quoting in SQL systems.  E.g. do not use hyphens '-', spaces ' ', '@', etc. in your schema or field names.
==== Acceptable identifier regex ====
jsonschema-tools [https://github.com/wikimedia/jsonschema-tools/blob/master/lib/tests/robustness.js#L52 uses this regex] to ensure identifiers are properly named.
<code>
/^[$a-z]+[a-z0-9_]*$/
</code>
Avoid using the allowed dollar sign '$'.  This is a special case used only for JSONSchema's <code>$schema</code> URI field.


== Backwards compatible modifications only ==
== Backwards compatible modifications only ==
Line 64: Line 96:
Even if your producer code is no longer producing events, your event data might be out there somewhere, used by a consumer.  We want to be able to always map from an event to its schema, and this is not possible if the schema has been deleted.
Even if your producer code is no longer producing events, your event data might be out there somewhere, used by a consumer.  We want to be able to always map from an event to its schema, and this is not possible if the schema has been deleted.


It may be possible to delete schemas with a lot of collaboration with ALL consumers of data, but the best rules is just not to do it.
It may be possible to delete schemas with a lot of collaboration with ALL consumers of data, but the best rule is just not to do it.


= Conventions =
= Conventions =


=== Optional / Missing fields ===
=== Optional / Missing fields ===
As long as a field is not in the list of <tt>required</tt> fields, it is 'optional'.  There may be times when you don't want to set a field's value.  To do so, just leave the field out of the event data when you send it.  In downstream SQL systems (i.e. Hive), missing fields will be set to NULL during ingestion.
As long as a field is not in the list of <tt>required</tt> fields, it is 'optional'.  There may be times when you don't want to set a field's value.  To do so, just leave the field out of the event data when you send it.  In downstream SQL systems (i.e. Hive), missing fields will be set to <code>NULL</code> during ingestion.
 
Note that <code>null</code> is not a valid value for any field, unless it has <code>type: null</code>, which is useless.  See also: [[#No_union_types]]


=== examples ===
=== examples ===
JSONSchema has an [https://json-schema.org/understanding-json-schema/reference/generic.html <tt>examples</tt>] annotation.  Your schema should include at least one example of an event that validates with the schema.  Analytics Engineering uses these examples to produce canary (AKA heartbeat) events into event streams for monitoring of ingestion.  jsonschema-tools tests will ensure that your examples validate with your schema.
JSONSchema has an [https://json-schema.org/understanding-json-schema/reference/generic.html <tt>examples</tt>] annotation.  Your schema should include at least one example of an event that validates with the schema.  Data Engineering uses these examples to produce canary (AKA heartbeat) events into event streams for monitoring of ingestion.  jsonschema-tools tests will ensure that your examples validate with your schema.


You should always add at least one event example.
You should always add at least one event example.  If you don't, jsonschema-tools will generate one with random values for you.  It is preferred that you add your own example.


=== datetimes / timestamps ===
=== datetimes / timestamps ===
Line 218: Line 252:
</syntaxhighlight>
</syntaxhighlight>


== Schema fragments ==
jsonschema-tools allows us to reference and include fields from other schemas by their schema URI.  To keep dependencies easy to reason about, we only reference 'fragment' schemas stored in the <code>fragment</code> namespace/directory in our schema repositories.


* Events should never use fragment schema as their <code>$schema</code> URI
* Concrete (non fragment) schemas should only ever <code>$ref</code> <code>fragment/</code> schemas
* Avoid making fields required in fragment schemas.  Concrete schemas should be able to decide what fields are required
* Concrete (non fragment) schemas should define all fields at thhe root level, and only use <code>allOf</code> for <code>$ref</code>s.  E.g.
This is the preferred way of <code>$ref</code>ing and merging fragment schemas and defining concrete properties:
<syntaxhighlight lang=yaml>
allOf:
  - $ref: /fragment/common/2.0.0#
properties:
  test:
    type: string
    default: default value
</syntaxhighlight>
You may see some schemas that do this in a less than ideal way. Please avoid this:
<syntaxhighlight lang=yaml>
allOf:
  - $ref: /fragment/common/2.0.0#
  - properties:
      test:
        type: string
        default: default value
</syntaxhighlight>


== Frequently used fields ==
== Frequently used fields ==
Line 224: Line 284:


=== <tt>http</tt> information ===
=== <tt>http</tt> information ===
Events are often associated with a HTTP request/response.  It can be useful for the event to record some information about the HTTP session.  If you need to do this, for consistency we suggest the following:
Events are often associated with a HTTP request/response.  It can be useful for the event to record some information about the HTTP request and response.  If you need to do this, for consistency we suggest <tt>$ref</tt>-ing the [https://schema.wikimedia.org/#!//primary/jsonschema/fragment/http fragment http schema]:


<syntaxhighlight lang=yaml>
<syntaxhighlight lang=yaml>
http:
allOf:
   type: object
   - $ref: /fragment/common/2.0.0#
   properties:
   - $ref: /fragment/http/1.2.0#
</syntaxhighlight>


    uri:
      type: string
      description: The full URI of this HTTP request
   
    method:
      type: string
      description: The HTTP request method (GET, POST, etc.)


    client_ip:
Note that in that schema, <tt>client_ip</tt> is not captured.  If you want to collect the client IP address of the producer of the event, the <tt>http.client_ip</tt> property is the appropriate place to do it.  There is a separate [https://schema.wikimedia.org/#!//primary/jsonschema/fragment/http.client_ip fragment http/client_ip schema] for this purpose. <tt>$ref</tt> it along with the fragment/http schema:
      type: string
      description: The http client's IP address


    request_headers:
<syntaxhighlight lang=yaml>
      type: object
allOf:
      description: Request headers sent by the client.
  - $ref: /fragment/common/2.0.0#
      additionalProperties:
  - $ref: /fragment/http/1.2.0#
        type: string
  - $ref: /fragment/http/client_ip/1.0.0#
 
    response_headers:
      type: object
      description: Response headers sent by the server.
      additionalProperties:
        type: string
 
    has_cookies:
      type: boolean
      description: True if the http request has any cookies set
</syntaxhighlight>
</syntaxhighlight>
Your schema does not need to set all of these fields in the <tt>http</tt> object,  e.g. if you don't need <tt>http.response_headers</tt>, then leave it out.
While this convention does have a convenient place to store common request headers like 'User-Agent' and 'Content-Type', you are not required to do so. If it is more useful for you to have a top level <tt>user_agent</tt> field, feel free to do so.
There is a [https://schema.wikimedia.org/#!//primary/jsonschema/fragment/http fragment http schema] you may <tt>$ref</tt> to include this in your schema if you don't mind having all of these fields defined.


(See [https://phabricator.wikimedia.org/T214093#4918832 this discussion] for more context.)
(See [https://phabricator.wikimedia.org/T214093#4918832 this discussion] for more context.)
Line 272: Line 309:
==== [https://schema.wikimedia.org/#!//primary/jsonschema/fragment/mediawiki/common mediawiki/common] ====
==== [https://schema.wikimedia.org/#!//primary/jsonschema/fragment/mediawiki/common mediawiki/common] ====
; <tt>database</tt> : Mediawiki database name, e.g. enwiki, dewiki, etc.
; <tt>database</tt> : Mediawiki database name, e.g. enwiki, dewiki, etc.
; <tt>perfomer</tt> : Information about the user that triggered the event
; <tt>performer</tt> : Information about the user that triggered the event


==== [https://schema.wikimedia.org/#!/primary/jsonschema/fragment/mediawiki/page/common mediawiki/page/common] ====
==== [https://schema.wikimedia.org/#!/primary/jsonschema/fragment/mediawiki/page/common mediawiki/page/common] ====
Line 297: Line 334:


See [[Event Platform/Analytics/Fragments]] for documentation on these.
See [[Event Platform/Analytics/Fragments]] for documentation on these.
== Automatically populated fields ==
If your events are produced through a Wikimedia EventGate service (as most events are), some fields can be auto-populated if they are present in the schema, but not present in the event.  This is done through custom code in the eventgate-wikimedia repository.  For up to date info on what fields are populated, read the [https://gerrit.wikimedia.org/r/plugins/gitiles/eventgate-wikimedia/+/refs/heads/master/eventgate-wikimedia.js#503 code documentation for the makeSetWikimediaDefaults function] there.  As of 2020-10, if your schema has the following properties defined, but the fields are not present in the event, they will be populated by EventGate as follows:
; $schema : To value of <tt>schema_uri</tt> query param if provided in HTTP request.
; meta.stream : To value of <tt>stream</tt> query param if provided in HTTP request.
; meta.dt : To current ISO-8601 UTC timestamp
; meta.id : To new uuid
; meta.request_id : To value of X-Request-ID request header if set.
; http.client_ip : To value of X-Client-IP request header if set.
; http.request_headers['user-agent'] : To value of User-Agent request header if set.
== Event Data Modeling and Schema Naming ==
Data modeling is hard, and these guidelines are intentionally vague about proscribing any data modeling rules.  It'd be difficult to cover all the use cases out there.  However, there is one guideline we'd like to note.  We are modeling events here, so we should try to design these schemas as events!
An event should represent something happening at a specific time, so it is good practice to name your event schema as an action happening to something.
We've got many examples where we've named schemas after the entity (noun) and verb the event represents.  E.g. [https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/api/request/current.yaml mediawiki/api/request]: a MediaWiki API endpoint was requested.  Or, [https://schema.wikimedia.org/repositories/primary/jsonschema/mediawiki/user/blocks-change/current.yaml mediawiki/user/blocks-change]: a MediaWiki User's blocks have changed.
When attempting to model a user's interaction with a website, perhaps you'd choose to model an entity as a specific interface (a button, and text-input-element, a sidebar, a popup) and then describe the action with a verb (button/clicked, text-input-element/entered, popup/displayed or popup/hidden).  Or you might choose to use the same schema to model several types of state changes to the same entity, e.g. popup/visibility-change with a field <tt>action: displayed</tt> or <tt>action: hidden</tt>.  Either of these ways is fine, as are many variations on them.  What is important is that an event data model and schema name represent a real event happening.
When you name a schema, the question 'What is a <schema_name> event?' should make sense.  E.g. What is a 'user/create' event makes sense.  What is a 'user' event does not.  (What would it mean for a user to be an event?)
'''Examples of good event model / schema names'''
* entity create, entity delete, entity state change
* button click, button hover, UI element display
* interface feature user interaction
* funnel user state change
* search request
'''Examples of bad event model / schema names'''
* mobile app
* user
* page
* recommendation
[[Category:Event Platform]]

Latest revision as of 14:00, 27 April 2023


This page intends to document conventions, guidelines and rules for designing and evolving Event Schemas.

WMF Schema Repositories

WMF maintains 2 separate git schema repositories:

You can think of these as one single schema repository (and namespace), as softwares that use schemas at WMF generally use both. The only reason we have more than one schema repository is to allow for different gerrit/git merge rights. schemas/event/secondary is more permissive and should be used for schemas that are not used by 'tier 1' production systems.

Rules

Required fields

All event schemas must have the following fields.

$schema
A URI identifying the JSONSchema for this event. This should match an event schema $id in a schema repository. E.g. /schema_name/1.0.0
meta
Event meta data sub-object. This field contains common meta data fields for an event.
meta.stream
Name of the stream/queue that this event belongs in. This is used for routing incoming events to specific streams and downstream 'datasets'. This will likely correspond with a Kafka topic and a Hive table.
meta.dt
a date timestamp for the event, in ISO-8601 format. This should be the event ingestion|processing|system time. This is a required property in schemas, but will be filled in by EventGate on the server side if a client does not set it. If you are using EventGate, you should leave this field blank in your event data and allow EventGate to set it to server side receive time.
dt
a date timestamp for the event, in ISO-8601 format. This should be set by your client and should represent the 'event time' of the event. That is, this is the actual time the event happened.


NOTE: As of 2020-12, these meta.dt and dt conventions are not fully adopted in existent schemas, but all new schemas should use these conventions. See T240460 and T267648.

No union types / No null values

Union types are not allowed. E.g. type: [string, integer] and type: [string, null] are not allowed. This means that null JSON values in data are not useful, as to set a field to null, its type would have to be exactly type: "null", which means that all values are null. See the Optional Fields Guidelines documentation below. See also https://json-schema.org/understanding-json-schema/reference/null.html.

No object additionalProperties

The full structure of your data must be known ahead of time. The exception to this is map type fields (see below), since maps fully specify the data types of keys and values.

arrays

Arrays must specify their items type, and all items must have exactly the same type.

array_field:
  type: array
  items:
    type: string

The array items type may be a complex object, as long as that object also explicitly declares all properties and types.

links_hovered:
  type: array
  items:
    type: object
    properties:
      link_url:
        type: string
      hover_time_ms:
        type: integer

Identifier Naming Rules

No Capital letters - Use snake_case

All identifiers (schema names, field names, etc.) should be in snake_case and should be all lower case. Event fields are often imported into case-insensitive RDBMS SQL systems. Mixing captial and lower case letters in e.g. Hive or MySQL table and field names can be confusing and cause issues in systems and code that access those SQL systems.

Keys in map type objects may use mixed case. However, unless totally necessary to use capital letters it is recommended that your producer convert map keys to lowercase. Doing so will avoid annoying case insensitive multi-case SQL queries later.

See also: camelCase is bad

No special characters

Avoid using characters that would require special quoting in SQL systems. E.g. do not use hyphens '-', spaces ' ', '@', etc. in your schema or field names.

Acceptable identifier regex

jsonschema-tools uses this regex to ensure identifiers are properly named.

/^[$a-z]+[a-z0-9_]*$/

Avoid using the allowed dollar sign '$'. This is a special case used only for JSONSchema's $schema URI field.

Backwards compatible modifications only

This basically means that the only allowed schema modification is to add new optional fields.

No type changes

Type changes are the worst kind of backwards incompatible change. They can severely break downstream data consumers.

Do not remove fields

Field removals may cause downstream components to break. While field removals may be possible if they are done carefully, the best practice is just not to do them.

Do not rename fields

Renaming a field is the same as removing a field and adding a new one, and removing fields is not allowed.

Do not delete schemas

Even if your producer code is no longer producing events, your event data might be out there somewhere, used by a consumer. We want to be able to always map from an event to its schema, and this is not possible if the schema has been deleted.

It may be possible to delete schemas with a lot of collaboration with ALL consumers of data, but the best rule is just not to do it.

Conventions

Optional / Missing fields

As long as a field is not in the list of required fields, it is 'optional'. There may be times when you don't want to set a field's value. To do so, just leave the field out of the event data when you send it. In downstream SQL systems (i.e. Hive), missing fields will be set to NULL during ingestion.

Note that null is not a valid value for any field, unless it has type: null, which is useless. See also: #No_union_types

examples

JSONSchema has an examples annotation. Your schema should include at least one example of an event that validates with the schema. Data Engineering uses these examples to produce canary (AKA heartbeat) events into event streams for monitoring of ingestion. jsonschema-tools tests will ensure that your examples validate with your schema.

You should always add at least one event example. If you don't, jsonschema-tools will generate one with random values for you. It is preferred that you add your own example.

datetimes / timestamps

All date time / timestamps should be serialized in JSON data as ISO-8601 datetime strings, ideally in UTC time, suffixed with the 'Z' (Zero) timezone qualifier. E.g. 2015-12-20T09:10:56Z.

Datetime fields should be named suffixed with '_dt'.

session_start_dt:
  type: string
  format: date-time
  maxLength: 128

Note the maxLength: 128. This is a security constraint that limits the amount of work a JSONSchema validator has to do in order to validate a format (all fields that use format or pattern will need to specify maxLength.

There may be cases where sending an integer unix epoch timestamp instead of an ISO-8601 string is required. In this case the field type should be integer, the timestamp should be sent as integer milliseconds, and the field name should be suffixed with '_ts_ms'. If you must send seconds instead of milliseconds, then you should suffix your field name with '_ts_s'.

session_start_ts_ms:
  type: integer
  description: Unix epoch in milliseconds

Elapsed time fields

If possible, time duration fields should be sent as integer milliseconds. Please suffix these fields with the time unit, e.g. '_ms', '_ns', '_s', etc.

page_preview_visible_time_ms:
  type: integer
  description: Time the page preview popup was shown to the user


map types

There may be times when you don't know all field names ahead of time. Since neither union types or schemaless object (AKA structs) are supported, you'll want to use a 'map type'. JSON serialization does not differentiate between a 'map' and an 'object', but downstream systems need to know the difference. A map type in JSONSchema can be represented by an object with an additionalProperties schema specified for which all properties have the same type.

map_field:
  type: object
  additionalProperties:
    type: string

This would allow you to emit event data like:

{
  "map_field": {
    "key1": "value1",
    "key2": "value2"
  "}
}

Complex value types are supported, e.g.

map_field:
  type: object
  additionalProperties:
    type: object
    properties:
      p1:
        type: string
      p2:
        type: integer

Which would allow you to emit event data like:

{
  "map_field": {
    "key1": {"p1": "foo", "p2": 123}, 
    "key2": {"p1": "bar", "p2": 456}
  }
}

In this example, the map_field has string keys, but complex struct (object) types.

Modeling state changes

We model state changes by providing the current (new) state as well as the previous state in the same event, rather than attempting to provide a diff of the change. We do so by providing the current state as data fields in the main event body, and providing the previous state as a subobject called 'prior_state' with the same state fields from the main event body.

Example (from mediawiki/user/blocks-change):

      ### user/blocks_change specific fields
      blocks:
        description: >
          The current state of blocks for the target user of this user change event.
        type: object
        properties:
          name:
            description: Whether the name or IP should be suppressed (hidden).
            type: boolean
          email:
            description: Whether sending email is blocked.
            type: boolean
          user_talk:
            description: Whether the user is blocked from editing their own talk page.
            type: boolean
          account_create:
            description: Whether the user/IP is blocked from creating accounts.
            type: boolean
          expiry_dt:
            description: >
              The timestamp the block expires in ISO8601 format.
              If missing, the blocks do not expire.
            type: string
            format: date-time
            maxLength: 128

      prior_state:
        description: >
          The prior state of the entity before this event. If a top level entity
          field is not present in this object, then its value has not changed
          since the prior event.  For user blocks changes, if prior_state is not
          present, then the User or IP did not have any existing blocks in place
          at the time this event was emitted.  This does not mean this User
          or IP never had any blocks.  It is possible that the User's block
          had automatically expired and were no longer in place when this event
          was emitted.
        type: object
        properties:
          blocks:
            description: >
              The prior state of blocks for the target user of this user change event.
            type: object
            properties:
              name:
                type: boolean
              email:
                type: boolean
              user_talk:
                type: boolean
              account_create:
                type: boolean
              expiry_dt:
                type: string
                format: date-time
                maxLength: 128

Schema fragments

jsonschema-tools allows us to reference and include fields from other schemas by their schema URI. To keep dependencies easy to reason about, we only reference 'fragment' schemas stored in the fragment namespace/directory in our schema repositories.

  • Events should never use fragment schema as their $schema URI
  • Concrete (non fragment) schemas should only ever $ref fragment/ schemas
  • Avoid making fields required in fragment schemas. Concrete schemas should be able to decide what fields are required
  • Concrete (non fragment) schemas should define all fields at thhe root level, and only use allOf for $refs. E.g.

This is the preferred way of $refing and merging fragment schemas and defining concrete properties:

allOf:
  - $ref: /fragment/common/2.0.0#
properties:
  test:
    type: string
    default: default value

You may see some schemas that do this in a less than ideal way. Please avoid this:

allOf:
  - $ref: /fragment/common/2.0.0#
  - properties:
      test:
        type: string
        default: default value

Frequently used fields

This section documents some fields that may appear in many different schemas. These fields may DRYed up into a $ref-able subschemas, and/or they may be moved into a 'Data Dictionary'.

http information

Events are often associated with a HTTP request/response. It can be useful for the event to record some information about the HTTP request and response. If you need to do this, for consistency we suggest $ref-ing the fragment http schema:

allOf:
  - $ref: /fragment/common/2.0.0#
  - $ref: /fragment/http/1.2.0#


Note that in that schema, client_ip is not captured. If you want to collect the client IP address of the producer of the event, the http.client_ip property is the appropriate place to do it. There is a separate fragment http/client_ip schema for this purpose. $ref it along with the fragment/http schema:

allOf:
  - $ref: /fragment/common/2.0.0#
  - $ref: /fragment/http/1.2.0#
  - $ref: /fragment/http/client_ip/1.0.0#

(See this discussion for more context.)

Common Mediawiki event fields

In the schemas/event/primary repository, there are several fragment subschemas for Mediawiki related entity (page, revision, etc.) events.

mediawiki/common

database
Mediawiki database name, e.g. enwiki, dewiki, etc.
performer
Information about the user that triggered the event

mediawiki/page/common

Includes all of mediawiki/common and

page_id
(database) id of page
page_title
textual page title
page_namespace
wiki namespace
page_is_redirect
boolean if the head revision of the page is a redirect
rev_id
The head revision of a page at the time of the page event

mediawiki/revision/common

Includes all of mediawiki/common and

page_id
page_id the revision belongs to
page_title
page_title the revision belongs to
page_namespace
namespace of page the revision belongs to
page_is_redirect
boolean if this revision is a redirect
rev_id
(database) rev_id
rev_parent_id
revision id of this revision's parent
rev_timestamp
revision create time in ISO-8601 format. This does not end in '_dt' to better match the naming convention in the Mediawiki database.

Common analytics fields

In the schemas/event/secondary repository, there are several fragment subschemas for standardized fields in analytics events (e.g. identifiers such as device_id, session_id, pageview_id and many others).

See Event Platform/Analytics/Fragments for documentation on these.

Automatically populated fields

If your events are produced through a Wikimedia EventGate service (as most events are), some fields can be auto-populated if they are present in the schema, but not present in the event. This is done through custom code in the eventgate-wikimedia repository. For up to date info on what fields are populated, read the code documentation for the makeSetWikimediaDefaults function there. As of 2020-10, if your schema has the following properties defined, but the fields are not present in the event, they will be populated by EventGate as follows:

$schema
To value of schema_uri query param if provided in HTTP request.
meta.stream
To value of stream query param if provided in HTTP request.
meta.dt
To current ISO-8601 UTC timestamp
meta.id
To new uuid
meta.request_id
To value of X-Request-ID request header if set.
http.client_ip
To value of X-Client-IP request header if set.
http.request_headers['user-agent']
To value of User-Agent request header if set.

Event Data Modeling and Schema Naming

Data modeling is hard, and these guidelines are intentionally vague about proscribing any data modeling rules. It'd be difficult to cover all the use cases out there. However, there is one guideline we'd like to note. We are modeling events here, so we should try to design these schemas as events!

An event should represent something happening at a specific time, so it is good practice to name your event schema as an action happening to something.

We've got many examples where we've named schemas after the entity (noun) and verb the event represents. E.g. mediawiki/api/request: a MediaWiki API endpoint was requested. Or, mediawiki/user/blocks-change: a MediaWiki User's blocks have changed.

When attempting to model a user's interaction with a website, perhaps you'd choose to model an entity as a specific interface (a button, and text-input-element, a sidebar, a popup) and then describe the action with a verb (button/clicked, text-input-element/entered, popup/displayed or popup/hidden). Or you might choose to use the same schema to model several types of state changes to the same entity, e.g. popup/visibility-change with a field action: displayed or action: hidden. Either of these ways is fine, as are many variations on them. What is important is that an event data model and schema name represent a real event happening.

When you name a schema, the question 'What is a <schema_name> event?' should make sense. E.g. What is a 'user/create' event makes sense. What is a 'user' event does not. (What would it mean for a user to be an event?)

Examples of good event model / schema names

  • entity create, entity delete, entity state change
  • button click, button hover, UI element display
  • interface feature user interaction
  • funnel user state change
  • search request

Examples of bad event model / schema names

  • mobile app
  • user
  • page
  • recommendation