You are browsing a read-only backup copy of Wikitech. The live site can be found at

Difference between revisions of "Analytics/Systems/EventLogging"

From Wikitech
Jump to navigation Jump to search
Line 578: Line 578:
||[[phab:T219842|Task: T219842]]
||[[phab:T219842|Task: T219842]]
| Kafka Jumbo outage since 22:00 to midnite. Data loss on those hours
| Kafka Jumbo outage since 22:00 to midnite. Data loss on those hours
|Third party domain data is not getting refined (so sites like that run clones of our code do not send us their requests)

Revision as of 02:18, 13 September 2019

EventLogging (EL for short) is a platform for modelling, logging, and processing arbitrary analytic data. It consists of:

This documentation is about specific EventLogging instance that collects data on Wikimedia sites.

EventLogging architecture

For users


Here's the list of the existing schemas. Note that many of them are active, but not all. Some schemas are still in development (not active yet) and others may be obsolete and listed for historical reference.

The schema's discussion page is the place to comment on the schema design and related topics. It contains a template that specifies the schema maintainer(s), the team and project the schema belongs to, its status (active, inactive, in development), and its purging strategy.

Creating a schema

There's thorough documentation on designing and creating a new schema here:

These are some special guidelines to create a schema that Druid can digest easily:

Please, don't forget to fill in the schema's talk page with the template: Note that for new schemas the purging strategy should read: automatically purge all events older than 90 days (see Data retention and purging section). If you want to keep your data indefinitely, please contact the Analytics team.

Send events

See Extension:EventLogging/Programming for how to instrument your MediaWiki code.

If the schema is expected to generate a high rate of events, it should be limited to Hive and prevented from creating tables on MariaDB (where a high rate is more likely to create problems), see Analytics/Systems/EventLogging/Data_retention_and_auto-purging#Black-listed_schemas.

Client-side events

Client-side events are logged using a web beacon with project's hostname (e.g., the path beacon/event, and query string containing all the event fields (with percent-encoded punctuation). For example:

Decoding the punctuation, this looks like:
 "event": { "action": "abort", ... },
 "schema": "Edit",
 "revision": 1234,
 "webHost": "",
 "wiki": "enwiki"

Because this data is sent through a URL, we can't use URLs that are longer than browsers can cope with. Therefore, EventLogging limits unencoded client-side events to 1000 characters.

Note that you can send events from "any" domain but in prod only events coming from wikimedia domains are processed. There are many clones of wikimedia running our code (like that in turns send us fake data.

Accessing data


Availability of schemas in MariaDB

Only events in certain whitelisted EventLogging schemas are ingested into MySQL/MariaDB. If you want to deploy a new event schema and have it ingested into MySQL, please file a Phabricator task and tag 'Analytics'.

Eventlogging Capsule

EventLogging analytics data is wrapped up in what is called the EventCapsule. (EventLogging EventBus (non on-wiki schemas) does not use EventCapsule.)

In MySQL, the event data appears in columns prefaced with event_ , while EventCapsule data appears as columns without event prefix. It is important to note that the EventCapsule JSONSchema does not match what you see on MySQL. The MySQL consumer backend adds a lot of special casing in order to store event data. The top level non-event fields in MySQL (originally from EventCapsule) are:

id                                 | int(11)         | autoincrement field, used for replication purposes
uuid                               | binary(32)      | unique identifier for event 
timestamp                          | varbinary(14)   | Mediawiki format: yyyymmddhhmmss.
userAgent                          | varbinary(1024) | Parsed user agent JSON string
webHost                            | varbinary(1024) | web domain like
wiki                               | varbinary(1024) | db name like 'enwiki'

Data stored by EventLogging for the various schemas has varying degrees of privacy, including personally identifiable information and sensitive information, hence access to it requires an NDA. Also, by default, EL data is only kept for 90 days, unless otherwise specified, see Analytics/Systems/EventLogging/Data retention and auto-purging.

See Analytics/EventLogging/Data representations for an explanation on where the data lives and how to access it.


See also: Analytics/Data access#EventLogging data and Analytics/Data access#Production_access.

Table names Tables are in the "log" database and generally have the name <schema name>_<revision id>, e.g. log.NavigationTiming_10785754 has data from this revision of the NavigationTiming schema.

In March 2017, a change to the EventLogging capsule necessitated renaming and archiving of all tables that were receiving events at that time. Tables were renamed from, for example: WikipediaZeroUsage_14574251 to WikipediaZeroUsage_14574251_15423246 (15423246 being the revision of the EventCapsule schema at that time).

Sample query

Note that you need to have ssh access to stat1006 and also be authorized to access the db.

On stat1006.eqiad.wmnet, type this command:

mysql --defaults-extra-file="/etc/mysql/conf.d/research-client.cnf" -h analytics-slave.eqiad.wmnet -e "select left(timestamp,8) ts , COUNT(*) from log.NavigationTiming_10785754 where timestamp >= '20150402062613' group by ts order by ts";

Hadoop & Hive

Raw JSON data is imported into HDFS from Kafka, and then further refined into Parquet-backed Hive tables. These tables live in 2 Hive databases: event and event_sanitized, and are stored in HDFS at hdfs:///wmf/data/event and hdfs:///wmf/data/event_sanitized. event stores the original data during 90 days (data older than 90 days is automatically deleted). event_sanitized stores the sanitized data indefinitely. The sanitization process uses a whitelist that indicates which tables and fields can be stored indefinitely, see: Analytics/Systems/EventLogging/Data retention and auto-purging. You can access all this data through Hive, Spark, or other Hadoop methods. New data should be available for querying at most every hour.

Incompatibilities with the MariaDB setup
  • The canonical EventCapsule data that all EventLogging analytics schemas share uses an ISO-8601 string dt field. In the MariaDB EventLogging tables, this has been converted into a timestamp which uses MediaWiki's yyyymmddhhmmss format, but the Hive EventLogging tables only provide the unconverted dt field. (More detail: phab:T179540)
    • A UDF has been provided in Hive to convert the dt field into a MediaWiki timestamp (phab:T186155). It can be used as follows:
ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
SELECT GetMediaWikiTimestamp('2019-02-20T12:34:56Z') AS timestamp;
  • In the Hive EventLogging tables, the userAgent field has a parsed nested subobject, whereas on MariaDB it is stored as an unstructured varchar field containing a JSON string (see also phab:T164224).
  • Not all EventLogging analytics schemas are 'refinable'. Some schemas specify invalid field names, e.g. with dots '.' in them, or have field type changes between different records. If this happens, it will not be possible to be store the data in a Hive table and as such it won't appear in the list of refined tables. If your schema has this problem, you should fix it. (Dashes '-' in field names are automatically converted to underscores '_' during the refine process before the data is being ingested into Hive, cf. phab:T216096#4955417.)
  • NOTE: Hadoop and Hive (in the JVM) are strongly typed, whereas the source EventLogging JSON data is not. This can cause problems when importing into Hive, as the refinement step needs to figure out what to do if it encounters type changes. TYPE CHANGES ARE NOT SUPPORTED. Please do not ever change the type of an EventLogging field. You may add new fields as you need and stop using old ones, but do not change types. Some type changes will be partially supported during the refinement stage. E.g. if the schema contains an integer, but future data contains a decimal number, the refinement step will log a warning, but still finish refinement. The record with the offending type changed field have all its fields set to NULL (not just the offending field).

EventLogging analytics data is imported into event and event_sanitized databases in Hive. Example:

  count(*) as cnt
  year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID

event.userid	cnt
NULL           1848
333333           87
222229           59
111113           29
111125           21
466534           17
433542           10
754324            7
121346            7
123452            6

Note that the EventLogging schema fields are in the event field struct. You can access them with dotted notation, e.g. event.userID.

Errors for schemas

Errors are available on eventerror table on events database: Sample select:

select * from eventerror where event.schema like 'MobileWikiApp%' and year=2018 and month=11 and day=1 limit 10;

Spark can access data directly through HDFS, or as SQL tables in Hive. Refer to the Spark documentation for how to do so. Examples:

Spark 2 Scala SQL & Hive:
// spark2-shell

val query = """
  count(*) as cnt
  year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID

val result = spark.sql(query)

|  userID| cnt|
|    null|1848|
|  333333|  87|
|  222229|  59|
|  111113|  29|
|  111125|  21|
|  466534|  17|
|  433542|  10|
|  754324|   7|
|  121346|   7|
|  123452|   6|
Spark 2 Python SQL & Hive:
# pyspark2

query = """
  count(*) as cnt
  year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID

result = spark.sql(query)

|  userID| cnt|
|    null|1848|
|  333333|  87|
|  222229|  59|
|  111113|  29|
|  111125|  21|
|  466534|  17|
|  433542|  10|
|  754324|   7|
|  121346|   7|
|  123452|   6|
Spark 2 R SQL & Hive:
# spark2R

query <- "
  count(*) as cnt
  year = 2017 AND month = 11 AND day = 20 AND hour = 19
GROUP BY event.userID

result <- collect(sql(query))

     userID  cnt
1        NA 1848
2    333333   87
3    222229   59
4    111113   29
5    111125   21
6    466534   17
7    433542   10
8    754324    7
9    121346    7
10   123452    6
Hadoop. Archived Data

In 2017, some big EventLogging tables were archived from MySQL to Hadoop. Tables were exported with sqoop into avro format files and tables were created according to the corresponding schema. Thus far we have the following tables archived in Hadoop, in the archive database. This means that they are no longer available on MySQL:


You can query these tables just like any other table in hive. A tip regarding dealing with binary types:

select *  from Some_tbl where (cast(uuid as string) )='ed663031e61452018531f45b4b5502cb';

Caveat: This process does not preserve the data type for e.g. bigint or boolean fields. The archived Hive table will contain them as strings instead, which will need to be converted back (e.g. CAST(field AS BIGINT)).

Hadoop Raw Data

Raw EventLogging JSON data is imported hourly into Hadoop by Camus. It is unlikely that you will ever need to access this raw data directly. Instead, use the refined event Hive tables as described above.

Raw data is written to directories named after each schema in hourly partitions in HDFS. /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_<schema>/hourly/<year>/<month>/<day>/<hour>. There are a myriad of ways to access this data, including Hive and Spark. Below are a few examples. There may be many (better!) ways to do this.

For backup purposes, we keep 90 days of events coming from the eventlogging-client-side topic in /mnt/hdfs/wmf/data/raw/eventlogging_client_side/hourly/<year>/<month>/<day>/<hour>. File:EventLogging on Kafka - Lightning Talk.pdf

Note that all EventLogging data in Hadoop is automatically purged after 90 days; the whitelist of fields to retain is not used, but this feature could be added in the future if there is sufficient demand.


Hive has a couple of built in functions for parsing JSON. Since EventLogging records are stored as JSON strings, you can access this data by creating a Hive table with a single string column and then parsing that string in your queries:

ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;

-- Make sure you don't create tables in the default Hive database.
USE otto;

-- Create a table with a single string field
CREATE EXTERNAL TABLE `CentralNoticeBannerHistory` (
  `json_string` string
  year int,
  month int,
  day int,
  hour int

-- Add a partition
ALTER TABLE CentralNoticeBannerHistory
ADD PARTITION (year=2015, month=9, day=17, hour=16)
LOCATION '/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2015/09/17/16';

-- Parse the single string field as JSON and select a nested key out of it
SELECT get_json_object(json_string, '$.event.l.b') as banner_name
FROM CentralNoticeBannerHistory
WHERE year=2015;

Spark Python (pyspark):

import json
data = sc.sequenceFile("/wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2015/09/17/07")
records = x: json.loads(x[1])) x: (x['event']['l'][0]['b'], 1)).countByKey()
Out[33]: defaultdict(<class 'int'>, {'WMES_General_Assembly': 5})

MobileWikiAppFindInPage events with SparkSQL in Spark Python (pyspark 1):

# Load the JSON string values out of the compressed sequence file.
# Note that this uses * globs to expand to all data in 2016.
data = sc.sequenceFile(
).map(lambda x: x[1])

# parse the JSON strings into a DataFrame
json_data = sqlCtx.jsonRDD(data) # replace with for pyspark 2
# Register this DataFrame as a temp table so we can use SparkSQL.

top_k_page_ids = sqlCtx.sql(
"""SELECT event.pageID, count(*) AS cnt
    FROM MobileWikiAppFindInPage
    GROUP BY event.pageID
    LIMIT 10"""
for r in top_k_page_ids.collect():
    print "%s: %s" % (r.pageID, r.cnt)

Edit events with SparkSQL in Spark scala (spark-shell):

// Load the JSON string values out of the compressed sequence file
// and parse them as a DataFrame.

val rawDataPath = "/wmf/data/raw/eventlogging/eventlogging_Edit/hourly/2015/10/21/16"

val edits =
        spark.sparkContext.sequenceFile[Long, String](rawDataPath).map(_._2)
// Register this DataFrame as a temp table so we can use SparkSQL.

// SELECT top 10 edited wikis
val top_k_edits = sqlContext.sql(
    """SELECT wiki, count(*) AS cnt
    FROM edits
    GROUP BY wiki
    LIMIT 10"""
// Print them out


There are many Kafka tools with which you can read the EventLogging data streams. kafkacat is one that is installed on stat1007.

# Uses kafkacat CLI to print window ($1)
# seconds of data from $topic ($2)
function kafka_timed_subscribe {
    timeout $1 kafkacat -C -b kafka-jumbo1001 -t $2

# Prints the top K most frequently
# occurring values from stdin.
function top_k {
    sort        |
    uniq -c     |
    sort -nr    |
    head -n $1

while true; do
    date; echo '------------------------------' 
    # Subscribe to eventlogging_Edit topic for 5 seconds
    kafka_timed_subscribe 5 eventlogging_Edit |
    # Filter for the "wiki" field 
    jq .wiki |
    # Count the top 10 wikis that had the most edits
    top_k 10
    echo ''

Generating reports and dashboards

In addition to ad-hoc queries, there are a couple of tools that make it easy to generate periodic reports on EventLogging data and display them in the form of dashboards. You can find more info on them here:

Publishing data

See Analytics/EventLogging/Publishing for how to proceed if you want to publish reports based on EventLogging data, or datasets that contain EventLogging data.

Verify received events

Logstash has eventlogging EventError events. You can view all of these at

Validation errors are visible on application logs located at


In production they also end up in the kafka topic


There is also a Hive table named event.eventerror.

The processor is the one that handles validation, so, for example;


will have an error like the following if events are invalid:

Unable to validate: ?{
  "event": {
    "pagename": "Recentchanges",
    "namespace": null,
    "invert": false,
    "associated": false,
    "hideminor": false,
    "hidebots": true,
    "hideanons": false,
    "hideliu": false,
    "hidepatrolled": false,
    "hidemyself": false,
    "hidecategorization": true,
    "tagfilter": null
  "schema": "ChangesListFilters",
  "revision": 15876023,
  "clientValidated": false,
  "wiki": "nowikimedia",
  "webHost": "",
  "userAgent": "Apple-PubSub/65.28"
}; cp1066.eqiad.wmnet 42402900 2016-09-26T07:01:42 -

This happens if client code has a bug and is sending events that are not valid according to the schema, we normally try to identify the schema at fault and pas that info back to the devs so they can fix it. See a ticket of how do we deal with these errors:

As of T205437, validation error logs are also available in Logstash for up to 30 days, i.e. A handy link to the associated Kibana search is available on a schema's talk page, provided that it's documented using the SchemaDoc template.

Note well that access to Logstash requires a Wikimedia developer account with membership in a user group indicating that the user has signed an NDA.

User agent sanitization

Main article: Analytics/Systems/EventLogging/User agent sanitization

The userAgent field is sanitized immediately upon storage; the content is replaced with a parsed version in JSON format.

Data retention and purging

Starting in August 2016, all EventLogging data will by default be purged after 90 days to comply with WMF's data retention guidelines. Individual columns within tables can be white-listed so that the data is retained indefinitely in the MariaDB databases; generally, all columns can be whitelisted, except the clientIp and userAgent fields. To have your data added to this whitelist, contact Analytics. Note that new schemas will be purged by default unless otherwise discussed.

Read more on this topic at Analytics/EventLogging/Data retention and auto-purging. For implementation details, see T108850.

Operational support

Tier 2 support



Any outages that affect EventLogging will be tracked on Incident documentation (also listed below) and announced to the lists and


Alarms at this time come to the Analytics team. We are working on being able to claim alarms in icinga.


You can contact the analytics team at:

For developers


The EventLogging python codebase can be found at


See Analytics/EventLogging/Architecture for EventLogging architecture.


On this page you'll find information about Event Logging performance, such as load tests and benchmarks:

Size limitation

There is a limitation of the size of individual EventLogging events due the underlying infrastructure (limited size of urls in Varnish's varnishncsa/ varnishlog, as well as Wikimedia UDP packets). For the purpose of size limitation, an "entry" is a /beacon request URL containing urlencoded JSON-stringified event data. Entries longer than 1014 bytes are truncated. When an entry is truncated, it will fail validation because of parsing (as the result is invalid JSON).

This should be taken into account when creating a schema. Large schemas should be avoided and schema fields with long keys and/or values, too. Consider splitting up a very large schema, or replacing long fields with shorter ones.

To aid with testing the length of schemas, EventLogging's dev-server logs a warning into the console for each event that exceeds the size limit.


You can use various tools to monitor operational metrics, read more in this dedicated page:


The Event Logging extension can be tested on vagrant easily and that is described on at Extension:EventLogging. The server side of EventLogging (consumer of events) does not have a vagrant setup for testing but can be tested in the Beta Cluster:

How do I ...?

Visit this EventLogging how to page. It contains some dev-ops tips and tricks for EventLogging like: deploying, troubleshooting, restarting, etc. Please, add here any step-by-step on EventLogging dev-ops tasks.

Administration. On call

Here's a list of routine tasks to do when oncall for EventLogging.

Data Quality Issues

Changes and Known Problems with Dataset

Date from Date until Task Details
2017-11 2017-11 Task T179625 Canonical EventLogging data (parsed and validated and stored in Kafka) did not match EventCapsule schema. This was fixed, and data was transformed before insertion into MySQL for backwards compatibility. This helped standardize all event data so that it could be refined and made available in Hive.
2017-07-10 2017-07-12 Task T170486 Some data was not inserted in MySQL, but was backfilled for all schemas but page-create. During the backfill, bot events were also accidentally backfilled, resulting in extra data during this time.
2017-05-24 onwards Task T67508 Do not accept data from bots on eventlogging unless bot user agent matches "MediaWiki".
2017-03-29 onwards Task T153207 Change userAgent field in event capsule
2019-03-19 (14 to 22 hours) Task T218831 Eventlogging mysql consumer was restarting for several hours in which it was not able to enter any data on database
2019-04-01 Task: T219842 Kafka Jumbo outage since 22:00 to midnite. Data loss on those hours
2019-09-12 Third party domain data is not getting refined (so sites like that run clones of our code do not send us their requests)


Here's a list of all related incidents and their post-mortems. To add a new page to this generated list, use the "EventLogging/Incident_documentation" category.

For all the incidents (including ones not related to EventLogging) see: Incident documentation.

Limits of the eventlogging replication script

The log database is replicated to the eventlogging slave databases via a custom script, called (script stored in operations/puppet for the curious). While working on we realized that the script was not able to replicate high volume events in real time, showing a lot of replication lag (even days in the worst case scenario). Please review the task for more info or contact the Analytics team in case you have more questions.

See also