You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Data Lake/Traffic/mediawiki api request
event.mediawiki_api_request
The event.mediawiki_api_request table is api.php request data. The data is produced by the ApiMain::logRequest()
method in MediaWiki to the "api-request" debug logging channel. This logging channel is encoded as JSON messages and handled by the EventBus extension. The JSON Events are sent to EventGate and then sent to Kafka. Camus imports the JSON data from Kafka, and a 'Refine' job uses the JSONSchema to create a Hive table and import the data hourly.
Schema
The Mediawiki event JSONSchemas are stored in the mediawiki/event-schemas git repository. mediawiki/api/request can be found here: https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/api/request/.
As of 2018-05, the Hive Schema is
_schema string meta struct<uri:string,request_id:string,id:string,dt:string,domain:string,stream:string> http struct<method:string,client_ip:string,request_headers:map<string,string>> database string backend_time_ms bigint api_error_codes array<string> params map<string,string> datacenter string year bigint month bigint day bigint hour bigint # Partition Information # col_name data_type comment datacenter string year bigint month bigint day bigint hour bigint
And the JSONSchema is:
title: mediawiki/api/request
description: A single request to the MediaWiki Action API (api.php)
$id: /mediawiki/api/request/0.0.1
$schema: http://json-schema.org/draft-07/schema#
type: object
properties:
# global event fields
$schema:
type: string
description: >
The URI identifying the jsonschema for this event. This may be just
a short uri containing only the name and revision at the end of the
URI path. e.g. /schema_name/12345 is acceptable. This often will
(and should) match the schema's $id field.
meta:
type: object
properties:
uri:
type: string
format: uri-reference
maxLength: 8192
description: the unique URI identifying the event / resource
request_id:
type: string
description: The unique ID of the request that caused the event.
id:
type: string
pattern: '^[a-fA-F0-9]{8}(-[a-fA-F0-9]{4}){3}-[a-fA-F0-9]{12}$'
maxLength: 36
description: the unique ID of this event; should match the dt field
# -- ApiAction .ts
dt:
type: string
format: date-time
maxLength: 26
description: the time stamp of the event, in ISO8601 format
domain:
type: string
description: the domain the event pertains to
minLength: 1
stream:
type: string
description: The name of the stream/queue that this event belongs in.
minLength: 1
required:
- id
- dt
- domain
- stream
# http request/response info
http:
type: object
description: Information about the HTTP request that generated this event.
properties:
method:
type: string
description: The HTTP request method (GET, POST, etc.)
client_ip:
type: string
description: The http client's IP address
request_headers:
type: object
description: Request headers sent by the client.
additionalProperties:
type: string
# event-specific fields
# -- ApiAction .wiki
database:
type: string
description: The name of the wiki database this event belongs to.
minLength: 1
backend_time_ms:
type: integer
description: Elapsed milliseconds to process request as measured by MediaWiki
api_error_codes:
type: array
description: >
List of error codes for any errors encountered, e.g. 'apierror-missingparam'
If this field is missing or empty, no errors were encountered.
items:
type: string
params:
type: object
description: >
Request parameters. These might come from http query string,
or from the POST body. Sensitive params are redacted.
additionalProperties:
type: string
wmf_raw.ApiAction
2018-05: This table is being deprecated in favor of event.mediawiki_api_request
The wmf_raw.ApiAction table is 'raw' api.php request data. The data is produced by the ApiMain::logRequest()
method in MediaWiki to the "ApiAction" debug logging channel. This logging channel is encoded as binary Avro messages and delivered to Kafka. An Oozie job consumes the Avro data from the "mediawiki_ApiAction" channel and stores it in HDFS.
Avro schema
{
"type": "record",
"name": "ApiAction",
"namespace": "org.wikimedia.analytics.schemas",
"doc": "A single request to the MediaWiki Action API (api.php)",
"fields": [
{
"name": "ts",
"doc": "The timestamp, in unix time, that the request was made",
"type": "int",
"default": 0
},
{
"name": "ip",
"doc": "The requesting IP address (either ipv4 or ipv6) in string notation",
"type": "string",
"default": ""
},
{
"name": "userAgent",
"doc": "The HTTP User-Agent header prefixed with the Api-user-agent header if present",
"type": "string",
"default": ""
},
{
"name": "wiki",
"doc": "The wiki making this request, such as dewiki or enwiktionary",
"type": "string",
"default": ""
},
{
"name": "timeSpentBackend",
"doc": "Elapsed milliseconds to process request as measured by MediaWiki",
"type": "int",
"default": -1
},
{
"name": "hadError",
"doc": "Were errors encountered when processing this request?",
"type": "boolean",
"default": false
},
{
"name": "errorCodes",
"doc": "List of error codes for any errors encountered",
"type": {"type": "array", "items": "string"},
"default": []
},
{
"name": "params",
"doc": "Request parameters",
"type": {"type": "map", "values": "string"},
"default": {}
}
]
}
wmf.action_*
![]() | These tables are actually currently in bd808.action_* pending the resolution of T137321 |
The raw data from wmf_raw.ApiAction is used to compute various dimensional rollup tables for easier analysis.
-- Create tables for Action API stats
--
-- Usage:
-- hive -f create-action-tables.sql --database wmf
CREATE TABLE IF NOT EXISTS action_ua_hourly (
userAgent STRING COMMENT 'Raw user-agent',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by user-agent and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
CREATE EXTERNAL TABLE IF NOT EXISTS action_action_hourly (
action STRING COMMENT 'Action parameter value',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
CREATE EXTERNAL TABLE IF NOT EXISTS action_param_hourly (
action STRING COMMENT 'Action parameter value',
param STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
value STRING COMMENT 'Parameter value',
wiki STRING COMMENT 'Target wiki (e.g. enwiki)',
ipClass STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action, parameter, value and wiki'
PARTITIONED BY (
year INT COMMENT 'Unpadded year of request',
month INT COMMENT 'Unpadded month of request',
day INT COMMENT 'Unpadded day of request',
hour INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;
-- NOTE: there are many params we do not want to count distinct values of
-- at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken,
-- titles, pageids). Rather than trying to make an extensive blacklist and
-- potentially allow new parameters to slip through which have high
-- cardinality or sensitive information, the ETL process will use a whitelist
-- approach to count params that have been deemed to be useful.
--
-- The initial whitelist is (query, prop), (query, list), (query, meta),
-- (flow, module), (*, generator). The prop, list and meta parameters will
-- additionally be split on '|' with each component counted separately.