You are browsing a read-only backup copy of Wikitech. The live site can be found at


From Wikitech-static
< Analytics‎ | Systems
Revision as of 17:13, 15 December 2020 by imported>Neil P. Quinn-WMF (→‎WMF-specific: Update link)
Jump to navigation Jump to search

reportupdater is a Python program that periodically executes given SQL queries (or scripts with SQL-like output) and appends their results into timeline TSV report files. If you have crons on any of the analytics data machines harvesting data you can likely benefit from using reportupdater. Think of it as a more "robust" way to execute cron jobs.

Some datasets generated by reportupdater:

Some of its features are:

  • Detects which data points are missing in the reports and executes the corresponding queries to update those missing values.
  • Supports different report granularities: daily, weekly and monthly.
  • Allows parametrizing the query (or script) to generate several reports, one for each value of the parameter.
  • Generates consistent and comprehensive logs with different levels.
  • It is robust against interrupted or parallel execution, so report files don't get corrupted.
  • You can collect data from the MediaWiki databases, the EventLogging database and from the Hive cluster.


Get code from: gerrit, code also mirrored to Wikimedia's Github.


You need python 2.7 to run reportupdater. To install dependencies, cd into reportupdater's root directory and execute pip install -r requirements.txt.

Queries and scripts

You should write a set of either SQL queries or executable scripts (or both combined) to be run by reportupdater. Regardless of which you choose, you must write code that returns a single data point per metric. For example, if splitting by wiki then the metric path should include {wiki} and the output would include exactly one row for each value of the wiki column. Multiline output requires the funnel report flag. Then reportupdater will execute it for each specified date interval to build the whole timeline report. There are a couple conventions that your code should follow:

SQL Query conventions

  1. Use the placeholders {from_timestamp} and {to_timestamp} to timebound the results, for example: WHERE timestamp >= '{from_timestamp}' AND timestamp < '{to_timestamp}' (note that from_timestamp should be inclusive, while to_timestamp should be exclusive). Their format is YYYYMMDD000000.
  2. The first column of the results must be DATE('{from_timestamp}') AS date. This is an unnecessary limitation and might be removed in the future, but for now it's like that :/.
  3. There is no restriction on the number, format or contents of the subsequent columns.

Script conventions

  1. The script file MUST be marked as EXECUTABLE, otherwise it will give a cryptic permission denied error
  2. The first 2 parameters passed to the script are start_date and end_date, their format is YYYY-MM-DD. Use them to timebound the results (note that start_date should be inclusive, while end_date should be exclusive).
  3. The output of the report must be in TSV format, and must contain a first row with a header.
  4. The first column must be equal to start_date parameter (consider naming it date). This is an unnecessary limitation and might be removed in the future, but for now it's like that :/.
  5. There is no restriction on the number, format or contents of the subsequent columns.

Where to put these files?

You should put all queries and scripts inside the same dedicated directory. When executing reportupdater you'll pass that directory as an argument and reportupdater will know where to find your code. If you're wanting to host your datasets at, put your queries and scripts in a repo and then ask the Analytics team to clone them to wherever they clone them to.

Sample queries executed by other teams

Against eventlogging database:

Config file

You also need to write a YAML config file that will tell reportupdater how to run the reports. There are four main sections in the config file:

The databases section

If you're not using SQL queries, you can skip this section. The databases section specifies which databases to connect. Usually, it should include the host, the port, the credentials file and the database name, like this:

        host: analytics-slave.eqiad.wmnet
        port: 3306
        creds_file: /etc/mysql/conf.d/research-client.cnf
        db: log

But if your queries are hitting the analytics replicas (former dbstore1002 or analytics-store) the configuration should be different. The analytics replicas have been restructured for performance and maintainability, and they are now divided in shards. RU provides a specific configuration option, so you don't need to deal with shards. You should set `auto_find_db_shard` to true, and drop `host` and `port`, like this:

        auto_find_db_shard: true
        creds_file: /some/path/to/.my.cnf
        db: enwiki

Gotcha: if the table that you're querying belongs to an extension (i.e. echo tables), it will be in a separate shard called x1. So you have to indicate it by setting `use_x1` to true:

        auto_find_db_shard: true
        use_x1: true
        creds_file: /some/path/to/.my.cnf
        db: enwiki

Now, if you are using the `explode_by` feature, i.e. with `explode_by: wiki`, then a single query is connecting to several databases. The way to specify that would be to set `wiki_db_placeholder` to the name of the placeholder you're exploding the database names (in this case `wiki`). Also, you should remove the `db` parameter:

        auto_find_db_shard: true
        wiki_db_placeholder: wiki
        creds_file: /some/path/to/.my.cnf

The graphite section

It specifies which graphite server to send stats to. It should include the host and the port. It can optionally include a dictionary of lookups, in which the key will be an identifier for a value in a metric execution. For example, if your metric explodes by wiki, you can provide a lookup dictionary for the identifier "wiki". This could be used to translate the wiki database name, like enwiki, to the wiki hostname, like en.wikipedia. The value in the lookups dictionary is a filename to load as that lookup. Or, if you'd like to translate values from a column in your report, you can provide a dictionary for "column1" for example.

    port: 1234
        wiki: /some/path/to/wiki-lookup.yaml
        column1: /some/path/to/column1-lookup.yaml

If you're not sending stats to graphite, you can skip this section.

The reports section

It describes the reports to be executed by reportupdater. For each report, there are three mandatory fields you must provide: the report id, the granularity and the start date; and there are some optional fields you can also specify, like this:

        granularity: days
        starts: 2016-01-01
        granularity: months
        starts: 2014-06-15
        type: script
        lag: 10800
        granularity: weeks
        starts: 2000-01-01
        max_data_points: 10
            dimension1: valueA, valueB, valueC
            dimension2: valueK, valueL
        granularity: days
        funnel: true
            wiki: enwiki, dewiki, eswiki
            path: "{_metric}.{column0}.{wiki}"
                "": column1ToSendAsValue
                "": column2ToSendAsValue
The report id
It must match the name of the custom executable file. For example, if you're using a SQL query named random_report.sql, then the report id must be random_report (without .sql). If you're using a script, then name it the same as the report id. The report id also dictates the name of the generated report: if you use the id random_report, the resulting report will be named random_report.tsv.
days, weeks or months. Depending on this parameter, reportupdater will execute your queries/scripts every day, week or month; passing the corresponding date range to them. Mandatory.
Indicates the first date to be computed (inclusive). Its format should be YYYY-MM-DD. Mandatory.
sql or script. If sql, reportupdater will interpret your custom executable as SQL code. If script, it will execute your custom code via the command line. Optional. Default: sql.
Positive integer. The time to wait (in seconds) to start the execution of the query/script after the desired time interval has finished. It can be useful to wait until the source data is loaded into the database. Optional. Default: 0.
String. Should be one of the database identifiers specified in the databases section. Indicates which database should reportupdater connect to in order to execute the query. It only works with sql reports, if your report is of type script, this option is ignored. Optional. Default: see defaults section.
true or false. If true, the program will assume that the results of the query/script contain more than one line. If false it will assume only one-line results. This option should be named multiline instead of funnel, which is a terrible name. Optional. Default: false.
Positive integer. The maximum of data points the report will contain. Note it depends on the granularity: If you have a weekly granularity and set max_data_points to 10, your report will hold the data for the last 10 weeks. Optional. If not set, the report will hold all data points since the specified starts parameter.
String. You may want to have two or more reports using the same query or script. You can do so by assigning a query or script name to the execute field of a report. Optional. If it is not set, reportupdater will execute the query/script that matches the report identifier.
Dictionary<string: list> (in YAML format). If you want to have N identical reports, one for each section of your product, you can configure it by using explode_by. For example: wiki: enwiki, dewiki, jawiki, frwiki would generate 4 reports, one for each wiki. Note that you would have to include the {wiki} placeholder in your SQL query to slice its results. If you're using a script, all explode_by values will be passed as arguments to the script (right after the date arguments) in alphabetical order. If you want, you can specify a file path instead of the value list; if you do so, reportupdater will read that file to get the value list (separate values using /n).
Dictionary. If you want to send your report results to graphite, you must configure the metric names and values that will be sent. In this section, you define two things. First is a path template string. This is a python string template that will be filled in with values from a single report result row. This string will be formatted with values from the row itself (keyed by column name), values from the explode by section (keyed by the explode by key), and the special {_metric} described next. Second is a dictionary of metrics to send. The key in this dictionary will be substituted in the path string where you use {_metric}. And the value is the column to use for the actual value to send to graphite. For a complete example using graphite, see

The defaults section

Any value that you assign here will apply to all reports. If a report also has a value for that same key, it'll override the default value.

The examples below show a config with defaults compared to an equivalent config without defaults:

    type: script
    starts: 2018-01-01
        wiki: wikis.txt

    granularity: days

    granularity: weeks

    granularity: months
    type: script
    starts: 2018-01-01
    granularity: days
        wiki: wikis.txt

    type: script
    starts: 2018-01-01
    granularity: weeks
        wiki: wikis.txt

    type: script
    starts: 2018-01-01
    granularity: months
        wiki: wikis.txt

How to execute?

Use the executable in reportupdater's root directory.

See source: [1]

usage: [-h] [--config-path CONFIG_PATH]
                         [--wikis_path WIKIS_PATH] [-l LOG_LEVEL]
                         query_folder output_folder

Periodically execute SQL queries or scripts and write/update the results into
TSV files.

positional arguments:
  query_folder          Folder with *.sql files and scripts.
  output_folder         Folder to write the TSV files to.

optional arguments:
  -h, --help            show this help message and exit
  --config-path CONFIG_PATH
                        Yaml configuration file. Default:
  -l LOG_LEVEL, --log-level LOG_LEVEL

You'll find it useful as well to create a cron job that executes reportupdater periodically, for example every hour.


Sometimes, at the time of the queries, the data sources are incomplete or corrupt. In those cases, reportupdater might generate wrong values for a given date. If this happens, you can use the re-run functionality of reportupdater: executable in the root directory.

usage: [-h] [--config-path CONFIG_PATH] [-r REPORT]
                        query_folder start_date end_date

Mark reports to be re-run for a given date range.

positional arguments:
  query_folder          Folder with *.sql files and scripts.
  start_date            Start of the date range to be rerun (YYYY-MM-DD,
  end_date              End of the date range to be rerun (YYYY-MM-DD,

optional arguments:
  -h, --help            show this help message and exit
  --config-path CONFIG_PATH
                        Yaml configuration file. Default:
  -r REPORT, --report REPORT
                        Report to be re-run. Several reports can be specified
                        like this. If none is specified, all reports listed in
                        the config file are marked for re-run.

Note that the execution of does not re-run the reports per se. It only marks given reports and dates so that can re-run them in its next execution. Please, do not alter report files manually! Doing so can result in permission problems.

Example: rerunning all history for the active-boards report from the limn-flow-data repository, in the flow folder:

python /srv/reportupdater/reportupdater/ -r active-boards /srv/reportupdater/jobs/limn-flow-data/flow 2009-11-08 2017-12-11


How to test?

To test your queries against the real database, just run the queries there before submiting a code patch.

To test on any analytics client machine:

  • you need both repositories cloned: reportupdater-queries and reportupdater
  • get the reportupdater-queries changeset you are interested in testing
  • if you're testing Hive queries, make sure you've authenticated through kinit
  • if you're testing queries to MariaDB databases, make sure the credentials file is pointed correctly
  • /usr/bin/python3 /path/to/reportupdater/ -l info /path/to/reportupdater-queries/<your directory>/ ~/reportupdater-output
  • It might be necessary to execute using the analytics user, if you do so, give permits to the reportupdater-queries repo, so that the pid file can be created.

How to productionize?

  1. Add your queries/scripts and config file to a new folder within this repo, and ask someone in the Analytics team to review.
  2. Ask the Analytics team to puppetize the execution of your reportupdater instance, or create a puppet patch yourself adding your reportupdater job in [2], and ask someone in the Analytics team to review. There are two puppet modules that run jobs one for mysql one for hive.


Where are logs ?

For reports running on cron, logs are written to /srv/reportupdater/log. For reports running on systemd timers, logs are accessed as described here.

Where is puppet code?


reportupdater is maintained by WMF's Analytics team.

Next features

  • Today the queries and scripts have to output a first column with the date. However this is not necessary and can be removed.
  • Also the configuration option is_funnel could be removed and reportupdater could identify automatically if the queries return 1 row or multiple rows and handle it properly.
  • Make all report config fields defaultable in the defaults section, like db.