You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Search/MLR Pipeline: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Dam io
(Fix broken link to data_pipeline script on github)
Line 14: Line 14:
= Training data =
= Training data =

The training data is assembled by the mjolnir data pipeline and the cli script [].
The training data is assembled by the mjolnir data pipeline and the cli script [].

== Grouping queries ==
== Grouping queries ==

Revision as of 14:31, 3 December 2018

The MLR pipeline is a set of offline tools and processes used to train models for ranking search results on WMF wiki sites using CirrusSearch.


The high level overview of the pipeline is to assemble search queries with visited pages to generate labels used in a machine learning algorithm. The ML algorithm will then produce a model realized as a json file that can be uploaded to the production elasticsearch cluster in a format understood by the LTR plugin. MLR Pipeline Sequence Diagram.png

Data preparation

Generation of query clicks

This process joins the cirrus requests data with webrequests. This process is managed by oozie. The code is available in the wikimedia/discovery/analytics project (patch is still in gerrit: 317019). The resulting data is available in two tables discovery.query_clicks_hourly and discovery.query_clicks_daily.

Training data

The training data is assembled by the mjolnir data pipeline and the cli script

Grouping queries

In order to maximize the number of labels for a query we need to group similar queries together. The technique uses two passes:

  • group queries together using a lucene stemmer
  • collect the top 5 results from raw queries and apply a naive clustering algorithm to explode large groups where the stemmer was too aggressive

See the code for more details.


The resulting data may be too large to be processed by the training pipeline so we need to sample the input data. Sampling is not trivial as it needs to take into account the popularity of each query to not bias the training data towards popular queries. The technique employed is to bucketize the queries per percentiles using spark approxQuantile. Each bucket can then be sampled.

See the code for more details.

Labels generation

Labels are generated thanks to clickmodels; the implementation used is the DbnModel described in the research paper "A Dynamic Bayesian Network Click Model for Web Search Ranking".

See the code for more details.

Feature vectors

The last step in preparing the training is to collect feature values for every pair of feature : hit for every query. There are several ways to collect feature vectors with mjolnir; one can use the logging endpoint of the LTR plugin or directly send individual queries to elasticsearch.

The most convenient way to train a new model with new features is to prepare a featureset with the LTR plugin. Mjolnir will then be able to collect feature vectors directly from the plugin.

Due to firewall constraints mjolnir running on the analytics cluster cannot directly access relforge100x machines where test indices are usually created. Kafka is used as the service to transfer queries from mjolnir to the relforge elasticsearch cluster. Results follow the same path in the other direction.

See the code for more details.

Once this process is done the training data will be available in hdfs, containing queries, labels and feature vectors.

Notes on the kafka workflow

TODO: describe the daemon and the 3 topics used.


The training process is based on xgboost. TODO: hyperopt, folding and cross validation.

Evaluate feature importance

TODO: notes on how to build feature importance with paws.

Deploy a model to production

Once a model has been trained mjolnir should have created a json file that can be uploaded directly to the production clusters. A simple mediawiki config patch can be deployed to switch production traffic to this new model by changing the wmgCirrusSearchMLRModel config in wmf-config/InitialiseSettings.php.