You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Search/MLR Pipeline

From Wikitech-static
< Search
Revision as of 15:29, 13 October 2017 by imported>DCausse (Created page with "{{Template:Draft}} The [ MLR] pipeline is a set of offline tools and processes used to train models for ranking search results on...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The MLR pipeline is a set of offline tools and processes used to train models for ranking search results on WMF wiki sites using CirrusSearch.


The high level overview of the pipeline is to assemble search queries with visited pages to generate labels used in a machine learning algorithm. The ML algorithm will then produce a model materialized as a json file that can be uploaded to production elasticsearch cluster in a format understood by the LTR plugin.

Data preparation

Generation of query clicks

This process joins the cirrus requests data with webrequests. This process is managed by oozie. The code is available in the wikimedia/discovery/analytics project (patch is still in gerrit: 317019). The resulting data is available in two tables discovery.query_clicks_hourly and discovery.query_clicks_daily.

Training data

The training is assembled by the mjolnir data pipeline and the cli script

Grouping queries

In order to maximize the number of labels for a query we need to group similar queries together. The technique uses two passes:

  • group queries together using a lucene stemmer
  • collect the top 5 results from raw queries and apply a naive clustering algorithm to explode large groups where the stemmer was too aggressive

See the code for more details.


The resulting data may be too large to be processed by the training pipeline so we need to sample the input data. Sampling is not trivial as it needs to take into account the popularity of query not to bias the training data towards popular queries. The technique employed is to bucketize the queries per percentiles using spark approxQuantile. Each bucket can then be sampled.

See the code for more details.

Labels generation

Labels are generated thanks to clickmodels, the implementation used is the DbnModel described in the A Dynamic Bayesian Network Click Model for Web Search Ranking research paper.

See the code for more details.

Feature vectors

The last step in preparing the training is to collect feature values for every pairs of feature : hit for every query. There are several ways to collect feature vectors with mjolnir, one can use the logging endpoint of the LTR plugin or directly send individual queries to elasticsearch.

The most convenient way to train a new model with new features is to prepare a featureset with the LTR plugin. Mjolnir will then be able to collect feature vectors directly from the plugin.

Due to firewall constraints mjolnir running on the analytics cluster cannot directly access relforge100x machines where test indices are usually created. Kafka is used as the service to transfer queries from mjolnir to the relforge elasticsearch cluster. Results follow the same path in the other direction.

See the code for more details.

Once this process is done will have the training available in hdfs containing queries, labels and feature vectors.

Notes on the kafka workflow

TODO: describe the daemon and the 3 topics used.


The training process is based on xgboost. TODO: hyperopt, folding and cross validation.

Evaluate feature importance

TODO: notes on how to build feature importance with paws.

Deploy a model to production

Once a model has been trained mjolnir should have created a json file that can be uploaded directly to the production clusters. A simple mediawiki config patch can be deployed to switch production traffic to this new model by changing the wmgCirrusSearchMLRModel config in wmf-config/InitialiseSettings.php.