You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Search/MLR Pipeline: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Tjones
m (minor copy-editing)
imported>Ebernhardson
 
(3 intermediate revisions by 3 users not shown)
Line 3: Line 3:


= Overview =
= Overview =
The high level overview of the pipeline is to assemble search queries with visited pages to generate labels used in a machine learning algorithm. The ML algorithm will then produce a model realized as a json file that can be uploaded to the production elasticsearch cluster in a format understood by the [https://github.com/o19s/elasticsearch-learning-to-rank LTR plugin].
The high level overview of the pipeline is to assemble search queries with visited pages to generate labels used in a machine learning algorithm. The ML algorithm will then produce a model realized as a json file that can be uploaded to the production elasticsearch cluster in a format understood by the [https://github.com/o19s/elasticsearch-learning-to-rank LTR plugin]. The diagram below is dated, oozie along with the indicated operator have since been replaced with an apache airflow instance that builds models weekly.
[[File:MLR Pipeline Sequence Diagram.png]]


= Data preparation =
= Data preparation =
Line 9: Line 10:
== Generation of query clicks ==
== Generation of query clicks ==


This process joins the [[Analytics/Data_Lake/Traffic/Cirrus|cirrus requests]] data with [[Analytics/Data_Lake/Traffic/Webrequest|webrequests]]. This process is managed by [[Analytics/Systems/Cluster/Oozie|oozie]]. The code is available in the [https://gerrit.wikimedia.org/r/#/projects/wikimedia/discovery/analytics,dashboards/default wikimedia/discovery/analytics] project (patch is still in gerrit: {{gerrit|317019}}). The resulting data is available in two tables [[Analytics/Data Lake/Traffic/CirrusQueryClicks#discovery.query_clicks_hourly|discovery.query_clicks_hourly]] and [[Analytics/Data Lake/Traffic/CirrusQueryClicks#discovery.query_clicks_daily|discovery.query_clicks_daily]].
This process joins the [[Analytics/Data_Lake/Traffic/Cirrus|cirrus requests]] data with [[Analytics/Data_Lake/Traffic/Webrequest|webrequests]]. This process is managed by airflow. The code is available in the [https://gerrit.wikimedia.org/r/#/projects/wikimedia/discovery/analytics,dashboards/default wikimedia/discovery/analytics] project, labeled query_clicks. The resulting data is available in two tables [[Analytics/Data Lake/Traffic/CirrusQueryClicks#discovery.query_clicks_hourly|discovery.query_clicks_hourly]] and [[Analytics/Data Lake/Traffic/CirrusQueryClicks#discovery.query_clicks_daily|discovery.query_clicks_daily]].


= Training data =
= Training data =


The training data is assembled by the mjolnir data pipeline and the cli script [https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/cli/data_pipeline.py data_pipeline.py].
The training data is assembled by the mjolnir data pipeline and the cli script [https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/utilities/data_pipeline.py data_pipeline.py].


== Grouping queries ==
== Grouping queries ==
Line 35: Line 36:


== Feature vectors ==
== Feature vectors ==
The last step in preparing the training is to collect feature values for every pair of feature : hit for every query. There are several ways to collect feature vectors with mjolnir; one can use the logging endpoint of the LTR plugin or directly send individual queries to elasticsearch.
The next step in preparing the training is to collect feature values for every pair of feature : hit for every query. There are several ways to collect feature vectors with mjolnir; one can use the logging endpoint of the LTR plugin or directly send individual queries to elasticsearch.


The most convenient way to train a new model with new features is to [https://github.com/o19s/elasticsearch-learning-to-rank/blob/master/docs/building-features.rst#create-a-feature-set prepare a featureset] with the LTR plugin. Mjolnir will then be able to collect feature vectors directly from the plugin.
The most convenient way to train a new model with new features is to [https://github.com/o19s/elasticsearch-learning-to-rank/blob/master/docs/building-features.rst#create-a-feature-set prepare a featureset] with the LTR plugin. Mjolnir will then be able to collect feature vectors directly from the plugin.


Due to firewall constraints mjolnir running on the analytics cluster cannot directly access relforge100x machines where test indices are usually created. Kafka is used as the service to transfer queries from mjolnir to the relforge elasticsearch cluster. Results follow the same path in the other direction.
Due to firewall constraints mjolnir running on the analytics cluster cannot directly access relforge100x machines where test indices are usually created. Kafka is used as the service to transfer queries from mjolnir to the relforge elasticsearch cluster. Results follow the same path in the other direction. This is managed by the mjolnir-msearch-daemon.


See the [https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/features.py code] for more details.  
See the [https://github.com/wikimedia/search-MjoLniR/blob/master/mjolnir/features.py code] for more details.  
Once this process is done the training data will be available in hdfs, containing queries, labels and feature vectors.


=== Notes on the kafka workflow ===
=== Notes on the kafka workflow ===
TODO: describe the daemon and the 3 topics used.
TODO: describe the daemon and the 3 topics used.
== Feature Selection ==
Initially around 250 feature are collected for each labeled data point. This is trimmed to 50 features, for runtime performance reasons, using [https://en.wikipedia.org/wiki/Feature_selection#Minimum-redundancy-maximum-relevance_(mRMR)_feature_selection minimum redundancy maximum relevance (mRMR)] feature selection via [https://github.com/sramirez/spark-infotheoretic-feature-selection spark-infotheoretic-feature-selection].
Once this process is done the training data will be available in hdfs, containing queries, labels and feature vectors.


= Training =
= Training =


The training process is based on [https://github.com/dmlc/xgboost xgboost]. TODO: hyperopt, folding and cross validation.
The training process is based on [https://github.com/dmlc/xgboost xgboost] and [https://github.com/hyperopt/hyperopt hyperopt]. Models are trained with either 3 or 5 fold cross validation using 100-150 hyperopt rounds. Final model is trained on all data using best parameters from hyperparameter tuning.


== Evaluate feature importance ==
== Evaluate feature importance ==

Latest revision as of 23:39, 18 May 2022

The MLR pipeline is a set of offline tools and processes used to train models for ranking search results on WMF wiki sites using CirrusSearch.

Overview

The high level overview of the pipeline is to assemble search queries with visited pages to generate labels used in a machine learning algorithm. The ML algorithm will then produce a model realized as a json file that can be uploaded to the production elasticsearch cluster in a format understood by the LTR plugin. The diagram below is dated, oozie along with the indicated operator have since been replaced with an apache airflow instance that builds models weekly. MLR Pipeline Sequence Diagram.png

Data preparation

Generation of query clicks

This process joins the cirrus requests data with webrequests. This process is managed by airflow. The code is available in the wikimedia/discovery/analytics project, labeled query_clicks. The resulting data is available in two tables discovery.query_clicks_hourly and discovery.query_clicks_daily.

Training data

The training data is assembled by the mjolnir data pipeline and the cli script data_pipeline.py.

Grouping queries

In order to maximize the number of labels for a query we need to group similar queries together. The technique uses two passes:

  • group queries together using a lucene stemmer
  • collect the top 5 results from raw queries and apply a naive clustering algorithm to explode large groups where the stemmer was too aggressive

See the code for more details.

Sampling

The resulting data may be too large to be processed by the training pipeline so we need to sample the input data. Sampling is not trivial as it needs to take into account the popularity of each query to not bias the training data towards popular queries. The technique employed is to bucketize the queries per percentiles using spark approxQuantile. Each bucket can then be sampled.

See the code for more details.

Labels generation

Labels are generated thanks to clickmodels; the implementation used is the DbnModel described in the research paper "A Dynamic Bayesian Network Click Model for Web Search Ranking".

See the code for more details.

Feature vectors

The next step in preparing the training is to collect feature values for every pair of feature : hit for every query. There are several ways to collect feature vectors with mjolnir; one can use the logging endpoint of the LTR plugin or directly send individual queries to elasticsearch.

The most convenient way to train a new model with new features is to prepare a featureset with the LTR plugin. Mjolnir will then be able to collect feature vectors directly from the plugin.

Due to firewall constraints mjolnir running on the analytics cluster cannot directly access relforge100x machines where test indices are usually created. Kafka is used as the service to transfer queries from mjolnir to the relforge elasticsearch cluster. Results follow the same path in the other direction. This is managed by the mjolnir-msearch-daemon.

See the code for more details.

Notes on the kafka workflow

TODO: describe the daemon and the 3 topics used.

Feature Selection

Initially around 250 feature are collected for each labeled data point. This is trimmed to 50 features, for runtime performance reasons, using minimum redundancy maximum relevance (mRMR) feature selection via spark-infotheoretic-feature-selection.

Once this process is done the training data will be available in hdfs, containing queries, labels and feature vectors.

Training

The training process is based on xgboost and hyperopt. Models are trained with either 3 or 5 fold cross validation using 100-150 hyperopt rounds. Final model is trained on all data using best parameters from hyperparameter tuning.

Evaluate feature importance

TODO: notes on how to build feature importance with paws.

Deploy a model to production

Once a model has been trained mjolnir should have created a json file that can be uploaded directly to the production clusters. A simple mediawiki config patch can be deployed to switch production traffic to this new model by changing the wmgCirrusSearchMLRModel config in wmf-config/InitialiseSettings.php.