You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Add Link: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Kosta Harlan
No edit summary
imported>Kosta Harlan
Line 14: Line 14:


== Link Recommendation Service ==
== Link Recommendation Service ==
=== Repository ===
The repository for training the link recommendation model as well as for the query service is at https://gerrit.wikimedia.org/r/plugins/gitiles/research/mwaddlink/. Some explanation how the model works can be found on the [[:m:Research:Link_recommendation_model_for_add-a-link_structured_task|meta-research-page]].
The repository for training the link recommendation model as well as for the query service is at https://gerrit.wikimedia.org/r/plugins/gitiles/research/mwaddlink/. Some explanation how the model works can be found on the [[:m:Research:Link_recommendation_model_for_add-a-link_structured_task|meta-research-page]].


=== Deployment ===
The service is deployed in production using the [[Deployment pipeline]].
The service is deployed in production using the [[Deployment pipeline]].


=== Dataset pipeline ===
The link recommendation model is trained on the stats1008 server where several MySQL tables per wiki are filled with dictionary lookup data. Those tables are exported and published via [[Datasets.wikimedia.org]] The production query service (that MediaWiki interacts with) will import those datasets into its own MySQL instance in Kubernetes ({{Phabricator|T266826}}).
The link recommendation model is trained on the stats1008 server where several MySQL tables per wiki are filled with dictionary lookup data. Those tables are exported and published via [[Datasets.wikimedia.org]] The production query service (that MediaWiki interacts with) will import those datasets into its own MySQL instance in Kubernetes ({{Phabricator|T266826}}).
=== Grafana ===
[https://grafana.wikimedia.org/d/XkUDlMLGz/linkrecommendation?orgId=1&refresh=10s Grafana dashboard]


== Resolved questions / decisions ==
== Resolved questions / decisions ==

Revision as of 14:01, 27 January 2021

This page contains information about the infrastructure used for the Add a Link structured task project (task T252822)

High-level summary

The Link Recommendation Service recommends phrases of text in an article to link to other articles on a wiki. Users can then accept or reject these recommendations.

  1. The service is an application hosted on kubernetes with an API accessible via HTTP (see task T258978). It responds to a POST request containing wikitext of an article and responds with a structured response of link recommendations for the article. It does not have caching or storage; the client (MediaWiki) is responsible for doing that (task T261411).
  2. The search index stores metadata about which articles have link recommendations via a field we set per article (task T261407, task T262226)
  3. A MySQL table per wiki is used for caching the actual link recommendations (task T261411); each row contains serialized link recommendations for a particular article.
  4. A maintenance script (task T261408) runs hourly per enabled wiki to generate link recommendations by iterating over each Search/articletopic and calling the Link Recommendation Service to request recommendations
    • the maintenance script caches the results in the MySQL table, then sends an event to Event_Platform/EventGate, where the Search pipeline ensures that the index is updated with the links/nolinks metadata for the article.
    • on page edit (when the edit is not done via the Add Link UX), link recommendations are regenerated via the job queue and the same code and APIs that are utilized in the maintenance script (n.b. we might do this differently; not yet implemented)

Diagram: Fetching and completing link recommendation tasksLink recommendation service (task fetch and completion)

Source: Add_Link/Diagram:_Fetching_and_completing_link_recommendation_tasks

Link Recommendation Service

Repository

The repository for training the link recommendation model as well as for the query service is at https://gerrit.wikimedia.org/r/plugins/gitiles/research/mwaddlink/. Some explanation how the model works can be found on the meta-research-page.

Deployment

The service is deployed in production using the Deployment pipeline.

Dataset pipeline

The link recommendation model is trained on the stats1008 server where several MySQL tables per wiki are filled with dictionary lookup data. Those tables are exported and published via Datasets.wikimedia.org The production query service (that MediaWiki interacts with) will import those datasets into its own MySQL instance in Kubernetes (task T266826).

Grafana

Grafana dashboard

Resolved questions / decisions

  • 10 December How to get a MySQL database from stat* server to a production MySQL instance (SRE/Analytics) (task T266826)
  • 23 October: Store the link recommendations in WANObjectCache or in a MySQL table? task T261411(needs SRE/DBA input)
  • 15 October: use wikitext for training model, generating dictionary data, and as input to the mwaddlink query service. Will search for phrases in VE's editable content surface rather than attempt to apply offsets from wikitext / parsoid HTML.

Deployment

note: the canonical documentation is at Deployments on kubernetes

  1. SSH to deployment server
  2. cd /srv/deployment-charts/helmfile.d/services/linkrecommendation
  3. helmfile -e {staging|eqiad|codfw} -i apply
  4. service-checker-swagger staging.svc.eqiad.wmnet https://staging.svc.eqiad.wmnet:4005 -s /apispec_1.json

Updates

9 November - 10 December 2020

  • Growth / Research: Continued refactoring of research/mwaddlink for production ready status
  • Growth: Backend patches for GrowthExperiments for consuming research/mwaddlink data
  • Growth / SRE: Deployed linkrecommendation service to production (no datasets yet though)
  • DBA: Created database and read/write users for production kubernetes instance to access
  • Search: Working on consuming event(s) generated by service

2 - 6 November 2020

26 - 30 October 2020

  • Growth / Research: Recap architecture and discuss milestones
  • Growth / SRE / DBA: Agreed to use MySQL for lookup tables for the link recommendation service
  • Growth: Continued prototyping of the VisualEditor integration; continued work on deployment pipeline; initial work on HTTP API via Flask; addition of MySQL cache table in GrowthExperiments along with general infrastructure for reading/writing to the cache

19 - 23 October 2020

  • Growth / Research: Working on deployment pipeline for mwaddlink
  • Growth: Prototyping VisualEditor integration
  • Growth: Beginning work on maintenance script and supporting classes

12 - 16 October 2020

  • Growth / Research: Parsoid HTML vs wikitext, repo structure, MySQL vs SQLite, misc other things
  • Growth: Engineers meet to discuss schedule, order of tasks, etc

5 - 9 October 2020

  • Growth / Editing: Exploring ways to bring link recommendation data into VisualEditor
  • Growth / Research: Discussing repository structures in preparation for deployment pipeline setup
  • Growth / SRE / Research: Discussing how to get mwaddlink-query / mwaddlink into production

Teams / Contact

Growth (primary stakeholder, technical contact for project is Kosta Harlan, product owner is Marshall Miller). Other teams: Search Platform, SRE, Release Engineering, Research, Editing, Parsing

Roles / responsibilities

  • Growth: User facing code, integration with our existing newcomer tasks framework, plus maintenance script to populate cache with recommendations
  • Research: Implementing code to train models and provide a query client (research/mwaddlink repo)
  • SRE: Working with Growth + Research to put the link recommendation service into production
  • Search Platform: Implementing the event pipeline to update the search index metadata for a document when new link recommendations are generated
  • Release Engineering: Consulting with Growth for deployment pipeline
  • Editing: Consulting with Growth for VE integration
  • Parsing: Consulting with Growth for VE integration

Background reading