Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

SLO/ToneCheck Model

From Wikitech

Status: approved

Organizational

Service

  • ToneCheck is a microservice hosted in Lift Wing running in ML-Serve Kubernetes clusters. It is hosted as Inference Services (isvc), like most Machine Learning models in Lift Wing.
  • ToneCheck Inference Service is available through either the internal endpoint via Discovery or the external endpoint via API Gateway from outside the production WMF network.
  • The ToneCheck model is a fine tuned language model based on the multilingual BERT model. ToneCheck is designed for use with text that users are preparing to add to Wikipedia, and it returns a prediction of whether the text contains promotional, derogatory, or otherwise subjective language.

Teams

  • The Machine Learning (ML) team is solely responsible the development, deployment, and maintenance of the ToneCheck Inference Service.
  • Editing Team is the primary client for ToneCheck Inference Service. (See Edit check/Tone Check Project)

Architectural

Environmental dependencies

  • Kubernetes/Clusters#ml-serve : This is a hard dependency as these Kubernetes nodes are the backbone of LiftWing , the platform where this service is deployed on.
  • Thanos Swift for the storage of model binaries: This is a soft dependency. We use swift as the model-storage in LiftWing with an s3 compatible API to fetch a model to the pod during the initialization of a model service, so if Swift is not available the current services will not be affected but we may end up having issues with new releases in the case where a new version of the model needs to be deployed.
  • WMF Docker Registry ( SLO ) to store and fetch the docker images of services running on the cluster. Again this is a soft dependency as existing services will be able to operate normally even if the docker registry is not available. New releases using same docker images can use the images which are cached locally on the node, however there will be an issue if we want to use a new image.

So, Swift and WMF Docker Registry are only needed during service (re)starts.

Service dependencies

  • API Gateway ( SLO ) to route external traffic - hard dependency

If the API Gateway is broken, the service is still accessible from the internal endpoint.

Client-facing

Who are the service’s clients?

The main client for the tone check model is VisualEditor , which makes requests to this service vi a project called Tone Check . Tone check, is one of the multiple Edit checks integrated with the visual editor which makes a request to this service and surfaces messages to the users. The Edit Check project maintains its own SLO .

Users can also be any external or internal developer that wants to integrate this into their application.

Request Classes

Service Level Indicators (SLIs)

  • Latency SLI, acceptable fraction: The percentage of all successful requests (2xx) that complete within 1000 milliseconds (1 sec), measured at the server side.
  • Service availability SLI: The percentage of all requests receiving a non-error response (non 5xx).

Operational

Monitoring

ToneCheck is monitored through several dashboards and alerts.

Dashboards

  • Inference Services (kserve container resource, preprocess latency, predict latency): dashboard
  • Istio-sidecar (top 20 service calls, traffic, response codes, latency/bytes): dashboard
  • Queue proxy (sits in between the istio sidecar inbound traffic and the kserve container): dashboard

Alerts

Troubleshooting

The ML team will be solely responsible for troubleshooting this service and no additional support from SREs is needed. We currently have 1 ML SRE and 5 MLEs in the team. Since most ML team engineers work similar hours, incidents occurring outside working hours may affect response times.

ML inference services are manually deployed rather than automatically deployed, which means most deployment issues are caught in staging before reaching production.

Typical incidents involve service crashes, unresponsive systems, or 5xx errors. When an isvc incident occurs, we examine relevant dashboards and logs to identify the root cause (for example, https://phabricator.wikimedia.org/T362503#9713711 ), and determine whether it's a networking issue or a model/inference service issue.

The time needed for troubleshooting varies from hours to days, primarily based on our familiarity with the issue. However, we typically decide on our response (such as a quick fix, restarting the service) within a few hours to immediately mitigate the issue and minimize error budget burn.

Deployment

ToneCheck is deployed on Kubernetes with helmfile. Changes to the inference services involve two steps: 1) CI builds a new production image, and 2) the deployment chart is updated with the new image and synced to update the isvc.

For service-level configuration changes (network, resource allocation, etc.), only a helmfile change and sync are needed to update the isvc.

The time needed for deployment includes code review, patch merging, helmfile updates, and syncing, typically takes less than 1 hour.

Service Level Objectives

Realistic targets

Based on our staging load tests ( https://phabricator.wikimedia.org/P75923 ), a realistic target for the Latency SLI acceptable fraction would be 90%. Test 1 with 50 users showed a 99th percentile request latency of 1300 milliseconds, while Test 2, representing a more realistic use case, showed a 99th percentile request latency of 210 milliseconds.

A realistic target for the Service Availability would be 95% , a figure agreed upon in discussions with the Editing team, who are the primary users of this service. Since this service is guaranteed to be troubleshooted during EU working hours this means that there could be a possible downtime that spans for hours. After a preliminary analysis, we’re setting an initial availability target of 95 %. This allows for up to 5 % downtime over a 90-day window, about 108 hours, or roughly 36 hours per month. While comparable services have not historically seen outages of this magnitude, starting with this conservative SLO gives us headroom for unforeseen incidents and provides a clear baseline to improve upon.

Ideal targets

An ideal target for the Success Ratio SLO would be 99%. Sometimes requests are malformed or they contain some malformed instances in a batch of requests to ToneCheck. These specific instances will receive a 400 status code in the response list while the correct instances will receive a 200 status code as the service treats individually the instances of a batch. It process all instances and return the corresponding responses for all of them, if the instance is correct then the data will be fed into the ToneCheck model and it will return the desired output for the specific instance. On the other hand, if the instance is malformed, then it will not passed in the model and the service will return an error message for that specific instance in the batch. The response of the service is a list of responses following the same order as in the request batch.

An ideal target for Service Availability SLO would be 99.9%. This suggests most problems will be fixed in under 2 hours.

Reconciliation

  • Latency SLO, acceptable fraction: 90% of all successful requests (2xx) that complete within 1000 milliseconds, measured at the server side.
  • Service Availability SLO: 95% of all requests receiving a non-error response (non 5xx).