You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Machine Learning/LiftWing/KServe

From Wikitech-static
Jump to navigation Jump to search

KServe is a Python framework and K8s infrastructure aimed to standardize the way that people run and deploy HTTP servers wrapping up ML models. The Machine Learning team uses it in the LiftWing K8s cluster, to implement the new model serving infrastructure that should replace ORES.

You can check this video (slides) about KServe to get an overview!

How Kserve fits into the Kubernetes picture?

As described above, KServe represents two things:

  • A Python framework to load model binaries and wrap them around a consistent and standard HTTP interface/server.
  • A set of Kubernetes resources and controllers able to deploy the aforementioned HTTP servers.

Before concentrating on Kubernetes it is wise to learn a bit how the Python framework works and how to write custom code to serve your model. Once KServe's internals and architecture are learned, it should be relatively easy to start playing with Docker to test few things. Once done, the ML team will take care of helping people to add the K8s configuration to deploy the model on Lift Wing.

KServe architecture

KServe uses Tornado and asyncio behind the scenes: it assumes that the code that handles/wraps the model is as async as possible (so composed by coroutines and not blocking code). The idea is to have the following split:

  • Transformer code, that takes care of the client's inputs and also to retrieve the necessary features from services/feature-stores/etc.. It correspond to a separate Docker image and container.
  • Predictor code, that gets features via HTTP from the Transformer and passes them to the model, that computes a score. The result is then returned to the client.

By default both Transformer and Predictor runs a Tornado IOLoop based on Python's Asyncio, so any blocking code limits the scalability of the code. KServe offers the possibility to use Ray workers as well, to parallelize models, see what the ML Team tested in

When writing the code of your model server, keep in mind the following:

  • In Python only on thread can run at any given time due to the GIL. Any cpu-bound code holds the GIL until it finishes, stopping all the other threads / executions in the meantime (especially I/O bound ones, like send/receive of HTTP requests).
  • Python upstream suggests to use multi-process in case of cpu-bound code. The ML team is writing libraries to ease the execution of cpu-bound functions/code into separate processes, please reach out to them in case you think that your code needs it.
  • Any HTTP or similar operation that you do in your code should use a correspondent asyncio library, like aiohttp. Please reach out to the ML team for more info if you need them (libraries like requests and urllib are not async for example).


The starting point is surely the inference-services repository, where we keep all our configurations and Python code needed to generate the Docker images that will run on Kubernetes.

New service

If you have a new service that you want the ML Team to deploy on Lift Wing, we would suggest you first build and test your own model server using KServe locally via Docker.

This is an example how to build a model server for image classification using a pre-trained Alexnet model.

Create your model server

In model-server/, the AlexNetModel class extends the kserve.Model base class for creating a custom model server.

The base model class defines a group of handlers:

  • load: loads your model into the memory from a local file system or remote model storage.
  • preprocess: pre-processes the raw data or customized transformation logic.
  • predict: executes the inference for your model.
  • postprocess: post-processes the prediction result or turns the raw prediction result into a user-friendly inference response.

Based on your need, you can write a custom code for these handlers. Note that the later three handlers are executed in sequence, means that the output of the preprocess is passed to predict as the input, and the output of the predictor is passed to postprocess as the input. In this Alexnet example, you will see we write custom code for load and predict handlers, so we basically do everything (preprocess, predict, postprocess) in a single predict handler.

Having a separate Transformer to do pre/post-process is not mandatory, but is recommended. A more complex example with transformer and predicator, see

In model-server/requirement.txt, add the dependencies for your service and the dependencies below for KServe that align our production environment to the file.


Create a Dockerfile

Docker provides the ability to package and run an application in an isolated environment. If you look at the Dockerfile in the example, you will see we first specify a base image "python3-build-buster:0.1.0" from the Wikimedia Docker Registry, so we make sure the application can run in our WMF environment. The rest of the steps in the Dockerfile are simple, 1) we copy the model-server directory to the container. 2) pip install the necessary dependencies in requirement.txt. Finally, define an entry point for the KServe application to run the script.

Deploy locally and Test

Please follow the instructions to deploy Alexnet model locally and test.

Multiprocessing with asyncio

Enabling multiprocessing with asyncio can be done by specifying the corresponding variables in the helmfile values. At the moment this is only available for revscoring model servers.

      value: "True"
      value: "5"
    - name: PREPROCESS_MP
      value: "True"
    - name: INFERENCE_MP
      value: "False"

Setting PREPROCESS_MP to True/False will enable/disable multiprocessing for preprocessing the models input, while doing the same with INFERENCE_MP will toggle multiprocessing for the inference part.

The team has run load testing experiments for the revscoring model servers which demonstrated a significant improvement in model server's robustness for the editquality models while for other model servers enabling multiprocessing did not improve the server's capacity to handle increased load. More information along with test results can be found on the relevant Phabricator task.

Service already present in the inference-services repository

Testing services already present in the inference-services repository locally is possible with Docker, but it needs a little bit of knowledge about how Kserve works.

Example 1 - Testing enwiki-goodfaith

Let's imagine that we want to run the enwiki revscoring editquality goodfaith model locally, to test how it works.


In the inference-service repo, run the following commands to build the Docker image:

blubber .pipeline/editquality/blubber.yaml production | docker build --tag SOME-DOCKER-TAG-THAT-YOU-LIKE --file - .

If you are curious about what Dockerfile gets built, remove the docker build command and see the output of Blubber. After the build process is done, we should see a Docker image named after the tag added to the docker build command. Use the following command to check:

docker image ls

Next, check the file related to editquality (contained in the model-server directory) and familiarize with the __init__() function. All the environment variables retrieved in there are usually passed to the container by Kubernetes settings, so with Docker we'll have to explicitly set them.

Now you can create your specific playground directory under /tmp or somewhere else. The important bit is that you place the model binary file inside it. In this example, let's suppose that we are under /tmp/test-kserve, and that the model binary is stored in a subdirectory called models (so the binary's path is /tmp/test-kserve/models/model.bin). The name of the model is important, the standard is model.bin (so please rename your binary in case it doesn't match).

Run the following command to start a container:

docker run -p 8080:8080 -e INFERENCE_NAME=enwiki-goodfaith -e WIKI_URL= --rm -v `pwd`/models:/mnt/models SOME-DOCKER-TAG-THAT-YOU-LIKE

If everything goes fine, you'll see some something like:

[I 220725 09:06:00 model_server:150] Registering model: enwiki-goodfaith
[I 220725 09:06:00 model_server:123] Listening on port 8080
[I 220725 09:06:00 model_server:125] Will fork 1 workers
[I 220725 09:06:00 model_server:128] Setting max asyncio worker threads as as 8

Now we are ready to test the model server! First, create a file called input.json with the following content:

{ "rev_id": 1097728152 }

Open another terminal, execute:

curl localhost:8080/v1/models/enwiki-goodfaith:predict -i -X POST -d@input.json --header "Content-type: application/json" --header "Accept-Encoding: application/json"

If everything goes fine, you should see some scores in the HTTP response.

Example 2 - Testing calling a "fake" EventGate (two containers)

A more complicated example is how to test code that needs to call services (besides the MW API). One example is the testing of

In the above code change, we are trying to add support for EventGate. The new code would allow us to create and send specific JSON events via HTTP POSTs to EventGate, but in our case we don't need to re-create the whole infrastructure locally; a simple HTTP server to echo the POST content is enough to verify the functionality.

The Docker daemon creates containers in a default network called bridge, that we can use to connect two containers together. The idea is to:

  • Create a KServe container like explained in the Example 1.
  • Create a HTTP server in another container using Python.

The latter is simple. Let's create a directory with two files:

FROM python:3-alpine


RUN mkdir /ws
COPY /ws/


CMD ["python", "", "6666"]

We can then build and execute the container:

  • docker build . -t simple-http-server
  • docker run --rm -it -p 6666 simple-http-server

Before creating the KServe container, let's check the running container's IP:

  • docker ps (to get the container id)
  • docker inspect #container-id | grep IPAddress (let's assume it is

As you can see in, two new variables have been added to __init__: EVENTGATE_URL and EVENTGATE_STREAM. So let's add them to the run command:

docker run -p 8080:8080 -e EVENTGATE_STREAM=test -e EVENTGATE_URL="" -e INFERENCE_NAME=enwiki-goodfaith -e WIKI_URL= --rm -v `pwd`/models:/mnt/models SOME-DOCKER-TAG

Now you can test via curl the new code, and you should see the HTTP POST send by the KServe container to the "fake" EventGate simple HTTP server!

Example 3 - Testing outlink-topic-model (two containers)

The KServe architecture highly encourages the use of Transformers for the pre/post-process functions (so basically for feature engineering) and to use a Predictor for the models. Transformer and Predictor are separate Docker containers, that will also become separate pods in k8s (but we don't need to worry a lot about this last bit).

This example is a variation of the second one, since it involves spinning up two containers and use the default network bridge to make them in communication between each other. The Transformer can be instructed to contact the predictor on a certain IP:Port combination, to pass to it features collected during the preprocess step.

Let's use the outlink model example (at the moment the only transformer/predictor example in inference-services) to see the steps:

Build the Transformer's Docker image locally:

blubber .pipeline/outlink/transformer.yaml production | docker build --tag outlink-transformer --file - .

Build the Predictor's Docker image locally:

blubber .pipeline/outlink/blubber.yaml production | docker build --tag outlink-predictor --file - .

Download the model from in a temp path (see Example 2) Start the predictor:

docker run --rm -v `pwd`:/mnt/models outlink-predictor

(note: `pwd` represents the directory that will be mounted in the container, it needs to have the model binary downloaded above and renamed 'model.bin')

Run docker ps and docker inspect #container-id to find the IP address of the Predictor's container (see Example 2 for more info).

Run the transformer:

docker run -p 8080:8080 --rm outlink-transformer --predictor_host PREDICTOR_IP:8080 --model_name outlink-topic-model

(note: PREDICTOR_IP needs to be replaced with what you found during the previous step)

Then you can send requests to localhost:8080 via curl or your preferred HTTP client. You'll hit the Transformer first, the features will be retrieved and then sent to the Predictor. The score will be generated and the returned to the client.