You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Machine Learning/LiftWing/KServe: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>AikoChou
No edit summary
imported>AikoChou
mNo edit summary
Line 1: Line 1:
[https://kserve.github.io KServe] is a Python framework and K8s infrastructure aimed to standardize the way that people run and deploy HTTP servers wrapping up ML models. The Machine Learning team uses it in the LiftWing K8s cluster, to implement the new model serving infrastructure that should replace ORES.
[https://kserve.github.io/website KServe] is a Python framework and K8s infrastructure aimed to standardize the way that people run and deploy HTTP servers wrapping up ML models. The Machine Learning team uses it in the LiftWing K8s cluster, to implement the new model serving infrastructure that should replace ORES.


== How Kserve fits into the Kubernetes picture? ==
== How Kserve fits into the Kubernetes picture? ==
Line 40: Line 40:


=== Step 2: Create a requirement.txt ===
=== Step 2: Create a requirement.txt ===
In [https://github.com/AikoChou/kserve-example/blob/main/alexnet-model/model-server/requirements.txt model-server/requirement.txt], you should add all the dependencies for your service and the following dependencies for KServe that align our production environment (subject to change) to the file.
In [https://github.com/AikoChou/kserve-example/blob/main/alexnet-model/model-server/requirements.txt model-server/requirement.txt], you should add all the dependencies for your service and the following dependencies for KServe that align our production environment to the file.
{| class="wikitable"
{| class="wikitable"
|kserve==0.8.0
|kserve==0.8.0
Line 63: Line 63:
Let's imagine that we want to run the ''enwiki revscoring editquality goodfaith'' model locally, to test how it works:
Let's imagine that we want to run the ''enwiki revscoring editquality goodfaith'' model locally, to test how it works:


* First of all, we need to deploy the inference-services repository (see the related section for more info).
* First of all, we need to clone the inference-services repository (see the related section for more info).
* We need to have [[labsconsole:Blubber/Download|Blubber]] available locally.
* We need to have [[labsconsole:Blubber/Download|Blubber]] available locally.
* We need to get the model binary version that we need (in our case, they are available in https://github.com/wikimedia/editquality/tree/master/models).
* We need to get the model binary version that we need (in our case, they are available in https://github.com/wikimedia/editquality/tree/master/models).

Revision as of 15:07, 28 July 2022

KServe is a Python framework and K8s infrastructure aimed to standardize the way that people run and deploy HTTP servers wrapping up ML models. The Machine Learning team uses it in the LiftWing K8s cluster, to implement the new model serving infrastructure that should replace ORES.

How Kserve fits into the Kubernetes picture?

As described above, KServe represents two things:

  • A Python framework to load model binaries and wrap them around a consistent and standard HTTP interface/server.
  • A set of Kubernetes resources and controllers able to deploy the aforementioned HTTP servers.

Before concentrating on Kubernetes it is wise to learn a bit how the Python framework works and how to write custom code to serve your model. Once KServe's internals and architecture are learned, it should be relatively easy to start playing with Docker to test few things. Once done, the ML team will take care of helping people to add the K8s configuration to deploy the model on Lift Wing.

KServe architecture

KServe uses Tornado behind the scenes, and it assumes that the code that handles/wraps the model is as async as possible (so composed by coroutines and not blocking code). The idea is to have the following split:

  • Transformer code, that takes care of the client's inputs and also to retrieve the necessary features from services/feature-stores/etc.. It correspond to a separate Docker image and container.
  • Predictor code, that gets features via HTTP from the Transformer and passes them to the model, that computes a score. The result is then returned to the client.

By default both Transformer and Predictor runs a Tornado IOLoop, so any blocking code limits the scalability of the code. KServe offers the possibility to use Ray workers as well, to parallelize models, see what the ML Team tested in https://phabricator.wikimedia.org/T309624.

Repositories

The starting point is surely the inference-services repository, where we keep all our configurations and Python code needed to generate the Docker images that will run on Kubernetes.

New service

If you have a new service that you want the ML Team to deploy on Lift Wing, we would suggest you first build and test your own model server using KServe locally via Docker.

This is an example https://github.com/AikoChou/kserve-example/tree/main/alexnet-model how to build a model server for image classification using a pre-trained Alexnet model.

Step 1: Create your model server (model.py).

In model-server/model.py, the AlexNetModel class extends the kserve.Model base class for creating a custom model server.

The base model class defines a group of handlers:

  • load: loads your model into the memory from a local file system or remote model storage.
  • preprocess: pre-processes the raw data or customized transformation logic.
  • predict: executes the inference for your model.
  • postprocess: post-processes the prediction result or turns the raw prediction result into a user-friendly inference response.

Based on your need, you can write a custom code for these handlers. Note that the later three handlers are executed in sequence, means that the output of the preprocess is passed to predict as the input, and the output of the predictor is passed to postprocess as the input. In this Alexnet example, you will see we write custom code for load and predict handlers, so we basically do everything (preprocess, predict, postprocess) in a single predict handler.

Having a separate Transformer to do pre/post-process is not mandatory, but is recommended. A more complex example with transformer and predicator, see https://github.com/AikoChou/kserve-example/tree/main/outlink-topic-model.

Step 2: Create a requirement.txt

In model-server/requirement.txt, you should add all the dependencies for your service and the following dependencies for KServe that align our production environment to the file.

kserve==0.8.0
kubernetes==17.17.0
protobuf==3.19.1
ray==1.9.0

Step 3: Create a Dockerfile

Docker provides the ability to package and run an application in an isolated environment. If you look at the Dockerfile in the example, you will see we first specify a base image "python3-build-buster:0.1.0" from the Wikimedia Docker Registry, so we make sure the application can run in our WMF environment. The rest of the steps in the Dockerfile are simple, 1) we copy the model-server directory to the container. 2) pip install the necessary dependencies in requirement.txt. Finally, define an entry point for the KServe application to run the model.py script.

Step 4: Deploy locally and Test

Please follow the instructions to deploy Alexnet model locally and test.

Service already present in the inference-services repository

Testing services already present in the inference-services repository locally is possible with Docker, but it needs a little bit of knowledge about how Kserve works.

Example 1

Let's imagine that we want to run the enwiki revscoring editquality goodfaith model locally, to test how it works:

  • First of all, we need to clone the inference-services repository (see the related section for more info).
  • We need to have Blubber available locally.
  • We need to get the model binary version that we need (in our case, they are available in https://github.com/wikimedia/editquality/tree/master/models).
  • In the inference-service repo, change dir to revscoring/editquality
  • Run the following commands to build the Docker image: blubber ../../.pipeline/editquality/blubber.yaml production | docker build --tag SOME-DOCKER-TAG-THAT-YOU-LIKE --file - .
    • If you are curious about what Dockerfile gets built, remove the docker build command and see the output of Blubber.
  • At this point, we should see a Docker image in your local environment named after the tag added to the docker build command (use docker image ls to check).
  • Check the model.py file related to editquality (contained in the model-server directory) and familiarize with the __init__() function. All the environment variables retrieved in there are usually passed to the container by Kubernetes settings, so with Docker we'll have to explicitly set them.
  • Now you can create your specific playground directory under /tmp or somewhere else. The important bit is that you place the model binary file inside it. In this example, let's suppose that we are under /tmp/test-kserve, and that the model binary is stored in a subdirectory called models (so the binary's path is /tmp/test-kserve/models/model.bin). The name of the model is important, the standard is model.bin (so please rename your binary in case it doesn't match).
  • Run something like the following: docker run -p 8080:8080 -e INFERENCE_NAME=enwiki-goodfaith -e WIKI_URL=https://en.wikipedia.org --rm -v `pwd`/models:/mnt/models SOME-DOCKER-TAG-THAT-YOU-LIKE
  • Now we are ready to test the model server!
    • Create a file called input.json with the following content: { "rev_id": 1097728152 }
    • Execute: curl localhost:8080/v1/models/enwiki-goodfaith:predict -i -X POST -d@input.json --header "Content-type: application/json" --header "Accept-Encoding: application/json"
    • If everything goes fine, you should see some scores in the HTTP response.

Example 2

A more complicated example is how to test code that needs to call services (besides the MW API). One example is the testing of https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/808247

In the above code change, we are trying to add support for EventGate. The new code would allow us to create and send specific JSON events via HTTP POSTs to EventGate, but in our case we don't need to re-create the whole infrastructure locally; a simple HTTP server to echo the POST content is enough to verify the functionality.

The Docker daemon creates containers in a default network called bridge, that we can use to connect two containers together. The idea is to:

  • Create a KServe container like explained in the Example 1.
  • Create a HTTP server in another container using Python.

The latter is simple. Let's create a directory with two files:

FROM python:3-alpine

EXPOSE 6666

RUN mkdir /ws
COPY server.py /ws/server.py

WORKDIR /ws

CMD ["python", "server.py"]

We can then build and execute the container:

  • docker build . -t simple-http-server
  • docker run --rm -it -p 6666 simple-http-server

Before creating the KServe container, let's check the running container's IP:

  • docker ps (to get the container id)
  • docker inspect #container-id | grep IPAddress (let's assume it is 172.19.0.3)

As you can see in https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/808247, two new variables have been added to __init__: EVENTGATE_URL and EVENTGATE_STREAM. So let's add them to the run command:

docker run --network my-net -p 8080:8080 -e EVENTGATE_STREAM=test -e EVENTGATE_URL="http://172.19.0.3:6666" -e INFERENCE_NAME=enwiki-goodfaith -e WIKI_URL=https://en.wikipedia.org --rm -v `pwd`/models:/mnt/models SOME-DOCKER-TAG

Now you can test via curl the new code, and you should see the HTTP POST send by the KServe container to the "fake" EventGate simple HTTP server!

Example 3 - Transformer and Predictor

The KServe architecture highly encourages the use of Transformers for the pre/post-process functions (so basically for feature engineering) and to use a Predictor for the models. Transformer and Predictor are separate Docker containers, that will also become separate pods in k8s (but we don't need to worry a lot about this last bit).

This example is a variation of the second one, since it involves spinning up two containers and use the default network bridge to make them in communication between each other. The Transformer can be instructed to contact the predictor on a certain IP:Port combination, to pass to it features collected during the preprocess step.

Let's use the outlink model example (at the moment the only transformer/predictor example in inference-services) to see the steps:

  • Build the Transformer's Docker image locally:
    • cd inference-services/outlink-topic-model
    • blubber ../.pipeline/outlink/transformer.yaml production | docker build --tag outlink-transformer --file - .
  • Build the Predictor's Docker image locally:
    • cd inference-services/outlink-topic-model
    • blubber ../.pipeline/outlink/blubber.yaml production | docker build --tag outlink-predictor --file - .
  • Download the model from https://analytics.wikimedia.org/published/datasets/one-off/isaacj/articletopic/model_alloutlinks_202012.bin in a temp path (see Example 2)
  • Start the predictor: docker run --rm -v `pwd`:/mnt/models outlink-predictor (note: `pwd` represents the directory that will be mounted in the container, it needs to have the model binary downloaded above and renamed 'model.bin').
  • Run docker ps and docker inspect #container-id to find the IP address of the Predictor's container (see Example 2 for more info).
  • Run the transformer: docker run -p 8080:8080 --rm outlink-transformer --predictor_host PREDICTOR_IP:8080 --model_name outlink-topic-model (note: PREDICTOR_IP needs to be replaced with what you found during the previous step).
  • Then you can send requests to localhost:8080 via curl or your preferred HTTP client. You'll hit the Transformer first, the features will be retrieved and then sent to the Predictor. The score will be generated and the returned to the client.