You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:Alexandros Kosiaris/Benchmarking kubernetes apps

From Wikitech-static
< User:Alexandros Kosiaris
Revision as of 14:48, 12 December 2018 by imported>Alexandros Kosiaris (→‎Actual benchmarking)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Let's see if we can work this out in a way that warrants it's own wikitech page so other can benefit and amend

What I just did (and is probably worth reproducing)

Setting everything up

On my machine, with direct internet access (aka not via some HTTP proxy)

  • install minikube on my local machine
  • install kubectl on my local machine (that's actually optional)
  • install helm on my local machines
  • install apache benchmark (ab)

The 3 fist are golang binaries, the last one comes with apache but googling around seems to point out there are solutions for installing it even on windows. In the wikitech page we should have links to the releases, but for now lemme proceed

$ minikube start

wait it out

$ helm init

wait it out until helm version returns successfully

TODO: This will probably need some changes in the future, but for now it's fine

Open grafana

$ minikube addons enable heapster

$ minikube addons open heapster

Wait it out a bit, it should start recording.

$ git clone https://gerrit.wikimedia.org/r/operations/deployment-charts

$ git pull <blubberoid's change> # i.e. in this case git pull https://gerrit.wikimedia.org/r/operations/deployment-charts refs/changes/26/479026/2

$ cd deployment-charts

  1. Note: this should not be needed normally when the pipeline tags latest
  2. Get the wanted app version using

$ curl https://docker-registry.wikimedia.org/v2/wikimedia/mediawiki-services-graphoid/tags/list | jq '.'

$ helm install --set main_app.version=20181210183809-production blubberoid

wait it out (alternatively pass --wait to helm install)

Execute the commands about setting MINIKUBE_HOST and SERVICE_PORT in your terminal (you can skip that if you feel like hardcoding them)

Actual benchmarking

Note in grafana the CPU and memory use of the pod under nominal (aka zero stress). That's the requests stanza of resources in your values.yaml

Now here comes the hard part. Blubberoid is actually easy cause it effectively exposes a single API endpoint we care about[1] but the generic advice would be something like

"Run this process for every one of the endpoints your service exposes. The idea is to stress test all endpoints so that we know how the service behaves under nominal/normal/high loads. Choose you payloads, you know them better than anyone"

That would be the /[variant] of a blubber spec.

Using the example from the docs

version: v3
base: docker-registry.wikimedia.org/nodejs-slim
apt: { packages: [librsvg2-2] }
lives:
  in: /srv/service
variants:
  build:
    base: docker-registry.wikimedia.org/nodejs-devel
    apt: { packages: [librsvg2-dev, git, pkg-config, build-essential] }
    node: { requirements: [package.json] }
    runs: { environment: { LINK: g++ } }
  test:
    includes: [build]
    entrypoint: [npm, test]

that gives us /test and /build endpoints. From blubberoid's case however they are the same thing so let's just bencmark /test

Using the shell variables set above and saving in a file called datafile the above stanza this returns successfully

curl --data-binary @datafile http://${MINIKUBE_HOST}:${SERVICE_PORT}/test

Transform then this to an actual ab test

ab -n30000 -c1 -T 'application/x-www-form-urlencoded' -p datafile http://${MINIKUBE_HOST}:${SERVICE_PORT}/test

That's 30k requests with a concurrency of one. Note the results in grafana and then gradually increase the concurrency

I did with with a for loop in my case

for i in `seq 1 30`; do ab -n30000 -c${i} -T 'application/x-www-form-urlencoded' -p lala http://${MINIKUBE_HOST}:${SERVICE_PORT}/test; done

This is probably way higher than the blubberoid service will ever see, so it should suffice. For other services it may require running the benchmarks from different hosts, with many cpus and multiple invocations of the ab program (or other HTTP benchmarking suites, ab can only do so much)

Interpreting the results

After this is done, mark the maximal CPU+memory usage. The memory is an easy one (that's actually a lie but for now it should be enough). This is your limits stanza in values.yaml.

However CPU is way more difficult to interpret. For starters it varies heavily on the architecture of the host being tested on. In minikube it will probably be very limited, probably 1 core if the app is written decently. If it's not, it's probably prudent to profile the endpoint and figure out if it can be improved. On higher CPU counts, depending heavily on the app, this will vary. If a maximum is reached despite increasing concurrency great, you got your number. If not well the application is probably well written, congrats. We will have to figure out a number, will need to revisit this.

In blubberoid's case the app does seem to be well written, so the limit should probably come from expected usage. My take is 1 core is fine.

Things to take into account

grep the ab output for 'Requests per second'. If the number sounds good for 1 instance of the service (in production we will have multiple), that's fine. If not, it's probably prudent to figure out why.

Do the same for 'Failed requests'. If you have anything other than 0, figure out why. It may be you found a bug (like a race condition) or simply the service is crumbling under the load. In that case it's important to understand the breaking point. It's not always easy to figure out. It might be fully related to the current load , at which point the amount of concurrency applied will tell you how many users a single instance of the service will support or it might be a byproduct of the benchmarking process. applications written in frameworks with lazy GC might fall in this category as the benchmarking does not allow them to GC in time and they end up either consuming the entire memory of the host or end up being killed by kubernetes. Whatever the reason, try and figure it out and most importantly document it. Ask for help.

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/blubber/+/refs/heads/master/cmd/blubberoid/main.go