You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Appserver-buster-upgrade-2021

From Wikitech-static
Jump to navigation Jump to search

Observations of differences between MediaWiki servers on stretch vs buster, before and after task T245757, January 2021

TCP errors

Are there more TCP errors ?

At first glance it appears as if TCP errors have been reduced when looking at these 2 example hosts. The data gap in the middle is when the reimaging to buster happened.

No, not really

But once you zoom out and look at an entire week before, it turns out it isn't actually a pattern.

mw1268 - TCP errors - over an entire week before the upgrade

disk utilization

Did disk utilization go up ?

Similary it first looks as if disk utilization went through the roof after the upgrade:

No, seems like spikes are unrelated

But once you zoom out.. you see we have these spikes separate from the upgrade event:

mw1268 - disk utilization over a week leading up to the buster upgrade day


performance (avg response time)

Is it actually getting slower?

Looking at average response time it can appear as if a buster server is actually slower if we look at mw1268 (stretch) vs mw1267 (buster) over a 6 hour span:

mw1268 (stretch) vs mw1267 (buster) - avg response time - over 6 hours on 2021-01-27

Similarly if we compare these hosts over a week:

mw1268 (stretch) vs mw1267 (buster) - avg response time - over a week in Jan 2021

Or over a full 30 days (mw1267 was reimaged on Jan 8):

mw1268 (stretch) vs mw1267 (buster) - avg response time - over 30 days

Compare another set of servers

App

But mw1267/mw1268 are really old hardware and will soon be replaced anyways. Let's take another set of servers, appservers, one on stretch and unchanged and the other during the reimaging, so before and after, mw1403 to see whether we can confirm this or not on the more modern hardware.

mw1403 (stretch) vs mw1405 (buster) - response time over 6 hours on 2021-01-27 - can't see a difference here

File:Mw1403-mw1405-response-time-6hours-Screenshot at 2021-01-27 15-46-50.png

API

Let's do the same for API servers. mw1404 (buster) mw1406 (stretch->buster) - avg response time over 12 hours on 2021-01-27, no obvious difference

File:Mw1404-mw1406-API-response-time-12hours-Screenshot at 2021-01-27 15-48-21.png

Let's check the same hardware before and after

To make really sure, let's check the same machines before and after reimaging.

Before

First let's take a screenshot of the baseline, before touching them. mw1402,mw1404 are API servers, both on stretch. mw1268,mw1269 are app servers, both on stretch in this image. looking at 24 hours and 7 days and also at the 95th percentile now:

File:Mw1402-mw1404-both-stretch-response-time-24hours-Screenshot at 2021-01-27 15-51-16.png File:Mw1402-mw1404-both-stretch-response-time-7days-Screenshot at 2021-01-27 15-52-15.png File:Mw1268-mw1269-app-both-stretch-response-time-7days-Screenshot at 2021-01-27 15-54-26.png

After

Now after reimaging one of the 2 servers. stretch on the left, buster on the right. Here we are seeing a slower average response time on buster again. File:Mw1402-mw1404-stretch-buster-response-time-Screenshot at 2021-01-28 15-58-25.png

But... the percentage of responses under 250ms is actually getting better

File:Mw1402-mw1404-stretch-buster-responses under 250-Screenshot at 2021-01-28 15-59-18.png

Same comparison for appservers instead of API. Here stretch and buster are reversed, buster on the left, stretch on the right.

File:Mw1268-mw1269-buster-stretch-response-time-Screenshot at 2021-01-28 16-01-25.png

Results still inconsistent? Is there a pattern? File:Mw1268-mw1269-buster-stretch-responses under 250-Screenshot at 2021-01-28 16-02-17.png

sources

example grafana links, dashboards used: host-overview, application-servers-red-dashboard-wkandek