Databases: additional details
The database issues are certainly related (there was no parallel problem at the same time, as seen by http traffic), although not necessarily caused directly by the deploy, only triggered, apparently, by the dblist changes. I think the database problems were unavoidable, but they could have been minimized. When there is database issues, errors tends to pileup, probably by a combination of retries at load balancer/connection pilups/unable to get replication control + jobqueue overload (which also makes an extensive usage of databases) + bot edit retrials? + too large api timeouts. That means that, mediawiki code, on certain bad states, does not react reliably. I reported this issue as a comment on the already ongoing security (as it affects reliability in a reproducible way) ticket at https://phabricator.wikimedia.org/T180918#3929364 and hope performance can give it a second look.
Something to notice is that the anomaly regarding databases wasn't the 2 spikes commented earlier, there was a higher amount of queries starting from 19:15 up until 20:59. It mostly affected enwiki, but looking at the several graphs, it is difficult to understand the actual user impact, as for example, edits after that period got a spike up after the "hard" part of the outage ( https://grafana.wikimedia.org/dashboard/db/edit-count?orgId=1&from=1517248290599&to=1517256840245&refresh=5m https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=1&fullscreen&orgId=1&from=1517249242761&to=1517259961641 ). -- Jcrespo 18:16, 31 January 2018 (UTC)